WAL logging problem in 9.4.3?
Hoi,
I ran into this in our CI setup and I thinks it's an actual bug. The
issue appears to be that when a table is created *and truncated* in a
single transaction, that the WAL log is logging a truncate record it
shouldn't, such that if the database crashes you have a broken index.
It would also lose any data that was in the table at commit time.
I didn't test 9.4.4 yet, though I don't see anything in the release
notes that resemble this.
Detail:
=== Start with an empty database
martijn@martijn-jessie:$ psql ctmp -h localhost -U username
Password for user username:
psql (9.4.3)
Type "help" for help.
ctmp=# begin;
BEGIN
ctmp=# create table test(id serial primary key);
CREATE TABLE
ctmp=# truncate table test;
TRUNCATE TABLE
ctmp=# commit;
COMMIT
ctmp=# select relname, relfilenode from pg_class where relname like 'test%';
relname | relfilenode
-------------+-------------
test | 16389
test_id_seq | 16387
test_pkey | 16393
(3 rows)
=== Note: if you do a CHECKPOINT here the issue doesn't happen
=== obviously.
ctmp=# \q
martijn@martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393}
[sudo] password for martijn:
-rw------- 1 messagebus ssl-cert 8192 Jul 2 23:34 /data/postgres/base/16385/16387
-rw------- 1 messagebus ssl-cert 0 Jul 2 23:34 /data/postgres/base/16385/16389
-rw------- 1 messagebus ssl-cert 8192 Jul 2 23:34 /data/postgres/base/16385/16393
=== Note the index file is 8KB.
=== At this point nuke the database server (in this case it was simply
=== destroying the container it was running in.
=== Dump the xlogs just to show what got recorded. Note there's a
=== truncate for the data file and the index file.
martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16389|16387|16393'
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file create: base/16385/16387
rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log: rel 1663/16385/16387
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file create: base/16385/16389
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file create: base/16385/16393
rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log: rel 1663/16385/16387
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file truncate: base/16385/16389 to 0 blocks
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file truncate: base/16385/16393 to 0 blocks
pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740
=== Start the DB up again
database_1 | LOG: database system was interrupted; last known up at 2015-07-02 21:08:05 UTC
database_1 | LOG: database system was not properly shut down; automatic recovery in progress
database_1 | LOG: redo starts at 0/16A92A8
database_1 | LOG: record with zero length at 0/16BE740
database_1 | LOG: redo done at 0/16BE710
database_1 | LOG: last completed transaction was at log time 2015-07-02 21:34:45.664989+00
database_1 | LOG: database system is ready to accept connections
database_1 | LOG: autovacuum launcher started
=== Oops, the index file is empty now
martijn@martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393}
-rw------- 1 messagebus ssl-cert 8192 Jul 2 23:37 /data/postgres/base/16385/16387
-rw------- 1 messagebus ssl-cert 0 Jul 2 23:34 /data/postgres/base/16385/16389
-rw------- 1 messagebus ssl-cert 0 Jul 2 23:37 /data/postgres/base/16385/16393
martijn@martijn-jessie:$ psql ctmp -h localhost -U username
Password for user username:
psql (9.4.3)
Type "help" for help.
=== And now the index is broken. I think the only reason it doesn't
=== complain about the data file is because zero bytes there is OK. But if
=== the table had data before it would be gone now.
ctmp=# select * from test;
ERROR: could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes
ctmp=# select version();
version
-----------------------------------------------------------------------------------------------
PostgreSQL 9.4.3 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
(1 row)
Hope this helps.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer
Hi,
On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
=== Start with an empty database
My guess is you have wal_level = minimal?
ctmp=# begin;
BEGIN
ctmp=# create table test(id serial primary key);
CREATE TABLE
ctmp=# truncate table test;
TRUNCATE TABLE
ctmp=# commit;
COMMIT
ctmp=# select relname, relfilenode from pg_class where relname like 'test%';
relname | relfilenode
-------------+-------------
test | 16389
test_id_seq | 16387
test_pkey | 16393
(3 rows)
=== Note the index file is 8KB.
=== At this point nuke the database server (in this case it was simply
=== destroying the container it was running in.
How did you continue from there? The container has persistent storage?
Or are you repapplying the WAL to somewhere else?
=== Dump the xlogs just to show what got recorded. Note there's a
=== truncate for the data file and the index file.
That should be ok.
martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16389|16387|16393'
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file create: base/16385/16387
rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log: rel 1663/16385/16387
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file create: base/16385/16389
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file create: base/16385/16393
rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log: rel 1663/16385/16387
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file truncate: base/16385/16389 to 0 blocks
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file truncate: base/16385/16393 to 0 blocks
pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740
Note that the truncate will lead to a new, different, relfilenode.
=== Start the DB up again
database_1 | LOG: database system was interrupted; last known up at 2015-07-02 21:08:05 UTC
database_1 | LOG: database system was not properly shut down; automatic recovery in progress
database_1 | LOG: redo starts at 0/16A92A8
database_1 | LOG: record with zero length at 0/16BE740
database_1 | LOG: redo done at 0/16BE710
database_1 | LOG: last completed transaction was at log time 2015-07-02 21:34:45.664989+00
database_1 | LOG: database system is ready to accept connections
database_1 | LOG: autovacuum launcher started=== Oops, the index file is empty now
That's probably just the old index file?
martijn@martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393}
-rw------- 1 messagebus ssl-cert 8192 Jul 2 23:37 /data/postgres/base/16385/16387
-rw------- 1 messagebus ssl-cert 0 Jul 2 23:34 /data/postgres/base/16385/16389
-rw------- 1 messagebus ssl-cert 0 Jul 2 23:37 /data/postgres/base/16385/16393martijn@martijn-jessie:$ psql ctmp -h localhost -U username
Password for user username:
psql (9.4.3)
Type "help" for help.=== And now the index is broken. I think the only reason it doesn't
=== complain about the data file is because zero bytes there is OK. But if
=== the table had data before it would be gone now.ctmp=# select * from test;
ERROR: could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes
Hm. I can't reproduce this. Can you include a bit more details about how
to reproduce?
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
Hi,
On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
=== Start with an empty database
My guess is you have wal_level = minimal?
Default config, was just initdb'd. So yes, the default wal_level =
minimal.
=== Note the index file is 8KB.
=== At this point nuke the database server (in this case it was simply
=== destroying the container it was running in.How did you continue from there? The container has persistent storage?
Or are you repapplying the WAL to somewhere else?
The container has persistant storage on the host. What I think is
actually unusual is that the script that started postgres was missing
an 'exec" so postgres never gets the signal to shutdown.
martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16389|16387|16393'
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file create: base/16385/16387
rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log: rel 1663/16385/16387
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file create: base/16385/16389
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file create: base/16385/16393
rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log: rel 1663/16385/16387
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file truncate: base/16385/16389 to 0 blocks
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file truncate: base/16385/16393 to 0 blocks
pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740Note that the truncate will lead to a new, different, relfilenode.
Really? Comparing the relfilenodes gives the same values before and
after the truncate.
ctmp=# select * from test;
ERROR: could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytesHm. I can't reproduce this. Can you include a bit more details about how
to reproduce?
Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
can probably construct a Dockerfile that reproduces it pretty reliably.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer
On Fri, Jul 3, 2015 at 2:20 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
Hi,
On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
=== Start with an empty database
My guess is you have wal_level = minimal?
Default config, was just initdb'd. So yes, the default wal_level =
minimal.=== Note the index file is 8KB.
=== At this point nuke the database server (in this case it was simply
=== destroying the container it was running in.How did you continue from there? The container has persistent storage?
Or are you repapplying the WAL to somewhere else?The container has persistant storage on the host. What I think is
actually unusual is that the script that started postgres was missing
an 'exec" so postgres never gets the signal to shutdown.martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16389|16387|16393'
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file create: base/16385/16387
rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log: rel 1663/16385/16387
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file create: base/16385/16389
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file create: base/16385/16393
rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log: rel 1663/16385/16387
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file truncate: base/16385/16389 to 0 blocks
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file truncate: base/16385/16393 to 0 blocks
pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740Note that the truncate will lead to a new, different, relfilenode.
Really? Comparing the relfilenodes gives the same values before and
after the truncate.
Yep, the relfilenodes are not changed in this case because CREATE TABLE and
TRUNCATE were executed in the same transaction block.
ctmp=# select * from test;
ERROR: could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytesHm. I can't reproduce this. Can you include a bit more details about how
to reproduce?Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
can probably construct a Dockerfile that reproduces it pretty reliably.
I could reproduce the problem in the master branch by doing
the following steps.
1. start the PostgreSQL server with wal_level = minimal
2. execute the following SQL statements
begin;
create table test(id serial primary key);
truncate table test;
commit;
3. shutdown the server with immediate mode
4. restart the server (crash recovery occurs)
5. execute the following SQL statement
select * from test;
The optimization of TRUNCATE opereation that we can use when
CREATE TABLE and TRUNCATE are executed in the same transaction block
seems to cause the problem. In this case, only index file truncation is
logged, and index creation in btbuild() is not logged because wal_level
is minimal. Then at the subsequent crash recovery, index file is truncated
to 0 byte... Very simple fix is to log an index creation in that case,
but not sure if that's ok to do..
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote:
Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
can probably construct a Dockerfile that reproduces it pretty reliably.I could reproduce the problem in the master branch by doing
the following steps.
Thank you, I wasn't sure if you could kill the server fast enough
without containers, but it looks like immediate mode is enough.
1. start the PostgreSQL server with wal_level = minimal
2. execute the following SQL statements
begin;
create table test(id serial primary key);
truncate table test;
commit;
3. shutdown the server with immediate mode
4. restart the server (crash recovery occurs)
5. execute the following SQL statement
select * from test;The optimization of TRUNCATE opereation that we can use when
CREATE TABLE and TRUNCATE are executed in the same transaction block
seems to cause the problem. In this case, only index file truncation is
logged, and index creation in btbuild() is not logged because wal_level
is minimal. Then at the subsequent crash recovery, index file is truncated
to 0 byte... Very simple fix is to log an index creation in that case,
but not sure if that's ok to do..
Looks plausible to me.
For reference I attach a small tarball for reproduction with docker.
1. Unpack tarball into empty dir (it has three small files)
2. docker build -t test .
3. docker run -v /tmp/pgtest:/data test
4. docker run -v /tmp/pgtest:/data test
Data dir is in /tmp/pgtest
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer
Attachments:
On Fri, Jul 3, 2015 at 3:01 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote:
Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
can probably construct a Dockerfile that reproduces it pretty reliably.I could reproduce the problem in the master branch by doing
the following steps.Thank you, I wasn't sure if you could kill the server fast enough
without containers, but it looks like immediate mode is enough.1. start the PostgreSQL server with wal_level = minimal
2. execute the following SQL statements
begin;
create table test(id serial primary key);
truncate table test;
commit;
3. shutdown the server with immediate mode
4. restart the server (crash recovery occurs)
5. execute the following SQL statement
select * from test;The optimization of TRUNCATE opereation that we can use when
CREATE TABLE and TRUNCATE are executed in the same transaction block
seems to cause the problem. In this case, only index file truncation is
logged, and index creation in btbuild() is not logged because wal_level
is minimal. Then at the subsequent crash recovery, index file is truncated
to 0 byte... Very simple fix is to log an index creation in that case,
but not sure if that's ok to do..
In 9.2 or before, this problem doesn't occur because no such error is thrown
even if an index file size is zero. But in 9.3 or later, since the planner
tries to read a meta page of an index to get the height of the btree tree,
an empty index file causes such error. The planner was changed that way by
commit 31f38f28, and the problem seems to be an oversight of that commit.
I'm not familiar with that change of the planner, but ISTM that we can
simply change _bt_getrootheight() so that 0 is returned if an index file is
empty, i.e., meta page cannot be read, in order to work around the problem.
Thought?
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Fujii Masao <masao.fujii@gmail.com> writes:
The optimization of TRUNCATE opereation that we can use when
CREATE TABLE and TRUNCATE are executed in the same transaction block
seems to cause the problem. In this case, only index file truncation is
logged, and index creation in btbuild() is not logged because wal_level
is minimal. Then at the subsequent crash recovery, index file is truncated
to 0 byte... Very simple fix is to log an index creation in that case,
but not sure if that's ok to do..
In 9.2 or before, this problem doesn't occur because no such error is thrown
even if an index file size is zero. But in 9.3 or later, since the planner
tries to read a meta page of an index to get the height of the btree tree,
an empty index file causes such error. The planner was changed that way by
commit 31f38f28, and the problem seems to be an oversight of that commit.
What? You want to blame the planner for failing because the index was
left corrupt by broken WAL replay? A failure would occur anyway at
execution.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 3, 2015 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Fujii Masao <masao.fujii@gmail.com> writes:
The optimization of TRUNCATE opereation that we can use when
CREATE TABLE and TRUNCATE are executed in the same transaction block
seems to cause the problem. In this case, only index file truncation is
logged, and index creation in btbuild() is not logged because wal_level
is minimal. Then at the subsequent crash recovery, index file is truncated
to 0 byte... Very simple fix is to log an index creation in that case,
but not sure if that's ok to do..In 9.2 or before, this problem doesn't occur because no such error is thrown
even if an index file size is zero. But in 9.3 or later, since the planner
tries to read a meta page of an index to get the height of the btree tree,
an empty index file causes such error. The planner was changed that way by
commit 31f38f28, and the problem seems to be an oversight of that commit.What? You want to blame the planner for failing because the index was
left corrupt by broken WAL replay? A failure would occur anyway at
execution.
Yep, right. I was not thinking that such index with file size 0 is corrupted
because the reported problem didn't happen before that commit was added.
But that's my fault. Such index can cause an error even in other code paths.
Okay, so probably we need to change WAL replay of TRUNCATE so that
the index file is truncated to one containing only meta page instead of
empty one. That is, the WAL replay of TRUNCATE would need to call
index_build() after smgrtruncate() maybe.
Then how should we implement that? Invent new WAL record type that
calls smgrtruncate() and index_build() during WAL replay? Or add the
special flag to XLOG_SMGR_TRUNCATE record, and make WAL replay
call index_build() only if the flag is found? Any other good idea?
Anyway ISTM that we might need to add or modify WAL record.
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-04 01:39:42 +0900, Fujii Masao wrote:
Okay, so probably we need to change WAL replay of TRUNCATE so that
the index file is truncated to one containing only meta page instead of
empty one. That is, the WAL replay of TRUNCATE would need to call
index_build() after smgrtruncate() maybe.Then how should we implement that? Invent new WAL record type that
calls smgrtruncate() and index_build() during WAL replay? Or add the
special flag to XLOG_SMGR_TRUNCATE record, and make WAL replay
call index_build() only if the flag is found? Any other good idea?
Anyway ISTM that we might need to add or modify WAL record.
It's easy enough to log something like a metapage with
log_newpage().
But the more interesting question is why that's not hhappening
today. RelationTruncateIndexes() does call the index_build() which
should end up WAL logging the index creation.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Fujii Masao <masao.fujii@gmail.com> writes:
Okay, so probably we need to change WAL replay of TRUNCATE so that
the index file is truncated to one containing only meta page instead of
empty one. That is, the WAL replay of TRUNCATE would need to call
index_build() after smgrtruncate() maybe.
That seems completely unworkable. For one thing, index_build would expect
to be able to do catalog lookups, but we can't assume that the catalogs
are in a good state yet.
I think the responsibility has to be on the WAL-writing end to emit WAL
instructions that lead to a correct on-disk state. Putting complex
behavior into the reading side is fundamentally misguided.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-03 18:49:31 +0200, Andres Freund wrote:
But the more interesting question is why that's not hhappening
today. RelationTruncateIndexes() does call the index_build() which
should end up WAL logging the index creation.
So that's because there's an XLogIsNeeded() preventing it.
Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
right now missing how the whole "skip wal logging if relation has just
been truncated" optimization can ever actually be crashsafe unless we
use a new relfilenode (which we don't!).
Sure, we do an heap_sync() at the the end of the transaction. That's
nice and all. But it doesn't help if we crash and re-start WAL apply
from a checkpoint before the table was created. Because that'll replay
the truncation.
That's much worse than just the indexes - the rows added by a COPY
without WAL logging will also be truncated away, no?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 03, 2015 at 12:53:56PM -0400, Tom Lane wrote:
Fujii Masao <masao.fujii@gmail.com> writes:
Okay, so probably we need to change WAL replay of TRUNCATE so that
the index file is truncated to one containing only meta page instead of
empty one. That is, the WAL replay of TRUNCATE would need to call
index_build() after smgrtruncate() maybe.That seems completely unworkable. For one thing, index_build would expect
to be able to do catalog lookups, but we can't assume that the catalogs
are in a good state yet.I think the responsibility has to be on the WAL-writing end to emit WAL
instructions that lead to a correct on-disk state. Putting complex
behavior into the reading side is fundamentally misguided.
Am I missing something. ISTM that if the truncate record was simply not
logged at all everything would work fine. The whole point is that the
table was created in this transaction and so if it exists the table on
disk must be the correct representation.
The broken index is just one symptom. The heap also shouldn't be
truncated at all. If you insert a row before commit then after replay
the tuple should be there still.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer
On 2015-07-03 19:14:26 +0200, Martijn van Oosterhout wrote:
Am I missing something. ISTM that if the truncate record was simply not
logged at all everything would work fine. The whole point is that the
table was created in this transaction and so if it exists the table on
disk must be the correct representation.
That'd not work either. Consider:
BEGIN;
CREATE TABLE ...
INSERT;
TRUNCATE;
INSERT;
COMMIT;
If you replay that without a truncation wal record the second INSERT
will try to add stuff to already occupied space. And they can have
different lengths and stuff, so you cannot just ignore that fact.
The broken index is just one symptom.
Agreed. I think the problem is something else though. Namely that we
reuse the relfilenode for heap_truncate_one_rel(). That's just entirely
broken afaics. We need to allocate a new relfilenode and write stuff
into that. Then we can forgo WAL logging the truncation record.
If you insert a row before commit then after replay the tuple should be there still.
The insert would be WAL logged. COPY skips wal logging tho.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-03 19:02:29 +0200, Andres Freund wrote:
Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
right now missing how the whole "skip wal logging if relation has just
been truncated" optimization can ever actually be crashsafe unless we
use a new relfilenode (which we don't!).
We actually used to use a different relfilenode, but optimized that
away: cab9a0656c36739f59277b34fea8ab9438395869
commit cab9a0656c36739f59277b34fea8ab9438395869
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun Aug 23 19:23:41 2009 +0000
Make TRUNCATE do truncate-in-place when processing a relation that was created
or previously truncated in the current (sub)transaction. This is safe since
if the (sub)transaction later rolls back, we'd just discard the rel's current
physical file anyway. This avoids unreasonable growth in the number of
transient files when a relation is repeatedly truncated. Per a performance
gripe a couple weeks ago from Todd Cook.
to me the reasoning here looks flawed.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 03, 2015 at 07:21:21PM +0200, Andres Freund wrote:
On 2015-07-03 19:14:26 +0200, Martijn van Oosterhout wrote:
Am I missing something. ISTM that if the truncate record was simply not
logged at all everything would work fine. The whole point is that the
table was created in this transaction and so if it exists the table on
disk must be the correct representation.That'd not work either. Consider:
BEGIN;
CREATE TABLE ...
INSERT;
TRUNCATE;
INSERT;
COMMIT;If you replay that without a truncation wal record the second INSERT
will try to add stuff to already occupied space. And they can have
different lengths and stuff, so you cannot just ignore that fact.
I was about to disagree with you by suggesting that if the table was
created in this transaction then WAL logging is skipped. But testing
shows that inserts are indeed logged, as you point out.
With inserts the WAL records look as follows (relfilenodes changed):
martijn@martijn-jessie:~/git/ctm/docker$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /tmp/pgtest/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16386|16384|16390'
rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 0/016A79C8, prev 0/016A79A0, bkp: 0000, desc: file create: base/12139/16384
rmgr: Sequence len (rec/tot): 158/ 190, tx: 683, lsn: 0/016B4258, prev 0/016B2508, bkp: 0000, desc: log: rel 1663/12139/16384
rmgr: Storage len (rec/tot): 16/ 48, tx: 683, lsn: 0/016B4318, prev 0/016B4258, bkp: 0000, desc: file create: base/12139/16386
rmgr: Storage len (rec/tot): 16/ 48, tx: 683, lsn: 0/016B9468, prev 0/016B9418, bkp: 0000, desc: file create: base/12139/16390
rmgr: Sequence len (rec/tot): 158/ 190, tx: 683, lsn: 0/016BC938, prev 0/016BC880, bkp: 0000, desc: log: rel 1663/12139/16384
rmgr: Sequence len (rec/tot): 158/ 190, tx: 683, lsn: 0/016BCAF0, prev 0/016BCAA0, bkp: 0000, desc: log: rel 1663/12139/16384
rmgr: Heap len (rec/tot): 35/ 67, tx: 683, lsn: 0/016BCBB0, prev 0/016BCAF0, bkp: 0000, desc: insert(init): rel 1663/12139/16386; tid 0/1
rmgr: Btree len (rec/tot): 20/ 52, tx: 683, lsn: 0/016BCBF8, prev 0/016BCBB0, bkp: 0000, desc: newroot: rel 1663/12139/16390; root 1 lev 0
rmgr: Btree len (rec/tot): 34/ 66, tx: 683, lsn: 0/016BCC30, prev 0/016BCBF8, bkp: 0000, desc: insert: rel 1663/12139/16390; tid 1/1
rmgr: Storage len (rec/tot): 16/ 48, tx: 683, lsn: 0/016BCC78, prev 0/016BCC30, bkp: 0000, desc: file truncate: base/12139/16386 to 0 blocks
rmgr: Storage len (rec/tot): 16/ 48, tx: 683, lsn: 0/016BCCA8, prev 0/016BCC78, bkp: 0000, desc: file truncate: base/12139/16390 to 0 blocks
rmgr: Heap len (rec/tot): 35/ 67, tx: 683, lsn: 0/016BCCD8, prev 0/016BCCA8, bkp: 0000, desc: insert(init): rel 1663/12139/16386; tid 0/1
rmgr: Btree len (rec/tot): 20/ 52, tx: 683, lsn: 0/016BCD20, prev 0/016BCCD8, bkp: 0000, desc: newroot: rel 1663/12139/16390; root 1 lev 0
rmgr: Btree len (rec/tot): 34/ 66, tx: 683, lsn: 0/016BCD58, prev 0/016BCD20, bkp: 0000, desc: insert: rel 1663/12139/16390; tid 1/1
relname | relfilenode
-------------+-------------
test | 16386
test_id_seq | 16384
test_pkey | 16390
(3 rows)
And amazingly, the database cluster successfuly recovers and there's no
error now. So the problem is *only* because there is no data in the
table at commit time. Which indicates that it's the 'newroot" record
that saves the day normally. And it's apparently generated by the
first insert.
Agreed. I think the problem is something else though. Namely that we
reuse the relfilenode for heap_truncate_one_rel(). That's just entirely
broken afaics. We need to allocate a new relfilenode and write stuff
into that. Then we can forgo WAL logging the truncation record.
Would that properly initialise the index though?
Anyway, this is way outside my expertise, so I'll bow out now. Let me
know if I can be of more assistance.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer
Martijn van Oosterhout <kleptog@svana.org> writes:
With inserts the WAL records look as follows (relfilenodes changed):
...
And amazingly, the database cluster successfuly recovers and there's no
error now. So the problem is *only* because there is no data in the
table at commit time. Which indicates that it's the 'newroot" record
that saves the day normally. And it's apparently generated by the
first insert.
Yeah, because the correct "empty" state of a btree index is to have a
metapage but no root page, so the first insert forces creation of a root
page. And, by chance, btree_xlog_newroot restores the metapage from
scratch, so this works even if the metapage had been missing or corrupt.
However, things would still break if the first access to the index was
a read attempt rather than an insert.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-03 19:26:05 +0200, Andres Freund wrote:
On 2015-07-03 19:02:29 +0200, Andres Freund wrote:
Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
right now missing how the whole "skip wal logging if relation has just
been truncated" optimization can ever actually be crashsafe unless we
use a new relfilenode (which we don't!).We actually used to use a different relfilenode, but optimized that
away: cab9a0656c36739f59277b34fea8ab9438395869commit cab9a0656c36739f59277b34fea8ab9438395869
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun Aug 23 19:23:41 2009 +0000Make TRUNCATE do truncate-in-place when processing a relation that was created
or previously truncated in the current (sub)transaction. This is safe since
if the (sub)transaction later rolls back, we'd just discard the rel's current
physical file anyway. This avoids unreasonable growth in the number of
transient files when a relation is repeatedly truncated. Per a performance
gripe a couple weeks ago from Todd Cook.to me the reasoning here looks flawed.
It looks to me we need to re-neg on this a bit. I think we can still be
more efficient than the general codepath: We can drop the old
relfilenode immediately. But pg_class.relfilenode has to differ from the
old after the truncation.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On 2015-07-03 19:26:05 +0200, Andres Freund wrote:
commit cab9a0656c36739f59277b34fea8ab9438395869
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun Aug 23 19:23:41 2009 +0000Make TRUNCATE do truncate-in-place when processing a relation that was created
or previously truncated in the current (sub)transaction. This is safe since
if the (sub)transaction later rolls back, we'd just discard the rel's current
physical file anyway. This avoids unreasonable growth in the number of
transient files when a relation is repeatedly truncated. Per a performance
gripe a couple weeks ago from Todd Cook.to me the reasoning here looks flawed.
It looks to me we need to re-neg on this a bit. I think we can still be
more efficient than the general codepath: We can drop the old
relfilenode immediately. But pg_class.relfilenode has to differ from the
old after the truncation.
Why exactly? The first truncation in the (sub)xact would have assigned a
new relfilenode, why do we need another one? The file in question will
go away on crash/rollback in any case, and no other transaction can see
it yet.
I'm prepared to believe that some bit of logic is doing the wrong thing in
this state, but I do not agree that truncate-in-place is unworkable.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-03 18:38:37 -0400, Tom Lane wrote:
Why exactly? The first truncation in the (sub)xact would have assigned a
new relfilenode, why do we need another one? The file in question will
go away on crash/rollback in any case, and no other transaction can see
it yet.
Consider:
BEGIN;
CREATE TABLE;
INSERT largeval;
TRUNCATE;
INSERT 1;
COPY;
INSERT 2;
COMMIT;
INSERT 1 is going to be WAL logged. For that to work correctly TRUNCATE
has to be WAL logged, as otherwise there'll be conflicting/overlapping
tuples on the target page.
But:
The truncation itself is not fully wal logged, neither is the COPY. Both
rely on heap_sync()/immedsync(). For that to be correct the current
relfilenode's truncation may *not* be wal-logged, because the contents
of the COPY or the truncation itself will only be on-disk, not in the
WAL.
Only being on-disk but not in the WAL is a problem if we crash and
replay the truncate record.
I'm prepared to believe that some bit of logic is doing the wrong
thing in this state, but I do not agree that truncate-in-place is
unworkable.
Unless we're prepared to make everything that potentially WAL logs
something do the rel->rd_createSubid == mySubid && dance, I can't see
that working.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jul 4, 2015 at 2:26 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-07-03 19:02:29 +0200, Andres Freund wrote:
Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
right now missing how the whole "skip wal logging if relation has just
been truncated" optimization can ever actually be crashsafe unless we
use a new relfilenode (which we don't!).
Agreed... When I ran the following test scenario, I found that
the loaded data disappeared after the crash recovery.
1. start PostgreSQL server with wal_level = minimal
2. execute the following SQL statements
\copy (SELECT num FROM generate_series(1,10) num) to /tmp/num.csv with csv
BEGIN;
CREATE TABLE test (i int primary key);
TRUNCATE TABLE test;
\copy test from /tmp/num.csv with csv
COMMIT;
SELECT COUNT(*) FROM test; -- returns 10
3. shutdown the server with immediate mode
4. restart the server
5. execute the following SQL statement after crash recovery ends
SELECT COUNT(*) FROM test; -- returns 0..
In #2, 10 rows were copied and the transaction was committed.
The subsequent statement of "select count(*)" obviously returned 10.
However, after crash recovery, in #5, the same statement returned 0.
That is, the loaded (+ committed) 10 data was lost after the crash.
We actually used to use a different relfilenode, but optimized that
away: cab9a0656c36739f59277b34fea8ab9438395869commit cab9a0656c36739f59277b34fea8ab9438395869
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun Aug 23 19:23:41 2009 +0000Make TRUNCATE do truncate-in-place when processing a relation that was created
or previously truncated in the current (sub)transaction. This is safe since
if the (sub)transaction later rolls back, we'd just discard the rel's current
physical file anyway. This avoids unreasonable growth in the number of
transient files when a relation is repeatedly truncated. Per a performance
gripe a couple weeks ago from Todd Cook.to me the reasoning here looks flawed.
Before this commit, when I ran the above test scenario, no data loss happened.
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Fujii Masao <masao.fujii@gmail.com> writes:
On Sat, Jul 4, 2015 at 2:26 AM, Andres Freund <andres@anarazel.de> wrote:
We actually used to use a different relfilenode, but optimized that
away: cab9a0656c36739f59277b34fea8ab9438395869commit cab9a0656c36739f59277b34fea8ab9438395869
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun Aug 23 19:23:41 2009 +0000Make TRUNCATE do truncate-in-place when processing a relation that was created
or previously truncated in the current (sub)transaction. This is safe since
if the (sub)transaction later rolls back, we'd just discard the rel's current
physical file anyway. This avoids unreasonable growth in the number of
transient files when a relation is repeatedly truncated. Per a performance
gripe a couple weeks ago from Todd Cook.to me the reasoning here looks flawed.
Before this commit, when I ran the above test scenario, no data loss happened.
Actually I think what is broken here is COPY's test to decide whether it
can omit writing WAL:
* Check to see if we can avoid writing WAL
*
* If archive logging/streaming is not enabled *and* either
* - table was created in same transaction as this COPY
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
* If it does commit, we'll have done the heap_sync at the bottom of this
* routine first.
The problem with that analysis is that it supposes that, if we crash and
recover, the WAL replay sequence will not touch the data. What's killing
us in this example is the replay of the TRUNCATE, but that is not the only
possibility. For example consider this modification of Fujii-san's test
case:
BEGIN;
CREATE TABLE test (i int primary key);
INSERT INTO test VALUES(-1);
\copy test from /tmp/num.csv with csv
COMMIT;
SELECT COUNT(*) FROM test;
The COUNT() correctly says 11 rows, but after crash-and-recover,
only the row with -1 is there. This is because the INSERT writes
out an INSERT+INIT WAL record, which we happily replay, clobbering
the data added later by COPY.
We might have to give up on this COPY optimization :-(. I'm not
sure what would be a safe rule for deciding that we can skip WAL
logging in this situation, but I am pretty sure that it would
require keeping information we don't currently keep about what's
happened earlier in the transaction.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-06 11:14:40 -0400, Tom Lane wrote:
BEGIN;
CREATE TABLE test (i int primary key);
INSERT INTO test VALUES(-1);
\copy test from /tmp/num.csv with csv
COMMIT;
SELECT COUNT(*) FROM test;The COUNT() correctly says 11 rows, but after crash-and-recover,
only the row with -1 is there. This is because the INSERT writes
out an INSERT+INIT WAL record, which we happily replay, clobbering
the data added later by COPY.
ISTM any WAL logged action that touches a relfilenode essentially needs
to disable further optimization based on the knowledge that the relation
is new.
We might have to give up on this COPY optimization :-(.
A crazy, not well though through, bandaid for the INSERT+INIT case would
be to force COPY to use a new page when using the SKIP_WAL codepath.
I'm not sure what would be a safe rule for deciding that we can skip
WAL logging in this situation, but I am pretty sure that it would
require keeping information we don't currently keep about what's
happened earlier in the transaction.
It'd not be impossible to add more state to the relcache entry for the
relation. Whether it's likely that we'd find all the places that'd need
updating that state, I'm not sure.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On 2015-07-06 11:14:40 -0400, Tom Lane wrote:
The COUNT() correctly says 11 rows, but after crash-and-recover,
only the row with -1 is there. This is because the INSERT writes
out an INSERT+INIT WAL record, which we happily replay, clobbering
the data added later by COPY.
ISTM any WAL logged action that touches a relfilenode essentially needs
to disable further optimization based on the knowledge that the relation
is new.
After a bit more thought, I think it's not so much "any WAL logged action"
as "any unconditionally-replayed action". INSERT+INIT breaks this
example because heap_xlog_insert will unconditionally replay the action,
even if the page is valid and has same or newer LSN. Similarly, TRUNCATE
is problematic because we redo it unconditionally (and in that case it's
hard to see an alternative).
It'd not be impossible to add more state to the relcache entry for the
relation. Whether it's likely that we'd find all the places that'd need
updating that state, I'm not sure.
Yeah, the sticking point is mainly being sure that the state is correctly
tracked, both now and after future changes. We'd need to identify a state
invariant that we could be pretty confident we'd not break.
One idea I had was to allow the COPY optimization only if the heap file is
physically zero-length at the time the COPY starts. That would still be
able to optimize in all the cases we care about making COPY fast for.
Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this. That should be safe if the table was
previously empty.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
On 2015-07-06 11:14:40 -0400, Tom Lane wrote:
The COUNT() correctly says 11 rows, but after crash-and-recover,
only the row with -1 is there. This is because the INSERT writes
out an INSERT+INIT WAL record, which we happily replay, clobbering
the data added later by COPY.ISTM any WAL logged action that touches a relfilenode essentially needs
to disable further optimization based on the knowledge that the relation
is new.After a bit more thought, I think it's not so much "any WAL logged action"
as "any unconditionally-replayed action". INSERT+INIT breaks this
example because heap_xlog_insert will unconditionally replay the action,
even if the page is valid and has same or newer LSN. Similarly, TRUNCATE
is problematic because we redo it unconditionally (and in that case it's
hard to see an alternative).It'd not be impossible to add more state to the relcache entry for the
relation. Whether it's likely that we'd find all the places that'd need
updating that state, I'm not sure.Yeah, the sticking point is mainly being sure that the state is correctly
tracked, both now and after future changes. We'd need to identify a state
invariant that we could be pretty confident we'd not break.One idea I had was to allow the COPY optimization only if the heap file is
physically zero-length at the time the COPY starts.
This seems not helpful for the case where TRUNCATE is executed
before COPY. No?
That would still be
able to optimize in all the cases we care about making COPY fast for.
Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this. That should be safe if the table was
previously empty.
So, if COPY is executed multiple times at the same transaction,
only first COPY can be optimized?
After second thought, I'm thinking that we can safely optimize
COPY if no problematic WAL records like INSERT+INIT or TRUNCATE
are generated since current REDO location or the table was created
at the same transaction. That is, if INSERT or TRUNCATE is executed
after the table creation, but if CHECKPOINT happens subsequently,
we don't need to log COPY. The subsequent crash recovery will not
replay such problematic WAL records. So the example cases where
we can optimize COPY are:
BEGIN
CREATE TABLE
COPY
COPY -- subsequent COPY also can be optimized
BEGIN
CREATE TABLE
TRUNCATE
CHECKPOINT
COPY
BEGIN
CREATE TABLE
INSERT
CHECKPOINT
COPY
A crash recovery can start from previous REDO location (i.e., REDO
location of the last checkpoint record). So we might need to check
whether such problematic WAL records are generated since the previous
REDO location instead of current one.
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Fujii Masao <masao.fujii@gmail.com> writes:
On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
One idea I had was to allow the COPY optimization only if the heap file is
physically zero-length at the time the COPY starts.
This seems not helpful for the case where TRUNCATE is executed
before COPY. No?
Huh? The heap file would be zero length in that case.
So, if COPY is executed multiple times at the same transaction,
only first COPY can be optimized?
This is true, and I don't think we should care, especially not if we're
going to take risks of incorrect behavior in order to optimize that
third-order case. The fact that we're dealing with this bug at all should
remind us that this stuff is harder than it looks. I want a simple,
reliable, back-patchable fix, and I do not believe that what you are
suggesting would be any of those.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
One idea I had was to allow the COPY optimization only if the heap file is
physically zero-length at the time the COPY starts. That would still be
able to optimize in all the cases we care about making COPY fast for.
Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this. That should be safe if the table was
previously empty.
I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.
My tentative guess is that the best course is to
a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
truncation replay issue.
b) Force new pages to be used when using the heap_sync mode in
COPY. That avoids the INIT danger you found. It seems rather
reasonable to avoid using pages that have already been the target of
WAL logging here in general.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this. That should be safe if the table was
previously empty.
I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.
What evidence have you got to base that value judgement on?
cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction. On the other hand,
I know of no evidence that anyone's depending on multiple sequential
COPYs, nor intermixed COPY and INSERT, to be fast. The original argument
for having this COPY optimization at all was to make restoring pg_dump
scripts in a single transaction fast; and that use-case doesn't care
about anything but a single COPY into a virgin table.
I think you're worrying about exactly the wrong case.
My tentative guess is that the best course is to
a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
truncation replay issue.
b) Force new pages to be used when using the heap_sync mode in
COPY. That avoids the INIT danger you found. It seems rather
reasonable to avoid using pages that have already been the target of
WAL logging here in general.
And what reason is there to think that this would fix all the problems?
We know of those two, but we've not exactly looked hard for other cases.
Again, the only known field usage for the COPY optimization is the pg_dump
scenario; were that not so, we'd have noticed the problem long since.
So I don't have any faith that this is a well-tested area.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/10/2015 02:06 AM, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this. That should be safe if the table was
previously empty.I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.What evidence have you got to base that value judgement on?
cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction.
Yeah, if we specifically made that case cheap, in response to a
complaint, it would be a regression to make it expensive again. We might
get away with it in a major version, but would hate to backpatch that.
My tentative guess is that the best course is to
a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
truncation replay issue.
b) Force new pages to be used when using the heap_sync mode in
COPY. That avoids the INIT danger you found. It seems rather
reasonable to avoid using pages that have already been the target of
WAL logging here in general.And what reason is there to think that this would fix all the problems?
We know of those two, but we've not exactly looked hard for other cases.
Hmm. Perhaps that could be made to work, but it feels pretty fragile.
For example, you could have an insert trigger on the table that inserts
additional rows to the same table, and those inserts would be intermixed
with the rows inserted by COPY. You'll have to avoid that somehow.
Full-page images in general are a problem. If a checkpoint happens, and
a trigger modifies the page we're COPYing to in any way, you have the
same issue. Even reading a page can cause a full-page image of it to be
written: If you update a hint bit on the page while reading it, and
checksums are enabled, and a checkpoint happened since the page was last
updated, bang. I don't think that's a problem in this case because there
are no hint bits to be set on pages that we're COPYing to, but it's a
whole new subtle assumption.
I think we should
1. reliably and explicitly keep track of whether we've WAL-logged any
TRUNCATE, INSERT/UPDATE+INIT, or any other full-page-logging operations
on the relation, and
2. make sure we never skip WAL-logging again if we have.
Let's add a flag, rd_skip_wal_safe, to RelationData that's initially set
when a new relfilenode is created, i.e. whenever rd_createSubid or
rd_newRelfilenodeSubid is set. Whenever a TRUNCATE or a full-page image
(including INSERT/UPDATE+INIT) is WAL-logged, clear the flag. In copy.c,
only skip WAL-logging if the flag is still set. To deal with the case
that the flag gets cleared in the middle of COPY, also check the flag
whenever we're about to skip WAL-logging in heap_insert, and if it's
been cleared, ignore the HEAP_INSERT_SKIP_WAL option and WAL-log anyway.
Compared to the status quo, that disables the WAL-skipping optimization
in the scenario where you CREATE, INSERT, then COPY to a table in the
same transaction. I think that's acceptable.
(Alternatively, to handle the case that the flag gets cleared in the
middle of COPY, add another flag to RelationData indicating that a
WAL-skipping COPY is in-progress, and refrain from WAL-logging any
FPW-writing operations on the table when it's set (or any operations
whatsoever). That'd be more efficient, but it's such a rare corner case
that it hardly matters.)
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-09 19:06:11 -0400, Tom Lane wrote:
What evidence have you got to base that value judgement on?
cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction. On the other hand,
I know of no evidence that anyone's depending on multiple sequential
COPYs, nor intermixed COPY and INSERT, to be fast. The original argument
for having this COPY optimization at all was to make restoring pg_dump
scripts in a single transaction fast; and that use-case doesn't care
about anything but a single COPY into a virgin table.
Well, you'll hardly have heard complaints about COPY, given that we
behaved like currently for a long while.
I definitely know of ETL like processes that have relied on subsequent
COPYs into truncates relations being cheaper. Can't remember the same
for intermixed COPY and INSERT, but it'd not surprise me if somebody
mixed COPY and UPDATEs rather freely for ETL.
I think you're worrying about exactly the wrong case.
My tentative guess is that the best course is to
a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
truncation replay issue.
b) Force new pages to be used when using the heap_sync mode in
COPY. That avoids the INIT danger you found. It seems rather
reasonable to avoid using pages that have already been the target of
WAL logging here in general.And what reason is there to think that this would fix all the
problems?
Yea, that's the big problem.
Again, the only known field usage for the COPY optimization is the pg_dump
scenario; were that not so, we'd have noticed the problem long since.
So I don't have any faith that this is a well-tested area.
You need to crash in the right moment. I don't think that's that
frequently exercised...
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-10 11:50:33 +0300, Heikki Linnakangas wrote:
On 07/10/2015 02:06 AM, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this. That should be safe if the table was
previously empty.I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.What evidence have you got to base that value judgement on?
cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction.Yeah, if we specifically made that case cheap, in response to a complaint,
it would be a regression to make it expensive again. We might get away with
it in a major version, but would hate to backpatch that.
Sure. But making COPY slower would also be one. Of a longer standing
behaviour, with massively bigger impact if somebody relies on it? I mean
a new relfilenode includes a couple heap and storage options. Missing
the skip wal optimization can easily double or triple COPY durations.
I generally find it to be very dubious to re-use a relfilenode after a
truncation. I bet most hackers didn't ever know we ever did that, and
the rest probably forgot it.
We can still retain a portion of the optimizations from cab9a0656c36739f
- there's no need to keep the old relfilenode's contents around after
all.
My tentative guess is that the best course is to
a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
truncation replay issue.
b) Force new pages to be used when using the heap_sync mode in
COPY. That avoids the INIT danger you found. It seems rather
reasonable to avoid using pages that have already been the target of
WAL logging here in general.And what reason is there to think that this would fix all the problems?
We know of those two, but we've not exactly looked hard for other cases.Hmm. Perhaps that could be made to work, but it feels pretty fragile.
It does. I'm not very happy about this mess.
For
example, you could have an insert trigger on the table that inserts
additional rows to the same table, and those inserts would be intermixed
with the rows inserted by COPY.
That should be fine? As long as copy only uses new pages INSERT can use
the same ones without problem. I think...
Full-page images in general are a problem.
With the above rules I don't think it'd be. They'd contain the previous
contents, and we'll not target them again with COPY.
I think we should
1. reliably and explicitly keep track of whether we've WAL-logged any
TRUNCATE, INSERT/UPDATE+INIT, or any other full-page-logging operations on
the relation, and
2. make sure we never skip WAL-logging again if we have.Let's add a flag, rd_skip_wal_safe, to RelationData that's initially set
when a new relfilenode is created, i.e. whenever rd_createSubid or
rd_newRelfilenodeSubid is set. Whenever a TRUNCATE or a full-page image
(including INSERT/UPDATE+INIT) is WAL-logged, clear the flag. In copy.c,
only skip WAL-logging if the flag is still set. To deal with the case that
the flag gets cleared in the middle of COPY, also check the flag whenever
we're about to skip WAL-logging in heap_insert, and if it's been cleared,
ignore the HEAP_INSERT_SKIP_WAL option and WAL-log anyway.
Am I missing something or will this break the BEGIN; TRUNCATE; COPY;
pattern we use ourselves and have suggested a number of times ?
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 10, 2015 at 2:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Fujii Masao <masao.fujii@gmail.com> writes:
On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
One idea I had was to allow the COPY optimization only if the heap file is
physically zero-length at the time the COPY starts.This seems not helpful for the case where TRUNCATE is executed
before COPY. No?Huh? The heap file would be zero length in that case.
So, if COPY is executed multiple times at the same transaction,
only first COPY can be optimized?This is true, and I don't think we should care, especially not if we're
going to take risks of incorrect behavior in order to optimize that
third-order case. The fact that we're dealing with this bug at all should
remind us that this stuff is harder than it looks. I want a simple,
reliable, back-patchable fix, and I do not believe that what you are
suggesting would be any of those.
Maybe I'm missing something. But I start wondering why TRUNCATE
and INSERT (or even all the operations on the table created at
the current transaction) need to be WAL-logged while COPY can be
optimized. If no WAL records are generated on that table, the problem
we're talking about seems not to occur. Also this seems safe and
doesn't degrade the performance of data loading. Thought?
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-10 19:23:28 +0900, Fujii Masao wrote:
Maybe I'm missing something. But I start wondering why TRUNCATE
and INSERT (or even all the operations on the table created at
the current transaction) need to be WAL-logged while COPY can be
optimized. If no WAL records are generated on that table, the problem
we're talking about seems not to occur. Also this seems safe and
doesn't degrade the performance of data loading. Thought?
Skipping WAL logging means that you need to scan through the whole
shrared buffers to write out dirty buffers and fsync the segments. A
single insert wal record is a couple orders of magnitudes cheaper than
that. Essentially doing this juts for COPY is a heuristic.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/10/2015 12:14 PM, Andres Freund wrote:
On 2015-07-10 11:50:33 +0300, Heikki Linnakangas wrote:
On 07/10/2015 02:06 AM, Tom Lane wrote:
cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction.Yeah, if we specifically made that case cheap, in response to a complaint,
it would be a regression to make it expensive again. We might get away with
it in a major version, but would hate to backpatch that.Sure. But making COPY slower would also be one. Of a longer standing
behaviour, with massively bigger impact if somebody relies on it? I mean
a new relfilenode includes a couple heap and storage options. Missing
the skip wal optimization can easily double or triple COPY durations.
Completely disabling the skip-WAL optimization is not acceptable either,
IMO. It's a false dichotomy that we have to choose between those two
options. We'll have to consider the exact scenarios where we'd have to
disable the optimization vs. using a new relfilenode.
My tentative guess is that the best course is to
a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
truncation replay issue.
b) Force new pages to be used when using the heap_sync mode in
COPY. That avoids the INIT danger you found. It seems rather
reasonable to avoid using pages that have already been the target of
WAL logging here in general.And what reason is there to think that this would fix all the problems?
We know of those two, but we've not exactly looked hard for other cases.Hmm. Perhaps that could be made to work, but it feels pretty fragile.
It does. I'm not very happy about this mess.
For
example, you could have an insert trigger on the table that inserts
additional rows to the same table, and those inserts would be intermixed
with the rows inserted by COPY.That should be fine? As long as copy only uses new pages INSERT can use
the same ones without problem. I think...Full-page images in general are a problem.
With the above rules I don't think it'd be. They'd contain the previous
contents, and we'll not target them again with COPY.
Well, you really have to ensure that COPY never uses a page that any
other operation (INSERT, DELETE, UPDATE, hint-bit-update) has ever
touched and created a FPW for. The naive approach, where you just reset
the target block at beginning of COPY and use the HEAP_INSERT_SKIP_FSM
option is not enough. It's possible, but requires a lot more bookkeeping
than might seem at first glance.
I think we should
1. reliably and explicitly keep track of whether we've WAL-logged any
TRUNCATE, INSERT/UPDATE+INIT, or any other full-page-logging operations on
the relation, and
2. make sure we never skip WAL-logging again if we have.Let's add a flag, rd_skip_wal_safe, to RelationData that's initially set
when a new relfilenode is created, i.e. whenever rd_createSubid or
rd_newRelfilenodeSubid is set. Whenever a TRUNCATE or a full-page image
(including INSERT/UPDATE+INIT) is WAL-logged, clear the flag. In copy.c,
only skip WAL-logging if the flag is still set. To deal with the case that
the flag gets cleared in the middle of COPY, also check the flag whenever
we're about to skip WAL-logging in heap_insert, and if it's been cleared,
ignore the HEAP_INSERT_SKIP_WAL option and WAL-log anyway.Am I missing something or will this break the BEGIN; TRUNCATE; COPY;
pattern we use ourselves and have suggested a number of times ?
Sorry, I was imprecise above. I meant "whenever an XLOG_SMGR_TRUNCATE
record is WAL-logged", rather than a "whenever a TRUNCATE [command] is
WAL-logged". TRUNCATE on a table that wasn't created in the same
transaction doesn't emit an XLOG_SMGR_TRUNCATE record, because it
creates a whole new relfilenode. So that's OK.
In the long-term, I'd like to refactor this whole thing so that we never
WAL-log any operations on a relation that's created in the same
transaction (when wal_level=minimal). Instead, at COMMIT, we'd fsync()
the relation, or if it's smaller than some threshold, WAL-log the
contents of the whole file at that point. That would move all that
more-difficult-than-it-seems-at-first-glance logic from COPY and
indexam's to a central location, and it would allow the same
optimization for all operations, not just COPY. But that probably isn't
feasible to backpatch.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-10 13:38:50 +0300, Heikki Linnakangas wrote:
In the long-term, I'd like to refactor this whole thing so that we never
WAL-log any operations on a relation that's created in the same transaction
(when wal_level=minimal). Instead, at COMMIT, we'd fsync() the relation, or
if it's smaller than some threshold, WAL-log the contents of the whole file
at that point. That would move all that
more-difficult-than-it-seems-at-first-glance logic from COPY and indexam's
to a central location, and it would allow the same optimization for all
operations, not just COPY. But that probably isn't feasible to backpatch.
I don't think that's really realistic until we have a buffer manager
that lets you efficiently scan for all pages of a relation :(
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
This thread seemed to trail off without a resolution. Was anything done?
(See more below.)
On 07/09/15 19:06, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this. That should be safe if the table was
previously empty.I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.What evidence have you got to base that value judgement on?
cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction.
I'm the complainer mentioned in the cab9a0656c36739f commit message. :)
FWIW, we use a temp table to split a join across 4 largish tables
(10^8 rows or more each) and 2 small tables (10^6 rows each). We
write the results of joining the 2 largest tables into the temp
table, and then join that to the other 4. This gave significant
performance benefits because the planner would know the exact row
count of the 2-way join heading into the 4-way join. After commit
cab9a0656c36739f, we got another noticeable performance improvement
(I did timings before and after, but I can't seem to put my hands
on the numbers right now).
We do millions of these queries every day in batches. Each batch
reuses a single temp table (truncating it before each pair of joins)
so as to reduce the churn in the system catalogs. In case it matters,
the temp table is created with ON COMMIT DROP.
This was (and still is) done on 9.2.x.
HTH.
-- todd cook
-- tcook@blackducksoftware.com
On the other hand,
I know of no evidence that anyone's depending on multiple sequential
COPYs, nor intermixed COPY and INSERT, to be fast. The original argument
for having this COPY optimization at all was to make restoring pg_dump
scripts in a single transaction fast; and that use-case doesn't care
about anything but a single COPY into a virgin table.I think you're worrying about exactly the wrong case.
My tentative guess is that the best course is to
a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
truncation replay issue.
b) Force new pages to be used when using the heap_sync mode in
COPY. That avoids the INIT danger you found. It seems rather
reasonable to avoid using pages that have already been the target of
WAL logging here in general.And what reason is there to think that this would fix all the problems?
We know of those two, but we've not exactly looked hard for other cases.
Again, the only known field usage for the COPY optimization is the pg_dump
scenario; were that not so, we'd have noticed the problem long since.
So I don't have any faith that this is a well-tested area.regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 21, 2015 at 02:24:47PM -0400, Todd A. Cook wrote:
Hi,
This thread seemed to trail off without a resolution. Was anything done?
Not that I can tell. I was the original poster of this thread. We've
worked around the issue by placing a CHECKPOINT command at the end of
the migration script. For us it's not a performance issue, more a
correctness one, tables were empty when they shouldn't have been.
I'm hoping a fix will appear in the 9.5 release, since we're intending
to release with that version. A forced checkpoint every now and them
probably won't be a serious problem though.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer
On 2015-07-21 21:37:41 +0200, Martijn van Oosterhout wrote:
On Tue, Jul 21, 2015 at 02:24:47PM -0400, Todd A. Cook wrote:
Hi,
This thread seemed to trail off without a resolution. Was anything done?
Not that I can tell.
Heikki and I had some in-person conversation about it at a conference,
but we didn't really find anything we both liked...
I was the original poster of this thread. We've
worked around the issue by placing a CHECKPOINT command at the end of
the migration script. For us it's not a performance issue, more a
correctness one, tables were empty when they shouldn't have been.
If it's just correctness, you could just use wal_level = archive.
I'm hoping a fix will appear in the 9.5 release, since we're intending
to release with that version. A forced checkpoint every now and them
probably won't be a serious problem though.
We're imo going to have to fix this in the back branches.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10 July 2015 at 00:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this. That should be safe if the table was
previously empty.I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.What evidence have you got to base that value judgement on?
cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction. On the other hand,
I know of no evidence that anyone's depending on multiple sequential
COPYs, nor intermixed COPY and INSERT, to be fast. The original argument
for having this COPY optimization at all was to make restoring pg_dump
scripts in a single transaction fast; and that use-case doesn't care
about anything but a single COPY into a virgin table.
We have to backpatch this fix, so it must be both simple and effective.
Heikki's suggestions may be best, maybe not, but they don't seem
backpatchable.
Tom's suggestion looks good. So does Andres' suggestion. I have coded both.
And what reason is there to think that this would fix all the problems?
I don't think either suggested fix could be claimed to be a great solution,
since there is little principle here, only heuristic. Heikki's solution
would be the only safe way, but is not backpatchable.
Forcing SKIP_FSM to always extend has no negative side effects in other
code paths, AFAICS.
Patches attached. Martijn, please verify.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
fix_wal_logging_copy_truncate.v1.patchapplication/octet-stream; name=fix_wal_logging_copy_truncate.v1.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..40131ca 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -283,6 +283,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
}
else if (bistate && bistate->current_buf != InvalidBuffer)
targetBlock = BufferGetBlockNumber(bistate->current_buf);
+ else if (!use_fsm)
+ targetBlock = InvalidBlockNumber;
else
targetBlock = RelationGetTargetBlock(relation);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1c7eded..0d5171d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1169,76 +1169,62 @@ ExecuteTruncate(TruncateStmt *stmt)
/*
* OK, truncate each table.
+ *
+ * We used to call heap_truncate_one_rel() in some corner cases, but it
+ * is no longer safe to do so and that behavior is now fully removed.
*/
mySubid = GetCurrentSubTransactionId();
foreach(cell, rels)
{
Relation rel = (Relation) lfirst(cell);
+ Oid heap_relid;
+ Oid toast_relid;
+ MultiXactId minmulti;
/*
- * Normally, we need a transaction-safe truncation here. However, if
- * the table was either created in the current (sub)transaction or has
- * a new relfilenode in the current (sub)transaction, then we can just
- * truncate it in-place, because a rollback would cause the whole
- * table or the current physical file to be thrown away anyway.
+ * This effectively deletes all rows in the table, and may be done
+ * in a serializable transaction. In that case we must record a
+ * rw-conflict in to this transaction from each transaction
+ * holding a predicate lock on the table.
*/
- if (rel->rd_createSubid == mySubid ||
- rel->rd_newRelfilenodeSubid == mySubid)
- {
- /* Immediate, non-rollbackable truncation is OK */
- heap_truncate_one_rel(rel);
- }
- else
- {
- Oid heap_relid;
- Oid toast_relid;
- MultiXactId minmulti;
+ CheckTableForSerializableConflictIn(rel);
- /*
- * This effectively deletes all rows in the table, and may be done
- * in a serializable transaction. In that case we must record a
- * rw-conflict in to this transaction from each transaction
- * holding a predicate lock on the table.
- */
- CheckTableForSerializableConflictIn(rel);
+ minmulti = GetOldestMultiXactId();
- minmulti = GetOldestMultiXactId();
+ /*
+ * Need the full transaction-safe pushups.
+ *
+ * Create a new empty storage file for the relation, and assign it
+ * as the relfilenode value. The old storage file is scheduled for
+ * deletion at commit.
+ */
+ RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
+ RecentXmin, minmulti);
+ if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
+ heap_create_init_fork(rel);
- /*
- * Need the full transaction-safe pushups.
- *
- * Create a new empty storage file for the relation, and assign it
- * as the relfilenode value. The old storage file is scheduled for
- * deletion at commit.
- */
+ heap_relid = RelationGetRelid(rel);
+ toast_relid = rel->rd_rel->reltoastrelid;
+
+ /*
+ * The same for the toast table, if any.
+ */
+ if (OidIsValid(toast_relid))
+ {
+ rel = relation_open(toast_relid, AccessExclusiveLock);
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
RecentXmin, minmulti);
if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
heap_create_init_fork(rel);
-
- heap_relid = RelationGetRelid(rel);
- toast_relid = rel->rd_rel->reltoastrelid;
-
- /*
- * The same for the toast table, if any.
- */
- if (OidIsValid(toast_relid))
- {
- rel = relation_open(toast_relid, AccessExclusiveLock);
- RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
- RecentXmin, minmulti);
- if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
- heap_create_init_fork(rel);
- heap_close(rel, NoLock);
- }
-
- /*
- * Reconstruct the indexes to match, and we're done.
- */
- reindex_relation(heap_relid, REINDEX_REL_PROCESS_TOAST, 0);
+ heap_close(rel, NoLock);
}
+ /*
+ * Reconstruct the indexes to match, and we're done.
+ */
+ reindex_relation(heap_relid, REINDEX_REL_PROCESS_TOAST, 0);
+
pgstat_count_truncate(rel);
}
fix_copy_zero_blocks.v1.patchapplication/octet-stream; name=fix_copy_zero_blocks.v1.patchDownload
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 8904676..a91bd9a 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2246,7 +2246,14 @@ CopyFrom(CopyState cstate)
cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
+
+ /*
+ * We can skip writing WAL if there have been no actions that write an
+ * init block for any of the buffers we will touch during COPY. Since
+ * we have no way of knowing at present which ones that is, we must
+ * use a simple but effective heuristic to ensure safety in all cases.
+ */
+ if (!XLogIsNeeded() && RelationGetNumberOfBlocks(cstate->rel->rd_node) == 0)
hi_options |= HEAP_INSERT_SKIP_WAL;
}
On 07/22/2015 11:18 AM, Simon Riggs wrote:
On 10 July 2015 at 00:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this. That should be safe if the table was
previously empty.I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.What evidence have you got to base that value judgement on?
cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction. On the other hand,
I know of no evidence that anyone's depending on multiple sequential
COPYs, nor intermixed COPY and INSERT, to be fast. The original argument
for having this COPY optimization at all was to make restoring pg_dump
scripts in a single transaction fast; and that use-case doesn't care
about anything but a single COPY into a virgin table.We have to backpatch this fix, so it must be both simple and effective.
Heikki's suggestions may be best, maybe not, but they don't seem
backpatchable.Tom's suggestion looks good. So does Andres' suggestion. I have coded both.
Thanks. For comparison, I wrote a patch to implement what I had in mind.
When a WAL-skipping COPY begins, we add an entry for that relation in a
"pending-fsyncs" hash table. Whenever we perform any action on a heap
that would normally be WAL-logged, we check if the relation is in the
hash table, and skip WAL-logging if so.
That was a simplified explanation. In reality, when WAL-skipping COPY
begins, we also memorize the current size of the relation. Any actions
on blocks greater than the old size are not WAL-logged, and any actions
on smaller-numbered blocks are. This ensures that if you did any INSERTs
on the table before the COPY, any new actions on the blocks that were
already WAL-logged by the INSERT are also WAL-logged. And likewise if
you perform any INSERTs after (or during, by trigger) the COPY, and they
modify the new pages, those actions are not WAL-logged. So starting a
WAL-skipping COPY splits the relation into two parts: the first part
that is WAL-logged as usual, and the later part that is not WAL-logged.
(there is one loose end marked with XXX in the patch on this, when one
of the pages involved in a cold UPDATE is before the watermark and the
other is after)
The actual fsync() has been moved to the end of transaction, as we are
now skipping WAL-logging of any actions after the COPY as well.
And truncations complicate things further. If we emit a truncation WAL
record in the transaction, we also make an entry in the hash table to
record that. All operations on a relation that has been truncated must
be WAL-logged as usual, because replaying the truncate record will
destroy all data even if we fsync later. But we still optimize for
"BEGIN; CREATE; COPY; TRUNCATE; COPY;" style patterns, because if we
truncate a relation that has already been marked for fsync-at-COMMIT, we
don't need to WAL-log the truncation either.
This is more invasive than I'd like to backpatch, but I think it's the
simplest approach that works, and doesn't disable any of the important
optimizations we have.
And what reason is there to think that this would fix all the problems?
I don't think either suggested fix could be claimed to be a great solution,
since there is little principle here, only heuristic. Heikki's solution
would be the only safe way, but is not backpatchable.
I can't get too excited about a half-fix that leaves you with data
corruption in some scenarios.
I wrote a little test script to test all these different scenarios
(attached). Both of your patches fail with the script.
- Heikki
Attachments:
fix-wal-level-minimal-heikki-1.patchapplication/x-patch; name=fix-wal-level-minimal-heikki-1.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2e3b9d2..9ef688b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2021,12 +2021,6 @@ FreeBulkInsertState(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2115,7 +2109,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2315,12 +2309,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
char *scratch = NULL;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2335,7 +2327,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* palloc() within a critical section is not safe, so we allocate this
* beforehand.
*/
- if (needwal)
+ if (RelationNeedsWAL(relation))
scratch = palloc(BLCKSZ);
/*
@@ -2401,7 +2393,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (needwal)
+ if (HeapNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
xl_heap_multi_insert *xlrec;
@@ -2888,7 +2880,7 @@ l1:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -3755,7 +3747,11 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ /* FIXME: what if the old page must be WAL-logged, but the new one
+ * must not?
+ */
+ if (HeapNeedsWAL(relation, buffer) ||
+ HeapNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -4626,7 +4622,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5258,12 +5254,12 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (HeapNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
+ Page page = BufferGetPage(buf);
XLogRecPtr recptr;
XLogRecData rdata[2];
- Page page = BufferGetPage(buf);
xlrec.target.node = rel->rd_node;
xlrec.target.tid = mytup.t_self;
@@ -5405,7 +5401,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -8555,3 +8551,71 @@ heap_sync(Relation rel)
heap_close(toastrel, AccessShareLock);
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. After calling this, any changes to
+ * the heap (including TOAST heap if any) in the same transaction will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * Like with heap_sync(), indexes are not touched.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ smgrRegisterPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ smgrRegisterPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous heap_register_sync() requests.
+ *
+ * Note that it is required to use this before creating any WAL records for
+ * heap pages - it is not merely an optimization. WAL-logging a record,
+ * when we have already skipped a previous WAL record for the same page could
+ * lead to failure at WAL replay, as the "before" state expected by the
+ * record might not match what's on disk (this should only a be problem
+ * with full_page_writes=off, though).
+ */
+bool
+HeapNeedsWAL(Relation rel, Buffer buf)
+{
+ /* Temporary relations never need WAL */
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /*
+ * If we are going to fsync() the relation at COMMIT, and we have not
+ * truncated the block away previously, and we have not emitted any WAL
+ * records for this block yet, we can skip WAL-logging it.
+ */
+ if (smgrIsSyncPending(rel->rd_node, BufferGetBlockNumber(buf)))
+ {
+ /*
+ * If a pending fsync() will handle this page, its LSN should be
+ * invalid. If it's not, we've already emitted a WAL record for this
+ * block, and all subsequent changes to the block must be WAL-logged
+ * too.
+ */
+ Assert(PageGetLSN(BufferGetPage(buf)) == InvalidXLogRecPtr);
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 06b5488..e342cbb 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -250,7 +250,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index a0c0c7f..7832dee 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -278,6 +278,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (1 << mapBit);
MarkBufferDirty(vmBuf);
+ /* XXX: Should we use HeapNeedsWAL here? */
if (RelationNeedsWAL(rel))
{
if (XLogRecPtrIsInvalid(recptr))
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9f989f8..306c6c1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1921,6 +1921,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2126,6 +2129,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2412,6 +2418,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false);
/*
* Advertise the fact that we aborted in pg_clog (assuming that we got as
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e9754c..39306dd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -951,6 +951,15 @@ begin:;
}
if (dtbuf[i] == InvalidBuffer)
{
+ {
+ RelFileNode rnode;
+ ForkNumber forknum;
+ BlockNumber blknum;
+ BufferGetTag(rdt->buffer, &rnode, &forknum, &blknum);
+ if (rnode.relNode >= FirstNormalObjectId && rmid != RM_BTREE_ID)
+ elog(LOG, "WAL-logging update to rel %u block %u (rmid %d info %X)", rnode.relNode, blknum, rmid, info);
+ }
+
/* OK, put it in this slot */
dtbuf[i] = rdt->buffer;
if (doPageWrites && XLogCheckBuffer(rdt, true,
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c3b2f07..5162904 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -20,6 +20,7 @@
#include "postgres.h"
#include "access/visibilitymap.h"
+#include "access/transam.h"
#include "access/xact.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
@@ -27,6 +28,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -62,6 +64,42 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action. When we are about to begin a large operation on the relation,
+ * a PendingRelSync entry is created, and 'sync_above' is set to the current
+ * size of the relation. Any operations on blocks < sync_above need to be
+ * WAL-logged as usual, but for operations on higher blocks, WAL-logging is
+ * skipped. It's important that after WAL-logging has been skipped for a
+ * block, we don't WAL log any subsequent actions on the same block either.
+ * Replaying the WAl record of the subsequent action might fail otherwise,
+ * as the "before" state of the block might not match, as the earlier actions
+ * were not WAL-logged.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we cannot skip WAL-logging for that
+ * relation anymore, as replaying the truncation record will destroy all the
+ * data inserted after that. But if we have already decided to skip WAL-logging
+ * changes to a relation, and the relation is truncated, we don't need to
+ * WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+/* Relations that need to be fsync'd at commit */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >= sync_above */
+ bool truncated; /* truncation WAL record was written */
+} PendingRelSync;
+
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/*
* RelationCreateStorage
* Create physical storage for a relation.
*
@@ -228,6 +266,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
void
RelationTruncate(Relation rel, BlockNumber nblocks)
{
+ PendingRelSync *pending = NULL;
+ bool found;
bool fsm;
bool vm;
@@ -251,6 +291,17 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
if (vm)
visibilitymap_truncate(rel, nblocks);
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->sync_above = InvalidBlockNumber;
+ pending->truncated = false;
+ }
+
/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
@@ -260,7 +311,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/
- if (RelationNeedsWAL(rel))
+ if (RelationNeedsWAL(rel) &&
+ (pending->sync_above == InvalidBlockNumber || pending->sync_above < nblocks))
{
/*
* Make an XLOG entry reporting the file truncation.
@@ -279,6 +331,9 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE, &rdata);
+ if (rel->rd_node.relNode >= FirstNormalObjectId)
+ elog(LOG, "WAL-logged truncation of rel %u to %u blocks", rel->rd_node.relNode, nblocks);
+
/*
* Flush, because otherwise the truncation of the main relation might
* hit the disk before the WAL record, and the truncation of the FSM
@@ -288,6 +343,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (fsm || vm)
XLogFlush(lsn);
+
+ pending->truncated = true;
}
/* Do the real work */
@@ -422,6 +479,142 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because
+ * we are going to skip WAL-logging subsequent actions to it.
+ */
+void
+smgrRegisterPendingSync(Relation rel)
+{
+ PendingRelSync *pending;
+ bool found;
+ BlockNumber nblocks;
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+
+ /* Look up or create an entry */
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->truncated = false;
+ pending->sync_above = nblocks;
+
+ if (rel->rd_node.relNode >= FirstNormalObjectId)
+ elog(LOG, "Registering new pending sync for rel %u at block %u", rel->rd_node.relNode, nblocks);
+
+ }
+ else if (pending->sync_above == InvalidBlockNumber)
+ {
+ if (rel->rd_node.relNode >= FirstNormalObjectId)
+ elog(LOG, "Registering pending sync for rel %u at block %u", rel->rd_node.relNode, nblocks);
+ pending->sync_above = nblocks;
+ }
+ else
+ if (rel->rd_node.relNode >= FirstNormalObjectId)
+ elog(LOG, "Not updating pending sync for rel %u at block %u (was %u)", rel->rd_node.relNode, nblocks, pending->sync_above);
+}
+
+/*
+ * Are we going to fsync() this relation at COMMIT, and hence don't need to
+ * WAL-log changes to the given block?
+ */
+bool
+smgrIsSyncPending(RelFileNode rnode, BlockNumber blkno)
+{
+ PendingRelSync *pending;
+ bool found;
+
+ if (!pendingSyncs)
+ return false;
+
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rnode,
+ HASH_FIND, &found);
+ if (!found)
+ return false;
+
+ /*
+ * We have no fsync() pending for this relation, or we have (possibly)
+ * already emitted WAL records for this block.
+ */
+ if (pending->sync_above == InvalidBlockNumber ||
+ pending->sync_above > blkno)
+ {
+ if (rnode.relNode >= FirstNormalObjectId)
+ elog(LOG, "Not skipping WAL-logging for rel %u block %u, because sync_above is %u", rnode.relNode, blkno, pending->sync_above);
+ return false;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (pending->truncated)
+ {
+ if (rnode.relNode >= FirstNormalObjectId)
+ elog(LOG, "Not skipping WAL-logging for rel %u block %u, because it was truncated", rnode.relNode, blkno);
+ return false;
+ }
+
+ if (rnode.relNode >= FirstNormalObjectId)
+ elog(LOG, "Skipping WAL-logging for rel %u block %u", rnode.relNode, blkno);
+
+ return true;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelFileNodeBuffers(pending->relnode, false);
+ /* FlushRelationBuffers will have opened rd_smgr */
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ if (pending->relnode.relNode >= FirstNormalObjectId)
+ elog(LOG, "Syncing rel %u", pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3eba9ef..c6aa608 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -26,6 +26,7 @@
#include "access/xact.h"
#include "catalog/namespace.h"
#include "catalog/pg_type.h"
+#include "catalog/storage.h"
#include "commands/copy.h"
#include "commands/defrem.h"
#include "commands/trigger.h"
@@ -2152,7 +2153,10 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
+ {
+ heap_register_sync(cstate->rel);
hi_options |= HEAP_INSERT_SKIP_WAL;
+ }
}
/*
@@ -2400,11 +2404,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction (we used to do it here, but it was later found out
+ * that to be safe, we must avoid WAL-logging any subsequent actions on
+ * the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 3778d9d..180f2d9 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -932,8 +932,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
heap_execute_freeze_tuple(htup, &frozen[i]);
}
- /* Now WAL-log freezing if neccessary */
- if (RelationNeedsWAL(onerel))
+ /* Now WAL-log freezing if necessary */
+ if (HeapNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1220,7 +1220,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (HeapNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 18013d5..011bab0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -108,6 +108,7 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushRelFileNodeBuffers_internal(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static int rnode_comparator(const void *p1, const void *p2);
@@ -2426,18 +2427,31 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- volatile BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelFileNodeBuffers_internal(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelFileNodeBuffers(RelFileNode rnode, bool islocal)
+{
+ FlushRelFileNodeBuffers_internal(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelFileNodeBuffers_internal(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ volatile BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
bufHdr = &LocalBufferDescriptors[i];
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(bufHdr->flags & BM_VALID) && (bufHdr->flags & BM_DIRTY))
{
ErrorContextCallback errcallback;
@@ -2453,7 +2467,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -2480,16 +2494,16 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(bufHdr->flags & BM_VALID) && (bufHdr->flags & BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(bufHdr->content_lock, LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(bufHdr->content_lock);
UnpinBuffer(bufHdr, true);
}
@@ -2662,6 +2676,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
XLogRecPtr lsn = InvalidXLogRecPtr;
bool dirtied = false;
bool delayChkpt = false;
+ RelFileNode rnode;
+ ForkNumber forknum;
+ BlockNumber blknum;
/*
* If we need to protect hint bit updates from torn writes, WAL-log a
@@ -2672,7 +2689,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
* We don't check full_page_writes here because that logic is included
* when we call XLogInsert() since the value changes dynamically.
*/
- if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT))
+ BufferGetTag(buffer, &rnode, &forknum, &blknum);
+ if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) &&
+ !smgrIsSyncPending(rnode, blknum))
{
/*
* If we're in recovery we cannot dirty a page because of a hint.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 493839f..3fdd5a1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -24,7 +24,7 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
+#define HEAP_INSERT_SKIP_WAL 0x0001 /* obsolete, not used anymore */
#define HEAP_INSERT_SKIP_FSM 0x0002
#define HEAP_INSERT_FROZEN 0x0004
@@ -161,6 +161,7 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
extern void heap_markpos(HeapScanDesc scan);
extern void heap_restrpos(HeapScanDesc scan);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
/* in heap/pruneheap.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 9557486..70c5f75 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -16,6 +16,7 @@
#include "access/htup.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "storage/bufpage.h"
#include "storage/relfilenode.h"
#include "utils/relcache.h"
@@ -375,6 +376,8 @@ extern void heap2_redo(XLogRecPtr lsn, XLogRecord *rptr);
extern void heap2_desc(StringInfo buf, uint8 xl_info, char *rec);
extern void heap_xlog_logical_rewrite(XLogRecPtr lsn, XLogRecord *r);
+extern bool HeapNeedsWAL(Relation rel, Buffer buf);
+
extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
TransactionId latestRemovedXid);
extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 4b87a36..e56d252 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -29,6 +29,9 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrRegisterPendingSync(Relation rel);
+extern bool smgrIsSyncPending(RelFileNode rnode, BlockNumber blkno);
+extern void smgrDoPendingSyncs(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 921e4ed..590ab08 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -192,6 +192,7 @@ extern BlockNumber BufferGetBlockNumber(Buffer buffer);
extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelFileNodeBuffers(RelFileNode rel, bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
On 22 July 2015 at 17:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
When a WAL-skipping COPY begins, we add an entry for that relation in a
"pending-fsyncs" hash table. Whenever we perform any action on a heap that
would normally be WAL-logged, we check if the relation is in the hash
table, and skip WAL-logging if so.That was a simplified explanation. In reality, when WAL-skipping COPY
begins, we also memorize the current size of the relation. Any actions on
blocks greater than the old size are not WAL-logged, and any actions on
smaller-numbered blocks are. This ensures that if you did any INSERTs on
the table before the COPY, any new actions on the blocks that were already
WAL-logged by the INSERT are also WAL-logged. And likewise if you perform
any INSERTs after (or during, by trigger) the COPY, and they modify the new
pages, those actions are not WAL-logged. So starting a WAL-skipping COPY
splits the relation into two parts: the first part that is WAL-logged as
usual, and the later part that is not WAL-logged. (there is one loose end
marked with XXX in the patch on this, when one of the pages involved in a
cold UPDATE is before the watermark and the other is after)The actual fsync() has been moved to the end of transaction, as we are now
skipping WAL-logging of any actions after the COPY as well.And truncations complicate things further. If we emit a truncation WAL
record in the transaction, we also make an entry in the hash table to
record that. All operations on a relation that has been truncated must be
WAL-logged as usual, because replaying the truncate record will destroy all
data even if we fsync later. But we still optimize for "BEGIN; CREATE;
COPY; TRUNCATE; COPY;" style patterns, because if we truncate a relation
that has already been marked for fsync-at-COMMIT, we don't need to WAL-log
the truncation either.This is more invasive than I'd like to backpatch, but I think it's the
simplest approach that works, and doesn't disable any of the important
optimizations we have.
I didn't like it when I first read this, but I do now. As a by product of
fixing the bug it actually extends the optimization.
You can optimize this approach so we always write WAL unless one of the two
subid fields are set, so there is no need to call smgrIsSyncPending() every
time. I couldn't see where this depended upon wal_level, but I guess its
there somewhere.
I'm unhappy about the call during MarkBufferDirtyHint() which is just too
costly. The only way to do this cheaply is to specifically mark buffers as
being BM_WAL_SKIPPED, so they do not need to be hinted. That flag would be
removed when we flush the buffers for the relation.
And what reason is there to think that this would fix all the problems?
I don't think either suggested fix could be claimed to be a great
solution,
since there is little principle here, only heuristic. Heikki's solution
would be the only safe way, but is not backpatchable.I can't get too excited about a half-fix that leaves you with data
corruption in some scenarios.
On further consideration, it seems obvious that Andres' suggestion would
not work for UPDATE or DELETE, so I now agree.
It does seem a big thing to backpatch; alternative suggestions?
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 22, 2015 at 12:21 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
This is more invasive than I'd like to backpatch, but I think it's the
simplest approach that works, and doesn't disable any of the important
optimizations we have.
Hmm, isn't HeapNeedsWAL() a lot more costly than RelationNeedsWAL()?
Should we be worried about that?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/23/2015 09:38 PM, Robert Haas wrote:
On Wed, Jul 22, 2015 at 12:21 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
This is more invasive than I'd like to backpatch, but I think it's the
simplest approach that works, and doesn't disable any of the important
optimizations we have.Hmm, isn't HeapNeedsWAL() a lot more costly than RelationNeedsWAL()?
Yes. But it's still very cheap, especially in the common case that the
pending syncs hash table is empty.
Should we be worried about that?
It doesn't worry me.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas wrote:
Thanks. For comparison, I wrote a patch to implement what I had in mind.
When a WAL-skipping COPY begins, we add an entry for that relation in a
"pending-fsyncs" hash table. Whenever we perform any action on a heap that
would normally be WAL-logged, we check if the relation is in the hash table,
and skip WAL-logging if so.
I think this wasn't applied, was it?
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 21, 2015 at 11:53 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Heikki Linnakangas wrote:
Thanks. For comparison, I wrote a patch to implement what I had in mind.
When a WAL-skipping COPY begins, we add an entry for that relation in a
"pending-fsyncs" hash table. Whenever we perform any action on a heap that
would normally be WAL-logged, we check if the relation is in the hash table,
and skip WAL-logging if so.I think this wasn't applied, was it?
No, it was not applied.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22/10/15 03:56, Michael Paquier wrote:
On Wed, Oct 21, 2015 at 11:53 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Heikki Linnakangas wrote:
Thanks. For comparison, I wrote a patch to implement what I had in mind.
When a WAL-skipping COPY begins, we add an entry for that relation in a
"pending-fsyncs" hash table. Whenever we perform any action on a heap that
would normally be WAL-logged, we check if the relation is in the hash table,
and skip WAL-logging if so.I think this wasn't applied, was it?
No, it was not applied.
I dropped the ball on this one back in July, so here's an attempt to
revive this thread.
I spent some time fixing the remaining issues with the prototype patch I
posted earlier, and rebased that on top of current git master. See attached.
Some review of that would be nice. If there are no major issues with it,
I'm going to create backpatchable versions of this for 9.4 and below.
- Heikki
Attachments:
0001-Fix-the-optimization-to-skip-WAL-logging-on-table-cr.patchtext/x-diff; name=0001-Fix-the-optimization-to-skip-WAL-logging-on-table-cr.patchDownload
>From 063e1aa258800873783190a9678d551b43c0e39e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 4 Feb 2016 15:21:09 +0300
Subject: [PATCH 1/1] Fix the optimization to skip WAL-logging on table created
in same xact.
There were several bugs in the optimization to skip WAL-logging for a table
that was created (or truncated) in the same transaction, with
wal_level=minimal, leading to data loss if a crash happened after the
optimization was used:
* If the table was created, and then truncated, and then loaded with COPY,
we would replay the truncate record at commit, and the table would end
up being empty after replay.
* If there is a trigger on a table that modifies the same table, and you
COPY to the table in the transaction that created it, you might have some
WAL-logged operations on a page, performed by the trigger, intermixed with
the non-WAL-logged inserts done by the COPY. That can lead to crash at
recovery, because we might try to replay a WAL record that e.g. updates
a tuple, but insertion of the tuple was not WAL-logged.
---
src/backend/access/heap/heapam.c | 254 +++++++++++++++++++++++---------
src/backend/access/heap/pruneheap.c | 2 +-
src/backend/access/heap/rewriteheap.c | 4 +-
src/backend/access/heap/visibilitymap.c | 2 +-
src/backend/access/transam/xact.c | 7 +
src/backend/catalog/storage.c | 250 ++++++++++++++++++++++++++++---
src/backend/commands/copy.c | 14 +-
src/backend/commands/createas.c | 9 +-
src/backend/commands/matview.c | 6 +-
src/backend/commands/tablecmds.c | 5 +-
src/backend/commands/vacuumlazy.c | 6 +-
src/backend/storage/buffer/bufmgr.c | 47 ++++--
src/include/access/heapam.h | 8 +-
src/include/access/heapam_xlog.h | 2 +
src/include/catalog/storage.h | 3 +
src/include/storage/bufmgr.h | 2 +
16 files changed, 487 insertions(+), 134 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f443742..79298e2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -55,6 +77,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/namespace.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -2332,12 +2355,6 @@ FreeBulkInsertState(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2441,7 +2458,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2640,12 +2657,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
char *scratch = NULL;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2660,7 +2675,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* palloc() within a critical section is not safe, so we allocate this
* beforehand.
*/
- if (needwal)
+ if (RelationNeedsWAL(relation))
scratch = palloc(BLCKSZ);
/*
@@ -2695,6 +2710,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2706,6 +2722,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = HeapNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3262,7 +3279,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4130,7 +4147,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer) ||
+ HeapNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5048,7 +5066,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5691,7 +5709,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (HeapNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5831,7 +5849,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -5963,7 +5981,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6069,7 +6087,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7122,7 +7140,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(HeapNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7170,7 +7188,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(HeapNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7254,7 +7272,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(HeapNeedsWAL(reln, newbuf) || HeapNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -7357,76 +7375,86 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
if (need_tuple_data)
bufflags |= REGBUF_KEEP_DATA;
- XLogRegisterBuffer(0, newbuf, bufflags);
- if (oldbuf != newbuf)
- XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
-
- XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
-
/*
* Prepare WAL data for the new tuple.
*/
- if (prefixlen > 0 || suffixlen > 0)
+ if (HeapNeedsWAL(reln, newbuf))
{
- if (prefixlen > 0 && suffixlen > 0)
- {
- prefix_suffix[0] = prefixlen;
- prefix_suffix[1] = suffixlen;
- XLogRegisterBufData(0, (char *) &prefix_suffix, sizeof(uint16) * 2);
- }
- else if (prefixlen > 0)
- {
- XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
- }
- else
+ XLogRegisterBuffer(0, newbuf, bufflags);
+
+ if ((prefixlen > 0 || suffixlen > 0))
{
- XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+ if (prefixlen > 0 && suffixlen > 0)
+ {
+ prefix_suffix[0] = prefixlen;
+ prefix_suffix[1] = suffixlen;
+ XLogRegisterBufData(0, (char *) &prefix_suffix,
+ sizeof(uint16) * 2);
+ }
+ else if (prefixlen > 0)
+ {
+ XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
+ }
+ else
+ {
+ XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+ }
}
- }
- xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
- xlhdr.t_infomask = newtup->t_data->t_infomask;
- xlhdr.t_hoff = newtup->t_data->t_hoff;
- Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
+ xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
+ xlhdr.t_infomask = newtup->t_data->t_infomask;
+ xlhdr.t_hoff = newtup->t_data->t_hoff;
+ Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
- /*
- * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
- *
- * The 'data' doesn't include the common prefix or suffix.
- */
- XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
- if (prefixlen == 0)
- {
- XLogRegisterBufData(0,
- ((char *) newtup->t_data) + SizeofHeapTupleHeader,
- newtup->t_len - SizeofHeapTupleHeader - suffixlen);
- }
- else
- {
/*
- * Have to write the null bitmap and data after the common prefix as
- * two separate rdata entries.
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
+ *
+ * The 'data' doesn't include the common prefix or suffix.
*/
- /* bitmap [+ padding] [+ oid] */
- if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+ XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+ if (prefixlen == 0)
{
XLogRegisterBufData(0,
((char *) newtup->t_data) + SizeofHeapTupleHeader,
- newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+ newtup->t_len - SizeofHeapTupleHeader - suffixlen);
}
+ else
+ {
+ /*
+ * Have to write the null bitmap and data after the common prefix
+ * as two separate rdata entries.
+ */
+ /* bitmap [+ padding] [+ oid] */
+ if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+ {
+ XLogRegisterBufData(0,
+ ((char *) newtup->t_data) + SizeofHeapTupleHeader,
+ newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+ }
- /* data after common prefix */
- XLogRegisterBufData(0,
+ /* data after common prefix */
+ XLogRegisterBufData(0,
((char *) newtup->t_data) + newtup->t_data->t_hoff + prefixlen,
newtup->t_len - newtup->t_data->t_hoff - prefixlen - suffixlen);
+ }
}
+ /*
+ * If the old and new tuple are on different pages, also register the old
+ * page, so that a full-page image is created for it if necessary. We
+ * don't need any extra information to replay changes to it.
+ */
+ if (oldbuf != newbuf && HeapNeedsWAL(reln, oldbuf))
+ XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
+
/* We need to log a tuple identity */
if (need_tuple_data && old_key_tuple)
{
@@ -8343,8 +8371,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
*/
/* Deal with old tuple version */
- oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- &obuffer);
+ if (oldblk == newblk)
+ oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+ else if (XLogRecHasBlockRef(record, 1))
+ oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+ else
+ oldaction = BLK_DONE;
+
if (oldaction == BLK_NEEDS_REDO)
{
page = BufferGetPage(obuffer);
@@ -8398,6 +8431,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageInit(page, BufferGetPageSize(nbuffer), 0);
newaction = BLK_NEEDS_REDO;
}
+ else if (!XLogRecHasBlockRef(record, 0))
+ newaction = BLK_DONE;
else
newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -8788,9 +8823,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
@@ -8823,3 +8865,75 @@ heap_sync(Relation rel)
heap_close(toastrel, AccessShareLock);
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ smgrRegisterPendingSync(rel->rd_node, RelationGetNumberOfBlocks(rel));
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ smgrRegisterPendingSync(toastrel->rd_node,
+ RelationGetNumberOfBlocks(toastrel));
+ heap_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous heap_register_sync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records
+ * for heap pages - it is not merely an optimization! WAL-logging a record,
+ * when we have already skipped a previous WAL record for the same page
+ * could lead lead to failure at WAL replay, as the "before" state expected
+ * by the record might not match what's on disk. Also, if the heap was
+ * truncated earlier, we must WAL-log any changes to the once-truncated
+ * blocks, because replaying the truncation record will destroy them.
+ * (smgrIsSyncPending() figures out all that.)
+ */
+bool
+HeapNeedsWAL(Relation rel, Buffer buf)
+{
+ /* Temporary relations never need WAL */
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /*
+ * If we are going to fsync() the relation at COMMIT, and we have not
+ * truncated the block away previously, and we have not emitted any WAL
+ * records for this block yet, we can skip WAL-logging it.
+ */
+ if (smgrIsSyncPending(rel->rd_node, BufferGetBlockNumber(buf)))
+ {
+ /*
+ * If a pending fsync() will handle this page, its LSN should be
+ * invalid. If it's not, we've already emitted a WAL record for this
+ * block, and all subsequent changes to the block must be WAL-logged
+ * too.
+ */
+ Assert(PageGetLSN(BufferGetPage(buf)) == InvalidXLogRecPtr);
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 59beadd..476e308 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -251,7 +251,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f9ce986..36ba62a 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
- HEAP_INSERT_SKIP_FSM |
- (state->rs_use_wal ?
- 0 : HEAP_INSERT_SKIP_WAL));
+ HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index fc28f3f..7663485 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -279,7 +279,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (1 << mapBit);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (HeapNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b0d5440..5013145 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1989,6 +1989,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2219,6 +2222,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2519,6 +2525,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false);
/*
* Advertise the fact that we aborted in pg_clog (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fe68c99..3097d84 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
@@ -29,6 +30,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -64,6 +66,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/*
* RelationCreateStorage
* Create physical storage for a relation.
*
@@ -226,6 +271,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
void
RelationTruncate(Relation rel, BlockNumber nblocks)
{
+ PendingRelSync *pending = NULL;
+ bool found;
bool fsm;
bool vm;
@@ -260,30 +307,51 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
-
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
-
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->sync_above = InvalidBlockNumber;
+ pending->truncated_to = InvalidBlockNumber;
+ }
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ if (pending->sync_above == InvalidBlockNumber || pending->sync_above < nblocks)
+ {
+ /*
+ * Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ pending->truncated_to = nblocks;
+ }
}
/* Do the real work */
@@ -361,7 +429,9 @@ smgrDoPendingDeletes(bool isCommit)
smgrdounlinkall(srels, nrels, false);
for (i = 0; i < nrels; i++)
+ {
smgrclose(srels[i]);
+ }
pfree(srels);
}
@@ -418,6 +488,140 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because
+ * we are going to skip WAL-logging subsequent actions to it.
+ */
+void
+smgrRegisterPendingSync(RelFileNode rnode, BlockNumber nblocks)
+{
+ PendingRelSync *pending;
+ bool found;
+
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+
+ /* Look up or create an entry */
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rnode,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->truncated_to = InvalidBlockNumber;
+ pending->sync_above = nblocks;
+
+ elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at block %u",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+
+ }
+ else if (pending->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+ pending->sync_above = nblocks;
+ }
+ else
+ elog(DEBUG2, "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, pending->sync_above,
+ nblocks);
+}
+
+/*
+ * Are we going to fsync() this relation at COMMIT, and hence don't need to
+ * WAL-log changes to the given block?
+ */
+bool
+smgrIsSyncPending(RelFileNode rnode, BlockNumber blkno)
+{
+ PendingRelSync *pending;
+ bool found;
+
+ if (!pendingSyncs)
+ return false;
+
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rnode,
+ HASH_FIND, &found);
+ if (!found)
+ return false;
+
+ /*
+ * We have no fsync() pending for this relation, or we have (possibly)
+ * already emitted WAL records for this block.
+ */
+ if (pending->sync_above == InvalidBlockNumber ||
+ pending->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, blkno, pending->sync_above);
+ return false;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (pending->truncated_to != InvalidBlockNumber &&
+ pending->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, blkno);
+ return false;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, blkno);
+
+ return true;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3201476..cc8cebd 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -26,6 +26,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/pg_type.h"
+#include "catalog/storage.h"
#include "commands/copy.h"
#include "commands/defrem.h"
#include "commands/trigger.h"
@@ -2269,8 +2270,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2302,7 +2302,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -2551,11 +2551,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index fcb0331..80713af 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -471,8 +471,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -519,9 +520,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 869c586..7be9f1f 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -412,7 +412,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -453,9 +453,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index eeda3b4..adff984 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3983,8 +3983,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4235,8 +4236,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 4f6f6e7..8410812 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -761,7 +761,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (HeapNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -981,7 +981,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (HeapNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1283,7 +1283,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (HeapNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7141eb8..e1061d7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -413,6 +413,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -2864,18 +2865,39 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between the FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache() functions.
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(bufHdr->flags & BM_VALID) && (bufHdr->flags & BM_DIRTY))
{
ErrorContextCallback errcallback;
@@ -2891,7 +2913,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -2918,18 +2940,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(bufHdr->flags & BM_VALID) && (bufHdr->flags & BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
@@ -3122,6 +3144,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
XLogRecPtr lsn = InvalidXLogRecPtr;
bool dirtied = false;
bool delayChkpt = false;
+ RelFileNode rnode;
+ ForkNumber forknum;
+ BlockNumber blknum;
/*
* If we need to protect hint bit updates from torn writes, WAL-log a
@@ -3132,7 +3157,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
* We don't check full_page_writes here because that logic is included
* when we call XLogInsert() since the value changes dynamically.
*/
- if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT))
+ BufferGetTag(buffer, &rnode, &forknum, &blknum);
+ if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) &&
+ !smgrIsSyncPending(rnode, blknum))
{
/*
* If we're in recovery we cannot dirty a page because of a hint.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a427df5..b671210 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
@@ -176,6 +175,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
/* in heap/pruneheap.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f77489b..81b7c81 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -372,6 +372,8 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
extern const char *heap2_identify(uint8 info);
extern void heap_xlog_logical_rewrite(XLogReaderState *r);
+extern bool HeapNeedsWAL(Relation rel, Buffer buf);
+
extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
TransactionId latestRemovedXid);
extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef960da..e84dee2 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -29,6 +29,9 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrRegisterPendingSync(RelFileNode rnode, BlockNumber nblocks);
+extern bool smgrIsSyncPending(RelFileNode rnode, BlockNumber blkno);
+extern void smgrDoPendingSyncs(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 92c4bc5..7a3daaa 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -178,6 +178,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
--
2.1.4
On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I dropped the ball on this one back in July, so here's an attempt to revive
this thread.I spent some time fixing the remaining issues with the prototype patch I
posted earlier, and rebased that on top of current git master. See attached.Some review of that would be nice. If there are no major issues with it, I'm
going to create backpatchable versions of this for 9.4 and below.
I am going to look into that very soon. For now and to not forget
about this bug, I have added an entry in the CF app:
https://commitfest.postgresql.org/9/528/
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 18, 2016 at 4:27 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I dropped the ball on this one back in July, so here's an attempt to revive
this thread.I spent some time fixing the remaining issues with the prototype patch I
posted earlier, and rebased that on top of current git master. See attached.Some review of that would be nice. If there are no major issues with it, I'm
going to create backpatchable versions of this for 9.4 and below.I am going to look into that very soon. For now and to not forget
about this bug, I have added an entry in the CF app:
https://commitfest.postgresql.org/9/528/
Worth noting that this patch does not address the problem with index
relations when a TRUNCATE is used in the same transaction as its
CREATE TABLE, take that for example when wal_level = minimal:
1) Run transaction
=# begin;
BEGIN
=# create table ab (a int primary key);
CREATE TABLE
=# truncate ab;
TRUNCATE TABLE
=# commit;
COMMIT
2) Restart server with immediate mode.
3) Failure:
=# table ab;
ERROR: XX001: could not read block 0 in file "base/16384/16388": read
only 0 of 8192 bytes
LOCATION: mdread, md.c:728
The case where a COPY is issued after TRUNCATE is fixed though, so
that's still an improvement.
Here are other comments.
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
"Flush updates to relations there were not WAL-logged"?
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
islocal is always set as false, I'd rather remove it this argument
from FlushRelationBuffersWithoutRelCache.
for (i = 0; i < nrels; i++)
+ {
smgrclose(srels[i]);
+ }
Looks like noise.
+ if (!found)
+ {
+ pending->truncated_to = InvalidBlockNumber;
+ pending->sync_above = nblocks;
+
+ elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at
block %u",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+
+ }
+ else if (pending->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+ pending->sync_above = nblocks;
+ }
+ else
Here couldn't it be possible that when (sync_above !=
InvalidBlockNumber), nblocks can be higher than sync_above? In which
case we had better increase sync_above to nblocks, no?
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
This is lacking comments.
- if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT))
+ BufferGetTag(buffer, &rnode, &forknum, &blknum);
+ if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) &&
+ !smgrIsSyncPending(rnode, blknum))
Here as well explaining in more details why the buffer does not need
to go through XLogSaveBufferForHint would be nice.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 19, 2016 at 4:33 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Thu, Feb 18, 2016 at 4:27 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I dropped the ball on this one back in July, so here's an attempt to revive
this thread.I spent some time fixing the remaining issues with the prototype patch I
posted earlier, and rebased that on top of current git master. See attached.Some review of that would be nice. If there are no major issues with it, I'm
going to create backpatchable versions of this for 9.4 and below.I am going to look into that very soon. For now and to not forget
about this bug, I have added an entry in the CF app:
https://commitfest.postgresql.org/9/528/Worth noting that this patch does not address the problem with index
relations when a TRUNCATE is used in the same transaction as its
CREATE TABLE, take that for example when wal_level = minimal:
1) Run transaction
=# begin;
BEGIN
=# create table ab (a int primary key);
CREATE TABLE
=# truncate ab;
TRUNCATE TABLE
=# commit;
COMMIT
2) Restart server with immediate mode.
3) Failure:
=# table ab;
ERROR: XX001: could not read block 0 in file "base/16384/16388": read
only 0 of 8192 bytes
LOCATION: mdread, md.c:728The case where a COPY is issued after TRUNCATE is fixed though, so
that's still an improvement.Here are other comments.
+ /* Flush updates to relations that we didn't WAL-logged */ + smgrDoPendingSyncs(true); "Flush updates to relations there were not WAL-logged"?+void +FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal) +{ + FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal); +} islocal is always set as false, I'd rather remove it this argument from FlushRelationBuffersWithoutRelCache.for (i = 0; i < nrels; i++)
+ {
smgrclose(srels[i]);
+ }
Looks like noise.+ if (!found) + { + pending->truncated_to = InvalidBlockNumber; + pending->sync_above = nblocks; + + elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at block %u", + rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks); + + } + else if (pending->sync_above == InvalidBlockNumber) + { + elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u", + rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks); + pending->sync_above = nblocks; + } + else Here couldn't it be possible that when (sync_above != InvalidBlockNumber), nblocks can be higher than sync_above? In which case we had better increase sync_above to nblocks, no?+ if (!pendingSyncs) + createPendingSyncsHash(); + pending = (PendingRelSync *) hash_search(pendingSyncs, + (void *) &rel->rd_node, + HASH_ENTER, &found); This is lacking comments.- if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT)) + BufferGetTag(buffer, &rnode, &forknum, &blknum); + if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) && + !smgrIsSyncPending(rnode, blknum)) Here as well explaining in more details why the buffer does not need to go through XLogSaveBufferForHint would be nice.
An additional one:
- XLogRegisterBuffer(0, newbuf, bufflags);
- if (oldbuf != newbuf)
- XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
In log_heap_update, the new buffer is now conditionally logged
depending on if the heap needs WAL or not.
Now during replay the following thing is done:
- oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- &obuffer);
+ if (oldblk == newblk)
+ oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+ else if (XLogRecHasBlockRef(record, 1))
+ oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+ else
+ oldaction = BLK_DONE;
Shouldn't we check for XLogRecHasBlockRef(record, 0) when the tuple is
updated on the same page?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello, I considered on the original issue.
At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSGFKUAFqPe5t30jeEA+V9yFMM4yJGa3SnkgY1RHzn7Dg@mail.gmail.com>
Worth noting that this patch does not address the problem with index
relations when a TRUNCATE is used in the same transaction as its
Focusing this issue, what we should do is somehow building empty
index just after a index truncation. The attached patch does the
following things to fix this.
- make index_build use ambuildempty when the relation on which
the index will be built is apparently empty. That is, when the
relation has no block.
- add one parameter "persistent" to ambuildempty(). It behaves as
before if the parameter is false. If not, it creates an empty
index on MAIN_FORK and emits logs even if wal_level is minimal.
Creation of an index for an empty table can be safely done by
ambuildempty, since it creates the image for init fork, which can
be simply copied as main fork on initialization. And the heap is
always empty when RelationTruncateIndexes calls index_build.
For nonempty tables, ambuild properly initializes the new index.
The new parameter 'persistent' would be better be forknum because
it actually represents the persistency of the index to be
created. But I'm out of time now..
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
Fix_wal_logging_problem_20160311.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c740952..7f0d3f9 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -675,13 +675,14 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
}
void
-brinbuildempty(Relation index)
+brinbuildempty(Relation index, bool persistent)
{
Buffer metabuf;
+ ForkNumber forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM);
/* An empty BRIN index has a metapage only. */
metabuf =
- ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+ ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
/* Initialize and xlog metabuffer. */
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cd21e0e..c041360 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -430,20 +430,23 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
}
/*
- * ginbuildempty() -- build an empty gin index in the initialization fork
+ * ginbuildempty() -- build an empty gin index
+ * the new index is built in the intialization fork or main fork according
+ * to the parameter persistent.
*/
void
-ginbuildempty(Relation index)
+ginbuildempty(Relation index, bool persistent)
{
Buffer RootBuffer,
MetaBuffer;
+ ForkNumber forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM);
/* An empty GIN index has two pages. */
MetaBuffer =
- ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+ ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
RootBuffer =
- ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+ ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
/* Initialize and xlog metabuffer and root buffer. */
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 996363c..3d73083 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -110,15 +110,18 @@ createTempGistContext(void)
}
/*
- * gistbuildempty() -- build an empty gist index in the initialization fork
+ * gistbuildempty() -- build an empty gist index.
+ * the new index is built in the intialization fork or main fork according
+ * to the parameter persistent.
*/
void
-gistbuildempty(Relation index)
+gistbuildempty(Relation index, bool persistent)
{
Buffer buffer;
+ ForkNumber forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM);
/* Initialize the root page */
- buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+ buffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
/* Initialize and xlog buffer */
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3d48c4f..3b9cd66 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -156,12 +156,14 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
}
/*
- * hashbuildempty() -- build an empty hash index in the initialization fork
+ * hashbuildempty() -- build an empty hash index
+ * the new index is built in the intialization fork or main fork according
+ * to the parameter persistent.
*/
void
-hashbuildempty(Relation index)
+hashbuildempty(Relation index, bool persistent)
{
- _hash_metapinit(index, 0, INIT_FORKNUM);
+ _hash_metapinit(index, 0, persistent ? MAIN_FORKNUM : INIT_FORKNUM);
}
/*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..c20377d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -230,12 +230,15 @@ btbuildCallback(Relation index,
}
/*
- * btbuildempty() -- build an empty btree index in the initialization fork
+ * btbuildempty() -- build an empty btree index
+ * the new index is built in the intialization fork or main fork according
+ * to the parameter persistent.
*/
void
-btbuildempty(Relation index)
+btbuildempty(Relation index, bool persistent)
{
Page metapage;
+ ForkNumber forknum = persistent ? MAIN_FORKNUM : INIT_FORKNUM;
/* Construct metapage. */
metapage = (Page) palloc(BLCKSZ);
@@ -243,10 +246,9 @@ btbuildempty(Relation index)
/* Write the page. If archiving/streaming, XLOG it. */
PageSetChecksumInplace(metapage, BTREE_METAPAGE);
- smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
- (char *) metapage, true);
- if (XLogIsNeeded())
- log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+ smgrwrite(index->rd_smgr, forknum, BTREE_METAPAGE, (char *) metapage, true);
+ if (XLogIsNeeded() || persistent)
+ log_newpage(&index->rd_smgr->smgr_rnode.node, forknum,
BTREE_METAPAGE, metapage, false);
/*
@@ -254,7 +256,7 @@ btbuildempty(Relation index)
* write did not go through shared_buffers and therefore a concurrent
* checkpoint may have moved the redo pointer past our xlog record.
*/
- smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+ smgrimmedsync(index->rd_smgr, forknum);
}
/*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 44fd644..3d5964b 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -152,12 +152,15 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
}
/*
- * Build an empty SPGiST index in the initialization fork
+ * Build an empty SPGiST index
+ * the new index is built in the intialization fork or main fork according
+ * to the parameter persistent.
*/
void
-spgbuildempty(Relation index)
+spgbuildempty(Relation index, bool persistent)
{
Page page;
+ ForkNumber forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM);
/* Construct metapage. */
page = (Page) palloc(BLCKSZ);
@@ -165,30 +168,30 @@ spgbuildempty(Relation index)
/* Write the page. If archiving/streaming, XLOG it. */
PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
- smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
+ smgrwrite(index->rd_smgr, forknum, SPGIST_METAPAGE_BLKNO,
(char *) page, true);
- if (XLogIsNeeded())
- log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+ if (XLogIsNeeded() || persistent)
+ log_newpage(&index->rd_smgr->smgr_rnode.node, forknum,
SPGIST_METAPAGE_BLKNO, page, false);
/* Likewise for the root page. */
SpGistInitPage(page, SPGIST_LEAF);
PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
- smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
+ smgrwrite(index->rd_smgr, forknum, SPGIST_ROOT_BLKNO,
(char *) page, true);
- if (XLogIsNeeded())
- log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+ if (XLogIsNeeded() || persistent)
+ log_newpage(&index->rd_smgr->smgr_rnode.node, forknum,
SPGIST_ROOT_BLKNO, page, true);
/* Likewise for the null-tuples root page. */
SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
- smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
+ smgrwrite(index->rd_smgr, forknum, SPGIST_NULL_BLKNO,
(char *) page, true);
- if (XLogIsNeeded())
- log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+ if (XLogIsNeeded() || persistent)
+ log_newpage(&index->rd_smgr->smgr_rnode.node, forknum,
SPGIST_NULL_BLKNO, page, true);
/*
@@ -196,7 +199,7 @@ spgbuildempty(Relation index)
* writes did not go through shared buffers and therefore a concurrent
* checkpoint may have moved the redo pointer past our xlog record.
*/
- smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+ smgrimmedsync(index->rd_smgr, forknum);
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 31a1438..ea8c623 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1987,7 +1987,8 @@ index_build(Relation heapRelation,
bool isprimary,
bool isreindex)
{
- IndexBuildResult *stats;
+ static IndexBuildResult defstats = {0, 0};
+ IndexBuildResult *stats = &defstats;
Oid save_userid;
int save_sec_context;
int save_nestlevel;
@@ -2016,12 +2017,19 @@ index_build(Relation heapRelation,
save_nestlevel = NewGUCNestLevel();
/*
- * Call the access method's build procedure
+ * Call the access method's build procedure. Build an empty index for
+ * empty heaps.
*/
- stats = indexRelation->rd_amroutine->ambuild(heapRelation, indexRelation,
- indexInfo);
- Assert(PointerIsValid(stats));
-
+ if (RelationGetNumberOfBlocks(heapRelation) > 0)
+ stats = indexRelation->rd_amroutine->ambuild(heapRelation,
+ indexRelation,
+ indexInfo);
+ else
+ {
+ RelationOpenSmgr(indexRelation);
+ indexRelation->rd_amroutine->ambuildempty(indexRelation, true);
+ }
+
/*
* If this is an unlogged index, we may need to write out an init fork for
* it -- but we must first check whether one already exists. If, for
@@ -2032,9 +2040,8 @@ index_build(Relation heapRelation,
if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
!smgrexists(indexRelation->rd_smgr, INIT_FORKNUM))
{
- RelationOpenSmgr(indexRelation);
smgrcreate(indexRelation->rd_smgr, INIT_FORKNUM, false);
- indexRelation->rd_amroutine->ambuildempty(indexRelation);
+ indexRelation->rd_amroutine->ambuildempty(indexRelation, false);
}
/*
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 35f1061..220494e 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -36,7 +36,7 @@ typedef IndexBuildResult *(*ambuild_function) (Relation heapRelation,
struct IndexInfo *indexInfo);
/* build empty index */
-typedef void (*ambuildempty_function) (Relation indexRelation);
+typedef void (*ambuildempty_function) (Relation indexRelation, bool persistent);
/* insert this tuple */
typedef bool (*aminsert_function) (Relation indexRelation,
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 47317af..f7e600a 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -86,7 +86,7 @@ extern BrinDesc *brin_build_desc(Relation rel);
extern void brin_free_desc(BrinDesc *bdesc);
extern IndexBuildResult *brinbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
-extern void brinbuildempty(Relation index);
+extern void brinbuildempty(Relation index, bool persistent);
extern bool brininsert(Relation idxRel, Datum *values, bool *nulls,
ItemPointer heaptid, Relation heapRel,
IndexUniqueCheck checkUnique);
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index d2ea588..91a2622 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -617,7 +617,7 @@ extern Datum gintuple_get_key(GinState *ginstate, IndexTuple tuple,
/* gininsert.c */
extern IndexBuildResult *ginbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
-extern void ginbuildempty(Relation index);
+extern void ginbuildempty(Relation index, bool persistent);
extern bool gininsert(Relation index, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique);
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index f9732ba..448044e 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -428,7 +428,7 @@ typedef struct GiSTOptions
/* gist.c */
extern Datum gisthandler(PG_FUNCTION_ARGS);
-extern void gistbuildempty(Relation index);
+extern void gistbuildempty(Relation index, bool persistent);
extern bool gistinsert(Relation r, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3a68390..ab93e34 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -246,7 +246,7 @@ typedef HashMetaPageData *HashMetaPage;
extern Datum hashhandler(PG_FUNCTION_ARGS);
extern IndexBuildResult *hashbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
-extern void hashbuildempty(Relation index);
+extern void hashbuildempty(Relation index, bool persistent);
extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9046b16..64de387 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -656,7 +656,7 @@ typedef BTScanOpaqueData *BTScanOpaque;
extern Datum bthandler(PG_FUNCTION_ARGS);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
-extern void btbuildempty(Relation index);
+extern void btbuildempty(Relation index, bool persistent);
extern bool btinsert(Relation rel, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique);
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index 1994f71..3c26cde 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -181,7 +181,7 @@ extern bytea *spgoptions(Datum reloptions, bool validate);
/* spginsert.c */
extern IndexBuildResult *spgbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
-extern void spgbuildempty(Relation index);
+extern void spgbuildempty(Relation index, bool persistent);
extern bool spginsert(Relation index, Datum *values, bool *isnull,
ItemPointer ht_ctid, Relation heapRel,
IndexUniqueCheck checkUnique);
On Fri, Mar 11, 2016 at 9:32 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSGFKUAFqPe5t30jeEA+V9yFMM4yJGa3SnkgY1RHzn7Dg@mail.gmail.com>
Worth noting that this patch does not address the problem with index
relations when a TRUNCATE is used in the same transaction as itsFocusing this issue, what we should do is somehow building empty
index just after a index truncation. The attached patch does the
following things to fix this.- make index_build use ambuildempty when the relation on which
the index will be built is apparently empty. That is, when the
relation has no block.
- add one parameter "persistent" to ambuildempty(). It behaves as
before if the parameter is false. If not, it creates an empty
index on MAIN_FORK and emits logs even if wal_level is minimal.
Hm. It seems to me that this patch is just a bandaid for the real
problem which is that we should not TRUNCATE the underlying index
relations when the TRUNCATE optimization is running. In short I would
let the empty routines in AM code paths alone, and just continue using
them for the generation of INIT_FORKNUM with unlogged relations. Your
patch is not something backpatchable anyway I think.
The new parameter 'persistent' would be better be forknum because
it actually represents the persistency of the index to be
created. But I'm out of time now..
I actually have some users running with wal_level to minimal, even if
I don't think they use this optimization, we had better fix even index
relations at the same time as table relations.. I'll try to get some
time once the patch review storm goes down a little, except if someone
beats me to it first.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for the comment.
I understand that this is not an issue in a hurry so don't bother
to reply.
At Tue, 15 Mar 2016 18:21:34 +0100, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSVm-X1-w9i=U=DCyMxDxzfNT-41pqTSvh0DUmUgi8BQg@mail.gmail.com>
On Fri, Mar 11, 2016 at 9:32 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSGFKUAFqPe5t30jeEA+V9yFMM4yJGa3SnkgY1RHzn7Dg@mail.gmail.com>
Worth noting that this patch does not address the problem with index
relations when a TRUNCATE is used in the same transaction as itsFocusing this issue, what we should do is somehow building empty
index just after a index truncation. The attached patch does the
following things to fix this.- make index_build use ambuildempty when the relation on which
the index will be built is apparently empty. That is, when the
relation has no block.
- add one parameter "persistent" to ambuildempty(). It behaves as
before if the parameter is false. If not, it creates an empty
index on MAIN_FORK and emits logs even if wal_level is minimal.Hm. It seems to me that this patch is just a bandaid for the real
problem which is that we should not TRUNCATE the underlying index
relations when the TRUNCATE optimization is running.
The eventual problem is a 0-length index relation left just after
a relation truncation. We assume that an index with an empty
relation after a recovery is not valid. However just skipping
TRUNCATE of the index relation won't resolve it since it in turn
leaves an index with garbage entries. Am I missing something?
Since the index relation should be "validly emptied" in-place in
any way in the case of TRUNCATE optimization, I tried that by
TRUNCATE + ambuildempty, which can be redo'ed properly, too. A
repeated TRUNCATEs issues eventully-useless logs but it would be
inevitable since we cannot fortell of any succeeding TRUNCATEs.
(TRUNCATE+)COPY+INSERT seems another kind of problem, which would
be fixed by Heikki's patch.
In short I would
let the empty routines in AM code paths alone, and just continue using
them for the generation of INIT_FORKNUM with unlogged relations. Your
patch is not something backpatchable anyway I think.
It seems to be un-backpatchable if the change of the manner to
call ambuildempty inhibits this.
The new parameter 'persistent' would be better be forknum because
it actually represents the persistency of the index to be
created. But I'm out of time now..I actually have some users running with wal_level to minimal, even if
I don't think they use this optimization, we had better fix even index
relations at the same time as table relations.. I'll try to get some
time once the patch review storm goes down a little, except if someone
beats me to it first.
Ok, I understand that this is not an issue in a hurry. I'll go to
another patch that needs review.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
Ok, I understand that this is not an issue in a hurry. I'll go to
another patch that needs review.
Since we're getting towards the end of the CF is it time to pick this up
again?
Thanks,
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 23, 2016 at 1:38 AM, David Steele <david@pgmasters.net> wrote:
On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
Ok, I understand that this is not an issue in a hurry. I'll go to
another patch that needs review.Since we're getting towards the end of the CF is it time to pick this up
again?
Perhaps not. This is a legit bug with an unfinished patch (see index
relation truncation) that is going to need a careful review. I don't
think that this should be impacted by the 4/8 feature freeze, so we
could still work on that after the embargo and we've had this bug for
months actually. FWIW, I am still planning to work on it once the CF
is done, in order to keep my manpower focused on actual patch reviews
as much as possible...
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 23, 2016 at 9:52 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Wed, Mar 23, 2016 at 1:38 AM, David Steele <david@pgmasters.net> wrote:
On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
Ok, I understand that this is not an issue in a hurry. I'll go to
another patch that needs review.Since we're getting towards the end of the CF is it time to pick this up
again?Perhaps not. This is a legit bug with an unfinished patch (see index
relation truncation) that is going to need a careful review. I don't
think that this should be impacted by the 4/8 feature freeze, so we
could still work on that after the embargo and we've had this bug for
months actually. FWIW, I am still planning to work on it once the CF
is done, in order to keep my manpower focused on actual patch reviews
as much as possible...
In short, we may want to bump that to next CF... I have already marked
this ticket as something to work on soonish on my side, so it does not
change much seen from here if it's part of the next CF. What we should
just be sure is not to lose track of its existence.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3/22/16 8:54 PM, Michael Paquier wrote:
On Wed, Mar 23, 2016 at 9:52 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Wed, Mar 23, 2016 at 1:38 AM, David Steele <david@pgmasters.net> wrote:
On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
Ok, I understand that this is not an issue in a hurry. I'll go to
another patch that needs review.Since we're getting towards the end of the CF is it time to pick this up
again?Perhaps not. This is a legit bug with an unfinished patch (see index
relation truncation) that is going to need a careful review. I don't
think that this should be impacted by the 4/8 feature freeze, so we
could still work on that after the embargo and we've had this bug for
months actually. FWIW, I am still planning to work on it once the CF
is done, in order to keep my manpower focused on actual patch reviews
as much as possible...In short, we may want to bump that to next CF... I have already marked
this ticket as something to work on soonish on my side, so it does not
change much seen from here if it's part of the next CF. What we should
just be sure is not to lose track of its existence.
I would prefer not to bump it to the next CF unless we decide this will
not get fixed for 9.6.
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 23, 2016 at 11:11 AM, David Steele <david@pgmasters.net> wrote:
I would prefer not to bump it to the next CF unless we decide this will
not get fixed for 9.6.
It may make sense to add that to the list of open items for 9.6
instead. That's not a feature.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Wed, Mar 23, 2016 at 11:11 AM, David Steele <david@pgmasters.net> wrote:
I would prefer not to bump it to the next CF unless we decide this will
not get fixed for 9.6.It may make sense to add that to the list of open items for 9.6
instead. That's not a feature.
So I have moved this patch to the next CF for now, and will work on
fixing it rather soonishly as an effort to stabilize 9.6 as well as
back-branches.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 6, 2016 at 3:11 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Wed, Mar 23, 2016 at 11:11 AM, David Steele <david@pgmasters.net> wrote:
I would prefer not to bump it to the next CF unless we decide this will
not get fixed for 9.6.It may make sense to add that to the list of open items for 9.6
instead. That's not a feature.So I have moved this patch to the next CF for now, and will work on
fixing it rather soonishly as an effort to stabilize 9.6 as well as
back-branches.
Well, not that soon at the end, but I am back on that... I have not
completely reviewed all the code yet, and the case of index relation
referring to a relation optimized with truncate is still broken, but
for now here is a rebased patch if people are interested. I am going
to get as well a TAP tests out of my pocket to ease testing.
--
Michael
Attachments:
fix-wal-level-minimal-michael-1.patchtext/x-patch; charset=US-ASCII; name=fix-wal-level-minimal-michael-1.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 38bba16..bbc09cd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -55,6 +55,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/namespace.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -2331,12 +2332,6 @@ FreeBulkInsertState(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2440,7 +2435,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2639,12 +2634,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
char *scratch = NULL;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2659,7 +2652,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* palloc() within a critical section is not safe, so we allocate this
* beforehand.
*/
- if (needwal)
+ if (RelationNeedsWAL(relation))
scratch = palloc(BLCKSZ);
/*
@@ -2727,7 +2720,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* We don't use heap_multi_insert for catalog tuples yet, but
* better be prepared...
*/
- if (needwal && need_cids)
+ if (HeapNeedsWAL(relation, buffer) && need_cids)
log_heap_new_cid(relation, heaptup);
}
@@ -2747,7 +2740,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (needwal)
+ if (HeapNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
xl_heap_multi_insert *xlrec;
@@ -3261,7 +3254,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -3982,7 +3975,7 @@ l2:
MarkBufferDirty(buffer);
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -4194,7 +4187,7 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
@@ -5148,7 +5141,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5825,7 +5818,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (HeapNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5980,7 +5973,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6112,7 +6105,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6218,7 +6211,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -9081,3 +9074,71 @@ heap_sync(Relation rel)
heap_close(toastrel, AccessShareLock);
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. After calling this, any changes to
+ * the heap (including TOAST heap if any) in the same transaction will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * Like with heap_sync(), indexes are not touched.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ smgrRegisterPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ smgrRegisterPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous heap_register_sync() requests.
+ *
+ * Note that it is required to use this before creating any WAL records for
+ * heap pages - it is not merely an optimization. WAL-logging a record,
+ * when we have already skipped a previous WAL record for the same page could
+ * lead to failure at WAL replay, as the "before" state expected by the
+ * record might not match what's on disk (this should only a be problem
+ * with full_page_writes=off, though).
+ */
+bool
+HeapNeedsWAL(Relation rel, Buffer buf)
+{
+ /* Temporary relations never need WAL */
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /*
+ * If we are going to fsync() the relation at COMMIT, and we have not
+ * truncated the block away previously, and we have not emitted any WAL
+ * records for this block yet, we can skip WAL-logging it.
+ */
+ if (smgrIsSyncPending(rel->rd_node, BufferGetBlockNumber(buf)))
+ {
+ /*
+ * If a pending fsync() will handle this page, its LSN should be
+ * invalid. If it's not, we've already emitted a WAL record for this
+ * block, and all subsequent changes to the block must be WAL-logged
+ * too.
+ */
+ Assert(PageGetLSN(BufferGetPage(buf)) == InvalidXLogRecPtr);
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6ff9251..3207134 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -260,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 3ad4a9f..fb07795 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -307,6 +307,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
+ /* XXX: Should we use HeapNeedsWAL here? */
if (RelationNeedsWAL(rel))
{
if (XLogRecPtrIsInvalid(recptr))
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 23f36ea..f66d9ab 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2007,6 +2007,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2237,6 +2240,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2541,6 +2547,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false);
/*
* Advertise the fact that we aborted in pg_clog (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0d8311c..54ff874 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -20,6 +20,7 @@
#include "postgres.h"
#include "access/visibilitymap.h"
+#include "access/transam.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -29,6 +30,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -64,6 +66,42 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action. When we are about to begin a large operation on the relation,
+ * a PendingRelSync entry is created, and 'sync_above' is set to the current
+ * size of the relation. Any operations on blocks < sync_above need to be
+ * WAL-logged as usual, but for operations on higher blocks, WAL-logging is
+ * skipped. It's important that after WAL-logging has been skipped for a
+ * block, we don't WAL log any subsequent actions on the same block either.
+ * Replaying the WAl record of the subsequent action might fail otherwise,
+ * as the "before" state of the block might not match, as the earlier actions
+ * were not WAL-logged.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we cannot skip WAL-logging for that
+ * relation anymore, as replaying the truncation record will destroy all the
+ * data inserted after that. But if we have already decided to skip WAL-logging
+ * changes to a relation, and the relation is truncated, we don't need to
+ * WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+/* Relations that need to be fsync'd at commit */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >= sync_above */
+ bool truncated; /* truncation WAL record was written */
+} PendingRelSync;
+
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/*
* RelationCreateStorage
* Create physical storage for a relation.
*
@@ -226,6 +264,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
void
RelationTruncate(Relation rel, BlockNumber nblocks)
{
+ PendingRelSync *pending = NULL;
+ bool found;
bool fsm;
bool vm;
@@ -249,6 +289,17 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
if (vm)
visibilitymap_truncate(rel, nblocks);
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->sync_above = InvalidBlockNumber;
+ pending->truncated = false;
+ }
+
/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
@@ -258,7 +309,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/
- if (RelationNeedsWAL(rel))
+ if (RelationNeedsWAL(rel) &&
+ (pending->sync_above == InvalidBlockNumber || pending->sync_above < nblocks))
{
/*
* Make an XLOG entry reporting the file truncation.
@@ -276,6 +328,9 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
lsn = XLogInsert(RM_SMGR_ID,
XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ if (rel->rd_node.relNode >= FirstNormalObjectId)
+ elog(LOG, "WAL-logged truncation of rel %u to %u blocks", rel->rd_node.relNode, nblocks);
+
/*
* Flush, because otherwise the truncation of the main relation might
* hit the disk before the WAL record, and the truncation of the FSM
@@ -285,6 +340,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (fsm || vm)
XLogFlush(lsn);
+
+ pending->truncated = true;
}
/* Do the real work */
@@ -419,6 +476,142 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because
+ * we are going to skip WAL-logging subsequent actions to it.
+ */
+void
+smgrRegisterPendingSync(Relation rel)
+{
+ PendingRelSync *pending;
+ bool found;
+ BlockNumber nblocks;
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+
+ /* Look up or create an entry */
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->truncated = false;
+ pending->sync_above = nblocks;
+
+ if (rel->rd_node.relNode >= FirstNormalObjectId)
+ elog(LOG, "Registering new pending sync for rel %u at block %u", rel->rd_node.relNode, nblocks);
+
+ }
+ else if (pending->sync_above == InvalidBlockNumber)
+ {
+ if (rel->rd_node.relNode >= FirstNormalObjectId)
+ elog(LOG, "Registering pending sync for rel %u at block %u", rel->rd_node.relNode, nblocks);
+ pending->sync_above = nblocks;
+ }
+ else
+ if (rel->rd_node.relNode >= FirstNormalObjectId)
+ elog(LOG, "Not updating pending sync for rel %u at block %u (was %u)", rel->rd_node.relNode, nblocks, pending->sync_above);
+}
+
+/*
+ * Are we going to fsync() this relation at COMMIT, and hence don't need to
+ * WAL-log changes to the given block?
+ */
+bool
+smgrIsSyncPending(RelFileNode rnode, BlockNumber blkno)
+{
+ PendingRelSync *pending;
+ bool found;
+
+ if (!pendingSyncs)
+ return false;
+
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rnode,
+ HASH_FIND, &found);
+ if (!found)
+ return false;
+
+ /*
+ * We have no fsync() pending for this relation, or we have (possibly)
+ * already emitted WAL records for this block.
+ */
+ if (pending->sync_above == InvalidBlockNumber ||
+ pending->sync_above > blkno)
+ {
+ if (rnode.relNode >= FirstNormalObjectId)
+ elog(LOG, "Not skipping WAL-logging for rel %u block %u, because sync_above is %u", rnode.relNode, blkno, pending->sync_above);
+ return false;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (pending->truncated)
+ {
+ if (rnode.relNode >= FirstNormalObjectId)
+ elog(LOG, "Not skipping WAL-logging for rel %u block %u, because it was truncated", rnode.relNode, blkno);
+ return false;
+ }
+
+ if (rnode.relNode >= FirstNormalObjectId)
+ elog(LOG, "Skipping WAL-logging for rel %u block %u", rnode.relNode, blkno);
+
+ return true;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelFileNodeBuffers(pending->relnode, false);
+ /* FlushRelationBuffers will have opened rd_smgr */
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ if (pending->relnode.relNode >= FirstNormalObjectId)
+ elog(LOG, "Syncing rel %u", pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f45b330..01486da 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -26,6 +26,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/pg_type.h"
+#include "catalog/storage.h"
#include "commands/copy.h"
#include "commands/defrem.h"
#include "commands/trigger.h"
@@ -2302,7 +2303,10 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
+ {
+ heap_register_sync(cstate->rel);
hi_options |= HEAP_INSERT_SKIP_WAL;
+ }
}
/*
@@ -2551,11 +2555,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction (we used to do it here, but it was later found out
+ * that to be safe, we must avoid WAL-logging any subsequent actions on
+ * the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 231e92d..1b1246f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1462,7 +1462,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (HeapNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..d1e7bc8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -452,6 +452,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
static void AtProcExit_Buffers(int code, Datum arg);
+static void FlushRelFileNodeBuffers_internal(SMgrRelation smgr, bool islocal);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
static int buffertag_comparator(const void *p1, const void *p2);
@@ -3136,14 +3137,30 @@ FlushRelationBuffers(Relation rel)
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelFileNodeBuffers_internal(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelFileNodeBuffers(RelFileNode rnode, bool islocal)
+{
+ FlushRelFileNodeBuffers_internal(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelFileNodeBuffers_internal(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3160,7 +3177,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3190,18 +3207,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
@@ -3397,6 +3414,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
bool dirtied = false;
bool delayChkpt = false;
uint32 buf_state;
+ RelFileNode rnode;
+ ForkNumber forknum;
+ BlockNumber blknum;
/*
* If we need to protect hint bit updates from torn writes, WAL-log a
@@ -3407,8 +3427,10 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
* We don't check full_page_writes here because that logic is included
* when we call XLogInsert() since the value changes dynamically.
*/
+ BufferGetTag(buffer, &rnode, &forknum, &blknum);
if (XLogHintBitIsNeeded() &&
- (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
+ (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT) &&
+ !smgrIsSyncPending(rnode, blknum))
{
/*
* If we're in recovery we cannot dirty a page because of a hint.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b3a595c..06082d9 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,7 +25,7 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
+#define HEAP_INSERT_SKIP_WAL 0x0001 /* obsolete, not used anymore */
#define HEAP_INSERT_SKIP_FSM 0x0002
#define HEAP_INSERT_FROZEN 0x0004
#define HEAP_INSERT_SPECULATIVE 0x0008
@@ -177,6 +177,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
/* in heap/pruneheap.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5418d71 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -378,6 +378,8 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
extern const char *heap2_identify(uint8 info);
extern void heap_xlog_logical_rewrite(XLogReaderState *r);
+extern bool HeapNeedsWAL(Relation rel, Buffer buf);
+
extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
TransactionId latestRemovedXid);
extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef960da..c618c78 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -29,6 +29,9 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrRegisterPendingSync(Relation rel);
+extern bool smgrIsSyncPending(RelFileNode rnode, BlockNumber blkno);
+extern void smgrDoPendingSyncs(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..0622dee 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -202,6 +202,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelFileNodeBuffers(RelFileNode rel, bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
On Thu, Jul 28, 2016 at 4:59 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Wed, Apr 6, 2016 at 3:11 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Wed, Mar 23, 2016 at 11:11 AM, David Steele <david@pgmasters.net> wrote:
I would prefer not to bump it to the next CF unless we decide this will
not get fixed for 9.6.It may make sense to add that to the list of open items for 9.6
instead. That's not a feature.So I have moved this patch to the next CF for now, and will work on
fixing it rather soonishly as an effort to stabilize 9.6 as well as
back-branches.Well, not that soon at the end, but I am back on that... I have not
completely reviewed all the code yet, and the case of index relation
referring to a relation optimized with truncate is still broken, but
for now here is a rebased patch if people are interested. I am going
to get as well a TAP tests out of my pocket to ease testing.
The patch I sent yesterday was based on an incorrect version. Attached
is a slightly-modified version of the last one I found here
(/messages/by-id/56B342F5.1050502@iki.fi), which
is rebased on HEAD at ed0b228. I have also converted the test case
script of upthread into a TAP test in src/test/recovery that covers 3
cases and I included that in the patch:
1) CREATE + INSERT + COPY => crash
2) CREATE + trigger + COPY => crash
3) CREATE + TRUNCATE + COPY => incorrect number of rows.
The first two tests make the system crash, the third one reports an
incorrect number of rows.
This is registered in next CF by the way:
https://commitfest.postgresql.org/10/528/
Thoughts?
--
Michael
Attachments:
fix-wal-level-minimal-michael-2.patchinvalid/octet-stream; name=fix-wal-level-minimal-michael-2.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 38bba16..5d5c673 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -55,6 +77,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/namespace.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -2331,12 +2354,6 @@ FreeBulkInsertState(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2440,7 +2457,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2639,12 +2656,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
char *scratch = NULL;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2659,7 +2674,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* palloc() within a critical section is not safe, so we allocate this
* beforehand.
*/
- if (needwal)
+ if (RelationNeedsWAL(relation))
scratch = palloc(BLCKSZ);
/*
@@ -2694,6 +2709,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2705,6 +2721,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = HeapNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3261,7 +3278,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4194,7 +4211,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer) ||
+ HeapNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5148,7 +5166,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5825,7 +5843,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (HeapNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5980,7 +5998,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6112,7 +6130,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6218,7 +6236,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7331,7 +7349,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(HeapNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7379,7 +7397,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(HeapNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7464,7 +7482,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(HeapNeedsWAL(reln, newbuf) || HeapNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -7567,76 +7585,86 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
if (need_tuple_data)
bufflags |= REGBUF_KEEP_DATA;
- XLogRegisterBuffer(0, newbuf, bufflags);
- if (oldbuf != newbuf)
- XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
-
- XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
-
/*
* Prepare WAL data for the new tuple.
*/
- if (prefixlen > 0 || suffixlen > 0)
+ if (HeapNeedsWAL(reln, newbuf))
{
- if (prefixlen > 0 && suffixlen > 0)
- {
- prefix_suffix[0] = prefixlen;
- prefix_suffix[1] = suffixlen;
- XLogRegisterBufData(0, (char *) &prefix_suffix, sizeof(uint16) * 2);
- }
- else if (prefixlen > 0)
- {
- XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
- }
- else
+ XLogRegisterBuffer(0, newbuf, bufflags);
+
+ if ((prefixlen > 0 || suffixlen > 0))
{
- XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+ if (prefixlen > 0 && suffixlen > 0)
+ {
+ prefix_suffix[0] = prefixlen;
+ prefix_suffix[1] = suffixlen;
+ XLogRegisterBufData(0, (char *) &prefix_suffix,
+ sizeof(uint16) * 2);
+ }
+ else if (prefixlen > 0)
+ {
+ XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
+ }
+ else
+ {
+ XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+ }
}
- }
- xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
- xlhdr.t_infomask = newtup->t_data->t_infomask;
- xlhdr.t_hoff = newtup->t_data->t_hoff;
- Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
+ xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
+ xlhdr.t_infomask = newtup->t_data->t_infomask;
+ xlhdr.t_hoff = newtup->t_data->t_hoff;
+ Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
- /*
- * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
- *
- * The 'data' doesn't include the common prefix or suffix.
- */
- XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
- if (prefixlen == 0)
- {
- XLogRegisterBufData(0,
- ((char *) newtup->t_data) + SizeofHeapTupleHeader,
- newtup->t_len - SizeofHeapTupleHeader - suffixlen);
- }
- else
- {
/*
- * Have to write the null bitmap and data after the common prefix as
- * two separate rdata entries.
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
+ *
+ * The 'data' doesn't include the common prefix or suffix.
*/
- /* bitmap [+ padding] [+ oid] */
- if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+ XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+ if (prefixlen == 0)
{
XLogRegisterBufData(0,
((char *) newtup->t_data) + SizeofHeapTupleHeader,
- newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+ newtup->t_len - SizeofHeapTupleHeader - suffixlen);
}
+ else
+ {
+ /*
+ * Have to write the null bitmap and data after the common prefix
+ * as two separate rdata entries.
+ */
+ /* bitmap [+ padding] [+ oid] */
+ if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+ {
+ XLogRegisterBufData(0,
+ ((char *) newtup->t_data) + SizeofHeapTupleHeader,
+ newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+ }
- /* data after common prefix */
- XLogRegisterBufData(0,
+ /* data after common prefix */
+ XLogRegisterBufData(0,
((char *) newtup->t_data) + newtup->t_data->t_hoff + prefixlen,
newtup->t_len - newtup->t_data->t_hoff - prefixlen - suffixlen);
+ }
}
+ /*
+ * If the old and new tuple are on different pages, also register the old
+ * page, so that a full-page image is created for it if necessary. We
+ * don't need any extra information to replay changes to it.
+ */
+ if (oldbuf != newbuf && HeapNeedsWAL(reln, oldbuf))
+ XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
+
/* We need to log a tuple identity */
if (need_tuple_data && old_key_tuple)
{
@@ -8555,8 +8583,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
*/
/* Deal with old tuple version */
- oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- &obuffer);
+ if (oldblk == newblk)
+ oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+ else if (XLogRecHasBlockRef(record, 1))
+ oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+ else
+ oldaction = BLK_DONE;
+
if (oldaction == BLK_NEEDS_REDO)
{
page = BufferGetPage(obuffer);
@@ -8610,6 +8643,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageInit(page, BufferGetPageSize(nbuffer), 0);
newaction = BLK_NEEDS_REDO;
}
+ else if (!XLogRecHasBlockRef(record, 0))
+ newaction = BLK_DONE;
else
newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9046,9 +9081,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
@@ -9081,3 +9123,75 @@ heap_sync(Relation rel)
heap_close(toastrel, AccessShareLock);
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ smgrRegisterPendingSync(rel->rd_node, RelationGetNumberOfBlocks(rel));
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ smgrRegisterPendingSync(toastrel->rd_node,
+ RelationGetNumberOfBlocks(toastrel));
+ heap_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous heap_register_sync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records
+ * for heap pages - it is not merely an optimization! WAL-logging a record,
+ * when we have already skipped a previous WAL record for the same page
+ * could lead lead to failure at WAL replay, as the "before" state expected
+ * by the record might not match what's on disk. Also, if the heap was
+ * truncated earlier, we must WAL-log any changes to the once-truncated
+ * blocks, because replaying the truncation record will destroy them.
+ * (smgrIsSyncPending() figures out all that.)
+ */
+bool
+HeapNeedsWAL(Relation rel, Buffer buf)
+{
+ /* Temporary relations never need WAL */
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /*
+ * If we are going to fsync() the relation at COMMIT, and we have not
+ * truncated the block away previously, and we have not emitted any WAL
+ * records for this block yet, we can skip WAL-logging it.
+ */
+ if (smgrIsSyncPending(rel->rd_node, BufferGetBlockNumber(buf)))
+ {
+ /*
+ * If a pending fsync() will handle this page, its LSN should be
+ * invalid. If it's not, we've already emitted a WAL record for this
+ * block, and all subsequent changes to the block must be WAL-logged
+ * too.
+ */
+ Assert(PageGetLSN(BufferGetPage(buf)) == InvalidXLogRecPtr);
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6ff9251..3207134 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -260,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (HeapNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f9ce986..36ba62a 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
- HEAP_INSERT_SKIP_FSM |
- (state->rs_use_wal ?
- 0 : HEAP_INSERT_SKIP_WAL));
+ HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 3ad4a9f..4b82f3d 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -307,7 +307,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (HeapNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 23f36ea..f66d9ab 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2007,6 +2007,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2237,6 +2240,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2541,6 +2547,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false);
/*
* Advertise the fact that we aborted in pg_clog (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0d8311c..0a685be 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/transam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
@@ -29,6 +30,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -64,6 +66,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/*
* RelationCreateStorage
* Create physical storage for a relation.
*
@@ -226,6 +271,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
void
RelationTruncate(Relation rel, BlockNumber nblocks)
{
+ PendingRelSync *pending = NULL;
+ bool found;
bool fsm;
bool vm;
@@ -260,31 +307,52 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
-
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
-
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->sync_above = InvalidBlockNumber;
+ pending->truncated_to = InvalidBlockNumber;
+ }
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ if (pending->sync_above == InvalidBlockNumber ||
+ pending->sync_above < nblocks)
+ {
+ /*
+ * Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ pending->truncated_to = nblocks;
+ }
}
/* Do the real work */
@@ -419,6 +487,140 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because
+ * we are going to skip WAL-logging subsequent actions to it.
+ */
+void
+smgrRegisterPendingSync(RelFileNode rnode, BlockNumber nblocks)
+{
+ PendingRelSync *pending;
+ bool found;
+
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+
+ /* Look up or create an entry */
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rnode,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->truncated_to = InvalidBlockNumber;
+ pending->sync_above = nblocks;
+
+ elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at block %u",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+
+ }
+ else if (pending->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+ pending->sync_above = nblocks;
+ }
+ else
+ elog(DEBUG2, "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, pending->sync_above,
+ nblocks);
+}
+
+/*
+ * Are we going to fsync() this relation at COMMIT, and hence don't need to
+ * WAL-log changes to the given block?
+ */
+bool
+smgrIsSyncPending(RelFileNode rnode, BlockNumber blkno)
+{
+ PendingRelSync *pending;
+ bool found;
+
+ if (!pendingSyncs)
+ return false;
+
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rnode,
+ HASH_FIND, &found);
+ if (!found)
+ return false;
+
+ /*
+ * We have no fsync() pending for this relation, or we have (possibly)
+ * already emitted WAL records for this block.
+ */
+ if (pending->sync_above == InvalidBlockNumber ||
+ pending->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, blkno, pending->sync_above);
+ return false;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (pending->truncated_to != InvalidBlockNumber &&
+ pending->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, blkno);
+ return false;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, blkno);
+
+ return true;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f45b330..01b712d 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -26,6 +26,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/pg_type.h"
+#include "catalog/storage.h"
#include "commands/copy.h"
#include "commands/defrem.h"
#include "commands/trigger.h"
@@ -2269,8 +2270,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2302,7 +2302,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -2551,11 +2551,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 5b4f6af..b64d52a 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 6cddcbd..dbef95b 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -456,7 +456,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -499,9 +499,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 86e9814..ca892ea 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3984,8 +3984,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4236,8 +4237,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 231e92d..7190000 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -879,7 +879,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (HeapNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1106,7 +1106,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (HeapNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1462,7 +1462,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (HeapNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..edc580f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3130,20 +3131,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3160,7 +3182,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3190,18 +3212,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
@@ -3397,6 +3419,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
bool dirtied = false;
bool delayChkpt = false;
uint32 buf_state;
+ RelFileNode rnode;
+ ForkNumber forknum;
+ BlockNumber blknum;
/*
* If we need to protect hint bit updates from torn writes, WAL-log a
@@ -3407,8 +3432,10 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
* We don't check full_page_writes here because that logic is included
* when we call XLogInsert() since the value changes dynamically.
*/
+ BufferGetTag(buffer, &rnode, &forknum, &blknum);
if (XLogHintBitIsNeeded() &&
- (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
+ (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT) &&
+ !smgrIsSyncPending(rnode, blknum))
{
/*
* If we're in recovery we cannot dirty a page because of a hint.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b3a595c..1c169ef 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
@@ -177,6 +176,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
/* in heap/pruneheap.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 06a8242..5418d71 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -378,6 +378,8 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
extern const char *heap2_identify(uint8 info);
extern void heap_xlog_logical_rewrite(XLogReaderState *r);
+extern bool HeapNeedsWAL(Relation rel, Buffer buf);
+
extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
TransactionId latestRemovedXid);
extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef960da..e84dee2 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -29,6 +29,9 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrRegisterPendingSync(RelFileNode rnode, BlockNumber nblocks);
+extern bool smgrIsSyncPending(RelFileNode rnode, BlockNumber blkno);
+extern void smgrDoPendingSyncs(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..f02ea93 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -202,6 +202,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/test/recovery/t/006_truncate_opt.pl b/src/test/recovery/t/006_truncate_opt.pl
new file mode 100644
index 0000000..baf5604
--- /dev/null
+++ b/src/test/recovery/t/006_truncate_opt.pl
@@ -0,0 +1,94 @@
+# Set of tests to check TRUNCATE optimizations with CREATE TABLE
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+my $node = get_new_node('master');
+$node->init;
+
+my $copy_file = $node->backup_dir . "copy_data.txt";
+
+$node->append_conf('postgresql.conf', qq{
+fsync = on
+wal_level = minimal
+});
+
+$node->start;
+
+# Create file containing data to COPY
+TestLib::append_to_file($copy_file, qq{copied row 1
+copied row 2
+copied row 3
+});
+
+# CREATE, INSERT, COPY, crash.
+#
+# If COPY inserts to the existing block, and is not WAL-logged, replaying
+# the implicit FPW of the INSERT record will destroy the COPY data.
+$node->psql('postgres', qq{
+BEGIN;
+CREATE TABLE test1(t text PRIMARY KEY);
+INSERT INTO test1 VALUES ('inserted row');
+COPY test1 FROM '$copy_file';
+COMMIT;
+});
+# Enforce recovery and check the state of table. There should be 4 rows.
+$node->stop('immediate');
+$node->start;
+my $ret = $node->safe_psql('postgres', 'SELECT count(*) FROM test1');
+is($ret, '4', 'SELECT reports 4 rows');
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE test1;');
+
+# CREATE, COPY, crash. Trigger in COPY that inserts more to same table.
+#
+# If the INSERTS from the trigger go to the same block we're copying to,
+# and the INSERTs are WAL-logged, WAL replay will fail when it tries to
+# replay the WAL record but the "before" image doesn't match, because not
+# all changes were WAL-logged.
+$node->psql('postgres', qq{
+BEGIN;
+CREATE TABLE test1(t text PRIMARY KEY);
+CREATE FUNCTION test1_beforetrig() RETURNS trigger LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.t NOT LIKE 'triggered%' THEN
+ INSERT INTO test1 VALUES ('triggered ' || NEW.t);
+ END IF;
+ RETURN NEW;
+END;
+\$\$;
+CREATE TRIGGER test1_beforeinsert BEFORE INSERT ON test1
+FOR EACH ROW EXECUTE PROCEDURE test1_beforetrig();
+COPY test1 FROM '$copy_file';
+COMMIT;
+});
+# Enforce recovery and check the state of table. There should be 6
+# rows here.
+$node->stop('immediate');
+$node->start;
+$ret = $node->safe_psql('postgres', 'SELECT count(*) FROM test1');
+is($ret, '6', 'SELECT returns 6 rows');
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE test1;');
+$node->safe_psql('postgres', 'DROP FUNCTION test1_beforetrig();');
+
+# CREATE, TRUNCATE, COPY, crash.
+#
+# If we skip WAL-logging of the COPY, replaying the TRUNCATE record destroys
+# the newly inserted data.
+$node->psql('postgres', qq{
+BEGIN;
+CREATE TABLE test1(t text PRIMARY KEY);
+TRUNCATE test1;
+COPY test1 FROM '$copy_file';
+COMMIT;
+});
+# Enforce recovery and check the state of table. There should be 3
+# rows here.
+$node->stop('immediate');
+$node->start;
+$ret = $node->safe_psql('postgres', 'SELECT count(*) FROM test1');
+is($ret, '3', 'SELECT returns 3 rows');
Hello, I return to this before my things:)
Though I haven't played with the patch yet..
At Fri, 29 Jul 2016 16:54:42 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqR+3JjS=JB3R=AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg@mail.gmail.com>
Well, not that soon at the end, but I am back on that... I have not
completely reviewed all the code yet, and the case of index relation
referring to a relation optimized with truncate is still broken, but
for now here is a rebased patch if people are interested. I am going
to get as well a TAP tests out of my pocket to ease testing.The patch I sent yesterday was based on an incorrect version. Attached
is a slightly-modified version of the last one I found here
(/messages/by-id/56B342F5.1050502@iki.fi), which
is rebased on HEAD at ed0b228. I have also converted the test case
script of upthread into a TAP test in src/test/recovery that covers 3
cases and I included that in the patch:
1) CREATE + INSERT + COPY => crash
2) CREATE + trigger + COPY => crash
3) CREATE + TRUNCATE + COPY => incorrect number of rows.
The first two tests make the system crash, the third one reports an
incorrect number of rows.
At the first glance, managing sync_above and truncate_to is
workable for these cases, but seems too complicated for the
problem to be resolved.
This provides smgr with a capability to manage pending page
synchs. But the postpone-page-syncs-or-not issue rather seems to
be a matter of the users of that, who are responsible for WAL
issueing. Anyway heap_resgister_sync doesn't use any secret of
smgr. So I think this approach binds smgr with Relation too
tightly.
By this patch, many RelationNeedsWALs, which just accesses local
struct, are replaced with HeapNeedsWAL, which eventually accesses
a hash added by this patch. Especially in log_heap_update, it is
called for every update of single tuple (on a relation that needs
WAL).
Though I don't know how it actually impacts the perfomance, it
seems to me that we can live with truncated_to and sync_above in
RelationData and BufferNeedsWAL(rel, buf) instead of
HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation
seems to exist at once in the hash.
What do you think?
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello, I return to this before my things:)
Though I haven't played with the patch yet..
Be sure to run the test cases in the patch or base your tests on them then!
Though I don't know how it actually impacts the perfomance, it
seems to me that we can live with truncated_to and sync_above in
RelationData and BufferNeedsWAL(rel, buf) instead of
HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation
seems to exist at once in the hash.
TBH, I still think that the design of this patch as proposed is pretty
cool and easy to follow.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello,
At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqT5x05tG7aut1yz+WJN76DqNz1Jzq46fSFtee4YbY0YcA@mail.gmail.com>
On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hello, I return to this before my things:)
Though I haven't played with the patch yet..
Be sure to run the test cases in the patch or base your tests on them then!
All items of 006_truncate_opt fail on ed0b228 and they are fixed
with the patch.
Though I don't know how it actually impacts the perfomance, it
seems to me that we can live with truncated_to and sync_above in
RelationData and BufferNeedsWAL(rel, buf) instead of
HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation
seems to exist at once in the hash.TBH, I still think that the design of this patch as proposed is pretty
cool and easy to follow.
It is clean from certain viewpoint but additional hash,
especially hash-searching on every HeapNeedsWAL seems to me to be
unacceptable. Do you see it accetable?
The attached patch is quiiiccck-and-dirty-hack of Michael's patch
just as a PoC of my proposal quoted above. This also passes the
006 test. The major changes are the following.
- Moved sync_above and truncted_to into RelationData.
- Cleaning up is done in AtEOXact_cleanup instead of explicit
calling to smgrDoPendingSyncs().
* BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires
hash_search. It just refers to the additional members in the
given Relation.
X I feel that I have dropped one of the features of the origitnal
patch during the hack, but I don't recall it clearly now:(
X I haven't consider relfilenode replacement, which didn't matter
for the original patch. (but there's few places to consider).
What do you think about this?
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
fix-wal-level-minimal-michael-horiguchi-1.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 38bba16..02e33cc 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -55,6 +77,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/namespace.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -2331,12 +2354,6 @@ FreeBulkInsertState(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2440,7 +2457,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2639,12 +2656,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
char *scratch = NULL;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2659,7 +2674,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* palloc() within a critical section is not safe, so we allocate this
* beforehand.
*/
- if (needwal)
+ if (RelationNeedsWAL(relation))
scratch = palloc(BLCKSZ);
/*
@@ -2694,6 +2709,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2705,6 +2721,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3261,7 +3278,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4194,7 +4211,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5148,7 +5166,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5825,7 +5843,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5980,7 +5998,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6112,7 +6130,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6218,7 +6236,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7331,7 +7349,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7379,7 +7397,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7464,7 +7482,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -7567,76 +7585,86 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
if (need_tuple_data)
bufflags |= REGBUF_KEEP_DATA;
- XLogRegisterBuffer(0, newbuf, bufflags);
- if (oldbuf != newbuf)
- XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
-
- XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
-
/*
* Prepare WAL data for the new tuple.
*/
- if (prefixlen > 0 || suffixlen > 0)
+ if (BufferNeedsWAL(reln, newbuf))
{
- if (prefixlen > 0 && suffixlen > 0)
- {
- prefix_suffix[0] = prefixlen;
- prefix_suffix[1] = suffixlen;
- XLogRegisterBufData(0, (char *) &prefix_suffix, sizeof(uint16) * 2);
- }
- else if (prefixlen > 0)
- {
- XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
- }
- else
+ XLogRegisterBuffer(0, newbuf, bufflags);
+
+ if ((prefixlen > 0 || suffixlen > 0))
{
- XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+ if (prefixlen > 0 && suffixlen > 0)
+ {
+ prefix_suffix[0] = prefixlen;
+ prefix_suffix[1] = suffixlen;
+ XLogRegisterBufData(0, (char *) &prefix_suffix,
+ sizeof(uint16) * 2);
+ }
+ else if (prefixlen > 0)
+ {
+ XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
+ }
+ else
+ {
+ XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+ }
}
- }
- xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
- xlhdr.t_infomask = newtup->t_data->t_infomask;
- xlhdr.t_hoff = newtup->t_data->t_hoff;
- Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
+ xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
+ xlhdr.t_infomask = newtup->t_data->t_infomask;
+ xlhdr.t_hoff = newtup->t_data->t_hoff;
+ Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
- /*
- * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
- *
- * The 'data' doesn't include the common prefix or suffix.
- */
- XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
- if (prefixlen == 0)
- {
- XLogRegisterBufData(0,
- ((char *) newtup->t_data) + SizeofHeapTupleHeader,
- newtup->t_len - SizeofHeapTupleHeader - suffixlen);
- }
- else
- {
/*
- * Have to write the null bitmap and data after the common prefix as
- * two separate rdata entries.
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
+ *
+ * The 'data' doesn't include the common prefix or suffix.
*/
- /* bitmap [+ padding] [+ oid] */
- if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+ XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+ if (prefixlen == 0)
{
XLogRegisterBufData(0,
((char *) newtup->t_data) + SizeofHeapTupleHeader,
- newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+ newtup->t_len - SizeofHeapTupleHeader - suffixlen);
}
+ else
+ {
+ /*
+ * Have to write the null bitmap and data after the common prefix
+ * as two separate rdata entries.
+ */
+ /* bitmap [+ padding] [+ oid] */
+ if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+ {
+ XLogRegisterBufData(0,
+ ((char *) newtup->t_data) + SizeofHeapTupleHeader,
+ newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+ }
- /* data after common prefix */
- XLogRegisterBufData(0,
+ /* data after common prefix */
+ XLogRegisterBufData(0,
((char *) newtup->t_data) + newtup->t_data->t_hoff + prefixlen,
newtup->t_len - newtup->t_data->t_hoff - prefixlen - suffixlen);
+ }
}
+ /*
+ * If the old and new tuple are on different pages, also register the old
+ * page, so that a full-page image is created for it if necessary. We
+ * don't need any extra information to replay changes to it.
+ */
+ if (oldbuf != newbuf && BufferNeedsWAL(reln, oldbuf))
+ XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
+
/* We need to log a tuple identity */
if (need_tuple_data && old_key_tuple)
{
@@ -8555,8 +8583,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
*/
/* Deal with old tuple version */
- oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- &obuffer);
+ if (oldblk == newblk)
+ oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+ else if (XLogRecHasBlockRef(record, 1))
+ oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+ else
+ oldaction = BLK_DONE;
+
if (oldaction == BLK_NEEDS_REDO)
{
page = BufferGetPage(obuffer);
@@ -8610,6 +8643,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageInit(page, BufferGetPageSize(nbuffer), 0);
newaction = BLK_NEEDS_REDO;
}
+ else if (!XLogRecHasBlockRef(record, 0))
+ newaction = BLK_DONE;
else
newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9046,9 +9081,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
@@ -9081,3 +9123,33 @@ heap_sync(Relation rel)
heap_close(toastrel, AccessShareLock);
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6ff9251..27a2447 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -260,7 +261,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f9ce986..36ba62a 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
- HEAP_INSERT_SKIP_FSM |
- (state->rs_use_wal ?
- 0 : HEAP_INSERT_SKIP_WAL));
+ HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 3ad4a9f..e08623c 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0d8311c..a2f03a7 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -260,31 +260,41 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
-
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
-
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
-
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ if (rel->sync_above == InvalidBlockNumber ||
+ rel->sync_above < nblocks)
+ {
+ /*
+ * Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->truncated_to = nblocks;
+ }
}
/* Do the real work */
@@ -419,6 +429,59 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+void
+RecordPendingSync(Relation rel)
+{
+ Assert(RelationNeedsWAL(rel));
+
+ if (rel->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ RelationGetNumberOfBlocks(rel));
+ rel->sync_above = RelationGetNumberOfBlocks(rel);
+ }
+ else
+ elog(DEBUG2, "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->sync_above, RelationGetNumberOfBlocks(rel));
+}
+
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ blkno = BufferGetBlockNumber(buf);
+ if (rel->sync_above == InvalidBlockNumber ||
+ rel->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (rel->truncated_to != InvalidBlockNumber &&
+ rel->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode, blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode, blkno);
+
+ return false;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f45b330..a0fe63f 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2269,8 +2269,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2302,7 +2301,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -2551,11 +2550,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 5b4f6af..b64d52a 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 6cddcbd..dbef95b 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -456,7 +456,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -499,9 +499,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 86e9814..ca892ea 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3984,8 +3984,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4236,8 +4237,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 231e92d..3662f7b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -879,7 +879,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1106,7 +1106,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1462,7 +1462,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..d128e63 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3130,20 +3131,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3160,7 +3182,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3190,18 +3212,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8d2ad01..31ae0f1 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -66,6 +66,7 @@
#include "optimizer/var.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -407,6 +408,9 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ relation->sync_above = InvalidBlockNumber;
+ relation->truncated_to = InvalidBlockNumber;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -1731,6 +1735,9 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ relation->sync_above = InvalidBlockNumber;
+ relation->truncated_to = InvalidBlockNumber;
+
/*
* add new reldesc to relcache
*/
@@ -2055,6 +2062,22 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
pfree(relation);
}
+static void
+RelationDoPendingFlush(Relation relation)
+{
+ if (relation->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(relation->rd_node, false);
+ smgrimmedsync(smgropen(relation->rd_node, InvalidBackendId),
+ MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u",
+ relation->rd_node.spcNode,
+ relation->rd_node.dbNode, relation->rd_node.relNode);
+
+ }
+}
+
/*
* RelationClearRelation
*
@@ -2686,7 +2709,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
if (relation->rd_createSubid != InvalidSubTransactionId)
{
if (isCommit)
+ {
+ RelationDoPendingFlush(relation);
relation->rd_createSubid = InvalidSubTransactionId;
+ }
else if (RelationHasReferenceCountZero(relation))
{
RelationClearRelation(relation, false);
@@ -3019,6 +3045,9 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ rel->sync_above = InvalidBlockNumber;
+ rel->truncated_to = InvalidBlockNumber;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b3a595c..1c169ef 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
@@ -177,6 +176,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
/* in heap/pruneheap.c */
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef960da..235c2b4 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -29,6 +29,8 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..f02ea93 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -202,6 +202,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ed14442..a8a2b23 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -172,6 +172,9 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ BlockNumber sync_above;
+ BlockNumber truncated_to;
} RelationData;
On Thu, Sep 29, 2016 at 10:02 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello,
At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqT5x05tG7aut1yz+WJN76DqNz1Jzq46fSFtee4YbY0YcA@mail.gmail.com>
On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hello, I return to this before my things:)
Though I haven't played with the patch yet..
Be sure to run the test cases in the patch or base your tests on them then!
All items of 006_truncate_opt fail on ed0b228 and they are fixed
with the patch.Though I don't know how it actually impacts the perfomance, it
seems to me that we can live with truncated_to and sync_above in
RelationData and BufferNeedsWAL(rel, buf) instead of
HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation
seems to exist at once in the hash.TBH, I still think that the design of this patch as proposed is pretty
cool and easy to follow.It is clean from certain viewpoint but additional hash,
especially hash-searching on every HeapNeedsWAL seems to me to be
unacceptable. Do you see it accetable?The attached patch is quiiiccck-and-dirty-hack of Michael's patch
just as a PoC of my proposal quoted above. This also passes the
006 test. The major changes are the following.- Moved sync_above and truncted_to into RelationData.
- Cleaning up is done in AtEOXact_cleanup instead of explicit
calling to smgrDoPendingSyncs().* BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires
hash_search. It just refers to the additional members in the
given Relation.X I feel that I have dropped one of the features of the origitnal
patch during the hack, but I don't recall it clearly now:(X I haven't consider relfilenode replacement, which didn't matter
for the original patch. (but there's few places to consider).What do you think about this?
I have moved this patch to next CF. (I still need to look at your patch.)
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
At Sun, 2 Oct 2016 21:43:46 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTKOyHkrBSxvvSBZCXvU9F8OT_uumXmST_awKsswQA5Sg@mail.gmail.com>
On Thu, Sep 29, 2016 at 10:02 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hello,
At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqT5x05tG7aut1yz+WJN76DqNz1Jzq46fSFtee4YbY0YcA@mail.gmail.com>
On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hello, I return to this before my things:)
Though I haven't played with the patch yet..
Be sure to run the test cases in the patch or base your tests on them then!
All items of 006_truncate_opt fail on ed0b228 and they are fixed
with the patch.Though I don't know how it actually impacts the perfomance, it
seems to me that we can live with truncated_to and sync_above in
RelationData and BufferNeedsWAL(rel, buf) instead of
HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation
seems to exist at once in the hash.TBH, I still think that the design of this patch as proposed is pretty
cool and easy to follow.It is clean from certain viewpoint but additional hash,
especially hash-searching on every HeapNeedsWAL seems to me to be
unacceptable. Do you see it accetable?The attached patch is quiiiccck-and-dirty-hack of Michael's patch
just as a PoC of my proposal quoted above. This also passes the
006 test. The major changes are the following.- Moved sync_above and truncted_to into RelationData.
- Cleaning up is done in AtEOXact_cleanup instead of explicit
calling to smgrDoPendingSyncs().* BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires
hash_search. It just refers to the additional members in the
given Relation.X I feel that I have dropped one of the features of the origitnal
patch during the hack, but I don't recall it clearly now:(X I haven't consider relfilenode replacement, which didn't matter
for the original patch. (but there's few places to consider).What do you think about this?
I have moved this patch to next CF. (I still need to look at your patch.)
Thanks for considering that.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I dropped the ball on this one back in July, so here's an attempt to revive
this thread.I spent some time fixing the remaining issues with the prototype patch I
posted earlier, and rebased that on top of current git master. See attached.Some review of that would be nice. If there are no major issues with it, I'm
going to create backpatchable versions of this for 9.4 and below.
Heikki:
Are you going to do commit something here? This thread and patch are
now 14 months old, which is a long time to make people wait for a bug
fix. The status in the CF is "Ready for Committer" although I am not
sure if that's accurate.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I dropped the ball on this one back in July, so here's an attempt to revive
this thread.I spent some time fixing the remaining issues with the prototype patch I
posted earlier, and rebased that on top of current git master. See attached.Some review of that would be nice. If there are no major issues with it, I'm
going to create backpatchable versions of this for 9.4 and below.Are you going to do commit something here? This thread and patch are
now 14 months old, which is a long time to make people wait for a bug
fix. The status in the CF is "Ready for Committer" although I am not
sure if that's accurate.
"Needs Review" is definitely a better definition of its current state.
The last time I had a look at this patch I thought that it was in
pretty good shape (not Horiguchi-san's version, but the one in
/messages/by-id/CAB7nPqR+3JjS=JB3R=AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg@mail.gmail.com).
With some of the recent changes, surely it needs a second look, things
related to heap handling tend to rot quickly.
I'll look into it once again by the end of this week if Heikki does
not show up, the rest will be on him I am afraid...
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 9, 2016 at 9:27 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:
On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:
I dropped the ball on this one back in July, so here's an attempt to
revive
this thread.
I spent some time fixing the remaining issues with the prototype patch I
posted earlier, and rebased that on top of current git master. See
attached.
Some review of that would be nice. If there are no major issues with
it, I'm
going to create backpatchable versions of this for 9.4 and below.
Are you going to do commit something here? This thread and patch are
now 14 months old, which is a long time to make people wait for a bug
fix. The status in the CF is "Ready for Committer" although I am not
sure if that's accurate."Needs Review" is definitely a better definition of its current state.
The last time I had a look at this patch I thought that it was in
pretty good shape (not Horiguchi-san's version, but the one in
/messages/by-id/CAB7nPqR+3JjS=JB3R=AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg@mail.gmail.com
).
With some of the recent changes, surely it needs a second look, things
related to heap handling tend to rot quickly.I'll look into it once again by the end of this week if Heikki does
not show up, the rest will be on him I am afraid...
I have been able to hit a crash with recovery test 008:
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff96d48f06
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff96d48f06 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff9102e4ec libsystem_pthread.dylib`pthread_kill + 90
frame #2: 0x00007fff8e5cc6df libsystem_c.dylib`abort + 129
frame #3: 0x0000000106ef10f0
postgres`ExceptionalCondition(conditionName="!(( !( ((void) ((bool) (!
(!((buffer) <= NBuffers && (buffer) >= -NLocBuffer)) ||
(ExceptionalCondition(\"!((buffer) <= NBuffers && (buffer) >=
-NLocBuffer)\", (\"FailedAssertion\"), \"bufmgr.c\", 2593), 0)))), (buffer)
!= 0 ) ? ((bool) 0) : ((buffer) < 0) ? (LocalRefCount[-(buffer) - 1] > 0) :
(GetPrivateRefCount(buffer) > 0) ))", errorType="FailedAssertion",
fileName="bufmgr.c", lineNumber=2593) + 128 at assert.c:54
frame #4: 0x0000000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) +
204 at bufmgr.c:2593
frame #5: 0x000000010694e6ad
postgres`HeapNeedsWAL(rel=0x00007f9454804118, buf=0) + 61 at heapam.c:9234
frame #6: 0x000000010696d8bd
postgres`visibilitymap_set(rel=0x00007f9454804118, heapBlk=1, heapBuf=0,
recptr=50841176, vmBuf=118, cutoff_xid=866, flags='\x01') + 989 at
visibilitymap.c:310
frame #7: 0x000000010695d020
postgres`heap_xlog_visible(record=0x00007f94520035d0) + 896 at heapam.c:8148
frame #8: 0x000000010695c582
postgres`heap2_redo(record=0x00007f94520035d0) + 242 at heapam.c:9107
frame #9: 0x00000001069d132d postgres`StartupXLOG + 9181 at xlog.c:6950
frame #10: 0x0000000106c9d783 postgres`StartupProcessMain + 339 at
startup.c:216
frame #11: 0x00000001069ee6ec postgres`AuxiliaryProcessMain(argc=2,
argv=0x00007fff59316d80) + 1676 at bootstrap.c:420
frame #12: 0x0000000106c98002
postgres`StartChildProcess(type=StartupProcess) + 322 at postmaster.c:5221
frame #13: 0x0000000106c96031 postgres`PostmasterMain(argc=3,
argv=0x00007f9451c04210) + 6033 at postmaster.c:1301
frame #14: 0x0000000106bc30cf postgres`main(argc=3,
argv=0x00007f9451c04210) + 751 at main.c:228
(lldb) up 1
frame #4: 0x0000000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) + 204
at bufmgr.c:2593
2590 {
2591 BufferDesc *bufHdr;
2592
-> 2593 Assert(BufferIsPinned(buffer));
2594
2595 if (BufferIsLocal(buffer))
2596 bufHdr = GetLocalBufferDescriptor(-buffer - 1);
--
Michael
On Wed, Nov 9, 2016 at 5:55 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:
On Wed, Nov 9, 2016 at 9:27 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas <robertmhaas@gmail.com>
wrote:
On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:
I dropped the ball on this one back in July, so here's an attempt to
revive
this thread.
I spent some time fixing the remaining issues with the prototype patch
I
posted earlier, and rebased that on top of current git master. See
attached.
Some review of that would be nice. If there are no major issues with
it, I'm
going to create backpatchable versions of this for 9.4 and below.
Are you going to do commit something here? This thread and patch are
now 14 months old, which is a long time to make people wait for a bug
fix. The status in the CF is "Ready for Committer" although I am not
sure if that's accurate."Needs Review" is definitely a better definition of its current state.
The last time I had a look at this patch I thought that it was in
pretty good shape (not Horiguchi-san's version, but the one in
/messages/by-id/CAB7nPqR+3JjS=JB3R=AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg@mail.gmail.com).
With some of the recent changes, surely it needs a second look, things
related to heap handling tend to rot quickly.I'll look into it once again by the end of this week if Heikki does
not show up, the rest will be on him I am afraid...I have been able to hit a crash with recovery test 008: (lldb) bt * thread #1: tid = 0x0000, 0x00007fff96d48f06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP * frame #0: 0x00007fff96d48f06 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x00007fff9102e4ec libsystem_pthread.dylib`pthread_kill + 90 frame #2: 0x00007fff8e5cc6df libsystem_c.dylib`abort + 129 frame #3: 0x0000000106ef10f0 postgres`ExceptionalCondition(conditionName="!(( !( ((void) ((bool) (! (!((buffer) <= NBuffers && (buffer) >= -NLocBuffer)) || (ExceptionalCondition(\"!((buffer) <= NBuffers && (buffer) >= -NLocBuffer)\", (\"FailedAssertion\"), \"bufmgr.c\", 2593), 0)))), (buffer) != 0 ) ? ((bool) 0) : ((buffer) < 0) ? (LocalRefCount[-(buffer) - 1] > 0) : (GetPrivateRefCount(buffer) > 0) ))", errorType="FailedAssertion", fileName="bufmgr.c", lineNumber=2593) + 128 at assert.c:54 frame #4: 0x0000000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) + 204 at bufmgr.c:2593 frame #5: 0x000000010694e6ad postgres`HeapNeedsWAL(rel=0x00007f9454804118, buf=0) + 61 at heapam.c:9234 frame #6: 0x000000010696d8bd postgres`visibilitymap_set(rel=0x00007f9454804118, heapBlk=1, heapBuf=0, recptr=50841176, vmBuf=118, cutoff_xid=866, flags='\x01') + 989 at visibilitymap.c:310 frame #7: 0x000000010695d020 postgres`heap_xlog_visible(record=0x00007f94520035d0) + 896 at heapam.c:8148 frame #8: 0x000000010695c582 postgres`heap2_redo(record=0x00007f94520035d0) + 242 at heapam.c:9107 frame #9: 0x00000001069d132d postgres`StartupXLOG + 9181 at xlog.c:6950 frame #10: 0x0000000106c9d783 postgres`StartupProcessMain + 339 at startup.c:216 frame #11: 0x00000001069ee6ec postgres`AuxiliaryProcessMain(argc=2, argv=0x00007fff59316d80) + 1676 at bootstrap.c:420 frame #12: 0x0000000106c98002 postgres`StartChildProcess(type=StartupProcess) + 322 at postmaster.c:5221 frame #13: 0x0000000106c96031 postgres`PostmasterMain(argc=3, argv=0x00007f9451c04210) + 6033 at postmaster.c:1301 frame #14: 0x0000000106bc30cf postgres`main(argc=3, argv=0x00007f9451c04210) + 751 at main.c:228 (lldb) up 1 frame #4: 0x0000000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) + 204 at bufmgr.c:2593 2590 { 2591 BufferDesc *bufHdr; 2592 -> 2593 Assert(BufferIsPinned(buffer)); 2594 2595 if (BufferIsLocal(buffer)) 2596 bufHdr = GetLocalBufferDescriptor(-buffer - 1);
The latest proposed patch still having problems.
Closed in 2016-11 commitfest with "moved to next CF" status because of a
bug fix patch.
Please feel free to update the status once you submit the updated patch.
Regards,
Hari Babu
Fujitsu Australia
On Fri, Dec 2, 2016 at 1:39 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
The latest proposed patch still having problems.
Closed in 2016-11 commitfest with "moved to next CF" status because of a bug
fix patch.
Please feel free to update the status once you submit the updated patch.
And moved to CF 2017-03...
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 1/30/17 11:33 PM, Michael Paquier wrote:
On Fri, Dec 2, 2016 at 1:39 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
The latest proposed patch still having problems.
Closed in 2016-11 commitfest with "moved to next CF" status because of a bug
fix patch.
Please feel free to update the status once you submit the updated patch.And moved to CF 2017-03...
Are there any plans to post a new patch? This thread is now 18 months
old and it would be good to get a resolution in this CF.
Thanks,
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Kyotaro HORIGUCHI wrote:
The attached patch is quiiiccck-and-dirty-hack of Michael's patch
just as a PoC of my proposal quoted above. This also passes the
006 test. The major changes are the following.- Moved sync_above and truncted_to into RelationData.
Interesting. I wonder if it's possible that a relcache invalidation
would cause these values to get lost for some reason, because that would
be dangerous.
I suppose the rationale is that this shouldn't happen because any
operation that does things this way must hold an exclusive lock on the
relation. But that doesn't guarantee that the relcache entry is
completely stable, does it? If we can get proof of that, then this
technique should be safe, I think.
In your version of the patch, which I spent some time skimming, I am
missing comments on various functions. I added some as I went along,
including one XXX indicating it must be filled.
RecordPendingSync() should really live in relcache.c (and probably get a
different name).
X I feel that I have dropped one of the features of the origitnal
patch during the hack, but I don't recall it clearly now:(
Hah :-)
X I haven't consider relfilenode replacement, which didn't matter
for the original patch. (but there's few places to consider).
Hmm ... Please provide.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
fix-wal-level-minimal-michael-horiguchi-2.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0c3e2b0..aa1b97d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -56,6 +78,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/namespace.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -2356,12 +2379,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2465,7 +2482,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2664,12 +2681,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
char *scratch = NULL;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2684,7 +2699,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* palloc() within a critical section is not safe, so we allocate this
* beforehand.
*/
- if (needwal)
+ if (RelationNeedsWAL(relation))
scratch = palloc(BLCKSZ);
/*
@@ -2719,6 +2734,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2730,6 +2746,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3286,7 +3303,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4250,7 +4267,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5141,7 +5159,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5843,7 +5861,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5998,7 +6016,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6131,7 +6149,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6240,7 +6258,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7354,7 +7372,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7402,7 +7420,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7487,7 +7505,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -7590,76 +7608,86 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
if (need_tuple_data)
bufflags |= REGBUF_KEEP_DATA;
- XLogRegisterBuffer(0, newbuf, bufflags);
- if (oldbuf != newbuf)
- XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
-
- XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
-
/*
* Prepare WAL data for the new tuple.
*/
- if (prefixlen > 0 || suffixlen > 0)
+ if (BufferNeedsWAL(reln, newbuf))
{
- if (prefixlen > 0 && suffixlen > 0)
- {
- prefix_suffix[0] = prefixlen;
- prefix_suffix[1] = suffixlen;
- XLogRegisterBufData(0, (char *) &prefix_suffix, sizeof(uint16) * 2);
- }
- else if (prefixlen > 0)
- {
- XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
- }
- else
- {
- XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
- }
- }
+ XLogRegisterBuffer(0, newbuf, bufflags);
- xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
- xlhdr.t_infomask = newtup->t_data->t_infomask;
- xlhdr.t_hoff = newtup->t_data->t_hoff;
- Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
+ if ((prefixlen > 0 || suffixlen > 0))
+ {
+ if (prefixlen > 0 && suffixlen > 0)
+ {
+ prefix_suffix[0] = prefixlen;
+ prefix_suffix[1] = suffixlen;
+ XLogRegisterBufData(0, (char *) &prefix_suffix,
+ sizeof(uint16) * 2);
+ }
+ else if (prefixlen > 0)
+ {
+ XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
+ }
+ else
+ {
+ XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+ }
+ }
+
+ xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
+ xlhdr.t_infomask = newtup->t_data->t_infomask;
+ xlhdr.t_hoff = newtup->t_data->t_hoff;
+ Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
- /*
- * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
- *
- * The 'data' doesn't include the common prefix or suffix.
- */
- XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
- if (prefixlen == 0)
- {
- XLogRegisterBufData(0,
- ((char *) newtup->t_data) + SizeofHeapTupleHeader,
- newtup->t_len - SizeofHeapTupleHeader - suffixlen);
- }
- else
- {
/*
- * Have to write the null bitmap and data after the common prefix as
- * two separate rdata entries.
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
+ *
+ * The 'data' doesn't include the common prefix or suffix.
*/
- /* bitmap [+ padding] [+ oid] */
- if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+ XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+ if (prefixlen == 0)
{
XLogRegisterBufData(0,
((char *) newtup->t_data) + SizeofHeapTupleHeader,
- newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+ newtup->t_len - SizeofHeapTupleHeader - suffixlen);
}
+ else
+ {
+ /*
+ * Have to write the null bitmap and data after the common prefix
+ * as two separate rdata entries.
+ */
+ /* bitmap [+ padding] [+ oid] */
+ if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+ {
+ XLogRegisterBufData(0,
+ ((char *) newtup->t_data) + SizeofHeapTupleHeader,
+ newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+ }
- /* data after common prefix */
- XLogRegisterBufData(0,
+ /* data after common prefix */
+ XLogRegisterBufData(0,
((char *) newtup->t_data) + newtup->t_data->t_hoff + prefixlen,
newtup->t_len - newtup->t_data->t_hoff - prefixlen - suffixlen);
+ }
}
+ /*
+ * If the old and new tuple are on different pages, also register the old
+ * page, so that a full-page image is created for it if necessary. We
+ * don't need any extra information to replay changes to it.
+ */
+ if (oldbuf != newbuf && BufferNeedsWAL(reln, oldbuf))
+ XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
+
/* We need to log a tuple identity */
if (need_tuple_data && old_key_tuple)
{
@@ -8578,8 +8606,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
*/
/* Deal with old tuple version */
- oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- &obuffer);
+ if (oldblk == newblk)
+ oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+ else if (XLogRecHasBlockRef(record, 1))
+ oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+ else
+ oldaction = BLK_DONE;
+
if (oldaction == BLK_NEEDS_REDO)
{
page = BufferGetPage(obuffer);
@@ -8633,6 +8666,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageInit(page, BufferGetPageSize(nbuffer), 0);
newaction = BLK_NEEDS_REDO;
}
+ else if (!XLogRecHasBlockRef(record, 0))
+ newaction = BLK_DONE;
else
newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9069,9 +9104,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
@@ -9181,3 +9223,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d69a266..4754278 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -260,7 +261,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..6462f44 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
- HEAP_INSERT_SKIP_FSM |
- (state->rs_use_wal ?
- 0 : HEAP_INSERT_SKIP_WAL));
+ HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e5616ce..933fa9c 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f677916..929b5a0 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -254,11 +254,15 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
- * would be worse than the disease. It would turn a usually harmless
+ * would be worse than the disease: it would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
+ *
+ * XXX Explain why we skip this sometimes.
*/
- if (RelationNeedsWAL(rel))
+ if (RelationNeedsWAL(rel) &&
+ (rel->sync_above == InvalidBlockNumber ||
+ rel->sync_above < nblocks))
{
/*
* Make an XLOG entry reporting the file truncation.
@@ -268,7 +272,6 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
xlrec.blkno = nblocks;
xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -276,6 +279,10 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
lsn = XLogInsert(RM_SMGR_ID,
XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
/*
* Flush, because otherwise the truncation of the main relation might
* hit the disk before the WAL record, and the truncation of the FSM
@@ -285,6 +292,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (fsm || vm)
XLogFlush(lsn);
+
+ rel->truncated_to = nblocks;
}
/* Do the real work */
@@ -420,6 +429,72 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
}
/*
+ * RecordPendingSync
+ * Make note that we need to sync buffers above the current relation size.
+ *
+ * (Thus, any operation that writes buffers above the current size can be
+ * optimized as not needing WAL; a relation sync will automatically be executed
+ * at transaction commit.)
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ Assert(RelationNeedsWAL(rel));
+
+ if (rel->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ RelationGetNumberOfBlocks(rel));
+ rel->sync_above = RelationGetNumberOfBlocks(rel);
+ }
+ else
+ elog(DEBUG2, "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->sync_above, RelationGetNumberOfBlocks(rel));
+}
+
+/*
+ * BufferNeedsWAL
+ * Return whether or not changes to the given buffer require to be
+ * WAL-logged.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ blkno = BufferGetBlockNumber(buf);
+ if (rel->sync_above == InvalidBlockNumber ||
+ rel->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (rel->truncated_to != InvalidBlockNumber &&
+ rel->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode, blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode, blkno);
+
+ return false;
+}
+
+/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
* What we have to do here is throw away the in-memory state about pending
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 8c58808..cb9df1b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2372,8 +2372,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2405,7 +2404,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -2784,11 +2783,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index f49b391..7710f82 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 2f93328..514012b 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -477,7 +477,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -520,9 +520,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 60f8b7f..9b14053 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4327,8 +4327,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4589,8 +4590,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5b43a66..f3dcf6e 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -893,7 +893,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1120,7 +1120,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1480,7 +1480,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2109cbf..f7c2b16 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelcache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ddb9485..11913f9 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
#include "optimizer/var.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -418,6 +419,9 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ relation->sync_above = InvalidBlockNumber;
+ relation->truncated_to = InvalidBlockNumber;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -2032,6 +2036,9 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ relation->sync_above = InvalidBlockNumber;
+ relation->truncated_to = InvalidBlockNumber;
+
/*
* add new reldesc to relcache
*/
@@ -2366,6 +2373,24 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
}
/*
+ * If this relation has a pending flush request, execute it.
+ */
+static void
+RelationDoPendingFlush(Relation relation)
+{
+ if (relation->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelcache(relation->rd_node, false);
+ smgrimmedsync(smgropen(relation->rd_node, InvalidBackendId),
+ MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u",
+ relation->rd_node.spcNode,
+ relation->rd_node.dbNode, relation->rd_node.relNode);
+ }
+}
+
+/*
* RelationClearRelation
*
* Physically blow away a relation cache entry, or reset it and rebuild
@@ -3015,7 +3040,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
if (relation->rd_createSubid != InvalidSubTransactionId)
{
if (isCommit)
+ {
+ RelationDoPendingFlush(relation);
relation->rd_createSubid = InvalidSubTransactionId;
+ }
else if (RelationHasReferenceCountZero(relation))
{
RelationClearRelation(relation, false);
@@ -3353,6 +3381,9 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ rel->sync_above = InvalidBlockNumber;
+ rel->truncated_to = InvalidBlockNumber;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7e85510..aa069a5 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,7 +25,7 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
+/* 0x0001 is free */
#define HEAP_INSERT_SKIP_FSM 0x0002
#define HEAP_INSERT_FROZEN 0x0004
#define HEAP_INSERT_SPECULATIVE 0x0008
@@ -178,6 +178,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index fea96de..415b98a 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -29,6 +29,8 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 07a32d6..ac6f866 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ab875bb..03244be 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -216,6 +216,10 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* support for WAL-flush-skipping */
+ BlockNumber sync_above;
+ BlockNumber truncated_to;
} RelationData;
I have claimed this patch as committer FWIW.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera wrote:
I suppose the rationale is that this shouldn't happen because any
operation that does things this way must hold an exclusive lock on the
relation. But that doesn't guarantee that the relcache entry is
completely stable, does it? If we can get proof of that, then this
technique should be safe, I think.
It occurs to me that in order to test this we could run the recovery
tests (including Michael's new 006 file, which you didn't include in
your patch) under -D CLOBBER_CACHE_ALWAYS. I think that'd be sufficient
proof that it is solid.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Interesting. I wonder if it's possible that a relcache invalidation
would cause these values to get lost for some reason, because that would
be dangerous.
I suppose the rationale is that this shouldn't happen because any
operation that does things this way must hold an exclusive lock on the
relation. But that doesn't guarantee that the relcache entry is
completely stable,
It ABSOLUTELY is not safe. Relcache flushes can happen regardless of
how strong a lock you hold.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello, thank you for looking this.
At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <27309.1491611915@sss.pgh.pa.us>
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Interesting. I wonder if it's possible that a relcache invalidation
would cause these values to get lost for some reason, because that would
be dangerous.I suppose the rationale is that this shouldn't happen because any
operation that does things this way must hold an exclusive lock on the
relation. But that doesn't guarantee that the relcache entry is
completely stable,It ABSOLUTELY is not safe. Relcache flushes can happen regardless of
how strong a lock you hold.regards, tom lane
Ugh. Yes, relcache invalidation happens anytime and it resets the
added values. pg_stat_info deceived me that it can store
transient values. But I came up with another thought.
The reason I proposed it was I thought that hash_search for every
buffer is not good. Instead, like pg_stat_info, we can link the
pending-sync hash entry to Relation. This greately reduces the
frequency of hash-searching.
I'll post new patch in this way soon.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170411.095606.245908357.horiguchi.kyotaro@lab.ntt.co.jp>
Hello, thank you for looking this.
At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <27309.1491611915@sss.pgh.pa.us>
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Interesting. I wonder if it's possible that a relcache invalidation
would cause these values to get lost for some reason, because that would
be dangerous.I suppose the rationale is that this shouldn't happen because any
operation that does things this way must hold an exclusive lock on the
relation. But that doesn't guarantee that the relcache entry is
completely stable,It ABSOLUTELY is not safe. Relcache flushes can happen regardless of
how strong a lock you hold.regards, tom lane
Ugh. Yes, relcache invalidation happens anytime and it resets the
added values. pg_stat_info deceived me that it can store
transient values. But I came up with another thought.The reason I proposed it was I thought that hash_search for every
buffer is not good. Instead, like pg_stat_info, we can link the
buffer => buffer modification
pending-sync hash entry to Relation. This greately reduces the
frequency of hash-searching.I'll post new patch in this way soon.
Here it is.
- Relation has new members no_pending_sync and pending_sync that
works as instant cache of an entry in pendingSync hash.
- Commit-time synchronizing is restored as Michael's patch.
- If relfilenode is replaced, pending_sync for the old node is
removed. Anyway this is ignored on abort and meaningless on
commit.
- TAP test is renamed to 012 since some new files have been added.
Accessing pending sync hash occured on every calling of
HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
accessing relations has pending sync. Almost of them are
eliminated as the result.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
fix-wal-level-minimal-michael-horiguchi-2.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0c3e2b0..aa1b97d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -56,6 +78,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/namespace.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -2356,12 +2379,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2465,7 +2482,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2664,12 +2681,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
char *scratch = NULL;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2684,7 +2699,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* palloc() within a critical section is not safe, so we allocate this
* beforehand.
*/
- if (needwal)
+ if (RelationNeedsWAL(relation))
scratch = palloc(BLCKSZ);
/*
@@ -2719,6 +2734,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2730,6 +2746,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3286,7 +3303,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4250,7 +4267,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5141,7 +5159,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5843,7 +5861,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5998,7 +6016,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6131,7 +6149,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6240,7 +6258,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7354,7 +7372,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7402,7 +7420,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7487,7 +7505,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -7590,76 +7608,86 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
if (need_tuple_data)
bufflags |= REGBUF_KEEP_DATA;
- XLogRegisterBuffer(0, newbuf, bufflags);
- if (oldbuf != newbuf)
- XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
-
- XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
-
/*
* Prepare WAL data for the new tuple.
*/
- if (prefixlen > 0 || suffixlen > 0)
+ if (BufferNeedsWAL(reln, newbuf))
{
- if (prefixlen > 0 && suffixlen > 0)
- {
- prefix_suffix[0] = prefixlen;
- prefix_suffix[1] = suffixlen;
- XLogRegisterBufData(0, (char *) &prefix_suffix, sizeof(uint16) * 2);
- }
- else if (prefixlen > 0)
- {
- XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
- }
- else
+ XLogRegisterBuffer(0, newbuf, bufflags);
+
+ if ((prefixlen > 0 || suffixlen > 0))
{
- XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+ if (prefixlen > 0 && suffixlen > 0)
+ {
+ prefix_suffix[0] = prefixlen;
+ prefix_suffix[1] = suffixlen;
+ XLogRegisterBufData(0, (char *) &prefix_suffix,
+ sizeof(uint16) * 2);
+ }
+ else if (prefixlen > 0)
+ {
+ XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
+ }
+ else
+ {
+ XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+ }
}
- }
- xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
- xlhdr.t_infomask = newtup->t_data->t_infomask;
- xlhdr.t_hoff = newtup->t_data->t_hoff;
- Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
+ xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
+ xlhdr.t_infomask = newtup->t_data->t_infomask;
+ xlhdr.t_hoff = newtup->t_data->t_hoff;
+ Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
- /*
- * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
- *
- * The 'data' doesn't include the common prefix or suffix.
- */
- XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
- if (prefixlen == 0)
- {
- XLogRegisterBufData(0,
- ((char *) newtup->t_data) + SizeofHeapTupleHeader,
- newtup->t_len - SizeofHeapTupleHeader - suffixlen);
- }
- else
- {
/*
- * Have to write the null bitmap and data after the common prefix as
- * two separate rdata entries.
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
+ *
+ * The 'data' doesn't include the common prefix or suffix.
*/
- /* bitmap [+ padding] [+ oid] */
- if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+ XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+ if (prefixlen == 0)
{
XLogRegisterBufData(0,
((char *) newtup->t_data) + SizeofHeapTupleHeader,
- newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+ newtup->t_len - SizeofHeapTupleHeader - suffixlen);
}
+ else
+ {
+ /*
+ * Have to write the null bitmap and data after the common prefix
+ * as two separate rdata entries.
+ */
+ /* bitmap [+ padding] [+ oid] */
+ if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+ {
+ XLogRegisterBufData(0,
+ ((char *) newtup->t_data) + SizeofHeapTupleHeader,
+ newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+ }
- /* data after common prefix */
- XLogRegisterBufData(0,
+ /* data after common prefix */
+ XLogRegisterBufData(0,
((char *) newtup->t_data) + newtup->t_data->t_hoff + prefixlen,
newtup->t_len - newtup->t_data->t_hoff - prefixlen - suffixlen);
+ }
}
+ /*
+ * If the old and new tuple are on different pages, also register the old
+ * page, so that a full-page image is created for it if necessary. We
+ * don't need any extra information to replay changes to it.
+ */
+ if (oldbuf != newbuf && BufferNeedsWAL(reln, oldbuf))
+ XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
+
/* We need to log a tuple identity */
if (need_tuple_data && old_key_tuple)
{
@@ -8578,8 +8606,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
*/
/* Deal with old tuple version */
- oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- &obuffer);
+ if (oldblk == newblk)
+ oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+ else if (XLogRecHasBlockRef(record, 1))
+ oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+ else
+ oldaction = BLK_DONE;
+
if (oldaction == BLK_NEEDS_REDO)
{
page = BufferGetPage(obuffer);
@@ -8633,6 +8666,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageInit(page, BufferGetPageSize(nbuffer), 0);
newaction = BLK_NEEDS_REDO;
}
+ else if (!XLogRecHasBlockRef(record, 0))
+ newaction = BLK_DONE;
else
newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9069,9 +9104,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
@@ -9181,3 +9223,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d69a266..4754278 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -260,7 +261,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..6462f44 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
- HEAP_INSERT_SKIP_FSM |
- (state->rs_use_wal ?
- 0 : HEAP_INSERT_SKIP_WAL));
+ HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e5616ce..933fa9c 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92b263a..361b50d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2007,6 +2007,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2238,6 +2241,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2545,6 +2551,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandon pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f677916..1234325 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -64,6 +65,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/*
* RelationCreateStorage
* Create physical storage for a relation.
*
@@ -116,6 +160,14 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+
+ /* pending sync on this file is no longer needed */
+ if (pendingSyncs)
+ {
+ bool found;
+
+ hash_search(pendingSyncs, (void *) &rnode, HASH_REMOVE, &found);
+ }
}
/*
@@ -226,6 +278,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
void
RelationTruncate(Relation rel, BlockNumber nblocks)
{
+ PendingRelSync *pending = NULL;
+ bool found;
bool fsm;
bool vm;
@@ -260,37 +314,78 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
-
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->sync_above = InvalidBlockNumber;
+ pending->truncated_to = InvalidBlockNumber;
+ }
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ rel->pending_sync = pending;
+ }
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above < nblocks)
+ {
+ /*
+ * Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->pending_sync->truncated_to = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -419,6 +514,156 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ bool found = true;
+ BlockNumber nblocks;
+
+ Assert(RelationNeedsWAL(rel));
+
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+
+ /* Look up or create an entry */
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ }
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (!found)
+ {
+ rel->pending_sync->truncated_to = InvalidBlockNumber;
+ rel->pending_sync->sync_above = nblocks;
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ }
+ else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ rel->pending_sync->sync_above = nblocks;
+ }
+ else
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* WAL is needed if no pending syncs */
+ if (!rel->pending_sync)
+ {
+ bool found = false;
+
+ if (!pendingSyncs)
+ return true;
+
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_FIND, &found);
+ if (!found)
+ return true;
+ }
+
+ blkno = BufferGetBlockNumber(buf);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+ rel->pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b5af2be..8aa7e7b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2372,8 +2372,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2405,7 +2404,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -2782,11 +2781,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 06425cc..408495e 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 9ffd91e..8b127e3 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -477,7 +477,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -520,9 +520,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index abb262b..ae69954 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4327,8 +4327,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4589,8 +4590,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5b43a66..f3dcf6e 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -893,7 +893,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1120,7 +1120,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1480,7 +1480,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2109cbf..e991e9f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ddb9485..61ff7eb 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
#include "optimizer/var.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -418,6 +419,9 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* pending_sync is set as required later */
+ relation->pending_sync = NULL;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -3353,6 +3357,8 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7e85510..3967641 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
@@ -178,6 +177,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index fea96de..e8e49f1 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -29,6 +29,9 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 07a32d6..6ec2d26 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ab875bb..f802cc1 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -216,6 +216,8 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ struct PendingRelSync *pending_sync;
} RelationData;
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
deleted file mode 100644
index ccd5943..0000000
--- a/src/test/recovery/t/001_stream_rep.pl
+++ /dev/null
@@ -1,230 +0,0 @@
-# Minimal test testing streaming replication
-use strict;
-use warnings;
-use PostgresNode;
-use TestLib;
-use Test::More tests => 28;
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1);
-$node_master->start;
-my $backup_name = 'my_backup';
-
-# Take backup
-$node_master->backup($backup_name);
-
-# Create streaming standby linking to master
-my $node_standby_1 = get_new_node('standby_1');
-$node_standby_1->init_from_backup($node_master, $backup_name,
- has_streaming => 1);
-$node_standby_1->start;
-
-# Take backup of standby 1 (not mandatory, but useful to check if
-# pg_basebackup works on a standby).
-$node_standby_1->backup($backup_name);
-
-# Take a second backup of the standby while the master is offline.
-$node_master->stop;
-$node_standby_1->backup('my_backup_2');
-$node_master->start;
-
-# Create second standby node linking to standby 1
-my $node_standby_2 = get_new_node('standby_2');
-$node_standby_2->init_from_backup($node_standby_1, $backup_name,
- has_streaming => 1);
-$node_standby_2->start;
-
-# Create some content on master and check its presence in standby 1
-$node_master->safe_psql('postgres',
- "CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
-
-# Wait for standbys to catch up
-$node_master->wait_for_catchup($node_standby_1, 'replay', $node_master->lsn('insert'));
-$node_standby_1->wait_for_catchup($node_standby_2, 'replay', $node_standby_1->lsn('replay'));
-
-my $result =
- $node_standby_1->safe_psql('postgres', "SELECT count(*) FROM tab_int");
-print "standby 1: $result\n";
-is($result, qq(1002), 'check streamed content on standby 1');
-
-$result =
- $node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
-print "standby 2: $result\n";
-is($result, qq(1002), 'check streamed content on standby 2');
-
-# Check that only READ-only queries can run on standbys
-is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
- 3, 'read-only queries on standby 1');
-is($node_standby_2->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
- 3, 'read-only queries on standby 2');
-
-# Tests for connection parameter target_session_attrs
-note "testing connection parameter \"target_session_attrs\"";
-
-# Routine designed to run tests on the connection parameter
-# target_session_attrs with multiple nodes.
-sub test_target_session_attrs
-{
- my $node1 = shift;
- my $node2 = shift;
- my $target_node = shift;
- my $mode = shift;
- my $status = shift;
-
- my $node1_host = $node1->host;
- my $node1_port = $node1->port;
- my $node1_name = $node1->name;
- my $node2_host = $node2->host;
- my $node2_port = $node2->port;
- my $node2_name = $node2->name;
-
- my $target_name = $target_node->name;
-
- # Build connection string for connection attempt.
- my $connstr = "host=$node1_host,$node2_host ";
- $connstr .= "port=$node1_port,$node2_port ";
- $connstr .= "target_session_attrs=$mode";
-
- # The client used for the connection does not matter, only the backend
- # point does.
- my ($ret, $stdout, $stderr) =
- $node1->psql('postgres', 'SHOW port;', extra_params => ['-d', $connstr]);
- is($status == $ret && $stdout eq $target_node->port, 1,
- "connect to node $target_name if mode \"$mode\" and $node1_name,$node2_name listed");
-}
-
-# Connect to master in "read-write" mode with master,standby1 list.
-test_target_session_attrs($node_master, $node_standby_1, $node_master,
- "read-write", 0);
-# Connect to master in "read-write" mode with standby1,master list.
-test_target_session_attrs($node_standby_1, $node_master, $node_master,
- "read-write", 0);
-# Connect to master in "any" mode with master,standby1 list.
-test_target_session_attrs($node_master, $node_standby_1, $node_master,
- "any", 0);
-# Connect to standby1 in "any" mode with standby1,master list.
-test_target_session_attrs($node_standby_1, $node_master, $node_standby_1,
- "any", 0);
-
-note "switching to physical replication slot";
-# Switch to using a physical replication slot. We can do this without a new
-# backup since physical slots can go backwards if needed. Do so on both
-# standbys. Since we're going to be testing things that affect the slot state,
-# also increase the standby feedback interval to ensure timely updates.
-my ($slotname_1, $slotname_2) = ('standby_1', 'standby_2');
-$node_master->append_conf('postgresql.conf', "max_replication_slots = 4\n");
-$node_master->restart;
-is($node_master->psql('postgres', qq[SELECT pg_create_physical_replication_slot('$slotname_1');]), 0, 'physical slot created on master');
-$node_standby_1->append_conf('recovery.conf', "primary_slot_name = $slotname_1\n");
-$node_standby_1->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
-$node_standby_1->append_conf('postgresql.conf', "max_replication_slots = 4\n");
-$node_standby_1->restart;
-is($node_standby_1->psql('postgres', qq[SELECT pg_create_physical_replication_slot('$slotname_2');]), 0, 'physical slot created on intermediate replica');
-$node_standby_2->append_conf('recovery.conf', "primary_slot_name = $slotname_2\n");
-$node_standby_2->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
-$node_standby_2->restart;
-
-sub get_slot_xmins
-{
- my ($node, $slotname) = @_;
- my $slotinfo = $node->slot($slotname);
- return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
-}
-
-# There's no hot standby feedback and there are no logical slots on either peer
-# so xmin and catalog_xmin should be null on both slots.
-my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
-is($xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
-is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
-is($xmin, '', 'cascaded slot xmin null with no hs_feedback');
-is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');
-
-# Replication still works?
-$node_master->safe_psql('postgres', 'CREATE TABLE replayed(val integer);');
-
-sub replay_check
-{
- my $newval = $node_master->safe_psql('postgres', 'INSERT INTO replayed(val) SELECT coalesce(max(val),0) + 1 AS newval FROM replayed RETURNING val');
- $node_master->wait_for_catchup($node_standby_1, 'replay', $node_master->lsn('insert'));
- $node_standby_1->wait_for_catchup($node_standby_2, 'replay', $node_standby_1->lsn('replay'));
- $node_standby_1->safe_psql('postgres', qq[SELECT 1 FROM replayed WHERE val = $newval])
- or die "standby_1 didn't replay master value $newval";
- $node_standby_2->safe_psql('postgres', qq[SELECT 1 FROM replayed WHERE val = $newval])
- or die "standby_2 didn't replay standby_1 value $newval";
-}
-
-replay_check();
-
-note "enabling hot_standby_feedback";
-# Enable hs_feedback. The slot should gain an xmin. We set the status interval
-# so we'll see the results promptly.
-$node_standby_1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
-$node_standby_1->reload;
-$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
-$node_standby_2->reload;
-replay_check();
-sleep(2);
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
-isnt($xmin, '', 'non-cascaded slot xmin non-null with hs feedback');
-is($catalog_xmin, '', 'non-cascaded slot xmin still null with hs_feedback');
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
-isnt($xmin, '', 'cascaded slot xmin non-null with hs feedback');
-is($catalog_xmin, '', 'cascaded slot xmin still null with hs_feedback');
-
-note "doing some work to advance xmin";
-for my $i (10000..11000) {
- $node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES ($i);]);
-}
-$node_master->safe_psql('postgres', 'VACUUM;');
-$node_master->safe_psql('postgres', 'CHECKPOINT;');
-
-my ($xmin2, $catalog_xmin2) = get_slot_xmins($node_master, $slotname_1);
-note "new xmin $xmin2, old xmin $xmin";
-isnt($xmin2, $xmin, 'non-cascaded slot xmin with hs feedback has changed');
-is($catalog_xmin2, '', 'non-cascaded slot xmin still null with hs_feedback unchanged');
-
-($xmin2, $catalog_xmin2) = get_slot_xmins($node_standby_1, $slotname_2);
-note "new xmin $xmin2, old xmin $xmin";
-isnt($xmin2, $xmin, 'cascaded slot xmin with hs feedback has changed');
-is($catalog_xmin2, '', 'cascaded slot xmin still null with hs_feedback unchanged');
-
-note "disabling hot_standby_feedback";
-# Disable hs_feedback. Xmin should be cleared.
-$node_standby_1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
-$node_standby_1->reload;
-$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
-$node_standby_2->reload;
-replay_check();
-sleep(2);
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
-is($xmin, '', 'non-cascaded slot xmin null with hs feedback reset');
-is($catalog_xmin, '', 'non-cascaded slot xmin still null with hs_feedback reset');
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
-is($xmin, '', 'cascaded slot xmin null with hs feedback reset');
-is($catalog_xmin, '', 'cascaded slot xmin still null with hs_feedback reset');
-
-note "re-enabling hot_standby_feedback and disabling while stopped";
-$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
-$node_standby_2->reload;
-
-$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
-replay_check();
-
-$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
-$node_standby_2->stop;
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
-isnt($xmin, '', 'cascaded slot xmin non-null with postgres shut down');
-
-# Xmin from a previous run should be cleared on startup.
-$node_standby_2->start;
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
-is($xmin, '', 'cascaded slot xmin reset after startup with hs feedback reset');
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
deleted file mode 100644
index 83b43bf..0000000
--- a/src/test/recovery/t/002_archiving.pl
+++ /dev/null
@@ -1,53 +0,0 @@
-# test for archiving with hot standby
-use strict;
-use warnings;
-use PostgresNode;
-use TestLib;
-use Test::More tests => 1;
-use File::Copy;
-
-# Initialize master node, doing archives
-my $node_master = get_new_node('master');
-$node_master->init(
- has_archiving => 1,
- allows_streaming => 1);
-my $backup_name = 'my_backup';
-
-# Start it
-$node_master->start;
-
-# Take backup for slave
-$node_master->backup($backup_name);
-
-# Initialize standby node from backup, fetching WAL from archives
-my $node_standby = get_new_node('standby');
-$node_standby->init_from_backup($node_master, $backup_name,
- has_restoring => 1);
-$node_standby->append_conf(
- 'postgresql.conf', qq(
-wal_retrieve_retry_interval = '100ms'
-));
-$node_standby->start;
-
-# Create some content on master
-$node_master->safe_psql('postgres',
- "CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-my $current_lsn =
- $node_master->safe_psql('postgres', "SELECT pg_current_wal_location();");
-
-# Force archiving of WAL file to make it present on master
-$node_master->safe_psql('postgres', "SELECT pg_switch_wal()");
-
-# Add some more content, it should not be present on standby
-$node_master->safe_psql('postgres',
- "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
-
-# Wait until necessary replay has been done on standby
-my $caughtup_query =
- "SELECT '$current_lsn'::pg_lsn <= pg_last_wal_replay_location()";
-$node_standby->poll_query_until('postgres', $caughtup_query)
- or die "Timed out while waiting for standby to catch up";
-
-my $result =
- $node_standby->safe_psql('postgres', "SELECT count(*) FROM tab_int");
-is($result, qq(1000), 'check content from archives');
diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
deleted file mode 100644
index b7b0caa..0000000
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ /dev/null
@@ -1,146 +0,0 @@
-# Test for recovery targets: name, timestamp, XID
-use strict;
-use warnings;
-use PostgresNode;
-use TestLib;
-use Test::More tests => 9;
-
-# Create and test a standby from given backup, with a certain
-# recovery target.
-sub test_recovery_standby
-{
- my $test_name = shift;
- my $node_name = shift;
- my $node_master = shift;
- my $recovery_params = shift;
- my $num_rows = shift;
- my $until_lsn = shift;
-
- my $node_standby = get_new_node($node_name);
- $node_standby->init_from_backup($node_master, 'my_backup',
- has_restoring => 1);
-
- foreach my $param_item (@$recovery_params)
- {
- $node_standby->append_conf(
- 'recovery.conf',
- qq($param_item
-));
- }
-
- $node_standby->start;
-
- # Wait until standby has replayed enough data
- my $caughtup_query =
- "SELECT '$until_lsn'::pg_lsn <= pg_last_wal_replay_location()";
- $node_standby->poll_query_until('postgres', $caughtup_query)
- or die "Timed out while waiting for standby to catch up";
-
- # Create some content on master and check its presence in standby
- my $result =
- $node_standby->safe_psql('postgres', "SELECT count(*) FROM tab_int");
- is($result, qq($num_rows), "check standby content for $test_name");
-
- # Stop standby node
- $node_standby->teardown_node;
-}
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(has_archiving => 1, allows_streaming => 1);
-
-# Start it
-$node_master->start;
-
-# Create data before taking the backup, aimed at testing
-# recovery_target = 'immediate'
-$node_master->safe_psql('postgres',
- "CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-my $lsn1 =
- $node_master->safe_psql('postgres', "SELECT pg_current_wal_location();");
-
-# Take backup from which all operations will be run
-$node_master->backup('my_backup');
-
-# Insert some data with used as a replay reference, with a recovery
-# target TXID.
-$node_master->safe_psql('postgres',
- "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
-my $ret = $node_master->safe_psql('postgres',
- "SELECT pg_current_wal_location(), txid_current();");
-my ($lsn2, $recovery_txid) = split /\|/, $ret;
-
-# More data, with recovery target timestamp
-$node_master->safe_psql('postgres',
- "INSERT INTO tab_int VALUES (generate_series(2001,3000))");
-$ret = $node_master->safe_psql('postgres',
- "SELECT pg_current_wal_location(), now();");
-my ($lsn3, $recovery_time) = split /\|/, $ret;
-
-# Even more data, this time with a recovery target name
-$node_master->safe_psql('postgres',
- "INSERT INTO tab_int VALUES (generate_series(3001,4000))");
-my $recovery_name = "my_target";
-my $lsn4 =
- $node_master->safe_psql('postgres', "SELECT pg_current_wal_location();");
-$node_master->safe_psql('postgres',
- "SELECT pg_create_restore_point('$recovery_name');");
-
-# And now for a recovery target LSN
-$node_master->safe_psql('postgres',
- "INSERT INTO tab_int VALUES (generate_series(4001,5000))");
-my $recovery_lsn = $node_master->safe_psql('postgres', "SELECT pg_current_wal_location()");
-my $lsn5 =
- $node_master->safe_psql('postgres', "SELECT pg_current_wal_location();");
-
-$node_master->safe_psql('postgres',
- "INSERT INTO tab_int VALUES (generate_series(5001,6000))");
-
-# Force archiving of WAL file
-$node_master->safe_psql('postgres', "SELECT pg_switch_wal()");
-
-# Test recovery targets
-my @recovery_params = ("recovery_target = 'immediate'");
-test_recovery_standby('immediate target',
- 'standby_1', $node_master, \@recovery_params, "1000", $lsn1);
-@recovery_params = ("recovery_target_xid = '$recovery_txid'");
-test_recovery_standby('XID', 'standby_2', $node_master, \@recovery_params,
- "2000", $lsn2);
-@recovery_params = ("recovery_target_time = '$recovery_time'");
-test_recovery_standby('time', 'standby_3', $node_master, \@recovery_params,
- "3000", $lsn3);
-@recovery_params = ("recovery_target_name = '$recovery_name'");
-test_recovery_standby('name', 'standby_4', $node_master, \@recovery_params,
- "4000", $lsn4);
-@recovery_params = ("recovery_target_lsn = '$recovery_lsn'");
-test_recovery_standby('LSN', 'standby_5', $node_master, \@recovery_params,
- "5000", $lsn5);
-
-# Multiple targets
-# Last entry has priority (note that an array respects the order of items
-# not hashes).
-@recovery_params = (
- "recovery_target_name = '$recovery_name'",
- "recovery_target_xid = '$recovery_txid'",
- "recovery_target_time = '$recovery_time'");
-test_recovery_standby('name + XID + time',
- 'standby_6', $node_master, \@recovery_params, "3000", $lsn3);
-@recovery_params = (
- "recovery_target_time = '$recovery_time'",
- "recovery_target_name = '$recovery_name'",
- "recovery_target_xid = '$recovery_txid'");
-test_recovery_standby('time + name + XID',
- 'standby_7', $node_master, \@recovery_params, "2000", $lsn2);
-@recovery_params = (
- "recovery_target_xid = '$recovery_txid'",
- "recovery_target_time = '$recovery_time'",
- "recovery_target_name = '$recovery_name'");
-test_recovery_standby('XID + time + name',
- 'standby_8', $node_master, \@recovery_params, "4000", $lsn4);
-@recovery_params = (
- "recovery_target_xid = '$recovery_txid'",
- "recovery_target_time = '$recovery_time'",
- "recovery_target_name = '$recovery_name'",
- "recovery_target_lsn = '$recovery_lsn'",);
-test_recovery_standby('XID + time + name + LSN',
- 'standby_9', $node_master, \@recovery_params, "5000", $lsn5);
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
deleted file mode 100644
index 7c6587a..0000000
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ /dev/null
@@ -1,62 +0,0 @@
-# Test for timeline switch
-# Ensure that a cascading standby is able to follow a newly-promoted standby
-# on a new timeline.
-use strict;
-use warnings;
-use File::Path qw(rmtree);
-use PostgresNode;
-use TestLib;
-use Test::More tests => 1;
-
-$ENV{PGDATABASE} = 'postgres';
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1);
-$node_master->start;
-
-# Take backup
-my $backup_name = 'my_backup';
-$node_master->backup($backup_name);
-
-# Create two standbys linking to it
-my $node_standby_1 = get_new_node('standby_1');
-$node_standby_1->init_from_backup($node_master, $backup_name,
- has_streaming => 1);
-$node_standby_1->start;
-my $node_standby_2 = get_new_node('standby_2');
-$node_standby_2->init_from_backup($node_master, $backup_name,
- has_streaming => 1);
-$node_standby_2->start;
-
-# Create some content on master
-$node_master->safe_psql('postgres',
- "CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-
-# Wait until standby has replayed enough data on standby 1
-$node_master->wait_for_catchup($node_standby_1, 'replay', $node_master->lsn('write'));
-
-# Stop and remove master, and promote standby 1, switching it to a new timeline
-$node_master->teardown_node;
-$node_standby_1->promote;
-
-# Switch standby 2 to replay from standby 1
-rmtree($node_standby_2->data_dir . '/recovery.conf');
-my $connstr_1 = $node_standby_1->connstr;
-$node_standby_2->append_conf(
- 'recovery.conf', qq(
-primary_conninfo='$connstr_1 application_name=@{[$node_standby_2->name]}'
-standby_mode=on
-recovery_target_timeline='latest'
-));
-$node_standby_2->restart;
-
-# Insert some data in standby 1 and check its presence in standby 2
-# to ensure that the timeline switch has been done.
-$node_standby_1->safe_psql('postgres',
- "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
-$node_standby_1->wait_for_catchup($node_standby_2, 'replay', $node_standby_1->lsn('write'));
-
-my $result =
- $node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
-is($result, qq(2000), 'check content of standby 2');
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
deleted file mode 100644
index cd9e8f5..0000000
--- a/src/test/recovery/t/005_replay_delay.pl
+++ /dev/null
@@ -1,69 +0,0 @@
-# Checks for recovery_min_apply_delay
-use strict;
-use warnings;
-
-use PostgresNode;
-use TestLib;
-use Test::More tests => 1;
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1);
-$node_master->start;
-
-# And some content
-$node_master->safe_psql('postgres',
- "CREATE TABLE tab_int AS SELECT generate_series(1, 10) AS a");
-
-# Take backup
-my $backup_name = 'my_backup';
-$node_master->backup($backup_name);
-
-# Create streaming standby from backup
-my $node_standby = get_new_node('standby');
-my $delay = 3;
-$node_standby->init_from_backup($node_master, $backup_name,
- has_streaming => 1);
-$node_standby->append_conf(
- 'recovery.conf', qq(
-recovery_min_apply_delay = '${delay}s'
-));
-$node_standby->start;
-
-# Make new content on master and check its presence in standby depending
-# on the delay applied above. Before doing the insertion, get the
-# current timestamp that will be used as a comparison base. Even on slow
-# machines, this allows to have a predictable behavior when comparing the
-# delay between data insertion moment on master and replay time on standby.
-my $master_insert_time = time();
-$node_master->safe_psql('postgres',
- "INSERT INTO tab_int VALUES (generate_series(11, 20))");
-
-# Now wait for replay to complete on standby. We're done waiting when the
-# slave has replayed up to the previously saved master LSN.
-my $until_lsn =
- $node_master->safe_psql('postgres', "SELECT pg_current_wal_location()");
-
-my $remaining = 90;
-while ($remaining-- > 0)
-{
-
- # Done waiting?
- my $replay_status = $node_standby->safe_psql('postgres',
- "SELECT (pg_last_wal_replay_location() - '$until_lsn'::pg_lsn) >= 0"
- );
- last if $replay_status eq 't';
-
- # No, sleep some more.
- my $sleep = $master_insert_time + $delay - time();
- $sleep = 1 if $sleep < 1;
- sleep $sleep;
-}
-
-die "Maximum number of attempts reached ($remaining remain)"
- if $remaining < 0;
-
-# This test is successful if and only if the LSN has been applied with at least
-# the configured apply delay.
-ok(time() - $master_insert_time >= $delay,
- "standby applies WAL only after replication delay");
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
deleted file mode 100644
index bf9b50a..0000000
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ /dev/null
@@ -1,104 +0,0 @@
-# Testing of logical decoding using SQL interface and/or pg_recvlogical
-#
-# Most logical decoding tests are in contrib/test_decoding. This module
-# is for work that doesn't fit well there, like where server restarts
-# are required.
-use strict;
-use warnings;
-use PostgresNode;
-use TestLib;
-use Test::More tests => 16;
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1);
-$node_master->append_conf(
- 'postgresql.conf', qq(
-wal_level = logical
-));
-$node_master->start;
-my $backup_name = 'master_backup';
-
-$node_master->safe_psql('postgres', qq[CREATE TABLE decoding_test(x integer, y text);]);
-
-$node_master->safe_psql('postgres', qq[SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');]);
-
-$node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]);
-
-# Basic decoding works
-my($result) = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]);
-is(scalar(my @foobar = split /^/m, $result), 12, 'Decoding produced 12 rows inc BEGIN/COMMIT');
-
-# If we immediately crash the server we might lose the progress we just made
-# and replay the same changes again. But a clean shutdown should never repeat
-# the same changes when we use the SQL decoding interface.
-$node_master->restart('fast');
-
-# There are no new writes, so the result should be empty.
-$result = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]);
-chomp($result);
-is($result, '', 'Decoding after fast restart repeats no rows');
-
-# Insert some rows and verify that we get the same results from pg_recvlogical
-# and the SQL interface.
-$node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4) s;]);
-
-my $expected = q{BEGIN
-table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
-table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
-table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
-table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
-COMMIT};
-
-my $stdout_sql = $node_master->safe_psql('postgres', qq[SELECT data FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]);
-is($stdout_sql, $expected, 'got expected output from SQL decoding session');
-
-my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;");
-print "waiting to replay $endpos\n";
-
-my $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
-chomp($stdout_recv);
-is($stdout_recv, $expected, 'got same expected output from pg_recvlogical decoding session');
-
-$stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0', 'skip-empty-xacts' => '1');
-chomp($stdout_recv);
-is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
-
-$node_master->safe_psql('postgres', 'CREATE DATABASE otherdb');
-
-is($node_master->psql('otherdb', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;"), 3,
- 'replaying logical slot from another database fails');
-
-$node_master->safe_psql('otherdb', qq[SELECT pg_create_logical_replication_slot('otherdb_slot', 'test_decoding');]);
-
-# make sure you can't drop a slot while active
-my $pg_recvlogical = IPC::Run::start(['pg_recvlogical', '-d', $node_master->connstr('otherdb'), '-S', 'otherdb_slot', '-f', '-', '--start']);
-$node_master->poll_query_until('otherdb', "SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'otherdb_slot' AND active_pid IS NOT NULL)");
-is($node_master->psql('postgres', 'DROP DATABASE otherdb'), 3,
- 'dropping a DB with inactive logical slots fails');
-$pg_recvlogical->kill_kill;
-is($node_master->slot('otherdb_slot')->{'slot_name'}, undef,
- 'logical slot still exists');
-
-$node_master->poll_query_until('otherdb', "SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'otherdb_slot' AND active_pid IS NULL)");
-is($node_master->psql('postgres', 'DROP DATABASE otherdb'), 0,
- 'dropping a DB with inactive logical slots succeeds');
-is($node_master->slot('otherdb_slot')->{'slot_name'}, undef,
- 'logical slot was actually dropped with DB');
-
-# Restarting a node with wal_level = logical that has existing
-# slots must succeed, but decoding from those slots must fail.
-$node_master->safe_psql('postgres', 'ALTER SYSTEM SET wal_level = replica');
-is($node_master->safe_psql('postgres', 'SHOW wal_level'), 'logical', 'wal_level is still logical before restart');
-$node_master->restart;
-is($node_master->safe_psql('postgres', 'SHOW wal_level'), 'replica', 'wal_level is replica');
-isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
- 'restored slot catalog_xmin is nonzero');
-is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
- 'reading from slot with wal_level < logical fails');
-is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
- 'can drop logical slot while wal_level = replica');
-is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
-
-# done with the node
-$node_master->stop;
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
deleted file mode 100644
index e11b428..0000000
--- a/src/test/recovery/t/007_sync_rep.pl
+++ /dev/null
@@ -1,205 +0,0 @@
-# Minimal test testing synchronous replication sync_state transition
-use strict;
-use warnings;
-use PostgresNode;
-use TestLib;
-use Test::More tests => 11;
-
-# Query checking sync_priority and sync_state of each standby
-my $check_sql =
-"SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name;";
-
-# Check that sync_state of each standby is expected.
-# If $setting is given, synchronous_standby_names is set to it and
-# the configuration file is reloaded before the test.
-sub test_sync_state
-{
- my ($self, $expected, $msg, $setting) = @_;
-
- if (defined($setting))
- {
- $self->psql('postgres',
- "ALTER SYSTEM SET synchronous_standby_names = '$setting';");
- $self->reload;
- }
-
- my $timeout_max = 30;
- my $timeout = 0;
- my $result;
-
- # A reload may take some time to take effect on busy machines,
- # hence use a loop with a timeout to give some room for the test
- # to pass.
- while ($timeout < $timeout_max)
- {
- $result = $self->safe_psql('postgres', $check_sql);
-
- last if ($result eq $expected);
-
- $timeout++;
- sleep 1;
- }
-
- is($result, $expected, $msg);
-}
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1);
-$node_master->start;
-my $backup_name = 'master_backup';
-
-# Take backup
-$node_master->backup($backup_name);
-
-# Create standby1 linking to master
-my $node_standby_1 = get_new_node('standby1');
-$node_standby_1->init_from_backup($node_master, $backup_name,
- has_streaming => 1);
-$node_standby_1->start;
-
-# Create standby2 linking to master
-my $node_standby_2 = get_new_node('standby2');
-$node_standby_2->init_from_backup($node_master, $backup_name,
- has_streaming => 1);
-$node_standby_2->start;
-
-# Create standby3 linking to master
-my $node_standby_3 = get_new_node('standby3');
-$node_standby_3->init_from_backup($node_master, $backup_name,
- has_streaming => 1);
-$node_standby_3->start;
-
-# Check that sync_state is determined correctly when
-# synchronous_standby_names is specified in old syntax.
-test_sync_state(
- $node_master, qq(standby1|1|sync
-standby2|2|potential
-standby3|0|async),
- 'old syntax of synchronous_standby_names',
- 'standby1,standby2');
-
-# Check that all the standbys are considered as either sync or
-# potential when * is specified in synchronous_standby_names.
-# Note that standby1 is chosen as sync standby because
-# it's stored in the head of WalSnd array which manages
-# all the standbys though they have the same priority.
-test_sync_state(
- $node_master, qq(standby1|1|sync
-standby2|1|potential
-standby3|1|potential),
- 'asterisk in synchronous_standby_names',
- '*');
-
-# Stop and start standbys to rearrange the order of standbys
-# in WalSnd array. Now, if standbys have the same priority,
-# standby2 is selected preferentially and standby3 is next.
-$node_standby_1->stop;
-$node_standby_2->stop;
-$node_standby_3->stop;
-
-$node_standby_2->start;
-$node_standby_3->start;
-
-# Specify 2 as the number of sync standbys.
-# Check that two standbys are in 'sync' state.
-test_sync_state(
- $node_master, qq(standby2|2|sync
-standby3|3|sync),
- '2 synchronous standbys',
- '2(standby1,standby2,standby3)');
-
-# Start standby1
-$node_standby_1->start;
-
-# Create standby4 linking to master
-my $node_standby_4 = get_new_node('standby4');
-$node_standby_4->init_from_backup($node_master, $backup_name,
- has_streaming => 1);
-$node_standby_4->start;
-
-# Check that standby1 and standby2 whose names appear earlier in
-# synchronous_standby_names are considered as sync. Also check that
-# standby3 appearing later represents potential, and standby4 is
-# in 'async' state because it's not in the list.
-test_sync_state(
- $node_master, qq(standby1|1|sync
-standby2|2|sync
-standby3|3|potential
-standby4|0|async),
- '2 sync, 1 potential, and 1 async');
-
-# Check that sync_state of each standby is determined correctly
-# when num_sync exceeds the number of names of potential sync standbys
-# specified in synchronous_standby_names.
-test_sync_state(
- $node_master, qq(standby1|0|async
-standby2|4|sync
-standby3|3|sync
-standby4|1|sync),
- 'num_sync exceeds the num of potential sync standbys',
- '6(standby4,standby0,standby3,standby2)');
-
-# The setting that * comes before another standby name is acceptable
-# but does not make sense in most cases. Check that sync_state is
-# chosen properly even in case of that setting.
-# The priority of standby2 should be 2 because it matches * first.
-test_sync_state(
- $node_master, qq(standby1|1|sync
-standby2|2|sync
-standby3|2|potential
-standby4|2|potential),
- 'asterisk comes before another standby name',
- '2(standby1,*,standby2)');
-
-# Check that the setting of '2(*)' chooses standby2 and standby3 that are stored
-# earlier in WalSnd array as sync standbys.
-test_sync_state(
- $node_master, qq(standby1|1|potential
-standby2|1|sync
-standby3|1|sync
-standby4|1|potential),
- 'multiple standbys having the same priority are chosen as sync',
- '2(*)');
-
-# Stop Standby3 which is considered in 'sync' state.
-$node_standby_3->stop;
-
-# Check that the state of standby1 stored earlier in WalSnd array than
-# standby4 is transited from potential to sync.
-test_sync_state(
- $node_master, qq(standby1|1|sync
-standby2|1|sync
-standby4|1|potential),
- 'potential standby found earlier in array is promoted to sync');
-
-# Check that standby1 and standby2 are chosen as sync standbys
-# based on their priorities.
-test_sync_state(
-$node_master, qq(standby1|1|sync
-standby2|2|sync
-standby4|0|async),
-'priority-based sync replication specified by FIRST keyword',
-'FIRST 2(standby1, standby2)');
-
-# Check that all the listed standbys are considered as candidates
-# for sync standbys in a quorum-based sync replication.
-test_sync_state(
-$node_master, qq(standby1|1|quorum
-standby2|2|quorum
-standby4|0|async),
-'2 quorum and 1 async',
-'ANY 2(standby1, standby2)');
-
-# Start Standby3 which will be considered in 'quorum' state.
-$node_standby_3->start;
-
-# Check that the setting of 'ANY 2(*)' chooses all standbys as
-# candidates for quorum sync standbys.
-test_sync_state(
-$node_master, qq(standby1|1|quorum
-standby2|1|quorum
-standby3|1|quorum
-standby4|1|quorum),
-'all standbys are considered as candidates for quorum sync standbys',
-'ANY 2(*)');
diff --git a/src/test/recovery/t/008_fsm_truncation.pl b/src/test/recovery/t/008_fsm_truncation.pl
deleted file mode 100644
index 8aa8a4f..0000000
--- a/src/test/recovery/t/008_fsm_truncation.pl
+++ /dev/null
@@ -1,92 +0,0 @@
-# Test WAL replay of FSM changes.
-#
-# FSM changes don't normally need to be WAL-logged, except for truncation.
-# The FSM mustn't return a page that doesn't exist (anymore).
-use strict;
-use warnings;
-
-use PostgresNode;
-use TestLib;
-use Test::More tests => 1;
-
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1);
-
-$node_master->append_conf('postgresql.conf', qq{
-fsync = on
-wal_log_hints = on
-max_prepared_transactions = 5
-autovacuum = off
-});
-
-# Create a master node and its standby, initializing both with some data
-# at the same time.
-$node_master->start;
-
-$node_master->backup('master_backup');
-my $node_standby = get_new_node('standby');
-$node_standby->init_from_backup($node_master, 'master_backup',
- has_streaming => 1);
-$node_standby->start;
-
-$node_master->psql('postgres', qq{
-create table testtab (a int, b char(100));
-insert into testtab select generate_series(1,1000), 'foo';
-insert into testtab select generate_series(1,1000), 'foo';
-delete from testtab where ctid > '(8,0)';
-});
-
-# Take a lock on the table to prevent following vacuum from truncating it
-$node_master->psql('postgres', qq{
-begin;
-lock table testtab in row share mode;
-prepare transaction 'p1';
-});
-
-# Vacuum, update FSM without truncation
-$node_master->psql('postgres', 'vacuum verbose testtab');
-
-# Force a checkpoint
-$node_master->psql('postgres', 'checkpoint');
-
-# Now do some more insert/deletes, another vacuum to ensure full-page writes
-# are done
-$node_master->psql('postgres', qq{
-insert into testtab select generate_series(1,1000), 'foo';
-delete from testtab where ctid > '(8,0)';
-vacuum verbose testtab;
-});
-
-# Ensure all buffers are now clean on the standby
-$node_standby->psql('postgres', 'checkpoint');
-
-# Release the lock, vacuum again which should lead to truncation
-$node_master->psql('postgres', qq{
-rollback prepared 'p1';
-vacuum verbose testtab;
-});
-
-$node_master->psql('postgres', 'checkpoint');
-my $until_lsn =
- $node_master->safe_psql('postgres', "SELECT pg_current_wal_location();");
-
-# Wait long enough for standby to receive and apply all WAL
-my $caughtup_query =
- "SELECT '$until_lsn'::pg_lsn <= pg_last_wal_replay_location()";
-$node_standby->poll_query_until('postgres', $caughtup_query)
- or die "Timed out while waiting for standby to catch up";
-
-# Promote the standby
-$node_standby->promote;
-$node_standby->poll_query_until('postgres',
- "SELECT NOT pg_is_in_recovery()")
- or die "Timed out while waiting for promotion of standby";
-$node_standby->psql('postgres', 'checkpoint');
-
-# Restart to discard in-memory copy of FSM
-$node_standby->restart;
-
-# Insert should work on standby
-is($node_standby->psql('postgres',
- qq{insert into testtab select generate_series(1,1000), 'foo';}),
- 0, 'INSERT succeeds with truncated relation FSM');
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
deleted file mode 100644
index be7f00b..0000000
--- a/src/test/recovery/t/009_twophase.pl
+++ /dev/null
@@ -1,322 +0,0 @@
-# Tests dedicated to two-phase commit in recovery
-use strict;
-use warnings;
-
-use PostgresNode;
-use TestLib;
-use Test::More tests => 13;
-
-# Setup master node
-my $node_master = get_new_node("master");
-$node_master->init(allows_streaming => 1);
-$node_master->append_conf('postgresql.conf', qq(
- max_prepared_transactions = 10
- log_checkpoints = true
-));
-$node_master->start;
-$node_master->backup('master_backup');
-$node_master->psql('postgres', "CREATE TABLE t_009_tbl (id int)");
-
-# Setup slave node
-my $node_slave = get_new_node('slave');
-$node_slave->init_from_backup($node_master, 'master_backup', has_streaming => 1);
-$node_slave->start;
-
-# Switch to synchronous replication
-$node_master->append_conf('postgresql.conf', qq(
- synchronous_standby_names = '*'
-));
-$node_master->psql('postgres', "SELECT pg_reload_conf()");
-
-my $psql_out = '';
-my $psql_rc = '';
-
-###############################################################################
-# Check that we can commit and abort transaction after soft restart.
-# Here checkpoint happens before shutdown and no WAL replay will occur at next
-# startup. In this case postgres re-creates shared-memory state from twophase
-# files.
-###############################################################################
-
-$node_master->psql('postgres', "
- BEGIN;
- INSERT INTO t_009_tbl VALUES (42);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (43);
- PREPARE TRANSACTION 'xact_009_1';
- BEGIN;
- INSERT INTO t_009_tbl VALUES (142);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (143);
- PREPARE TRANSACTION 'xact_009_2';");
-$node_master->stop;
-$node_master->start;
-
-$psql_rc = $node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-is($psql_rc, '0', 'Commit prepared transaction after restart');
-
-$psql_rc = $node_master->psql('postgres', "ROLLBACK PREPARED 'xact_009_2'");
-is($psql_rc, '0', 'Rollback prepared transaction after restart');
-
-###############################################################################
-# Check that we can commit and abort after a hard restart.
-# At next startup, WAL replay will re-create shared memory state for prepared
-# transaction using dedicated WAL records.
-###############################################################################
-
-$node_master->psql('postgres', "
- CHECKPOINT;
- BEGIN;
- INSERT INTO t_009_tbl VALUES (42);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (43);
- PREPARE TRANSACTION 'xact_009_1';
- BEGIN;
- INSERT INTO t_009_tbl VALUES (142);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (143);
- PREPARE TRANSACTION 'xact_009_2';");
-$node_master->teardown_node;
-$node_master->start;
-
-$psql_rc = $node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-is($psql_rc, '0', 'Commit prepared transaction after teardown');
-
-$psql_rc = $node_master->psql('postgres', "ROLLBACK PREPARED 'xact_009_2'");
-is($psql_rc, '0', 'Rollback prepared transaction after teardown');
-
-###############################################################################
-# Check that WAL replay can handle several transactions with same GID name.
-###############################################################################
-
-$node_master->psql('postgres', "
- CHECKPOINT;
- BEGIN;
- INSERT INTO t_009_tbl VALUES (42);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (43);
- PREPARE TRANSACTION 'xact_009_1';
- COMMIT PREPARED 'xact_009_1';
- BEGIN;
- INSERT INTO t_009_tbl VALUES (42);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (43);
- PREPARE TRANSACTION 'xact_009_1';");
-$node_master->teardown_node;
-$node_master->start;
-
-$psql_rc = $node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-is($psql_rc, '0', 'Replay several transactions with same GID');
-
-###############################################################################
-# Check that WAL replay cleans up its shared memory state and releases locks
-# while replaying transaction commits.
-###############################################################################
-
-$node_master->psql('postgres', "
- BEGIN;
- INSERT INTO t_009_tbl VALUES (42);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (43);
- PREPARE TRANSACTION 'xact_009_1';
- COMMIT PREPARED 'xact_009_1';");
-$node_master->teardown_node;
-$node_master->start;
-$psql_rc = $node_master->psql('postgres', "
- BEGIN;
- INSERT INTO t_009_tbl VALUES (42);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (43);
- -- This prepare can fail due to conflicting GID or locks conflicts if
- -- replay did not fully cleanup its state on previous commit.
- PREPARE TRANSACTION 'xact_009_1';");
-is($psql_rc, '0', "Cleanup of shared memory state for 2PC commit");
-
-$node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-
-###############################################################################
-# Check that WAL replay will cleanup its shared memory state on running slave.
-###############################################################################
-
-$node_master->psql('postgres', "
- BEGIN;
- INSERT INTO t_009_tbl VALUES (42);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (43);
- PREPARE TRANSACTION 'xact_009_1';
- COMMIT PREPARED 'xact_009_1';");
-$node_slave->psql('postgres', "SELECT count(*) FROM pg_prepared_xacts",
- stdout => \$psql_out);
-is($psql_out, '0',
- "Cleanup of shared memory state on running standby without checkpoint");
-
-###############################################################################
-# Same as in previous case, but let's force checkpoint on slave between
-# prepare and commit to use on-disk twophase files.
-###############################################################################
-
-$node_master->psql('postgres', "
- BEGIN;
- INSERT INTO t_009_tbl VALUES (42);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (43);
- PREPARE TRANSACTION 'xact_009_1';");
-$node_slave->psql('postgres', "CHECKPOINT");
-$node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-$node_slave->psql('postgres', "SELECT count(*) FROM pg_prepared_xacts",
- stdout => \$psql_out);
-is($psql_out, '0',
- "Cleanup of shared memory state on running standby after checkpoint");
-
-###############################################################################
-# Check that prepared transactions can be committed on promoted slave.
-###############################################################################
-
-$node_master->psql('postgres', "
- BEGIN;
- INSERT INTO t_009_tbl VALUES (42);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (43);
- PREPARE TRANSACTION 'xact_009_1';");
-$node_master->teardown_node;
-$node_slave->promote;
-$node_slave->poll_query_until('postgres',
- "SELECT NOT pg_is_in_recovery()")
- or die "Timed out while waiting for promotion of standby";
-
-$psql_rc = $node_slave->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-is($psql_rc, '0', "Restore of prepared transaction on promoted slave");
-
-# change roles
-($node_master, $node_slave) = ($node_slave, $node_master);
-$node_slave->enable_streaming($node_master);
-$node_slave->append_conf('recovery.conf', qq(
-recovery_target_timeline='latest'
-));
-$node_slave->start;
-
-###############################################################################
-# Check that prepared transactions are replayed after soft restart of standby
-# while master is down. Since standby knows that master is down it uses a
-# different code path on startup to ensure that the status of transactions is
-# consistent.
-###############################################################################
-
-$node_master->psql('postgres', "
- BEGIN;
- INSERT INTO t_009_tbl VALUES (42);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (43);
- PREPARE TRANSACTION 'xact_009_1';");
-$node_master->stop;
-$node_slave->restart;
-$node_slave->promote;
-$node_slave->poll_query_until('postgres',
- "SELECT NOT pg_is_in_recovery()")
- or die "Timed out while waiting for promotion of standby";
-
-$node_slave->psql('postgres', "SELECT count(*) FROM pg_prepared_xacts",
- stdout => \$psql_out);
-is($psql_out, '1',
- "Restore prepared transactions from files with master down");
-
-# restore state
-($node_master, $node_slave) = ($node_slave, $node_master);
-$node_slave->enable_streaming($node_master);
-$node_slave->append_conf('recovery.conf', qq(
-recovery_target_timeline='latest'
-));
-$node_slave->start;
-$node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-
-###############################################################################
-# Check that prepared transactions are correctly replayed after slave hard
-# restart while master is down.
-###############################################################################
-
-$node_master->psql('postgres', "
- BEGIN;
- INSERT INTO t_009_tbl VALUES (242);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (243);
- PREPARE TRANSACTION 'xact_009_1';
- ");
-$node_master->stop;
-$node_slave->teardown_node;
-$node_slave->start;
-$node_slave->promote;
-$node_slave->poll_query_until('postgres',
- "SELECT NOT pg_is_in_recovery()")
- or die "Timed out while waiting for promotion of standby";
-
-$node_slave->psql('postgres', "SELECT count(*) FROM pg_prepared_xacts",
- stdout => \$psql_out);
-is($psql_out, '1',
- "Restore prepared transactions from records with master down");
-
-# restore state
-($node_master, $node_slave) = ($node_slave, $node_master);
-$node_slave->enable_streaming($node_master);
-$node_slave->append_conf('recovery.conf', qq(
-recovery_target_timeline='latest'
-));
-$node_slave->start;
-$node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-
-
-###############################################################################
-# Check for a lock conflict between prepared transaction with DDL inside and replay of
-# XLOG_STANDBY_LOCK wal record.
-###############################################################################
-
-$node_master->psql('postgres', "
- BEGIN;
- CREATE TABLE t_009_tbl2 (id int);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl2 VALUES (42);
- PREPARE TRANSACTION 'xact_009_1';
- -- checkpoint will issue XLOG_STANDBY_LOCK that can conflict with lock
- -- held by 'create table' statement
- CHECKPOINT;
- COMMIT PREPARED 'xact_009_1';");
-
-$node_slave->psql('postgres', "SELECT count(*) FROM pg_prepared_xacts",
- stdout => \$psql_out);
-is($psql_out, '0', "Replay prepared transaction with DDL");
-
-
-###############################################################################
-# Check that replay will correctly set SUBTRANS and properly advance nextXid
-# so that it won't conflict with savepoint xids.
-###############################################################################
-
-$node_master->psql('postgres', "
- BEGIN;
- DELETE FROM t_009_tbl;
- INSERT INTO t_009_tbl VALUES (43);
- SAVEPOINT s1;
- INSERT INTO t_009_tbl VALUES (43);
- SAVEPOINT s2;
- INSERT INTO t_009_tbl VALUES (43);
- SAVEPOINT s3;
- INSERT INTO t_009_tbl VALUES (43);
- SAVEPOINT s4;
- INSERT INTO t_009_tbl VALUES (43);
- SAVEPOINT s5;
- INSERT INTO t_009_tbl VALUES (43);
- PREPARE TRANSACTION 'xact_009_1';
- CHECKPOINT;");
-
-$node_master->stop;
-$node_master->start;
-$node_master->psql('postgres', "
- -- here we can get xid of previous savepoint if nextXid
- -- wasn't properly advanced
- BEGIN;
- INSERT INTO t_009_tbl VALUES (142);
- ROLLBACK;
- COMMIT PREPARED 'xact_009_1';");
-
-$node_master->psql('postgres', "SELECT count(*) FROM t_009_tbl",
- stdout => \$psql_out);
-is($psql_out, '6', "Check nextXid handling for prepared subtransactions");
diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl b/src/test/recovery/t/010_logical_decoding_timelines.pl
deleted file mode 100644
index cdddb4d..0000000
--- a/src/test/recovery/t/010_logical_decoding_timelines.pl
+++ /dev/null
@@ -1,184 +0,0 @@
-# Demonstrate that logical can follow timeline switches.
-#
-# Logical replication slots can follow timeline switches but it's
-# normally not possible to have a logical slot on a replica where
-# promotion and a timeline switch can occur. The only ways
-# we can create that circumstance are:
-#
-# * By doing a filesystem-level copy of the DB, since pg_basebackup
-# excludes pg_replslot but we can copy it directly; or
-#
-# * by creating a slot directly at the C level on the replica and
-# advancing it as we go using the low level APIs. It can't be done
-# from SQL since logical decoding isn't allowed on replicas.
-#
-# This module uses the first approach to show that timeline following
-# on a logical slot works.
-#
-# (For convenience, it also tests some recovery-related operations
-# on logical slots).
-#
-use strict;
-use warnings;
-
-use PostgresNode;
-use TestLib;
-use Test::More tests => 13;
-use RecursiveCopy;
-use File::Copy;
-use IPC::Run ();
-use Scalar::Util qw(blessed);
-
-my ($stdout, $stderr, $ret);
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1, has_archiving => 1);
-$node_master->append_conf('postgresql.conf', q[
-wal_level = 'logical'
-max_replication_slots = 3
-max_wal_senders = 2
-log_min_messages = 'debug2'
-hot_standby_feedback = on
-wal_receiver_status_interval = 1
-]);
-$node_master->dump_info;
-$node_master->start;
-
-note "testing logical timeline following with a filesystem-level copy";
-
-$node_master->safe_psql('postgres',
-"SELECT pg_create_logical_replication_slot('before_basebackup', 'test_decoding');"
-);
-$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
-$node_master->safe_psql('postgres',
- "INSERT INTO decoding(blah) VALUES ('beforebb');");
-
-# We also want to verify that DROP DATABASE on a standby with a logical
-# slot works. This isn't strictly related to timeline following, but
-# the only way to get a logical slot on a standby right now is to use
-# the same physical copy trick, so:
-$node_master->safe_psql('postgres', 'CREATE DATABASE dropme;');
-$node_master->safe_psql('dropme',
-"SELECT pg_create_logical_replication_slot('dropme_slot', 'test_decoding');"
-);
-
-$node_master->safe_psql('postgres', 'CHECKPOINT;');
-
-my $backup_name = 'b1';
-$node_master->backup_fs_hot($backup_name);
-
-$node_master->safe_psql('postgres',
- q[SELECT pg_create_physical_replication_slot('phys_slot');]);
-
-my $node_replica = get_new_node('replica');
-$node_replica->init_from_backup(
- $node_master, $backup_name,
- has_streaming => 1,
- has_restoring => 1);
-$node_replica->append_conf(
- 'recovery.conf', q[primary_slot_name = 'phys_slot']);
-
-$node_replica->start;
-
-# If we drop 'dropme' on the master, the standby should drop the
-# db and associated slot.
-is($node_master->psql('postgres', 'DROP DATABASE dropme'), 0,
- 'dropped DB with logical slot OK on master');
-$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
-is($node_replica->safe_psql('postgres', q[SELECT 1 FROM pg_database WHERE datname = 'dropme']), '',
- 'dropped DB dropme on standby');
-is($node_master->slot('dropme_slot')->{'slot_name'}, undef,
- 'logical slot was actually dropped on standby');
-
-# Back to testing failover...
-$node_master->safe_psql('postgres',
-"SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
-);
-$node_master->safe_psql('postgres',
- "INSERT INTO decoding(blah) VALUES ('afterbb');");
-$node_master->safe_psql('postgres', 'CHECKPOINT;');
-
-# Verify that only the before base_backup slot is on the replica
-$stdout = $node_replica->safe_psql('postgres',
- 'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
-is($stdout, 'before_basebackup',
- 'Expected to find only slot before_basebackup on replica');
-
-# Examine the physical slot the replica uses to stream changes
-# from the master to make sure its hot_standby_feedback
-# has locked in a catalog_xmin on the physical slot, and that
-# any xmin is < the catalog_xmin
-$node_master->poll_query_until('postgres', q[
- SELECT catalog_xmin IS NOT NULL
- FROM pg_replication_slots
- WHERE slot_name = 'phys_slot'
- ]);
-my $phys_slot = $node_master->slot('phys_slot');
-isnt($phys_slot->{'xmin'}, '',
- 'xmin assigned on physical slot of master');
-isnt($phys_slot->{'catalog_xmin'}, '',
- 'catalog_xmin assigned on physical slot of master');
-# Ignore wrap-around here, we're on a new cluster:
-cmp_ok($phys_slot->{'xmin'}, '>=', $phys_slot->{'catalog_xmin'},
- 'xmin on physical slot must not be lower than catalog_xmin');
-
-$node_master->safe_psql('postgres', 'CHECKPOINT');
-
-# Boom, crash
-$node_master->stop('immediate');
-
-$node_replica->promote;
-print "waiting for replica to come up\n";
-$node_replica->poll_query_until('postgres',
- "SELECT NOT pg_is_in_recovery();");
-
-$node_replica->safe_psql('postgres',
- "INSERT INTO decoding(blah) VALUES ('after failover');");
-
-# Shouldn't be able to read from slot created after base backup
-($ret, $stdout, $stderr) = $node_replica->psql('postgres',
-"SELECT data FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');"
-);
-is($ret, 3, 'replaying from after_basebackup slot fails');
-like(
- $stderr,
- qr/replication slot "after_basebackup" does not exist/,
- 'after_basebackup slot missing');
-
-# Should be able to read from slot created before base backup
-($ret, $stdout, $stderr) = $node_replica->psql(
- 'postgres',
-"SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
- timeout => 30);
-is($ret, 0, 'replay from slot before_basebackup succeeds');
-
-my $final_expected_output_bb = q(BEGIN
-table public.decoding: INSERT: blah[text]:'beforebb'
-COMMIT
-BEGIN
-table public.decoding: INSERT: blah[text]:'afterbb'
-COMMIT
-BEGIN
-table public.decoding: INSERT: blah[text]:'after failover'
-COMMIT);
-is($stdout, $final_expected_output_bb, 'decoded expected data from slot before_basebackup');
-is($stderr, '', 'replay from slot before_basebackup produces no stderr');
-
-# So far we've peeked the slots, so when we fetch the same info over
-# pg_recvlogical we should get complete results. First, find out the commit lsn
-# of the last transaction. There's no max(pg_lsn), so:
-
-my $endpos = $node_replica->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL) ORDER BY location DESC LIMIT 1;");
-
-# now use the walsender protocol to peek the slot changes and make sure we see
-# the same results.
-
-$stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
- $endpos, 30, 'include-xids' => '0', 'skip-empty-xacts' => '1');
-
-# walsender likes to add a newline
-chomp($stdout);
-is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
-
-$node_replica->teardown_node();
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
deleted file mode 100644
index 3c3718e..0000000
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ /dev/null
@@ -1,46 +0,0 @@
-#
-# Tests relating to PostgreSQL crash recovery and redo
-#
-use strict;
-use warnings;
-use PostgresNode;
-use TestLib;
-use Test::More tests => 3;
-
-my $node = get_new_node('master');
-$node->init(allows_streaming => 1);
-$node->start;
-
-my ($stdin, $stdout, $stderr) = ('', '', '');
-
-# Ensure that txid_status reports 'aborted' for xacts
-# that were in-progress during crash. To do that, we need
-# an xact to be in-progress when we crash and we need to know
-# its xid.
-my $tx = IPC::Run::start(
- ['psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d', $node->connstr('postgres')],
- '<', \$stdin, '>', \$stdout, '2>', \$stderr);
-$stdin .= q[
-BEGIN;
-CREATE TABLE mine(x integer);
-SELECT txid_current();
-];
-$tx->pump until $stdout =~ /[[:digit:]]+[\r\n]$/;
-
-# Status should be in-progress
-my $xid = $stdout;
-chomp($xid);
-
-is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]), 'in progress', 'own xid is in-progres');
-
-# Crash and restart the postmaster
-$node->stop('immediate');
-$node->start;
-
-# Make sure we really got a new xid
-cmp_ok($node->safe_psql('postgres', 'SELECT txid_current()'), '>', $xid,
- 'new xid after restart is greater');
-# and make sure we show the in-progress xact as aborted
-is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]), 'aborted', 'xid is aborted after crash');
-
-$tx->kill_kill;
Sorry, what I have just sent was broken.
At Tue, 11 Apr 2017 17:33:41 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170411.173341.257028732.horiguchi.kyotaro@lab.ntt.co.jp>
At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170411.095606.245908357.horiguchi.kyotaro@lab.ntt.co.jp>
Hello, thank you for looking this.
At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <27309.1491611915@sss.pgh.pa.us>
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Interesting. I wonder if it's possible that a relcache invalidation
would cause these values to get lost for some reason, because that would
be dangerous.I suppose the rationale is that this shouldn't happen because any
operation that does things this way must hold an exclusive lock on the
relation. But that doesn't guarantee that the relcache entry is
completely stable,It ABSOLUTELY is not safe. Relcache flushes can happen regardless of
how strong a lock you hold.regards, tom lane
Ugh. Yes, relcache invalidation happens anytime and it resets the
added values. pg_stat_info deceived me that it can store
transient values. But I came up with another thought.The reason I proposed it was I thought that hash_search for every
buffer is not good. Instead, like pg_stat_info, we can link thebuffer => buffer modification
pending-sync hash entry to Relation. This greately reduces the
frequency of hash-searching.I'll post new patch in this way soon.
Here it is.
It contained tariling space and missing test script. This is the
correct patch.
- Relation has new members no_pending_sync and pending_sync that
works as instant cache of an entry in pendingSync hash.- Commit-time synchronizing is restored as Michael's patch.
- If relfilenode is replaced, pending_sync for the old node is
removed. Anyway this is ignored on abort and meaningless on
commit.- TAP test is renamed to 012 since some new files have been added.
Accessing pending sync hash occured on every calling of
HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
accessing relations has pending sync. Almost of them are
eliminated as the result.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
fix-wal-level-minimal-michael-horiguchi-2.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0c3e2b0..23a6d56 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -56,6 +78,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/namespace.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -2356,12 +2379,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2392,6 +2409,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* TID where the tuple was stored. But note that any toasting of fields
* within the tuple data is NOT reflected into *tup.
*/
+extern HTAB *pendingSyncs;
Oid
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
int options, BulkInsertState bistate)
@@ -2465,7 +2483,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2664,12 +2682,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
char *scratch = NULL;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2684,7 +2700,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* palloc() within a critical section is not safe, so we allocate this
* beforehand.
*/
- if (needwal)
+ if (RelationNeedsWAL(relation))
scratch = palloc(BLCKSZ);
/*
@@ -2719,6 +2735,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2730,6 +2747,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3286,7 +3304,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4250,7 +4268,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5141,7 +5160,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5843,7 +5862,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5998,7 +6017,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6131,7 +6150,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6240,7 +6259,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7354,7 +7373,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7402,7 +7421,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7487,7 +7506,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -7590,76 +7609,86 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.new_offnum = ItemPointerGetOffsetNumber(&newtup->t_self);
xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
+ XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
+
bufflags = REGBUF_STANDARD;
if (init)
bufflags |= REGBUF_WILL_INIT;
if (need_tuple_data)
bufflags |= REGBUF_KEEP_DATA;
- XLogRegisterBuffer(0, newbuf, bufflags);
- if (oldbuf != newbuf)
- XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
-
- XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
-
/*
* Prepare WAL data for the new tuple.
*/
- if (prefixlen > 0 || suffixlen > 0)
+ if (BufferNeedsWAL(reln, newbuf))
{
- if (prefixlen > 0 && suffixlen > 0)
- {
- prefix_suffix[0] = prefixlen;
- prefix_suffix[1] = suffixlen;
- XLogRegisterBufData(0, (char *) &prefix_suffix, sizeof(uint16) * 2);
- }
- else if (prefixlen > 0)
- {
- XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
- }
- else
+ XLogRegisterBuffer(0, newbuf, bufflags);
+
+ if ((prefixlen > 0 || suffixlen > 0))
{
- XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+ if (prefixlen > 0 && suffixlen > 0)
+ {
+ prefix_suffix[0] = prefixlen;
+ prefix_suffix[1] = suffixlen;
+ XLogRegisterBufData(0, (char *) &prefix_suffix,
+ sizeof(uint16) * 2);
+ }
+ else if (prefixlen > 0)
+ {
+ XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
+ }
+ else
+ {
+ XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+ }
}
- }
- xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
- xlhdr.t_infomask = newtup->t_data->t_infomask;
- xlhdr.t_hoff = newtup->t_data->t_hoff;
- Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
+ xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
+ xlhdr.t_infomask = newtup->t_data->t_infomask;
+ xlhdr.t_hoff = newtup->t_data->t_hoff;
+ Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
- /*
- * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
- *
- * The 'data' doesn't include the common prefix or suffix.
- */
- XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
- if (prefixlen == 0)
- {
- XLogRegisterBufData(0,
- ((char *) newtup->t_data) + SizeofHeapTupleHeader,
- newtup->t_len - SizeofHeapTupleHeader - suffixlen);
- }
- else
- {
/*
- * Have to write the null bitmap and data after the common prefix as
- * two separate rdata entries.
+ * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
+ *
+ * The 'data' doesn't include the common prefix or suffix.
*/
- /* bitmap [+ padding] [+ oid] */
- if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+ XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+ if (prefixlen == 0)
{
XLogRegisterBufData(0,
((char *) newtup->t_data) + SizeofHeapTupleHeader,
- newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+ newtup->t_len - SizeofHeapTupleHeader - suffixlen);
}
+ else
+ {
+ /*
+ * Have to write the null bitmap and data after the common prefix
+ * as two separate rdata entries.
+ */
+ /* bitmap [+ padding] [+ oid] */
+ if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+ {
+ XLogRegisterBufData(0,
+ ((char *) newtup->t_data) + SizeofHeapTupleHeader,
+ newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+ }
- /* data after common prefix */
- XLogRegisterBufData(0,
+ /* data after common prefix */
+ XLogRegisterBufData(0,
((char *) newtup->t_data) + newtup->t_data->t_hoff + prefixlen,
newtup->t_len - newtup->t_data->t_hoff - prefixlen - suffixlen);
+ }
}
+ /*
+ * If the old and new tuple are on different pages, also register the old
+ * page, so that a full-page image is created for it if necessary. We
+ * don't need any extra information to replay changes to it.
+ */
+ if (oldbuf != newbuf && BufferNeedsWAL(reln, oldbuf))
+ XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
+
/* We need to log a tuple identity */
if (need_tuple_data && old_key_tuple)
{
@@ -8578,8 +8607,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
*/
/* Deal with old tuple version */
- oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- &obuffer);
+ if (oldblk == newblk)
+ oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+ else if (XLogRecHasBlockRef(record, 1))
+ oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+ else
+ oldaction = BLK_DONE;
+
if (oldaction == BLK_NEEDS_REDO)
{
page = BufferGetPage(obuffer);
@@ -8633,6 +8667,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageInit(page, BufferGetPageSize(nbuffer), 0);
newaction = BLK_NEEDS_REDO;
}
+ else if (!XLogRecHasBlockRef(record, 0))
+ newaction = BLK_DONE;
else
newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9069,9 +9105,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
@@ -9181,3 +9224,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d69a266..4754278 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -260,7 +261,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..6462f44 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
- HEAP_INSERT_SKIP_FSM |
- (state->rs_use_wal ?
- 0 : HEAP_INSERT_SKIP_WAL));
+ HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e5616ce..933fa9c 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92b263a..313a03b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2007,6 +2007,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2238,6 +2241,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2545,6 +2551,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f677916..14df0b1 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -64,6 +65,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/*
* RelationCreateStorage
* Create physical storage for a relation.
*
@@ -226,6 +270,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
void
RelationTruncate(Relation rel, BlockNumber nblocks)
{
+ PendingRelSync *pending = NULL;
+ bool found;
bool fsm;
bool vm;
@@ -260,37 +306,81 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
-
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ /* no_pending_sync is ignored since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ elog(LOG, "RelationTruncate: accessing hash");
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->sync_above = InvalidBlockNumber;
+ pending->truncated_to = InvalidBlockNumber;
+ }
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ rel->no_pending_sync= false;
+ rel->pending_sync = pending;
+ }
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above < nblocks)
+ {
+ /*
+ * Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->pending_sync->truncated_to = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -369,6 +459,24 @@ smgrDoPendingDeletes(bool isCommit)
}
/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+ bool found;
+
+ rel->pending_sync = NULL;
+ rel->no_pending_sync = true;
+ if (pendingSyncs)
+ {
+ elog(LOG, "RelationRemovePendingSync: accessing hash");
+ hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ }
+}
+
+
+/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
* The return value is the number of relations scheduled for termination.
@@ -419,6 +527,166 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ bool found = true;
+ BlockNumber nblocks;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* ignore no_pending_sync since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+
+ /* Look up or create an entry */
+ rel->no_pending_sync = false;
+ elog(LOG, "RecordPendingSync: accessing hash");
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ }
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (!found)
+ {
+ rel->pending_sync->truncated_to = InvalidBlockNumber;
+ rel->pending_sync->sync_above = nblocks;
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ }
+ else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ rel->pending_sync->sync_above = nblocks;
+ }
+ else
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ elog(LOG, "BufferNeedsWAL: pendingSyncs = %p, no_pending_sync = %d", pendingSyncs, rel->no_pending_sync);
+ /* no further work if we know that we don't have pending sync */
+ if (!pendingSyncs || rel->no_pending_sync)
+ return true;
+
+ /* do the real work */
+ if (!rel->pending_sync)
+ {
+ bool found = false;
+
+ elog(LOG, "BufferNeedsWAL: accessing hash");
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_FIND, &found);
+ if (!found)
+ {
+ /* we don't have no one. don't access the hash no longer */
+ rel->no_pending_sync = true;
+ return true;
+ }
+ }
+
+ blkno = BufferGetBlockNumber(buf);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+ rel->pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b5af2be..8aa7e7b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2372,8 +2372,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2405,7 +2404,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -2782,11 +2781,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 06425cc..408495e 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 9ffd91e..8b127e3 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -477,7 +477,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -520,9 +520,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index abb262b..2fd210b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4327,8 +4327,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4589,8 +4590,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
@@ -10510,11 +10509,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. Pending syncs for the old node is no longer needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationRemovePendingSync(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5b43a66..f3dcf6e 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -893,7 +893,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1120,7 +1120,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1480,7 +1480,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2109cbf..e991e9f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ddb9485..b6b0d78 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
#include "optimizer/var.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -418,6 +419,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -2032,6 +2037,10 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
/*
* add new reldesc to relcache
*/
@@ -3353,6 +3362,10 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ /* newly built relation has no pending sync */
+ rel->no_pending_sync = true;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7e85510..3967641 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
@@ -178,6 +177,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index fea96de..b9d485a 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 07a32d6..6ec2d26 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ab875bb..666273e 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -216,6 +216,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * no_pending_sync is true if this relation is kown not to have pending
+ * syncs. Elsewise searching for registered sync is required if
+ * pending_sync is NULL.
+ */
+ bool no_pending_sync;
+ struct PendingRelSync *pending_sync;
} RelationData;
diff --git a/src/test/recovery/t/012_truncate_opt.pl b/src/test/recovery/t/012_truncate_opt.pl
new file mode 100644
index 0000000..baf5604
--- /dev/null
+++ b/src/test/recovery/t/012_truncate_opt.pl
@@ -0,0 +1,94 @@
+# Set of tests to check TRUNCATE optimizations with CREATE TABLE
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+my $node = get_new_node('master');
+$node->init;
+
+my $copy_file = $node->backup_dir . "copy_data.txt";
+
+$node->append_conf('postgresql.conf', qq{
+fsync = on
+wal_level = minimal
+});
+
+$node->start;
+
+# Create file containing data to COPY
+TestLib::append_to_file($copy_file, qq{copied row 1
+copied row 2
+copied row 3
+});
+
+# CREATE, INSERT, COPY, crash.
+#
+# If COPY inserts to the existing block, and is not WAL-logged, replaying
+# the implicit FPW of the INSERT record will destroy the COPY data.
+$node->psql('postgres', qq{
+BEGIN;
+CREATE TABLE test1(t text PRIMARY KEY);
+INSERT INTO test1 VALUES ('inserted row');
+COPY test1 FROM '$copy_file';
+COMMIT;
+});
+# Enforce recovery and check the state of table. There should be 4 rows.
+$node->stop('immediate');
+$node->start;
+my $ret = $node->safe_psql('postgres', 'SELECT count(*) FROM test1');
+is($ret, '4', 'SELECT reports 4 rows');
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE test1;');
+
+# CREATE, COPY, crash. Trigger in COPY that inserts more to same table.
+#
+# If the INSERTS from the trigger go to the same block we're copying to,
+# and the INSERTs are WAL-logged, WAL replay will fail when it tries to
+# replay the WAL record but the "before" image doesn't match, because not
+# all changes were WAL-logged.
+$node->psql('postgres', qq{
+BEGIN;
+CREATE TABLE test1(t text PRIMARY KEY);
+CREATE FUNCTION test1_beforetrig() RETURNS trigger LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.t NOT LIKE 'triggered%' THEN
+ INSERT INTO test1 VALUES ('triggered ' || NEW.t);
+ END IF;
+ RETURN NEW;
+END;
+\$\$;
+CREATE TRIGGER test1_beforeinsert BEFORE INSERT ON test1
+FOR EACH ROW EXECUTE PROCEDURE test1_beforetrig();
+COPY test1 FROM '$copy_file';
+COMMIT;
+});
+# Enforce recovery and check the state of table. There should be 6
+# rows here.
+$node->stop('immediate');
+$node->start;
+$ret = $node->safe_psql('postgres', 'SELECT count(*) FROM test1');
+is($ret, '6', 'SELECT returns 6 rows');
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE test1;');
+$node->safe_psql('postgres', 'DROP FUNCTION test1_beforetrig();');
+
+# CREATE, TRUNCATE, COPY, crash.
+#
+# If we skip WAL-logging of the COPY, replaying the TRUNCATE record destroys
+# the newly inserted data.
+$node->psql('postgres', qq{
+BEGIN;
+CREATE TABLE test1(t text PRIMARY KEY);
+TRUNCATE test1;
+COPY test1 FROM '$copy_file';
+COMMIT;
+});
+# Enforce recovery and check the state of table. There should be 3
+# rows here.
+$node->stop('immediate');
+$node->start;
+$ret = $node->safe_psql('postgres', 'SELECT count(*) FROM test1');
+is($ret, '3', 'SELECT returns 3 rows');
On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Sorry, what I have just sent was broken.
You can use PROVE_TESTS when running make check to select a subset of
tests you want to run. I use that all the time when working on patches
dedicated to certain code paths.
- Relation has new members no_pending_sync and pending_sync that
works as instant cache of an entry in pendingSync hash.
- Commit-time synchronizing is restored as Michael's patch.
- If relfilenode is replaced, pending_sync for the old node is
removed. Anyway this is ignored on abort and meaningless on
commit.
- TAP test is renamed to 012 since some new files have been added.Accessing pending sync hash occurred on every calling of
HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
accessing relations has pending sync. Almost of them are
eliminated as the result.
Did you actually test this patch? One of the logs added makes the
tests a long time to run:
2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
STATEMENT: ANALYZE;
2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs
= 0x0, no_pending_sync = 0
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ rel->no_pending_sync= false;
+ rel->pending_sync = pending;
+ }
It seems to me that those flags and the pending_sync data should be
kept in the context of backend process and not be part of the Relation
data...
+void
+RecordPendingSync(Relation rel)
I don't think that I agree that this should be part of relcache.c. The
syncs are tracked should be tracked out of the relation context.
Seeing how invasive this change is, I would also advocate for this
patch as only being a HEAD-only change, not many people are
complaining about this optimization of TRUNCATE missing when wal_level
= minimal, and this needs a very careful review.
Should I code something? Or Horiguchi-san, would you take care of it?
The previous crash I saw has been taken care of, but it's been really
some time since I looked at this patch...
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I'd like to put a supplimentary explanation.
At Tue, 11 Apr 2017 17:38:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170411.173812.133964522.horiguchi.kyotaro@lab.ntt.co.jp>
Sorry, what I have just sent was broken.
At Tue, 11 Apr 2017 17:33:41 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170411.173341.257028732.horiguchi.kyotaro@lab.ntt.co.jp>
At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170411.095606.245908357.horiguchi.kyotaro@lab.ntt.co.jp>
Hello, thank you for looking this.
At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <27309.1491611915@sss.pgh.pa.us>
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Interesting. I wonder if it's possible that a relcache invalidation
would cause these values to get lost for some reason, because that would
be dangerous.I suppose the rationale is that this shouldn't happen because any
operation that does things this way must hold an exclusive lock on the
relation. But that doesn't guarantee that the relcache entry is
completely stable,It ABSOLUTELY is not safe. Relcache flushes can happen regardless of
how strong a lock you hold.regards, tom lane
Ugh. Yes, relcache invalidation happens anytime and it resets the
The pending locations are not stored in relcache hash so the
problem here is not invalidation but that Relation objects are
created as necessary, anywhere. Even if no invalidation happens,
the same thing will happen in a bit different form.
added values. pg_stat_info deceived me that it can store
transient values. But I came up with another thought.The reason I proposed it was I thought that hash_search for every
buffer is not good. Instead, like pg_stat_info, we can link thebuffer => buffer modification
pending-sync hash entry to Relation. This greately reduces the
frequency of hash-searching.I'll post new patch in this way soon.
Here it is.
It contained tariling space and missing test script. This is the
correct patch.- Relation has new members no_pending_sync and pending_sync that
works as instant cache of an entry in pendingSync hash.- Commit-time synchronizing is restored as Michael's patch.
- If relfilenode is replaced, pending_sync for the old node is
removed. Anyway this is ignored on abort and meaningless on
commit.- TAP test is renamed to 012 since some new files have been added.
Accessing pending sync hash occured on every calling of
HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
accessing relations has pending sync. Almost of them are
eliminated as the result.
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTRyica1d-zU+YckveFC876=Sc847etmk7TRgAS2pA9CA@mail.gmail.com>
On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Sorry, what I have just sent was broken.
You can use PROVE_TESTS when running make check to select a subset of
tests you want to run. I use that all the time when working on patches
dedicated to certain code paths.
Thank you for the information. Removing unwanted test scripts
from t/ directories was annoyance. This makes me happy.
- Relation has new members no_pending_sync and pending_sync that
works as instant cache of an entry in pendingSync hash.
- Commit-time synchronizing is restored as Michael's patch.
- If relfilenode is replaced, pending_sync for the old node is
removed. Anyway this is ignored on abort and meaningless on
commit.
- TAP test is renamed to 012 since some new files have been added.Accessing pending sync hash occurred on every calling of
HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
accessing relations has pending sync. Almost of them are
eliminated as the result.Did you actually test this patch? One of the logs added makes the
tests a long time to run:
Maybe this patch requires make clean since it extends the
structure RelationData. (Perhaps I saw the same trouble.)
2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
STATEMENT: ANALYZE;
2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs
= 0x0, no_pending_sync = 0- lsn = XLogInsert(RM_SMGR_ID, - XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + rel->no_pending_sync= false; + rel->pending_sync = pending; + }It seems to me that those flags and the pending_sync data should be
kept in the context of backend process and not be part of the Relation
data...
I understand that the context of "backend process" means
storage.c local. I don't mind the context on which the data is,
but I found only there that can get rid of frequent hash
searching. For pending deletions, just appending to a list is
enough and costs almost nothing, on the other hand pendig syncs
are required to be referenced, sometimes very frequently.
+void +RecordPendingSync(Relation rel) I don't think that I agree that this should be part of relcache.c. The syncs are tracked should be tracked out of the relation context.
Yeah.. It's in storage.c in the latest patch. (Sorry for the
duplicate name). I think it is a kind of bond between smgr and
relation.
Seeing how invasive this change is, I would also advocate for this
patch as only being a HEAD-only change, not many people are
complaining about this optimization of TRUNCATE missing when wal_level
= minimal, and this needs a very careful review.
Agreed.
Should I code something? Or Horiguchi-san, would you take care of it?
The previous crash I saw has been taken care of, but it's been really
some time since I looked at this patch...
My point is hash-search on every tuple insertion should be evaded
even if it happens rearely. Once it was a bit apart from your
original patch, but in the latest patch the significant part
(pending-sync hash) is revived from the original one.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 13 Apr 2017, at 11:42, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTRyica1d-zU+YckveFC876=Sc847etmk7TRgAS2pA9CA@mail.gmail.com>
On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Sorry, what I have just sent was broken.
You can use PROVE_TESTS when running make check to select a subset of
tests you want to run. I use that all the time when working on patches
dedicated to certain code paths.Thank you for the information. Removing unwanted test scripts
from t/ directories was annoyance. This makes me happy.- Relation has new members no_pending_sync and pending_sync that
works as instant cache of an entry in pendingSync hash.
- Commit-time synchronizing is restored as Michael's patch.
- If relfilenode is replaced, pending_sync for the old node is
removed. Anyway this is ignored on abort and meaningless on
commit.
- TAP test is renamed to 012 since some new files have been added.Accessing pending sync hash occurred on every calling of
HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
accessing relations has pending sync. Almost of them are
eliminated as the result.Did you actually test this patch? One of the logs added makes the
tests a long time to run:Maybe this patch requires make clean since it extends the
structure RelationData. (Perhaps I saw the same trouble.)2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
STATEMENT: ANALYZE;
2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs
= 0x0, no_pending_sync = 0- lsn = XLogInsert(RM_SMGR_ID, - XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + rel->no_pending_sync= false; + rel->pending_sync = pending; + }It seems to me that those flags and the pending_sync data should be
kept in the context of backend process and not be part of the Relation
data...I understand that the context of "backend process" means
storage.c local. I don't mind the context on which the data is,
but I found only there that can get rid of frequent hash
searching. For pending deletions, just appending to a list is
enough and costs almost nothing, on the other hand pendig syncs
are required to be referenced, sometimes very frequently.+void +RecordPendingSync(Relation rel) I don't think that I agree that this should be part of relcache.c. The syncs are tracked should be tracked out of the relation context.Yeah.. It's in storage.c in the latest patch. (Sorry for the
duplicate name). I think it is a kind of bond between smgr and
relation.Seeing how invasive this change is, I would also advocate for this
patch as only being a HEAD-only change, not many people are
complaining about this optimization of TRUNCATE missing when wal_level
= minimal, and this needs a very careful review.Agreed.
Should I code something? Or Horiguchi-san, would you take care of it?
The previous crash I saw has been taken care of, but it's been really
some time since I looked at this patch...My point is hash-search on every tuple insertion should be evaded
even if it happens rearely. Once it was a bit apart from your
original patch, but in the latest patch the significant part
(pending-sync hash) is revived from the original one.
This patch has followed along since CF 2016-03, do we think we can reach a
conclusion in this CF? It was marked as "Waiting on Author”, based on
developments since in this thread, I’ve changed it back to “Needs Review”
again.
cheers ./daniel
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for your notification.
At Tue, 5 Sep 2017 12:05:01 +0200, Daniel Gustafsson <daniel@yesql.se> wrote in <B3EC34FC-A48E-41AA-8598-BFC5D87CB383@yesql.se>
On 13 Apr 2017, at 11:42, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTRyica1d-zU+YckveFC876=Sc847etmk7TRgAS2pA9CA@mail.gmail.com>
On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Sorry, what I have just sent was broken.
You can use PROVE_TESTS when running make check to select a subset of
tests you want to run. I use that all the time when working on patches
dedicated to certain code paths.Thank you for the information. Removing unwanted test scripts
from t/ directories was annoyance. This makes me happy.- Relation has new members no_pending_sync and pending_sync that
works as instant cache of an entry in pendingSync hash.
- Commit-time synchronizing is restored as Michael's patch.
- If relfilenode is replaced, pending_sync for the old node is
removed. Anyway this is ignored on abort and meaningless on
commit.
- TAP test is renamed to 012 since some new files have been added.Accessing pending sync hash occurred on every calling of
HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
accessing relations has pending sync. Almost of them are
eliminated as the result.Did you actually test this patch? One of the logs added makes the
tests a long time to run:Maybe this patch requires make clean since it extends the
structure RelationData. (Perhaps I saw the same trouble.)2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
STATEMENT: ANALYZE;
2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs
= 0x0, no_pending_sync = 0- lsn = XLogInsert(RM_SMGR_ID, - XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + rel->no_pending_sync= false; + rel->pending_sync = pending; + }It seems to me that those flags and the pending_sync data should be
kept in the context of backend process and not be part of the Relation
data...I understand that the context of "backend process" means
storage.c local. I don't mind the context on which the data is,
but I found only there that can get rid of frequent hash
searching. For pending deletions, just appending to a list is
enough and costs almost nothing, on the other hand pendig syncs
are required to be referenced, sometimes very frequently.+void +RecordPendingSync(Relation rel) I don't think that I agree that this should be part of relcache.c. The syncs are tracked should be tracked out of the relation context.Yeah.. It's in storage.c in the latest patch. (Sorry for the
duplicate name). I think it is a kind of bond between smgr and
relation.Seeing how invasive this change is, I would also advocate for this
patch as only being a HEAD-only change, not many people are
complaining about this optimization of TRUNCATE missing when wal_level
= minimal, and this needs a very careful review.Agreed.
Should I code something? Or Horiguchi-san, would you take care of it?
The previous crash I saw has been taken care of, but it's been really
some time since I looked at this patch...My point is hash-search on every tuple insertion should be evaded
even if it happens rearely. Once it was a bit apart from your
original patch, but in the latest patch the significant part
(pending-sync hash) is revived from the original one.This patch has followed along since CF 2016-03, do we think we can reach a
conclusion in this CF? It was marked as "Waiting on Author”, based on
developments since in this thread, I’ve changed it back to “Needs Review”
again.
I manged to reload its context into my head. It doesn't apply on
the current master and needs some amendment. I'm going to work on
this.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello,
At Fri, 08 Sep 2017 16:30:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170908.163001.53230385.horiguchi.kyotaro@lab.ntt.co.jp>
2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
STATEMENT: ANALYZE;
2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs
= 0x0, no_pending_sync = 0- lsn = XLogInsert(RM_SMGR_ID, - XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + rel->no_pending_sync= false; + rel->pending_sync = pending; + }It seems to me that those flags and the pending_sync data should be
kept in the context of backend process and not be part of the Relation
data...I understand that the context of "backend process" means
storage.c local. I don't mind the context on which the data is,
but I found only there that can get rid of frequent hash
searching. For pending deletions, just appending to a list is
enough and costs almost nothing, on the other hand pendig syncs
are required to be referenced, sometimes very frequently.+void +RecordPendingSync(Relation rel) I don't think that I agree that this should be part of relcache.c. The syncs are tracked should be tracked out of the relation context.Yeah.. It's in storage.c in the latest patch. (Sorry for the
duplicate name). I think it is a kind of bond between smgr and
relation.Seeing how invasive this change is, I would also advocate for this
patch as only being a HEAD-only change, not many people are
complaining about this optimization of TRUNCATE missing when wal_level
= minimal, and this needs a very careful review.Agreed.
Should I code something? Or Horiguchi-san, would you take care of it?
The previous crash I saw has been taken care of, but it's been really
some time since I looked at this patch...My point is hash-search on every tuple insertion should be evaded
even if it happens rearely. Once it was a bit apart from your
original patch, but in the latest patch the significant part
(pending-sync hash) is revived from the original one.This patch has followed along since CF 2016-03, do we think we can reach a
conclusion in this CF? It was marked as "Waiting on Author”, based on
developments since in this thread, I’ve changed it back to “Needs Review”
again.I manged to reload its context into my head. It doesn't apply on
the current master and needs some amendment. I'm going to work on
this.
Rebased and slightly modified.
Michael's latest patch on which this patch is piggybacking seems
works perfectly. The motive of my addition is avoiding frequent
(I think specifically per tuple modification) hash accessing
occurs while pending-syncs exist. The hash contains at least 6 or
more entries.
The attached patch emits more log messages that will be removed
in the final shape to see how much the addition reduces the hash
access. As a basis of determining the worthiness of the
additional mechanism, I'll show an example of a set of queries
below.
In the log messages, "r" is relation oid, "b" is buffer number,
"hash" is the pointer to the backend-global hash table for
pending syncs. "ent" is the entry in the hash belongs to the
relation, "neg" is a flag indicates that the existing pending
sync hash doesn't have an entry for the relation.
=# set log_min_message to debug2;
=# begin;
=# create table test1(a text primary key);
DEBUG: BufferNeedsWAL(r 2608, b 55): hash = (nil), ent=(nil), neg = 0
# relid=2608 buf=55, hash has not been created
=# insert into test1 values ('inserted row');
DEBUG: BufferNeedsWAL(r 24807, b 0): hash = (nil), ent=(nil), neg = 0
# relid=24807, fist buffer, hash has not bee created
=# copy test1 from '/<somewhere>/copy_data.txt';
DEBUG: BufferNeedsWAL(r 24807, b 0): hash = 0x171de00, ent=0x171f390, neg = 0
# hash created, pending sync entry linked, no longer needs hash acess
# (repeats for the number of buffers)
COPY 200
=# create table test3(a text primary key);
DEBUG: BufferNeedsWAL(r 2608, b 55): hash = 0x171de00, ent=(nil), neg = 1
# no pending sync entry for this relation, no longer needs hash access.
=# insert into test3 (select a from generate_series(0, 99) a);
DEBUG: BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0
DEBUG: BufferNeedsWAL: accessing hash : not found
DEBUG: BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 1
# This table no longer needs hash access, (repeats for the number of tuples)
=# truncate test3;
=# insert into test3 (select a from generate_series(0, 99) a);
DEBUG: BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0
DEBUG: BufferNeedsWAL: accessing hash : found
DEBUG: BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=0x171f340, neg = 0
# This table has pending sync but no longer needs hash access,
# (repeats for the number of tuples)
The hash is required in the case of relcache invalidation. When
ent=(nil) and neg = 0 but hash != (nil), it tries hash search and
restores the previous state.
This mechanism avoids most of the hash accesses by replacing into
just following a pointer. On the other hand, the hash access
occurs only after relation truncate in the current
transaction. In other words, this won't be in effect unless any
of table truncation, copy, create as, alter table or refresing
matview occurs.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
fix-wal-level-minimal-michael-horiguchi-3.patchtext/x-patch; charset=us-asciiDownload
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 34,39 ****
--- 34,61 ----
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
***************
*** 56,61 ****
--- 78,84 ----
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/namespace.h"
+ #include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
***************
*** 2370,2381 **** ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
--- 2393,2398 ----
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***************
*** 20,25 ****
--- 20,26 ----
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+ #include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
***************
*** 259,265 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
! if (RelationNeedsWAL(relation))
{
XLogRecPtr recptr;
--- 260,266 ----
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
! if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
*** a/src/backend/access/heap/rewriteheap.c
--- b/src/backend/access/heap/rewriteheap.c
***************
*** 649,657 **** raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
! HEAP_INSERT_SKIP_FSM |
! (state->rs_use_wal ?
! 0 : HEAP_INSERT_SKIP_WAL));
else
heaptup = tup;
--- 649,655 ----
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
! HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
*** a/src/backend/access/heap/visibilitymap.c
--- b/src/backend/access/heap/visibilitymap.c
***************
*** 88,93 ****
--- 88,94 ----
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+ #include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
***************
*** 307,313 **** visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
! if (RelationNeedsWAL(rel))
{
if (XLogRecPtrIsInvalid(recptr))
{
--- 308,314 ----
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
! if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 2007,2012 **** CommitTransaction(void)
--- 2007,2015 ----
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
***************
*** 2235,2240 **** PrepareTransaction(void)
--- 2238,2246 ----
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
***************
*** 2548,2553 **** AbortTransaction(void)
--- 2554,2560 ----
AtAbort_Notify();
AtEOXact_RelationMap(false);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
***************
*** 29,34 ****
--- 29,35 ----
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+ #include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
***************
*** 64,69 **** typedef struct PendingRelDelete
--- 65,113 ----
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+ typedef struct PendingRelSync
+ {
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+ } PendingRelSync;
+
+ /* Relations that need to be fsync'd at commit */
+ static HTAB *pendingSyncs = NULL;
+
+ static void createPendingSyncsHash(void);
+
+ /*
* RelationCreateStorage
* Create physical storage for a relation.
*
***************
*** 226,231 **** RelationPreserveStorage(RelFileNode rnode, bool atCommit)
--- 270,277 ----
void
RelationTruncate(Relation rel, BlockNumber nblocks)
{
+ PendingRelSync *pending = NULL;
+ bool found;
bool fsm;
bool vm;
***************
*** 260,296 **** RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
! /*
! * Make an XLOG entry reporting the file truncation.
! */
! XLogRecPtr lsn;
! xl_smgr_truncate xlrec;
!
! xlrec.blkno = nblocks;
! xlrec.rnode = rel->rd_node;
! xlrec.flags = SMGR_TRUNCATE_ALL;
!
! XLogBeginInsert();
! XLogRegisterData((char *) &xlrec, sizeof(xlrec));
!
! lsn = XLogInsert(RM_SMGR_ID,
! XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
!
! /*
! * Flush, because otherwise the truncation of the main relation might
! * hit the disk before the WAL record, and the truncation of the FSM
! * or visibility map. If we crashed during that window, we'd be left
! * with a truncated heap, but the FSM or visibility map would still
! * contain entries for the non-existent heap pages.
! */
! if (fsm || vm)
! XLogFlush(lsn);
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
--- 306,386 ----
*/
if (RelationNeedsWAL(rel))
{
! /* no_pending_sync is ignored since new entry is created here */
! if (!rel->pending_sync)
! {
! if (!pendingSyncs)
! createPendingSyncsHash();
! elog(DEBUG2, "RelationTruncate: accessing hash");
! pending = (PendingRelSync *) hash_search(pendingSyncs,
! (void *) &rel->rd_node,
! HASH_ENTER, &found);
! if (!found)
! {
! pending->sync_above = InvalidBlockNumber;
! pending->truncated_to = InvalidBlockNumber;
! }
!
! rel->no_pending_sync= false;
! rel->pending_sync = pending;
! }
!
! if (rel->pending_sync->sync_above == InvalidBlockNumber ||
! rel->pending_sync->sync_above < nblocks)
! {
! /*
! * Make an XLOG entry reporting the file truncation.
! */
! XLogRecPtr lsn;
! xl_smgr_truncate xlrec;
!
! xlrec.blkno = nblocks;
! xlrec.rnode = rel->rd_node;
!
! XLogBeginInsert();
! XLogRegisterData((char *) &xlrec, sizeof(xlrec));
!
! lsn = XLogInsert(RM_SMGR_ID,
! XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
!
! elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
! rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
! nblocks);
!
! /*
! * Flush, because otherwise the truncation of the main relation
! * might hit the disk before the WAL record, and the truncation of
! * the FSM or visibility map. If we crashed during that window,
! * we'd be left with a truncated heap, but the FSM or visibility
! * map would still contain entries for the non-existent heap
! * pages.
! */
! if (fsm || vm)
! XLogFlush(lsn);
!
! rel->pending_sync->truncated_to = nblocks;
! }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+ /* create the hash table to track pending at-commit fsyncs */
+ static void
+ createPendingSyncsHash(void)
+ {
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
***************
*** 369,374 **** smgrDoPendingDeletes(bool isCommit)
--- 459,482 ----
}
/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+ void
+ RelationRemovePendingSync(Relation rel)
+ {
+ bool found;
+
+ rel->pending_sync = NULL;
+ rel->no_pending_sync = true;
+ if (pendingSyncs)
+ {
+ elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+ hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ }
+ }
+
+
+ /*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
* The return value is the number of relations scheduled for termination.
***************
*** 419,424 **** smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
--- 527,696 ----
return nrels;
}
+
+ /*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+ void
+ RecordPendingSync(Relation rel)
+ {
+ bool found = true;
+ BlockNumber nblocks;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* ignore no_pending_sync since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+
+ /* Look up or create an entry */
+ rel->no_pending_sync = false;
+ elog(DEBUG2, "RecordPendingSync: accessing hash");
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ }
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (!found)
+ {
+ rel->pending_sync->truncated_to = InvalidBlockNumber;
+ rel->pending_sync->sync_above = nblocks;
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ }
+ else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ rel->pending_sync->sync_above = nblocks;
+ }
+ else
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+ }
+
+ /*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+ bool
+ BufferNeedsWAL(Relation rel, Buffer buf)
+ {
+ BlockNumber blkno = InvalidBlockNumber;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf), pendingSyncs, rel->pending_sync, rel->no_pending_sync);
+ /* no further work if we know that we don't have pending sync */
+ if (!pendingSyncs || rel->no_pending_sync)
+ return true;
+
+ /* do the real work */
+ if (!rel->pending_sync)
+ {
+ bool found = false;
+
+ /*
+ * Hold the entry in rel. This relies on the fact that hash entry
+ * never moves.
+ */
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_FIND, &found);
+ elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+ if (!found)
+ {
+ /* we don't have no one. don't access the hash no longer */
+ rel->no_pending_sync = true;
+ return true;
+ }
+ }
+
+ blkno = BufferGetBlockNumber(buf);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+ rel->pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+ }
+
+ /*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+ void
+ smgrDoPendingSyncs(bool isCommit)
+ {
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+ }
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***************
*** 2347,2354 **** CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
! * If it does commit, we'll have done the heap_sync at the bottom of this
! * routine first.
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
--- 2347,2353 ----
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
! * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
***************
*** 2380,2386 **** CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
! hi_options |= HEAP_INSERT_SKIP_WAL;
}
/*
--- 2379,2385 ----
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
! heap_register_sync(cstate->rel);
}
/*
***************
*** 2862,2872 **** CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
! * If we skipped writing WAL, then we need to sync the heap (but not
! * indexes since those use WAL anyway)
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
--- 2861,2871 ----
FreeExecutorState(estate);
/*
! * If we skipped writing WAL, then we will sync the heap at the end of
! * the transaction. (We used to do it here, but it was later found out
! * that to be safe, we must also avoid WAL-logging any subsequent
! * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
return processed;
}
*** a/src/backend/commands/createas.c
--- b/src/backend/commands/createas.c
***************
*** 567,574 **** intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
! myState->hi_options = HEAP_INSERT_SKIP_FSM |
! (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
--- 567,575 ----
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
! if (!XLogIsNeeded())
! heap_register_sync(intoRelationDesc);
! myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
***************
*** 617,625 **** intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
! /* If we skipped using WAL, must heap_sync before commit */
! if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
! heap_sync(myState->rel);
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
--- 618,624 ----
FreeBulkInsertState(myState->bistate);
! /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
*** a/src/backend/commands/matview.c
--- b/src/backend/commands/matview.c
***************
*** 477,483 **** transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
! myState->hi_options |= HEAP_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
--- 477,483 ----
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
! heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
***************
*** 520,528 **** transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
! /* If we skipped using WAL, must heap_sync before commit */
! if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
! heap_sync(myState->transientrel);
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
--- 520,526 ----
FreeBulkInsertState(myState->bistate);
! /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***************
*** 4357,4364 **** ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
! hi_options |= HEAP_INSERT_SKIP_WAL;
}
else
{
--- 4357,4365 ----
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
! heap_register_sync(newrel);
}
else
{
***************
*** 4624,4631 **** ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
--- 4625,4630 ----
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 891,897 **** lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
! if (RelationNeedsWAL(onerel) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
--- 891,897 ----
* page has been previously WAL-logged, and if not, do that
* now.
*/
! if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
***************
*** 1118,1124 **** lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
! if (RelationNeedsWAL(onerel))
{
XLogRecPtr recptr;
--- 1118,1124 ----
}
/* Now WAL-log freezing if necessary */
! if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
***************
*** 1476,1482 **** lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
! if (RelationNeedsWAL(onerel))
{
XLogRecPtr recptr;
--- 1476,1482 ----
MarkBufferDirty(buffer);
/* XLOG stuff */
! if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 451,456 **** static BufferDesc *BufferAlloc(SMgrRelation smgr,
--- 451,457 ----
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+ static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
***************
*** 3147,3166 **** PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
! if (RelationUsesLocalBuffers(rel))
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
! if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
--- 3148,3188 ----
void
FlushRelationBuffers(Relation rel)
{
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
! FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
! }
!
! /*
! * Like FlushRelationBuffers(), but the relation is specified by a
! * RelFileNode
! */
! void
! FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
! {
! FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
! }
!
! /*
! * Code shared between functions FlushRelationBuffers() and
! * FlushRelationBuffersWithoutRelCache().
! */
! static void
! FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
! {
! RelFileNode rnode = smgr->smgr_rnode.node;
! int i;
! BufferDesc *bufHdr;
!
! if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
! if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
***************
*** 3177,3183 **** FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
! smgrwrite(rel->rd_smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
--- 3199,3205 ----
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
! smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
***************
*** 3207,3224 **** FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
! if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
! if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
! FlushBuffer(bufHdr, rel->rd_smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
--- 3229,3246 ----
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
! if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
! if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
! FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
***************
*** 72,77 ****
--- 72,78 ----
#include "optimizer/var.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+ #include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
***************
*** 418,423 **** AllocateRelationDesc(Form_pg_class relp)
--- 419,428 ----
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
MemoryContextSwitchTo(oldcxt);
return relation;
***************
*** 2040,2045 **** formrdesc(const char *relationName, Oid relationReltype,
--- 2045,2054 ----
relation->rd_rel->relhasindex = true;
}
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
/*
* add new reldesc to relcache
*/
***************
*** 3364,3369 **** RelationBuildLocalRelation(const char *relname,
--- 3373,3382 ----
else
rel->rd_rel->relfilenode = relfilenode;
+ /* newly built relation has no pending sync */
+ rel->no_pending_sync = true;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 25,34 ****
/* "options" flag bits for heap_insert */
! #define HEAP_INSERT_SKIP_WAL 0x0001
! #define HEAP_INSERT_SKIP_FSM 0x0002
! #define HEAP_INSERT_FROZEN 0x0004
! #define HEAP_INSERT_SPECULATIVE 0x0008
typedef struct BulkInsertStateData *BulkInsertState;
--- 25,33 ----
/* "options" flag bits for heap_insert */
! #define HEAP_INSERT_SKIP_FSM 0x0001
! #define HEAP_INSERT_FROZEN 0x0002
! #define HEAP_INSERT_SPECULATIVE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
***************
*** 179,184 **** extern void simple_heap_delete(Relation relation, ItemPointer tid);
--- 178,184 ----
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+ extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
*** a/src/include/catalog/storage.h
--- b/src/include/catalog/storage.h
***************
*** 22,34 **** extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
!
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
--- 22,37 ----
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
! extern void RelationRemovePendingSync(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+ extern void smgrDoPendingSyncs(bool isCommit);
+ extern void RecordPendingSync(Relation rel);
+ bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 190,195 **** extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
--- 190,197 ----
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+ extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
*** a/src/include/utils/rel.h
--- b/src/include/utils/rel.h
***************
*** 216,221 **** typedef struct RelationData
--- 216,229 ----
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * no_pending_sync is true if this relation is known not to have pending
+ * syncs. Elsewise searching for registered sync is required if
+ * pending_sync is NULL.
+ */
+ bool no_pending_sync;
+ struct PendingRelSync *pending_sync;
} RelationData;
Hello, (does this seem to be a top post?)
The CF status of this patch turned into "Waiting on Author" by
automated CI checking. However, I still don't get any error even
on the current master (69835bc) after make distclean. Also I
don't see any difference between the "problematic" patch and my
working branch has nothing different other than patching line
shifts. (So I haven't post a new one.)
I looked on the location heapam.c:2502 where the CI complains at
in my working branch and I found a different code with the
complaint.
https://travis-ci.org/postgresql-cfbot/postgresql/builds/274777750
1363 heapam.c:2502:18: error: ‘HEAP_INSERT_SKIP_WAL’ undeclared (first use in this function)
1364 if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
heapam.c:2502@work branch
2502: /* XLOG stuff */
2503: if (BufferNeedsWAL(relation, buffer))
So I conclude that the CI mechinery failed to applly the patch
correctly.
At Thu, 13 Apr 2017 15:29:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170413.152935.100104316.horiguchi.kyotaro@lab.ntt.co.jp>
I'll post new patch in this way soon.
Here it is.
It contained tariling space and missing test script. This is the
correct patch.- Relation has new members no_pending_sync and pending_sync that
works as instant cache of an entry in pendingSync hash.- Commit-time synchronizing is restored as Michael's patch.
- If relfilenode is replaced, pending_sync for the old node is
removed. Anyway this is ignored on abort and meaningless on
commit.- TAP test is renamed to 012 since some new files have been added.
Accessing pending sync hash occured on every calling of
HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
accessing relations has pending sync. Almost of them are
eliminated as the result.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 13, 2017 at 1:04 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
The CF status of this patch turned into "Waiting on Author" by
automated CI checking. However, I still don't get any error even
on the current master (69835bc) after make distclean. Also I
don't see any difference between the "problematic" patch and my
working branch has nothing different other than patching line
shifts. (So I haven't post a new one.)I looked on the location heapam.c:2502 where the CI complains at
in my working branch and I found a different code with the
complaint.https://travis-ci.org/postgresql-cfbot/postgresql/builds/274777750
1363 heapam.c:2502:18: error: ‘HEAP_INSERT_SKIP_WAL’ undeclared (first use in this function)
1364 if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))heapam.c:2502@work branch
2502: /* XLOG stuff */
2503: if (BufferNeedsWAL(relation, buffer))So I conclude that the CI mechinery failed to applly the patch
correctly.
Hi Horiguchi-san,
Hmm. Here is that line in heamap.c in unpatched master:
It says:
2485 if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
After applying fix-wal-level-minimal-michael-horiguchi-3.patch from
this message:
/messages/by-id/20170912.131441.20602611.horiguchi.kyotaro@lab.ntt.co.jp
... that line is unchanged, although it has moved to line number 2502.
It doesn't compile for me, because your patch removed the definition
of HEAP_INSERT_SKIP_WAL but hasn't removed that reference to it.
I'm not sure what happened. Is it possible that your patch was not
created by diffing against master?
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Kyotaro HORIGUCHI wrote:
The CF status of this patch turned into "Waiting on Author" by
automated CI checking.
I object to automated turning of patches to waiting on author by
machinery. Sending occasional reminder messages to authors making them
know about outdated patches seems acceptable to me at this stage.
It'll take some time for this machinery to get perfected; only when it
is beyond experimental mode it'll be acceptable to change patches'
status in an automated fashion.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Wed, 13 Sep 2017 15:05:31 +1200, Thomas Munro <thomas.munro@enterprisedb.com> wrote in <CAEepm=0x7CGYmNM5q7TKzz_KrD+Pr7jbFzD8UZad_+=4PG1PyA@mail.gmail.com>
It doesn't compile for me, because your patch removed the definition
of HEAP_INSERT_SKIP_WAL but hasn't removed that reference to it.I'm not sure what happened. Is it possible that your patch was not
created by diffing against master?
It created using filterdiff.
git diff master --patience | grep options
...
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
but the line dissapears from the output of the following command
git diff master --patience | filterdiff --format=context | grep options
filterdiff seems to did something wrong..
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170913.174239.25978735.horiguchi.kyotaro@lab.ntt.co.jp>
filterdiff seems to did something wrong..
# to did...
The patch is broken by filterdiff so I send a new patch made
directly by git format-patch. I confirmed that a build completes
with applying this.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
fix-wal-level-minimal-michael-horiguchi-5.patchtext/x-patch; charset=us-asciiDownload
From 7086b5855080065f73de4d099cbaab09511f01fc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 12 Sep 2017 13:01:33 +0900
Subject: [PATCH] Fix WAL logging problem
---
src/backend/access/heap/heapam.c | 113 +++++++++---
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 4 +-
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/access/transam/xact.c | 7 +
src/backend/catalog/storage.c | 318 +++++++++++++++++++++++++++++---
src/backend/commands/copy.c | 13 +-
src/backend/commands/createas.c | 9 +-
src/backend/commands/matview.c | 6 +-
src/backend/commands/tablecmds.c | 8 +-
src/backend/commands/vacuumlazy.c | 6 +-
src/backend/storage/buffer/bufmgr.c | 40 +++-
src/backend/utils/cache/relcache.c | 13 ++
src/include/access/heapam.h | 8 +-
src/include/catalog/storage.h | 5 +-
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 8 +
17 files changed, 476 insertions(+), 90 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d20f038..e40254d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -56,6 +78,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/namespace.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -2373,12 +2396,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2409,6 +2426,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* TID where the tuple was stored. But note that any toasting of fields
* within the tuple data is NOT reflected into *tup.
*/
+extern HTAB *pendingSyncs;
Oid
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
int options, BulkInsertState bistate)
@@ -2482,7 +2500,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2681,12 +2699,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
char *scratch = NULL;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2701,7 +2717,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* palloc() within a critical section is not safe, so we allocate this
* beforehand.
*/
- if (needwal)
+ if (RelationNeedsWAL(relation))
scratch = palloc(BLCKSZ);
/*
@@ -2736,6 +2752,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2747,6 +2764,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3303,7 +3321,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4269,7 +4287,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5160,7 +5179,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5894,7 +5913,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -6050,7 +6069,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6183,7 +6202,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6292,7 +6311,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7406,7 +7425,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7454,7 +7473,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7539,7 +7558,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -8630,8 +8649,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
*/
/* Deal with old tuple version */
- oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- &obuffer);
+ if (oldblk == newblk)
+ oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+ else if (XLogRecHasBlockRef(record, 1))
+ oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+ else
+ oldaction = BLK_DONE;
+
if (oldaction == BLK_NEEDS_REDO)
{
page = BufferGetPage(obuffer);
@@ -8685,6 +8709,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageInit(page, BufferGetPageSize(nbuffer), 0);
newaction = BLK_NEEDS_REDO;
}
+ else if (!XLogRecHasBlockRef(record, 0))
+ newaction = BLK_DONE;
else
newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9121,9 +9147,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
@@ -9233,3 +9266,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 52231ac..97edb99 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bd560e4..3c457db 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
- HEAP_INSERT_SKIP_FSM |
- (state->rs_use_wal ?
- 0 : HEAP_INSERT_SKIP_WAL));
+ HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..971d469 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 93dca7a..7fba3df 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2008,6 +2008,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2236,6 +2239,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2549,6 +2555,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9a5fde0..6bc1088 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -64,6 +65,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/*
* RelationCreateStorage
* Create physical storage for a relation.
*
@@ -226,6 +270,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
void
RelationTruncate(Relation rel, BlockNumber nblocks)
{
+ PendingRelSync *pending = NULL;
+ bool found;
bool fsm;
bool vm;
@@ -260,37 +306,81 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
-
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ /* no_pending_sync is ignored since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ elog(DEBUG2, "RelationTruncate: accessing hash");
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->sync_above = InvalidBlockNumber;
+ pending->truncated_to = InvalidBlockNumber;
+ }
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ rel->no_pending_sync= false;
+ rel->pending_sync = pending;
+ }
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above < nblocks)
+ {
+ /*
+ * Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->pending_sync->truncated_to = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -369,6 +459,24 @@ smgrDoPendingDeletes(bool isCommit)
}
/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+ bool found;
+
+ rel->pending_sync = NULL;
+ rel->no_pending_sync = true;
+ if (pendingSyncs)
+ {
+ elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+ hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ }
+}
+
+
+/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
* The return value is the number of relations scheduled for termination.
@@ -419,6 +527,170 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ bool found = true;
+ BlockNumber nblocks;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* ignore no_pending_sync since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+
+ /* Look up or create an entry */
+ rel->no_pending_sync = false;
+ elog(DEBUG2, "RecordPendingSync: accessing hash");
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ }
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (!found)
+ {
+ rel->pending_sync->truncated_to = InvalidBlockNumber;
+ rel->pending_sync->sync_above = nblocks;
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ }
+ else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ rel->pending_sync->sync_above = nblocks;
+ }
+ else
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf), pendingSyncs, rel->pending_sync, rel->no_pending_sync);
+ /* no further work if we know that we don't have pending sync */
+ if (!pendingSyncs || rel->no_pending_sync)
+ return true;
+
+ /* do the real work */
+ if (!rel->pending_sync)
+ {
+ bool found = false;
+
+ /*
+ * Hold the entry in rel. This relies on the fact that hash entry
+ * never moves.
+ */
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_FIND, &found);
+ elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+ if (!found)
+ {
+ /* we don't have no one. don't access the hash no longer */
+ rel->no_pending_sync = true;
+ return true;
+ }
+ }
+
+ blkno = BufferGetBlockNumber(buf);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+ rel->pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f05..6c0ffae 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2347,8 +2347,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2380,7 +2379,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -2862,11 +2861,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index e60210c..dbc2028 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index d2e0376..5645a6e 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -477,7 +477,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -520,9 +520,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 96354bd..3fdb99d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4401,8 +4401,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4675,8 +4676,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
@@ -10656,11 +10655,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. Pending syncs for the old node is no longer needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationRemovePendingSync(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 45b1859..757ed7f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -891,7 +891,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1118,7 +1118,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1476,7 +1476,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 15795b0..be57547 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index b8e3780..3dff4ed 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
#include "optimizer/var.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -418,6 +419,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -2040,6 +2045,10 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
/*
* add new reldesc to relcache
*/
@@ -3364,6 +3373,10 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ /* newly built relation has no pending sync */
+ rel->no_pending_sync = true;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4e41024..79b964f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
@@ -179,6 +178,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index a3a97db..03964e2 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 98b63fc..598d1a0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 4bc61e5..c7610bd 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -216,6 +216,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * no_pending_sync is true if this relation is known not to have pending
+ * syncs. Elsewise searching for registered sync is required if
+ * pending_sync is NULL.
+ */
+ bool no_pending_sync;
+ struct PendingRelSync *pending_sync;
} RelationData;
--
2.9.2
On Thu, Sep 14, 2017 at 3:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170913.174239.25978735.horiguchi.kyotaro@lab.ntt.co.jp>
filterdiff seems to did something wrong..
# to did...
The patch is broken by filterdiff so I send a new patch made
directly by git format-patch. I confirmed that a build completes
with applying this.
To my surprise this patch still applies but fails recovery tests. I am
bumping it to next CF, for what will be its 8th registration as it is
for a bug fix, switching the status to "waiting on author".
--
Michael
At Tue, 28 Nov 2017 10:36:39 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSqukqS5Xx6_6KEk53eRy5ObdvaNG-5aN_4cE8=gTeOdg@mail.gmail.com>
On Thu, Sep 14, 2017 at 3:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170913.174239.25978735.horiguchi.kyotaro@lab.ntt.co.jp>
filterdiff seems to did something wrong..
# to did...
It's horrid to see that:p
The patch is broken by filterdiff so I send a new patch made
directly by git format-patch. I confirmed that a build completes
with applying this.To my surprise this patch still applies but fails recovery tests. I am
bumping it to next CF, for what will be its 8th registration as it is
for a bug fix, switching the status to "waiting on author".
Thank you for checking that. I saw maybe the same failure. It
occurred when visibilitymap_set() is called with heapBuf =
InvalidBuffer during recovery. Checking pendingSyncs and
no_pending_sync before the elog fixes it. Anyway the DEBUG2 elogs
are to removed before committing. They are just to look how it
works.
The attached patch applies on the current HEAD and passes all
recovery tests.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Fix-WAL-logging-problem.patchtext/x-patch; charset=us-asciiDownload
From af24850bf8ec5ea082d3affce9d0754daf1862ea Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 12 Sep 2017 13:01:33 +0900
Subject: [PATCH] Fix WAL logging problem
---
src/backend/access/heap/heapam.c | 113 ++++++++---
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 4 +-
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/access/transam/xact.c | 7 +
src/backend/catalog/storage.c | 324 +++++++++++++++++++++++++++++---
src/backend/commands/copy.c | 13 +-
src/backend/commands/createas.c | 9 +-
src/backend/commands/matview.c | 6 +-
src/backend/commands/tablecmds.c | 8 +-
src/backend/commands/vacuumlazy.c | 6 +-
src/backend/storage/buffer/bufmgr.c | 40 +++-
src/backend/utils/cache/relcache.c | 13 ++
src/include/access/heapam.h | 8 +-
src/include/catalog/storage.h | 5 +-
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 8 +
17 files changed, 482 insertions(+), 90 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3acef27..ecb9ad8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -56,6 +78,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/namespace.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -2373,12 +2396,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2409,6 +2426,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* TID where the tuple was stored. But note that any toasting of fields
* within the tuple data is NOT reflected into *tup.
*/
+extern HTAB *pendingSyncs;
Oid
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
int options, BulkInsertState bistate)
@@ -2482,7 +2500,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2683,12 +2701,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
char *scratch = NULL;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2703,7 +2719,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* palloc() within a critical section is not safe, so we allocate this
* beforehand.
*/
- if (needwal)
+ if (RelationNeedsWAL(relation))
scratch = palloc(BLCKSZ);
/*
@@ -2738,6 +2754,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2749,6 +2766,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3305,7 +3323,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4271,7 +4289,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5162,7 +5181,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5896,7 +5915,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -6052,7 +6071,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6185,7 +6204,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6294,7 +6313,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7408,7 +7427,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7456,7 +7475,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7541,7 +7560,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -8632,8 +8651,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
*/
/* Deal with old tuple version */
- oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- &obuffer);
+ if (oldblk == newblk)
+ oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+ else if (XLogRecHasBlockRef(record, 1))
+ oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+ else
+ oldaction = BLK_DONE;
+
if (oldaction == BLK_NEEDS_REDO)
{
page = BufferGetPage(obuffer);
@@ -8687,6 +8711,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageInit(page, BufferGetPageSize(nbuffer), 0);
newaction = BLK_NEEDS_REDO;
}
+ else if (!XLogRecHasBlockRef(record, 0))
+ newaction = BLK_DONE;
else
newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9123,9 +9149,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
@@ -9235,3 +9268,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 9f33e0c..1f184c9 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f93c194..899d7a5 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
- HEAP_INSERT_SKIP_FSM |
- (state->rs_use_wal ?
- 0 : HEAP_INSERT_SKIP_WAL));
+ HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..971d469 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 046898c..24400e7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2000,6 +2000,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2228,6 +2231,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2541,6 +2547,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9a5fde0..722f740 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -64,6 +65,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/*
* RelationCreateStorage
* Create physical storage for a relation.
*
@@ -226,6 +270,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
void
RelationTruncate(Relation rel, BlockNumber nblocks)
{
+ PendingRelSync *pending = NULL;
+ bool found;
bool fsm;
bool vm;
@@ -260,37 +306,81 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
-
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ /* no_pending_sync is ignored since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ elog(DEBUG2, "RelationTruncate: accessing hash");
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->sync_above = InvalidBlockNumber;
+ pending->truncated_to = InvalidBlockNumber;
+ }
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ rel->no_pending_sync= false;
+ rel->pending_sync = pending;
+ }
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above < nblocks)
+ {
+ /*
+ * Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->pending_sync->truncated_to = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -369,6 +459,24 @@ smgrDoPendingDeletes(bool isCommit)
}
/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+ bool found;
+
+ rel->pending_sync = NULL;
+ rel->no_pending_sync = true;
+ if (pendingSyncs)
+ {
+ elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+ hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ }
+}
+
+
+/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
* The return value is the number of relations scheduled for termination.
@@ -419,6 +527,176 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ bool found = true;
+ BlockNumber nblocks;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* ignore no_pending_sync since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+
+ /* Look up or create an entry */
+ rel->no_pending_sync = false;
+ elog(DEBUG2, "RecordPendingSync: accessing hash");
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ }
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (!found)
+ {
+ rel->pending_sync->truncated_to = InvalidBlockNumber;
+ rel->pending_sync->sync_above = nblocks;
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ }
+ else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ rel->pending_sync->sync_above = nblocks;
+ }
+ else
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /*
+ * no point in doing further work if we know that we don't have pending
+ * sync
+ */
+ if (!pendingSyncs || rel->no_pending_sync)
+ return true;
+
+ Assert(BufferIsValid(buf));
+
+ elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf), pendingSyncs, rel->pending_sync, rel->no_pending_sync);
+
+ /* do the real work */
+ if (!rel->pending_sync)
+ {
+ bool found = false;
+
+ /*
+ * Hold the entry in rel. This relies on the fact that hash entry
+ * never moves.
+ */
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_FIND, &found);
+ elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+ if (!found)
+ {
+ /* we don't have no one. don't access the hash no longer */
+ rel->no_pending_sync = true;
+ return true;
+ }
+ }
+
+ blkno = BufferGetBlockNumber(buf);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+ rel->pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 254be28..1ba8cce 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2357,8 +2357,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2390,7 +2389,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -2887,11 +2886,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 4d77411..01bbb51 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index d2e0376..5645a6e 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -477,7 +477,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -520,9 +520,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d979ce2..594d7bf 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4412,8 +4412,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4686,8 +4687,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
@@ -10727,11 +10726,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. Pending syncs for the old node is no longer needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationRemovePendingSync(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 20ce431..82bbf05 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -902,7 +902,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1129,7 +1129,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1487,7 +1487,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 26df7cb..171b17b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 12a5f15..08711b5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -73,6 +73,7 @@
#include "optimizer/var.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -414,6 +415,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -2043,6 +2048,10 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
/*
* add new reldesc to relcache
*/
@@ -3367,6 +3376,10 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ /* newly built relation has no pending sync */
+ rel->no_pending_sync = true;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4e41024..79b964f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
@@ -179,6 +178,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index a3a97db..03964e2 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 98b63fc..598d1a0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 68fd6fb..507844f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -216,6 +216,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * no_pending_sync is true if this relation is known not to have pending
+ * syncs. Elsewise searching for registered sync is required if
+ * pending_sync is NULL.
+ */
+ bool no_pending_sync;
+ struct PendingRelSync *pending_sync;
} RelationData;
--
2.9.2
Greetings,
* Kyotaro HORIGUCHI (horiguchi.kyotaro@lab.ntt.co.jp) wrote:
At Tue, 28 Nov 2017 10:36:39 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSqukqS5Xx6_6KEk53eRy5ObdvaNG-5aN_4cE8=gTeOdg@mail.gmail.com>
On Thu, Sep 14, 2017 at 3:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170913.174239.25978735.horiguchi.kyotaro@lab.ntt.co.jp>
filterdiff seems to did something wrong..
# to did...
It's horrid to see that:p
The patch is broken by filterdiff so I send a new patch made
directly by git format-patch. I confirmed that a build completes
with applying this.To my surprise this patch still applies but fails recovery tests. I am
bumping it to next CF, for what will be its 8th registration as it is
for a bug fix, switching the status to "waiting on author".Thank you for checking that. I saw maybe the same failure. It
occurred when visibilitymap_set() is called with heapBuf =
InvalidBuffer during recovery. Checking pendingSyncs and
no_pending_sync before the elog fixes it. Anyway the DEBUG2 elogs
are to removed before committing. They are just to look how it
works.The attached patch applies on the current HEAD and passes all
recovery tests.
This is currently marked as 'waiting on author' in the CF app, but it
sounds like it should be 'Needs review'. If that's the case, please
update the CF app accordingly. If you run into any issues with that,
let me know.
Thanks!
Stephen
Hello,
At Thu, 4 Jan 2018 23:10:40 -0500, Stephen Frost <sfrost@snowman.net> wrote in <20180105041040.GI2416@tamriel.snowman.net>
The attached patch applies on the current HEAD and passes all
recovery tests.This is currently marked as 'waiting on author' in the CF app, but it
sounds like it should be 'Needs review'. If that's the case, please
update the CF app accordingly. If you run into any issues with that,
let me know.Thanks!
Thank you for noticing me of that. The attached is the rebased
patch (the previous version didn't conflict with the current
master, though) and changed the status to "Needs Review".
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Fix-WAL-logging-problem.patchtext/x-patch; charset=us-asciiDownload
From 15e3d095b89e9a5bb8025008d1475107b340cbd4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 12 Sep 2017 13:01:33 +0900
Subject: [PATCH] Fix WAL logging problem
---
src/backend/access/heap/heapam.c | 113 ++++++++---
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 4 +-
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/access/transam/xact.c | 7 +
src/backend/catalog/storage.c | 324 +++++++++++++++++++++++++++++---
src/backend/commands/copy.c | 13 +-
src/backend/commands/createas.c | 9 +-
src/backend/commands/matview.c | 6 +-
src/backend/commands/tablecmds.c | 8 +-
src/backend/commands/vacuumlazy.c | 6 +-
src/backend/storage/buffer/bufmgr.c | 40 +++-
src/backend/utils/cache/relcache.c | 13 ++
src/include/access/heapam.h | 8 +-
src/include/catalog/storage.h | 5 +-
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 8 +
17 files changed, 482 insertions(+), 90 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dbc8f2d..df7e050 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -56,6 +78,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/namespace.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -2373,12 +2396,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2409,6 +2426,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* TID where the tuple was stored. But note that any toasting of fields
* within the tuple data is NOT reflected into *tup.
*/
+extern HTAB *pendingSyncs;
Oid
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
int options, BulkInsertState bistate)
@@ -2482,7 +2500,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2683,12 +2701,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
char *scratch = NULL;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2703,7 +2719,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* palloc() within a critical section is not safe, so we allocate this
* beforehand.
*/
- if (needwal)
+ if (RelationNeedsWAL(relation))
scratch = palloc(BLCKSZ);
/*
@@ -2738,6 +2754,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2749,6 +2766,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3305,7 +3323,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4271,7 +4289,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5162,7 +5181,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5896,7 +5915,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -6052,7 +6071,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6185,7 +6204,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6294,7 +6313,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7480,7 +7499,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7528,7 +7547,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7613,7 +7632,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -8704,8 +8723,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
*/
/* Deal with old tuple version */
- oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- &obuffer);
+ if (oldblk == newblk)
+ oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+ else if (XLogRecHasBlockRef(record, 1))
+ oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+ else
+ oldaction = BLK_DONE;
+
if (oldaction == BLK_NEEDS_REDO)
{
page = BufferGetPage(obuffer);
@@ -8759,6 +8783,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageInit(page, BufferGetPageSize(nbuffer), 0);
newaction = BLK_NEEDS_REDO;
}
+ else if (!XLogRecHasBlockRef(record, 0))
+ newaction = BLK_DONE;
else
newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9195,9 +9221,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
@@ -9307,3 +9340,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f67d7d1..6dd2ae5 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 7d466c2..7471d74 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -652,9 +652,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
- HEAP_INSERT_SKIP_FSM |
- (state->rs_use_wal ?
- 0 : HEAP_INSERT_SKIP_WAL));
+ HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b251e69..4a46444 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index ea81f4b..8a0c3b4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2001,6 +2001,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2229,6 +2232,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2542,6 +2548,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cff49ba..e9abd49 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -64,6 +65,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/*
* RelationCreateStorage
* Create physical storage for a relation.
*
@@ -226,6 +270,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
void
RelationTruncate(Relation rel, BlockNumber nblocks)
{
+ PendingRelSync *pending = NULL;
+ bool found;
bool fsm;
bool vm;
@@ -260,37 +306,81 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
-
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ /* no_pending_sync is ignored since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ elog(DEBUG2, "RelationTruncate: accessing hash");
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->sync_above = InvalidBlockNumber;
+ pending->truncated_to = InvalidBlockNumber;
+ }
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ rel->no_pending_sync= false;
+ rel->pending_sync = pending;
+ }
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above < nblocks)
+ {
+ /*
+ * Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->pending_sync->truncated_to = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -369,6 +459,24 @@ smgrDoPendingDeletes(bool isCommit)
}
/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+ bool found;
+
+ rel->pending_sync = NULL;
+ rel->no_pending_sync = true;
+ if (pendingSyncs)
+ {
+ elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+ hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ }
+}
+
+
+/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
* The return value is the number of relations scheduled for termination.
@@ -419,6 +527,176 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ bool found = true;
+ BlockNumber nblocks;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* ignore no_pending_sync since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+
+ /* Look up or create an entry */
+ rel->no_pending_sync = false;
+ elog(DEBUG2, "RecordPendingSync: accessing hash");
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ }
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (!found)
+ {
+ rel->pending_sync->truncated_to = InvalidBlockNumber;
+ rel->pending_sync->sync_above = nblocks;
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ }
+ else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ rel->pending_sync->sync_above = nblocks;
+ }
+ else
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /*
+ * no point in doing further work if we know that we don't have pending
+ * sync
+ */
+ if (!pendingSyncs || rel->no_pending_sync)
+ return true;
+
+ Assert(BufferIsValid(buf));
+
+ elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf), pendingSyncs, rel->pending_sync, rel->no_pending_sync);
+
+ /* do the real work */
+ if (!rel->pending_sync)
+ {
+ bool found = false;
+
+ /*
+ * Hold the entry in rel. This relies on the fact that hash entry
+ * never moves.
+ */
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_FIND, &found);
+ elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+ if (!found)
+ {
+ /* we don't have no one. don't access the hash no longer */
+ rel->no_pending_sync = true;
+ return true;
+ }
+ }
+
+ blkno = BufferGetBlockNumber(buf);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+ rel->pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6bfca2a..a7f0e5f 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2354,8 +2354,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2387,7 +2386,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -2841,11 +2840,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 3d82edb..a3c3518 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index ab6a889..33a2167 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -477,7 +477,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -520,9 +520,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f2a928b..81e5ccf 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4411,8 +4411,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4685,8 +4686,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
@@ -10668,11 +10667,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. Pending syncs for the old node is no longer needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationRemovePendingSync(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index cf7f5e1..bbb0215 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -904,7 +904,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1140,7 +1140,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1498,7 +1498,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4e44336..f0f3ac2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 28a4483..ce9f361 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -73,6 +73,7 @@
#include "optimizer/var.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -413,6 +414,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -1998,6 +2003,10 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
/*
* add new reldesc to relcache
*/
@@ -3322,6 +3331,10 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ /* newly built relation has no pending sync */
+ rel->no_pending_sync = true;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c0256b..fff3fd4 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
@@ -180,6 +179,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef52d85..49d93cd 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce390..9fae7c6 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index aa8add5..9fa06a5 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -216,6 +216,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * no_pending_sync is true if this relation is known not to have pending
+ * syncs. Elsewise searching for registered sync is required if
+ * pending_sync is NULL.
+ */
+ bool no_pending_sync;
+ struct PendingRelSync *pending_sync;
} RelationData;
--
2.9.2
Hello. I found that c203d6cf81 hit this and this is the rebased
version on the current master.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Fix-WAL-logging-problem.patchtext/x-patch; charset=us-asciiDownload
From 3dac5baf787dc949cfb22a698a0d72b6eb48e75e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 12 Sep 2017 13:01:33 +0900
Subject: [PATCH] Fix WAL logging problem
---
src/backend/access/heap/heapam.c | 113 ++++++++---
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 4 +-
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/access/transam/xact.c | 7 +
src/backend/catalog/storage.c | 320 +++++++++++++++++++++++++++++---
src/backend/commands/copy.c | 13 +-
src/backend/commands/createas.c | 9 +-
src/backend/commands/matview.c | 6 +-
src/backend/commands/tablecmds.c | 8 +-
src/backend/commands/vacuumlazy.c | 6 +-
src/backend/storage/buffer/bufmgr.c | 40 +++-
src/backend/utils/cache/relcache.c | 13 ++
src/include/access/heapam.h | 8 +-
src/include/catalog/storage.h | 5 +-
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 8 +
17 files changed, 480 insertions(+), 88 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d7279248e7..8fd2c2948e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -57,6 +79,7 @@
#include "catalog/catalog.h"
#include "catalog/namespace.h"
#include "catalog/index.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -2400,12 +2423,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2436,6 +2453,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* TID where the tuple was stored. But note that any toasting of fields
* within the tuple data is NOT reflected into *tup.
*/
+extern HTAB *pendingSyncs;
Oid
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
int options, BulkInsertState bistate)
@@ -2509,7 +2527,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2710,12 +2728,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
char *scratch = NULL;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2730,7 +2746,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
* palloc() within a critical section is not safe, so we allocate this
* beforehand.
*/
- if (needwal)
+ if (RelationNeedsWAL(relation))
scratch = palloc(BLCKSZ);
/*
@@ -2765,6 +2781,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2776,6 +2793,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3332,7 +3350,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4307,7 +4325,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5276,7 +5295,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -6020,7 +6039,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -6174,7 +6193,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6307,7 +6326,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6416,7 +6435,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7602,7 +7621,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7650,7 +7669,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7735,7 +7754,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -8826,8 +8845,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
*/
/* Deal with old tuple version */
- oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- &obuffer);
+ if (oldblk == newblk)
+ oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+ else if (XLogRecHasBlockRef(record, 1))
+ oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+ else
+ oldaction = BLK_DONE;
+
if (oldaction == BLK_NEEDS_REDO)
{
page = BufferGetPage(obuffer);
@@ -8881,6 +8905,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageInit(page, BufferGetPageSize(nbuffer), 0);
newaction = BLK_NEEDS_REDO;
}
+ else if (!XLogRecHasBlockRef(record, 0))
+ newaction = BLK_DONE;
else
newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9317,9 +9343,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
@@ -9429,3 +9462,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f67d7d15df..6dd2ae5254 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 7d466c2588..7471d7461b 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -652,9 +652,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
- HEAP_INSERT_SKIP_FSM |
- (state->rs_use_wal ?
- 0 : HEAP_INSERT_SKIP_WAL));
+ HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b251e69703..4a46444f33 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b88d4ccf74..976fbeb02f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2001,6 +2001,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2229,6 +2232,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2542,6 +2548,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cff49bae9e..e9abd49070 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -63,6 +64,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
/*
* RelationCreateStorage
* Create physical storage for a relation.
@@ -226,6 +270,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
void
RelationTruncate(Relation rel, BlockNumber nblocks)
{
+ PendingRelSync *pending = NULL;
+ bool found;
bool fsm;
bool vm;
@@ -260,37 +306,81 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
+ /* no_pending_sync is ignored since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ elog(DEBUG2, "RelationTruncate: accessing hash");
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->sync_above = InvalidBlockNumber;
+ pending->truncated_to = InvalidBlockNumber;
+ }
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
+ rel->no_pending_sync= false;
+ rel->pending_sync = pending;
+ }
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above < nblocks)
+ {
+ /*
+ * Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->pending_sync->truncated_to = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -368,6 +458,24 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+ bool found;
+
+ rel->pending_sync = NULL;
+ rel->no_pending_sync = true;
+ if (pendingSyncs)
+ {
+ elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+ hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ }
+}
+
+
/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
@@ -419,6 +527,176 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ bool found = true;
+ BlockNumber nblocks;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* ignore no_pending_sync since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+
+ /* Look up or create an entry */
+ rel->no_pending_sync = false;
+ elog(DEBUG2, "RecordPendingSync: accessing hash");
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ }
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (!found)
+ {
+ rel->pending_sync->truncated_to = InvalidBlockNumber;
+ rel->pending_sync->sync_above = nblocks;
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ }
+ else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ rel->pending_sync->sync_above = nblocks;
+ }
+ else
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /*
+ * no point in doing further work if we know that we don't have pending
+ * sync
+ */
+ if (!pendingSyncs || rel->no_pending_sync)
+ return true;
+
+ Assert(BufferIsValid(buf));
+
+ elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf), pendingSyncs, rel->pending_sync, rel->no_pending_sync);
+
+ /* do the real work */
+ if (!rel->pending_sync)
+ {
+ bool found = false;
+
+ /*
+ * Hold the entry in rel. This relies on the fact that hash entry
+ * never moves.
+ */
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_FIND, &found);
+ elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+ if (!found)
+ {
+ /* we don't have no one. don't access the hash no longer */
+ rel->no_pending_sync = true;
+ return true;
+ }
+ }
+
+ blkno = BufferGetBlockNumber(buf);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+ rel->pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index a42861da0d..de9fc12615 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2352,8 +2352,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2385,7 +2384,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -2821,11 +2820,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 3d82edbf58..a3c3518c69 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 23892b1b81..f1b48583ba 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -464,7 +464,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -507,9 +507,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 83a881eff3..ee8c80f34f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4481,8 +4481,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4755,8 +4756,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
@@ -10811,11 +10810,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. Pending syncs for the old node is no longer needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationRemovePendingSync(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index f9da24c491..78909bc519 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -900,7 +900,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1170,7 +1170,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1531,7 +1531,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..1095f6c721 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 48f92dc430..399390e6c1 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -76,6 +76,7 @@
#include "pgstat.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -417,6 +418,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -2072,6 +2077,10 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
/*
* add new reldesc to relcache
*/
@@ -3402,6 +3411,10 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ /* newly built relation has no pending sync */
+ rel->no_pending_sync = true;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c0256b18a..fff3fd42aa 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
@@ -180,6 +179,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef52d85803..49d93cd01f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce3906a0..9fae7c6ae5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c26c395b0b..040ae3a07a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * no_pending_sync is true if this relation is known not to have pending
+ * syncs. Elsewise searching for registered sync is required if
+ * pending_sync is NULL.
+ */
+ bool no_pending_sync;
+ struct PendingRelSync *pending_sync;
} RelationData;
--
2.16.3
On Fri, Mar 30, 2018 at 10:06:46AM +0900, Kyotaro HORIGUCHI wrote:
Hello. I found that c203d6cf81 hit this and this is the rebased
version on the current master.
Okay, as this is visibly the oldest item in this commit fest, Andrew has
asked me to look at a solution which would allow us to definitely close
the loop for all maintained branches. In consequence, I have been
looking at this problem. Here are my thoughts:
- The set of errors reported on this thread are alarming, depending on
the scenarios used, we could have "could not read file" stuff, or even
data loss after WAL replay comes and wipes out everything.
- Disabling completely the TRUNCATE optimization is definitely not cool,
as there could be an impact for users.
- Removing wal_level = minimal is not acceptable as well, as some people
rely on this feature.
- Rewriting the sync handling of heap relation files in an invasive way
may be something to investigate and improve on HEAD (I am not really
convinced about that actually for the optimizations discussed on this
thread as this may result in more bugs than actual fixes), but that
would do nothing for back-branches.
Hence I propose the patch attached which disables the TRUNCATE and COPY
optimizations for two cases, which are the ones actually causing
problems. One solution has been presented by Simon here for COPY, which
is to disable the optimization when there are no blocks on a relation
with wal_level = minimal:
/messages/by-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
For back-patching, I find that really appealing.
The second thing that the patch attached does is to tweak
ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
wal_level = minimal.
Another thing that this patch adds is a set of regression tests to
stress all the various scenarios presented on this thread with table
creation, INSERT, COPY and TRUNCATE running in the same transactions for
both wal_level = minimal and replica, which make sure that there are no
failures and no actual data loss. The test is useful anyway, as any
patch presented did not present a way to test easily all the scenarios,
except for a bash script present upthread, but this discarded some of
the cases.
I would propose that for a back-patch, except for the test which can go
down easily to 9.6 but I have not tested that yet.
Thoughts?
--
Michael
Attachments:
wal-minimal-copy-truncate.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3a66cb5025..78f7db07f0 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -42,6 +42,7 @@
#include "parser/parse_relation.h"
#include "port/pg_bswap.h"
#include "rewrite/rewriteHandler.h"
+#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
@@ -2404,7 +2405,16 @@ CopyFrom(CopyState cstate)
cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
+
+ /*
+ * Skip writing WAL if there have been no actions that write an init
+ * block for any of the buffers that will be touched during COPY.
+ * Since there is no way of knowing at present which ones these are,
+ * we must use a simple but effective heuristic to ensure safety of
+ * the COPY operation for all cases, which is in this case to check
+ * that the relation copied to has zero blocks.
+ */
+ if (!XLogIsNeeded() && RelationGetNumberOfBlocks(cstate->rel) == 0)
hi_options |= HEAP_INSERT_SKIP_WAL;
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7c0cf0d7ee..90fe27fbf9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1562,10 +1562,15 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged,
* the table was either created in the current (sub)transaction or has
* a new relfilenode in the current (sub)transaction, then we can just
* truncate it in-place, because a rollback would cause the whole
- * table or the current physical file to be thrown away anyway.
+ * table or the current physical file to be thrown away anyway. This
+ * optimization is not safe with wal_level = minimal as there is no
+ * actual way to know which are the blocks that could have been
+ * touched by another operation done within this same transaction, be
+ * it INSERT or COPY.
*/
- if (rel->rd_createSubid == mySubid ||
- rel->rd_newRelfilenodeSubid == mySubid)
+ if (XLogIsNeeded() &&
+ (rel->rd_createSubid == mySubid ||
+ rel->rd_newRelfilenodeSubid == mySubid))
{
/* Immediate, non-rollbackable truncation is OK */
heap_truncate_one_rel(rel);
diff --git a/src/test/recovery/t/015_wal_optimize.pl b/src/test/recovery/t/015_wal_optimize.pl
new file mode 100644
index 0000000000..98a410b125
--- /dev/null
+++ b/src/test/recovery/t/015_wal_optimize.pl
@@ -0,0 +1,120 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 10;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ $node->teardown_node;
+ $node->clean_node;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
On Wed, Jul 4, 2018 at 12:59 AM, Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Mar 30, 2018 at 10:06:46AM +0900, Kyotaro HORIGUCHI wrote:
Hello. I found that c203d6cf81 hit this and this is the rebased
version on the current master.Okay, as this is visibly the oldest item in this commit fest, Andrew has
asked me to look at a solution which would allow us to definitely close
the loop for all maintained branches. In consequence, I have been
looking at this problem. Here are my thoughts:
- The set of errors reported on this thread are alarming, depending on
the scenarios used, we could have "could not read file" stuff, or even
data loss after WAL replay comes and wipes out everything.
- Disabling completely the TRUNCATE optimization is definitely not cool,
as there could be an impact for users.
- Removing wal_level = minimal is not acceptable as well, as some people
rely on this feature.
- Rewriting the sync handling of heap relation files in an invasive way
may be something to investigate and improve on HEAD (I am not really
convinced about that actually for the optimizations discussed on this
thread as this may result in more bugs than actual fixes), but that
would do nothing for back-branches.Hence I propose the patch attached which disables the TRUNCATE and COPY
optimizations for two cases, which are the ones actually causing
problems. One solution has been presented by Simon here for COPY, which
is to disable the optimization when there are no blocks on a relation
with wal_level = minimal:
/messages/by-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
For back-patching, I find that really appealing.The second thing that the patch attached does is to tweak
ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
wal_level = minimal.Another thing that this patch adds is a set of regression tests to
stress all the various scenarios presented on this thread with table
creation, INSERT, COPY and TRUNCATE running in the same transactions for
both wal_level = minimal and replica, which make sure that there are no
failures and no actual data loss. The test is useful anyway, as any
patch presented did not present a way to test easily all the scenarios,
except for a bash script present upthread, but this discarded some of
the cases.I would propose that for a back-patch, except for the test which can go
down easily to 9.6 but I have not tested that yet.
Many thanks for working on this.
+1 for these changes, even though the TRUNCATE fix looks perverse. If
anyone wants to propose further optimizations in this area this would
at least give us a startpoint which is correct.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 04, 2018 at 07:55:53AM -0400, Andrew Dunstan wrote:
Many thanks for working on this.
No problem. Thanks for the lookup.
+1 for these changes, even though the TRUNCATE fix looks perverse. If
anyone wants to propose further optimizations in this area this would
at least give us a startpoint which is correct.
Yes, that's exactly what I am coming at. The optimizations which are
currently broken just cannot and should not be used. If anybody wishes
to improve the current set of optimizations in place for wal_level =
minimal, let's also consider the other patch. Based on the tests I sent
in the previous patch, I have compiled five scenarios by the way:
1) BEGIN -> CREATE TABLE -> TRUNCATE -> COMMIT.
With wal_level = minimal, this fails hard with "could not read block 0
blah" when trying to read the data after commit..
2) BEGIN -> CREATE -> INSERT -> TRUNCATE -> INSERT -> COMMIT, and this
one reports an empty table, without failing, but there should be tuples
from the INSERT.
3) BEGIN -> CREATE -> INSERT -> TRUNCATE -> COPY -> COMMIT, which also
reports an empty table while there should be tuples from the COPY.
4) BEGIN -> CREATE -> INSERT -> TRUNCATE -> INSERT -> COPY -> INSERT ->
COMMIT, which fails at WAL replay with a PANIC: invalid max offset
number.
5) BEGIN -> CREATE -> INSERT -> COPY -> COMMIT, which sees only the
tuple inserted, causing an incorrect number of tuples. If you reverse
the COPY and INSERT, then this is able to pass.
This stuff really generates a good number of different failures. There
have been so many people participating on this thread that discussing
more this approach would be surely a good step forward, and this
summarizes quite nicely the set of failures discussed until now here. I
would be happy to push forward with this patch to close all the holes
mentioned.
--
Michael
Thanks for picking this up!
(I hope this gets through the email filters this time, sending a shell
script seems to be difficult. I also trimmed the CC list, if that helps.)
On 04/07/18 07:59, Michael Paquier wrote:
Hence I propose the patch attached which disables the TRUNCATE and COPY
optimizations for two cases, which are the ones actually causing
problems. One solution has been presented by Simon here for COPY, which
is to disable the optimization when there are no blocks on a relation
with wal_level = minimal:
/messages/by-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
For back-patching, I find that really appealing.
This fails in the case that there are any WAL-logged changes to the
table while the COPY is running. That can happen at least if the table
has an INSERT trigger, that performs operations on the same table, and
the COPY fires the trigger. That scenario is covered by the little bash
script I posted earlier in this thread
(/messages/by-id/55AFC302.1060805@iki.fi).
Attached is a new version of that script, updated to make it work with v11.
The second thing that the patch attached does is to tweak
ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
wal_level = minimal.
If we go down that route, let's at least keep the TRUNCATE optimization
for temporary and unlogged tables.
- Heikki
Attachments:
On Tue, Jul 10, 2018 at 05:35:58PM +0300, Heikki Linnakangas wrote:
Thanks for picking this up!
(I hope this gets through the email filters this time, sending a shell
script seems to be difficult. I also trimmed the CC list, if that helps.)On 04/07/18 07:59, Michael Paquier wrote:
Hence I propose the patch attached which disables the TRUNCATE and COPY
optimizations for two cases, which are the ones actually causing
problems. One solution has been presented by Simon here for COPY, which
is to disable the optimization when there are no blocks on a relation
with wal_level = minimal:
/messages/by-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
For back-patching, I find that really appealing.This fails in the case that there are any WAL-logged changes to the table
while the COPY is running. That can happen at least if the table has an
INSERT trigger, that performs operations on the same table, and the COPY
fires the trigger. That scenario is covered by the little bash script I
posted earlier in this thread
(/messages/by-id/55AFC302.1060805@iki.fi). Attached
is a new version of that script, updated to make it work with v11.
Thanks for the pointer. My tap test has been covering two out of the
three scenarios you have in your script. I have been able to convert
the extra as the attached, and I have added as well an extra test with
TRUNCATE triggers. So it seems to me that we want to disable the
optimization if any type of trigger are defined on the relation copied
to as it could be possible that these triggers work on the blocks copied
as well, for any BEFORE/AFTER and STATEMENT/ROW triggers. What do you
think?
The second thing that the patch attached does is to tweak
ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
wal_level = minimal.If we go down that route, let's at least keep the TRUNCATE optimization for
temporary and unlogged tables.
Yes, that sounds right. Fixed as well. I have additionally done more
work on the comments.
Thoughts?
--
Michael
Attachments:
wal-minimal-copy-truncate-v2.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3a66cb5025..7674369613 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -42,6 +42,7 @@
#include "parser/parse_relation.h"
#include "port/pg_bswap.h"
#include "rewrite/rewriteHandler.h"
+#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
@@ -2367,13 +2368,23 @@ CopyFrom(CopyState cstate)
/*----------
* Check to see if we can avoid writing WAL
*
- * If archive logging/streaming is not enabled *and* either
- * - table was created in same transaction as this COPY
+ * WAL can be skipped if all the following conditions are satisfied:
+ * - table was created in same transaction as this COPY.
+ * - archive logging/streaming is enabled.
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
* If it does commit, we'll have done the heap_sync at the bottom of this
* routine first.
+ * - No triggers are defined on the relation, particularly BEFORE/AFTER
+ * ROW INSERT triggers could try to write data to the same block copied
+ * to when the INSERT are WAL-logged.
+ * - No actions which write an init block for any of the buffers that
+ * will be touched during COPY have happened. Since there is no way of
+ * knowing at present which ones these are, we must use a simple but
+ * effective heuristic to ensure safety of the COPY operation for all
+ * cases, which is in this case to check that the relation copied to has
+ * zero blocks.
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2404,7 +2415,10 @@ CopyFrom(CopyState cstate)
cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
+
+ if (!XLogIsNeeded() &&
+ cstate->rel->trigdesc == NULL &&
+ RelationGetNumberOfBlocks(cstate->rel) == 0)
hi_options |= HEAP_INSERT_SKIP_WAL;
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7c0cf0d7ee..150f8c1fd2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1562,10 +1562,16 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged,
* the table was either created in the current (sub)transaction or has
* a new relfilenode in the current (sub)transaction, then we can just
* truncate it in-place, because a rollback would cause the whole
- * table or the current physical file to be thrown away anyway.
+ * table or the current physical file to be thrown away anyway. This
+ * optimization is not safe with wal_level = minimal as there is no
+ * actual way to know which are the blocks that could have been
+ * touched by another operation done within this same transaction, be
+ * it INSERT or COPY. Non-permanent relations can also safely use
+ * this optimization as they don't rely on WAL at recovery.
*/
- if (rel->rd_createSubid == mySubid ||
- rel->rd_newRelfilenodeSubid == mySubid)
+ if ((XLogIsNeeded() || !RelationNeedsWAL(rel)) &&
+ (rel->rd_createSubid == mySubid ||
+ rel->rd_newRelfilenodeSubid == mySubid))
{
/* Immediate, non-rollbackable truncation is OK */
heap_truncate_one_rel(rel);
diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ $node->teardown_node;
+ $node->clean_node;
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
On 07/10/2018 11:32 PM, Michael Paquier wrote:
On Tue, Jul 10, 2018 at 05:35:58PM +0300, Heikki Linnakangas wrote:
Thanks for picking this up!
(I hope this gets through the email filters this time, sending a shell
script seems to be difficult. I also trimmed the CC list, if that helps.)On 04/07/18 07:59, Michael Paquier wrote:
Hence I propose the patch attached which disables the TRUNCATE and COPY
optimizations for two cases, which are the ones actually causing
problems. One solution has been presented by Simon here for COPY, which
is to disable the optimization when there are no blocks on a relation
with wal_level = minimal:
/messages/by-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
For back-patching, I find that really appealing.This fails in the case that there are any WAL-logged changes to the table
while the COPY is running. That can happen at least if the table has an
INSERT trigger, that performs operations on the same table, and the COPY
fires the trigger. That scenario is covered by the little bash script I
posted earlier in this thread
(/messages/by-id/55AFC302.1060805@iki.fi). Attached
is a new version of that script, updated to make it work with v11.Thanks for the pointer. My tap test has been covering two out of the
three scenarios you have in your script. I have been able to convert
the extra as the attached, and I have added as well an extra test with
TRUNCATE triggers. So it seems to me that we want to disable the
optimization if any type of trigger are defined on the relation copied
to as it could be possible that these triggers work on the blocks copied
as well, for any BEFORE/AFTER and STATEMENT/ROW triggers. What do you
think?
Yeah, this seems like the only sane approach.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/07/18 16:51, Andrew Dunstan wrote:
On 07/10/2018 11:32 PM, Michael Paquier wrote:
On Tue, Jul 10, 2018 at 05:35:58PM +0300, Heikki Linnakangas wrote:
Thanks for picking this up!
(I hope this gets through the email filters this time, sending a shell
script seems to be difficult. I also trimmed the CC list, if that helps.)On 04/07/18 07:59, Michael Paquier wrote:
Hence I propose the patch attached which disables the TRUNCATE and COPY
optimizations for two cases, which are the ones actually causing
problems. One solution has been presented by Simon here for COPY, which
is to disable the optimization when there are no blocks on a relation
with wal_level = minimal:
/messages/by-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
For back-patching, I find that really appealing.This fails in the case that there are any WAL-logged changes to the table
while the COPY is running. That can happen at least if the table has an
INSERT trigger, that performs operations on the same table, and the COPY
fires the trigger. That scenario is covered by the little bash script I
posted earlier in this thread
(/messages/by-id/55AFC302.1060805@iki.fi). Attached
is a new version of that script, updated to make it work with v11.Thanks for the pointer. My tap test has been covering two out of the
three scenarios you have in your script. I have been able to convert
the extra as the attached, and I have added as well an extra test with
TRUNCATE triggers. So it seems to me that we want to disable the
optimization if any type of trigger are defined on the relation copied
to as it could be possible that these triggers work on the blocks copied
as well, for any BEFORE/AFTER and STATEMENT/ROW triggers. What do you
think?Yeah, this seems like the only sane approach.
Doesn't have to be a trigger, could be a CHECK constraint, datatype
input function, etc. Admittedly, having a datatype input function that
inserts to the table is worth a "huh?", but I'm feeling very confident
that we can catch all such cases, and some of them might even be sensible.
- Heikki
On Thu, Jul 12, 2018 at 05:12:21PM +0300, Heikki Linnakangas wrote:
Doesn't have to be a trigger, could be a CHECK constraint, datatype input
function, etc. Admittedly, having a datatype input function that inserts to
the table is worth a "huh?", but I'm feeling very confident that we can
catch all such cases, and some of them might even be sensible.
Sure, but do we want to be that invasive? Triggers are easy enough to
block because those are available directly within cstate so you would
know if those are triggered. CHECK constraint can be also easily looked
after by looking at the Relation information, and actually as DEFAULT
values could have an expression we'd want to block them, no? The input
datatype is well, more tricky to deal with as there is no actual way to
know if the INSERT is happening within the context of a COPY and this
could be just C code. One way to tackle that would be to enforce the
optimization to not be used if a non-system data type is used when doing
COPY...
Disabling entirely the optimization for any relation which has a CHECK
constraint or DEFAULT expression basically applies to a hell lot of
them, which makes the optimization, at least it seems to me, useless
because it is never going to apply to most real-world cases.
--
Michael
On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Doesn't have to be a trigger, could be a CHECK constraint, datatype input
function, etc. Admittedly, having a datatype input function that inserts to
the table is worth a "huh?", but I'm feeling very confident that we can
catch all such cases, and some of them might even be sensible.
Is this sentence missing a "not"? i.e. "I'm not feeling very confident"?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 16 July 2018 21:38:39 EEST, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:Doesn't have to be a trigger, could be a CHECK constraint, datatype
input
function, etc. Admittedly, having a datatype input function that
inserts to
the table is worth a "huh?", but I'm feeling very confident that we
can
catch all such cases, and some of them might even be sensible.
Is this sentence missing a "not"? i.e. "I'm not feeling very
confident"?
Yes, sorry.
- Heikki
On 2018-Jul-12, Heikki Linnakangas wrote:
Thanks for the pointer. My tap test has been covering two out of
the three scenarios you have in your script. I have been able to
convert the extra as the attached, and I have added as well an
extra test with TRUNCATE triggers. So it seems to me that we want
to disable the optimization if any type of trigger are defined on
the relation copied to as it could be possible that these triggers
work on the blocks copied as well, for any BEFORE/AFTER and
STATEMENT/ROW triggers. What do you think?Yeah, this seems like the only sane approach.
Doesn't have to be a trigger, could be a CHECK constraint, datatype
input function, etc. Admittedly, having a datatype input function that
inserts to the table is worth a "huh?", but I'm feeling very confident
that we can catch all such cases, and some of them might even be
sensible.
A counterexample could be a a JSON compresion scheme that uses a catalog
for a dictionary of keys. Hasn't this been described already? Also not
completely out of the question for GIS data, I think (Not sure if
PostGIS does this already.)
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 16, 2018 at 09:41:51PM +0300, Heikki Linnakangas wrote:
On 16 July 2018 21:38:39 EEST, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:Doesn't have to be a trigger, could be a CHECK constraint, datatype
input
function, etc. Admittedly, having a datatype input function that
inserts to
the table is worth a "huh?", but I'm feeling very confident that we
can
catch all such cases, and some of them might even be sensible.
Is this sentence missing a "not"? i.e. "I'm not feeling very
confident"?Yes, sorry.
This explains a lot :p
I doubt as well that we'd be able to catch all the holes as well as the
conditions where the optimization could be run safely are rather
basically impossible to catch beforehand. I'd like to vote for getting
rid of this optimization for COPY, this can hurt more than it is
helpful. Per the lack of complaints, this could happen only in HEAD?
--
Michael
Hello.
At Mon, 16 Jul 2018 16:14:09 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20180716201409.2qfcneo4qkdwjvpv@alvherre.pgsql>
On 2018-Jul-12, Heikki Linnakangas wrote:
Thanks for the pointer. My tap test has been covering two out of
the three scenarios you have in your script. I have been able to
convert the extra as the attached, and I have added as well an
extra test with TRUNCATE triggers. So it seems to me that we want
to disable the optimization if any type of trigger are defined on
the relation copied to as it could be possible that these triggers
work on the blocks copied as well, for any BEFORE/AFTER and
STATEMENT/ROW triggers. What do you think?Yeah, this seems like the only sane approach.
Doesn't have to be a trigger, could be a CHECK constraint, datatype
input function, etc. Admittedly, having a datatype input function that
inserts to the table is worth a "huh?", but I'm feeling very confident
that we can catch all such cases, and some of them might even be
sensible.A counterexample could be a a JSON compresion scheme that uses a catalog
for a dictionary of keys. Hasn't this been described already? Also not
completely out of the question for GIS data, I think (Not sure if
PostGIS does this already.)
In the third case, IIUC, disabling bulk-insertion after any
WAL-logging insertion happend seems to work. The attached diff to
v2 patch makes the three TAP tests pass. It uses relcache to
store the last insertion XID but it will not be invalidated
during a COPY operation.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
wal-minimal-copy-truncate-v2-v3.difftext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 72395a50b8..e5c651b498 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2509,6 +2509,18 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
+ /*
+ * Bulk insertion is not safe after a WAL-logging insertion in the same
+ * transaction. We don't start bulk insertion under inhibitin conditions
+ * but we also need to cancel WAL-skipping in the case where WAL-logging
+ * insertion happens during a bulk insertion. This happens by anything
+ * that can insert a tuple during bulk insertion such like triggers,
+ * constraints or type conversions. We need not worry about relcache flush
+ * happening while a bulk insertion is running.
+ */
+ if (relation->last_logged_insert_xid == xid)
+ options &= ~HEAP_INSERT_SKIP_WAL;
+
/* XLOG stuff */
if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
{
@@ -2582,6 +2594,12 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
recptr = XLogInsert(RM_HEAP_ID, info);
PageSetLSN(page, recptr);
+
+ /*
+ * If this happens during a bulk insertion, stop WAL skipping for the
+ * rest of the current command.
+ */
+ relation->last_logged_insert_xid = xid;
}
END_CRIT_SECTION();
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 7674369613..7b9a7af2d2 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2416,10 +2416,8 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
- if (!XLogIsNeeded() &&
- cstate->rel->trigdesc == NULL &&
- RelationGetNumberOfBlocks(cstate->rel) == 0)
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ if (!XLogIsNeeded() && RelationGetNumberOfBlocks(cstate->rel) == 0)
+ hi_options |= HEAP_INSERT_SKIP_WAL;
}
/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 30a956822f..34a692a497 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1575,6 +1575,9 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged,
{
/* Immediate, non-rollbackable truncation is OK */
heap_truncate_one_rel(rel);
+
+ /* Allow bulk-insert */
+ rel->last_logged_insert_xid = InvalidTransactionId;
}
else
{
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 6125421d39..99fb7e1dd8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1243,6 +1243,8 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
/* It's fully valid */
relation->rd_isvalid = true;
+ relation->last_logged_insert_xid = InvalidTransactionId;
+
return relation;
}
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c97f9d1b43..6ee575ad14 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -188,6 +188,9 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* XID of the last transaction on which WAL-logged insertion happened */
+ TransactionId last_logged_insert_xid;
} RelationData;
On 07/16/2018 08:01 PM, Michael Paquier wrote:
I doubt as well that we'd be able to catch all the holes as well as the
conditions where the optimization could be run safely are rather
basically impossible to catch beforehand. I'd like to vote for getting
rid of this optimization for COPY, this can hurt more than it is
helpful. Per the lack of complaints, this could happen only in HEAD?
Well, we'd be getting rid of it because of a danger of data loss which
we can't otherwise mitigate. Maybe it does need to be backpatched, even
if we haven't had complaints.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jul 17, 2018 at 8:28 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
Well, we'd be getting rid of it because of a danger of data loss which we
can't otherwise mitigate. Maybe it does need to be backpatched, even if we
haven't had complaints.
What's wrong with the approach proposed in
/messages/by-id/55AFC302.1060805@iki.fi ?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Jul 18, 2018 at 06:42:10AM -0400, Robert Haas wrote:
On Tue, Jul 17, 2018 at 8:28 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:Well, we'd be getting rid of it because of a danger of data loss which we
can't otherwise mitigate. Maybe it does need to be backpatched, even if we
haven't had complaints.What's wrong with the approach proposed in
/messages/by-id/55AFC302.1060805@iki.fi ?
For back-branches that's very invasive so that seems risky to me
particularly seeing the low number of complaints on the matter.
--
Michael
On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier <michael@paquier.xyz> wrote:
What's wrong with the approach proposed in
/messages/by-id/55AFC302.1060805@iki.fi ?For back-branches that's very invasive so that seems risky to me
particularly seeing the low number of complaints on the matter.
Hmm. I think that if you disable the optimization, you're betting that
people won't mind losing performance in this case in a maintenance
release. If you back-patch Heikki's approach, you're betting that the
committed version doesn't have any bugs that are worse than the status
quo. Personally, I'd rather take the latter bet. Maybe the patch
isn't all there yet, but that seems like something we can work
towards. If we just give up and disable the optimization, we won't
know how many people we ticked off or how badly until after we've done
it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 18/07/18 16:29, Robert Haas wrote:
On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier <michael@paquier.xyz> wrote:
What's wrong with the approach proposed in
/messages/by-id/55AFC302.1060805@iki.fi ?For back-branches that's very invasive so that seems risky to me
particularly seeing the low number of complaints on the matter.Hmm. I think that if you disable the optimization, you're betting that
people won't mind losing performance in this case in a maintenance
release. If you back-patch Heikki's approach, you're betting that the
committed version doesn't have any bugs that are worse than the status
quo. Personally, I'd rather take the latter bet. Maybe the patch
isn't all there yet, but that seems like something we can work
towards. If we just give up and disable the optimization, we won't
know how many people we ticked off or how badly until after we've done
it.
Yeah. I'm not happy about backpatching a big patch like what I proposed,
and Kyotaro developed further. But I think it's the least bad option we
have, the other options discussed seem even worse.
One way to review the patch is to look at what it changes, when
wal_level is *not* set to minimal, i.e. what risk or overhead does it
pose to users who are not affected by this bug? It seems pretty safe to me.
The other aspect is, how confident are we that this actually fixes the
bug, with least impact to users using wal_level='minimal'? I think it's
the best shot we have so far. All the other proposals either don't fully
fix the bug, or hurt performance in some legit cases.
I'd suggest that we continue based on the patch that Kyotaro posted at
/messages/by-id/20180330.100646.86008470.horiguchi.kyotaro@lab.ntt.co.jp.
- Heikki
On Wed, Jul 18, 2018 at 05:58:16PM +0300, Heikki Linnakangas wrote:
I'd suggest that we continue based on the patch that Kyotaro posted at
/messages/by-id/20180330.100646.86008470.horiguchi.kyotaro@lab.ntt.co.jp.
Whatever happens here, perhaps one way to move on would be to commit
first the TAP test that I proposed upthread. That would not work for
wal_level=minimal so this part should be commented out, but that's
easier this way to test basically all the cases we talked about with any
approach taken.
--
Michael
Hello.
At Wed, 25 Jul 2018 23:08:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180725140833.GC6660@paquier.xyz>
On Wed, Jul 18, 2018 at 05:58:16PM +0300, Heikki Linnakangas wrote:
I'd suggest that we continue based on the patch that Kyotaro posted at
/messages/by-id/20180330.100646.86008470.horiguchi.kyotaro@lab.ntt.co.jp.Whatever happens here, perhaps one way to move on would be to commit
first the TAP test that I proposed upthread. That would not work for
wal_level=minimal so this part should be commented out, but that's
easier this way to test basically all the cases we talked about with any
approach taken.
/messages/by-id/20180704045912.GG1672@paquier.xyz
However I'm not sure the policy (if any) allows us to add a test
that should success, I'm not opposed to do that. But even if we
did that, it won't be visible to other than us in this thread. It
seems to me more or less similar to pasting a boilerplate that
points to the above message in this thread, or just writing "this
patch passes "the" test.".
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
On 18/07/18 16:29, Robert Haas wrote:
On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier
<michael@paquier.xyz> wrote:What's wrong with the approach proposed in
/messages/by-id/55AFC302.1060805@iki.fi ?For back-branches that's very invasive so that seems risky to me
particularly seeing the low number of complaints on the matter.Hmm. I think that if you disable the optimization, you're betting that
people won't mind losing performance in this case in a maintenance
release. If you back-patch Heikki's approach, you're betting that the
committed version doesn't have any bugs that are worse than the status
quo. Personally, I'd rather take the latter bet. Maybe the patch
isn't all there yet, but that seems like something we can work
towards. If we just give up and disable the optimization, we won't
know how many people we ticked off or how badly until after we've done
it.Yeah. I'm not happy about backpatching a big patch like what I
proposed, and Kyotaro developed further. But I think it's the least
bad option we have, the other options discussed seem even worse.One way to review the patch is to look at what it changes, when
wal_level is *not* set to minimal, i.e. what risk or overhead does it
pose to users who are not affected by this bug? It seems pretty safe
to me.The other aspect is, how confident are we that this actually fixes the
bug, with least impact to users using wal_level='minimal'? I think
it's the best shot we have so far. All the other proposals either
don't fully fix the bug, or hurt performance in some legit cases.I'd suggest that we continue based on the patch that Kyotaro posted at
/messages/by-id/20180330.100646.86008470.horiguchi.kyotaro@lab.ntt.co.jp.
I have just spent some time reviewing Kyatoro's patch. I'm a bit
nervous, too, given the size. But I'm also nervous about leaving things
as they are. I suspect the reason we haven't heard more about this is
that these days use of "wal_level = minimal" is relatively rare.
I like the fact that this is closer to being a real fix rather than just
throwing out the optimization. Like Heikki I've come round to the view
that something like this is the least bad option.
The code looks good to me - some comments might be helpful in
heap_xlog_update()
Do we want to try this on HEAD and then backpatch it? Do we want to add
some testing along the lines Michael suggested?
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello.
At Fri, 27 Jul 2018 15:26:24 -0400, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote in <d0c9e197-5219-c094-418a-e5a6fbd8cdda@2ndQuadrant.com>
On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
On 18/07/18 16:29, Robert Haas wrote:
On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier <michael@paquier.xyz>
wrote:What's wrong with the approach proposed in
/messages/by-id/55AFC302.1060805@iki.fi ?For back-branches that's very invasive so that seems risky to me
particularly seeing the low number of complaints on the matter.Hmm. I think that if you disable the optimization, you're betting that
people won't mind losing performance in this case in a maintenance
release. If you back-patch Heikki's approach, you're betting that the
committed version doesn't have any bugs that are worse than the status
quo. Personally, I'd rather take the latter bet. Maybe the patch
isn't all there yet, but that seems like something we can work
towards. If we just give up and disable the optimization, we won't
know how many people we ticked off or how badly until after we've done
it.Yeah. I'm not happy about backpatching a big patch like what I
proposed, and Kyotaro developed further. But I think it's the least
bad option we have, the other options discussed seem even worse.One way to review the patch is to look at what it changes, when
wal_level is *not* set to minimal, i.e. what risk or overhead does it
pose to users who are not affected by this bug? It seems pretty safe
to me.The other aspect is, how confident are we that this actually fixes the
bug, with least impact to users using wal_level='minimal'? I think
it's the best shot we have so far. All the other proposals either
don't fully fix the bug, or hurt performance in some legit cases.I'd suggest that we continue based on the patch that Kyotaro posted at
/messages/by-id/20180330.100646.86008470.horiguchi.kyotaro@lab.ntt.co.jp.I have just spent some time reviewing Kyatoro's patch. I'm a bit
nervous, too, given the size. But I'm also nervous about leaving
things as they are. I suspect the reason we haven't heard more about
this is that these days use of "wal_level = minimal" is relatively
rare.
Thank you for lokking this (and sorry for the late response).
I like the fact that this is closer to being a real fix rather than
just throwing out the optimization. Like Heikki I've come round to the
view that something like this is the least bad option.The code looks good to me - some comments might be helpful in
heap_xlog_update()
Thanks. It is intending to avoid PANIC for a broken record. I
reverted the part since PANIC would be preferable in the case.
Do we want to try this on HEAD and then backpatch it? Do we want to
add some testing along the lines Michael suggested?
44cac93464 hit this, rebased. And added Michael's TAP test
contained in [1]/messages/by-id/20180711033241.GQ1661@paquier.xyz as patch 0001.
I regard [2]/messages/by-id/CAKJS1f9iF55cwx-LUOreRokyi9UZESXOLHuFDkt0wksZN+KqWw@mail.gmail.com as an orthogonal issue.
The previous patch didn't care of the case of
BEGIN;CREATE;TRUNCATE;COMMIT case. This version contains a "fix"
of nbtree (patch 0003) so that FPI of the metapage is always
emitted when building an empty index. On the other hand this
emits useless one or two FPIs (136 bytes each) on TRUNCATE in a
separate transaction, but it won't matter so much.. Other index
methods don't have this problem. Some other AMs emits initialize
WALs even in minimal mode.
This still has a bit too many elog(DEBUG2)s to see how it is
working. I'm going to remove most of them in the final version.
I started to prefix the file names with version 2.
regards.
[1]: /messages/by-id/20180711033241.GQ1661@paquier.xyz
[2]: /messages/by-id/CAKJS1f9iF55cwx-LUOreRokyi9UZESXOLHuFDkt0wksZN+KqWw@mail.gmail.com
or
https://commitfest.postgresql.org/20/1811/
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v2-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From 092e7412f361c39530911d4592fb46653ca027ab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/3] TAP test for copy-truncation optimization.
---
src/test/recovery/t/016_wal_optimize.pl | 192 ++++++++++++++++++++++++++++++++
1 file changed, 192 insertions(+)
create mode 100644 src/test/recovery/t/016_wal_optimize.pl
diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ $node->teardown_node;
+ $node->clean_node;
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v2-0002-Fix-WAL-logging-problem.patchtext/x-patch; charset=us-asciiDownload
From 76d5e5ed12ef510bf7ea43a948979b052bc26aee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 12 Sep 2017 13:01:33 +0900
Subject: [PATCH 2/3] Fix WAL logging problem
We skip WAL logging for some bulk insertion operations, but this can
cause curruption when such operations are mixd with truncation. This
patch fixes the issue by introducing buffer-resolution WAL emittion.
With this patch, in minimal mode we still skip WAL logging for
extended pages and fsync them at commit time but write for exiting
pages or for the pages re-extended after a WAL-logged trancation.
---
src/backend/access/heap/heapam.c | 100 +++++++---
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 4 +-
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/access/transam/xact.c | 7 +
src/backend/catalog/storage.c | 321 +++++++++++++++++++++++++++++---
src/backend/commands/copy.c | 13 +-
src/backend/commands/createas.c | 9 +-
src/backend/commands/matview.c | 6 +-
src/backend/commands/tablecmds.c | 8 +-
src/backend/commands/vacuumlazy.c | 6 +-
src/backend/storage/buffer/bufmgr.c | 40 +++-
src/backend/utils/cache/relcache.c | 13 ++
src/include/access/heapam.h | 8 +-
src/include/catalog/storage.h | 5 +-
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 8 +
17 files changed, 471 insertions(+), 85 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5f1a69ca53..97b4159362 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -57,6 +79,7 @@
#include "catalog/catalog.h"
#include "catalog/namespace.h"
#include "catalog/index.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -2413,12 +2436,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2449,6 +2466,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* TID where the tuple was stored. But note that any toasting of fields
* within the tuple data is NOT reflected into *tup.
*/
+extern HTAB *pendingSyncs;
Oid
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
int options, BulkInsertState bistate)
@@ -2522,7 +2540,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2723,12 +2741,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
PGAlignedBlock scratch;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2770,6 +2786,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2781,6 +2798,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3343,7 +3361,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4322,7 +4340,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5294,7 +5313,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -6038,7 +6057,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -6198,7 +6217,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6331,7 +6350,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6440,7 +6459,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7636,7 +7655,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7684,7 +7703,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7769,7 +7788,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -9390,9 +9409,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
@@ -9509,3 +9535,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index c2f5343dac..d0b68902d9 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 85f92973c9..ec9d1b3113 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -653,9 +653,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
- HEAP_INSERT_SKIP_FSM |
- (state->rs_use_wal ?
- 0 : HEAP_INSERT_SKIP_WAL));
+ HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 695567b4b0..fce14ce35f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6cd00d9aaa..e0ba2aff29 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2016,6 +2016,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2245,6 +2248,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2559,6 +2565,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..ef0b75d288 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -62,6 +63,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
/*
* RelationCreateStorage
* Create physical storage for a relation.
@@ -225,6 +269,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
void
RelationTruncate(Relation rel, BlockNumber nblocks)
{
+ PendingRelSync *pending = NULL;
+ bool found;
bool fsm;
bool vm;
@@ -259,37 +305,82 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
+ /* no_pending_sync is ignored since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ elog(DEBUG2, "RelationTruncate: accessing hash");
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ if (!found)
+ {
+ pending->sync_above = InvalidBlockNumber;
+ pending->truncated_to = InvalidBlockNumber;
+ }
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
+ rel->no_pending_sync= false;
+ rel->pending_sync = pending;
+ }
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above < nblocks)
+ {
+ /*
+ * Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->pending_sync->truncated_to = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -367,6 +458,24 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+ bool found;
+
+ rel->pending_sync = NULL;
+ rel->no_pending_sync = true;
+ if (pendingSyncs)
+ {
+ elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+ hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ }
+}
+
+
/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
@@ -418,6 +527,176 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ bool found = true;
+ BlockNumber nblocks;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* ignore no_pending_sync since new entry is created here */
+ if (!rel->pending_sync)
+ {
+ if (!pendingSyncs)
+ createPendingSyncsHash();
+
+ /* Look up or create an entry */
+ rel->no_pending_sync = false;
+ elog(DEBUG2, "RecordPendingSync: accessing hash");
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
+ }
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+ if (!found)
+ {
+ rel->pending_sync->truncated_to = InvalidBlockNumber;
+ rel->pending_sync->sync_above = nblocks;
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ }
+ else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ rel->pending_sync->sync_above = nblocks;
+ }
+ else
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /*
+ * no point in doing further work if we know that we don't have pending
+ * sync
+ */
+ if (!pendingSyncs || rel->no_pending_sync)
+ return true;
+
+ Assert(BufferIsValid(buf));
+
+ elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf), pendingSyncs, rel->pending_sync, rel->no_pending_sync);
+
+ /* do the real work */
+ if (!rel->pending_sync)
+ {
+ bool found = false;
+
+ /*
+ * Hold the entry in rel. This relies on the fact that hash entry
+ * never moves.
+ */
+ rel->pending_sync =
+ (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_FIND, &found);
+ elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+ if (!found)
+ {
+ /* we don't have no one. don't access the hash no longer */
+ rel->no_pending_sync = true;
+ return true;
+ }
+ }
+
+ blkno = BufferGetBlockNumber(buf);
+ if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ rel->pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+ rel->pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 86b0fb300f..07f96fde56 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2391,8 +2391,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2424,7 +2423,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -3079,11 +3078,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index d5cb62da15..0f58da40c6 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -568,8 +568,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -618,9 +619,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e1eb7c374b..986f7baf39 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -464,7 +464,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -507,9 +507,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e10d3dbf3d..715718450d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4611,8 +4611,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4885,8 +4886,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
@@ -11019,11 +11018,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. Pending syncs for the old node is no longer needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationRemovePendingSync(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..72849a9a94 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -924,7 +924,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1194,7 +1194,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1575,7 +1575,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..1095f6c721 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a4fc001103..20ba6fc989 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -77,6 +77,7 @@
#include "pgstat.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -419,6 +420,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -1872,6 +1877,10 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
/*
* add new reldesc to relcache
*/
@@ -3271,6 +3280,10 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ /* newly built relation has no pending sync */
+ rel->no_pending_sync = true;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ca5cad7497..c5e5e9a8b2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
@@ -180,6 +179,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef52d85803..49d93cd01f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce3906a0..9fae7c6ae5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 6ecbdb6294..ea44e0e15f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -188,6 +188,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * no_pending_sync is true if this relation is known not to have pending
+ * syncs. Elsewise searching for registered sync is required if
+ * pending_sync is NULL.
+ */
+ bool no_pending_sync;
+ struct PendingRelSync *pending_sync;
} RelationData;
--
2.16.3
v2-0003-Write-WAL-for-empty-nbtree-index-build.patchtext/x-patch; charset=us-asciiDownload
From c57c30c911031ac3257dd58935486fde7d4ddef0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 3/3] Write WAL for empty nbtree index build
After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
src/backend/access/nbtree/nbtsort.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..2c2647b530 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -610,8 +610,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
RelationOpenSmgr(wstate->index);
- /* XLOG stuff */
- if (wstate->btws_use_wal)
+ /* XLOG stuff
+ *
+ * Even if minimal mode, WAL is required here if truncation happened after
+ * being created in the same transaction. It is not needed otherwise but
+ * we don't bother identifying the case precisely.
+ */
+ if (wstate->btws_use_wal ||
+ (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
{
/* We use the heap NEWPAGE record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1055,6 +1061,11 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* set to point to "P_NONE"). This changes the index to the "valid" state
* by filling in a valid magic number in the metapage.
*/
+ /*
+ * If no tuple was inserted, it's possible that we are truncating a
+ * relation. We need to emit WAL for the metapage in the case. However it
+ * is not required elsewise,
+ */
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, rootblkno, rootlevel);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
--
2.16.3
At Thu, 11 Oct 2018 13:42:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20181011.134235.218062184.horiguchi.kyotaro@lab.ntt.co.jp>
Hello.
At Fri, 27 Jul 2018 15:26:24 -0400, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote in <d0c9e197-5219-c094-418a-e5a6fbd8cdda@2ndQuadrant.com>
On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
On 18/07/18 16:29, Robert Haas wrote:
On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier <michael@paquier.xyz>
wrote:What's wrong with the approach proposed in
/messages/by-id/55AFC302.1060805@iki.fi ?For back-branches that's very invasive so that seems risky to me
particularly seeing the low number of complaints on the matter.Hmm. I think that if you disable the optimization, you're betting that
people won't mind losing performance in this case in a maintenance
release. If you back-patch Heikki's approach, you're betting that the
committed version doesn't have any bugs that are worse than the status
quo. Personally, I'd rather take the latter bet. Maybe the patch
isn't all there yet, but that seems like something we can work
towards. If we just give up and disable the optimization, we won't
know how many people we ticked off or how badly until after we've done
it.Yeah. I'm not happy about backpatching a big patch like what I
proposed, and Kyotaro developed further. But I think it's the least
bad option we have, the other options discussed seem even worse.One way to review the patch is to look at what it changes, when
wal_level is *not* set to minimal, i.e. what risk or overhead does it
pose to users who are not affected by this bug? It seems pretty safe
to me.The other aspect is, how confident are we that this actually fixes the
bug, with least impact to users using wal_level='minimal'? I think
it's the best shot we have so far. All the other proposals either
don't fully fix the bug, or hurt performance in some legit cases.I'd suggest that we continue based on the patch that Kyotaro posted at
/messages/by-id/20180330.100646.86008470.horiguchi.kyotaro@lab.ntt.co.jp.I have just spent some time reviewing Kyatoro's patch. I'm a bit
nervous, too, given the size. But I'm also nervous about leaving
things as they are. I suspect the reason we haven't heard more about
this is that these days use of "wal_level = minimal" is relatively
rare.Thank you for lokking this (and sorry for the late response).
I like the fact that this is closer to being a real fix rather than
just throwing out the optimization. Like Heikki I've come round to the
view that something like this is the least bad option.The code looks good to me - some comments might be helpful in
heap_xlog_update()Thanks. It is intending to avoid PANIC for a broken record. I
reverted the part since PANIC would be preferable in the case.Do we want to try this on HEAD and then backpatch it? Do we want to
add some testing along the lines Michael suggested?44cac93464 hit this, rebased. And added Michael's TAP test
contained in [1] as patch 0001.I regard [2] as an orthogonal issue.
The previous patch didn't care of the case of
BEGIN;CREATE;TRUNCATE;COMMIT case. This version contains a "fix"
of nbtree (patch 0003) so that FPI of the metapage is always
emitted when building an empty index. On the other hand this
emits useless one or two FPIs (136 bytes each) on TRUNCATE in a
separate transaction, but it won't matter so much.. Other index
methods don't have this problem. Some other AMs emits initialize
WALs even in minimal mode.This still has a bit too many elog(DEBUG2)s to see how it is
working. I'm going to remove most of them in the final version.I started to prefix the file names with version 2.
regards.
[1] /messages/by-id/20180711033241.GQ1661@paquier.xyz
[2] /messages/by-id/CAKJS1f9iF55cwx-LUOreRokyi9UZESXOLHuFDkt0wksZN+KqWw@mail.gmail.com
or
I refactored getPendingSyncEntry out of RecordPendingSync,
BufferNeedsWAL and RelationTruncate. And split the second patch
into infrastracture-side and user-side ones. I expect it makes
reviewing far easier.
I reaplce RelationNeedsWAL in a part of code added in
heap_update() by bfa2ab56bb.
- v3-0001-TAP-test-for-copy-truncation-optimization.patch
TAP test
-v3-0002-Write-WAL-for-empty-nbtree-index-build.patch
nbtree "fix"
- v3-0003-Add-infrastructure-to-WAL-logging-skip-feature.patch
Pending-sync infrastructure.
- v3-0004-Fix-WAL-skipping-feature.patch
Actual fix of WAL skipping feature.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v3-0004-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From cdbb6f3af2b66f3b2fefd374e0bcf2bc7096a17a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 4/4] Fix WAL skipping feature.
This patch repalces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
src/backend/access/heap/heapam.c | 71 ++++++++++++++++++++++-----------
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 4 +-
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/commands/copy.c | 13 +++---
src/backend/commands/createas.c | 9 ++---
src/backend/commands/matview.c | 6 +--
src/backend/commands/tablecmds.c | 5 +--
src/backend/commands/vacuumlazy.c | 6 +--
src/include/access/heapam.h | 7 ++--
10 files changed, 73 insertions(+), 54 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4e823b6e39..46a3dda09f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -2414,12 +2436,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2450,6 +2466,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* TID where the tuple was stored. But note that any toasting of fields
* within the tuple data is NOT reflected into *tup.
*/
+
Oid
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
int options, BulkInsertState bistate)
@@ -2523,7 +2540,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2724,12 +2741,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
PGAlignedBlock scratch;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2771,6 +2786,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2782,6 +2798,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3344,7 +3361,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4101,7 +4118,7 @@ l2:
MarkBufferDirty(buffer);
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -4323,7 +4340,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5295,7 +5313,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -6039,7 +6057,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -6199,7 +6217,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6332,7 +6350,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6441,7 +6459,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7637,7 +7655,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7685,7 +7703,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7770,7 +7788,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -9391,9 +9409,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index c2f5343dac..d0b68902d9 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 85f92973c9..ec9d1b3113 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -653,9 +653,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}
else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
- HEAP_INSERT_SKIP_FSM |
- (state->rs_use_wal ?
- 0 : HEAP_INSERT_SKIP_WAL));
+ HEAP_INSERT_SKIP_FSM);
else
heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 695567b4b0..fce14ce35f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 86b0fb300f..07f96fde56 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2391,8 +2391,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2424,7 +2423,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -3079,11 +3078,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index d5cb62da15..0f58da40c6 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -568,8 +568,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -618,9 +619,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e1eb7c374b..986f7baf39 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -464,7 +464,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -507,9 +507,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6dff2c696b..715718450d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4611,8 +4611,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4885,8 +4886,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..72849a9a94 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -924,7 +924,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1194,7 +1194,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1575,7 +1575,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 169c2f730e..c5e5e9a8b2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
typedef struct BulkInsertStateData *BulkInsertState;
--
2.16.3
v3-0003-Add-infrastructure-to-WAL-logging-skip-feature.patchtext/x-patch; charset=us-asciiDownload
From 521064b509f640388e5c0d3fca12d5538d212635 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:00:44 +0900
Subject: [PATCH 3/4] Add infrastructure to WAL-logging skip feature
We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just singaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-comit fsyncs for a relation and in-transaction truncations.
heap_register_sync() should be used to start tracking before batch
operations like COPY and CLUSTER, and use BufferNeedsWAL() instead of
RelationNeedsWAL() at the places related to WAL-logging about
heap-modifying operations.
---
src/backend/access/heap/heapam.c | 31 ++++
src/backend/access/transam/xact.c | 7 +
src/backend/catalog/storage.c | 317 +++++++++++++++++++++++++++++++++---
src/backend/commands/tablecmds.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 40 ++++-
src/backend/utils/cache/relcache.c | 13 ++
src/include/access/heapam.h | 1 +
src/include/catalog/storage.h | 5 +-
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 8 +
10 files changed, 395 insertions(+), 32 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5f1a69ca53..4e823b6e39 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
#include "catalog/catalog.h"
#include "catalog/namespace.h"
#include "catalog/index.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -9509,3 +9510,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6cd00d9aaa..e0ba2aff29 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2016,6 +2016,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2245,6 +2248,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2559,6 +2565,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..e14ce64fc4 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -62,6 +63,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static PendingRelSync *getPendingSyncEntry(Relation rel, bool create);
+
/*
* RelationCreateStorage
* Create physical storage for a relation.
@@ -259,37 +303,117 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
+ PendingRelSync *pending_sync;
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
+ /* get pending sync entry, create if not yet */
+ pending_sync = getPendingSyncEntry(rel, true);
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ if (pending_sync->sync_above == InvalidBlockNumber ||
+ pending_sync->sync_above < nblocks)
+ {
+ /*
+ * This is the first time truncation of this relation in this
+ * transaction or truncation that leaves pages that need at-commit
+ * fsync. Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->pending_sync->truncated_to = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/*
+ * getPendingSyncEntry: get pendig sync entry.
+ *
+ * Returns pending sync entry for the relation. The entry tracks pending
+ * at-commit fsyncs for the relation. Creates one if needed when create is
+ * true.
+ */
+static PendingRelSync *
+getPendingSyncEntry(Relation rel, bool create)
+{
+ PendingRelSync *pendsync_entry = NULL;
+ bool found;
+
+ if (rel->pending_sync)
+ return rel->pending_sync;
+
+ /* we know we don't have pending sync entry */
+ if (!create && rel->no_pending_sync)
+ return NULL;
+
+ if (!pendingSyncs)
+ {
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ if (!create)
+ return NULL;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+
+ elog(DEBUG2, "getPendingSyncEntry: accessing hash for %d",
+ rel->rd_node.relNode);
+ pendsync_entry = (PendingRelSync *)
+ hash_search(pendingSyncs, (void *) &rel->rd_node,
+ create ? HASH_ENTER: HASH_FIND, &found);
+
+ if (!pendsync_entry)
+ {
+ rel->no_pending_sync = true;
+ return NULL;
+ }
+
+ /* new entry created */
+ if (!found)
+ {
+ pendsync_entry->truncated_to = InvalidBlockNumber;
+ pendsync_entry->sync_above = InvalidBlockNumber;
+ }
+
+ /* hold shortcut in Relation */
+ rel->no_pending_sync = false;
+ rel->pending_sync = pendsync_entry;
+
+ return pendsync_entry;
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -367,6 +491,24 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+ bool found;
+
+ rel->pending_sync = NULL;
+ rel->no_pending_sync = true;
+ if (pendingSyncs)
+ {
+ elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+ hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ }
+}
+
+
/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
@@ -418,6 +560,139 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ BlockNumber nblocks;
+ PendingRelSync *pending_sync;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* get pending sync entry, create if not yet */
+ pending_sync = getPendingSyncEntry(rel, true);
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (pending_sync->sync_above != InvalidBlockNumber)
+ {
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+
+ return;
+ }
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ pending_sync->sync_above = nblocks;
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+ PendingRelSync *pending_sync;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch exising pending sync entry */
+ pending_sync = getPendingSyncEntry(rel, false);
+
+ /*
+ * no point in doing further work if we know that we don't have pending
+ * sync
+ */
+ if (!pending_sync)
+ return true;
+
+ Assert(BufferIsValid(buf));
+
+ blkno = BufferGetBlockNumber(buf);
+
+ /* we don't skip WAL-logging for pages that already done */
+ if (pending_sync->sync_above == InvalidBlockNumber ||
+ pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (pending_sync->truncated_to != InvalidBlockNumber &&
+ pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e10d3dbf3d..6dff2c696b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11019,11 +11019,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. Pending syncs for the old node is no longer needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationRemovePendingSync(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..1095f6c721 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a4fc001103..20ba6fc989 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -77,6 +77,7 @@
#include "pgstat.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -419,6 +420,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -1872,6 +1877,10 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
/*
* add new reldesc to relcache
*/
@@ -3271,6 +3280,10 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ /* newly built relation has no pending sync */
+ rel->no_pending_sync = true;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ca5cad7497..169c2f730e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -180,6 +180,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef52d85803..49d93cd01f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce3906a0..9fae7c6ae5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 6ecbdb6294..ea44e0e15f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -188,6 +188,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * no_pending_sync is true if this relation is known not to have pending
+ * syncs. Elsewise searching for registered sync is required if
+ * pending_sync is NULL.
+ */
+ bool no_pending_sync;
+ struct PendingRelSync *pending_sync;
} RelationData;
--
2.16.3
v3-0002-Write-WAL-for-empty-nbtree-index-build.patchtext/x-patch; charset=us-asciiDownload
From 19d9f2ec8868df606eabf3987140b7a305449536 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/4] Write WAL for empty nbtree index build
After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
src/backend/access/nbtree/nbtsort.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..2c2647b530 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -610,8 +610,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
RelationOpenSmgr(wstate->index);
- /* XLOG stuff */
- if (wstate->btws_use_wal)
+ /* XLOG stuff
+ *
+ * Even if minimal mode, WAL is required here if truncation happened after
+ * being created in the same transaction. It is not needed otherwise but
+ * we don't bother identifying the case precisely.
+ */
+ if (wstate->btws_use_wal ||
+ (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
{
/* We use the heap NEWPAGE record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1055,6 +1061,11 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* set to point to "P_NONE"). This changes the index to the "valid" state
* by filling in a valid magic number in the metapage.
*/
+ /*
+ * If no tuple was inserted, it's possible that we are truncating a
+ * relation. We need to emit WAL for the metapage in the case. However it
+ * is not required elsewise,
+ */
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, rootblkno, rootlevel);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
--
2.16.3
v3-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From 092e7412f361c39530911d4592fb46653ca027ab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.
---
src/test/recovery/t/016_wal_optimize.pl | 192 ++++++++++++++++++++++++++++++++
1 file changed, 192 insertions(+)
create mode 100644 src/test/recovery/t/016_wal_optimize.pl
diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ $node->teardown_node;
+ $node->clean_node;
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
Hello.
At Thu, 11 Oct 2018 17:04:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20181011.170453.123148806.horiguchi.kyotaro@lab.ntt.co.jp>
At Thu, 11 Oct 2018 13:42:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20181011.134235.218062184.horiguchi.kyotaro@lab.ntt.co.jp>
I refactored getPendingSyncEntry out of RecordPendingSync,
BufferNeedsWAL and RelationTruncate. And split the second patch
into infrastracture-side and user-side ones. I expect it makes
reviewing far easier.I reaplce RelationNeedsWAL in a part of code added in
heap_update() by bfa2ab56bb.- v3-0001-TAP-test-for-copy-truncation-optimization.patch
TAP test
-v3-0002-Write-WAL-for-empty-nbtree-index-build.patch
nbtree "fix"
- v3-0003-Add-infrastructure-to-WAL-logging-skip-feature.patch
Pending-sync infrastructure.
- v3-0004-Fix-WAL-skipping-feature.patch
Actual fix of WAL skipping feature.
0004 was shot by e9edc1ba0b. Rebased to the current HEAD.
Successfully built and passeed all regression/recovery tests
including additional recovery/t/016_wal_optimize.pl.
reagrds.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v4-0004-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From 666d27dbc47c9963e5098904ffb9b173effaf853 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 4/4] Fix WAL skipping feature.
This patch repalces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
src/backend/access/heap/heapam.c | 71 ++++++++++++++++++++++-----------
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 3 --
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/commands/copy.c | 13 +++---
src/backend/commands/createas.c | 9 ++---
src/backend/commands/matview.c | 6 +--
src/backend/commands/tablecmds.c | 5 +--
src/backend/commands/vacuumlazy.c | 6 +--
src/include/access/heapam.h | 9 ++---
10 files changed, 73 insertions(+), 55 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7caa3ec248..a68eae9b11 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -2414,12 +2436,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2455,6 +2471,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* TID where the tuple was stored. But note that any toasting of fields
* within the tuple data is NOT reflected into *tup.
*/
+
Oid
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
int options, BulkInsertState bistate)
@@ -2528,7 +2545,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2730,7 +2747,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
PGAlignedBlock scratch;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2738,7 +2754,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2780,6 +2795,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2791,6 +2807,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3353,7 +3370,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4110,7 +4127,7 @@ l2:
MarkBufferDirty(buffer);
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -4332,7 +4349,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5304,7 +5322,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -6048,7 +6066,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -6208,7 +6226,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6341,7 +6359,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6450,7 +6468,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7646,7 +7664,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7694,7 +7712,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7779,7 +7797,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -9400,9 +9418,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index c2f5343dac..d0b68902d9 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c5db75afa1..d2f78199ee 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -655,9 +655,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* The new relfilenode's relcache entrye doesn't have the necessary
* information to determine whether a relation should emit data for
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 695567b4b0..fce14ce35f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b58a74f4e3..f54f80777b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2391,8 +2391,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2424,7 +2423,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -3078,11 +3077,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index d5cb62da15..0f58da40c6 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -568,8 +568,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -618,9 +619,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e1eb7c374b..986f7baf39 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -464,7 +464,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -507,9 +507,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c507a1ab34..98084ad98c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4617,8 +4617,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4891,8 +4892,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..72849a9a94 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -924,7 +924,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1194,7 +1194,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1575,7 +1575,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f1d4a803ae..708cdd6cc5 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,11 +25,10 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
-#define HEAP_INSERT_NO_LOGICAL 0x0010
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
+#define HEAP_INSERT_NO_LOGICAL 0x0008
typedef struct BulkInsertStateData *BulkInsertState;
--
2.16.3
v4-0003-Add-infrastructure-to-WAL-logging-skip-feature.patchtext/x-patch; charset=us-asciiDownload
From ec791053430111c5ec62d659b9104c8163b95916 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:00:44 +0900
Subject: [PATCH 3/4] Add infrastructure to WAL-logging skip feature
We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just singaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-comit fsyncs for a relation and in-transaction truncations.
heap_register_sync() should be used to start tracking before batch
operations like COPY and CLUSTER, and use BufferNeedsWAL() instead of
RelationNeedsWAL() at the places related to WAL-logging about
heap-modifying operations.
---
src/backend/access/heap/heapam.c | 31 ++++
src/backend/access/transam/xact.c | 7 +
src/backend/catalog/storage.c | 317 +++++++++++++++++++++++++++++++++---
src/backend/commands/tablecmds.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 40 ++++-
src/backend/utils/cache/relcache.c | 13 ++
src/include/access/heapam.h | 1 +
src/include/catalog/storage.h | 5 +-
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 8 +
10 files changed, 395 insertions(+), 32 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fb63471a0e..7caa3ec248 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
#include "catalog/catalog.h"
#include "catalog/namespace.h"
#include "catalog/index.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -9518,3 +9519,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a979d7e07b..2a77f7daa3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2016,6 +2016,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2245,6 +2248,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2559,6 +2565,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..e14ce64fc4 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -62,6 +63,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static PendingRelSync *getPendingSyncEntry(Relation rel, bool create);
+
/*
* RelationCreateStorage
* Create physical storage for a relation.
@@ -259,37 +303,117 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
+ PendingRelSync *pending_sync;
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
+ /* get pending sync entry, create if not yet */
+ pending_sync = getPendingSyncEntry(rel, true);
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ if (pending_sync->sync_above == InvalidBlockNumber ||
+ pending_sync->sync_above < nblocks)
+ {
+ /*
+ * This is the first time truncation of this relation in this
+ * transaction or truncation that leaves pages that need at-commit
+ * fsync. Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->pending_sync->truncated_to = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/*
+ * getPendingSyncEntry: get pendig sync entry.
+ *
+ * Returns pending sync entry for the relation. The entry tracks pending
+ * at-commit fsyncs for the relation. Creates one if needed when create is
+ * true.
+ */
+static PendingRelSync *
+getPendingSyncEntry(Relation rel, bool create)
+{
+ PendingRelSync *pendsync_entry = NULL;
+ bool found;
+
+ if (rel->pending_sync)
+ return rel->pending_sync;
+
+ /* we know we don't have pending sync entry */
+ if (!create && rel->no_pending_sync)
+ return NULL;
+
+ if (!pendingSyncs)
+ {
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ if (!create)
+ return NULL;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+
+ elog(DEBUG2, "getPendingSyncEntry: accessing hash for %d",
+ rel->rd_node.relNode);
+ pendsync_entry = (PendingRelSync *)
+ hash_search(pendingSyncs, (void *) &rel->rd_node,
+ create ? HASH_ENTER: HASH_FIND, &found);
+
+ if (!pendsync_entry)
+ {
+ rel->no_pending_sync = true;
+ return NULL;
+ }
+
+ /* new entry created */
+ if (!found)
+ {
+ pendsync_entry->truncated_to = InvalidBlockNumber;
+ pendsync_entry->sync_above = InvalidBlockNumber;
+ }
+
+ /* hold shortcut in Relation */
+ rel->no_pending_sync = false;
+ rel->pending_sync = pendsync_entry;
+
+ return pendsync_entry;
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -367,6 +491,24 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+ bool found;
+
+ rel->pending_sync = NULL;
+ rel->no_pending_sync = true;
+ if (pendingSyncs)
+ {
+ elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+ hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ }
+}
+
+
/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
@@ -418,6 +560,139 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ BlockNumber nblocks;
+ PendingRelSync *pending_sync;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* get pending sync entry, create if not yet */
+ pending_sync = getPendingSyncEntry(rel, true);
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (pending_sync->sync_above != InvalidBlockNumber)
+ {
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+
+ return;
+ }
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ pending_sync->sync_above = nblocks;
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+ PendingRelSync *pending_sync;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch exising pending sync entry */
+ pending_sync = getPendingSyncEntry(rel, false);
+
+ /*
+ * no point in doing further work if we know that we don't have pending
+ * sync
+ */
+ if (!pending_sync)
+ return true;
+
+ Assert(BufferIsValid(buf));
+
+ blkno = BufferGetBlockNumber(buf);
+
+ /* we don't skip WAL-logging for pages that already done */
+ if (pending_sync->sync_above == InvalidBlockNumber ||
+ pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (pending_sync->truncated_to != InvalidBlockNumber &&
+ pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 946119fa86..c507a1ab34 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11025,11 +11025,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. Pending syncs for the old node is no longer needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationRemovePendingSync(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..1095f6c721 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index aecbd4a943..280b481e88 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -77,6 +77,7 @@
#include "pgstat.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -419,6 +420,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -1872,6 +1877,10 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
/*
* add new reldesc to relcache
*/
@@ -3271,6 +3280,10 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ /* newly built relation has no pending sync */
+ rel->no_pending_sync = true;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 40e153f71a..f1d4a803ae 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -181,6 +181,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef52d85803..49d93cd01f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce3906a0..9fae7c6ae5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 84469f5715..55af2aa6bc 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -188,6 +188,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * no_pending_sync is true if this relation is known not to have pending
+ * syncs. Elsewise searching for registered sync is required if
+ * pending_sync is NULL.
+ */
+ bool no_pending_sync;
+ struct PendingRelSync *pending_sync;
} RelationData;
--
2.16.3
v4-0002-Write-WAL-for-empty-nbtree-index-build.patchtext/x-patch; charset=us-asciiDownload
From c6e5f68e7b0e6036ff96c7789f9f4314e449a990 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/4] Write WAL for empty nbtree index build
After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
src/backend/access/nbtree/nbtsort.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..2c2647b530 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -610,8 +610,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
RelationOpenSmgr(wstate->index);
- /* XLOG stuff */
- if (wstate->btws_use_wal)
+ /* XLOG stuff
+ *
+ * Even if minimal mode, WAL is required here if truncation happened after
+ * being created in the same transaction. It is not needed otherwise but
+ * we don't bother identifying the case precisely.
+ */
+ if (wstate->btws_use_wal ||
+ (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
{
/* We use the heap NEWPAGE record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1055,6 +1061,11 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* set to point to "P_NONE"). This changes the index to the "valid" state
* by filling in a valid magic number in the metapage.
*/
+ /*
+ * If no tuple was inserted, it's possible that we are truncating a
+ * relation. We need to emit WAL for the metapage in the case. However it
+ * is not required elsewise,
+ */
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, rootblkno, rootlevel);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
--
2.16.3
v4-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From ee1624fe2f3d556da2ce9b41c32576fedef686fa Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.
---
src/test/recovery/t/016_wal_optimize.pl | 192 ++++++++++++++++++++++++++++++++
1 file changed, 192 insertions(+)
create mode 100644 src/test/recovery/t/016_wal_optimize.pl
diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ $node->teardown_node;
+ $node->clean_node;
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
On Wed, Nov 14, 2018 at 4:48 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
0004 was shot by e9edc1ba0b. Rebased to the current HEAD.
Successfully built and passeed all regression/recovery tests
including additional recovery/t/016_wal_optimize.pl.
Thank you for working on this patch. Unfortunately, cfbot complains that
v4-0004-Fix-WAL-skipping-feature.patch could not be applied without conflicts.
Could you please post a rebased version one more time?
On Fri, Jul 27, 2018 at 9:26 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
On 18/07/18 16:29, Robert Haas wrote:
On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier
<michael@paquier.xyz> wrote:What's wrong with the approach proposed in
/messages/by-id/55AFC302.1060805@iki.fi ?For back-branches that's very invasive so that seems risky to me
particularly seeing the low number of complaints on the matter.Hmm. I think that if you disable the optimization, you're betting that
people won't mind losing performance in this case in a maintenance
release. If you back-patch Heikki's approach, you're betting that the
committed version doesn't have any bugs that are worse than the status
quo. Personally, I'd rather take the latter bet. Maybe the patch
isn't all there yet, but that seems like something we can work
towards. If we just give up and disable the optimization, we won't
know how many people we ticked off or how badly until after we've done
it.Yeah. I'm not happy about backpatching a big patch like what I
proposed, and Kyotaro developed further. But I think it's the least
bad option we have, the other options discussed seem even worse.One way to review the patch is to look at what it changes, when
wal_level is *not* set to minimal, i.e. what risk or overhead does it
pose to users who are not affected by this bug? It seems pretty safe
to me.The other aspect is, how confident are we that this actually fixes the
bug, with least impact to users using wal_level='minimal'? I think
it's the best shot we have so far. All the other proposals either
don't fully fix the bug, or hurt performance in some legit cases.I'd suggest that we continue based on the patch that Kyotaro posted at
/messages/by-id/20180330.100646.86008470.horiguchi.kyotaro@lab.ntt.co.jp.I have just spent some time reviewing Kyatoro's patch. I'm a bit
nervous, too, given the size. But I'm also nervous about leaving things
as they are. I suspect the reason we haven't heard more about this is
that these days use of "wal_level = minimal" is relatively rare.
I'm totally out of context of this patch, but reading this makes me nervous
too. Taking into account that the problem now is lack of review, do you have
plans to spend more time reviewing this patch?
Hello.
At Fri, 30 Nov 2018 18:27:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcV6MUg1BEoQUywX917Oiz6JoMdoZ1Vu3RT5GgBb-yPszg@mail.gmail.com>
On Wed, Nov 14, 2018 at 4:48 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
0004 was shot by e9edc1ba0b. Rebased to the current HEAD.
Successfully built and passeed all regression/recovery tests
including additional recovery/t/016_wal_optimize.pl.Thank you for working on this patch. Unfortunately, cfbot complains that
v4-0004-Fix-WAL-skipping-feature.patch could not be applied without conflicts.
Could you please post a rebased version one more time?
Thanks. Here's the rebased version. I found no other amendment
required other than the apparent conflict.
On Fri, Jul 27, 2018 at 9:26 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
On 18/07/18 16:29, Robert Haas wrote:
On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier
<michael@paquier.xyz> wrote:What's wrong with the approach proposed in
/messages/by-id/55AFC302.1060805@iki.fi ?For back-branches that's very invasive so that seems risky to me
particularly seeing the low number of complaints on the matter.Hmm. I think that if you disable the optimization, you're betting that
people won't mind losing performance in this case in a maintenance
release. If you back-patch Heikki's approach, you're betting that the
committed version doesn't have any bugs that are worse than the status
quo. Personally, I'd rather take the latter bet. Maybe the patch
isn't all there yet, but that seems like something we can work
towards. If we just give up and disable the optimization, we won't
know how many people we ticked off or how badly until after we've done
it.Yeah. I'm not happy about backpatching a big patch like what I
proposed, and Kyotaro developed further. But I think it's the least
bad option we have, the other options discussed seem even worse.One way to review the patch is to look at what it changes, when
wal_level is *not* set to minimal, i.e. what risk or overhead does it
pose to users who are not affected by this bug? It seems pretty safe
to me.The other aspect is, how confident are we that this actually fixes the
bug, with least impact to users using wal_level='minimal'? I think
it's the best shot we have so far. All the other proposals either
don't fully fix the bug, or hurt performance in some legit cases.I'd suggest that we continue based on the patch that Kyotaro posted at
/messages/by-id/20180330.100646.86008470.horiguchi.kyotaro@lab.ntt.co.jp.I have just spent some time reviewing Kyatoro's patch. I'm a bit
nervous, too, given the size. But I'm also nervous about leaving things
as they are. I suspect the reason we haven't heard more about this is
that these days use of "wal_level = minimal" is relatively rare.I'm totally out of context of this patch, but reading this makes me nervous
too. Taking into account that the problem now is lack of review, do you have
plans to spend more time reviewing this patch?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v5-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From 120f3f1d4dc47eb74a6ad7fde3c116e31b8eab3e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.
---
src/test/recovery/t/016_wal_optimize.pl | 192 ++++++++++++++++++++++++++++++++
1 file changed, 192 insertions(+)
create mode 100644 src/test/recovery/t/016_wal_optimize.pl
diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ $node->teardown_node;
+ $node->clean_node;
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v5-0002-Write-WAL-for-empty-nbtree-index-build.patchtext/x-patch; charset=us-asciiDownload
From 7b29c2c9b3d19fd6230bc5663df9d6953197479a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/4] Write WAL for empty nbtree index build
After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
src/backend/access/nbtree/nbtsort.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..2c2647b530 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -610,8 +610,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
RelationOpenSmgr(wstate->index);
- /* XLOG stuff */
- if (wstate->btws_use_wal)
+ /* XLOG stuff
+ *
+ * Even if minimal mode, WAL is required here if truncation happened after
+ * being created in the same transaction. It is not needed otherwise but
+ * we don't bother identifying the case precisely.
+ */
+ if (wstate->btws_use_wal ||
+ (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
{
/* We use the heap NEWPAGE record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1055,6 +1061,11 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* set to point to "P_NONE"). This changes the index to the "valid" state
* by filling in a valid magic number in the metapage.
*/
+ /*
+ * If no tuple was inserted, it's possible that we are truncating a
+ * relation. We need to emit WAL for the metapage in the case. However it
+ * is not required elsewise,
+ */
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, rootblkno, rootlevel);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
--
2.16.3
v5-0003-Add-infrastructure-to-WAL-logging-skip-feature.patchtext/x-patch; charset=us-asciiDownload
From 92d023071580e3f211a82b191b1afe9afbe824b1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:00:44 +0900
Subject: [PATCH 3/4] Add infrastructure to WAL-logging skip feature
We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just singaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-comit fsyncs for a relation and in-transaction truncations.
heap_register_sync() should be used to start tracking before batch
operations like COPY and CLUSTER, and use BufferNeedsWAL() instead of
RelationNeedsWAL() at the places related to WAL-logging about
heap-modifying operations.
---
src/backend/access/heap/heapam.c | 31 ++++
src/backend/access/transam/xact.c | 7 +
src/backend/catalog/storage.c | 317 +++++++++++++++++++++++++++++++++---
src/backend/commands/tablecmds.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 40 ++++-
src/backend/utils/cache/relcache.c | 13 ++
src/include/access/heapam.h | 1 +
src/include/catalog/storage.h | 5 +-
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 8 +
10 files changed, 395 insertions(+), 32 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9650145642..8f1ea73541 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
#include "catalog/catalog.h"
#include "catalog/namespace.h"
#include "catalog/index.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -9460,3 +9461,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d967400384..d79b2a94dc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2020,6 +2020,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2249,6 +2252,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2563,6 +2569,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..e14ce64fc4 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -62,6 +63,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static PendingRelSync *getPendingSyncEntry(Relation rel, bool create);
+
/*
* RelationCreateStorage
* Create physical storage for a relation.
@@ -259,37 +303,117 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
+ PendingRelSync *pending_sync;
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
+ /* get pending sync entry, create if not yet */
+ pending_sync = getPendingSyncEntry(rel, true);
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ if (pending_sync->sync_above == InvalidBlockNumber ||
+ pending_sync->sync_above < nblocks)
+ {
+ /*
+ * This is the first time truncation of this relation in this
+ * transaction or truncation that leaves pages that need at-commit
+ * fsync. Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->pending_sync->truncated_to = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/*
+ * getPendingSyncEntry: get pendig sync entry.
+ *
+ * Returns pending sync entry for the relation. The entry tracks pending
+ * at-commit fsyncs for the relation. Creates one if needed when create is
+ * true.
+ */
+static PendingRelSync *
+getPendingSyncEntry(Relation rel, bool create)
+{
+ PendingRelSync *pendsync_entry = NULL;
+ bool found;
+
+ if (rel->pending_sync)
+ return rel->pending_sync;
+
+ /* we know we don't have pending sync entry */
+ if (!create && rel->no_pending_sync)
+ return NULL;
+
+ if (!pendingSyncs)
+ {
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ if (!create)
+ return NULL;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+
+ elog(DEBUG2, "getPendingSyncEntry: accessing hash for %d",
+ rel->rd_node.relNode);
+ pendsync_entry = (PendingRelSync *)
+ hash_search(pendingSyncs, (void *) &rel->rd_node,
+ create ? HASH_ENTER: HASH_FIND, &found);
+
+ if (!pendsync_entry)
+ {
+ rel->no_pending_sync = true;
+ return NULL;
+ }
+
+ /* new entry created */
+ if (!found)
+ {
+ pendsync_entry->truncated_to = InvalidBlockNumber;
+ pendsync_entry->sync_above = InvalidBlockNumber;
+ }
+
+ /* hold shortcut in Relation */
+ rel->no_pending_sync = false;
+ rel->pending_sync = pendsync_entry;
+
+ return pendsync_entry;
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -367,6 +491,24 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+ bool found;
+
+ rel->pending_sync = NULL;
+ rel->no_pending_sync = true;
+ if (pendingSyncs)
+ {
+ elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+ hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ }
+}
+
+
/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
@@ -418,6 +560,139 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ BlockNumber nblocks;
+ PendingRelSync *pending_sync;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* get pending sync entry, create if not yet */
+ pending_sync = getPendingSyncEntry(rel, true);
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (pending_sync->sync_above != InvalidBlockNumber)
+ {
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+
+ return;
+ }
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ pending_sync->sync_above = nblocks;
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+ PendingRelSync *pending_sync;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch exising pending sync entry */
+ pending_sync = getPendingSyncEntry(rel, false);
+
+ /*
+ * no point in doing further work if we know that we don't have pending
+ * sync
+ */
+ if (!pending_sync)
+ return true;
+
+ Assert(BufferIsValid(buf));
+
+ blkno = BufferGetBlockNumber(buf);
+
+ /* we don't skip WAL-logging for pages that already done */
+ if (pending_sync->sync_above == InvalidBlockNumber ||
+ pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (pending_sync->truncated_to != InvalidBlockNumber &&
+ pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ad8c176793..879c3d981e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -10905,11 +10905,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. Pending syncs for the old node is no longer needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationRemovePendingSync(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9817770aff..1cb93ca486 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3183,7 +3205,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3213,18 +3235,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index c3071db1cd..40b00e1275 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -77,6 +77,7 @@
#include "pgstat.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -417,6 +418,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -1868,6 +1873,10 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
/*
* add new reldesc to relcache
*/
@@ -3264,6 +3273,10 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ /* newly built relation has no pending sync */
+ rel->no_pending_sync = true;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 64cfdbd2f0..4baa287c8c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -181,6 +181,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef52d85803..49d93cd01f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce3906a0..9fae7c6ae5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 2217081dcc..db60eddea0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -187,6 +187,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * no_pending_sync is true if this relation is known not to have pending
+ * syncs. Elsewise searching for registered sync is required if
+ * pending_sync is NULL.
+ */
+ bool no_pending_sync;
+ struct PendingRelSync *pending_sync;
} RelationData;
--
2.16.3
v5-0004-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From 5b4cb2ba0065bf40f6eedca35e6c262e4f5d7050 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 4/4] Fix WAL skipping feature.
This patch repalces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
src/backend/access/heap/heapam.c | 70 ++++++++++++++++++++++-----------
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 3 --
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/commands/copy.c | 13 +++---
src/backend/commands/createas.c | 9 ++---
src/backend/commands/matview.c | 6 +--
src/backend/commands/tablecmds.c | 5 +--
src/backend/commands/vacuumlazy.c | 6 +--
src/include/access/heapam.h | 9 ++---
10 files changed, 72 insertions(+), 55 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8f1ea73541..c9c254a032 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -2414,12 +2436,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2528,7 +2544,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2704,7 +2720,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
PGAlignedBlock scratch;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2712,7 +2727,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2754,6 +2768,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2765,6 +2780,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3327,7 +3343,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -4069,7 +4085,7 @@ l2:
MarkBufferDirty(buffer);
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -4291,7 +4307,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -5263,7 +5280,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -6007,7 +6024,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -6167,7 +6184,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -6300,7 +6317,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6409,7 +6426,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7605,7 +7622,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7653,7 +7670,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7738,7 +7755,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -9342,9 +9359,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index c2f5343dac..d0b68902d9 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 44caeca336..ecddc40329 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -655,9 +655,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 695567b4b0..fce14ce35f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 4311e16007..d583b5a8a3 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2364,8 +2364,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2406,7 +2405,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -3036,11 +3035,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index d01b258b65..3d32d07d69 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -555,8 +555,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -599,9 +600,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index a171ebabf8..174aa3376a 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -461,7 +461,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -507,9 +507,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 879c3d981e..ce8f7cd881 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4591,8 +4591,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4857,8 +4858,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
heap_close(newrel, NoLock);
}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8134c52253..28caf92073 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -924,7 +924,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1188,7 +1188,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1569,7 +1569,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4baa287c8c..d2fbc1ad47 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,11 +25,10 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
-#define HEAP_INSERT_NO_LOGICAL 0x0010
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
+#define HEAP_INSERT_NO_LOGICAL 0x0008
typedef struct BulkInsertStateData *BulkInsertState;
--
2.16.3
Rebased.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v6-0004-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From 7b52c9dd2d6bb76f0264bfd0f17d034001351b6f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 4/4] Fix WAL skipping feature.
This patch repalces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
src/backend/access/heap/heapam.c | 70 ++++++++++++++++++++++-----------
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 3 --
src/backend/access/heap/vacuumlazy.c | 6 +--
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/commands/copy.c | 13 +++---
src/backend/commands/createas.c | 9 ++---
src/backend/commands/matview.c | 6 +--
src/backend/commands/tablecmds.c | 5 +--
src/include/access/heapam.h | 9 ++---
10 files changed, 72 insertions(+), 55 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5972e9d190..a2d8aefa28 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -28,6 +28,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -2127,12 +2149,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2239,7 +2255,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2414,7 +2430,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
PGAlignedBlock scratch;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2422,7 +2437,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2464,6 +2478,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2475,6 +2490,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3037,7 +3053,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -3777,7 +3793,7 @@ l2:
MarkBufferDirty(buffer);
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -3992,7 +4008,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -4882,7 +4899,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5626,7 +5643,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5786,7 +5803,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -5919,7 +5936,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6028,7 +6045,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7225,7 +7242,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7273,7 +7290,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7358,7 +7375,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -8962,9 +8979,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a3e51922d8..a05659b168 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -258,7 +259,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f5cf9ffc9c..1e9c07c9b2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -654,9 +654,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 37aa484ec3..3309c93bce 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -923,7 +923,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1187,7 +1187,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1568,7 +1568,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 931ae81fd6..53da0da68f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index dbb06397e6..b42bfbfd47 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2390,8 +2390,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2437,7 +2436,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -3092,11 +3091,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 2bc8f928ea..5eb45a4a65 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -556,8 +556,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -600,9 +601,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5a47be4b33..5f447c6d94 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,7 +463,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,9 +509,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e15296e373..65be3c2869 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4616,8 +4616,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4882,8 +4883,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
table_close(newrel, NoLock);
}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fab5052868..32a365021a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -27,11 +27,10 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
-#define HEAP_INSERT_NO_LOGICAL 0x0010
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
+#define HEAP_INSERT_NO_LOGICAL 0x0008
typedef struct BulkInsertStateData *BulkInsertState;
--
2.16.3
v6-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From d5f2b47b6ba191d0ad1673f9bd9c5851d91a1b59 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.
---
src/test/recovery/t/016_wal_optimize.pl | 192 ++++++++++++++++++++++++++++++++
1 file changed, 192 insertions(+)
create mode 100644 src/test/recovery/t/016_wal_optimize.pl
diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ $node->teardown_node;
+ $node->clean_node;
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v6-0002-Write-WAL-for-empty-nbtree-index-build.patchtext/x-patch; charset=us-asciiDownload
From 5613d41deca5a5691d18457db6bfd177ee2febe1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/4] Write WAL for empty nbtree index build
After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
src/backend/access/nbtree/nbtsort.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..70d4380533 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
RelationOpenSmgr(wstate->index);
- /* XLOG stuff */
- if (wstate->btws_use_wal)
+ /* XLOG stuff
+ *
+ * Even if minimal mode, WAL is required here if truncation happened after
+ * being created in the same transaction. It is not needed otherwise but
+ * we don't bother identifying the case precisely.
+ */
+ if (wstate->btws_use_wal ||
+ (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
{
/* We use the heap NEWPAGE record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1056,6 +1062,11 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* set to point to "P_NONE"). This changes the index to the "valid" state
* by filling in a valid magic number in the metapage.
*/
+ /*
+ * If no tuple was inserted, it's possible that we are truncating a
+ * relation. We need to emit WAL for the metapage in the case. However it
+ * is not required elsewise,
+ */
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, rootblkno, rootlevel);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
--
2.16.3
v6-0003-Add-infrastructure-to-WAL-logging-skip-feature.patchtext/x-patch; charset=us-asciiDownload
From ec2f481feb39247584e06b92aaee42c21c9dec2c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:00:44 +0900
Subject: [PATCH 3/4] Add infrastructure to WAL-logging skip feature
We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just singaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-comit fsyncs for a relation and in-transaction truncations.
heap_register_sync() should be used to start tracking before batch
operations like COPY and CLUSTER, and use BufferNeedsWAL() instead of
RelationNeedsWAL() at the places related to WAL-logging about
heap-modifying operations.
---
src/backend/access/heap/heapam.c | 31 ++++
src/backend/access/transam/xact.c | 7 +
src/backend/catalog/storage.c | 317 +++++++++++++++++++++++++++++++++---
src/backend/commands/tablecmds.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 40 ++++-
src/backend/utils/cache/relcache.c | 13 ++
src/include/access/heapam.h | 1 +
src/include/catalog/storage.h | 5 +-
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 8 +
10 files changed, 395 insertions(+), 32 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4406a69ef2..5972e9d190 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -50,6 +50,7 @@
#include "access/xloginsert.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -9080,3 +9081,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0181976964..fa845bfd45 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2020,6 +2020,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2249,6 +2252,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2568,6 +2574,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0302507e6f..68947b017f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -62,6 +63,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static PendingRelSync *getPendingSyncEntry(Relation rel, bool create);
+
/*
* RelationCreateStorage
* Create physical storage for a relation.
@@ -259,37 +303,117 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
+ PendingRelSync *pending_sync;
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
+ /* get pending sync entry, create if not yet */
+ pending_sync = getPendingSyncEntry(rel, true);
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ if (pending_sync->sync_above == InvalidBlockNumber ||
+ pending_sync->sync_above < nblocks)
+ {
+ /*
+ * This is the first time truncation of this relation in this
+ * transaction or truncation that leaves pages that need at-commit
+ * fsync. Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->pending_sync->truncated_to = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/*
+ * getPendingSyncEntry: get pendig sync entry.
+ *
+ * Returns pending sync entry for the relation. The entry tracks pending
+ * at-commit fsyncs for the relation. Creates one if needed when create is
+ * true.
+ */
+static PendingRelSync *
+getPendingSyncEntry(Relation rel, bool create)
+{
+ PendingRelSync *pendsync_entry = NULL;
+ bool found;
+
+ if (rel->pending_sync)
+ return rel->pending_sync;
+
+ /* we know we don't have pending sync entry */
+ if (!create && rel->no_pending_sync)
+ return NULL;
+
+ if (!pendingSyncs)
+ {
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ if (!create)
+ return NULL;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+
+ elog(DEBUG2, "getPendingSyncEntry: accessing hash for %d",
+ rel->rd_node.relNode);
+ pendsync_entry = (PendingRelSync *)
+ hash_search(pendingSyncs, (void *) &rel->rd_node,
+ create ? HASH_ENTER: HASH_FIND, &found);
+
+ if (!pendsync_entry)
+ {
+ rel->no_pending_sync = true;
+ return NULL;
+ }
+
+ /* new entry created */
+ if (!found)
+ {
+ pendsync_entry->truncated_to = InvalidBlockNumber;
+ pendsync_entry->sync_above = InvalidBlockNumber;
+ }
+
+ /* hold shortcut in Relation */
+ rel->no_pending_sync = false;
+ rel->pending_sync = pendsync_entry;
+
+ return pendsync_entry;
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -367,6 +491,24 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+ bool found;
+
+ rel->pending_sync = NULL;
+ rel->no_pending_sync = true;
+ if (pendingSyncs)
+ {
+ elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+ hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ }
+}
+
+
/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
@@ -418,6 +560,139 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ BlockNumber nblocks;
+ PendingRelSync *pending_sync;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* get pending sync entry, create if not yet */
+ pending_sync = getPendingSyncEntry(rel, true);
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (pending_sync->sync_above != InvalidBlockNumber)
+ {
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+
+ return;
+ }
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ pending_sync->sync_above = nblocks;
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+ PendingRelSync *pending_sync;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch exising pending sync entry */
+ pending_sync = getPendingSyncEntry(rel, false);
+
+ /*
+ * no point in doing further work if we know that we don't have pending
+ * sync
+ */
+ if (!pending_sync)
+ return true;
+
+ Assert(BufferIsValid(buf));
+
+ blkno = BufferGetBlockNumber(buf);
+
+ /* we don't skip WAL-logging for pages that already done */
+ if (pending_sync->sync_above == InvalidBlockNumber ||
+ pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (pending_sync->truncated_to != InvalidBlockNumber &&
+ pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 434be403fe..e15296e373 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11387,11 +11387,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. Pending syncs for the old node is no longer needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationRemovePendingSync(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..a9741f138c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3183,7 +3205,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3213,18 +3235,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index af96a03338..66e7d5a301 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
#include "partitioning/partbounds.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -414,6 +415,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -1869,6 +1874,10 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
/*
* add new reldesc to relcache
*/
@@ -3263,6 +3272,10 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ /* newly built relation has no pending sync */
+ rel->no_pending_sync = true;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ab0879138f..fab5052868 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -163,6 +163,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9f638be924..95d7898e25 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c5826f691d..8a9ea041dd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 1d05465303..0f39f209d3 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -185,6 +185,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * no_pending_sync is true if this relation is known not to have pending
+ * syncs. Elsewise searching for registered sync is required if
+ * pending_sync is NULL.
+ */
+ bool no_pending_sync;
+ struct PendingRelSync *pending_sync;
} RelationData;
--
2.16.3
Rebased.
No commit hit this but I fixed one space error.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v7-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From d048aedbee48a1a0d91ae6e009b7a7903f272720 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.
---
src/test/recovery/t/016_wal_optimize.pl | 192 ++++++++++++++++++++++++++++++++
1 file changed, 192 insertions(+)
create mode 100644 src/test/recovery/t/016_wal_optimize.pl
diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ $node->teardown_node;
+ $node->clean_node;
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v7-0002-Write-WAL-for-empty-nbtree-index-build.patchtext/x-patch; charset=us-asciiDownload
From 5a435c9c82155204484f31601a12821cf1e5e96e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/4] Write WAL for empty nbtree index build
After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
src/backend/access/nbtree/nbtsort.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..70d4380533 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
RelationOpenSmgr(wstate->index);
- /* XLOG stuff */
- if (wstate->btws_use_wal)
+ /* XLOG stuff
+ *
+ * Even if minimal mode, WAL is required here if truncation happened after
+ * being created in the same transaction. It is not needed otherwise but
+ * we don't bother identifying the case precisely.
+ */
+ if (wstate->btws_use_wal ||
+ (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
{
/* We use the heap NEWPAGE record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1056,6 +1062,11 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* set to point to "P_NONE"). This changes the index to the "valid" state
* by filling in a valid magic number in the metapage.
*/
+ /*
+ * If no tuple was inserted, it's possible that we are truncating a
+ * relation. We need to emit WAL for the metapage in the case. However it
+ * is not required elsewise,
+ */
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, rootblkno, rootlevel);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
--
2.16.3
v7-0003-Add-infrastructure-to-WAL-logging-skip-feature.patchtext/x-patch; charset=us-asciiDownload
From 1123bd8ce20ff177673f614722d3fe092a2bcbeb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:00:44 +0900
Subject: [PATCH 3/4] Add infrastructure to WAL-logging skip feature
We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just singaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-comit fsyncs for a relation and in-transaction truncations.
heap_register_sync() should be used to start tracking before batch
operations like COPY and CLUSTER, and use BufferNeedsWAL() instead of
RelationNeedsWAL() at the places related to WAL-logging about
heap-modifying operations.
---
src/backend/access/heap/heapam.c | 31 ++++
src/backend/access/transam/xact.c | 7 +
src/backend/catalog/storage.c | 317 +++++++++++++++++++++++++++++++++---
src/backend/commands/tablecmds.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 40 ++++-
src/backend/utils/cache/relcache.c | 13 ++
src/include/access/heapam.h | 1 +
src/include/catalog/storage.h | 5 +-
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 8 +
10 files changed, 395 insertions(+), 32 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dc3499349b..5ea5ff5848 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -50,6 +50,7 @@
#include "access/xloginsert.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -9079,3 +9080,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordPendingSync(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordPendingSync(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e93262975d..6d62d6e34f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2021,6 +2021,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2250,6 +2253,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2575,6 +2581,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0302507e6f..26dc3ddb1b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -62,6 +63,49 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber sync_above; /* WAL-logging skipped for blocks >=
+ * sync_above */
+ BlockNumber truncated_to; /* truncation WAL record was written */
+} PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static PendingRelSync *getPendingSyncEntry(Relation rel, bool create);
+
/*
* RelationCreateStorage
* Create physical storage for a relation.
@@ -259,37 +303,117 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
+ PendingRelSync *pending_sync;
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
+ /* get pending sync entry, create if not yet */
+ pending_sync = getPendingSyncEntry(rel, true);
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ if (pending_sync->sync_above == InvalidBlockNumber ||
+ pending_sync->sync_above < nblocks)
+ {
+ /*
+ * This is the first time truncation of this relation in this
+ * transaction or truncation that leaves pages that need at-commit
+ * fsync. Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ rel->pending_sync->truncated_to = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/*
+ * getPendingSyncEntry: get pendig sync entry.
+ *
+ * Returns pending sync entry for the relation. The entry tracks pending
+ * at-commit fsyncs for the relation. Creates one if needed when create is
+ * true.
+ */
+static PendingRelSync *
+getPendingSyncEntry(Relation rel, bool create)
+{
+ PendingRelSync *pendsync_entry = NULL;
+ bool found;
+
+ if (rel->pending_sync)
+ return rel->pending_sync;
+
+ /* we know we don't have pending sync entry */
+ if (!create && rel->no_pending_sync)
+ return NULL;
+
+ if (!pendingSyncs)
+ {
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ if (!create)
+ return NULL;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(PendingRelSync);
+ ctl.hash = tag_hash;
+ pendingSyncs = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+
+ elog(DEBUG2, "getPendingSyncEntry: accessing hash for %d",
+ rel->rd_node.relNode);
+ pendsync_entry = (PendingRelSync *)
+ hash_search(pendingSyncs, (void *) &rel->rd_node,
+ create ? HASH_ENTER: HASH_FIND, &found);
+
+ if (!pendsync_entry)
+ {
+ rel->no_pending_sync = true;
+ return NULL;
+ }
+
+ /* new entry created */
+ if (!found)
+ {
+ pendsync_entry->truncated_to = InvalidBlockNumber;
+ pendsync_entry->sync_above = InvalidBlockNumber;
+ }
+
+ /* hold shortcut in Relation */
+ rel->no_pending_sync = false;
+ rel->pending_sync = pendsync_entry;
+
+ return pendsync_entry;
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -367,6 +491,24 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+ bool found;
+
+ rel->pending_sync = NULL;
+ rel->no_pending_sync = true;
+ if (pendingSyncs)
+ {
+ elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+ hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ }
+}
+
+
/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
@@ -418,6 +560,139 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+ BlockNumber nblocks;
+ PendingRelSync *pending_sync;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* get pending sync entry, create if not yet */
+ pending_sync = getPendingSyncEntry(rel, true);
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (pending_sync->sync_above != InvalidBlockNumber)
+ {
+ elog(DEBUG2,
+ "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ rel->pending_sync->sync_above, nblocks);
+
+ return;
+ }
+
+ elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ nblocks);
+ pending_sync->sync_above = nblocks;
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+ PendingRelSync *pending_sync;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch exising pending sync entry */
+ pending_sync = getPendingSyncEntry(rel, false);
+
+ /*
+ * no point in doing further work if we know that we don't have pending
+ * sync
+ */
+ if (!pending_sync)
+ return true;
+
+ Assert(BufferIsValid(buf));
+
+ blkno = BufferGetBlockNumber(buf);
+
+ /* we don't skip WAL-logging for pages that already done */
+ if (pending_sync->sync_above == InvalidBlockNumber ||
+ pending_sync->sync_above > blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno, rel->pending_sync->sync_above);
+ return true;
+ }
+
+ /*
+ * We have emitted a truncation record for this block.
+ */
+ if (pending_sync->truncated_to != InvalidBlockNumber &&
+ pending_sync->truncated_to <= blkno)
+ {
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+ return true;
+ }
+
+ elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ blkno);
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!pendingSyncs)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->sync_above != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+ elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ pending->relnode.dbNode, pending->relnode.relNode);
+ }
+ }
+ }
+
+ hash_destroy(pendingSyncs);
+ pendingSyncs = NULL;
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index a93b13c2fe..6190b3f605 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11412,11 +11412,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. Pending syncs for the old node is no longer needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationRemovePendingSync(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..a9741f138c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3183,7 +3205,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3213,18 +3235,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 54a40ef00b..b5baa430db 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
#include "partitioning/partdesc.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -412,6 +413,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
MemoryContextSwitchTo(oldcxt);
return relation;
@@ -1813,6 +1818,10 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relhasindex = true;
}
+ /* We don't know if pending sync for this relation exists so far */
+ relation->pending_sync = NULL;
+ relation->no_pending_sync = false;
+
/*
* add new reldesc to relcache
*/
@@ -3207,6 +3216,10 @@ RelationBuildLocalRelation(const char *relname,
else
rel->rd_rel->relfilenode = relfilenode;
+ /* newly built relation has no pending sync */
+ rel->no_pending_sync = true;
+ rel->pending_sync = NULL;
+
RelationInitLockInfo(rel); /* see lmgr.c */
RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ab0879138f..fab5052868 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -163,6 +163,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9f638be924..95d7898e25 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c5826f691d..8a9ea041dd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 1d05465303..0f39f209d3 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -185,6 +185,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * no_pending_sync is true if this relation is known not to have pending
+ * syncs. Elsewise searching for registered sync is required if
+ * pending_sync is NULL.
+ */
+ bool no_pending_sync;
+ struct PendingRelSync *pending_sync;
} RelationData;
--
2.16.3
v7-0004-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From 256a04a64ffad9f280577e14683113d33a6633e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 4/4] Fix WAL skipping feature.
This patch repalces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
src/backend/access/heap/heapam.c | 70 ++++++++++++++++++++++-----------
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 3 --
src/backend/access/heap/vacuumlazy.c | 6 +--
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/commands/copy.c | 13 +++---
src/backend/commands/createas.c | 9 ++---
src/backend/commands/matview.c | 6 +--
src/backend/commands/tablecmds.c | 5 +--
src/include/access/heapam.h | 9 ++---
10 files changed, 72 insertions(+), 55 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5ea5ff5848..c66a468335 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -28,6 +28,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -2127,12 +2149,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* The new tuple is stamped with current transaction ID and the specified
* command ID.
*
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation. Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit. (See also heap_sync() comments)
- *
* The HEAP_INSERT_SKIP_FSM option is passed directly to
* RelationGetBufferForTuple, which see for more info.
*
@@ -2239,7 +2255,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2414,7 +2430,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
PGAlignedBlock scratch;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2422,7 +2437,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2464,6 +2478,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2475,6 +2490,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -3037,7 +3053,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
xl_heap_header xlhdr;
@@ -3776,7 +3792,7 @@ l2:
MarkBufferDirty(buffer);
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -3991,7 +4007,8 @@ l2:
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer) ||
+ BufferNeedsWAL(relation, newbuf))
{
XLogRecPtr recptr;
@@ -4881,7 +4898,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5625,7 +5642,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5785,7 +5802,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
htup->t_ctid = tuple->t_self;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -5918,7 +5935,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -6027,7 +6044,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7224,7 +7241,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7272,7 +7289,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7357,7 +7374,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -8961,9 +8978,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a3e51922d8..a05659b168 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -258,7 +259,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f5cf9ffc9c..1e9c07c9b2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -654,9 +654,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9416c31889..1f66685c88 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -929,7 +929,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1193,7 +1193,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1575,7 +1575,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06b26..1f5f7d92dd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "port/pg_bitutils.h"
#include "storage/bufmgr.h"
@@ -276,7 +277,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 5dd6fe02c6..db7a94ff6e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2390,8 +2390,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2437,7 +2436,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -3087,11 +3086,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 6517ecb738..17fb78ba78 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -556,8 +556,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -603,9 +604,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5a47be4b33..5f447c6d94 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,7 +463,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,9 +509,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6190b3f605..94d7876b8c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4617,8 +4617,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4886,8 +4887,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
table_close(newrel, NoLock);
}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fab5052868..32a365021a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -27,11 +27,10 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL 0x0001
-#define HEAP_INSERT_SKIP_FSM 0x0002
-#define HEAP_INSERT_FROZEN 0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
-#define HEAP_INSERT_NO_LOGICAL 0x0010
+#define HEAP_INSERT_SKIP_FSM 0x0001
+#define HEAP_INSERT_FROZEN 0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
+#define HEAP_INSERT_NO_LOGICAL 0x0008
typedef struct BulkInsertStateData *BulkInsertState;
--
2.16.3
This has been waiting for a review since October, so I reviewed it. The code
comment at PendingRelSync summarizes the design well, and I like that design.
I also liked the design in the /messages/by-id/559FA0BA.3080808@iki.fi
last paragraph, and I suspect it would have been no harder to back-patch. I
wonder if it would have been simpler and better, but I'm not asking anyone to
investigate that. Let's keep pursuing your current design.
This moves a shared_buffers scan and smgrimmedsync() from commands like COPY
to COMMIT. Users setting a timeout on COMMIT may need to adjust, and
log_min_duration_statement analysis will reflect the change. I feel that's
fine. (There already exist ways for COMMIT to be slow.)
On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
--- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) /* Ensure rd_smgr is open (could have been closed by relcache flush!) */ RelationOpenSmgr(wstate->index);- /* XLOG stuff */ - if (wstate->btws_use_wal) + /* XLOG stuff + * + * Even if minimal mode, WAL is required here if truncation happened after + * being created in the same transaction. It is not needed otherwise but + * we don't bother identifying the case precisely. + */ + if (wstate->btws_use_wal || + (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
We initialized "btws_use_wal" like this:
#define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
#define RelationNeedsWAL(relation) \
((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
Hence, this change causes us to emit WAL for the metapage of a
RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation. We should never do
that. If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
relfilenode. I've attached a test case for this; it is a patch that applies
on top of your v7 patches. The test checks for orphaned files after redo.
+ * If no tuple was inserted, it's possible that we are truncating a + * relation. We need to emit WAL for the metapage in the case. However it + * is not required elsewise,
Did you mean to write more words after that comma?
--- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log + * any subsequent actions on the same block either. Replaying the WAL record + * of the subsequent action might fail otherwise, as the "before" state of + * the block might not match, as the earlier actions were not WAL-logged.
Good point. To participate in WAL redo properly, each "before" state must
have a distinct pd_lsn. In CREATE INDEX USING btree, the initial index build
skips WAL, but an INSERT later in the same transaction writes WAL. There,
however, each "before" state does have a distinct pd_lsn; the initial build
has pd_lsn==0, and each subsequent state has a pd_lsn driven by WAL position.
Hence, I think the CREATE INDEX USING btree behavior is fine, even though it
doesn't conform to this code comment.
I think this restriction applies only to full_page_writes=off. Otherwise, the
first WAL-logged change will find pd_lsn==0 and emit a full-page image. With
a full-page image in the record, the block's "before" state doesn't matter.
Also, one could make it safe to write WAL for a particular block by issuing
heap_sync() for the block's relation.
+/* + * RelationRemovePendingSync() -- remove pendingSync entry for a relation + */ +void +RelationRemovePendingSync(Relation rel)
What is the coding rule for deciding when to call this? Currently, only
ATExecSetTableSpace() calls this. CLUSTER doesn't call it, despite behaving
much like ALTER TABLE SET TABLESPACE behaves.
+{ + bool found; + + rel->pending_sync = NULL; + rel->no_pending_sync = true; + if (pendingSyncs) + { + elog(DEBUG2, "RelationRemovePendingSync: accessing hash"); + hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found); + } +}
We'd need a mechanism to un-remove the sync at subtransaction abort. My
attachment includes a test case demonstrating the consequences of that defect.
Please look for other areas that need to know about subtransactions; patch v7
had no code pertaining to subtransactions.
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
As you mention upthread, you have many debugging elog()s. These are too
detailed to include in every binary, but I do want them in the code. See
CACHE_elog() for a good example of achieving that.
+/* + * Sync to disk any relations that we skipped WAL-logging for earlier. + */ +void +smgrDoPendingSyncs(bool isCommit) +{ + if (!pendingSyncs) + return; + + if (isCommit) + { + HASH_SEQ_STATUS status; + PendingRelSync *pending; + + hash_seq_init(&status, pendingSyncs); + + while ((pending = hash_seq_search(&status)) != NULL) + { + if (pending->sync_above != InvalidBlockNumber)
I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
InvalidBlockNumber" are not sync requests at all. Those just record the fact
of a RelationTruncate() happening. If you can think of a way to improve that,
please do so. If not, it's okay.
--- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c
@@ -412,6 +413,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;+ /* We don't know if pending sync for this relation exists so far */ + relation->pending_sync = NULL; + relation->no_pending_sync = false;
RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
prefix to these fields.
This is a nonstandard place to clear fields. Clear them in
load_relcache_init_file() only, like we do for rd_statvalid. (Other paths
will then rely on palloc0() for implicit initialization.)
--- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c
@@ -3991,7 +4007,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
MarkBufferDirty(buffer);/* XLOG stuff */ - if (RelationNeedsWAL(relation)) + if (BufferNeedsWAL(relation, buffer) || + BufferNeedsWAL(relation, newbuf))
This is fine if both buffers need WAL or neither buffer needs WAL. It is not
fine when one buffer needs WAL and the other buffer does not. My attachment
includes a test case. Of the bugs I'm reporting, this one seems most
difficult to solve well.
@@ -8961,9 +8978,16 @@ heap2_redo(XLogReaderState *record) * heap_sync - sync a heap, for use when no WAL has been written * * This forces the heap contents (including TOAST heap if any) down to disk. - * If we skipped using WAL, and WAL is otherwise needed, we must force the - * relation down to disk before it's safe to commit the transaction. This - * requires writing out any dirty buffers and then doing a forced fsync. + * If we did any changes to the heap bypassing the buffer manager, we must + * force the relation down to disk before it's safe to commit the + * transaction, because the direct modifications will not be flushed by + * the next checkpoint. + * + * We used to also use this after batch operations like COPY and CLUSTER, + * if we skipped using WAL and WAL is otherwise needed, but there were + * corner-cases involving other WAL-logged operations to the same + * relation, where that was not enough. heap_register_sync() should be + * used for that purpose instead.
We still use heap_sync() in CLUSTER. Can we migrate CLUSTER to the newer
heap_register_sync()? Patch v7 makes some commands use the new way (COPY,
CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
TABLESPACE, CLUSTER). It would make the system simpler to understand if we
eliminated the old way. If that creates more problems than it solves, please
at least write down a coding rule to explain why certain commands shouldn't
use the old way.
Thanks,
nm
Attachments:
wal-optimize-noah-tests-v1.patchtext/plain; charset=us-asciiDownload
diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
index 310772a..988ccaf 100644
--- a/src/test/recovery/t/016_wal_optimize.pl
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -11,7 +11,24 @@ use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 14;
+use Test::More tests => 20;
+
+sub check_orphan_relfilenodes
+{
+ my($node) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced]);
+ return;
+}
# Wrapper routine tunable for wal_level.
sub run_wal_optimize
@@ -26,6 +43,13 @@ wal_level = $wal_level
));
$node->start;
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::real_dir($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
# Test direct truncation optimization. No tuples
$node->safe_psql('postgres', "
BEGIN;
@@ -79,6 +103,36 @@ wal_level = $wal_level
is($result, qq(3),
"wal_level = $wal_level, optimized truncation with copied table");
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
# Test truncation with inserted tuples using both INSERT and COPY. Tuples
# inserted after the truncation should be seen.
$node->safe_psql('postgres', "
@@ -182,8 +236,14 @@ wal_level = $wal_level
is($result, qq(4),
"wal_level = $wal_level, replay of optimized copy with before trigger");
- $node->teardown_node;
- $node->clean_node;
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes $node;
+
return;
}
Thank you for reviewing!
At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah@leadboat.com> wrote in <20190311022708.GA2189728@rfd.leadboat.com>
This has been waiting for a review since October, so I reviewed it. The code
comment at PendingRelSync summarizes the design well, and I like that design.
It is Michael's work.
I also liked the design in the /messages/by-id/559FA0BA.3080808@iki.fi
last paragraph, and I suspect it would have been no harder to back-patch. I
wonder if it would have been simpler and better, but I'm not asking anyone to
investigate that. Let's keep pursuing your current design.
I must admit that this is complex..
This moves a shared_buffers scan and smgrimmedsync() from commands like COPY
to COMMIT. Users setting a timeout on COMMIT may need to adjust, and
log_min_duration_statement analysis will reflect the change. I feel that's
fine. (There already exist ways for COMMIT to be slow.)On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
--- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) /* Ensure rd_smgr is open (could have been closed by relcache flush!) */ RelationOpenSmgr(wstate->index);- /* XLOG stuff */ - if (wstate->btws_use_wal) + /* XLOG stuff + * + * Even if minimal mode, WAL is required here if truncation happened after + * being created in the same transaction. It is not needed otherwise but + * we don't bother identifying the case precisely. + */ + if (wstate->btws_use_wal || + (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))We initialized "btws_use_wal" like this:
#define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
#define RelationNeedsWAL(relation) \
((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);Hence, this change causes us to emit WAL for the metapage of a
RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation. We should never do
that. If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
relfilenode. I've attached a test case for this; it is a patch that applies
on top of your v7 patches. The test checks for orphaned files after redo.
Oops! Added RelationNeedsWAL(index) there. (Attched 1st patch on
top of this patchset)
+ * If no tuple was inserted, it's possible that we are truncating a + * relation. We need to emit WAL for the metapage in the case. However it + * is not required elsewise,Did you mean to write more words after that comma?
Sorry, it is just a garbage. Required work is done in
_bt_blwritepage.
--- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log + * any subsequent actions on the same block either. Replaying the WAL record + * of the subsequent action might fail otherwise, as the "before" state of + * the block might not match, as the earlier actions were not WAL-logged.Good point. To participate in WAL redo properly, each "before" state must
have a distinct pd_lsn. In CREATE INDEX USING btree, the initial index build
skips WAL, but an INSERT later in the same transaction writes WAL. There,
however, each "before" state does have a distinct pd_lsn; the initial build
has pd_lsn==0, and each subsequent state has a pd_lsn driven by WAL position.
Hence, I think the CREATE INDEX USING btree behavior is fine, even though it
doesn't conform to this code comment.
(The NB is Michael's work.)
Yes. Btree works differently from heap. Thak you for confirmation.
I think this restriction applies only to full_page_writes=off. Otherwise, the
first WAL-logged change will find pd_lsn==0 and emit a full-page image. With
a full-page image in the record, the block's "before" state doesn't matter.
Also, one could make it safe to write WAL for a particular block by issuing
heap_sync() for the block's relation.
Umm.. Once truncate happens, WAL is emitted for all pages. If we
decide to skip WALs on copy or similar bulk operations, WALs are
not emitted at all, including XLOG_HEAP_INIT_PAGE. So that
doesn't happen. The unlogged data is synced at commit time.
+/* + * RelationRemovePendingSync() -- remove pendingSync entry for a relation + */ +void +RelationRemovePendingSync(Relation rel)What is the coding rule for deciding when to call this? Currently, only
ATExecSetTableSpace() calls this. CLUSTER doesn't call it, despite behaving
much like ALTER TABLE SET TABLESPACE behaves.+{ + bool found; + + rel->pending_sync = NULL; + rel->no_pending_sync = true; + if (pendingSyncs) + { + elog(DEBUG2, "RelationRemovePendingSync: accessing hash"); + hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found); + } +}We'd need a mechanism to un-remove the sync at subtransaction abort. My
attachment includes a test case demonstrating the consequences of that defect.
Please look for other areas that need to know about subtransactions; patch v7
had no code pertaining to subtransactions.
Agreed It forgets about subtransaction rollbacks. I'll make
RelationRemovePendingSync just mark as "removed" and make
ROLLBACK TO and RELEASE process the flag make it work. (Attached
2nd patch on top of thie patchset)
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
As you mention upthread, you have many debugging elog()s. These are too
detailed to include in every binary, but I do want them in the code. See
CACHE_elog() for a good example of achieving that.
Agreed will do. They were need to check the behavior precisely
but usually not needed.
+/* + * Sync to disk any relations that we skipped WAL-logging for earlier. + */ +void +smgrDoPendingSyncs(bool isCommit) +{ + if (!pendingSyncs) + return; + + if (isCommit) + { + HASH_SEQ_STATUS status; + PendingRelSync *pending; + + hash_seq_init(&status, pendingSyncs); + + while ((pending = hash_seq_search(&status)) != NULL) + { + if (pending->sync_above != InvalidBlockNumber)I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
InvalidBlockNumber" are not sync requests at all. Those just record the fact
of a RelationTruncate() happening. If you can think of a way to improve that,
please do so. If not, it's okay.
After a truncation, required WAL records are emitted for the
truncated pages, so no need to sync. Does this make sense for
you? (Maybe commit is needed there)
--- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c@@ -412,6 +413,10 @@ AllocateRelationDesc(Form_pg_class relp)
/* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount = 1;+ /* We don't know if pending sync for this relation exists so far */ + relation->pending_sync = NULL; + relation->no_pending_sync = false;RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
prefix to these fields.
This is a nonstandard place to clear fields. Clear them in
load_relcache_init_file() only, like we do for rd_statvalid. (Other paths
will then rely on palloc0() for implicit initialization.)
Agreed, will do in the next version.
--- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c@@ -3991,7 +4007,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
MarkBufferDirty(buffer);/* XLOG stuff */ - if (RelationNeedsWAL(relation)) + if (BufferNeedsWAL(relation, buffer) || + BufferNeedsWAL(relation, newbuf))This is fine if both buffers need WAL or neither buffer needs WAL. It is not
fine when one buffer needs WAL and the other buffer does not. My attachment
includes a test case. Of the bugs I'm reporting, this one seems most
difficult to solve well.
Yeah, it is right (and it's rather silly). Thank you for
pointing out. Will fix.
@@ -8961,9 +8978,16 @@ heap2_redo(XLogReaderState *record) * heap_sync - sync a heap, for use when no WAL has been written * * This forces the heap contents (including TOAST heap if any) down to disk. - * If we skipped using WAL, and WAL is otherwise needed, we must force the - * relation down to disk before it's safe to commit the transaction. This - * requires writing out any dirty buffers and then doing a forced fsync. + * If we did any changes to the heap bypassing the buffer manager, we must + * force the relation down to disk before it's safe to commit the + * transaction, because the direct modifications will not be flushed by + * the next checkpoint. + * + * We used to also use this after batch operations like COPY and CLUSTER, + * if we skipped using WAL and WAL is otherwise needed, but there were + * corner-cases involving other WAL-logged operations to the same + * relation, where that was not enough. heap_register_sync() should be + * used for that purpose instead.We still use heap_sync() in CLUSTER. Can we migrate CLUSTER to the newer
heap_register_sync()? Patch v7 makes some commands use the new way (COPY,
CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
TABLESPACE, CLUSTER). It would make the system simpler to understand if we
eliminated the old way. If that creates more problems than it solves, please
at least write down a coding rule to explain why certain commands shouldn't
use the old way.
Perhaps doable for TABLESPACE and CLUSTER. I'm not sure about
CREATE INDEX. I'll consider them.
I don't have enough time for now so the new version will be
posted early next week.
Thanks you for the review!
regards.
Attachments:
pending_sync_nbtsort_fix.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index fb4a80bf1d..060e0171a5 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -627,7 +627,8 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
* we don't bother identifying the case precisely.
*/
if (wstate->btws_use_wal ||
- (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
+ (RelationNeedsWAL(wstate->index) &&
+ (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
{
/* We use the heap NEWPAGE record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1071,11 +1072,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* set to point to "P_NONE"). This changes the index to the "valid" state
* by filling in a valid magic number in the metapage.
*/
- /*
- * If no tuple was inserted, it's possible that we are truncating a
- * relation. We need to emit WAL for the metapage in the case. However it
- * is not required elsewise,
- */
metapage = (Page) palloc(BLCKSZ);
_bt_initmetapage(metapage, rootblkno, rootlevel);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
pending_sync_fix_tblsp_subxact.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d1210de8f4..3ce69b7a40 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -4037,6 +4037,8 @@ ReleaseSavepoint(const char *name)
(errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
+ smgrProcessPendingSyncRemoval(s->subTransactionId, true);
+
/*
* Mark "commit pending" all subtransactions up to the target
* subtransaction. The actual commits will happen when control gets to
@@ -4146,6 +4148,8 @@ RollbackToSavepoint(const char *name)
(errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
+ smgrProcessPendingSyncRemoval(s->subTransactionId, false);
+
/*
* Mark "abort pending" all subtransactions up to the target
* subtransaction. The actual aborts will happen when control gets to
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 26dc3ddb1b..ad4a1e5127 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -99,6 +99,7 @@ typedef struct PendingRelSync
BlockNumber sync_above; /* WAL-logging skipped for blocks >=
* sync_above */
BlockNumber truncated_to; /* truncation WAL record was written */
+ SubTransactionId removed_xid; /* subxid where this is removed */
} PendingRelSync;
/* Relations that need to be fsync'd at commit */
@@ -405,6 +406,7 @@ getPendingSyncEntry(Relation rel, bool create)
{
pendsync_entry->truncated_to = InvalidBlockNumber;
pendsync_entry->sync_above = InvalidBlockNumber;
+ pendsync_entry->removed_xid = InvalidSubTransactionId;
}
/* hold shortcut in Relation */
@@ -498,14 +500,17 @@ void
RelationRemovePendingSync(Relation rel)
{
bool found;
+ PendingRelSync *pending_sync;
- rel->pending_sync = NULL;
- rel->no_pending_sync = true;
- if (pendingSyncs)
- {
- elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
- hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
- }
+ if (rel->no_pending_sync)
+ return;
+
+ pending_sync = getPendingSyncEntry(rel, false);
+
+ if (pending_sync)
+ return;
+
+ rel->pending_sync->removed_xid = GetCurrentSubTransactionId();
}
@@ -693,6 +698,31 @@ smgrDoPendingSyncs(bool isCommit)
pendingSyncs = NULL;
}
+void
+smgrProcessPendingSyncRemoval(SubTransactionId sxid, bool isCommit)
+{
+ HASH_SEQ_STATUS status;
+ PendingRelSync *pending;
+
+ if (!pendingSyncs)
+ return;
+
+ hash_seq_init(&status, pendingSyncs);
+
+ while ((pending = hash_seq_search(&status)) != NULL)
+ {
+ if (pending->removed_xid == sxid)
+ {
+ pending->removed_xid = InvalidSubTransactionId;
+ if (isCommit)
+ {
+ pending->sync_above = InvalidBlockNumber;
+ pending->truncated_to = InvalidBlockNumber;
+ }
+ }
+ }
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
On Wed, Mar 20, 2019 at 05:17:54PM +0900, Kyotaro HORIGUCHI wrote:
At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah@leadboat.com> wrote in <20190311022708.GA2189728@rfd.leadboat.com>
On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
+/* + * Sync to disk any relations that we skipped WAL-logging for earlier. + */ +void +smgrDoPendingSyncs(bool isCommit) +{ + if (!pendingSyncs) + return; + + if (isCommit) + { + HASH_SEQ_STATUS status; + PendingRelSync *pending; + + hash_seq_init(&status, pendingSyncs); + + while ((pending = hash_seq_search(&status)) != NULL) + { + if (pending->sync_above != InvalidBlockNumber)I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
InvalidBlockNumber" are not sync requests at all. Those just record the fact
of a RelationTruncate() happening. If you can think of a way to improve that,
please do so. If not, it's okay.After a truncation, required WAL records are emitted for the
truncated pages, so no need to sync. Does this make sense for
you? (Maybe commit is needed there)
Yes, the behavior makes sense. I wasn't saying the quoted code had the wrong
behavior. I was saying that the data structure called "pendingSyncs" is
actually "pending syncs and past truncates". It's not ideal that the variable
name differs from the variable purpose in this way. However, it's okay if you
don't find a way to improve that.
I don't have enough time for now so the new version will be
posted early next week.
I'll wait for that version.
Hello. This is a revised version.
At Wed, 20 Mar 2019 22:48:35 -0700, Noah Misch <noah@leadboat.com> wrote in <20190321054835.GB3842129@rfd.leadboat.com>
On Wed, Mar 20, 2019 at 05:17:54PM +0900, Kyotaro HORIGUCHI wrote:
At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah@leadboat.com> wrote in <20190311022708.GA2189728@rfd.leadboat.com>
On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
InvalidBlockNumber" are not sync requests at all. Those just record the fact
of a RelationTruncate() happening. If you can think of a way to improve that,
please do so. If not, it's okay.After a truncation, required WAL records are emitted for the
truncated pages, so no need to sync. Does this make sense for
you? (Maybe commit is needed there)Yes, the behavior makes sense. I wasn't saying the quoted code had the wrong
behavior. I was saying that the data structure called "pendingSyncs" is
actually "pending syncs and past truncates". It's not ideal that the variable
name differs from the variable purpose in this way. However, it's okay if you
don't find a way to improve that.
It is convincing. The current member names "sync_above" and
"truncated_to" are wordings based on the operations that have
happened on the relation. I changed the names to words based on
what to do on the relation. Renamed to skip_wal_min_blk and
wal_log_min_blk.
I don't have enough time for now so the new version will be
posted early next week.I'll wait for that version.
At Wed, 20 Mar 2019 17:17:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190320.171754.171896368.horiguchi.kyotaro@lab.ntt.co.jp>
Hence, this change causes us to emit WAL for the metapage of a
RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation. We should never do
that. If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
relfilenode. I've attached a test case for this; it is a patch that applies
on top of your v7 patches. The test checks for orphaned files after redo.Oops! Added RelationNeedsWAL(index) there. (Attched 1st patch on
top of this patchset)
Done in the attached patch. But the orphan file check in the TAP
diff was wrong. It detects orphaned pg_class entry for temprary
tables, which dissapears after the first autovacuum. The revised
tap test (check_orphan_relfilenodes) doesn't faultly fail and
catches the bug in the previous patch.
+ * If no tuple was inserted, it's possible that we are truncating a + * relation. We need to emit WAL for the metapage in the case. However it + * is not required elsewise,Did you mean to write more words after that comma?
Sorry, it is just a garbage. Required work is done in
_bt_blwritepage.
Removed.
We'd need a mechanism to un-remove the sync at subtransaction abort. My
attachment includes a test case demonstrating the consequences of that defect.
Please look for other areas that need to know about subtransactions; patch v7
had no code pertaining to subtransactions.
Added. Passed the new tests.
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
As you mention upthread, you have many debugging elog()s. These are too
detailed to include in every binary, but I do want them in the code. See
CACHE_elog() for a good example of achieving that.Agreed will do. They were need to check the behavior precisely
but usually not needed.
I removed all such elog()s.
RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
prefix to these fields.
This is a nonstandard place to clear fields. Clear them in
load_relcache_init_file() only, like we do for rd_statvalid. (Other paths
will then rely on palloc0() for implicit initialization.)
Both are done.
- if (RelationNeedsWAL(relation)) + if (BufferNeedsWAL(relation, buffer) || + BufferNeedsWAL(relation, newbuf))This is fine if both buffers need WAL or neither buffer needs WAL. It is not
fine when one buffer needs WAL and the other buffer does not. My attachment
includes a test case. Of the bugs I'm reporting, this one seems most
difficult to solve well.
I refactored heap_insert/delete so that the XLOG stuff can be
used from heap_update. Then modify heap_update so that it emits
XLOG_INSERT and XLOG_DELETE in addition to XLOG_UPDATE.
We still use heap_sync() in CLUSTER. Can we migrate CLUSTER to the newer
heap_register_sync()? Patch v7 makes some commands use the new way (COPY,
CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
TABLESPACE, CLUSTER). It would make the system simpler to understand if we
eliminated the old way. If that creates more problems than it solves, please
at least write down a coding rule to explain why certain commands shouldn't
use the old way.Perhaps doable for TABLESPACE and CLUSTER. I'm not sure about
CREATE INDEX. I'll consider them.
I added the CLUSTER case in the new patchset. For the SET
TABLESPACE case, it works on SMGR layer and manipulates fork
files explicitly but this stuff is Relation based and doesn't
distinguish forks. We can modify this stuff to work on smgr and
make it fork-aware but I don't think it is worth doing.
CREATE INDEX is not changed in this version. I continue to
consider it.
The attached is the new patchset.
v8-0001-TAP-test-for-copy-truncation-optimization.patch
- Revised version of test.
v8-0002-Write-WAL-for-empty-nbtree-index-build.patch
- Fixed version of v7
v8-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patch
- New file, moves xlog stuff of heap_insert and heap_delete out
of the functions so that heap_update can use them.
v8-0004-Add-infrastructure-to-WAL-logging-skip-feature.patch
- Renamed variables, functions. Removed elogs.
v8-0005-Fix-WAL-skipping-feature.patch
- Fixed heap_update.
v8-0006-Change-cluster-to-use-the-new-pending-sync-infrastru.patch
- New file, modifies CLUSTER to use this feature.
v8-0007-Add-a-comment-to-ATExecSetTableSpace.patch
- New file, adds a comment that excuses for not using this stuff.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v8-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From 885e9ac73434aa8d5fe80393dc64746c36148acd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/7] TAP test for copy-truncation optimization.
---
src/test/recovery/t/017_wal_optimize.pl | 254 ++++++++++++++++++++++++++++++++
1 file changed, 254 insertions(+)
create mode 100644 src/test/recovery/t/017_wal_optimize.pl
diff --git a/src/test/recovery/t/017_wal_optimize.pl b/src/test/recovery/t/017_wal_optimize.pl
new file mode 100644
index 0000000000..5d67548b54
--- /dev/null
+++ b/src/test/recovery/t/017_wal_optimize.pl
@@ -0,0 +1,254 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::real_dir($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v8-0002-Write-WAL-for-empty-nbtree-index-build.patchtext/x-patch; charset=us-asciiDownload
From a28a1e9a87d4cc2135fbaf079a16e7487de8d357 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/7] Write WAL for empty nbtree index build
After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
src/backend/access/nbtree/nbtsort.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2762a2d548..70fe3bec32 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -622,8 +622,15 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
RelationOpenSmgr(wstate->index);
- /* XLOG stuff */
- if (wstate->btws_use_wal)
+ /* XLOG stuff
+ *
+ * Even if minimal mode, WAL is required here if truncation happened after
+ * being created in the same transaction. It is not needed otherwise but
+ * we don't bother identifying the case precisely.
+ */
+ if (wstate->btws_use_wal ||
+ (RelationNeedsWAL(wstate->index) &&
+ (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
{
/* We use the heap NEWPAGE record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
--
2.16.3
v8-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patchtext/x-patch; charset=us-asciiDownload
From a070ada24a7f448a449435e38a57209725a8c914 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 13:29:50 +0900
Subject: [PATCH 3/7] Move XLOG stuff from heap_insert and heap_delete
Succeeding commit makes heap_update emit insert and delete WAL
records. Move out XLOG stuff for insert and delete so that heap_update
can use the stuff.
---
src/backend/access/heap/heapam.c | 277 ++++++++++++++++++++++-----------------
1 file changed, 157 insertions(+), 120 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 65536c7214..fe5d939c45 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -71,6 +71,11 @@
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
+static XLogRecPtr log_heap_insert(Relation relation, Buffer buffer,
+ HeapTuple heaptup, int options, bool all_visible_cleared);
+static XLogRecPtr log_heap_delete(Relation relation, Buffer buffer,
+ HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+ bool changingPart, bool all_visible_cleared);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
@@ -1889,6 +1894,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
+ Page page;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
@@ -1925,16 +1931,18 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
*/
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+ page = BufferGetPage(buffer);
+
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
RelationPutHeapTuple(relation, buffer, heaptup,
(options & HEAP_INSERT_SPECULATIVE) != 0);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(page))
{
all_visible_cleared = true;
- PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllVisible(page);
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer, VISIBILITYMAP_VALID_BITS);
@@ -1956,76 +1964,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/* XLOG stuff */
if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
{
- xl_heap_insert xlrec;
- xl_heap_header xlhdr;
XLogRecPtr recptr;
- Page page = BufferGetPage(buffer);
- uint8 info = XLOG_HEAP_INSERT;
- int bufflags = 0;
-
- /*
- * If this is a catalog, we need to transmit combocids to properly
- * decode, so log that as well.
- */
- if (RelationIsAccessibleInLogicalDecoding(relation))
- log_heap_new_cid(relation, heaptup);
-
- /*
- * If this is the single and first tuple on page, we can reinit the
- * page instead of restoring the whole thing. Set flag, and hide
- * buffer references from XLogInsert.
- */
- if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
- PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
- {
- info |= XLOG_HEAP_INIT_PAGE;
- bufflags |= REGBUF_WILL_INIT;
- }
-
- xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
- xlrec.flags = 0;
- if (all_visible_cleared)
- xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
- if (options & HEAP_INSERT_SPECULATIVE)
- xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
- Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
-
- /*
- * For logical decoding, we need the tuple even if we're doing a full
- * page write, so make sure it's included even if we take a full-page
- * image. (XXX We could alternatively store a pointer into the FPW).
- */
- if (RelationIsLogicallyLogged(relation) &&
- !(options & HEAP_INSERT_NO_LOGICAL))
- {
- xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
- bufflags |= REGBUF_KEEP_DATA;
- }
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
-
- xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
- xlhdr.t_infomask = heaptup->t_data->t_infomask;
- xlhdr.t_hoff = heaptup->t_data->t_hoff;
-
- /*
- * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
- * write the whole page to the xlog, we don't need to store
- * xl_heap_header in the xlog.
- */
- XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
- XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- XLogRegisterBufData(0,
- (char *) heaptup->t_data + SizeofHeapTupleHeader,
- heaptup->t_len - SizeofHeapTupleHeader);
-
- /* filtering by origin on a row level is much more efficient */
- XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
- recptr = XLogInsert(RM_HEAP_ID, info);
+ recptr = log_heap_insert(relation, buffer, heaptup,
+ options, all_visible_cleared);
+
PageSetLSN(page, recptr);
}
@@ -2744,58 +2687,10 @@ l1:
*/
if (RelationNeedsWAL(relation))
{
- xl_heap_delete xlrec;
- xl_heap_header xlhdr;
XLogRecPtr recptr;
- /* For logical decode we need combocids to properly decode the catalog */
- if (RelationIsAccessibleInLogicalDecoding(relation))
- log_heap_new_cid(relation, &tp);
-
- xlrec.flags = 0;
- if (all_visible_cleared)
- xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
- if (changingPart)
- xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
- xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
- tp.t_data->t_infomask2);
- xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
- xlrec.xmax = new_xmax;
-
- if (old_key_tuple != NULL)
- {
- if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
- xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
- else
- xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
- }
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
-
- XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
- /*
- * Log replica identity of the deleted tuple if there is one
- */
- if (old_key_tuple != NULL)
- {
- xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
- xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
- xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
-
- XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
- XLogRegisterData((char *) old_key_tuple->t_data
- + SizeofHeapTupleHeader,
- old_key_tuple->t_len
- - SizeofHeapTupleHeader);
- }
-
- /* filtering by origin on a row level is much more efficient */
- XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
- recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
-
+ recptr = log_heap_delete(relation, buffer, &tp, old_key_tuple, new_xmax,
+ changingPart, all_visible_cleared);
PageSetLSN(page, recptr);
}
@@ -7045,6 +6940,148 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
return recptr;
}
+/*
+ * Perform XLogInsert for a heap-insert operation. Caller must already
+ * have modified the buffer and marked it dirty.
+ */
+XLogRecPtr
+log_heap_insert(Relation relation, Buffer buffer,
+ HeapTuple heaptup, int options, bool all_visible_cleared)
+{
+ xl_heap_insert xlrec;
+ xl_heap_header xlhdr;
+ uint8 info = XLOG_HEAP_INSERT;
+ int bufflags = 0;
+ Page page = BufferGetPage(buffer);
+
+ /*
+ * If this is a catalog, we need to transmit combocids to properly
+ * decode, so log that as well.
+ */
+ if (RelationIsAccessibleInLogicalDecoding(relation))
+ log_heap_new_cid(relation, heaptup);
+
+ /*
+ * If this is the single and first tuple on page, we can reinit the
+ * page instead of restoring the whole thing. Set flag, and hide
+ * buffer references from XLogInsert.
+ */
+ if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
+ PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+ {
+ info |= XLOG_HEAP_INIT_PAGE;
+ bufflags |= REGBUF_WILL_INIT;
+ }
+
+ xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
+ xlrec.flags = 0;
+ if (all_visible_cleared)
+ xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
+ if (options & HEAP_INSERT_SPECULATIVE)
+ xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
+ Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
+
+ /*
+ * For logical decoding, we need the tuple even if we're doing a full
+ * page write, so make sure it's included even if we take a full-page
+ * image. (XXX We could alternatively store a pointer into the FPW).
+ */
+ if (RelationIsLogicallyLogged(relation) &&
+ !(options & HEAP_INSERT_NO_LOGICAL))
+ {
+ xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
+ bufflags |= REGBUF_KEEP_DATA;
+ }
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
+
+ xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
+ xlhdr.t_infomask = heaptup->t_data->t_infomask;
+ xlhdr.t_hoff = heaptup->t_data->t_hoff;
+
+ /*
+ * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
+ * write the whole page to the xlog, we don't need to store
+ * xl_heap_header in the xlog.
+ */
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
+ XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+ /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
+ XLogRegisterBufData(0,
+ (char *) heaptup->t_data + SizeofHeapTupleHeader,
+ heaptup->t_len - SizeofHeapTupleHeader);
+
+ /* filtering by origin on a row level is much more efficient */
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+ return XLogInsert(RM_HEAP_ID, info);
+}
+
+/*
+ * Perform XLogInsert for a heap-insert operation. Caller must already
+ * have modified the buffer and marked it dirty.
+ *
+ * NB: heap_abort_speculative() uses the same xlog record and replay
+ * routines.
+ */
+static XLogRecPtr
+log_heap_delete(Relation relation, Buffer buffer,
+ HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+ bool changingPart, bool all_visible_cleared)
+{
+ xl_heap_delete xlrec;
+ xl_heap_header xlhdr;
+
+ /* For logical decode we need combocids to properly decode the catalog */
+ if (RelationIsAccessibleInLogicalDecoding(relation))
+ log_heap_new_cid(relation, tp);
+
+ xlrec.flags = 0;
+ if (all_visible_cleared)
+ xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
+ if (changingPart)
+ xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
+ xlrec.infobits_set = compute_infobits(tp->t_data->t_infomask,
+ tp->t_data->t_infomask2);
+ xlrec.offnum = ItemPointerGetOffsetNumber(&tp->t_self);
+ xlrec.xmax = new_xmax;
+
+ if (old_key_tuple != NULL)
+ {
+ if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
+ xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
+ else
+ xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
+ }
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
+
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+ /*
+ * Log replica identity of the deleted tuple if there is one
+ */
+ if (old_key_tuple != NULL)
+ {
+ xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
+ xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
+ xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
+
+ XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
+ XLogRegisterData((char *) old_key_tuple->t_data
+ + SizeofHeapTupleHeader,
+ old_key_tuple->t_len
+ - SizeofHeapTupleHeader);
+ }
+
+ /* filtering by origin on a row level is much more efficient */
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+ return XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
+}
+
/*
* Perform XLogInsert for a heap-update operation. Caller must already
* have modified the buffer(s) and marked them dirty.
--
2.16.3
v8-0004-Add-infrastructure-to-WAL-logging-skip-feature.patchtext/x-patch; charset=us-asciiDownload
From 2778e0aa67ccaf58f03da59e9c31706907c2b7e6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:00:44 +0900
Subject: [PATCH 4/7] Add infrastructure to WAL-logging skip feature
We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just signaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-commit fsyncs for a relation and in-transaction truncations.
heap_register_sync() should be used to start tracking before batch
operations like COPY and CLUSTER, and use BufferNeedsWAL() instead of
RelationNeedsWAL() at the places related to WAL-logging about
heap-modifying operations.
---
src/backend/access/heap/heapam.c | 31 ++++
src/backend/access/transam/xact.c | 11 ++
src/backend/catalog/storage.c | 344 +++++++++++++++++++++++++++++++++---
src/backend/commands/tablecmds.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 40 ++++-
src/backend/utils/cache/relcache.c | 3 +
src/include/access/heapam.h | 1 +
src/include/catalog/storage.h | 5 +
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 8 +
10 files changed, 417 insertions(+), 31 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fe5d939c45..024620ddc1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
#include "access/xloginsert.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -8829,3 +8830,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordWALSkipping(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordWALSkipping(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c3214d4f4d..32a6a877f3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2022,6 +2022,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2254,6 +2257,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2579,6 +2585,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandone pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
@@ -4097,6 +4104,8 @@ ReleaseSavepoint(const char *name)
(errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
+ smgrProcessWALRequirementInval(s->subTransactionId, true);
+
/*
* Mark "commit pending" all subtransactions up to the target
* subtransaction. The actual commits will happen when control gets to
@@ -4206,6 +4215,8 @@ RollbackToSavepoint(const char *name)
(errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
+ smgrProcessWALRequirementInval(s->subTransactionId, false);
+
/*
* Mark "abort pending" all subtransactions up to the target
* subtransaction. The actual aborts will happen when control gets to
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0302507e6f..a0cf8d3e27 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -62,6 +63,54 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'skip_wal_min_blk' is set to the current size of the relation. Any operations
+ * on blocks < skip_wal_min_blk need to be WAL-logged as usual, but for
+ * operations on higher blocks, WAL-logging is skipped.
+
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct RelWalRequirement
+{
+ RelFileNode relnode; /* relation created in same xact */
+ BlockNumber skip_wal_min_blk;/* WAL-logging skipped for blocks >=
+ * skip_wal_min_blk */
+ BlockNumber wal_log_min_blk; /* The minimum blk number that requires
+ * WAL-logging even if skipped by the above*/
+ SubTransactionId invalidate_sxid; /* subxid where this entry is
+ * invalidated */
+} RelWalRequirement;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *relWalRequirements = NULL;
+static int walreq_pending_invals = 0;
+
+static RelWalRequirement *getWalRequirementEntry(Relation rel, bool create);
+
/*
* RelationCreateStorage
* Create physical storage for a relation.
@@ -259,37 +308,114 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
+ RelWalRequirement *walreq;
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
+ /* get pending sync entry, create if not yet */
+ walreq = getWalRequirementEntry(rel, true);
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ if (walreq->skip_wal_min_blk == InvalidBlockNumber ||
+ walreq->skip_wal_min_blk < nblocks)
+ {
+ /*
+ * This is the first time truncation of this relation in this
+ * transaction or truncation that leaves pages that need at-commit
+ * fsync. Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ /* no longer skip WAL-logging for the blocks */
+ rel->rd_walrequirement->wal_log_min_blk = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/*
+ * getWalRequirementEntry: get WAL requirement entry.
+ *
+ * Returns WAL requirement entry for the relation. The entry tracks
+ * WAL-skipping blocks for the relation. The WAL-skipped blocks need fsync at
+ * commit time. Creates one if needed when create is true.
+ */
+static RelWalRequirement *
+getWalRequirementEntry(Relation rel, bool create)
+{
+ RelWalRequirement *walreq_entry = NULL;
+ bool found;
+
+ if (rel->rd_walrequirement)
+ return rel->rd_walrequirement;
+
+ /* we know we don't have pending sync entry */
+ if (!create && rel->rd_nowalrequirement)
+ return NULL;
+
+ if (!relWalRequirements)
+ {
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ if (!create)
+ return NULL;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(RelWalRequirement);
+ ctl.hash = tag_hash;
+ relWalRequirements = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+
+ walreq_entry = (RelWalRequirement *)
+ hash_search(relWalRequirements, (void *) &rel->rd_node,
+ create ? HASH_ENTER: HASH_FIND, &found);
+
+ if (!walreq_entry)
+ {
+ /* prevent further hash lookup */
+ rel->rd_nowalrequirement = true;
+ return NULL;
+ }
+
+ /* new entry created */
+ if (!found)
+ {
+ walreq_entry->wal_log_min_blk = InvalidBlockNumber;
+ walreq_entry->skip_wal_min_blk = InvalidBlockNumber;
+ walreq_entry->invalidate_sxid = InvalidSubTransactionId;
+ }
+
+ /* hold shortcut in Relation */
+ rel->rd_nowalrequirement = false;
+ rel->rd_walrequirement = walreq_entry;
+
+ return walreq_entry;
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -367,6 +493,34 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * RelationInvalidateWALRequirements() -- invalidate wal requirement entry
+ */
+void
+RelationInvalidateWALRequirements(Relation rel)
+{
+ RelWalRequirement *walreq;
+
+ /* we know we don't have one */
+ if (rel->rd_nowalrequirement)
+ return;
+
+ walreq = getWalRequirementEntry(rel, false);
+
+ if (!walreq)
+ return;
+
+ /*
+ * The state is reset at subtransaction commit/abort. No invalidation
+ * request must not come for the same relation in the same subtransaction.
+ */
+ Assert(walreq->invalidate_sxid == InvalidSubTransactionId);
+
+ walreq_pending_invals++;
+ walreq->invalidate_sxid = GetCurrentSubTransactionId();
+}
+
+
/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
@@ -418,6 +572,154 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+
+/*
+ * Remember that the given relation doesn't need WAL-logging for the blocks
+ * after the current block size and the blocks are going to be sync'd at
+ * commit.
+ */
+void
+RecordWALSkipping(Relation rel)
+{
+ BlockNumber nblocks;
+ RelWalRequirement *walreq;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* get pending sync entry, create if not yet */
+ walreq = getWalRequirementEntry(rel, true);
+
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ /*
+ * Record only the first registration.
+ */
+ if (walreq->skip_wal_min_blk != InvalidBlockNumber)
+ return;
+
+ walreq->skip_wal_min_blk = nblocks;
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+ RelWalRequirement *walreq;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch exising pending sync entry */
+ walreq = getWalRequirementEntry(rel, false);
+
+ /*
+ * no point in doing further work if we know that we don't have special
+ * WAL requirement
+ */
+ if (!walreq)
+ return true;
+
+ Assert(BufferIsValid(buf));
+
+ blkno = BufferGetBlockNumber(buf);
+
+ /*
+ * We don't skip WAL-logging for pages that once done.
+ */
+ if (walreq->skip_wal_min_blk == InvalidBlockNumber ||
+ walreq->skip_wal_min_blk > blkno)
+ return true;
+
+ /*
+ * we don't skip WAL-logging for blocks that have got WAL-logged
+ * truncation
+ */
+ if (walreq->wal_log_min_blk != InvalidBlockNumber &&
+ walreq->wal_log_min_blk <= blkno)
+ return true;
+
+ return false;
+}
+
+/*
+ * Sync to disk any relations that we have skipped WAL-logging earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!relWalRequirements)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ RelWalRequirement *walreq;
+
+ hash_seq_init(&status, relWalRequirements);
+
+ while ((walreq = hash_seq_search(&status)) != NULL)
+ {
+ if (walreq->skip_wal_min_blk != InvalidBlockNumber)
+ {
+ FlushRelationBuffersWithoutRelCache(walreq->relnode, false);
+ smgrimmedsync(smgropen(walreq->relnode, InvalidBackendId),
+ MAIN_FORKNUM);
+ }
+ }
+ }
+
+ hash_destroy(relWalRequirements);
+ relWalRequirements = NULL;
+}
+
+/*
+ * Process pending invalidation of WAL requirements happened in the
+ * subtransaction
+ */
+void
+smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit)
+{
+ HASH_SEQ_STATUS status;
+ RelWalRequirement *walreq;
+
+ if (!relWalRequirements || walreq_pending_invals == 0)
+ return;
+
+ /*
+ * It may take some time when there're many relWalRequirements entries. We
+ * expect that we don't have relWalRequirements in almost all cases.
+ */
+ hash_seq_init(&status, relWalRequirements);
+
+ while ((walreq = hash_seq_search(&status)) != NULL)
+ {
+ if (walreq->invalidate_sxid == sxid)
+ {
+ Assert(walreq_pending_invals > 0);
+ walreq->invalidate_sxid = InvalidSubTransactionId;
+ walreq_pending_invals--;
+ if (isCommit)
+ {
+ walreq->skip_wal_min_blk = InvalidBlockNumber;
+ walreq->wal_log_min_blk = InvalidBlockNumber;
+ }
+ }
+ }
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3183b2aaa1..45bb0b5614 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11587,11 +11587,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. Pending syncs for the old node is no longer needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationInvalidateWALRequirements(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..a9741f138c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,41 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3183,7 +3205,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3213,18 +3235,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 84609e0725..95e834d45e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
#include "partitioning/partdesc.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -5625,6 +5626,8 @@ load_relcache_init_file(bool shared)
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
+ rel->rd_nowalrequirement = false;
+ rel->rd_walrequirement = NULL;
/*
* Recompute lock and physical addressing info. This is needed in
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 945ca50616..509394bb35 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -174,6 +174,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
/* in heap/pruneheap.c */
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9f638be924..76178b87f2 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,6 +22,7 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
+extern void RelationInvalidateWALRequirements(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
@@ -29,6 +30,10 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit);
+extern void RecordWALSkipping(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c5826f691d..8a9ea041dd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 54028515a7..30f0d5bd83 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -198,6 +198,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * rd_nowalrequirement is true if this relation is known not to have
+ * special WAL requirements. Otherwise we need to ask smgr for an entry
+ * if rd_walrequirement is NULL.
+ */
+ bool rd_nowalrequirement;
+ struct RelWalRequirement *rd_walrequirement;
} RelationData;
--
2.16.3
v8-0005-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From b7fd4d56f808f98d39861b8d04d2be7839c28202 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 5/7] Fix WAL skipping feature.
This patch replaces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
src/backend/access/heap/heapam.c | 104 ++++++++++++++++++++++++--------
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 3 -
src/backend/access/heap/vacuumlazy.c | 6 +-
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/commands/copy.c | 13 ++--
src/backend/commands/createas.c | 9 ++-
src/backend/commands/matview.c | 6 +-
src/backend/commands/tablecmds.c | 5 +-
src/include/access/heapam.h | 3 +-
src/include/access/tableam.h | 11 +---
11 files changed, 104 insertions(+), 62 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 024620ddc1..96f2cde3ce 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -28,6 +28,28 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or
+ * WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ * the file to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transacton, because
+ * for a small number of changes, it's cheaper to just create the WAL
+ * records than fsyncing() the whole relation at COMMIT. It is only
+ * worthwhile for (presumably) large operations like COPY, CLUSTER,
+ * or VACUUM FULL. Use heap_register_sync() to initiate such an
+ * operation; it will cause any subsequent updates to the table to skip
+ * WAL-logging, if possible, and cause the heap to be synced to disk at
+ * COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -1963,7 +1985,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
@@ -2073,7 +2095,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
PGAlignedBlock scratch;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2081,7 +2102,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2123,6 +2143,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2134,6 +2155,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -2686,7 +2708,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
@@ -2820,6 +2842,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
vmbuffer = InvalidBuffer,
vmbuffer_new = InvalidBuffer;
bool need_toast;
+ bool oldbuf_needs_wal,
+ newbuf_needs_wal;
Size newtupsize,
pagefree;
bool have_tuple_lock = false;
@@ -3371,7 +3395,7 @@ l2:
MarkBufferDirty(buffer);
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -3585,8 +3609,20 @@ l2:
MarkBufferDirty(newbuf);
MarkBufferDirty(buffer);
- /* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ /*
+ * XLOG stuff
+ *
+ * Emit heap-update log. When wal_level = minimal, we may emit insert or
+ * delete record according to wal-optimization.
+ */
+ oldbuf_needs_wal = BufferNeedsWAL(relation, buffer);
+
+ if (newbuf == buffer)
+ newbuf_needs_wal = oldbuf_needs_wal;
+ else
+ newbuf_needs_wal = BufferNeedsWAL(relation, newbuf);
+
+ if (oldbuf_needs_wal || newbuf_needs_wal)
{
XLogRecPtr recptr;
@@ -3596,15 +3632,26 @@ l2:
*/
if (RelationIsAccessibleInLogicalDecoding(relation))
{
- log_heap_new_cid(relation, &oldtup);
- log_heap_new_cid(relation, heaptup);
+ if (oldbuf_needs_wal)
+ log_heap_new_cid(relation, &oldtup);
+ if (newbuf_needs_wal)
+ log_heap_new_cid(relation, heaptup);
}
- recptr = log_heap_update(relation, buffer,
- newbuf, &oldtup, heaptup,
- old_key_tuple,
- all_visible_cleared,
- all_visible_cleared_new);
+ if (oldbuf_needs_wal && newbuf_needs_wal)
+ recptr = log_heap_update(relation, buffer, newbuf,
+ &oldtup, heaptup,
+ old_key_tuple,
+ all_visible_cleared,
+ all_visible_cleared_new);
+ else if (oldbuf_needs_wal)
+ recptr = log_heap_delete(relation, buffer, &oldtup, old_key_tuple,
+ xmax_old_tuple, false,
+ all_visible_cleared);
+ else
+ recptr = log_heap_insert(relation, buffer, newtup,
+ 0, all_visible_cleared_new);
+
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4482,7 +4529,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5234,7 +5281,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5394,7 +5441,7 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
htup->t_ctid = *tid;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -5526,7 +5573,7 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -5635,7 +5682,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -6832,7 +6879,7 @@ log_heap_clean(Relation reln, Buffer buffer,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -6880,7 +6927,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
XLogRecPtr recptr;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7107,7 +7154,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
int bufflags;
/* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -8711,9 +8758,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a3e51922d8..a05659b168 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -258,7 +259,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..1ac77f7c14 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -654,9 +654,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 5c554f9465..3f5df63df8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -929,7 +929,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1193,7 +1193,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1575,7 +1575,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06b26..1f5f7d92dd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "port/pg_bitutils.h"
#include "storage/bufmgr.h"
@@ -276,7 +277,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 705df8900b..1074320a5a 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2391,8 +2391,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2438,7 +2437,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -3091,11 +3090,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 36e3d44aad..8cba15fd3c 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -557,8 +557,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,9 +605,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5a47be4b33..5f447c6d94 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,7 +463,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,9 +509,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 45bb0b5614..242311b0d7 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4666,8 +4666,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4958,8 +4959,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
table_close(newrel, NoLock);
}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 509394bb35..a9aec90e86 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,11 +29,10 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
-#define HEAP_INSERT_SPECULATIVE 0x0010
+#define HEAP_INSERT_SPECULATIVE 0x0008
typedef struct BulkInsertStateData *BulkInsertState;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c2baa9d7a8..268e672470 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -94,10 +94,9 @@ typedef struct TM_FailureData
} TM_FailureData;
/* "options" flag bits for table_insert */
-#define TABLE_INSERT_SKIP_WAL 0x0001
-#define TABLE_INSERT_SKIP_FSM 0x0002
-#define TABLE_INSERT_FROZEN 0x0004
-#define TABLE_INSERT_NO_LOGICAL 0x0008
+#define TABLE_INSERT_SKIP_FSM 0x0001
+#define TABLE_INSERT_FROZEN 0x0002
+#define TABLE_INSERT_NO_LOGICAL 0x0004
/* flag bits fortable_lock_tuple */
/* Follow tuples whose update is in progress if lock modes don't conflict */
@@ -634,10 +633,6 @@ table_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot, Snapshot snap
* behaviour of the AM. Several options might be ignored by AMs not supporting
* them.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple will not
- * necessarily logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space. It's
--
2.16.3
v8-0006-Change-cluster-to-use-the-new-pending-sync-infrastru.patchtext/x-patch; charset=us-asciiDownload
From b703b1287f9cb6ab1c556909f90473fa3fe25877 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 18:29:37 +0900
Subject: [PATCH 6/7] Change cluster to use the new pending sync infrastructure
When wal_level is minimal, CLUSTER gets benefit by moving file sync
from command end to transaction end by the pending-sync infrastructure
that file sync is performed at commit time.
---
src/backend/access/heap/rewriteheap.c | 25 +++++-------------------
src/backend/catalog/storage.c | 36 +++++++++++++++++++++++++++++++++++
src/backend/commands/cluster.c | 13 +++++--------
src/include/access/rewriteheap.h | 2 +-
src/include/catalog/storage.h | 3 ++-
5 files changed, 49 insertions(+), 30 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 1ac77f7c14..494f7fcd41 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "access/xloginsert.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "lib/ilist.h"
@@ -144,7 +145,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -238,15 +238,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* min_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +269,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +327,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -344,19 +341,7 @@ end_heap_rewrite(RewriteState state)
(char *) state->rs_buffer, true);
}
- /*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too.
- *
- * It's obvious that we must do this when not WAL-logging. It's less
- * obvious that we have to do it even if we did WAL-log the pages. The
- * reason is the same as in tablecmds.c's copy_relation_data(): we're
- * writing data that's not in shared buffers, and so a CHECKPOINT
- * occurring during the rewriteheap operation won't have fsync'd data we
- * wrote before the checkpoint.
- */
- if (RelationNeedsWAL(state->rs_new_rel))
- heap_sync(state->rs_new_rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
logical_end_heap_rewrite(state);
@@ -692,7 +677,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index a0cf8d3e27..cd623eb3bb 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -613,6 +613,42 @@ RecordWALSkipping(Relation rel)
* must WAL-log any changes to the once-truncated blocks, because replaying
* the truncation record will destroy them.
*/
+bool
+BlockNeedsWAL(Relation rel, BlockNumber blkno)
+{
+ RelWalRequirement *walreq;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch exising pending sync entry */
+ walreq = getWalRequirementEntry(rel, false);
+
+ /*
+ * no point in doing further work if we know that we don't have special
+ * WAL requirement
+ */
+ if (!walreq)
+ return true;
+
+ /*
+ * We don't skip WAL-logging for pages that once done.
+ */
+ if (walreq->skip_wal_min_blk == InvalidBlockNumber ||
+ walreq->skip_wal_min_blk > blkno)
+ return true;
+
+ /*
+ * we don't skip WAL-logging for blocks that have got WAL-logged
+ * truncation
+ */
+ if (walreq->wal_log_min_blk != InvalidBlockNumber &&
+ walreq->wal_log_min_blk <= blkno)
+ return true;
+
+ return false;
+}
+
bool
BufferNeedsWAL(Relation rel, Buffer buf)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 3e2a807640..e2c4897d07 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,7 +767,6 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
TransactionId OldestXmin;
TransactionId FreezeXid;
@@ -826,13 +825,11 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
+ * If wal_level is minimal, we skip WAL-logging even for WAL-requiring
+ * relations. Otherwise follow whether it's a WAL-logged rel.
*/
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
- /* use_wal off requires smgr_targblock be initially invalid */
- Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
+ if (!XLogIsNeeded())
+ heap_register_sync(NewHeap);
/*
* If both tables have TOAST tables, perform toast swap by content. It is
@@ -899,7 +896,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, FreezeXid,
- MultiXactCutoff, use_wal);
+ MultiXactCutoff);
/*
* Decide whether to use an indexscan or seqscan-and-optional-sort to scan
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 6006249d96..64efecf48b 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 76178b87f2..e8edbe5d71 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -33,7 +33,8 @@ extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
extern void smgrDoPendingSyncs(bool isCommit);
extern void smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit);
extern void RecordWALSkipping(Relation rel);
-bool BufferNeedsWAL(Relation rel, Buffer buf);
+extern bool BlockNeedsWAL(Relation rel, BlockNumber blkno);
+extern bool BufferNeedsWAL(Relation rel, Buffer buf);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
--
2.16.3
v8-0007-Add-a-comment-to-ATExecSetTableSpace.patchtext/x-patch; charset=us-asciiDownload
From 945ab5fa80089d489c204a010f2f5d551e6bec79 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 20:39:21 +0900
Subject: [PATCH 7/7] Add a comment to ATExecSetTableSpace.
We use heap_register_sync() stuff to control WAL-logging and file sync
on bulk insertion, but we cannot use it because the function lacks the
ability to handle forks explicitly. Add a comment to explain that.
---
src/backend/commands/tablecmds.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 242311b0d7..c7c7bcb308 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11594,7 +11594,13 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
RelationInvalidateWALRequirements(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
- /* copy main fork */
+ /*
+ * copy main fork
+ *
+ * You might think that we could use heap_register_sync() to control file
+ * sync and WAL-logging, but we cannot because the sutff lacks the ability
+ * to handle each fork explicitly.
+ */
copy_relation_data(rel->rd_smgr, dstrel, MAIN_FORKNUM,
rel->rd_rel->relpersistence);
--
2.16.3
Hello. I revised the patch I think addressing all your comments.
Differences from v7 patch are:
v9-0001:
- Renamed the script from 016_ to 017_.
- Added some additional tests.
v9-0002:
- Fixed _bt_blwritepage().
It is re-modified by v9-0007.
v9-0003: New patch.
- Refactors out xlog sutff from heap_insert/delete.
(log_heap_insert(), log_heap_udpate())
v9-0004: (v7-0003, v8-0004)
- Renamed some struct names and member names.
(PendingRelSync -> RelWalRequirement
.sync_above -> skip_wal_min_blk, .truncated_to -> wal_log_min_blk)
- Rename the addtional members in RelationData to rd_*.
- Explicitly initialize the additional members only in
load_relcache_init_file().
- Added new interface functions that accept block number and
SMgrRelation.
(BlockNeedsWAL(), RecordPendingSync())
- Support subtransaction, (or invalidation).
(RelWalRequirement.create_sxid, invalidate_sxid,
RelationInvalidateWALRequirements(), smgrDoPendingSyncs())
- Support forks.
(RelWalRequirement.forks, smgrDoPendingSyncs(), RecordPendingSync())
- Removd elog(LOG)s and a leftover comment.
v9-0005: (v7-0004, v8-0005)
- Fixed heap_update().
(heap_update())
v9-0006: New patch.
- Modifies CLUSTER to skip WAL logging.
v9-0007: New patch.
- Modifies ALTER TABLE SET TABLESPACE to skip WAL logging.
v9-0008: New patch.
- Modifies btbuild to skip WAL logging.
- Modifies btinsertonpg to skip WAL logging after truncation.
- Overrites on v9-0002's change.
ALL:
- Rebased.
- Fixed typos and mistakes in comments.
At Wed, 20 Mar 2019 17:17:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190320.171754.171896368.horiguchi.kyotaro@lab.ntt.co.jp>
We still use heap_sync() in CLUSTER. Can we migrate CLUSTER to the newer
heap_register_sync()? Patch v7 makes some commands use the new way (COPY,
CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
TABLESPACE, CLUSTER). It would make the system simpler to understand if we
eliminated the old way. If that creates more problems than it solves, please
at least write down a coding rule to explain why certain commands shouldn't
use the old way.Perhaps doable for TABLESPACE and CLUSTER. I'm not sure about
CREATE INDEX. I'll consider them.I added the CLUSTER case in the new patchset. For the SET
TABLESPACE case, it works on SMGR layer and manipulates fork
files explicitly but this stuff is Relation based and doesn't
distinguish forks. We can modify this stuff to work on smgr and
make it fork-aware but I don't think it is worth doing.CREATE INDEX is not changed in this version. I continue to
consider it.
I managed to simplify the change. Please look at v9-0008.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v9-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From 13fe16c4527273426d93429986700ac66810945d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/8] TAP test for copy-truncation optimization.
---
src/test/recovery/t/017_wal_optimize.pl | 254 ++++++++++++++++++++++++++++++++
1 file changed, 254 insertions(+)
create mode 100644 src/test/recovery/t/017_wal_optimize.pl
diff --git a/src/test/recovery/t/017_wal_optimize.pl b/src/test/recovery/t/017_wal_optimize.pl
new file mode 100644
index 0000000000..5d67548b54
--- /dev/null
+++ b/src/test/recovery/t/017_wal_optimize.pl
@@ -0,0 +1,254 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::real_dir($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v9-0002-Write-WAL-for-empty-nbtree-index-build.patchtext/x-patch; charset=us-asciiDownload
From 01691f5cf36e3bc75952b630088788c0da36b594 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/8] Write WAL for empty nbtree index build
After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
src/backend/access/nbtree/nbtsort.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 46e0831834..e65d4aab0f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -622,8 +622,15 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
RelationOpenSmgr(wstate->index);
- /* XLOG stuff */
- if (wstate->btws_use_wal)
+ /* XLOG stuff
+ *
+ * Even if minimal mode, WAL is required here if truncation happened after
+ * being created in the same transaction. It is not needed otherwise but
+ * we don't bother identifying the case precisely.
+ */
+ if (wstate->btws_use_wal ||
+ (RelationNeedsWAL(wstate->index) &&
+ (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
{
/* We use the heap NEWPAGE record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
--
2.16.3
v9-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patchtext/x-patch; charset=us-asciiDownload
From 09ecd87dee4187d1266799c8cc68e2ea9f700c9b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 13:29:50 +0900
Subject: [PATCH 3/8] Move XLOG stuff from heap_insert and heap_delete
Succeeding commit makes heap_update emit insert and delete WAL
records. Move out XLOG stuff for insert and delete so that heap_update
can use the stuff.
---
src/backend/access/heap/heapam.c | 277 ++++++++++++++++++++++-----------------
1 file changed, 157 insertions(+), 120 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 137cc9257d..c6e71dba6b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -71,6 +71,11 @@
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
+static XLogRecPtr log_heap_insert(Relation relation, Buffer buffer,
+ HeapTuple heaptup, int options, bool all_visible_cleared);
+static XLogRecPtr log_heap_delete(Relation relation, Buffer buffer,
+ HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+ bool changingPart, bool all_visible_cleared);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
@@ -1860,6 +1865,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
+ Page page;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
@@ -1896,16 +1902,18 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
*/
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+ page = BufferGetPage(buffer);
+
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
RelationPutHeapTuple(relation, buffer, heaptup,
(options & HEAP_INSERT_SPECULATIVE) != 0);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(page))
{
all_visible_cleared = true;
- PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllVisible(page);
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer, VISIBILITYMAP_VALID_BITS);
@@ -1927,76 +1935,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/* XLOG stuff */
if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
{
- xl_heap_insert xlrec;
- xl_heap_header xlhdr;
XLogRecPtr recptr;
- Page page = BufferGetPage(buffer);
- uint8 info = XLOG_HEAP_INSERT;
- int bufflags = 0;
-
- /*
- * If this is a catalog, we need to transmit combocids to properly
- * decode, so log that as well.
- */
- if (RelationIsAccessibleInLogicalDecoding(relation))
- log_heap_new_cid(relation, heaptup);
-
- /*
- * If this is the single and first tuple on page, we can reinit the
- * page instead of restoring the whole thing. Set flag, and hide
- * buffer references from XLogInsert.
- */
- if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
- PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
- {
- info |= XLOG_HEAP_INIT_PAGE;
- bufflags |= REGBUF_WILL_INIT;
- }
-
- xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
- xlrec.flags = 0;
- if (all_visible_cleared)
- xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
- if (options & HEAP_INSERT_SPECULATIVE)
- xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
- Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
-
- /*
- * For logical decoding, we need the tuple even if we're doing a full
- * page write, so make sure it's included even if we take a full-page
- * image. (XXX We could alternatively store a pointer into the FPW).
- */
- if (RelationIsLogicallyLogged(relation) &&
- !(options & HEAP_INSERT_NO_LOGICAL))
- {
- xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
- bufflags |= REGBUF_KEEP_DATA;
- }
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
-
- xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
- xlhdr.t_infomask = heaptup->t_data->t_infomask;
- xlhdr.t_hoff = heaptup->t_data->t_hoff;
-
- /*
- * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
- * write the whole page to the xlog, we don't need to store
- * xl_heap_header in the xlog.
- */
- XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
- XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- XLogRegisterBufData(0,
- (char *) heaptup->t_data + SizeofHeapTupleHeader,
- heaptup->t_len - SizeofHeapTupleHeader);
-
- /* filtering by origin on a row level is much more efficient */
- XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
- recptr = XLogInsert(RM_HEAP_ID, info);
+ recptr = log_heap_insert(relation, buffer, heaptup,
+ options, all_visible_cleared);
+
PageSetLSN(page, recptr);
}
@@ -2715,58 +2658,10 @@ l1:
*/
if (RelationNeedsWAL(relation))
{
- xl_heap_delete xlrec;
- xl_heap_header xlhdr;
XLogRecPtr recptr;
- /* For logical decode we need combocids to properly decode the catalog */
- if (RelationIsAccessibleInLogicalDecoding(relation))
- log_heap_new_cid(relation, &tp);
-
- xlrec.flags = 0;
- if (all_visible_cleared)
- xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
- if (changingPart)
- xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
- xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
- tp.t_data->t_infomask2);
- xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
- xlrec.xmax = new_xmax;
-
- if (old_key_tuple != NULL)
- {
- if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
- xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
- else
- xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
- }
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
-
- XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
- /*
- * Log replica identity of the deleted tuple if there is one
- */
- if (old_key_tuple != NULL)
- {
- xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
- xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
- xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
-
- XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
- XLogRegisterData((char *) old_key_tuple->t_data
- + SizeofHeapTupleHeader,
- old_key_tuple->t_len
- - SizeofHeapTupleHeader);
- }
-
- /* filtering by origin on a row level is much more efficient */
- XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
- recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
-
+ recptr = log_heap_delete(relation, buffer, &tp, old_key_tuple, new_xmax,
+ changingPart, all_visible_cleared);
PageSetLSN(page, recptr);
}
@@ -7016,6 +6911,148 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
return recptr;
}
+/*
+ * Perform XLogInsert for a heap-insert operation. Caller must already
+ * have modified the buffer and marked it dirty.
+ */
+XLogRecPtr
+log_heap_insert(Relation relation, Buffer buffer,
+ HeapTuple heaptup, int options, bool all_visible_cleared)
+{
+ xl_heap_insert xlrec;
+ xl_heap_header xlhdr;
+ uint8 info = XLOG_HEAP_INSERT;
+ int bufflags = 0;
+ Page page = BufferGetPage(buffer);
+
+ /*
+ * If this is a catalog, we need to transmit combocids to properly
+ * decode, so log that as well.
+ */
+ if (RelationIsAccessibleInLogicalDecoding(relation))
+ log_heap_new_cid(relation, heaptup);
+
+ /*
+ * If this is the single and first tuple on page, we can reinit the
+ * page instead of restoring the whole thing. Set flag, and hide
+ * buffer references from XLogInsert.
+ */
+ if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
+ PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+ {
+ info |= XLOG_HEAP_INIT_PAGE;
+ bufflags |= REGBUF_WILL_INIT;
+ }
+
+ xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
+ xlrec.flags = 0;
+ if (all_visible_cleared)
+ xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
+ if (options & HEAP_INSERT_SPECULATIVE)
+ xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
+ Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
+
+ /*
+ * For logical decoding, we need the tuple even if we're doing a full
+ * page write, so make sure it's included even if we take a full-page
+ * image. (XXX We could alternatively store a pointer into the FPW).
+ */
+ if (RelationIsLogicallyLogged(relation) &&
+ !(options & HEAP_INSERT_NO_LOGICAL))
+ {
+ xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
+ bufflags |= REGBUF_KEEP_DATA;
+ }
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
+
+ xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
+ xlhdr.t_infomask = heaptup->t_data->t_infomask;
+ xlhdr.t_hoff = heaptup->t_data->t_hoff;
+
+ /*
+ * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
+ * write the whole page to the xlog, we don't need to store
+ * xl_heap_header in the xlog.
+ */
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
+ XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+ /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
+ XLogRegisterBufData(0,
+ (char *) heaptup->t_data + SizeofHeapTupleHeader,
+ heaptup->t_len - SizeofHeapTupleHeader);
+
+ /* filtering by origin on a row level is much more efficient */
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+ return XLogInsert(RM_HEAP_ID, info);
+}
+
+/*
+ * Perform XLogInsert for a heap-insert operation. Caller must already
+ * have modified the buffer and marked it dirty.
+ *
+ * NB: heap_abort_speculative() uses the same xlog record and replay
+ * routines.
+ */
+static XLogRecPtr
+log_heap_delete(Relation relation, Buffer buffer,
+ HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+ bool changingPart, bool all_visible_cleared)
+{
+ xl_heap_delete xlrec;
+ xl_heap_header xlhdr;
+
+ /* For logical decode we need combocids to properly decode the catalog */
+ if (RelationIsAccessibleInLogicalDecoding(relation))
+ log_heap_new_cid(relation, tp);
+
+ xlrec.flags = 0;
+ if (all_visible_cleared)
+ xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
+ if (changingPart)
+ xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
+ xlrec.infobits_set = compute_infobits(tp->t_data->t_infomask,
+ tp->t_data->t_infomask2);
+ xlrec.offnum = ItemPointerGetOffsetNumber(&tp->t_self);
+ xlrec.xmax = new_xmax;
+
+ if (old_key_tuple != NULL)
+ {
+ if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
+ xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
+ else
+ xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
+ }
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
+
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+ /*
+ * Log replica identity of the deleted tuple if there is one
+ */
+ if (old_key_tuple != NULL)
+ {
+ xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
+ xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
+ xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
+
+ XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
+ XLogRegisterData((char *) old_key_tuple->t_data
+ + SizeofHeapTupleHeader,
+ old_key_tuple->t_len
+ - SizeofHeapTupleHeader);
+ }
+
+ /* filtering by origin on a row level is much more efficient */
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+ return XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
+}
+
/*
* Perform XLogInsert for a heap-update operation. Caller must already
* have modified the buffer(s) and marked them dirty.
--
2.16.3
v9-0004-Add-infrastructure-to-WAL-logging-skip-feature.patchtext/x-patch; charset=us-asciiDownload
From ed3a737a571b268503804372ebac3a31247493be Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 26 Mar 2019 15:34:48 +0900
Subject: [PATCH 4/8] Add infrastructure to WAL-logging skip feature
We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just signaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-commit fsyncs for a relation and in-transaction
truncations. heap_register_sync() should be used to start tracking
before batch operations like COPY and CLUSTER, and use
BufferNeedsWAL() instead of RelationNeedsWAL() at the places related
to WAL-logging about heap-modifying operations.
---
src/backend/access/heap/heapam.c | 31 +++
src/backend/access/transam/xact.c | 11 +
src/backend/catalog/storage.c | 418 ++++++++++++++++++++++++++++++++++--
src/backend/commands/tablecmds.c | 4 +-
src/backend/storage/buffer/bufmgr.c | 39 +++-
src/backend/utils/cache/relcache.c | 3 +
src/include/access/heapam.h | 1 +
src/include/catalog/storage.h | 8 +
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 8 +
10 files changed, 493 insertions(+), 32 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c6e71dba6b..5a8627507f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
#include "access/xloginsert.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -8800,3 +8801,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
}
}
}
+
+/*
+ * heap_register_sync - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordWALSkipping(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordWALSkipping(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c3214d4f4d..ad7cb3bcb9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2022,6 +2022,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2254,6 +2257,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2579,6 +2585,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
AtAbort_Twophase();
+ smgrDoPendingSyncs(false); /* abandon pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
@@ -4097,6 +4104,8 @@ ReleaseSavepoint(const char *name)
(errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
+ smgrProcessWALRequirementInval(s->subTransactionId, true);
+
/*
* Mark "commit pending" all subtransactions up to the target
* subtransaction. The actual commits will happen when control gets to
@@ -4206,6 +4215,8 @@ RollbackToSavepoint(const char *name)
(errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
+ smgrProcessWALRequirementInval(s->subTransactionId, false);
+
/*
* Mark "abort pending" all subtransactions up to the target
* subtransaction. The actual aborts will happen when control gets to
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0302507e6f..be37174ef2 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -27,7 +27,7 @@
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
-#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -62,6 +62,58 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a RelWalRequirement entry is created, and
+ * 'skip_wal_min_blk' is set to the current size of the relation. Any operations
+ * on blocks < skip_wal_min_blk need to be WAL-logged as usual, but for
+ * operations on higher blocks, WAL-logging is skipped.
+
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct RelWalRequirement
+{
+ RelFileNode relnode; /* relation created in same xact */
+ bool forks[MAX_FORKNUM + 1]; /* target forknums */
+ BlockNumber skip_wal_min_blk; /* WAL-logging skipped for blocks >=
+ * skip_wal_min_blk */
+ BlockNumber wal_log_min_blk; /* The minimum blk number that requires
+ * WAL-logging even if skipped by the
+ * above*/
+ SubTransactionId create_sxid; /* subxid where this entry is created */
+ SubTransactionId invalidate_sxid; /* subxid where this entry is
+ * invalidated */
+} RelWalRequirement;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *walRequirements = NULL;
+
+static RelWalRequirement *getWalRequirementEntry(Relation rel, bool create);
+static RelWalRequirement *getWalRequirementEntryRNode(RelFileNode *node,
+ bool create);
+
/*
* RelationCreateStorage
* Create physical storage for a relation.
@@ -259,37 +311,290 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
+ RelWalRequirement *walreq;
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
+ /* get pending sync entry, create if not yet */
+ walreq = getWalRequirementEntry(rel, true);
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ if (walreq->skip_wal_min_blk == InvalidBlockNumber ||
+ walreq->skip_wal_min_blk < nblocks)
+ {
+ /*
+ * This is the first time truncation of this relation in this
+ * transaction or truncation that leaves pages that need at-commit
+ * fsync. Make an XLOG entry reporting the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
- /*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
- */
- if (fsm || vm)
- XLogFlush(lsn);
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ /* no longer skip WAL-logging for the blocks */
+ rel->rd_walrequirement->wal_log_min_blk = nblocks;
+ }
}
/* Do the real work */
smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
}
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+ RelWalRequirement *walreq;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch existing pending sync entry */
+ walreq = getWalRequirementEntry(rel, false);
+
+ /*
+ * no point in doing further work if we know that we don't have special
+ * WAL requirement
+ */
+ if (!walreq)
+ return true;
+
+ Assert(BufferIsValid(buf));
+
+ blkno = BufferGetBlockNumber(buf);
+
+ /*
+ * We don't skip WAL-logging for pages that once done.
+ */
+ if (walreq->skip_wal_min_blk == InvalidBlockNumber ||
+ walreq->skip_wal_min_blk > blkno)
+ return true;
+
+ /*
+ * we don't skip WAL-logging for blocks that have got WAL-logged
+ * truncation
+ */
+ if (walreq->wal_log_min_blk != InvalidBlockNumber &&
+ walreq->wal_log_min_blk <= blkno)
+ return true;
+
+ return false;
+}
+
+bool
+BlockNeedsWAL(Relation rel, BlockNumber blkno)
+{
+ RelWalRequirement *walreq;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch exising pending sync entry */
+ walreq = getWalRequirementEntry(rel, false);
+
+ /*
+ * no point in doing further work if we know that we don't have special
+ * WAL requirement
+ */
+ if (!walreq)
+ return true;
+
+ /*
+ * We don't skip WAL-logging for pages that once done.
+ */
+ if (walreq->skip_wal_min_blk == InvalidBlockNumber ||
+ walreq->skip_wal_min_blk > blkno)
+ return true;
+
+ /*
+ * we don't skip WAL-logging for blocks that have got WAL-logged
+ * truncation
+ */
+ if (walreq->wal_log_min_blk != InvalidBlockNumber &&
+ walreq->wal_log_min_blk <= blkno)
+ return true;
+
+ return false;
+}
+
+/*
+ * Remember that the given relation doesn't need WAL-logging for the blocks
+ * after the current block size and fo the blocks that are going to be synced
+ * at commit.
+ */
+void
+RecordWALSkipping(Relation rel)
+{
+ RelWalRequirement *walreq;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* get pending sync entry, create if not yet */
+ walreq = getWalRequirementEntry(rel, true);
+
+ /*
+ * Record only the first registration.
+ */
+ if (walreq->skip_wal_min_blk != InvalidBlockNumber)
+ return;
+
+ walreq->skip_wal_min_blk = RelationGetNumberOfBlocks(rel);
+}
+
+/*
+ * Record commit-time file sync. This shouldn't be used mixing with
+ * RecordWALSkipping.
+ */
+void
+RecordPendingSync(SMgrRelation rel, ForkNumber forknum)
+{
+ RelWalRequirement *walreq;
+
+ walreq = getWalRequirementEntryRNode(&rel->smgr_rnode.node, true);
+ walreq->forks[forknum] = true;
+ walreq->skip_wal_min_blk = 0;
+}
+
+/*
+ * RelationInvalidateWALRequirements() -- invalidate wal requirement entry
+ */
+void
+RelationInvalidateWALRequirements(Relation rel)
+{
+ RelWalRequirement *walreq;
+
+ /* we know we don't have one */
+ if (rel->rd_nowalrequirement)
+ return;
+
+ walreq = getWalRequirementEntry(rel, false);
+
+ if (!walreq)
+ return;
+
+ /*
+ * The state is reset at subtransaction commit/abort. No invalidation
+ * request must not come for the same relation in the same subtransaction.
+ */
+ Assert(walreq->invalidate_sxid == InvalidSubTransactionId);
+
+ walreq->invalidate_sxid = GetCurrentSubTransactionId();
+}
+
+/*
+ * getWalRequirementEntry: get WAL requirement entry.
+ *
+ * Returns WAL requirement entry for the relation. The entry tracks
+ * WAL-skipping blocks for the relation. The WAL-skipped blocks need fsync at
+ * commit time. Creates one if needed when create is true.
+ */
+static RelWalRequirement *
+getWalRequirementEntry(Relation rel, bool create)
+{
+ RelWalRequirement *walreq_entry = NULL;
+
+ if (rel->rd_walrequirement)
+ return rel->rd_walrequirement;
+
+ /* we know we don't have pending sync entry */
+ if (!create && rel->rd_nowalrequirement)
+ return NULL;
+
+ walreq_entry = getWalRequirementEntryRNode(&rel->rd_node, create);
+
+ if (!walreq_entry)
+ {
+ /* prevent further hash lookup */
+ rel->rd_nowalrequirement = true;
+ return NULL;
+ }
+
+ walreq_entry->forks[MAIN_FORKNUM] = true;
+
+ /* hold shortcut in Relation */
+ rel->rd_nowalrequirement = false;
+ rel->rd_walrequirement = walreq_entry;
+
+ return walreq_entry;
+}
+
+/*
+ * getWalRequirementEntryRNode: get WAL requirement entry by rnode
+ *
+ * Returns WAL requirement entry for the RelFileNode.
+ */
+static RelWalRequirement *
+getWalRequirementEntryRNode(RelFileNode *rnode, bool create)
+{
+ RelWalRequirement *walreq_entry = NULL;
+ bool found;
+
+ if (!walRequirements)
+ {
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ if (!create)
+ return NULL;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(RelWalRequirement);
+ ctl.hash = tag_hash;
+ walRequirements = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+
+ walreq_entry = (RelWalRequirement *)
+ hash_search(walRequirements, (void *) rnode,
+ create ? HASH_ENTER: HASH_FIND, &found);
+
+ if (!walreq_entry)
+ return NULL;
+
+ /* new entry created */
+ if (!found)
+ {
+ memset(&walreq_entry->forks, 0, sizeof(sizeof(walreq_entry->forks)));
+ walreq_entry->wal_log_min_blk = InvalidBlockNumber;
+ walreq_entry->skip_wal_min_blk = InvalidBlockNumber;
+ walreq_entry->create_sxid = GetCurrentSubTransactionId();
+ walreq_entry->invalidate_sxid = InvalidSubTransactionId;
+ }
+
+ return walreq_entry;
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -418,6 +723,75 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+/*
+ * Sync to disk any relations that we have skipped WAL-logging earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+ if (!walRequirements)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ RelWalRequirement *walreq;
+
+ hash_seq_init(&status, walRequirements);
+
+ while ((walreq = hash_seq_search(&status)) != NULL)
+ {
+ if (walreq->skip_wal_min_blk != InvalidBlockNumber &&
+ walreq->invalidate_sxid == InvalidSubTransactionId)
+ {
+ int f;
+
+ FlushRelationBuffersWithoutRelCache(walreq->relnode, false);
+
+ /* flush all requested forks */
+ for (f = MAIN_FORKNUM ; f <= MAX_FORKNUM ; f++)
+ {
+ if (walreq->forks[f])
+ smgrimmedsync(smgropen(walreq->relnode,
+ InvalidBackendId), f);
+ }
+ }
+ }
+ }
+
+ hash_destroy(walRequirements);
+ walRequirements = NULL;
+}
+
+/*
+ * Process pending invalidation of WAL requirements happened in the
+ * subtransaction
+ */
+void
+smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit)
+{
+ HASH_SEQ_STATUS status;
+ RelWalRequirement *walreq;
+
+ if (!walRequirements)
+ return;
+
+ /* We expect that we don't have walRequirements in almost all cases */
+ hash_seq_init(&status, walRequirements);
+
+ while ((walreq = hash_seq_search(&status)) != NULL)
+ {
+ /* remove useless entry */
+ if (isCommit ?
+ walreq->invalidate_sxid == sxid :
+ walreq->create_sxid == sxid)
+ hash_search(walRequirements, &walreq->relnode, HASH_REMOVE, NULL);
+ /* or cancel invalidation */
+ else if (!isCommit && walreq->invalidate_sxid == sxid)
+ walreq->invalidate_sxid = InvalidSubTransactionId;
+ }
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3183b2aaa1..c9a0e02168 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11587,11 +11587,13 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
/*
* Create and copy all forks of the relation, and schedule unlinking of
- * old physical files.
+ * old physical files. WAL requirements for the old node is no longer
+ * needed.
*
* NOTE: any conflict in relfilenode value will be caught in
* RelationCreateStorage().
*/
+ RelationInvalidateWALRequirements(rel);
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..f00826712a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,40 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3183,7 +3204,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3213,18 +3234,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 84609e0725..95e834d45e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
#include "partitioning/partdesc.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -5625,6 +5626,8 @@ load_relcache_init_file(bool shared)
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
+ rel->rd_nowalrequirement = false;
+ rel->rd_walrequirement = NULL;
/*
* Recompute lock and physical addressing info. This is needed in
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3773a4df85..3d4fb7f3c3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -172,6 +172,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
+extern void heap_register_sync(Relation relation);
extern void heap_sync(Relation relation);
/* in heap/pruneheap.c */
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9f638be924..9034465001 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -16,12 +16,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
#include "utils/relcache.h"
extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
+extern bool BufferNeedsWAL(Relation rel, Buffer buf);
+extern bool BlockNeedsWAL(Relation rel, BlockNumber blkno);
+extern void RecordWALSkipping(Relation rel);
+extern void RecordPendingSync(SMgrRelation rel, ForkNumber forknum);
+extern void RelationInvalidateWALRequirements(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
@@ -29,6 +35,8 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c5826f691d..8a9ea041dd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 54028515a7..30f0d5bd83 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -198,6 +198,14 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * rd_nowalrequirement is true if this relation is known not to have
+ * special WAL requirements. Otherwise we need to ask smgr for an entry
+ * if rd_walrequirement is NULL.
+ */
+ bool rd_nowalrequirement;
+ struct RelWalRequirement *rd_walrequirement;
} RelationData;
--
2.16.3
v9-0005-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From 24e392a8423c1bed350b58e5cda56a140d2730ce Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 5/8] Fix WAL skipping feature.
This patch replaces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
src/backend/access/heap/heapam.c | 109 ++++++++++++++++++++++++--------
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 3 -
src/backend/access/heap/vacuumlazy.c | 6 +-
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/commands/copy.c | 13 ++--
src/backend/commands/createas.c | 9 ++-
src/backend/commands/matview.c | 6 +-
src/backend/commands/tablecmds.c | 6 +-
src/include/access/heapam.h | 3 +-
src/include/access/tableam.h | 11 +---
11 files changed, 106 insertions(+), 66 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5a8627507f..00416c4a99 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -28,6 +28,27 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or WAL
+ * archival purposes (i.e. if wal_level=minimal), and we fsync() the file
+ * to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transaction, because for
+ * a small number of changes, it's cheaper to just create the WAL records
+ * than fsync()ing the whole relation at COMMIT. It is only worthwhile for
+ * (presumably) large operations like COPY, CLUSTER, or VACUUM FULL. Use
+ * heap_register_sync() to initiate such an operation; it will cause any
+ * subsequent updates to the table to skip WAL-logging, if possible, and
+ * cause the heap to be synced to disk at COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -1934,7 +1955,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
@@ -2044,7 +2065,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
PGAlignedBlock scratch;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2052,7 +2072,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2094,6 +2113,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2105,6 +2125,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -2657,7 +2678,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
@@ -2791,6 +2812,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
vmbuffer = InvalidBuffer,
vmbuffer_new = InvalidBuffer;
bool need_toast;
+ bool oldbuf_needs_wal,
+ newbuf_needs_wal;
Size newtupsize,
pagefree;
bool have_tuple_lock = false;
@@ -3342,7 +3365,7 @@ l2:
MarkBufferDirty(buffer);
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -3556,8 +3579,20 @@ l2:
MarkBufferDirty(newbuf);
MarkBufferDirty(buffer);
- /* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ /*
+ * XLOG stuff
+ *
+ * Emit heap-update log. When wal_level = minimal, we may emit insert or
+ * delete record according to wal-optimization.
+ */
+ oldbuf_needs_wal = BufferNeedsWAL(relation, buffer);
+
+ if (newbuf == buffer)
+ newbuf_needs_wal = oldbuf_needs_wal;
+ else
+ newbuf_needs_wal = BufferNeedsWAL(relation, newbuf);
+
+ if (oldbuf_needs_wal || newbuf_needs_wal)
{
XLogRecPtr recptr;
@@ -3567,15 +3602,26 @@ l2:
*/
if (RelationIsAccessibleInLogicalDecoding(relation))
{
- log_heap_new_cid(relation, &oldtup);
- log_heap_new_cid(relation, heaptup);
+ if (oldbuf_needs_wal)
+ log_heap_new_cid(relation, &oldtup);
+ if (newbuf_needs_wal)
+ log_heap_new_cid(relation, heaptup);
}
- recptr = log_heap_update(relation, buffer,
- newbuf, &oldtup, heaptup,
- old_key_tuple,
- all_visible_cleared,
- all_visible_cleared_new);
+ if (oldbuf_needs_wal && newbuf_needs_wal)
+ recptr = log_heap_update(relation, buffer, newbuf,
+ &oldtup, heaptup,
+ old_key_tuple,
+ all_visible_cleared,
+ all_visible_cleared_new);
+ else if (oldbuf_needs_wal)
+ recptr = log_heap_delete(relation, buffer, &oldtup, old_key_tuple,
+ xmax_old_tuple, false,
+ all_visible_cleared);
+ else
+ recptr = log_heap_insert(relation, buffer, newtup,
+ 0, all_visible_cleared_new);
+
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4453,7 +4499,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5205,7 +5251,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5365,7 +5411,7 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
htup->t_ctid = *tid;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -5497,7 +5543,7 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -5606,7 +5652,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -6802,8 +6848,8 @@ log_heap_clean(Relation reln, Buffer buffer,
xl_heap_clean xlrec;
XLogRecPtr recptr;
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ /* Caller should not call me on non-WAL-logged buffers */
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -6850,8 +6896,8 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
xl_heap_freeze_page xlrec;
XLogRecPtr recptr;
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ /* Caller should not call me on non-WAL-logged buffers */
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7077,8 +7123,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool init;
int bufflags;
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ /* Caller should not call me when no buffer needs WAL-logging */
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -8682,9 +8728,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a3e51922d8..a05659b168 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -258,7 +259,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..1ac77f7c14 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -654,9 +654,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 5c554f9465..3f5df63df8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -929,7 +929,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1193,7 +1193,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1575,7 +1575,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06b26..1f5f7d92dd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "port/pg_bitutils.h"
#include "storage/bufmgr.h"
@@ -276,7 +277,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 705df8900b..1074320a5a 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2391,8 +2391,7 @@ CopyFrom(CopyState cstate)
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the heap_sync at the bottom of this
- * routine first.
+ * If it does commit, commit will do heap_sync().
*
* As mentioned in comments in utils/rel.h, the in-same-transaction test
* is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2438,7 +2437,7 @@ CopyFrom(CopyState cstate)
{
hi_options |= HEAP_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(cstate->rel);
}
/*
@@ -3091,11 +3090,11 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
/*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway)
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
*/
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(cstate->rel);
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 3bdb67c697..b4431f2af3 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->hi_options = HEAP_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ heap_register_sync(intoRelationDesc);
+ myState->hi_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,9 +605,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5b2cbc7c89..45e693129d 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,7 +463,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(transientrel);
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -508,9 +508,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- /* If we skipped using WAL, must heap_sync before commit */
- if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(myState->transientrel);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c9a0e02168..54ce52eaae 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4664,10 +4664,10 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
mycid = GetCurrentCommandId(true);
bistate = GetBulkInsertState();
-
hi_options = HEAP_INSERT_SKIP_FSM;
+
if (!XLogIsNeeded())
- hi_options |= HEAP_INSERT_SKIP_WAL;
+ heap_register_sync(newrel);
}
else
{
@@ -4958,8 +4958,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);
/* If we skipped writing WAL, then we need to sync the heap. */
- if (hi_options & HEAP_INSERT_SKIP_WAL)
- heap_sync(newrel);
table_close(newrel, NoLock);
}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3d4fb7f3c3..97114aed3e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,11 +29,10 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
-#define HEAP_INSERT_SPECULATIVE 0x0010
+#define HEAP_INSERT_SPECULATIVE 0x0008
typedef struct BulkInsertStateData *BulkInsertState;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4699335cdf..cf7f8e7da0 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -94,10 +94,9 @@ typedef struct TM_FailureData
} TM_FailureData;
/* "options" flag bits for table_insert */
-#define TABLE_INSERT_SKIP_WAL 0x0001
-#define TABLE_INSERT_SKIP_FSM 0x0002
-#define TABLE_INSERT_FROZEN 0x0004
-#define TABLE_INSERT_NO_LOGICAL 0x0008
+#define TABLE_INSERT_SKIP_FSM 0x0001
+#define TABLE_INSERT_FROZEN 0x0002
+#define TABLE_INSERT_NO_LOGICAL 0x0004
/* flag bits fortable_lock_tuple */
/* Follow tuples whose update is in progress if lock modes don't conflict */
@@ -702,10 +701,6 @@ table_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot, Snapshot snap
* behaviour of the AM. Several options might be ignored by AMs not supporting
* them.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple will not
- * necessarily logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space. It's
--
2.16.3
v9-0006-Change-cluster-to-use-the-new-pending-sync-infrastru.patchtext/x-patch; charset=us-asciiDownload
From 5b047a9514613c42c9ef1fb395ca401b55d7e2de Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 26 Mar 2019 15:32:28 +0900
Subject: [PATCH 6/8] Change cluster to use the new pending sync infrastructure
Apply the pending-sync infrastructure to CLUSTER command. It gets
benefits from moving file sync from command end to transaction end
when wal_level is minimal.
---
src/backend/access/heap/rewriteheap.c | 25 +++++--------------------
src/backend/commands/cluster.c | 13 +++++--------
src/include/access/rewriteheap.h | 2 +-
3 files changed, 11 insertions(+), 29 deletions(-)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 1ac77f7c14..494f7fcd41 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "access/xloginsert.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "lib/ilist.h"
@@ -144,7 +145,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -238,15 +238,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* min_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +269,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +327,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -344,19 +341,7 @@ end_heap_rewrite(RewriteState state)
(char *) state->rs_buffer, true);
}
- /*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too.
- *
- * It's obvious that we must do this when not WAL-logging. It's less
- * obvious that we have to do it even if we did WAL-log the pages. The
- * reason is the same as in tablecmds.c's copy_relation_data(): we're
- * writing data that's not in shared buffers, and so a CHECKPOINT
- * occurring during the rewriteheap operation won't have fsync'd data we
- * wrote before the checkpoint.
- */
- if (RelationNeedsWAL(state->rs_new_rel))
- heap_sync(state->rs_new_rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
logical_end_heap_rewrite(state);
@@ -692,7 +677,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 205070b83d..34c1a5e96c 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -788,7 +788,6 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
TransactionId OldestXmin;
TransactionId FreezeXid;
@@ -847,13 +846,11 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
+ * If wal_level is minimal, we skip WAL-logging even for WAL-logging
+ * relations. The heap will be synced at commit.
*/
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
- /* use_wal off requires smgr_targblock be initially invalid */
- Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
+ if (!XLogIsNeeded())
+ heap_register_sync(NewHeap);
/*
* If both tables have TOAST tables, perform toast swap by content. It is
@@ -920,7 +917,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, FreezeXid,
- MultiXactCutoff, use_wal);
+ MultiXactCutoff);
/*
* Decide whether to use an indexscan or seqscan-and-optional-sort to scan
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 6006249d96..64efecf48b 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
--
2.16.3
v9-0007-Change-ALTER-TABLESPACE-to-use-the-pending-sync-infr.patchtext/x-patch; charset=us-asciiDownload
From 50740e21bcb34b89334f7e5756d757b469a087c9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 20:39:21 +0900
Subject: [PATCH 7/8] Change ALTER TABLESPACE to use the pending-sync
infrastructure
Apply heap_register_sync() to ATLER TABLESPACE stuff.
---
src/backend/commands/tablecmds.c | 54 +++++++++++++++++++++-------------------
1 file changed, 28 insertions(+), 26 deletions(-)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 54ce52eaae..aabb3806f6 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -84,7 +84,6 @@
#include "storage/lmgr.h"
#include "storage/lock.h"
#include "storage/predicate.h"
-#include "storage/smgr.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/fmgroids.h"
@@ -11891,7 +11890,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
{
PGAlignedBlock buf;
Page page;
- bool use_wal;
+ bool use_wal = false;
bool copying_initfork;
BlockNumber nblocks;
BlockNumber blkno;
@@ -11906,12 +11905,33 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
forkNum == INIT_FORKNUM;
- /*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a permanent relation.
- */
- use_wal = XLogIsNeeded() &&
- (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+ if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+ {
+ /*
+ * We need to log the copied data in WAL iff WAL archiving/streaming
+ * is enabled AND it's a permanent relation.
+ */
+ if (XLogIsNeeded())
+ use_wal = true;
+
+ /*
+ * If the rel is WAL-logged, must fsync at commit. We do the same to
+ * ensure that the toast table gets fsync'd too. (For a temp or
+ * unlogged rel we don't care since the data will be gone after a
+ * crash anyway.)
+ *
+ * It's obvious that we must do this when not WAL-logging the
+ * copy. It's less obvious that we have to do it even if we did
+ * WAL-log the copied pages. The reason is that since we're copying
+ * outside shared buffers, a CHECKPOINT occurring during the copy has
+ * no way to flush the previously written data to disk (indeed it
+ * won't know the new rel even exists). A crash later on would replay
+ * WAL from the checkpoint, therefore it wouldn't replay our earlier
+ * WAL entries. If we do not fsync those pages here, they might still
+ * not be on disk when the crash occurs.
+ */
+ RecordPendingSync(dst, forkNum);
+ }
nblocks = smgrnblocks(src, forkNum);
@@ -11948,24 +11968,6 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
*/
smgrextend(dst, forkNum, blkno, buf.data, true);
}
-
- /*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too. (For a temp or
- * unlogged rel we don't care since the data will be gone after a crash
- * anyway.)
- *
- * It's obvious that we must do this when not WAL-logging the copy. It's
- * less obvious that we have to do it even if we did WAL-log the copied
- * pages. The reason is that since we're copying outside shared buffers, a
- * CHECKPOINT occurring during the copy has no way to flush the previously
- * written data to disk (indeed it won't know the new rel even exists). A
- * crash later on would replay WAL from the checkpoint, therefore it
- * wouldn't replay our earlier WAL entries. If we do not fsync those pages
- * here, they might still not be on disk when the crash occurs.
- */
- if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
- smgrimmedsync(dst, forkNum);
}
/*
--
2.16.3
v9-0008-Optimize-WAL-logging-on-btree-bulk-insertion.patchtext/x-patch; charset=us-asciiDownload
From 2928ccd4197d237294215e4b9f0c9a6e8aa42eae Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 26 Mar 2019 14:48:26 +0900
Subject: [PATCH 8/8] Optimize WAL-logging on btree bulk insertion
Likewise the heap case, bulk insertion into btree can be optimized to
omit WAL-logging on certain conditions.
---
src/backend/access/heap/heapam.c | 13 +++++++++++++
src/backend/access/nbtree/nbtinsert.c | 5 ++++-
src/backend/access/nbtree/nbtsort.c | 23 +++++++----------------
3 files changed, 24 insertions(+), 17 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 00416c4a99..c28b479141 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8870,6 +8870,8 @@ heap_mask(char *pagedata, BlockNumber blkno)
void
heap_register_sync(Relation rel)
{
+ ListCell *indlist;
+
/* non-WAL-logged tables never need fsync */
if (!RelationNeedsWAL(rel))
return;
@@ -8883,4 +8885,15 @@ heap_register_sync(Relation rel)
RecordWALSkipping(toastrel);
heap_close(toastrel, AccessShareLock);
}
+
+ /* Do the same to all index relations */
+ foreach(indlist, RelationGetIndexList(rel))
+ {
+ Oid indexId = lfirst_oid(indlist);
+ Relation indexRel;
+
+ indexRel = index_open(indexId, AccessShareLock);
+ RecordWALSkipping(indexRel);
+ index_close(indexRel, NoLock);
+ }
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 96b7593fc1..fadcc09cb1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
#include "access/tableam.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
@@ -1096,7 +1097,9 @@ _bt_insertonpg(Relation rel,
cachedBlock = BufferGetBlockNumber(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf) ||
+ (!P_ISLEAF(lpageop) && BufferNeedsWAL(rel, cbuf)) ||
+ (BufferIsValid(metabuf) && BufferNeedsWAL(rel, metabuf)))
{
xl_btree_insert xlrec;
xl_btree_metadata xlmeta;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e65d4aab0f..90a5d6ae13 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -66,6 +66,7 @@
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "catalog/index.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/smgr.h"
@@ -264,7 +265,6 @@ typedef struct BTWriteState
Relation heap;
Relation index;
BTScanInsert inskey; /* generic insertion scankey */
- bool btws_use_wal; /* dump pages to WAL? */
BlockNumber btws_pages_alloced; /* # pages allocated */
BlockNumber btws_pages_written; /* # pages written out */
Page btws_zeropage; /* workspace for filling zeroes */
@@ -334,6 +334,10 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+ /* Skip WAL-logging if wal_level = minimal */
+ if (!XLogIsNeeded())
+ RecordWALSkipping(index);
+
/*
* Finish the build by (1) completing the sort of the spool file, (2)
* inserting the sorted tuples into btree pages and (3) building the upper
@@ -543,12 +547,6 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.index = btspool->index;
wstate.inskey = _bt_mkscankey(wstate.index, NULL);
- /*
- * We need to log index creation in WAL iff WAL archiving/streaming is
- * enabled UNLESS the index isn't WAL-logged anyway.
- */
- wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
-
/* reserve the metapage */
wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
wstate.btws_pages_written = 0;
@@ -622,15 +620,8 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
RelationOpenSmgr(wstate->index);
- /* XLOG stuff
- *
- * Even if minimal mode, WAL is required here if truncation happened after
- * being created in the same transaction. It is not needed otherwise but
- * we don't bother identifying the case precisely.
- */
- if (wstate->btws_use_wal ||
- (RelationNeedsWAL(wstate->index) &&
- (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
+ /* XLOG stuff */
+ if (BlockNeedsWAL(wstate->index, blkno))
{
/* We use the heap NEWPAGE record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
--
2.16.3
On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
I also liked the design in the /messages/by-id/559FA0BA.3080808@iki.fi
last paragraph, and I suspect it would have been no harder to back-patch. I
wonder if it would have been simpler and better, but I'm not asking anyone to
investigate that.
Now I am asking for that. Would anyone like to try implementing that other
design, to see how much simpler it would be? I now expect the already-drafted
design to need several more iterations before it reaches a finished patch.
Separately, I reviewed v9 of the already-drafted design:
On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
+/* + * RelationRemovePendingSync() -- remove pendingSync entry for a relation + */ +void +RelationRemovePendingSync(Relation rel)What is the coding rule for deciding when to call this? Currently, only
ATExecSetTableSpace() calls this. CLUSTER doesn't call it, despite behaving
much like ALTER TABLE SET TABLESPACE behaves.
This question still applies. (The function name did change from
RelationRemovePendingSync() to RelationInvalidateWALRequirements().)
On Mon, Mar 25, 2019 at 09:32:04PM +0900, Kyotaro HORIGUCHI wrote:
At Wed, 20 Mar 2019 22:48:35 -0700, Noah Misch <noah@leadboat.com> wrote in <20190321054835.GB3842129@rfd.leadboat.com>
On Wed, Mar 20, 2019 at 05:17:54PM +0900, Kyotaro HORIGUCHI wrote:
At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah@leadboat.com> wrote in <20190311022708.GA2189728@rfd.leadboat.com>
On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
+ elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",As you mention upthread, you have many debugging elog()s. These are too
detailed to include in every binary, but I do want them in the code. See
CACHE_elog() for a good example of achieving that.Agreed will do. They were need to check the behavior precisely
but usually not needed.I removed all such elog()s.
Again, I do want them in the code. Please restore them, but use a mechanism
like CACHE_elog() so they're built only if one defines a preprocessor symbol.
On Tue, Mar 26, 2019 at 04:35:07PM +0900, Kyotaro HORIGUCHI wrote:
@@ -4097,6 +4104,8 @@ ReleaseSavepoint(const char *name)
(errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));+ smgrProcessWALRequirementInval(s->subTransactionId, true); + /* * Mark "commit pending" all subtransactions up to the target * subtransaction. The actual commits will happen when control gets to @@ -4206,6 +4215,8 @@ RollbackToSavepoint(const char *name) (errcode(ERRCODE_S_E_INVALID_SPECIFICATION), errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));+ smgrProcessWALRequirementInval(s->subTransactionId, false);
The smgrProcessWALRequirementInval() calls almost certainly belong in
CommitSubTransaction() and AbortSubTransaction(), not in these functions. By
doing it here, you'd get the wrong behavior in a subtransaction created via a
plpgsql "BEGIN ... EXCEPTION WHEN OTHERS THEN" block.
+/* + * Process pending invalidation of WAL requirements happened in the + * subtransaction + */ +void +smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit) +{ + HASH_SEQ_STATUS status; + RelWalRequirement *walreq; + + if (!walRequirements) + return; + + /* We expect that we don't have walRequirements in almost all cases */ + hash_seq_init(&status, walRequirements); + + while ((walreq = hash_seq_search(&status)) != NULL) + { + /* remove useless entry */ + if (isCommit ? + walreq->invalidate_sxid == sxid : + walreq->create_sxid == sxid) + hash_search(walRequirements, &walreq->relnode, HASH_REMOVE, NULL);
Do not remove entries during subtransaction commit, because a parent
subtransaction might still abort. See other CommitSubTransaction() callees
for examples of correct subtransaction handling. AtEOSubXact_Files() is one
simple example.
@@ -3567,15 +3602,26 @@ heap_update */ if (RelationIsAccessibleInLogicalDecoding(relation)) { - log_heap_new_cid(relation, &oldtup); - log_heap_new_cid(relation, heaptup); + if (oldbuf_needs_wal) + log_heap_new_cid(relation, &oldtup); + if (newbuf_needs_wal) + log_heap_new_cid(relation, heaptup);
These if(...) conditions are always true, since they're redundant with
RelationIsAccessibleInLogicalDecoding(relation). Remove the conditions or
replace them with asserts.
}
- recptr = log_heap_update(relation, buffer, - newbuf, &oldtup, heaptup, - old_key_tuple, - all_visible_cleared, - all_visible_cleared_new); + if (oldbuf_needs_wal && newbuf_needs_wal) + recptr = log_heap_update(relation, buffer, newbuf, + &oldtup, heaptup, + old_key_tuple, + all_visible_cleared, + all_visible_cleared_new); + else if (oldbuf_needs_wal) + recptr = log_heap_delete(relation, buffer, &oldtup, old_key_tuple, + xmax_old_tuple, false, + all_visible_cleared); + else + recptr = log_heap_insert(relation, buffer, newtup, + 0, all_visible_cleared_new);
By using DELETE and INSERT records to implement an UPDATE, you lose the ctid
chain and infomask bits that were present before crash recovery. If that's
okay in these circumstances, please write a comment explaining why.
@@ -1096,7 +1097,9 @@ _bt_insertonpg(Relation rel,
cachedBlock = BufferGetBlockNumber(buf);/* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (BufferNeedsWAL(rel, buf) || + (!P_ISLEAF(lpageop) && BufferNeedsWAL(rel, cbuf)) || + (BufferIsValid(metabuf) && BufferNeedsWAL(rel, metabuf)))
This appears to have the same problem that heap_update() had in v7; if
BufferNeedsWAL(rel, buf) is false and BufferNeedsWAL(rel, metabuf) is true, we
emit WAL for both buffers. If that can't actually happen today, use asserts.
I don't want the btree code to get significantly more complicated in order to
participate in the RelWalRequirement system. If btree code would get more
complicated, it's better to have btree continue using the old system. If
btree's complexity would be essentially unchanged, it's still good to use the
new system.
@@ -334,6 +334,10 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+ /* Skip WAL-logging if wal_level = minimal */ + if (!XLogIsNeeded()) + RecordWALSkipping(index);
_bt_load() still has an smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM),
which should be unnecessary after you add this end-of-transaction sync. Also,
this code can reach an assertion failure at wal_level=minimal:
910024 2019-03-31 19:12:13.728 GMT LOG: statement: create temp table x (c int primary key)
910024 2019-03-31 19:12:13.729 GMT DEBUG: CREATE TABLE / PRIMARY KEY will create implicit index "x_pkey" for table "x"
910024 2019-03-31 19:12:13.730 GMT DEBUG: building index "x_pkey" on table "x" serially
TRAP: FailedAssertion("!(((rel)->rd_rel->relpersistence == 'p'))", File: "storage.c", Line: 460)
Also, please fix whitespace problems that "git diff --check master" reports.
nm
Thank you for reviewing.
At Sun, 31 Mar 2019 15:31:58 -0700, Noah Misch <noah@leadboat.com> wrote in <20190331223158.GB891537@rfd.leadboat.com>
On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
+/* + * RelationRemovePendingSync() -- remove pendingSync entry for a relation + */ +void +RelationRemovePendingSync(Relation rel)What is the coding rule for deciding when to call this? Currently, only
ATExecSetTableSpace() calls this. CLUSTER doesn't call it, despite behaving
much like ALTER TABLE SET TABLESPACE behaves.This question still applies. (The function name did change from
RelationRemovePendingSync() to RelationInvalidateWALRequirements().)
It is called for heap_register_sync()'ed relations to avoid
syncing useless or trying to sync nonexistent files. I modifed
all CLUSTER, COPY FROM, CREATE AS, REFRESH MATVIEW and SET
TABLESPACE uses the function. (The function is renamed to
table_relation_invalidate_walskip()).
I noticed that heap_register_sync and friends are now a kind of
Table-AM function. So I added .relation_register_walskip and
.relation_invalidate_walskip in TableAMRoutine and moved the
heap_register_sync stuff as heapam_relation_register_walskip and
friends. .finish_bulk_insert() is modified to be used only
WAL-skip is active on the relation. (0004, 0005) But I'm not sure
that is the right direction.
(RelWALRequirements is renamed to RelWALSkip)
The change made smgrFinishBulkInsert (known as smgrDoPendingSync)
need to call a tableam interface. Relation is required to call it
in the designed way but relcache cannot live until there. In the
attached patch 0005, a new member TableAmRoutine *tableam is
added to RelWalSkip and calls finish_bulk_insert() via the
tableAm. But I'm quite uneasy with that...
On Mon, Mar 25, 2019 at 09:32:04PM +0900, Kyotaro HORIGUCHI wrote:
At Wed, 20 Mar 2019 22:48:35 -0700, Noah Misch <noah@leadboat.com> wrote in <20190321054835.GB3842129@rfd.leadboat.com>
Again, I do want them in the code. Please restore them, but use a mechanism
like CACHE_elog() so they're built only if one defines a preprocessor symbol.
Ah, sorry. I restored the messages using STORAGE_elog(). I also
needed this. (SMGR_ might be better but I'm not sure.)
On Tue, Mar 26, 2019 at 04:35:07PM +0900, Kyotaro HORIGUCHI wrote:
+ smgrProcessWALRequirementInval(s->subTransactionId, false);
The smgrProcessWALRequirementInval() calls almost certainly belong in
CommitSubTransaction() and AbortSubTransaction(), not in these functions. By
doing it here, you'd get the wrong behavior in a subtransaction created via a
plpgsql "BEGIN ... EXCEPTION WHEN OTHERS THEN" block.
Thanks. Moved it to AtSubAbort_smgr() and AtSubCommit_smgr(). (0005)
+/* + * Process pending invalidation of WAL requirements happened in the + * subtransaction + */ +void +smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit) +{ + HASH_SEQ_STATUS status; + RelWalRequirement *walreq; + + if (!walRequirements) + return; + + /* We expect that we don't have walRequirements in almost all cases */ + hash_seq_init(&status, walRequirements); + + while ((walreq = hash_seq_search(&status)) != NULL) + { + /* remove useless entry */ + if (isCommit ? + walreq->invalidate_sxid == sxid : + walreq->create_sxid == sxid) + hash_search(walRequirements, &walreq->relnode, HASH_REMOVE, NULL);Do not remove entries during subtransaction commit, because a parent
subtransaction might still abort. See other CommitSubTransaction() callees
for examples of correct subtransaction handling. AtEOSubXact_Files() is one
simple example.
Thanks. smgrProcessWALSkipInval() (0005) is changed so that:
- If a RelWalSkip entry is created in aborted subtransaction,
remove it.
- If a RelWalSkip entry is created then invalidated in committed
subtransaction, remove it.
- If a RelWalSkip entry is created and committed, change the
creator subtransaction to the parent subtransaction.
- If a RelWalSkip entry is create elsewhere and invalidated in
committed subtransaction, move the invalidation to the parent
subtransaction.
- If a RelWalSkip entry is created elsewhere and invalidated in
aborted subtransaction, cancel the invalidation.
Test is added as test3a2 and test3a3. (0001)
@@ -3567,15 +3602,26 @@ heap_update */ if (RelationIsAccessibleInLogicalDecoding(relation)) { - log_heap_new_cid(relation, &oldtup); - log_heap_new_cid(relation, heaptup); + if (oldbuf_needs_wal) + log_heap_new_cid(relation, &oldtup); + if (newbuf_needs_wal) + log_heap_new_cid(relation, heaptup);These if(...) conditions are always true, since they're redundant with
RelationIsAccessibleInLogicalDecoding(relation). Remove the conditions or
replace them with asserts.
Ah.. I see. It is not the minimal case. Added a comment and an
assertion. (0006)
+ * catalog. Both oldbuf_needs_wal and newbuf_needs_wal must be true
+ * when logical decoding is active.
By using DELETE and INSERT records to implement an UPDATE, you lose the ctid
chain and infomask bits that were present before crash recovery. If that's
okay in these circumstances, please write a comment explaining why.
Sounds reasonable. Added a comment. (Honestly I completely forgot
about that.. Thanks!) (0006)
+ * Insert log record. Using delete or insert log loses HOT chain
+ * information but that happens only when newbuf is different from
+ * buffer, where HOT cannot happen.
@@ -1096,7 +1097,9 @@ _bt_insertonpg(Relation rel,
| | | cachedBlock = BufferGetBlockNumber(buf);| | /* XLOG stuff */ - | | if (RelationNeedsWAL(rel)) + | | if (BufferNeedsWAL(rel, buf) || + | | | (!P_ISLEAF(lpageop) && BufferNeedsWAL(rel, cbuf)) || + | | | (BufferIsValid(metabuf) && BufferNeedsWAL(rel, metabuf)))This appears to have the same problem that heap_update() had in v7; if
BufferNeedsWAL(rel, buf) is false and BufferNeedsWAL(rel, metabuf) is true, we
emit WAL for both buffers. If that can't actually happen today, use asserts.I don't want the btree code to get significantly more complicated in order to
participate in the RelWalRequirement system. If btree code would get more
complicated, it's better to have btree continue using the old system. If
btree's complexity would be essentially unchanged, it's still good to use the
new system.
It was broken. I tried to fix it but page split baffled me. I
reverted it and added a comment there explaining the reason for
not applying BufferNeedsWAL stuff to nbtree. WAL-logging skip
feature is now restricted to work only on non-index
heaps. (getWalSkipEntry and RecordPendingSync in 0005)
@@ -334,6 +334,10 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
| reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+ | /* Skip WAL-logging if wal_level = minimal */ + | if (!XLogIsNeeded()) + | | RecordWALSkipping(index);_bt_load() still has an smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM),
which should be unnecessary after you add this end-of-transaction sync. Also,
this code can reach an assertion failure at wal_level=minimal:910024 2019-03-31 19:12:13.728 GMT LOG: statement: create temp table x (c int primary key)
910024 2019-03-31 19:12:13.729 GMT DEBUG: CREATE TABLE / PRIMARY KEY will create implicit index "x_pkey" for table "x"
910024 2019-03-31 19:12:13.730 GMT DEBUG: building index "x_pkey" on table "x" serially
TRAP: FailedAssertion("!(((rel)->rd_rel->relpersistence == 'p'))", File: "storage.c", Line: 460)
This is what I mentioned as "broken" above. Sorry for the
silly mistake.
Also, please fix whitespace problems that "git diff --check master" reports.
Thanks. Good to know the command.
After all, this patch set contains the following files.
v10-0001-TAP-test-for-copy-truncation-optimization.patch
Tap test script. Multi-level subtransaction case is added.
v10-0002-Write-WAL-for-empty-nbtree-index-build.patch
As mentioned above, nbtree patch has been shrinked to the
initial state of a workaround. Comment is rewrited. (v9-0002 +
v9-0008)
v10-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patch
Not substantially changed.
v10-0004-Add-new-interface-to-TableAmRoutine.patch
New file. Adds two new interfaces to TableAmRoutine and modified
one interface.
v10-0005-Add-infrastructure-to-WAL-logging-skip-feature.patch
Heavily revised version of v9-0004.
Some functions are renamed.
Fixed subtransaction handling.
Added STORAGE_elog() stuff.
Uses table-am functions.
Changes heapam stuff.
v10-0006-Fix-WAL-skipping-feature.patch
Revised version of v9-0005 + v9-0006 + v9-0007.
Added comment and assertion in heap_insert().
v10-0007-Remove-TABLE-HEAP_INSERT_SKIP_WAL.patch
Separated from v9-0005 so that subsequent patches are sane.
Removes TABLE/HEAP_ISNERT_SKIP_WAL.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v10-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From 55c85f06a9dc0a77f4cc6b02d4538b2e7169b3dc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/7] TAP test for copy-truncation optimization.
---
src/test/recovery/t/017_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
1 file changed, 291 insertions(+)
create mode 100644 src/test/recovery/t/017_wal_optimize.pl
diff --git a/src/test/recovery/t/017_wal_optimize.pl b/src/test/recovery/t/017_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/017_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::real_dir($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v10-0002-Write-WAL-for-empty-nbtree-index-build.patchtext/x-patch; charset=us-asciiDownload
From fda405f0f0f9a5fa816c426adc5eb8850f20f6eb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/7] Write WAL for empty nbtree index build
After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
src/backend/access/nbtree/nbtsort.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 14d9545768..5551a9c227 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -622,8 +622,16 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
RelationOpenSmgr(wstate->index);
- /* XLOG stuff */
- if (wstate->btws_use_wal)
+ /* XLOG stuff
+ *
+ * Even when wal_level is minimal, WAL is required here if truncation
+ * happened after being created in the same transaction. This is hacky but
+ * we cannot use BufferNeedsWAL() stuff for nbtree since it can emit
+ * atomic WAL records on multiple buffers.
+ */
+ if (wstate->btws_use_wal ||
+ (RelationNeedsWAL(wstate->index) &&
+ (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
{
/* We use the heap NEWPAGE record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
--
2.16.3
v10-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patchtext/x-patch; charset=us-asciiDownload
From d15655d7bfe0b44c3b027ccdcc36fe0087f823c1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 13:29:50 +0900
Subject: [PATCH 3/7] Move XLOG stuff from heap_insert and heap_delete
Succeeding commit makes heap_update emit insert and delete WAL
records. Move out XLOG stuff for insert and delete so that heap_update
can use the stuff.
---
src/backend/access/heap/heapam.c | 275 ++++++++++++++++++++++-----------------
1 file changed, 156 insertions(+), 119 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 05ceb6550d..267570b461 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -72,6 +72,11 @@
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
+static XLogRecPtr log_heap_insert(Relation relation, Buffer buffer,
+ HeapTuple heaptup, int options, bool all_visible_cleared);
+static XLogRecPtr log_heap_delete(Relation relation, Buffer buffer,
+ HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+ bool changingPart, bool all_visible_cleared);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
@@ -1875,6 +1880,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
+ Page page;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
@@ -1911,16 +1917,18 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
*/
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+ page = BufferGetPage(buffer);
+
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
RelationPutHeapTuple(relation, buffer, heaptup,
(options & HEAP_INSERT_SPECULATIVE) != 0);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(page))
{
all_visible_cleared = true;
- PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllVisible(page);
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer, VISIBILITYMAP_VALID_BITS);
@@ -1942,75 +1950,10 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/* XLOG stuff */
if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
{
- xl_heap_insert xlrec;
- xl_heap_header xlhdr;
XLogRecPtr recptr;
- Page page = BufferGetPage(buffer);
- uint8 info = XLOG_HEAP_INSERT;
- int bufflags = 0;
- /*
- * If this is a catalog, we need to transmit combocids to properly
- * decode, so log that as well.
- */
- if (RelationIsAccessibleInLogicalDecoding(relation))
- log_heap_new_cid(relation, heaptup);
-
- /*
- * If this is the single and first tuple on page, we can reinit the
- * page instead of restoring the whole thing. Set flag, and hide
- * buffer references from XLogInsert.
- */
- if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
- PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
- {
- info |= XLOG_HEAP_INIT_PAGE;
- bufflags |= REGBUF_WILL_INIT;
- }
-
- xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
- xlrec.flags = 0;
- if (all_visible_cleared)
- xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
- if (options & HEAP_INSERT_SPECULATIVE)
- xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
- Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
-
- /*
- * For logical decoding, we need the tuple even if we're doing a full
- * page write, so make sure it's included even if we take a full-page
- * image. (XXX We could alternatively store a pointer into the FPW).
- */
- if (RelationIsLogicallyLogged(relation) &&
- !(options & HEAP_INSERT_NO_LOGICAL))
- {
- xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
- bufflags |= REGBUF_KEEP_DATA;
- }
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
-
- xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
- xlhdr.t_infomask = heaptup->t_data->t_infomask;
- xlhdr.t_hoff = heaptup->t_data->t_hoff;
-
- /*
- * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
- * write the whole page to the xlog, we don't need to store
- * xl_heap_header in the xlog.
- */
- XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
- XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- XLogRegisterBufData(0,
- (char *) heaptup->t_data + SizeofHeapTupleHeader,
- heaptup->t_len - SizeofHeapTupleHeader);
-
- /* filtering by origin on a row level is much more efficient */
- XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
- recptr = XLogInsert(RM_HEAP_ID, info);
+ recptr = log_heap_insert(relation, buffer, heaptup,
+ options, all_visible_cleared);
PageSetLSN(page, recptr);
}
@@ -2730,58 +2673,10 @@ l1:
*/
if (RelationNeedsWAL(relation))
{
- xl_heap_delete xlrec;
- xl_heap_header xlhdr;
XLogRecPtr recptr;
- /* For logical decode we need combocids to properly decode the catalog */
- if (RelationIsAccessibleInLogicalDecoding(relation))
- log_heap_new_cid(relation, &tp);
-
- xlrec.flags = 0;
- if (all_visible_cleared)
- xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
- if (changingPart)
- xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
- xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
- tp.t_data->t_infomask2);
- xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
- xlrec.xmax = new_xmax;
-
- if (old_key_tuple != NULL)
- {
- if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
- xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
- else
- xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
- }
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
-
- XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
- /*
- * Log replica identity of the deleted tuple if there is one
- */
- if (old_key_tuple != NULL)
- {
- xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
- xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
- xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
-
- XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
- XLogRegisterData((char *) old_key_tuple->t_data
- + SizeofHeapTupleHeader,
- old_key_tuple->t_len
- - SizeofHeapTupleHeader);
- }
-
- /* filtering by origin on a row level is much more efficient */
- XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
- recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
-
+ recptr = log_heap_delete(relation, buffer, &tp, old_key_tuple, new_xmax,
+ changingPart, all_visible_cleared);
PageSetLSN(page, recptr);
}
@@ -7245,6 +7140,148 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
return recptr;
}
+/*
+ * Perform XLogInsert for a heap-insert operation. Caller must already
+ * have modified the buffer and marked it dirty.
+ */
+XLogRecPtr
+log_heap_insert(Relation relation, Buffer buffer,
+ HeapTuple heaptup, int options, bool all_visible_cleared)
+{
+ xl_heap_insert xlrec;
+ xl_heap_header xlhdr;
+ uint8 info = XLOG_HEAP_INSERT;
+ int bufflags = 0;
+ Page page = BufferGetPage(buffer);
+
+ /*
+ * If this is a catalog, we need to transmit combocids to properly
+ * decode, so log that as well.
+ */
+ if (RelationIsAccessibleInLogicalDecoding(relation))
+ log_heap_new_cid(relation, heaptup);
+
+ /*
+ * If this is the single and first tuple on page, we can reinit the
+ * page instead of restoring the whole thing. Set flag, and hide
+ * buffer references from XLogInsert.
+ */
+ if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
+ PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+ {
+ info |= XLOG_HEAP_INIT_PAGE;
+ bufflags |= REGBUF_WILL_INIT;
+ }
+
+ xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
+ xlrec.flags = 0;
+ if (all_visible_cleared)
+ xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
+ if (options & HEAP_INSERT_SPECULATIVE)
+ xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
+ Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
+
+ /*
+ * For logical decoding, we need the tuple even if we're doing a full
+ * page write, so make sure it's included even if we take a full-page
+ * image. (XXX We could alternatively store a pointer into the FPW).
+ */
+ if (RelationIsLogicallyLogged(relation) &&
+ !(options & HEAP_INSERT_NO_LOGICAL))
+ {
+ xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
+ bufflags |= REGBUF_KEEP_DATA;
+ }
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
+
+ xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
+ xlhdr.t_infomask = heaptup->t_data->t_infomask;
+ xlhdr.t_hoff = heaptup->t_data->t_hoff;
+
+ /*
+ * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
+ * write the whole page to the xlog, we don't need to store
+ * xl_heap_header in the xlog.
+ */
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
+ XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+ /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
+ XLogRegisterBufData(0,
+ (char *) heaptup->t_data + SizeofHeapTupleHeader,
+ heaptup->t_len - SizeofHeapTupleHeader);
+
+ /* filtering by origin on a row level is much more efficient */
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+ return XLogInsert(RM_HEAP_ID, info);
+}
+
+/*
+ * Perform XLogInsert for a heap-insert operation. Caller must already
+ * have modified the buffer and marked it dirty.
+ *
+ * NB: heap_abort_speculative() uses the same xlog record and replay
+ * routines.
+ */
+static XLogRecPtr
+log_heap_delete(Relation relation, Buffer buffer,
+ HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+ bool changingPart, bool all_visible_cleared)
+{
+ xl_heap_delete xlrec;
+ xl_heap_header xlhdr;
+
+ /* For logical decode we need combocids to properly decode the catalog */
+ if (RelationIsAccessibleInLogicalDecoding(relation))
+ log_heap_new_cid(relation, tp);
+
+ xlrec.flags = 0;
+ if (all_visible_cleared)
+ xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
+ if (changingPart)
+ xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
+ xlrec.infobits_set = compute_infobits(tp->t_data->t_infomask,
+ tp->t_data->t_infomask2);
+ xlrec.offnum = ItemPointerGetOffsetNumber(&tp->t_self);
+ xlrec.xmax = new_xmax;
+
+ if (old_key_tuple != NULL)
+ {
+ if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
+ xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
+ else
+ xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
+ }
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
+
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+ /*
+ * Log replica identity of the deleted tuple if there is one
+ */
+ if (old_key_tuple != NULL)
+ {
+ xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
+ xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
+ xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
+
+ XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
+ XLogRegisterData((char *) old_key_tuple->t_data
+ + SizeofHeapTupleHeader,
+ old_key_tuple->t_len
+ - SizeofHeapTupleHeader);
+ }
+
+ /* filtering by origin on a row level is much more efficient */
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+ return XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
+}
+
/*
* Perform XLogInsert for a heap-update operation. Caller must already
* have modified the buffer(s) and marked them dirty.
--
2.16.3
v10-0004-Add-new-interface-to-TableAmRoutine.patchtext/x-patch; charset=us-asciiDownload
From 255e3b3d5998318a9aa7abd0d3f9dab67dd0053a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 11:53:36 +0900
Subject: [PATCH 4/7] Add new interface to TableAmRoutine
Add two interface functions to TableAmRoutine, which are related to
WAL-skipping feature.
---
src/backend/access/table/tableamapi.c | 4 ++
src/include/access/tableam.h | 79 +++++++++++++++++++++++------------
2 files changed, 56 insertions(+), 27 deletions(-)
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 51c0deaaf2..fef4e523e8 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -94,6 +94,10 @@ GetTableAmRoutine(Oid amhandler)
(routine->scan_bitmap_next_tuple == NULL));
Assert(routine->scan_sample_next_block != NULL);
Assert(routine->scan_sample_next_tuple != NULL);
+ Assert((routine->relation_register_walskip == NULL) ==
+ (routine->relation_invalidate_walskip == NULL) &&
+ (routine->relation_register_walskip == NULL) ==
+ (routine->finish_bulk_insert == NULL));
return routine;
}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4efe178ed1..1a3a3c6711 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -382,19 +382,15 @@ typedef struct TableAmRoutine
/*
* Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * tuple_insert and multi_insert or page-level copying performed by ALTER
+ * TABLE rewrite. This is called at commit time if WAL-skipping is
+ * activated and the caller decided that any finish work is required to
+ * the file.
*
- * Typically callers of tuple_insert and multi_insert will just pass all
- * the flags the apply to them, and each AM has to decide which of them
- * make sense for it, and then only take actions in finish_bulk_insert
- * that make sense for a specific AM.
- *
- * Optional callback.
+ * Optional callback. Must be provided when relation_register_walskip is
+ * provided.
*/
- void (*finish_bulk_insert) (Relation rel, int options);
-
+ void (*finish_bulk_insert) (RelFileNode rnode, ForkNumber forkNum);
/* ------------------------------------------------------------------------
* DDL related functionality.
@@ -447,6 +443,26 @@ typedef struct TableAmRoutine
double *tups_vacuumed,
double *tups_recently_dead);
+ /*
+ * Register WAL-skipping on the current storage of rel. WAL-logging on the
+ * relation is skipped and the storage will be synced at commit. Returns
+ * true if successfully registered, and finish_bulk_insert() is called at
+ * commit.
+ *
+ * Optional callback.
+ */
+ void (*relation_register_walskip) (Relation rel);
+
+ /*
+ * Invalidate registered WAL skipping on the current storage of rel. The
+ * function is called when the storage of the relation is going to be
+ * out-of-use after commit.
+ *
+ * Optional callback. Must be provided when relation_register_walskip is
+ * provided.
+ */
+ void (*relation_invalidate_walskip) (Relation rel);
+
/*
* React to VACUUM command on the relation. The VACUUM might be user
* triggered or by autovacuum. The specific actions performed by the AM
@@ -1026,8 +1042,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
*
*
* The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
*
* On return the slot's tts_tid and tts_tableOid are updated to reflect the
* insertion. But note that any toasting of fields within the slot is NOT
@@ -1201,20 +1216,6 @@ table_lock_tuple(Relation rel, ItemPointer tid, Snapshot snapshot,
flags, tmfd);
}
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
- /* optional callback */
- if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
- rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
/* ------------------------------------------------------------------------
* DDL related functionality.
@@ -1298,6 +1299,30 @@ table_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tups_recently_dead);
}
+/*
+ * Register WAL-skipping to the relation. WAL-logging is skipped for the new
+ * pages after this call and the relation file is going to be synced at
+ * commit.
+ */
+static inline void
+table_relation_register_walskip(Relation rel)
+{
+ if (rel->rd_tableam && rel->rd_tableam->relation_register_walskip)
+ rel->rd_tableam->relation_register_walskip(rel);
+}
+
+/*
+ * Unregister WAL-skipping to the relation. Call this when the relation is
+ * going to be out-of-use after commit. WAL-skipping continues but the
+ * relation won't be synced at commit.
+ */
+static inline void
+table_relation_invalidate_walskip(Relation rel)
+{
+ if (rel->rd_tableam && rel->rd_tableam->relation_invalidate_walskip)
+ rel->rd_tableam->relation_invalidate_walskip(rel);
+}
+
/*
* Perform VACUUM on the relation. The VACUUM can be user triggered or by
* autovacuum. The specific actions performed by the AM will depend heavily on
--
2.16.3
v10-0005-Add-infrastructure-to-WAL-logging-skip-feature.patchtext/x-patch; charset=us-asciiDownload
From 24c9b0b9b9698d86fce3ad129400e3042a2e0afd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 18:05:10 +0900
Subject: [PATCH 5/7] Add infrastructure to WAL-logging skip feature
We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just signaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-commit fsyncs for a relation and in-transaction
truncations. table_relation_register_walskip() should be used to start
tracking before batch operations like COPY and CLUSTER, and use
BufferNeedsWAL() instead of RelationNeedsWAL() at the places related
to WAL-logging about heap-modifying operations, then remove
call to table_finish_bulk_insert() and the tableam intaface.
---
src/backend/access/transam/xact.c | 12 +-
src/backend/catalog/storage.c | 612 +++++++++++++++++++++++++++++++++---
src/backend/commands/tablecmds.c | 6 +-
src/backend/storage/buffer/bufmgr.c | 39 ++-
src/backend/utils/cache/relcache.c | 3 +
src/include/catalog/storage.h | 17 +-
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 7 +
8 files changed, 631 insertions(+), 67 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e9ed92b70b..33a83dc784 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2102,6 +2102,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrFinishBulkInsert(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2334,6 +2337,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrFinishBulkInsert(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2659,6 +2665,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
AtAbort_Twophase();
+ smgrFinishBulkInsert(false); /* abandon pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
@@ -4792,8 +4799,7 @@ CommitSubTransaction(void)
AtEOSubXact_RelationCache(true, s->subTransactionId,
s->parent->subTransactionId);
AtEOSubXact_Inval(true);
- AtSubCommit_smgr();
-
+ AtSubCommit_smgr(s->subTransactionId, s->parent->subTransactionId);
/*
* The only lock we actually release here is the subtransaction XID lock.
*/
@@ -4970,7 +4976,7 @@ AbortSubTransaction(void)
ResourceOwnerRelease(s->curTransactionOwner,
RESOURCE_RELEASE_AFTER_LOCKS,
false, false);
- AtSubAbort_smgr();
+ AtSubAbort_smgr(s->subTransactionId, s->parent->subTransactionId);
AtEOXact_GUC(false, s->gucNestLevel);
AtEOSubXact_SPI(false, s->subTransactionId);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 72242b2476..4cd112f86c 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -21,6 +21,7 @@
#include "miscadmin.h"
+#include "access/tableam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
@@ -29,10 +30,18 @@
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
-#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+ /* #define STORAGEDEBUG */ /* turns DEBUG elogs on */
+
+#ifdef STORAGEDEBUG
+#define STORAGE_elog(...) elog(__VA_ARGS__)
+#else
+#define STORAGE_elog(...)
+#endif
+
/*
* We keep a list of all relations (represented as RelFileNode values)
* that have been created or deleted in the current transaction. When
@@ -64,6 +73,61 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a RelWalSkip entry is created, and
+ * 'skip_wal_min_blk' is set to the current size of the relation. Any
+ * operations on blocks < skip_wal_min_blk need to be WAL-logged as usual, but
+ * for operations on higher blocks, WAL-logging is skipped.
+
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct RelWalSkip
+{
+ RelFileNode relnode; /* relation created in same xact */
+ bool forks[MAX_FORKNUM + 1]; /* target forknums */
+ BlockNumber skip_wal_min_blk; /* WAL-logging skipped for blocks >=
+ * skip_wal_min_blk */
+ BlockNumber wal_log_min_blk; /* The minimum blk number that requires
+ * WAL-logging even if skipped by the
+ * above*/
+ SubTransactionId create_sxid; /* subxid where this entry is created */
+ SubTransactionId invalidate_sxid; /* subxid where this entry is
+ * invalidated */
+ const TableAmRoutine *tableam; /* Table access routine */
+} RelWalSkip;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *walSkipHash = NULL;
+
+static RelWalSkip *getWalSkipEntry(Relation rel, bool create);
+static RelWalSkip *getWalSkipEntryRNode(RelFileNode *node,
+ bool create);
+static void smgrProcessWALSkipInval(bool isCommit, SubTransactionId mySubid,
+ SubTransactionId parentSubid);
+
/*
* RelationCreateStorage
* Create physical storage for a relation.
@@ -261,31 +325,59 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
+ RelWalSkip *walskip;
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
-
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ /* get pending sync entry, create if not yet */
+ walskip = getWalSkipEntry(rel, true);
/*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
+ * walskip is null here if rel doesn't support WAL-logging skip,
+ * otherwise check for WAL-skipping status.
*/
- if (fsm || vm)
- XLogFlush(lsn);
+ if (walskip == NULL ||
+ walskip->skip_wal_min_blk == InvalidBlockNumber ||
+ walskip->skip_wal_min_blk < nblocks)
+ {
+ /*
+ * If WAL-skipping is enabled, this is the first time truncation
+ * of this relation in this transaction or truncation that leaves
+ * pages that need at-commit fsync. Make an XLOG entry reporting
+ * the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ STORAGE_elog(DEBUG2,
+ "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, nblocks);
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ if (walskip)
+ {
+ /* no longer skip WAL-logging for the blocks */
+ walskip->wal_log_min_blk = nblocks;
+ }
+ }
}
/* Do the real work */
@@ -296,8 +388,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
* Copy a fork's data, block by block.
*/
void
-RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
- ForkNumber forkNum, char relpersistence)
+RelationCopyStorage(Relation srcrel, SMgrRelation dst, ForkNumber forkNum)
{
PGAlignedBlock buf;
Page page;
@@ -305,6 +396,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
bool copying_initfork;
BlockNumber nblocks;
BlockNumber blkno;
+ SMgrRelation src = srcrel->rd_smgr;
+ char relpersistence = srcrel->rd_rel->relpersistence;
page = (Page) buf.data;
@@ -316,12 +409,33 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
forkNum == INIT_FORKNUM;
- /*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a permanent relation.
- */
- use_wal = XLogIsNeeded() &&
- (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+ if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+ {
+ /*
+ * We need to log the copied data in WAL iff WAL archiving/streaming
+ * is enabled AND it's a permanent relation.
+ */
+ if (XLogIsNeeded())
+ use_wal = true;
+
+ /*
+ * If the rel is WAL-logged, must fsync before commit. We use
+ * heap_sync to ensure that the toast table gets fsync'd too. (For a
+ * temp or unlogged rel we don't care since the data will be gone
+ * after a crash anyway.)
+ *
+ * It's obvious that we must do this when not WAL-logging the
+ * copy. It's less obvious that we have to do it even if we did
+ * WAL-log the copied pages. The reason is that since we're copying
+ * outside shared buffers, a CHECKPOINT occurring during the copy has
+ * no way to flush the previously written data to disk (indeed it
+ * won't know the new rel even exists). A crash later on would replay
+ * WAL from the checkpoint, therefore it wouldn't replay our earlier
+ * WAL entries. If we do not fsync those pages here, they might still
+ * not be on disk when the crash occurs.
+ */
+ RecordPendingSync(srcrel, dst, forkNum);
+ }
nblocks = smgrnblocks(src, forkNum);
@@ -358,24 +472,321 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
*/
smgrextend(dst, forkNum, blkno, buf.data, true);
}
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+ RelWalSkip *walskip;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch existing pending sync entry */
+ walskip = getWalSkipEntry(rel, false);
/*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too. (For a temp or
- * unlogged rel we don't care since the data will be gone after a crash
- * anyway.)
- *
- * It's obvious that we must do this when not WAL-logging the copy. It's
- * less obvious that we have to do it even if we did WAL-log the copied
- * pages. The reason is that since we're copying outside shared buffers, a
- * CHECKPOINT occurring during the copy has no way to flush the previously
- * written data to disk (indeed it won't know the new rel even exists). A
- * crash later on would replay WAL from the checkpoint, therefore it
- * wouldn't replay our earlier WAL entries. If we do not fsync those pages
- * here, they might still not be on disk when the crash occurs.
+ * no point in doing further work if we know that we don't skip
+ * WAL-logging.
*/
- if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
- smgrimmedsync(dst, forkNum);
+ if (!walskip)
+ {
+ STORAGE_elog(DEBUG2,
+ "not skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, BufferGetBlockNumber(buf));
+ return true;
+ }
+
+ Assert(BufferIsValid(buf));
+
+ blkno = BufferGetBlockNumber(buf);
+
+ /*
+ * We don't skip WAL-logging for pages that once done.
+ */
+ if (walskip->skip_wal_min_blk == InvalidBlockNumber ||
+ walskip->skip_wal_min_blk > blkno)
+ {
+ STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because skip_wal_min_blk is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, blkno, walskip->skip_wal_min_blk);
+ return true;
+ }
+
+ /*
+ * we don't skip WAL-logging for blocks that have got WAL-logged
+ * truncation
+ */
+ if (walskip->wal_log_min_blk != InvalidBlockNumber &&
+ walskip->wal_log_min_blk <= blkno)
+ {
+ STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because wal_log_min_blk is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, blkno, walskip->wal_log_min_blk);
+ return true;
+ }
+
+ STORAGE_elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, blkno);
+
+ return false;
+}
+
+bool
+BlockNeedsWAL(Relation rel, BlockNumber blkno)
+{
+ RelWalSkip *walskip;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch exising pending sync entry */
+ walskip = getWalSkipEntry(rel, false);
+
+ /*
+ * no point in doing further work if we know that we don't skip
+ * WAL-logging.
+ */
+ if (!walskip)
+ return true;
+
+ /*
+ * We don't skip WAL-logging for pages that once done.
+ */
+ if (walskip->skip_wal_min_blk == InvalidBlockNumber ||
+ walskip->skip_wal_min_blk > blkno)
+ {
+ STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because skip_wal_min_blk is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, blkno, walskip->skip_wal_min_blk);
+ return true;
+ }
+
+ /*
+ * we don't skip WAL-logging for blocks that have got WAL-logged
+ * truncation
+ */
+ if (walskip->wal_log_min_blk != InvalidBlockNumber &&
+ walskip->wal_log_min_blk <= blkno)
+ {
+ STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because wal_log_min_blk is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, blkno, walskip->wal_log_min_blk);
+
+ return true;
+ }
+
+ STORAGE_elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, blkno);
+
+ return false;
+}
+
+/*
+ * Remember that the given relation doesn't need WAL-logging for the blocks
+ * after the current block size and for the blocks that are going to be synced
+ * at commit.
+ */
+void
+RecordWALSkipping(Relation rel)
+{
+ RelWalSkip *walskip;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* get pending sync entry, create if not yet */
+ walskip = getWalSkipEntry(rel, true);
+
+ if (walskip == NULL)
+ return;
+
+ /*
+ * Record only the first registration.
+ */
+ if (walskip->skip_wal_min_blk != InvalidBlockNumber)
+ {
+ STORAGE_elog(DEBUG2, "WAL skipping for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, walskip->skip_wal_min_blk,
+ RelationGetNumberOfBlocks(rel));
+ return;
+ }
+
+ STORAGE_elog(DEBUG2, "registering new WAL skipping rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, RelationGetNumberOfBlocks(rel));
+
+ walskip->skip_wal_min_blk = RelationGetNumberOfBlocks(rel);
+}
+
+/*
+ * Record commit-time file sync. This shouldn't be used mixing with
+ * RecordWALSkipping.
+ */
+void
+RecordPendingSync(Relation rel, SMgrRelation targetsrel, ForkNumber forknum)
+{
+ RelWalSkip *walskip;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* check for support for this feature */
+ if (rel->rd_tableam == NULL ||
+ rel->rd_tableam->relation_register_walskip == NULL)
+ return;
+
+ walskip = getWalSkipEntryRNode(&targetsrel->smgr_rnode.node, true);
+ walskip->forks[forknum] = true;
+ walskip->skip_wal_min_blk = 0;
+ walskip->tableam = rel->rd_tableam;
+
+ STORAGE_elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ walskip->relnode.spcNode, walskip->relnode.dbNode,
+ walskip->relnode.relNode, 0);
+}
+
+/*
+ * RelationInvalidateWALSkip() -- invalidate WAL-skip entry
+ */
+void
+RelationInvalidateWALSkip(Relation rel)
+{
+ RelWalSkip *walskip;
+
+ /* we know we don't have one */
+ if (rel->rd_nowalskip)
+ return;
+
+ walskip = getWalSkipEntry(rel, false);
+
+ if (!walskip)
+ return;
+
+ /*
+ * The state is reset at subtransaction commit/abort. No invalidation
+ * request must not come for the same relation in the same subtransaction.
+ */
+ Assert(walskip->invalidate_sxid == InvalidSubTransactionId);
+
+ walskip->invalidate_sxid = GetCurrentSubTransactionId();
+
+ STORAGE_elog(DEBUG2,
+ "WAL skip of rel %u/%u/%u invalidated by sxid %d",
+ walskip->relnode.spcNode, walskip->relnode.dbNode,
+ walskip->relnode.relNode, walskip->invalidate_sxid);
+}
+
+/*
+ * getWalSkipEntry: get WAL skip entry.
+ *
+ * Returns WAL skip entry for the relation. The entry tracks WAL-skipping
+ * blocks for the relation. The WAL-skipped blocks need fsync at commit time.
+ * Creates one if needed when create is true. If rel doesn't support this
+ * feature, returns true even if create is true.
+ */
+static inline RelWalSkip *
+getWalSkipEntry(Relation rel, bool create)
+{
+ RelWalSkip *walskip_entry = NULL;
+
+ if (rel->rd_walskip)
+ return rel->rd_walskip;
+
+ /* we know we don't have pending sync entry */
+ if (!create && rel->rd_nowalskip)
+ return NULL;
+
+ /* check for support for this feature */
+ if (rel->rd_tableam == NULL ||
+ rel->rd_tableam->relation_register_walskip == NULL)
+ {
+ rel->rd_nowalskip = true;
+ return NULL;
+ }
+
+ walskip_entry = getWalSkipEntryRNode(&rel->rd_node, create);
+
+ if (!walskip_entry)
+ {
+ /* prevent further hash lookup */
+ rel->rd_nowalskip = true;
+ return NULL;
+ }
+
+ walskip_entry->forks[MAIN_FORKNUM] = true;
+ walskip_entry->tableam = rel->rd_tableam;
+
+ /* hold shortcut in Relation */
+ rel->rd_nowalskip = false;
+ rel->rd_walskip = walskip_entry;
+
+ return walskip_entry;
+}
+
+/*
+ * getWalSkipEntryRNode: get WAL skip entry by rnode
+ *
+ * Returns a WAL skip entry for the RelFileNode.
+ */
+static RelWalSkip *
+getWalSkipEntryRNode(RelFileNode *rnode, bool create)
+{
+ RelWalSkip *walskip_entry = NULL;
+ bool found;
+
+ if (!walSkipHash)
+ {
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ if (!create)
+ return NULL;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(RelWalSkip);
+ ctl.hash = tag_hash;
+ walSkipHash = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+
+ walskip_entry = (RelWalSkip *)
+ hash_search(walSkipHash, (void *) rnode,
+ create ? HASH_ENTER: HASH_FIND, &found);
+
+ if (!walskip_entry)
+ return NULL;
+
+ /* new entry created */
+ if (!found)
+ {
+ memset(&walskip_entry->forks, 0, sizeof(walskip_entry->forks));
+ walskip_entry->wal_log_min_blk = InvalidBlockNumber;
+ walskip_entry->skip_wal_min_blk = InvalidBlockNumber;
+ walskip_entry->create_sxid = GetCurrentSubTransactionId();
+ walskip_entry->invalidate_sxid = InvalidSubTransactionId;
+ walskip_entry->tableam = NULL;
+ }
+
+ return walskip_entry;
}
/*
@@ -506,6 +917,107 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+/*
+ * Finish bulk insert of files.
+ */
+void
+smgrFinishBulkInsert(bool isCommit)
+{
+ if (!walSkipHash)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ RelWalSkip *walskip;
+
+ hash_seq_init(&status, walSkipHash);
+
+ while ((walskip = hash_seq_search(&status)) != NULL)
+ {
+ /*
+ * On commit, process valid entreis. Rollback doesn't need sync on
+ * all changes during the transaction.
+ */
+ if (walskip->skip_wal_min_blk != InvalidBlockNumber &&
+ walskip->invalidate_sxid == InvalidSubTransactionId)
+ {
+ int f;
+
+ FlushRelationBuffersWithoutRelCache(walskip->relnode, false);
+
+ /*
+ * We mustn't create an entry when the table AM doesn't
+ * support WAL-skipping.
+ */
+ Assert (walskip->tableam->finish_bulk_insert);
+
+ /* flush all requested forks */
+ for (f = MAIN_FORKNUM ; f <= MAX_FORKNUM ; f++)
+ {
+ if (walskip->forks[f])
+ {
+ walskip->tableam->finish_bulk_insert(walskip->relnode, f);
+ STORAGE_elog(DEBUG2, "finishing bulk insert to rel %u/%u/%u fork %d",
+ walskip->relnode.spcNode,
+ walskip->relnode.dbNode,
+ walskip->relnode.relNode, f);
+ }
+ }
+ }
+ }
+ }
+
+ hash_destroy(walSkipHash);
+ walSkipHash = NULL;
+}
+
+/*
+ * Process pending invalidation of WAL skip happened in the subtransaction
+ */
+void
+smgrProcessWALSkipInval(bool isCommit, SubTransactionId mySubid,
+ SubTransactionId parentSubid)
+{
+ HASH_SEQ_STATUS status;
+ RelWalSkip *walskip;
+
+ if (!walSkipHash)
+ return;
+
+ /* We expect that we don't have walSkipHash in almost all cases */
+ hash_seq_init(&status, walSkipHash);
+
+ while ((walskip = hash_seq_search(&status)) != NULL)
+ {
+ if (walskip->create_sxid == mySubid)
+ {
+ /*
+ * The entry was created in this subxact. Remove it on abort, or
+ * on commit after invalidation.
+ */
+ if (!isCommit || walskip->invalidate_sxid == mySubid)
+ hash_search(walSkipHash, &walskip->relnode,
+ HASH_REMOVE, NULL);
+ /* Treat committing valid entry as creation by the parent. */
+ else if (walskip->invalidate_sxid == InvalidSubTransactionId)
+ walskip->create_sxid = parentSubid;
+ }
+ else if (walskip->invalidate_sxid == mySubid)
+ {
+ /*
+ * This entry was created elsewhere then invalidated by this
+ * subxact. Treat commit as invalidation by the parent. Otherwise
+ * cancel invalidation.
+ */
+ if (isCommit)
+ walskip->invalidate_sxid = parentSubid;
+ else
+ walskip->invalidate_sxid = InvalidSubTransactionId;
+ }
+ }
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
@@ -535,7 +1047,7 @@ PostPrepare_smgr(void)
* Reassign all items in the pending-deletes list to the parent transaction.
*/
void
-AtSubCommit_smgr(void)
+AtSubCommit_smgr(SubTransactionId mySubid, SubTransactionId parentSubid)
{
int nestLevel = GetCurrentTransactionNestLevel();
PendingRelDelete *pending;
@@ -545,6 +1057,9 @@ AtSubCommit_smgr(void)
if (pending->nestLevel >= nestLevel)
pending->nestLevel = nestLevel - 1;
}
+
+ /* Remove invalidated WAL skip in this subtransaction */
+ smgrProcessWALSkipInval(true, mySubid, parentSubid);
}
/*
@@ -555,9 +1070,12 @@ AtSubCommit_smgr(void)
* subtransaction will not commit.
*/
void
-AtSubAbort_smgr(void)
+AtSubAbort_smgr(SubTransactionId mySubid, SubTransactionId parentSubid)
{
smgrDoPendingDeletes(false);
+
+ /* Remove invalidated WAL skip in this subtransaction */
+ smgrProcessWALSkipInval(false, mySubid, parentSubid);
}
void
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 654179297c..8908b77d98 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11983,8 +11983,7 @@ index_copy_data(Relation rel, RelFileNode newrnode)
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
- RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
- rel->rd_rel->relpersistence);
+ RelationCopyStorage(rel, dstrel, MAIN_FORKNUM);
/* copy those extra forks that exist */
for (ForkNumber forkNum = MAIN_FORKNUM + 1;
@@ -12002,8 +12001,7 @@ index_copy_data(Relation rel, RelFileNode newrnode)
(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
forkNum == INIT_FORKNUM))
log_smgrcreate(&newrnode, forkNum);
- RelationCopyStorage(rel->rd_smgr, dstrel, forkNum,
- rel->rd_rel->relpersistence);
+ RelationCopyStorage(rel, dstrel, forkNum);
}
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..f00826712a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,40 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3183,7 +3204,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3213,18 +3234,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 64f3c2e887..f06d55a8fe 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
#include "partitioning/partdesc.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -5644,6 +5645,8 @@ load_relcache_init_file(bool shared)
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
+ rel->rd_nowalskip = false;
+ rel->rd_walskip = NULL;
/*
* Recompute lock and physical addressing info. This is needed in
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 882dc65c89..83fee7dbfe 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,8 +23,14 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
- ForkNumber forkNum, char relpersistence);
+extern void RelationCopyStorage(Relation srcrel, SMgrRelation dst,
+ ForkNumber forkNum);
+extern bool BufferNeedsWAL(Relation rel, Buffer buf);
+extern bool BlockNeedsWAL(Relation rel, BlockNumber blkno);
+extern void RecordWALSkipping(Relation rel);
+extern void RecordPendingSync(Relation rel, SMgrRelation srel,
+ ForkNumber forknum);
+extern void RelationInvalidateWALSkip(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
@@ -32,8 +38,11 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
-extern void AtSubCommit_smgr(void);
-extern void AtSubAbort_smgr(void);
+extern void smgrFinishBulkInsert(bool isCommit);
+extern void AtSubCommit_smgr(SubTransactionId mySubid,
+ SubTransactionId parentSubid);
+extern void AtSubAbort_smgr(SubTransactionId mySubid,
+ SubTransactionId parentSubid);
extern void PostPrepare_smgr(void);
#endif /* STORAGE_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c5826f691d..8a9ea041dd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 54028515a7..b2b46322b2 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -198,6 +198,13 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * rd_nowalskip is true if this relation is known not to skip WAL.
+ * Otherwise we need to ask smgr for an entry if rd_walskip is NULL.
+ */
+ bool rd_nowalskip;
+ struct RelWalSkip *rd_walskip;
} RelationData;
--
2.16.3
v10-0006-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From 3e816b09365dc8d388832460820a3ee2ca58dc5b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 13:29:23 +0900
Subject: [PATCH 6/7] Fix WAL skipping feature.
This patch replaces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
the new infrastructure.
---
src/backend/access/heap/heapam.c | 114 +++++++++++++++++++++++--------
src/backend/access/heap/heapam_handler.c | 88 ++++++++++++++++++------
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 28 ++------
src/backend/access/heap/vacuumlazy.c | 6 +-
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/commands/cluster.c | 27 ++++++++
src/backend/commands/copy.c | 15 +++-
src/backend/commands/createas.c | 7 +-
src/backend/commands/matview.c | 7 +-
src/backend/commands/tablecmds.c | 8 ++-
src/include/access/rewriteheap.h | 2 +-
12 files changed, 219 insertions(+), 89 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 267570b461..cc516e599d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -28,6 +28,27 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or WAL
+ * archival purposes (i.e. if wal_level=minimal), and we fsync() the file
+ * to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transaction, because for
+ * a small number of changes, it's cheaper to just create the WAL records
+ * than fsync()ing the whole relation at COMMIT. It is only worthwhile for
+ * (presumably) large operations like COPY, CLUSTER, or VACUUM FULL. Use
+ * table_relation_register_sync() to initiate such an operation; it will
+ * cause any subsequent updates to the table to skip WAL-logging, if
+ * possible, and cause the heap to be synced to disk at COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -51,6 +72,7 @@
#include "access/xloginsert.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -1948,7 +1970,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
@@ -2058,7 +2080,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
int ndone;
PGAlignedBlock scratch;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2066,7 +2087,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2108,6 +2128,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2119,6 +2140,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -2671,7 +2693,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
@@ -2805,6 +2827,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
vmbuffer = InvalidBuffer,
vmbuffer_new = InvalidBuffer;
bool need_toast;
+ bool oldbuf_needs_wal,
+ newbuf_needs_wal;
Size newtupsize,
pagefree;
bool have_tuple_lock = false;
@@ -3356,7 +3380,7 @@ l2:
MarkBufferDirty(buffer);
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -3570,26 +3594,55 @@ l2:
MarkBufferDirty(newbuf);
MarkBufferDirty(buffer);
- /* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ /*
+ * XLOG stuff
+ *
+ * Emit heap-update log. When wal_level = minimal, we may emit insert or
+ * delete record according to wal-optimization.
+ */
+ oldbuf_needs_wal = BufferNeedsWAL(relation, buffer);
+
+ if (newbuf == buffer)
+ newbuf_needs_wal = oldbuf_needs_wal;
+ else
+ newbuf_needs_wal = BufferNeedsWAL(relation, newbuf);
+
+ if (oldbuf_needs_wal || newbuf_needs_wal)
{
XLogRecPtr recptr;
/*
* For logical decoding we need combocids to properly decode the
- * catalog.
+ * catalog. Both oldbuf_needs_wal and newbuf_needs_wal must be true
+ * when logical decoding is active.
*/
if (RelationIsAccessibleInLogicalDecoding(relation))
{
+ Assert(oldbuf_needs_wal && newbuf_needs_wal);
+
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
}
- recptr = log_heap_update(relation, buffer,
- newbuf, &oldtup, heaptup,
- old_key_tuple,
- all_visible_cleared,
- all_visible_cleared_new);
+ /*
+ * Insert log record. Using delete or insert log loses HOT chain
+ * information but that happens only when newbuf is different from
+ * buffer, where HOT cannot happen.
+ */
+ if (oldbuf_needs_wal && newbuf_needs_wal)
+ recptr = log_heap_update(relation, buffer, newbuf,
+ &oldtup, heaptup,
+ old_key_tuple,
+ all_visible_cleared,
+ all_visible_cleared_new);
+ else if (oldbuf_needs_wal)
+ recptr = log_heap_delete(relation, buffer, &oldtup, old_key_tuple,
+ xmax_old_tuple, false,
+ all_visible_cleared);
+ else
+ recptr = log_heap_insert(relation, buffer, newtup,
+ 0, all_visible_cleared_new);
+
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4467,7 +4520,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5219,7 +5272,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5379,7 +5432,7 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
htup->t_ctid = *tid;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -5511,7 +5564,7 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -5620,7 +5673,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7030,8 +7083,8 @@ log_heap_clean(Relation reln, Buffer buffer,
xl_heap_clean xlrec;
XLogRecPtr recptr;
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ /* Caller should not call me on non-WAL-logged buffers */
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7078,8 +7131,8 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
xl_heap_freeze_page xlrec;
XLogRecPtr recptr;
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ /* Caller should not call me on non-WAL-logged buffers */
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7305,8 +7358,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool init;
int bufflags;
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ /* Caller should not call me when no buffer needs WAL-logging */
+ Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -8910,9 +8963,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. table_relation_register_sync() should
+ * be used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5c96fc91b7..bddf026b81 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -57,6 +57,9 @@ static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static void heapam_relation_register_walskip(Relation rel);
+static void heapam_relation_invalidate_walskip(Relation rel);
+
static const TableAmRoutine heapam_methods;
@@ -541,14 +544,10 @@ tuple_lock_retry:
}
static void
-heapam_finish_bulk_insert(Relation relation, int options)
+heapam_finish_bulk_insert(RelFileNode rnode, ForkNumber forkNum)
{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
+ /* Sync the file immedately */
+ smgrimmedsync(smgropen(rnode, InvalidBackendId), forkNum);
}
@@ -616,6 +615,12 @@ heapam_relation_copy_data(Relation rel, RelFileNode newrnode)
dstrel = smgropen(newrnode, rel->rd_backend);
RelationOpenSmgr(rel);
+ /*
+ * Register WAL-skipping for the relation. WAL-logging is skipped and sync
+ * the file at commit if the AM supports the feature.
+ */
+ table_relation_register_walskip(rel);
+
/*
* Create and copy all forks of the relation, and schedule unlinking of
* old physical files.
@@ -626,8 +631,7 @@ heapam_relation_copy_data(Relation rel, RelFileNode newrnode)
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
- RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
- rel->rd_rel->relpersistence);
+ RelationCopyStorage(rel, dstrel, MAIN_FORKNUM);
/* copy those extra forks that exist */
for (ForkNumber forkNum = MAIN_FORKNUM + 1;
@@ -645,8 +649,7 @@ heapam_relation_copy_data(Relation rel, RelFileNode newrnode)
(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
forkNum == INIT_FORKNUM))
log_smgrcreate(&newrnode, forkNum);
- RelationCopyStorage(rel->rd_smgr, dstrel, forkNum,
- rel->rd_rel->relpersistence);
+ RelationCopyStorage(rel, dstrel, forkNum);
}
}
@@ -670,7 +673,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -684,15 +686,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
- /*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
- */
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
- /* use_wal off requires smgr_targblock be initially invalid */
- Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
-
/* Preallocate values/isnull arrays */
natts = newTupDesc->natts;
values = (Datum *) palloc(natts * sizeof(Datum));
@@ -700,7 +693,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, FreezeXid,
- MultiXactCutoff, use_wal);
+ MultiXactCutoff);
/* Set up sorting if wanted */
@@ -946,6 +939,55 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
pfree(isnull);
}
+/*
+ * heapam_relation_register_walskip - register a heap to be WAL-skipped then
+ * synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file. This makes
+ * note of the current size of the relation, and ensures that when the
+ * relation is extended, any changes to the new blocks in the heap, in the
+ * same transaction, will not be WAL-logged. Instead, the heap contents are
+ * flushed to disk at commit.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+static void
+heapam_relation_register_walskip(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordWALSkipping(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordWALSkipping(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+
+ return;
+}
+
+/*
+ * heapam_relation_invalidate_walskip - invalidate registered WAL skipping
+ *
+ * After some file-replacing operations like CLUSTER, the old file no longe
+ * needs to be synced to disk. This function invalidates the registered
+ * WAL-skipping on the current relfilenode of the relation.
+ */
+static void
+heapam_relation_invalidate_walskip(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RelationInvalidateWALSkip(rel);
+}
+
static bool
heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
BufferAccessStrategy bstrategy)
@@ -2423,6 +2465,8 @@ static const TableAmRoutine heapam_methods = {
.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
.relation_copy_data = heapam_relation_copy_data,
.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
+ .relation_register_walskip = heapam_relation_register_walskip,
+ .relation_invalidate_walskip = heapam_relation_invalidate_walskip,
.relation_vacuum = heap_vacuum_rel,
.scan_analyze_next_block = heapam_scan_analyze_next_block,
.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a3e51922d8..a05659b168 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -258,7 +259,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..494f7fcd41 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "access/xloginsert.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "lib/ilist.h"
@@ -144,7 +145,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -238,15 +238,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* min_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +269,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +327,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -344,19 +341,7 @@ end_heap_rewrite(RewriteState state)
(char *) state->rs_buffer, true);
}
- /*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too.
- *
- * It's obvious that we must do this when not WAL-logging. It's less
- * obvious that we have to do it even if we did WAL-log the pages. The
- * reason is the same as in tablecmds.c's copy_relation_data(): we're
- * writing data that's not in shared buffers, and so a CHECKPOINT
- * occurring during the rewriteheap operation won't have fsync'd data we
- * wrote before the checkpoint.
- */
- if (RelationNeedsWAL(state->rs_new_rel))
- heap_sync(state->rs_new_rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
logical_end_heap_rewrite(state);
@@ -654,9 +639,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -695,7 +677,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b5b464e4a9..45139ec70e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -945,7 +945,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1209,7 +1209,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1591,7 +1591,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06b26..1f5f7d92dd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "port/pg_bitutils.h"
#include "storage/bufmgr.h"
@@ -276,7 +277,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 4f4be1efbf..b5db26fda5 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -612,6 +612,18 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
relpersistence,
AccessExclusiveLock);
+ /*
+ * If wal_level is minimal, we skip WAL-logging even for WAL-logging
+ * relations. The filenode is synced at commit.
+ */
+ if (!XLogIsNeeded())
+ {
+ /* make_new_heap doesn't lock OIDNewHeap */
+ Relation newheap = table_open(OIDNewHeap, AccessShareLock);
+ table_relation_register_walskip(newheap);
+ table_close(newheap, AccessShareLock);
+ }
+
/* Copy the heap data into the new table in the desired order */
copy_table_data(OIDNewHeap, tableOid, indexOid, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
@@ -1355,6 +1367,21 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
+ /*
+ * Unregister useless pending file-sync. table_relation_unregister_sync
+ * relies on a premise that relation cache has the correct relfilenode and
+ * related members. After swap_relation_files, the relcache entry for the
+ * heaps gets inconsistent with pg_class entry so we should do this before
+ * the call.
+ */
+ if (!XLogIsNeeded())
+ {
+ Relation oldheap = table_open(OIDOldHeap, AccessShareLock);
+
+ table_relation_invalidate_walskip(oldheap);
+ table_close(oldheap, AccessShareLock);
+ }
+
/*
* Swap the contents of the heap relations (including any toast tables).
* Also set old heap's relfrozenxid to frozenXid.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c1fd7b78ce..6a85ab890e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2437,9 +2437,13 @@ CopyFrom(CopyState cstate)
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
{
- ti_options |= TABLE_INSERT_SKIP_FSM;
+ /*
+ * We can skip WAL-logging the insertions, unless PITR or streaming
+ * replication is in use. We can skip the FSM in any case.
+ */
if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
+ table_relation_register_walskip(cstate->rel);
+ ti_options |= TABLE_INSERT_SKIP_FSM;
}
/*
@@ -3106,7 +3110,12 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
- table_finish_bulk_insert(cstate->rel, ti_options);
+ /*
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
+ */
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 43c2fa9124..8b73654413 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ table_relation_register_walskip(intoRelationDesc);
+ myState->ti_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,7 +605,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->rel, myState->ti_options);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 2aac63296b..33b7bc4c16 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -462,9 +462,10 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
+ table_relation_register_walskip(transientrel);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
+
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,7 +510,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->transientrel, myState->ti_options);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8908b77d98..deb147c45a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4716,7 +4716,11 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
ti_options = TABLE_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
+ {
+ /* Forget old relation's registerd sync */
+ table_relation_invalidate_walskip(oldrel);
+ table_relation_register_walskip(newrel);
+ }
}
else
{
@@ -5000,7 +5004,7 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
FreeBulkInsertState(bistate);
- table_finish_bulk_insert(newrel, ti_options);
+ /* If we skipped writing WAL, then it will be done at commit. */
table_close(newrel, NoLock);
}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 6006249d96..64efecf48b 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
--
2.16.3
v10-0007-Remove-TABLE-HEAP_INSERT_SKIP_WAL.patchtext/x-patch; charset=us-asciiDownload
From f4a0cc5382805500c3db3d4ec2231cee383841f3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 13:31:33 +0900
Subject: [PATCH 7/7] Remove TABLE/HEAP_INSERT_SKIP_WAL
Remove no-longer-used symbol TABLE/HEAP_INSERT_SKIP_WAL.
---
src/include/access/heapam.h | 3 +--
src/include/access/tableam.h | 11 +++--------
2 files changed, 4 insertions(+), 10 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c077755d5..5b084c2f5a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,11 +29,10 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
-#define HEAP_INSERT_SPECULATIVE 0x0010
+#define HEAP_INSERT_SPECULATIVE 0x0008
typedef struct BulkInsertStateData *BulkInsertState;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1a3a3c6711..b5203dd485 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -100,10 +100,9 @@ typedef struct TM_FailureData
} TM_FailureData;
/* "options" flag bits for table_insert */
-#define TABLE_INSERT_SKIP_WAL 0x0001
-#define TABLE_INSERT_SKIP_FSM 0x0002
-#define TABLE_INSERT_FROZEN 0x0004
-#define TABLE_INSERT_NO_LOGICAL 0x0008
+#define TABLE_INSERT_SKIP_FSM 0x0001
+#define TABLE_INSERT_FROZEN 0x0002
+#define TABLE_INSERT_NO_LOGICAL 0x0004
/* flag bits fortable_lock_tuple */
/* Follow tuples whose update is in progress if lock modes don't conflict */
@@ -1017,10 +1016,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* behaviour of the AM. Several options might be ignored by AMs not supporting
* them.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple will not
- * necessarily logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space. It's
--
2.16.3
On Tue, Apr 2, 2019 at 6:54 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
By using DELETE and INSERT records to implement an UPDATE, you lose the ctid
chain and infomask bits that were present before crash recovery. If that's
okay in these circumstances, please write a comment explaining why.Sounds reasonable. Added a comment. (Honestly I completely forgot
about that.. Thanks!) (0006)
If you haven't already, I think you should set up a master and a
standby and wal_consistency_checking=all and run tests of this feature
on the master and see if anything breaks on the master or the standby.
I'm not sure that emitting an insert or delete record is going to
reproduce the exact same state on the standby that exists on the
master.
+ * Insert log record. Using delete or insert log loses HOT chain
+ * information but that happens only when newbuf is different from
+ * buffer, where HOT cannot happen.
"HOT chain information" seems pretty vague.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Thank you for looking this.
At Wed, 3 Apr 2019 10:16:02 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYEST4xYaU10gM=XXeA-oxbFh=qSfy0X4PXDCWubcgj=g@mail.gmail.com>
On Tue, Apr 2, 2019 at 6:54 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:By using DELETE and INSERT records to implement an UPDATE, you lose the ctid
chain and infomask bits that were present before crash recovery. If that's
okay in these circumstances, please write a comment explaining why.Sounds reasonable. Added a comment. (Honestly I completely forgot
about that.. Thanks!) (0006)If you haven't already, I think you should set up a master and a
standby and wal_consistency_checking=all and run tests of this feature
on the master and see if anything breaks on the master or the standby.
I'm not sure that emitting an insert or delete record is going to
reproduce the exact same state on the standby that exists on the
master.
All of this patch is for wal_level = minimal. Doesn't make
changes in other cases. Updates are always replicated as
XLOG_HEAP_(HOT_)UPDATE. Crash recovery cases involving log_insert
or log_update are exercised by the TAP test.
+ * Insert log record. Using delete or insert log loses HOT chain + * information but that happens only when newbuf is different from + * buffer, where HOT cannot happen."HOT chain information" seems pretty vague.
Thanks. Actually I was a bit uneasy with "information". Does the
following make sense?
* Insert log record, using delete or insert instead of update log
* when only one of the two buffers needs WAL-logging. If this were a
* HOT-update, redoing the WAL record would result in a broken
* hot-chain. However, that never happens because updates complete on
* a single page always use log_update.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Apr 3, 2019 at 10:03 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
* Insert log record, using delete or insert instead of update log
* when only one of the two buffers needs WAL-logging. If this were a
* HOT-update, redoing the WAL record would result in a broken
* hot-chain. However, that never happens because updates complete on
* a single page always use log_update.
It makes sense grammatically, but I'm not sure I believe that it's
sound technically. Even though it's only used in the non-HOT case,
it's still important that the CTID, XMIN, and XMAX fields are set
correctly during both normal operation and recovery.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
At Thu, 4 Apr 2019 10:52:59 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZE0jW0jbQxAtoJgJNwrR1hyx3x8pUjQr=ggenLxnPoEQ@mail.gmail.com>
On Wed, Apr 3, 2019 at 10:03 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:* Insert log record, using delete or insert instead of update log
* when only one of the two buffers needs WAL-logging. If this were a
* HOT-update, redoing the WAL record would result in a broken
* hot-chain. However, that never happens because updates complete on
* a single page always use log_update.It makes sense grammatically, but I'm not sure I believe that it's
Great to hear that! I rewrote it as the following.
+ * Insert log record. When we are not running WAL-skipping, always use
+ * update log. Otherwise use delete or insert log instead when only
+ * one of the two buffers needs WAL-logging. If this were a
+ * HOT-update, redoing the WAL record would result in a broken
+ * hot-chain. However, that never happens because updates complete on
+ * a single page always use log_update.
+ *
+ * Using delete or insert log in place of udpate log leads to
+ * inconsistent series of WAL records. But note that WAL-skipping
+ * happens only when we are updating a tuple in a relation that has
+ * been create in the same transaction. Once commited, the WAL records
+ * recovers the same state of the relation as the synced state at the
+ * commit. Or the maybe-broken relation due to a crash before commit
+ * will be removed in recovery.
sound technically. Even though it's only used in the non-HOT case,
it's still important that the CTID, XMIN, and XMAX fields are set
correctly during both normal operation and recovery.
log_heap_delete()/log_heap_update() record the infomasks of the
deleted tuple as is. Xmax is stored from the same
variable. offnum is taken from the deleted tuple and buffer is
registered and xlrec.flags is set to the same value. As the
result Xmax, infomasks and ctid are restored to the same state by
heap_xlog_xlog_delete(). I didn't add a comment about that.
log_heap_insert()/log_heap_update() record the infomasks of the
inserted tuple as is. Xmin/Cmin and ctid related info are handled
the same way. But log_heap_insert() assumes that Xmax =
invalid. But that happens only when another transaction can see
it, which is not the case here. I added a command and assertion
before calling log_heap_insert().
+ * Coming here means that the old tuple is invisible and
+ * inoperable to another transaction. So xmax_new_tuple is
+ * expected to be InvalidTransactionId here.
+ */
+ Assert (xmax_new_tuple == InvalidTransactionId);
+ recptr = log_heap_insert(relation, buffer, newtup,
I noticed that I accidentally moved log_heap_new_cid stuff to
log_heap_insert/delete(). I restored them.
The attached v11 is the new version addressing the aboves and
rebased.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v11-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From 12a6bc81a98bd15d7c8059c797fdca558d82f0d7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/7] TAP test for copy-truncation optimization.
---
src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
1 file changed, 291 insertions(+)
create mode 100644 src/test/recovery/t/018_wal_optimize.pl
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::real_dir($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v11-0002-Write-WAL-for-empty-nbtree-index-build.patchtext/x-patch; charset=us-asciiDownload
From 694d146936a0fe0943854b7ca81a59b251fa9c2a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/7] Write WAL for empty nbtree index build
After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
src/backend/access/nbtree/nbtsort.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 9ac4c1e1c0..a31d58025f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -654,8 +654,16 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
RelationOpenSmgr(wstate->index);
- /* XLOG stuff */
- if (wstate->btws_use_wal)
+ /* XLOG stuff
+ *
+ * Even when wal_level is minimal, WAL is required here if truncation
+ * happened after being created in the same transaction. This is hacky but
+ * we cannot use BufferNeedsWAL() stuff for nbtree since it can emit
+ * atomic WAL records on multiple buffers.
+ */
+ if (wstate->btws_use_wal ||
+ (RelationNeedsWAL(wstate->index) &&
+ (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
{
/* We use the heap NEWPAGE record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
--
2.16.3
v11-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patchtext/x-patch; charset=us-asciiDownload
From 0d2a38f20dabec2d87d7d021b3d0cc12c3fa016b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 13:29:50 +0900
Subject: [PATCH 3/7] Move XLOG stuff from heap_insert and heap_delete
Succeeding commit makes heap_update emit insert and delete WAL
records. Move out XLOG stuff for insert and delete so that heap_update
can use the stuff.
---
src/backend/access/heap/heapam.c | 252 ++++++++++++++++++++++-----------------
1 file changed, 145 insertions(+), 107 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a05b6a07ad..223be30eb3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -72,6 +72,11 @@
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
TransactionId xid, CommandId cid, int options);
+static XLogRecPtr log_heap_insert(Relation relation, Buffer buffer,
+ HeapTuple heaptup, int options, bool all_visible_cleared);
+static XLogRecPtr log_heap_delete(Relation relation, Buffer buffer,
+ HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+ bool changingPart, bool all_visible_cleared);
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
@@ -1875,6 +1880,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
+ Page page;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
@@ -1911,16 +1917,18 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
*/
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+ page = BufferGetPage(buffer);
+
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
RelationPutHeapTuple(relation, buffer, heaptup,
(options & HEAP_INSERT_SPECULATIVE) != 0);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(page))
{
all_visible_cleared = true;
- PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllVisible(page);
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer, VISIBILITYMAP_VALID_BITS);
@@ -1942,12 +1950,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/* XLOG stuff */
if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
{
- xl_heap_insert xlrec;
- xl_heap_header xlhdr;
XLogRecPtr recptr;
- Page page = BufferGetPage(buffer);
- uint8 info = XLOG_HEAP_INSERT;
- int bufflags = 0;
/*
* If this is a catalog, we need to transmit combocids to properly
@@ -1956,61 +1959,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
if (RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
- /*
- * If this is the single and first tuple on page, we can reinit the
- * page instead of restoring the whole thing. Set flag, and hide
- * buffer references from XLogInsert.
- */
- if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
- PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
- {
- info |= XLOG_HEAP_INIT_PAGE;
- bufflags |= REGBUF_WILL_INIT;
- }
-
- xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
- xlrec.flags = 0;
- if (all_visible_cleared)
- xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
- if (options & HEAP_INSERT_SPECULATIVE)
- xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
- Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
-
- /*
- * For logical decoding, we need the tuple even if we're doing a full
- * page write, so make sure it's included even if we take a full-page
- * image. (XXX We could alternatively store a pointer into the FPW).
- */
- if (RelationIsLogicallyLogged(relation) &&
- !(options & HEAP_INSERT_NO_LOGICAL))
- {
- xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
- bufflags |= REGBUF_KEEP_DATA;
- }
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
-
- xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
- xlhdr.t_infomask = heaptup->t_data->t_infomask;
- xlhdr.t_hoff = heaptup->t_data->t_hoff;
-
- /*
- * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
- * write the whole page to the xlog, we don't need to store
- * xl_heap_header in the xlog.
- */
- XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
- XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
- /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
- XLogRegisterBufData(0,
- (char *) heaptup->t_data + SizeofHeapTupleHeader,
- heaptup->t_len - SizeofHeapTupleHeader);
-
- /* filtering by origin on a row level is much more efficient */
- XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
- recptr = XLogInsert(RM_HEAP_ID, info);
+ recptr = log_heap_insert(relation, buffer, heaptup,
+ options, all_visible_cleared);
PageSetLSN(page, recptr);
}
@@ -2733,58 +2683,15 @@ l1:
*/
if (RelationNeedsWAL(relation))
{
- xl_heap_delete xlrec;
- xl_heap_header xlhdr;
XLogRecPtr recptr;
/* For logical decode we need combocids to properly decode the catalog */
if (RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
- xlrec.flags = 0;
- if (all_visible_cleared)
- xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
- if (changingPart)
- xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
- xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
- tp.t_data->t_infomask2);
- xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
- xlrec.xmax = new_xmax;
-
- if (old_key_tuple != NULL)
- {
- if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
- xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
- else
- xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
- }
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
-
- XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
- /*
- * Log replica identity of the deleted tuple if there is one
- */
- if (old_key_tuple != NULL)
- {
- xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
- xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
- xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
-
- XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
- XLogRegisterData((char *) old_key_tuple->t_data
- + SizeofHeapTupleHeader,
- old_key_tuple->t_len
- - SizeofHeapTupleHeader);
- }
-
- /* filtering by origin on a row level is much more efficient */
- XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
- recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
+ recptr = log_heap_delete(relation, buffer, &tp, old_key_tuple, new_xmax,
+ changingPart, all_visible_cleared);
PageSetLSN(page, recptr);
}
@@ -7248,6 +7155,137 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
return recptr;
}
+/*
+ * Perform XLogInsert for a heap-insert operation. Caller must already
+ * have modified the buffer and marked it dirty.
+ */
+XLogRecPtr
+log_heap_insert(Relation relation, Buffer buffer,
+ HeapTuple heaptup, int options, bool all_visible_cleared)
+{
+ xl_heap_insert xlrec;
+ xl_heap_header xlhdr;
+ uint8 info = XLOG_HEAP_INSERT;
+ int bufflags = 0;
+ Page page = BufferGetPage(buffer);
+
+ /*
+ * If this is the single and first tuple on page, we can reinit the
+ * page instead of restoring the whole thing. Set flag, and hide
+ * buffer references from XLogInsert.
+ */
+ if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
+ PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+ {
+ info |= XLOG_HEAP_INIT_PAGE;
+ bufflags |= REGBUF_WILL_INIT;
+ }
+
+ xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
+ xlrec.flags = 0;
+ if (all_visible_cleared)
+ xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
+ if (options & HEAP_INSERT_SPECULATIVE)
+ xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
+ Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
+
+ /*
+ * For logical decoding, we need the tuple even if we're doing a full
+ * page write, so make sure it's included even if we take a full-page
+ * image. (XXX We could alternatively store a pointer into the FPW).
+ */
+ if (RelationIsLogicallyLogged(relation) &&
+ !(options & HEAP_INSERT_NO_LOGICAL))
+ {
+ xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
+ bufflags |= REGBUF_KEEP_DATA;
+ }
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
+
+ xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
+ xlhdr.t_infomask = heaptup->t_data->t_infomask;
+ xlhdr.t_hoff = heaptup->t_data->t_hoff;
+
+ /*
+ * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
+ * write the whole page to the xlog, we don't need to store
+ * xl_heap_header in the xlog.
+ */
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
+ XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+ /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
+ XLogRegisterBufData(0,
+ (char *) heaptup->t_data + SizeofHeapTupleHeader,
+ heaptup->t_len - SizeofHeapTupleHeader);
+
+ /* filtering by origin on a row level is much more efficient */
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+ return XLogInsert(RM_HEAP_ID, info);
+}
+
+/*
+ * Perform XLogInsert for a heap-insert operation. Caller must already
+ * have modified the buffer and marked it dirty.
+ *
+ * NB: heap_abort_speculative() uses the same xlog record and replay
+ * routines.
+ */
+static XLogRecPtr
+log_heap_delete(Relation relation, Buffer buffer,
+ HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+ bool changingPart, bool all_visible_cleared)
+{
+ xl_heap_delete xlrec;
+ xl_heap_header xlhdr;
+
+ xlrec.flags = 0;
+ if (all_visible_cleared)
+ xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
+ if (changingPart)
+ xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
+ xlrec.infobits_set = compute_infobits(tp->t_data->t_infomask,
+ tp->t_data->t_infomask2);
+ xlrec.offnum = ItemPointerGetOffsetNumber(&tp->t_self);
+ xlrec.xmax = new_xmax;
+
+ if (old_key_tuple != NULL)
+ {
+ if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
+ xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
+ else
+ xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
+ }
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
+
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+ /*
+ * Log replica identity of the deleted tuple if there is one
+ */
+ if (old_key_tuple != NULL)
+ {
+ xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
+ xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
+ xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
+
+ XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
+ XLogRegisterData((char *) old_key_tuple->t_data
+ + SizeofHeapTupleHeader,
+ old_key_tuple->t_len
+ - SizeofHeapTupleHeader);
+ }
+
+ /* filtering by origin on a row level is much more efficient */
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+ return XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
+}
+
/*
* Perform XLogInsert for a heap-update operation. Caller must already
* have modified the buffer(s) and marked them dirty.
--
2.16.3
v11-0004-Add-new-interface-to-TableAmRoutine.patchtext/x-patch; charset=us-asciiDownload
From 9e1295b47c3a55298b96e183f158328c29d1adf8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 11:53:36 +0900
Subject: [PATCH 4/7] Add new interface to TableAmRoutine
Add two interface functions to TableAmRoutine, which are related to
WAL-skipping feature.
---
src/backend/access/table/tableamapi.c | 4 ++
src/include/access/tableam.h | 79 +++++++++++++++++++++++------------
2 files changed, 56 insertions(+), 27 deletions(-)
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index bfd713f3af..56b5d521de 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -93,6 +93,10 @@ GetTableAmRoutine(Oid amhandler)
(routine->scan_bitmap_next_tuple == NULL));
Assert(routine->scan_sample_next_block != NULL);
Assert(routine->scan_sample_next_tuple != NULL);
+ Assert((routine->relation_register_walskip == NULL) ==
+ (routine->relation_invalidate_walskip == NULL) &&
+ (routine->relation_register_walskip == NULL) ==
+ (routine->finish_bulk_insert == NULL));
return routine;
}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a647e7db32..38a00d8823 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -389,19 +389,15 @@ typedef struct TableAmRoutine
/*
* Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * tuple_insert and multi_insert or page-level copying performed by ALTER
+ * TABLE rewrite. This is called at commit time if WAL-skipping is
+ * activated and the caller decided that any finish work is required to
+ * the file.
*
- * Typically callers of tuple_insert and multi_insert will just pass all
- * the flags the apply to them, and each AM has to decide which of them
- * make sense for it, and then only take actions in finish_bulk_insert
- * that make sense for a specific AM.
- *
- * Optional callback.
+ * Optional callback. Must be provided when relation_register_walskip is
+ * provided.
*/
- void (*finish_bulk_insert) (Relation rel, int options);
-
+ void (*finish_bulk_insert) (RelFileNode rnode, ForkNumber forkNum);
/* ------------------------------------------------------------------------
* DDL related functionality.
@@ -454,6 +450,26 @@ typedef struct TableAmRoutine
double *tups_vacuumed,
double *tups_recently_dead);
+ /*
+ * Register WAL-skipping on the current storage of rel. WAL-logging on the
+ * relation is skipped and the storage will be synced at commit. Returns
+ * true if successfully registered, and finish_bulk_insert() is called at
+ * commit.
+ *
+ * Optional callback.
+ */
+ void (*relation_register_walskip) (Relation rel);
+
+ /*
+ * Invalidate registered WAL skipping on the current storage of rel. The
+ * function is called when the storage of the relation is going to be
+ * out-of-use after commit.
+ *
+ * Optional callback. Must be provided when relation_register_walskip is
+ * provided.
+ */
+ void (*relation_invalidate_walskip) (Relation rel);
+
/*
* React to VACUUM command on the relation. The VACUUM might be user
* triggered or by autovacuum. The specific actions performed by the AM
@@ -1034,8 +1050,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
*
*
* The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
*
* On return the slot's tts_tid and tts_tableOid are updated to reflect the
* insertion. But note that any toasting of fields within the slot is NOT
@@ -1231,20 +1246,6 @@ table_lock_tuple(Relation rel, ItemPointer tid, Snapshot snapshot,
flags, tmfd);
}
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
- /* optional callback */
- if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
- rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
/* ------------------------------------------------------------------------
* DDL related functionality.
@@ -1328,6 +1329,30 @@ table_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tups_recently_dead);
}
+/*
+ * Register WAL-skipping to the relation. WAL-logging is skipped for the new
+ * pages after this call and the relation file is going to be synced at
+ * commit.
+ */
+static inline void
+table_relation_register_walskip(Relation rel)
+{
+ if (rel->rd_tableam && rel->rd_tableam->relation_register_walskip)
+ rel->rd_tableam->relation_register_walskip(rel);
+}
+
+/*
+ * Unregister WAL-skipping to the relation. Call this when the relation is
+ * going to be out-of-use after commit. WAL-skipping continues but the
+ * relation won't be synced at commit.
+ */
+static inline void
+table_relation_invalidate_walskip(Relation rel)
+{
+ if (rel->rd_tableam && rel->rd_tableam->relation_invalidate_walskip)
+ rel->rd_tableam->relation_invalidate_walskip(rel);
+}
+
/*
* Perform VACUUM on the relation. The VACUUM can be user triggered or by
* autovacuum. The specific actions performed by the AM will depend heavily on
--
2.16.3
v11-0005-Add-infrastructure-to-WAL-logging-skip-feature.patchtext/x-patch; charset=us-asciiDownload
From 0b8b3ce573e27941692ac5462db7cd6f8d0b2209 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 18:05:10 +0900
Subject: [PATCH 5/7] Add infrastructure to WAL-logging skip feature
We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just signaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-commit fsyncs for a relation and in-transaction
truncations. table_relation_register_walskip() should be used to start
tracking before batch operations like COPY and CLUSTER, and use
BufferNeedsWAL() instead of RelationNeedsWAL() at the places related
to WAL-logging about heap-modifying operations, then remove
call to table_finish_bulk_insert() and the tableam intaface.
---
src/backend/access/transam/xact.c | 12 +-
src/backend/catalog/storage.c | 612 +++++++++++++++++++++++++++++++++---
src/backend/commands/tablecmds.c | 6 +-
src/backend/storage/buffer/bufmgr.c | 39 ++-
src/backend/utils/cache/relcache.c | 3 +
src/include/catalog/storage.h | 17 +-
src/include/storage/bufmgr.h | 2 +
src/include/utils/rel.h | 7 +
8 files changed, 631 insertions(+), 67 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd5024ef00..a2c689f414 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2111,6 +2111,9 @@ CommitTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrFinishBulkInsert(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2343,6 +2346,9 @@ PrepareTransaction(void)
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrFinishBulkInsert(true);
+
/*
* Mark serializable transaction as complete for predicate locking
* purposes. This should be done as late as we can put it and still allow
@@ -2668,6 +2674,7 @@ AbortTransaction(void)
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
AtAbort_Twophase();
+ smgrFinishBulkInsert(false); /* abandon pending syncs */
/*
* Advertise the fact that we aborted in pg_xact (assuming that we got as
@@ -4801,8 +4808,7 @@ CommitSubTransaction(void)
AtEOSubXact_RelationCache(true, s->subTransactionId,
s->parent->subTransactionId);
AtEOSubXact_Inval(true);
- AtSubCommit_smgr();
-
+ AtSubCommit_smgr(s->subTransactionId, s->parent->subTransactionId);
/*
* The only lock we actually release here is the subtransaction XID lock.
*/
@@ -4979,7 +4985,7 @@ AbortSubTransaction(void)
ResourceOwnerRelease(s->curTransactionOwner,
RESOURCE_RELEASE_AFTER_LOCKS,
false, false);
- AtSubAbort_smgr();
+ AtSubAbort_smgr(s->subTransactionId, s->parent->subTransactionId);
AtEOXact_GUC(false, s->gucNestLevel);
AtEOSubXact_SPI(false, s->subTransactionId);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 72242b2476..4cd112f86c 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -21,6 +21,7 @@
#include "miscadmin.h"
+#include "access/tableam.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
@@ -29,10 +30,18 @@
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
-#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+ /* #define STORAGEDEBUG */ /* turns DEBUG elogs on */
+
+#ifdef STORAGEDEBUG
+#define STORAGE_elog(...) elog(__VA_ARGS__)
+#else
+#define STORAGE_elog(...)
+#endif
+
/*
* We keep a list of all relations (represented as RelFileNode values)
* that have been created or deleted in the current transaction. When
@@ -64,6 +73,61 @@ typedef struct PendingRelDelete
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a RelWalSkip entry is created, and
+ * 'skip_wal_min_blk' is set to the current size of the relation. Any
+ * operations on blocks < skip_wal_min_blk need to be WAL-logged as usual, but
+ * for operations on higher blocks, WAL-logging is skipped.
+
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct RelWalSkip
+{
+ RelFileNode relnode; /* relation created in same xact */
+ bool forks[MAX_FORKNUM + 1]; /* target forknums */
+ BlockNumber skip_wal_min_blk; /* WAL-logging skipped for blocks >=
+ * skip_wal_min_blk */
+ BlockNumber wal_log_min_blk; /* The minimum blk number that requires
+ * WAL-logging even if skipped by the
+ * above*/
+ SubTransactionId create_sxid; /* subxid where this entry is created */
+ SubTransactionId invalidate_sxid; /* subxid where this entry is
+ * invalidated */
+ const TableAmRoutine *tableam; /* Table access routine */
+} RelWalSkip;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *walSkipHash = NULL;
+
+static RelWalSkip *getWalSkipEntry(Relation rel, bool create);
+static RelWalSkip *getWalSkipEntryRNode(RelFileNode *node,
+ bool create);
+static void smgrProcessWALSkipInval(bool isCommit, SubTransactionId mySubid,
+ SubTransactionId parentSubid);
+
/*
* RelationCreateStorage
* Create physical storage for a relation.
@@ -261,31 +325,59 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
if (RelationNeedsWAL(rel))
{
- /*
- * Make an XLOG entry reporting the file truncation.
- */
- XLogRecPtr lsn;
- xl_smgr_truncate xlrec;
+ RelWalSkip *walskip;
- xlrec.blkno = nblocks;
- xlrec.rnode = rel->rd_node;
- xlrec.flags = SMGR_TRUNCATE_ALL;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, sizeof(xlrec));
-
- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ /* get pending sync entry, create if not yet */
+ walskip = getWalSkipEntry(rel, true);
/*
- * Flush, because otherwise the truncation of the main relation might
- * hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
- * contain entries for the non-existent heap pages.
+ * walskip is null here if rel doesn't support WAL-logging skip,
+ * otherwise check for WAL-skipping status.
*/
- if (fsm || vm)
- XLogFlush(lsn);
+ if (walskip == NULL ||
+ walskip->skip_wal_min_blk == InvalidBlockNumber ||
+ walskip->skip_wal_min_blk < nblocks)
+ {
+ /*
+ * If WAL-skipping is enabled, this is the first time truncation
+ * of this relation in this transaction or truncation that leaves
+ * pages that need at-commit fsync. Make an XLOG entry reporting
+ * the file truncation.
+ */
+ XLogRecPtr lsn;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = nblocks;
+ xlrec.rnode = rel->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ lsn = XLogInsert(RM_SMGR_ID,
+ XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ STORAGE_elog(DEBUG2,
+ "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, nblocks);
+ /*
+ * Flush, because otherwise the truncation of the main relation
+ * might hit the disk before the WAL record, and the truncation of
+ * the FSM or visibility map. If we crashed during that window,
+ * we'd be left with a truncated heap, but the FSM or visibility
+ * map would still contain entries for the non-existent heap
+ * pages.
+ */
+ if (fsm || vm)
+ XLogFlush(lsn);
+
+ if (walskip)
+ {
+ /* no longer skip WAL-logging for the blocks */
+ walskip->wal_log_min_blk = nblocks;
+ }
+ }
}
/* Do the real work */
@@ -296,8 +388,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
* Copy a fork's data, block by block.
*/
void
-RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
- ForkNumber forkNum, char relpersistence)
+RelationCopyStorage(Relation srcrel, SMgrRelation dst, ForkNumber forkNum)
{
PGAlignedBlock buf;
Page page;
@@ -305,6 +396,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
bool copying_initfork;
BlockNumber nblocks;
BlockNumber blkno;
+ SMgrRelation src = srcrel->rd_smgr;
+ char relpersistence = srcrel->rd_rel->relpersistence;
page = (Page) buf.data;
@@ -316,12 +409,33 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
forkNum == INIT_FORKNUM;
- /*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a permanent relation.
- */
- use_wal = XLogIsNeeded() &&
- (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+ if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+ {
+ /*
+ * We need to log the copied data in WAL iff WAL archiving/streaming
+ * is enabled AND it's a permanent relation.
+ */
+ if (XLogIsNeeded())
+ use_wal = true;
+
+ /*
+ * If the rel is WAL-logged, must fsync before commit. We use
+ * heap_sync to ensure that the toast table gets fsync'd too. (For a
+ * temp or unlogged rel we don't care since the data will be gone
+ * after a crash anyway.)
+ *
+ * It's obvious that we must do this when not WAL-logging the
+ * copy. It's less obvious that we have to do it even if we did
+ * WAL-log the copied pages. The reason is that since we're copying
+ * outside shared buffers, a CHECKPOINT occurring during the copy has
+ * no way to flush the previously written data to disk (indeed it
+ * won't know the new rel even exists). A crash later on would replay
+ * WAL from the checkpoint, therefore it wouldn't replay our earlier
+ * WAL entries. If we do not fsync those pages here, they might still
+ * not be on disk when the crash occurs.
+ */
+ RecordPendingSync(srcrel, dst, forkNum);
+ }
nblocks = smgrnblocks(src, forkNum);
@@ -358,24 +472,321 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
*/
smgrextend(dst, forkNum, blkno, buf.data, true);
}
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+ BlockNumber blkno = InvalidBlockNumber;
+ RelWalSkip *walskip;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch existing pending sync entry */
+ walskip = getWalSkipEntry(rel, false);
/*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too. (For a temp or
- * unlogged rel we don't care since the data will be gone after a crash
- * anyway.)
- *
- * It's obvious that we must do this when not WAL-logging the copy. It's
- * less obvious that we have to do it even if we did WAL-log the copied
- * pages. The reason is that since we're copying outside shared buffers, a
- * CHECKPOINT occurring during the copy has no way to flush the previously
- * written data to disk (indeed it won't know the new rel even exists). A
- * crash later on would replay WAL from the checkpoint, therefore it
- * wouldn't replay our earlier WAL entries. If we do not fsync those pages
- * here, they might still not be on disk when the crash occurs.
+ * no point in doing further work if we know that we don't skip
+ * WAL-logging.
*/
- if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
- smgrimmedsync(dst, forkNum);
+ if (!walskip)
+ {
+ STORAGE_elog(DEBUG2,
+ "not skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, BufferGetBlockNumber(buf));
+ return true;
+ }
+
+ Assert(BufferIsValid(buf));
+
+ blkno = BufferGetBlockNumber(buf);
+
+ /*
+ * We don't skip WAL-logging for pages that once done.
+ */
+ if (walskip->skip_wal_min_blk == InvalidBlockNumber ||
+ walskip->skip_wal_min_blk > blkno)
+ {
+ STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because skip_wal_min_blk is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, blkno, walskip->skip_wal_min_blk);
+ return true;
+ }
+
+ /*
+ * we don't skip WAL-logging for blocks that have got WAL-logged
+ * truncation
+ */
+ if (walskip->wal_log_min_blk != InvalidBlockNumber &&
+ walskip->wal_log_min_blk <= blkno)
+ {
+ STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because wal_log_min_blk is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, blkno, walskip->wal_log_min_blk);
+ return true;
+ }
+
+ STORAGE_elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, blkno);
+
+ return false;
+}
+
+bool
+BlockNeedsWAL(Relation rel, BlockNumber blkno)
+{
+ RelWalSkip *walskip;
+
+ if (!RelationNeedsWAL(rel))
+ return false;
+
+ /* fetch exising pending sync entry */
+ walskip = getWalSkipEntry(rel, false);
+
+ /*
+ * no point in doing further work if we know that we don't skip
+ * WAL-logging.
+ */
+ if (!walskip)
+ return true;
+
+ /*
+ * We don't skip WAL-logging for pages that once done.
+ */
+ if (walskip->skip_wal_min_blk == InvalidBlockNumber ||
+ walskip->skip_wal_min_blk > blkno)
+ {
+ STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because skip_wal_min_blk is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, blkno, walskip->skip_wal_min_blk);
+ return true;
+ }
+
+ /*
+ * we don't skip WAL-logging for blocks that have got WAL-logged
+ * truncation
+ */
+ if (walskip->wal_log_min_blk != InvalidBlockNumber &&
+ walskip->wal_log_min_blk <= blkno)
+ {
+ STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because wal_log_min_blk is %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, blkno, walskip->wal_log_min_blk);
+
+ return true;
+ }
+
+ STORAGE_elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, blkno);
+
+ return false;
+}
+
+/*
+ * Remember that the given relation doesn't need WAL-logging for the blocks
+ * after the current block size and for the blocks that are going to be synced
+ * at commit.
+ */
+void
+RecordWALSkipping(Relation rel)
+{
+ RelWalSkip *walskip;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* get pending sync entry, create if not yet */
+ walskip = getWalSkipEntry(rel, true);
+
+ if (walskip == NULL)
+ return;
+
+ /*
+ * Record only the first registration.
+ */
+ if (walskip->skip_wal_min_blk != InvalidBlockNumber)
+ {
+ STORAGE_elog(DEBUG2, "WAL skipping for rel %u/%u/%u was already registered at block %u (new %u)",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, walskip->skip_wal_min_blk,
+ RelationGetNumberOfBlocks(rel));
+ return;
+ }
+
+ STORAGE_elog(DEBUG2, "registering new WAL skipping rel %u/%u/%u at block %u",
+ rel->rd_node.spcNode, rel->rd_node.dbNode,
+ rel->rd_node.relNode, RelationGetNumberOfBlocks(rel));
+
+ walskip->skip_wal_min_blk = RelationGetNumberOfBlocks(rel);
+}
+
+/*
+ * Record commit-time file sync. This shouldn't be used mixing with
+ * RecordWALSkipping.
+ */
+void
+RecordPendingSync(Relation rel, SMgrRelation targetsrel, ForkNumber forknum)
+{
+ RelWalSkip *walskip;
+
+ Assert(RelationNeedsWAL(rel));
+
+ /* check for support for this feature */
+ if (rel->rd_tableam == NULL ||
+ rel->rd_tableam->relation_register_walskip == NULL)
+ return;
+
+ walskip = getWalSkipEntryRNode(&targetsrel->smgr_rnode.node, true);
+ walskip->forks[forknum] = true;
+ walskip->skip_wal_min_blk = 0;
+ walskip->tableam = rel->rd_tableam;
+
+ STORAGE_elog(DEBUG2,
+ "registering new pending sync for rel %u/%u/%u at block %u",
+ walskip->relnode.spcNode, walskip->relnode.dbNode,
+ walskip->relnode.relNode, 0);
+}
+
+/*
+ * RelationInvalidateWALSkip() -- invalidate WAL-skip entry
+ */
+void
+RelationInvalidateWALSkip(Relation rel)
+{
+ RelWalSkip *walskip;
+
+ /* we know we don't have one */
+ if (rel->rd_nowalskip)
+ return;
+
+ walskip = getWalSkipEntry(rel, false);
+
+ if (!walskip)
+ return;
+
+ /*
+ * The state is reset at subtransaction commit/abort. No invalidation
+ * request must not come for the same relation in the same subtransaction.
+ */
+ Assert(walskip->invalidate_sxid == InvalidSubTransactionId);
+
+ walskip->invalidate_sxid = GetCurrentSubTransactionId();
+
+ STORAGE_elog(DEBUG2,
+ "WAL skip of rel %u/%u/%u invalidated by sxid %d",
+ walskip->relnode.spcNode, walskip->relnode.dbNode,
+ walskip->relnode.relNode, walskip->invalidate_sxid);
+}
+
+/*
+ * getWalSkipEntry: get WAL skip entry.
+ *
+ * Returns WAL skip entry for the relation. The entry tracks WAL-skipping
+ * blocks for the relation. The WAL-skipped blocks need fsync at commit time.
+ * Creates one if needed when create is true. If rel doesn't support this
+ * feature, returns true even if create is true.
+ */
+static inline RelWalSkip *
+getWalSkipEntry(Relation rel, bool create)
+{
+ RelWalSkip *walskip_entry = NULL;
+
+ if (rel->rd_walskip)
+ return rel->rd_walskip;
+
+ /* we know we don't have pending sync entry */
+ if (!create && rel->rd_nowalskip)
+ return NULL;
+
+ /* check for support for this feature */
+ if (rel->rd_tableam == NULL ||
+ rel->rd_tableam->relation_register_walskip == NULL)
+ {
+ rel->rd_nowalskip = true;
+ return NULL;
+ }
+
+ walskip_entry = getWalSkipEntryRNode(&rel->rd_node, create);
+
+ if (!walskip_entry)
+ {
+ /* prevent further hash lookup */
+ rel->rd_nowalskip = true;
+ return NULL;
+ }
+
+ walskip_entry->forks[MAIN_FORKNUM] = true;
+ walskip_entry->tableam = rel->rd_tableam;
+
+ /* hold shortcut in Relation */
+ rel->rd_nowalskip = false;
+ rel->rd_walskip = walskip_entry;
+
+ return walskip_entry;
+}
+
+/*
+ * getWalSkipEntryRNode: get WAL skip entry by rnode
+ *
+ * Returns a WAL skip entry for the RelFileNode.
+ */
+static RelWalSkip *
+getWalSkipEntryRNode(RelFileNode *rnode, bool create)
+{
+ RelWalSkip *walskip_entry = NULL;
+ bool found;
+
+ if (!walSkipHash)
+ {
+ /* First time through: initialize the hash table */
+ HASHCTL ctl;
+
+ if (!create)
+ return NULL;
+
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(RelWalSkip);
+ ctl.hash = tag_hash;
+ walSkipHash = hash_create("pending relation sync table", 5,
+ &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+
+ walskip_entry = (RelWalSkip *)
+ hash_search(walSkipHash, (void *) rnode,
+ create ? HASH_ENTER: HASH_FIND, &found);
+
+ if (!walskip_entry)
+ return NULL;
+
+ /* new entry created */
+ if (!found)
+ {
+ memset(&walskip_entry->forks, 0, sizeof(walskip_entry->forks));
+ walskip_entry->wal_log_min_blk = InvalidBlockNumber;
+ walskip_entry->skip_wal_min_blk = InvalidBlockNumber;
+ walskip_entry->create_sxid = GetCurrentSubTransactionId();
+ walskip_entry->invalidate_sxid = InvalidSubTransactionId;
+ walskip_entry->tableam = NULL;
+ }
+
+ return walskip_entry;
}
/*
@@ -506,6 +917,107 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+/*
+ * Finish bulk insert of files.
+ */
+void
+smgrFinishBulkInsert(bool isCommit)
+{
+ if (!walSkipHash)
+ return;
+
+ if (isCommit)
+ {
+ HASH_SEQ_STATUS status;
+ RelWalSkip *walskip;
+
+ hash_seq_init(&status, walSkipHash);
+
+ while ((walskip = hash_seq_search(&status)) != NULL)
+ {
+ /*
+ * On commit, process valid entreis. Rollback doesn't need sync on
+ * all changes during the transaction.
+ */
+ if (walskip->skip_wal_min_blk != InvalidBlockNumber &&
+ walskip->invalidate_sxid == InvalidSubTransactionId)
+ {
+ int f;
+
+ FlushRelationBuffersWithoutRelCache(walskip->relnode, false);
+
+ /*
+ * We mustn't create an entry when the table AM doesn't
+ * support WAL-skipping.
+ */
+ Assert (walskip->tableam->finish_bulk_insert);
+
+ /* flush all requested forks */
+ for (f = MAIN_FORKNUM ; f <= MAX_FORKNUM ; f++)
+ {
+ if (walskip->forks[f])
+ {
+ walskip->tableam->finish_bulk_insert(walskip->relnode, f);
+ STORAGE_elog(DEBUG2, "finishing bulk insert to rel %u/%u/%u fork %d",
+ walskip->relnode.spcNode,
+ walskip->relnode.dbNode,
+ walskip->relnode.relNode, f);
+ }
+ }
+ }
+ }
+ }
+
+ hash_destroy(walSkipHash);
+ walSkipHash = NULL;
+}
+
+/*
+ * Process pending invalidation of WAL skip happened in the subtransaction
+ */
+void
+smgrProcessWALSkipInval(bool isCommit, SubTransactionId mySubid,
+ SubTransactionId parentSubid)
+{
+ HASH_SEQ_STATUS status;
+ RelWalSkip *walskip;
+
+ if (!walSkipHash)
+ return;
+
+ /* We expect that we don't have walSkipHash in almost all cases */
+ hash_seq_init(&status, walSkipHash);
+
+ while ((walskip = hash_seq_search(&status)) != NULL)
+ {
+ if (walskip->create_sxid == mySubid)
+ {
+ /*
+ * The entry was created in this subxact. Remove it on abort, or
+ * on commit after invalidation.
+ */
+ if (!isCommit || walskip->invalidate_sxid == mySubid)
+ hash_search(walSkipHash, &walskip->relnode,
+ HASH_REMOVE, NULL);
+ /* Treat committing valid entry as creation by the parent. */
+ else if (walskip->invalidate_sxid == InvalidSubTransactionId)
+ walskip->create_sxid = parentSubid;
+ }
+ else if (walskip->invalidate_sxid == mySubid)
+ {
+ /*
+ * This entry was created elsewhere then invalidated by this
+ * subxact. Treat commit as invalidation by the parent. Otherwise
+ * cancel invalidation.
+ */
+ if (isCommit)
+ walskip->invalidate_sxid = parentSubid;
+ else
+ walskip->invalidate_sxid = InvalidSubTransactionId;
+ }
+ }
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
@@ -535,7 +1047,7 @@ PostPrepare_smgr(void)
* Reassign all items in the pending-deletes list to the parent transaction.
*/
void
-AtSubCommit_smgr(void)
+AtSubCommit_smgr(SubTransactionId mySubid, SubTransactionId parentSubid)
{
int nestLevel = GetCurrentTransactionNestLevel();
PendingRelDelete *pending;
@@ -545,6 +1057,9 @@ AtSubCommit_smgr(void)
if (pending->nestLevel >= nestLevel)
pending->nestLevel = nestLevel - 1;
}
+
+ /* Remove invalidated WAL skip in this subtransaction */
+ smgrProcessWALSkipInval(true, mySubid, parentSubid);
}
/*
@@ -555,9 +1070,12 @@ AtSubCommit_smgr(void)
* subtransaction will not commit.
*/
void
-AtSubAbort_smgr(void)
+AtSubAbort_smgr(SubTransactionId mySubid, SubTransactionId parentSubid)
{
smgrDoPendingDeletes(false);
+
+ /* Remove invalidated WAL skip in this subtransaction */
+ smgrProcessWALSkipInval(false, mySubid, parentSubid);
}
void
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e842f9152b..013eb203f4 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -12452,8 +12452,7 @@ index_copy_data(Relation rel, RelFileNode newrnode)
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
- RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
- rel->rd_rel->relpersistence);
+ RelationCopyStorage(rel, dstrel, MAIN_FORKNUM);
/* copy those extra forks that exist */
for (ForkNumber forkNum = MAIN_FORKNUM + 1;
@@ -12471,8 +12470,7 @@ index_copy_data(Relation rel, RelFileNode newrnode)
(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
forkNum == INIT_FORKNUM))
log_smgrcreate(&newrnode, forkNum);
- RelationCopyStorage(rel->rd_smgr, dstrel, forkNum,
- rel->rd_rel->relpersistence);
+ RelationCopyStorage(rel, dstrel, forkNum);
}
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 887023fc8a..0c6598d9af 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
BufferAccessStrategy strategy,
bool *foundPtr);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void AtProcExit_Buffers(int code, Datum arg);
static void CheckForBufferLeaks(void);
static int rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,40 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
/* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3183,7 +3204,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3213,18 +3234,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 64f3c2e887..f06d55a8fe 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
#include "partitioning/partdesc.h"
#include "rewrite/rewriteDefine.h"
#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
#include "utils/array.h"
@@ -5644,6 +5645,8 @@ load_relcache_init_file(bool shared)
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
+ rel->rd_nowalskip = false;
+ rel->rd_walskip = NULL;
/*
* Recompute lock and physical addressing info. This is needed in
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 882dc65c89..83fee7dbfe 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,8 +23,14 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
- ForkNumber forkNum, char relpersistence);
+extern void RelationCopyStorage(Relation srcrel, SMgrRelation dst,
+ ForkNumber forkNum);
+extern bool BufferNeedsWAL(Relation rel, Buffer buf);
+extern bool BlockNeedsWAL(Relation rel, BlockNumber blkno);
+extern void RecordWALSkipping(Relation rel);
+extern void RecordPendingSync(Relation rel, SMgrRelation srel,
+ ForkNumber forknum);
+extern void RelationInvalidateWALSkip(Relation rel);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
@@ -32,8 +38,11 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
-extern void AtSubCommit_smgr(void);
-extern void AtSubAbort_smgr(void);
+extern void smgrFinishBulkInsert(bool isCommit);
+extern void AtSubCommit_smgr(SubTransactionId mySubid,
+ SubTransactionId parentSubid);
+extern void AtSubAbort_smgr(SubTransactionId mySubid,
+ SubTransactionId parentSubid);
extern void PostPrepare_smgr(void);
#endif /* STORAGE_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c5826f691d..8a9ea041dd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 89a7fbf73a..0adc2aba06 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -198,6 +198,13 @@ typedef struct RelationData
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /*
+ * rd_nowalskip is true if this relation is known not to skip WAL.
+ * Otherwise we need to ask smgr for an entry if rd_walskip is NULL.
+ */
+ bool rd_nowalskip;
+ struct RelWalSkip *rd_walskip;
} RelationData;
--
2.16.3
v11-0006-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From 5f0b1c61b7f73b08000a5b4288662b13e6fe51f4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 13:29:23 +0900
Subject: [PATCH 6/7] Fix WAL skipping feature.
This patch replaces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
the new infrastructure.
---
src/backend/access/heap/heapam.c | 133 ++++++++++++++++++++++++-------
src/backend/access/heap/heapam_handler.c | 87 +++++++++++++++-----
src/backend/access/heap/pruneheap.c | 3 +-
src/backend/access/heap/rewriteheap.c | 28 ++-----
src/backend/access/heap/vacuumlazy.c | 6 +-
src/backend/access/heap/visibilitymap.c | 3 +-
src/backend/commands/cluster.c | 27 +++++++
src/backend/commands/copy.c | 15 +++-
src/backend/commands/createas.c | 7 +-
src/backend/commands/matview.c | 7 +-
src/backend/commands/tablecmds.c | 8 +-
src/include/access/rewriteheap.h | 2 +-
12 files changed, 237 insertions(+), 89 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 223be30eb3..ae70798b3c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -28,6 +28,27 @@
* the POSTGRES heap access method used for all POSTGRES
* relations.
*
+ * WAL CONSIDERATIONS
+ * All heap operations are normally WAL-logged. but there are a few
+ * exceptions. Temporary and unlogged relations never need to be
+ * WAL-logged, but we can also skip WAL-logging for a table that was
+ * created in the same transaction, if we don't need WAL for PITR or WAL
+ * archival purposes (i.e. if wal_level=minimal), and we fsync() the file
+ * to disk at COMMIT instead.
+ *
+ * The same-relation optimization is not employed automatically on all
+ * updates to a table that was created in the same transaction, because for
+ * a small number of changes, it's cheaper to just create the WAL records
+ * than fsync()ing the whole relation at COMMIT. It is only worthwhile for
+ * (presumably) large operations like COPY, CLUSTER, or VACUUM FULL. Use
+ * table_relation_register_sync() to initiate such an operation; it will
+ * cause any subsequent updates to the table to skip WAL-logging, if
+ * possible, and cause the heap to be synced to disk at COMMIT.
+ *
+ * To make that work, all modifications to heap must use
+ * BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ * for the given block.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
@@ -51,6 +72,7 @@
#include "access/xloginsert.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "port/atomics.h"
@@ -1948,7 +1970,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
@@ -2065,7 +2087,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
int ndone;
PGAlignedBlock scratch;
Page page;
- bool needwal;
Size saveFreeSpace;
bool need_tuple_data = RelationIsLogicallyLogged(relation);
bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2073,7 +2094,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -2122,6 +2142,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
+ bool needwal;
CHECK_FOR_INTERRUPTS();
@@ -2133,6 +2154,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);
page = BufferGetPage(buffer);
+ needwal = BufferNeedsWAL(relation, buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -2681,7 +2703,7 @@ l1:
* NB: heap_abort_speculative() uses the same xlog record and replay
* routines.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
@@ -2820,6 +2842,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
vmbuffer = InvalidBuffer,
vmbuffer_new = InvalidBuffer;
bool need_toast;
+ bool oldbuf_needs_wal,
+ newbuf_needs_wal;
Size newtupsize,
pagefree;
bool have_tuple_lock = false;
@@ -3371,7 +3395,7 @@ l2:
MarkBufferDirty(buffer);
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -3585,26 +3609,74 @@ l2:
MarkBufferDirty(newbuf);
MarkBufferDirty(buffer);
- /* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ /*
+ * XLOG stuff
+ *
+ * Emit heap-update log. When wal_level = minimal, we may emit insert or
+ * delete record according to wal-optimization.
+ */
+ oldbuf_needs_wal = BufferNeedsWAL(relation, buffer);
+
+ if (newbuf == buffer)
+ newbuf_needs_wal = oldbuf_needs_wal;
+ else
+ newbuf_needs_wal = BufferNeedsWAL(relation, newbuf);
+
+ if (oldbuf_needs_wal || newbuf_needs_wal)
{
XLogRecPtr recptr;
/*
* For logical decoding we need combocids to properly decode the
- * catalog.
+ * catalog. Both oldbuf_needs_wal and newbuf_needs_wal must be true
+ * when logical decoding is active.
*/
if (RelationIsAccessibleInLogicalDecoding(relation))
{
+ Assert(oldbuf_needs_wal && newbuf_needs_wal);
+
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
}
- recptr = log_heap_update(relation, buffer,
- newbuf, &oldtup, heaptup,
- old_key_tuple,
- all_visible_cleared,
- all_visible_cleared_new);
+ /*
+ * Insert log record. When we are not running WAL-skipping, always use
+ * update log. Otherwise use delete or insert log instead when only
+ * one of the two buffers needs WAL-logging. If this were a
+ * HOT-update, redoing the WAL record would result in a broken
+ * hot-chain. However, that never happens because updates complete on
+ * a single page always use log_update.
+ *
+ * Using delete or insert log in place of udpate log leads to
+ * inconsistent series of WAL records. But note that WAL-skipping
+ * happens only when we are updating a tuple in a relation that has
+ * been create in the same transaction. Once commited, the WAL records
+ * recovers the same state of the relation as the synced state at the
+ * commit. Or the maybe-broken relation due to a crash before commit
+ * will be removed in recovery.
+ */
+ if (oldbuf_needs_wal && newbuf_needs_wal)
+ recptr = log_heap_update(relation, buffer, newbuf,
+ &oldtup, heaptup,
+ old_key_tuple,
+ all_visible_cleared,
+ all_visible_cleared_new);
+ else if (oldbuf_needs_wal)
+ recptr = log_heap_delete(relation, buffer, &oldtup, old_key_tuple,
+ xmax_old_tuple, false,
+ all_visible_cleared);
+ else
+ {
+ /*
+ * Coming here means that the old tuple is invisible and
+ * inoperable to another transaction. So xmax_new_tuple is
+ * expected to be InvalidTransactionId here.
+ */
+ Assert (xmax_new_tuple == InvalidTransactionId);
+ recptr = log_heap_insert(relation, buffer, newtup,
+ 0, all_visible_cleared_new);
+ }
+
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4482,7 +4554,7 @@ failed:
* (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
* entries for everything anyway.)
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, *buffer))
{
xl_heap_lock xlrec;
XLogRecPtr recptr;
@@ -5234,7 +5306,7 @@ l4:
MarkBufferDirty(buf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, buf))
{
xl_heap_lock_updated xlrec;
XLogRecPtr recptr;
@@ -5394,7 +5466,7 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
htup->t_ctid = *tid;
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_confirm xlrec;
XLogRecPtr recptr;
@@ -5526,7 +5598,7 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
* The WAL records generated here match heap_delete(). The same recovery
* routines are used.
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
@@ -5635,7 +5707,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
@@ -7045,8 +7117,8 @@ log_heap_clean(Relation reln, Buffer buffer,
xl_heap_clean xlrec;
XLogRecPtr recptr;
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ /* Caller should not call me on non-WAL-logged buffers */
+ Assert(BufferNeedsWAL(reln, buffer));
xlrec.latestRemovedXid = latestRemovedXid;
xlrec.nredirected = nredirected;
@@ -7093,8 +7165,8 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
xl_heap_freeze_page xlrec;
XLogRecPtr recptr;
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ /* Caller should not call me on non-WAL-logged buffers */
+ Assert(BufferNeedsWAL(reln, buffer));
/* nor when there are no tuples to freeze */
Assert(ntuples > 0);
@@ -7309,8 +7381,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
bool init;
int bufflags;
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
+ /* Caller should not call me unless both buffers need WAL-logging */
+ Assert(BufferNeedsWAL(reln, newbuf) && BufferNeedsWAL(reln, oldbuf));
XLogBeginInsert();
@@ -8914,9 +8986,16 @@ heap2_redo(XLogReaderState *record)
* heap_sync - sync a heap, for use when no WAL has been written
*
* This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. table_relation_register_sync() should
+ * be used for that purpose instead.
*
* Indexes are not touched. (Currently, index operations associated with
* the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index add0d65f81..0c763f3a33 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -58,6 +58,8 @@ static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
OffsetNumber tupoffset);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
+static void heapam_relation_register_walskip(Relation rel);
+static void heapam_relation_invalidate_walskip(Relation rel);
static const TableAmRoutine heapam_methods;
@@ -543,14 +545,10 @@ tuple_lock_retry:
}
static void
-heapam_finish_bulk_insert(Relation relation, int options)
+heapam_finish_bulk_insert(RelFileNode rnode, ForkNumber forkNum)
{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
+ /* Sync the file immedately */
+ smgrimmedsync(smgropen(rnode, InvalidBackendId), forkNum);
}
@@ -618,6 +616,12 @@ heapam_relation_copy_data(Relation rel, RelFileNode newrnode)
dstrel = smgropen(newrnode, rel->rd_backend);
RelationOpenSmgr(rel);
+ /*
+ * Register WAL-skipping for the relation. WAL-logging is skipped and sync
+ * the file at commit if the AM supports the feature.
+ */
+ table_relation_register_walskip(rel);
+
/*
* Create and copy all forks of the relation, and schedule unlinking of
* old physical files.
@@ -628,8 +632,7 @@ heapam_relation_copy_data(Relation rel, RelFileNode newrnode)
RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
/* copy main fork */
- RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
- rel->rd_rel->relpersistence);
+ RelationCopyStorage(rel, dstrel, MAIN_FORKNUM);
/* copy those extra forks that exist */
for (ForkNumber forkNum = MAIN_FORKNUM + 1;
@@ -647,8 +650,7 @@ heapam_relation_copy_data(Relation rel, RelFileNode newrnode)
(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
forkNum == INIT_FORKNUM))
log_smgrcreate(&newrnode, forkNum);
- RelationCopyStorage(rel->rd_smgr, dstrel, forkNum,
- rel->rd_rel->relpersistence);
+ RelationCopyStorage(rel, dstrel, forkNum);
}
}
@@ -672,7 +674,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -686,15 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
- /*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
- */
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
- /* use_wal off requires smgr_targblock be initially invalid */
- Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
-
/* Preallocate values/isnull arrays */
natts = newTupDesc->natts;
values = (Datum *) palloc(natts * sizeof(Datum));
@@ -702,7 +694,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, FreezeXid,
- MultiXactCutoff, use_wal);
+ MultiXactCutoff);
/* Set up sorting if wanted */
@@ -948,6 +940,55 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
pfree(isnull);
}
+/*
+ * heapam_relation_register_walskip - register a heap to be WAL-skipped then
+ * synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file. This makes
+ * note of the current size of the relation, and ensures that when the
+ * relation is extended, any changes to the new blocks in the heap, in the
+ * same transaction, will not be WAL-logged. Instead, the heap contents are
+ * flushed to disk at commit.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+static void
+heapam_relation_register_walskip(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RecordWALSkipping(rel);
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ RecordWALSkipping(toastrel);
+ heap_close(toastrel, AccessShareLock);
+ }
+
+ return;
+}
+
+/*
+ * heapam_relation_invalidate_walskip - invalidate registered WAL skipping
+ *
+ * After some file-replacing operations like CLUSTER, the old file no longe
+ * needs to be synced to disk. This function invalidates the registered
+ * WAL-skipping on the current relfilenode of the relation.
+ */
+static void
+heapam_relation_invalidate_walskip(Relation rel)
+{
+ /* non-WAL-logged tables never need fsync */
+ if (!RelationNeedsWAL(rel))
+ return;
+
+ RelationInvalidateWALSkip(rel);
+}
+
static bool
heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
BufferAccessStrategy bstrategy)
@@ -2531,6 +2572,8 @@ static const TableAmRoutine heapam_methods = {
.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
.relation_copy_data = heapam_relation_copy_data,
.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
+ .relation_register_walskip = heapam_relation_register_walskip,
+ .relation_invalidate_walskip = heapam_relation_invalidate_walskip,
.relation_vacuum = heap_vacuum_rel,
.scan_analyze_next_block = heapam_scan_analyze_next_block,
.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a3e51922d8..a05659b168 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -258,7 +259,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/*
* Emit a WAL HEAP_CLEAN record showing what we did
*/
- if (RelationNeedsWAL(relation))
+ if (BufferNeedsWAL(relation, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..494f7fcd41 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
#include "access/xloginsert.h"
#include "catalog/catalog.h"
+#include "catalog/storage.h"
#include "lib/ilist.h"
@@ -144,7 +145,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -238,15 +238,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* min_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +269,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +327,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -344,19 +341,7 @@ end_heap_rewrite(RewriteState state)
(char *) state->rs_buffer, true);
}
- /*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too.
- *
- * It's obvious that we must do this when not WAL-logging. It's less
- * obvious that we have to do it even if we did WAL-log the pages. The
- * reason is the same as in tablecmds.c's copy_relation_data(): we're
- * writing data that's not in shared buffers, and so a CHECKPOINT
- * occurring during the rewriteheap operation won't have fsync'd data we
- * wrote before the checkpoint.
- */
- if (RelationNeedsWAL(state->rs_new_rel))
- heap_sync(state->rs_new_rel);
+ /* If we skipped using WAL, we will sync the relation at commit */
logical_end_heap_rewrite(state);
@@ -654,9 +639,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -695,7 +677,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c9d83128d5..3d8d01b10f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -959,7 +959,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
* page has been previously WAL-logged, and if not, do that
* now.
*/
- if (RelationNeedsWAL(onerel) &&
+ if (BufferNeedsWAL(onerel, buf) &&
PageGetLSN(page) == InvalidXLogRecPtr)
log_newpage_buffer(buf, true);
@@ -1233,7 +1233,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
}
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buf))
{
XLogRecPtr recptr;
@@ -1644,7 +1644,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (RelationNeedsWAL(onerel))
+ if (BufferNeedsWAL(onerel, buffer))
{
XLogRecPtr recptr;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06b26..1f5f7d92dd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
#include "access/heapam_xlog.h"
#include "access/visibilitymap.h"
#include "access/xlog.h"
+#include "catalog/storage.h"
#include "miscadmin.h"
#include "port/pg_bitutils.h"
#include "storage/bufmgr.h"
@@ -276,7 +277,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map[mapByte] |= (flags << mapOffset);
MarkBufferDirty(vmBuf);
- if (RelationNeedsWAL(rel))
+ if (BufferNeedsWAL(rel, heapBuf))
{
if (XLogRecPtrIsInvalid(recptr))
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 4f4be1efbf..b5db26fda5 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -612,6 +612,18 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
relpersistence,
AccessExclusiveLock);
+ /*
+ * If wal_level is minimal, we skip WAL-logging even for WAL-logging
+ * relations. The filenode is synced at commit.
+ */
+ if (!XLogIsNeeded())
+ {
+ /* make_new_heap doesn't lock OIDNewHeap */
+ Relation newheap = table_open(OIDNewHeap, AccessShareLock);
+ table_relation_register_walskip(newheap);
+ table_close(newheap, AccessShareLock);
+ }
+
/* Copy the heap data into the new table in the desired order */
copy_table_data(OIDNewHeap, tableOid, indexOid, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
@@ -1355,6 +1367,21 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
+ /*
+ * Unregister useless pending file-sync. table_relation_unregister_sync
+ * relies on a premise that relation cache has the correct relfilenode and
+ * related members. After swap_relation_files, the relcache entry for the
+ * heaps gets inconsistent with pg_class entry so we should do this before
+ * the call.
+ */
+ if (!XLogIsNeeded())
+ {
+ Relation oldheap = table_open(OIDOldHeap, AccessShareLock);
+
+ table_relation_invalidate_walskip(oldheap);
+ table_close(oldheap, AccessShareLock);
+ }
+
/*
* Swap the contents of the heap relations (including any toast tables).
* Also set old heap's relfrozenxid to frozenXid.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c39218f8db..046acc9fbf 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2762,9 +2762,13 @@ CopyFrom(CopyState cstate)
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
{
- ti_options |= TABLE_INSERT_SKIP_FSM;
+ /*
+ * We can skip WAL-logging the insertions, unless PITR or streaming
+ * replication is in use. We can skip the FSM in any case.
+ */
if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
+ table_relation_register_walskip(cstate->rel);
+ ti_options |= TABLE_INSERT_SKIP_FSM;
}
/*
@@ -3369,7 +3373,12 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
- table_finish_bulk_insert(cstate->rel, ti_options);
+ /*
+ * If we skipped writing WAL, then we will sync the heap at the end of
+ * the transaction. (We used to do it here, but it was later found out
+ * that to be safe, we must also avoid WAL-logging any subsequent
+ * actions on the pages we skipped WAL for). Indexes always use WAL.
+ */
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 43c2fa9124..8b73654413 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ if (!XLogIsNeeded())
+ table_relation_register_walskip(intoRelationDesc);
+ myState->ti_options = HEAP_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,7 +605,7 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->rel, myState->ti_options);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 2aac63296b..33b7bc4c16 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -462,9 +462,10 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
+ table_relation_register_walskip(transientrel);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
+
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,7 +510,7 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->transientrel, myState->ti_options);
+ /* If we skipped using WAL, we will sync the relation at commit */
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 013eb203f4..85555f87fb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4728,7 +4728,11 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
ti_options = TABLE_INSERT_SKIP_FSM;
if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
+ {
+ /* Forget old relation's registerd sync */
+ table_relation_invalidate_walskip(oldrel);
+ table_relation_register_walskip(newrel);
+ }
}
else
{
@@ -5012,7 +5016,7 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
FreeBulkInsertState(bistate);
- table_finish_bulk_insert(newrel, ti_options);
+ /* If we skipped writing WAL, then it will be done at commit. */
table_close(newrel, NoLock);
}
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 6006249d96..64efecf48b 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
--
2.16.3
v11-0007-Remove-TABLE-HEAP_INSERT_SKIP_WAL.patchtext/x-patch; charset=us-asciiDownload
From e3d5ca858c56678bb0ee6fbd9d9e89bef17667bc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 13:31:33 +0900
Subject: [PATCH 7/7] Remove TABLE/HEAP_INSERT_SKIP_WAL
Remove no-longer-used symbol TABLE/HEAP_INSERT_SKIP_WAL.
---
src/include/access/heapam.h | 3 +--
src/include/access/tableam.h | 11 +++--------
2 files changed, 4 insertions(+), 10 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 77e5e603b0..f632e2758d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,11 +29,10 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
-#define HEAP_INSERT_SPECULATIVE 0x0010
+#define HEAP_INSERT_SPECULATIVE 0x0008
typedef struct BulkInsertStateData *BulkInsertState;
struct TupleTableSlot;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 38a00d8823..9840bf0258 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -103,10 +103,9 @@ typedef struct TM_FailureData
} TM_FailureData;
/* "options" flag bits for table_insert */
-#define TABLE_INSERT_SKIP_WAL 0x0001
-#define TABLE_INSERT_SKIP_FSM 0x0002
-#define TABLE_INSERT_FROZEN 0x0004
-#define TABLE_INSERT_NO_LOGICAL 0x0008
+#define TABLE_INSERT_SKIP_FSM 0x0001
+#define TABLE_INSERT_FROZEN 0x0002
+#define TABLE_INSERT_NO_LOGICAL 0x0004
/* flag bits fortable_lock_tuple */
/* Follow tuples whose update is in progress if lock modes don't conflict */
@@ -1025,10 +1024,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* behaviour of the AM. Several options might be ignored by AMs not supporting
* them.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple will not
- * necessarily logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space. It's
--
2.16.3
On Sun, Mar 31, 2019 at 03:31:58PM -0700, Noah Misch wrote:
On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
I also liked the design in the /messages/by-id/559FA0BA.3080808@iki.fi
last paragraph, and I suspect it would have been no harder to back-patch. I
wonder if it would have been simpler and better, but I'm not asking anyone to
investigate that.Now I am asking for that. Would anyone like to try implementing that other
design, to see how much simpler it would be?
Anyone? I've been deferring review of v10 and v11 in hopes of seeing the
above-described patch first.
Hello.
At Sun, 12 May 2019 17:37:05 -0700, Noah Misch <noah@leadboat.com> wrote in <20190513003705.GA1202614@rfd.leadboat.com>
On Sun, Mar 31, 2019 at 03:31:58PM -0700, Noah Misch wrote:
On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
I also liked the design in the /messages/by-id/559FA0BA.3080808@iki.fi
last paragraph, and I suspect it would have been no harder to back-patch. I
wonder if it would have been simpler and better, but I'm not asking anyone to
investigate that.Now I am asking for that. Would anyone like to try implementing that other
design, to see how much simpler it would be?
Yeah, I think it is a bit too-complex for the value. But I think
it is the best way as far as we keep reusing a file on
truncation of the whole file.
Anyone? I've been deferring review of v10 and v11 in hopes of seeing the
above-described patch first.
The siginificant portion of the complexity in this patch comes
from need to behave differently per block according to remebered
logged and truncated block numbers.
0005:
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
If this consideration holds and given the optimizations on
WAL-skip and truncation, there's no way to avoid the per-block
behavior as far as we are allowing mixture of
logged-modifications and WAL-skipped COPY on the same relation
within a transaction.
We could avoid the per-block behavior change by making the
wal-inhibition per-relation basis. That will reduce the patch
size by the amount of BufferNeedsWALs and log_heap_update, but
not that large.
inhibit wal-skipping after any wal-logged modifications in the relation.
inhibit wal-logging after any wal-skipped modifications in the relation.
wal-skipped relations are synced at commit-time.
truncation of wal-skipped relation creates a new relfilenode.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, May 14, 2019 at 01:59:10PM +0900, Kyotaro HORIGUCHI wrote:
At Sun, 12 May 2019 17:37:05 -0700, Noah Misch <noah@leadboat.com> wrote in <20190513003705.GA1202614@rfd.leadboat.com>
On Sun, Mar 31, 2019 at 03:31:58PM -0700, Noah Misch wrote:
On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
I also liked the design in the /messages/by-id/559FA0BA.3080808@iki.fi
last paragraph, and I suspect it would have been no harder to back-patch. I
wonder if it would have been simpler and better, but I'm not asking anyone to
investigate that.Now I am asking for that. Would anyone like to try implementing that other
design, to see how much simpler it would be?Yeah, I think it is a bit too-complex for the value. But I think
it is the best way as far as we keep reusing a file on
truncation of the whole file.
The design of v11-0006-Fix-WAL-skipping-feature.patch doesn't, in general,
work for WAL records touching more than one buffer. For heapam, that patch
works around this problem by emitting XLOG_HEAP_INSERT or XLOG_HEAP_DELETE
when we'd normally emit XLOG_HEAP_UPDATE. As a result, post-crash-recovery
heap page bits differ from the bits present when we don't crash. Though I'm
85% confident this does not introduce a bug today, this is fragile. That is
the main complexity I wish to avoid.
I suspect the design in the /messages/by-id/559FA0BA.3080808@iki.fi last
paragraph will be simpler, not more complex. In the implementation I'm
envisioning, smgrDoPendingDeletes() would change name, perhaps to
AtEOXact_Storage(). For every relfilenode it does not delete, it would ensure
durability by syncing (for large nodes) or by WAL-logging each page (for small
nodes). RelationNeedsWAL() would return false whenever the applicable
relfilenode appears in pendingDeletes. Access methods would remove their
smgrimmedsync() calls, but they would otherwise not change. Would anyone like
to try implementing that?
Hello.
At Thu, 16 May 2019 23:50:50 -0700, Noah Misch <noah@leadboat.com> wrote in <20190517065050.GA1298884@rfd.leadboat.com>
On Tue, May 14, 2019 at 01:59:10PM +0900, Kyotaro HORIGUCHI wrote:
At Sun, 12 May 2019 17:37:05 -0700, Noah Misch <noah@leadboat.com> wrote in <20190513003705.GA1202614@rfd.leadboat.com>
On Sun, Mar 31, 2019 at 03:31:58PM -0700, Noah Misch wrote:
On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
I also liked the design in the /messages/by-id/559FA0BA.3080808@iki.fi
last paragraph, and I suspect it would have been no harder to back-patch. I
wonder if it would have been simpler and better, but I'm not asking anyone to
investigate that.Now I am asking for that. Would anyone like to try implementing that other
design, to see how much simpler it would be?Yeah, I think it is a bit too-complex for the value. But I think
it is the best way as far as we keep reusing a file on
truncation of the whole file.The design of v11-0006-Fix-WAL-skipping-feature.patch doesn't, in general,
work for WAL records touching more than one buffer. For heapam, that patch
works around this problem by emitting XLOG_HEAP_INSERT or XLOG_HEAP_DELETE
when we'd normally emit XLOG_HEAP_UPDATE. As a result, post-crash-recovery
heap page bits differ from the bits present when we don't crash. Though I'm
85% confident this does not introduce a bug today, this is fragile. That is
the main complexity I wish to avoid.
Ok, I see your point. The same issue happens on index pages more
aggressively. I didn't allow wal-skipping on indexes for the
reason.
I suspect the design in the /messages/by-id/559FA0BA.3080808@iki.fi last
paragraph will be simpler, not more complex. In the implementation I'm
envisioning, smgrDoPendingDeletes() would change name, perhaps to
AtEOXact_Storage(). For every relfilenode it does not delete, it would ensure
durability by syncing (for large nodes) or by WAL-logging each page (for small
nodes). RelationNeedsWAL() would return false whenever the applicable
relfilenode appears in pendingDeletes. Access methods would remove their
smgrimmedsync() calls, but they would otherwise not change. Would anyone like
to try implementing that?
Following this direction, the attached PoC works *at least for*
the wal_optimization TAP tests, but doing pending flush not in
smgr but in relcache. This is extending skip-wal feature to
indexes. And makes the old 0002 patch on nbtree useless.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v12-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From ebca88dea9f9458cbd58f15e370ff3fc8fbd371b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/2] TAP test for copy-truncation optimization.
---
src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
1 file changed, 291 insertions(+)
create mode 100644 src/test/recovery/t/018_wal_optimize.pl
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::real_dir($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v12-0002-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From 3859609090a274fc1ba59964f3819d19217bd8ef Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 20 May 2019 15:38:59 +0900
Subject: [PATCH 2/2] Fix WAL skipping feature
This patch shows a PoC of how change WAL-skipping feature to avoid
table corruption caused by mixing wal-logged and wal-skipped
operations.
---
src/backend/access/heap/heapam.c | 4 ++--
src/backend/access/heap/heapam_handler.c | 7 +------
src/backend/access/heap/rewriteheap.c | 3 ---
src/backend/access/transam/xact.c | 6 ++++++
src/backend/commands/copy.c | 4 ----
src/backend/commands/createas.c | 3 +--
src/backend/commands/tablecmds.c | 2 --
src/backend/utils/cache/relcache.c | 22 ++++++++++++++++++++++
src/include/access/heapam.h | 1 -
src/include/utils/rel.h | 3 ++-
src/include/utils/relcache.h | 1 +
11 files changed, 35 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 19d2c529d8..dda76c8736 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1950,7 +1950,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2133,7 +2133,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8d8161fd97..f4af981a35 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -560,12 +560,7 @@ tuple_lock_retry:
static void
heapam_finish_bulk_insert(Relation relation, int options)
{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
+ /* heapam doesn't need do this */
}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..1ac77f7c14 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -654,9 +654,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 20feeec327..fb35992a13 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2133,6 +2133,9 @@ CommitTransaction(void)
/* Commit updates to the relation map --- do this as late as possible */
AtEOXact_RelationMap(true, is_parallel_worker);
+ /* Perform pending flush */
+ AtEOXact_DoPendingFlush();
+
/*
* set the current transaction state information appropriately during
* commit processing
@@ -2349,6 +2352,9 @@ PrepareTransaction(void)
*/
PreCommit_CheckForSerializationFailure();
+ /* Perform pending flush */
+ AtEOXact_DoPendingFlush();
+
/* NOTIFY will be handled below */
/*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6ffc3a62f6..9bae04b8a7 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2761,11 +2761,7 @@ CopyFrom(CopyState cstate)
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 43c2fa9124..83e5f9220f 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index bfcf9472d7..b686497443 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4741,8 +4741,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d0f6f715e6..10fd405171 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2913,6 +2913,28 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
}
+void
+AtEOXact_DoPendingFlush()
+{
+ HASH_SEQ_STATUS status;
+ RelIdCacheEnt *idhentry;
+
+ if (!RelationIdCache)
+ return;
+
+ hash_seq_init(&status, RelationIdCache);
+ while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+ {
+ Relation rel = idhentry->reldesc;
+ if (RELATION_IS_LOCAL(rel) && !XLogIsNeeded() && rel->rd_smgr)
+ {
+ FlushRelationBuffers(rel);
+ smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
+ }
+ }
+}
+
+
/*
* AtEOXact_RelationCache
*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 62aaa08eff..0fb7d86bf2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33abce3..41ab634ff5 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -514,7 +514,8 @@ typedef struct ViewOptions
* True if relation needs WAL.
*/
#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ !(RELATION_IS_LOCAL(relation) && !XLogIsNeeded()))
/*
* RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 364495a5f0..cd9b1a6f68 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -123,6 +123,7 @@ extern void RelationCloseSmgrByOid(Oid relationId);
extern void AtEOXact_RelationCache(bool isCommit);
extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
+extern void AtEOXact_DoPendingFlush(void);
/*
* Routines to help manage rebuilding of relcache init files
--
2.16.3
Hello.
At Mon, 20 May 2019 15:54:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190520.155430.215084510.horiguchi.kyotaro@lab.ntt.co.jp>
I suspect the design in the /messages/by-id/559FA0BA.3080808@iki.fi last
paragraph will be simpler, not more complex. In the implementation I'm
envisioning, smgrDoPendingDeletes() would change name, perhaps to
AtEOXact_Storage(). For every relfilenode it does not delete, it would ensure
durability by syncing (for large nodes) or by WAL-logging each page (for small
nodes). RelationNeedsWAL() would return false whenever the applicable
relfilenode appears in pendingDeletes. Access methods would remove their
smgrimmedsync() calls, but they would otherwise not change. Would anyone like
to try implementing that?Following this direction, the attached PoC works *at least for*
the wal_optimization TAP tests, but doing pending flush not in
smgr but in relcache. This is extending skip-wal feature to
indexes. And makes the old 0002 patch on nbtree useless.
This is a tidier version of the patch.
- Passes regression tests including 018_wal_optimize.pl
- Move the substantial work to table/index AMs.
Each AM can decide whether to support WAL skip or not.
Currently heap and nbtree support it.
- The timing of sync is moved from AtEOXact to PreCommit. This is
because heap_sync() needs xact state = INPROGRESS.
- matview and cluster is broken, since swapping to new
relfilenode doesn't change rd_newRelfilenodeSubid. I'll address
that.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v13-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From 680462288cb82da23c19a02239787fc1ea08cdde Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/2] TAP test for copy-truncation optimization.
---
src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
1 file changed, 291 insertions(+)
create mode 100644 src/test/recovery/t/018_wal_optimize.pl
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::real_dir($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v13-0002-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From 75b90a8020275af6ee5e6ee5a4433c5582bd9148 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 20 May 2019 15:38:59 +0900
Subject: [PATCH 2/2] Fix WAL skipping feature
WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modifcation is WAL-logged at all then sync
such relations at commit.
---
src/backend/access/brin/brin.c | 2 +
src/backend/access/gin/ginutil.c | 2 +
src/backend/access/gist/gist.c | 2 +
src/backend/access/hash/hash.c | 2 +
src/backend/access/heap/heapam.c | 8 +--
src/backend/access/heap/heapam_handler.c | 15 +++---
src/backend/access/heap/rewriteheap.c | 3 --
src/backend/access/index/indexam.c | 16 ++++++
src/backend/access/nbtree/nbtree.c | 13 +++++
src/backend/access/transam/xact.c | 6 +++
src/backend/commands/copy.c | 6 ---
src/backend/commands/createas.c | 5 +-
src/backend/commands/matview.c | 4 --
src/backend/commands/tablecmds.c | 4 --
src/backend/utils/cache/relcache.c | 87 ++++++++++++++++++++++++++++++++
src/include/access/amapi.h | 8 +++
src/include/access/genam.h | 1 +
src/include/access/heapam.h | 1 -
src/include/access/nbtree.h | 1 +
src/include/access/tableam.h | 36 +++++++------
src/include/utils/rel.h | 21 +++++++-
src/include/utils/relcache.h | 1 +
22 files changed, 188 insertions(+), 56 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index aba234c0af..681520852f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -125,6 +125,8 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amatcommitsync = NULL;
+
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..f4f0eebec5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -77,6 +77,8 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amatcommitsync = NULL;
+
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index d70a138f54..3a23e7c4b2 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -99,6 +99,8 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amatcommitsync = NULL;
+
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 048e40e46f..3fa8262319 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -98,6 +98,8 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amatcommitsync = NULL;
+
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 19d2c529d8..7f78122b81 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1950,7 +1950,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2133,7 +2133,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -8906,10 +8906,6 @@ heap2_redo(XLogReaderState *record)
void
heap_sync(Relation rel)
{
- /* non-WAL-logged tables never need fsync */
- if (!RelationNeedsWAL(rel))
- return;
-
/* main heap */
FlushRelationBuffers(rel);
/* FlushRelationBuffers will have opened rd_smgr */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8d8161fd97..a2e1464845 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -557,15 +557,14 @@ tuple_lock_retry:
return result;
}
+/* ------------------------------------------------------------------------
+ * WAL-skipping related routine
+ * ------------------------------------------------------------------------
+ */
static void
-heapam_finish_bulk_insert(Relation relation, int options)
+heapam_at_commit_sync(Relation relation)
{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
+ heap_sync(relation);
}
@@ -2573,7 +2572,7 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
+ .at_commit_sync = heapam_at_commit_sync,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..1ac77f7c14 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -654,9 +654,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0fc9139bad..1d089603b7 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_at_commit_sync - perform at_commit_sync
*
* NOTES
* This file contains the index_ routines which used
@@ -837,6 +838,21 @@ index_getprocinfo(Relation irel,
return locinfo;
}
+/* ----------------
+ * index_at_commit_sync
+ *
+ * This routine perfoms at-commit sync of index storage. This is called
+ * when permanent index created in the current transaction is committed.
+ * ----------------
+ */
+void
+index_at_commit_sync(Relation irel)
+{
+ Assert(irel->rd_indam != NULL && irel->rd_indam->amatcommitsync != NULL);
+
+ irel->rd_indam->amatcommitsync(irel);
+}
+
/* ----------------
* index_store_float8_orderby_distances
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 02fb352b94..39377f35eb 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -147,6 +147,8 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->aminitparallelscan = btinitparallelscan;
amroutine->amparallelrescan = btparallelrescan;
+ amroutine->amatcommitsync = btatcommitsync;
+
PG_RETURN_POINTER(amroutine);
}
@@ -1385,3 +1387,14 @@ btcanreturn(Relation index, int attno)
{
return true;
}
+
+/*
+ * btatcommitsync() -- Perform at-commit sync of WAL-skipped index
+ */
+void
+btatcommitsync(Relation index)
+{
+ FlushRelationBuffers(index);
+ smgrimmedsync(index->rd_smgr, MAIN_FORKNUM);
+}
+
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 20feeec327..bc38a53195 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2120,6 +2120,9 @@ CommitTransaction(void)
if (!is_parallel_worker)
PreCommit_CheckForSerializationFailure();
+ /* Sync WAL-skipped relations */
+ PreCommit_RelationSync();
+
/*
* Insert notifications sent by NOTIFY commands into the queue. This
* should be late in the pre-commit sequence to minimize time spent
@@ -2395,6 +2398,9 @@ PrepareTransaction(void)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot PREPARE a transaction that has manipulated logical replication workers")));
+ /* Sync WAL-skipped relations */
+ PreCommit_RelationSync();
+
/* Prevent cancel/die interrupt while cleaning up */
HOLD_INTERRUPTS();
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 5f81aa57d4..a25c82438e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2761,11 +2761,7 @@ CopyFrom(CopyState cstate)
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
@@ -3364,8 +3360,6 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
- table_finish_bulk_insert(cstate->rel, ti_options);
-
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 43c2fa9124..859b869b0d 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->rel, myState->ti_options);
-
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 99bf3c29f2..c84edd0db0 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* replication is in use. We can skip the FSM in any case.
*/
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index bfcf9472d7..75f11a327d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4741,8 +4741,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -5026,8 +5024,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
FreeBulkInsertState(bistate);
- table_finish_bulk_insert(newrel, ti_options);
-
table_close(newrel, NoLock);
}
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d0f6f715e6..4bffbfff5d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1512,6 +1512,9 @@ RelationInitIndexAccessInfo(Relation relation)
relation->rd_exclprocs = NULL;
relation->rd_exclstrats = NULL;
relation->rd_amcache = NULL;
+
+ if (relation->rd_indam->amatcommitsync != NULL)
+ relation->rd_can_skipwal = true;
}
/*
@@ -1781,6 +1784,9 @@ RelationInitTableAccessMethod(Relation relation)
* Now we can fetch the table AM's API struct
*/
InitTableAmRoutine(relation);
+
+ if (relation->rd_tableam && relation->rd_tableam->at_commit_sync)
+ relation->rd_can_skipwal = true;
}
/*
@@ -2913,6 +2919,73 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
}
+/*
+ * PreComimt_RelationSync
+ *
+ * Sync relations that were WAL-skipped in this transaction .
+ *
+ * AMs may have skipped WAL-logging for relations created in the current
+ * transaction. This let such relations be synced. This operation can only be
+ * perfomed while transaction status is INPROGRESS so it is separated from
+ * AtEOXact_RelationCache.
+ */
+void
+PreCommit_RelationSync(void)
+{
+ HASH_SEQ_STATUS status;
+ RelIdCacheEnt *idhentry;
+ int i;
+
+ /* See AtEOXact_RelationCache for details on eoxact_list */
+ if (eoxact_list_overflowed)
+ {
+ hash_seq_init(&status, RelationIdCache);
+ while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+ {
+ Relation rel = idhentry->reldesc;
+
+ if (!RelationNeedsAtCommitSync(rel))
+ continue;
+
+ if (rel->rd_tableam != NULL)
+ table_at_commit_sync(rel);
+ else
+ {
+ Assert (rel->rd_indam != NULL);
+ table_at_commit_sync(rel);
+ }
+ }
+ }
+ else
+ {
+ for (i = 0; i < eoxact_list_len; i++)
+ {
+ Relation rel;
+
+ idhentry = (RelIdCacheEnt *) hash_search(RelationIdCache,
+ (void *) &eoxact_list[i],
+ HASH_FIND,
+ NULL);
+
+ if (idhentry == NULL)
+ continue;
+
+ rel = idhentry->reldesc;
+
+ if (!RelationNeedsAtCommitSync(rel))
+ continue;
+
+ if (rel->rd_tableam != NULL)
+ table_at_commit_sync(rel);
+ else
+ {
+ Assert (rel->rd_indam != NULL);
+ table_at_commit_sync(rel);
+ }
+ }
+ }
+}
+
/*
* AtEOXact_RelationCache
*
@@ -3032,7 +3105,21 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
if (relation->rd_createSubid != InvalidSubTransactionId)
{
if (isCommit)
+ {
+ /*
+ * While wal_level=minimal, we have skipped WAL-logging on
+ * persistent relations created in this transaction. Sync that
+ * tables out before they become publicly accessible.
+ */
+ if (!XLogIsNeeded() && relation->rd_smgr &&
+ relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ {
+ FlushRelationBuffers(relation);
+ smgrimmedsync(relation->rd_smgr, MAIN_FORKNUM);
+ }
+
relation->rd_createSubid = InvalidSubTransactionId;
+ }
else if (RelationHasReferenceCountZero(relation))
{
RelationClearRelation(relation, false);
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 09a7404267..fc6981d98a 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -156,6 +156,11 @@ typedef void (*aminitparallelscan_function) (void *target);
/* (re)start parallel index scan */
typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+/* sync relation at commit */
+typedef void (*amatcommitsync_function) (Relation indexRelation);
+
+ /* interface function to support WAL-skipping feature */
+
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
* chunk of memory.
@@ -230,6 +235,9 @@ typedef struct IndexAmRoutine
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
aminitparallelscan_function aminitparallelscan; /* can be NULL */
amparallelrescan_function amparallelrescan; /* can be NULL */
+
+ /* interface function to support WAL-skipping feature */
+ amatcommitsync_function amatcommitsync; /* can be NULL */;
} IndexAmRoutine;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9717183ef2..b225fd622e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -177,6 +177,7 @@ extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
uint16 procnum);
+extern void index_at_commit_sync(Relation irel);
extern void index_store_float8_orderby_distances(IndexScanDesc scan,
Oid *orderByTypes, double *distances,
bool recheckOrderBy);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 62aaa08eff..0fb7d86bf2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6c1acd4855..1d042e89b5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -717,6 +717,7 @@ extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *btvacuumcleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool btcanreturn(Relation index, int attno);
+extern void btatcommitsync(Relation index);
/*
* prototypes for internal functions in nbtree.c
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 06eae2337a..90254cb278 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -409,19 +409,15 @@ typedef struct TableAmRoutine
TM_FailureData *tmfd);
/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
+ * Sync relation at commit-time if needed.
*
- * Typically callers of tuple_insert and multi_insert will just pass all
- * the flags that apply to them, and each AM has to decide which of them
- * make sense for it, and then only take actions in finish_bulk_insert for
- * those flags, and ignore others.
+ * A table AM may skip WAL-logging for relations created in the current
+ * transaction. This routine is called commit-time and the table AM
+ * must flush buffer and sync the underlying storage.
*
* Optional callback.
*/
- void (*finish_bulk_insert) (Relation rel, int options);
+ void (*at_commit_sync) (Relation rel);
/* ------------------------------------------------------------------------
@@ -1104,8 +1100,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
*
*
* The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
*
* On return the slot's tts_tid and tts_tableOid are updated to reflect the
* insertion. But note that any toasting of fields within the slot is NOT
@@ -1300,20 +1295,23 @@ table_lock_tuple(Relation rel, ItemPointer tid, Snapshot snapshot,
}
/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * Sync relation at commit-time if needed.
+ *
+ * A table AM that defines this interface can allow derived objects created
+ * in the current transaction to skip WAL-logging. This routine is called
+ * commit-time and the table AM must flush buffer and sync the underlying
+ * storage.
+ *
+ * Optional callback.
*/
static inline void
-table_finish_bulk_insert(Relation rel, int options)
+table_at_commit_sync(Relation rel)
{
/* optional callback */
- if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
- rel->rd_tableam->finish_bulk_insert(rel, options);
+ if (rel->rd_tableam && rel->rd_tableam->at_commit_sync)
+ rel->rd_tableam->at_commit_sync(rel);
}
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33abce3..c09fd84a1c 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -64,6 +64,9 @@ typedef struct RelationData
* rd_replidindex) */
bool rd_statvalid; /* is rd_statlist valid? */
+ /* Some relations cane comit WAL-logging on certain condition. */
+ bool rd_can_skipwal; /* can skip WAL-logging? */
+
/*
* rd_createSubid is the ID of the highest subtransaction the rel has
* survived into; or zero if the rel was not created in the current top
@@ -512,9 +515,25 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
+ *
+ * If underlying table AM has at_commit_sync interface, returns false if
+ * wal_level = minimal and this relation is created in the current transaction
*/
#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (!relation->rd_can_skipwal || \
+ !(RELATION_IS_LOCAL(relation) && !XLogIsNeeded())))
+
+/*
+ * RelationNeedAtCommitSync
+ * True if relation needs WAL needs on-commit sync
+ */
+#define RelationNeedsAtCommitSync(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ relation->rd_can_skipwal && \
+ (RELATION_IS_LOCAL(relation) || \
+ relation->rd_newRelfilenodeSubid != InvalidBlockNumber) \
+ && !XLogIsNeeded()))
/*
* RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 364495a5f0..07c4cfa565 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -120,6 +120,7 @@ extern void RelationCacheInvalidate(void);
extern void RelationCloseSmgrByOid(Oid relationId);
+extern void PreCommit_RelationSync(void);
extern void AtEOXact_RelationCache(bool isCommit);
extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
--
2.16.3
Attached is a new version.
At Tue, 21 May 2019 21:29:48 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190521.212948.34357392.horiguchi.kyotaro@lab.ntt.co.jp>
At Mon, 20 May 2019 15:54:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190520.155430.215084510.horiguchi.kyotaro@lab.ntt.co.jp>
I suspect the design in the /messages/by-id/559FA0BA.3080808@iki.fi last
paragraph will be simpler, not more complex. In the implementation I'm
envisioning, smgrDoPendingDeletes() would change name, perhaps to
AtEOXact_Storage(). For every relfilenode it does not delete, it would ensure
durability by syncing (for large nodes) or by WAL-logging each page (for small
nodes). RelationNeedsWAL() would return false whenever the applicable
relfilenode appears in pendingDeletes. Access methods would remove their
smgrimmedsync() calls, but they would otherwise not change. Would anyone like
to try implementing that?Following this direction, the attached PoC works *at least for*
the wal_optimization TAP tests, but doing pending flush not in
smgr but in relcache. This is extending skip-wal feature to
indexes. And makes the old 0002 patch on nbtree useless.This is a tidier version of the patch.
- Passes regression tests including 018_wal_optimize.pl
- Move the substantial work to table/index AMs.
Each AM can decide whether to support WAL skip or not.
Currently heap and nbtree support it.- The timing of sync is moved from AtEOXact to PreCommit. This is
because heap_sync() needs xact state = INPROGRESS.- matview and cluster is broken, since swapping to new
relfilenode doesn't change rd_newRelfilenodeSubid. I'll address
that.
cluster/matview are fixed.
A obstacle to fix them was the unreliability of
newRelfilenodeSubid. As mentioned in the comment of
RelationData, newRelfilenodeSubid may dissapear by certain
sequence of commands.
In the attched v14, I added "rd_firstRelfilenodeSubid", which
stores the subtransaction id where the first relfilenode
replacementin the current transaction. It suivives any sequence
of commands, including one mentioned in CopyFrom's comment (which
I removed by this patch).
With the attached patch, on relations based on table/index AMs
that supports WAL-skipping, WAL-logging is eliminated if the
relation is created in the current transaction, or relfilenode is
replaced in the current transaction. At-commit file sync is
surely performed. (Only Heap and Btree support it.)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v14-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From 0430cf502bc8d04f3e71cc69a748a9a035706cb6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/2] TAP test for copy-truncation optimization.
---
src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
1 file changed, 291 insertions(+)
create mode 100644 src/test/recovery/t/018_wal_optimize.pl
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::real_dir($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v14-0002-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From effbb1cdc777e0612a51682dd41f0f46b7881798 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 20 May 2019 15:38:59 +0900
Subject: [PATCH 2/2] Fix WAL skipping feature
WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modifcation is WAL-logged at all then sync
such relations at commit.
---
src/backend/access/brin/brin.c | 2 +
src/backend/access/gin/ginutil.c | 2 +
src/backend/access/gist/gist.c | 2 +
src/backend/access/hash/hash.c | 2 +
src/backend/access/heap/heapam.c | 8 +-
src/backend/access/heap/heapam_handler.c | 24 ++----
src/backend/access/heap/rewriteheap.c | 12 +--
src/backend/access/index/indexam.c | 18 +++++
src/backend/access/nbtree/nbtree.c | 13 ++++
src/backend/access/transam/xact.c | 6 ++
src/backend/commands/cluster.c | 29 ++++++++
src/backend/commands/copy.c | 38 ++--------
src/backend/commands/createas.c | 5 +-
src/backend/commands/matview.c | 4 -
src/backend/commands/tablecmds.c | 10 +--
src/backend/utils/cache/relcache.c | 123 ++++++++++++++++++++++++++++++-
src/include/access/amapi.h | 6 ++
src/include/access/genam.h | 1 +
src/include/access/heapam.h | 1 -
src/include/access/nbtree.h | 1 +
src/include/access/rewriteheap.h | 2 +-
src/include/access/tableam.h | 47 ++++++------
src/include/utils/rel.h | 35 ++++++++-
src/include/utils/relcache.h | 4 +
24 files changed, 289 insertions(+), 106 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..4b48f44949 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -125,6 +125,8 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amatcommitsync = NULL;
+
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..f4f0eebec5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -77,6 +77,8 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amatcommitsync = NULL;
+
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 45c00aaa87..ebaf4495b8 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -99,6 +99,8 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amatcommitsync = NULL;
+
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e9f2c84af1..ce7ac58204 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -98,6 +98,8 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->aminitparallelscan = NULL;
amroutine->amparallelrescan = NULL;
+ amroutine->amatcommitsync = NULL;
+
PG_RETURN_POINTER(amroutine);
}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6c342635e8..642e7d0cc5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1950,7 +1950,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2133,7 +2133,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -8906,10 +8906,6 @@ heap2_redo(XLogReaderState *record)
void
heap_sync(Relation rel)
{
- /* non-WAL-logged tables never need fsync */
- if (!RelationNeedsWAL(rel))
- return;
-
/* main heap */
FlushRelationBuffers(rel);
/* FlushRelationBuffers will have opened rd_smgr */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a4a28e88ec..17126e599b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -559,15 +559,14 @@ tuple_lock_retry:
return result;
}
+/* ------------------------------------------------------------------------
+ * WAL-skipping related routine
+ * ------------------------------------------------------------------------
+ */
static void
-heapam_finish_bulk_insert(Relation relation, int options)
+heapam_at_commit_sync(Relation relation)
{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
+ heap_sync(relation);
}
@@ -702,7 +701,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -716,12 +714,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
- /*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
- */
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
/* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
@@ -732,7 +724,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff, use_wal);
+ *multi_cutoff);
/* Set up sorting if wanted */
@@ -2626,7 +2618,7 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
+ .at_commit_sync = heapam_at_commit_sync,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 131ec7b8d7..617eec582b 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -245,8 +244,7 @@ static void logical_end_heap_rewrite(RewriteState state);
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +269,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +327,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -654,9 +651,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -695,7 +689,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aefdd2916d..ade721a383 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
* index_can_return - does index support index-only scans?
* index_getprocid - get a support procedure OID
* index_getprocinfo - get a support procedure's lookup info
+ * index_at_commit_sync - perform at_commit_sync
*
* NOTES
* This file contains the index_ routines which used
@@ -837,6 +838,23 @@ index_getprocinfo(Relation irel,
return locinfo;
}
+/* ----------------
+ * index_at_commit_sync
+ *
+ * An index AM that defines this interface can allow derived objects created
+ * in the current transaction to skip WAL-logging. This routine is called
+ * commit-time and the AM must flush buffer and sync the underlying storage.
+ *
+ * Optional interface
+ * ----------------
+ */
+void
+index_at_commit_sync(Relation irel)
+{
+ if (irel->rd_indam && irel->rd_indam->amatcommitsync)
+ irel->rd_indam->amatcommitsync(irel);
+}
+
/* ----------------
* index_store_float8_orderby_distances
*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac44b..695b058b85 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -147,6 +147,8 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->aminitparallelscan = btinitparallelscan;
amroutine->amparallelrescan = btparallelrescan;
+ amroutine->amatcommitsync = btatcommitsync;
+
PG_RETURN_POINTER(amroutine);
}
@@ -1385,3 +1387,14 @@ btcanreturn(Relation index, int attno)
{
return true;
}
+
+/*
+ * btatcommitsync() -- Perform at-commit sync of WAL-skipped index
+ */
+void
+btatcommitsync(Relation index)
+{
+ FlushRelationBuffers(index);
+ smgrimmedsync(index->rd_smgr, MAIN_FORKNUM);
+}
+
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f1108ccc8b..0670985bc2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2120,6 +2120,9 @@ CommitTransaction(void)
if (!is_parallel_worker)
PreCommit_CheckForSerializationFailure();
+ /* Sync WAL-skipped relations */
+ PreCommit_RelationSync();
+
/*
* Insert notifications sent by NOTIFY commands into the queue. This
* should be late in the pre-commit sequence to minimize time spent
@@ -2395,6 +2398,9 @@ PrepareTransaction(void)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot PREPARE a transaction that has manipulated logical replication workers")));
+ /* Sync WAL-skipped relations */
+ PreCommit_RelationSync();
+
/* Prevent cancel/die interrupt while cleaning up */
HOLD_INTERRUPTS();
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index ebaec4f8dd..504a04104f 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,41 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
{
+ Relation rel1;
+ Relation rel2;
+
/*
* Normal non-mapped relations: swap relfilenodes, reltablespaces,
* relpersistence
*/
Assert(!target_is_pg_class);
+ /* Update creation subid hints of relcache */
+ rel1 = relation_open(r1, ExclusiveLock);
+ rel2 = relation_open(r2, ExclusiveLock);
+
+ /*
+ * New relation's relfilenode is created in the current transaction
+ * and used as old ralation's new relfilenode. So its
+ * newRelfilenodeSubid as new relation's createSubid. We don't fix
+ * rel2 since it would be deleted soon.
+ */
+ Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+ rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+ /* record the first relfilenode change in the current transaction */
+ if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ {
+ rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+ /* Flag the old relation as needing eoxact cleanup */
+ RelationEOXactListAdd(rel1);
+ }
+
+ relation_close(rel1, ExclusiveLock);
+ relation_close(rel2, ExclusiveLock);
+
+ /* swap relfilenodes, reltablespaces, relpersistence */
swaptemp = relform1->relfilenode;
relform1->relfilenode = relform2->relfilenode;
relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b00891ffd2..77608c09c3 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2720,28 +2720,9 @@ CopyFrom(CopyState cstate)
* If it does commit, we'll have done the table_finish_bulk_insert() at
* the bottom of this routine first.
*
- * As mentioned in comments in utils/rel.h, the in-same-transaction test
- * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
- * can be cleared before the end of the transaction. The exact case is
- * when a relation sets a new relfilenode twice in same transaction, yet
- * the second one fails in an aborted subtransaction, e.g.
- *
- * BEGIN;
- * TRUNCATE t;
- * SAVEPOINT save;
- * TRUNCATE t;
- * ROLLBACK TO save;
- * COPY ...
- *
- * Also, if the target file is new-in-transaction, we assume that checking
- * FSM for free space is a waste of time, even if we must use WAL because
- * of archiving. This could possibly be wrong, but it's unlikely.
- *
- * The comments for table_insert and RelationGetBufferForTuple specify that
- * skipping WAL logging is only safe if we ensure that our tuples do not
- * go into pages containing tuples from any other transactions --- but this
- * must be the case if we have a new table or new relfilenode, so we need
- * no additional work to enforce that.
+ * If the target file is new-in-transaction, we assume that checking FSM
+ * for free space is a waste of time, even if we must use WAL because of
+ * archiving. This could possibly be wrong, but it's unlikely.
*
* We currently don't support this optimization if the COPY target is a
* partitioned table as we currently only lazily initialize partition
@@ -2757,15 +2738,14 @@ CopyFrom(CopyState cstate)
* are not supported as per the description above.
*----------
*/
- /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+ /*
+ * createSubid is creation check, firstRelfilenodeSubid is truncation and
+ * cluster check. Partitioned table doesn't have storage.
+ */
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
- cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
+ cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
@@ -3364,8 +3344,6 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
- table_finish_bulk_insert(cstate->rel, ti_options);
-
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 43c2fa9124..859b869b0d 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->rel, myState->ti_options);
-
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index dc2940cd4e..583c542121 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* replication is in use. We can skip the FSM in any case.
*/
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 602a8dbd1c..f63662f4ed 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4733,9 +4733,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
/*
* Prepare a BulkInsertState and options for table_insert. Because we're
- * building a new heap, we can skip WAL-logging and fsync it to disk at
- * the end instead (unless WAL-logging is required for archiving or
- * streaming replication). The FSM is empty too, so don't bother using it.
+ * building a new heap, the underlying table AM can skip WAL-logging and
+ * fsync the relation to disk at the end of the current transaction
+ * instead. The FSM is empty too, so don't bother using it.
*/
if (newrel)
{
@@ -4743,8 +4743,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -5028,8 +5026,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
FreeBulkInsertState(bistate);
- table_finish_bulk_insert(newrel, ti_options);
-
table_close(newrel, NoLock);
}
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2b992d7832..cd418c5f80 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -177,6 +177,13 @@ static bool eoxact_list_overflowed = false;
eoxact_list_overflowed = true; \
} while (0)
+/* Function version of the macro above */
+void
+RelationEOXactListAdd(Relation rel)
+{
+ EOXactListAdd(rel);
+}
+
/*
* EOXactTupleDescArray stores TupleDescs that (might) need AtEOXact
* cleanup work. The array expands as needed; there is no hashtable because
@@ -263,6 +270,7 @@ static void RelationReloadIndexInfo(Relation relation);
static void RelationReloadNailed(Relation relation);
static void RelationFlushRelation(Relation relation);
static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+static void PreCommit_SyncOneRelation(Relation relation);
static void AtEOXact_cleanup(Relation relation, bool isCommit);
static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1512,6 +1520,10 @@ RelationInitIndexAccessInfo(Relation relation)
relation->rd_exclprocs = NULL;
relation->rd_exclstrats = NULL;
relation->rd_amcache = NULL;
+
+ /* set AM-type-independent WAL-skip flag if this am supports it */
+ if (relation->rd_indam->amatcommitsync != NULL)
+ relation->rd_can_skipwal = true;
}
/*
@@ -1781,6 +1793,10 @@ RelationInitTableAccessMethod(Relation relation)
* Now we can fetch the table AM's API struct
*/
InitTableAmRoutine(relation);
+
+ /* set AM-type-independent WAL-skip flag if this am supports it */
+ if (relation->rd_tableam && relation->rd_tableam->at_commit_sync)
+ relation->rd_can_skipwal = true;
}
/*
@@ -2594,6 +2610,7 @@ RelationClearRelation(Relation relation, bool rebuild)
/* creation sub-XIDs must be preserved */
SWAPFIELD(SubTransactionId, rd_createSubid);
SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+ SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
/* un-swap rd_rel pointers, swap contents instead */
SWAPFIELD(Form_pg_class, rd_rel);
/* ... but actually, we don't have to update newrel->rd_rel */
@@ -2661,7 +2678,7 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2801,7 +2818,7 @@ RelationCacheInvalidate(void)
* pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -2913,6 +2930,93 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
}
+/*
+ * PreCommit_RelationSync
+ *
+ * Sync relations that were WAL-skipped in this transaction .
+ *
+ * Access method may have skipped WAL-logging for relations created in the
+ * current transaction. Such relations need to be synced at top-transaction's
+ * commit. The operation requires active transaction state, so separately
+ * performed from AtEOXact_RelationCache.
+ */
+void
+PreCommit_RelationSync(void)
+{
+ HASH_SEQ_STATUS status;
+ RelIdCacheEnt *idhentry;
+ int i;
+
+ /* See AtEOXact_RelationCache about eoxact_list */
+ if (eoxact_list_overflowed)
+ {
+ hash_seq_init(&status, RelationIdCache);
+ while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+ PreCommit_SyncOneRelation(idhentry->reldesc);
+ }
+ else
+ {
+ for (i = 0; i < eoxact_list_len; i++)
+ {
+ idhentry = (RelIdCacheEnt *) hash_search(RelationIdCache,
+ (void *) &eoxact_list[i],
+ HASH_FIND,
+ NULL);
+
+ if (idhentry != NULL)
+ PreCommit_SyncOneRelation(idhentry->reldesc);
+ }
+ }
+}
+
+/*
+ * PreCommit_SyncOneRelation
+ *
+ * Sync one relation if needed
+ *
+ * NB: this processing must be idempotent, because EOXactListAdd() doesn't
+ * bother to prevent duplicate entries in eoxact_list[].
+ */
+static void
+PreCommit_SyncOneRelation(Relation relation)
+{
+ HeapTuple reltup;
+ Form_pg_class relform;
+
+ /* return immediately if no need for sync */
+ if (!RelationNeedsAtCommitSync(relation))
+ return;
+
+ /*
+ * We are about to sync a WAL-skipped relation. The relfilenode here is
+ * wrong if the last sub transaction that created new relfilenode was
+ * aborted.
+ */
+ if (relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId &&
+ relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+ {
+ reltup = SearchSysCache1(RELOID, ObjectIdGetDatum(relation->rd_id));
+ if (!HeapTupleIsValid(reltup))
+ elog(ERROR, "cache lookup failed for relation %u", relation->rd_id);
+ relform = (Form_pg_class) GETSTRUCT(reltup);
+ relation->rd_rel->relfilenode = relform->relfilenode;
+ relation->rd_node.relNode = relform->relfilenode;
+ ReleaseSysCache(reltup);
+ }
+
+ if (relation->rd_tableam != NULL)
+ table_at_commit_sync(relation);
+ else
+ {
+ Assert (relation->rd_indam != NULL);
+ table_at_commit_sync(relation);
+ }
+
+ /* We have synced the files, forget about relfilenode change */
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+}
+
/*
* AtEOXact_RelationCache
*
@@ -3058,6 +3162,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
* Likewise, reset the hint about the relfilenode being new.
*/
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
}
/*
@@ -3149,7 +3254,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
}
/*
- * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+ * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
*/
if (relation->rd_newRelfilenodeSubid == mySubid)
{
@@ -3158,6 +3263,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
}
+
+ if (relation->rd_firstRelfilenodeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstRelfilenodeSubid = parentSubid;
+ else
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ }
}
@@ -3440,6 +3553,10 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
*/
RelationDropStorage(relation);
+ /* Record the subxid where the first relfilenode change happen */
+ if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
/*
* Create storage for the main fork of the new relfilenode. If it's a
* table-like object, call into the table AM to do so, which'll also
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..75159d10d4 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -156,6 +156,9 @@ typedef void (*aminitparallelscan_function) (void *target);
/* (re)start parallel index scan */
typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+/* sync relation at commit after skipping WAL-logging */
+typedef void (*amatcommitsync_function) (Relation indexRelation);
+
/*
* API struct for an index AM. Note this must be stored in a single palloc'd
* chunk of memory.
@@ -230,6 +233,9 @@ typedef struct IndexAmRoutine
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
aminitparallelscan_function aminitparallelscan; /* can be NULL */
amparallelrescan_function amparallelrescan; /* can be NULL */
+
+ /* interface function to do at-commit sync after skipping WAL-logging */
+ amatcommitsync_function amatcommitsync; /* can be NULL */;
} IndexAmRoutine;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..8e661edfdd 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -177,6 +177,7 @@ extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
uint16 procnum);
extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
uint16 procnum);
+extern void index_at_commit_sync(Relation irel);
extern void index_store_float8_orderby_distances(IndexScanDesc scan,
Oid *orderByTypes, double *distances,
bool recheckOrderBy);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b88bd8a4d7..187c668878 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f225b..f33d2b38b5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -717,6 +717,7 @@ extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
extern IndexBulkDeleteResult *btvacuumcleanup(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);
extern bool btcanreturn(Relation index, int attno);
+extern void btatcommitsync(Relation index);
/*
* prototypes for internal functions in nbtree.c
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6f1cd382d8..759a1e806d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -409,19 +409,15 @@ typedef struct TableAmRoutine
TM_FailureData *tmfd);
/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
+ * Sync relation at commit-time after skipping WAL-logging.
*
- * Typically callers of tuple_insert and multi_insert will just pass all
- * the flags that apply to them, and each AM has to decide which of them
- * make sense for it, and then only take actions in finish_bulk_insert for
- * those flags, and ignore others.
+ * A table AM may skip WAL-logging for relations created in the current
+ * transaction. This routine is called commit-time and the table AM
+ * must flush buffer and sync the underlying storage.
*
* Optional callback.
*/
- void (*finish_bulk_insert) (Relation rel, int options);
+ void (*at_commit_sync) (Relation rel);
/* ------------------------------------------------------------------------
@@ -1089,10 +1085,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* behaviour of the AM. Several options might be ignored by AMs not supporting
* them.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space. It's
@@ -1112,10 +1104,12 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* Note that most of these options will be applied when inserting into the
* heap's TOAST table, too, if the tuple requires any out-of-line data.
*
+ * The core function RelationNeedsWAL() considers skipping WAL-logging on
+ * relations created in-transaction or truncated when the AM provides
+ * at_commit_sync interface.
*
* The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
*
* On return the slot's tts_tid and tts_tableOid are updated to reflect the
* insertion. But note that any toasting of fields within the slot is NOT
@@ -1205,6 +1199,8 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
* delete it. Failure return codes are TM_SelfModified, TM_Updated, and
* TM_BeingModified (the last only possible if wait == false).
*
+ * See table_insert about skipping WAL-logging feature.
+ *
* In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
* t_xmax, and, if possible, and, if possible, t_cmax. See comments for
* struct TM_FailureData for additional info.
@@ -1249,6 +1245,8 @@ table_delete(Relation rel, ItemPointer tid, CommandId cid,
* update was done. However, any TOAST changes in the new tuple's
* data are not reflected into *newtup.
*
+ * See table_insert about skipping WAL-logging feature.
+ *
* In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
* t_xmax, and, if possible, t_cmax. See comments for struct TM_FailureData
* for additional info.
@@ -1310,20 +1308,23 @@ table_lock_tuple(Relation rel, ItemPointer tid, Snapshot snapshot,
}
/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * Sync relation at commit-time if needed.
+ *
+ * A table AM that defines this interface can allow derived objects created
+ * in the current transaction to skip WAL-logging. This routine is called
+ * commit-time and the table AM must flush buffer and sync the underlying
+ * storage.
+ *
+ * Optional callback.
*/
static inline void
-table_finish_bulk_insert(Relation rel, int options)
+table_at_commit_sync(Relation rel)
{
/* optional callback */
- if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
- rel->rd_tableam->finish_bulk_insert(rel, options);
+ if (rel->rd_tableam && rel->rd_tableam->at_commit_sync)
+ rel->rd_tableam->at_commit_sync(rel);
}
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33abce3..6a3ef80575 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -63,6 +63,7 @@ typedef struct RelationData
bool rd_indexvalid; /* is rd_indexlist valid? (also rd_pkindex and
* rd_replidindex) */
bool rd_statvalid; /* is rd_statlist valid? */
+ bool rd_can_skipwal; /* underlying AM allow WAL-logging? */
/*
* rd_createSubid is the ID of the highest subtransaction the rel has
@@ -76,10 +77,17 @@ typedef struct RelationData
* transaction, with one of them occurring in a subsequently aborted
* subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
* ROLLBACK TO save; -- rd_newRelfilenode is now forgotten
+ * rd_firstRelfilenodeSubid is the ID of the hightest subtransaction the
+ * relfilenode change has took place first in the current
+ * transaction. This won't be forgotten as newRelfilenodeSubid is. A valid
+ * OID means that the currently active relfilenode is transaction-local
+ * and no-need for WAL-logging.
*/
SubTransactionId rd_createSubid; /* rel was created in current xact */
SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
* current xact */
+ SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned
+ * first in current xact */
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
@@ -512,9 +520,32 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
+ *
+ * If underlying AM supports WAL-skipping feature, returns false if wal_level
+ * = minimal and this relation is created or truncated in the current
+ * transaction.
*/
-#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (!relation->rd_can_skipwal || \
+ XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
+
+/*
+ * RelationNeedsAtCommitSync
+ * True if relation needs at-commit sync
+ *
+ * This macro is used in few places but written here because it is tightly
+ * related with RelationNeedsWAL() above. We don't need to sync local or temp
+ * relations.
+ */
+#define RelationNeedsAtCommitSync(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ !(!relation->rd_can_skipwal || \
+ XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
/*
* RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index d9c10ffcba..b681d3afb2 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -120,6 +120,7 @@ extern void RelationCacheInvalidate(void);
extern void RelationCloseSmgrByOid(Oid relationId);
+extern void PreCommit_RelationSync(void);
extern void AtEOXact_RelationCache(bool isCommit);
extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
@@ -138,4 +139,7 @@ extern bool criticalRelcachesBuilt;
/* should be used only by relcache.c and postinit.c */
extern bool criticalSharedRelcachesBuilt;
+/* add rel to eoxact cleanup list */
+void RelationEOXactListAdd(Relation rel);
+
#endif /* RELCACHE_H */
--
2.16.3
On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
Following this direction, the attached PoC works *at least for*
the wal_optimization TAP tests, but doing pending flush not in
smgr but in relcache.
This task, syncing files created in the current transaction, is not the kind
of task normally assigned to a cache. We already have a module, storage.c,
that maintains state about files created in the current transaction. Why did
you use relcache instead of storage.c?
On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
This is a tidier version of the patch.
- Move the substantial work to table/index AMs.
Each AM can decide whether to support WAL skip or not.
Currently heap and nbtree support it.
Why would an AM find it important to disable WAL skip?
Thanks for the comment!
At Fri, 24 May 2019 19:33:32 -0700, Noah Misch <noah@leadboat.com> wrote in <20190525023332.GE1624191@rfd.leadboat.com>
On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
Following this direction, the attached PoC works *at least for*
the wal_optimization TAP tests, but doing pending flush not in
smgr but in relcache.This task, syncing files created in the current transaction, is not the kind
of task normally assigned to a cache. We already have a module, storage.c,
that maintains state about files created in the current transaction. Why did
you use relcache instead of storage.c?
The reason was at-commit sync needs buffer flush beforehand. But
FlushRelationBufferWithoutRelCache() in v11 can do
that. storage.c is reasonable as the place.
On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
This is a tidier version of the patch.
- Move the substantial work to table/index AMs.
Each AM can decide whether to support WAL skip or not.
Currently heap and nbtree support it.Why would an AM find it important to disable WAL skip?
The reason is currently it's AM's responsibility to decide
whether to skip WAL or not.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, May 27, 2019 at 02:08:26PM +0900, Kyotaro HORIGUCHI wrote:
At Fri, 24 May 2019 19:33:32 -0700, Noah Misch <noah@leadboat.com> wrote in <20190525023332.GE1624191@rfd.leadboat.com>
On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
Following this direction, the attached PoC works *at least for*
the wal_optimization TAP tests, but doing pending flush not in
smgr but in relcache.This task, syncing files created in the current transaction, is not the kind
of task normally assigned to a cache. We already have a module, storage.c,
that maintains state about files created in the current transaction. Why did
you use relcache instead of storage.c?The reason was at-commit sync needs buffer flush beforehand. But
FlushRelationBufferWithoutRelCache() in v11 can do
that. storage.c is reasonable as the place.
Okay. I do want this to work in 9.5 and later, but I'm not aware of a reason
relcache.c would be a better code location in older branches. Unless you
think of a reason to prefer relcache.c, please use storage.c.
On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
This is a tidier version of the patch.
- Move the substantial work to table/index AMs.
Each AM can decide whether to support WAL skip or not.
Currently heap and nbtree support it.Why would an AM find it important to disable WAL skip?
The reason is currently it's AM's responsibility to decide
whether to skip WAL or not.
I see. Skipping the sync would be a mere optimization; no AM would require it
for correctness. An AM might want RelationNeedsWAL() to keep returning true
despite the sync happening, perhaps because it persists data somewhere other
than the forks of pg_class.relfilenode. Since the index and table APIs
already assume one relfilenode captures all persistent data, I'm not seeing a
use case for an AM overriding this behavior. Let's take away the AM's
responsibility for this decision, making the system simpler. A future patch
could let AM code decide, if someone find a real-world use case for
AM-specific logic around when to skip WAL.
On Tue, May 28, 2019 at 4:33 AM Noah Misch <noah@leadboat.com> wrote:
On Mon, May 27, 2019 at 02:08:26PM +0900, Kyotaro HORIGUCHI wrote:
At Fri, 24 May 2019 19:33:32 -0700, Noah Misch <noah@leadboat.com> wrote in <20190525023332.GE1624191@rfd.leadboat.com>
On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
Following this direction, the attached PoC works *at least for*
the wal_optimization TAP tests, but doing pending flush not in
smgr but in relcache.This task, syncing files created in the current transaction, is not the kind
of task normally assigned to a cache. We already have a module, storage.c,
that maintains state about files created in the current transaction. Why did
you use relcache instead of storage.c?The reason was at-commit sync needs buffer flush beforehand. But
FlushRelationBufferWithoutRelCache() in v11 can do
that. storage.c is reasonable as the place.Okay. I do want this to work in 9.5 and later, but I'm not aware of a reason
relcache.c would be a better code location in older branches. Unless you
think of a reason to prefer relcache.c, please use storage.c.On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
This is a tidier version of the patch.
- Move the substantial work to table/index AMs.
Each AM can decide whether to support WAL skip or not.
Currently heap and nbtree support it.Why would an AM find it important to disable WAL skip?
The reason is currently it's AM's responsibility to decide
whether to skip WAL or not.I see. Skipping the sync would be a mere optimization; no AM would require it
for correctness. An AM might want RelationNeedsWAL() to keep returning true
despite the sync happening, perhaps because it persists data somewhere other
than the forks of pg_class.relfilenode. Since the index and table APIs
already assume one relfilenode captures all persistent data, I'm not seeing a
use case for an AM overriding this behavior. Let's take away the AM's
responsibility for this decision, making the system simpler. A future patch
could let AM code decide, if someone find a real-world use case for
AM-specific logic around when to skip WAL.
It seems there is some feedback for this patch and the CF is going to
start in 2 days. Are you planning to work on this patch for next CF,
if not then it is better to bump this? It is not a good idea to see
the patch in "waiting on author" in the beginning of CF unless the
author is actively working on the patch and is going to produce a
version in next few days.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hello. Rebased the patch to master(bd56cd75d2).
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v16-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From ac52e2c1c56a96c1745149ff4220a3a116d6c811 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/3] TAP test for copy-truncation optimization.
---
src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
1 file changed, 291 insertions(+)
create mode 100644 src/test/recovery/t/018_wal_optimize.pl
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::real_dir($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v16-0002-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From 4363a50092dc8aa536b24582a3160f4f47c85349 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Mon, 27 May 2019 16:06:30 +0900
Subject: [PATCH 2/3] Fix WAL skipping feature
WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modifcation is WAL-logged at all then sync
such relations at commit.
---
src/backend/access/heap/heapam.c | 4 +-
src/backend/access/heap/heapam_handler.c | 22 +----------
src/backend/access/heap/rewriteheap.c | 13 ++-----
src/backend/catalog/storage.c | 64 +++++++++++++++++++++++++-------
src/backend/commands/cluster.c | 24 ++++++++++++
src/backend/commands/copy.c | 38 ++++---------------
src/backend/commands/createas.c | 5 +--
src/backend/commands/matview.c | 4 --
src/backend/commands/tablecmds.c | 10 ++---
src/backend/storage/buffer/bufmgr.c | 33 +++++++++++-----
src/backend/utils/cache/relcache.c | 16 ++++++--
src/include/access/heapam.h | 1 -
src/include/access/rewriteheap.h | 2 +-
src/include/access/tableam.h | 41 ++------------------
src/include/storage/bufmgr.h | 1 +
src/include/utils/rel.h | 17 ++++++++-
16 files changed, 148 insertions(+), 147 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d768b9b061..eca98fb063 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1941,7 +1941,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2124,7 +2124,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 09bc6fe98a..b9554f6064 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -556,18 +556,6 @@ tuple_lock_retry:
return result;
}
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related callbacks for heap AM.
* ------------------------------------------------------------------------
@@ -699,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +700,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
- /*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
- */
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
/* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
@@ -729,7 +710,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff, use_wal);
+ *multi_cutoff);
/* Set up sorting if wanted */
@@ -2517,7 +2498,6 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 72a448ad31..992d4b9880 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* min_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 3cc886f7fe..e4bcdc390f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -57,7 +57,8 @@ typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
BackendId backend; /* InvalidBackendId if not a temp rel */
- bool atCommit; /* T=delete at commit; F=delete at abort */
+ bool atCommit; /* T=work at commit; F=work at abort */
+ bool dosync; /* T=work is sync; F=work is delete */
int nestLevel; /* xact nesting level of request */
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
@@ -114,10 +115,29 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
+ pending->dosync = false;
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /*
+ * We are going to skip WAL-logging for storage of persistent relations
+ * created in the current transaction when wal_level = minimal. The
+ * relation needs to be synced at commit.
+ */
+ if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+ {
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->dosync = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
+
return srel;
}
@@ -155,6 +175,7 @@ RelationDropStorage(Relation rel)
pending->relnode = rel->rd_node;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
+ pending->dosync = false;
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -428,21 +449,34 @@ smgrDoPendingDeletes(bool isCommit)
{
SMgrRelation srel;
- srel = smgropen(pending->relnode, pending->backend);
-
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
+ if (pending->dosync)
{
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
+ /* Perform pending sync of WAL-skipped relation */
+ FlushRelationBuffersWithoutRelcache(pending->relnode,
+ false);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrimmedsync(srel, MAIN_FORKNUM);
+ smgrclose(srel);
}
- else if (maxrels <= nrels)
+ else
{
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
+ /* Collect pending deletions */
+ srel = smgropen(pending->relnode, pending->backend);
- srels[nrels++] = srel;
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
+ }
}
/* must explicitly free the list entry */
pfree(pending);
@@ -489,8 +523,9 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
nrels = 0;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
+ /* Pending syncs are excluded */
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId && !pending->dosync)
nrels++;
}
if (nrels == 0)
@@ -502,8 +537,9 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
*ptr = rptr;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
+ /* Pending syncs are excluded */
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId && !pending->dosync)
{
*rptr = pending->relnode;
rptr++;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index ebaec4f8dd..6fc9d7d64e 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
{
+ Relation rel1;
+ Relation rel2;
+
/*
* Normal non-mapped relations: swap relfilenodes, reltablespaces,
* relpersistence
*/
Assert(!target_is_pg_class);
+ /* Update creation subid hints of relcache */
+ rel1 = relation_open(r1, ExclusiveLock);
+ rel2 = relation_open(r2, ExclusiveLock);
+
+ /*
+ * New relation's relfilenode is created in the current transaction
+ * and used as old ralation's new relfilenode. So its
+ * newRelfilenodeSubid as new relation's createSubid. We don't fix
+ * rel2 since it would be deleted soon.
+ */
+ Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+ rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+ /* record the first relfilenode change in the current transaction */
+ if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+ relation_close(rel1, ExclusiveLock);
+ relation_close(rel2, ExclusiveLock);
+
+ /* swap relfilenodes, reltablespaces, relpersistence */
swaptemp = relform1->relfilenode;
relform1->relfilenode = relform2->relfilenode;
relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f1161f0fee..f4beff0001 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2722,28 +2722,9 @@ CopyFrom(CopyState cstate)
* If it does commit, we'll have done the table_finish_bulk_insert() at
* the bottom of this routine first.
*
- * As mentioned in comments in utils/rel.h, the in-same-transaction test
- * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
- * can be cleared before the end of the transaction. The exact case is
- * when a relation sets a new relfilenode twice in same transaction, yet
- * the second one fails in an aborted subtransaction, e.g.
- *
- * BEGIN;
- * TRUNCATE t;
- * SAVEPOINT save;
- * TRUNCATE t;
- * ROLLBACK TO save;
- * COPY ...
- *
- * Also, if the target file is new-in-transaction, we assume that checking
- * FSM for free space is a waste of time, even if we must use WAL because
- * of archiving. This could possibly be wrong, but it's unlikely.
- *
- * The comments for table_tuple_insert and RelationGetBufferForTuple
- * specify that skipping WAL logging is only safe if we ensure that our
- * tuples do not go into pages containing tuples from any other
- * transactions --- but this must be the case if we have a new table or
- * new relfilenode, so we need no additional work to enforce that.
+ * If the target file is new-in-transaction, we assume that checking FSM
+ * for free space is a waste of time, even if we must use WAL because of
+ * archiving. This could possibly be wrong, but it's unlikely.
*
* We currently don't support this optimization if the COPY target is a
* partitioned table as we currently only lazily initialize partition
@@ -2759,15 +2740,14 @@ CopyFrom(CopyState cstate)
* are not supported as per the description above.
*----------
*/
- /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+ /*
+ * createSubid is creation check, firstRelfilenodeSubid is truncation and
+ * cluster check. Partitioned table doesn't have storage.
+ */
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
- cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
+ cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
@@ -3366,8 +3346,6 @@ CopyFrom(CopyState cstate)
FreeExecutorState(estate);
- table_finish_bulk_insert(cstate->rel, ti_options);
-
return processed;
}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 4c1d909d38..39ebd73691 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->rel, myState->ti_options);
-
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..1c854dcebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* replication is in use. We can skip the FSM in any case.
*/
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0f1a9f0e54..ac7336ef58 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4761,9 +4761,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
/*
* Prepare a BulkInsertState and options for table_tuple_insert. Because
- * we're building a new heap, we can skip WAL-logging and fsync it to disk
- * at the end instead (unless WAL-logging is required for archiving or
- * streaming replication). The FSM is empty too, so don't bother using it.
+ * we're building a new heap, the underlying table AM can skip WAL-logging
+ * and fsync the relation to disk at the end of the current transaction
+ * instead. The FSM is empty too, so don't bother using it.
*/
if (newrel)
{
@@ -4771,8 +4771,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -5057,8 +5055,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
FreeBulkInsertState(bistate);
- table_finish_bulk_insert(newrel, ti_options);
-
table_close(newrel, NoLock);
}
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7332e6b590..280fdf8080 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
static int32 PrivateRefCountOverflowed = 0;
static uint32 PrivateRefCountClock = 0;
static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void ReservePrivateRefCountEntry(void);
static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -3190,20 +3191,32 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
- /* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3220,7 +3233,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3250,18 +3263,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2b992d7832..812bfadb40 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2661,7 +2661,7 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2801,7 +2801,7 @@ RelationCacheInvalidate(void)
* pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -3058,6 +3058,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
* Likewise, reset the hint about the relfilenode being new.
*/
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
}
/*
@@ -3149,7 +3150,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
}
/*
- * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+ * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
*/
if (relation->rd_newRelfilenodeSubid == mySubid)
{
@@ -3158,6 +3159,15 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
}
+
+
+ if (relation->rd_firstRelfilenodeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstRelfilenodeSubid = parentSubid;
+ else
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ }
}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..80c2e1bafc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c2b0481e7e..ac0e981acb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
uint8 flags,
TM_FailureData *tmfd);
- /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
- *
- * Typically callers of tuple_insert and multi_insert will just pass all
- * the flags that apply to them, and each AM has to decide which of them
- * make sense for it, and then only take actions in finish_bulk_insert for
- * those flags, and ignore others.
- *
- * Optional callback.
- */
- void (*finish_bulk_insert) (Relation rel, int options);
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
@@ -1088,10 +1072,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* behaviour of the AM. Several options might be ignored by AMs not supporting
* them.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space. It's
@@ -1111,10 +1091,8 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* Note that most of these options will be applied when inserting into the
* heap's TOAST table, too, if the tuple requires any out-of-line data.
*
- *
* The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
*
* On return the slot's tts_tid and tts_tableOid are updated to reflect the
* insertion. But note that any toasting of fields within the slot is NOT
@@ -1249,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
* update was done. However, any TOAST changes in the new tuple's
* data are not reflected into *newtup.
*
+ * See table_insert about skipping WAL-logging feature.
+ *
* In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
* t_xmax, and, if possible, t_cmax. See comments for struct TM_FailureData
* for additional info.
@@ -1309,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
flags, tmfd);
}
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
- /* optional callback */
- if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
- rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 509f4b7ef1..ace5f5a2ae 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d35b4a5061..5cbb5a7b27 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -76,10 +76,17 @@ typedef struct RelationData
* transaction, with one of them occurring in a subsequently aborted
* subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
* ROLLBACK TO save; -- rd_newRelfilenode is now forgotten
+ * rd_firstRelfilenodeSubid is the ID of the hightest subtransaction the
+ * relfilenode change has took place first in the current
+ * transaction. This won't be forgotten as newRelfilenodeSubid is. A valid
+ * OID means that the currently active relfilenode is transaction-local
+ * and no-need for WAL-logging.
*/
SubTransactionId rd_createSubid; /* rel was created in current xact */
SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
* current xact */
+ SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned
+ * first in current xact */
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
@@ -512,9 +519,15 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
*/
-#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
/*
* RelationUsesLocalBuffers
--
2.16.3
v16-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patchtext/x-patch; charset=us-asciiDownload
From 63fc1a432f20e99df6f081bc6af640bf6907879c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Wed, 29 May 2019 23:03:22 +0900
Subject: [PATCH 3/3] Rename smgrDoPendingDeletes to smgrDoPendingOperations
The function longer does only deletions but also syncs. Rename the
function to refect that. smgrGetPendingDeletes is not renamed since it
does not change behavior.
---
src/backend/access/transam/xact.c | 4 +--
src/backend/catalog/storage.c | 57 ++++++++++++++++++++-------------------
src/include/catalog/storage.h | 2 +-
3 files changed, 32 insertions(+), 31 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d7930c077d..cc0c43b2dd 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2228,7 +2228,7 @@ CommitTransaction(void)
* Other backends will observe the attendant catalog changes and not
* attempt to access affected files.
*/
- smgrDoPendingDeletes(true);
+ smgrDoPendingOperations(true);
AtCommit_Notify();
AtEOXact_GUC(true, 1);
@@ -2716,7 +2716,7 @@ AbortTransaction(void)
ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_AFTER_LOCKS,
false, true);
- smgrDoPendingDeletes(false);
+ smgrDoPendingOperations(false);
AtEOXact_GUC(false, 1);
AtEOXact_SPI(false);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index e4bcdc390f..6ebe75aa37 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -53,17 +53,17 @@
* but I'm being paranoid.
*/
-typedef struct PendingRelDelete
+typedef struct PendingRelOps
{
RelFileNode relnode; /* relation that may need to be deleted */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=work at commit; F=work at abort */
bool dosync; /* T=work is sync; F=work is delete */
int nestLevel; /* xact nesting level of request */
- struct PendingRelDelete *next; /* linked-list link */
-} PendingRelDelete;
+ struct PendingRelOps *next; /* linked-list link */
+} PendingRelOps;
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOps *pendingDeletes = NULL; /* head of linked list */
/*
* RelationCreateStorage
@@ -79,7 +79,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelOps *pending;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -110,8 +110,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOps *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
@@ -127,8 +127,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
*/
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOps *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = true;
@@ -167,11 +167,11 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
void
RelationDropStorage(Relation rel)
{
- PendingRelDelete *pending;
+ PendingRelOps *pending;
/* Add the relation to the list of stuff to delete at commit */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOps *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
pending->relnode = rel->rd_node;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
@@ -185,9 +185,9 @@ RelationDropStorage(Relation rel)
* present in the pending-delete list twice, once with atCommit true and
* once with atCommit false. Hence, it will be physically deleted at end
* of xact in either case (and the other entry will be ignored by
- * smgrDoPendingDeletes, so no error will occur). We could instead remove
- * the existing list entry and delete the physical file immediately, but
- * for now I'll keep the logic simple.
+ * smgrDoPendingOperations, so no error will occur). We could instead
+ * remove the existing list entry and delete the physical file
+ * immediately, but for now I'll keep the logic simple.
*/
RelationCloseSmgr(rel);
@@ -213,9 +213,9 @@ RelationDropStorage(Relation rel)
void
RelationPreserveStorage(RelFileNode rnode, bool atCommit)
{
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOps *pending;
+ PendingRelOps *prev;
+ PendingRelOps *next;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -406,7 +406,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
}
/*
- * smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
+ * smgrDoPendingOperations() -- Take care of relation deletes and syncs at
+ * end of xact.
*
* This also runs when aborting a subxact; we want to clean up a failed
* subxact immediately.
@@ -417,12 +418,12 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* already recovered the physical storage.
*/
void
-smgrDoPendingDeletes(bool isCommit)
+smgrDoPendingOperations(bool isCommit)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOps *pending;
+ PendingRelOps *prev;
+ PendingRelOps *next;
int nrels = 0,
i = 0,
maxrels = 0;
@@ -518,7 +519,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
RelFileNode *rptr;
- PendingRelDelete *pending;
+ PendingRelOps *pending;
nrels = 0;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
@@ -558,8 +559,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
void
PostPrepare_smgr(void)
{
- PendingRelDelete *pending;
- PendingRelDelete *next;
+ PendingRelOps *pending;
+ PendingRelOps *next;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
@@ -580,7 +581,7 @@ void
AtSubCommit_smgr(void)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
+ PendingRelOps *pending;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
@@ -599,7 +600,7 @@ AtSubCommit_smgr(void)
void
AtSubAbort_smgr(void)
{
- smgrDoPendingDeletes(false);
+ smgrDoPendingOperations(false);
}
void
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..43836cf11c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -30,7 +30,7 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
-extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingOperations(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
--
2.16.3
On Wed, Jul 10, 2019 at 01:19:14PM +0900, Kyotaro Horiguchi wrote:
Hello. Rebased the patch to master(bd56cd75d2).
It looks like you did more than just a rebase, because this v16 no longer
modifies many files that v14 did modify. (That's probably good, since you had
pending review comments.) What other changes did you make?
Many message seem lost during moving to new environmet..
I'm digging the archive but coudn't find the message for v15..
At Thu, 11 Jul 2019 18:03:35 -0700, Noah Misch <noah@leadboat.com> wrote in <20190712010335.GB1610889@rfd.leadboat.com>
On Wed, Jul 10, 2019 at 01:19:14PM +0900, Kyotaro Horiguchi wrote:
Hello. Rebased the patch to master(bd56cd75d2).
It looks like you did more than just a rebase, because this v16 no longer
modifies many files that v14 did modify. (That's probably good, since you had
pending review comments.) What other changes did you make?
Yeah.. Maybe I forgot to send pre-v15 or v16 before rebasing.
v14: WAL-logging is controled by AMs and syncing at commit is
controled according to the behavior. At-commit sync is still
controlled per-relation basis, which means it must be
processed before transaction state becomes TRNAS_COMMIT. So
it needs to be separated into PreCommit_RelationSync() from
AtEOXact_RelationCache().
v15: The biggest change is that at-commit sync is changed to smgr
basis. At-commit sync is programmed at creation of a storage
file (RelationCreateStorage), and smgrDoPendingDelete(or
smgrDoPendingOperations after rename) runs syncs. AM are no
longer involved and all permanent relations are WAL-skipped at
all in the creation transaction while wal_level=minimal.
All storages created for a relation are once synced then
removed at commit.
v16: rebased.
The v16 seems no longer works so I'll send further rebased version.
Sorry for the late reply and confusion..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190712.173041.236938840.horikyota.ntt@gmail.com>
The v16 seems no longer works so I'll send further rebased version.
It's just by renaming of TestLib::real_dir to perl2host.
This is rebased version v17.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v17-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From 9bcd4acb14c5cef2d4bdf20c9be8c86597a9cf7c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/3] TAP test for copy-truncation optimization.
---
src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
1 file changed, 291 insertions(+)
create mode 100644 src/test/recovery/t/018_wal_optimize.pl
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..b26cd8efd5
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::perl2host($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v17-0002-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From 5d56e218b7771b3277d3aa97145dea16fdd48dbc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Mon, 27 May 2019 16:06:30 +0900
Subject: [PATCH 2/3] Fix WAL skipping feature
WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modifcation is WAL-logged at all then sync
such relations at commit.
---
src/backend/access/heap/heapam.c | 4 +-
src/backend/access/heap/heapam_handler.c | 22 +----------
src/backend/access/heap/rewriteheap.c | 13 ++-----
src/backend/catalog/storage.c | 64 +++++++++++++++++++++++++-------
src/backend/commands/cluster.c | 24 ++++++++++++
src/backend/commands/copy.c | 39 ++++---------------
src/backend/commands/createas.c | 5 +--
src/backend/commands/matview.c | 4 --
src/backend/commands/tablecmds.c | 10 ++---
src/backend/storage/buffer/bufmgr.c | 33 +++++++++++-----
src/backend/utils/cache/relcache.c | 16 ++++++--
src/include/access/heapam.h | 1 -
src/include/access/rewriteheap.h | 2 +-
src/include/access/tableam.h | 41 ++------------------
src/include/storage/bufmgr.h | 1 +
src/include/utils/rel.h | 17 ++++++++-
16 files changed, 148 insertions(+), 148 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d768b9b061..eca98fb063 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1941,7 +1941,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2124,7 +2124,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 09bc6fe98a..b9554f6064 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -556,18 +556,6 @@ tuple_lock_retry:
return result;
}
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related callbacks for heap AM.
* ------------------------------------------------------------------------
@@ -699,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +700,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
- /*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
- */
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
/* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
@@ -729,7 +710,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff, use_wal);
+ *multi_cutoff);
/* Set up sorting if wanted */
@@ -2517,7 +2498,6 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 72a448ad31..992d4b9880 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* min_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 3cc886f7fe..e4bcdc390f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -57,7 +57,8 @@ typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
BackendId backend; /* InvalidBackendId if not a temp rel */
- bool atCommit; /* T=delete at commit; F=delete at abort */
+ bool atCommit; /* T=work at commit; F=work at abort */
+ bool dosync; /* T=work is sync; F=work is delete */
int nestLevel; /* xact nesting level of request */
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
@@ -114,10 +115,29 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
+ pending->dosync = false;
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /*
+ * We are going to skip WAL-logging for storage of persistent relations
+ * created in the current transaction when wal_level = minimal. The
+ * relation needs to be synced at commit.
+ */
+ if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+ {
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->dosync = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
+
return srel;
}
@@ -155,6 +175,7 @@ RelationDropStorage(Relation rel)
pending->relnode = rel->rd_node;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
+ pending->dosync = false;
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -428,21 +449,34 @@ smgrDoPendingDeletes(bool isCommit)
{
SMgrRelation srel;
- srel = smgropen(pending->relnode, pending->backend);
-
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
+ if (pending->dosync)
{
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
+ /* Perform pending sync of WAL-skipped relation */
+ FlushRelationBuffersWithoutRelcache(pending->relnode,
+ false);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrimmedsync(srel, MAIN_FORKNUM);
+ smgrclose(srel);
}
- else if (maxrels <= nrels)
+ else
{
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
+ /* Collect pending deletions */
+ srel = smgropen(pending->relnode, pending->backend);
- srels[nrels++] = srel;
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
+ }
}
/* must explicitly free the list entry */
pfree(pending);
@@ -489,8 +523,9 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
nrels = 0;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
+ /* Pending syncs are excluded */
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId && !pending->dosync)
nrels++;
}
if (nrels == 0)
@@ -502,8 +537,9 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
*ptr = rptr;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
+ /* Pending syncs are excluded */
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId && !pending->dosync)
{
*rptr = pending->relnode;
rptr++;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index ebaec4f8dd..6fc9d7d64e 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
{
+ Relation rel1;
+ Relation rel2;
+
/*
* Normal non-mapped relations: swap relfilenodes, reltablespaces,
* relpersistence
*/
Assert(!target_is_pg_class);
+ /* Update creation subid hints of relcache */
+ rel1 = relation_open(r1, ExclusiveLock);
+ rel2 = relation_open(r2, ExclusiveLock);
+
+ /*
+ * New relation's relfilenode is created in the current transaction
+ * and used as old ralation's new relfilenode. So its
+ * newRelfilenodeSubid as new relation's createSubid. We don't fix
+ * rel2 since it would be deleted soon.
+ */
+ Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+ rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+ /* record the first relfilenode change in the current transaction */
+ if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+ relation_close(rel1, ExclusiveLock);
+ relation_close(rel2, ExclusiveLock);
+
+ /* swap relfilenodes, reltablespaces, relpersistence */
swaptemp = relform1->relfilenode;
relform1->relfilenode = relform2->relfilenode;
relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 4f04d122c3..f02efd59fc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2535,9 +2535,6 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
ExecDropSingleTupleTableSlot(buffer->slots[i]);
- table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
- miinfo->ti_options);
-
pfree(buffer);
}
@@ -2726,28 +2723,9 @@ CopyFrom(CopyState cstate)
* If it does commit, we'll have done the table_finish_bulk_insert() at
* the bottom of this routine first.
*
- * As mentioned in comments in utils/rel.h, the in-same-transaction test
- * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
- * can be cleared before the end of the transaction. The exact case is
- * when a relation sets a new relfilenode twice in same transaction, yet
- * the second one fails in an aborted subtransaction, e.g.
- *
- * BEGIN;
- * TRUNCATE t;
- * SAVEPOINT save;
- * TRUNCATE t;
- * ROLLBACK TO save;
- * COPY ...
- *
- * Also, if the target file is new-in-transaction, we assume that checking
- * FSM for free space is a waste of time, even if we must use WAL because
- * of archiving. This could possibly be wrong, but it's unlikely.
- *
- * The comments for table_tuple_insert and RelationGetBufferForTuple
- * specify that skipping WAL logging is only safe if we ensure that our
- * tuples do not go into pages containing tuples from any other
- * transactions --- but this must be the case if we have a new table or
- * new relfilenode, so we need no additional work to enforce that.
+ * If the target file is new-in-transaction, we assume that checking FSM
+ * for free space is a waste of time, even if we must use WAL because of
+ * archiving. This could possibly be wrong, but it's unlikely.
*
* We currently don't support this optimization if the COPY target is a
* partitioned table as we currently only lazily initialize partition
@@ -2763,15 +2741,14 @@ CopyFrom(CopyState cstate)
* are not supported as per the description above.
*----------
*/
- /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+ /*
+ * createSubid is creation check, firstRelfilenodeSubid is truncation and
+ * cluster check. Partitioned table doesn't have storage.
+ */
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
- cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
+ cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 4c1d909d38..39ebd73691 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->rel, myState->ti_options);
-
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..1c854dcebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* replication is in use. We can skip the FSM in any case.
*/
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0f1a9f0e54..ac7336ef58 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4761,9 +4761,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
/*
* Prepare a BulkInsertState and options for table_tuple_insert. Because
- * we're building a new heap, we can skip WAL-logging and fsync it to disk
- * at the end instead (unless WAL-logging is required for archiving or
- * streaming replication). The FSM is empty too, so don't bother using it.
+ * we're building a new heap, the underlying table AM can skip WAL-logging
+ * and fsync the relation to disk at the end of the current transaction
+ * instead. The FSM is empty too, so don't bother using it.
*/
if (newrel)
{
@@ -4771,8 +4771,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -5057,8 +5055,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
FreeBulkInsertState(bistate);
- table_finish_bulk_insert(newrel, ti_options);
-
table_close(newrel, NoLock);
}
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7332e6b590..280fdf8080 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
static int32 PrivateRefCountOverflowed = 0;
static uint32 PrivateRefCountClock = 0;
static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void ReservePrivateRefCountEntry(void);
static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -3190,20 +3191,32 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
- /* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3220,7 +3233,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3250,18 +3263,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2b992d7832..812bfadb40 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2661,7 +2661,7 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2801,7 +2801,7 @@ RelationCacheInvalidate(void)
* pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -3058,6 +3058,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
* Likewise, reset the hint about the relfilenode being new.
*/
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
}
/*
@@ -3149,7 +3150,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
}
/*
- * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+ * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
*/
if (relation->rd_newRelfilenodeSubid == mySubid)
{
@@ -3158,6 +3159,15 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
}
+
+
+ if (relation->rd_firstRelfilenodeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstRelfilenodeSubid = parentSubid;
+ else
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ }
}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..80c2e1bafc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c2b0481e7e..ac0e981acb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
uint8 flags,
TM_FailureData *tmfd);
- /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
- *
- * Typically callers of tuple_insert and multi_insert will just pass all
- * the flags that apply to them, and each AM has to decide which of them
- * make sense for it, and then only take actions in finish_bulk_insert for
- * those flags, and ignore others.
- *
- * Optional callback.
- */
- void (*finish_bulk_insert) (Relation rel, int options);
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
@@ -1088,10 +1072,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* behaviour of the AM. Several options might be ignored by AMs not supporting
* them.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space. It's
@@ -1111,10 +1091,8 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* Note that most of these options will be applied when inserting into the
* heap's TOAST table, too, if the tuple requires any out-of-line data.
*
- *
* The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
*
* On return the slot's tts_tid and tts_tableOid are updated to reflect the
* insertion. But note that any toasting of fields within the slot is NOT
@@ -1249,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
* update was done. However, any TOAST changes in the new tuple's
* data are not reflected into *newtup.
*
+ * See table_insert about skipping WAL-logging feature.
+ *
* In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
* t_xmax, and, if possible, t_cmax. See comments for struct TM_FailureData
* for additional info.
@@ -1309,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
flags, tmfd);
}
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
- /* optional callback */
- if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
- rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 509f4b7ef1..ace5f5a2ae 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d35b4a5061..5cbb5a7b27 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -76,10 +76,17 @@ typedef struct RelationData
* transaction, with one of them occurring in a subsequently aborted
* subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
* ROLLBACK TO save; -- rd_newRelfilenode is now forgotten
+ * rd_firstRelfilenodeSubid is the ID of the hightest subtransaction the
+ * relfilenode change has took place first in the current
+ * transaction. This won't be forgotten as newRelfilenodeSubid is. A valid
+ * OID means that the currently active relfilenode is transaction-local
+ * and no-need for WAL-logging.
*/
SubTransactionId rd_createSubid; /* rel was created in current xact */
SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
* current xact */
+ SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned
+ * first in current xact */
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
@@ -512,9 +519,15 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
*/
-#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
/*
* RelationUsesLocalBuffers
--
2.16.3
v17-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patchtext/x-patch; charset=us-asciiDownload
From 264bb593502db35ab8dbd7ddd505d2e729807293 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Wed, 29 May 2019 23:03:22 +0900
Subject: [PATCH 3/3] Rename smgrDoPendingDeletes to smgrDoPendingOperations
The function longer does only deletions but also syncs. Rename the
function to refect that. smgrGetPendingDeletes is not renamed since it
does not change behavior.
---
src/backend/access/transam/xact.c | 4 +--
src/backend/catalog/storage.c | 57 ++++++++++++++++++++-------------------
src/include/catalog/storage.h | 2 +-
3 files changed, 32 insertions(+), 31 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d7930c077d..cc0c43b2dd 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2228,7 +2228,7 @@ CommitTransaction(void)
* Other backends will observe the attendant catalog changes and not
* attempt to access affected files.
*/
- smgrDoPendingDeletes(true);
+ smgrDoPendingOperations(true);
AtCommit_Notify();
AtEOXact_GUC(true, 1);
@@ -2716,7 +2716,7 @@ AbortTransaction(void)
ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_AFTER_LOCKS,
false, true);
- smgrDoPendingDeletes(false);
+ smgrDoPendingOperations(false);
AtEOXact_GUC(false, 1);
AtEOXact_SPI(false);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index e4bcdc390f..6ebe75aa37 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -53,17 +53,17 @@
* but I'm being paranoid.
*/
-typedef struct PendingRelDelete
+typedef struct PendingRelOps
{
RelFileNode relnode; /* relation that may need to be deleted */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=work at commit; F=work at abort */
bool dosync; /* T=work is sync; F=work is delete */
int nestLevel; /* xact nesting level of request */
- struct PendingRelDelete *next; /* linked-list link */
-} PendingRelDelete;
+ struct PendingRelOps *next; /* linked-list link */
+} PendingRelOps;
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOps *pendingDeletes = NULL; /* head of linked list */
/*
* RelationCreateStorage
@@ -79,7 +79,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelOps *pending;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -110,8 +110,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOps *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
@@ -127,8 +127,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
*/
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOps *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = true;
@@ -167,11 +167,11 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
void
RelationDropStorage(Relation rel)
{
- PendingRelDelete *pending;
+ PendingRelOps *pending;
/* Add the relation to the list of stuff to delete at commit */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOps *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
pending->relnode = rel->rd_node;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
@@ -185,9 +185,9 @@ RelationDropStorage(Relation rel)
* present in the pending-delete list twice, once with atCommit true and
* once with atCommit false. Hence, it will be physically deleted at end
* of xact in either case (and the other entry will be ignored by
- * smgrDoPendingDeletes, so no error will occur). We could instead remove
- * the existing list entry and delete the physical file immediately, but
- * for now I'll keep the logic simple.
+ * smgrDoPendingOperations, so no error will occur). We could instead
+ * remove the existing list entry and delete the physical file
+ * immediately, but for now I'll keep the logic simple.
*/
RelationCloseSmgr(rel);
@@ -213,9 +213,9 @@ RelationDropStorage(Relation rel)
void
RelationPreserveStorage(RelFileNode rnode, bool atCommit)
{
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOps *pending;
+ PendingRelOps *prev;
+ PendingRelOps *next;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -406,7 +406,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
}
/*
- * smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
+ * smgrDoPendingOperations() -- Take care of relation deletes and syncs at
+ * end of xact.
*
* This also runs when aborting a subxact; we want to clean up a failed
* subxact immediately.
@@ -417,12 +418,12 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* already recovered the physical storage.
*/
void
-smgrDoPendingDeletes(bool isCommit)
+smgrDoPendingOperations(bool isCommit)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOps *pending;
+ PendingRelOps *prev;
+ PendingRelOps *next;
int nrels = 0,
i = 0,
maxrels = 0;
@@ -518,7 +519,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
RelFileNode *rptr;
- PendingRelDelete *pending;
+ PendingRelOps *pending;
nrels = 0;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
@@ -558,8 +559,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
void
PostPrepare_smgr(void)
{
- PendingRelDelete *pending;
- PendingRelDelete *next;
+ PendingRelOps *pending;
+ PendingRelOps *next;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
@@ -580,7 +581,7 @@ void
AtSubCommit_smgr(void)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
+ PendingRelOps *pending;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
@@ -599,7 +600,7 @@ AtSubCommit_smgr(void)
void
AtSubAbort_smgr(void)
{
- smgrDoPendingDeletes(false);
+ smgrDoPendingOperations(false);
}
void
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..43836cf11c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -30,7 +30,7 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
-extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingOperations(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
--
2.16.3
I found that CF-bot complaining on this.
Seems that some comment fixes by the recent 21039555cd are the
cause.
No substantial change have been made by this rebasing.
regards.
On Fri, Jul 12, 2019 at 5:37 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190712.173041.236938840.horikyota.ntt@gmail.com>
The v16 seems no longer works so I'll send further rebased version.
It's just by renaming of TestLib::real_dir to perl2host.
This is rebased version v17.regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v18-0001-TAP-test-for-copy-truncation-optimization.patchapplication/octet-stream; name=v18-0001-TAP-test-for-copy-truncation-optimization.patchDownload
From c6181fce2a5418a6f9c7ab63d7db924fa13eb6f5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/3] TAP test for copy-truncation optimization.
---
src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
1 file changed, 291 insertions(+)
create mode 100644 src/test/recovery/t/018_wal_optimize.pl
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..b26cd8efd5
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::perl2host($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v18-0002-Fix-WAL-skipping-feature.patchapplication/octet-stream; name=v18-0002-Fix-WAL-skipping-feature.patchDownload
From 29b68dfd1560b574201bb44ec7692828650de9b3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Mon, 27 May 2019 16:06:30 +0900
Subject: [PATCH 2/3] Fix WAL skipping feature
WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modifcation is WAL-logged at all then sync
such relations at commit.
---
src/backend/access/heap/heapam.c | 4 +-
src/backend/access/heap/heapam_handler.c | 22 +----------
src/backend/access/heap/rewriteheap.c | 13 ++-----
src/backend/catalog/storage.c | 64 +++++++++++++++++++++++++-------
src/backend/commands/cluster.c | 24 ++++++++++++
src/backend/commands/copy.c | 39 ++++---------------
src/backend/commands/createas.c | 5 +--
src/backend/commands/matview.c | 4 --
src/backend/commands/tablecmds.c | 10 ++---
src/backend/storage/buffer/bufmgr.c | 33 +++++++++++-----
src/backend/utils/cache/relcache.c | 16 ++++++--
src/include/access/heapam.h | 1 -
src/include/access/rewriteheap.h | 2 +-
src/include/access/tableam.h | 40 ++------------------
src/include/storage/bufmgr.h | 1 +
src/include/utils/rel.h | 17 ++++++++-
16 files changed, 148 insertions(+), 147 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94309949fa..2f1d68762b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1941,7 +1941,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2124,7 +2124,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 09bc6fe98a..b9554f6064 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -556,18 +556,6 @@ tuple_lock_retry:
return result;
}
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related callbacks for heap AM.
* ------------------------------------------------------------------------
@@ -699,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +700,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
- /*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
- */
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
/* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
@@ -729,7 +710,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff, use_wal);
+ *multi_cutoff);
/* Set up sorting if wanted */
@@ -2517,7 +2498,6 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 72a448ad31..992d4b9880 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* min_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 3cc886f7fe..e4bcdc390f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -57,7 +57,8 @@ typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
BackendId backend; /* InvalidBackendId if not a temp rel */
- bool atCommit; /* T=delete at commit; F=delete at abort */
+ bool atCommit; /* T=work at commit; F=work at abort */
+ bool dosync; /* T=work is sync; F=work is delete */
int nestLevel; /* xact nesting level of request */
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
@@ -114,10 +115,29 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
+ pending->dosync = false;
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /*
+ * We are going to skip WAL-logging for storage of persistent relations
+ * created in the current transaction when wal_level = minimal. The
+ * relation needs to be synced at commit.
+ */
+ if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+ {
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->dosync = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
+
return srel;
}
@@ -155,6 +175,7 @@ RelationDropStorage(Relation rel)
pending->relnode = rel->rd_node;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
+ pending->dosync = false;
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -428,21 +449,34 @@ smgrDoPendingDeletes(bool isCommit)
{
SMgrRelation srel;
- srel = smgropen(pending->relnode, pending->backend);
-
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
+ if (pending->dosync)
{
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
+ /* Perform pending sync of WAL-skipped relation */
+ FlushRelationBuffersWithoutRelcache(pending->relnode,
+ false);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrimmedsync(srel, MAIN_FORKNUM);
+ smgrclose(srel);
}
- else if (maxrels <= nrels)
+ else
{
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
+ /* Collect pending deletions */
+ srel = smgropen(pending->relnode, pending->backend);
- srels[nrels++] = srel;
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
+ }
}
/* must explicitly free the list entry */
pfree(pending);
@@ -489,8 +523,9 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
nrels = 0;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
+ /* Pending syncs are excluded */
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId && !pending->dosync)
nrels++;
}
if (nrels == 0)
@@ -502,8 +537,9 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
*ptr = rptr;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
+ /* Pending syncs are excluded */
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId && !pending->dosync)
{
*rptr = pending->relnode;
rptr++;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index cedb4ee844..29f7bf6dbd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
{
+ Relation rel1;
+ Relation rel2;
+
/*
* Normal non-mapped relations: swap relfilenodes, reltablespaces,
* relpersistence
*/
Assert(!target_is_pg_class);
+ /* Update creation subid hints of relcache */
+ rel1 = relation_open(r1, ExclusiveLock);
+ rel2 = relation_open(r2, ExclusiveLock);
+
+ /*
+ * New relation's relfilenode is created in the current transaction
+ * and used as old ralation's new relfilenode. So its
+ * newRelfilenodeSubid as new relation's createSubid. We don't fix
+ * rel2 since it would be deleted soon.
+ */
+ Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+ rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+ /* record the first relfilenode change in the current transaction */
+ if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+ relation_close(rel1, ExclusiveLock);
+ relation_close(rel2, ExclusiveLock);
+
+ /* swap relfilenodes, reltablespaces, relpersistence */
swaptemp = relform1->relfilenode;
relform1->relfilenode = relform2->relfilenode;
relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 4f04d122c3..f02efd59fc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2535,9 +2535,6 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
ExecDropSingleTupleTableSlot(buffer->slots[i]);
- table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
- miinfo->ti_options);
-
pfree(buffer);
}
@@ -2726,28 +2723,9 @@ CopyFrom(CopyState cstate)
* If it does commit, we'll have done the table_finish_bulk_insert() at
* the bottom of this routine first.
*
- * As mentioned in comments in utils/rel.h, the in-same-transaction test
- * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
- * can be cleared before the end of the transaction. The exact case is
- * when a relation sets a new relfilenode twice in same transaction, yet
- * the second one fails in an aborted subtransaction, e.g.
- *
- * BEGIN;
- * TRUNCATE t;
- * SAVEPOINT save;
- * TRUNCATE t;
- * ROLLBACK TO save;
- * COPY ...
- *
- * Also, if the target file is new-in-transaction, we assume that checking
- * FSM for free space is a waste of time, even if we must use WAL because
- * of archiving. This could possibly be wrong, but it's unlikely.
- *
- * The comments for table_tuple_insert and RelationGetBufferForTuple
- * specify that skipping WAL logging is only safe if we ensure that our
- * tuples do not go into pages containing tuples from any other
- * transactions --- but this must be the case if we have a new table or
- * new relfilenode, so we need no additional work to enforce that.
+ * If the target file is new-in-transaction, we assume that checking FSM
+ * for free space is a waste of time, even if we must use WAL because of
+ * archiving. This could possibly be wrong, but it's unlikely.
*
* We currently don't support this optimization if the COPY target is a
* partitioned table as we currently only lazily initialize partition
@@ -2763,15 +2741,14 @@ CopyFrom(CopyState cstate)
* are not supported as per the description above.
*----------
*/
- /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+ /*
+ * createSubid is creation check, firstRelfilenodeSubid is truncation and
+ * cluster check. Partitioned table doesn't have storage.
+ */
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
- cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
+ cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index b7d220699f..8a91d946e3 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->rel, myState->ti_options);
-
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..1c854dcebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* replication is in use. We can skip the FSM in any case.
*/
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fb2be10794..bdb7b53d2a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4762,9 +4762,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
/*
* Prepare a BulkInsertState and options for table_tuple_insert. Because
- * we're building a new heap, we can skip WAL-logging and fsync it to disk
- * at the end instead (unless WAL-logging is required for archiving or
- * streaming replication). The FSM is empty too, so don't bother using it.
+ * we're building a new heap, the underlying table AM can skip WAL-logging
+ * and fsync the relation to disk at the end of the current transaction
+ * instead. The FSM is empty too, so don't bother using it.
*/
if (newrel)
{
@@ -4772,8 +4772,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -5058,8 +5056,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
FreeBulkInsertState(bistate);
- table_finish_bulk_insert(newrel, ti_options);
-
table_close(newrel, NoLock);
}
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f3a402854..41ff6da9d9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
static int32 PrivateRefCountOverflowed = 0;
static uint32 PrivateRefCountClock = 0;
static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void ReservePrivateRefCountEntry(void);
static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -3191,20 +3192,32 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
- /* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3221,7 +3234,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3251,18 +3264,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 7aa5d7c7fa..b6b61d0b1b 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2660,7 +2660,7 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2800,7 +2800,7 @@ RelationCacheInvalidate(void)
* pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -3057,6 +3057,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
* Likewise, reset the hint about the relfilenode being new.
*/
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
}
/*
@@ -3148,7 +3149,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
}
/*
- * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+ * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
*/
if (relation->rd_newRelfilenodeSubid == mySubid)
{
@@ -3157,6 +3158,15 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
}
+
+
+ if (relation->rd_firstRelfilenodeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstRelfilenodeSubid = parentSubid;
+ else
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ }
}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..80c2e1bafc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7f81703b78..b652cd6cef 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
uint8 flags,
TM_FailureData *tmfd);
- /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
- *
- * Typically callers of tuple_insert and multi_insert will just pass all
- * the flags that apply to them, and each AM has to decide which of them
- * make sense for it, and then only take actions in finish_bulk_insert for
- * those flags, and ignore others.
- *
- * Optional callback.
- */
- void (*finish_bulk_insert) (Relation rel, int options);
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
@@ -1087,10 +1071,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* The options bitmask allows the caller to specify options that may change the
* behaviour of the AM. The AM will ignore options that it does not support.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space.
@@ -1112,8 +1092,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* heap's TOAST table, too, if the tuple requires any out-of-line data.
*
* The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
*
* On return the slot's tts_tid and tts_tableOid are updated to reflect the
* insertion. But note that any toasting of fields within the slot is NOT
@@ -1248,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
* update was done. However, any TOAST changes in the new tuple's
* data are not reflected into *newtup.
*
+ * See table_insert about skipping WAL-logging feature.
+ *
* In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
* t_xmax, and, if possible, t_cmax. See comments for struct TM_FailureData
* for additional info.
@@ -1308,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
flags, tmfd);
}
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
- /* optional callback */
- if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
- rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 509f4b7ef1..ace5f5a2ae 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d35b4a5061..5cbb5a7b27 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -76,10 +76,17 @@ typedef struct RelationData
* transaction, with one of them occurring in a subsequently aborted
* subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
* ROLLBACK TO save; -- rd_newRelfilenode is now forgotten
+ * rd_firstRelfilenodeSubid is the ID of the hightest subtransaction the
+ * relfilenode change has took place first in the current
+ * transaction. This won't be forgotten as newRelfilenodeSubid is. A valid
+ * OID means that the currently active relfilenode is transaction-local
+ * and no-need for WAL-logging.
*/
SubTransactionId rd_createSubid; /* rel was created in current xact */
SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
* current xact */
+ SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned
+ * first in current xact */
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
@@ -512,9 +519,15 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
*/
-#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
/*
* RelationUsesLocalBuffers
--
2.16.3
v18-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patchapplication/octet-stream; name=v18-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patchDownload
From 753f498e101552b3ea2f7690373b367040b53005 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Wed, 29 May 2019 23:03:22 +0900
Subject: [PATCH 3/3] Rename smgrDoPendingDeletes to smgrDoPendingOperations
The function longer does only deletions but also syncs. Rename the
function to refect that. smgrGetPendingDeletes is not renamed since it
does not change behavior.
---
src/backend/access/transam/xact.c | 4 +--
src/backend/catalog/storage.c | 57 ++++++++++++++++++++-------------------
src/include/catalog/storage.h | 2 +-
3 files changed, 32 insertions(+), 31 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d7930c077d..cc0c43b2dd 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2228,7 +2228,7 @@ CommitTransaction(void)
* Other backends will observe the attendant catalog changes and not
* attempt to access affected files.
*/
- smgrDoPendingDeletes(true);
+ smgrDoPendingOperations(true);
AtCommit_Notify();
AtEOXact_GUC(true, 1);
@@ -2716,7 +2716,7 @@ AbortTransaction(void)
ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_AFTER_LOCKS,
false, true);
- smgrDoPendingDeletes(false);
+ smgrDoPendingOperations(false);
AtEOXact_GUC(false, 1);
AtEOXact_SPI(false);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index e4bcdc390f..6ebe75aa37 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -53,17 +53,17 @@
* but I'm being paranoid.
*/
-typedef struct PendingRelDelete
+typedef struct PendingRelOps
{
RelFileNode relnode; /* relation that may need to be deleted */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=work at commit; F=work at abort */
bool dosync; /* T=work is sync; F=work is delete */
int nestLevel; /* xact nesting level of request */
- struct PendingRelDelete *next; /* linked-list link */
-} PendingRelDelete;
+ struct PendingRelOps *next; /* linked-list link */
+} PendingRelOps;
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOps *pendingDeletes = NULL; /* head of linked list */
/*
* RelationCreateStorage
@@ -79,7 +79,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelOps *pending;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -110,8 +110,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOps *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
@@ -127,8 +127,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
*/
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOps *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = true;
@@ -167,11 +167,11 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
void
RelationDropStorage(Relation rel)
{
- PendingRelDelete *pending;
+ PendingRelOps *pending;
/* Add the relation to the list of stuff to delete at commit */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOps *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
pending->relnode = rel->rd_node;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
@@ -185,9 +185,9 @@ RelationDropStorage(Relation rel)
* present in the pending-delete list twice, once with atCommit true and
* once with atCommit false. Hence, it will be physically deleted at end
* of xact in either case (and the other entry will be ignored by
- * smgrDoPendingDeletes, so no error will occur). We could instead remove
- * the existing list entry and delete the physical file immediately, but
- * for now I'll keep the logic simple.
+ * smgrDoPendingOperations, so no error will occur). We could instead
+ * remove the existing list entry and delete the physical file
+ * immediately, but for now I'll keep the logic simple.
*/
RelationCloseSmgr(rel);
@@ -213,9 +213,9 @@ RelationDropStorage(Relation rel)
void
RelationPreserveStorage(RelFileNode rnode, bool atCommit)
{
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOps *pending;
+ PendingRelOps *prev;
+ PendingRelOps *next;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -406,7 +406,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
}
/*
- * smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
+ * smgrDoPendingOperations() -- Take care of relation deletes and syncs at
+ * end of xact.
*
* This also runs when aborting a subxact; we want to clean up a failed
* subxact immediately.
@@ -417,12 +418,12 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* already recovered the physical storage.
*/
void
-smgrDoPendingDeletes(bool isCommit)
+smgrDoPendingOperations(bool isCommit)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOps *pending;
+ PendingRelOps *prev;
+ PendingRelOps *next;
int nrels = 0,
i = 0,
maxrels = 0;
@@ -518,7 +519,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
RelFileNode *rptr;
- PendingRelDelete *pending;
+ PendingRelOps *pending;
nrels = 0;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
@@ -558,8 +559,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
void
PostPrepare_smgr(void)
{
- PendingRelDelete *pending;
- PendingRelDelete *next;
+ PendingRelOps *pending;
+ PendingRelOps *next;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
@@ -580,7 +581,7 @@ void
AtSubCommit_smgr(void)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
+ PendingRelOps *pending;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
@@ -599,7 +600,7 @@ AtSubCommit_smgr(void)
void
AtSubAbort_smgr(void)
{
- smgrDoPendingDeletes(false);
+ smgrDoPendingOperations(false);
}
void
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..43836cf11c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -30,7 +30,7 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
-extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingOperations(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
--
2.16.3
On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
No substantial change have been made by this rebasing.
Thanks. I'll likely review this on 2019-08-20. If someone opts to review it
earlier, I welcome that.
On Sat, Jul 27, 2019 at 6:26 PM Noah Misch <noah@leadboat.com> wrote:
On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
No substantial change have been made by this rebasing.
Thanks. I'll likely review this on 2019-08-20. If someone opts to review it
earlier, I welcome that.
Cool. That'll be in time to be marked committed in the September CF,
this patch's 16th.
--
Thomas Munro
https://enterprisedb.com
Hello.
At Fri, 2 Aug 2019 11:35:06 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in <CA+hUKGJKcMFocY71nV3XM-8U=+0T278h0DQ8CPOcO_uzERZ8Og@mail.gmail.com>
On Sat, Jul 27, 2019 at 6:26 PM Noah Misch <noah@leadboat.com> wrote:
On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
No substantial change have been made by this rebasing.
Thanks. I'll likely review this on 2019-08-20. If someone opts to review it
earlier, I welcome that.Cool. That'll be in time to be marked committed in the September CF,
this patch's 16th.
Yeah, this patch has been reborn far simpler and generic (or
robust) thanks to Noah.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
For two-phase commit, PrepareTransaction() needs to execute pending syncs.
On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
--- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, /* Remember if it's a system catalog */ is_system_catalog = IsSystemRelation(OldHeap);- /*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
- */
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
/* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
Since you're deleting the use_wal variable, update that last comment.
--- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit) { SMgrRelation srel;- srel = smgropen(pending->relnode, pending->backend); - - /* allocate the initial array, or extend it, if needed */ - if (maxrels == 0) + if (pending->dosync) { - maxrels = 8; - srels = palloc(sizeof(SMgrRelation) * maxrels); + /* Perform pending sync of WAL-skipped relation */ + FlushRelationBuffersWithoutRelcache(pending->relnode, + false); + srel = smgropen(pending->relnode, pending->backend); + smgrimmedsync(srel, MAIN_FORKNUM);
This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for
FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be
no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
false due to this code, use RelationNeedsWAL() for multiple forks, and then
not actually sync all forks.
The /messages/by-id/559FA0BA.3080808@iki.fi design had another component
not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation,
or if it's smaller than some threshold, WAL-log the contents of the whole file
at that point." Please write the part to WAL-log the contents of small files
instead of syncing them.
--- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate) * If it does commit, we'll have done the table_finish_bulk_insert() at * the bottom of this routine first. * - * As mentioned in comments in utils/rel.h, the in-same-transaction test - * is not always set correctly, since in rare cases rd_newRelfilenodeSubid - * can be cleared before the end of the transaction. The exact case is - * when a relation sets a new relfilenode twice in same transaction, yet - * the second one fails in an aborted subtransaction, e.g. - * - * BEGIN; - * TRUNCATE t; - * SAVEPOINT save; - * TRUNCATE t; - * ROLLBACK TO save; - * COPY ...
The comment material being deleted is still correct, so don't delete it.
Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The
attached patch adds an assertion that RelationNeedsWAL() and the
pendingDeletes array have the same opinion about the relfilenode, and it
expands a test case to fail that assertion.
--- a/src/include/utils/rel.h +++ b/src/include/utils/rel.h @@ -74,11 +74,13 @@ typedef struct RelationData SubTransactionId rd_createSubid; /* rel was created in current xact */ SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in * current xact */ + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned + * first in current xact */
In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
all the lines printed. Many bits of code need to look at all three,
e.g. RelationClose(). This field needs to be 100% reliable. In other words,
it must equal InvalidSubTransactionId if and only if the relfilenode matches
the relfilenode that would be in place if the top transaction rolled back.
nm
Attachments:
wal-optimize-noah-tests-v2.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 6ebe75a..d74e9a5 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -405,6 +405,21 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
smgrimmedsync(dst, forkNum);
}
+bool
+hasPendingSync(Relation rel)
+{
+ PendingRelOps *pending;
+
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ if (RelFileNodeEquals(pending->relnode, rel->rd_node) &&
+ pending->backend == rel->rd_backend &&
+ pending->dosync)
+ return true;
+ }
+ return false;
+}
+
/*
* smgrDoPendingOperations() -- Take care of relation deletes and syncs at
* end of xact.
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index cd51df4..66fb8dc 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -513,6 +513,20 @@ typedef struct ViewOptions
(relation)->rd_smgr->smgr_targblock = (targblock); \
} while (0)
+static inline bool
+subids_need_wal(Relation relation)
+{
+ extern bool hasPendingSync(Relation rel);
+ bool relcache_verdict =
+ (relation->rd_createSubid == InvalidSubTransactionId &&
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId);
+ bool storage_verdict = !hasPendingSync(relation);
+
+ Assert(relcache_verdict == storage_verdict);
+
+ return relcache_verdict;
+}
+
/*
* RelationNeedsWAL
* True if relation needs WAL.
@@ -521,10 +535,8 @@ typedef struct ViewOptions
* truncated in the current transaction.
*/
#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
- (XLogIsNeeded() || \
- (relation->rd_createSubid == InvalidSubTransactionId && \
- relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (XLogIsNeeded() || subids_need_wal(relation)))
/*
* RelationUsesLocalBuffers
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
index b26cd8e..56b92b4 100644
--- a/src/test/recovery/t/018_wal_optimize.pl
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -157,6 +157,16 @@ wal_level = $wal_level
is($result, qq(3),
"wal_level = $wal_level, SET TABLESPACE in subtransaction");
+ $node->safe_psql('postgres', "
+ CREATE TABLE test2a (id int);
+ BEGIN;
+ TRUNCATE test2a;
+ SAVEPOINT s;
+ TRUNCATE test2a;
+ ROLLBACK TO s;
+ INSERT INTO test2a DEFAULT VALUES;
+ COMMIT;");
+
# UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
$node->safe_psql('postgres', "
BEGIN;
Thank you for taking time.
At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in <20190818035230.GB3021338@rfd.leadboat.com>
For two-phase commit, PrepareTransaction() needs to execute pending syncs.
Now TwoPhaseFileHeader has two new members for (commit-time)
pending syncs. Pending-syncs are useless on wal-replay, but that
is needed for commit-prepared.
On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
--- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
...
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
/* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);Since you're deleting the use_wal variable, update that last comment.
Oops. Rewrote it.
--- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
...
+ smgrimmedsync(srel, MAIN_FORKNUM);
This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for
FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be
no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
false due to this code, use RelationNeedsWAL() for multiple forks, and then
not actually sync all forks.
I agree that all forks needs syncing, but FSM and VM are checking
RelationNeedsWAL(modified). To make sure, are you suggesting to
sync all forks instead of emitting WAL for them, or suggesting
that VM and FSM to emit WALs even when the modified
RelationNeedsWAL returns false (+ sync all forks)?
The /messages/by-id/559FA0BA.3080808@iki.fi design had another component
not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation,
or if it's smaller than some threshold, WAL-log the contents of the whole file
at that point." Please write the part to WAL-log the contents of small files
instead of syncing them.
I'm not sure the point of the behavior. I suppose that the "log"
is a sequence of new_page records. It also needs to be synced and
it is always larger than the file to be synced. I can't think of
an appropriate threshold without the point.
--- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate) * If it does commit, we'll have done the table_finish_bulk_insert() at * the bottom of this routine first. * - * As mentioned in comments in utils/rel.h, the in-same-transaction test - * is not always set correctly, since in rare cases rd_newRelfilenodeSubid - * can be cleared before the end of the transaction. The exact case is - * when a relation sets a new relfilenode twice in same transaction, yet - * the second one fails in an aborted subtransaction, e.g. - * - * BEGIN; - * TRUNCATE t; - * SAVEPOINT save; - * TRUNCATE t; - * ROLLBACK TO save; - * COPY ...The comment material being deleted is still correct, so don't delete it.
Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The
attached patch adds an assertion that RelationNeedsWAL() and the
pendingDeletes array have the same opinion about the relfilenode, and it
expands a test case to fail that assertion.
(Un?)Fortunately, that doesn't fail.. (with rebased version on
the recent master) I'll recheck that tomorrow.
--- a/src/include/utils/rel.h +++ b/src/include/utils/rel.h @@ -74,11 +74,13 @@ typedef struct RelationData SubTransactionId rd_createSubid; /* rel was created in current xact */ SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in * current xact */ + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned + * first in current xact */In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
all the lines printed. Many bits of code need to look at all three,
e.g. RelationClose().
Agreed. I'll recheck that.
This field needs to be 100% reliable. In other words,
it must equal InvalidSubTransactionId if and only if the relfilenode matches
the relfilenode that would be in place if the top transaction rolled back.
I don't get this. I think the variable moves as you suggested. It
is handled same way with fd_new* in AtEOSubXact_cleanup but the
difference is in assignment but rollback. rd_fist* won't change
after the first assignment so rollback of the subid means
relfilenode is also rolled back to the initial value at the
beginning of the top transaction.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in <20190818035230.GB3021338@rfd.leadboat.com>
For two-phase commit, PrepareTransaction() needs to execute pending syncs.
Now TwoPhaseFileHeader has two new members for (commit-time)
pending syncs. Pending-syncs are useless on wal-replay, but that
is needed for commit-prepared.
There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
command, which is far too late to be syncing new relation files. (A crash may
have already destroyed their data.) PrepareTransaction(), which implements
the PREPARE TRANSACTION command, is the right place for these syncs.
A failure in these new syncs needs to prevent the transaction from being
marked committed. Hence, in CommitTransaction(), these new syncs need to
happen after the last step that could create assign a new relfilenode and
before RecordTransactionCommit(). I suspect it's best to do it after
PreCommit_on_commit_actions() and before AtEOXact_LargeObject().
On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
--- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)...
+ smgrimmedsync(srel, MAIN_FORKNUM);
This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for
FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be
no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
false due to this code, use RelationNeedsWAL() for multiple forks, and then
not actually sync all forks.I agree that all forks needs syncing, but FSM and VM are checking
RelationNeedsWAL(modified). To make sure, are you suggesting to
sync all forks instead of emitting WAL for them, or suggesting
that VM and FSM to emit WALs even when the modified
RelationNeedsWAL returns false (+ sync all forks)?
I hadn't thought that far. What do you think is best?
The /messages/by-id/559FA0BA.3080808@iki.fi design had another component
not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation,
or if it's smaller than some threshold, WAL-log the contents of the whole file
at that point." Please write the part to WAL-log the contents of small files
instead of syncing them.I'm not sure the point of the behavior. I suppose that the "log"
is a sequence of new_page records. It also needs to be synced and
it is always larger than the file to be synced. I can't think of
an appropriate threshold without the point.
Yes, it would be a sequence of new-page records. FlushRelationBuffers() locks
every buffer header containing a buffer of the current database. The belief
has been that writing one page to xlog is cheaper than FlushRelationBuffers()
in a busy system with large shared_buffers.
--- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate) * If it does commit, we'll have done the table_finish_bulk_insert() at * the bottom of this routine first. * - * As mentioned in comments in utils/rel.h, the in-same-transaction test - * is not always set correctly, since in rare cases rd_newRelfilenodeSubid - * can be cleared before the end of the transaction. The exact case is - * when a relation sets a new relfilenode twice in same transaction, yet - * the second one fails in an aborted subtransaction, e.g. - * - * BEGIN; - * TRUNCATE t; - * SAVEPOINT save; - * TRUNCATE t; - * ROLLBACK TO save; - * COPY ...The comment material being deleted is still correct, so don't delete it.
Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The
attached patch adds an assertion that RelationNeedsWAL() and the
pendingDeletes array have the same opinion about the relfilenode, and it
expands a test case to fail that assertion.(Un?)Fortunately, that doesn't fail.. (with rebased version on
the recent master) I'll recheck that tomorrow.
Did you build with --enable-cassert?
--- a/src/include/utils/rel.h +++ b/src/include/utils/rel.h @@ -74,11 +74,13 @@ typedef struct RelationData SubTransactionId rd_createSubid; /* rel was created in current xact */ SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in * current xact */ + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned + * first in current xact */
This field needs to be 100% reliable. In other words,
it must equal InvalidSubTransactionId if and only if the relfilenode matches
the relfilenode that would be in place if the top transaction rolled back.I don't get this. I think the variable moves as you suggested. It
is handled same way with fd_new* in AtEOSubXact_cleanup but the
difference is in assignment but rollback. rd_fist* won't change
after the first assignment so rollback of the subid means
relfilenode is also rolled back to the initial value at the
beginning of the top transaction.
$ git grep -n 'rd_firstRelfilenodeSubid = '
src/backend/commands/cluster.c:1061: rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
src/backend/utils/cache/relcache.c:3067: relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
src/backend/utils/cache/relcache.c:3173: relation->rd_firstRelfilenodeSubid = parentSubid;
src/backend/utils/cache/relcache.c:3175: relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
swap_relation_files() is the only place initializing this field. Many paths
that assign a new relfilenode will never call swap_relation_files().
Hello.
At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190819.185959.118543656.horikyota.ntt@gmail.com>
The comment material being deleted is still correct, so don't delete it.
Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The
attached patch adds an assertion that RelationNeedsWAL() and the
pendingDeletes array have the same opinion about the relfilenode, and it
expands a test case to fail that assertion.(Un?)Fortunately, that doesn't fail.. (with rebased version on
the recent master) I'll recheck that tomorrow.
I saw the assertion failure. It's a part of intended
behavior. In this patch, relcache doesn't hold the whole history
of relfilenodes so we cannot remove useless pending syncs
perfectly. On the other hand they are harmless except that they
cause extra sync of files that are removed immediately. So I
choosed that once registered pending syncs are not removed.
If we want consistency here, we need to record creator subxid in
PendingRelOps (PendingRelDelete) struct and rather large work at
subtransaction end.
--- a/src/include/utils/rel.h +++ b/src/include/utils/rel.h @@ -74,11 +74,13 @@ typedef struct RelationData SubTransactionId rd_createSubid; /* rel was created in current xact */ SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in * current xact */ + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned + * first in current xact */In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
all the lines printed. Many bits of code need to look at all three,
e.g. RelationClose().Agreed. I'll recheck that.
This field needs to be 100% reliable. In other words,
it must equal InvalidSubTransactionId if and only if the relfilenode matches
the relfilenode that would be in place if the top transaction rolled back.I don't get this. I think the variable moves as you suggested. It
is handled same way with fd_new* in AtEOSubXact_cleanup but the
difference is in assignment but rollback. rd_fist* won't change
after the first assignment so rollback of the subid means
relfilenode is also rolled back to the initial value at the
beginning of the top transaction.
So I'll add this in the next version to see how it looks.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello. New version is attached.
At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190819.185959.118543656.horikyota.ntt@gmail.com>
Thank you for taking time.
At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in <20190818035230.GB3021338@rfd.leadboat.com>
For two-phase commit, PrepareTransaction() needs to execute pending syncs.
Now TwoPhaseFileHeader has two new members for pending syncs. It
is useless on wal-replay, but that is needed for commit-prepared.
On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
--- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,...
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
/* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);Since you're deleting the use_wal variable, update that last comment.
Oops! Rewrote it.
--- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)...
+ smgrimmedsync(srel, MAIN_FORKNUM);
This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for
FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be
no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
false due to this code, use RelationNeedsWAL() for multiple forks, and then
not actually sync all forks.I agree that all forks needs syncing, but FSM and VM are checking
RelationNeedsWAL(modified). To make sure, are you suggesting to
sync all forks instead of emitting WAL for them, or suggesting
that VM and FSM to emit WALs even when the modified
RelationNeedsWAL returns false (+ sync all forks)?
All forks are synced and have no WALs emitted (as before) in the
attached version 19. FSM and VM are not changed.
The /messages/by-id/559FA0BA.3080808@iki.fi design had another component
not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation,
or if it's smaller than some threshold, WAL-log the contents of the whole file
at that point." Please write the part to WAL-log the contents of small files
instead of syncing them.I'm not sure the point of the behavior. I suppose that the "log"
is a sequence of new_page records. It also needs to be synced and
it is always larger than the file to be synced. I can't think of
an appropriate threshold without the point.
This is not included in this version. I'll continue to consider
this.
--- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate) * If it does commit, we'll have done the table_finish_bulk_insert() at * the bottom of this routine first. * - * As mentioned in comments in utils/rel.h, the in-same-transaction test - * is not always set correctly, since in rare cases rd_newRelfilenodeSubid - * can be cleared before the end of the transaction. The exact case is - * when a relation sets a new relfilenode twice in same transaction, yet - * the second one fails in an aborted subtransaction, e.g. - * - * BEGIN; - * TRUNCATE t; - * SAVEPOINT save; - * TRUNCATE t; - * ROLLBACK TO save; - * COPY ...The comment material being deleted is still correct, so don't delete it.
The code is changed to use rd_firstRelfilenodeSubid instead of
rd_firstRelfilenodeSubid which has the issue mentioned in the
deleted section. So this is right but irrelevant to the code
here. The same thing is written in the comment in RelationData.
(In short, not reverted)
Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The
attached patch adds an assertion that RelationNeedsWAL() and the
pendingDeletes array have the same opinion about the relfilenode, and it
expands a test case to fail that assertion.
..
In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
all the lines printed. Many bits of code need to look at all three,
e.g. RelationClose().
I forgot to maintain rd_firstRelfilenode in many places and the
assertion failure no longer happens after I fixed it. Opposite to
my previous mail, of course useless pending entries are removed
at subtransction abort and no needless syncs happen in that
meaning. But another type of useless sync was seen with the
previous version 18.
(In short fixed.)
This field needs to be 100% reliable. In other words,
it must equal InvalidSubTransactionId if and only if the relfilenode matches
the relfilenode that would be in place if the top transaction rolled back.
Sorry, I confused this with another similar behavior of the
previous version 18, where files are synced even if it is to be
removed immediately at commit. In this version
smgrDoPendingOperations doesn't sync to-be-deleted files.
While checking this, I found that smgrDoPendingDeletes is making
unnecessary call to smgrclose() which lead server to crash while
deleting files. I removed it.
Please find the new version attached.
Changes:
- Rebased to f8cf524da1.
- Fixed prepare transaction. test2a catches this.
(twophase.c)
- Fixed a comment in heapam_relation_copy_for_cluster.
- All forks are synced. (smgrDoPendingDeletes/Operations, SyncRelationFiles)
- Fixed handling of rd_firstRelfilenodeSubid.
(RelationBuildLocalRelation, RelationSetNewRelfilenode,
load_relcache_init_file)
- Prevent to-be-deleted files from syncing. (smgrDoPendingDeletes/Operations)
- Fixed a crash bug caused by smgrclose() in smgrDoPendingOperations.
Minor changes:
- Renamed: PendingRelOps => PendingRelOp
- Type changed: bool PendingRelOp.dosync => PendingOpType PendingRelOp.op
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v19-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From b4144d7e1f1fb22f4387e3af9d37a29b68c9795f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/3] TAP test for copy-truncation optimization.
---
src/test/recovery/t/018_wal_optimize.pl | 312 ++++++++++++++++++++++++++++++++
1 file changed, 312 insertions(+)
create mode 100644 src/test/recovery/t/018_wal_optimize.pl
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..b041121745
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,312 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::perl2host($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+
+ # Same for prepared transaction
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2a (id serial PRIMARY KEY);
+ INSERT INTO test2a VALUES (DEFAULT);
+ TRUNCATE test2a;
+ INSERT INTO test2a VALUES (DEFAULT);
+ PREPARE TRANSACTION 't';
+ COMMIT PREPARED 't';");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with prepared transaction");
+
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v19-0002-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From d62a337281024c1f9df09596e62724057b02cdfb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 21 Aug 2019 13:57:00 +0900
Subject: [PATCH 2/3] Fix WAL skipping feature
WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modification is WAL-logged at all then sync
such relations at commit.
---
src/backend/access/heap/heapam.c | 4 +-
src/backend/access/heap/heapam_handler.c | 22 +----
src/backend/access/heap/rewriteheap.c | 13 +--
src/backend/access/transam/twophase.c | 23 ++++-
src/backend/catalog/storage.c | 158 ++++++++++++++++++++++++++-----
src/backend/commands/cluster.c | 24 +++++
src/backend/commands/copy.c | 39 ++------
src/backend/commands/createas.c | 5 +-
src/backend/commands/matview.c | 4 -
src/backend/commands/tablecmds.c | 10 +-
src/backend/storage/buffer/bufmgr.c | 33 +++++--
src/backend/storage/smgr/md.c | 30 ++++++
src/backend/utils/cache/relcache.c | 28 ++++--
src/include/access/heapam.h | 1 -
src/include/access/rewriteheap.h | 2 +-
src/include/access/tableam.h | 40 +-------
src/include/catalog/storage.h | 8 ++
src/include/storage/bufmgr.h | 1 +
src/include/storage/md.h | 1 +
src/include/utils/rel.h | 17 +++-
20 files changed, 300 insertions(+), 163 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cb811d345a..ef18b61c55 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1936,7 +1936,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2119,7 +2119,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f1ff01e8cb..27f414a361 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -558,18 +558,6 @@ tuple_lock_retry:
return result;
}
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related callbacks for heap AM.
* ------------------------------------------------------------------------
@@ -701,7 +689,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -716,12 +703,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
is_system_catalog = IsSystemRelation(OldHeap);
/*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
+ * smgr_targblock must be initially invalid if we are to skip WAL logging
*/
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
- /* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
/* Preallocate values/isnull arrays */
@@ -731,7 +714,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff, use_wal);
+ *multi_cutoff);
/* Set up sorting if wanted */
@@ -2519,7 +2502,6 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index a17508a82f..9e0d7295af 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 477709bbc2..e3512fc415 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -921,6 +921,7 @@ typedef struct TwoPhaseFileHeader
Oid owner; /* user running the transaction */
int32 nsubxacts; /* number of following subxact XIDs */
int32 ncommitrels; /* number of delete-on-commit rels */
+ int32 npendsyncrels; /* number of sync-on-commit rels */
int32 nabortrels; /* number of delete-on-abort rels */
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
@@ -1009,6 +1010,7 @@ StartPrepare(GlobalTransaction gxact)
TwoPhaseFileHeader hdr;
TransactionId *children;
RelFileNode *commitrels;
+ RelFileNode *pendsyncrels;
RelFileNode *abortrels;
SharedInvalidationMessage *invalmsgs;
@@ -1034,6 +1036,7 @@ StartPrepare(GlobalTransaction gxact)
hdr.owner = gxact->owner;
hdr.nsubxacts = xactGetCommittedChildren(&children);
hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels);
+ hdr.npendsyncrels = smgrGetPendingSyncs(true, &pendsyncrels);
hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels);
hdr.ninvalmsgs = xactGetCommittedInvalidationMessages(&invalmsgs,
&hdr.initfileinval);
@@ -1057,6 +1060,11 @@ StartPrepare(GlobalTransaction gxact)
save_state_data(commitrels, hdr.ncommitrels * sizeof(RelFileNode));
pfree(commitrels);
}
+ if (hdr.npendsyncrels > 0)
+ {
+ save_state_data(pendsyncrels, hdr.npendsyncrels * sizeof(RelFileNode));
+ pfree(pendsyncrels);
+ }
if (hdr.nabortrels > 0)
{
save_state_data(abortrels, hdr.nabortrels * sizeof(RelFileNode));
@@ -1464,6 +1472,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
TransactionId latestXid;
TransactionId *children;
RelFileNode *commitrels;
+ RelFileNode *pendsyncrels;
RelFileNode *abortrels;
RelFileNode *delrels;
int ndelrels;
@@ -1499,6 +1508,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
children = (TransactionId *) bufptr;
bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->npendsyncrels * sizeof(RelFileNode));
+ pendsyncrels = (RelFileNode *) bufptr;
bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
abortrels = (RelFileNode *) bufptr;
bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
@@ -1544,9 +1555,9 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
gxact->valid = false;
/*
- * We have to remove any files that were supposed to be dropped. For
- * consistency with the regular xact.c code paths, must do this before
- * releasing locks, so do it before running the callbacks.
+ * We have to sync or remove any files that were supposed to be done
+ * so. For consistency with the regular xact.c code paths, must do this
+ * before releasing locks, so do it before running the callbacks.
*
* NB: this code knows that we couldn't be dropping any temp rels ...
*/
@@ -1554,11 +1565,17 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
{
delrels = commitrels;
ndelrels = hdr->ncommitrels;
+
+ /* Make sure files supposed to be synced are synced */
+ SyncRelationFiles(pendsyncrels, hdr->npendsyncrels);
}
else
{
delrels = abortrels;
ndelrels = hdr->nabortrels;
+
+ /* We don't have an at-abort pending sync */
+ Assert(pendsyncrels == 0);
}
/* Make sure files supposed to be dropped are dropped */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 3cc886f7fe..354a74c27c 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -30,6 +30,7 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -53,11 +54,13 @@
* but I'm being paranoid.
*/
+/* entry type of pendingDeletes */
typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
BackendId backend; /* InvalidBackendId if not a temp rel */
- bool atCommit; /* T=delete at commit; F=delete at abort */
+ bool atCommit; /* T=work at commit; F=work at abort */
+ PendingOpType op; /* type of operation to do */
int nestLevel; /* xact nesting level of request */
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
@@ -114,10 +117,29 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
+ pending->op = PENDING_DELETE;
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /*
+ * We are going to skip WAL-logging for storage of persistent relations
+ * created in the current transaction when wal_level = minimal. The
+ * relation needs to be synced at commit.
+ */
+ if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+ {
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->op = PENDING_SYNC;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
+
return srel;
}
@@ -155,6 +177,7 @@ RelationDropStorage(Relation rel)
pending->relnode = rel->rd_node;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
+ pending->op = PENDING_DELETE;
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -201,7 +224,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
{
next = pending->next;
if (RelFileNodeEquals(rnode, pending->relnode)
- && pending->atCommit == atCommit)
+ && pending->atCommit == atCommit
+ && pending->op == PENDING_DELETE)
{
/* unlink and delete list entry */
if (prev)
@@ -406,6 +430,7 @@ smgrDoPendingDeletes(bool isCommit)
i = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ struct HTAB *synchash = NULL;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -428,21 +453,50 @@ smgrDoPendingDeletes(bool isCommit)
{
SMgrRelation srel;
- srel = smgropen(pending->relnode, pending->backend);
-
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
+ if (pending->op == PENDING_SYNC)
{
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
- }
- else if (maxrels <= nrels)
- {
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
+ /* We don't have abort-time pending syncs */
+ Assert(isCommit);
- srels[nrels++] = srel;
+ /* Create hash if not yet */
+ if (synchash == NULL)
+ {
+ HASHCTL hash_ctl;
+
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(SMgrRelation*);
+ hash_ctl.entrysize = sizeof(SMgrRelation*);
+ hash_ctl.hcxt = CurrentMemoryContext;
+ synchash =
+ hash_create("pending sync hash", 8,
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
+
+ /* Collect pending syncs */
+ srel = smgropen(pending->relnode, pending->backend);
+ (void) hash_search(synchash, (void *) &srel,
+ HASH_ENTER, NULL);
+ }
+ else
+ {
+ /* Collect pending deletions */
+ srel = smgropen(pending->relnode, pending->backend);
+
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
+ }
}
/* must explicitly free the list entry */
pfree(pending);
@@ -450,6 +504,43 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+ /* Sync only files that are not to be removed. */
+ if (synchash)
+ {
+ HASH_SEQ_STATUS hstat;
+ SMgrRelation *psrel;
+
+ /* remove to-be-removed files from synchash */
+ if (nrels > 0)
+ {
+ int i;
+ bool found;
+
+ for (i = 0 ; i < nrels ; i++)
+ (void) hash_search(synchash, (void *) &(srels[i]),
+ HASH_REMOVE, &found);
+ }
+
+ /* sync survuvied files */
+ hash_seq_init(&hstat, synchash);
+ while ((psrel = (SMgrRelation *) hash_seq_search(&hstat)) != NULL)
+ {
+ ForkNumber fork;
+
+ /* Perform pending sync of WAL-skipped relation */
+ FlushRelationBuffersWithoutRelcache((*psrel)->smgr_rnode.node,
+ false);
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(*psrel, fork))
+ smgrimmedsync(*psrel, fork);
+ }
+ }
+
+ hash_destroy(synchash);
+ synchash = NULL;
+ }
+
if (nrels > 0)
{
smgrdounlinkall(srels, nrels, false);
@@ -462,11 +553,12 @@ smgrDoPendingDeletes(bool isCommit)
}
/*
- * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
+ * smgrGetPendingOperations() -- Get a list of non-temp relations to be
+ * deleted or synced.
*
- * The return value is the number of relations scheduled for termination.
- * *ptr is set to point to a freshly-palloc'd array of RelFileNodes.
- * If there are no relations to be deleted, *ptr is set to NULL.
+ * The return value is the number of relations scheduled for the operation
+ * specified by op. *ptr is set to point to a freshly-palloc'd array of
+ * RelFileNodes. If there are no matching relations, *ptr is set to NULL.
*
* Only non-temporary relations are included in the returned list. This is OK
* because the list is used only in contexts where temporary relations don't
@@ -475,11 +567,11 @@ smgrDoPendingDeletes(bool isCommit)
* (and all temporary files will be zapped if we restart anyway, so no need
* for redo to do it also).
*
- * Note that the list does not include anything scheduled for termination
- * by upper-level transactions.
+ * Note that the list does not include anything scheduled by upper-level
+ * transactions.
*/
-int
-smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+static inline int
+smgrGetPendingOperations(PendingOpType op, bool forCommit, RelFileNode **ptr)
{
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
@@ -490,7 +582,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId
+ && pending->op == op)
nrels++;
}
if (nrels == 0)
@@ -503,7 +596,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId
+ && pending->op == op)
{
*rptr = pending->relnode;
rptr++;
@@ -512,6 +606,20 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+/* Returns list of pending deletes, see smgrGetPendingOperations for details */
+int
+smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+{
+ return smgrGetPendingOperations(PENDING_DELETE, forCommit, ptr);
+}
+
+/* Returns list of pending syncs, see smgrGetPendingOperations for details */
+int
+smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr)
+{
+ return smgrGetPendingOperations(PENDING_SYNC, forCommit, ptr);
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 28985a07ec..f665ee8358 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
{
+ Relation rel1;
+ Relation rel2;
+
/*
* Normal non-mapped relations: swap relfilenodes, reltablespaces,
* relpersistence
*/
Assert(!target_is_pg_class);
+ /* Update creation subid hints of relcache */
+ rel1 = relation_open(r1, ExclusiveLock);
+ rel2 = relation_open(r2, ExclusiveLock);
+
+ /*
+ * New relation's relfilenode is created in the current transaction
+ * and used as old ralation's new relfilenode. So its
+ * newRelfilenodeSubid as new relation's createSubid. We don't fix
+ * rel2 since it would be deleted soon.
+ */
+ Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+ rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+ /* record the first relfilenode change in the current transaction */
+ if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+ relation_close(rel1, ExclusiveLock);
+ relation_close(rel2, ExclusiveLock);
+
+ /* swap relfilenodes, reltablespaces, relpersistence */
swaptemp = relform1->relfilenode;
relform1->relfilenode = relform2->relfilenode;
relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3aeef30b28..3ce04f7efc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2534,9 +2534,6 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
ExecDropSingleTupleTableSlot(buffer->slots[i]);
- table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
- miinfo->ti_options);
-
pfree(buffer);
}
@@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
* If it does commit, we'll have done the table_finish_bulk_insert() at
* the bottom of this routine first.
*
- * As mentioned in comments in utils/rel.h, the in-same-transaction test
- * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
- * can be cleared before the end of the transaction. The exact case is
- * when a relation sets a new relfilenode twice in same transaction, yet
- * the second one fails in an aborted subtransaction, e.g.
- *
- * BEGIN;
- * TRUNCATE t;
- * SAVEPOINT save;
- * TRUNCATE t;
- * ROLLBACK TO save;
- * COPY ...
- *
- * Also, if the target file is new-in-transaction, we assume that checking
- * FSM for free space is a waste of time, even if we must use WAL because
- * of archiving. This could possibly be wrong, but it's unlikely.
- *
- * The comments for table_tuple_insert and RelationGetBufferForTuple
- * specify that skipping WAL logging is only safe if we ensure that our
- * tuples do not go into pages containing tuples from any other
- * transactions --- but this must be the case if we have a new table or
- * new relfilenode, so we need no additional work to enforce that.
+ * If the target file is new-in-transaction, we assume that checking FSM
+ * for free space is a waste of time, even if we must use WAL because of
+ * archiving. This could possibly be wrong, but it's unlikely.
*
* We currently don't support this optimization if the COPY target is a
* partitioned table as we currently only lazily initialize partition
@@ -2762,15 +2740,14 @@ CopyFrom(CopyState cstate)
* are not supported as per the description above.
*----------
*/
- /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+ /*
+ * createSubid is creation check, firstRelfilenodeSubid is truncation and
+ * cluster check. Partitioned table doesn't have storage.
+ */
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
- cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
+ cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index b7d220699f..8a91d946e3 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->rel, myState->ti_options);
-
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..1c854dcebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* replication is in use. We can skip the FSM in any case.
*/
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cceefbdd49..2468b178cb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4762,9 +4762,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
/*
* Prepare a BulkInsertState and options for table_tuple_insert. Because
- * we're building a new heap, we can skip WAL-logging and fsync it to disk
- * at the end instead (unless WAL-logging is required for archiving or
- * streaming replication). The FSM is empty too, so don't bother using it.
+ * we're building a new heap, the underlying table AM can skip WAL-logging
+ * and smgr will sync the relation to disk at the end of the current
+ * transaction instead. The FSM is empty too, so don't bother using it.
*/
if (newrel)
{
@@ -4772,8 +4772,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -5058,8 +5056,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
FreeBulkInsertState(bistate);
- table_finish_bulk_insert(newrel, ti_options);
-
table_close(newrel, NoLock);
}
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f3a402854..41ff6da9d9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
static int32 PrivateRefCountOverflowed = 0;
static uint32 PrivateRefCountClock = 0;
static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void ReservePrivateRefCountEntry(void);
static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -3191,20 +3192,32 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
- /* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3221,7 +3234,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3251,18 +3264,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 07f3c93d3f..514c6098e6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -994,6 +994,36 @@ ForgetDatabaseSyncRequests(Oid dbid)
RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, true /* retryOnError */ );
}
+/*
+ * SyncRelationFiles -- sync files of all given relations
+ *
+ * This function is assumed to be called only when skipping WAL-logging and
+ * emits no xlog records.
+ */
+void
+SyncRelationFiles(RelFileNode *syncrels, int nsyncrels)
+{
+ int i;
+
+ for (i = 0; i < nsyncrels; i++)
+ {
+ SMgrRelation srel;
+ ForkNumber fork;
+
+ /* sync all existing forks of the relation */
+ FlushRelationBuffersWithoutRelcache(syncrels[i], false);
+ srel = smgropen(syncrels[i], InvalidBackendId);
+
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(srel, fork))
+ smgrimmedsync(srel, fork);
+ }
+
+ smgrclose(srel);
+ }
+}
+
/*
* DropRelationFiles -- drop files of all given relations
*/
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 248860758c..147babb6b5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1096,6 +1096,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
relation->rd_isnailed = false;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
switch (relation->rd_rel->relpersistence)
{
case RELPERSISTENCE_UNLOGGED:
@@ -1829,6 +1830,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_isnailed = true;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
relation->rd_backend = InvalidBackendId;
relation->rd_islocaltemp = false;
@@ -2094,7 +2096,7 @@ RelationClose(Relation relation)
#ifdef RELCACHE_FORCE_RELEASE
if (RelationHasReferenceCountZero(relation) &&
relation->rd_createSubid == InvalidSubTransactionId &&
- relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
RelationClearRelation(relation, false);
#endif
}
@@ -2510,8 +2512,8 @@ RelationClearRelation(Relation relation, bool rebuild)
* problem.
*
* When rebuilding an open relcache entry, we must preserve ref count,
- * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state. Also
- * attempt to preserve the pg_class entry (rd_rel), tupledesc,
+ * rd_createSubid/rd_new/firstRelfilenodeSubid, and rd_toastoid state.
+ * Also attempt to preserve the pg_class entry (rd_rel), tupledesc,
* rewrite-rule, partition key, and partition descriptor substructures
* in place, because various places assume that these structures won't
* move while they are working with an open relcache entry. (Note:
@@ -2600,6 +2602,7 @@ RelationClearRelation(Relation relation, bool rebuild)
/* creation sub-XIDs must be preserved */
SWAPFIELD(SubTransactionId, rd_createSubid);
SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+ SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
/* un-swap rd_rel pointers, swap contents instead */
SWAPFIELD(Form_pg_class, rd_rel);
/* ... but actually, we don't have to update newrel->rd_rel */
@@ -2667,7 +2670,7 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2807,7 +2810,7 @@ RelationCacheInvalidate(void)
* pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -3064,6 +3067,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
* Likewise, reset the hint about the relfilenode being new.
*/
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
}
/*
@@ -3155,7 +3159,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
}
/*
- * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+ * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
*/
if (relation->rd_newRelfilenodeSubid == mySubid)
{
@@ -3164,6 +3168,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
}
+
+ if (relation->rd_firstRelfilenodeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstRelfilenodeSubid = parentSubid;
+ else
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ }
}
@@ -3253,6 +3265,7 @@ RelationBuildLocalRelation(const char *relname,
/* it's being created in this transaction */
rel->rd_createSubid = GetCurrentSubTransactionId();
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
/*
* create a new tuple descriptor from the one passed in. We do this
@@ -3556,6 +3569,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
* operations on the rel in the same transaction.
*/
relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+ if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
/* Flag relation as needing eoxact cleanup (to remove the hint) */
EOXactListAdd(relation);
@@ -5592,6 +5607,7 @@ load_relcache_init_file(bool shared)
rel->rd_fkeylist = NIL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..80c2e1bafc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7f81703b78..b652cd6cef 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
uint8 flags,
TM_FailureData *tmfd);
- /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
- *
- * Typically callers of tuple_insert and multi_insert will just pass all
- * the flags that apply to them, and each AM has to decide which of them
- * make sense for it, and then only take actions in finish_bulk_insert for
- * those flags, and ignore others.
- *
- * Optional callback.
- */
- void (*finish_bulk_insert) (Relation rel, int options);
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
@@ -1087,10 +1071,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* The options bitmask allows the caller to specify options that may change the
* behaviour of the AM. The AM will ignore options that it does not support.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space.
@@ -1112,8 +1092,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* heap's TOAST table, too, if the tuple requires any out-of-line data.
*
* The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
*
* On return the slot's tts_tid and tts_tableOid are updated to reflect the
* insertion. But note that any toasting of fields within the slot is NOT
@@ -1248,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
* update was done. However, any TOAST changes in the new tuple's
* data are not reflected into *newtup.
*
+ * See table_insert about skipping WAL-logging feature.
+ *
* In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
* t_xmax, and, if possible, t_cmax. See comments for struct TM_FailureData
* for additional info.
@@ -1308,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
flags, tmfd);
}
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
- /* optional callback */
- if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
- rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..1de6f1655c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,6 +19,13 @@
#include "storage/smgr.h"
#include "utils/relcache.h"
+/* enum for operation type of PendingDelete entries */
+typedef enum PendingOpType
+{
+ PENDING_DELETE,
+ PENDING_SYNC
+} PendingOpType;
+
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
@@ -32,6 +39,7 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern int smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 509f4b7ef1..ace5f5a2ae 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index c0f05e23ff..2bb2947bdb 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -42,6 +42,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void SyncRelationFiles(RelFileNode *syncrels, int nsyncrels);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c5d36680a2..f372dc2086 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -75,10 +75,17 @@ typedef struct RelationData
* transaction, with one of them occurring in a subsequently aborted
* subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
* ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+ * rd_firstRelfilenodeSubid is the ID of the first subtransaction the
+ * relfilenode change has took place in the current transaction. Unlike
+ * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID
+ * means that the currently active relfilenode is transaction-local and we
+ * sync the relation at commit instead of WAL-logging.
*/
SubTransactionId rd_createSubid; /* rel was created in current xact */
SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
* current xact */
+ SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned
+ * first in current xact */
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
@@ -514,9 +521,15 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
*/
-#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
/*
* RelationUsesLocalBuffers
--
2.16.3
v19-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patchtext/x-patch; charset=us-asciiDownload
From 6f6b87ef06e26ad8222f5900f8e3b146d2f18cba Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Wed, 29 May 2019 23:03:22 +0900
Subject: [PATCH 3/3] Rename smgrDoPendingDeletes to smgrDoPendingOperations
The function longer does only deletions but also syncs. Rename the
function to reflect that. smgrGetPendingDeletes is not renamed since it
does not change behavior.
---
src/backend/access/transam/xact.c | 4 +-
src/backend/catalog/storage.c | 91 ++++++++++++++++++++-------------------
src/include/catalog/storage.h | 2 +-
3 files changed, 49 insertions(+), 48 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f594d33e7a..0123fb0f7f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2228,7 +2228,7 @@ CommitTransaction(void)
* Other backends will observe the attendant catalog changes and not
* attempt to access affected files.
*/
- smgrDoPendingDeletes(true);
+ smgrDoPendingOperations(true);
AtCommit_Notify();
AtEOXact_GUC(true, 1);
@@ -2716,7 +2716,7 @@ AbortTransaction(void)
ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_AFTER_LOCKS,
false, true);
- smgrDoPendingDeletes(false);
+ smgrDoPendingOperations(false);
AtEOXact_GUC(false, 1);
AtEOXact_SPI(false);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 354a74c27c..544ef3aa55 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -54,18 +54,18 @@
* but I'm being paranoid.
*/
-/* entry type of pendingDeletes */
-typedef struct PendingRelDelete
+/* entry type of pendingOperations */
+typedef struct PendingRelOp
{
RelFileNode relnode; /* relation that may need to be deleted */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=work at commit; F=work at abort */
PendingOpType op; /* type of operation to do */
int nestLevel; /* xact nesting level of request */
- struct PendingRelDelete *next; /* linked-list link */
-} PendingRelDelete;
+ struct PendingRelOp *next; /* linked-list link */
+} PendingRelOp;
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingOperations = NULL; /* head of linked list */
/*
* RelationCreateStorage
@@ -81,7 +81,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelOp *pending;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -112,15 +112,15 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->op = PENDING_DELETE;
pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pending->next = pendingOperations;
+ pendingOperations = pending;
/*
* We are going to skip WAL-logging for storage of persistent relations
@@ -129,15 +129,15 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
*/
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = true;
pending->op = PENDING_SYNC;
pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pending->next = pendingOperations;
+ pendingOperations = pending;
}
return srel;
@@ -169,27 +169,27 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
void
RelationDropStorage(Relation rel)
{
- PendingRelDelete *pending;
+ PendingRelOp *pending;
/* Add the relation to the list of stuff to delete at commit */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
pending->relnode = rel->rd_node;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->op = PENDING_DELETE;
pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pending->next = pendingOperations;
+ pendingOperations = pending;
/*
* NOTE: if the relation was created in this transaction, it will now be
* present in the pending-delete list twice, once with atCommit true and
* once with atCommit false. Hence, it will be physically deleted at end
* of xact in either case (and the other entry will be ignored by
- * smgrDoPendingDeletes, so no error will occur). We could instead remove
- * the existing list entry and delete the physical file immediately, but
- * for now I'll keep the logic simple.
+ * smgrDoPendingOperations, so no error will occur). We could instead
+ * remove the existing list entry and delete the physical file
+ * immediately, but for now I'll keep the logic simple.
*/
RelationCloseSmgr(rel);
@@ -215,12 +215,12 @@ RelationDropStorage(Relation rel)
void
RelationPreserveStorage(RelFileNode rnode, bool atCommit)
{
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
prev = NULL;
- for (pending = pendingDeletes; pending != NULL; pending = next)
+ for (pending = pendingOperations; pending != NULL; pending = next)
{
next = pending->next;
if (RelFileNodeEquals(rnode, pending->relnode)
@@ -231,7 +231,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
if (prev)
prev->next = next;
else
- pendingDeletes = next;
+ pendingOperations = next;
pfree(pending);
/* prev does not change */
}
@@ -409,7 +409,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
}
/*
- * smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
+ * smgrDoPendingOperations() -- Take care of relation deletes and syncs at
+ * end of xact.
*
* This also runs when aborting a subxact; we want to clean up a failed
* subxact immediately.
@@ -420,12 +421,12 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* already recovered the physical storage.
*/
void
-smgrDoPendingDeletes(bool isCommit)
+smgrDoPendingOperations(bool isCommit)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
int nrels = 0,
i = 0,
maxrels = 0;
@@ -433,7 +434,7 @@ smgrDoPendingDeletes(bool isCommit)
struct HTAB *synchash = NULL;
prev = NULL;
- for (pending = pendingDeletes; pending != NULL; pending = next)
+ for (pending = pendingOperations; pending != NULL; pending = next)
{
next = pending->next;
if (pending->nestLevel < nestLevel)
@@ -447,7 +448,7 @@ smgrDoPendingDeletes(bool isCommit)
if (prev)
prev->next = next;
else
- pendingDeletes = next;
+ pendingOperations = next;
/* do deletion if called for */
if (pending->atCommit == isCommit)
{
@@ -576,10 +577,10 @@ smgrGetPendingOperations(PendingOpType op, bool forCommit, RelFileNode **ptr)
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
RelFileNode *rptr;
- PendingRelDelete *pending;
+ PendingRelOp *pending;
nrels = 0;
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ for (pending = pendingOperations; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
&& pending->backend == InvalidBackendId
@@ -593,7 +594,7 @@ smgrGetPendingOperations(PendingOpType op, bool forCommit, RelFileNode **ptr)
}
rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode));
*ptr = rptr;
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ for (pending = pendingOperations; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
&& pending->backend == InvalidBackendId
@@ -630,13 +631,13 @@ smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr)
void
PostPrepare_smgr(void)
{
- PendingRelDelete *pending;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *next;
- for (pending = pendingDeletes; pending != NULL; pending = next)
+ for (pending = pendingOperations; pending != NULL; pending = next)
{
next = pending->next;
- pendingDeletes = next;
+ pendingOperations = next;
/* must explicitly free the list entry */
pfree(pending);
}
@@ -646,15 +647,15 @@ PostPrepare_smgr(void)
/*
* AtSubCommit_smgr() --- Take care of subtransaction commit.
*
- * Reassign all items in the pending-deletes list to the parent transaction.
+ * Reassign all items in the pending-operations list to the parent transaction.
*/
void
AtSubCommit_smgr(void)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
+ PendingRelOp *pending;
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ for (pending = pendingOperations; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel)
pending->nestLevel = nestLevel - 1;
@@ -671,7 +672,7 @@ AtSubCommit_smgr(void)
void
AtSubAbort_smgr(void)
{
- smgrDoPendingDeletes(false);
+ smgrDoPendingOperations(false);
}
void
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 1de6f1655c..dcb3bc4b69 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -37,7 +37,7 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
-extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingOperations(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
extern int smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
--
2.16.3
On Wed, Aug 21, 2019 at 04:32:38PM +0900, Kyotaro Horiguchi wrote:
At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190819.185959.118543656.horikyota.ntt@gmail.com>
At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in <20190818035230.GB3021338@rfd.leadboat.com>
For two-phase commit, PrepareTransaction() needs to execute pending syncs.
Now TwoPhaseFileHeader has two new members for pending syncs. It
is useless on wal-replay, but that is needed for commit-prepared.
Syncs need to happen in PrepareTransaction(), not in commit-prepared. I wrote
about that in /messages/by-id/20190820060314.GA3086296@rfd.leadboat.com
Hello.
At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <noah@leadboat.com> wrote in <20190820060314.GA3086296@rfd.leadboat.com>
On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in <20190818035230.GB3021338@rfd.leadboat.com>
For two-phase commit, PrepareTransaction() needs to execute pending syncs.
Now TwoPhaseFileHeader has two new members for (commit-time)
pending syncs. Pending-syncs are useless on wal-replay, but that
is needed for commit-prepared.There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
command, which is far too late to be syncing new relation files. (A crash may
have already destroyed their data.) PrepareTransaction(), which implements
the PREPARE TRANSACTION command, is the right place for these syncs.A failure in these new syncs needs to prevent the transaction from being
marked committed. Hence, in CommitTransaction(), these new syncs need to
Agreed.
happen after the last step that could create assign a new relfilenode and
before RecordTransactionCommit(). I suspect it's best to do it after
PreCommit_on_commit_actions() and before AtEOXact_LargeObject().
I don't find an obvious problem there. Since pending deletes and
pending syncs are separately processed, I'm planning to make a
separate list for syncs from deletes.
This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for
FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be
no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
false due to this code, use RelationNeedsWAL() for multiple forks, and then
not actually sync all forks.I agree that all forks needs syncing, but FSM and VM are checking
RelationNeedsWAL(modified). To make sure, are you suggesting to
sync all forks instead of emitting WAL for them, or suggesting
that VM and FSM to emit WALs even when the modified
RelationNeedsWAL returns false (+ sync all forks)?I hadn't thought that far. What do you think is best?
As in the latest patch, sync ALL forks then no WALs. We could
skip syncing FSM but I'm not sure it's work doing.
The /messages/by-id/559FA0BA.3080808@iki.fi design had another component
not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation,
or if it's smaller than some threshold, WAL-log the contents of the whole file
at that point." Please write the part to WAL-log the contents of small files
instead of syncing them.I'm not sure the point of the behavior. I suppose that the "log"
is a sequence of new_page records. It also needs to be synced and
it is always larger than the file to be synced. I can't think of
an appropriate threshold without the point.Yes, it would be a sequence of new-page records. FlushRelationBuffers() locks
every buffer header containing a buffer of the current database. The belief
has been that writing one page to xlog is cheaper than FlushRelationBuffers()
in a busy system with large shared_buffers.
I'm at a loss.. The decision between WAL and sync is made at
commit time, when we no longer have a pin on a buffer. When
emitting WAL, opposite to the assumption, lock needs to be
re-acquired for every page to emit log_new_page. What is worse,
we may need to reload evicted buffers. If the file has been
CopyFrom'ed, ring buffer strategy makes the situnation farther
worse. That doesn't seem cheap at all..
If there were any chance on WAL for smaller files here, it would
be on the files smaller than the ring size of bulk-write
strategy(16MB).
If we pick up every buffer page of the file instead of scanning
through all buffers, that makes things worse by conflicts on
partition locks.
Any thoughts?
# Sorry time's up today.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote:
At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <noah@leadboat.com> wrote in <20190820060314.GA3086296@rfd.leadboat.com>
On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in <20190818035230.GB3021338@rfd.leadboat.com>
The /messages/by-id/559FA0BA.3080808@iki.fi design had another component
not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation,
or if it's smaller than some threshold, WAL-log the contents of the whole file
at that point." Please write the part to WAL-log the contents of small files
instead of syncing them.I'm not sure the point of the behavior. I suppose that the "log"
is a sequence of new_page records. It also needs to be synced and
it is always larger than the file to be synced. I can't think of
an appropriate threshold without the point.Yes, it would be a sequence of new-page records. FlushRelationBuffers() locks
every buffer header containing a buffer of the current database. The belief
has been that writing one page to xlog is cheaper than FlushRelationBuffers()
in a busy system with large shared_buffers.I'm at a loss.. The decision between WAL and sync is made at
commit time, when we no longer have a pin on a buffer. When
emitting WAL, opposite to the assumption, lock needs to be
re-acquired for every page to emit log_new_page. What is worse,
we may need to reload evicted buffers. If the file has been
CopyFrom'ed, ring buffer strategy makes the situnation farther
worse. That doesn't seem cheap at all..
Consider a one-page relfilenode. Doing all the things you list for a single
page may be cheaper than locking millions of buffer headers.
If there were any chance on WAL for smaller files here, it would
be on the files smaller than the ring size of bulk-write
strategy(16MB).
Like you, I expect the optimal threshold is less than 16MB, though you should
benchmark to see. Under the ideal threshold, when a transaction creates a new
relfilenode just smaller than the threshold, that transaction will be somewhat
slower than it would be if the threshold were zero. Locking every buffer
header causes a distributed slow-down for other queries, and protecting the
latency of non-DDL queries is typically more useful than accelerating
TRUNCATE, CREATE TABLE, etc. Writing more WAL also slows down other queries;
beyond a certain relfilenode size, the extra WAL harms non-DDL queries more
than the buffer scan harms them. That's about where the threshold should be.
This should be GUC-controlled, especially since this is back-patch material.
We won't necessarily pick the best value on the first attempt, and the best
value could depend on factors like the filesystem, the storage hardware, and
the database's latency goals. One could define the GUC as an absolute size
(e.g. 1MB) or as a ratio of shared_buffers (e.g. GUC value of 0.001 means the
threshold is 1MB when shared_buffers is 1GB). I'm not sure which is better.
Hello.
At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch <noah@leadboat.com> wrote in <20190826050843.GB3153606@rfd.leadboat.com>
noah> On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote:
noah> > At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <noah@leadboat.com> wrote in <20190820060314.GA3086296@rfd.leadboat.com>
On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
I'm not sure the point of the behavior. I suppose that the "log"
is a sequence of new_page records. It also needs to be synced and
it is always larger than the file to be synced. I can't think of
an appropriate threshold without the point.Yes, it would be a sequence of new-page records. FlushRelationBuffers() locks
every buffer header containing a buffer of the current database. The belief
has been that writing one page to xlog is cheaper than FlushRelationBuffers()
in a busy system with large shared_buffers.I'm at a loss.. The decision between WAL and sync is made at
commit time, when we no longer have a pin on a buffer. When
emitting WAL, opposite to the assumption, lock needs to be
re-acquired for every page to emit log_new_page. What is worse,
we may need to reload evicted buffers. If the file has been
CopyFrom'ed, ring buffer strategy makes the situnation farther
worse. That doesn't seem cheap at all..Consider a one-page relfilenode. Doing all the things you list for a single
page may be cheaper than locking millions of buffer headers.
If I understand you correctly, I would say that *all* buffers
that don't belong to in-transaction-created files are skipped
before taking locks. No lock conflict happens with other
backends.
FlushRelationBuffers uses double-checked-locking as follows:
FlushRelationBuffers_common():
..
if(!islocal) {
for (i for all buffers) {
if (RelFileNodeEquals(bufHder->tag.rnode, rnode)) {
LockBufHdr(bufHdr);
if (RelFileNodeEquals(bufHder->tag.rnode, rnode) && valid & dirty) {
PinBuffer_Locked(bubHder);
LWLockAcquire();
FlushBuffer();
128GB shared buffers contain 16M buffers. On my
perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it
has only 6GB, the test is ignoring the effect of cache that comes
from the difference of the buffer size). (attached 1)
With WAL-emitting we find every buffers of the file using buffer
hash, we suffer partition locks instead of the 15ms of local
latency. That seems worse.
If there were any chance on WAL for smaller files here, it would
be on the files smaller than the ring size of bulk-write
strategy(16MB).Like you, I expect the optimal threshold is less than 16MB, though you should
benchmark to see. Under the ideal threshold, when a transaction creates a new
relfilenode just smaller than the threshold, that transaction will be somewhat
slower than it would be if the threshold were zero. Locking every buffer
I looked closer on this.
For a 16MB file, the cost of write-fsyncing cost is almost the
same to that of WAL-emitting cost. It was about 200 ms on the
Vista-era machine with non-performant rotating magnetic disks
with xfs. (attached 2, 3) Although write-fsyncing of relation
file makes no lock conflict with other backends, WAL-emitting
delays other backends' commits at most by that milliseconds.
In summary, the characteristics of the two methods on a 16MB file
are as the follows.
File write:
- 15ms of buffer scan without locks (@128GB shared buffer)
+ no hash search for a buffer
= take locks on all buffers only of the file one by one (to write)
+ plus 200ms of write-fdatasync (of whole the relation file),
which doesn't conflict with other backends. (except via CPU
time slots and IO bandwidth.)
WAL write :
+ no buffer scan
- 2048 times (16M/8k) of partition lock on finding every buffer
for the target file, which can conflict with other backends.
= take locks on all buffers only of the file one by one (to take FPW)
- plus 200ms of open(create)-write-fdatasync (of a WAL file (of
default size)), which can delay commits on other backends at
most by that duration.
header causes a distributed slow-down for other queries, and protecting the
latency of non-DDL queries is typically more useful than accelerating
TRUNCATE, CREATE TABLE, etc. Writing more WAL also slows down other queries;
beyond a certain relfilenode size, the extra WAL harms non-DDL queries more
than the buffer scan harms them. That's about where the threshold should be.
If the discussion above is correct, we shouldn't use WAL-write
even for files around 16MB. For smaller shared_buffers and file
size, the delays are:
Scan all buffers takes:
15 ms for 128GB shared_buffers
4.5ms for 32GB shared_buffers
fdatasync takes:
200 ms for 16MB/sync
51 ms for 1MB/sync
46 ms for 512kB/sync
40 ms for 256kB/sync
37 ms for 128kB/sync
35 ms for <64kB/sync
It seems reasonable for 5400rpm disks. The threashold seems 64kB
on my configuration. It can differ by configuration but I think
not so largely. (I'm not sure about SSD or in-memory
filesystems.)
So for smaller than 64kB files:
File write:
-- 15ms of buffer scan without locks
+ no hash search for a buffer
= plus 35 ms of write-fdatasync
WAL write :
++ no buffer scan
- one partition lock on finding every buffer for the target
file, which can conflict with other backends. (but ignorable.)
= plus 35 ms of (open(create)-)write-fdatasync
It's possible that WAL records with smaller size is needless of
time for its own sync. This is the most obvious gain by WAL
emitting. considring 5-15ms of buffer scanning time, 256 or 512
kilobytes are the candidate default threshold but it would be
safe to use 64kB.
This should be GUC-controlled, especially since this is back-patch material.
Is this size of patch back-patchable?
We won't necessarily pick the best value on the first attempt, and the best
value could depend on factors like the filesystem, the storage hardware, and
the database's latency goals. One could define the GUC as an absolute size
(e.g. 1MB) or as a ratio of shared_buffers (e.g. GUC value of 0.001 means the
threshold is 1MB when shared_buffers is 1GB). I'm not sure which is better.
I'm not sure whether the knob shows apparent performance gain and
whether we can offer the criteria to identify the proper
value. But I'll add this feature with a GUC
effective_io_block_size defaults to 64kB as the threshold in the
next version. (The name and default value are arguable, of course.)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190827.154932.250364935.horikyota.ntt@gmail.com>
128GB shared buffers contain 16M buffers. On my
perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it
has only 6GB, the test is ignoring the effect of cache that comes
from the difference of the buffer size). (attached 1)
...
For a 16MB file, the cost of write-fsyncing cost is almost the
same to that of WAL-emitting cost. It was about 200 ms on the
Vista-era machine with non-performant rotating magnetic disks
with xfs. (attached 2, 3) Although write-fsyncing of relation
file makes no lock conflict with other backends, WAL-emitting
delays other backends' commits at most by that milliseconds.
FWIW, the attached are the programs I used to take the numbers.
testloop.c: to take time to loop over buffers in FlushRelationBuffers
testfile.c: to take time to sync a heap file. (means one file for the size)
testfile2.c: to take time to emit a wal record. (means 16MB per file)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello, Noah.
At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190827.154932.250364935.horikyota.ntt@gmail.com>
I'm not sure whether the knob shows apparent performance gain and
whether we can offer the criteria to identify the proper
value. But I'll add this feature with a GUC
effective_io_block_size defaults to 64kB as the threshold in the
next version. (The name and default value are arguable, of course.)
This is a new version of the patch based on the discussion.
The differences from v19 are the follows.
- Removed the new stuff in two-phase.c.
The action on PREPARE TRANSACTION is now taken in
PrepareTransaction(). Instead of storing pending syncs in
two-phase files, the function immediately syncs all files that
can survive the transaction end. (twophase.c, xact.c)
- Separate pendingSyncs from pendingDeletes.
pendingSyncs gets handled differently from pendingDeletes so it
is separated.
- Let smgrDoPendingSyncs() to avoid performing fsync on
to-be-deleted files.
In previous versions the function syncs all recorded files even
if it is being deleted. Since we use WAL-logging as the
alternative of fsync now, performance gets more significance
g than before. Thus this version avoids uesless fsyncs.
- Use log_newpage instead of fsync for small tables.
As in the discussion up-thread, I think I understand how
WAL-logging works better than fsync. smgrDoPendingSync issues
log_newpage for all blocks in the table smaller than the GUC
variable "effective_io_block_size". I found
log_newpage_range() that does exact what is needed here but it
requires Relation that is no available there. I removed an
assertion in CreateFakeRelcacheEntry so that it works while
non-recovery mode.
- Rebased and fixed some bugs.
I'm trying to measure performance difference on WAL/fsync.
By the way, smgrDoPendingDelete is called from CommitTransaction
and AbortTransaction directlry, and from AbortSubTransaction via
AtSubAbort_smgr(), which calls only smgrDoPendingDeletes() and is
called only from AbortSubTransaction. I think these should be
unified either way. Any opinions?
CommitTransaction()
+ msgrDoPendingDelete()
AbortTransaction()
+ msgrDoPendingDelete()
AbortSubTransactoin()
AtSubAbort_smgr()
+ msgrDoPendingDelete()
# Looking around, the prefixes AtEOact/PreCommit/AtAbort don't
# seem to be used keeping a principle.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v20-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From 83deb772808cdd3afdb44a7630656cc827adfe33 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.
---
src/test/recovery/t/018_wal_optimize.pl | 312 ++++++++++++++++++++++++++++++++
1 file changed, 312 insertions(+)
create mode 100644 src/test/recovery/t/018_wal_optimize.pl
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..b041121745
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,312 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::perl2host($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+
+ # Same for prepared transaction
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2a (id serial PRIMARY KEY);
+ INSERT INTO test2a VALUES (DEFAULT);
+ TRUNCATE test2a;
+ INSERT INTO test2a VALUES (DEFAULT);
+ PREPARE TRANSACTION 't';
+ COMMIT PREPARED 't';");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with prepared transaction");
+
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.16.3
v20-0002-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From e0650491226a689120d19060ad5da0917f7d3bd6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 21 Aug 2019 13:57:00 +0900
Subject: [PATCH 2/4] Fix WAL skipping feature
WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modification is WAL-logged at all then sync
such relations at commit.
---
src/backend/access/heap/heapam.c | 4 +-
src/backend/access/heap/heapam_handler.c | 22 +--
src/backend/access/heap/rewriteheap.c | 13 +-
src/backend/access/transam/xact.c | 17 ++
src/backend/access/transam/xlogutils.c | 11 +-
src/backend/catalog/storage.c | 295 +++++++++++++++++++++++++++----
src/backend/commands/cluster.c | 24 +++
src/backend/commands/copy.c | 39 +---
src/backend/commands/createas.c | 5 +-
src/backend/commands/matview.c | 4 -
src/backend/commands/tablecmds.c | 10 +-
src/backend/storage/buffer/bufmgr.c | 41 +++--
src/backend/storage/smgr/md.c | 30 ++++
src/backend/utils/cache/relcache.c | 28 ++-
src/backend/utils/misc/guc.c | 13 ++
src/include/access/heapam.h | 1 -
src/include/access/rewriteheap.h | 2 +-
src/include/access/tableam.h | 40 +----
src/include/catalog/storage.h | 12 ++
src/include/storage/bufmgr.h | 1 +
src/include/storage/md.h | 1 +
src/include/utils/rel.h | 17 +-
22 files changed, 455 insertions(+), 175 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cb811d345a..ef18b61c55 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1936,7 +1936,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2119,7 +2119,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f1ff01e8cb..27f414a361 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -558,18 +558,6 @@ tuple_lock_retry:
return result;
}
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related callbacks for heap AM.
* ------------------------------------------------------------------------
@@ -701,7 +689,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -716,12 +703,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
is_system_catalog = IsSystemRelation(OldHeap);
/*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
+ * smgr_targblock must be initially invalid if we are to skip WAL logging
*/
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
- /* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
/* Preallocate values/isnull arrays */
@@ -731,7 +714,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff, use_wal);
+ *multi_cutoff);
/* Set up sorting if wanted */
@@ -2519,7 +2502,6 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index a17508a82f..9e0d7295af 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f594d33e7a..1c4b264947 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2107,6 +2107,13 @@ CommitTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Synchronize files that are created and not WAL-logged during this
+ * transaction. This must happen before emitting commit record so that we
+ * don't see committed-but-broken files after a crash.
+ */
+ smgrDoPendingSyncs(true, false);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2339,6 +2346,14 @@ PrepareTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Sync all WAL-skipped files now. Some of them may be deleted at
+ * transaction end but we don't bother store that information in PREPARE
+ * record or two-phase files. Like commit, we should sync WAL-skipped
+ * files before emitting PREPARE record. See CommitTransaction().
+ */
+ smgrDoPendingSyncs(true, true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2657,6 +2672,7 @@ AbortTransaction(void)
*/
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
+ smgrDoPendingSyncs(false, false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
@@ -4941,6 +4957,7 @@ AbortSubTransaction(void)
s->parent->curTransactionOwner);
AtEOSubXact_LargeObject(false, s->subTransactionId,
s->parent->subTransactionId);
+ smgrDoPendingSyncs(false, false);
AtSubAbort_Notify();
/* Advertise the fact that we aborted in pg_xact. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 1fc39333f1..ff7dba429a 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
* fields related to physical storage, like rd_rel, are initialized, so the
* fake entry is only usable in low-level operations like ReadBuffer().
*
+ * This is also used for syncing WAL-skipped files.
+ *
* Caller must free the returned entry with FreeFakeRelcacheEntry().
*/
Relation
@@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
FakeRelCacheEntry fakeentry;
Relation rel;
- Assert(InRecovery);
-
/* Allocate the Relation struct and all related space in one block. */
fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
rel = (Relation) fakeentry;
rel->rd_rel = &fakeentry->pgc;
rel->rd_node = rnode;
- /* We will never be working with temp rels during recovery */
+ /*
+ * We will never be working with temp rels during recovery or syncing
+ * WAL-skpped files.
+ */
rel->rd_backend = InvalidBackendId;
- /* It must be a permanent table if we're in recovery. */
+ /* It must be a permanent table here */
rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
/* We don't know the name of the relation; use relfilenode instead */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 3cc886f7fe..43926ecaba 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -30,9 +30,13 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+/* GUC variables */
+int effective_io_block_size = 64; /* threshold of WAL-skipping in kilobytes */
+
/*
* We keep a list of all relations (represented as RelFileNode values)
* that have been created or deleted in the current transaction. When
@@ -53,16 +57,17 @@
* but I'm being paranoid.
*/
-typedef struct PendingRelDelete
+typedef struct PendingRelOp
{
RelFileNode relnode; /* relation that may need to be deleted */
BackendId backend; /* InvalidBackendId if not a temp rel */
- bool atCommit; /* T=delete at commit; F=delete at abort */
+ bool atCommit; /* T=work at commit; F=work at abort */
int nestLevel; /* xact nesting level of request */
- struct PendingRelDelete *next; /* linked-list link */
-} PendingRelDelete;
+ struct PendingRelOp *next; /* linked-list link */
+} PendingRelOp;
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingSyncs = NULL; /* head of linked list */
/*
* RelationCreateStorage
@@ -78,7 +83,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelOp *pending;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -109,8 +114,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
@@ -118,6 +123,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /*
+ * When wal_level = minimal, we are going to skip WAL-logging for storage
+ * of persistent relations created in the current transaction. The
+ * relation needs to be synced at commit.
+ */
+ if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+ {
+ int nestLevel = GetCurrentTransactionNestLevel();
+
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
+ pending->relnode = rnode;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->nestLevel = nestLevel;
+ pending->next = pendingSyncs;
+ pendingSyncs = pending;
+ }
+
return srel;
}
@@ -147,11 +171,11 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
void
RelationDropStorage(Relation rel)
{
- PendingRelDelete *pending;
+ PendingRelOp *pending;
/* Add the relation to the list of stuff to delete at commit */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
pending->relnode = rel->rd_node;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
@@ -192,9 +216,9 @@ RelationDropStorage(Relation rel)
void
RelationPreserveStorage(RelFileNode rnode, bool atCommit)
{
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -399,9 +423,9 @@ void
smgrDoPendingDeletes(bool isCommit)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
int nrels = 0,
i = 0,
maxrels = 0;
@@ -462,11 +486,195 @@ smgrDoPendingDeletes(bool isCommit)
}
/*
- * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
+ * smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*
- * The return value is the number of relations scheduled for termination.
- * *ptr is set to point to a freshly-palloc'd array of RelFileNodes.
- * If there are no relations to be deleted, *ptr is set to NULL.
+ * This should be called before smgrDoPendingDeletes() at every subtransaction
+ * end. Also this should be called before emitting WAL record so that sync
+ * failure prevents commit.
+ *
+ * If sync_all is true, syncs all files including that are scheduled to be
+ * deleted.
+ */
+void
+smgrDoPendingSyncs(bool isCommit, bool sync_all)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
+ SMgrRelation srel = NULL;
+ ForkNumber fork;
+ BlockNumber nblocks[MAX_FORKNUM + 1];
+ BlockNumber total_blocks = 0;
+ HTAB *delhash = NULL;
+
+ /* Return if nothing to be synced in this nestlevel */
+ if (!pendingSyncs || pendingSyncs->nestLevel < nestLevel)
+ return;
+
+ Assert (pendingSyncs->nestLevel <= nestLevel);
+ Assert (pendingSyncs->backend == InvalidBackendId);
+
+ /*
+ * If sync_all is false, pending syncs on the relation that are to be
+ * deleted in this transaction-end should be ignored. Collect pending
+ * deletes that will happen in the following call to
+ * smgrDoPendingDeletes().
+ */
+ if (!sync_all)
+ {
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ if (pending->nestLevel < pendingSyncs->nestLevel ||
+ pending->atCommit != isCommit)
+ continue;
+
+ /* create the hash if not yet */
+ if (delhash == NULL)
+ {
+ HASHCTL hash_ctl;
+
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(RelFileNode);
+ hash_ctl.entrysize = sizeof(RelFileNode);
+ hash_ctl.hcxt = CurrentMemoryContext;
+ delhash =
+ hash_create("pending del temporary hash", 8, &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
+
+ (void) hash_search(delhash, (void *) &(pending->relnode),
+ HASH_ENTER, &found);
+ Assert(!found);
+ }
+ }
+
+ /* Loop over pendingSyncs */
+ prev = NULL;
+ for (pending = pendingSyncs; pending != NULL; pending = next)
+ {
+ bool to_be_removed = (!isCommit); /* don't sync if aborted */
+
+ next = pending->next;
+
+ /* outer-level entries should not be processed yet */
+ if (pending->nestLevel < nestLevel)
+ {
+ prev = pending;
+ continue;
+ }
+
+ /* don't sync relnodes that is being deleted */
+ if (delhash && !to_be_removed)
+ hash_search(delhash, (void *) &pending->relnode,
+ HASH_FIND, &to_be_removed);
+
+ /* remove the entry if no longer useful */
+ if (to_be_removed)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingSyncs = next;
+ pfree(pending);
+ continue;
+ }
+
+ /* actual sync happens at the end of top transaction */
+ if (nestLevel > 1)
+ {
+ prev = pending;
+ continue;
+ }
+
+ /* Now the time to sync the rnode */
+ srel = smgropen(pendingSyncs->relnode, pendingSyncs->backend);
+
+ /*
+ * We emit newpage WAL records for smaller size of relations.
+ *
+ * Small WAL records have a chance to be emitted at once along with
+ * other backends' WAL records. We emit WAL records instead of syncing
+ * for files that are smaller than a certain threshold expecting
+ * faster commit. The threshold is defined by the GUC
+ * effective_io_block_size.
+ */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ /* FSM doesn't need WAL nor sync */
+ if (fork != FSM_FORKNUM && smgrexists(srel, fork))
+ {
+ BlockNumber n = smgrnblocks(srel, fork);
+
+ /* we shouldn't come here for unlogged relations */
+ Assert(fork != INIT_FORKNUM);
+
+ nblocks[fork] = n;
+ total_blocks += n;
+ }
+ else
+ nblocks[fork] = InvalidBlockNumber;
+ }
+
+ /*
+ * Sync file or emit WAL record for the file according to the total
+ * size.
+ */
+ if (total_blocks * BLCKSZ >= effective_io_block_size * 1024)
+ {
+ /* Flush all buffers then sync the file */
+ FlushRelationBuffersWithoutRelcache(srel->smgr_rnode.node, false);
+
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(srel, fork))
+ smgrimmedsync(srel, fork);
+ }
+ }
+ else
+ {
+ /*
+ * Emit WAL records for all blocks. Some of the blocks might have
+ * been synced or evicted, but We don't bother checking that. The
+ * file is small enough.
+ */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ bool page_std = (fork == MAIN_FORKNUM);
+ int n = nblocks[fork];
+ Relation rel;
+
+ if (!BlockNumberIsValid(n))
+ continue;
+
+ /* Emit WAL for the whole file */
+ rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+ log_newpage_range(rel, fork, 0, n, page_std);
+ FreeFakeRelcacheEntry(rel);
+ }
+ }
+
+ /* done remove from list */
+ if (prev)
+ prev->next = next;
+ else
+ pendingSyncs = next;
+ pfree(pending);
+ }
+
+ if (delhash)
+ hash_destroy(delhash);
+}
+
+/*
+ * smgrGetPendingOperations() -- Get a list of non-temp relations to be
+ * deleted or synced.
+ *
+ * The return value is the number of relations scheduled in the given
+ * list. *ptr is set to point to a freshly-palloc'd array of RelFileNodes. If
+ * there are no matching relations, *ptr is set to NULL.
*
* Only non-temporary relations are included in the returned list. This is OK
* because the list is used only in contexts where temporary relations don't
@@ -475,19 +683,19 @@ smgrDoPendingDeletes(bool isCommit)
* (and all temporary files will be zapped if we restart anyway, so no need
* for redo to do it also).
*
- * Note that the list does not include anything scheduled for termination
- * by upper-level transactions.
+ * Note that the list does not include anything scheduled by upper-level
+ * transactions.
*/
-int
-smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+static inline int
+smgrGetPendingOperations(PendingRelOp *list, bool forCommit, RelFileNode **ptr)
{
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
RelFileNode *rptr;
- PendingRelDelete *pending;
+ PendingRelOp *pending;
nrels = 0;
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ for (pending = list; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
&& pending->backend == InvalidBackendId)
@@ -500,7 +708,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
}
rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode));
*ptr = rptr;
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ for (pending = list; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
&& pending->backend == InvalidBackendId)
@@ -512,6 +720,20 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+/* Returns list of pending deletes, see smgrGetPendingOperations for details */
+int
+smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+{
+ return smgrGetPendingOperations(pendingDeletes, forCommit, ptr);
+}
+
+/* Returns list of pending syncs, see smgrGetPendingOperations for details */
+int
+smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr)
+{
+ return smgrGetPendingOperations(pendingSyncs, forCommit, ptr);
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
@@ -522,8 +744,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
void
PostPrepare_smgr(void)
{
- PendingRelDelete *pending;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *next;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
@@ -532,25 +754,34 @@ PostPrepare_smgr(void)
/* must explicitly free the list entry */
pfree(pending);
}
+
+ /* We shouldn't have an entry in pendingSyncs */
+ Assert(pendingSyncs == NULL);
}
/*
* AtSubCommit_smgr() --- Take care of subtransaction commit.
*
- * Reassign all items in the pending-deletes list to the parent transaction.
+ * Reassign all items in the pending-operations list to the parent transaction.
*/
void
AtSubCommit_smgr(void)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
+ PendingRelOp *pending;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel)
pending->nestLevel = nestLevel - 1;
}
+
+ for (pending = pendingSyncs; pending != NULL; pending = pending->next)
+ {
+ if (pending->nestLevel >= nestLevel)
+ pending->nestLevel = nestLevel - 1;
+ }
}
/*
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 28985a07ec..f665ee8358 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
{
+ Relation rel1;
+ Relation rel2;
+
/*
* Normal non-mapped relations: swap relfilenodes, reltablespaces,
* relpersistence
*/
Assert(!target_is_pg_class);
+ /* Update creation subid hints of relcache */
+ rel1 = relation_open(r1, ExclusiveLock);
+ rel2 = relation_open(r2, ExclusiveLock);
+
+ /*
+ * New relation's relfilenode is created in the current transaction
+ * and used as old ralation's new relfilenode. So its
+ * newRelfilenodeSubid as new relation's createSubid. We don't fix
+ * rel2 since it would be deleted soon.
+ */
+ Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+ rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+ /* record the first relfilenode change in the current transaction */
+ if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+ relation_close(rel1, ExclusiveLock);
+ relation_close(rel2, ExclusiveLock);
+
+ /* swap relfilenodes, reltablespaces, relpersistence */
swaptemp = relform1->relfilenode;
relform1->relfilenode = relform2->relfilenode;
relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3aeef30b28..3ce04f7efc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2534,9 +2534,6 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
ExecDropSingleTupleTableSlot(buffer->slots[i]);
- table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
- miinfo->ti_options);
-
pfree(buffer);
}
@@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
* If it does commit, we'll have done the table_finish_bulk_insert() at
* the bottom of this routine first.
*
- * As mentioned in comments in utils/rel.h, the in-same-transaction test
- * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
- * can be cleared before the end of the transaction. The exact case is
- * when a relation sets a new relfilenode twice in same transaction, yet
- * the second one fails in an aborted subtransaction, e.g.
- *
- * BEGIN;
- * TRUNCATE t;
- * SAVEPOINT save;
- * TRUNCATE t;
- * ROLLBACK TO save;
- * COPY ...
- *
- * Also, if the target file is new-in-transaction, we assume that checking
- * FSM for free space is a waste of time, even if we must use WAL because
- * of archiving. This could possibly be wrong, but it's unlikely.
- *
- * The comments for table_tuple_insert and RelationGetBufferForTuple
- * specify that skipping WAL logging is only safe if we ensure that our
- * tuples do not go into pages containing tuples from any other
- * transactions --- but this must be the case if we have a new table or
- * new relfilenode, so we need no additional work to enforce that.
+ * If the target file is new-in-transaction, we assume that checking FSM
+ * for free space is a waste of time, even if we must use WAL because of
+ * archiving. This could possibly be wrong, but it's unlikely.
*
* We currently don't support this optimization if the COPY target is a
* partitioned table as we currently only lazily initialize partition
@@ -2762,15 +2740,14 @@ CopyFrom(CopyState cstate)
* are not supported as per the description above.
*----------
*/
- /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+ /*
+ * createSubid is creation check, firstRelfilenodeSubid is truncation and
+ * cluster check. Partitioned table doesn't have storage.
+ */
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
- cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
+ cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index b7d220699f..8a91d946e3 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->rel, myState->ti_options);
-
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..1c854dcebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* replication is in use. We can skip the FSM in any case.
*/
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cceefbdd49..2468b178cb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4762,9 +4762,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
/*
* Prepare a BulkInsertState and options for table_tuple_insert. Because
- * we're building a new heap, we can skip WAL-logging and fsync it to disk
- * at the end instead (unless WAL-logging is required for archiving or
- * streaming replication). The FSM is empty too, so don't bother using it.
+ * we're building a new heap, the underlying table AM can skip WAL-logging
+ * and smgr will sync the relation to disk at the end of the current
+ * transaction instead. The FSM is empty too, so don't bother using it.
*/
if (newrel)
{
@@ -4772,8 +4772,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -5058,8 +5056,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
FreeBulkInsertState(bistate);
- table_finish_bulk_insert(newrel, ti_options);
-
table_close(newrel, NoLock);
}
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f3a402854..55c122b3a7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
static int32 PrivateRefCountOverflowed = 0;
static uint32 PrivateRefCountClock = 0;
static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void ReservePrivateRefCountEntry(void);
static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -675,10 +676,10 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
* ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
* a relcache entry for the relation.
*
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay. If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * NB: At present, this function may only be used on permanent relations,
+ * which is OK, because we only use it during XLOG replay and processing
+ * pending syncs. If in the future we want to use it on temporary or unlogged
+ * relations, we could pass additional parameters.
*/
Buffer
ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
@@ -3191,20 +3192,32 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
- /* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3221,7 +3234,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3251,18 +3264,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 07f3c93d3f..514c6098e6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -994,6 +994,36 @@ ForgetDatabaseSyncRequests(Oid dbid)
RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, true /* retryOnError */ );
}
+/*
+ * SyncRelationFiles -- sync files of all given relations
+ *
+ * This function is assumed to be called only when skipping WAL-logging and
+ * emits no xlog records.
+ */
+void
+SyncRelationFiles(RelFileNode *syncrels, int nsyncrels)
+{
+ int i;
+
+ for (i = 0; i < nsyncrels; i++)
+ {
+ SMgrRelation srel;
+ ForkNumber fork;
+
+ /* sync all existing forks of the relation */
+ FlushRelationBuffersWithoutRelcache(syncrels[i], false);
+ srel = smgropen(syncrels[i], InvalidBackendId);
+
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(srel, fork))
+ smgrimmedsync(srel, fork);
+ }
+
+ smgrclose(srel);
+ }
+}
+
/*
* DropRelationFiles -- drop files of all given relations
*/
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 248860758c..147babb6b5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1096,6 +1096,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
relation->rd_isnailed = false;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
switch (relation->rd_rel->relpersistence)
{
case RELPERSISTENCE_UNLOGGED:
@@ -1829,6 +1830,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_isnailed = true;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
relation->rd_backend = InvalidBackendId;
relation->rd_islocaltemp = false;
@@ -2094,7 +2096,7 @@ RelationClose(Relation relation)
#ifdef RELCACHE_FORCE_RELEASE
if (RelationHasReferenceCountZero(relation) &&
relation->rd_createSubid == InvalidSubTransactionId &&
- relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
RelationClearRelation(relation, false);
#endif
}
@@ -2510,8 +2512,8 @@ RelationClearRelation(Relation relation, bool rebuild)
* problem.
*
* When rebuilding an open relcache entry, we must preserve ref count,
- * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state. Also
- * attempt to preserve the pg_class entry (rd_rel), tupledesc,
+ * rd_createSubid/rd_new/firstRelfilenodeSubid, and rd_toastoid state.
+ * Also attempt to preserve the pg_class entry (rd_rel), tupledesc,
* rewrite-rule, partition key, and partition descriptor substructures
* in place, because various places assume that these structures won't
* move while they are working with an open relcache entry. (Note:
@@ -2600,6 +2602,7 @@ RelationClearRelation(Relation relation, bool rebuild)
/* creation sub-XIDs must be preserved */
SWAPFIELD(SubTransactionId, rd_createSubid);
SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+ SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
/* un-swap rd_rel pointers, swap contents instead */
SWAPFIELD(Form_pg_class, rd_rel);
/* ... but actually, we don't have to update newrel->rd_rel */
@@ -2667,7 +2670,7 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2807,7 +2810,7 @@ RelationCacheInvalidate(void)
* pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -3064,6 +3067,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
* Likewise, reset the hint about the relfilenode being new.
*/
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
}
/*
@@ -3155,7 +3159,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
}
/*
- * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+ * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
*/
if (relation->rd_newRelfilenodeSubid == mySubid)
{
@@ -3164,6 +3168,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
}
+
+ if (relation->rd_firstRelfilenodeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstRelfilenodeSubid = parentSubid;
+ else
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ }
}
@@ -3253,6 +3265,7 @@ RelationBuildLocalRelation(const char *relname,
/* it's being created in this transaction */
rel->rd_createSubid = GetCurrentSubTransactionId();
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
/*
* create a new tuple descriptor from the one passed in. We do this
@@ -3556,6 +3569,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
* operations on the rel in the same transaction.
*/
relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+ if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
/* Flag relation as needing eoxact cleanup (to remove the hint) */
EOXactListAdd(relation);
@@ -5592,6 +5607,7 @@ load_relcache_init_file(bool shared)
rel->rd_fkeylist = NIL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90ffd89339..1e4fc256fc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
#include "access/xlog_internal.h"
#include "catalog/namespace.h"
#include "catalog/pg_authid.h"
+#include "catalog/storage.h"
#include "commands/async.h"
#include "commands/prepare.h"
#include "commands/user.h"
@@ -2774,6 +2775,18 @@ static struct config_int ConfigureNamesInt[] =
check_effective_io_concurrency, assign_effective_io_concurrency, NULL
},
+ {
+ {"effective_io_block_size", PGC_USERSET, RESOURCES_DISK,
+ gettext_noop("Size of file that can be fsync'ed in the minimum required duration."),
+ gettext_noop("For rotating magnetic disks, it is around the size of a track or sylinder."),
+ GUC_UNIT_KB
+ },
+ &effective_io_block_size,
+ 64,
+ 0, MAX_KILOBYTES,
+ NULL, NULL, NULL
+ },
+
{
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..80c2e1bafc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7f81703b78..b652cd6cef 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
uint8 flags,
TM_FailureData *tmfd);
- /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
- *
- * Typically callers of tuple_insert and multi_insert will just pass all
- * the flags that apply to them, and each AM has to decide which of them
- * make sense for it, and then only take actions in finish_bulk_insert for
- * those flags, and ignore others.
- *
- * Optional callback.
- */
- void (*finish_bulk_insert) (Relation rel, int options);
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
@@ -1087,10 +1071,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* The options bitmask allows the caller to specify options that may change the
* behaviour of the AM. The AM will ignore options that it does not support.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space.
@@ -1112,8 +1092,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* heap's TOAST table, too, if the tuple requires any out-of-line data.
*
* The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
*
* On return the slot's tts_tid and tts_tableOid are updated to reflect the
* insertion. But note that any toasting of fields within the slot is NOT
@@ -1248,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
* update was done. However, any TOAST changes in the new tuple's
* data are not reflected into *newtup.
*
+ * See table_insert about skipping WAL-logging feature.
+ *
* In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
* t_xmax, and, if possible, t_cmax. See comments for struct TM_FailureData
* for additional info.
@@ -1308,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
flags, tmfd);
}
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
- /* optional callback */
- if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
- rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..1c1cf5d252 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,6 +19,16 @@
#include "storage/smgr.h"
#include "utils/relcache.h"
+/* enum for operation type of PendingDelete entries */
+typedef enum PendingOpType
+{
+ PENDING_DELETE,
+ PENDING_SYNC
+} PendingOpType;
+
+/* GUC variables */
+extern int effective_io_block_size; /* threshold for WAL-skipping */
+
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
@@ -31,7 +41,9 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(bool isCommit, bool sync_all);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern int smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 509f4b7ef1..ace5f5a2ae 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index c0f05e23ff..2bb2947bdb 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -42,6 +42,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void SyncRelationFiles(RelFileNode *syncrels, int nsyncrels);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c5d36680a2..f372dc2086 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -75,10 +75,17 @@ typedef struct RelationData
* transaction, with one of them occurring in a subsequently aborted
* subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
* ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+ * rd_firstRelfilenodeSubid is the ID of the first subtransaction the
+ * relfilenode change has took place in the current transaction. Unlike
+ * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID
+ * means that the currently active relfilenode is transaction-local and we
+ * sync the relation at commit instead of WAL-logging.
*/
SubTransactionId rd_createSubid; /* rel was created in current xact */
SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
* current xact */
+ SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned
+ * first in current xact */
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
@@ -514,9 +521,15 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
*/
-#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
/*
* RelationUsesLocalBuffers
--
2.16.3
v20-0003-Documentation-for-effective_io_block_size.patchtext/x-patch; charset=us-asciiDownload
From cce02653f263211b1c777c3aac4d25423035a68d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 28 Aug 2019 14:05:30 +0900
Subject: [PATCH 3/4] Documentation for effective_io_block_size
---
doc/src/sgml/config.sgml | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89284dc5c0..2d38d897ca 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1832,6 +1832,27 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-effective-io-block-size" xreflabel="effective_io_block_size">
+ <term><varname>effective_io_block_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>effective_io_block_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the expected maximum size of a file for which <function>fsync</function> returns in the minimum required duration. It is approximately the size of a track or sylinder for magnetic disks.
+ The value is specified in kilobytes and the default is <literal>64</literal> kilobytes.
+ </para>
+ <para>
+ When <xref linkend="guc-wal-level"/> is <literal>minimal</literal>,
+ WAL-logging is skipped for tables created in-trasaction. If a table
+ is smaller than that size at commit, it is WAL-logged instead of
+ issueing <function>fsync</function> on it.
+
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
--
2.16.3
v20-0004-Additional-test-for-new-GUC-setting.patchtext/x-patch; charset=us-asciiDownload
From b31533b895a3b239339aeb466d6f1abc0a1a4669 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 28 Aug 2019 14:12:18 +0900
Subject: [PATCH 4/4] Additional test for new GUC setting.
This patchset adds new GUC variable effective_io_block_size that
controls wheter WAL-skipped tables are finally WAL-logged or
fcync'ed. All of the TAP test performs WAL-logging so this adds an
item that performs file sync.
---
src/test/recovery/t/018_wal_optimize.pl | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
index b041121745..95063ab131 100644
--- a/src/test/recovery/t/018_wal_optimize.pl
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -11,7 +11,7 @@ use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 26;
+use Test::More tests => 28;
sub check_orphan_relfilenodes
{
@@ -102,7 +102,23 @@ max_prepared_transactions = 1
$result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
is($result, qq(1),
"wal_level = $wal_level, optimized truncation with prepared transaction");
+ # Same for file sync mode
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ SET effective_io_block_size to 0;
+ BEGIN;
+ CREATE TABLE test2b (id serial PRIMARY KEY);
+ INSERT INTO test2b VALUES (DEFAULT);
+ TRUNCATE test2b;
+ INSERT INTO test2b VALUES (DEFAULT);
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2b;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with file-sync");
# Data file for COPY query in follow-up tests.
my $basedir = $node->basedir;
--
2.16.3
I have updated this patch's status to "needs review", since v20 has not
received any comments yet.
Noah, you're listed as committer for this patch. Are you still on the
hook for getting it done during the v13 timeframe?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Sep 02, 2019 at 05:15:00PM -0400, Alvaro Herrera wrote:
I have updated this patch's status to "needs review", since v20 has not
received any comments yet.Noah, you're listed as committer for this patch. Are you still on the
hook for getting it done during the v13 timeframe?
Yes, assuming "getting it done" = "getting the CF entry to state other than
Needs Review".
[Casual readers with opinions on GUC naming: consider skipping to the end.]
MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
rd_createSubid is set; see attached test case. It needs to skip WAL whenever
RelationNeedsWAL() returns false.
On Tue, Aug 27, 2019 at 03:49:32PM +0900, Kyotaro Horiguchi wrote:
At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch <noah@leadboat.com> wrote in <20190826050843.GB3153606@rfd.leadboat.com>
Consider a one-page relfilenode. Doing all the things you list for a single
page may be cheaper than locking millions of buffer headers.If I understand you correctly, I would say that *all* buffers
that don't belong to in-transaction-created files are skipped
before taking locks. No lock conflict happens with other
backends.FlushRelationBuffers uses double-checked-locking as follows:
I had misread the code; you're right.
This should be GUC-controlled, especially since this is back-patch material.
Is this size of patch back-patchable?
Its size is not an obstacle. It's not ideal to back-patch such a user-visible
performance change, but it would be worse to leave back branches able to
corrupt data during recovery.
On Wed, Aug 28, 2019 at 03:42:10PM +0900, Kyotaro Horiguchi wrote:
- Use log_newpage instead of fsync for small tables.
I'm trying to measure performance difference on WAL/fsync.
I would measure it with simultaneous pgbench instances:
1. DDL pgbench instance repeatedly creates and drops a table of X kilobytes,
using --rate to make this happen a fixed number of times per second.
2. Regular pgbench instance runs the built-in script at maximum qps.
For each X, try one test run with effective_io_block_size = X-1 and one with
effective_io_block_size = X. If the regular pgbench instance gets materially
higher qps with effective_io_block_size = X-1, the ideal default is <X.
Otherwise, the ideal default is >=X.
+ <varlistentry id="guc-effective-io-block-size" xreflabel="effective_io_block_size"> + <term><varname>effective_io_block_size</varname> (<type>integer</type>) + <indexterm> + <primary><varname>effective_io_block_size</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the expected maximum size of a file for which <function>fsync</function> returns in the minimum required duration. It is approximately the size of a track or sylinder for magnetic disks. + The value is specified in kilobytes and the default is <literal>64</literal> kilobytes. + </para> + <para> + When <xref linkend="guc-wal-level"/> is <literal>minimal</literal>, + WAL-logging is skipped for tables created in-trasaction. If a table + is smaller than that size at commit, it is WAL-logged instead of + issueing <function>fsync</function> on it. + + </para> + </listitem> + </varlistentry>
Cylinder and track sizes are obsolete as user-visible concepts. (They're not
constant for a given drive, and I think modern disks provide no way to read
the relevant parameters.) I like the name "wal_skip_threshold", and my second
choice would be "wal_skip_min_size". Possibly documented as follows:
When wal_level is minimal and a transaction commits after creating or
rewriting a permanent table, materialized view, or index, this setting
determines how to persist the new data. If the data is smaller than this
setting, write it to the WAL log; otherwise, use an fsync of the data file.
Depending on the properties of your storage, raising or lowering this value
might help if such commits are slowing concurrent transactions. The default
is 64 kilobytes (64kB).
Any other opinions on the GUC name?
Attachments:
wal-optimize-noah-tests-v3.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
index 95063ab..5d476a4 100644
--- a/src/test/recovery/t/018_wal_optimize.pl
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -11,7 +11,7 @@ use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 28;
+use Test::More tests => 32;
sub check_orphan_relfilenodes
{
@@ -43,6 +43,8 @@ sub run_wal_optimize
$node->append_conf('postgresql.conf', qq(
wal_level = $wal_level
max_prepared_transactions = 1
+wal_log_hints = on
+effective_io_block_size = 0
));
$node->start;
@@ -194,6 +196,24 @@ max_prepared_transactions = 1
is($result, qq(3),
"wal_level = $wal_level, SET TABLESPACE in subtransaction");
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a5 (c int PRIMARY KEY);
+ SAVEPOINT q; INSERT INTO test3a5 VALUES (1); ROLLBACK TO q;
+ CHECKPOINT;
+ INSERT INTO test3a5 VALUES (1); -- set index hint bit
+ INSERT INTO test3a5 VALUES (2);
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->psql('postgres', );
+ my($ret, $stdout, $stderr) = $node->psql(
+ 'postgres', "INSERT INTO test3a5 VALUES (2);");
+ is($ret, qq(3),
+ "wal_level = $wal_level, unique index LP_DEAD");
+ like($stderr, qr/violates unique/,
+ "wal_level = $wal_level, unique index LP_DEAD message");
+
# UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
$node->safe_psql('postgres', "
BEGIN;
Hello. Thanks for the comment.
# Sorry in advance for possilbe breaking the thread.
MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
rd_createSubid is set; see attached test case. It needs to skip WAL whenever
RelationNeedsWAL() returns false.
Thanks for pointing out that. And the test patch helped me very much.
Most of callers can tell that to the function, but SetHintBits()
cannot easily. Rather I think we shouldn't even try to do
that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
for sync-pending state of the relfilenode for the buffer. In the
attached patch (0003) RelFileNodeSkippingWAL loops over pendingSyncs
but it is called only at the time FPW is added so I believe it doesn't
affect performance so much. However, we can use hash for pendingSyncs
instead of liked list. Anyway the change is in its own file
v21-0003-Fix-MarkBufferDirtyHint.patch, which will be merged into
0002.
AFAICS all XLogInsert is guarded by RelationNeedsWAL() or in the
non-wal_minimal code paths.
Cylinder and track sizes are obsolete as user-visible concepts. (They're not
onstant for a given drive, and I think modern disks provide no way to read
the relevant parameters.) I like the name "wal_skip_threshold", and my second
I strongly agree. Thanks for the draft. I used it as-is. I don't come
up with an appropriate second description of the GUC so I just removed
it.
# it was "For rotating magnetic disks, it is around the size of a
# track or sylinder."
the relevant parameters.) I like the name "wal_skip_threshold", and
my second choice would be "wal_skip_min_size". Possibly documented
as follows:
..
Any other opinions on the GUC name?
I prefer the first candidate. I already used the terminology in
storage.c and the name fits more to the context.
* We emit newpage WAL records for smaller size of relations.
*
* Small WAL records have a chance to be emitted at once along with
* other backends' WAL records. We emit WAL records instead of syncing
* for files that are smaller than a certain threshold expecting faster
- * commit. The threshold is defined by the GUC effective_io_block_size.
+ * commit. The threshold is defined by the GUC wal_skip_threshold.
The attached are:
- v21-0001-TAP-test-for-copy-truncation-optimization.patch
same as v20
- v21-0002-Fix-WAL-skipping-feature.patch
GUC name changed.
- v21-0003-Fix-MarkBufferDirtyHint.patch
PoC of fixing the function. will be merged into 0002. (New)
- v21-0004-Documentation-for-wal_skip_threshold.patch
GUC name and description changed. (Previous 0003)
- v21-0005-Additional-test-for-new-GUC-setting.patch
including adjusted version of wal-optimize-noah-tests-v3.patch
Maybe test names need further adjustment. (Previous 0004)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v21-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From 34149545942480d8dcc1cc587f40091b19b5aa39 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH v21 1/5] TAP test for copy-truncation optimization.
---
src/test/recovery/t/018_wal_optimize.pl | 312 ++++++++++++++++++++++++
1 file changed, 312 insertions(+)
create mode 100644 src/test/recovery/t/018_wal_optimize.pl
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..b041121745
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,312 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::perl2host($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+
+ # Same for prepared transaction
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2a (id serial PRIMARY KEY);
+ INSERT INTO test2a VALUES (DEFAULT);
+ TRUNCATE test2a;
+ INSERT INTO test2a VALUES (DEFAULT);
+ PREPARE TRANSACTION 't';
+ COMMIT PREPARED 't';");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with prepared transaction");
+
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.23.0
v21-0002-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From e297f55d0d9215d9e828ec32dc0ebadb8e04bb2c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Oct 2019 12:07:09 +0900
Subject: [PATCH v21 2/5] Fix WAL skipping feature
WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modification is WAL-logged at all then sync
such relations at commit.
---
src/backend/access/heap/heapam.c | 4 +-
src/backend/access/heap/heapam_handler.c | 22 +-
src/backend/access/heap/rewriteheap.c | 13 +-
src/backend/access/transam/xact.c | 17 ++
src/backend/access/transam/xlogutils.c | 11 +-
src/backend/catalog/storage.c | 294 ++++++++++++++++++++---
src/backend/commands/cluster.c | 24 ++
src/backend/commands/copy.c | 39 +--
src/backend/commands/createas.c | 5 +-
src/backend/commands/matview.c | 4 -
src/backend/commands/tablecmds.c | 10 +-
src/backend/storage/buffer/bufmgr.c | 41 ++--
src/backend/storage/smgr/md.c | 30 +++
src/backend/utils/cache/relcache.c | 28 ++-
src/backend/utils/misc/guc.c | 13 +
src/include/access/heapam.h | 1 -
src/include/access/rewriteheap.h | 2 +-
src/include/access/tableam.h | 40 +--
src/include/catalog/storage.h | 12 +
src/include/storage/bufmgr.h | 1 +
src/include/storage/md.h | 1 +
src/include/utils/rel.h | 19 +-
22 files changed, 455 insertions(+), 176 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb34ef..a7ead9405a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1936,7 +1936,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2119,7 +2119,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2dd8821fac..0871df7730 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -558,18 +558,6 @@ tuple_lock_retry:
return result;
}
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related callbacks for heap AM.
* ------------------------------------------------------------------------
@@ -701,7 +689,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -716,12 +703,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
is_system_catalog = IsSystemRelation(OldHeap);
/*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
+ * smgr_targblock must be initially invalid if we are to skip WAL logging
*/
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
- /* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
/* Preallocate values/isnull arrays */
@@ -731,7 +714,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff, use_wal);
+ *multi_cutoff);
/* Set up sorting if wanted */
@@ -2519,7 +2502,6 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d41dbcf5f7..9b757cacf4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index fc55fa6d53..59d65bc214 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2107,6 +2107,13 @@ CommitTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Synchronize files that are created and not WAL-logged during this
+ * transaction. This must happen before emitting commit record so that we
+ * don't see committed-but-broken files after a crash.
+ */
+ smgrDoPendingSyncs(true, false);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2339,6 +2346,14 @@ PrepareTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Sync all WAL-skipped files now. Some of them may be deleted at
+ * transaction end but we don't bother store that information in PREPARE
+ * record or two-phase files. Like commit, we should sync WAL-skipped
+ * files before emitting PREPARE record. See CommitTransaction().
+ */
+ smgrDoPendingSyncs(true, true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2657,6 +2672,7 @@ AbortTransaction(void)
*/
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
+ smgrDoPendingSyncs(false, false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
@@ -4964,6 +4980,7 @@ AbortSubTransaction(void)
s->parent->curTransactionOwner);
AtEOSubXact_LargeObject(false, s->subTransactionId,
s->parent->subTransactionId);
+ smgrDoPendingSyncs(false, false);
AtSubAbort_Notify();
/* Advertise the fact that we aborted in pg_xact. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5f1e5ba75d..fc296abf91 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
* fields related to physical storage, like rd_rel, are initialized, so the
* fake entry is only usable in low-level operations like ReadBuffer().
*
+ * This is also used for syncing WAL-skipped files.
+ *
* Caller must free the returned entry with FreeFakeRelcacheEntry().
*/
Relation
@@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
FakeRelCacheEntry fakeentry;
Relation rel;
- Assert(InRecovery);
-
/* Allocate the Relation struct and all related space in one block. */
fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
rel = (Relation) fakeentry;
rel->rd_rel = &fakeentry->pgc;
rel->rd_node = rnode;
- /* We will never be working with temp rels during recovery */
+ /*
+ * We will never be working with temp rels during recovery or syncing
+ * WAL-skpped files.
+ */
rel->rd_backend = InvalidBackendId;
- /* It must be a permanent table if we're in recovery. */
+ /* It must be a permanent table here */
rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
/* We don't know the name of the relation; use relfilenode instead */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 625af8d49a..806f235a24 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -30,9 +30,13 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+/* GUC variables */
+int wal_skip_threshold = 64; /* threshold of WAL-skipping in kilobytes */
+
/*
* We keep a list of all relations (represented as RelFileNode values)
* that have been created or deleted in the current transaction. When
@@ -53,16 +57,17 @@
* but I'm being paranoid.
*/
-typedef struct PendingRelDelete
+typedef struct PendingRelOp
{
RelFileNode relnode; /* relation that may need to be deleted */
BackendId backend; /* InvalidBackendId if not a temp rel */
- bool atCommit; /* T=delete at commit; F=delete at abort */
+ bool atCommit; /* T=work at commit; F=work at abort */
int nestLevel; /* xact nesting level of request */
- struct PendingRelDelete *next; /* linked-list link */
-} PendingRelDelete;
+ struct PendingRelOp *next; /* linked-list link */
+} PendingRelOp;
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingSyncs = NULL; /* head of linked list */
/*
* RelationCreateStorage
@@ -78,7 +83,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelOp *pending;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -109,8 +114,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
@@ -118,6 +123,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /*
+ * When wal_level = minimal, we are going to skip WAL-logging for storage
+ * of persistent relations created in the current transaction. The
+ * relation needs to be synced at commit.
+ */
+ if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+ {
+ int nestLevel = GetCurrentTransactionNestLevel();
+
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
+ pending->relnode = rnode;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->nestLevel = nestLevel;
+ pending->next = pendingSyncs;
+ pendingSyncs = pending;
+ }
+
return srel;
}
@@ -147,11 +171,11 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
void
RelationDropStorage(Relation rel)
{
- PendingRelDelete *pending;
+ PendingRelOp *pending;
/* Add the relation to the list of stuff to delete at commit */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
pending->relnode = rel->rd_node;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
@@ -192,9 +216,9 @@ RelationDropStorage(Relation rel)
void
RelationPreserveStorage(RelFileNode rnode, bool atCommit)
{
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -431,9 +455,9 @@ void
smgrDoPendingDeletes(bool isCommit)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
int nrels = 0,
i = 0,
maxrels = 0;
@@ -494,11 +518,194 @@ smgrDoPendingDeletes(bool isCommit)
}
/*
- * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
+ * smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
+ *
+ * This should be called before smgrDoPendingDeletes() at every subtransaction
+ * end. Also this should be called before emitting WAL record so that sync
+ * failure prevents commit.
+ *
+ * If sync_all is true, syncs all files including that are scheduled to be
+ * deleted.
+ */
+void
+smgrDoPendingSyncs(bool isCommit, bool sync_all)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
+ SMgrRelation srel = NULL;
+ ForkNumber fork;
+ BlockNumber nblocks[MAX_FORKNUM + 1];
+ BlockNumber total_blocks = 0;
+ HTAB *delhash = NULL;
+
+ /* Return if nothing to be synced in this nestlevel */
+ if (!pendingSyncs || pendingSyncs->nestLevel < nestLevel)
+ return;
+
+ Assert (pendingSyncs->nestLevel <= nestLevel);
+ Assert (pendingSyncs->backend == InvalidBackendId);
+
+ /*
+ * If sync_all is false, pending syncs on the relation that are to be
+ * deleted in this transaction-end should be ignored. Collect pending
+ * deletes that will happen in the following call to
+ * smgrDoPendingDeletes().
+ */
+ if (!sync_all)
+ {
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ if (pending->nestLevel < pendingSyncs->nestLevel ||
+ pending->atCommit != isCommit)
+ continue;
+
+ /* create the hash if not yet */
+ if (delhash == NULL)
+ {
+ HASHCTL hash_ctl;
+
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(RelFileNode);
+ hash_ctl.entrysize = sizeof(RelFileNode);
+ hash_ctl.hcxt = CurrentMemoryContext;
+ delhash =
+ hash_create("pending del temporary hash", 8, &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
+
+ (void) hash_search(delhash, (void *) &(pending->relnode),
+ HASH_ENTER, &found);
+ Assert(!found);
+ }
+ }
+
+ /* Loop over pendingSyncs */
+ prev = NULL;
+ for (pending = pendingSyncs; pending != NULL; pending = next)
+ {
+ bool to_be_removed = (!isCommit); /* don't sync if aborted */
+
+ next = pending->next;
+
+ /* outer-level entries should not be processed yet */
+ if (pending->nestLevel < nestLevel)
+ {
+ prev = pending;
+ continue;
+ }
+
+ /* don't sync relnodes that is being deleted */
+ if (delhash && !to_be_removed)
+ hash_search(delhash, (void *) &pending->relnode,
+ HASH_FIND, &to_be_removed);
+
+ /* remove the entry if no longer useful */
+ if (to_be_removed)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingSyncs = next;
+ pfree(pending);
+ continue;
+ }
+
+ /* actual sync happens at the end of top transaction */
+ if (nestLevel > 1)
+ {
+ prev = pending;
+ continue;
+ }
+
+ /* Now the time to sync the rnode */
+ srel = smgropen(pendingSyncs->relnode, pendingSyncs->backend);
+
+ /*
+ * We emit newpage WAL records for smaller size of relations.
+ *
+ * Small WAL records have a chance to be emitted at once along with
+ * other backends' WAL records. We emit WAL records instead of syncing
+ * for files that are smaller than a certain threshold expecting faster
+ * commit. The threshold is defined by the GUC wal_skip_threshold.
+ */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ /* FSM doesn't need WAL nor sync */
+ if (fork != FSM_FORKNUM && smgrexists(srel, fork))
+ {
+ BlockNumber n = smgrnblocks(srel, fork);
+
+ /* we shouldn't come here for unlogged relations */
+ Assert(fork != INIT_FORKNUM);
+
+ nblocks[fork] = n;
+ total_blocks += n;
+ }
+ else
+ nblocks[fork] = InvalidBlockNumber;
+ }
+
+ /*
+ * Sync file or emit WAL record for the file according to the total
+ * size.
+ */
+ if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+ {
+ /* Flush all buffers then sync the file */
+ FlushRelationBuffersWithoutRelcache(srel->smgr_rnode.node, false);
+
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(srel, fork))
+ smgrimmedsync(srel, fork);
+ }
+ }
+ else
+ {
+ /*
+ * Emit WAL records for all blocks. Some of the blocks might have
+ * been synced or evicted, but We don't bother checking that. The
+ * file is small enough.
+ */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ bool page_std = (fork == MAIN_FORKNUM);
+ int n = nblocks[fork];
+ Relation rel;
+
+ if (!BlockNumberIsValid(n))
+ continue;
+
+ /* Emit WAL for the whole file */
+ rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+ log_newpage_range(rel, fork, 0, n, page_std);
+ FreeFakeRelcacheEntry(rel);
+ }
+ }
+
+ /* done remove from list */
+ if (prev)
+ prev->next = next;
+ else
+ pendingSyncs = next;
+ pfree(pending);
+ }
+
+ if (delhash)
+ hash_destroy(delhash);
+}
+
+/*
+ * smgrGetPendingOperations() -- Get a list of non-temp relations to be
+ * deleted or synced.
*
- * The return value is the number of relations scheduled for termination.
- * *ptr is set to point to a freshly-palloc'd array of RelFileNodes.
- * If there are no relations to be deleted, *ptr is set to NULL.
+ * The return value is the number of relations scheduled in the given
+ * list. *ptr is set to point to a freshly-palloc'd array of RelFileNodes. If
+ * there are no matching relations, *ptr is set to NULL.
*
* Only non-temporary relations are included in the returned list. This is OK
* because the list is used only in contexts where temporary relations don't
@@ -507,19 +714,19 @@ smgrDoPendingDeletes(bool isCommit)
* (and all temporary files will be zapped if we restart anyway, so no need
* for redo to do it also).
*
- * Note that the list does not include anything scheduled for termination
- * by upper-level transactions.
+ * Note that the list does not include anything scheduled by upper-level
+ * transactions.
*/
-int
-smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+static inline int
+smgrGetPendingOperations(PendingRelOp *list, bool forCommit, RelFileNode **ptr)
{
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
RelFileNode *rptr;
- PendingRelDelete *pending;
+ PendingRelOp *pending;
nrels = 0;
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ for (pending = list; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
&& pending->backend == InvalidBackendId)
@@ -532,7 +739,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
}
rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode));
*ptr = rptr;
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ for (pending = list; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
&& pending->backend == InvalidBackendId)
@@ -544,6 +751,20 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+/* Returns list of pending deletes, see smgrGetPendingOperations for details */
+int
+smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+{
+ return smgrGetPendingOperations(pendingDeletes, forCommit, ptr);
+}
+
+/* Returns list of pending syncs, see smgrGetPendingOperations for details */
+int
+smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr)
+{
+ return smgrGetPendingOperations(pendingSyncs, forCommit, ptr);
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
@@ -554,8 +775,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
void
PostPrepare_smgr(void)
{
- PendingRelDelete *pending;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *next;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
@@ -564,25 +785,34 @@ PostPrepare_smgr(void)
/* must explicitly free the list entry */
pfree(pending);
}
+
+ /* We shouldn't have an entry in pendingSyncs */
+ Assert(pendingSyncs == NULL);
}
/*
* AtSubCommit_smgr() --- Take care of subtransaction commit.
*
- * Reassign all items in the pending-deletes list to the parent transaction.
+ * Reassign all items in the pending-operations list to the parent transaction.
*/
void
AtSubCommit_smgr(void)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
+ PendingRelOp *pending;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel)
pending->nestLevel = nestLevel - 1;
}
+
+ for (pending = pendingSyncs; pending != NULL; pending = pending->next)
+ {
+ if (pending->nestLevel >= nestLevel)
+ pending->nestLevel = nestLevel - 1;
+ }
}
/*
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a23128d7a0..fba44de88a 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
{
+ Relation rel1;
+ Relation rel2;
+
/*
* Normal non-mapped relations: swap relfilenodes, reltablespaces,
* relpersistence
*/
Assert(!target_is_pg_class);
+ /* Update creation subid hints of relcache */
+ rel1 = relation_open(r1, ExclusiveLock);
+ rel2 = relation_open(r2, ExclusiveLock);
+
+ /*
+ * New relation's relfilenode is created in the current transaction
+ * and used as old ralation's new relfilenode. So its
+ * newRelfilenodeSubid as new relation's createSubid. We don't fix
+ * rel2 since it would be deleted soon.
+ */
+ Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+ rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+ /* record the first relfilenode change in the current transaction */
+ if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+ relation_close(rel1, ExclusiveLock);
+ relation_close(rel2, ExclusiveLock);
+
+ /* swap relfilenodes, reltablespaces, relpersistence */
swaptemp = relform1->relfilenode;
relform1->relfilenode = relform2->relfilenode;
relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3aeef30b28..3ce04f7efc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2534,9 +2534,6 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
ExecDropSingleTupleTableSlot(buffer->slots[i]);
- table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
- miinfo->ti_options);
-
pfree(buffer);
}
@@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
* If it does commit, we'll have done the table_finish_bulk_insert() at
* the bottom of this routine first.
*
- * As mentioned in comments in utils/rel.h, the in-same-transaction test
- * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
- * can be cleared before the end of the transaction. The exact case is
- * when a relation sets a new relfilenode twice in same transaction, yet
- * the second one fails in an aborted subtransaction, e.g.
- *
- * BEGIN;
- * TRUNCATE t;
- * SAVEPOINT save;
- * TRUNCATE t;
- * ROLLBACK TO save;
- * COPY ...
- *
- * Also, if the target file is new-in-transaction, we assume that checking
- * FSM for free space is a waste of time, even if we must use WAL because
- * of archiving. This could possibly be wrong, but it's unlikely.
- *
- * The comments for table_tuple_insert and RelationGetBufferForTuple
- * specify that skipping WAL logging is only safe if we ensure that our
- * tuples do not go into pages containing tuples from any other
- * transactions --- but this must be the case if we have a new table or
- * new relfilenode, so we need no additional work to enforce that.
+ * If the target file is new-in-transaction, we assume that checking FSM
+ * for free space is a waste of time, even if we must use WAL because of
+ * archiving. This could possibly be wrong, but it's unlikely.
*
* We currently don't support this optimization if the COPY target is a
* partitioned table as we currently only lazily initialize partition
@@ -2762,15 +2740,14 @@ CopyFrom(CopyState cstate)
* are not supported as per the description above.
*----------
*/
- /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+ /*
+ * createSubid is creation check, firstRelfilenodeSubid is truncation and
+ * cluster check. Partitioned table doesn't have storage.
+ */
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
- cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
+ cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index b7d220699f..8a91d946e3 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->rel, myState->ti_options);
-
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..1c854dcebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* replication is in use. We can skip the FSM in any case.
*/
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8d25d14772..54c8b0fb04 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4764,9 +4764,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
/*
* Prepare a BulkInsertState and options for table_tuple_insert. Because
- * we're building a new heap, we can skip WAL-logging and fsync it to disk
- * at the end instead (unless WAL-logging is required for archiving or
- * streaming replication). The FSM is empty too, so don't bother using it.
+ * we're building a new heap, the underlying table AM can skip WAL-logging
+ * and smgr will sync the relation to disk at the end of the current
+ * transaction instead. The FSM is empty too, so don't bother using it.
*/
if (newrel)
{
@@ -4774,8 +4774,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -5070,8 +5068,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
FreeBulkInsertState(bistate);
- table_finish_bulk_insert(newrel, ti_options);
-
table_close(newrel, NoLock);
}
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 483f705305..827626b330 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
static int32 PrivateRefCountOverflowed = 0;
static uint32 PrivateRefCountClock = 0;
static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void ReservePrivateRefCountEntry(void);
static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -675,10 +676,10 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
* ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
* a relcache entry for the relation.
*
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay. If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * NB: At present, this function may only be used on permanent relations,
+ * which is OK, because we only use it during XLOG replay and processing
+ * pending syncs. If in the future we want to use it on temporary or unlogged
+ * relations, we could pass additional parameters.
*/
Buffer
ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
@@ -3203,20 +3204,32 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
- /* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3233,7 +3246,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3263,18 +3276,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 07f3c93d3f..514c6098e6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -994,6 +994,36 @@ ForgetDatabaseSyncRequests(Oid dbid)
RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, true /* retryOnError */ );
}
+/*
+ * SyncRelationFiles -- sync files of all given relations
+ *
+ * This function is assumed to be called only when skipping WAL-logging and
+ * emits no xlog records.
+ */
+void
+SyncRelationFiles(RelFileNode *syncrels, int nsyncrels)
+{
+ int i;
+
+ for (i = 0; i < nsyncrels; i++)
+ {
+ SMgrRelation srel;
+ ForkNumber fork;
+
+ /* sync all existing forks of the relation */
+ FlushRelationBuffersWithoutRelcache(syncrels[i], false);
+ srel = smgropen(syncrels[i], InvalidBackendId);
+
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(srel, fork))
+ smgrimmedsync(srel, fork);
+ }
+
+ smgrclose(srel);
+ }
+}
+
/*
* DropRelationFiles -- drop files of all given relations
*/
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 585dcee5db..892462873f 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1096,6 +1096,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
relation->rd_isnailed = false;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
switch (relation->rd_rel->relpersistence)
{
case RELPERSISTENCE_UNLOGGED:
@@ -1829,6 +1830,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_isnailed = true;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
relation->rd_backend = InvalidBackendId;
relation->rd_islocaltemp = false;
@@ -2094,7 +2096,7 @@ RelationClose(Relation relation)
#ifdef RELCACHE_FORCE_RELEASE
if (RelationHasReferenceCountZero(relation) &&
relation->rd_createSubid == InvalidSubTransactionId &&
- relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
RelationClearRelation(relation, false);
#endif
}
@@ -2510,8 +2512,8 @@ RelationClearRelation(Relation relation, bool rebuild)
* problem.
*
* When rebuilding an open relcache entry, we must preserve ref count,
- * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state. Also
- * attempt to preserve the pg_class entry (rd_rel), tupledesc,
+ * rd_createSubid/rd_new/firstRelfilenodeSubid, and rd_toastoid state.
+ * Also attempt to preserve the pg_class entry (rd_rel), tupledesc,
* rewrite-rule, partition key, and partition descriptor substructures
* in place, because various places assume that these structures won't
* move while they are working with an open relcache entry. (Note:
@@ -2600,6 +2602,7 @@ RelationClearRelation(Relation relation, bool rebuild)
/* creation sub-XIDs must be preserved */
SWAPFIELD(SubTransactionId, rd_createSubid);
SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+ SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
/* un-swap rd_rel pointers, swap contents instead */
SWAPFIELD(Form_pg_class, rd_rel);
/* ... but actually, we don't have to update newrel->rd_rel */
@@ -2667,7 +2670,7 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2807,7 +2810,7 @@ RelationCacheInvalidate(void)
* pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -3064,6 +3067,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
* Likewise, reset the hint about the relfilenode being new.
*/
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
}
/*
@@ -3155,7 +3159,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
}
/*
- * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+ * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
*/
if (relation->rd_newRelfilenodeSubid == mySubid)
{
@@ -3164,6 +3168,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
}
+
+ if (relation->rd_firstRelfilenodeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstRelfilenodeSubid = parentSubid;
+ else
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ }
}
@@ -3253,6 +3265,7 @@ RelationBuildLocalRelation(const char *relname,
/* it's being created in this transaction */
rel->rd_createSubid = GetCurrentSubTransactionId();
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
/*
* create a new tuple descriptor from the one passed in. We do this
@@ -3556,6 +3569,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
* operations on the rel in the same transaction.
*/
relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+ if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
/* Flag relation as needing eoxact cleanup (to remove the hint) */
EOXactListAdd(relation);
@@ -5592,6 +5607,7 @@ load_relcache_init_file(bool shared)
rel->rd_fkeylist = NIL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 31a5ef0474..559f96a6dc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
#include "access/xlog_internal.h"
#include "catalog/namespace.h"
#include "catalog/pg_authid.h"
+#include "catalog/storage.h"
#include "commands/async.h"
#include "commands/prepare.h"
#include "commands/user.h"
@@ -2774,6 +2775,18 @@ static struct config_int ConfigureNamesInt[] =
check_effective_io_concurrency, assign_effective_io_concurrency, NULL
},
+ {
+ {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+ gettext_noop("Size of file that can be fsync'ed in the minimum required duration."),
+ NULL,
+ GUC_UNIT_KB
+ },
+ &wal_skip_threshold,
+ 64,
+ 0, MAX_KILOBYTES,
+ NULL, NULL, NULL
+ },
+
{
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..80c2e1bafc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7f81703b78..b652cd6cef 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
uint8 flags,
TM_FailureData *tmfd);
- /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
- *
- * Typically callers of tuple_insert and multi_insert will just pass all
- * the flags that apply to them, and each AM has to decide which of them
- * make sense for it, and then only take actions in finish_bulk_insert for
- * those flags, and ignore others.
- *
- * Optional callback.
- */
- void (*finish_bulk_insert) (Relation rel, int options);
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
@@ -1087,10 +1071,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* The options bitmask allows the caller to specify options that may change the
* behaviour of the AM. The AM will ignore options that it does not support.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space.
@@ -1112,8 +1092,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* heap's TOAST table, too, if the tuple requires any out-of-line data.
*
* The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
*
* On return the slot's tts_tid and tts_tableOid are updated to reflect the
* insertion. But note that any toasting of fields within the slot is NOT
@@ -1248,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
* update was done. However, any TOAST changes in the new tuple's
* data are not reflected into *newtup.
*
+ * See table_insert about skipping WAL-logging feature.
+ *
* In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
* t_xmax, and, if possible, t_cmax. See comments for struct TM_FailureData
* for additional info.
@@ -1308,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
flags, tmfd);
}
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
- /* optional callback */
- if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
- rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..24e71651c3 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,6 +19,16 @@
#include "storage/smgr.h"
#include "utils/relcache.h"
+/* enum for operation type of PendingDelete entries */
+typedef enum PendingOpType
+{
+ PENDING_DELETE,
+ PENDING_SYNC
+} PendingOpType;
+
+/* GUC variables */
+extern int wal_skip_threshold; /* threshold for WAL-skipping */
+
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
@@ -31,7 +41,9 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(bool isCommit, bool sync_all);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern int smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 17b97f7e38..f31a36de17 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index c0f05e23ff..2bb2947bdb 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -42,6 +42,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void SyncRelationFiles(RelFileNode *syncrels, int nsyncrels);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a5cf804f9f..b2062efa63 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -75,10 +75,17 @@ typedef struct RelationData
* transaction, with one of them occurring in a subsequently aborted
* subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
* ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+ * rd_firstRelfilenodeSubid is the ID of the first subtransaction the
+ * relfilenode change has took place in the current transaction. Unlike
+ * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID
+ * means that the currently active relfilenode is transaction-local and we
+ * sync the relation at commit instead of WAL-logging.
*/
SubTransactionId rd_createSubid; /* rel was created in current xact */
SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
* current xact */
+ SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned
+ * first in current xact */
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
@@ -517,9 +524,15 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
+ */
+#define RelationNeedsWAL(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
/*
* RelationUsesLocalBuffers
--
2.23.0
v21-0003-Fix-MarkBufferDirtyHint.patchtext/x-patch; charset=us-asciiDownload
From 96ad8bd4537e5055509ec9fdbbef502b52f136b5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Oct 2019 12:07:52 +0900
Subject: [PATCH v21 3/5] Fix MarkBufferDirtyHint
---
src/backend/catalog/storage.c | 17 +++++++++++++++++
src/backend/storage/buffer/bufmgr.c | 7 +++++++
src/include/catalog/storage.h | 1 +
3 files changed, 25 insertions(+)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 806f235a24..6d5a3d53e7 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -440,6 +440,23 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
smgrimmedsync(dst, forkNum);
}
+/*
+ * RelFileNodeSkippingWAL - check if this relfilenode needs WAL
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+ PendingRelOp *pending;
+
+ for (pending = pendingSyncs ; pending != NULL ; pending = pending->next)
+ {
+ if (RelFileNodeEquals(pending->relnode, rnode))
+ return true;
+ }
+
+ return false;
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 827626b330..06ec7cc186 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3506,6 +3506,13 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
if (RecoveryInProgress())
return;
+ /*
+ * Skip WAL logging if this buffer belongs to a relation that is
+ * skipping WAL-logging.
+ */
+ if (RelFileNodeSkippingWAL(bufHdr->tag.rnode))
+ return;
+
/*
* If the block is already dirty because we either made a change
* or set a hint already, then we don't need to write a full page
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 24e71651c3..eb2666e001 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -35,6 +35,7 @@ extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
--
2.23.0
v21-0004-Documentation-for-wal_skip_threshold.patchtext/x-patch; charset=us-asciiDownload
From 6ad62905d8a256c3531c9225bdb3212c45f5faff Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 28 Aug 2019 14:05:30 +0900
Subject: [PATCH v21 4/5] Documentation for wal_skip_threshold
---
doc/src/sgml/config.sgml | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 886632ff43..f928c5aa0b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1833,6 +1833,32 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-wal-skip-min_size" xreflabel="wal_skip_threshold">
+ <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ When wal_level is minimal and a transaction commits after creating or
+ rewriting a permanent table, materialized view, or index, this
+ setting determines how to persist the new data. If the data is
+ smaller than this setting, write it to the WAL log; otherwise, use an
+ fsync of the data file. Depending on the properties of your storage,
+ raising or lowering this value might help if such commits are slowing
+ concurrent transactions. The default is 64 kilobytes (64kB).
+ </para>
+ <para>
+ When <xref linkend="guc-wal-level"/> is <literal>minimal</literal>,
+ WAL-logging is skipped for tables created in-trasaction. If a table
+ is smaller than that size at commit, it is WAL-logged instead of
+ issueing <function>fsync</function> on it.
+
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
--
2.23.0
v21-0005-Additional-test-for-new-GUC-setting.patchtext/x-patch; charset=us-asciiDownload
From 67baae223a93bc4c9827e1c8d99a040a058ad6ad Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 28 Aug 2019 14:12:18 +0900
Subject: [PATCH v21 5/5] Additional test for new GUC setting.
This patchset adds new GUC variable effective_io_block_size that
controls wheter WAL-skipped tables are finally WAL-logged or
fcync'ed. All of the TAP test performs WAL-logging so this adds an
item that performs file sync.
---
src/test/recovery/t/018_wal_optimize.pl | 38 ++++++++++++++++++++++++-
1 file changed, 37 insertions(+), 1 deletion(-)
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
index b041121745..ba9185e2ba 100644
--- a/src/test/recovery/t/018_wal_optimize.pl
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -11,7 +11,7 @@ use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 26;
+use Test::More tests => 32;
sub check_orphan_relfilenodes
{
@@ -43,6 +43,8 @@ sub run_wal_optimize
$node->append_conf('postgresql.conf', qq(
wal_level = $wal_level
max_prepared_transactions = 1
+wal_log_hints = on
+wal_skip_threshold = 0
));
$node->start;
@@ -102,7 +104,23 @@ max_prepared_transactions = 1
$result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
is($result, qq(1),
"wal_level = $wal_level, optimized truncation with prepared transaction");
+ # Same for file sync mode
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ SET wal_skip_threshold to 0;
+ BEGIN;
+ CREATE TABLE test2b (id serial PRIMARY KEY);
+ INSERT INTO test2b VALUES (DEFAULT);
+ TRUNCATE test2b;
+ INSERT INTO test2b VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2b;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with file-sync");
# Data file for COPY query in follow-up tests.
my $basedir = $node->basedir;
@@ -178,6 +196,24 @@ max_prepared_transactions = 1
is($result, qq(3),
"wal_level = $wal_level, SET TABLESPACE in subtransaction");
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a5 (c int PRIMARY KEY);
+ SAVEPOINT q; INSERT INTO test3a5 VALUES (1); ROLLBACK TO q;
+ CHECKPOINT;
+ INSERT INTO test3a5 VALUES (1); -- set index hint bit
+ INSERT INTO test3a5 VALUES (2);
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->psql('postgres', );
+ my($ret, $stdout, $stderr) = $node->psql(
+ 'postgres', "INSERT INTO test3a5 VALUES (2);");
+ is($ret, qq(3),
+ "wal_level = $wal_level, unique index LP_DEAD");
+ like($stderr, qr/violates unique/,
+ "wal_level = $wal_level, unique index LP_DEAD message");
+
# UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
$node->safe_psql('postgres', "
BEGIN;
--
2.23.0
Ugh!
2019年10月25日(金) 13:13 Kyotaro Horiguchi <horikyota.ntt@gmail.com>:
that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
for sync-pending state of the relfilenode for the buffer. In the
attached patch (0003)
regards.
It's wrong that it also skips chnging flags.
I"ll fix it soon
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Oct 25, 2019 at 1:13 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
Hello. Thanks for the comment.
# Sorry in advance for possilbe breaking the thread.
MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
rd_createSubid is set; see attached test case. It needs to skip WAL whenever
RelationNeedsWAL() returns false.Thanks for pointing out that. And the test patch helped me very much.
Most of callers can tell that to the function, but SetHintBits()
cannot easily. Rather I think we shouldn't even try to do
that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
for sync-pending state of the relfilenode for the buffer. In the
attached patch (0003) RelFileNodeSkippingWAL loops over pendingSyncs
but it is called only at the time FPW is added so I believe it doesn't
affect performance so much. However, we can use hash for pendingSyncs
instead of liked list. Anyway the change is in its own file
v21-0003-Fix-MarkBufferDirtyHint.patch, which will be merged into
0002.AFAICS all XLogInsert is guarded by RelationNeedsWAL() or in the
non-wal_minimal code paths.Cylinder and track sizes are obsolete as user-visible concepts. (They're not
onstant for a given drive, and I think modern disks provide no way to read
the relevant parameters.) I like the name "wal_skip_threshold", and my secondI strongly agree. Thanks for the draft. I used it as-is. I don't come
up with an appropriate second description of the GUC so I just removed
it.# it was "For rotating magnetic disks, it is around the size of a
# track or sylinder."the relevant parameters.) I like the name "wal_skip_threshold", and
my second choice would be "wal_skip_min_size". Possibly documented
as follows:..
Any other opinions on the GUC name?
I prefer the first candidate. I already used the terminology in
storage.c and the name fits more to the context.* We emit newpage WAL records for smaller size of relations.
*
* Small WAL records have a chance to be emitted at once along with
* other backends' WAL records. We emit WAL records instead of syncing
* for files that are smaller than a certain threshold expecting faster- * commit. The threshold is defined by the GUC effective_io_block_size. + * commit. The threshold is defined by the GUC wal_skip_threshold.
It's wrong that it also skips changing flags.
I"ll fix it soon
This is the fixed verison v22.
The attached are:
- v22-0001-TAP-test-for-copy-truncation-optimization.patch
Same as v20, 21
- v22-0002-Fix-WAL-skipping-feature.patch
GUC name changed. Same as v21.
- v22-0003-Fix-MarkBufferDirtyHint.patch
PoC of fixing the function. will be merged into 0002. (New in v21,
fixed in v22)
- v21-0004-Documentation-for-wal_skip_threshold.patch
GUC name and description changed. (Previous 0003, same as v21)
- v21-0005-Additional-test-for-new-GUC-setting.patch
including adjusted version of wal-optimize-noah-tests-v3.patch
Maybe test names need further adjustment. (Previous 0004, same as v21)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v22-0004-Documentation-for-wal_skip_threshold.patchapplication/octet-stream; name=v22-0004-Documentation-for-wal_skip_threshold.patchDownload
From f5d58a918925c345f5e3efd75d81564892818312 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 28 Aug 2019 14:05:30 +0900
Subject: [PATCH v22 4/5] Documentation for wal_skip_threshold
---
doc/src/sgml/config.sgml | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 886632f..f928c5a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1833,6 +1833,32 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-wal-skip-min_size" xreflabel="wal_skip_threshold">
+ <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ When wal_level is minimal and a transaction commits after creating or
+ rewriting a permanent table, materialized view, or index, this
+ setting determines how to persist the new data. If the data is
+ smaller than this setting, write it to the WAL log; otherwise, use an
+ fsync of the data file. Depending on the properties of your storage,
+ raising or lowering this value might help if such commits are slowing
+ concurrent transactions. The default is 64 kilobytes (64kB).
+ </para>
+ <para>
+ When <xref linkend="guc-wal-level"/> is <literal>minimal</literal>,
+ WAL-logging is skipped for tables created in-trasaction. If a table
+ is smaller than that size at commit, it is WAL-logged instead of
+ issueing <function>fsync</function> on it.
+
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
--
2.9.2
v22-0001-TAP-test-for-copy-truncation-optimization.patchapplication/octet-stream; name=v22-0001-TAP-test-for-copy-truncation-optimization.patchDownload
From 93f64c49050f7aee5ec65aee49255a725be9b97a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH v22 1/5] TAP test for copy-truncation optimization.
---
src/test/recovery/t/018_wal_optimize.pl | 312 ++++++++++++++++++++++++++++++++
1 file changed, 312 insertions(+)
create mode 100644 src/test/recovery/t/018_wal_optimize.pl
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000..b041121
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,312 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica". The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ # Primary needs to have wal_level = minimal here
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::perl2host($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+
+ # Same for prepared transaction
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2a (id serial PRIMARY KEY);
+ INSERT INTO test2a VALUES (DEFAULT);
+ TRUNCATE test2a;
+ INSERT INTO test2a VALUES (DEFAULT);
+ PREPARE TRANSACTION 't';
+ COMMIT PREPARED 't';");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with prepared transaction");
+
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.9.2
v22-0002-Fix-WAL-skipping-feature.patchapplication/octet-stream; name=v22-0002-Fix-WAL-skipping-feature.patchDownload
From 18441252eac4e1996ba777a34e90c985efa4f43d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Oct 2019 12:07:09 +0900
Subject: [PATCH v22 2/5] Fix WAL skipping feature
WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modification is WAL-logged at all then sync
such relations at commit.
---
src/backend/access/heap/heapam.c | 4 +-
src/backend/access/heap/heapam_handler.c | 22 +--
src/backend/access/heap/rewriteheap.c | 13 +-
src/backend/access/transam/xact.c | 17 ++
src/backend/access/transam/xlogutils.c | 11 +-
src/backend/catalog/storage.c | 294 +++++++++++++++++++++++++++----
src/backend/commands/cluster.c | 24 +++
src/backend/commands/copy.c | 39 +---
src/backend/commands/createas.c | 5 +-
src/backend/commands/matview.c | 4 -
src/backend/commands/tablecmds.c | 10 +-
src/backend/storage/buffer/bufmgr.c | 41 +++--
src/backend/storage/smgr/md.c | 30 ++++
src/backend/utils/cache/relcache.c | 28 ++-
src/backend/utils/misc/guc.c | 13 ++
src/include/access/heapam.h | 1 -
src/include/access/rewriteheap.h | 2 +-
src/include/access/tableam.h | 40 +----
src/include/catalog/storage.h | 12 ++
src/include/storage/bufmgr.h | 1 +
src/include/storage/md.h | 1 +
src/include/utils/rel.h | 19 +-
22 files changed, 455 insertions(+), 176 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb3..a7ead94 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1936,7 +1936,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2119,7 +2119,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2dd8821..0871df7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -558,18 +558,6 @@ tuple_lock_retry:
return result;
}
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related callbacks for heap AM.
* ------------------------------------------------------------------------
@@ -701,7 +689,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -716,12 +703,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
is_system_catalog = IsSystemRelation(OldHeap);
/*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
+ * smgr_targblock must be initially invalid if we are to skip WAL logging
*/
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
- /* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
/* Preallocate values/isnull arrays */
@@ -731,7 +714,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff, use_wal);
+ *multi_cutoff);
/* Set up sorting if wanted */
@@ -2519,7 +2502,6 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d41dbcf..9b757ca 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index fc55fa6..59d65bc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2107,6 +2107,13 @@ CommitTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Synchronize files that are created and not WAL-logged during this
+ * transaction. This must happen before emitting commit record so that we
+ * don't see committed-but-broken files after a crash.
+ */
+ smgrDoPendingSyncs(true, false);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2339,6 +2346,14 @@ PrepareTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Sync all WAL-skipped files now. Some of them may be deleted at
+ * transaction end but we don't bother store that information in PREPARE
+ * record or two-phase files. Like commit, we should sync WAL-skipped
+ * files before emitting PREPARE record. See CommitTransaction().
+ */
+ smgrDoPendingSyncs(true, true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2657,6 +2672,7 @@ AbortTransaction(void)
*/
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
+ smgrDoPendingSyncs(false, false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
@@ -4964,6 +4980,7 @@ AbortSubTransaction(void)
s->parent->curTransactionOwner);
AtEOSubXact_LargeObject(false, s->subTransactionId,
s->parent->subTransactionId);
+ smgrDoPendingSyncs(false, false);
AtSubAbort_Notify();
/* Advertise the fact that we aborted in pg_xact. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5f1e5ba..fc296ab 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
* fields related to physical storage, like rd_rel, are initialized, so the
* fake entry is only usable in low-level operations like ReadBuffer().
*
+ * This is also used for syncing WAL-skipped files.
+ *
* Caller must free the returned entry with FreeFakeRelcacheEntry().
*/
Relation
@@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
FakeRelCacheEntry fakeentry;
Relation rel;
- Assert(InRecovery);
-
/* Allocate the Relation struct and all related space in one block. */
fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
rel = (Relation) fakeentry;
rel->rd_rel = &fakeentry->pgc;
rel->rd_node = rnode;
- /* We will never be working with temp rels during recovery */
+ /*
+ * We will never be working with temp rels during recovery or syncing
+ * WAL-skpped files.
+ */
rel->rd_backend = InvalidBackendId;
- /* It must be a permanent table if we're in recovery. */
+ /* It must be a permanent table here */
rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
/* We don't know the name of the relation; use relfilenode instead */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 625af8d..806f235 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -30,9 +30,13 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+/* GUC variables */
+int wal_skip_threshold = 64; /* threshold of WAL-skipping in kilobytes */
+
/*
* We keep a list of all relations (represented as RelFileNode values)
* that have been created or deleted in the current transaction. When
@@ -53,16 +57,17 @@
* but I'm being paranoid.
*/
-typedef struct PendingRelDelete
+typedef struct PendingRelOp
{
RelFileNode relnode; /* relation that may need to be deleted */
BackendId backend; /* InvalidBackendId if not a temp rel */
- bool atCommit; /* T=delete at commit; F=delete at abort */
+ bool atCommit; /* T=work at commit; F=work at abort */
int nestLevel; /* xact nesting level of request */
- struct PendingRelDelete *next; /* linked-list link */
-} PendingRelDelete;
+ struct PendingRelOp *next; /* linked-list link */
+} PendingRelOp;
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingSyncs = NULL; /* head of linked list */
/*
* RelationCreateStorage
@@ -78,7 +83,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelOp *pending;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -109,8 +114,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
@@ -118,6 +123,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /*
+ * When wal_level = minimal, we are going to skip WAL-logging for storage
+ * of persistent relations created in the current transaction. The
+ * relation needs to be synced at commit.
+ */
+ if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+ {
+ int nestLevel = GetCurrentTransactionNestLevel();
+
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
+ pending->relnode = rnode;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->nestLevel = nestLevel;
+ pending->next = pendingSyncs;
+ pendingSyncs = pending;
+ }
+
return srel;
}
@@ -147,11 +171,11 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
void
RelationDropStorage(Relation rel)
{
- PendingRelDelete *pending;
+ PendingRelOp *pending;
/* Add the relation to the list of stuff to delete at commit */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
pending->relnode = rel->rd_node;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
@@ -192,9 +216,9 @@ RelationDropStorage(Relation rel)
void
RelationPreserveStorage(RelFileNode rnode, bool atCommit)
{
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -431,9 +455,9 @@ void
smgrDoPendingDeletes(bool isCommit)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
int nrels = 0,
i = 0,
maxrels = 0;
@@ -494,11 +518,194 @@ smgrDoPendingDeletes(bool isCommit)
}
/*
- * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
+ * smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
+ *
+ * This should be called before smgrDoPendingDeletes() at every subtransaction
+ * end. Also this should be called before emitting WAL record so that sync
+ * failure prevents commit.
+ *
+ * If sync_all is true, syncs all files including that are scheduled to be
+ * deleted.
+ */
+void
+smgrDoPendingSyncs(bool isCommit, bool sync_all)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
+ SMgrRelation srel = NULL;
+ ForkNumber fork;
+ BlockNumber nblocks[MAX_FORKNUM + 1];
+ BlockNumber total_blocks = 0;
+ HTAB *delhash = NULL;
+
+ /* Return if nothing to be synced in this nestlevel */
+ if (!pendingSyncs || pendingSyncs->nestLevel < nestLevel)
+ return;
+
+ Assert (pendingSyncs->nestLevel <= nestLevel);
+ Assert (pendingSyncs->backend == InvalidBackendId);
+
+ /*
+ * If sync_all is false, pending syncs on the relation that are to be
+ * deleted in this transaction-end should be ignored. Collect pending
+ * deletes that will happen in the following call to
+ * smgrDoPendingDeletes().
+ */
+ if (!sync_all)
+ {
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ if (pending->nestLevel < pendingSyncs->nestLevel ||
+ pending->atCommit != isCommit)
+ continue;
+
+ /* create the hash if not yet */
+ if (delhash == NULL)
+ {
+ HASHCTL hash_ctl;
+
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(RelFileNode);
+ hash_ctl.entrysize = sizeof(RelFileNode);
+ hash_ctl.hcxt = CurrentMemoryContext;
+ delhash =
+ hash_create("pending del temporary hash", 8, &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
+
+ (void) hash_search(delhash, (void *) &(pending->relnode),
+ HASH_ENTER, &found);
+ Assert(!found);
+ }
+ }
+
+ /* Loop over pendingSyncs */
+ prev = NULL;
+ for (pending = pendingSyncs; pending != NULL; pending = next)
+ {
+ bool to_be_removed = (!isCommit); /* don't sync if aborted */
+
+ next = pending->next;
+
+ /* outer-level entries should not be processed yet */
+ if (pending->nestLevel < nestLevel)
+ {
+ prev = pending;
+ continue;
+ }
+
+ /* don't sync relnodes that is being deleted */
+ if (delhash && !to_be_removed)
+ hash_search(delhash, (void *) &pending->relnode,
+ HASH_FIND, &to_be_removed);
+
+ /* remove the entry if no longer useful */
+ if (to_be_removed)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingSyncs = next;
+ pfree(pending);
+ continue;
+ }
+
+ /* actual sync happens at the end of top transaction */
+ if (nestLevel > 1)
+ {
+ prev = pending;
+ continue;
+ }
+
+ /* Now the time to sync the rnode */
+ srel = smgropen(pendingSyncs->relnode, pendingSyncs->backend);
+
+ /*
+ * We emit newpage WAL records for smaller size of relations.
+ *
+ * Small WAL records have a chance to be emitted at once along with
+ * other backends' WAL records. We emit WAL records instead of syncing
+ * for files that are smaller than a certain threshold expecting faster
+ * commit. The threshold is defined by the GUC wal_skip_threshold.
+ */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ /* FSM doesn't need WAL nor sync */
+ if (fork != FSM_FORKNUM && smgrexists(srel, fork))
+ {
+ BlockNumber n = smgrnblocks(srel, fork);
+
+ /* we shouldn't come here for unlogged relations */
+ Assert(fork != INIT_FORKNUM);
+
+ nblocks[fork] = n;
+ total_blocks += n;
+ }
+ else
+ nblocks[fork] = InvalidBlockNumber;
+ }
+
+ /*
+ * Sync file or emit WAL record for the file according to the total
+ * size.
+ */
+ if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+ {
+ /* Flush all buffers then sync the file */
+ FlushRelationBuffersWithoutRelcache(srel->smgr_rnode.node, false);
+
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(srel, fork))
+ smgrimmedsync(srel, fork);
+ }
+ }
+ else
+ {
+ /*
+ * Emit WAL records for all blocks. Some of the blocks might have
+ * been synced or evicted, but We don't bother checking that. The
+ * file is small enough.
+ */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ bool page_std = (fork == MAIN_FORKNUM);
+ int n = nblocks[fork];
+ Relation rel;
+
+ if (!BlockNumberIsValid(n))
+ continue;
+
+ /* Emit WAL for the whole file */
+ rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+ log_newpage_range(rel, fork, 0, n, page_std);
+ FreeFakeRelcacheEntry(rel);
+ }
+ }
+
+ /* done remove from list */
+ if (prev)
+ prev->next = next;
+ else
+ pendingSyncs = next;
+ pfree(pending);
+ }
+
+ if (delhash)
+ hash_destroy(delhash);
+}
+
+/*
+ * smgrGetPendingOperations() -- Get a list of non-temp relations to be
+ * deleted or synced.
*
- * The return value is the number of relations scheduled for termination.
- * *ptr is set to point to a freshly-palloc'd array of RelFileNodes.
- * If there are no relations to be deleted, *ptr is set to NULL.
+ * The return value is the number of relations scheduled in the given
+ * list. *ptr is set to point to a freshly-palloc'd array of RelFileNodes. If
+ * there are no matching relations, *ptr is set to NULL.
*
* Only non-temporary relations are included in the returned list. This is OK
* because the list is used only in contexts where temporary relations don't
@@ -507,19 +714,19 @@ smgrDoPendingDeletes(bool isCommit)
* (and all temporary files will be zapped if we restart anyway, so no need
* for redo to do it also).
*
- * Note that the list does not include anything scheduled for termination
- * by upper-level transactions.
+ * Note that the list does not include anything scheduled by upper-level
+ * transactions.
*/
-int
-smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+static inline int
+smgrGetPendingOperations(PendingRelOp *list, bool forCommit, RelFileNode **ptr)
{
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
RelFileNode *rptr;
- PendingRelDelete *pending;
+ PendingRelOp *pending;
nrels = 0;
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ for (pending = list; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
&& pending->backend == InvalidBackendId)
@@ -532,7 +739,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
}
rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode));
*ptr = rptr;
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ for (pending = list; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
&& pending->backend == InvalidBackendId)
@@ -544,6 +751,20 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+/* Returns list of pending deletes, see smgrGetPendingOperations for details */
+int
+smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+{
+ return smgrGetPendingOperations(pendingDeletes, forCommit, ptr);
+}
+
+/* Returns list of pending syncs, see smgrGetPendingOperations for details */
+int
+smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr)
+{
+ return smgrGetPendingOperations(pendingSyncs, forCommit, ptr);
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
@@ -554,8 +775,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
void
PostPrepare_smgr(void)
{
- PendingRelDelete *pending;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *next;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
@@ -564,25 +785,34 @@ PostPrepare_smgr(void)
/* must explicitly free the list entry */
pfree(pending);
}
+
+ /* We shouldn't have an entry in pendingSyncs */
+ Assert(pendingSyncs == NULL);
}
/*
* AtSubCommit_smgr() --- Take care of subtransaction commit.
*
- * Reassign all items in the pending-deletes list to the parent transaction.
+ * Reassign all items in the pending-operations list to the parent transaction.
*/
void
AtSubCommit_smgr(void)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
+ PendingRelOp *pending;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel)
pending->nestLevel = nestLevel - 1;
}
+
+ for (pending = pendingSyncs; pending != NULL; pending = pending->next)
+ {
+ if (pending->nestLevel >= nestLevel)
+ pending->nestLevel = nestLevel - 1;
+ }
}
/*
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a23128d..fba44de 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
{
+ Relation rel1;
+ Relation rel2;
+
/*
* Normal non-mapped relations: swap relfilenodes, reltablespaces,
* relpersistence
*/
Assert(!target_is_pg_class);
+ /* Update creation subid hints of relcache */
+ rel1 = relation_open(r1, ExclusiveLock);
+ rel2 = relation_open(r2, ExclusiveLock);
+
+ /*
+ * New relation's relfilenode is created in the current transaction
+ * and used as old ralation's new relfilenode. So its
+ * newRelfilenodeSubid as new relation's createSubid. We don't fix
+ * rel2 since it would be deleted soon.
+ */
+ Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+ rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+ /* record the first relfilenode change in the current transaction */
+ if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+ relation_close(rel1, ExclusiveLock);
+ relation_close(rel2, ExclusiveLock);
+
+ /* swap relfilenodes, reltablespaces, relpersistence */
swaptemp = relform1->relfilenode;
relform1->relfilenode = relform2->relfilenode;
relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3aeef30..3ce04f7 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2534,9 +2534,6 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
ExecDropSingleTupleTableSlot(buffer->slots[i]);
- table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
- miinfo->ti_options);
-
pfree(buffer);
}
@@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
* If it does commit, we'll have done the table_finish_bulk_insert() at
* the bottom of this routine first.
*
- * As mentioned in comments in utils/rel.h, the in-same-transaction test
- * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
- * can be cleared before the end of the transaction. The exact case is
- * when a relation sets a new relfilenode twice in same transaction, yet
- * the second one fails in an aborted subtransaction, e.g.
- *
- * BEGIN;
- * TRUNCATE t;
- * SAVEPOINT save;
- * TRUNCATE t;
- * ROLLBACK TO save;
- * COPY ...
- *
- * Also, if the target file is new-in-transaction, we assume that checking
- * FSM for free space is a waste of time, even if we must use WAL because
- * of archiving. This could possibly be wrong, but it's unlikely.
- *
- * The comments for table_tuple_insert and RelationGetBufferForTuple
- * specify that skipping WAL logging is only safe if we ensure that our
- * tuples do not go into pages containing tuples from any other
- * transactions --- but this must be the case if we have a new table or
- * new relfilenode, so we need no additional work to enforce that.
+ * If the target file is new-in-transaction, we assume that checking FSM
+ * for free space is a waste of time, even if we must use WAL because of
+ * archiving. This could possibly be wrong, but it's unlikely.
*
* We currently don't support this optimization if the COPY target is a
* partitioned table as we currently only lazily initialize partition
@@ -2762,15 +2740,14 @@ CopyFrom(CopyState cstate)
* are not supported as per the description above.
*----------
*/
- /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+ /*
+ * createSubid is creation check, firstRelfilenodeSubid is truncation and
+ * cluster check. Partitioned table doesn't have storage.
+ */
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
- cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
+ cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index b7d2206..8a91d94 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->rel, myState->ti_options);
-
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8..1c854dc 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* replication is in use. We can skip the FSM in any case.
*/
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8d25d14..54c8b0f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4764,9 +4764,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
/*
* Prepare a BulkInsertState and options for table_tuple_insert. Because
- * we're building a new heap, we can skip WAL-logging and fsync it to disk
- * at the end instead (unless WAL-logging is required for archiving or
- * streaming replication). The FSM is empty too, so don't bother using it.
+ * we're building a new heap, the underlying table AM can skip WAL-logging
+ * and smgr will sync the relation to disk at the end of the current
+ * transaction instead. The FSM is empty too, so don't bother using it.
*/
if (newrel)
{
@@ -4774,8 +4774,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -5070,8 +5068,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
FreeBulkInsertState(bistate);
- table_finish_bulk_insert(newrel, ti_options);
-
table_close(newrel, NoLock);
}
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 483f705..827626b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
static int32 PrivateRefCountOverflowed = 0;
static uint32 PrivateRefCountClock = 0;
static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void ReservePrivateRefCountEntry(void);
static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -675,10 +676,10 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
* ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
* a relcache entry for the relation.
*
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay. If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * NB: At present, this function may only be used on permanent relations,
+ * which is OK, because we only use it during XLOG replay and processing
+ * pending syncs. If in the future we want to use it on temporary or unlogged
+ * relations, we could pass additional parameters.
*/
Buffer
ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
@@ -3203,20 +3204,32 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
- /* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3233,7 +3246,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3263,18 +3276,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 07f3c93..514c609 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -995,6 +995,36 @@ ForgetDatabaseSyncRequests(Oid dbid)
}
/*
+ * SyncRelationFiles -- sync files of all given relations
+ *
+ * This function is assumed to be called only when skipping WAL-logging and
+ * emits no xlog records.
+ */
+void
+SyncRelationFiles(RelFileNode *syncrels, int nsyncrels)
+{
+ int i;
+
+ for (i = 0; i < nsyncrels; i++)
+ {
+ SMgrRelation srel;
+ ForkNumber fork;
+
+ /* sync all existing forks of the relation */
+ FlushRelationBuffersWithoutRelcache(syncrels[i], false);
+ srel = smgropen(syncrels[i], InvalidBackendId);
+
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(srel, fork))
+ smgrimmedsync(srel, fork);
+ }
+
+ smgrclose(srel);
+ }
+}
+
+/*
* DropRelationFiles -- drop files of all given relations
*/
void
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 585dcee..8924628 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1096,6 +1096,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
relation->rd_isnailed = false;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
switch (relation->rd_rel->relpersistence)
{
case RELPERSISTENCE_UNLOGGED:
@@ -1829,6 +1830,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_isnailed = true;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
relation->rd_backend = InvalidBackendId;
relation->rd_islocaltemp = false;
@@ -2094,7 +2096,7 @@ RelationClose(Relation relation)
#ifdef RELCACHE_FORCE_RELEASE
if (RelationHasReferenceCountZero(relation) &&
relation->rd_createSubid == InvalidSubTransactionId &&
- relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
RelationClearRelation(relation, false);
#endif
}
@@ -2510,8 +2512,8 @@ RelationClearRelation(Relation relation, bool rebuild)
* problem.
*
* When rebuilding an open relcache entry, we must preserve ref count,
- * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state. Also
- * attempt to preserve the pg_class entry (rd_rel), tupledesc,
+ * rd_createSubid/rd_new/firstRelfilenodeSubid, and rd_toastoid state.
+ * Also attempt to preserve the pg_class entry (rd_rel), tupledesc,
* rewrite-rule, partition key, and partition descriptor substructures
* in place, because various places assume that these structures won't
* move while they are working with an open relcache entry. (Note:
@@ -2600,6 +2602,7 @@ RelationClearRelation(Relation relation, bool rebuild)
/* creation sub-XIDs must be preserved */
SWAPFIELD(SubTransactionId, rd_createSubid);
SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+ SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
/* un-swap rd_rel pointers, swap contents instead */
SWAPFIELD(Form_pg_class, rd_rel);
/* ... but actually, we don't have to update newrel->rd_rel */
@@ -2667,7 +2670,7 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2807,7 +2810,7 @@ RelationCacheInvalidate(void)
* pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -3064,6 +3067,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
* Likewise, reset the hint about the relfilenode being new.
*/
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
}
/*
@@ -3155,7 +3159,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
}
/*
- * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+ * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
*/
if (relation->rd_newRelfilenodeSubid == mySubid)
{
@@ -3164,6 +3168,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
}
+
+ if (relation->rd_firstRelfilenodeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstRelfilenodeSubid = parentSubid;
+ else
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ }
}
@@ -3253,6 +3265,7 @@ RelationBuildLocalRelation(const char *relname,
/* it's being created in this transaction */
rel->rd_createSubid = GetCurrentSubTransactionId();
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
/*
* create a new tuple descriptor from the one passed in. We do this
@@ -3556,6 +3569,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
* operations on the rel in the same transaction.
*/
relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+ if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
/* Flag relation as needing eoxact cleanup (to remove the hint) */
EOXactListAdd(relation);
@@ -5592,6 +5607,7 @@ load_relcache_init_file(bool shared)
rel->rd_fkeylist = NIL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 31a5ef0..559f96a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
#include "access/xlog_internal.h"
#include "catalog/namespace.h"
#include "catalog/pg_authid.h"
+#include "catalog/storage.h"
#include "commands/async.h"
#include "commands/prepare.h"
#include "commands/user.h"
@@ -2775,6 +2776,18 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+ gettext_noop("Size of file that can be fsync'ed in the minimum required duration."),
+ NULL,
+ GUC_UNIT_KB
+ },
+ &wal_skip_threshold,
+ 64,
+ 0, MAX_KILOBYTES,
+ NULL, NULL, NULL
+ },
+
+ {
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
NULL,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6..80c2e1b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253..7f9736e 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7f81703..b652cd6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
uint8 flags,
TM_FailureData *tmfd);
- /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
- *
- * Typically callers of tuple_insert and multi_insert will just pass all
- * the flags that apply to them, and each AM has to decide which of them
- * make sense for it, and then only take actions in finish_bulk_insert for
- * those flags, and ignore others.
- *
- * Optional callback.
- */
- void (*finish_bulk_insert) (Relation rel, int options);
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
@@ -1087,10 +1071,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* The options bitmask allows the caller to specify options that may change the
* behaviour of the AM. The AM will ignore options that it does not support.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space.
@@ -1112,8 +1092,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* heap's TOAST table, too, if the tuple requires any out-of-line data.
*
* The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
*
* On return the slot's tts_tid and tts_tableOid are updated to reflect the
* insertion. But note that any toasting of fields within the slot is NOT
@@ -1248,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
* update was done. However, any TOAST changes in the new tuple's
* data are not reflected into *newtup.
*
+ * See table_insert about skipping WAL-logging feature.
+ *
* In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
* t_xmax, and, if possible, t_cmax. See comments for struct TM_FailureData
* for additional info.
@@ -1308,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
flags, tmfd);
}
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
- /* optional callback */
- if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
- rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f..24e7165 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,6 +19,16 @@
#include "storage/smgr.h"
#include "utils/relcache.h"
+/* enum for operation type of PendingDelete entries */
+typedef enum PendingOpType
+{
+ PENDING_DELETE,
+ PENDING_SYNC
+} PendingOpType;
+
+/* GUC variables */
+extern int wal_skip_threshold; /* threshold for WAL-skipping */
+
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
@@ -31,7 +41,9 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(bool isCommit, bool sync_all);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern int smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 17b97f7..f31a36d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index c0f05e2..2bb2947 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -42,6 +42,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void SyncRelationFiles(RelFileNode *syncrels, int nsyncrels);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a5cf804..b2062ef 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -75,10 +75,17 @@ typedef struct RelationData
* transaction, with one of them occurring in a subsequently aborted
* subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
* ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+ * rd_firstRelfilenodeSubid is the ID of the first subtransaction the
+ * relfilenode change has took place in the current transaction. Unlike
+ * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID
+ * means that the currently active relfilenode is transaction-local and we
+ * sync the relation at commit instead of WAL-logging.
*/
SubTransactionId rd_createSubid; /* rel was created in current xact */
SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
* current xact */
+ SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned
+ * first in current xact */
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
@@ -517,9 +524,15 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
+ */
+#define RelationNeedsWAL(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
/*
* RelationUsesLocalBuffers
--
2.9.2
v22-0005-Additional-test-for-new-GUC-setting.patchapplication/octet-stream; name=v22-0005-Additional-test-for-new-GUC-setting.patchDownload
From 15693c7a3d2c3a64931c73ebaecfcf15c90e97b3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 28 Aug 2019 14:12:18 +0900
Subject: [PATCH v22 5/5] Additional test for new GUC setting.
This patchset adds new GUC variable effective_io_block_size that
controls wheter WAL-skipped tables are finally WAL-logged or
fcync'ed. All of the TAP test performs WAL-logging so this adds an
item that performs file sync.
---
src/test/recovery/t/018_wal_optimize.pl | 38 ++++++++++++++++++++++++++++++++-
1 file changed, 37 insertions(+), 1 deletion(-)
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
index b041121..ba9185e 100644
--- a/src/test/recovery/t/018_wal_optimize.pl
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -11,7 +11,7 @@ use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 26;
+use Test::More tests => 32;
sub check_orphan_relfilenodes
{
@@ -43,6 +43,8 @@ sub run_wal_optimize
$node->append_conf('postgresql.conf', qq(
wal_level = $wal_level
max_prepared_transactions = 1
+wal_log_hints = on
+wal_skip_threshold = 0
));
$node->start;
@@ -102,7 +104,23 @@ max_prepared_transactions = 1
$result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
is($result, qq(1),
"wal_level = $wal_level, optimized truncation with prepared transaction");
+ # Same for file sync mode
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ SET wal_skip_threshold to 0;
+ BEGIN;
+ CREATE TABLE test2b (id serial PRIMARY KEY);
+ INSERT INTO test2b VALUES (DEFAULT);
+ TRUNCATE test2b;
+ INSERT INTO test2b VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2b;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with file-sync");
# Data file for COPY query in follow-up tests.
my $basedir = $node->basedir;
@@ -178,6 +196,24 @@ max_prepared_transactions = 1
is($result, qq(3),
"wal_level = $wal_level, SET TABLESPACE in subtransaction");
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a5 (c int PRIMARY KEY);
+ SAVEPOINT q; INSERT INTO test3a5 VALUES (1); ROLLBACK TO q;
+ CHECKPOINT;
+ INSERT INTO test3a5 VALUES (1); -- set index hint bit
+ INSERT INTO test3a5 VALUES (2);
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->psql('postgres', );
+ my($ret, $stdout, $stderr) = $node->psql(
+ 'postgres', "INSERT INTO test3a5 VALUES (2);");
+ is($ret, qq(3),
+ "wal_level = $wal_level, unique index LP_DEAD");
+ like($stderr, qr/violates unique/,
+ "wal_level = $wal_level, unique index LP_DEAD message");
+
# UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
$node->safe_psql('postgres', "
BEGIN;
--
2.9.2
v22-0003-Fix-MarkBufferDirtyHint.patchapplication/octet-stream; name=v22-0003-Fix-MarkBufferDirtyHint.patchDownload
From 0512c840432995b9c8968c7561b5575ee2ff4f51 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Oct 2019 12:07:52 +0900
Subject: [PATCH v22 3/5] Fix MarkBufferDirtyHint
---
src/backend/catalog/storage.c | 22 ++++++++++++++++++++++
src/backend/storage/buffer/bufmgr.c | 5 ++++-
src/include/catalog/storage.h | 1 +
3 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 806f235..5e54b62 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -441,6 +441,28 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
}
/*
+ * RelFileNodeSkippingWAL - check if WAL-logging is allows for the relfilenode
+ *
+ * When wal_level is minimal, we skip WAL-logging for permanent relations
+ * created in the current transaction. Changes of such relfilenodes shoudn't
+ * be WAL-logged. Though it is known from Relation efficiently, thisfunction
+ * is intended for the code paths not having access to Relation.
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+ PendingRelOp *pending;
+
+ for (pending = pendingSyncs ; pending != NULL ; pending = pending->next)
+ {
+ if (RelFileNodeEquals(pending->relnode, rnode))
+ return true;
+ }
+
+ return false;
+}
+
+/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
* This also runs when aborting a subxact; we want to clean up a failed
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 827626b..9065d55 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3492,9 +3492,12 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
*
* We don't check full_page_writes here because that logic is included
* when we call XLogInsert() since the value changes dynamically.
+ *
+ * We mustn't emit WAL for WAL-skipping relations.
*/
if (XLogHintBitIsNeeded() &&
- (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
+ (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT) &&
+ !RelFileNodeSkippingWAL(bufHdr->tag.rnode))
{
/*
* If we're in recovery we cannot dirty a page because of a hint.
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 24e7165..eb2666e 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -35,6 +35,7 @@ extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
--
2.9.2
On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
This is the fixed verison v22.
I'd like to offer a few thoughts on this thread and on these patches,
which is now more than 4 years old and more than 150 messages in
length.
First, I'd like to restate my understanding of the problem just to see
whether I've got the right idea and whether we're all on the same
page. When wal_level=minimal, we sometimes try to skip WAL logging on
newly-created relations in favor of fsync-ing the relation at commit
time. The idea is that if the transaction aborts or is aborted by a
crash, the contents of the relation don't need to be reproduced
because they are irrelevant, so no WAL is needed, and if the
transaction commits we can't lose any data on a crash because we've
already fsync'd, and standbys don't matter because wal_level=minimal
precludes having any. However, we're not entirely consistent about
skipping WAL-logging: some operations do and others don't, and this
causes confusion if a crash occurs, because we might try to replay
some of the things that happened to that relation but not all of them.
For example, the original poster complained about a sequence of steps
where an index truncation was logged but subsequent index insertions
were not; a badly-timed crash will replay the truncation but can't
replay the index insertions because they weren't logged in the first
place; consequently, while the state was actually OK at the beginning
of replay, it's no longer OK by the end. Replaying nothing would've
been OK, but replaying some things and not others isn't.
Second, for anyone who is not following this thread closely but is
interested in a summary, I'd like to summarize how I believe that the
current patch proposes to solve the problem. As I understand it, the
approach taken by the patch is to try to change things so that we log
nothing at all for relations created or truncated in the current
top-level transaction, and everything for others. To achieve this, the
patch makes a number of changes, three of which seem to me to be
particularly key. One, the patch changes the relcache infrastructure
with the goal of making it possible to reliably identify whether a
relation has been created or truncated in the current toplevel
transaction; our current code does have tracking for this, but it's
not 100% accurate. Two, the patch changes the definition of
RelationNeedsWAL() so that it not only checks that the relation is a
permanent one, but also that either wal_level != minimal or the
relation was not created in the current transaction. It seems to me
that if RelationNeedsWAL() is used to gate every test for whether or
not to write WAL pertaining to a particular relation, this ought to
achieve the desired behavior of logging either everything or nothing.
It is not quite clear to me how we can be sure that we use that in
every relevant place. Three, the patch replaces the various ad-hoc
bits of code which fsync relations which perform unlogged operations
on permanent relations with a new tracking mechanism that arranges to
perform all of the relevant fsync() calls at commit time. This is
further augmented with a mechanism that instead logs all the relation
pages in lieu of fsync()ing if the relation is very small, on the
theory that logging a few FPIs will be cheaper than an fsync(). I view
this additional mechanism as perhaps a bit much for a bug fix patch,
but I understand that the goal is to prevent a performance regression,
and it's not really over the top, so I think it's probably OK.
Third, I'd like to offer a couple of general comments on the state of
these patches. Broadly, I think they look pretty good. They seem quite
well-engineered to me and as far as I can see the overall approach is
sound. I think there are a number of places where the comments could
be better; I'll include a few points on that further down. I also
think that the code in swap_relation_files() which takes ExclusiveLock
on the relations looks quite strange. It's hard to understand why it's
needed at all, or why that lock level is used. On the flip side, I
think that the test suite looks really impressive and should be of
considerable help not only in making sure that this is fixed but
detecting if it gets broken again in the future. Perhaps it doesn't
cover every scenario we care about, but if that turns out to be the
case, it seems like it would be easily to further generalize. I really
like the idea of this *kind* of test framework.
Comments on comments, and other nitpicking:
- in-trasaction is mis-spelled in the doc patch. accidentially is
mis-spelled in the 0002 patch.
- I think the header comment for the new TAP test could do a far
better job explaining the overall goal of this testing than it
actually does.
- I think somewhere in relcache.c or rel.h there ought to be comments
explaining the precise degree to which rd_createSubid,
rd_newRelfilenodeSubid, and rd_firstRelfilenodeSubid are reliable,
including problem scenarios. This patch removes some language of this
sort from CopyFrom(), which was a funny place to have that information
in the first place, but I don't see that it adds anything to replace
it. I also think that we ought to explain - for the fields that are
reliable - that they need to be reliable precisely for the purpose of
not breaking this stuff. There's a bit of this right now:
+ * rd_firstRelfilenodeSubid is the ID of the first subtransaction the
+ * relfilenode change has took place in the current transaction. Unlike
+ * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID
+ * means that the currently active relfilenode is transaction-local and we
+ * sync the relation at commit instead of WAL-logging.
...but I think that needs to be somewhat expanded and clarified.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Nov 05, 2019 at 04:16:14PM -0500, Robert Haas wrote:
On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
This is the fixed verison v22.
I'd like to offer a few thoughts on this thread and on these patches,
which is now more than 4 years old and more than 150 messages in
length.
...
Your understanding matches mine. Thanks for studying this. I had been
feeling nervous about being the sole reviewer of the latest design.
Comments on comments, and other nitpicking:
I started pre-commit editing on 2019-10-28, and comment+README updates have
been the largest part of that. I'll check my edits against the things you
list here, and I'll share on-list before committing. I've now marked the CF
entry Ready for Committer.
Thank you for looking this.
At Tue, 5 Nov 2019 16:16:14 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:This is the fixed verison v22.
First, I'd like to restate my understanding of the problem just to see
..
Second, for anyone who is not following this thread closely but is
Thanks for restating the issue and summarizing this patch. All of the
description match my understanding.
perform all of the relevant fsync() calls at commit time. This is
further augmented with a mechanism that instead logs all the relation
pages in lieu of fsync()ing if the relation is very small, on the
theory that logging a few FPIs will be cheaper than an fsync(). I view
this additional mechanism as perhaps a bit much for a bug fix patch,
but I understand that the goal is to prevent a performance regression,
and it's not really over the top, so I think it's probably OK.
Thanks. It would need some benchmarking as mentioned upthread. My new
machine became to work steadily so I will do that.
sound. I think there are a number of places where the comments could
be better; I'll include a few points on that further down. I also
think that the code in swap_relation_files() which takes ExclusiveLock
on the relations looks quite strange. It's hard to understand why it's
needed at all, or why that lock level is used. On the flip side, I
Right. It *was* a mistake of AccessExclusiveLock. On second thought,
callers must have taken locks on them with required level for
relfilenode swapping. However, one problematic case is toast indexes
of the target relation, which are not locked at all. Finally I used
AccessShareLock as it doesn't raise lock level other than
NoLock. Anyway the toast relation is not accessible outside the
session. (Done in the attached)
think that the test suite looks really impressive and should be of
considerable help not only in making sure that this is fixed but
detecting if it gets broken again in the future. Perhaps it doesn't
cover every scenario we care about, but if that turns out to be the
case, it seems like it would be easily to further generalize. I really
like the idea of this *kind* of test framework.
The paths running swap_relation_files are not covered. CLUSTER,
REFRESH MATVIEW and ALTER TABLE. CLUSTER and ALTER TABLE can interact
with INSERTs but MATVIEW cannot. Copying some of the existing test
cases using them will work. (Not yet done).
Comments on comments, and other nitpicking:
- in-trasaction is mis-spelled in the doc patch. accidentially is
mis-spelled in the 0002 patch.
Thanks. I found another couple of typos "issuing"->"issueing",
"skipped"->"skpped" by ispell'ing git diff output and all fixed.
- I think the header comment for the new TAP test could do a far
better job explaining the overall goal of this testing than it
actually does.
I rewrote it...
- I think somewhere in relcache.c or rel.h there ought to be comments
explaining the precise degree to which rd_createSubid,
rd_newRelfilenodeSubid, and rd_firstRelfilenodeSubid are reliable,
including problem scenarios. This patch removes some language of this
sort from CopyFrom(), which was a funny place to have that information
in the first place, but I don't see that it adds anything to replace
it. I also think that we ought to explain - for the fields that are
reliable - that they need to be reliable precisely for the purpose of
not breaking this stuff. There's a bit of this right now:+ * rd_firstRelfilenodeSubid is the ID of the first subtransaction the + * relfilenode change has took place in the current transaction. Unlike + * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID + * means that the currently active relfilenode is transaction-local and we + * sync the relation at commit instead of WAL-logging....but I think that needs to be somewhat expanded and clarified.
Agreed. It would be crude but I put augmenting descriptions of how the
variables work and descriptions in contrast to rd_first*.
# rd_first* is not a hint in the sense that it is reliable but it is
# mentioned as hint in some places, which will need fix.
If the fix of MarkBufferDirtyHint is ok, I'll merge it into 0002.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v23-0001-TAP-test-for-copy-truncation-optimization.patchtext/x-patch; charset=us-asciiDownload
From c5e7243ba05677ad84bc8b6b03077cadcaadf4b8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Oct 2019 13:07:41 +0900
Subject: [PATCH v23 1/5] TAP test for copy-truncation optimization.
---
src/test/recovery/t/018_wal_optimize.pl | 321 ++++++++++++++++++++++++
1 file changed, 321 insertions(+)
create mode 100644 src/test/recovery/t/018_wal_optimize.pl
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..ac62c77a42
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,321 @@
+# Test recovery for skipping WAL-logging of objects created in-transaction
+#
+# When wal_level is "minimal, WAL records are omitted for relations
+# that are created in the current transaction, then fsync'ed at
+# commit. The feature decides which relfilenode are needed to be
+# synced at commit and others are deleted after state changes caused
+# by subtransaction operations. Failure of the decision leads to
+# orphan relfilenodes or broken table after recovery from a crash just
+# after commit. Or if we accidentally emit WAL record for WAL-skipping
+# relations, corruption happens.
+#
+# This test also contains regression test for data-loss happened with
+# old implement of the feature by bad interaction with some sequences
+# of COPY/INSERTs.
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+
+# Make sure no orphan relfilenode files exist.
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 and relpersistence <> 't' and
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+#
+# We run this same test suite for both wal_level=minimal and replica.
+#
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::perl2host($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test1 (id serial PRIMARY KEY);
+ TRUNCATE test1;
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2 (id serial PRIMARY KEY);
+ INSERT INTO test2 VALUES (DEFAULT);
+ TRUNCATE test2;
+ INSERT INTO test2 VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+
+ # Same for prepared transaction
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test2a (id serial PRIMARY KEY);
+ INSERT INTO test2a VALUES (DEFAULT);
+ TRUNCATE test2a;
+ INSERT INTO test2a VALUES (DEFAULT);
+ PREPARE TRANSACTION 't';
+ COMMIT PREPARED 't';");
+
+ $node->stop('immediate');
+ $node->start;
+
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with prepared transaction");
+
+
+ # Data file for COPY query in follow-up tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after the
+ # truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3;
+ COPY test3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+ COPY test3a FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a2;
+ SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+ COPY test3a2 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE test3a3;
+ SAVEPOINT s;
+ ALTER TABLE test3a3 SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE test3a3 SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY test3a3 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+ # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above
+ UPDATE test3b SET id2 = id2 + 1;
+ DELETE FROM test3b;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE test4;
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COPY test4 FROM '$copy_file' DELIMITER ',';
+ INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+ INSERT INTO test5 VALUES (DEFAULT, 1);
+ COPY test5 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER test6_before_row_insert
+ BEFORE INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+ CREATE TRIGGER test6_after_row_insert
+ AFTER INSERT ON test6
+ FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+ COPY test6 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER test7_before_stat_truncate
+ BEFORE TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+ CREATE TRIGGER test7_after_stat_truncate
+ AFTER TRUNCATE ON test7
+ FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+ INSERT INTO test7 VALUES (DEFAULT, 1);
+ TRUNCATE test7;
+ COPY test7 FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
--
2.23.0
v23-0002-Fix-WAL-skipping-feature.patchtext/x-patch; charset=us-asciiDownload
From a22895a0d9ca9f69258e1a9c5d915ea3b5d48641 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Oct 2019 12:07:09 +0900
Subject: [PATCH v23 2/5] Fix WAL skipping feature
WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modification is WAL-logged at all then sync
such relations at commit.
---
src/backend/access/heap/heapam.c | 4 +-
src/backend/access/heap/heapam_handler.c | 22 +-
src/backend/access/heap/rewriteheap.c | 13 +-
src/backend/access/transam/xact.c | 17 ++
src/backend/access/transam/xlogutils.c | 11 +-
src/backend/catalog/storage.c | 294 ++++++++++++++++++++---
src/backend/commands/cluster.c | 28 +++
src/backend/commands/copy.c | 39 +--
src/backend/commands/createas.c | 5 +-
src/backend/commands/matview.c | 4 -
src/backend/commands/tablecmds.c | 10 +-
src/backend/storage/buffer/bufmgr.c | 41 ++--
src/backend/storage/smgr/md.c | 30 +++
src/backend/utils/cache/relcache.c | 28 ++-
src/backend/utils/misc/guc.c | 13 +
src/include/access/heapam.h | 1 -
src/include/access/rewriteheap.h | 2 +-
src/include/access/tableam.h | 40 +--
src/include/catalog/storage.h | 12 +
src/include/storage/bufmgr.h | 1 +
src/include/storage/md.h | 1 +
src/include/utils/rel.h | 52 +++-
22 files changed, 483 insertions(+), 185 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb34ef..a7ead9405a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1936,7 +1936,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2119,7 +2119,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2dd8821fac..0871df7730 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -558,18 +558,6 @@ tuple_lock_retry:
return result;
}
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related callbacks for heap AM.
* ------------------------------------------------------------------------
@@ -701,7 +689,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -716,12 +703,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
is_system_catalog = IsSystemRelation(OldHeap);
/*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
+ * smgr_targblock must be initially invalid if we are to skip WAL logging
*/
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
- /* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
/* Preallocate values/isnull arrays */
@@ -731,7 +714,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff, use_wal);
+ *multi_cutoff);
/* Set up sorting if wanted */
@@ -2519,7 +2502,6 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d41dbcf5f7..9b757cacf4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index fc55fa6d53..59d65bc214 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2107,6 +2107,13 @@ CommitTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Synchronize files that are created and not WAL-logged during this
+ * transaction. This must happen before emitting commit record so that we
+ * don't see committed-but-broken files after a crash.
+ */
+ smgrDoPendingSyncs(true, false);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2339,6 +2346,14 @@ PrepareTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Sync all WAL-skipped files now. Some of them may be deleted at
+ * transaction end but we don't bother store that information in PREPARE
+ * record or two-phase files. Like commit, we should sync WAL-skipped
+ * files before emitting PREPARE record. See CommitTransaction().
+ */
+ smgrDoPendingSyncs(true, true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2657,6 +2672,7 @@ AbortTransaction(void)
*/
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
+ smgrDoPendingSyncs(false, false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
@@ -4964,6 +4980,7 @@ AbortSubTransaction(void)
s->parent->curTransactionOwner);
AtEOSubXact_LargeObject(false, s->subTransactionId,
s->parent->subTransactionId);
+ smgrDoPendingSyncs(false, false);
AtSubAbort_Notify();
/* Advertise the fact that we aborted in pg_xact. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5f1e5ba75d..e566f01eef 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
* fields related to physical storage, like rd_rel, are initialized, so the
* fake entry is only usable in low-level operations like ReadBuffer().
*
+ * This is also used for syncing WAL-skipped files.
+ *
* Caller must free the returned entry with FreeFakeRelcacheEntry().
*/
Relation
@@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
FakeRelCacheEntry fakeentry;
Relation rel;
- Assert(InRecovery);
-
/* Allocate the Relation struct and all related space in one block. */
fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
rel = (Relation) fakeentry;
rel->rd_rel = &fakeentry->pgc;
rel->rd_node = rnode;
- /* We will never be working with temp rels during recovery */
+ /*
+ * We will never be working with temp rels during recovery or syncing
+ * WAL-skipped files.
+ */
rel->rd_backend = InvalidBackendId;
- /* It must be a permanent table if we're in recovery. */
+ /* It must be a permanent table here */
rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
/* We don't know the name of the relation; use relfilenode instead */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 625af8d49a..806f235a24 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -30,9 +30,13 @@
#include "catalog/storage_xlog.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+/* GUC variables */
+int wal_skip_threshold = 64; /* threshold of WAL-skipping in kilobytes */
+
/*
* We keep a list of all relations (represented as RelFileNode values)
* that have been created or deleted in the current transaction. When
@@ -53,16 +57,17 @@
* but I'm being paranoid.
*/
-typedef struct PendingRelDelete
+typedef struct PendingRelOp
{
RelFileNode relnode; /* relation that may need to be deleted */
BackendId backend; /* InvalidBackendId if not a temp rel */
- bool atCommit; /* T=delete at commit; F=delete at abort */
+ bool atCommit; /* T=work at commit; F=work at abort */
int nestLevel; /* xact nesting level of request */
- struct PendingRelDelete *next; /* linked-list link */
-} PendingRelDelete;
+ struct PendingRelOp *next; /* linked-list link */
+} PendingRelOp;
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingSyncs = NULL; /* head of linked list */
/*
* RelationCreateStorage
@@ -78,7 +83,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelOp *pending;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -109,8 +114,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
pending->relnode = rnode;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
@@ -118,6 +123,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /*
+ * When wal_level = minimal, we are going to skip WAL-logging for storage
+ * of persistent relations created in the current transaction. The
+ * relation needs to be synced at commit.
+ */
+ if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+ {
+ int nestLevel = GetCurrentTransactionNestLevel();
+
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
+ pending->relnode = rnode;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->nestLevel = nestLevel;
+ pending->next = pendingSyncs;
+ pendingSyncs = pending;
+ }
+
return srel;
}
@@ -147,11 +171,11 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
void
RelationDropStorage(Relation rel)
{
- PendingRelDelete *pending;
+ PendingRelOp *pending;
/* Add the relation to the list of stuff to delete at commit */
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending = (PendingRelOp *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
pending->relnode = rel->rd_node;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
@@ -192,9 +216,9 @@ RelationDropStorage(Relation rel)
void
RelationPreserveStorage(RelFileNode rnode, bool atCommit)
{
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -431,9 +455,9 @@ void
smgrDoPendingDeletes(bool isCommit)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
- PendingRelDelete *prev;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
int nrels = 0,
i = 0,
maxrels = 0;
@@ -494,11 +518,194 @@ smgrDoPendingDeletes(bool isCommit)
}
/*
- * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
+ * smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
+ *
+ * This should be called before smgrDoPendingDeletes() at every subtransaction
+ * end. Also this should be called before emitting WAL record so that sync
+ * failure prevents commit.
+ *
+ * If sync_all is true, syncs all files including that are scheduled to be
+ * deleted.
+ */
+void
+smgrDoPendingSyncs(bool isCommit, bool sync_all)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingRelOp *pending;
+ PendingRelOp *prev;
+ PendingRelOp *next;
+ SMgrRelation srel = NULL;
+ ForkNumber fork;
+ BlockNumber nblocks[MAX_FORKNUM + 1];
+ BlockNumber total_blocks = 0;
+ HTAB *delhash = NULL;
+
+ /* Return if nothing to be synced in this nestlevel */
+ if (!pendingSyncs || pendingSyncs->nestLevel < nestLevel)
+ return;
+
+ Assert (pendingSyncs->nestLevel <= nestLevel);
+ Assert (pendingSyncs->backend == InvalidBackendId);
+
+ /*
+ * If sync_all is false, pending syncs on the relation that are to be
+ * deleted in this transaction-end should be ignored. Collect pending
+ * deletes that will happen in the following call to
+ * smgrDoPendingDeletes().
+ */
+ if (!sync_all)
+ {
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ if (pending->nestLevel < pendingSyncs->nestLevel ||
+ pending->atCommit != isCommit)
+ continue;
+
+ /* create the hash if not yet */
+ if (delhash == NULL)
+ {
+ HASHCTL hash_ctl;
+
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(RelFileNode);
+ hash_ctl.entrysize = sizeof(RelFileNode);
+ hash_ctl.hcxt = CurrentMemoryContext;
+ delhash =
+ hash_create("pending del temporary hash", 8, &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
+
+ (void) hash_search(delhash, (void *) &(pending->relnode),
+ HASH_ENTER, &found);
+ Assert(!found);
+ }
+ }
+
+ /* Loop over pendingSyncs */
+ prev = NULL;
+ for (pending = pendingSyncs; pending != NULL; pending = next)
+ {
+ bool to_be_removed = (!isCommit); /* don't sync if aborted */
+
+ next = pending->next;
+
+ /* outer-level entries should not be processed yet */
+ if (pending->nestLevel < nestLevel)
+ {
+ prev = pending;
+ continue;
+ }
+
+ /* don't sync relnodes that is being deleted */
+ if (delhash && !to_be_removed)
+ hash_search(delhash, (void *) &pending->relnode,
+ HASH_FIND, &to_be_removed);
+
+ /* remove the entry if no longer useful */
+ if (to_be_removed)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingSyncs = next;
+ pfree(pending);
+ continue;
+ }
+
+ /* actual sync happens at the end of top transaction */
+ if (nestLevel > 1)
+ {
+ prev = pending;
+ continue;
+ }
+
+ /* Now the time to sync the rnode */
+ srel = smgropen(pendingSyncs->relnode, pendingSyncs->backend);
+
+ /*
+ * We emit newpage WAL records for smaller size of relations.
+ *
+ * Small WAL records have a chance to be emitted at once along with
+ * other backends' WAL records. We emit WAL records instead of syncing
+ * for files that are smaller than a certain threshold expecting faster
+ * commit. The threshold is defined by the GUC wal_skip_threshold.
+ */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ /* FSM doesn't need WAL nor sync */
+ if (fork != FSM_FORKNUM && smgrexists(srel, fork))
+ {
+ BlockNumber n = smgrnblocks(srel, fork);
+
+ /* we shouldn't come here for unlogged relations */
+ Assert(fork != INIT_FORKNUM);
+
+ nblocks[fork] = n;
+ total_blocks += n;
+ }
+ else
+ nblocks[fork] = InvalidBlockNumber;
+ }
+
+ /*
+ * Sync file or emit WAL record for the file according to the total
+ * size.
+ */
+ if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+ {
+ /* Flush all buffers then sync the file */
+ FlushRelationBuffersWithoutRelcache(srel->smgr_rnode.node, false);
+
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(srel, fork))
+ smgrimmedsync(srel, fork);
+ }
+ }
+ else
+ {
+ /*
+ * Emit WAL records for all blocks. Some of the blocks might have
+ * been synced or evicted, but We don't bother checking that. The
+ * file is small enough.
+ */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ bool page_std = (fork == MAIN_FORKNUM);
+ int n = nblocks[fork];
+ Relation rel;
+
+ if (!BlockNumberIsValid(n))
+ continue;
+
+ /* Emit WAL for the whole file */
+ rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+ log_newpage_range(rel, fork, 0, n, page_std);
+ FreeFakeRelcacheEntry(rel);
+ }
+ }
+
+ /* done remove from list */
+ if (prev)
+ prev->next = next;
+ else
+ pendingSyncs = next;
+ pfree(pending);
+ }
+
+ if (delhash)
+ hash_destroy(delhash);
+}
+
+/*
+ * smgrGetPendingOperations() -- Get a list of non-temp relations to be
+ * deleted or synced.
*
- * The return value is the number of relations scheduled for termination.
- * *ptr is set to point to a freshly-palloc'd array of RelFileNodes.
- * If there are no relations to be deleted, *ptr is set to NULL.
+ * The return value is the number of relations scheduled in the given
+ * list. *ptr is set to point to a freshly-palloc'd array of RelFileNodes. If
+ * there are no matching relations, *ptr is set to NULL.
*
* Only non-temporary relations are included in the returned list. This is OK
* because the list is used only in contexts where temporary relations don't
@@ -507,19 +714,19 @@ smgrDoPendingDeletes(bool isCommit)
* (and all temporary files will be zapped if we restart anyway, so no need
* for redo to do it also).
*
- * Note that the list does not include anything scheduled for termination
- * by upper-level transactions.
+ * Note that the list does not include anything scheduled by upper-level
+ * transactions.
*/
-int
-smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+static inline int
+smgrGetPendingOperations(PendingRelOp *list, bool forCommit, RelFileNode **ptr)
{
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
RelFileNode *rptr;
- PendingRelDelete *pending;
+ PendingRelOp *pending;
nrels = 0;
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ for (pending = list; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
&& pending->backend == InvalidBackendId)
@@ -532,7 +739,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
}
rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode));
*ptr = rptr;
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ for (pending = list; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
&& pending->backend == InvalidBackendId)
@@ -544,6 +751,20 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
return nrels;
}
+/* Returns list of pending deletes, see smgrGetPendingOperations for details */
+int
+smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+{
+ return smgrGetPendingOperations(pendingDeletes, forCommit, ptr);
+}
+
+/* Returns list of pending syncs, see smgrGetPendingOperations for details */
+int
+smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr)
+{
+ return smgrGetPendingOperations(pendingSyncs, forCommit, ptr);
+}
+
/*
* PostPrepare_smgr -- Clean up after a successful PREPARE
*
@@ -554,8 +775,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
void
PostPrepare_smgr(void)
{
- PendingRelDelete *pending;
- PendingRelDelete *next;
+ PendingRelOp *pending;
+ PendingRelOp *next;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
@@ -564,25 +785,34 @@ PostPrepare_smgr(void)
/* must explicitly free the list entry */
pfree(pending);
}
+
+ /* We shouldn't have an entry in pendingSyncs */
+ Assert(pendingSyncs == NULL);
}
/*
* AtSubCommit_smgr() --- Take care of subtransaction commit.
*
- * Reassign all items in the pending-deletes list to the parent transaction.
+ * Reassign all items in the pending-operations list to the parent transaction.
*/
void
AtSubCommit_smgr(void)
{
int nestLevel = GetCurrentTransactionNestLevel();
- PendingRelDelete *pending;
+ PendingRelOp *pending;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel)
pending->nestLevel = nestLevel - 1;
}
+
+ for (pending = pendingSyncs; pending != NULL; pending = pending->next)
+ {
+ if (pending->nestLevel >= nestLevel)
+ pending->nestLevel = nestLevel - 1;
+ }
}
/*
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a23128d7a0..3559d11eb7 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,40 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
{
+ Relation rel1;
+ Relation rel2;
+
/*
* Normal non-mapped relations: swap relfilenodes, reltablespaces,
* relpersistence
*/
Assert(!target_is_pg_class);
+ /*
+ * Update creation subid hints of relcache. Although we don't need for
+ * additional lock, we must use AccessShareLock here since caller may
+ * omit locks on relations that cannot be concurrently accessed.
+ */
+ rel1 = relation_open(r1, AccessShareLock);
+ rel2 = relation_open(r2, AccessShareLock);
+
+ /*
+ * New relation's relfilenode is created in the current transaction
+ * and used as old ralation's new relfilenode. So its
+ * newRelfilenodeSubid as new relation's createSubid. We don't fix
+ * rel2 since it would be deleted soon.
+ */
+ Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+ rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+ /* record the first relfilenode change in the current transaction */
+ if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+ relation_close(rel1, AccessShareLock);
+ relation_close(rel2, AccessShareLock);
+
+ /* swap relfilenodes, reltablespaces, relpersistence */
swaptemp = relform1->relfilenode;
relform1->relfilenode = relform2->relfilenode;
relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index e17d8c760f..e6abc11e4c 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2532,9 +2532,6 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
ExecDropSingleTupleTableSlot(buffer->slots[i]);
- table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
- miinfo->ti_options);
-
pfree(buffer);
}
@@ -2723,28 +2720,9 @@ CopyFrom(CopyState cstate)
* If it does commit, we'll have done the table_finish_bulk_insert() at
* the bottom of this routine first.
*
- * As mentioned in comments in utils/rel.h, the in-same-transaction test
- * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
- * can be cleared before the end of the transaction. The exact case is
- * when a relation sets a new relfilenode twice in same transaction, yet
- * the second one fails in an aborted subtransaction, e.g.
- *
- * BEGIN;
- * TRUNCATE t;
- * SAVEPOINT save;
- * TRUNCATE t;
- * ROLLBACK TO save;
- * COPY ...
- *
- * Also, if the target file is new-in-transaction, we assume that checking
- * FSM for free space is a waste of time, even if we must use WAL because
- * of archiving. This could possibly be wrong, but it's unlikely.
- *
- * The comments for table_tuple_insert and RelationGetBufferForTuple
- * specify that skipping WAL logging is only safe if we ensure that our
- * tuples do not go into pages containing tuples from any other
- * transactions --- but this must be the case if we have a new table or
- * new relfilenode, so we need no additional work to enforce that.
+ * If the target file is new-in-transaction, we assume that checking FSM
+ * for free space is a waste of time, even if we must use WAL because of
+ * archiving. This could possibly be wrong, but it's unlikely.
*
* We currently don't support this optimization if the COPY target is a
* partitioned table as we currently only lazily initialize partition
@@ -2760,15 +2738,14 @@ CopyFrom(CopyState cstate)
* are not supported as per the description above.
*----------
*/
- /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+ /*
+ * createSubid is creation check, firstRelfilenodeSubid is truncation and
+ * cluster check. Partitioned table doesn't have storage.
+ */
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
- cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
+ cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index b7d220699f..8a91d946e3 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* We can skip WAL-logging the insertions, unless PITR or streaming
* replication is in use. We can skip the FSM in any case.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->rel, myState->ti_options);
-
/* close rel, but keep lock until commit */
table_close(myState->rel, NoLock);
myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..1c854dcebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
* replication is in use. We can skip the FSM in any case.
*/
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
FreeBulkInsertState(myState->bistate);
- table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
/* close transientrel, but keep lock until commit */
table_close(myState->transientrel, NoLock);
myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5597be6e3d..3ec218aca4 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4764,9 +4764,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
/*
* Prepare a BulkInsertState and options for table_tuple_insert. Because
- * we're building a new heap, we can skip WAL-logging and fsync it to disk
- * at the end instead (unless WAL-logging is required for archiving or
- * streaming replication). The FSM is empty too, so don't bother using it.
+ * we're building a new heap, the underlying table AM can skip WAL-logging
+ * and smgr will sync the relation to disk at the end of the current
+ * transaction instead. The FSM is empty too, so don't bother using it.
*/
if (newrel)
{
@@ -4774,8 +4774,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
bistate = GetBulkInsertState();
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -5070,8 +5068,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
{
FreeBulkInsertState(bistate);
- table_finish_bulk_insert(newrel, ti_options);
-
table_close(newrel, NoLock);
}
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7ad10736d5..1d9438ad56 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
static int32 PrivateRefCountOverflowed = 0;
static uint32 PrivateRefCountClock = 0;
static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
static void ReservePrivateRefCountEntry(void);
static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -675,10 +676,10 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
* ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
* a relcache entry for the relation.
*
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay. If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * NB: At present, this function may only be used on permanent relations,
+ * which is OK, because we only use it during XLOG replay and processing
+ * pending syncs. If in the future we want to use it on temporary or unlogged
+ * relations, we could pass additional parameters.
*/
Buffer
ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
@@ -3203,20 +3204,32 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
- /* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3233,7 +3246,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3263,18 +3276,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 07f3c93d3f..514c6098e6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -994,6 +994,36 @@ ForgetDatabaseSyncRequests(Oid dbid)
RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, true /* retryOnError */ );
}
+/*
+ * SyncRelationFiles -- sync files of all given relations
+ *
+ * This function is assumed to be called only when skipping WAL-logging and
+ * emits no xlog records.
+ */
+void
+SyncRelationFiles(RelFileNode *syncrels, int nsyncrels)
+{
+ int i;
+
+ for (i = 0; i < nsyncrels; i++)
+ {
+ SMgrRelation srel;
+ ForkNumber fork;
+
+ /* sync all existing forks of the relation */
+ FlushRelationBuffersWithoutRelcache(syncrels[i], false);
+ srel = smgropen(syncrels[i], InvalidBackendId);
+
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(srel, fork))
+ smgrimmedsync(srel, fork);
+ }
+
+ smgrclose(srel);
+ }
+}
+
/*
* DropRelationFiles -- drop files of all given relations
*/
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 585dcee5db..892462873f 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1096,6 +1096,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
relation->rd_isnailed = false;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
switch (relation->rd_rel->relpersistence)
{
case RELPERSISTENCE_UNLOGGED:
@@ -1829,6 +1830,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_isnailed = true;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
relation->rd_backend = InvalidBackendId;
relation->rd_islocaltemp = false;
@@ -2094,7 +2096,7 @@ RelationClose(Relation relation)
#ifdef RELCACHE_FORCE_RELEASE
if (RelationHasReferenceCountZero(relation) &&
relation->rd_createSubid == InvalidSubTransactionId &&
- relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
RelationClearRelation(relation, false);
#endif
}
@@ -2510,8 +2512,8 @@ RelationClearRelation(Relation relation, bool rebuild)
* problem.
*
* When rebuilding an open relcache entry, we must preserve ref count,
- * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state. Also
- * attempt to preserve the pg_class entry (rd_rel), tupledesc,
+ * rd_createSubid/rd_new/firstRelfilenodeSubid, and rd_toastoid state.
+ * Also attempt to preserve the pg_class entry (rd_rel), tupledesc,
* rewrite-rule, partition key, and partition descriptor substructures
* in place, because various places assume that these structures won't
* move while they are working with an open relcache entry. (Note:
@@ -2600,6 +2602,7 @@ RelationClearRelation(Relation relation, bool rebuild)
/* creation sub-XIDs must be preserved */
SWAPFIELD(SubTransactionId, rd_createSubid);
SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+ SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
/* un-swap rd_rel pointers, swap contents instead */
SWAPFIELD(Form_pg_class, rd_rel);
/* ... but actually, we don't have to update newrel->rd_rel */
@@ -2667,7 +2670,7 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2807,7 +2810,7 @@ RelationCacheInvalidate(void)
* pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -3064,6 +3067,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
* Likewise, reset the hint about the relfilenode being new.
*/
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
}
/*
@@ -3155,7 +3159,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
}
/*
- * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+ * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
*/
if (relation->rd_newRelfilenodeSubid == mySubid)
{
@@ -3164,6 +3168,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
}
+
+ if (relation->rd_firstRelfilenodeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstRelfilenodeSubid = parentSubid;
+ else
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ }
}
@@ -3253,6 +3265,7 @@ RelationBuildLocalRelation(const char *relname,
/* it's being created in this transaction */
rel->rd_createSubid = GetCurrentSubTransactionId();
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
/*
* create a new tuple descriptor from the one passed in. We do this
@@ -3556,6 +3569,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
* operations on the rel in the same transaction.
*/
relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+ if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
/* Flag relation as needing eoxact cleanup (to remove the hint) */
EOXactListAdd(relation);
@@ -5592,6 +5607,7 @@ load_relcache_init_file(bool shared)
rel->rd_fkeylist = NIL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 31a5ef0474..559f96a6dc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
#include "access/xlog_internal.h"
#include "catalog/namespace.h"
#include "catalog/pg_authid.h"
+#include "catalog/storage.h"
#include "commands/async.h"
#include "commands/prepare.h"
#include "commands/user.h"
@@ -2774,6 +2775,18 @@ static struct config_int ConfigureNamesInt[] =
check_effective_io_concurrency, assign_effective_io_concurrency, NULL
},
+ {
+ {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+ gettext_noop("Size of file that can be fsync'ed in the minimum required duration."),
+ NULL,
+ GUC_UNIT_KB
+ },
+ &wal_skip_threshold,
+ 64,
+ 0, MAX_KILOBYTES,
+ NULL, NULL, NULL
+ },
+
{
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..80c2e1bafc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7f81703b78..b652cd6cef 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
uint8 flags,
TM_FailureData *tmfd);
- /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
- *
- * Typically callers of tuple_insert and multi_insert will just pass all
- * the flags that apply to them, and each AM has to decide which of them
- * make sense for it, and then only take actions in finish_bulk_insert for
- * those flags, and ignore others.
- *
- * Optional callback.
- */
- void (*finish_bulk_insert) (Relation rel, int options);
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
@@ -1087,10 +1071,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* The options bitmask allows the caller to specify options that may change the
* behaviour of the AM. The AM will ignore options that it does not support.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space.
@@ -1112,8 +1092,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* heap's TOAST table, too, if the tuple requires any out-of-line data.
*
* The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
*
* On return the slot's tts_tid and tts_tableOid are updated to reflect the
* insertion. But note that any toasting of fields within the slot is NOT
@@ -1248,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
* update was done. However, any TOAST changes in the new tuple's
* data are not reflected into *newtup.
*
+ * See table_insert about skipping WAL-logging feature.
+ *
* In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
* t_xmax, and, if possible, t_cmax. See comments for struct TM_FailureData
* for additional info.
@@ -1308,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
flags, tmfd);
}
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
- /* optional callback */
- if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
- rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
/* ------------------------------------------------------------------------
* DDL related functionality.
* ------------------------------------------------------------------------
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..24e71651c3 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,6 +19,16 @@
#include "storage/smgr.h"
#include "utils/relcache.h"
+/* enum for operation type of PendingDelete entries */
+typedef enum PendingOpType
+{
+ PENDING_DELETE,
+ PENDING_SYNC
+} PendingOpType;
+
+/* GUC variables */
+extern int wal_skip_threshold; /* threshold for WAL-skipping */
+
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
@@ -31,7 +41,9 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(bool isCommit, bool sync_all);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern int smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 17b97f7e38..f31a36de17 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index c0f05e23ff..2bb2947bdb 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -42,6 +42,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void SyncRelationFiles(RelFileNode *syncrels, int nsyncrels);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8b8b237f0d..a46c086cc2 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -66,19 +66,41 @@ typedef struct RelationData
/*
* rd_createSubid is the ID of the highest subtransaction the rel has
* survived into; or zero if the rel was not created in the current top
- * transaction. This can be now be relied on, whereas previously it could
- * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
- * the ID of the highest subtransaction the relfilenode change has
- * survived into, or zero if not changed in the current transaction (or we
- * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
- * when a relation has multiple new relfilenodes within a single
- * transaction, with one of them occurring in a subsequently aborted
- * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
- * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+ * transaction. A valid value means the relation is created in the
+ * subtransction and non-rollbackable trancation is usable in the same
+ * subtransaction. This can be now be relied on, whereas previously it
+ * could be "forgotten" in earlier releases.
+ *
+ * Likewise, rd_newRelfilenodeSubid is the subtransaction ID where the
+ * current relfilenode can be assumed to have been created, or zero if
+ * not. If this is equal to the current subtransaction id we can truncate
+ * the current relfilenode in a non-rollbackable way. It survibes moving to
+ * parent subtransaction as long as comitted. It is not totally reliable
+ * and used just as hint because it is forgotten by overwriting in a
+ * subsequent subtransaction. e.g. BEGIN; TRUNCATE t; SAVEPOINT save;
+ * TRUNCATE t; ROLLBACK TO save; TRUNCATE t; -- The ROLLBACK TO doesn't
+ * restore the value for the first TRUNCATE and the value is now
+ * forgotten. The last TRUNCATE doesn't use non-rollbackable truncation.
+ *
+ * rd_firstRelfilenodeSubid is the ID of the subtransaction where the
+ * change of relfilenode took place first in the top trasaction. A valid
+ * value means that there are one or more relfilenodes created in the
+ * top-transaction. They are all local and inaccessible from outside. When
+ * wal_level is minimal, WAL-logging is omitted and the relfilenode at
+ * commit is sync'ed (and others are removed). Unlike newRelfilenodeSubid,
+ * this is reliable. No overwriting happens and the value is moved to
+ * parent subtransaction at subtransacton commit, and forgotten at
+ * rollback.
+ *
+ * A valid value of rd_createSubid or rd_firstRelfilenodeSubid prevents the
+ * relcache entry from flushing or rebuilding in order to preseve the
+ * value.
*/
SubTransactionId rd_createSubid; /* rel was created in current xact */
SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
* current xact */
+ SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned
+ * first in current xact */
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
@@ -521,9 +543,15 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
+ */
+#define RelationNeedsWAL(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
/*
* RelationUsesLocalBuffers
--
2.23.0
v23-0003-Fix-MarkBufferDirtyHint.patchtext/x-patch; charset=us-asciiDownload
From eb49991713a4658eac7eede81c251c90a0c918b9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Oct 2019 12:07:52 +0900
Subject: [PATCH v23 3/5] Fix MarkBufferDirtyHint
---
src/backend/catalog/storage.c | 17 +++++++++++++++++
src/backend/storage/buffer/bufmgr.c | 7 +++++++
src/include/catalog/storage.h | 1 +
3 files changed, 25 insertions(+)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 806f235a24..6d5a3d53e7 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -440,6 +440,23 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
smgrimmedsync(dst, forkNum);
}
+/*
+ * RelFileNodeSkippingWAL - check if this relfilenode needs WAL
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+ PendingRelOp *pending;
+
+ for (pending = pendingSyncs ; pending != NULL ; pending = pending->next)
+ {
+ if (RelFileNodeEquals(pending->relnode, rnode))
+ return true;
+ }
+
+ return false;
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1d9438ad56..288b2d3467 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3506,6 +3506,13 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
if (RecoveryInProgress())
return;
+ /*
+ * Skip WAL logging if this buffer belongs to a relation that is
+ * skipping WAL-logging.
+ */
+ if (RelFileNodeSkippingWAL(bufHdr->tag.rnode))
+ return;
+
/*
* If the block is already dirty because we either made a change
* or set a hint already, then we don't need to write a full page
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 24e71651c3..eb2666e001 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -35,6 +35,7 @@ extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
--
2.23.0
v23-0004-Documentation-for-wal_skip_threshold.patchtext/x-patch; charset=us-asciiDownload
From 8bebfb74ef8aab5dcf162aee9cd0f44fce113e10 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 28 Aug 2019 14:05:30 +0900
Subject: [PATCH v23 4/5] Documentation for wal_skip_threshold
---
doc/src/sgml/config.sgml | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0191ec84b1..5d22134a11 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1817,6 +1817,32 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-wal-skip-min_size" xreflabel="wal_skip_threshold">
+ <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ When wal_level is minimal and a transaction commits after creating or
+ rewriting a permanent table, materialized view, or index, this
+ setting determines how to persist the new data. If the data is
+ smaller than this setting, write it to the WAL log; otherwise, use an
+ fsync of the data file. Depending on the properties of your storage,
+ raising or lowering this value might help if such commits are slowing
+ concurrent transactions. The default is 64 kilobytes (64kB).
+ </para>
+ <para>
+ When <xref linkend="guc-wal-level"/> is <literal>minimal</literal>,
+ WAL-logging is skipped for tables created in-transaction. If a table
+ is smaller than that size at commit, it is WAL-logged instead of
+ issuing <function>fsync</function> on it.
+
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
--
2.23.0
v23-0005-Additional-test-for-new-GUC-setting.patchtext/x-patch; charset=us-asciiDownload
From 6f46a460e8d2e1bced58b8d1f62361476eb3729b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 28 Aug 2019 14:12:18 +0900
Subject: [PATCH v23 5/5] Additional test for new GUC setting.
This patchset adds new GUC variable effective_io_block_size that
controls wheter WAL-skipped tables are finally WAL-logged or
fcync'ed. All of the TAP test performs WAL-logging so this adds an
item that performs file sync.
---
src/test/recovery/t/018_wal_optimize.pl | 38 ++++++++++++++++++++++++-
1 file changed, 37 insertions(+), 1 deletion(-)
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
index ac62c77a42..470d4f048c 100644
--- a/src/test/recovery/t/018_wal_optimize.pl
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -18,7 +18,7 @@ use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 26;
+use Test::More tests => 32;
# Make sure no orphan relfilenode files exist.
sub check_orphan_relfilenodes
@@ -52,6 +52,8 @@ sub run_wal_optimize
$node->append_conf('postgresql.conf', qq(
wal_level = $wal_level
max_prepared_transactions = 1
+wal_log_hints = on
+wal_skip_threshold = 0
));
$node->start;
@@ -111,7 +113,23 @@ max_prepared_transactions = 1
$result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
is($result, qq(1),
"wal_level = $wal_level, optimized truncation with prepared transaction");
+ # Same for file sync mode
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ SET wal_skip_threshold to 0;
+ BEGIN;
+ CREATE TABLE test2b (id serial PRIMARY KEY);
+ INSERT INTO test2b VALUES (DEFAULT);
+ TRUNCATE test2b;
+ INSERT INTO test2b VALUES (DEFAULT);
+ COMMIT;");
+
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2b;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with file-sync");
# Data file for COPY query in follow-up tests.
my $basedir = $node->basedir;
@@ -187,6 +205,24 @@ max_prepared_transactions = 1
is($result, qq(3),
"wal_level = $wal_level, SET TABLESPACE in subtransaction");
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE test3a5 (c int PRIMARY KEY);
+ SAVEPOINT q; INSERT INTO test3a5 VALUES (1); ROLLBACK TO q;
+ CHECKPOINT;
+ INSERT INTO test3a5 VALUES (1); -- set index hint bit
+ INSERT INTO test3a5 VALUES (2);
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->psql('postgres', );
+ my($ret, $stdout, $stderr) = $node->psql(
+ 'postgres', "INSERT INTO test3a5 VALUES (2);");
+ is($ret, qq(3),
+ "wal_level = $wal_level, unique index LP_DEAD");
+ like($stderr, qr/violates unique/,
+ "wal_level = $wal_level, unique index LP_DEAD message");
+
# UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
$node->safe_psql('postgres', "
BEGIN;
--
2.23.0
On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
I started pre-commit editing on 2019-10-28, and comment+README updates have
been the largest part of that. I'll check my edits against the things you
list here, and I'll share on-list before committing. I've now marked the CF
entry Ready for Committer.
Having dedicated many days to that, I am attaching v24nm. I know of two
remaining defects:
=== Defect 1: gistGetFakeLSN()
When I modified pg_regress.c to use wal_level=minimal for all suites,
src/test/isolation/specs/predicate-gist.spec failed the assertion in
gistGetFakeLSN(). One could reproduce the problem just by running this
sequence in psql:
begin;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
insert into gist_point_tbl (id, p)
select g, point(g*10, g*10) from generate_series(1, 1000) g;
I've included a wrong-in-general hack to make the test pass. I see two main
options for fixing this:
(a) Introduce an empty WAL record that reserves an LSN and has no other
effect. Make GiST use that for permanent relations that are skipping WAL.
Further optimizations are possible. For example, we could use a backend-local
counter (like the one gistGetFakeLSN() uses for temp relations) until the
counter is greater a recent real LSN. That optimization is probably too
clever, though it would make the new WAL record almost never appear.
(b) Exempt GiST from most WAL skipping. GiST index build could still skip
WAL, but it would do its own smgrimmedsync() in addition to the one done at
commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
other AM-independent code that skips WAL.
Overall, I like the cleanliness of (a). The main argument for (b) is that it
ensures we have all the features to opt-out of WAL skipping, which could be
useful for out-of-tree index access methods. (I think we currently have the
features for a tableam to do so, but not for an indexam to do so.) Overall, I
lean toward (a). Any other ideas or preferences?
=== Defect 2: repetitive work when syncing many relations
For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
sophisticated about optimizing the shared buffers scan. Commit 279628a
introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, to
further reduce the chance of causing performance regressions. (One could,
however, work around the problem by raising wal_skip_threshold.) Kyotaro, if
you agree, could you modify v24nm to implement that?
Notable changes in v24nm:
- Wrote section "Skipping WAL for New RelFileNode" in
src/backend/access/transam/README to be the main source concerning the new
coding rules.
- Updated numerous comments and doc sections.
- Eliminated the pendingSyncs list in favor of a "sync" field in
pendingDeletes. I mostly did this to eliminate the possibility of the lists
getting out of sync. This removed considerable parallel code for managing a
second list at end-of-xact. We now call smgrDoPendingSyncs() only when
committing or preparing a top-level transaction.
- Whenever code sets an rd_*Subid field of a Relation, it must call
EOXactListAdd(). swap_relation_files() was not doing so, so the field
remained set during the next transaction. I introduced
RelationAssumeNewRelfilenode() to handle both tasks, and I located the call
so it also affects the mapped relation case.
- In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild,
rd_createSubid remained set. (That happened before this patch, but it has
been harmless.) I fixed this in heap_create().
- Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM. A sync is necessary
when checksums are enabled. Observe the precedent that
RelationCopyStorage() has not been exempting FSM_FORKNUM.
- Pass log_newpage_range() a "false" for page_std, for the same reason
RelationCopyStorage() does.
- log_newpage_range() ignored its forkNum and page_std arguments, so we logged
the wrong data for non-main forks. Before this patch, callers always passed
MAIN_FORKNUM and "true", hence the lack of complaints.
- Restored table_finish_bulk_insert(), though heapam no longer provides a
callback. The API is still well-defined, and other table AMs might have use
for it. Removing it feels like a separate proposal.
- Removed TABLE_INSERT_SKIP_WAL. Any out-of-tree code using it should revisit
itself in light of this patch.
- Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation;
it was overcounting.
- Made us skip WAL after SET TABLESPACE, like we do after CLUSTER.
- Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to
"Write Ahead Log" -> "Settings", between similar settings
wal_writer_flush_after and commit_delay. The other place I considered was
"Resource Consumption" -> "Asynchronous Behavior", due to the similarity of
backend_flush_after.
- Gave each test a unique name. Changed test table names to be descriptive,
e.g. test7 became trunc_trig.
- Squashed all patches into one. Split patches are good when one could
reasonably choose to push a subset, but that didn't apply here. I wouldn't
push a GUC implementation without its documentation. Since the tests fail
without the main bug fix, I wouldn't push tests separately.
By the way, based on the comment at zheap_prepare_insert(), I expect zheap
will exempt itself from skipping WAL. It may stop calling RelationNeedsWAL()
and instead test for RELPERSISTENCE_PERMANENT.
nm
Attachments:
skip-wal-v24nm.patchtext/plain; charset=us-asciiDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f837703..f078775 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2462,21 +2462,14 @@ include_dir 'conf.d'
levels. This parameter can only be set at server start.
</para>
<para>
- In <literal>minimal</literal> level, WAL-logging of some bulk
- operations can be safely skipped, which can make those
- operations much faster (see <xref linkend="populate-pitr"/>).
- Operations in which this optimization can be applied include:
- <simplelist>
- <member><command>CREATE TABLE AS</command></member>
- <member><command>CREATE INDEX</command></member>
- <member><command>CLUSTER</command></member>
- <member><command>COPY</command> into tables that were created or truncated in the same
- transaction</member>
- </simplelist>
- But minimal WAL does not contain enough information to reconstruct the
- data from a base backup and the WAL logs, so <literal>replica</literal> or
- higher must be used to enable WAL archiving
- (<xref linkend="guc-archive-mode"/>) and streaming replication.
+ In <literal>minimal</literal> level, no information is logged for
+ tables or indexes for the remainder of a transaction that creates or
+ truncates them. This can make bulk operations much faster (see
+ <xref linkend="populate-pitr"/>). But minimal WAL does not contain
+ enough information to reconstruct the data from a base backup and the
+ WAL logs, so <literal>replica</literal> or higher must be used to
+ enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
+ streaming replication.
</para>
<para>
In <literal>logical</literal> level, the same information is logged as
@@ -2868,6 +2861,26 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-wal-skip-threshold" xreflabel="wal_skip_threshold">
+ <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ When <varname>wal_level</varname> is <literal>minimal</literal> and a
+ transaction commits after creating or rewriting a permanent table,
+ materialized view, or index, this setting determines how to persist
+ the new data. If the data is smaller than this setting, write it to
+ the WAL log; otherwise, use an fsync of the data file. Depending on
+ the properties of your storage, raising or lowering this value might
+ help if such commits are slowing concurrent transactions. The default
+ is 64 kilobytes (<literal>64kB</literal>).
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-commit-delay" xreflabel="commit_delay">
<term><varname>commit_delay</varname> (<type>integer</type>)
<indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 715aff6..fcc6017 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1605,8 +1605,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
needs to be written, because in case of an error, the files
containing the newly loaded data will be removed anyway.
However, this consideration only applies when
- <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for
- non-partitioned tables as all commands must write WAL otherwise.
+ <xref linkend="guc-wal-level"/> is <literal>minimal</literal>
+ as all commands must write WAL otherwise.
</para>
</sect2>
@@ -1706,42 +1706,13 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
</para>
<para>
- Aside from avoiding the time for the archiver or WAL sender to
- process the WAL data,
- doing this will actually make certain commands faster, because they
- are designed not to write WAL at all if <varname>wal_level</varname>
- is <literal>minimal</literal>. (They can guarantee crash safety more cheaply
- by doing an <function>fsync</function> at the end than by writing WAL.)
- This applies to the following commands:
- <itemizedlist>
- <listitem>
- <para>
- <command>CREATE TABLE AS SELECT</command>
- </para>
- </listitem>
- <listitem>
- <para>
- <command>CREATE INDEX</command> (and variants such as
- <command>ALTER TABLE ADD PRIMARY KEY</command>)
- </para>
- </listitem>
- <listitem>
- <para>
- <command>ALTER TABLE SET TABLESPACE</command>
- </para>
- </listitem>
- <listitem>
- <para>
- <command>CLUSTER</command>
- </para>
- </listitem>
- <listitem>
- <para>
- <command>COPY FROM</command>, when the target table has been
- created or truncated earlier in the same transaction
- </para>
- </listitem>
- </itemizedlist>
+ Aside from avoiding the time for the archiver or WAL sender to process the
+ WAL data, doing this will actually make certain commands faster, because
+ they do not to write WAL at all if <varname>wal_level</varname>
+ is <literal>minimal</literal> and the current subtransaction (or top-level
+ transaction) created or truncated the table or index they change. (They
+ can guarantee crash safety more cheaply by doing
+ an <function>fsync</function> at the end than by writing WAL.)
</para>
</sect2>
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 553a6d6..66c52d6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1013,7 +1013,12 @@ gistGetFakeLSN(Relation rel)
{
static XLogRecPtr counter = FirstNormalUnloggedLSN;
- if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+ /*
+ * XXX before commit fix this. This is not correct for
+ * RELPERSISTENCE_PERMANENT, but it suffices to make tests pass.
+ */
+ if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP
+ || rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
{
/*
* Temporary relations are only accessible in our session, so a simple
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb3..be19c34 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -21,7 +21,6 @@
* heap_multi_insert - insert multiple tuples into a relation
* heap_delete - delete a tuple from a relation
* heap_update - replace a tuple in a relation with another tuple
- * heap_sync - sync heap, for when no WAL has been written
*
* NOTES
* This file contains the heap_ routines which implement
@@ -1936,7 +1935,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2119,7 +2118,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -8921,46 +8920,6 @@ heap2_redo(XLogReaderState *record)
}
/*
- * heap_sync - sync a heap, for use when no WAL has been written
- *
- * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
- *
- * Indexes are not touched. (Currently, index operations associated with
- * the commands that use this are WAL-logged and so do not need fsync.
- * That behavior might change someday, but in any case it's likely that
- * any fsync decisions required would be per-index and hence not appropriate
- * to be done here.)
- */
-void
-heap_sync(Relation rel)
-{
- /* non-WAL-logged tables never need fsync */
- if (!RelationNeedsWAL(rel))
- return;
-
- /* main heap */
- FlushRelationBuffers(rel);
- /* FlushRelationBuffers will have opened rd_smgr */
- smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
-
- /* FSM is not critical, don't bother syncing it */
-
- /* toast heap, if any */
- if (OidIsValid(rel->rd_rel->reltoastrelid))
- {
- Relation toastrel;
-
- toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
- FlushRelationBuffers(toastrel);
- smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
- table_close(toastrel, AccessShareLock);
- }
-}
-
-/*
* Mask a heap page before performing consistency checks on it.
*/
void
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 92073fe..07fe717 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -555,17 +555,6 @@ tuple_lock_retry:
return result;
}
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
-}
-
/* ------------------------------------------------------------------------
* DDL related callbacks for heap AM.
@@ -698,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +701,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
is_system_catalog = IsSystemRelation(OldHeap);
/*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
+ * Valid smgr_targblock implies something already wrote to the relation.
+ * This may be harmless, but this function hasn't planned for it.
*/
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
- /* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
/* Preallocate values/isnull arrays */
@@ -728,7 +713,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff, use_wal);
+ *multi_cutoff);
/* Set up sorting if wanted */
@@ -2515,7 +2500,6 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d285b1f..3e56483 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -136,7 +136,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -230,15 +229,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -263,7 +260,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -322,7 +318,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -337,18 +333,14 @@ end_heap_rewrite(RewriteState state)
}
/*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too.
- *
- * It's obvious that we must do this when not WAL-logging. It's less
- * obvious that we have to do it even if we did WAL-log the pages. The
+ * When we WAL-logged rel pages, we must nonetheless fsync them. The
* reason is the same as in storage.c's RelationCopyStorage(): we're
* writing data that's not in shared buffers, and so a CHECKPOINT
* occurring during the rewriteheap operation won't have fsync'd data we
* wrote before the checkpoint.
*/
if (RelationNeedsWAL(state->rs_new_rel))
- heap_sync(state->rs_new_rel);
+ smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
logical_end_heap_rewrite(state);
@@ -646,9 +638,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -687,7 +676,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c11a3fb..e9b8ba8 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -31,18 +31,6 @@
* them. They will need to be re-read into shared buffers on first use after
* the build finishes.
*
- * Since the index will never be used unless it is completely built,
- * from a crash-recovery point of view there is no need to WAL-log the
- * steps of the build. After completing the index build, we can just sync
- * the whole file to disk using smgrimmedsync() before exiting this module.
- * This can be seen to be sufficient for crash recovery by considering that
- * it's effectively equivalent to what would happen if a CHECKPOINT occurred
- * just after the index build. However, it is clearly not sufficient if the
- * DBA is using the WAL log for PITR or replication purposes, since another
- * machine would not be able to reconstruct the index from WAL. Therefore,
- * we log the completed index pages to WAL if and only if WAL archiving is
- * active.
- *
* This code isn't concerned about the FSM at all. The caller is responsible
* for initializing that.
*
@@ -563,12 +551,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-
- /*
- * We need to log index creation in WAL iff WAL archiving/streaming is
- * enabled UNLESS the index isn't WAL-logged anyway.
- */
- wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
+ wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
/* reserve the metapage */
wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
@@ -1266,21 +1249,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
_bt_uppershutdown(wstate, state);
/*
- * If the index is WAL-logged, we must fsync it down to disk before it's
- * safe to commit the transaction. (For a non-WAL-logged index we don't
- * care since the index will be uninteresting after a crash anyway.)
- *
- * It's obvious that we must do this when not WAL-logging the build. It's
- * less obvious that we have to do it even if we did WAL-log the index
- * pages. The reason is that since we're building outside shared buffers,
- * a CHECKPOINT occurring during the build has no way to flush the
- * previously written data to disk (indeed it won't know the index even
- * exists). A crash later on would replay WAL from the checkpoint,
- * therefore it wouldn't replay our earlier WAL entries. If we do not
- * fsync those pages here, they might still not be on disk when the crash
- * occurs.
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
*/
- if (RelationNeedsWAL(wstate->index))
+ if (wstate->btws_use_wal)
{
RelationOpenSmgr(wstate->index);
smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index b5a2cb2..641809c 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -717,6 +717,40 @@ then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+Skipping WAL for New RelFileNode
+--------------------------------
+
+Under wal_level=minimal, if a change modifies a relfilenode that
+RollbackAndReleaseCurrentSubTransaction() would unlink, in-tree access methods
+write no WAL for that change. For any access method, CommitTransaction()
+writes and fsyncs affected blocks before recording the commit. This skipping
+is mandatory; if a WAL-writing change preceded a WAL-skipping change for the
+same block, REDO could overwrite the WAL-skipping change. Code that writes
+WAL without calling RelationNeedsWAL() must check for this case.
+
+If skipping were not mandatory, a related problem would arise. Suppose, under
+full_page_writes=off, a WAL-writing change follows a WAL-skipping change.
+When a WAL record contains no full-page image, REDO expects the page to match
+its contents from just before record insertion. A WAL-skipping change may not
+reach disk at all, violating REDO's expectation.
+
+Prefer to do the same in future access methods. However, two other approaches
+can work. First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync(). Second, an access method can opt to write WAL
+unconditionally for permanent relations. When using the second method, do not
+call RelationCopyStorage(), which skips WAL.
+
+This applies only to WAL records whose replay would modify bytes stored in the
+new relfilenode. It does not apply to other records about the relfilenode,
+such as XLOG_SMGR_CREATE. Because it operates at the level of individual
+relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations.
+Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
+ALTER TABLE adds a TOAST relation. The TOAST relation will skip WAL, while
+the table owning it will not. ALTER TABLE SET TABLESPACE will cause a table
+to skip WAL, but that won't affect its indexes.
+
+
Asynchronous Commit
-------------------
@@ -820,13 +854,12 @@ Changes to a temp table are not WAL-logged, hence could reach disk in
advance of T1's commit, but we don't care since temp table contents don't
survive crashes anyway.
-Database writes made via any of the paths we have introduced to avoid WAL
-overhead for bulk updates are also safe. In these cases it's entirely
-possible for the data to reach disk before T1's commit, because T1 will
-fsync it down to disk without any sort of interlock, as soon as it finishes
-the bulk update. However, all these paths are designed to write data that
-no other transaction can see until after T1 commits. The situation is thus
-not different from ordinary WAL-logged updates.
+Database writes that skip WAL for new relfilenodes are also safe. In these
+cases it's entirely possible for the data to reach disk before T1's commit,
+because T1 will fsync it down to disk without any sort of interlock. However,
+all these paths are designed to write data that no other transaction can see
+until after T1 commits. The situation is thus not different from ordinary
+WAL-logged updates.
Transaction Emulation during Recovery
-------------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8fe38c3..a662ef9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2109,6 +2109,13 @@ CommitTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Synchronize files that are created and not WAL-logged during this
+ * transaction. This must happen before AtEOXact_RelationMap(), so that we
+ * don't see committed-but-broken files after a crash.
+ */
+ smgrDoPendingSyncs();
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2341,6 +2348,13 @@ PrepareTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Synchronize files that are created and not WAL-logged during this
+ * transaction. This must happen before EndPrepare(), so that we don't see
+ * committed-but-broken files after a crash and COMMIT PREPARED.
+ */
+ smgrDoPendingSyncs();
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index aa9dca0..dda1dea 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1043,8 +1043,13 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
BlockNumber startblk, BlockNumber endblk,
bool page_std)
{
+ int flags;
BlockNumber blkno;
+ flags = REGBUF_FORCE_IMAGE;
+ if (page_std)
+ flags |= REGBUF_STANDARD;
+
/*
* Iterate over all the pages in the range. They are collected into
* batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written
@@ -1066,7 +1071,8 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
nbufs = 0;
while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
{
- Buffer buf = ReadBuffer(rel, blkno);
+ Buffer buf = ReadBufferExtended(rel, forkNum, blkno,
+ RBM_NORMAL, NULL);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
@@ -1088,7 +1094,7 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
START_CRIT_SECTION();
for (i = 0; i < nbufs; i++)
{
- XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+ XLogRegisterBuffer(i, bufpack[i], flags);
MarkBufferDirty(bufpack[i]);
}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5f1e5ba..6d61f25 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
* fields related to physical storage, like rd_rel, are initialized, so the
* fake entry is only usable in low-level operations like ReadBuffer().
*
+ * This is also used for syncing WAL-skipped files.
+ *
* Caller must free the returned entry with FreeFakeRelcacheEntry().
*/
Relation
@@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
FakeRelCacheEntry fakeentry;
Relation rel;
- Assert(InRecovery);
-
/* Allocate the Relation struct and all related space in one block. */
fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
rel = (Relation) fakeentry;
rel->rd_rel = &fakeentry->pgc;
rel->rd_node = rnode;
- /* We will never be working with temp rels during recovery */
+ /*
+ * We will never be working with temp rels during recovery or while
+ * syncing WAL-skipped files.
+ */
rel->rd_backend = InvalidBackendId;
- /* It must be a permanent table if we're in recovery. */
+ /* It must be a permanent table here */
rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
/* We don't know the name of the relation; use relfilenode instead */
@@ -572,9 +575,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
/*
* We set up the lockRelId in case anything tries to lock the dummy
* relation. Note that this is fairly bogus since relNode may be
- * different from the relation's OID. It shouldn't really matter though,
- * since we are presumably running by ourselves and can't have any lock
- * conflicts ...
+ * different from the relation's OID. It shouldn't really matter though.
+ * In recovery, we are running by ourselves and can't have any lock
+ * conflicts. While syncing, we already hold AccessExclusiveLock.
*/
rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index b7bcdd9..293ea9a 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -440,6 +440,10 @@ heap_create(const char *relname,
break;
}
}
+ else
+ {
+ rel->rd_createSubid = InvalidSubTransactionId;
+ }
return rel;
}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 056ea3d..51c233d 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,9 +29,13 @@
#include "miscadmin.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+/* GUC variables */
+int wal_skip_threshold = 64; /* in kilobytes */
+
/*
* We keep a list of all relations (represented as RelFileNode values)
* that have been created or deleted in the current transaction. When
@@ -58,6 +62,7 @@ typedef struct PendingRelDelete
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
+ bool sync; /* whether to fsync at commit */
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
@@ -114,6 +119,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->sync =
+ relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded();
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -155,6 +162,7 @@ RelationDropStorage(Relation rel)
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->sync = false;
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -355,7 +363,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
/*
* We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a permanent relation.
+ * enabled AND it's a permanent relation. This gives the same answer as
+ * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+ * current operation created a new relfilenode.
*/
use_wal = XLogIsNeeded() &&
(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
@@ -397,25 +407,44 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
}
/*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too. (For a temp or
- * unlogged rel we don't care since the data will be gone after a crash
- * anyway.)
- *
- * It's obvious that we must do this when not WAL-logging the copy. It's
- * less obvious that we have to do it even if we did WAL-log the copied
- * pages. The reason is that since we're copying outside shared buffers, a
- * CHECKPOINT occurring during the copy has no way to flush the previously
- * written data to disk (indeed it won't know the new rel even exists). A
- * crash later on would replay WAL from the checkpoint, therefore it
- * wouldn't replay our earlier WAL entries. If we do not fsync those pages
- * here, they might still not be on disk when the crash occurs.
+ * When we WAL-logged rel pages, we must nonetheless fsync them. The
+ * reason is that since we're copying outside shared buffers, a CHECKPOINT
+ * occurring during the copy has no way to flush the previously written
+ * data to disk (indeed it won't know the new rel even exists). A crash
+ * later on would replay WAL from the checkpoint, therefore it wouldn't
+ * replay our earlier WAL entries. If we do not fsync those pages here,
+ * they might still not be on disk when the crash occurs.
*/
- if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+ if (use_wal || copying_initfork)
smgrimmedsync(dst, forkNum);
}
/*
+ * RelFileNodeSkippingWAL - check if a BM_PERMANENT relfilenode is using WAL
+ *
+ * Changes of certain relfilenodes must not write WAL; see "Skipping WAL for
+ * New RelFileNode" in src/backend/access/transam/README. Though it is
+ * known from Relation efficiently, this function is intended for the code
+ * paths not having access to Relation.
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+ PendingRelDelete *pending;
+
+ if (XLogIsNeeded())
+ return false; /* no permanent relfilenode skips WAL */
+
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ if (RelFileNodeEquals(pending->relnode, rnode) && pending->sync)
+ return true;
+ }
+
+ return false;
+}
+
+/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
* This also runs when aborting a subxact; we want to clean up a failed
@@ -493,6 +522,145 @@ smgrDoPendingDeletes(bool isCommit)
}
/*
+ * smgrDoPendingSyncs() -- Take care of relation syncs at commit.
+ *
+ * This should be called before smgrDoPendingDeletes() at every commit or
+ * prepare. Also this should be called before emitting WAL record so that sync
+ * failure prevents commit.
+ */
+void
+smgrDoPendingSyncs(void)
+{
+ PendingRelDelete *pending;
+ HTAB *delhash = NULL;
+
+ if (XLogIsNeeded())
+ return; /* no relation can use this */
+
+ Assert(GetCurrentTransactionNestLevel() == 1);
+ AssertPendingSyncs_RelationCache();
+
+ /*
+ * Pending syncs on the relation that are to be deleted in this
+ * transaction-end should be ignored. Collect pending deletes that will
+ * happen in the following call to smgrDoPendingDeletes().
+ */
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ if (!pending->atCommit)
+ continue;
+
+ /* create the hash if not yet */
+ if (delhash == NULL)
+ {
+ HASHCTL hash_ctl;
+
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(RelFileNode);
+ hash_ctl.entrysize = sizeof(RelFileNode);
+ hash_ctl.hcxt = CurrentMemoryContext;
+ delhash =
+ hash_create("pending del temporary hash", 8, &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
+
+ (void) hash_search(delhash, (void *) &pending->relnode,
+ HASH_ENTER, &found);
+ Assert(!found);
+ }
+
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ bool to_be_removed = false; /* don't sync if aborted */
+ ForkNumber fork;
+ BlockNumber nblocks[MAX_FORKNUM + 1];
+ BlockNumber total_blocks = 0;
+ SMgrRelation srel;
+
+ if (!pending->sync)
+ continue;
+ Assert(!pending->atCommit);
+
+ /* don't sync relnodes that is being deleted */
+ if (delhash)
+ hash_search(delhash, (void *) &pending->relnode,
+ HASH_FIND, &to_be_removed);
+ if (to_be_removed)
+ continue;
+
+ /* Now the time to sync the rnode */
+ srel = smgropen(pending->relnode, pending->backend);
+
+ /*
+ * We emit newpage WAL records for smaller relations.
+ *
+ * Small WAL records have a chance to be emitted along with other
+ * backends' WAL records. We emit WAL records instead of syncing for
+ * files that are smaller than a certain threshold, expecting faster
+ * commit. The threshold is defined by the GUC wal_skip_threshold.
+ */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ if (smgrexists(srel, fork))
+ {
+ BlockNumber n = smgrnblocks(srel, fork);
+
+ /* we shouldn't come here for unlogged relations */
+ Assert(fork != INIT_FORKNUM);
+
+ nblocks[fork] = n;
+ total_blocks += n;
+ }
+ else
+ nblocks[fork] = InvalidBlockNumber;
+ }
+
+ /*
+ * Sync file or emit WAL record for the file according to the total
+ * size.
+ */
+ if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+ {
+ /* Flush all buffers then sync the file */
+ FlushRelationBuffersWithoutRelcache(srel, false);
+
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(srel, fork))
+ smgrimmedsync(srel, fork);
+ }
+ }
+ else
+ {
+ /* Emit WAL records for all blocks. The file is small enough. */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ int n = nblocks[fork];
+ Relation rel;
+
+ if (!BlockNumberIsValid(n))
+ continue;
+
+ /*
+ * Emit WAL for the whole file. Unfortunately we don't know
+ * what kind of a page this is, so we have to log the full
+ * page including any unused space. ReadBufferExtended()
+ * counts some pgstat events; unfortunately, we discard them.
+ */
+ rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+ log_newpage_range(rel, fork, 0, n, false);
+ FreeFakeRelcacheEntry(rel);
+ }
+ }
+ }
+
+ if (delhash)
+ hash_destroy(delhash);
+}
+
+/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
* The return value is the number of relations scheduled for termination.
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b8c349f..093fff8 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1014,6 +1014,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
relfilenode2;
Oid swaptemp;
char swptmpchr;
+ Relation rel1;
/* We need writable copies of both pg_class tuples. */
relRelation = table_open(RelationRelationId, RowExclusiveLock);
@@ -1039,6 +1040,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
*/
Assert(!target_is_pg_class);
+ /* swap relfilenodes, reltablespaces, relpersistence */
swaptemp = relform1->relfilenode;
relform1->relfilenode = relform2->relfilenode;
relform2->relfilenode = swaptemp;
@@ -1174,6 +1176,15 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
}
/*
+ * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+ * subtransaction. Since the next step for rel2 is deletion, don't bother
+ * recording the newness of its relfilenode.
+ */
+ rel1 = relation_open(r1, AccessExclusiveLock);
+ RelationAssumeNewRelfilenode(rel1);
+ relation_close(rel1, NoLock);
+
+ /*
* Post alter hook for modified relations. The change to r2 is always
* internal, but r1 depends on the invocation context.
*/
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 42a147b..607e255 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2711,63 +2711,15 @@ CopyFrom(CopyState cstate)
RelationGetRelationName(cstate->rel))));
}
- /*----------
- * Check to see if we can avoid writing WAL
- *
- * If archive logging/streaming is not enabled *and* either
- * - table was created in same transaction as this COPY
- * - data is being written to relfilenode created in this transaction
- * then we can skip writing WAL. It's safe because if the transaction
- * doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the table_finish_bulk_insert() at
- * the bottom of this routine first.
- *
- * As mentioned in comments in utils/rel.h, the in-same-transaction test
- * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
- * can be cleared before the end of the transaction. The exact case is
- * when a relation sets a new relfilenode twice in same transaction, yet
- * the second one fails in an aborted subtransaction, e.g.
- *
- * BEGIN;
- * TRUNCATE t;
- * SAVEPOINT save;
- * TRUNCATE t;
- * ROLLBACK TO save;
- * COPY ...
- *
- * Also, if the target file is new-in-transaction, we assume that checking
- * FSM for free space is a waste of time, even if we must use WAL because
- * of archiving. This could possibly be wrong, but it's unlikely.
- *
- * The comments for table_tuple_insert and RelationGetBufferForTuple
- * specify that skipping WAL logging is only safe if we ensure that our
- * tuples do not go into pages containing tuples from any other
- * transactions --- but this must be the case if we have a new table or
- * new relfilenode, so we need no additional work to enforce that.
- *
- * We currently don't support this optimization if the COPY target is a
- * partitioned table as we currently only lazily initialize partition
- * information when routing the first tuple to the partition. We cannot
- * know at this stage if we can perform this optimization. It should be
- * possible to improve on this, but it does mean maintaining heap insert
- * option flags per partition and setting them when we first open the
- * partition.
- *
- * This optimization is not supported for relation types which do not
- * have any physical storage, with foreign tables and views using
- * INSTEAD OF triggers entering in this category. Partitioned tables
- * are not supported as per the description above.
- *----------
+ /*
+ * If the target file is new-in-transaction, we assume that checking FSM
+ * for free space is a waste of time. This could possibly be wrong, but
+ * it's unlikely.
*/
- /* createSubid is creation check, newRelfilenodeSubid is truncation check */
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
- cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
+ cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 2bf7083..20225dc 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,16 +552,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
myState->rel = intoRelationDesc;
myState->reladdr = intoRelationAddr;
myState->output_cid = GetCurrentCommandId(true);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
+ myState->bistate = GetBulkInsertState();
/*
- * We can skip WAL-logging the insertions, unless PITR or streaming
- * replication is in use. We can skip the FSM in any case.
+ * Valid smgr_targblock implies something already wrote to the relation.
+ * This may be harmless, but this function hasn't planned for it.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
- myState->bistate = GetBulkInsertState();
-
- /* Not using WAL requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8..ae809c9 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -457,17 +457,13 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->transientrel = transientrel;
myState->output_cid = GetCurrentCommandId(true);
-
- /*
- * We can skip WAL-logging the insertions, unless PITR or streaming
- * replication is in use. We can skip the FSM in any case.
- */
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
- /* Not using WAL requires smgr_targblock be initially invalid */
+ /*
+ * Valid smgr_targblock implies something already wrote to the relation.
+ * This may be harmless, but this function hasn't planned for it.
+ */
Assert(RelationGetTargetBlock(transientrel) == InvalidBlockNumber);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 45aae59..95721d7 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4768,19 +4768,14 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
newrel = NULL;
/*
- * Prepare a BulkInsertState and options for table_tuple_insert. Because
- * we're building a new heap, we can skip WAL-logging and fsync it to disk
- * at the end instead (unless WAL-logging is required for archiving or
- * streaming replication). The FSM is empty too, so don't bother using it.
+ * Prepare a BulkInsertState and options for table_tuple_insert. The FSM
+ * is empty, so don't bother using it.
*/
if (newrel)
{
mycid = GetCurrentCommandId(true);
bistate = GetBulkInsertState();
-
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -12460,6 +12455,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
table_close(pg_class, RowExclusiveLock);
+ RelationAssumeNewRelfilenode(rel);
+
relation_close(rel, NoLock);
/* Make sure the reltablespace change is visible */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7ad1073..746ce47 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3203,20 +3203,27 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
- /* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
+ RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3233,7 +3240,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3263,18 +3270,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
@@ -3484,13 +3491,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
(pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
{
/*
- * If we're in recovery we cannot dirty a page because of a hint.
- * We can set the hint, just not dirty the page as a result so the
- * hint is lost when we evict the page or shutdown.
+ * If we must not write WAL, due to a relfilenode-specific
+ * condition or being in recovery, don't dirty the page. We can
+ * set the hint, just not dirty the page as a result so the hint
+ * is lost when we evict the page or shutdown.
*
* See src/backend/storage/page/README for longer discussion.
*/
- if (RecoveryInProgress())
+ if (RecoveryInProgress() ||
+ RelFileNodeSkippingWAL(bufHdr->tag.rnode))
return;
/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 8a9eaf6..1d408c3 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -248,11 +248,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
* During replay, we would delete the file and then recreate it, which is fine
* if the contents of the file were repopulated by subsequent WAL entries.
* But if we didn't WAL-log insertions, but instead relied on fsyncing the
- * file after populating it (as for instance CLUSTER and CREATE INDEX do),
- * the contents of the file would be lost forever. By leaving the empty file
- * until after the next checkpoint, we prevent reassignment of the relfilenode
- * number until it's safe, because relfilenode assignment skips over any
- * existing file.
+ * file after populating it (as we do at wal_level=minimal), the contents of
+ * the file would be lost forever. By leaving the empty file until after the
+ * next checkpoint, we prevent reassignment of the relfilenode number until
+ * it's safe, because relfilenode assignment skips over any existing file.
*
* We do not need to go through this dance for temp relations, though, because
* we never make WAL entries for temp rels, and so a temp rel poses no threat
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ad1ff01..f3831f0 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -262,6 +262,9 @@ static void RelationReloadIndexInfo(Relation relation);
static void RelationReloadNailed(Relation relation);
static void RelationFlushRelation(Relation relation);
static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+#ifdef USE_ASSERT_CHECKING
+static void AssertPendingSyncConsistency(Relation relation);
+#endif
static void AtEOXact_cleanup(Relation relation, bool isCommit);
static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1095,6 +1098,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
relation->rd_isnailed = false;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
switch (relation->rd_rel->relpersistence)
{
case RELPERSISTENCE_UNLOGGED:
@@ -1828,6 +1832,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_isnailed = true;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
relation->rd_backend = InvalidBackendId;
relation->rd_islocaltemp = false;
@@ -2035,6 +2040,12 @@ RelationIdGetRelation(Oid relationId)
rd = RelationBuildDesc(relationId, true);
if (RelationIsValid(rd))
RelationIncrementReferenceCount(rd);
+
+#ifdef USE_ASSERT_CHECKING
+ if (!XLogIsNeeded() && RelationIsValid(rd))
+ AssertPendingSyncConsistency(rd);
+#endif
+
return rd;
}
@@ -2093,7 +2104,7 @@ RelationClose(Relation relation)
#ifdef RELCACHE_FORCE_RELEASE
if (RelationHasReferenceCountZero(relation) &&
relation->rd_createSubid == InvalidSubTransactionId &&
- relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
RelationClearRelation(relation, false);
#endif
}
@@ -2509,13 +2520,13 @@ RelationClearRelation(Relation relation, bool rebuild)
* problem.
*
* When rebuilding an open relcache entry, we must preserve ref count,
- * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state. Also
- * attempt to preserve the pg_class entry (rd_rel), tupledesc,
- * rewrite-rule, partition key, and partition descriptor substructures
- * in place, because various places assume that these structures won't
- * move while they are working with an open relcache entry. (Note:
- * the refcount mechanism for tupledescs might someday allow us to
- * remove this hack for the tupledesc.)
+ * rd_*Subid, and rd_toastoid state. Also attempt to preserve the
+ * pg_class entry (rd_rel), tupledesc, rewrite-rule, partition key,
+ * and partition descriptor substructures in place, because various
+ * places assume that these structures won't move while they are
+ * working with an open relcache entry. (Note: the refcount
+ * mechanism for tupledescs might someday allow us to remove this hack
+ * for the tupledesc.)
*
* Note that this process does not touch CurrentResourceOwner; which
* is good because whatever ref counts the entry may have do not
@@ -2599,6 +2610,7 @@ RelationClearRelation(Relation relation, bool rebuild)
/* creation sub-XIDs must be preserved */
SWAPFIELD(SubTransactionId, rd_createSubid);
SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+ SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
/* un-swap rd_rel pointers, swap contents instead */
SWAPFIELD(Form_pg_class, rd_rel);
/* ... but actually, we don't have to update newrel->rd_rel */
@@ -2666,7 +2678,7 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2751,11 +2763,10 @@ RelationCacheInvalidateEntry(Oid relationId)
* relation cache and re-read relation mapping data.
*
* This is currently used only to recover from SI message buffer overflow,
- * so we do not touch new-in-transaction relations; they cannot be targets
- * of cross-backend SI updates (and our own updates now go through a
- * separate linked list that isn't limited by the SI message buffer size).
- * Likewise, we need not discard new-relfilenode-in-transaction hints,
- * since any invalidation of those would be a local event.
+ * so we do not touch relations having new-in-transaction relfilenodes; they
+ * cannot be targets of cross-backend SI updates (and our own updates now go
+ * through a separate linked list that isn't limited by the SI message
+ * buffer size).
*
* We do this in two phases: the first pass deletes deletable items, and
* the second one rebuilds the rebuildable items. This is essential for
@@ -2806,7 +2817,7 @@ RelationCacheInvalidate(void)
* pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -2918,6 +2929,40 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
}
+#ifdef USE_ASSERT_CHECKING
+static void
+AssertPendingSyncConsistency(Relation relation)
+{
+ bool relcache_verdict =
+ relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&
+ ((relation->rd_createSubid != InvalidSubTransactionId &&
+ RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) ||
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId);
+ Assert(relcache_verdict == RelFileNodeSkippingWAL(relation->rd_node));
+}
+
+/*
+ * AssertPendingSyncs_RelationCache
+ *
+ * Assert that relcache.c and storage.c agree on whether to skip WAL.
+ *
+ * This consistently detects relcache.c skipping WAL while storage.c is not
+ * skipping WAL. It often fails to detect the reverse error, because
+ * invalidation will have destroyed the relcache entry. It will detect the
+ * reverse error if something opens the relation after the DDL.
+ */
+void
+AssertPendingSyncs_RelationCache(void)
+{
+ HASH_SEQ_STATUS status;
+ RelIdCacheEnt *idhentry;
+
+ hash_seq_init(&status, RelationIdCache);
+ while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+ AssertPendingSyncConsistency(idhentry->reldesc);
+}
+#endif
+
/*
* AtEOXact_RelationCache
*
@@ -3029,10 +3074,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
*
* During commit, reset the flag to zero, since we are now out of the
* creating transaction. During abort, simply delete the relcache entry
- * --- it isn't interesting any longer. (NOTE: if we have forgotten the
- * new-ness of a new relation due to a forced cache flush, the entry will
- * get deleted anyway by shared-cache-inval processing of the aborted
- * pg_class insertion.)
+ * --- it isn't interesting any longer.
*/
if (relation->rd_createSubid != InvalidSubTransactionId)
{
@@ -3060,9 +3102,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
}
/*
- * Likewise, reset the hint about the relfilenode being new.
+ * Likewise, reset any record of the relfilenode being new.
*/
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
}
/*
@@ -3154,7 +3197,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
}
/*
- * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+ * Likewise, update or drop any new-relfilenode-in-subtransaction.
*/
if (relation->rd_newRelfilenodeSubid == mySubid)
{
@@ -3163,6 +3206,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
}
+
+ if (relation->rd_firstRelfilenodeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstRelfilenodeSubid = parentSubid;
+ else
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ }
}
@@ -3252,6 +3303,7 @@ RelationBuildLocalRelation(const char *relname,
/* it's being created in this transaction */
rel->rd_createSubid = GetCurrentSubTransactionId();
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
/*
* create a new tuple descriptor from the one passed in. We do this
@@ -3549,14 +3601,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
*/
CommandCounterIncrement();
- /*
- * Mark the rel as having been given a new relfilenode in the current
- * (sub) transaction. This is a hint that can be used to optimize later
- * operations on the rel in the same transaction.
- */
+ RelationAssumeNewRelfilenode(relation);
+}
+
+/*
+ * RelationAssumeNewRelfilenode
+ *
+ * Code that modifies pg_class.reltablespace or pg_class.relfilenode must call
+ * this. The call shall precede any code that might insert WAL records whose
+ * replay would modify bytes in the new RelFileNode, and the call shall follow
+ * any WAL modifying bytes in the prior RelFileNode. See struct RelationData.
+ * Ideally, call this as near as possible to the CommandCounterIncrement()
+ * that makes the pg_class change visible (before it or after it); that
+ * minimizes the chance of future development adding a forbidden WAL insertion
+ * between RelationAssumeNewRelfilenode() and CommandCounterIncrement().
+ */
+void
+RelationAssumeNewRelfilenode(Relation relation)
+{
relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+ if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
- /* Flag relation as needing eoxact cleanup (to remove the hint) */
+ /* Flag relation as needing eoxact cleanup (to clear these fields) */
EOXactListAdd(relation);
}
@@ -5591,6 +5658,7 @@ load_relcache_init_file(bool shared)
rel->rd_fkeylist = NIL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4b3769b..2348bef 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
#include "access/xlog_internal.h"
#include "catalog/namespace.h"
#include "catalog/pg_authid.h"
+#include "catalog/storage.h"
#include "commands/async.h"
#include "commands/prepare.h"
#include "commands/trigger.h"
@@ -2639,6 +2640,18 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+ gettext_noop("Size of new file to fsync instead of writing WAL."),
+ NULL,
+ GUC_UNIT_KB
+ },
+ &wal_skip_threshold,
+ 64,
+ 0, MAX_KILOBYTES,
+ NULL, NULL, NULL
+ },
+
+ {
{"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING,
gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."),
NULL
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6..22916e8 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
@@ -166,8 +165,6 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
-extern void heap_sync(Relation relation);
-
extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
ItemPointerData *items,
int nitems);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253..7f9736e 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6402291..aca88d0 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -127,7 +127,7 @@ typedef struct TM_FailureData
} TM_FailureData;
/* "options" flag bits for table_tuple_insert */
-#define TABLE_INSERT_SKIP_WAL 0x0001
+/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
#define TABLE_INSERT_SKIP_FSM 0x0002
#define TABLE_INSERT_FROZEN 0x0004
#define TABLE_INSERT_NO_LOGICAL 0x0008
@@ -409,9 +409,8 @@ typedef struct TableAmRoutine
/*
* Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
+ * tuple_insert and multi_insert with a BulkInsertState specified. In-tree
+ * access methods ceased to use this.
*
* Typically callers of tuple_insert and multi_insert will just pass all
* the flags that apply to them, and each AM has to decide which of them
@@ -1087,10 +1086,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* The options bitmask allows the caller to specify options that may change the
* behaviour of the AM. The AM will ignore options that it does not support.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space.
@@ -1309,10 +1304,9 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
}
/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * Perform operations necessary to complete insertions made via tuple_insert
+ * and multi_insert with a BulkInsertState specified. In-tree access methods
+ * ceased to use this.
*/
static inline void
table_finish_bulk_insert(Relation rel, int options)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f..108115a 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,18 +19,23 @@
#include "storage/smgr.h"
#include "utils/relcache.h"
+/* GUC variables */
+extern int wal_skip_threshold;
+
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(void);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 17b97f7..8097d5a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;
+/* forward declared, to avoid including smgr.h here */
+struct SMgrRelationData;
+
/* in globals.c ... this duplicates miscadmin.h */
extern PGDLLIMPORT int NBuffers;
@@ -189,6 +192,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8b8b237..d9abd61 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -63,22 +63,40 @@ typedef struct RelationData
* rd_replidindex) */
bool rd_statvalid; /* is rd_statlist valid? */
- /*
+ /*----------
* rd_createSubid is the ID of the highest subtransaction the rel has
- * survived into; or zero if the rel was not created in the current top
- * transaction. This can be now be relied on, whereas previously it could
- * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
- * the ID of the highest subtransaction the relfilenode change has
- * survived into, or zero if not changed in the current transaction (or we
- * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
- * when a relation has multiple new relfilenodes within a single
- * transaction, with one of them occurring in a subsequently aborted
- * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
- * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+ * survived into or zero if the rel was not created in the current top
+ * transaction. rd_firstRelfilenodeSubid is the ID of the highest
+ * subtransaction an rd_node change has survived into or zero if rd_node
+ * matches the value it had at the start of the current top transaction.
+ * (Rolling back the subtransaction that rd_firstRelfilenodeSubid denotes
+ * would restore rd_node to the value it had at the start of the current
+ * top transaction. Rolling back any lower subtransaction would not.)
+ * Their accuracy is critical to RelationNeedsWAL().
+ *
+ * rd_newRelfilenodeSubid is the ID of the highest subtransaction the
+ * most-recent relfilenode change has survived into or zero if not changed
+ * in the current transaction (or we have forgotten changing it). This
+ * field is accurate when non-zero, but it can be zero when a relation has
+ * multiple new relfilenodes within a single transaction, with one of them
+ * occurring in a subsequently aborted subtransaction, e.g.
+ * BEGIN;
+ * TRUNCATE t;
+ * SAVEPOINT save;
+ * TRUNCATE t;
+ * ROLLBACK TO save;
+ * -- rd_newRelfilenodeSubid is now forgotten
+ *
+ * These fields are read-only outside relcache.c. Other files trigger
+ * rd_node changes by updating pg_class.reltablespace and/or
+ * pg_class.relfilenode. They must call RelationAssumeNewRelfilenode() to
+ * update these fields.
*/
SubTransactionId rd_createSubid; /* rel was created in current xact */
- SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
- * current xact */
+ SubTransactionId rd_newRelfilenodeSubid; /* highest subxact changing
+ * rd_node to current value */
+ SubTransactionId rd_firstRelfilenodeSubid; /* highest subxact changing
+ * rd_node to any value */
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
@@ -521,9 +539,16 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction. See "Skipping WAL for New
+ * RelFileNode" in src/backend/access/transam/README.
+ */
+#define RelationNeedsWAL(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
/*
* RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 2f2ace3..d3e8348 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -105,9 +105,10 @@ extern Relation RelationBuildLocalRelation(const char *relname,
char relkind);
/*
- * Routine to manage assignment of new relfilenode to a relation
+ * Routines to manage assignment of new relfilenode to a relation
*/
extern void RelationSetNewRelfilenode(Relation relation, char persistence);
+extern void RelationAssumeNewRelfilenode(Relation relation);
/*
* Routines for flushing/rebuilding relcache entries in various scenarios
@@ -120,6 +121,11 @@ extern void RelationCacheInvalidate(void);
extern void RelationCloseSmgrByOid(Oid relationId);
+#ifdef USE_ASSERT_CHECKING
+extern void AssertPendingSyncs_RelationCache(void);
+#else
+#define AssertPendingSyncs_RelationCache() do {} while (0)
+#endif
extern void AtEOXact_RelationCache(bool isCommit);
extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000..415d91c
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,353 @@
+# Test WAL replay when some operation has skipped WAL.
+#
+# These tests exercise code that once violated the mandate described in
+# src/backend/access/transam/README section "Skipping WAL for New
+# RelFileNode". The tests work by committing some transactions, initiating an
+# immediate shutdown, and confirming that the expected data survives recovery.
+# For many years, individual commands made the decision to skip WAL, hence the
+# frequent appearance of COPY in these tests.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 34;
+
+sub check_orphan_relfilenodes
+{
+ my($node, $test_name) = @_;
+
+ my $db_oid = $node->safe_psql('postgres',
+ "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+ my $prefix = "base/$db_oid/";
+ my $filepaths_referenced = $node->safe_psql('postgres', "
+ SELECT pg_relation_filepath(oid) FROM pg_class
+ WHERE reltablespace = 0 AND relpersistence <> 't' AND
+ pg_relation_filepath(oid) IS NOT NULL;");
+ is_deeply([sort(map { "$prefix$_" }
+ grep(/^[0-9]+$/,
+ slurp_dir($node->data_dir . "/$prefix")))],
+ [sort split /\n/, $filepaths_referenced],
+ $test_name);
+ return;
+}
+
+# We run this same test suite for both wal_level=minimal and replica.
+sub run_wal_optimize
+{
+ my $wal_level = shift;
+
+ my $node = get_new_node("node_$wal_level");
+ $node->init;
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+wal_log_hints = on
+wal_skip_threshold = 0
+#wal_debug = on
+));
+ $node->start;
+
+ # Setup
+ my $tablespace_dir = $node->basedir . '/tablespace_other';
+ mkdir ($tablespace_dir);
+ $tablespace_dir = TestLib::perl2host($tablespace_dir);
+ $node->safe_psql('postgres',
+ "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+ # Test direct truncation optimization. No tuples
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE trunc (id serial PRIMARY KEY);
+ TRUNCATE trunc;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ my $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc;");
+ is($result, qq(0),
+ "wal_level = $wal_level, optimized truncation with empty table");
+
+ # Test truncation with inserted tuples within the same transaction.
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE trunc_ins (id serial PRIMARY KEY);
+ INSERT INTO trunc_ins VALUES (DEFAULT);
+ TRUNCATE trunc_ins;
+ INSERT INTO trunc_ins VALUES (DEFAULT);
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc_ins;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with inserted table");
+
+ # Same for prepared transaction
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE twophase (id serial PRIMARY KEY);
+ INSERT INTO twophase VALUES (DEFAULT);
+ TRUNCATE twophase;
+ INSERT INTO twophase VALUES (DEFAULT);
+ PREPARE TRANSACTION 't';
+ COMMIT PREPARED 't';");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM twophase;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with prepared transaction");
+
+ # Same with writing WAL at end of xact, instead of syncing
+ # Tuples inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ SET wal_skip_threshold = '1TB';
+ BEGIN;
+ CREATE TABLE noskip (id serial PRIMARY KEY);
+ INSERT INTO noskip VALUES (DEFAULT);
+ TRUNCATE noskip;
+ INSERT INTO noskip VALUES (DEFAULT);
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM noskip;");
+ is($result, qq(1),
+ "wal_level = $wal_level, optimized truncation with end-of-xact WAL");
+
+ # Data file for COPY query in subsequent tests.
+ my $basedir = $node->basedir;
+ my $copy_file = "$basedir/copy_data.txt";
+ TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+ # Test truncation with inserted tuples using COPY. Tuples copied after
+ # the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE trunc_copy (id serial PRIMARY KEY, id2 int);
+ INSERT INTO trunc_copy (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE trunc_copy;
+ COPY trunc_copy FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc_copy;");
+ is($result, qq(3),
+ "wal_level = $wal_level, optimized truncation with copied table");
+
+ # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE spc_abort (id serial PRIMARY KEY, id2 int);
+ INSERT INTO spc_abort (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE spc_abort;
+ SAVEPOINT s; ALTER TABLE spc_abort SET TABLESPACE other; ROLLBACK TO s;
+ COPY spc_abort FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_abort;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE abort subtransaction");
+
+ # in different subtransaction patterns
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE spc_commit (id serial PRIMARY KEY, id2 int);
+ INSERT INTO spc_commit (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE spc_commit;
+ SAVEPOINT s; ALTER TABLE spc_commit SET TABLESPACE other; RELEASE s;
+ COPY spc_commit FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_commit;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE commit subtransaction");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE spc_nest (id serial PRIMARY KEY, id2 int);
+ INSERT INTO spc_nest (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+ TRUNCATE spc_nest;
+ SAVEPOINT s;
+ ALTER TABLE spc_nest SET TABLESPACE other;
+ SAVEPOINT s2;
+ ALTER TABLE spc_nest SET TABLESPACE pg_default;
+ ROLLBACK TO s2;
+ SAVEPOINT s2;
+ ALTER TABLE spc_nest SET TABLESPACE pg_default;
+ RELEASE s2;
+ ROLLBACK TO s;
+ COPY spc_nest FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_nest;");
+ is($result, qq(3),
+ "wal_level = $wal_level, SET TABLESPACE nested subtransaction");
+
+ $node->safe_psql('postgres', "
+ CREATE TABLE spc_hint (id int);
+ INSERT INTO spc_hint VALUES (1);
+ BEGIN;
+ ALTER TABLE spc_hint SET TABLESPACE other;
+ CHECKPOINT;
+ SELECT * FROM spc_hint; -- set hint bit
+ INSERT INTO spc_hint VALUES (2);
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_hint;");
+ is($result, qq(2),
+ "wal_level = $wal_level, SET TABLESPACE, hint bit");
+
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE idx_hint (c int PRIMARY KEY);
+ SAVEPOINT q; INSERT INTO idx_hint VALUES (1); ROLLBACK TO q;
+ CHECKPOINT;
+ INSERT INTO idx_hint VALUES (1); -- set index hint bit
+ INSERT INTO idx_hint VALUES (2);
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->psql('postgres', );
+ my($ret, $stdout, $stderr) = $node->psql(
+ 'postgres', "INSERT INTO idx_hint VALUES (2);");
+ is($ret, qq(3),
+ "wal_level = $wal_level, unique index LP_DEAD");
+ like($stderr, qr/violates unique/,
+ "wal_level = $wal_level, unique index LP_DEAD message");
+
+ # UPDATE touches two buffers for one row.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE upd (id serial PRIMARY KEY, id2 int);
+ INSERT INTO upd (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ COPY upd FROM '$copy_file' DELIMITER ',';
+ UPDATE upd SET id2 = id2 + 1;
+ DELETE FROM upd;
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM upd;");
+ is($result, qq(0),
+ "wal_level = $wal_level, UPDATE touches two buffers for one row");
+
+ # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+ # inserted after the truncation should be seen.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE ins_trunc (id serial PRIMARY KEY, id2 int);
+ INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+ TRUNCATE ins_trunc;
+ INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, 10000);
+ COPY ins_trunc FROM '$copy_file' DELIMITER ',';
+ INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, 10000);
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_trunc;");
+ is($result, qq(5),
+ "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+ # Test consistency of COPY with INSERT for table created in the same
+ # transaction.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE ins_copy (id serial PRIMARY KEY, id2 int);
+ INSERT INTO ins_copy VALUES (DEFAULT, 1);
+ COPY ins_copy FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_copy;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+ # Test consistency of COPY that inserts more to the same table using
+ # triggers. If the INSERTS from the trigger go to the same block data
+ # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+ # it tries to replay the WAL record but the "before" image doesn't match,
+ # because not all changes were WAL-logged.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE ins_trig (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION ins_trig_before_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO ins_trig VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE FUNCTION ins_trig_after_row_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ IF new.id2 NOT LIKE 'triggered%' THEN
+ INSERT INTO ins_trig VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+ END IF;
+ RETURN NEW;
+ END; \$\$;
+ CREATE TRIGGER ins_trig_before_row_insert
+ BEFORE INSERT ON ins_trig
+ FOR EACH ROW EXECUTE PROCEDURE ins_trig_before_row_trig();
+ CREATE TRIGGER ins_trig_after_row_insert
+ AFTER INSERT ON ins_trig
+ FOR EACH ROW EXECUTE PROCEDURE ins_trig_after_row_trig();
+ COPY ins_trig FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_trig;");
+ is($result, qq(9),
+ "wal_level = $wal_level, replay of optimized copy with INSERT trigger");
+
+ # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+ # with TRUNCATE triggers.
+ $node->safe_psql('postgres', "
+ BEGIN;
+ CREATE TABLE trunc_trig (id serial PRIMARY KEY, id2 text);
+ CREATE FUNCTION trunc_trig_before_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO trunc_trig VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE FUNCTION trunc_trig_after_stat_trig() RETURNS trigger
+ LANGUAGE plpgsql as \$\$
+ BEGIN
+ INSERT INTO trunc_trig VALUES (DEFAULT, 'triggered stat before');
+ RETURN NULL;
+ END; \$\$;
+ CREATE TRIGGER trunc_trig_before_stat_truncate
+ BEFORE TRUNCATE ON trunc_trig
+ FOR EACH STATEMENT EXECUTE PROCEDURE trunc_trig_before_stat_trig();
+ CREATE TRIGGER trunc_trig_after_stat_truncate
+ AFTER TRUNCATE ON trunc_trig
+ FOR EACH STATEMENT EXECUTE PROCEDURE trunc_trig_after_stat_trig();
+ INSERT INTO trunc_trig VALUES (DEFAULT, 1);
+ TRUNCATE trunc_trig;
+ COPY trunc_trig FROM '$copy_file' DELIMITER ',';
+ COMMIT;");
+ $node->stop('immediate');
+ $node->start;
+ $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc_trig;");
+ is($result, qq(4),
+ "wal_level = $wal_level, replay of optimized copy with TRUNCATE trigger");
+
+ # Test redo of temp table creation.
+ $node->safe_psql('postgres', "
+ CREATE TEMP TABLE temp (id serial PRIMARY KEY, id2 text);");
+ $node->stop('immediate');
+ $node->start;
+ check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+ return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 297b8fb..1ddde3e 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2354,6 +2354,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
fputs("log_lock_waits = on\n", pg_conf);
fputs("log_temp_files = 128kB\n", pg_conf);
fputs("max_prepared_transactions = 2\n", pg_conf);
+ fputs("wal_level = minimal\n", pg_conf); /* XXX before commit remove */
+ fputs("max_wal_senders = 0\n", pg_conf);
for (sl = temp_configs; sl != NULL; sl = sl->next)
{
I'm in the benchmarking week..
Thanks for reviewing!.
At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in
On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
I started pre-commit editing on 2019-10-28, and comment+README updates have
been the largest part of that. I'll check my edits against the things you
list here, and I'll share on-list before committing. I've now marked the CF
entry Ready for Committer.
I'll look into that soon.
By the way, before finalize this, I'd like to share the result of a
brief benchmarking.
First, I measured the direct effect of WAL skipping.
I measured the time required to do the following sequence for the
COMMIT-FPW-WAL case and COMMIT-fsync case. WAL and heap files are on
non-server spec HDD.
BEGIN;
TRUNCATE t;
INSERT INTO t (SELECT a FROM generate_series(1, n) a);
COMMIT;
REPLICA means the time with wal_level = replica
SYNC means the time with wal_level = minimal and force file sync.
WAL means the time with wal_level = minimal and force commit-WAL.
pages is the number of pages of the table.
(REPLICA comes from run.sh 1, SYNC/WAL comes from run.sh 2)
pages REPLICA SYNC WAL
1: 144 ms 683 ms 217 ms
3: 303 ms 995 ms 385 ms
5: 271 ms 1007 ms 217 ms
10: 157 ms 1043 ms 224 ms
17: 189 ms 1007 ms 193 ms
31: 202 ms 1091 ms 230 ms
56: 265 ms 1175 ms 226 ms
100: 510 ms 1307 ms 270 ms
177: 790 ms 1523 ms 524 ms
316: 1827 ms 1643 ms 719 ms
562: 1904 ms 2109 ms 1148 ms
1000: 3060 ms 2979 ms 2113 ms
1778: 6077 ms 3945 ms 3618 ms
3162: 13038 ms 7078 ms 6734 ms
There was a crossing point around 3000 pages. (bench1() finds that by
bisecting, run.sh 3).
With multiple sessions, the crossing point but does not go so
small.
10 processes (run.pl 4 10) The numbers in parentheses are WAL[n]/WAL[n-1].
pages SYNC WAL
316: 8436 ms 4694 ms
562: 12067 ms 9627 ms (x2.1) # WAL wins
1000: 19154 ms 43262 ms (x4.5) # SYNC wins. WAL's slope becomes steep.
1778: 32495 ms 63863 ms (x1.4)
100 processes (run.pl 4 100)
pages SYNC WAL
10: 13275 ms 1868 ms
17: 15919 ms 4438 ms (x2.3)
31: 17063 ms 6431 ms (x1.5)
56: 23193 ms 14276 ms (x2.2) # WAL wins
100: 35220 ms 67843 ms (x4.8) # SYNC wins. WAL's slope becomes steep.
With 10 pgbench sessions.
pages SYNC WAL
1: 915 ms 301 ms
3: 1634 ms 508 ms
5: 1634 ms 293ms
10: 1671 ms 1043 ms
17: 1600 ms 333 ms
31: 1864 ms 314 ms
56: 1562 ms 448 ms
100: 1538 ms 394 ms
177: 1697 ms 1047 ms
316: 3074 ms 1788 ms
562: 3306 ms 1245 ms
1000: 3440 ms 2182 ms
1778: 5064 ms 6464 ms # WAL's slope becomes steep
3162: 8675 ms 8165 ms
I don't think the result of 100 processes is meaningful, so excluding
the result a candidate for wal_skip_threshold can be 1000.
Thoughts? The attached is the benchmark script.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in
On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
I started pre-commit editing on 2019-10-28, and comment+README updates have
been the largest part of that. I'll check my edits against the things you
list here, and I'll share on-list before committing. I've now marked the CF
entry Ready for Committer.
I looked the version.
Notable changes in v24nm:
- Wrote section "Skipping WAL for New RelFileNode" in
src/backend/access/transam/README to be the main source concerning the new
coding rules.
Thanks for writing this.
+Prefer to do the same in future access methods. However, two other approaches
+can work. First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync(). Second, an access method can opt to write WAL
+unconditionally for permanent relations. When using the second method, do not
+call RelationCopyStorage(), which skips WAL.
Even using these methods, TransactionCommit flushes out buffers then
sync files again. Isn't a description something like the following
needed?
===
Even an access method switched a in-transaction created relfilenode to
WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
file then smgrimmedsync() the file.
===
- Updated numerous comments and doc sections.
- Eliminated the pendingSyncs list in favor of a "sync" field in
pendingDeletes. I mostly did this to eliminate the possibility of the lists
getting out of sync. This removed considerable parallel code for managing a
second list at end-of-xact. We now call smgrDoPendingSyncs() only when
committing or preparing a top-level transaction.
Mmm. Right. The second list was a trace of older versions, maybe that
needed additional works at rollback. Actually as of v23 the function
syncs no files at rollback. It is wiser to merging the two.
- Whenever code sets an rd_*Subid field of a Relation, it must call
EOXactListAdd(). swap_relation_files() was not doing so, so the field
remained set during the next transaction. I introduced
RelationAssumeNewRelfilenode() to handle both tasks, and I located the call
so it also affects the mapped relation case.
Ugh.. Thanks for pointing out. By the way
+ /*
+ * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+ * subtransaction. Since the next step for rel2 is deletion, don't bother
+ * recording the newness of its relfilenode.
+ */
+ rel1 = relation_open(r1, AccessExclusiveLock);
+ RelationAssumeNewRelfilenode(rel1);
It cannot be accessed from other sessions. Theoretically it doesn't
need a lock but NoLock cannot be used there since there's a path that
doesn't take lock on the relation. But AEL seems too strong and it
causes unecessary side effect. Couldn't we use weaker locks?
... Time is up. I'll continue looking this.
regards.
- In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild,
rd_createSubid remained set. (That happened before this patch, but it has
been harmless.) I fixed this in heap_create().- Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM. A sync is necessary
when checksums are enabled. Observe the precedent that
RelationCopyStorage() has not been exempting FSM_FORKNUM.- Pass log_newpage_range() a "false" for page_std, for the same reason
RelationCopyStorage() does.- log_newpage_range() ignored its forkNum and page_std arguments, so we logged
the wrong data for non-main forks. Before this patch, callers always passed
MAIN_FORKNUM and "true", hence the lack of complaints.- Restored table_finish_bulk_insert(), though heapam no longer provides a
callback. The API is still well-defined, and other table AMs might have use
for it. Removing it feels like a separate proposal.- Removed TABLE_INSERT_SKIP_WAL. Any out-of-tree code using it should revisit
itself in light of this patch.- Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation;
it was overcounting.- Made us skip WAL after SET TABLESPACE, like we do after CLUSTER.
- Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to
"Write Ahead Log" -> "Settings", between similar settings
wal_writer_flush_after and commit_delay. The other place I considered was
"Resource Consumption" -> "Asynchronous Behavior", due to the similarity of
backend_flush_after.- Gave each test a unique name. Changed test table names to be descriptive,
e.g. test7 became trunc_trig.- Squashed all patches into one. Split patches are good when one could
reasonably choose to push a subset, but that didn't apply here. I wouldn't
push a GUC implementation without its documentation. Since the tests fail
without the main bug fix, I wouldn't push tests separately.By the way, based on the comment at zheap_prepare_insert(), I expect zheap
will exempt itself from skipping WAL. It may stop calling RelationNeedsWAL()
and instead test for RELPERSISTENCE_PERMANENT.
--
Kyotaro Horiguchi
NTT Open Source Software Center
I should have replied this first.
At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in
On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
I started pre-commit editing on 2019-10-28, and comment+README updates have
been the largest part of that. I'll check my edits against the things you
list here, and I'll share on-list before committing. I've now marked the CF
entry Ready for Committer.Having dedicated many days to that, I am attaching v24nm. I know of two
remaining defects:=== Defect 1: gistGetFakeLSN()
When I modified pg_regress.c to use wal_level=minimal for all suites,
src/test/isolation/specs/predicate-gist.spec failed the assertion in
gistGetFakeLSN(). One could reproduce the problem just by running this
sequence in psql:begin;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
insert into gist_point_tbl (id, p)
select g, point(g*10, g*10) from generate_series(1, 1000) g;I've included a wrong-in-general hack to make the test pass. I see two main
options for fixing this:(a) Introduce an empty WAL record that reserves an LSN and has no other
effect. Make GiST use that for permanent relations that are skipping WAL.
Further optimizations are possible. For example, we could use a backend-local
counter (like the one gistGetFakeLSN() uses for temp relations) until the
counter is greater a recent real LSN. That optimization is probably too
clever, though it would make the new WAL record almost never appear.(b) Exempt GiST from most WAL skipping. GiST index build could still skip
WAL, but it would do its own smgrimmedsync() in addition to the one done at
commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
other AM-independent code that skips WAL.Overall, I like the cleanliness of (a). The main argument for (b) is that it
ensures we have all the features to opt-out of WAL skipping, which could be
useful for out-of-tree index access methods. (I think we currently have the
features for a tableam to do so, but not for an indexam to do so.) Overall, I
lean toward (a). Any other ideas or preferences?
I don't like (b), too.
What we need there is any sequential numbers for page LSN but really
compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
case? Or, I'm not sure but I suppose that nothing happenes when
UNLOGGED GiST index gets turned into LOGGED one.
Rewriting table like SET LOGGED will work but not realistic.
=== Defect 2: repetitive work when syncing many relations
For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
sophisticated about optimizing the shared buffers scan. Commit 279628a
introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, to
further reduce the chance of causing performance regressions. (One could,
however, work around the problem by raising wal_skip_threshold.) Kyotaro, if
you agree, could you modify v24nm to implement that?
Seems reasonable. Please wait a minite.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
use_real_lsn_as_fake_lsn.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 66c52d6dd6..387b1f7d18 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1017,8 +1017,7 @@ gistGetFakeLSN(Relation rel)
* XXX before commit fix this. This is not correct for
* RELPERSISTENCE_PERMANENT, but it suffices to make tests pass.
*/
- if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP
- || rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
{
/*
* Temporary relations are only accessible in our session, so a simple
@@ -1026,6 +1025,15 @@ gistGetFakeLSN(Relation rel)
*/
return counter++;
}
+ else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ {
+ /*
+ * Even though we are skipping WAL-logging of a permanent relations,
+ * the LSN must be a real one because WAL-logging starts after commit.
+ */
+ Assert(!RelationNeedsWAL(rel));
+ return GetXLogInsertRecPtr();
+ }
else
{
/*
Wow.. This is embarrassing.. *^^*.
At Thu, 21 Nov 2019 16:01:07 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I should have replied this first.
At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in
On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
I started pre-commit editing on 2019-10-28, and comment+README updates have
been the largest part of that. I'll check my edits against the things you
list here, and I'll share on-list before committing. I've now marked the CF
entry Ready for Committer.Having dedicated many days to that, I am attaching v24nm. I know of two
remaining defects:=== Defect 1: gistGetFakeLSN()
When I modified pg_regress.c to use wal_level=minimal for all suites,
src/test/isolation/specs/predicate-gist.spec failed the assertion in
gistGetFakeLSN(). One could reproduce the problem just by running this
sequence in psql:begin;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
insert into gist_point_tbl (id, p)
select g, point(g*10, g*10) from generate_series(1, 1000) g;I've included a wrong-in-general hack to make the test pass. I see two main
options for fixing this:(a) Introduce an empty WAL record that reserves an LSN and has no other
effect. Make GiST use that for permanent relations that are skipping WAL.
Further optimizations are possible. For example, we could use a backend-local
counter (like the one gistGetFakeLSN() uses for temp relations) until the
counter is greater a recent real LSN. That optimization is probably too
clever, though it would make the new WAL record almost never appear.(b) Exempt GiST from most WAL skipping. GiST index build could still skip
WAL, but it would do its own smgrimmedsync() in addition to the one done at
commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
other AM-independent code that skips WAL.Overall, I like the cleanliness of (a). The main argument for (b) is that it
ensures we have all the features to opt-out of WAL skipping, which could be
useful for out-of-tree index access methods. (I think we currently have the
features for a tableam to do so, but not for an indexam to do so.) Overall, I
lean toward (a). Any other ideas or preferences?I don't like (b), too.
What we need there is any sequential numbers for page LSN but really
compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
case? Or, I'm not sure but I suppose that nothing happenes when
UNLOGGED GiST index gets turned into LOGGED one.
Yes, I just forgot to remove these lines when writing the following.
Rewriting table like SET LOGGED will work but not realistic.
=== Defect 2: repetitive work when syncing many relations
For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
sophisticated about optimizing the shared buffers scan. Commit 279628a
introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, to
further reduce the chance of causing performance regressions. (One could,
however, work around the problem by raising wal_skip_threshold.) Kyotaro, if
you agree, could you modify v24nm to implement that?Seems reasonable. Please wait a minite.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Thu, 21 Nov 2019 16:01:07 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
sophisticated about optimizing the shared buffers scan. Commit 279628a
introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, toSeems reasonable. Please wait a minite.
This is the first cut of that. This makes the function FlushRelationBuffersWithoutRelcache useless, which was introduced in this work. The first patch reverts it, then the second patch adds the bulk sync feature.
The new function FlushRelFileNodesAllBuffers, differently from
DropRelFileNodesAllBuffers, takes SMgrRelation which is required by
FlushBuffer(). So it takes somewhat tricky way, where type
SMgrSortArray pointer to which is compatible with RelFileNode is used.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Revert-FlushRelationBuffersWithoutRelcache.patchtext/x-patch; charset=us-asciiDownload
From c51b44734d88fb19b568c4c0240848c8be2b7cf4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 19:28:35 +0900
Subject: [PATCH 1/2] Revert FlushRelationBuffersWithoutRelcache.
Succeeding patch makes the function useless and the function is no
longer useful globally. Revert it.
---
src/backend/storage/buffer/bufmgr.c | 27 ++++++++++-----------------
src/include/storage/bufmgr.h | 2 --
2 files changed, 10 insertions(+), 19 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 746ce477fc..67bbb26cae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3203,27 +3203,20 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- RelationOpenSmgr(rel);
-
- FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
- RelationUsesLocalBuffers(rel));
-}
-
-void
-FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
-{
- RelFileNode rnode = smgr->smgr_rnode.node;
- int i;
+ int i;
BufferDesc *bufHdr;
- if (islocal)
+ /* Open rel at the smgr level if not already done */
+ RelationOpenSmgr(rel);
+
+ if (RelationUsesLocalBuffers(rel))
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3240,7 +3233,7 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(smgr,
+ smgrwrite(rel->rd_smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3270,18 +3263,18 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, smgr);
+ FlushBuffer(bufHdr, rel->rd_smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8097d5ab22..8cd1cf25d9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -192,8 +192,6 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
-extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
- bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
--
2.23.0
0002-Improve-the-performance-of-relation-syncs.patchtext/x-patch; charset=us-asciiDownload
From 882731fcf063269d0bf85c57f23c83b9570e5df5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 19:33:18 +0900
Subject: [PATCH 2/2] Improve the performance of relation syncs.
We can improve performance of syncing multiple files at once in the
same way as b41669118. This reduces the number of scans on the whole
shared_bufffers from the number of synced relations to one.
---
src/backend/catalog/storage.c | 28 +++++--
src/backend/storage/buffer/bufmgr.c | 113 ++++++++++++++++++++++++++++
src/backend/storage/smgr/smgr.c | 38 +++++++++-
src/include/storage/bufmgr.h | 1 +
src/include/storage/smgr.h | 1 +
5 files changed, 174 insertions(+), 7 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 51c233dac6..65811b2a9e 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -533,6 +533,9 @@ smgrDoPendingSyncs(void)
{
PendingRelDelete *pending;
HTAB *delhash = NULL;
+ int nrels = 0,
+ maxrels = 0;
+ SMgrRelation *srels = NULL;
if (XLogIsNeeded())
return; /* no relation can use this */
@@ -573,7 +576,7 @@ smgrDoPendingSyncs(void)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- bool to_be_removed = false; /* don't sync if aborted */
+ bool to_be_removed = false;
ForkNumber fork;
BlockNumber nblocks[MAX_FORKNUM + 1];
BlockNumber total_blocks = 0;
@@ -623,14 +626,21 @@ smgrDoPendingSyncs(void)
*/
if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
{
- /* Flush all buffers then sync the file */
- FlushRelationBuffersWithoutRelcache(srel, false);
+ /* relations to sync are passed to smgrdosyncall at once */
- for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
{
- if (smgrexists(srel, fork))
- smgrimmedsync(srel, fork);
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
}
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
}
else
{
@@ -658,6 +668,12 @@ smgrDoPendingSyncs(void)
if (delhash)
hash_destroy(delhash);
+
+ if (nrels > 0)
+ {
+ smgrdosyncall(srels, nrels);
+ pfree(srels);
+ }
}
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 67bbb26cae..56314653ae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -105,6 +105,19 @@ typedef struct CkptTsStatus
int index;
} CkptTsStatus;
+/*
+ * Type for array used to sort SMgrRelations
+ *
+ * FlushRelFileNodesAllBuffers shares the same comparator function with
+ * DropRelFileNodeBuffers. Pointer to this struct and RelFileNode must
+ * be compatible.
+ */
+typedef struct SMgrSortArray
+{
+ RelFileNode rnode; /* This must be the first member */
+ SMgrRelation srel;
+} SMgrSortArray;
+
/* GUC variables */
bool zero_damaged_pages = false;
int bgwriter_lru_maxpages = 100;
@@ -3283,6 +3296,106 @@ FlushRelationBuffers(Relation rel)
}
}
+/* ---------------------------------------------------------------------
+ * FlushRelFileNodesAllBuffers
+ *
+ * This function flushes out the buffer pool all the pages of all
+ * forks of the specified smgr relations. It's equivalent to
+ * calling FlushRelationBuffers once per fork per relation, but the
+ * parameter is not Relation but SMgrRelation
+ * --------------------------------------------------------------------
+ */
+void
+FlushRelFileNodesAllBuffers(SMgrRelation *smgrs, int nrels)
+{
+ int i;
+ SMgrSortArray *srels;
+ bool use_bsearch;
+
+ if (nrels == 0)
+ return;
+
+ /* fill-in array for qsort */
+ srels = palloc(sizeof(SMgrSortArray) * nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ {
+ Assert (!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode));
+
+ srels[i].rnode = smgrs[i]->smgr_rnode.node;
+ srels[i].srel = smgrs[i];
+ }
+
+ /*
+ * Save the bsearch overhead for low number of relations to
+ * sync. See DropRelFileNodesAllBuffers for details. The name DROP_*
+ * is for historical reasons.
+ */
+ use_bsearch = nrels > DROP_RELS_BSEARCH_THRESHOLD;
+
+ /* sort the list of SMgrRelations if necessary */
+ if (use_bsearch)
+ pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator);
+
+ /* Make sure we can handle the pin inside the loop */
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ SMgrSortArray *srelent = NULL;
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ /*
+ * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
+ * and saves some cycles.
+ */
+
+ if (!use_bsearch)
+ {
+ int j;
+
+ for (j = 0; j < nrels; j++)
+ {
+ if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode))
+ {
+ srelent = &srels[j];
+ break;
+ }
+ }
+
+ }
+ else
+ {
+ srelent = bsearch((const void *) &(bufHdr->tag.rnode),
+ srels, nrels, sizeof(SMgrSortArray),
+ rnode_comparator);
+ }
+
+ /* buffer doesn't belong to any of the given relfilenodes; skip it */
+ if (srelent == NULL)
+ continue;
+
+ /* Ensure there's a free array slot for PinBuffer_Locked */
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+ if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) &&
+ (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+ FlushBuffer(bufHdr, srelent->srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+
+ pfree(srels);
+}
+
/* ---------------------------------------------------------------------
* FlushDatabaseBuffers
*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b50c69b438..f79f2df40f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -388,6 +388,43 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
}
+/*
+ * smgrdosyncall() -- Immediately sync all forks of all given relations
+ *
+ * All forks of all given relations are syncd out to the store.
+ *
+ * This is equivalent to flusing all buffers FlushRelationBuffers for each
+ * smgr relation then calling smgrimmedsync for all forks of each smgr
+ * relation, but it's significantly quicker so should be preferred when
+ * possible.
+ */
+void
+smgrdosyncall(SMgrRelation *rels, int nrels)
+{
+ int i = 0;
+ ForkNumber forknum;
+
+ if (nrels == 0)
+ return;
+
+ /* We need to flush all buffers for the relations before sync. */
+ FlushRelFileNodesAllBuffers(rels, nrels);
+
+ /*
+ * Sync the physical file(s).
+ */
+ for (i = 0; i < nrels; i++)
+ {
+ int which = rels[i]->smgr_which;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+ {
+ if (smgrsw[which].smgr_exists(rels[i], forknum))
+ smgrsw[which].smgr_immedsync(rels[i], forknum);
+ }
+ }
+}
+
/*
* smgrdounlinkall() -- Immediately unlink all forks of all given relations
*
@@ -469,7 +506,6 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
pfree(rnodes);
}
-
/*
* smgrextend() -- Add a new block to a file.
*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8cd1cf25d9..3f85e8c6fe 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -195,6 +195,7 @@ extern void FlushRelationBuffers(Relation rel);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void FlushRelFileNodesAllBuffers(struct SMgrRelationData **smgrs, int nrels);
extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 1543d8d870..31a5ecd059 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -89,6 +89,7 @@ extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
--
2.23.0
On 2019-11-05 22:16, Robert Haas wrote:
First, I'd like to restate my understanding of the problem just to see
whether I've got the right idea and whether we're all on the same
page. When wal_level=minimal, we sometimes try to skip WAL logging on
newly-created relations in favor of fsync-ing the relation at commit
time.
How useful is this behavior, relative to all the effort required?
Even if the benefit is significant, how many users can accept running
with wal_level=minimal and thus without replication or efficient backups?
Is there perhaps an alternative approach involving unlogged tables to
get a similar performance benefit?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Nov 22, 2019 at 01:21:31PM +0100, Peter Eisentraut wrote:
On 2019-11-05 22:16, Robert Haas wrote:
First, I'd like to restate my understanding of the problem just to see
whether I've got the right idea and whether we're all on the same
page. When wal_level=minimal, we sometimes try to skip WAL logging on
newly-created relations in favor of fsync-ing the relation at commit
time.How useful is this behavior, relative to all the effort required?
Even if the benefit is significant, how many users can accept running with
wal_level=minimal and thus without replication or efficient backups?
That longstanding optimization is too useful to remove, but likely not useful
enough to add today if we didn't already have it. The initial-data-load use
case remains plausible. I can also imagine using wal_level=minimal for data
warehouse applications where one can quickly rebuild from the authoritative
data.
Is there perhaps an alternative approach involving unlogged tables to get a
similar performance benefit?
At wal_level=replica, it seems inevitable that ALTER TABLE SET LOGGED will
need to WAL-log the table contents. I suppose we could keep wal_level=minimal
and change its only difference from wal_level=replica to be that ALTER TABLE
SET LOGGED skips WAL. Currently, ALTER TABLE SET LOGGED also rewrites the
table; that would need to change. I'd want to add ALTER INDEX SET LOGGED,
too. After all that, users would need to modify their applications. Overall,
it's possible, but it's not a clear win over the status quo.
On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote:
By the way, before finalize this, I'd like to share the result of a
brief benchmarking.
What non-default settings did you use? Please give the output of this or a
similar command:
select name, setting from pg_settings where setting <> boot_val;
If you run more benchmarks and weren't already using wal_buffers=16MB, I
recommend using it.
With 10 pgbench sessions.
pages SYNC WAL
1: 915 ms 301 ms
3: 1634 ms 508 ms
5: 1634 ms 293ms
10: 1671 ms 1043 ms
17: 1600 ms 333 ms
31: 1864 ms 314 ms
56: 1562 ms 448 ms
100: 1538 ms 394 ms
177: 1697 ms 1047 ms
316: 3074 ms 1788 ms
562: 3306 ms 1245 ms
1000: 3440 ms 2182 ms
1778: 5064 ms 6464 ms # WAL's slope becomes steep
3162: 8675 ms 8165 ms
For picking a default wal_skip_threshold, it would have been more informative
to see how this changes pgbench latency statistics. Some people want DDL to
be fast, but more people want DDL not to reduce the performance of concurrent
non-DDL. This benchmark procedure may help:
1. Determine $DDL_COUNT, a number of DDL transactions that take about one
minute when done via syncs.
2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
3. Wait 10s.
4. Start one DDL backend that runs $DDL_COUNT transactions.
5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
I would compare pgbench tps and latency between the seconds when DDL is and is
not running. As you did in earlier tests, I would repeat it using various
page counts, with and without sync.
On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote:
+Prefer to do the same in future access methods. However, two other approaches +can work. First, an access method can irreversibly transition a given fork +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and +smgrimmedsync(). Second, an access method can opt to write WAL +unconditionally for permanent relations. When using the second method, do not +call RelationCopyStorage(), which skips WAL.Even using these methods, TransactionCommit flushes out buffers then
sync files again. Isn't a description something like the following
needed?===
Even an access method switched a in-transaction created relfilenode to
WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
file then smgrimmedsync() the file.
===
It is enough that the text says to prefer the approach that core access
methods use. The extra flush and sync when using a non-preferred approach
wastes some performance, but it is otherwise harmless.
+ rel1 = relation_open(r1, AccessExclusiveLock); + RelationAssumeNewRelfilenode(rel1);It cannot be accessed from other sessions. Theoretically it doesn't
need a lock but NoLock cannot be used there since there's a path that
doesn't take lock on the relation. But AEL seems too strong and it
causes unecessary side effect. Couldn't we use weaker locks?
We could use NoLock. I assumed we already hold AccessExclusiveLock, in which
case this has no side effects.
On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote:
At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in
=== Defect 1: gistGetFakeLSN()
When I modified pg_regress.c to use wal_level=minimal for all suites,
src/test/isolation/specs/predicate-gist.spec failed the assertion in
gistGetFakeLSN(). One could reproduce the problem just by running this
sequence in psql:begin;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
insert into gist_point_tbl (id, p)
select g, point(g*10, g*10) from generate_series(1, 1000) g;I've included a wrong-in-general hack to make the test pass. I see two main
options for fixing this:(a) Introduce an empty WAL record that reserves an LSN and has no other
effect. Make GiST use that for permanent relations that are skipping WAL.
Further optimizations are possible. For example, we could use a backend-local
counter (like the one gistGetFakeLSN() uses for temp relations) until the
counter is greater a recent real LSN. That optimization is probably too
clever, though it would make the new WAL record almost never appear.(b) Exempt GiST from most WAL skipping. GiST index build could still skip
WAL, but it would do its own smgrimmedsync() in addition to the one done at
commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
other AM-independent code that skips WAL.Overall, I like the cleanliness of (a). The main argument for (b) is that it
ensures we have all the features to opt-out of WAL skipping, which could be
useful for out-of-tree index access methods. (I think we currently have the
features for a tableam to do so, but not for an indexam to do so.) Overall, I
lean toward (a). Any other ideas or preferences?I don't like (b), too.
What we need there is any sequential numbers for page LSN but really
compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
case?
No. If nothing is inserting WAL, GetXLogInsertRecPtr() does not increase.
GiST pages need an increasing LSN value.
I noticed an additional defect:
BEGIN;
CREATE TABLE t (c) AS SELECT 1;
CHECKPOINT; -- write and fsync the table's one page
TRUNCATE t; -- no WAL
COMMIT; -- no FPI, just the commit record
If we crash after the COMMIT and before the next fsync or OS-elected sync of
the table's file, the table will stay on disk with its pre-TRUNCATE content.
On Sat, Nov 23, 2019 at 11:35:09AM -0500, Noah Misch wrote:
That longstanding optimization is too useful to remove, but likely not useful
enough to add today if we didn't already have it. The initial-data-load use
case remains plausible. I can also imagine using wal_level=minimal for data
warehouse applications where one can quickly rebuild from the authoritative
data.
I can easily imagine cases where a user would like to use the benefit
of the optimization for an initial data load, and afterwards update
wal_level to replica so as they avoid the initial WAL burst which
serves no real purpose. So the first argument is pretty strong IMO,
the second much less.
--
Michael
At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch <noah@leadboat.com> wrote in
On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote:
By the way, before finalize this, I'd like to share the result of a
brief benchmarking.What non-default settings did you use? Please give the output of this or a
similar command:
Only wal_level=minimal and max_wal_senders=0.
select name, setting from pg_settings where setting <> boot_val;
If you run more benchmarks and weren't already using wal_buffers=16MB, I
recommend using it.
Roger.
With 10 pgbench sessions.
pages SYNC WAL
1: 915 ms 301 ms
3: 1634 ms 508 ms
5: 1634 ms 293ms
10: 1671 ms 1043 ms
17: 1600 ms 333 ms
31: 1864 ms 314 ms
56: 1562 ms 448 ms
100: 1538 ms 394 ms
177: 1697 ms 1047 ms
316: 3074 ms 1788 ms
562: 3306 ms 1245 ms
1000: 3440 ms 2182 ms
1778: 5064 ms 6464 ms # WAL's slope becomes steep
3162: 8675 ms 8165 msFor picking a default wal_skip_threshold, it would have been more informative
to see how this changes pgbench latency statistics. Some people want DDL to
be fast, but more people want DDL not to reduce the performance of concurrent
non-DDL. This benchmark procedure may help:1. Determine $DDL_COUNT, a number of DDL transactions that take about one
minute when done via syncs.
2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
3. Wait 10s.
4. Start one DDL backend that runs $DDL_COUNT transactions.
5. Save DDL start timestamp, DDL end timestamp, and pgbench output.I would compare pgbench tps and latency between the seconds when DDL is and is
not running. As you did in earlier tests, I would repeat it using various
page counts, with and without sync.
I understood the "DDL" is not pure DDLs but a kind of
define-then-load, like "CREATE TABLE AS" , "CREATE TABLE" then "COPY
FROM".
On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote:
+Prefer to do the same in future access methods. However, two other approaches +can work. First, an access method can irreversibly transition a given fork +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and +smgrimmedsync(). Second, an access method can opt to write WAL +unconditionally for permanent relations. When using the second method, do not +call RelationCopyStorage(), which skips WAL.Even using these methods, TransactionCommit flushes out buffers then
sync files again. Isn't a description something like the following
needed?===
Even an access method switched a in-transaction created relfilenode to
WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
file then smgrimmedsync() the file.
===It is enough that the text says to prefer the approach that core access
methods use. The extra flush and sync when using a non-preferred approach
wastes some performance, but it is otherwise harmless.
Ah, right and I agreed.
+ rel1 = relation_open(r1, AccessExclusiveLock); + RelationAssumeNewRelfilenode(rel1);It cannot be accessed from other sessions. Theoretically it doesn't
need a lock but NoLock cannot be used there since there's a path that
doesn't take lock on the relation. But AEL seems too strong and it
causes unecessary side effect. Couldn't we use weaker locks?We could use NoLock. I assumed we already hold AccessExclusiveLock, in which
case this has no side effects.
I forgot that this optimization is used only in non-replication
configuragion. So I agree that AEL doesn't have no side
effect.
On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote:
At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in
=== Defect 1: gistGetFakeLSN()
When I modified pg_regress.c to use wal_level=minimal for all suites,
src/test/isolation/specs/predicate-gist.spec failed the assertion in
gistGetFakeLSN(). One could reproduce the problem just by running this
sequence in psql:begin;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
insert into gist_point_tbl (id, p)
select g, point(g*10, g*10) from generate_series(1, 1000) g;I've included a wrong-in-general hack to make the test pass. I see two main
options for fixing this:(a) Introduce an empty WAL record that reserves an LSN and has no other
effect. Make GiST use that for permanent relations that are skipping WAL.
Further optimizations are possible. For example, we could use a backend-local
counter (like the one gistGetFakeLSN() uses for temp relations) until the
counter is greater a recent real LSN. That optimization is probably too
clever, though it would make the new WAL record almost never appear.(b) Exempt GiST from most WAL skipping. GiST index build could still skip
WAL, but it would do its own smgrimmedsync() in addition to the one done at
commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
other AM-independent code that skips WAL.Overall, I like the cleanliness of (a). The main argument for (b) is that it
ensures we have all the features to opt-out of WAL skipping, which could be
useful for out-of-tree index access methods. (I think we currently have the
features for a tableam to do so, but not for an indexam to do so.) Overall, I
lean toward (a). Any other ideas or preferences?I don't like (b), too.
What we need there is any sequential numbers for page LSN but really
compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
case?No. If nothing is inserting WAL, GetXLogInsertRecPtr() does not increase.
GiST pages need an increasing LSN value.
Sorry, I noticed that after the mail went out. I agree to (a) and will
do that.
I noticed an additional defect:
BEGIN;
CREATE TABLE t (c) AS SELECT 1;
CHECKPOINT; -- write and fsync the table's one page
TRUNCATE t; -- no WAL
COMMIT; -- no FPI, just the commit recordIf we crash after the COMMIT and before the next fsync or OS-elected sync of
the table's file, the table will stay on disk with its pre-TRUNCATE content.
The TRUNCATE replaces relfilenode in the catalog and the pre-TRUNCATE
content wouldn't be seen after COMMIT. Since the file has no pages,
it's right that no FPI emitted. What we should make sure the empty
file's metadata is synced out. But I think that kind of failure
shoudn't happen on modern file systems. If we don't rely on such
behavior, we can make sure thhat by turning the zero-pages case from
WAL into file sync. I'll do that in the next version.
I'll post the next version as a single patch.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Nov 25, 2019 at 11:08:54AM +0900, Kyotaro Horiguchi wrote:
At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch <noah@leadboat.com> wrote in
This benchmark procedure may help:
1. Determine $DDL_COUNT, a number of DDL transactions that take about one
minute when done via syncs.
2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
3. Wait 10s.
4. Start one DDL backend that runs $DDL_COUNT transactions.
5. Save DDL start timestamp, DDL end timestamp, and pgbench output.I would compare pgbench tps and latency between the seconds when DDL is and is
not running. As you did in earlier tests, I would repeat it using various
page counts, with and without sync.I understood the "DDL" is not pure DDLs but a kind of
define-then-load, like "CREATE TABLE AS" , "CREATE TABLE" then "COPY
FROM".
When I wrote "DDL", I meant the four-command transaction that you already used
in benchmarks.
I noticed an additional defect:
BEGIN;
CREATE TABLE t (c) AS SELECT 1;
CHECKPOINT; -- write and fsync the table's one page
TRUNCATE t; -- no WAL
COMMIT; -- no FPI, just the commit recordIf we crash after the COMMIT and before the next fsync or OS-elected sync of
the table's file, the table will stay on disk with its pre-TRUNCATE content.The TRUNCATE replaces relfilenode in the catalog
No, it does not. Since the relation is new in the transaction, the TRUNCATE
uses the heap_truncate_one_rel() strategy.
Since the file has no pages, it's right that no FPI emitted.
Correct.
If we don't rely on such
behavior, we can make sure thhat by turning the zero-pages case from
WAL into file sync. I'll do that in the next version.
The zero-pages case is not special. Here's an example of the problem with a
nonzero size:
BEGIN;
CREATE TABLE t (c) AS SELECT * FROM generate_series(1,100000);
CHECKPOINT; -- write and fsync the table's many pages
TRUNCATE t; -- no WAL
INSERT INTO t VALUES (0); -- no WAL
COMMIT; -- FPI for one page; nothing removes the additional pages
On Sat, Nov 23, 2019 at 4:21 PM Noah Misch <noah@leadboat.com> wrote:
I noticed an additional defect:
BEGIN;
CREATE TABLE t (c) AS SELECT 1;
CHECKPOINT; -- write and fsync the table's one page
TRUNCATE t; -- no WAL
COMMIT; -- no FPI, just the commit recordIf we crash after the COMMIT and before the next fsync or OS-elected sync of
the table's file, the table will stay on disk with its pre-TRUNCATE content.
Shouldn't the TRUNCATE be triggering an fsync() to happen before
COMMIT is permitted to complete? You'd have the same problem if the
TRUNCATE were replaced by INSERT, unless fsync() happens in that case.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Nov 25, 2019 at 03:58:14PM -0500, Robert Haas wrote:
On Sat, Nov 23, 2019 at 4:21 PM Noah Misch <noah@leadboat.com> wrote:
I noticed an additional defect:
BEGIN;
CREATE TABLE t (c) AS SELECT 1;
CHECKPOINT; -- write and fsync the table's one page
TRUNCATE t; -- no WAL
COMMIT; -- no FPI, just the commit recordIf we crash after the COMMIT and before the next fsync or OS-elected sync of
the table's file, the table will stay on disk with its pre-TRUNCATE content.Shouldn't the TRUNCATE be triggering an fsync() to happen before
COMMIT is permitted to complete?
With wal_skip_threshold=0, you do get an fsync(). The patch tries to avoid
at-commit fsync of small files by WAL-logging file contents instead. However,
the patch doesn't WAL-log enough to handle files that decreased in size.
You'd have the same problem if the
TRUNCATE were replaced by INSERT, unless fsync() happens in that case.
I think an insert would be fine. You'd get an FPI record for the relation's
one page, which fully reproduces the relation.
At Sun, 24 Nov 2019 22:08:39 -0500, Noah Misch <noah@leadboat.com> wrote in
On Mon, Nov 25, 2019 at 11:08:54AM +0900, Kyotaro Horiguchi wrote:
At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch <noah@leadboat.com> wrote in
I noticed an additional defect:
BEGIN;
CREATE TABLE t (c) AS SELECT 1;
CHECKPOINT; -- write and fsync the table's one page
TRUNCATE t; -- no WAL
COMMIT; -- no FPI, just the commit recordIf we crash after the COMMIT and before the next fsync or OS-elected sync of
the table's file, the table will stay on disk with its pre-TRUNCATE content.The TRUNCATE replaces relfilenode in the catalog
No, it does not. Since the relation is new in the transaction, the TRUNCATE
uses the heap_truncate_one_rel() strategy.
..
The zero-pages case is not special. Here's an example of the problem with a
nonzero size:
I got it. That is, if the file has had blocks beyond the size at
commit, we should sync the file even if it is small enough. It nees to
track beore-trunction size as this patch used to have.
pendingSyncHash is resurrected to do truncate-size tracking. That
information cannot be stored in SMgrRelation, which will be dissapper
by invalidation, or Relation, which is not available in storage layer.
smgrDoPendingDeletes is needed to be called at aboft again to clean up
useless hash. I'm not sure the exact cause but
AssertPendingSyncs_RelationCache() fails at abort (so it is not called
at abort).
smgrDoPendingSyncs and RelFileNodeSkippingWAL() become simpler by
using the hash.
Is is not fully checked. I didn't merged and mesured performance yet,
but I post the status-quo patch for now.
- v25-0001-version-nm.patch
Noah's v24 patch.
- v25-0002-Revert-FlushRelationBuffersWithoutRelcache.patch
Remove useless function (added by this patch..).
- v25-0003-Improve-the-performance-of-relation-syncs.patch
Make smgrDoPendingSyncs scan shared buffer once.
v25-0004-Adjust-gistGetFakeLSN.patch
Amendment for gistGetFakeLSN. This uses GetXLogInsertRecPtr as long as
it is different from the previous call and emits dummy WAL if we need
a new LSN. Since other than switch_wal record cannot be empty so the
dummy WAL has an integer content for now.
v25-0005-Sync-files-shrinked-by-truncation.patch
Amendment for the truncation problem.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v25-0001-version-nm.patchtext/x-patch; charset=us-asciiDownload
From 86d7c2dee819b1171f0a02c56e4cda065c64246f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v25 1/5] version nm
---
doc/src/sgml/config.sgml | 43 +++--
doc/src/sgml/perform.sgml | 47 ++----
src/backend/access/gist/gistutil.c | 7 +-
src/backend/access/heap/heapam.c | 45 +-----
src/backend/access/heap/heapam_handler.c | 22 +--
src/backend/access/heap/rewriteheap.c | 21 +--
src/backend/access/nbtree/nbtsort.c | 41 ++---
src/backend/access/transam/README | 47 +++++-
src/backend/access/transam/xact.c | 14 ++
src/backend/access/transam/xloginsert.c | 10 +-
src/backend/access/transam/xlogutils.c | 17 +-
src/backend/catalog/heap.c | 4 +
src/backend/catalog/storage.c | 198 +++++++++++++++++++++--
src/backend/commands/cluster.c | 11 ++
src/backend/commands/copy.c | 58 +------
src/backend/commands/createas.c | 11 +-
src/backend/commands/matview.c | 12 +-
src/backend/commands/tablecmds.c | 11 +-
src/backend/storage/buffer/bufmgr.c | 37 +++--
src/backend/storage/smgr/md.c | 9 +-
src/backend/utils/cache/relcache.c | 122 ++++++++++----
src/backend/utils/misc/guc.c | 13 ++
src/include/access/heapam.h | 3 -
src/include/access/rewriteheap.h | 2 +-
src/include/access/tableam.h | 18 +--
src/include/catalog/storage.h | 5 +
src/include/storage/bufmgr.h | 5 +
src/include/utils/rel.h | 57 +++++--
src/include/utils/relcache.h | 8 +-
src/test/regress/pg_regress.c | 2 +
30 files changed, 551 insertions(+), 349 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe45cc..d0f7dbd7d7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2483,21 +2483,14 @@ include_dir 'conf.d'
levels. This parameter can only be set at server start.
</para>
<para>
- In <literal>minimal</literal> level, WAL-logging of some bulk
- operations can be safely skipped, which can make those
- operations much faster (see <xref linkend="populate-pitr"/>).
- Operations in which this optimization can be applied include:
- <simplelist>
- <member><command>CREATE TABLE AS</command></member>
- <member><command>CREATE INDEX</command></member>
- <member><command>CLUSTER</command></member>
- <member><command>COPY</command> into tables that were created or truncated in the same
- transaction</member>
- </simplelist>
- But minimal WAL does not contain enough information to reconstruct the
- data from a base backup and the WAL logs, so <literal>replica</literal> or
- higher must be used to enable WAL archiving
- (<xref linkend="guc-archive-mode"/>) and streaming replication.
+ In <literal>minimal</literal> level, no information is logged for
+ tables or indexes for the remainder of a transaction that creates or
+ truncates them. This can make bulk operations much faster (see
+ <xref linkend="populate-pitr"/>). But minimal WAL does not contain
+ enough information to reconstruct the data from a base backup and the
+ WAL logs, so <literal>replica</literal> or higher must be used to
+ enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
+ streaming replication.
</para>
<para>
In <literal>logical</literal> level, the same information is logged as
@@ -2889,6 +2882,26 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-wal-skip-threshold" xreflabel="wal_skip_threshold">
+ <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ When <varname>wal_level</varname> is <literal>minimal</literal> and a
+ transaction commits after creating or rewriting a permanent table,
+ materialized view, or index, this setting determines how to persist
+ the new data. If the data is smaller than this setting, write it to
+ the WAL log; otherwise, use an fsync of the data file. Depending on
+ the properties of your storage, raising or lowering this value might
+ help if such commits are slowing concurrent transactions. The default
+ is 64 kilobytes (<literal>64kB</literal>).
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-commit-delay" xreflabel="commit_delay">
<term><varname>commit_delay</varname> (<type>integer</type>)
<indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 715aff63c8..fcc60173fb 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1605,8 +1605,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
needs to be written, because in case of an error, the files
containing the newly loaded data will be removed anyway.
However, this consideration only applies when
- <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for
- non-partitioned tables as all commands must write WAL otherwise.
+ <xref linkend="guc-wal-level"/> is <literal>minimal</literal>
+ as all commands must write WAL otherwise.
</para>
</sect2>
@@ -1706,42 +1706,13 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
</para>
<para>
- Aside from avoiding the time for the archiver or WAL sender to
- process the WAL data,
- doing this will actually make certain commands faster, because they
- are designed not to write WAL at all if <varname>wal_level</varname>
- is <literal>minimal</literal>. (They can guarantee crash safety more cheaply
- by doing an <function>fsync</function> at the end than by writing WAL.)
- This applies to the following commands:
- <itemizedlist>
- <listitem>
- <para>
- <command>CREATE TABLE AS SELECT</command>
- </para>
- </listitem>
- <listitem>
- <para>
- <command>CREATE INDEX</command> (and variants such as
- <command>ALTER TABLE ADD PRIMARY KEY</command>)
- </para>
- </listitem>
- <listitem>
- <para>
- <command>ALTER TABLE SET TABLESPACE</command>
- </para>
- </listitem>
- <listitem>
- <para>
- <command>CLUSTER</command>
- </para>
- </listitem>
- <listitem>
- <para>
- <command>COPY FROM</command>, when the target table has been
- created or truncated earlier in the same transaction
- </para>
- </listitem>
- </itemizedlist>
+ Aside from avoiding the time for the archiver or WAL sender to process the
+ WAL data, doing this will actually make certain commands faster, because
+ they do not to write WAL at all if <varname>wal_level</varname>
+ is <literal>minimal</literal> and the current subtransaction (or top-level
+ transaction) created or truncated the table or index they change. (They
+ can guarantee crash safety more cheaply by doing
+ an <function>fsync</function> at the end than by writing WAL.)
</para>
</sect2>
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 553a6d67b1..66c52d6dd6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1013,7 +1013,12 @@ gistGetFakeLSN(Relation rel)
{
static XLogRecPtr counter = FirstNormalUnloggedLSN;
- if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+ /*
+ * XXX before commit fix this. This is not correct for
+ * RELPERSISTENCE_PERMANENT, but it suffices to make tests pass.
+ */
+ if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP
+ || rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
{
/*
* Temporary relations are only accessible in our session, so a simple
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb34ef..be19c34cbd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -21,7 +21,6 @@
* heap_multi_insert - insert multiple tuples into a relation
* heap_delete - delete a tuple from a relation
* heap_update - replace a tuple in a relation with another tuple
- * heap_sync - sync heap, for when no WAL has been written
*
* NOTES
* This file contains the heap_ routines which implement
@@ -1936,7 +1935,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2119,7 +2118,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -8920,46 +8919,6 @@ heap2_redo(XLogReaderState *record)
}
}
-/*
- * heap_sync - sync a heap, for use when no WAL has been written
- *
- * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
- *
- * Indexes are not touched. (Currently, index operations associated with
- * the commands that use this are WAL-logged and so do not need fsync.
- * That behavior might change someday, but in any case it's likely that
- * any fsync decisions required would be per-index and hence not appropriate
- * to be done here.)
- */
-void
-heap_sync(Relation rel)
-{
- /* non-WAL-logged tables never need fsync */
- if (!RelationNeedsWAL(rel))
- return;
-
- /* main heap */
- FlushRelationBuffers(rel);
- /* FlushRelationBuffers will have opened rd_smgr */
- smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
-
- /* FSM is not critical, don't bother syncing it */
-
- /* toast heap, if any */
- if (OidIsValid(rel->rd_rel->reltoastrelid))
- {
- Relation toastrel;
-
- toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
- FlushRelationBuffers(toastrel);
- smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
- table_close(toastrel, AccessShareLock);
- }
-}
-
/*
* Mask a heap page before performing consistency checks on it.
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 92073fec54..07fe717faa 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -555,17 +555,6 @@ tuple_lock_retry:
return result;
}
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
-}
-
/* ------------------------------------------------------------------------
* DDL related callbacks for heap AM.
@@ -698,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +701,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
is_system_catalog = IsSystemRelation(OldHeap);
/*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
+ * Valid smgr_targblock implies something already wrote to the relation.
+ * This may be harmless, but this function hasn't planned for it.
*/
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
- /* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
/* Preallocate values/isnull arrays */
@@ -728,7 +713,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff, use_wal);
+ *multi_cutoff);
/* Set up sorting if wanted */
@@ -2515,7 +2500,6 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d285b1f390..3e564838fa 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -136,7 +136,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -230,15 +229,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -263,7 +260,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -322,7 +318,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -337,18 +333,14 @@ end_heap_rewrite(RewriteState state)
}
/*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too.
- *
- * It's obvious that we must do this when not WAL-logging. It's less
- * obvious that we have to do it even if we did WAL-log the pages. The
+ * When we WAL-logged rel pages, we must nonetheless fsync them. The
* reason is the same as in storage.c's RelationCopyStorage(): we're
* writing data that's not in shared buffers, and so a CHECKPOINT
* occurring during the rewriteheap operation won't have fsync'd data we
* wrote before the checkpoint.
*/
if (RelationNeedsWAL(state->rs_new_rel))
- heap_sync(state->rs_new_rel);
+ smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
logical_end_heap_rewrite(state);
@@ -646,9 +638,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -687,7 +676,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 1dd39a9535..b61692aefc 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -31,18 +31,6 @@
* them. They will need to be re-read into shared buffers on first use after
* the build finishes.
*
- * Since the index will never be used unless it is completely built,
- * from a crash-recovery point of view there is no need to WAL-log the
- * steps of the build. After completing the index build, we can just sync
- * the whole file to disk using smgrimmedsync() before exiting this module.
- * This can be seen to be sufficient for crash recovery by considering that
- * it's effectively equivalent to what would happen if a CHECKPOINT occurred
- * just after the index build. However, it is clearly not sufficient if the
- * DBA is using the WAL log for PITR or replication purposes, since another
- * machine would not be able to reconstruct the index from WAL. Therefore,
- * we log the completed index pages to WAL if and only if WAL archiving is
- * active.
- *
* This code isn't concerned about the FSM at all. The caller is responsible
* for initializing that.
*
@@ -563,12 +551,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-
- /*
- * We need to log index creation in WAL iff WAL archiving/streaming is
- * enabled UNLESS the index isn't WAL-logged anyway.
- */
- wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
+ wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
/* reserve the metapage */
wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
@@ -1265,21 +1248,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
_bt_uppershutdown(wstate, state);
/*
- * If the index is WAL-logged, we must fsync it down to disk before it's
- * safe to commit the transaction. (For a non-WAL-logged index we don't
- * care since the index will be uninteresting after a crash anyway.)
- *
- * It's obvious that we must do this when not WAL-logging the build. It's
- * less obvious that we have to do it even if we did WAL-log the index
- * pages. The reason is that since we're building outside shared buffers,
- * a CHECKPOINT occurring during the build has no way to flush the
- * previously written data to disk (indeed it won't know the index even
- * exists). A crash later on would replay WAL from the checkpoint,
- * therefore it wouldn't replay our earlier WAL entries. If we do not
- * fsync those pages here, they might still not be on disk when the crash
- * occurs.
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
*/
- if (RelationNeedsWAL(wstate->index))
+ if (wstate->btws_use_wal)
{
RelationOpenSmgr(wstate->index);
smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index b5a2cb2de8..641809cfda 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -717,6 +717,40 @@ then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+Skipping WAL for New RelFileNode
+--------------------------------
+
+Under wal_level=minimal, if a change modifies a relfilenode that
+RollbackAndReleaseCurrentSubTransaction() would unlink, in-tree access methods
+write no WAL for that change. For any access method, CommitTransaction()
+writes and fsyncs affected blocks before recording the commit. This skipping
+is mandatory; if a WAL-writing change preceded a WAL-skipping change for the
+same block, REDO could overwrite the WAL-skipping change. Code that writes
+WAL without calling RelationNeedsWAL() must check for this case.
+
+If skipping were not mandatory, a related problem would arise. Suppose, under
+full_page_writes=off, a WAL-writing change follows a WAL-skipping change.
+When a WAL record contains no full-page image, REDO expects the page to match
+its contents from just before record insertion. A WAL-skipping change may not
+reach disk at all, violating REDO's expectation.
+
+Prefer to do the same in future access methods. However, two other approaches
+can work. First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync(). Second, an access method can opt to write WAL
+unconditionally for permanent relations. When using the second method, do not
+call RelationCopyStorage(), which skips WAL.
+
+This applies only to WAL records whose replay would modify bytes stored in the
+new relfilenode. It does not apply to other records about the relfilenode,
+such as XLOG_SMGR_CREATE. Because it operates at the level of individual
+relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations.
+Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
+ALTER TABLE adds a TOAST relation. The TOAST relation will skip WAL, while
+the table owning it will not. ALTER TABLE SET TABLESPACE will cause a table
+to skip WAL, but that won't affect its indexes.
+
+
Asynchronous Commit
-------------------
@@ -820,13 +854,12 @@ Changes to a temp table are not WAL-logged, hence could reach disk in
advance of T1's commit, but we don't care since temp table contents don't
survive crashes anyway.
-Database writes made via any of the paths we have introduced to avoid WAL
-overhead for bulk updates are also safe. In these cases it's entirely
-possible for the data to reach disk before T1's commit, because T1 will
-fsync it down to disk without any sort of interlock, as soon as it finishes
-the bulk update. However, all these paths are designed to write data that
-no other transaction can see until after T1 commits. The situation is thus
-not different from ordinary WAL-logged updates.
+Database writes that skip WAL for new relfilenodes are also safe. In these
+cases it's entirely possible for the data to reach disk before T1's commit,
+because T1 will fsync it down to disk without any sort of interlock. However,
+all these paths are designed to write data that no other transaction can see
+until after T1 commits. The situation is thus not different from ordinary
+WAL-logged updates.
Transaction Emulation during Recovery
-------------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5c0d0f2af0..750f95c482 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2109,6 +2109,13 @@ CommitTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Synchronize files that are created and not WAL-logged during this
+ * transaction. This must happen before AtEOXact_RelationMap(), so that we
+ * don't see committed-but-broken files after a crash.
+ */
+ smgrDoPendingSyncs();
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2342,6 +2349,13 @@ PrepareTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Synchronize files that are created and not WAL-logged during this
+ * transaction. This must happen before EndPrepare(), so that we don't see
+ * committed-but-broken files after a crash and COMMIT PREPARED.
+ */
+ smgrDoPendingSyncs();
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index aa9dca0036..dda1dea08b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1043,8 +1043,13 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
BlockNumber startblk, BlockNumber endblk,
bool page_std)
{
+ int flags;
BlockNumber blkno;
+ flags = REGBUF_FORCE_IMAGE;
+ if (page_std)
+ flags |= REGBUF_STANDARD;
+
/*
* Iterate over all the pages in the range. They are collected into
* batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written
@@ -1066,7 +1071,8 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
nbufs = 0;
while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
{
- Buffer buf = ReadBuffer(rel, blkno);
+ Buffer buf = ReadBufferExtended(rel, forkNum, blkno,
+ RBM_NORMAL, NULL);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
@@ -1088,7 +1094,7 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
START_CRIT_SECTION();
for (i = 0; i < nbufs; i++)
{
- XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+ XLogRegisterBuffer(i, bufpack[i], flags);
MarkBufferDirty(bufpack[i]);
}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 446760ed6e..9561e30b08 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
* fields related to physical storage, like rd_rel, are initialized, so the
* fake entry is only usable in low-level operations like ReadBuffer().
*
+ * This is also used for syncing WAL-skipped files.
+ *
* Caller must free the returned entry with FreeFakeRelcacheEntry().
*/
Relation
@@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
FakeRelCacheEntry fakeentry;
Relation rel;
- Assert(InRecovery);
-
/* Allocate the Relation struct and all related space in one block. */
fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
rel = (Relation) fakeentry;
rel->rd_rel = &fakeentry->pgc;
rel->rd_node = rnode;
- /* We will never be working with temp rels during recovery */
+ /*
+ * We will never be working with temp rels during recovery or while
+ * syncing WAL-skipped files.
+ */
rel->rd_backend = InvalidBackendId;
- /* It must be a permanent table if we're in recovery. */
+ /* It must be a permanent table here */
rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
/* We don't know the name of the relation; use relfilenode instead */
@@ -572,9 +575,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
/*
* We set up the lockRelId in case anything tries to lock the dummy
* relation. Note that this is fairly bogus since relNode may be
- * different from the relation's OID. It shouldn't really matter though,
- * since we are presumably running by ourselves and can't have any lock
- * conflicts ...
+ * different from the relation's OID. It shouldn't really matter though.
+ * In recovery, we are running by ourselves and can't have any lock
+ * conflicts. While syncing, we already hold AccessExclusiveLock.
*/
rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index b7bcdd9d0f..293ea9a9dd 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -440,6 +440,10 @@ heap_create(const char *relname,
break;
}
}
+ else
+ {
+ rel->rd_createSubid = InvalidSubTransactionId;
+ }
return rel;
}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 056ea3d5d3..51c233dac6 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,9 +29,13 @@
#include "miscadmin.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+/* GUC variables */
+int wal_skip_threshold = 64; /* in kilobytes */
+
/*
* We keep a list of all relations (represented as RelFileNode values)
* that have been created or deleted in the current transaction. When
@@ -58,6 +62,7 @@ typedef struct PendingRelDelete
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
+ bool sync; /* whether to fsync at commit */
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
@@ -114,6 +119,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->sync =
+ relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded();
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -155,6 +162,7 @@ RelationDropStorage(Relation rel)
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->sync = false;
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -355,7 +363,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
/*
* We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a permanent relation.
+ * enabled AND it's a permanent relation. This gives the same answer as
+ * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+ * current operation created a new relfilenode.
*/
use_wal = XLogIsNeeded() &&
(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
@@ -397,24 +407,43 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
}
/*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too. (For a temp or
- * unlogged rel we don't care since the data will be gone after a crash
- * anyway.)
- *
- * It's obvious that we must do this when not WAL-logging the copy. It's
- * less obvious that we have to do it even if we did WAL-log the copied
- * pages. The reason is that since we're copying outside shared buffers, a
- * CHECKPOINT occurring during the copy has no way to flush the previously
- * written data to disk (indeed it won't know the new rel even exists). A
- * crash later on would replay WAL from the checkpoint, therefore it
- * wouldn't replay our earlier WAL entries. If we do not fsync those pages
- * here, they might still not be on disk when the crash occurs.
+ * When we WAL-logged rel pages, we must nonetheless fsync them. The
+ * reason is that since we're copying outside shared buffers, a CHECKPOINT
+ * occurring during the copy has no way to flush the previously written
+ * data to disk (indeed it won't know the new rel even exists). A crash
+ * later on would replay WAL from the checkpoint, therefore it wouldn't
+ * replay our earlier WAL entries. If we do not fsync those pages here,
+ * they might still not be on disk when the crash occurs.
*/
- if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+ if (use_wal || copying_initfork)
smgrimmedsync(dst, forkNum);
}
+/*
+ * RelFileNodeSkippingWAL - check if a BM_PERMANENT relfilenode is using WAL
+ *
+ * Changes of certain relfilenodes must not write WAL; see "Skipping WAL for
+ * New RelFileNode" in src/backend/access/transam/README. Though it is
+ * known from Relation efficiently, this function is intended for the code
+ * paths not having access to Relation.
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+ PendingRelDelete *pending;
+
+ if (XLogIsNeeded())
+ return false; /* no permanent relfilenode skips WAL */
+
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ if (RelFileNodeEquals(pending->relnode, rnode) && pending->sync)
+ return true;
+ }
+
+ return false;
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -492,6 +521,145 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingSyncs() -- Take care of relation syncs at commit.
+ *
+ * This should be called before smgrDoPendingDeletes() at every commit or
+ * prepare. Also this should be called before emitting WAL record so that sync
+ * failure prevents commit.
+ */
+void
+smgrDoPendingSyncs(void)
+{
+ PendingRelDelete *pending;
+ HTAB *delhash = NULL;
+
+ if (XLogIsNeeded())
+ return; /* no relation can use this */
+
+ Assert(GetCurrentTransactionNestLevel() == 1);
+ AssertPendingSyncs_RelationCache();
+
+ /*
+ * Pending syncs on the relation that are to be deleted in this
+ * transaction-end should be ignored. Collect pending deletes that will
+ * happen in the following call to smgrDoPendingDeletes().
+ */
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ if (!pending->atCommit)
+ continue;
+
+ /* create the hash if not yet */
+ if (delhash == NULL)
+ {
+ HASHCTL hash_ctl;
+
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(RelFileNode);
+ hash_ctl.entrysize = sizeof(RelFileNode);
+ hash_ctl.hcxt = CurrentMemoryContext;
+ delhash =
+ hash_create("pending del temporary hash", 8, &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
+
+ (void) hash_search(delhash, (void *) &pending->relnode,
+ HASH_ENTER, &found);
+ Assert(!found);
+ }
+
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ bool to_be_removed = false; /* don't sync if aborted */
+ ForkNumber fork;
+ BlockNumber nblocks[MAX_FORKNUM + 1];
+ BlockNumber total_blocks = 0;
+ SMgrRelation srel;
+
+ if (!pending->sync)
+ continue;
+ Assert(!pending->atCommit);
+
+ /* don't sync relnodes that is being deleted */
+ if (delhash)
+ hash_search(delhash, (void *) &pending->relnode,
+ HASH_FIND, &to_be_removed);
+ if (to_be_removed)
+ continue;
+
+ /* Now the time to sync the rnode */
+ srel = smgropen(pending->relnode, pending->backend);
+
+ /*
+ * We emit newpage WAL records for smaller relations.
+ *
+ * Small WAL records have a chance to be emitted along with other
+ * backends' WAL records. We emit WAL records instead of syncing for
+ * files that are smaller than a certain threshold, expecting faster
+ * commit. The threshold is defined by the GUC wal_skip_threshold.
+ */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ if (smgrexists(srel, fork))
+ {
+ BlockNumber n = smgrnblocks(srel, fork);
+
+ /* we shouldn't come here for unlogged relations */
+ Assert(fork != INIT_FORKNUM);
+
+ nblocks[fork] = n;
+ total_blocks += n;
+ }
+ else
+ nblocks[fork] = InvalidBlockNumber;
+ }
+
+ /*
+ * Sync file or emit WAL record for the file according to the total
+ * size.
+ */
+ if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+ {
+ /* Flush all buffers then sync the file */
+ FlushRelationBuffersWithoutRelcache(srel, false);
+
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(srel, fork))
+ smgrimmedsync(srel, fork);
+ }
+ }
+ else
+ {
+ /* Emit WAL records for all blocks. The file is small enough. */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ int n = nblocks[fork];
+ Relation rel;
+
+ if (!BlockNumberIsValid(n))
+ continue;
+
+ /*
+ * Emit WAL for the whole file. Unfortunately we don't know
+ * what kind of a page this is, so we have to log the full
+ * page including any unused space. ReadBufferExtended()
+ * counts some pgstat events; unfortunately, we discard them.
+ */
+ rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+ log_newpage_range(rel, fork, 0, n, false);
+ FreeFakeRelcacheEntry(rel);
+ }
+ }
+ }
+
+ if (delhash)
+ hash_destroy(delhash);
+}
+
/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b8c349f245..093fff8c5c 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1014,6 +1014,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
relfilenode2;
Oid swaptemp;
char swptmpchr;
+ Relation rel1;
/* We need writable copies of both pg_class tuples. */
relRelation = table_open(RelationRelationId, RowExclusiveLock);
@@ -1039,6 +1040,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
*/
Assert(!target_is_pg_class);
+ /* swap relfilenodes, reltablespaces, relpersistence */
swaptemp = relform1->relfilenode;
relform1->relfilenode = relform2->relfilenode;
relform2->relfilenode = swaptemp;
@@ -1173,6 +1175,15 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
CacheInvalidateRelcacheByTuple(reltup2);
}
+ /*
+ * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+ * subtransaction. Since the next step for rel2 is deletion, don't bother
+ * recording the newness of its relfilenode.
+ */
+ rel1 = relation_open(r1, AccessExclusiveLock);
+ RelationAssumeNewRelfilenode(rel1);
+ relation_close(rel1, NoLock);
+
/*
* Post alter hook for modified relations. The change to r2 is always
* internal, but r1 depends on the invocation context.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 42a147b67d..607e2558a3 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2711,63 +2711,15 @@ CopyFrom(CopyState cstate)
RelationGetRelationName(cstate->rel))));
}
- /*----------
- * Check to see if we can avoid writing WAL
- *
- * If archive logging/streaming is not enabled *and* either
- * - table was created in same transaction as this COPY
- * - data is being written to relfilenode created in this transaction
- * then we can skip writing WAL. It's safe because if the transaction
- * doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the table_finish_bulk_insert() at
- * the bottom of this routine first.
- *
- * As mentioned in comments in utils/rel.h, the in-same-transaction test
- * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
- * can be cleared before the end of the transaction. The exact case is
- * when a relation sets a new relfilenode twice in same transaction, yet
- * the second one fails in an aborted subtransaction, e.g.
- *
- * BEGIN;
- * TRUNCATE t;
- * SAVEPOINT save;
- * TRUNCATE t;
- * ROLLBACK TO save;
- * COPY ...
- *
- * Also, if the target file is new-in-transaction, we assume that checking
- * FSM for free space is a waste of time, even if we must use WAL because
- * of archiving. This could possibly be wrong, but it's unlikely.
- *
- * The comments for table_tuple_insert and RelationGetBufferForTuple
- * specify that skipping WAL logging is only safe if we ensure that our
- * tuples do not go into pages containing tuples from any other
- * transactions --- but this must be the case if we have a new table or
- * new relfilenode, so we need no additional work to enforce that.
- *
- * We currently don't support this optimization if the COPY target is a
- * partitioned table as we currently only lazily initialize partition
- * information when routing the first tuple to the partition. We cannot
- * know at this stage if we can perform this optimization. It should be
- * possible to improve on this, but it does mean maintaining heap insert
- * option flags per partition and setting them when we first open the
- * partition.
- *
- * This optimization is not supported for relation types which do not
- * have any physical storage, with foreign tables and views using
- * INSTEAD OF triggers entering in this category. Partitioned tables
- * are not supported as per the description above.
- *----------
+ /*
+ * If the target file is new-in-transaction, we assume that checking FSM
+ * for free space is a waste of time. This could possibly be wrong, but
+ * it's unlikely.
*/
- /* createSubid is creation check, newRelfilenodeSubid is truncation check */
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
- cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
+ cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 2bf7083719..20225dc62f 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,16 +552,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
myState->rel = intoRelationDesc;
myState->reladdr = intoRelationAddr;
myState->output_cid = GetCurrentCommandId(true);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
+ myState->bistate = GetBulkInsertState();
/*
- * We can skip WAL-logging the insertions, unless PITR or streaming
- * replication is in use. We can skip the FSM in any case.
+ * Valid smgr_targblock implies something already wrote to the relation.
+ * This may be harmless, but this function hasn't planned for it.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
- myState->bistate = GetBulkInsertState();
-
- /* Not using WAL requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..ae809c9801 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -457,17 +457,13 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->transientrel = transientrel;
myState->output_cid = GetCurrentCommandId(true);
-
- /*
- * We can skip WAL-logging the insertions, unless PITR or streaming
- * replication is in use. We can skip the FSM in any case.
- */
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
- /* Not using WAL requires smgr_targblock be initially invalid */
+ /*
+ * Valid smgr_targblock implies something already wrote to the relation.
+ * This may be harmless, but this function hasn't planned for it.
+ */
Assert(RelationGetTargetBlock(transientrel) == InvalidBlockNumber);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5440eb9015..0e2f5f4259 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4770,19 +4770,14 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
newrel = NULL;
/*
- * Prepare a BulkInsertState and options for table_tuple_insert. Because
- * we're building a new heap, we can skip WAL-logging and fsync it to disk
- * at the end instead (unless WAL-logging is required for archiving or
- * streaming replication). The FSM is empty too, so don't bother using it.
+ * Prepare a BulkInsertState and options for table_tuple_insert. The FSM
+ * is empty, so don't bother using it.
*/
if (newrel)
{
mycid = GetCurrentCommandId(true);
bistate = GetBulkInsertState();
-
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -12462,6 +12457,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
table_close(pg_class, RowExclusiveLock);
+ RelationAssumeNewRelfilenode(rel);
+
relation_close(rel, NoLock);
/* Make sure the reltablespace change is visible */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7ad10736d5..746ce477fc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3203,20 +3203,27 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
- /* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
+ RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3233,7 +3240,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3263,18 +3270,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
@@ -3484,13 +3491,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
(pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
{
/*
- * If we're in recovery we cannot dirty a page because of a hint.
- * We can set the hint, just not dirty the page as a result so the
- * hint is lost when we evict the page or shutdown.
+ * If we must not write WAL, due to a relfilenode-specific
+ * condition or being in recovery, don't dirty the page. We can
+ * set the hint, just not dirty the page as a result so the hint
+ * is lost when we evict the page or shutdown.
*
* See src/backend/storage/page/README for longer discussion.
*/
- if (RecoveryInProgress())
+ if (RecoveryInProgress() ||
+ RelFileNodeSkippingWAL(bufHdr->tag.rnode))
return;
/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 8a9eaf6430..1d408c339c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -248,11 +248,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
* During replay, we would delete the file and then recreate it, which is fine
* if the contents of the file were repopulated by subsequent WAL entries.
* But if we didn't WAL-log insertions, but instead relied on fsyncing the
- * file after populating it (as for instance CLUSTER and CREATE INDEX do),
- * the contents of the file would be lost forever. By leaving the empty file
- * until after the next checkpoint, we prevent reassignment of the relfilenode
- * number until it's safe, because relfilenode assignment skips over any
- * existing file.
+ * file after populating it (as we do at wal_level=minimal), the contents of
+ * the file would be lost forever. By leaving the empty file until after the
+ * next checkpoint, we prevent reassignment of the relfilenode number until
+ * it's safe, because relfilenode assignment skips over any existing file.
*
* We do not need to go through this dance for temp relations, though, because
* we never make WAL entries for temp rels, and so a temp rel poses no threat
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ad1ff01b32..f3831f0077 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -262,6 +262,9 @@ static void RelationReloadIndexInfo(Relation relation);
static void RelationReloadNailed(Relation relation);
static void RelationFlushRelation(Relation relation);
static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+#ifdef USE_ASSERT_CHECKING
+static void AssertPendingSyncConsistency(Relation relation);
+#endif
static void AtEOXact_cleanup(Relation relation, bool isCommit);
static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1095,6 +1098,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
relation->rd_isnailed = false;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
switch (relation->rd_rel->relpersistence)
{
case RELPERSISTENCE_UNLOGGED:
@@ -1828,6 +1832,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_isnailed = true;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
relation->rd_backend = InvalidBackendId;
relation->rd_islocaltemp = false;
@@ -2035,6 +2040,12 @@ RelationIdGetRelation(Oid relationId)
rd = RelationBuildDesc(relationId, true);
if (RelationIsValid(rd))
RelationIncrementReferenceCount(rd);
+
+#ifdef USE_ASSERT_CHECKING
+ if (!XLogIsNeeded() && RelationIsValid(rd))
+ AssertPendingSyncConsistency(rd);
+#endif
+
return rd;
}
@@ -2093,7 +2104,7 @@ RelationClose(Relation relation)
#ifdef RELCACHE_FORCE_RELEASE
if (RelationHasReferenceCountZero(relation) &&
relation->rd_createSubid == InvalidSubTransactionId &&
- relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
RelationClearRelation(relation, false);
#endif
}
@@ -2509,13 +2520,13 @@ RelationClearRelation(Relation relation, bool rebuild)
* problem.
*
* When rebuilding an open relcache entry, we must preserve ref count,
- * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state. Also
- * attempt to preserve the pg_class entry (rd_rel), tupledesc,
- * rewrite-rule, partition key, and partition descriptor substructures
- * in place, because various places assume that these structures won't
- * move while they are working with an open relcache entry. (Note:
- * the refcount mechanism for tupledescs might someday allow us to
- * remove this hack for the tupledesc.)
+ * rd_*Subid, and rd_toastoid state. Also attempt to preserve the
+ * pg_class entry (rd_rel), tupledesc, rewrite-rule, partition key,
+ * and partition descriptor substructures in place, because various
+ * places assume that these structures won't move while they are
+ * working with an open relcache entry. (Note: the refcount
+ * mechanism for tupledescs might someday allow us to remove this hack
+ * for the tupledesc.)
*
* Note that this process does not touch CurrentResourceOwner; which
* is good because whatever ref counts the entry may have do not
@@ -2599,6 +2610,7 @@ RelationClearRelation(Relation relation, bool rebuild)
/* creation sub-XIDs must be preserved */
SWAPFIELD(SubTransactionId, rd_createSubid);
SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+ SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
/* un-swap rd_rel pointers, swap contents instead */
SWAPFIELD(Form_pg_class, rd_rel);
/* ... but actually, we don't have to update newrel->rd_rel */
@@ -2666,7 +2678,7 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2751,11 +2763,10 @@ RelationCacheInvalidateEntry(Oid relationId)
* relation cache and re-read relation mapping data.
*
* This is currently used only to recover from SI message buffer overflow,
- * so we do not touch new-in-transaction relations; they cannot be targets
- * of cross-backend SI updates (and our own updates now go through a
- * separate linked list that isn't limited by the SI message buffer size).
- * Likewise, we need not discard new-relfilenode-in-transaction hints,
- * since any invalidation of those would be a local event.
+ * so we do not touch relations having new-in-transaction relfilenodes; they
+ * cannot be targets of cross-backend SI updates (and our own updates now go
+ * through a separate linked list that isn't limited by the SI message
+ * buffer size).
*
* We do this in two phases: the first pass deletes deletable items, and
* the second one rebuilds the rebuildable items. This is essential for
@@ -2806,7 +2817,7 @@ RelationCacheInvalidate(void)
* pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -2918,6 +2929,40 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
}
+#ifdef USE_ASSERT_CHECKING
+static void
+AssertPendingSyncConsistency(Relation relation)
+{
+ bool relcache_verdict =
+ relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&
+ ((relation->rd_createSubid != InvalidSubTransactionId &&
+ RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) ||
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId);
+ Assert(relcache_verdict == RelFileNodeSkippingWAL(relation->rd_node));
+}
+
+/*
+ * AssertPendingSyncs_RelationCache
+ *
+ * Assert that relcache.c and storage.c agree on whether to skip WAL.
+ *
+ * This consistently detects relcache.c skipping WAL while storage.c is not
+ * skipping WAL. It often fails to detect the reverse error, because
+ * invalidation will have destroyed the relcache entry. It will detect the
+ * reverse error if something opens the relation after the DDL.
+ */
+void
+AssertPendingSyncs_RelationCache(void)
+{
+ HASH_SEQ_STATUS status;
+ RelIdCacheEnt *idhentry;
+
+ hash_seq_init(&status, RelationIdCache);
+ while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+ AssertPendingSyncConsistency(idhentry->reldesc);
+}
+#endif
+
/*
* AtEOXact_RelationCache
*
@@ -3029,10 +3074,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
*
* During commit, reset the flag to zero, since we are now out of the
* creating transaction. During abort, simply delete the relcache entry
- * --- it isn't interesting any longer. (NOTE: if we have forgotten the
- * new-ness of a new relation due to a forced cache flush, the entry will
- * get deleted anyway by shared-cache-inval processing of the aborted
- * pg_class insertion.)
+ * --- it isn't interesting any longer.
*/
if (relation->rd_createSubid != InvalidSubTransactionId)
{
@@ -3060,9 +3102,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
}
/*
- * Likewise, reset the hint about the relfilenode being new.
+ * Likewise, reset any record of the relfilenode being new.
*/
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
}
/*
@@ -3154,7 +3197,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
}
/*
- * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+ * Likewise, update or drop any new-relfilenode-in-subtransaction.
*/
if (relation->rd_newRelfilenodeSubid == mySubid)
{
@@ -3163,6 +3206,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
}
+
+ if (relation->rd_firstRelfilenodeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstRelfilenodeSubid = parentSubid;
+ else
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ }
}
@@ -3252,6 +3303,7 @@ RelationBuildLocalRelation(const char *relname,
/* it's being created in this transaction */
rel->rd_createSubid = GetCurrentSubTransactionId();
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
/*
* create a new tuple descriptor from the one passed in. We do this
@@ -3549,14 +3601,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
*/
CommandCounterIncrement();
- /*
- * Mark the rel as having been given a new relfilenode in the current
- * (sub) transaction. This is a hint that can be used to optimize later
- * operations on the rel in the same transaction.
- */
+ RelationAssumeNewRelfilenode(relation);
+}
+
+/*
+ * RelationAssumeNewRelfilenode
+ *
+ * Code that modifies pg_class.reltablespace or pg_class.relfilenode must call
+ * this. The call shall precede any code that might insert WAL records whose
+ * replay would modify bytes in the new RelFileNode, and the call shall follow
+ * any WAL modifying bytes in the prior RelFileNode. See struct RelationData.
+ * Ideally, call this as near as possible to the CommandCounterIncrement()
+ * that makes the pg_class change visible (before it or after it); that
+ * minimizes the chance of future development adding a forbidden WAL insertion
+ * between RelationAssumeNewRelfilenode() and CommandCounterIncrement().
+ */
+void
+RelationAssumeNewRelfilenode(Relation relation)
+{
relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+ if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
- /* Flag relation as needing eoxact cleanup (to remove the hint) */
+ /* Flag relation as needing eoxact cleanup (to clear these fields) */
EOXactListAdd(relation);
}
@@ -5591,6 +5658,7 @@ load_relcache_init_file(bool shared)
rel->rd_fkeylist = NIL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba4edde71a..eecaf398c5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
#include "access/xlog_internal.h"
#include "catalog/namespace.h"
#include "catalog/pg_authid.h"
+#include "catalog/storage.h"
#include "commands/async.h"
#include "commands/prepare.h"
#include "commands/trigger.h"
@@ -2651,6 +2652,18 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+ gettext_noop("Size of new file to fsync instead of writing WAL."),
+ NULL,
+ GUC_UNIT_KB
+ },
+ &wal_skip_threshold,
+ 64,
+ 0, MAX_KILOBYTES,
+ NULL, NULL, NULL
+ },
+
{
{"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING,
gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."),
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..22916e8e0e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
@@ -166,8 +165,6 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
-extern void heap_sync(Relation relation);
-
extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
ItemPointerData *items,
int nitems);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 64022917e2..aca88d0620 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -127,7 +127,7 @@ typedef struct TM_FailureData
} TM_FailureData;
/* "options" flag bits for table_tuple_insert */
-#define TABLE_INSERT_SKIP_WAL 0x0001
+/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
#define TABLE_INSERT_SKIP_FSM 0x0002
#define TABLE_INSERT_FROZEN 0x0004
#define TABLE_INSERT_NO_LOGICAL 0x0008
@@ -409,9 +409,8 @@ typedef struct TableAmRoutine
/*
* Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
+ * tuple_insert and multi_insert with a BulkInsertState specified. In-tree
+ * access methods ceased to use this.
*
* Typically callers of tuple_insert and multi_insert will just pass all
* the flags that apply to them, and each AM has to decide which of them
@@ -1087,10 +1086,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* The options bitmask allows the caller to specify options that may change the
* behaviour of the AM. The AM will ignore options that it does not support.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space.
@@ -1309,10 +1304,9 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
}
/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * Perform operations necessary to complete insertions made via tuple_insert
+ * and multi_insert with a BulkInsertState specified. In-tree access methods
+ * ceased to use this.
*/
static inline void
table_finish_bulk_insert(Relation rel, int options)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..108115a023 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,18 +19,23 @@
#include "storage/smgr.h"
#include "utils/relcache.h"
+/* GUC variables */
+extern int wal_skip_threshold;
+
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(void);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 17b97f7e38..8097d5ab22 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;
+/* forward declared, to avoid including smgr.h here */
+struct SMgrRelationData;
+
/* in globals.c ... this duplicates miscadmin.h */
extern PGDLLIMPORT int NBuffers;
@@ -189,6 +192,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 31d8a1a10e..9db3d23897 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -63,22 +63,40 @@ typedef struct RelationData
* rd_replidindex) */
bool rd_statvalid; /* is rd_statlist valid? */
- /*
+ /*----------
* rd_createSubid is the ID of the highest subtransaction the rel has
- * survived into; or zero if the rel was not created in the current top
- * transaction. This can be now be relied on, whereas previously it could
- * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
- * the ID of the highest subtransaction the relfilenode change has
- * survived into, or zero if not changed in the current transaction (or we
- * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
- * when a relation has multiple new relfilenodes within a single
- * transaction, with one of them occurring in a subsequently aborted
- * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
- * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+ * survived into or zero if the rel was not created in the current top
+ * transaction. rd_firstRelfilenodeSubid is the ID of the highest
+ * subtransaction an rd_node change has survived into or zero if rd_node
+ * matches the value it had at the start of the current top transaction.
+ * (Rolling back the subtransaction that rd_firstRelfilenodeSubid denotes
+ * would restore rd_node to the value it had at the start of the current
+ * top transaction. Rolling back any lower subtransaction would not.)
+ * Their accuracy is critical to RelationNeedsWAL().
+ *
+ * rd_newRelfilenodeSubid is the ID of the highest subtransaction the
+ * most-recent relfilenode change has survived into or zero if not changed
+ * in the current transaction (or we have forgotten changing it). This
+ * field is accurate when non-zero, but it can be zero when a relation has
+ * multiple new relfilenodes within a single transaction, with one of them
+ * occurring in a subsequently aborted subtransaction, e.g.
+ * BEGIN;
+ * TRUNCATE t;
+ * SAVEPOINT save;
+ * TRUNCATE t;
+ * ROLLBACK TO save;
+ * -- rd_newRelfilenodeSubid is now forgotten
+ *
+ * These fields are read-only outside relcache.c. Other files trigger
+ * rd_node changes by updating pg_class.reltablespace and/or
+ * pg_class.relfilenode. They must call RelationAssumeNewRelfilenode() to
+ * update these fields.
*/
SubTransactionId rd_createSubid; /* rel was created in current xact */
- SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
- * current xact */
+ SubTransactionId rd_newRelfilenodeSubid; /* highest subxact changing
+ * rd_node to current value */
+ SubTransactionId rd_firstRelfilenodeSubid; /* highest subxact changing
+ * rd_node to any value */
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
@@ -520,9 +538,16 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction. See "Skipping WAL for New
+ * RelFileNode" in src/backend/access/transam/README.
+ */
+#define RelationNeedsWAL(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
/*
* RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 2f2ace35b0..d3e8348c1b 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -105,9 +105,10 @@ extern Relation RelationBuildLocalRelation(const char *relname,
char relkind);
/*
- * Routine to manage assignment of new relfilenode to a relation
+ * Routines to manage assignment of new relfilenode to a relation
*/
extern void RelationSetNewRelfilenode(Relation relation, char persistence);
+extern void RelationAssumeNewRelfilenode(Relation relation);
/*
* Routines for flushing/rebuilding relcache entries in various scenarios
@@ -120,6 +121,11 @@ extern void RelationCacheInvalidate(void);
extern void RelationCloseSmgrByOid(Oid relationId);
+#ifdef USE_ASSERT_CHECKING
+extern void AssertPendingSyncs_RelationCache(void);
+#else
+#define AssertPendingSyncs_RelationCache() do {} while (0)
+#endif
extern void AtEOXact_RelationCache(bool isCommit);
extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 297b8fbd6f..1ddde3ecce 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2354,6 +2354,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
fputs("log_lock_waits = on\n", pg_conf);
fputs("log_temp_files = 128kB\n", pg_conf);
fputs("max_prepared_transactions = 2\n", pg_conf);
+ fputs("wal_level = minimal\n", pg_conf); /* XXX before commit remove */
+ fputs("max_wal_senders = 0\n", pg_conf);
for (sl = temp_configs; sl != NULL; sl = sl->next)
{
--
2.23.0
v25-0002-Revert-FlushRelationBuffersWithoutRelcache.patchtext/x-patch; charset=us-asciiDownload
From 630f770a77f1cf57a3d9c805ab154a2e31f2134e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 19:28:35 +0900
Subject: [PATCH v25 2/5] Revert FlushRelationBuffersWithoutRelcache.
Succeeding patch makes the function useless and the function is no
longer useful globally. Revert it.
---
src/backend/storage/buffer/bufmgr.c | 27 ++++++++++-----------------
src/include/storage/bufmgr.h | 2 --
2 files changed, 10 insertions(+), 19 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 746ce477fc..67bbb26cae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3203,27 +3203,20 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- RelationOpenSmgr(rel);
-
- FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
- RelationUsesLocalBuffers(rel));
-}
-
-void
-FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
-{
- RelFileNode rnode = smgr->smgr_rnode.node;
- int i;
+ int i;
BufferDesc *bufHdr;
- if (islocal)
+ /* Open rel at the smgr level if not already done */
+ RelationOpenSmgr(rel);
+
+ if (RelationUsesLocalBuffers(rel))
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3240,7 +3233,7 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(smgr,
+ smgrwrite(rel->rd_smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3270,18 +3263,18 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, smgr);
+ FlushBuffer(bufHdr, rel->rd_smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8097d5ab22..8cd1cf25d9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -192,8 +192,6 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
-extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
- bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
--
2.23.0
v25-0003-Improve-the-performance-of-relation-syncs.patchtext/x-patch; charset=us-asciiDownload
From 12409838ef6eee0e35dd2730bda19bbb9f889931 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 19:33:18 +0900
Subject: [PATCH v25 3/5] Improve the performance of relation syncs.
We can improve performance of syncing multiple files at once in the
same way as b41669118. This reduces the number of scans on the whole
shared_bufffers from the number of synced relations to one.
---
src/backend/catalog/storage.c | 28 +++++--
src/backend/storage/buffer/bufmgr.c | 113 ++++++++++++++++++++++++++++
src/backend/storage/smgr/smgr.c | 38 +++++++++-
src/include/storage/bufmgr.h | 1 +
src/include/storage/smgr.h | 1 +
5 files changed, 174 insertions(+), 7 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 51c233dac6..65811b2a9e 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -533,6 +533,9 @@ smgrDoPendingSyncs(void)
{
PendingRelDelete *pending;
HTAB *delhash = NULL;
+ int nrels = 0,
+ maxrels = 0;
+ SMgrRelation *srels = NULL;
if (XLogIsNeeded())
return; /* no relation can use this */
@@ -573,7 +576,7 @@ smgrDoPendingSyncs(void)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- bool to_be_removed = false; /* don't sync if aborted */
+ bool to_be_removed = false;
ForkNumber fork;
BlockNumber nblocks[MAX_FORKNUM + 1];
BlockNumber total_blocks = 0;
@@ -623,14 +626,21 @@ smgrDoPendingSyncs(void)
*/
if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
{
- /* Flush all buffers then sync the file */
- FlushRelationBuffersWithoutRelcache(srel, false);
+ /* relations to sync are passed to smgrdosyncall at once */
- for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
{
- if (smgrexists(srel, fork))
- smgrimmedsync(srel, fork);
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
}
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
}
else
{
@@ -658,6 +668,12 @@ smgrDoPendingSyncs(void)
if (delhash)
hash_destroy(delhash);
+
+ if (nrels > 0)
+ {
+ smgrdosyncall(srels, nrels);
+ pfree(srels);
+ }
}
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 67bbb26cae..56314653ae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -105,6 +105,19 @@ typedef struct CkptTsStatus
int index;
} CkptTsStatus;
+/*
+ * Type for array used to sort SMgrRelations
+ *
+ * FlushRelFileNodesAllBuffers shares the same comparator function with
+ * DropRelFileNodeBuffers. Pointer to this struct and RelFileNode must
+ * be compatible.
+ */
+typedef struct SMgrSortArray
+{
+ RelFileNode rnode; /* This must be the first member */
+ SMgrRelation srel;
+} SMgrSortArray;
+
/* GUC variables */
bool zero_damaged_pages = false;
int bgwriter_lru_maxpages = 100;
@@ -3283,6 +3296,106 @@ FlushRelationBuffers(Relation rel)
}
}
+/* ---------------------------------------------------------------------
+ * FlushRelFileNodesAllBuffers
+ *
+ * This function flushes out the buffer pool all the pages of all
+ * forks of the specified smgr relations. It's equivalent to
+ * calling FlushRelationBuffers once per fork per relation, but the
+ * parameter is not Relation but SMgrRelation
+ * --------------------------------------------------------------------
+ */
+void
+FlushRelFileNodesAllBuffers(SMgrRelation *smgrs, int nrels)
+{
+ int i;
+ SMgrSortArray *srels;
+ bool use_bsearch;
+
+ if (nrels == 0)
+ return;
+
+ /* fill-in array for qsort */
+ srels = palloc(sizeof(SMgrSortArray) * nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ {
+ Assert (!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode));
+
+ srels[i].rnode = smgrs[i]->smgr_rnode.node;
+ srels[i].srel = smgrs[i];
+ }
+
+ /*
+ * Save the bsearch overhead for low number of relations to
+ * sync. See DropRelFileNodesAllBuffers for details. The name DROP_*
+ * is for historical reasons.
+ */
+ use_bsearch = nrels > DROP_RELS_BSEARCH_THRESHOLD;
+
+ /* sort the list of SMgrRelations if necessary */
+ if (use_bsearch)
+ pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator);
+
+ /* Make sure we can handle the pin inside the loop */
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ SMgrSortArray *srelent = NULL;
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ /*
+ * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
+ * and saves some cycles.
+ */
+
+ if (!use_bsearch)
+ {
+ int j;
+
+ for (j = 0; j < nrels; j++)
+ {
+ if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode))
+ {
+ srelent = &srels[j];
+ break;
+ }
+ }
+
+ }
+ else
+ {
+ srelent = bsearch((const void *) &(bufHdr->tag.rnode),
+ srels, nrels, sizeof(SMgrSortArray),
+ rnode_comparator);
+ }
+
+ /* buffer doesn't belong to any of the given relfilenodes; skip it */
+ if (srelent == NULL)
+ continue;
+
+ /* Ensure there's a free array slot for PinBuffer_Locked */
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+ if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) &&
+ (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+ FlushBuffer(bufHdr, srelent->srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+
+ pfree(srels);
+}
+
/* ---------------------------------------------------------------------
* FlushDatabaseBuffers
*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b50c69b438..f79f2df40f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -388,6 +388,43 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
}
+/*
+ * smgrdosyncall() -- Immediately sync all forks of all given relations
+ *
+ * All forks of all given relations are syncd out to the store.
+ *
+ * This is equivalent to flusing all buffers FlushRelationBuffers for each
+ * smgr relation then calling smgrimmedsync for all forks of each smgr
+ * relation, but it's significantly quicker so should be preferred when
+ * possible.
+ */
+void
+smgrdosyncall(SMgrRelation *rels, int nrels)
+{
+ int i = 0;
+ ForkNumber forknum;
+
+ if (nrels == 0)
+ return;
+
+ /* We need to flush all buffers for the relations before sync. */
+ FlushRelFileNodesAllBuffers(rels, nrels);
+
+ /*
+ * Sync the physical file(s).
+ */
+ for (i = 0; i < nrels; i++)
+ {
+ int which = rels[i]->smgr_which;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+ {
+ if (smgrsw[which].smgr_exists(rels[i], forknum))
+ smgrsw[which].smgr_immedsync(rels[i], forknum);
+ }
+ }
+}
+
/*
* smgrdounlinkall() -- Immediately unlink all forks of all given relations
*
@@ -469,7 +506,6 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
pfree(rnodes);
}
-
/*
* smgrextend() -- Add a new block to a file.
*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8cd1cf25d9..3f85e8c6fe 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -195,6 +195,7 @@ extern void FlushRelationBuffers(Relation rel);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void FlushRelFileNodesAllBuffers(struct SMgrRelationData **smgrs, int nrels);
extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 1543d8d870..31a5ecd059 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -89,6 +89,7 @@ extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
--
2.23.0
v25-0004-Adjust-gistGetFakeLSN.patchtext/x-patch; charset=us-asciiDownload
From 9a47b1faaae7c5e12596cc172dcb1f37e2fc971a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 16:12:03 +0900
Subject: [PATCH v25 4/5] Adjust gistGetFakeLSN()
GiST needs to set page LSN to monotically incresing numbers on updates
even if it is not WAL-logged at all. We use a simple counter for
UNLOGGESD/TEMP relations but the number must be smaller than the LSN
at the next commit for WAL-skipped relations. WAL-insertione pointer
works in major cases but we sometimes need to emit a WAL record to
generate an unique LSN for update. This patch adds a new WAL record
kind XLOG_GIST_ASSIGN_LSN, which conveys no substantial content and
emits it if needed.
---
src/backend/access/gist/gistutil.c | 30 +++++++++++++++++++-------
src/backend/access/gist/gistxlog.c | 17 +++++++++++++++
src/backend/access/rmgrdesc/gistdesc.c | 5 +++++
src/include/access/gist_private.h | 2 ++
src/include/access/gistxlog.h | 1 +
5 files changed, 47 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 66c52d6dd6..eebc1a9647 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1011,21 +1011,35 @@ gistproperty(Oid index_oid, int attno,
XLogRecPtr
gistGetFakeLSN(Relation rel)
{
- static XLogRecPtr counter = FirstNormalUnloggedLSN;
-
- /*
- * XXX before commit fix this. This is not correct for
- * RELPERSISTENCE_PERMANENT, but it suffices to make tests pass.
- */
- if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP
- || rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
{
+ static XLogRecPtr counter = FirstNormalUnloggedLSN;
/*
* Temporary relations are only accessible in our session, so a simple
* backend-local counter will do.
*/
return counter++;
}
+ else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ {
+ /*
+ * WAL-logging on this relation will start after commit, so the LSN
+ * must be distinct numbers smaller than the LSN at the next
+ * commit. Emit a dummy WAL record if insert-LSN hasn't advanced after
+ * the last call.
+ */
+ static XLogRecPtr lastlsn = InvalidXLogRecPtr;
+ XLogRecPtr currlsn = GetXLogInsertRecPtr();
+
+ Assert(!RelationNeedsWAL(rel));
+
+ /* No need of an actual record if we alredy have a distinct LSN */
+ if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
+ currlsn = gistXLogAssignLSN();
+
+ lastlsn = currlsn;
+ return currlsn;
+ }
else
{
/*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 3b28f54646..cc63c17aba 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -449,6 +449,9 @@ gist_redo(XLogReaderState *record)
case XLOG_GIST_PAGE_DELETE:
gistRedoPageDelete(record);
break;
+ case XLOG_GIST_ASSIGN_LSN:
+ /* nop. See gistGetFakeLSN(). */
+ break;
default:
elog(PANIC, "gist_redo: unknown op code %u", info);
}
@@ -592,6 +595,20 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
return recptr;
}
+/*
+ * Write an empty XLOG record to assign a distinct LSN.
+ */
+XLogRecPtr
+gistXLogAssignLSN(void)
+{
+ int dummy = 0;
+
+ XLogBeginInsert();
+ XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
+ XLogRegisterData((char*) &dummy, sizeof(dummy));
+ return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN);
+}
+
/*
* Write XLOG record about reuse of a deleted page.
*/
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index eccb6fd942..48cda40ac0 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -80,6 +80,9 @@ gist_desc(StringInfo buf, XLogReaderState *record)
case XLOG_GIST_PAGE_DELETE:
out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
break;
+ case XLOG_GIST_ASSIGN_LSN:
+ /* No details to write out */
+ break;
}
}
@@ -104,6 +107,8 @@ gist_identify(uint8 info)
break;
case XLOG_GIST_PAGE_DELETE:
id = "PAGE_DELETE";
+ case XLOG_GIST_ASSIGN_LSN:
+ id = "ASSIGN_LSN";
break;
}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index a409975db1..3455dd242d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -455,6 +455,8 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
BlockNumber origrlink, GistNSN oldnsn,
Buffer leftchild, bool markfollowright);
+extern XLogRecPtr gistXLogAssignLSN(void);
+
/* gistget.c */
extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index e44922d915..1eae06c0fb 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -26,6 +26,7 @@
/* #define XLOG_GIST_INSERT_COMPLETE 0x40 */ /* not used anymore */
/* #define XLOG_GIST_CREATE_INDEX 0x50 */ /* not used anymore */
#define XLOG_GIST_PAGE_DELETE 0x60
+#define XLOG_GIST_ASSIGN_LSN 0x70 /* nop, assign an new LSN */
/*
* Backup Blk 0: updated page.
--
2.23.0
v25-0005-Sync-files-shrinked-by-truncation.patchtext/x-patch; charset=us-asciiDownload
From 656b739e60f5c07e4eb91ec2ba016abf1db39e69 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 26 Nov 2019 21:25:09 +0900
Subject: [PATCH v25 5/5] Sync files shrinked by truncation
If truncation made a WAL-skipped file become smaller at commit than
the maximum size during the transaction, the file must not be
at-commit-WAL-logged and must be synced.
---
src/backend/access/transam/xact.c | 5 +-
src/backend/catalog/storage.c | 155 ++++++++++++++++++-----------
src/backend/utils/cache/relcache.c | 1 +
src/include/catalog/storage.h | 2 +-
4 files changed, 102 insertions(+), 61 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 750f95c482..f681cd3a23 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2114,7 +2114,7 @@ CommitTransaction(void)
* transaction. This must happen before AtEOXact_RelationMap(), so that we
* don't see committed-but-broken files after a crash.
*/
- smgrDoPendingSyncs();
+ smgrDoPendingSyncs(true);
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2354,7 +2354,7 @@ PrepareTransaction(void)
* transaction. This must happen before EndPrepare(), so that we don't see
* committed-but-broken files after a crash and COMMIT PREPARED.
*/
- smgrDoPendingSyncs();
+ smgrDoPendingSyncs(true);
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2674,6 +2674,7 @@ AbortTransaction(void)
*/
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
+ smgrDoPendingSyncs(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 65811b2a9e..ea499490b8 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -62,11 +62,17 @@ typedef struct PendingRelDelete
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
- bool sync; /* whether to fsync at commit */
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+typedef struct pendingSync
+{
+ RelFileNode rnode;
+ BlockNumber max_truncated;
+} pendingSync;
+
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+HTAB *pendingSyncHash = NULL;
/*
* RelationCreateStorage
@@ -119,11 +125,39 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->sync =
- relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /*
+ * If the relation needs at-commit sync, we also need to track the maximum
+ * unsynced truncated block used to decide whether we can WAL-logging or we
+ * must sync the file in smgrDoPendingSyncs.
+ */
+ if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+ {
+ pendingSync *pending;
+ bool found;
+
+ /* we sync only permanent relations */
+ Assert(backend == InvalidBackendId);
+
+ if (!pendingSyncHash)
+ {
+ HASHCTL ctl;
+
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(pendingSync);
+ ctl.hcxt = TopTransactionContext;
+ pendingSyncHash =
+ hash_create("max truncatd block hash",
+ 16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
+
+ pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
+ Assert(!found);
+ pending->max_truncated = InvalidBlockNumber;
+ }
+
return srel;
}
@@ -162,7 +196,6 @@ RelationDropStorage(Relation rel)
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->sync = false;
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -320,6 +353,21 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
if (fsm || vm)
XLogFlush(lsn);
}
+ else if (pendingSyncHash)
+ {
+ pendingSync *pending;
+
+ pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
+ HASH_FIND, NULL);
+ if (pending)
+ {
+ BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+
+ if (!BlockNumberIsValid(pending->max_truncated) ||
+ pending->max_truncated < nblocks)
+ pending->max_truncated = nblocks;
+ }
+ }
/* Do the real work to truncate relation forks */
smgrtruncate(rel->rd_smgr, forks, nforks, blocks);
@@ -430,18 +478,17 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
bool
RelFileNodeSkippingWAL(RelFileNode rnode)
{
- PendingRelDelete *pending;
-
if (XLogIsNeeded())
return false; /* no permanent relfilenode skips WAL */
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
- {
- if (RelFileNodeEquals(pending->relnode, rnode) && pending->sync)
- return true;
- }
+ if (!pendingSyncHash)
+ return false; /* we don't have a to-be-synced relation */
- return false;
+ /* the relation is not tracked as to-be-synced */
+ if (hash_search(pendingSyncHash, &rnode, HASH_FIND, NULL) == NULL)
+ return false;
+
+ return true;
}
/*
@@ -529,72 +576,60 @@ smgrDoPendingDeletes(bool isCommit)
* failure prevents commit.
*/
void
-smgrDoPendingSyncs(void)
+smgrDoPendingSyncs(bool isCommit)
{
PendingRelDelete *pending;
- HTAB *delhash = NULL;
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ HASH_SEQ_STATUS scan;
+ pendingSync *pendingsync;
if (XLogIsNeeded())
return; /* no relation can use this */
Assert(GetCurrentTransactionNestLevel() == 1);
+
+ if (!pendingSyncHash)
+ return; /* no relation needs sync */
+
+ /* Just throw away all pending syncs if any at rollback */
+ if (!isCommit)
+ {
+ if (pendingSyncHash)
+ {
+ hash_destroy(pendingSyncHash);
+ pendingSyncHash = NULL;
+ }
+ return;
+ }
+
AssertPendingSyncs_RelationCache();
/*
* Pending syncs on the relation that are to be deleted in this
- * transaction-end should be ignored. Collect pending deletes that will
- * happen in the following call to smgrDoPendingDeletes().
+ * transaction-end should be ignored. Remove sync hash entries entries for
+ * relations that will be deleted in the following call to
+ * smgrDoPendingDeletes().
*/
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- bool found PG_USED_FOR_ASSERTS_ONLY;
-
if (!pending->atCommit)
continue;
- /* create the hash if not yet */
- if (delhash == NULL)
- {
- HASHCTL hash_ctl;
-
- memset(&hash_ctl, 0, sizeof(hash_ctl));
- hash_ctl.keysize = sizeof(RelFileNode);
- hash_ctl.entrysize = sizeof(RelFileNode);
- hash_ctl.hcxt = CurrentMemoryContext;
- delhash =
- hash_create("pending del temporary hash", 8, &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
- }
-
- (void) hash_search(delhash, (void *) &pending->relnode,
- HASH_ENTER, &found);
- Assert(!found);
+ (void) hash_search(pendingSyncHash, (void *) &pending->relnode,
+ HASH_REMOVE, NULL);
}
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ hash_seq_init(&scan, pendingSyncHash);
+ while ((pendingsync = (pendingSync *) hash_seq_search(&scan)))
{
- bool to_be_removed = false;
- ForkNumber fork;
- BlockNumber nblocks[MAX_FORKNUM + 1];
- BlockNumber total_blocks = 0;
- SMgrRelation srel;
-
- if (!pending->sync)
- continue;
- Assert(!pending->atCommit);
-
- /* don't sync relnodes that is being deleted */
- if (delhash)
- hash_search(delhash, (void *) &pending->relnode,
- HASH_FIND, &to_be_removed);
- if (to_be_removed)
- continue;
+ ForkNumber fork;
+ BlockNumber nblocks[MAX_FORKNUM + 1];
+ BlockNumber total_blocks = 0;
+ SMgrRelation srel;
- /* Now the time to sync the rnode */
- srel = smgropen(pending->relnode, pending->backend);
+ srel = smgropen(pendingsync->rnode, InvalidBackendId);
/*
* We emit newpage WAL records for smaller relations.
@@ -622,9 +657,12 @@ smgrDoPendingSyncs(void)
/*
* Sync file or emit WAL record for the file according to the total
- * size.
+ * size. Do file sync if the size is larger than the threshold or
+ * truncates may have left blocks beyond the current size.
*/
- if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+ if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024 ||
+ (BlockNumberIsValid(pendingsync->max_truncated) &&
+ smgrnblocks(srel, MAIN_FORKNUM) < pendingsync->max_truncated))
{
/* relations to sync are passed to smgrdosyncall at once */
@@ -666,8 +704,9 @@ smgrDoPendingSyncs(void)
}
}
- if (delhash)
- hash_destroy(delhash);
+ Assert (pendingSyncHash);
+ hash_destroy(pendingSyncHash);
+ pendingSyncHash = NULL;
if (nrels > 0)
{
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index f3831f0077..ea11ceb4d3 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3619,6 +3619,7 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
void
RelationAssumeNewRelfilenode(Relation relation)
{
+ elog(LOG, "ASSUME: %d", relation->rd_node.relNode);
relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 108115a023..bf076657e7 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -35,7 +35,7 @@ extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
-extern void smgrDoPendingSyncs(void);
+extern void smgrDoPendingSyncs(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
--
2.23.0
At Tue, 26 Nov 2019 21:37:52 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail> Is is not fully checked. I didn't merged and mesured performance yet,
but I post the status-quo patch for now.
It was actually inconsistency caused by swap_relation_files.
1. rd_createSubid of relcache for r2 is not turned off. This prevents
the relcache entry from flushed. Commit processes pendingSyncs and
leaves the relcache entry with rd_createSubid != Invalid. It is
inconsistency.
2. relation_open(r1) returns a relcache entry with its relfilenode has
the old value (relfilenode1) since command counter has not been
incremented. On the other hand if it is incremented just before,
AssertPendingSyncConsistency() aborts because of the inconsistency
between relfilenode and rd_firstRel*.
As the result, I returned to think that we need to modify both
relcache entries with right relfilenode.
I once thought that taking AEL in the function has no side effect but
the code path is executed also when wal_level = replica or higher. And
as I mentioned upthread, we can even get there without taking any lock
on r1 or sometimes ShareLock. So upgrading to AEL emits Standby/LOCK
WAL and propagates to standby. After all I'd like to take the weakest
lock (AccessShareLock) there.
The attached is the new version of the patch.
- v26-0001-version-nm24.patch
Same with v24
- v26-0002-change-swap_relation_files.patch
Changes to swap_relation_files as mentioned above.
- v26-0003-Improve-the-performance-of-relation-syncs.patch
Do multiple pending syncs by one shared_buffers scanning.
- v26-0004-Revert-FlushRelationBuffersWithoutRelcache.patch
v26-0003 makes the function useless. Remove it.
- v26-0005-Fix-gistGetFakeLSN.patch
gistGetFakeLSN fix.
- v26-0006-Sync-files-shrinked-by-truncation.patch
Fix the problem of commit-time-FPI after truncation after checkpoint.
I'm not sure this is the right direction but pendingSyncHash is
removed from pendingDeletes list again.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v26-0001-version-nm24.patchtext/x-patch; charset=us-asciiDownload
From ee96bb1e14969823eab79ab1531d68e8aadc1915 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v26 1/6] version nm24
Noah Misch's version 24.
---
doc/src/sgml/config.sgml | 43 +++--
doc/src/sgml/perform.sgml | 47 ++----
src/backend/access/gist/gistutil.c | 7 +-
src/backend/access/heap/heapam.c | 45 +-----
src/backend/access/heap/heapam_handler.c | 22 +--
src/backend/access/heap/rewriteheap.c | 21 +--
src/backend/access/nbtree/nbtsort.c | 41 ++---
src/backend/access/transam/README | 47 +++++-
src/backend/access/transam/xact.c | 14 ++
src/backend/access/transam/xloginsert.c | 10 +-
src/backend/access/transam/xlogutils.c | 17 +-
src/backend/catalog/heap.c | 4 +
src/backend/catalog/storage.c | 198 +++++++++++++++++++++--
src/backend/commands/cluster.c | 11 ++
src/backend/commands/copy.c | 58 +------
src/backend/commands/createas.c | 11 +-
src/backend/commands/matview.c | 12 +-
src/backend/commands/tablecmds.c | 11 +-
src/backend/storage/buffer/bufmgr.c | 37 +++--
src/backend/storage/smgr/md.c | 9 +-
src/backend/utils/cache/relcache.c | 122 ++++++++++----
src/backend/utils/misc/guc.c | 13 ++
src/include/access/heapam.h | 3 -
src/include/access/rewriteheap.h | 2 +-
src/include/access/tableam.h | 18 +--
src/include/catalog/storage.h | 5 +
src/include/storage/bufmgr.h | 5 +
src/include/utils/rel.h | 57 +++++--
src/include/utils/relcache.h | 8 +-
src/test/regress/pg_regress.c | 2 +
30 files changed, 551 insertions(+), 349 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe45cc..d0f7dbd7d7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2483,21 +2483,14 @@ include_dir 'conf.d'
levels. This parameter can only be set at server start.
</para>
<para>
- In <literal>minimal</literal> level, WAL-logging of some bulk
- operations can be safely skipped, which can make those
- operations much faster (see <xref linkend="populate-pitr"/>).
- Operations in which this optimization can be applied include:
- <simplelist>
- <member><command>CREATE TABLE AS</command></member>
- <member><command>CREATE INDEX</command></member>
- <member><command>CLUSTER</command></member>
- <member><command>COPY</command> into tables that were created or truncated in the same
- transaction</member>
- </simplelist>
- But minimal WAL does not contain enough information to reconstruct the
- data from a base backup and the WAL logs, so <literal>replica</literal> or
- higher must be used to enable WAL archiving
- (<xref linkend="guc-archive-mode"/>) and streaming replication.
+ In <literal>minimal</literal> level, no information is logged for
+ tables or indexes for the remainder of a transaction that creates or
+ truncates them. This can make bulk operations much faster (see
+ <xref linkend="populate-pitr"/>). But minimal WAL does not contain
+ enough information to reconstruct the data from a base backup and the
+ WAL logs, so <literal>replica</literal> or higher must be used to
+ enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
+ streaming replication.
</para>
<para>
In <literal>logical</literal> level, the same information is logged as
@@ -2889,6 +2882,26 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-wal-skip-threshold" xreflabel="wal_skip_threshold">
+ <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ When <varname>wal_level</varname> is <literal>minimal</literal> and a
+ transaction commits after creating or rewriting a permanent table,
+ materialized view, or index, this setting determines how to persist
+ the new data. If the data is smaller than this setting, write it to
+ the WAL log; otherwise, use an fsync of the data file. Depending on
+ the properties of your storage, raising or lowering this value might
+ help if such commits are slowing concurrent transactions. The default
+ is 64 kilobytes (<literal>64kB</literal>).
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-commit-delay" xreflabel="commit_delay">
<term><varname>commit_delay</varname> (<type>integer</type>)
<indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 715aff63c8..fcc60173fb 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1605,8 +1605,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
needs to be written, because in case of an error, the files
containing the newly loaded data will be removed anyway.
However, this consideration only applies when
- <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for
- non-partitioned tables as all commands must write WAL otherwise.
+ <xref linkend="guc-wal-level"/> is <literal>minimal</literal>
+ as all commands must write WAL otherwise.
</para>
</sect2>
@@ -1706,42 +1706,13 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
</para>
<para>
- Aside from avoiding the time for the archiver or WAL sender to
- process the WAL data,
- doing this will actually make certain commands faster, because they
- are designed not to write WAL at all if <varname>wal_level</varname>
- is <literal>minimal</literal>. (They can guarantee crash safety more cheaply
- by doing an <function>fsync</function> at the end than by writing WAL.)
- This applies to the following commands:
- <itemizedlist>
- <listitem>
- <para>
- <command>CREATE TABLE AS SELECT</command>
- </para>
- </listitem>
- <listitem>
- <para>
- <command>CREATE INDEX</command> (and variants such as
- <command>ALTER TABLE ADD PRIMARY KEY</command>)
- </para>
- </listitem>
- <listitem>
- <para>
- <command>ALTER TABLE SET TABLESPACE</command>
- </para>
- </listitem>
- <listitem>
- <para>
- <command>CLUSTER</command>
- </para>
- </listitem>
- <listitem>
- <para>
- <command>COPY FROM</command>, when the target table has been
- created or truncated earlier in the same transaction
- </para>
- </listitem>
- </itemizedlist>
+ Aside from avoiding the time for the archiver or WAL sender to process the
+ WAL data, doing this will actually make certain commands faster, because
+ they do not to write WAL at all if <varname>wal_level</varname>
+ is <literal>minimal</literal> and the current subtransaction (or top-level
+ transaction) created or truncated the table or index they change. (They
+ can guarantee crash safety more cheaply by doing
+ an <function>fsync</function> at the end than by writing WAL.)
</para>
</sect2>
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 553a6d67b1..66c52d6dd6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1013,7 +1013,12 @@ gistGetFakeLSN(Relation rel)
{
static XLogRecPtr counter = FirstNormalUnloggedLSN;
- if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+ /*
+ * XXX before commit fix this. This is not correct for
+ * RELPERSISTENCE_PERMANENT, but it suffices to make tests pass.
+ */
+ if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP
+ || rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
{
/*
* Temporary relations are only accessible in our session, so a simple
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb34ef..be19c34cbd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -21,7 +21,6 @@
* heap_multi_insert - insert multiple tuples into a relation
* heap_delete - delete a tuple from a relation
* heap_update - replace a tuple in a relation with another tuple
- * heap_sync - sync heap, for when no WAL has been written
*
* NOTES
* This file contains the heap_ routines which implement
@@ -1936,7 +1935,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
MarkBufferDirty(buffer);
/* XLOG stuff */
- if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+ if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
@@ -2119,7 +2118,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
- needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ needwal = RelationNeedsWAL(relation);
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
@@ -8920,46 +8919,6 @@ heap2_redo(XLogReaderState *record)
}
}
-/*
- * heap_sync - sync a heap, for use when no WAL has been written
- *
- * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction. This
- * requires writing out any dirty buffers and then doing a forced fsync.
- *
- * Indexes are not touched. (Currently, index operations associated with
- * the commands that use this are WAL-logged and so do not need fsync.
- * That behavior might change someday, but in any case it's likely that
- * any fsync decisions required would be per-index and hence not appropriate
- * to be done here.)
- */
-void
-heap_sync(Relation rel)
-{
- /* non-WAL-logged tables never need fsync */
- if (!RelationNeedsWAL(rel))
- return;
-
- /* main heap */
- FlushRelationBuffers(rel);
- /* FlushRelationBuffers will have opened rd_smgr */
- smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
-
- /* FSM is not critical, don't bother syncing it */
-
- /* toast heap, if any */
- if (OidIsValid(rel->rd_rel->reltoastrelid))
- {
- Relation toastrel;
-
- toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
- FlushRelationBuffers(toastrel);
- smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
- table_close(toastrel, AccessShareLock);
- }
-}
-
/*
* Mask a heap page before performing consistency checks on it.
*/
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 92073fec54..07fe717faa 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -555,17 +555,6 @@ tuple_lock_retry:
return result;
}
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
- /*
- * If we skipped writing WAL, then we need to sync the heap (but not
- * indexes since those use WAL anyway / don't go through tableam)
- */
- if (options & HEAP_INSERT_SKIP_WAL)
- heap_sync(relation);
-}
-
/* ------------------------------------------------------------------------
* DDL related callbacks for heap AM.
@@ -698,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
- bool use_wal;
bool is_system_catalog;
Tuplesortstate *tuplesort;
TupleDesc oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +701,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
is_system_catalog = IsSystemRelation(OldHeap);
/*
- * We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a WAL-logged rel.
+ * Valid smgr_targblock implies something already wrote to the relation.
+ * This may be harmless, but this function hasn't planned for it.
*/
- use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
- /* use_wal off requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
/* Preallocate values/isnull arrays */
@@ -728,7 +713,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff, use_wal);
+ *multi_cutoff);
/* Set up sorting if wanted */
@@ -2515,7 +2500,6 @@ static const TableAmRoutine heapam_methods = {
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_lock = heapam_tuple_lock,
- .finish_bulk_insert = heapam_finish_bulk_insert,
.tuple_fetch_row_version = heapam_fetch_row_version,
.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d285b1f390..3e564838fa 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -136,7 +136,6 @@ typedef struct RewriteStateData
Page rs_buffer; /* page currently being built */
BlockNumber rs_blockno; /* block where page will go */
bool rs_buffer_valid; /* T if any tuples in buffer */
- bool rs_use_wal; /* must we WAL-log inserts? */
bool rs_logical_rewrite; /* do we need to do logical rewriting */
TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine
* tuple visibility */
@@ -230,15 +229,13 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
- * use_wal should the inserts to the new heap be WAL-logged?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi,
- bool use_wal)
+ TransactionId freeze_xid, MultiXactId cutoff_multi)
{
RewriteState state;
MemoryContext rw_cxt;
@@ -263,7 +260,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
/* new_heap needn't be empty, just locked */
state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
state->rs_buffer_valid = false;
- state->rs_use_wal = use_wal;
state->rs_oldest_xmin = oldest_xmin;
state->rs_freeze_xid = freeze_xid;
state->rs_cutoff_multi = cutoff_multi;
@@ -322,7 +318,7 @@ end_heap_rewrite(RewriteState state)
/* Write the last page, if any */
if (state->rs_buffer_valid)
{
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
@@ -337,18 +333,14 @@ end_heap_rewrite(RewriteState state)
}
/*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too.
- *
- * It's obvious that we must do this when not WAL-logging. It's less
- * obvious that we have to do it even if we did WAL-log the pages. The
+ * When we WAL-logged rel pages, we must nonetheless fsync them. The
* reason is the same as in storage.c's RelationCopyStorage(): we're
* writing data that's not in shared buffers, and so a CHECKPOINT
* occurring during the rewriteheap operation won't have fsync'd data we
* wrote before the checkpoint.
*/
if (RelationNeedsWAL(state->rs_new_rel))
- heap_sync(state->rs_new_rel);
+ smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
logical_end_heap_rewrite(state);
@@ -646,9 +638,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
{
int options = HEAP_INSERT_SKIP_FSM;
- if (!state->rs_use_wal)
- options |= HEAP_INSERT_SKIP_WAL;
-
/*
* While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
* for the TOAST table are not logically decoded. The main heap is
@@ -687,7 +676,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
/* Doesn't fit, so write out the existing page */
/* XLOG stuff */
- if (state->rs_use_wal)
+ if (RelationNeedsWAL(state->rs_new_rel))
log_newpage(&state->rs_new_rel->rd_node,
MAIN_FORKNUM,
state->rs_blockno,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 1dd39a9535..b61692aefc 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -31,18 +31,6 @@
* them. They will need to be re-read into shared buffers on first use after
* the build finishes.
*
- * Since the index will never be used unless it is completely built,
- * from a crash-recovery point of view there is no need to WAL-log the
- * steps of the build. After completing the index build, we can just sync
- * the whole file to disk using smgrimmedsync() before exiting this module.
- * This can be seen to be sufficient for crash recovery by considering that
- * it's effectively equivalent to what would happen if a CHECKPOINT occurred
- * just after the index build. However, it is clearly not sufficient if the
- * DBA is using the WAL log for PITR or replication purposes, since another
- * machine would not be able to reconstruct the index from WAL. Therefore,
- * we log the completed index pages to WAL if and only if WAL archiving is
- * active.
- *
* This code isn't concerned about the FSM at all. The caller is responsible
* for initializing that.
*
@@ -563,12 +551,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-
- /*
- * We need to log index creation in WAL iff WAL archiving/streaming is
- * enabled UNLESS the index isn't WAL-logged anyway.
- */
- wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
+ wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
/* reserve the metapage */
wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
@@ -1265,21 +1248,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
_bt_uppershutdown(wstate, state);
/*
- * If the index is WAL-logged, we must fsync it down to disk before it's
- * safe to commit the transaction. (For a non-WAL-logged index we don't
- * care since the index will be uninteresting after a crash anyway.)
- *
- * It's obvious that we must do this when not WAL-logging the build. It's
- * less obvious that we have to do it even if we did WAL-log the index
- * pages. The reason is that since we're building outside shared buffers,
- * a CHECKPOINT occurring during the build has no way to flush the
- * previously written data to disk (indeed it won't know the index even
- * exists). A crash later on would replay WAL from the checkpoint,
- * therefore it wouldn't replay our earlier WAL entries. If we do not
- * fsync those pages here, they might still not be on disk when the crash
- * occurs.
+ * When we WAL-logged index pages, we must nonetheless fsync index files.
+ * Since we're building outside shared buffers, a CHECKPOINT occurring
+ * during the build has no way to flush the previously written data to
+ * disk (indeed it won't know the index even exists). A crash later on
+ * would replay WAL from the checkpoint, therefore it wouldn't replay our
+ * earlier WAL entries. If we do not fsync those pages here, they might
+ * still not be on disk when the crash occurs.
*/
- if (RelationNeedsWAL(wstate->index))
+ if (wstate->btws_use_wal)
{
RelationOpenSmgr(wstate->index);
smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index b5a2cb2de8..641809cfda 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -717,6 +717,40 @@ then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+Skipping WAL for New RelFileNode
+--------------------------------
+
+Under wal_level=minimal, if a change modifies a relfilenode that
+RollbackAndReleaseCurrentSubTransaction() would unlink, in-tree access methods
+write no WAL for that change. For any access method, CommitTransaction()
+writes and fsyncs affected blocks before recording the commit. This skipping
+is mandatory; if a WAL-writing change preceded a WAL-skipping change for the
+same block, REDO could overwrite the WAL-skipping change. Code that writes
+WAL without calling RelationNeedsWAL() must check for this case.
+
+If skipping were not mandatory, a related problem would arise. Suppose, under
+full_page_writes=off, a WAL-writing change follows a WAL-skipping change.
+When a WAL record contains no full-page image, REDO expects the page to match
+its contents from just before record insertion. A WAL-skipping change may not
+reach disk at all, violating REDO's expectation.
+
+Prefer to do the same in future access methods. However, two other approaches
+can work. First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync(). Second, an access method can opt to write WAL
+unconditionally for permanent relations. When using the second method, do not
+call RelationCopyStorage(), which skips WAL.
+
+This applies only to WAL records whose replay would modify bytes stored in the
+new relfilenode. It does not apply to other records about the relfilenode,
+such as XLOG_SMGR_CREATE. Because it operates at the level of individual
+relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations.
+Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
+ALTER TABLE adds a TOAST relation. The TOAST relation will skip WAL, while
+the table owning it will not. ALTER TABLE SET TABLESPACE will cause a table
+to skip WAL, but that won't affect its indexes.
+
+
Asynchronous Commit
-------------------
@@ -820,13 +854,12 @@ Changes to a temp table are not WAL-logged, hence could reach disk in
advance of T1's commit, but we don't care since temp table contents don't
survive crashes anyway.
-Database writes made via any of the paths we have introduced to avoid WAL
-overhead for bulk updates are also safe. In these cases it's entirely
-possible for the data to reach disk before T1's commit, because T1 will
-fsync it down to disk without any sort of interlock, as soon as it finishes
-the bulk update. However, all these paths are designed to write data that
-no other transaction can see until after T1 commits. The situation is thus
-not different from ordinary WAL-logged updates.
+Database writes that skip WAL for new relfilenodes are also safe. In these
+cases it's entirely possible for the data to reach disk before T1's commit,
+because T1 will fsync it down to disk without any sort of interlock. However,
+all these paths are designed to write data that no other transaction can see
+until after T1 commits. The situation is thus not different from ordinary
+WAL-logged updates.
Transaction Emulation during Recovery
-------------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5c0d0f2af0..750f95c482 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2109,6 +2109,13 @@ CommitTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Synchronize files that are created and not WAL-logged during this
+ * transaction. This must happen before AtEOXact_RelationMap(), so that we
+ * don't see committed-but-broken files after a crash.
+ */
+ smgrDoPendingSyncs();
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2342,6 +2349,13 @@ PrepareTransaction(void)
*/
PreCommit_on_commit_actions();
+ /*
+ * Synchronize files that are created and not WAL-logged during this
+ * transaction. This must happen before EndPrepare(), so that we don't see
+ * committed-but-broken files after a crash and COMMIT PREPARED.
+ */
+ smgrDoPendingSyncs();
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index aa9dca0036..dda1dea08b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1043,8 +1043,13 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
BlockNumber startblk, BlockNumber endblk,
bool page_std)
{
+ int flags;
BlockNumber blkno;
+ flags = REGBUF_FORCE_IMAGE;
+ if (page_std)
+ flags |= REGBUF_STANDARD;
+
/*
* Iterate over all the pages in the range. They are collected into
* batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written
@@ -1066,7 +1071,8 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
nbufs = 0;
while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
{
- Buffer buf = ReadBuffer(rel, blkno);
+ Buffer buf = ReadBufferExtended(rel, forkNum, blkno,
+ RBM_NORMAL, NULL);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
@@ -1088,7 +1094,7 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
START_CRIT_SECTION();
for (i = 0; i < nbufs; i++)
{
- XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+ XLogRegisterBuffer(i, bufpack[i], flags);
MarkBufferDirty(bufpack[i]);
}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 446760ed6e..9561e30b08 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
* fields related to physical storage, like rd_rel, are initialized, so the
* fake entry is only usable in low-level operations like ReadBuffer().
*
+ * This is also used for syncing WAL-skipped files.
+ *
* Caller must free the returned entry with FreeFakeRelcacheEntry().
*/
Relation
@@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
FakeRelCacheEntry fakeentry;
Relation rel;
- Assert(InRecovery);
-
/* Allocate the Relation struct and all related space in one block. */
fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
rel = (Relation) fakeentry;
rel->rd_rel = &fakeentry->pgc;
rel->rd_node = rnode;
- /* We will never be working with temp rels during recovery */
+ /*
+ * We will never be working with temp rels during recovery or while
+ * syncing WAL-skipped files.
+ */
rel->rd_backend = InvalidBackendId;
- /* It must be a permanent table if we're in recovery. */
+ /* It must be a permanent table here */
rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
/* We don't know the name of the relation; use relfilenode instead */
@@ -572,9 +575,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
/*
* We set up the lockRelId in case anything tries to lock the dummy
* relation. Note that this is fairly bogus since relNode may be
- * different from the relation's OID. It shouldn't really matter though,
- * since we are presumably running by ourselves and can't have any lock
- * conflicts ...
+ * different from the relation's OID. It shouldn't really matter though.
+ * In recovery, we are running by ourselves and can't have any lock
+ * conflicts. While syncing, we already hold AccessExclusiveLock.
*/
rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index b7bcdd9d0f..293ea9a9dd 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -440,6 +440,10 @@ heap_create(const char *relname,
break;
}
}
+ else
+ {
+ rel->rd_createSubid = InvalidSubTransactionId;
+ }
return rel;
}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 056ea3d5d3..51c233dac6 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,9 +29,13 @@
#include "miscadmin.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+/* GUC variables */
+int wal_skip_threshold = 64; /* in kilobytes */
+
/*
* We keep a list of all relations (represented as RelFileNode values)
* that have been created or deleted in the current transaction. When
@@ -58,6 +62,7 @@ typedef struct PendingRelDelete
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
+ bool sync; /* whether to fsync at commit */
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
@@ -114,6 +119,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->sync =
+ relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded();
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -155,6 +162,7 @@ RelationDropStorage(Relation rel)
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->sync = false;
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -355,7 +363,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
/*
* We need to log the copied data in WAL iff WAL archiving/streaming is
- * enabled AND it's a permanent relation.
+ * enabled AND it's a permanent relation. This gives the same answer as
+ * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+ * current operation created a new relfilenode.
*/
use_wal = XLogIsNeeded() &&
(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
@@ -397,24 +407,43 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
}
/*
- * If the rel is WAL-logged, must fsync before commit. We use heap_sync
- * to ensure that the toast table gets fsync'd too. (For a temp or
- * unlogged rel we don't care since the data will be gone after a crash
- * anyway.)
- *
- * It's obvious that we must do this when not WAL-logging the copy. It's
- * less obvious that we have to do it even if we did WAL-log the copied
- * pages. The reason is that since we're copying outside shared buffers, a
- * CHECKPOINT occurring during the copy has no way to flush the previously
- * written data to disk (indeed it won't know the new rel even exists). A
- * crash later on would replay WAL from the checkpoint, therefore it
- * wouldn't replay our earlier WAL entries. If we do not fsync those pages
- * here, they might still not be on disk when the crash occurs.
+ * When we WAL-logged rel pages, we must nonetheless fsync them. The
+ * reason is that since we're copying outside shared buffers, a CHECKPOINT
+ * occurring during the copy has no way to flush the previously written
+ * data to disk (indeed it won't know the new rel even exists). A crash
+ * later on would replay WAL from the checkpoint, therefore it wouldn't
+ * replay our earlier WAL entries. If we do not fsync those pages here,
+ * they might still not be on disk when the crash occurs.
*/
- if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+ if (use_wal || copying_initfork)
smgrimmedsync(dst, forkNum);
}
+/*
+ * RelFileNodeSkippingWAL - check if a BM_PERMANENT relfilenode is using WAL
+ *
+ * Changes of certain relfilenodes must not write WAL; see "Skipping WAL for
+ * New RelFileNode" in src/backend/access/transam/README. Though it is
+ * known from Relation efficiently, this function is intended for the code
+ * paths not having access to Relation.
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+ PendingRelDelete *pending;
+
+ if (XLogIsNeeded())
+ return false; /* no permanent relfilenode skips WAL */
+
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ if (RelFileNodeEquals(pending->relnode, rnode) && pending->sync)
+ return true;
+ }
+
+ return false;
+}
+
/*
* smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
*
@@ -492,6 +521,145 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingSyncs() -- Take care of relation syncs at commit.
+ *
+ * This should be called before smgrDoPendingDeletes() at every commit or
+ * prepare. Also this should be called before emitting WAL record so that sync
+ * failure prevents commit.
+ */
+void
+smgrDoPendingSyncs(void)
+{
+ PendingRelDelete *pending;
+ HTAB *delhash = NULL;
+
+ if (XLogIsNeeded())
+ return; /* no relation can use this */
+
+ Assert(GetCurrentTransactionNestLevel() == 1);
+ AssertPendingSyncs_RelationCache();
+
+ /*
+ * Pending syncs on the relation that are to be deleted in this
+ * transaction-end should be ignored. Collect pending deletes that will
+ * happen in the following call to smgrDoPendingDeletes().
+ */
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ if (!pending->atCommit)
+ continue;
+
+ /* create the hash if not yet */
+ if (delhash == NULL)
+ {
+ HASHCTL hash_ctl;
+
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(RelFileNode);
+ hash_ctl.entrysize = sizeof(RelFileNode);
+ hash_ctl.hcxt = CurrentMemoryContext;
+ delhash =
+ hash_create("pending del temporary hash", 8, &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
+
+ (void) hash_search(delhash, (void *) &pending->relnode,
+ HASH_ENTER, &found);
+ Assert(!found);
+ }
+
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ bool to_be_removed = false; /* don't sync if aborted */
+ ForkNumber fork;
+ BlockNumber nblocks[MAX_FORKNUM + 1];
+ BlockNumber total_blocks = 0;
+ SMgrRelation srel;
+
+ if (!pending->sync)
+ continue;
+ Assert(!pending->atCommit);
+
+ /* don't sync relnodes that is being deleted */
+ if (delhash)
+ hash_search(delhash, (void *) &pending->relnode,
+ HASH_FIND, &to_be_removed);
+ if (to_be_removed)
+ continue;
+
+ /* Now the time to sync the rnode */
+ srel = smgropen(pending->relnode, pending->backend);
+
+ /*
+ * We emit newpage WAL records for smaller relations.
+ *
+ * Small WAL records have a chance to be emitted along with other
+ * backends' WAL records. We emit WAL records instead of syncing for
+ * files that are smaller than a certain threshold, expecting faster
+ * commit. The threshold is defined by the GUC wal_skip_threshold.
+ */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ if (smgrexists(srel, fork))
+ {
+ BlockNumber n = smgrnblocks(srel, fork);
+
+ /* we shouldn't come here for unlogged relations */
+ Assert(fork != INIT_FORKNUM);
+
+ nblocks[fork] = n;
+ total_blocks += n;
+ }
+ else
+ nblocks[fork] = InvalidBlockNumber;
+ }
+
+ /*
+ * Sync file or emit WAL record for the file according to the total
+ * size.
+ */
+ if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+ {
+ /* Flush all buffers then sync the file */
+ FlushRelationBuffersWithoutRelcache(srel, false);
+
+ for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ {
+ if (smgrexists(srel, fork))
+ smgrimmedsync(srel, fork);
+ }
+ }
+ else
+ {
+ /* Emit WAL records for all blocks. The file is small enough. */
+ for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+ {
+ int n = nblocks[fork];
+ Relation rel;
+
+ if (!BlockNumberIsValid(n))
+ continue;
+
+ /*
+ * Emit WAL for the whole file. Unfortunately we don't know
+ * what kind of a page this is, so we have to log the full
+ * page including any unused space. ReadBufferExtended()
+ * counts some pgstat events; unfortunately, we discard them.
+ */
+ rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+ log_newpage_range(rel, fork, 0, n, false);
+ FreeFakeRelcacheEntry(rel);
+ }
+ }
+ }
+
+ if (delhash)
+ hash_destroy(delhash);
+}
+
/*
* smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
*
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b8c349f245..093fff8c5c 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1014,6 +1014,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
relfilenode2;
Oid swaptemp;
char swptmpchr;
+ Relation rel1;
/* We need writable copies of both pg_class tuples. */
relRelation = table_open(RelationRelationId, RowExclusiveLock);
@@ -1039,6 +1040,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
*/
Assert(!target_is_pg_class);
+ /* swap relfilenodes, reltablespaces, relpersistence */
swaptemp = relform1->relfilenode;
relform1->relfilenode = relform2->relfilenode;
relform2->relfilenode = swaptemp;
@@ -1173,6 +1175,15 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
CacheInvalidateRelcacheByTuple(reltup2);
}
+ /*
+ * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+ * subtransaction. Since the next step for rel2 is deletion, don't bother
+ * recording the newness of its relfilenode.
+ */
+ rel1 = relation_open(r1, AccessExclusiveLock);
+ RelationAssumeNewRelfilenode(rel1);
+ relation_close(rel1, NoLock);
+
/*
* Post alter hook for modified relations. The change to r2 is always
* internal, but r1 depends on the invocation context.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 42a147b67d..607e2558a3 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2711,63 +2711,15 @@ CopyFrom(CopyState cstate)
RelationGetRelationName(cstate->rel))));
}
- /*----------
- * Check to see if we can avoid writing WAL
- *
- * If archive logging/streaming is not enabled *and* either
- * - table was created in same transaction as this COPY
- * - data is being written to relfilenode created in this transaction
- * then we can skip writing WAL. It's safe because if the transaction
- * doesn't commit, we'll discard the table (or the new relfilenode file).
- * If it does commit, we'll have done the table_finish_bulk_insert() at
- * the bottom of this routine first.
- *
- * As mentioned in comments in utils/rel.h, the in-same-transaction test
- * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
- * can be cleared before the end of the transaction. The exact case is
- * when a relation sets a new relfilenode twice in same transaction, yet
- * the second one fails in an aborted subtransaction, e.g.
- *
- * BEGIN;
- * TRUNCATE t;
- * SAVEPOINT save;
- * TRUNCATE t;
- * ROLLBACK TO save;
- * COPY ...
- *
- * Also, if the target file is new-in-transaction, we assume that checking
- * FSM for free space is a waste of time, even if we must use WAL because
- * of archiving. This could possibly be wrong, but it's unlikely.
- *
- * The comments for table_tuple_insert and RelationGetBufferForTuple
- * specify that skipping WAL logging is only safe if we ensure that our
- * tuples do not go into pages containing tuples from any other
- * transactions --- but this must be the case if we have a new table or
- * new relfilenode, so we need no additional work to enforce that.
- *
- * We currently don't support this optimization if the COPY target is a
- * partitioned table as we currently only lazily initialize partition
- * information when routing the first tuple to the partition. We cannot
- * know at this stage if we can perform this optimization. It should be
- * possible to improve on this, but it does mean maintaining heap insert
- * option flags per partition and setting them when we first open the
- * partition.
- *
- * This optimization is not supported for relation types which do not
- * have any physical storage, with foreign tables and views using
- * INSTEAD OF triggers entering in this category. Partitioned tables
- * are not supported as per the description above.
- *----------
+ /*
+ * If the target file is new-in-transaction, we assume that checking FSM
+ * for free space is a waste of time. This could possibly be wrong, but
+ * it's unlikely.
*/
- /* createSubid is creation check, newRelfilenodeSubid is truncation check */
if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
- cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
- {
+ cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
ti_options |= TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
- }
/*
* Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 2bf7083719..20225dc62f 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,16 +552,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
myState->rel = intoRelationDesc;
myState->reladdr = intoRelationAddr;
myState->output_cid = GetCurrentCommandId(true);
+ myState->ti_options = TABLE_INSERT_SKIP_FSM;
+ myState->bistate = GetBulkInsertState();
/*
- * We can skip WAL-logging the insertions, unless PITR or streaming
- * replication is in use. We can skip the FSM in any case.
+ * Valid smgr_targblock implies something already wrote to the relation.
+ * This may be harmless, but this function hasn't planned for it.
*/
- myState->ti_options = TABLE_INSERT_SKIP_FSM |
- (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
- myState->bistate = GetBulkInsertState();
-
- /* Not using WAL requires smgr_targblock be initially invalid */
Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..ae809c9801 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -457,17 +457,13 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
*/
myState->transientrel = transientrel;
myState->output_cid = GetCurrentCommandId(true);
-
- /*
- * We can skip WAL-logging the insertions, unless PITR or streaming
- * replication is in use. We can skip the FSM in any case.
- */
myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
- if (!XLogIsNeeded())
- myState->ti_options |= TABLE_INSERT_SKIP_WAL;
myState->bistate = GetBulkInsertState();
- /* Not using WAL requires smgr_targblock be initially invalid */
+ /*
+ * Valid smgr_targblock implies something already wrote to the relation.
+ * This may be harmless, but this function hasn't planned for it.
+ */
Assert(RelationGetTargetBlock(transientrel) == InvalidBlockNumber);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5440eb9015..0e2f5f4259 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4770,19 +4770,14 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
newrel = NULL;
/*
- * Prepare a BulkInsertState and options for table_tuple_insert. Because
- * we're building a new heap, we can skip WAL-logging and fsync it to disk
- * at the end instead (unless WAL-logging is required for archiving or
- * streaming replication). The FSM is empty too, so don't bother using it.
+ * Prepare a BulkInsertState and options for table_tuple_insert. The FSM
+ * is empty, so don't bother using it.
*/
if (newrel)
{
mycid = GetCurrentCommandId(true);
bistate = GetBulkInsertState();
-
ti_options = TABLE_INSERT_SKIP_FSM;
- if (!XLogIsNeeded())
- ti_options |= TABLE_INSERT_SKIP_WAL;
}
else
{
@@ -12462,6 +12457,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
table_close(pg_class, RowExclusiveLock);
+ RelationAssumeNewRelfilenode(rel);
+
relation_close(rel, NoLock);
/* Make sure the reltablespace change is visible */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7ad10736d5..746ce477fc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3203,20 +3203,27 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- int i;
- BufferDesc *bufHdr;
-
- /* Open rel at the smgr level if not already done */
RelationOpenSmgr(rel);
- if (RelationUsesLocalBuffers(rel))
+ FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
+ RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
+{
+ RelFileNode rnode = smgr->smgr_rnode.node;
+ int i;
+ BufferDesc *bufHdr;
+
+ if (islocal)
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3233,7 +3240,7 @@ FlushRelationBuffers(Relation rel)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(rel->rd_smgr,
+ smgrwrite(smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3263,18 +3270,18 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, rel->rd_smgr);
+ FlushBuffer(bufHdr, smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
@@ -3484,13 +3491,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
(pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
{
/*
- * If we're in recovery we cannot dirty a page because of a hint.
- * We can set the hint, just not dirty the page as a result so the
- * hint is lost when we evict the page or shutdown.
+ * If we must not write WAL, due to a relfilenode-specific
+ * condition or being in recovery, don't dirty the page. We can
+ * set the hint, just not dirty the page as a result so the hint
+ * is lost when we evict the page or shutdown.
*
* See src/backend/storage/page/README for longer discussion.
*/
- if (RecoveryInProgress())
+ if (RecoveryInProgress() ||
+ RelFileNodeSkippingWAL(bufHdr->tag.rnode))
return;
/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 8a9eaf6430..1d408c339c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -248,11 +248,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
* During replay, we would delete the file and then recreate it, which is fine
* if the contents of the file were repopulated by subsequent WAL entries.
* But if we didn't WAL-log insertions, but instead relied on fsyncing the
- * file after populating it (as for instance CLUSTER and CREATE INDEX do),
- * the contents of the file would be lost forever. By leaving the empty file
- * until after the next checkpoint, we prevent reassignment of the relfilenode
- * number until it's safe, because relfilenode assignment skips over any
- * existing file.
+ * file after populating it (as we do at wal_level=minimal), the contents of
+ * the file would be lost forever. By leaving the empty file until after the
+ * next checkpoint, we prevent reassignment of the relfilenode number until
+ * it's safe, because relfilenode assignment skips over any existing file.
*
* We do not need to go through this dance for temp relations, though, because
* we never make WAL entries for temp rels, and so a temp rel poses no threat
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ad1ff01b32..f3831f0077 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -262,6 +262,9 @@ static void RelationReloadIndexInfo(Relation relation);
static void RelationReloadNailed(Relation relation);
static void RelationFlushRelation(Relation relation);
static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+#ifdef USE_ASSERT_CHECKING
+static void AssertPendingSyncConsistency(Relation relation);
+#endif
static void AtEOXact_cleanup(Relation relation, bool isCommit);
static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1095,6 +1098,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
relation->rd_isnailed = false;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
switch (relation->rd_rel->relpersistence)
{
case RELPERSISTENCE_UNLOGGED:
@@ -1828,6 +1832,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_isnailed = true;
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
relation->rd_backend = InvalidBackendId;
relation->rd_islocaltemp = false;
@@ -2035,6 +2040,12 @@ RelationIdGetRelation(Oid relationId)
rd = RelationBuildDesc(relationId, true);
if (RelationIsValid(rd))
RelationIncrementReferenceCount(rd);
+
+#ifdef USE_ASSERT_CHECKING
+ if (!XLogIsNeeded() && RelationIsValid(rd))
+ AssertPendingSyncConsistency(rd);
+#endif
+
return rd;
}
@@ -2093,7 +2104,7 @@ RelationClose(Relation relation)
#ifdef RELCACHE_FORCE_RELEASE
if (RelationHasReferenceCountZero(relation) &&
relation->rd_createSubid == InvalidSubTransactionId &&
- relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
RelationClearRelation(relation, false);
#endif
}
@@ -2509,13 +2520,13 @@ RelationClearRelation(Relation relation, bool rebuild)
* problem.
*
* When rebuilding an open relcache entry, we must preserve ref count,
- * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state. Also
- * attempt to preserve the pg_class entry (rd_rel), tupledesc,
- * rewrite-rule, partition key, and partition descriptor substructures
- * in place, because various places assume that these structures won't
- * move while they are working with an open relcache entry. (Note:
- * the refcount mechanism for tupledescs might someday allow us to
- * remove this hack for the tupledesc.)
+ * rd_*Subid, and rd_toastoid state. Also attempt to preserve the
+ * pg_class entry (rd_rel), tupledesc, rewrite-rule, partition key,
+ * and partition descriptor substructures in place, because various
+ * places assume that these structures won't move while they are
+ * working with an open relcache entry. (Note: the refcount
+ * mechanism for tupledescs might someday allow us to remove this hack
+ * for the tupledesc.)
*
* Note that this process does not touch CurrentResourceOwner; which
* is good because whatever ref counts the entry may have do not
@@ -2599,6 +2610,7 @@ RelationClearRelation(Relation relation, bool rebuild)
/* creation sub-XIDs must be preserved */
SWAPFIELD(SubTransactionId, rd_createSubid);
SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+ SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
/* un-swap rd_rel pointers, swap contents instead */
SWAPFIELD(Form_pg_class, rd_rel);
/* ... but actually, we don't have to update newrel->rd_rel */
@@ -2666,7 +2678,7 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2751,11 +2763,10 @@ RelationCacheInvalidateEntry(Oid relationId)
* relation cache and re-read relation mapping data.
*
* This is currently used only to recover from SI message buffer overflow,
- * so we do not touch new-in-transaction relations; they cannot be targets
- * of cross-backend SI updates (and our own updates now go through a
- * separate linked list that isn't limited by the SI message buffer size).
- * Likewise, we need not discard new-relfilenode-in-transaction hints,
- * since any invalidation of those would be a local event.
+ * so we do not touch relations having new-in-transaction relfilenodes; they
+ * cannot be targets of cross-backend SI updates (and our own updates now go
+ * through a separate linked list that isn't limited by the SI message
+ * buffer size).
*
* We do this in two phases: the first pass deletes deletable items, and
* the second one rebuilds the rebuildable items. This is essential for
@@ -2806,7 +2817,7 @@ RelationCacheInvalidate(void)
* pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -2918,6 +2929,40 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
}
+#ifdef USE_ASSERT_CHECKING
+static void
+AssertPendingSyncConsistency(Relation relation)
+{
+ bool relcache_verdict =
+ relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&
+ ((relation->rd_createSubid != InvalidSubTransactionId &&
+ RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) ||
+ relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId);
+ Assert(relcache_verdict == RelFileNodeSkippingWAL(relation->rd_node));
+}
+
+/*
+ * AssertPendingSyncs_RelationCache
+ *
+ * Assert that relcache.c and storage.c agree on whether to skip WAL.
+ *
+ * This consistently detects relcache.c skipping WAL while storage.c is not
+ * skipping WAL. It often fails to detect the reverse error, because
+ * invalidation will have destroyed the relcache entry. It will detect the
+ * reverse error if something opens the relation after the DDL.
+ */
+void
+AssertPendingSyncs_RelationCache(void)
+{
+ HASH_SEQ_STATUS status;
+ RelIdCacheEnt *idhentry;
+
+ hash_seq_init(&status, RelationIdCache);
+ while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+ AssertPendingSyncConsistency(idhentry->reldesc);
+}
+#endif
+
/*
* AtEOXact_RelationCache
*
@@ -3029,10 +3074,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
*
* During commit, reset the flag to zero, since we are now out of the
* creating transaction. During abort, simply delete the relcache entry
- * --- it isn't interesting any longer. (NOTE: if we have forgotten the
- * new-ness of a new relation due to a forced cache flush, the entry will
- * get deleted anyway by shared-cache-inval processing of the aborted
- * pg_class insertion.)
+ * --- it isn't interesting any longer.
*/
if (relation->rd_createSubid != InvalidSubTransactionId)
{
@@ -3060,9 +3102,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
}
/*
- * Likewise, reset the hint about the relfilenode being new.
+ * Likewise, reset any record of the relfilenode being new.
*/
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
}
/*
@@ -3154,7 +3197,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
}
/*
- * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+ * Likewise, update or drop any new-relfilenode-in-subtransaction.
*/
if (relation->rd_newRelfilenodeSubid == mySubid)
{
@@ -3163,6 +3206,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
}
+
+ if (relation->rd_firstRelfilenodeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstRelfilenodeSubid = parentSubid;
+ else
+ relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+ }
}
@@ -3252,6 +3303,7 @@ RelationBuildLocalRelation(const char *relname,
/* it's being created in this transaction */
rel->rd_createSubid = GetCurrentSubTransactionId();
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
/*
* create a new tuple descriptor from the one passed in. We do this
@@ -3549,14 +3601,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
*/
CommandCounterIncrement();
- /*
- * Mark the rel as having been given a new relfilenode in the current
- * (sub) transaction. This is a hint that can be used to optimize later
- * operations on the rel in the same transaction.
- */
+ RelationAssumeNewRelfilenode(relation);
+}
+
+/*
+ * RelationAssumeNewRelfilenode
+ *
+ * Code that modifies pg_class.reltablespace or pg_class.relfilenode must call
+ * this. The call shall precede any code that might insert WAL records whose
+ * replay would modify bytes in the new RelFileNode, and the call shall follow
+ * any WAL modifying bytes in the prior RelFileNode. See struct RelationData.
+ * Ideally, call this as near as possible to the CommandCounterIncrement()
+ * that makes the pg_class change visible (before it or after it); that
+ * minimizes the chance of future development adding a forbidden WAL insertion
+ * between RelationAssumeNewRelfilenode() and CommandCounterIncrement().
+ */
+void
+RelationAssumeNewRelfilenode(Relation relation)
+{
relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+ if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+ relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
- /* Flag relation as needing eoxact cleanup (to remove the hint) */
+ /* Flag relation as needing eoxact cleanup (to clear these fields) */
EOXactListAdd(relation);
}
@@ -5591,6 +5658,7 @@ load_relcache_init_file(bool shared)
rel->rd_fkeylist = NIL;
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+ rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba4edde71a..eecaf398c5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
#include "access/xlog_internal.h"
#include "catalog/namespace.h"
#include "catalog/pg_authid.h"
+#include "catalog/storage.h"
#include "commands/async.h"
#include "commands/prepare.h"
#include "commands/trigger.h"
@@ -2651,6 +2652,18 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+ gettext_noop("Size of new file to fsync instead of writing WAL."),
+ NULL,
+ GUC_UNIT_KB
+ },
+ &wal_skip_threshold,
+ 64,
+ 0, MAX_KILOBYTES,
+ NULL, NULL, NULL
+ },
+
{
{"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING,
gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."),
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..22916e8e0e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL
#define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM
#define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN
#define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL
@@ -166,8 +165,6 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
extern void simple_heap_update(Relation relation, ItemPointer otid,
HeapTuple tup);
-extern void heap_sync(Relation relation);
-
extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
ItemPointerData *items,
int nitems);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
TransactionId OldestXmin, TransactionId FreezeXid,
- MultiXactId MultiXactCutoff, bool use_wal);
+ MultiXactId MultiXactCutoff);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 64022917e2..aca88d0620 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -127,7 +127,7 @@ typedef struct TM_FailureData
} TM_FailureData;
/* "options" flag bits for table_tuple_insert */
-#define TABLE_INSERT_SKIP_WAL 0x0001
+/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
#define TABLE_INSERT_SKIP_FSM 0x0002
#define TABLE_INSERT_FROZEN 0x0004
#define TABLE_INSERT_NO_LOGICAL 0x0008
@@ -409,9 +409,8 @@ typedef struct TableAmRoutine
/*
* Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * may for example be used to flush the relation, when the
- * TABLE_INSERT_SKIP_WAL option was used.
+ * tuple_insert and multi_insert with a BulkInsertState specified. In-tree
+ * access methods ceased to use this.
*
* Typically callers of tuple_insert and multi_insert will just pass all
* the flags that apply to them, and each AM has to decide which of them
@@ -1087,10 +1086,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
* The options bitmask allows the caller to specify options that may change the
* behaviour of the AM. The AM will ignore options that it does not support.
*
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
* If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
* free space in the relation. This can save some cycles when we know the
* relation is new and doesn't contain useful amounts of free space.
@@ -1309,10 +1304,9 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
}
/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * Perform operations necessary to complete insertions made via tuple_insert
+ * and multi_insert with a BulkInsertState specified. In-tree access methods
+ * ceased to use this.
*/
static inline void
table_finish_bulk_insert(Relation rel, int options)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..108115a023 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,18 +19,23 @@
#include "storage/smgr.h"
#include "utils/relcache.h"
+/* GUC variables */
+extern int wal_skip_threshold;
+
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
/*
* These functions used to be in storage/smgr/smgr.c, which explains the
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(void);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 17b97f7e38..8097d5ab22 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;
+/* forward declared, to avoid including smgr.h here */
+struct SMgrRelationData;
+
/* in globals.c ... this duplicates miscadmin.h */
extern PGDLLIMPORT int NBuffers;
@@ -189,6 +192,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
+ bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 31d8a1a10e..9db3d23897 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -63,22 +63,40 @@ typedef struct RelationData
* rd_replidindex) */
bool rd_statvalid; /* is rd_statlist valid? */
- /*
+ /*----------
* rd_createSubid is the ID of the highest subtransaction the rel has
- * survived into; or zero if the rel was not created in the current top
- * transaction. This can be now be relied on, whereas previously it could
- * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
- * the ID of the highest subtransaction the relfilenode change has
- * survived into, or zero if not changed in the current transaction (or we
- * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
- * when a relation has multiple new relfilenodes within a single
- * transaction, with one of them occurring in a subsequently aborted
- * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
- * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+ * survived into or zero if the rel was not created in the current top
+ * transaction. rd_firstRelfilenodeSubid is the ID of the highest
+ * subtransaction an rd_node change has survived into or zero if rd_node
+ * matches the value it had at the start of the current top transaction.
+ * (Rolling back the subtransaction that rd_firstRelfilenodeSubid denotes
+ * would restore rd_node to the value it had at the start of the current
+ * top transaction. Rolling back any lower subtransaction would not.)
+ * Their accuracy is critical to RelationNeedsWAL().
+ *
+ * rd_newRelfilenodeSubid is the ID of the highest subtransaction the
+ * most-recent relfilenode change has survived into or zero if not changed
+ * in the current transaction (or we have forgotten changing it). This
+ * field is accurate when non-zero, but it can be zero when a relation has
+ * multiple new relfilenodes within a single transaction, with one of them
+ * occurring in a subsequently aborted subtransaction, e.g.
+ * BEGIN;
+ * TRUNCATE t;
+ * SAVEPOINT save;
+ * TRUNCATE t;
+ * ROLLBACK TO save;
+ * -- rd_newRelfilenodeSubid is now forgotten
+ *
+ * These fields are read-only outside relcache.c. Other files trigger
+ * rd_node changes by updating pg_class.reltablespace and/or
+ * pg_class.relfilenode. They must call RelationAssumeNewRelfilenode() to
+ * update these fields.
*/
SubTransactionId rd_createSubid; /* rel was created in current xact */
- SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
- * current xact */
+ SubTransactionId rd_newRelfilenodeSubid; /* highest subxact changing
+ * rd_node to current value */
+ SubTransactionId rd_firstRelfilenodeSubid; /* highest subxact changing
+ * rd_node to any value */
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
@@ -520,9 +538,16 @@ typedef struct ViewOptions
/*
* RelationNeedsWAL
* True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
- ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction. See "Skipping WAL for New
+ * RelFileNode" in src/backend/access/transam/README.
+ */
+#define RelationNeedsWAL(relation) \
+ ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+ (XLogIsNeeded() || \
+ (relation->rd_createSubid == InvalidSubTransactionId && \
+ relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
/*
* RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 2f2ace35b0..d3e8348c1b 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -105,9 +105,10 @@ extern Relation RelationBuildLocalRelation(const char *relname,
char relkind);
/*
- * Routine to manage assignment of new relfilenode to a relation
+ * Routines to manage assignment of new relfilenode to a relation
*/
extern void RelationSetNewRelfilenode(Relation relation, char persistence);
+extern void RelationAssumeNewRelfilenode(Relation relation);
/*
* Routines for flushing/rebuilding relcache entries in various scenarios
@@ -120,6 +121,11 @@ extern void RelationCacheInvalidate(void);
extern void RelationCloseSmgrByOid(Oid relationId);
+#ifdef USE_ASSERT_CHECKING
+extern void AssertPendingSyncs_RelationCache(void);
+#else
+#define AssertPendingSyncs_RelationCache() do {} while (0)
+#endif
extern void AtEOXact_RelationCache(bool isCommit);
extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
SubTransactionId parentSubid);
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 297b8fbd6f..1ddde3ecce 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2354,6 +2354,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
fputs("log_lock_waits = on\n", pg_conf);
fputs("log_temp_files = 128kB\n", pg_conf);
fputs("max_prepared_transactions = 2\n", pg_conf);
+ fputs("wal_level = minimal\n", pg_conf); /* XXX before commit remove */
+ fputs("max_wal_senders = 0\n", pg_conf);
for (sl = temp_configs; sl != NULL; sl = sl->next)
{
--
2.23.0
v26-0002-change-swap_relation_files.patchtext/x-patch; charset=us-asciiDownload
From 6b69e19bdae8b282a75ebf373573cdb96adeef06 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 27 Nov 2019 07:38:46 -0500
Subject: [PATCH v26 2/6] change swap_relation_files
The current patch doesn't adjust the new relation in
swap_relation_files. This inhibits the relcache from
invalidated. Adjust relache of the new relfilenode.
Change lock level for relcache adjusting.
---
src/backend/commands/cluster.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 093fff8c5c..af7733eef4 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1015,6 +1015,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
Oid swaptemp;
char swptmpchr;
Relation rel1;
+ Relation rel2;
/* We need writable copies of both pg_class tuples. */
relRelation = table_open(RelationRelationId, RowExclusiveLock);
@@ -1177,12 +1178,31 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
/*
* Recognize that rel1's relfilenode (swapped from rel2) is new in this
- * subtransaction. Since the next step for rel2 is deletion, don't bother
- * recording the newness of its relfilenode.
+ * subtransaction. However the next step for rel2 is deletion, we need to
+ * turn off the newness of its relfilenode, that allows the relcache to be
+ * flushed. Requried lock must be held before getting here so we take
+ * AccessShareLock in case no lock is acquired. Since command counter is
+ * not advanced the relcache entries has the contens before the above
+ * updates. We don't bother incrementing it and swap their contents
+ * directly.
+ */
+ rel1 = relation_open(r1, AccessShareLock);
+ rel2 = relation_open(r2, AccessShareLock);
+
+ /* swap relfilenodes */
+ rel1->rd_node.relNode = relfilenode2;
+ rel2->rd_node.relNode = relfilenode1;
+
+ /*
+ * Adjust newness flags. relfilenode2 is already added to EOXact array so
+ * we don't need to do that again here. We assume the new file is created
+ * in the current subtransaction.
*/
- rel1 = relation_open(r1, AccessExclusiveLock);
RelationAssumeNewRelfilenode(rel1);
- relation_close(rel1, NoLock);
+ rel2->rd_createSubid = InvalidSubTransactionId;
+
+ relation_close(rel1, AccessShareLock);
+ relation_close(rel2, AccessShareLock);
/*
* Post alter hook for modified relations. The change to r2 is always
--
2.23.0
v26-0003-Improve-the-performance-of-relation-syncs.patchtext/x-patch; charset=us-asciiDownload
From 061e02878dcb3e2a6a54afb591dfec2f3ef88550 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 19:33:18 +0900
Subject: [PATCH v26 3/6] Improve the performance of relation syncs.
We can improve performance of syncing multiple files at once in the
same way as b41669118. This reduces the number of scans on the whole
shared_bufffers from the number of synced relations to one.
---
src/backend/catalog/storage.c | 28 +++++--
src/backend/storage/buffer/bufmgr.c | 113 ++++++++++++++++++++++++++++
src/backend/storage/smgr/smgr.c | 37 +++++++++
src/include/storage/bufmgr.h | 1 +
src/include/storage/smgr.h | 1 +
5 files changed, 174 insertions(+), 6 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 51c233dac6..65811b2a9e 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -533,6 +533,9 @@ smgrDoPendingSyncs(void)
{
PendingRelDelete *pending;
HTAB *delhash = NULL;
+ int nrels = 0,
+ maxrels = 0;
+ SMgrRelation *srels = NULL;
if (XLogIsNeeded())
return; /* no relation can use this */
@@ -573,7 +576,7 @@ smgrDoPendingSyncs(void)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- bool to_be_removed = false; /* don't sync if aborted */
+ bool to_be_removed = false;
ForkNumber fork;
BlockNumber nblocks[MAX_FORKNUM + 1];
BlockNumber total_blocks = 0;
@@ -623,14 +626,21 @@ smgrDoPendingSyncs(void)
*/
if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
{
- /* Flush all buffers then sync the file */
- FlushRelationBuffersWithoutRelcache(srel, false);
+ /* relations to sync are passed to smgrdosyncall at once */
- for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
{
- if (smgrexists(srel, fork))
- smgrimmedsync(srel, fork);
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
}
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
}
else
{
@@ -658,6 +668,12 @@ smgrDoPendingSyncs(void)
if (delhash)
hash_destroy(delhash);
+
+ if (nrels > 0)
+ {
+ smgrdosyncall(srels, nrels);
+ pfree(srels);
+ }
}
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 746ce477fc..e0c0b825e9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -105,6 +105,19 @@ typedef struct CkptTsStatus
int index;
} CkptTsStatus;
+/*
+ * Type for array used to sort SMgrRelations
+ *
+ * FlushRelFileNodesAllBuffers shares the same comparator function with
+ * DropRelFileNodeBuffers. Pointer to this struct and RelFileNode must
+ * be compatible.
+ */
+typedef struct SMgrSortArray
+{
+ RelFileNode rnode; /* This must be the first member */
+ SMgrRelation srel;
+} SMgrSortArray;
+
/* GUC variables */
bool zero_damaged_pages = false;
int bgwriter_lru_maxpages = 100;
@@ -3290,6 +3303,106 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
}
}
+/* ---------------------------------------------------------------------
+ * FlushRelFileNodesAllBuffers
+ *
+ * This function flushes out the buffer pool all the pages of all
+ * forks of the specified smgr relations. It's equivalent to
+ * calling FlushRelationBuffers once per fork per relation, but the
+ * parameter is not Relation but SMgrRelation
+ * --------------------------------------------------------------------
+ */
+void
+FlushRelFileNodesAllBuffers(SMgrRelation *smgrs, int nrels)
+{
+ int i;
+ SMgrSortArray *srels;
+ bool use_bsearch;
+
+ if (nrels == 0)
+ return;
+
+ /* fill-in array for qsort */
+ srels = palloc(sizeof(SMgrSortArray) * nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ {
+ Assert (!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode));
+
+ srels[i].rnode = smgrs[i]->smgr_rnode.node;
+ srels[i].srel = smgrs[i];
+ }
+
+ /*
+ * Save the bsearch overhead for low number of relations to
+ * sync. See DropRelFileNodesAllBuffers for details. The name DROP_*
+ * is for historical reasons.
+ */
+ use_bsearch = nrels > DROP_RELS_BSEARCH_THRESHOLD;
+
+ /* sort the list of SMgrRelations if necessary */
+ if (use_bsearch)
+ pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator);
+
+ /* Make sure we can handle the pin inside the loop */
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ SMgrSortArray *srelent = NULL;
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ /*
+ * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
+ * and saves some cycles.
+ */
+
+ if (!use_bsearch)
+ {
+ int j;
+
+ for (j = 0; j < nrels; j++)
+ {
+ if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode))
+ {
+ srelent = &srels[j];
+ break;
+ }
+ }
+
+ }
+ else
+ {
+ srelent = bsearch((const void *) &(bufHdr->tag.rnode),
+ srels, nrels, sizeof(SMgrSortArray),
+ rnode_comparator);
+ }
+
+ /* buffer doesn't belong to any of the given relfilenodes; skip it */
+ if (srelent == NULL)
+ continue;
+
+ /* Ensure there's a free array slot for PinBuffer_Locked */
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+ if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) &&
+ (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+ FlushBuffer(bufHdr, srelent->srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+
+ pfree(srels);
+}
+
/* ---------------------------------------------------------------------
* FlushDatabaseBuffers
*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b50c69b438..191b52ab43 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -388,6 +388,43 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
}
+/*
+ * smgrdosyncall() -- Immediately sync all forks of all given relations
+ *
+ * All forks of all given relations are syncd out to the store.
+ *
+ * This is equivalent to flusing all buffers FlushRelationBuffers for each
+ * smgr relation then calling smgrimmedsync for all forks of each smgr
+ * relation, but it's significantly quicker so should be preferred when
+ * possible.
+ */
+void
+smgrdosyncall(SMgrRelation *rels, int nrels)
+{
+ int i = 0;
+ ForkNumber forknum;
+
+ if (nrels == 0)
+ return;
+
+ /* We need to flush all buffers for the relations before sync. */
+ FlushRelFileNodesAllBuffers(rels, nrels);
+
+ /*
+ * Sync the physical file(s).
+ */
+ for (i = 0; i < nrels; i++)
+ {
+ int which = rels[i]->smgr_which;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+ {
+ if (smgrsw[which].smgr_exists(rels[i], forknum))
+ smgrsw[which].smgr_immedsync(rels[i], forknum);
+ }
+ }
+}
+
/*
* smgrdounlinkall() -- Immediately unlink all forks of all given relations
*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8097d5ab22..558bac7e05 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -197,6 +197,7 @@ extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void FlushRelFileNodesAllBuffers(struct SMgrRelationData **smgrs, int nrels);
extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 1543d8d870..31a5ecd059 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -89,6 +89,7 @@ extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
--
2.23.0
v26-0004-Revert-FlushRelationBuffersWithoutRelcache.patchtext/x-patch; charset=us-asciiDownload
From 25aa85b8b0c0b329de6b84942759797bfc912461 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 19:28:35 +0900
Subject: [PATCH v26 4/6] Revert FlushRelationBuffersWithoutRelcache.
The previous patch makes the function useless. Revert it.
---
src/backend/storage/buffer/bufmgr.c | 27 ++++++++++-----------------
src/include/storage/bufmgr.h | 2 --
2 files changed, 10 insertions(+), 19 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e0c0b825e9..56314653ae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3216,27 +3216,20 @@ PrintPinnedBufs(void)
void
FlushRelationBuffers(Relation rel)
{
- RelationOpenSmgr(rel);
-
- FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
- RelationUsesLocalBuffers(rel));
-}
-
-void
-FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
-{
- RelFileNode rnode = smgr->smgr_rnode.node;
- int i;
+ int i;
BufferDesc *bufHdr;
- if (islocal)
+ /* Open rel at the smgr level if not already done */
+ RelationOpenSmgr(rel);
+
+ if (RelationUsesLocalBuffers(rel))
{
for (i = 0; i < NLocBuffer; i++)
{
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3253,7 +3246,7 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
- smgrwrite(smgr,
+ smgrwrite(rel->rd_smgr,
bufHdr->tag.forkNum,
bufHdr->tag.blockNum,
localpage,
@@ -3283,18 +3276,18 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
- FlushBuffer(bufHdr, smgr);
+ FlushBuffer(bufHdr, rel->rd_smgr);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr, true);
}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 558bac7e05..3f85e8c6fe 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -192,8 +192,6 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);
extern void FlushOneBuffer(Buffer buffer);
extern void FlushRelationBuffers(Relation rel);
-extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
- bool islocal);
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
--
2.23.0
v26-0005-Fix-gistGetFakeLSN.patchtext/x-patch; charset=us-asciiDownload
From 29af080eb433af96baf0e64de0dcbded7a128263 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 16:12:03 +0900
Subject: [PATCH v26 5/6] Fix gistGetFakeLSN()
GiST needs to set page LSN to monotically incresing numbers on updates
even if it is not WAL-logged at all. We use a simple counter for
UNLOGGESD/TEMP relations but the number must be smaller than the LSN
at the next commit for WAL-skipped relations. WAL-insertione pointer
works in major cases but we sometimes need to emit a WAL record to
generate an unique LSN for update. This patch adds a new WAL record
kind XLOG_GIST_ASSIGN_LSN, which conveys no substantial content and
emits it if needed.
---
src/backend/access/gist/gistutil.c | 38 ++++++++++++++++++--------
src/backend/access/gist/gistxlog.c | 21 ++++++++++++++
src/backend/access/rmgrdesc/gistdesc.c | 5 ++++
src/include/access/gist_private.h | 2 ++
src/include/access/gistxlog.h | 1 +
5 files changed, 56 insertions(+), 11 deletions(-)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 66c52d6dd6..8347673c5e 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1004,28 +1004,44 @@ gistproperty(Oid index_oid, int attno,
}
/*
- * Temporary and unlogged GiST indexes are not WAL-logged, but we need LSNs
- * to detect concurrent page splits anyway. This function provides a fake
- * sequence of LSNs for that purpose.
+ * Temporary, unlogged GiST and WAL-skipped indexes are not WAL-logged, but we
+ * need LSNs to detect concurrent page splits anyway. This function provides a
+ * fake sequence of LSNs for that purpose.
*/
XLogRecPtr
gistGetFakeLSN(Relation rel)
{
- static XLogRecPtr counter = FirstNormalUnloggedLSN;
-
- /*
- * XXX before commit fix this. This is not correct for
- * RELPERSISTENCE_PERMANENT, but it suffices to make tests pass.
- */
- if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP
- || rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
{
/*
* Temporary relations are only accessible in our session, so a simple
* backend-local counter will do.
*/
+ static XLogRecPtr counter = FirstNormalUnloggedLSN;
+
return counter++;
}
+ else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ {
+ /*
+ * WAL-logging on this relation will start after commit, so the LSN
+ * must be distinct numbers smaller than the LSN at the next
+ * commit. Emit a dummy WAL record if insert-LSN hasn't advanced after
+ * the last call.
+ */
+ static XLogRecPtr lastlsn = InvalidXLogRecPtr;
+ XLogRecPtr currlsn = GetXLogInsertRecPtr();
+
+ /* Shouldn't be called for WAL-logging relations */
+ Assert(!RelationNeedsWAL(rel));
+
+ /* No need for an actual record if we alredy have a distinct LSN */
+ if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
+ currlsn = gistXLogAssignLSN();
+
+ lastlsn = currlsn;
+ return currlsn;
+ }
else
{
/*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 3b28f54646..ce17bc9dc3 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -449,6 +449,9 @@ gist_redo(XLogReaderState *record)
case XLOG_GIST_PAGE_DELETE:
gistRedoPageDelete(record);
break;
+ case XLOG_GIST_ASSIGN_LSN:
+ /* nop. See gistGetFakeLSN(). */
+ break;
default:
elog(PANIC, "gist_redo: unknown op code %u", info);
}
@@ -592,6 +595,24 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
return recptr;
}
+/*
+ * Write an empty XLOG record to assign a distinct LSN.
+ */
+XLogRecPtr
+gistXLogAssignLSN(void)
+{
+ int dummy = 0;
+
+ /*
+ * Records other than SWITCH_WAL must have content. We use an integer 0 to
+ * follow the restriction.
+ */
+ XLogBeginInsert();
+ XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
+ XLogRegisterData((char*) &dummy, sizeof(dummy));
+ return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN);
+}
+
/*
* Write XLOG record about reuse of a deleted page.
*/
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index eccb6fd942..48cda40ac0 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -80,6 +80,9 @@ gist_desc(StringInfo buf, XLogReaderState *record)
case XLOG_GIST_PAGE_DELETE:
out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
break;
+ case XLOG_GIST_ASSIGN_LSN:
+ /* No details to write out */
+ break;
}
}
@@ -104,6 +107,8 @@ gist_identify(uint8 info)
break;
case XLOG_GIST_PAGE_DELETE:
id = "PAGE_DELETE";
+ case XLOG_GIST_ASSIGN_LSN:
+ id = "ASSIGN_LSN";
break;
}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index a409975db1..3455dd242d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -455,6 +455,8 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
BlockNumber origrlink, GistNSN oldnsn,
Buffer leftchild, bool markfollowright);
+extern XLogRecPtr gistXLogAssignLSN(void);
+
/* gistget.c */
extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index e44922d915..1eae06c0fb 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -26,6 +26,7 @@
/* #define XLOG_GIST_INSERT_COMPLETE 0x40 */ /* not used anymore */
/* #define XLOG_GIST_CREATE_INDEX 0x50 */ /* not used anymore */
#define XLOG_GIST_PAGE_DELETE 0x60
+#define XLOG_GIST_ASSIGN_LSN 0x70 /* nop, assign an new LSN */
/*
* Backup Blk 0: updated page.
--
2.23.0
v26-0006-Sync-files-shrinked-by-truncation.patchtext/x-patch; charset=us-asciiDownload
From 70d8236c375c6dc115e6023707b8a53a28f0b872 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 26 Nov 2019 21:25:09 +0900
Subject: [PATCH v26 6/6] Sync files shrinked by truncation
If truncation made a WAL-skipped file become smaller at commit than
the maximum size during the transaction, the file must not be
at-commit-WAL-logged and must be synced.
---
src/backend/access/transam/xact.c | 5 +-
src/backend/catalog/storage.c | 161 +++++++++++++++++++-----------
src/include/catalog/storage.h | 2 +-
3 files changed, 106 insertions(+), 62 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 750f95c482..f681cd3a23 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2114,7 +2114,7 @@ CommitTransaction(void)
* transaction. This must happen before AtEOXact_RelationMap(), so that we
* don't see committed-but-broken files after a crash.
*/
- smgrDoPendingSyncs();
+ smgrDoPendingSyncs(true);
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2354,7 +2354,7 @@ PrepareTransaction(void)
* transaction. This must happen before EndPrepare(), so that we don't see
* committed-but-broken files after a crash and COMMIT PREPARED.
*/
- smgrDoPendingSyncs();
+ smgrDoPendingSyncs(true);
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2674,6 +2674,7 @@ AbortTransaction(void)
*/
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
+ smgrDoPendingSyncs(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 65811b2a9e..aa68c77d44 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -62,11 +62,17 @@ typedef struct PendingRelDelete
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
- bool sync; /* whether to fsync at commit */
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+typedef struct pendingSync
+{
+ RelFileNode rnode;
+ BlockNumber max_truncated;
+} pendingSync;
+
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+HTAB *pendingSyncHash = NULL;
/*
* RelationCreateStorage
@@ -119,11 +125,39 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->sync =
- relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /*
+ * If the relation needs at-commit sync, we also need to track the maximum
+ * unsynced truncated block used to decide whether we can WAL-logging or we
+ * must sync the file in smgrDoPendingSyncs.
+ */
+ if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+ {
+ pendingSync *pending;
+ bool found;
+
+ /* we sync only permanent relations */
+ Assert(backend == InvalidBackendId);
+
+ if (!pendingSyncHash)
+ {
+ HASHCTL ctl;
+
+ ctl.keysize = sizeof(RelFileNode);
+ ctl.entrysize = sizeof(pendingSync);
+ ctl.hcxt = TopTransactionContext;
+ pendingSyncHash =
+ hash_create("max truncatd block hash",
+ 16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
+
+ pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
+ Assert(!found);
+ pending->max_truncated = InvalidBlockNumber;
+ }
+
return srel;
}
@@ -162,7 +196,6 @@ RelationDropStorage(Relation rel)
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->sync = false;
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -320,6 +353,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
if (fsm || vm)
XLogFlush(lsn);
}
+ else if (pendingSyncHash)
+ {
+ pendingSync *pending;
+
+ /* Record largest maybe-unsynced block of files under tracking */
+ pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
+ HASH_FIND, NULL);
+ if (pending)
+ {
+ BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+
+ if (!BlockNumberIsValid(pending->max_truncated) ||
+ pending->max_truncated < nblocks)
+ pending->max_truncated = nblocks;
+ }
+ }
/* Do the real work to truncate relation forks */
smgrtruncate(rel->rd_smgr, forks, nforks, blocks);
@@ -430,18 +479,17 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
bool
RelFileNodeSkippingWAL(RelFileNode rnode)
{
- PendingRelDelete *pending;
-
if (XLogIsNeeded())
return false; /* no permanent relfilenode skips WAL */
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
- {
- if (RelFileNodeEquals(pending->relnode, rnode) && pending->sync)
- return true;
- }
+ if (!pendingSyncHash)
+ return false; /* we don't have a to-be-synced relation */
- return false;
+ /* the relation is not tracked as to-be-synced */
+ if (hash_search(pendingSyncHash, &rnode, HASH_FIND, NULL) == NULL)
+ return false;
+
+ return true;
}
/*
@@ -529,13 +577,14 @@ smgrDoPendingDeletes(bool isCommit)
* failure prevents commit.
*/
void
-smgrDoPendingSyncs(void)
+smgrDoPendingSyncs(bool isCommit)
{
PendingRelDelete *pending;
- HTAB *delhash = NULL;
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ HASH_SEQ_STATUS scan;
+ pendingSync *pendingsync;
if (XLogIsNeeded())
return; /* no relation can use this */
@@ -543,58 +592,44 @@ smgrDoPendingSyncs(void)
Assert(GetCurrentTransactionNestLevel() == 1);
AssertPendingSyncs_RelationCache();
+ if (!pendingSyncHash)
+ return; /* no relation needs sync */
+
+ /* Just throw away all pending syncs if any at rollback */
+ if (!isCommit)
+ {
+ if (pendingSyncHash)
+ {
+ hash_destroy(pendingSyncHash);
+ pendingSyncHash = NULL;
+ }
+ return;
+ }
+
/*
* Pending syncs on the relation that are to be deleted in this
- * transaction-end should be ignored. Collect pending deletes that will
- * happen in the following call to smgrDoPendingDeletes().
+ * transaction-end should be ignored. Remove sync hash entries entries for
+ * relations that will be deleted in the following call to
+ * smgrDoPendingDeletes().
*/
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- bool found PG_USED_FOR_ASSERTS_ONLY;
-
if (!pending->atCommit)
continue;
- /* create the hash if not yet */
- if (delhash == NULL)
- {
- HASHCTL hash_ctl;
-
- memset(&hash_ctl, 0, sizeof(hash_ctl));
- hash_ctl.keysize = sizeof(RelFileNode);
- hash_ctl.entrysize = sizeof(RelFileNode);
- hash_ctl.hcxt = CurrentMemoryContext;
- delhash =
- hash_create("pending del temporary hash", 8, &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
- }
-
- (void) hash_search(delhash, (void *) &pending->relnode,
- HASH_ENTER, &found);
- Assert(!found);
+ (void) hash_search(pendingSyncHash, (void *) &pending->relnode,
+ HASH_REMOVE, NULL);
}
- for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ hash_seq_init(&scan, pendingSyncHash);
+ while ((pendingsync = (pendingSync *) hash_seq_search(&scan)))
{
- bool to_be_removed = false;
- ForkNumber fork;
- BlockNumber nblocks[MAX_FORKNUM + 1];
- BlockNumber total_blocks = 0;
- SMgrRelation srel;
-
- if (!pending->sync)
- continue;
- Assert(!pending->atCommit);
-
- /* don't sync relnodes that is being deleted */
- if (delhash)
- hash_search(delhash, (void *) &pending->relnode,
- HASH_FIND, &to_be_removed);
- if (to_be_removed)
- continue;
+ ForkNumber fork;
+ BlockNumber nblocks[MAX_FORKNUM + 1];
+ BlockNumber total_blocks = 0;
+ SMgrRelation srel;
- /* Now the time to sync the rnode */
- srel = smgropen(pending->relnode, pending->backend);
+ srel = smgropen(pendingsync->rnode, InvalidBackendId);
/*
* We emit newpage WAL records for smaller relations.
@@ -622,9 +657,12 @@ smgrDoPendingSyncs(void)
/*
* Sync file or emit WAL record for the file according to the total
- * size.
+ * size. Do file sync if the size is larger than the threshold or
+ * truncates may have left blocks beyond the current size.
*/
- if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+ if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024 ||
+ (BlockNumberIsValid(pendingsync->max_truncated) &&
+ smgrnblocks(srel, MAIN_FORKNUM) < pendingsync->max_truncated))
{
/* relations to sync are passed to smgrdosyncall at once */
@@ -644,7 +682,11 @@ smgrDoPendingSyncs(void)
}
else
{
- /* Emit WAL records for all blocks. The file is small enough. */
+ /*
+ * Emit WAL records for all blocks. We don't emit
+ * XLOG_SMGR_TRUNCATE record because the past truncations haven't
+ * left unlogged pages here.
+ */
for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
{
int n = nblocks[fork];
@@ -666,8 +708,9 @@ smgrDoPendingSyncs(void)
}
}
- if (delhash)
- hash_destroy(delhash);
+ Assert (pendingSyncHash);
+ hash_destroy(pendingSyncHash);
+ pendingSyncHash = NULL;
if (nrels > 0)
{
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 108115a023..bf076657e7 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -35,7 +35,7 @@ extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
* naming
*/
extern void smgrDoPendingDeletes(bool isCommit);
-extern void smgrDoPendingSyncs(void);
+extern void smgrDoPendingSyncs(bool isCommit);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
--
2.23.0
I measured the performance with the latest patch set.
1. Determine $DDL_COUNT, a number of DDL transactions that take about one
minute when done via syncs.
2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
3. Wait 10s.
4. Start one DDL backend that runs $DDL_COUNT transactions.
5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
I did the following benchmarking.
1. Initialize bench database
$ pgbench -i -s 20
2. Start server with wal_level = replica (all other variables are not
changed) then run the attached ./bench.sh
$ ./bench.sh <count> <pages> <mode>
where count is the number of repetition, pages is the number of pages
to write in a run, and mode is "s" (sync) or "w"(WAL). The <mode>
doesn't affect if wal_level = replica. The script shows the following
result.
| before: tps 240.2, lat 44.087 ms (29 samples)
| during: tps 109.1, lat 114.887 ms (14 samples)
| after : tps 269.9, lat 39.557 ms (107 samples)
| DDL time = 13965 ms
| # transaction type: <builtin: TPC-B (sort of)>
before: mean numbers before "the DDL" starts.
during: mean numbers while "the DDL" is running.
after : mean numbers after "the DDL" ends.
DDL time: the time took to run "the DDL".
3. Restart server with wal_level = replica then run the bench.sh
twice.
$ ./bench.sh <count> <pages> s
$ ./bench.sh <count> <pages> w
Finally I got three graphs. (attached 1, 2, 3. PNGs)
* Graph 1 - The affect of the DDL on pgbench's TPS
The virtical axis means "during TPS" / "before TPS" in %. Larger is
better. The horizontal axis means the table pages size.
Replica and Minimal-sync are almost flat. Minimal-WAL getting worse
as table size increases. 500 pages seems to be the crosspoint.
* Graph 2 - The affect of the DDL on pgbench's latency.
The virtical axis means "during-letency" / "before-latency" in
%. Smaller is better. Like TPS but more quickly WAL-latency gets
worse as table size increases. The crosspoint seems to be 300 pages
or so.
* Graph 3 - The affect of pgbench's work load on DDL runtime.
The virtical axis means "time the DDL takes to run with pgbench" /
"time the DDL to run solely". Smaller is better. Replica and
Minimal-SYNC shows similar tendency. On Minimal-WAL the DDL runs
quite fast with small tables. The crosspoint seems to be about 2500
pages.
Seeing this, I became to be worry that the optimization might give far
smaller advantage than expected. Putting aside that, it seems to me
that the default value for the threshold would be 500-1000, same as
the previous benchmark showed.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
I measured the performance with the latest patch set.
1. Determine $DDL_COUNT, a number of DDL transactions that take about one
minute when done via syncs.
2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
3. Wait 10s.
4. Start one DDL backend that runs $DDL_COUNT transactions.
5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
If you have the raw data requested in (5), please share them here so folks
have the option to reproduce your graphs and calculations.
I did the following benchmarking.
1. Initialize bench database
$ pgbench -i -s 20
2. Start server with wal_level = replica (all other variables are not
changed) then run the attached ./bench.sh
The bench.sh attachment was missing; please attach it. Please give the output
of this command:
select name, setting from pg_settings where setting <> boot_val;
3. Restart server with wal_level = replica then run the bench.sh
twice.
I assume this is wal_level=minimal, not wal_level=replica.
Hello.
At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch <noah@leadboat.com> wrote in
On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
I measured the performance with the latest patch set.
1. Determine $DDL_COUNT, a number of DDL transactions that take about one
minute when done via syncs.
2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
3. Wait 10s.
4. Start one DDL backend that runs $DDL_COUNT transactions.
5. Save DDL start timestamp, DDL end timestamp, and pgbench output.If you have the raw data requested in (5), please share them here so folks
have the option to reproduce your graphs and calculations.
Sorry, I forgot to attach the scripts. The raw data was vanished into
unstable connection and the steps was quite crude. I prioritized on
showing some numbers at the time. I revised the scripts into more
automated way and will take numbers again.
2. Start server with wal_level = replica (all other variables are not
changed) then run the attached ./bench.shThe bench.sh attachment was missing; please attach it. Please give the output
of this command:select name, setting from pg_settings where setting <> boot_val;
(I intentionally show all the results..)
=# select name, setting from pg_settings where setting<> boot_val;
name | setting
----------------------------+--------------------
application_name | psql
archive_command | (disabled)
client_encoding | UTF8
data_directory_mode | 0700
default_text_search_config | pg_catalog.english
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
lc_messages | en_US.UTF-8
lc_monetary | en_US.UTF-8
lc_numeric | en_US.UTF-8
lc_time | en_US.UTF-8
log_checkpoints | on
log_file_mode | 0600
log_timezone | Asia/Tokyo
max_stack_depth | 2048
max_wal_senders | 0
max_wal_size | 10240
server_encoding | UTF8
shared_buffers | 16384
TimeZone | Asia/Tokyo
unix_socket_permissions | 0777
wal_buffers | 512
wal_level | minimal
(23 rows)
The result for "replica" setting in the benchmark script are used as
base numbers (or the denominator of the percentages).
3. Restart server with wal_level = replica then run the bench.sh
twice.I assume this is wal_level=minimal, not wal_level=replica.
Oops! It's wrong I ran once with replica, then twice with minimal.
Anyway, I revised the benchmarking scripts and attached them. The
parameters written in benchmain.sh were decided as ./bench2.pl 5
<count> <pages> s with wal_level=minimal server takes around 60
seconds.
I'll send the complete data tomorrow (in JST). The attached f.txt is
the result of preliminary test only with pages=100 and 250 (with HDD).
The attached files are:
benchmain.sh - main script
bench2.sh - run a benchmark with a single set of parameters
bench1.pl - behchmark client program
summarize.pl - script to summarize benchmain.sh's output
f.txt.gz - result only for pages=100, DDL count = 2200 (not 2250)
How to run:
$ /..unpatched_path../initdb -D <unpatched_datadir>
(wal_level=replica, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB)
$ /..patched_path../initdb -D <patched_datadir>
(wal_level=minimal, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB)
$./benchmain.sh > <result_file> # output raw data
$./summarize.pl [-v] < <result_file> # show summary
With the attached f.txt, summarize.pl gives the following output.
WAL wins with the that pages.
$ cat f.txt | ./summarize.pl
## params: wal_level=replica mode=none pages=100 count=353 scale=20
(% are relative to "before")
before: tps 262.3 (100.0%), lat 39.840 ms (100.0%) (29 samples)
during: tps 120.7 ( 46.0%), lat 112.508 ms (282.4%) (35 samples)
after: tps 106.3 ( 40.5%), lat 163.492 ms (410.4%) (86 samples)
DDL time: 34883 ms ( 100.0% relative to mode=none)
## params: wal_level=minimal mode=sync pages=100 count=353 scale=20
(% are relative to "before")
before: tps 226.3 (100.0%), lat 48.091 ms (100.0%) (29 samples)
during: tps 83.0 ( 36.7%), lat 184.942 ms (384.6%) (100 samples)
after: tps 82.6 ( 36.5%), lat 196.863 ms (409.4%) (21 samples)
DDL time: 99239 ms ( 284.5% relative to mode=none)
## params: wal_level=minimal mode=WAL pages=100 count=353 scale=20
(% are relative to "before")
before: tps 240.3 (100.0%), lat 44.686 ms (100.0%) (29 samples)
during: tps 129.6 ( 53.9%), lat 113.585 ms (254.2%) (31 samples)
after: tps 124.5 ( 51.8%), lat 141.992 ms (317.8%) (90 samples)
DDL time: 30392 ms ( 87.1% relative to mode=none)
## params: wal_level=replica mode=none pages=250 count=258 scale=20
(% are relative to "before")
before: tps 266.3 (100.0%), lat 45.884 ms (100.0%) (29 samples)
during: tps 87.9 ( 33.0%), lat 148.433 ms (323.5%) (54 samples)
after: tps 105.6 ( 39.6%), lat 153.216 ms (333.9%) (67 samples)
DDL time: 53176 ms ( 100.0% relative to mode=none)
## params: wal_level=minimal mode=sync pages=250 count=258 scale=20
(% are relative to "before")
before: tps 225.1 (100.0%), lat 47.705 ms (100.0%) (29 samples)
during: tps 93.7 ( 41.6%), lat 143.231 ms (300.2%) (83 samples)
after: tps 93.8 ( 41.7%), lat 186.097 ms (390.1%) (38 samples)
DDL time: 82104 ms ( 154.4% relative to mode=none)
## params: wal_level=minimal mode=WAL pages=250 count=258 scale=20
(% are relative to "before")
before: tps 230.2 (100.0%), lat 48.472 ms (100.0%) (29 samples)
during: tps 90.3 ( 39.2%), lat 183.365 ms (378.3%) (48 samples)
after: tps 123.9 ( 53.8%), lat 131.129 ms (270.5%) (73 samples)
DDL time: 47660 ms ( 89.6% relative to mode=none)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 03 Dec 2019 20:51:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I'll send the complete data tomorrow (in JST). The attached f.txt is
the result of preliminary test only with pages=100 and 250 (with HDD).
The attached files are the latest set of the test scripts and the result:
benchmark_scripts.tar.gz
benchmain.sh - main script
bench2.sh - run a benchmark with a single set of parameters
bench1.pl - behchmark client program
summarize.pl - script to summarize benchmain.sh's output
graph.xlsx - MS-Excel file for the graph below.
result.txt.gz - raw result of benchmain.sh
summary.txt.gz - cooked result by summarize.pl -s
graph.png - graphs
summarize.pl [-v|-s|-d]
-s: print summary table for spreadsheets (TSV)
-v: show pgbench summary
-d: debug print
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
result.txt.gzapplication/octet-streamDownload
�a.�] result.txt �\m��Hr��_���3��~a���$��s�N�^�g$��ha��H�����jI��V��q�X/�3C����z���w���������f���]#�����f�Z�o��E��]��Y����kDw���B|j�s�M�>-�����m�l���j�����Yh�����v����
�8 i6���f5m�.�W��������v��m=o�7}j�;1_M�y#����7��7��r��������>`5��j���;QO��
�5��f r�8�3�H��s���������`X��oW�A���
���<?z
���=����~��x�@����������������:��`6������v�L���^<�[-�G�>��"