Unlogged tables cleanup

Started by Konstantin Knizhnikover 9 years ago44 messageshackers
Jump to latest
#1Konstantin Knizhnik
k.knizhnik@postgrespro.ru

Hi, hackers

I wonder if such behavior can be considered as a bug:

knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# create tablespace fs location '/home/knizhnik/dtm-data/fs';
CREATE TABLESPACE
postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo(x integer);
CREATE TABLE
postgres=# insert into foo values(generate_series(1,100000));
INSERT 0 100000

Now simulate server crash using using "pkill -9 postgres".

knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l
logfile start
pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo;
ERROR: could not open file
"pg_tblspc/16384/PG_10_201611041/12289/16385": No such file or directory

knizhnik@knizhnik:~/dtm-data$ ls fs
PG_10_201611041
knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/

So all relation directory is removed!
It happens only for first table created in tablespace.
If you create table in Postgres data directory everything is ok: first
segment of relation is truncated but not deleted.
Also if you create one more unlogged table in tablespace it is truncated
correctly:

postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo1(x integer);
CREATE TABLE
postgres=# insert into foo1 values(generate_series(1,100000));
INSERT 0 100000
postgres=# \q
knizhnik@knizhnik:~/dtm-data$ pkill -9 postgres
knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l
logfile start
pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo1;
x
---
(0 rows)

knizhnik@knizhnik:~/dtm-data$ ls -l fs/PG_10_201611041/12289/*
-rw------- 1 knizhnik knizhnik 0 Nov 9 19:52 fs/PG_10_201611041/12289/32768
-rw------- 1 knizhnik knizhnik 0 Nov 9 19:52
fs/PG_10_201611041/12289/32768_init

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Robert Haas
robertmhaas@gmail.com
In reply to: Konstantin Knizhnik (#1)
Re: Unlogged tables cleanup

On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Now simulate server crash using using "pkill -9 postgres".

knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l
logfile start
pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo;
ERROR: could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385":
No such file or directory

knizhnik@knizhnik:~/dtm-data$ ls fs
PG_10_201611041
knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/

So all relation directory is removed!
It happens only for first table created in tablespace.
If you create table in Postgres data directory everything is ok: first
segment of relation is truncated but not deleted.

Whoa. There should be an _init fork that doesn't get removed, and
without removing the _init fork you shouldn't be able to remove the
directory that contains it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#2)
Re: Unlogged tables cleanup

On Thu, Nov 10, 2016 at 3:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Now simulate server crash using using "pkill -9 postgres".

knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l
logfile start
pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo;
ERROR: could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385":
No such file or directory

knizhnik@knizhnik:~/dtm-data$ ls fs
PG_10_201611041
knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/

So all relation directory is removed!
It happens only for first table created in tablespace.
If you create table in Postgres data directory everything is ok: first
segment of relation is truncated but not deleted.

Whoa. There should be an _init fork that doesn't get removed, and
without removing the _init fork you shouldn't be able to remove the
directory that contains it.

Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you
have locally a standby pointing as well to this tablespace?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Michael Paquier (#3)
Re: Unlogged tables cleanup

On Nov 10, 2016, at 10:17 AM, Michael Paquier wrote:

Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you
have locally a standby pointing as well to this tablespace?

No, it is latest sources from Postgres repository.
Please notice that you should create new database and tablespace to reproduce this issue.
So actually the whole sequence is

mkdir fs
initdb -D pgsql
pg_ctl -D pgsql -l logfile start
psql postgres
# create tablespace fs location '/home/knizhnik/dtm-data/fs';
# set default_tablespace=fs;
# create unlogged table foo(x integer);
# insert into foo values(generate_series(1,100000));
# ^D
pkill -9 postgres
pg_ctl -D pgsql -l logfile start
# select * from foo;

--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Michael Paquier
michael@paquier.xyz
In reply to: Konstantin Knizhnik (#4)
Re: Unlogged tables cleanup

On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik
<k.knizhnik@postgrespro.ru> wrote:

No, it is latest sources from Postgres repository.
Please notice that you should create new database and tablespace to reproduce this issue.
So actually the whole sequence is

mkdir fs
initdb -D pgsql
pg_ctl -D pgsql -l logfile start
psql postgres
# create tablespace fs location '/home/knizhnik/dtm-data/fs';
# set default_tablespace=fs;
# create unlogged table foo(x integer);
# insert into foo values(generate_series(1,100000));
# ^D
pkill -9 postgres
pg_ctl -D pgsql -l logfile start
# select * from foo;

OK, I understood what I was missing. This can be reproduced with
wal_level = minimal. When using hot_standby things are working
properly.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#5)
Re: Unlogged tables cleanup

On Thu, Nov 10, 2016 at 4:33 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik
<k.knizhnik@postgrespro.ru> wrote:

No, it is latest sources from Postgres repository.
Please notice that you should create new database and tablespace to reproduce this issue.
So actually the whole sequence is

mkdir fs
initdb -D pgsql
pg_ctl -D pgsql -l logfile start
psql postgres
# create tablespace fs location '/home/knizhnik/dtm-data/fs';
# set default_tablespace=fs;
# create unlogged table foo(x integer);
# insert into foo values(generate_series(1,100000));
# ^D
pkill -9 postgres
pg_ctl -D pgsql -l logfile start
# select * from foo;

OK, I understood what I was missing. This can be reproduced with
wal_level = minimal. When using hot_standby things are working
properly.

Okay, so what happens is that the CREATE TABLESPACE record removes the
tablespace directory and recreates a fresh one, but as no CREATE
records are created for unlogged tables the init fork is not
re-created. It seems to me that we should log a record to recreate the
INIT fork, and that heap_create_with_catalog() is missing something.
Generating a record in RelationCreateStorage() is the more direct way,
and that actually fixes the issue. Now the comments at the top of it
mention that RelationCreateStorage() should only be used to create the
MAIN forknum. So, wouldn't a correct fix be to log this INIT record at
the end of heap_create()?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#6)
Re: Unlogged tables cleanup

On Thu, Nov 10, 2016 at 5:15 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Okay, so what happens is that the CREATE TABLESPACE record removes the
tablespace directory and recreates a fresh one, but as no CREATE
records are created for unlogged tables the init fork is not
re-created. It seems to me that we should log a record to recreate the
INIT fork, and that heap_create_with_catalog() is missing something.
Generating a record in RelationCreateStorage() is the more direct way,
and that actually fixes the issue. Now the comments at the top of it
mention that RelationCreateStorage() should only be used to create the
MAIN forknum. So, wouldn't a correct fix be to log this INIT record at
the end of heap_create()?

Nah. Looking at the code the fix is quite obvious.
heap_create_init_fork() is checking for XLogIsNeeded() to decide if
the INIT forknum should be logged or not. But this is wrong, it needs
to be done unconditionally to ensure that the relation gets created at
replay.
--
Michael

Attachments:

unlogged-tbspace-fix.patchtext/plain; charset=US-ASCII; name=unlogged-tbspace-fix.patchDownload+1-2
#8Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#7)
Re: Unlogged tables cleanup

On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Nov 10, 2016 at 5:15 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Okay, so what happens is that the CREATE TABLESPACE record removes the
tablespace directory and recreates a fresh one, but as no CREATE
records are created for unlogged tables the init fork is not
re-created. It seems to me that we should log a record to recreate the
INIT fork, and that heap_create_with_catalog() is missing something.
Generating a record in RelationCreateStorage() is the more direct way,
and that actually fixes the issue. Now the comments at the top of it
mention that RelationCreateStorage() should only be used to create the
MAIN forknum. So, wouldn't a correct fix be to log this INIT record at
the end of heap_create()?

Nah. Looking at the code the fix is quite obvious.
heap_create_init_fork() is checking for XLogIsNeeded() to decide if
the INIT forknum should be logged or not. But this is wrong, it needs
to be done unconditionally to ensure that the relation gets created at
replay.

I think that we should also update other *buildempty() functions as well.
For example, if the table has a primary key, we'll encounter the error again
for btree index.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Michael Paquier
michael@paquier.xyz
In reply to: Kuntal Ghosh (#8)
Re: Unlogged tables cleanup

On Thu, Nov 10, 2016 at 10:52 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Nah. Looking at the code the fix is quite obvious.
heap_create_init_fork() is checking for XLogIsNeeded() to decide if
the INIT forknum should be logged or not. But this is wrong, it needs
to be done unconditionally to ensure that the relation gets created at
replay.

I think that we should also update other *buildempty() functions as well.
For example, if the table has a primary key, we'll encounter the error again
for btree index.

Good point. I just had a look at all the AMs: bloom, btree and SP-gist
are plainly broken. The others are fine. Attached is an updated patch.

Here are the tests I have done with wal_level = minimal to test all the AMs:
\! rm -rf /Users/mpaquier/Desktop/tbsp/PG*
create extension bloom;
create extension btree_gist;
create extension btree_gin;
create tablespace popo location '/Users/mpaquier/Desktop/tbsp';
set default_tablespace = popo;
create unlogged table foo (a int);
insert into foo values (generate_series(1,10000));
create index foo_bt on foo(a);
create index foo_bloom on foo using bloom(a);
create index foo_gin on foo using gin (a);
create index foo_gist on foo using gist (a);
create index foo_brin on foo using brin (a);
create unlogged table foo_geo (a box);
insert into foo_geo values ('(1,2),(2,3)');
create index foo_spgist ON foo_geo using spgist(a);

-- crash PG
-- Insert some data
insert into foo values (1000000);
insert into foo_geo values ('(50,12),(312,3)');

This should create 8 INIT forks, 6 for the indexes and 2 for the
tables. On HEAD, only 3 are getting created and with the patch all of
them are.
--
Michael

Attachments:

unlogged-tbspace-fix-v2.patchtext/x-patch; charset=US-ASCII; name=unlogged-tbspace-fix-v2.patchDownload+14-20
#10Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#9)
Re: Unlogged tables cleanup

On Thu, Nov 10, 2016 at 9:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Nov 10, 2016 at 10:52 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Nah. Looking at the code the fix is quite obvious.
heap_create_init_fork() is checking for XLogIsNeeded() to decide if
the INIT forknum should be logged or not. But this is wrong, it needs
to be done unconditionally to ensure that the relation gets created at
replay.

I think that we should also update other *buildempty() functions as well.
For example, if the table has a primary key, we'll encounter the error again
for btree index.

Good point. I just had a look at all the AMs: bloom, btree and SP-gist
are plainly broken. The others are fine. Attached is an updated patch.

Here are the tests I have done with wal_level = minimal to test all the AMs:
\! rm -rf /Users/mpaquier/Desktop/tbsp/PG*
create extension bloom;
create extension btree_gist;
create extension btree_gin;
create tablespace popo location '/Users/mpaquier/Desktop/tbsp';
set default_tablespace = popo;
create unlogged table foo (a int);
insert into foo values (generate_series(1,10000));
create index foo_bt on foo(a);
create index foo_bloom on foo using bloom(a);
create index foo_gin on foo using gin (a);
create index foo_gist on foo using gist (a);
create index foo_brin on foo using brin (a);
create unlogged table foo_geo (a box);
insert into foo_geo values ('(1,2),(2,3)');
create index foo_spgist ON foo_geo using spgist(a);

-- crash PG
-- Insert some data
insert into foo values (1000000);
insert into foo_geo values ('(50,12),(312,3)');

This should create 8 INIT forks, 6 for the indexes and 2 for the
tables. On HEAD, only 3 are getting created and with the patch all of
them are.

The header comment for heap_create_init_fork() says this:

/*
* Set up an init fork for an unlogged table so that it can be correctly
* reinitialized on restart. Since we're going to do an immediate sync, we
* only need to xlog this if archiving or streaming is enabled. And the
* immediate sync is required, because otherwise there's no guarantee that
* this will hit the disk before the next checkpoint moves the redo pointer.
*/

Your patch causes the code not to match the comment any more. And the
comment explains why at the time I wrote this code I thought it was OK
to have the XLogIsNeeded() test in there, so it clearly needs
updating.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#10)
Re: Unlogged tables cleanup

On Wed, Nov 16, 2016 at 11:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The header comment for heap_create_init_fork() says this:

/*
* Set up an init fork for an unlogged table so that it can be correctly
* reinitialized on restart. Since we're going to do an immediate sync, we
* only need to xlog this if archiving or streaming is enabled. And the
* immediate sync is required, because otherwise there's no guarantee that
* this will hit the disk before the next checkpoint moves the redo pointer.
*/

Your patch causes the code not to match the comment any more. And the
comment explains why at the time I wrote this code I thought it was OK
to have the XLogIsNeeded() test in there, so it clearly needs
updating.

Indeed I missed this comment block. Please let me suggest the following instead:
 /*
  * Set up an init fork for an unlogged table so that it can be correctly
- * reinitialized on restart.  Since we're going to do an immediate sync, we
- * only need to xlog this if archiving or streaming is enabled.  And the
- * immediate sync is required, because otherwise there's no guarantee that
- * this will hit the disk before the next checkpoint moves the redo pointer.
+ * reinitialized on restart.  An immediate sync is required even if the
+ * page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved
+ * the redo pointer past our xlog record.
  */
-- 
Michael

Attachments:

unlogged-tbspace-fix-v3.patchapplication/x-patch; name=unlogged-tbspace-fix-v3.patchDownload+18-24
#12Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#11)
Re: Unlogged tables cleanup

On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Nov 16, 2016 at 11:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The header comment for heap_create_init_fork() says this:

/*
* Set up an init fork for an unlogged table so that it can be correctly
* reinitialized on restart. Since we're going to do an immediate sync, we
* only need to xlog this if archiving or streaming is enabled. And the
* immediate sync is required, because otherwise there's no guarantee that
* this will hit the disk before the next checkpoint moves the redo pointer.
*/

Your patch causes the code not to match the comment any more. And the
comment explains why at the time I wrote this code I thought it was OK
to have the XLogIsNeeded() test in there, so it clearly needs
updating.

Indeed I missed this comment block. Please let me suggest the following instead:
/*
* Set up an init fork for an unlogged table so that it can be correctly
- * reinitialized on restart.  Since we're going to do an immediate sync, we
- * only need to xlog this if archiving or streaming is enabled.  And the
- * immediate sync is required, because otherwise there's no guarantee that
- * this will hit the disk before the next checkpoint moves the redo pointer.
+ * reinitialized on restart.  An immediate sync is required even if the
+ * page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved
+ * the redo pointer past our xlog record.
*/

Hmm. Well, that deletes the comment that's no longer true, but it
doesn't replace it with any explanation of why we also need to WAL-log
it unconditionally, and I think that explanation is not entirely
trivial?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#12)
Re: Unlogged tables cleanup

On Wed, Nov 16, 2016 at 7:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Indeed I missed this comment block. Please let me suggest the following instead:
/*
* Set up an init fork for an unlogged table so that it can be correctly
- * reinitialized on restart.  Since we're going to do an immediate sync, we
- * only need to xlog this if archiving or streaming is enabled.  And the
- * immediate sync is required, because otherwise there's no guarantee that
- * this will hit the disk before the next checkpoint moves the redo pointer.
+ * reinitialized on restart.  An immediate sync is required even if the
+ * page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved
+ * the redo pointer past our xlog record.
*/

Hmm. Well, that deletes the comment that's no longer true, but it
doesn't replace it with any explanation of why we also need to WAL-log
it unconditionally, and I think that explanation is not entirely
trivial?

OK, the original code does not give any special reason either
regarding why doing so is safe for archiving or streaming :)

More seriously, if there could be more details regarding that, I would
think that we could say something like "logging the init fork is
mandatory in any case to ensure its on-disk presence when at recovery
replay, even on non-default tablespaces whose base location are
deleted and re-created from scratch if the WAL record in charge of
creating this tablespace gets replayed". The problem shows up because
of tablespaces being deleted at replay at the end... So perhaps this
makes sense. What do you think?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#13)
Re: Unlogged tables cleanup

On Wed, Nov 16, 2016 at 11:55 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Nov 16, 2016 at 7:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Indeed I missed this comment block. Please let me suggest the following instead:
/*
* Set up an init fork for an unlogged table so that it can be correctly
- * reinitialized on restart.  Since we're going to do an immediate sync, we
- * only need to xlog this if archiving or streaming is enabled.  And the
- * immediate sync is required, because otherwise there's no guarantee that
- * this will hit the disk before the next checkpoint moves the redo pointer.
+ * reinitialized on restart.  An immediate sync is required even if the
+ * page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved
+ * the redo pointer past our xlog record.
*/

Hmm. Well, that deletes the comment that's no longer true, but it
doesn't replace it with any explanation of why we also need to WAL-log
it unconditionally, and I think that explanation is not entirely
trivial?

OK, the original code does not give any special reason either
regarding why doing so is safe for archiving or streaming :)

Yeah, but surely it's obvious that if you don't WAL log it, it's not
going to work for archiving or streaming. It's a lot less obvious why
you have to WAL log it when you're not doing either of those things if
you've already written it and fsync'd it locally. How is it going to
disappear if it's been fsync'd, one wonders?

More seriously, if there could be more details regarding that, I would
think that we could say something like "logging the init fork is
mandatory in any case to ensure its on-disk presence when at recovery
replay, even on non-default tablespaces whose base location are
deleted and re-created from scratch if the WAL record in charge of
creating this tablespace gets replayed". The problem shows up because
of tablespaces being deleted at replay at the end... So perhaps this
makes sense. What do you think?

Yes, that's about what I think we need to explain. Actually, I'm
wondering if we ought to just switch this over to using the delayChkpt
mechanism instead of going through this complicated song-and-dance.
Incurring an immediate sync to avoid having to WAL-log this is a bit
tenuous, but having to WAL-log it AND do the immediate sync just seems
silly, and I'm actually a bit worried that even with your fix there
might still be a latent bug somewhere. We couldn't switch mechanisms
cleanly in the 9.2 branch (cf.
f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
back-patch it in the form you have it in already, but it's probably
worth switching over at least in master.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#14)
Re: Unlogged tables cleanup

On Thu, Nov 17, 2016 at 7:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Yeah, but surely it's obvious that if you don't WAL log it, it's not
going to work for archiving or streaming. It's a lot less obvious why
you have to WAL log it when you're not doing either of those things if
you've already written it and fsync'd it locally. How is it going to
disappear if it's been fsync'd, one wonders?

That's not obvious that replaying a WAL record for a database creation
or tablespace creation would cause that for sure! I didn't know that
redo was wiping them out with rmtree() at replay before looking at
this bug. One more thing to recall when fixing an issue in this area
in the future.

More seriously, if there could be more details regarding that, I would
think that we could say something like "logging the init fork is
mandatory in any case to ensure its on-disk presence when at recovery
replay, even on non-default tablespaces whose base location are
deleted and re-created from scratch if the WAL record in charge of
creating this tablespace gets replayed". The problem shows up because
of tablespaces being deleted at replay at the end... So perhaps this
makes sense. What do you think?

Yes, that's about what I think we need to explain.

OK, what do you think about the attached then?

I came up with something like that for those code paths:
-   /* Write the page.  If archiving/streaming, XLOG it. */
+   /*
+    * Write the page and log it unconditionally.  This is important
+    * particularly for indexes created on tablespaces and databases
+    * whose creation happened after the last redo pointer as recovery
+    * removes any of their existing content when the corresponding
+    * create records are replayed.
+    */
I have not been able to use the word "create" less than that. The
patch is really repetitive, but I think that we had better mention the
need of logging the content in each code path and not touch
log_newpage() to keep it a maximum flexible.

Actually, I'm
wondering if we ought to just switch this over to using the delayChkpt
mechanism instead of going through this complicated song-and-dance.
Incurring an immediate sync to avoid having to WAL-log this is a bit
tenuous, but having to WAL-log it AND do the immediate sync just seems
silly, and I'm actually a bit worried that even with your fix there
might still be a latent bug somewhere. We couldn't switch mechanisms
cleanly in the 9.2 branch (cf.
f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
back-patch it in the form you have it in already, but it's probably
worth switching over at least in master.

Thinking through it... Could we not just rip off the sync phase
because there is delayChkpt? It seems to me that what counts is that
the commit of the transaction that creates the relation does not get
past the redo point. It would matter for read uncommitted transactions
but that concept does not exist in PG, and never will. On
back-branches it is definitely safer to keep the sync, I am just
thinking about a HEAD-only optimization here as you do.
--
Michael

Attachments:

unlogged-tbspace-fix-v4.patchapplication/x-download; name=unlogged-tbspace-fix-v4.patchDownload+39-24
#16Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#15)
Re: Unlogged tables cleanup

Michael, your greetings were passed on to me with a request that I
look at this thread.

On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

More seriously, if there could be more details regarding that, I would
think that we could say something like "logging the init fork is
mandatory in any case to ensure its on-disk presence when at recovery
replay, even on non-default tablespaces whose base location are
deleted and re-created from scratch if the WAL record in charge of
creating this tablespace gets replayed". The problem shows up because
of tablespaces being deleted at replay at the end... So perhaps this
makes sense. What do you think?

Yes, that's about what I think we need to explain.

OK, what do you think about the attached then?

I came up with something like that for those code paths:
-   /* Write the page.  If archiving/streaming, XLOG it. */
+   /*
+    * Write the page and log it unconditionally.  This is important
+    * particularly for indexes created on tablespaces and databases
+    * whose creation happened after the last redo pointer as recovery
+    * removes any of their existing content when the corresponding
+    * create records are replayed.
+    */
I have not been able to use the word "create" less than that. The
patch is really repetitive, but I think that we had better mention the
need of logging the content in each code path and not touch
log_newpage() to keep it a maximum flexible.

In blinsert.c, nbtree.c, etc. how about:

Write the page and log it. It might seem that an immediate sync would
be sufficient to guarantee that the file exists on disk, but recovery
itself might remove it while replaying, for example, an
XLOG_DBASE_CREATE record. Therefore, we need this even when
wal_level=minimal.

Actually, I'm
wondering if we ought to just switch this over to using the delayChkpt
mechanism instead of going through this complicated song-and-dance.
Incurring an immediate sync to avoid having to WAL-log this is a bit
tenuous, but having to WAL-log it AND do the immediate sync just seems
silly, and I'm actually a bit worried that even with your fix there
might still be a latent bug somewhere. We couldn't switch mechanisms
cleanly in the 9.2 branch (cf.
f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
back-patch it in the form you have it in already, but it's probably
worth switching over at least in master.

Thinking through it... Could we not just rip off the sync phase
because there is delayChkpt? It seems to me that what counts is that
the commit of the transaction that creates the relation does not get
past the redo point. It would matter for read uncommitted transactions
but that concept does not exist in PG, and never will. On
back-branches it is definitely safer to keep the sync, I am just
thinking about a HEAD-only optimization here as you do.

Right (I think). If we set and clear delayChkpt around this work, we
don't need the immediate sync.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#16)
Re: Unlogged tables cleanup

On Thu, Dec 8, 2016 at 6:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Michael, your greetings were passed on to me with a request that I
look at this thread.

Thanks for showing up!

On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

More seriously, if there could be more details regarding that, I would
think that we could say something like "logging the init fork is
mandatory in any case to ensure its on-disk presence when at recovery
replay, even on non-default tablespaces whose base location are
deleted and re-created from scratch if the WAL record in charge of
creating this tablespace gets replayed". The problem shows up because
of tablespaces being deleted at replay at the end... So perhaps this
makes sense. What do you think?

Yes, that's about what I think we need to explain.

OK, what do you think about the attached then?

I came up with something like that for those code paths:
-   /* Write the page.  If archiving/streaming, XLOG it. */
+   /*
+    * Write the page and log it unconditionally.  This is important
+    * particularly for indexes created on tablespaces and databases
+    * whose creation happened after the last redo pointer as recovery
+    * removes any of their existing content when the corresponding
+    * create records are replayed.
+    */
I have not been able to use the word "create" less than that. The
patch is really repetitive, but I think that we had better mention the
need of logging the content in each code path and not touch
log_newpage() to keep it a maximum flexible.

In blinsert.c, nbtree.c, etc. how about:

Write the page and log it. It might seem that an immediate sync would
be sufficient to guarantee that the file exists on disk, but recovery
itself might remove it while replaying, for example, an
XLOG_DBASE_CREATE record. Therefore, we need this even when
wal_level=minimal.

OK, I rewrote a bit the patch as attached. What do you think?

Actually, I'm
wondering if we ought to just switch this over to using the delayChkpt
mechanism instead of going through this complicated song-and-dance.
Incurring an immediate sync to avoid having to WAL-log this is a bit
tenuous, but having to WAL-log it AND do the immediate sync just seems
silly, and I'm actually a bit worried that even with your fix there
might still be a latent bug somewhere. We couldn't switch mechanisms
cleanly in the 9.2 branch (cf.
f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
back-patch it in the form you have it in already, but it's probably
worth switching over at least in master.

Thinking through it... Could we not just rip off the sync phase
because there is delayChkpt? It seems to me that what counts is that
the commit of the transaction that creates the relation does not get
past the redo point. It would matter for read uncommitted transactions
but that concept does not exist in PG, and never will. On
back-branches it is definitely safer to keep the sync, I am just
thinking about a HEAD-only optimization here as you do.

Right (I think). If we set and clear delayChkpt around this work, we
don't need the immediate sync.

My point is a bit different than what you mean I think: the
transaction creating an unlogged relfilenode would not need to even
set delayChkpt in the empty() routines because other transactions
would not refer to it until this transaction has committed. So I am
arguing about just removing the sync phase.
--
Michael

Attachments:

unlogged-tbspace-fix-v5.patchtext/plain; charset=US-ASCII; name=unlogged-tbspace-fix-v5.patchDownload+38-24
#18Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#17)
Re: Unlogged tables cleanup

On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

OK, I rewrote a bit the patch as attached. What do you think?

Committed and back-patched all the way back to 9.2.

Right (I think). If we set and clear delayChkpt around this work, we
don't need the immediate sync.

My point is a bit different than what you mean I think: the
transaction creating an unlogged relfilenode would not need to even
set delayChkpt in the empty() routines because other transactions
would not refer to it until this transaction has committed. So I am
arguing about just removing the sync phase.

That doesn't sound right; see the comment for heap_create_init_fork.
Suppose the transaction creating the unlogged table commits, a
checkpoint happens, and then the operating system crashes. Without
the immediate sync, the operating system crash can cause the un-sync'd
file to crash, and because of the checkpoint the WAL record that
creates it isn't replayed either. So the file's just gone.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#18)
Re: Unlogged tables cleanup

On Fri, Dec 9, 2016 at 4:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

OK, I rewrote a bit the patch as attached. What do you think?

Committed and back-patched all the way back to 9.2.

Thanks!

Right (I think). If we set and clear delayChkpt around this work, we
don't need the immediate sync.

My point is a bit different than what you mean I think: the
transaction creating an unlogged relfilenode would not need to even
set delayChkpt in the empty() routines because other transactions
would not refer to it until this transaction has committed. So I am
arguing about just removing the sync phase.

That doesn't sound right; see the comment for heap_create_init_fork.
Suppose the transaction creating the unlogged table commits, a
checkpoint happens, and then the operating system crashes. Without
the immediate sync, the operating system crash can cause the un-sync'd
file to crash, and because of the checkpoint the WAL record that
creates it isn't replayed either. So the file's just gone.

Doh. That would have made sense if the checkpoint was actually
flushing the page if it was in shared buffers. But that's not the
case.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#19)
Re: [HACKERS] Unlogged tables cleanup

On Thu, Dec 8, 2016 at 7:49 PM Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Dec 9, 2016 at 4:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

OK, I rewrote a bit the patch as attached. What do you think?

Committed and back-patched all the way back to 9.2.

Thanks!

This thread resulted in commit
fa0f466d5329e10b16f3b38c8eaf5306f7e234e8, and today I had cause to
look at that commit again. The code is now in
heapam_relation_set_new_filenode. I noticed a few things that seem
like they are not quite right.

1. The comment added in that commit says "Recover may as well remove
it while replaying..." but what is really meant is "Recovery may well
remove it well replaying..." The phrase "may as well" means that
there isn't really any reason not to do it even if it's not strictly
necessary. The phrase "may well" means that it's entirely possible.
The latter meaning is the one we want here.

2. The comment as adjusted in that commit refers to needing an
immediate sync "even if the page has been logged, because the write
did not go through shared_buffers," but there is no page and no write,
because an empty heap is just an empty file. That logic makes sense
for index AMs that bypass shared buffers to write a metapage, e.g.
nbtree, as opposed to others which go through shared_buffers and don't
have the immediate sync, e.g. brin. However, the heap writes no pages
either through shared buffers or bypassing shared buffers, so either
I'm confused or the comment makes no sense.

3. Before that commit, the comment said that "the immediate sync is
required, because otherwise there's no guarantee that this will hit
the disk before the next checkpoint moves the redo pointer." However,
that justification seems to apply to *anything* that does smgrcreate +
log_smgrcreate would also need to do smgrimmedsync, and
RelationCreateStorage doesn't. Unless, for some reason that I'm not
thinking of right now, the init fork has stronger durability
requirements that the main fork, it's hard to understand why
heapam_relation_set_new_filenode can call RelationCreateStorage to do
smgrcreate + log_smgrcreate for the main fork and that's OK, but when
it does the same thing itself for the init fork, it now needs
smgrimmedsync as well.

My guess, just shooting from the hip, is that the smgrimmedsync call
can be removed here. If that's wrong, then we need a better
explanation for why it's needed, and we possibly need to add it to
every single place that does smgrcreate that doesn't have it already.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#20)
#22Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#21)
#23Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#22)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#22)
#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#24)
#26Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#23)
#27Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#24)
#29Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#25)
#30Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#27)
#31Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#30)
#32Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#20)
#33Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#28)
#34Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#31)
#35Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#33)
#36Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#36)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#37)
#39Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#38)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#39)
#41Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#41)
#43Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#42)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#43)