hash index on unlogged tables doesn't behave as expected
While discussing the behavior of hash indexes with Bruce in the nearby
thread [1]/messages/by-id/20170630005634.GA4448@momjian.us, it has been noticed that hash index on unlogged tables
doesn't behave as expected. Prior to 10, it has the different set of
problems (mainly because hash indexes are not WAL-logged) which were
discussed on that thread [1]/messages/by-id/20170630005634.GA4448@momjian.us, however when I checked, it doesn't work
even for 10. Below are steps to reproduce the problem.
1. Setup master and standby
2. On the master, create unlogged table and hash index.
2A. Create unlogged table t1(c1 int);
2B. Create hash index idx_t1_hash on t1 using hash(c1);
3. On Standby, try selecting data,
select * from t1;
ERROR: cannot access temporary or unlogged relations during recovery
---Till here everything works as expected
4. Promote standby to master (I have just stopped the standby and
master and removed recovery.conf file from the standby database
location) and try starting the new master, it
gives below error and doesn't get started.
FATAL: could not create file "base/12700/16387": File exists
The basic issue was that the WAL logging for Create Index operation
was oblivion of the fact that for unlogged tables only INIT forks need
to be logged. Another point which we need to consider is that while
replaying the WAL for the create index operation, we need to flush the
buffer if it is for init fork. This needs to be done only for pages
that can be part of init fork file (like metapage, bitmappage).
Attached patch fixes the issue.
Another approach to fix the issue could be that always log complete
pages for the create index operation on unlogged tables (in
hashbuildempty). However, the logic for initial hash index pages
needs us to perform certain other actions (like update metapage after
the creation of bitmappage) which make it difficult to follow that
approach.
I think this should be considered a PostgreSQL 10 open item.
[1]: /messages/by-id/20170630005634.GA4448@momjian.us
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_unlogged_hash_index_issue_v1.patchapplication/octet-stream; name=fix_unlogged_hash_index_issue_v1.patchDownload
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 0ea11b2..8dd4cd1 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -33,6 +33,7 @@ hash_xlog_init_meta_page(XLogReaderState *record)
XLogRecPtr lsn = record->EndRecPtr;
Page page;
Buffer metabuf;
+ ForkNumber forknum;
xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
@@ -44,6 +45,15 @@ hash_xlog_init_meta_page(XLogReaderState *record)
page = (Page) BufferGetPage(metabuf);
PageSetLSN(page, lsn);
MarkBufferDirty(metabuf);
+
+ /*
+ * Force the on-disk state of init forks to always be in sync with the
+ * state in shared buffers. See XLogReadBufferForRedoExtended.
+ */
+ XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
+ if (forknum == INIT_FORKNUM)
+ FlushOneBuffer(metabuf);
+
/* all done */
UnlockReleaseBuffer(metabuf);
}
@@ -60,6 +70,7 @@ hash_xlog_init_bitmap_page(XLogReaderState *record)
Page page;
HashMetaPage metap;
uint32 num_buckets;
+ ForkNumber forknum;
xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
@@ -70,6 +81,14 @@ hash_xlog_init_bitmap_page(XLogReaderState *record)
_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
PageSetLSN(BufferGetPage(bitmapbuf), lsn);
MarkBufferDirty(bitmapbuf);
+
+ /*
+ * Force the on-disk state of init forks to always be in sync with the
+ * state in shared buffers. See XLogReadBufferForRedoExtended.
+ */
+ XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
+ if (forknum == INIT_FORKNUM)
+ FlushOneBuffer(bitmapbuf);
UnlockReleaseBuffer(bitmapbuf);
/* add the new bitmap page to the metapage's list of bitmaps */
@@ -90,6 +109,10 @@ hash_xlog_init_bitmap_page(XLogReaderState *record)
PageSetLSN(page, lsn);
MarkBufferDirty(metabuf);
+
+ XLogRecGetBlockTag(record, 1, NULL, &forknum, NULL);
+ if (forknum == INIT_FORKNUM)
+ FlushOneBuffer(metabuf);
}
if (BufferIsValid(metabuf))
UnlockReleaseBuffer(metabuf);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 4544889..b4bf8ab 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -345,6 +345,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
int32 ffactor;
uint32 num_buckets;
uint32 i;
+ bool use_wal;
/* safety check */
if (RelationGetNumberOfBlocksInFork(rel, forkNum) != 0)
@@ -352,6 +353,14 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
RelationGetRelationName(rel));
/*
+ * WAL log creation of pages if the relation is persistent, or this is the
+ * init fork of an unlogged relation.
+ */
+ use_wal = RelationNeedsWAL(rel) ||
+ (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+ forkNum == INIT_FORKNUM);
+
+ /*
* Determine the target fill factor (in tuples per bucket) for this index.
* The idea is to make the fill factor correspond to pages about as full
* as the user-settable fillfactor parameter says. We can compute it
@@ -384,7 +393,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
metap = HashPageGetMeta(pg);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (use_wal)
{
xl_hash_init_meta_page xlrec;
XLogRecPtr recptr;
@@ -427,11 +436,12 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
MarkBufferDirty(buf);
- log_newpage(&rel->rd_node,
- forkNum,
- blkno,
- BufferGetPage(buf),
- true);
+ if (use_wal)
+ log_newpage(&rel->rd_node,
+ forkNum,
+ blkno,
+ BufferGetPage(buf),
+ true);
_hash_relbuf(rel, buf);
}
@@ -459,7 +469,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
MarkBufferDirty(metabuf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (use_wal)
{
xl_hash_init_bitmap_page xlrec;
XLogRecPtr recptr;
Hi,
On Mon, Jul 3, 2017 at 9:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
While discussing the behavior of hash indexes with Bruce in the nearby
thread [1], it has been noticed that hash index on unlogged tables
doesn't behave as expected. Prior to 10, it has the different set of
problems (mainly because hash indexes are not WAL-logged) which were
discussed on that thread [1], however when I checked, it doesn't work
even for 10. Below are steps to reproduce the problem.1. Setup master and standby
2. On the master, create unlogged table and hash index.
2A. Create unlogged table t1(c1 int);
2B. Create hash index idx_t1_hash on t1 using hash(c1);
3. On Standby, try selecting data,
select * from t1;
ERROR: cannot access temporary or unlogged relations during recovery
---Till here everything works as expected
4. Promote standby to master (I have just stopped the standby and
master and removed recovery.conf file from the standby database
location) and try starting the new master, it
gives below error and doesn't get started.
FATAL: could not create file "base/12700/16387": File existsThe basic issue was that the WAL logging for Create Index operation
was oblivion of the fact that for unlogged tables only INIT forks need
to be logged. Another point which we need to consider is that while
replaying the WAL for the create index operation, we need to flush the
buffer if it is for init fork. This needs to be done only for pages
that can be part of init fork file (like metapage, bitmappage).
Attached patch fixes the issue.Another approach to fix the issue could be that always log complete
pages for the create index operation on unlogged tables (in
hashbuildempty). However, the logic for initial hash index pages
needs us to perform certain other actions (like update metapage after
the creation of bitmappage) which make it difficult to follow that
approach.
Thanks for sharing the steps to reproduce the issue in detail. I could
easily reproduce this issue. I also had a quick look into the patch and the
fix looks fine to me. However, it would be good if we can add at least one
test for unlogged table with hash index in the regression test suite.
Apart from that i have tested the patch and the patch seems to be working
fine. Here are the steps that i have followed to verify the fix,
With Patch:
========
postgres=# SELECT pg_relation_filepath('t1');
pg_relation_filepath
----------------------
base/14915/16384
(1 row)
postgres=# SELECT pg_relation_filepath('idx_t1_hash');
pg_relation_filepath
----------------------
base/14915/16387
(1 row)
Master:
=====
[ashu@localhost bin]$ ls -l ../master/base/14915/1638*
-rw-------. 1 ashu ashu 18128896 Jul 3 14:29 ../master/base/14915/16384
-rw-------. 1 ashu ashu 24576 Jul 3 14:29 ../master/base/14915/16384_fsm
-rw-------. 1 ashu ashu 0 Jul 3 14:28
../master/base/14915/16384_init
-rw-------. 1 ashu ashu 22339584 Jul 3 14:29 ../master/base/14915/16387
-rw-------. 1 ashu ashu 32768 Jul 3 14:28
../master/base/14915/16387_init
Standby:
======
[ashu@localhost bin]$ ls -l ../standby/base/14915/1638*
-rw-------. 1 ashu ashu 0 Jul 3 14:28 ../standby/base/14915/16384_init
-rw-------. 1 ashu ashu 32768 Jul 3 14:28 ../standby/base/14915/16387_init
Without patch:
==========
postgres=# SELECT pg_relation_filepath('t1');
pg_relation_filepath
----------------------
base/14915/16384
(1 row)
postgres=# SELECT pg_relation_filepath('idx_t1_hash');
pg_relation_filepath
----------------------
base/14915/16387
(1 row)
Master:
=====
[ashu@localhost bin]$ ls -l ../master/base/14915/1638*
-rw-------. 1 ashu ashu 18128896 Jul 3 14:36 ../master/base/14915/16384
-rw-------. 1 ashu ashu 24576 Jul 3 14:36 ../master/base/14915/16384_fsm
-rw-------. 1 ashu ashu 0 Jul 3 14:35
../master/base/14915/16384_init
-rw-------. 1 ashu ashu 22339584 Jul 3 14:36 ../master/base/14915/16387
-rw-------. 1 ashu ashu 32768 Jul 3 14:35
../master/base/14915/16387_init
Standby:
======
[ashu@localhost bin]$ ls -l ../standby/base/14915/1638*
-rw-------. 1 ashu ashu 0 Jul 3 14:35 ../standby/base/14915/16384_init
*-rw-------. 1 ashu ashu 73728 Jul 3 14:35 ../standby/base/14915/16387*
-rw-------. 1 ashu ashu 24576 Jul 3 14:35 ../standby/base/14915/16387_init
As seen from the test results above, it is clear that without patch, both
main fork and init fork were getting WAL logged as the create hash index
WAL logging was being done without checking forkNum type.
_hash_init()
{
................
................
log_newpage(&rel->rd_node,
forkNum,
blkno,
BufferGetPage(buf),
true);
_hash_relbuf(rel, buf);
....................
....................
}
I have also ran the WAL consistency check tool to verify the things and
didn't find any issues. Thanks.
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
On Mon, Jul 3, 2017 at 3:24 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
On Mon, Jul 3, 2017 at 9:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
The basic issue was that the WAL logging for Create Index operation
was oblivion of the fact that for unlogged tables only INIT forks need
to be logged. Another point which we need to consider is that while
replaying the WAL for the create index operation, we need to flush the
buffer if it is for init fork. This needs to be done only for pages
that can be part of init fork file (like metapage, bitmappage).
Attached patch fixes the issue.Another approach to fix the issue could be that always log complete
pages for the create index operation on unlogged tables (in
hashbuildempty). However, the logic for initial hash index pages
needs us to perform certain other actions (like update metapage after
the creation of bitmappage) which make it difficult to follow that
approach.Thanks for sharing the steps to reproduce the issue in detail. I could
easily reproduce this issue. I also had a quick look into the patch and the
fix looks fine to me. However, it would be good if we can add at least one
test for unlogged table with hash index in the regression test suite.
There is already such a test, see create_index.sql.
CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id
int4_ops);
Do you have something else in mind?
I think the problem mentioned in this thread can be caught only if we
promote the standby and restart it. We might be able to write it
using TAP framework, but I think such a test should exist for other
index types as well or in general for unlogged tables. I am not
opposed to writing such a test if you and or others feel so.
Apart from that i have tested the patch and the patch seems to be working
fine.
Thanks.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On Mon, Jul 3, 2017 at 4:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jul 3, 2017 at 3:24 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
On Mon, Jul 3, 2017 at 9:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
The basic issue was that the WAL logging for Create Index operation
was oblivion of the fact that for unlogged tables only INIT forks need
to be logged. Another point which we need to consider is that while
replaying the WAL for the create index operation, we need to flush the
buffer if it is for init fork. This needs to be done only for pages
that can be part of init fork file (like metapage, bitmappage).
Attached patch fixes the issue.Another approach to fix the issue could be that always log complete
pages for the create index operation on unlogged tables (in
hashbuildempty). However, the logic for initial hash index pages
needs us to perform certain other actions (like update metapage after
the creation of bitmappage) which make it difficult to follow that
approach.Thanks for sharing the steps to reproduce the issue in detail. I could
easily reproduce this issue. I also had a quick look into the patch and the
fix looks fine to me. However, it would be good if we can add at least one
test for unlogged table with hash index in the regression test suite.There is already such a test, see create_index.sql.
CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id
int4_ops);Do you have something else in mind?
Yes, that's what i had in my mind. But, good to know that we already
have a test-case with hash index created on an unlogged table.
I think the problem mentioned in this thread can be caught only if we
promote the standby and restart it. We might be able to write it
using TAP framework, but I think such a test should exist for other
index types as well or in general for unlogged tables. I am not
opposed to writing such a test if you and or others feel so.
I think, it would be a good thing to add such test cases using TAP
framework but as you said, it would be good to take others opinion as
well on this. Thanks.
Apart from that i have tested the patch and the patch seems to be working
fine.Thanks.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
There is already such a test, see create_index.sql.
CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id
int4_ops);Do you have something else in mind?
I think the problem mentioned in this thread can be caught only if we
promote the standby and restart it. We might be able to write it
using TAP framework, but I think such a test should exist for other
index types as well or in general for unlogged tables. I am not
opposed to writing such a test if you and or others feel so.
There have been many requests for something like that. This overlaps a
lot with the existing make check though, so a buildfarm module using
wal_consistency_checking may be a better idea than something in core.
Apart from that i have tested the patch and the patch seems to be working
fine.Thanks.
It seems to me that this qualifies as an open item for 10. WAL-logging
is new for hash tables. Amit, could you add one?
+ use_wal = RelationNeedsWAL(rel) ||
+ (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+ forkNum == INIT_FORKNUM);
In access AMs, empty() routines are always only called for unlogged
relations, so you could relax that to check for INIT_FORKNUM only.
Looking at the patch, I think that you are right to do the logging
directly in _hash_init() and not separately in hashbuildempty().
Compared to other relations the init data is more dynamic.
+ /*
+ * Force the on-disk state of init forks to always be in sync with the
+ * state in shared buffers. See XLogReadBufferForRedoExtended.
+ */
+ XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
+ if (forknum == INIT_FORKNUM)
+ FlushOneBuffer(metabuf);
I find those forced syncs a bit surprising. The buffer is already
marked as dirty, so even if there is a concurrent checkpoint they
would be flushed. If recovery comes again here, then they would just
be replayed. I am unclear why XLogReadBufferForRedoExtended is not
enough to replay a FPW of that.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 4, 2017 at 1:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
It seems to me that this qualifies as an open item for 10. WAL-logging
is new for hash tables. Amit, could you add one?
Added.
+ use_wal = RelationNeedsWAL(rel) || + (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED && + forkNum == INIT_FORKNUM); In access AMs, empty() routines are always only called for unlogged relations, so you could relax that to check for INIT_FORKNUM only.
Yeah, but _hash_init() is an exposed API, so I am not sure it is a
good idea to make such an assumption at that level of code. Now, as
there is no current user which wants to use it any other way, we can
make such an assumption, but does it serve any purpose?
Looking at the patch, I think that you are right to do the logging
directly in _hash_init() and not separately in hashbuildempty().
Compared to other relations the init data is more dynamic.+ /* + * Force the on-disk state of init forks to always be in sync with the + * state in shared buffers. See XLogReadBufferForRedoExtended. + */ + XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL); + if (forknum == INIT_FORKNUM) + FlushOneBuffer(metabuf); I find those forced syncs a bit surprising. The buffer is already marked as dirty, so even if there is a concurrent checkpoint they would be flushed.
What if there is no concurrent checkpoint activity and the server is
shutdown? Remember this replay happens on standby and we will just
try to perform Restartpoint and there is no guarantee that it will
flush the buffers. If the buffers are not flushed at shutdown, then
the Init fork will be empty on restart and the hash index will be
corrupt.
If recovery comes again here, then they would just
be replayed. I am unclear why XLogReadBufferForRedoExtended is not
enough to replay a FPW of that.
Where does FPW come into the picture for a replay of metapage or bitmappage?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jul 03, 2017 at 09:12:01AM +0530, Amit Kapila wrote:
While discussing the behavior of hash indexes with Bruce in the nearby
thread [1], it has been noticed that hash index on unlogged tables
doesn't behave as expected. Prior to 10, it has the different set of
problems (mainly because hash indexes are not WAL-logged) which were
discussed on that thread [1], however when I checked, it doesn't work
even for 10. Below are steps to reproduce the problem.1. Setup master and standby
2. On the master, create unlogged table and hash index.
2A. Create unlogged table t1(c1 int);
2B. Create hash index idx_t1_hash on t1 using hash(c1);
3. On Standby, try selecting data,
select * from t1;
ERROR: cannot access temporary or unlogged relations during recovery
---Till here everything works as expected
4. Promote standby to master (I have just stopped the standby and
master and removed recovery.conf file from the standby database
location) and try starting the new master, it
gives below error and doesn't get started.
FATAL: could not create file "base/12700/16387": File existsThe basic issue was that the WAL logging for Create Index operation
was oblivion of the fact that for unlogged tables only INIT forks need
to be logged. Another point which we need to consider is that while
replaying the WAL for the create index operation, we need to flush the
buffer if it is for init fork. This needs to be done only for pages
that can be part of init fork file (like metapage, bitmappage).
Attached patch fixes the issue.Another approach to fix the issue could be that always log complete
pages for the create index operation on unlogged tables (in
hashbuildempty). However, the logic for initial hash index pages
needs us to perform certain other actions (like update metapage after
the creation of bitmappage) which make it difficult to follow that
approach.I think this should be considered a PostgreSQL 10 open item.
[Action required within three days. This is a generic notification.]
The above-described topic is currently a PostgreSQL 10 open item. Robert,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.
[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello,
At Tue, 4 Jul 2017 14:49:26 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1+SYqCmA7ioTBpJHcO-B-rf8A=N9Gr1-RP3RhwecB5E-A@mail.gmail.com>
On Tue, Jul 4, 2017 at 1:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
It seems to me that this qualifies as an open item for 10. WAL-logging
is new for hash tables. Amit, could you add one?Added.
+ use_wal = RelationNeedsWAL(rel) || + (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED && + forkNum == INIT_FORKNUM); In access AMs, empty() routines are always only called for unlogged relations, so you could relax that to check for INIT_FORKNUM only.Yeah, but _hash_init() is an exposed API, so I am not sure it is a
good idea to make such an assumption at that level of code. Now, as
there is no current user which wants to use it any other way, we can
make such an assumption, but does it serve any purpose?
Check for INIT_FORKNUM appears both accompanied and not
accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
is the convention here.
The difference of the two is an init fork of TEMP
relations. However I believe that init fork is by definition a
part of an unlogged relation, it seems to me that it ought not to
be wal-logged if we had it. From this viewpoint, the middle line
makes sense.
Looking at the patch, I think that you are right to do the logging
directly in _hash_init() and not separately in hashbuildempty().
Compared to other relations the init data is more dynamic.+ /* + * Force the on-disk state of init forks to always be in sync with the + * state in shared buffers. See XLogReadBufferForRedoExtended. + */ + XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL); + if (forknum == INIT_FORKNUM) + FlushOneBuffer(metabuf); I find those forced syncs a bit surprising. The buffer is already marked as dirty, so even if there is a concurrent checkpoint they would be flushed.What if there is no concurrent checkpoint activity and the server is
shutdown? Remember this replay happens on standby and we will just
try to perform Restartpoint and there is no guarantee that it will
flush the buffers. If the buffers are not flushed at shutdown, then
the Init fork will be empty on restart and the hash index will be
corrupt.
XLogReadBufferForRedoExtended() automatically force the flush for
a XLOG record on init fork that having FPW imaeges. And
_hash_init seems to have emitted such a XLOG record using
log_newpage.
If recovery comes again here, then they would just
be replayed. I am unclear why XLogReadBufferForRedoExtended is not
enough to replay a FPW of that.Where does FPW come into the picture for a replay of metapage or bitmappage?
Since required FPW seems to be emitted and the flush should be
done automatically, the issuer side (_hash_init) may attach wrong
FPW images if hash_xlog_init_meta_page requires additional
flushes. But I don't see such a flaw there. I think Michael wants
to say such a kind of thing.
By the way the comment of the function ReadBufferWithoutRelcache
has the following sentense.
| * NB: At present, this function may only be used on permanent relations, which
| * is OK, because we only use it during XLOG replay. If in the future we
| * want to use it on temporary or unlogged relations, we could pass additional
| * parameters.
and does
| return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
mode, strategy, &hit);
This surely works since BufferAlloc recognizes INIT_FORK. But
some adjustment may be needed around here.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
FWIW..
At Thu, 06 Jul 2017 18:54:47 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170706.185447.256482539.horiguchi.kyotaro@lab.ntt.co.jp>
+ /* + * Force the on-disk state of init forks to always be in sync with the + * state in shared buffers. See XLogReadBufferForRedoExtended. + */ + XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL); + if (forknum == INIT_FORKNUM) + FlushOneBuffer(metabuf); I find those forced syncs a bit surprising. The buffer is already marked as dirty, so even if there is a concurrent checkpoint they would be flushed.What if there is no concurrent checkpoint activity and the server is
shutdown? Remember this replay happens on standby and we will just
try to perform Restartpoint and there is no guarantee that it will
flush the buffers. If the buffers are not flushed at shutdown, then
the Init fork will be empty on restart and the hash index will be
corrupt.XLogReadBufferForRedoExtended() automatically force the flush for
a XLOG record on init fork that having FPW imaeges. And
_hash_init seems to have emitted such a XLOG record using
log_newpage.If recovery comes again here, then they would just
be replayed. I am unclear why XLogReadBufferForRedoExtended is not
enough to replay a FPW of that.Where does FPW come into the picture for a replay of metapage or bitmappage?
Since required FPW seems to be emitted and the flush should be
done automatically, the issuer side (_hash_init) may attach wrong
"may attach" -> "may have attached"
FPW images if hash_xlog_init_meta_page requires additional
flushes. But I don't see such a flaw there. I think Michael wants
to say such a kind of thing.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 6, 2017 at 3:24 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello,
At Tue, 4 Jul 2017 14:49:26 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1+SYqCmA7ioTBpJHcO-B-rf8A=N9Gr1-RP3RhwecB5E-A@mail.gmail.com>
On Tue, Jul 4, 2017 at 1:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
It seems to me that this qualifies as an open item for 10. WAL-logging
is new for hash tables. Amit, could you add one?Added.
+ use_wal = RelationNeedsWAL(rel) || + (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED && + forkNum == INIT_FORKNUM); In access AMs, empty() routines are always only called for unlogged relations, so you could relax that to check for INIT_FORKNUM only.Yeah, but _hash_init() is an exposed API, so I am not sure it is a
good idea to make such an assumption at that level of code. Now, as
there is no current user which wants to use it any other way, we can
make such an assumption, but does it serve any purpose?Check for INIT_FORKNUM appears both accompanied and not
accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
is the convention here.The difference of the two is an init fork of TEMP
relations. However I believe that init fork is by definition a
part of an unlogged relation, it seems to me that it ought not to
be wal-logged if we had it. From this viewpoint, the middle line
makes sense.
What is the middle line? Are you okay with a current check or do you
see any problem with it or do you prefer to write it differently?
Looking at the patch, I think that you are right to do the logging
directly in _hash_init() and not separately in hashbuildempty().
Compared to other relations the init data is more dynamic.+ /* + * Force the on-disk state of init forks to always be in sync with the + * state in shared buffers. See XLogReadBufferForRedoExtended. + */ + XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL); + if (forknum == INIT_FORKNUM) + FlushOneBuffer(metabuf); I find those forced syncs a bit surprising. The buffer is already marked as dirty, so even if there is a concurrent checkpoint they would be flushed.What if there is no concurrent checkpoint activity and the server is
shutdown? Remember this replay happens on standby and we will just
try to perform Restartpoint and there is no guarantee that it will
flush the buffers. If the buffers are not flushed at shutdown, then
the Init fork will be empty on restart and the hash index will be
corrupt.XLogReadBufferForRedoExtended() automatically force the flush for
a XLOG record on init fork that having FPW imaeges. And
_hash_init seems to have emitted such a XLOG record using
log_newpage.
Sure, but log_newpage is not used for metapage and bitmappage.
If recovery comes again here, then they would just
be replayed. I am unclear why XLogReadBufferForRedoExtended is not
enough to replay a FPW of that.Where does FPW come into the picture for a replay of metapage or bitmappage?
Since required FPW seems to be emitted and the flush should be
done automatically, the issuer side (_hash_init) may attach wrong
FPW images if hash_xlog_init_meta_page requires additional
flushes. But I don't see such a flaw there. I think Michael wants
to say such a kind of thing.
I am not able to understand what you want to say, but I think you
don't see any problem with the patch as such.
By the way the comment of the function ReadBufferWithoutRelcache
has the following sentense.
I don't think we need anything with respect to this patch because
ReadBufferWithoutRelcache is used in case of FPW replay of INIT_FORK
pages as well. Do you have something else in mind which I am not able
to see?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 5, 2017 at 11:02 PM, Noah Misch <noah@leadboat.com> wrote:
[Action required within three days. This is a generic notification.]
The above-described topic is currently a PostgreSQL 10 open item. Robert,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.
I don't intend to rush this in before the beta2 wrap, although some
other committer is welcome to do so if they feel confident in the fix.
I will update again by July 17th.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 6, 2017 at 5:54 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Check for INIT_FORKNUM appears both accompanied and not
accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
is the convention here.
Checking only for INIT_FORKNUM seems logical to me. Also checking for
RELPERSISTENCE_UNLOGGED just makes the code longer to no benefit. I
guess Amit copied the test from ATExecSetTablespace, which does it as
he did, but it seems unnecessarily long-winded.
The difference of the two is an init fork of TEMP
relations. However I believe that init fork is by definition a
part of an unlogged relation, it seems to me that it ought not to
be wal-logged if we had it. From this viewpoint, the middle line
makes sense.
Actually, the init fork of an unlogged relation *must* be WAL-logged.
All other forks are removed on a crash (and the main fork is copied
anew from the init fork). But the init fork itself must survive -
therefore it, and only it, must be WAL-logged and thus durable.
By the way the comment of the function ReadBufferWithoutRelcache
has the following sentense.| * NB: At present, this function may only be used on permanent relations, which
| * is OK, because we only use it during XLOG replay. If in the future we
| * want to use it on temporary or unlogged relations, we could pass additional
| * parameters.and does
| return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
mode, strategy, &hit);This surely works since BufferAlloc recognizes INIT_FORK. But
some adjustment may be needed around here.
Yeah, it should probably mention that the init fork of an unlogged
relation is also OK.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jul 8, 2017 at 8:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jul 6, 2017 at 5:54 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Check for INIT_FORKNUM appears both accompanied and not
accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
is the convention here.Checking only for INIT_FORKNUM seems logical to me. Also checking for
RELPERSISTENCE_UNLOGGED just makes the code longer to no benefit. I
guess Amit copied the test from ATExecSetTablespace, which does it as
he did, but it seems unnecessarily long-winded.
Okay. If you and Michael feel the check that way is better, I will
change and submit the patch.
The difference of the two is an init fork of TEMP
relations. However I believe that init fork is by definition a
part of an unlogged relation, it seems to me that it ought not to
be wal-logged if we had it. From this viewpoint, the middle line
makes sense.Actually, the init fork of an unlogged relation *must* be WAL-logged.
All other forks are removed on a crash (and the main fork is copied
anew from the init fork). But the init fork itself must survive -
therefore it, and only it, must be WAL-logged and thus durable.By the way the comment of the function ReadBufferWithoutRelcache
has the following sentense.| * NB: At present, this function may only be used on permanent relations, which
| * is OK, because we only use it during XLOG replay. If in the future we
| * want to use it on temporary or unlogged relations, we could pass additional
| * parameters.and does
| return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
mode, strategy, &hit);This surely works since BufferAlloc recognizes INIT_FORK. But
some adjustment may be needed around here.Yeah, it should probably mention that the init fork of an unlogged
relation is also OK.
I think we should do that as a separate patch (I can write the same as
well) because that is not new behavior introduced by this patch, but
let me know if you think that we should add such a comment in this
patch itself.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think we should do that as a separate patch (I can write the same as
well) because that is not new behavior introduced by this patch, ...
True, although maybe hash indexes are the only way that happens today?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jul 8, 2017 at 9:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think we should do that as a separate patch (I can write the same as
well) because that is not new behavior introduced by this patch, ...True, although maybe hash indexes are the only way that happens today?
No, I think it will happen for all other indexes as well. Basically,
we log new page for init forks of other indexes and then while
restoring that full page image, it will fall through the same path.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jul 8, 2017 at 9:06 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Jul 8, 2017 at 8:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jul 6, 2017 at 5:54 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Check for INIT_FORKNUM appears both accompanied and not
accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
is the convention here.Checking only for INIT_FORKNUM seems logical to me. Also checking for
RELPERSISTENCE_UNLOGGED just makes the code longer to no benefit. I
guess Amit copied the test from ATExecSetTablespace, which does it as
he did, but it seems unnecessarily long-winded.Okay. If you and Michael feel the check that way is better, I will
change and submit the patch.
Changed as per suggestion.
By the way the comment of the function ReadBufferWithoutRelcache
has the following sentense.| * NB: At present, this function may only be used on permanent relations, which
| * is OK, because we only use it during XLOG replay. If in the future we
| * want to use it on temporary or unlogged relations, we could pass additional
| * parameters.and does
| return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
mode, strategy, &hit);This surely works since BufferAlloc recognizes INIT_FORK. But
some adjustment may be needed around here.Yeah, it should probably mention that the init fork of an unlogged
relation is also OK.I think we should do that as a separate patch (I can write the same as
well) because that is not new behavior introduced by this patch, but
let me know if you think that we should add such a comment in this
patch itself.
Attached a separate patch to adjust the comment atop ReadBufferWithoutRelcache.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_unlogged_hash_index_issue_v2.patchapplication/octet-stream; name=fix_unlogged_hash_index_issue_v2.patchDownload
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 429af11..4b81dd2 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -33,6 +33,7 @@ hash_xlog_init_meta_page(XLogReaderState *record)
XLogRecPtr lsn = record->EndRecPtr;
Page page;
Buffer metabuf;
+ ForkNumber forknum;
xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
@@ -44,6 +45,15 @@ hash_xlog_init_meta_page(XLogReaderState *record)
page = (Page) BufferGetPage(metabuf);
PageSetLSN(page, lsn);
MarkBufferDirty(metabuf);
+
+ /*
+ * Force the on-disk state of init forks to always be in sync with the
+ * state in shared buffers. See XLogReadBufferForRedoExtended.
+ */
+ XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
+ if (forknum == INIT_FORKNUM)
+ FlushOneBuffer(metabuf);
+
/* all done */
UnlockReleaseBuffer(metabuf);
}
@@ -60,6 +70,7 @@ hash_xlog_init_bitmap_page(XLogReaderState *record)
Page page;
HashMetaPage metap;
uint32 num_buckets;
+ ForkNumber forknum;
xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
@@ -70,6 +81,14 @@ hash_xlog_init_bitmap_page(XLogReaderState *record)
_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
PageSetLSN(BufferGetPage(bitmapbuf), lsn);
MarkBufferDirty(bitmapbuf);
+
+ /*
+ * Force the on-disk state of init forks to always be in sync with the
+ * state in shared buffers. See XLogReadBufferForRedoExtended.
+ */
+ XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
+ if (forknum == INIT_FORKNUM)
+ FlushOneBuffer(bitmapbuf);
UnlockReleaseBuffer(bitmapbuf);
/* add the new bitmap page to the metapage's list of bitmaps */
@@ -90,6 +109,10 @@ hash_xlog_init_bitmap_page(XLogReaderState *record)
PageSetLSN(page, lsn);
MarkBufferDirty(metabuf);
+
+ XLogRecGetBlockTag(record, 1, NULL, &forknum, NULL);
+ if (forknum == INIT_FORKNUM)
+ FlushOneBuffer(metabuf);
}
if (BufferIsValid(metabuf))
UnlockReleaseBuffer(metabuf);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 1cb18a7..3b0ee50 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -345,6 +345,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
int32 ffactor;
uint32 num_buckets;
uint32 i;
+ bool use_wal;
/* safety check */
if (RelationGetNumberOfBlocksInFork(rel, forkNum) != 0)
@@ -352,6 +353,13 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
RelationGetRelationName(rel));
/*
+ * WAL log creation of pages if the relation is persistent, or this is the
+ * init fork. Init forks for unlogged relations always needs to be WAL
+ * logged.
+ */
+ use_wal = RelationNeedsWAL(rel) || forkNum == INIT_FORKNUM;
+
+ /*
* Determine the target fill factor (in tuples per bucket) for this index.
* The idea is to make the fill factor correspond to pages about as full
* as the user-settable fillfactor parameter says. We can compute it
@@ -384,7 +392,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
metap = HashPageGetMeta(pg);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (use_wal)
{
xl_hash_init_meta_page xlrec;
XLogRecPtr recptr;
@@ -427,11 +435,12 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
MarkBufferDirty(buf);
- log_newpage(&rel->rd_node,
- forkNum,
- blkno,
- BufferGetPage(buf),
- true);
+ if (use_wal)
+ log_newpage(&rel->rd_node,
+ forkNum,
+ blkno,
+ BufferGetPage(buf),
+ true);
_hash_relbuf(rel, buf);
}
@@ -459,7 +468,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
MarkBufferDirty(metabuf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (use_wal)
{
xl_hash_init_bitmap_page xlrec;
XLogRecPtr recptr;
fix_comment_atop_ReadBufferWithoutRelcache_v1.patchapplication/octet-stream; name=fix_comment_atop_ReadBufferWithoutRelcache_v1.patchDownload
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 15795b0..9026870 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -673,10 +673,10 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
* ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
* a relcache entry for the relation.
*
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay. If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * NB: At present, this function may only be used on permanent relations and
+ * init fork of an unlogged relation, which is OK, because we only use it
+ * during XLOG replay. If in the future we want to use it on temporary or
+ * unlogged relations, we could pass additional parameters.
*/
Buffer
ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
At Sat, 8 Jul 2017 16:41:27 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1+-DUto+MyeNdLE9P9u8G3Fv6n+SOjPSqmPSw6ashhPjw@mail.gmail.com>
On Sat, Jul 8, 2017 at 9:06 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Jul 8, 2017 at 8:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jul 6, 2017 at 5:54 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Check for INIT_FORKNUM appears both accompanied and not
accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
is the convention here.Checking only for INIT_FORKNUM seems logical to me. Also checking for
RELPERSISTENCE_UNLOGGED just makes the code longer to no benefit. I
guess Amit copied the test from ATExecSetTablespace, which does it as
he did, but it seems unnecessarily long-winded.Okay. If you and Michael feel the check that way is better, I will
change and submit the patch.Changed as per suggestion.
On Sat, Jul 8, 2017 at 9:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think we should do that as a separate patch (I can write the same as
well) because that is not new behavior introduced by this patch, ...True, although maybe hash indexes are the only way that happens today?
No, I think it will happen for all other indexes as well. Basically,
we log new page for init forks of other indexes and then while
restoring that full page image, it will fall through the same path.
(Sorry, I didn't noticed that hash metapage is not using log_newpgae)
For example, (bt|gin|gist|spgist|brin)buildempty are using
log_newpage to log INIT_FORK metapages. I believe that the xlog
flow from log_newpage to XLogReadBufferForRedoExtended is
developed so that pages in init forks are surely flushed during
recovery, so we should fix it instead adding other flushes if the
path doesn't work. Am I looking wrong place? (I think it is
working.)
If I'm understanding correctly here, hashbild and hashbuildempty
should be refactored using the same structure with other *build
and *buildempty functions. Specifically metapages initialization
subroutines (_hash_init or _bt_initmetapage or SpGistInitMetapage
or...) does only on-memory initialization and does not emit WAL,
then *build and *buildempty emits WAL in their required way.
By the way the comment of the function ReadBufferWithoutRelcache
has the following sentense.| * NB: At present, this function may only be used on permanent relations, which
| * is OK, because we only use it during XLOG replay. If in the future we
| * want to use it on temporary or unlogged relations, we could pass additional
| * parameters.and does
| return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
mode, strategy, &hit);This surely works since BufferAlloc recognizes INIT_FORK. But
some adjustment may be needed around here.Yeah, it should probably mention that the init fork of an unlogged
relation is also OK.I think we should do that as a separate patch (I can write the same as
well) because that is not new behavior introduced by this patch, but
let me know if you think that we should add such a comment in this
patch itself.Attached a separate patch to adjust the comment atop ReadBufferWithoutRelcache.
+ * NB: At present, this function may only be used on permanent relations and
+ * init fork of an unlogged relation, which is OK, because we only use it
+ * during XLOG replay. If in the future we want to use it on temporary or
+ * unlogged relations, we could pass additional parameters.
*I* think the target of the funcion is permanent relations and
init forks, not unlogged relations. And I'd like to have an
additional comment about RELPERSISTENCE_PERMANENT. Something like
the attached.
reagards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
fix_comment_atop_ReadBufferWithoutRelcache_v2.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 15795b0..cc3980c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -673,10 +673,10 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
* ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
* a relcache entry for the relation.
*
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay. If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * NB: At present, this function may only be used on main fork pages of
+ * permanent relations and init fork pages, which is OK, because we only use
+ * it during XLOG replay. If in the future we want to use it on temporary or
+ * unlogged relations, we could pass additional parameters.
*/
Buffer
ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
@@ -689,6 +689,10 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
Assert(InRecovery);
+ /*
+ * Pages of init fork are permanent so RELPERSISTENCE_PERMANENT is ok
+ * here.
+ */
return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
mode, strategy, &hit);
}
On Mon, Jul 10, 2017 at 10:39 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Sat, 8 Jul 2017 16:41:27 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1+-DUto+MyeNdLE9P9u8G3Fv6n+SOjPSqmPSw6ashhPjw@mail.gmail.com>
On Sat, Jul 8, 2017 at 9:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think we should do that as a separate patch (I can write the same as
well) because that is not new behavior introduced by this patch, ...True, although maybe hash indexes are the only way that happens today?
No, I think it will happen for all other indexes as well. Basically,
we log new page for init forks of other indexes and then while
restoring that full page image, it will fall through the same path.(Sorry, I didn't noticed that hash metapage is not using log_newpgae)
The bitmappage is also not using log_newpage.
For example, (bt|gin|gist|spgist|brin)buildempty are using
log_newpage to log INIT_FORK metapages. I believe that the xlog
flow from log_newpage to XLogReadBufferForRedoExtended is
developed so that pages in init forks are surely flushed during
recovery, so we should fix it instead adding other flushes if the
path doesn't work. Am I looking wrong place? (I think it is
working.)
You are looking at right place.
If I'm understanding correctly here, hashbild and hashbuildempty
should be refactored using the same structure with other *build
and *buildempty functions. Specifically metapages initialization
subroutines (_hash_init or _bt_initmetapage or SpGistInitMetapage
or...) does only on-memory initialization and does not emit WAL,
then *build and *buildempty emits WAL in their required way.
I have considered this way of doing as well, read my first e-mail [1]/messages/by-id/CAA4eK1JpcMsEtOL_J7WODumeEfyrPi7FPYHeVdS7fyyrCrgp4w@mail.gmail.com
in this thread "Another approach to fix the issue ....". It is not
that we can't change the code of hashbuildempty to make it log new
pages for all kind of pages (metapage, bitmappage and datapages), but
I am not sure if there is a value in going in that direction. If we
have to go in that direction, we need to hard code some of the values
like hashm_nmaps and hashm_mapp in metapage rather than initializing
them after bitmappage creation. Before going in that direction, I
think we should also take opinion from other people especially some
committer as we might need to maintain two different ways of doing
almost the same thing.
I am also not 100% comfortable with adding flush at two new places,
but that makes the code for fix simpler and fundamentally there is no
problem in doing so.
There is a small downside to always logging new page which is that it
will always log full page image in WAL which has the potential to
increase WAL volume if there are many unlogged tables and indexes.
Now considering the init fork generally has very less pages, it is not
a big concern, but still avoiding full page image has some value.
By the way the comment of the function ReadBufferWithoutRelcache
has the following sentense.| * NB: At present, this function may only be used on permanent relations, which
| * is OK, because we only use it during XLOG replay. If in the future we
| * want to use it on temporary or unlogged relations, we could pass additional
| * parameters.and does
| return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
mode, strategy, &hit);This surely works since BufferAlloc recognizes INIT_FORK. But
some adjustment may be needed around here.Yeah, it should probably mention that the init fork of an unlogged
relation is also OK.I think we should do that as a separate patch (I can write the same as
well) because that is not new behavior introduced by this patch, but
let me know if you think that we should add such a comment in this
patch itself.Attached a separate patch to adjust the comment atop ReadBufferWithoutRelcache.
+ * NB: At present, this function may only be used on permanent relations and + * init fork of an unlogged relation, which is OK, because we only use it + * during XLOG replay. If in the future we want to use it on temporary or + * unlogged relations, we could pass additional parameters.*I* think the target of the funcion is permanent relations and
init forks, not unlogged relations. And I'd like to have an
additional comment about RELPERSISTENCE_PERMANENT. Something like
the attached.
Okay, let's leave this for committer to decide.
[1]: /messages/by-id/CAA4eK1JpcMsEtOL_J7WODumeEfyrPi7FPYHeVdS7fyyrCrgp4w@mail.gmail.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw@mail.gmail.com>
On Mon, Jul 10, 2017 at 10:39 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:At Sat, 8 Jul 2017 16:41:27 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1+-DUto+MyeNdLE9P9u8G3Fv6n+SOjPSqmPSw6ashhPjw@mail.gmail.com>
On Sat, Jul 8, 2017 at 9:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think we should do that as a separate patch (I can write the same as
well) because that is not new behavior introduced by this patch, ...True, although maybe hash indexes are the only way that happens today?
No, I think it will happen for all other indexes as well. Basically,
we log new page for init forks of other indexes and then while
restoring that full page image, it will fall through the same path.(Sorry, I didn't noticed that hash metapage is not using log_newpgae)
The bitmappage is also not using log_newpage.
For example, (bt|gin|gist|spgist|brin)buildempty are using
log_newpage to log INIT_FORK metapages. I believe that the xlog
flow from log_newpage to XLogReadBufferForRedoExtended is
developed so that pages in init forks are surely flushed during
recovery, so we should fix it instead adding other flushes if the
path doesn't work. Am I looking wrong place? (I think it is
working.)You are looking at right place.
If I'm understanding correctly here, hashbild and hashbuildempty
should be refactored using the same structure with other *build
and *buildempty functions. Specifically metapages initialization
subroutines (_hash_init or _bt_initmetapage or SpGistInitMetapage
or...) does only on-memory initialization and does not emit WAL,
then *build and *buildempty emits WAL in their required way.I have considered this way of doing as well, read my first e-mail [1]
in this thread "Another approach to fix the issue ....". It is not
that we can't change the code of hashbuildempty to make it log new
pages for all kind of pages (metapage, bitmappage and datapages), but
I am not sure if there is a value in going in that direction. If we
have to go in that direction, we need to hard code some of the values
like hashm_nmaps and hashm_mapp in metapage rather than initializing
them after bitmappage creation. Before going in that direction, I
think we should also take opinion from other people especially some
committer as we might need to maintain two different ways of doing
almost the same thing.
Thanks for the explanation and the pointer (to the start of this
thread.. sorry).
I am also not 100% comfortable with adding flush at two new places,
but that makes the code for fix simpler and fundamentally there is no
problem in doing so.
I agree that the patch would be simpler. Ok, I am satisfied with
an additional comment for _hash_init and hash_xlog_init_meta_page
that describes the reason that _hash_init doesn't/can't use
log_newpage and thus requires explicit flushes. Something like
the description in [1] would be enough.
There is a small downside to always logging new page which is that it
will always log full page image in WAL which has the potential to
increase WAL volume if there are many unlogged tables and indexes.
Now considering the init fork generally has very less pages, it is not
a big concern, but still avoiding full page image has some value.
Perhaps more effective mechanism will be developed at the time.
By the way the comment of the function ReadBufferWithoutRelcache
has the following sentense.| * NB: At present, this function may only be used on permanent relations, which
| * is OK, because we only use it during XLOG replay. If in the future we
| * want to use it on temporary or unlogged relations, we could pass additional
| * parameters.and does
| return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
mode, strategy, &hit);This surely works since BufferAlloc recognizes INIT_FORK. But
some adjustment may be needed around here.Yeah, it should probably mention that the init fork of an unlogged
relation is also OK.I think we should do that as a separate patch (I can write the same as
well) because that is not new behavior introduced by this patch, but
let me know if you think that we should add such a comment in this
patch itself.Attached a separate patch to adjust the comment atop ReadBufferWithoutRelcache.
+ * NB: At present, this function may only be used on permanent relations and + * init fork of an unlogged relation, which is OK, because we only use it + * during XLOG replay. If in the future we want to use it on temporary or + * unlogged relations, we could pass additional parameters.*I* think the target of the funcion is permanent relations and
init forks, not unlogged relations. And I'd like to have an
additional comment about RELPERSISTENCE_PERMANENT. Something like
the attached.Okay, let's leave this for committer to decide.
Agreed, thanks.
[1] - /messages/by-id/CAA4eK1JpcMsEtOL_J7WODumeEfyrPi7FPYHeVdS7fyyrCrgp4w@mail.gmail.com
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jul 10, 2017 at 3:27 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hi,
At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw@mail.gmail.com>
I am also not 100% comfortable with adding flush at two new places,
but that makes the code for fix simpler and fundamentally there is no
problem in doing so.I agree that the patch would be simpler. Ok, I am satisfied with
an additional comment for _hash_init and hash_xlog_init_meta_page
that describes the reason that _hash_init doesn't/can't use
log_newpage and thus requires explicit flushes. Something like
the description in [1] would be enough.
I have modified the comment in hash_xlog_init_meta_page and a
corresponding function for bitmap page. However, I think adding
anything about not using log_newpage in _hash_init doesn't sound good
to me. I think you have suggested it so that we don't forget the
reason for not using log_newpage, but I think that is overkill. Let
me know if you have any other concerns?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_unlogged_hash_index_issue_v3.patchapplication/octet-stream; name=fix_unlogged_hash_index_issue_v3.patchDownload
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 429af11..f3eb87e 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -33,6 +33,7 @@ hash_xlog_init_meta_page(XLogReaderState *record)
XLogRecPtr lsn = record->EndRecPtr;
Page page;
Buffer metabuf;
+ ForkNumber forknum;
xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
@@ -44,6 +45,17 @@ hash_xlog_init_meta_page(XLogReaderState *record)
page = (Page) BufferGetPage(metabuf);
PageSetLSN(page, lsn);
MarkBufferDirty(metabuf);
+
+ /*
+ * Force the on-disk state of init forks to always be in sync with the
+ * state in shared buffers. See XLogReadBufferForRedoExtended. We need a
+ * special handling for init forks as create index operation doesn't log
+ * the full page image of metapage.
+ */
+ XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
+ if (forknum == INIT_FORKNUM)
+ FlushOneBuffer(metabuf);
+
/* all done */
UnlockReleaseBuffer(metabuf);
}
@@ -60,6 +72,7 @@ hash_xlog_init_bitmap_page(XLogReaderState *record)
Page page;
HashMetaPage metap;
uint32 num_buckets;
+ ForkNumber forknum;
xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
@@ -70,6 +83,16 @@ hash_xlog_init_bitmap_page(XLogReaderState *record)
_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
PageSetLSN(BufferGetPage(bitmapbuf), lsn);
MarkBufferDirty(bitmapbuf);
+
+ /*
+ * Force the on-disk state of init forks to always be in sync with the
+ * state in shared buffers. See XLogReadBufferForRedoExtended. We need a
+ * special handling for init forks as create index operation doesn't log
+ * the full page image of bitmappage.
+ */
+ XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
+ if (forknum == INIT_FORKNUM)
+ FlushOneBuffer(bitmapbuf);
UnlockReleaseBuffer(bitmapbuf);
/* add the new bitmap page to the metapage's list of bitmaps */
@@ -90,6 +113,10 @@ hash_xlog_init_bitmap_page(XLogReaderState *record)
PageSetLSN(page, lsn);
MarkBufferDirty(metabuf);
+
+ XLogRecGetBlockTag(record, 1, NULL, &forknum, NULL);
+ if (forknum == INIT_FORKNUM)
+ FlushOneBuffer(metabuf);
}
if (BufferIsValid(metabuf))
UnlockReleaseBuffer(metabuf);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 1cb18a7..3b0ee50 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -345,6 +345,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
int32 ffactor;
uint32 num_buckets;
uint32 i;
+ bool use_wal;
/* safety check */
if (RelationGetNumberOfBlocksInFork(rel, forkNum) != 0)
@@ -352,6 +353,13 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
RelationGetRelationName(rel));
/*
+ * WAL log creation of pages if the relation is persistent, or this is the
+ * init fork. Init forks for unlogged relations always needs to be WAL
+ * logged.
+ */
+ use_wal = RelationNeedsWAL(rel) || forkNum == INIT_FORKNUM;
+
+ /*
* Determine the target fill factor (in tuples per bucket) for this index.
* The idea is to make the fill factor correspond to pages about as full
* as the user-settable fillfactor parameter says. We can compute it
@@ -384,7 +392,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
metap = HashPageGetMeta(pg);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (use_wal)
{
xl_hash_init_meta_page xlrec;
XLogRecPtr recptr;
@@ -427,11 +435,12 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
MarkBufferDirty(buf);
- log_newpage(&rel->rd_node,
- forkNum,
- blkno,
- BufferGetPage(buf),
- true);
+ if (use_wal)
+ log_newpage(&rel->rd_node,
+ forkNum,
+ blkno,
+ BufferGetPage(buf),
+ true);
_hash_relbuf(rel, buf);
}
@@ -459,7 +468,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
MarkBufferDirty(metabuf);
/* XLOG stuff */
- if (RelationNeedsWAL(rel))
+ if (use_wal)
{
xl_hash_init_bitmap_page xlrec;
XLogRecPtr recptr;
At Mon, 10 Jul 2017 18:37:34 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1KGK9pvW=Hn5yd51weEqWxGiKFC8HG8Bs1Ls1Pvfo5kwQ@mail.gmail.com>
On Mon, Jul 10, 2017 at 3:27 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hi,
At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw@mail.gmail.com>
I am also not 100% comfortable with adding flush at two new places,
but that makes the code for fix simpler and fundamentally there is no
problem in doing so.I agree that the patch would be simpler. Ok, I am satisfied with
an additional comment for _hash_init and hash_xlog_init_meta_page
that describes the reason that _hash_init doesn't/can't use
log_newpage and thus requires explicit flushes. Something like
the description in [1] would be enough.I have modified the comment in hash_xlog_init_meta_page and a
corresponding function for bitmap page. However, I think adding
anything about not using log_newpage in _hash_init doesn't sound good
to me. I think you have suggested it so that we don't forget the
reason for not using log_newpage, but I think that is overkill. Let
me know if you have any other concerns?
The modified comment looks perfect. Thanks!
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
(catching up finally with this thread)
On Mon, Jul 10, 2017 at 11:57 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw@mail.gmail.com>
I am also not 100% comfortable with adding flush at two new places,
but that makes the code for fix simpler and fundamentally there is no
problem in doing so.I agree that the patch would be simpler. Ok, I am satisfied with
an additional comment for _hash_init and hash_xlog_init_meta_page
that describes the reason that _hash_init doesn't/can't use
log_newpage and thus requires explicit flushes. Something like
the description in [1] would be enough.
It seems to me that we should really consider logging a full page of
the bitmap and meta pages for init forks as this is the common
practice used by all the other AMs. I would go as far as spreading a
method similar to ginbuildempty() to all the AMs as delayed
checkpoints guarantee that buffers are correctly flushed when marked
as dirty.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 14, 2017 at 6:02 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
(catching up finally with this thread)
On Mon, Jul 10, 2017 at 11:57 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw@mail.gmail.com>
I am also not 100% comfortable with adding flush at two new places,
but that makes the code for fix simpler and fundamentally there is no
problem in doing so.I agree that the patch would be simpler. Ok, I am satisfied with
an additional comment for _hash_init and hash_xlog_init_meta_page
that describes the reason that _hash_init doesn't/can't use
log_newpage and thus requires explicit flushes. Something like
the description in [1] would be enough.It seems to me that we should really consider logging a full page of
the bitmap and meta pages for init forks as this is the common
practice used by all the other AMs.
I think if we really want to go in that direction then it is better to
write code separately for hashbuildempty, rather than adding special
purpose logic in _hash_init for init forks. As to the comparison
with other indexes, the logic of hash index is slightly tricky (as I
have already explained in email up thread) as compared to other
indexes, so we should not force ourselves to do that way if it can be
integrated with logged index creation path. I am not against doing
that way as it has some merit, but the advantage seems to be thin.
Let's not argue endlessly on this point because it is not that I have
not considered it doing the way you are saying (in fact I have
mentioned that in my first e-mail). Let us wait for an opinion from
others and or from committers.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jul 15, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jul 14, 2017 at 6:02 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:(catching up finally with this thread)
On Mon, Jul 10, 2017 at 11:57 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw@mail.gmail.com>
I am also not 100% comfortable with adding flush at two new places,
but that makes the code for fix simpler and fundamentally there is no
problem in doing so.I agree that the patch would be simpler. Ok, I am satisfied with
an additional comment for _hash_init and hash_xlog_init_meta_page
that describes the reason that _hash_init doesn't/can't use
log_newpage and thus requires explicit flushes. Something like
the description in [1] would be enough.It seems to me that we should really consider logging a full page of
the bitmap and meta pages for init forks as this is the common
practice used by all the other AMs.I think if we really want to go in that direction then it is better to
write code separately for hashbuildempty, rather than adding special
purpose logic in _hash_init for init forks. As to the comparison
with other indexes, the logic of hash index is slightly tricky (as I
have already explained in email up thread) as compared to other
indexes, so we should not force ourselves to do that way if it can be
integrated with logged index creation path. I am not against doing
that way as it has some merit, but the advantage seems to be thin.
Let's not argue endlessly on this point because it is not that I have
not considered it doing the way you are saying (in fact I have
mentioned that in my first e-mail). Let us wait for an opinion from
others and or from committers.
I do agree with Amit. I think hash index is slightly trickier (in
terms of how it is organised) than other indexes and that could be the
reason for maintaining common code for hashbuild and hashbuildempty.
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jul 15, 2017 at 6:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
I do agree with Amit. I think hash index is slightly trickier (in
terms of how it is organised) than other indexes and that could be the
reason for maintaining common code for hashbuild and hashbuildempty.
Well, you both and Robert worked more on this code for PG10 than I
did, so I am fine to rely on your judgement for the final result.
Still I find this special handling quite surprising. All other AMs
just always log FPWs for the init fork pages so I'd rather not break
this treaty, but that's one against the world as things stand
currently on this thread ;)
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jul 15, 2017 at 4:25 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Sat, Jul 15, 2017 at 6:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
I do agree with Amit. I think hash index is slightly trickier (in
terms of how it is organised) than other indexes and that could be the
reason for maintaining common code for hashbuild and hashbuildempty.Well, you both and Robert worked more on this code for PG10 than I
did, so I am fine to rely on your judgement for the final result.
Still I find this special handling quite surprising. All other AMs
just always log FPWs for the init fork pages so I'd rather not break
this treaty, but that's one against the world as things stand
currently on this thread ;)
It seems to me that this area might benefit from a broader rethink.
It's not very nice to impose a restriction that init forks can only be
constructed using log_newpage(); on the other hand, it is also not
very nice that Amit's patch needs to recapitulate the mysterious
incantation used by XLogReadBufferForRedoExtended in several more
places. The basic problem here is that it would be really easy for
the next person who adds an index AM to make the exact same mistake
that Amit made here and that I failed to spot during code review. It
would be nice to come up with some more generic solution to the
problem rather than relying on each AM to do insert this
FlushOneBuffer() incantation in each place where it's needed. I think
there are probably several ways to do that; one idea might be to flush
all init-fork buffers in bulk at the end of recovery.
However, it doesn't really seem urgent to tinker with that; while I
think the fact that existing AMs use log_newpage() to populate the
init fork is mostly just luck, it might well be 10 or 20 years before
somebody adds another AM that happens to use any other technique.
Moreover, at the moment, we're trying to get a release out the door,
and to me that argues for keeping structural changes to a minimum.
Amit's patch seems like a pretty surgical fix to this problem with
minimal chance of breaking anything new, and that seems like the right
kind of fix for July. So I plan to commit what Amit proposed (with a
few grammar corrections).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jul 17, 2017 at 6:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
It seems to me that this area might benefit from a broader rethink.
It's not very nice to impose a restriction that init forks can only be
constructed using log_newpage(); on the other hand, it is also not
very nice that Amit's patch needs to recapitulate the mysterious
incantation used by XLogReadBufferForRedoExtended in several more
places. The basic problem here is that it would be really easy for
the next person who adds an index AM to make the exact same mistake
that Amit made here and that I failed to spot during code review. It
would be nice to come up with some more generic solution to the
problem rather than relying on each AM to do insert this
FlushOneBuffer() incantation in each place where it's needed. I think
there are probably several ways to do that; one idea might be to flush
all init-fork buffers in bulk at the end of recovery.
Things are centralized in XLogReadBufferForRedoExtended() for init
fork flushes, which is not something bad in itself as it is the unique
place doing this work normally for other AMs. And I have to admit that
the current coding for hash index WAL has the advantage to create less
WAL quantity as the bitmap and metadata pages get initialized using
the data of the records themselves. One idea I can think of would be
to create a README in src/backend/access for index AMs that summarizes
a basic set of recommendations for each routine used.
However, it doesn't really seem urgent to tinker with that; while I
think the fact that existing AMs use log_newpage() to populate the
init fork is mostly just luck, it might well be 10 or 20 years before
somebody adds another AM that happens to use any other technique.
Moreover, at the moment, we're trying to get a release out the door,
and to me that argues for keeping structural changes to a minimum.
Amit's patch seems like a pretty surgical fix to this problem with
minimal chance of breaking anything new, and that seems like the right
kind of fix for July. So I plan to commit what Amit proposed (with a
few grammar corrections).
No objections to that.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jul 17, 2017 at 10:21 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Mon, Jul 17, 2017 at 6:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
It seems to me that this area might benefit from a broader rethink.
It's not very nice to impose a restriction that init forks can only be
constructed using log_newpage(); on the other hand, it is also not
very nice that Amit's patch needs to recapitulate the mysterious
incantation used by XLogReadBufferForRedoExtended in several more
places. The basic problem here is that it would be really easy for
the next person who adds an index AM to make the exact same mistake
that Amit made here and that I failed to spot during code review. It
would be nice to come up with some more generic solution to the
problem rather than relying on each AM to do insert this
FlushOneBuffer() incantation in each place where it's needed. I think
there are probably several ways to do that; one idea might be to flush
all init-fork buffers in bulk at the end of recovery.Things are centralized in XLogReadBufferForRedoExtended() for init
fork flushes, which is not something bad in itself as it is the unique
place doing this work normally for other AMs. And I have to admit that
the current coding for hash index WAL has the advantage to create less
WAL quantity as the bitmap and metadata pages get initialized using
the data of the records themselves. One idea I can think of would be
to create a README in src/backend/access for index AMs that summarizes
a basic set of recommendations for each routine used.
We already have quite a decent amount of information about indexes in
our docs [1]https://www.postgresql.org/docs/devel/static/indexam.html[2]https://www.postgresql.org/docs/devel/static/index-functions.html. We might want to extend that as well.
However, it doesn't really seem urgent to tinker with that; while I
think the fact that existing AMs use log_newpage() to populate the
init fork is mostly just luck, it might well be 10 or 20 years before
somebody adds another AM that happens to use any other technique.
Moreover, at the moment, we're trying to get a release out the door,
and to me that argues for keeping structural changes to a minimum.
Amit's patch seems like a pretty surgical fix to this problem with
minimal chance of breaking anything new, and that seems like the right
kind of fix for July. So I plan to commit what Amit proposed (with a
few grammar corrections).
Thanks. Do you have any suggestion for back-branches? As of now, it
fails badly with below kind of error:
test=> SELECT * FROM t_u_hash;
ERROR: could not open file "base/16384/16392": No such file or directory
It is explained in another thread [3]/messages/by-id/20170630005634.GA4448@momjian.us where it has been found that the
reason for such an error is that hash indexes are not WAL logged prior
to 10. Now, we can claim that we don't recommend hash indexes to be
used prior to 10 in production, so such an error is okay even if there
is no crash has happened in the system.
[1]: https://www.postgresql.org/docs/devel/static/indexam.html
[2]: https://www.postgresql.org/docs/devel/static/index-functions.html
[3]: /messages/by-id/20170630005634.GA4448@momjian.us
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 18, 2017 at 4:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Thanks. Do you have any suggestion for back-branches? As of now, it
fails badly with below kind of error:test=> SELECT * FROM t_u_hash;
ERROR: could not open file "base/16384/16392": No such file or directoryIt is explained in another thread [3] where it has been found that the
reason for such an error is that hash indexes are not WAL logged prior
to 10. Now, we can claim that we don't recommend hash indexes to be
used prior to 10 in production, so such an error is okay even if there
is no crash has happened in the system.
There are a couple of approaches:
1) Marking such indexes as invalid at recovery and log information
about the switch done.
2) Error at creation of hash indexes on unlogged tables.
3) Leave it as-is, because there is already a WARNING at creation.
I don't mind seeing 3) per the amount of work done lately to support
WAL on hash indexes.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello,
Following a bit older thread.
At Tue, 18 Jul 2017 08:33:07 +0200, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSQDmz+PKewNN9r_7jC4WKf9f31Gkf=DzVGA3q+GsgJEQ@mail.gmail.com>
On Tue, Jul 18, 2017 at 4:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Thanks. Do you have any suggestion for back-branches? As of now, it
fails badly with below kind of error:test=> SELECT * FROM t_u_hash;
ERROR: could not open file "base/16384/16392": No such file or directoryIt is explained in another thread [3] where it has been found that the
reason for such an error is that hash indexes are not WAL logged prior
to 10. Now, we can claim that we don't recommend hash indexes to be
used prior to 10 in production, so such an error is okay even if there
is no crash has happened in the system.There are a couple of approaches:
1) Marking such indexes as invalid at recovery and log information
about the switch done.
2) Error at creation of hash indexes on unlogged tables.
3) Leave it as-is, because there is already a WARNING at creation.
I don't mind seeing 3) per the amount of work done lately to support
WAL on hash indexes.
I overlooked that but (3) is true as long as the table is
*logged* one.
postgres=# create table test (id int primary key, v text);
postgres=# create index on test using hash (id);
WARNING: hash indexes are not WAL-logged and their use is discouraged
But not for for unlogged tables.
postgres=# create unlogged table test (id int primary key, v text);
postgres=# create index on test using hash (id);
postgres=# (no warning)
And fails on promotion in the same way.
postgres=# select * from test;
ERROR: could not open file "base/13324/16446": No such file or directory
indexcmds.c@965:503
if (strcmp(accessMethodName, "hash") == 0 &&
RelationNeedsWAL(rel))
ereport(WARNING,
(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
Using !RelationUsesLocalBuffers instead fixes that and the
attached patch is for 9.6. I'm a bit unconfident on the usage of
logical meaning of the macro but what it does fits there.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
hashidx_pre_96_is_not_recommended_for_unlogged.patchtext/x-patch; charset=us-asciiDownload
*** a/src/backend/commands/indexcmds.c
--- b/src/backend/commands/indexcmds.c
***************
*** 501,507 **** DefineIndex(Oid relationId,
amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
if (strcmp(accessMethodName, "hash") == 0 &&
! RelationNeedsWAL(rel))
ereport(WARNING,
(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
--- 501,507 ----
amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
if (strcmp(accessMethodName, "hash") == 0 &&
! !RelationUsesLocalBuffers(rel))
ereport(WARNING,
(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
On Thu, Sep 21, 2017 at 4:14 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
postgres=# create table test (id int primary key, v text);
postgres=# create index on test using hash (id);
WARNING: hash indexes are not WAL-logged and their use is discouragedBut not for for unlogged tables.
postgres=# create unlogged table test (id int primary key, v text);
postgres=# create index on test using hash (id);
postgres=# (no warning)And fails on promotion in the same way.
postgres=# select * from test;
ERROR: could not open file "base/13324/16446": No such file or directoryindexcmds.c@965:503
if (strcmp(accessMethodName, "hash") == 0 &&
RelationNeedsWAL(rel))
ereport(WARNING,
(errmsg("hash indexes are not WAL-logged and their use is discouraged")));Using !RelationUsesLocalBuffers instead fixes that and the
attached patch is for 9.6. I'm a bit unconfident on the usage of
logical meaning of the macro but what it does fits there.
I think giving an error message like "hash indexes are not WAL-logged
and .." for unlogged tables doesn't seem like a good behavior.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sep 21, 2017, at 8:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:.
I think giving an error message like "hash indexes are not WAL-logged
and .." for unlogged tables doesn't seem like a good behavior.
+1. This seems like deliberate behavior, not a bug.
...Robert
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Thu, 21 Sep 2017 16:19:27 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <694CB417-EF2C-4760-863B-AEC4530C21E3@gmail.com>
On Sep 21, 2017, at 8:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:.
I think giving an error message like "hash indexes are not WAL-logged
and .." for unlogged tables doesn't seem like a good behavior.+1. This seems like deliberate behavior, not a bug.
Though I don't see it's bug and agree that the message is not
proper, currently we can create hash indexes without no warning
on unlogged tables and it causes a problem with replication.
The point here is that if we leave the behavior (failure on the
standby) for the reason that we see a warning on index creation,
a similar messages ought to be for unlogged tables.
Otherwise, our decision will be another option.
(4) Though we won't not see a warning on hash index creation on
unlogged tables, it seems to have been a problem and won't
mind.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 21, 2017 at 8:16 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Though I don't see it's bug and agree that the message is not
proper, currently we can create hash indexes without no warning
on unlogged tables and it causes a problem with replication.
That's true, but I don't believe it's a sufficient reason to make a change.
Before 84aa8ba128a08e6fdebb2497c7a79ebf18093e12 (2014), we didn't
issue a warning about hash indexes in any case whatsoever; we relied
on people reading the documentation to find out about the limitations
of hash indexes. They can still do that in any cases that the warning
doesn't adequately cover. I really don't think it's worth kibitzing
the cases where this message is emitted in released branches, or the
text of the message, just as we didn't back-patch the message itself
into older releases that are still supported. We need a compelling
reason to change things in stable branches, and the fact that a
warning message added in 2014 doesn't cover every limitation of a
pre-1996 hash index implementation is not an emergency. Let's save
back-patching for actual bugs, or we'll forever be fiddling with
things in stable branches that would be better left alone.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Thu, 21 Sep 2017 20:35:01 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmobXYq1ht8R76RTvun0pY85-=Oov8EY2Fv8nhNnM7Gdzxg@mail.gmail.com>
On Thu, Sep 21, 2017 at 8:16 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Though I don't see it's bug and agree that the message is not
proper, currently we can create hash indexes without no warning
on unlogged tables and it causes a problem with replication.That's true, but I don't believe it's a sufficient reason to make a change.
Before 84aa8ba128a08e6fdebb2497c7a79ebf18093e12 (2014), we didn't
issue a warning about hash indexes in any case whatsoever; we relied
on people reading the documentation to find out about the limitations
of hash indexes. They can still do that in any cases that the warning
doesn't adequately cover. I really don't think it's worth kibitzing
the cases where this message is emitted in released branches, or the
text of the message, just as we didn't back-patch the message itself
into older releases that are still supported. We need a compelling
reason to change things in stable branches, and the fact that a
warning message added in 2014 doesn't cover every limitation of a
pre-1996 hash index implementation is not an emergency. Let's save
back-patching for actual bugs, or we'll forever be fiddling with
things in stable branches that would be better left alone.
Sorry for annoying you and thank you. I agree with that after
just knowing the reason is not precisely (3) (we already have
WARNING for the problematic ops).
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers