Improving replay of XLOG_BTREE_VACUUM records
Hi all.
There are situations in which vacuuming big btree index causes stuck in WAL replaying on hot standby servers for quite a long time. I’ve described the problem in more details in this thread [0]/messages/by-id/058C9D59-9200-45FD-A565-0E4431A6F1E3@simply.name </messages/by-id/058C9D59-9200-45FD-A565-0E4431A6F1E3@simply.name>. Below in that thread Kevin Grittner proposed a good way for improving btree scans so that btree vacuuming logic could be seriously simplified. Since I don’t know when that may happen I’ve done a patch that makes some improvement right now. If Kevin or someone else would expand [1]http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ed5b87f96d473962ec5230fd820abfeaccb2069 <http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ed5b87f96d473962ec5230fd820abfeaccb2069> for handling all types of btree scans, I suppose, my patch could be thrown away and vacuuming logic should be strongly rewritten.
The idea of the patch is not to read pages from disk to make sure they are unpinned (like btree_xlog_vacuum does it right now). This is done with creating a new ReadBufferMode which returns locked buffer without setting BM_VALID flag on it. I don’t know if that is the right way of doing that but it seems to work well.
Testing it my environment gives a good win [2]https://yadi.sk/i/l13PZUNhgNB8u <https://yadi.sk/i/l13PZUNhgNB8u> - green is unpatched replica, blue is replica with a patch, two spikes are results of calling manual vacuuming of big table. Since the picture could be unavailable I’ll write here that:
1. replication delay reduced from ~1250 MB to 200 MB of replay_location lag,
2. patched replica caught master in less than a minute against 12 minutes of unpatched replica,
3. Physical I/O load on patched replica didn’t change compared with the normal workload while unpatched replica did lots of reads from PGDATA during spikes.
There is still a stuck in WAL replay but much smaller that right now. Also this change seems not to affect any other scenarios.
I’ll add it to 2015-06 commitfest.
[0]: /messages/by-id/058C9D59-9200-45FD-A565-0E4431A6F1E3@simply.name </messages/by-id/058C9D59-9200-45FD-A565-0E4431A6F1E3@simply.name>
[1]: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ed5b87f96d473962ec5230fd820abfeaccb2069 <http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ed5b87f96d473962ec5230fd820abfeaccb2069>
[2]: https://yadi.sk/i/l13PZUNhgNB8u <https://yadi.sk/i/l13PZUNhgNB8u>
--
May the force be with you…
https://simply.name
Attachments:
btree_vacuum_v1.patchapplication/octet-stream; name=btree_vacuum_v1.patchDownload
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2f85867..05d690e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -423,23 +423,18 @@ btree_xlog_vacuum(XLogReaderState *record)
for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; blkno++)
{
/*
- * We use RBM_NORMAL_NO_LOG mode because it's not an error
- * condition to see all-zero pages. The original btvacuumpage
- * scan would have skipped over all-zero pages, noting them in FSM
- * but not bothering to initialize them just yet; so we mustn't
- * throw an error here. (We could skip acquiring the cleanup lock
- * if PageIsNew, but it's probably not worth the cycles to test.)
- *
- * XXX we don't actually need to read the block, we just need to
- * confirm it is unpinned. If we had a special call into the
- * buffer manager we could optimise this so that if the block is
- * not in shared_buffers we confirm it as unpinned.
+ * We use RBM_ZERO_NO_BM_VALID mode because we don't actually
+ * need to read the block, we just need to confirm it is unpinned.
+ * In this mode the buffer is returned if it is already in shared
+ * buffers. If not, zeroed page is returned without BM_VALID flag
+ * so that noone could pin it. In both cases the returned buffer
+ * is pinned and with grabbed buffer content lock. So that we
+ * call UnlockReleaseBuffer on it.
*/
buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
- RBM_NORMAL_NO_LOG);
+ RBM_ZERO_NO_BM_VALID);
if (BufferIsValid(buffer))
{
- LockBufferForCleanup(buffer);
UnlockReleaseBuffer(buffer);
}
}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index fa98b82..3c53c0e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -459,8 +459,11 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
{
if (buffer != InvalidBuffer)
{
- if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+ if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+ mode == RBM_ZERO_NO_BM_VALID)
+ {
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ }
ReleaseBuffer(buffer);
}
buffer = ReadBufferWithoutRelcache(rnode, forknum,
@@ -470,8 +473,11 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
/* Handle the corner case that P_NEW returns non-consecutive pages */
if (BufferGetBlockNumber(buffer) != blkno)
{
- if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+ if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+ mode == RBM_ZERO_NO_BM_VALID)
+ {
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ }
ReleaseBuffer(buffer);
buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
mode, NULL);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a68eae8..4ea00dc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -527,6 +527,9 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
* RBM_ZERO_AND_CLEANUP_LOCK is the same as RBM_ZERO_AND_LOCK, but acquires
* a cleanup-strength lock on the page.
*
+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set
+ * BM_VALID bit before returning buffer so that noone could pin it.
+ *
* RBM_NORMAL_NO_LOG mode is treated the same as RBM_NORMAL here.
*
* If strategy is not NULL, a nondefault buffer access strategy is used.
@@ -674,7 +677,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
if (!isLocalBuf)
{
- if (mode == RBM_ZERO_AND_LOCK)
+ if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_NO_BM_VALID)
LWLockAcquire(bufHdr->content_lock, LW_EXCLUSIVE);
else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
@@ -761,8 +764,11 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
* Read in the page, unless the caller intends to overwrite it and
* just wants us to allocate a buffer.
*/
- if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+ if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+ mode == RBM_ZERO_NO_BM_VALID)
+ {
MemSet((char *) bufBlock, 0, BLCKSZ);
+ }
else
{
instr_time io_start,
@@ -813,8 +819,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
* (Note that we cannot use LockBuffer() of LockBufferForCleanup() here,
* because they assert that the buffer is already valid.)
*/
- if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
- !isLocalBuf)
+ if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+ mode == RBM_ZERO_NO_BM_VALID) && !isLocalBuf)
{
LWLockAcquire(bufHdr->content_lock, LW_EXCLUSIVE);
}
@@ -826,8 +832,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
}
else
{
- /* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID);
+ /*
+ * Set BM_VALID (in all modes except RBM_ZERO_NO_BM_VALID),
+ * terminate IO, and wake up any waiters.
+ */
+ if (mode == RBM_ZERO_NO_BM_VALID)
+ TerminateBufferIO(bufHdr, false, 0);
+ else
+ TerminateBufferIO(bufHdr, false, BM_VALID);
}
VacuumPageMiss++;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..62c0348 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -40,6 +40,8 @@ typedef enum
* initialize. Also locks the page. */
RBM_ZERO_AND_CLEANUP_LOCK, /* Like RBM_ZERO_AND_LOCK, but locks the page
* in "cleanup" mode */
+ RBM_ZERO_NO_BM_VALID, /* Like RBM_ZERO_AND_LOCK, but do not set BM_VALID
+ * bit before returning buffer */
RBM_ZERO_ON_ERROR, /* Read, but return an all-zeros page on error */
RBM_NORMAL_NO_LOG /* Don't log page as invalid during WAL
* replay; otherwise same as RBM_NORMAL */
On 5/1/15 11:19 AM, Vladimir Borodin wrote:
There are situations in which vacuuming big btree index causes stuck in
WAL replaying on hot standby servers for quite a long time. I’ve
described the problem in more details in this thread [0]. Below in that
thread Kevin Grittner proposed a good way for improving btree scans so
that btree vacuuming logic could be seriously simplified. Since I don’t
know when that may happen I’ve done a patch that makes some improvement
right now. If Kevin or someone else would expand [1] for handling all
types of btree scans, I suppose, my patch could be thrown away and
vacuuming logic should be strongly rewritten.
This looks like a good way to address this until the more significant
work can be done.
I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID?
or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on
the code; I see the logic to NO_BM_VALID...
+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set
+ * BM_VALID bit before returning buffer so that noone could pin it.
It would be better to explain why we want that mode. How about:
RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set
BM_VALID before returning the buffer. This is done to ensure that no one
can pin the buffer without actually reading the buffer contents in. This
is necessary while replying XLOG_BTREE_VACUUM records in hot standby.
+ if (mode == RBM_ZERO_NO_BM_VALID)
+ TerminateBufferIO(bufHdr, false, 0);
+ else
+ TerminateBufferIO(bufHdr, false, BM_VALID);
Simply passing in a 0 seems a bit odd to me; is there anywhere else we
do that?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi, Jim.
Thanks for review.
2 мая 2015 г., в 2:10, Jim Nasby <Jim.Nasby@BlueTreble.com> написал(а):
On 5/1/15 11:19 AM, Vladimir Borodin wrote:
There are situations in which vacuuming big btree index causes stuck in
WAL replaying on hot standby servers for quite a long time. I’ve
described the problem in more details in this thread [0]. Below in that
thread Kevin Grittner proposed a good way for improving btree scans so
that btree vacuuming logic could be seriously simplified. Since I don’t
know when that may happen I’ve done a patch that makes some improvement
right now. If Kevin or someone else would expand [1] for handling all
types of btree scans, I suppose, my patch could be thrown away and
vacuuming logic should be strongly rewritten.This looks like a good way to address this until the more significant work can be done.
I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID? or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on the code; I see the logic to NO_BM_VALID…
Perhaps, RBM_ZERO_NO_BM_VALID is not so good (it makes more difficult to grep BM_VALID in code), but I don’t actually like BM_INVALID and BM_NOT_VALID, sorry :( But I also don’t insist on NO_BM_VALID, any other suggestions?
+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set + * BM_VALID bit before returning buffer so that noone could pin it.It would be better to explain why we want that mode. How about:
RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set BM_VALID before returning the buffer. This is done to ensure that no one can pin the buffer without actually reading the buffer contents in. This is necessary while replying XLOG_BTREE_VACUUM records in hot standby.
Good point, fixed in attached patch.
+ if (mode == RBM_ZERO_NO_BM_VALID) + TerminateBufferIO(bufHdr, false, 0); + else + TerminateBufferIO(bufHdr, false, BM_VALID);Simply passing in a 0 seems a bit odd to me; is there anywhere else we do that?
Yes, it is done the same way in FlushBuffer function [0]http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/storage/buffer/bufmgr.c;h=a68eae81695f3f3b711d35d6c910e46b031f1cbc;hb=HEAD#l2426 <http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/storage/buffer/bufmgr.c;h=a68eae81695f3f3b711d35d6c910e46b031f1cbc;hb=HEAD#l2426>. Also comments before TerminateBufferIO say that 0 is expected value for third argument.
[0]: http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/storage/buffer/bufmgr.c;h=a68eae81695f3f3b711d35d6c910e46b031f1cbc;hb=HEAD#l2426 <http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/storage/buffer/bufmgr.c;h=a68eae81695f3f3b711d35d6c910e46b031f1cbc;hb=HEAD#l2426>
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
May the force be with you…
https://simply.name
Attachments:
btree_vacuum_v2.patchapplication/octet-stream; name=btree_vacuum_v2.patchDownload
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2f85867..50f2dd5 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -423,23 +423,18 @@ btree_xlog_vacuum(XLogReaderState *record)
for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; blkno++)
{
/*
- * We use RBM_NORMAL_NO_LOG mode because it's not an error
- * condition to see all-zero pages. The original btvacuumpage
- * scan would have skipped over all-zero pages, noting them in FSM
- * but not bothering to initialize them just yet; so we mustn't
- * throw an error here. (We could skip acquiring the cleanup lock
- * if PageIsNew, but it's probably not worth the cycles to test.)
- *
- * XXX we don't actually need to read the block, we just need to
- * confirm it is unpinned. If we had a special call into the
- * buffer manager we could optimise this so that if the block is
- * not in shared_buffers we confirm it as unpinned.
+ * We use RBM_ZERO_NO_BM_VALID mode because we don't actually
+ * need to read the block, we just need to confirm it is unpinned.
+ * In this mode the buffer is returned if it is already in shared
+ * buffers. If not, zeroed page is returned without BM_VALID flag
+ * so that no one could pin it. In both cases the returned buffer
+ * is pinned and with grabbed buffer content lock. So that we
+ * call UnlockReleaseBuffer on it.
*/
buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
- RBM_NORMAL_NO_LOG);
+ RBM_ZERO_NO_BM_VALID);
if (BufferIsValid(buffer))
{
- LockBufferForCleanup(buffer);
UnlockReleaseBuffer(buffer);
}
}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index fa98b82..3c53c0e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -459,8 +459,11 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
{
if (buffer != InvalidBuffer)
{
- if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+ if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+ mode == RBM_ZERO_NO_BM_VALID)
+ {
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ }
ReleaseBuffer(buffer);
}
buffer = ReadBufferWithoutRelcache(rnode, forknum,
@@ -470,8 +473,11 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
/* Handle the corner case that P_NEW returns non-consecutive pages */
if (BufferGetBlockNumber(buffer) != blkno)
{
- if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+ if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+ mode == RBM_ZERO_NO_BM_VALID)
+ {
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ }
ReleaseBuffer(buffer);
buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
mode, NULL);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a68eae8..3c300c3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -527,6 +527,12 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
* RBM_ZERO_AND_CLEANUP_LOCK is the same as RBM_ZERO_AND_LOCK, but acquires
* a cleanup-strength lock on the page.
*
+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set
+ * BM_VALID bit before returning buffer. This is done to ensure that no one
+ * can pin the buffer without actually reading the buffer contents in. It is
+ * useful when the caller does not need to read the page contents (since this
+ * saves I/O) but also is not going to change page contents.
+ *
* RBM_NORMAL_NO_LOG mode is treated the same as RBM_NORMAL here.
*
* If strategy is not NULL, a nondefault buffer access strategy is used.
@@ -674,7 +680,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
if (!isLocalBuf)
{
- if (mode == RBM_ZERO_AND_LOCK)
+ if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_NO_BM_VALID)
LWLockAcquire(bufHdr->content_lock, LW_EXCLUSIVE);
else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
@@ -761,8 +767,11 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
* Read in the page, unless the caller intends to overwrite it and
* just wants us to allocate a buffer.
*/
- if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+ if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+ mode == RBM_ZERO_NO_BM_VALID)
+ {
MemSet((char *) bufBlock, 0, BLCKSZ);
+ }
else
{
instr_time io_start,
@@ -813,8 +822,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
* (Note that we cannot use LockBuffer() of LockBufferForCleanup() here,
* because they assert that the buffer is already valid.)
*/
- if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
- !isLocalBuf)
+ if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+ mode == RBM_ZERO_NO_BM_VALID) && !isLocalBuf)
{
LWLockAcquire(bufHdr->content_lock, LW_EXCLUSIVE);
}
@@ -826,8 +835,14 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
}
else
{
- /* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID);
+ /*
+ * Set BM_VALID (in all modes except RBM_ZERO_NO_BM_VALID),
+ * terminate IO, and wake up any waiters.
+ */
+ if (mode == RBM_ZERO_NO_BM_VALID)
+ TerminateBufferIO(bufHdr, false, 0);
+ else
+ TerminateBufferIO(bufHdr, false, BM_VALID);
}
VacuumPageMiss++;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..62c0348 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -40,6 +40,8 @@ typedef enum
* initialize. Also locks the page. */
RBM_ZERO_AND_CLEANUP_LOCK, /* Like RBM_ZERO_AND_LOCK, but locks the page
* in "cleanup" mode */
+ RBM_ZERO_NO_BM_VALID, /* Like RBM_ZERO_AND_LOCK, but do not set BM_VALID
+ * bit before returning buffer */
RBM_ZERO_ON_ERROR, /* Read, but return an all-zeros page on error */
RBM_NORMAL_NO_LOG /* Don't log page as invalid during WAL
* replay; otherwise same as RBM_NORMAL */
On 05/02/2015 02:10 AM, Jim Nasby wrote:
This looks like a good way to address this until the more significant
work can be done.I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID?
or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on
the code; I see the logic to NO_BM_VALID...+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set + * BM_VALID bit before returning buffer so that noone could pin it.It would be better to explain why we want that mode. How about:
RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set
BM_VALID before returning the buffer. This is done to ensure that no one
can pin the buffer without actually reading the buffer contents in. This
is necessary while replying XLOG_BTREE_VACUUM records in hot standby.
That's a rather strange mode. The BM_VALID flag is internal to the
buffer manager - if you don't know how the buffer manager works, you
cannot understand that description. I couldn't quite understand what
that means myself. What can or can you not do with a buffer that's not
marked as BM_VALID? I'd suggest a mode like this instead:
In RBM_IF_IN_CACHE mode, the page is returned if it is in the buffer
cache already. If it's not, it is not read from disk, and InvalidBuffer
is returned instead.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 7/24/15 1:53 AM, Heikki Linnakangas wrote:
On 05/02/2015 02:10 AM, Jim Nasby wrote:
This looks like a good way to address this until the more significant
work can be done.I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID?
or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on
the code; I see the logic to NO_BM_VALID...+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set + * BM_VALID bit before returning buffer so that noone could pin it.It would be better to explain why we want that mode. How about:
RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set
BM_VALID before returning the buffer. This is done to ensure that no one
can pin the buffer without actually reading the buffer contents in. This
is necessary while replying XLOG_BTREE_VACUUM records in hot standby.That's a rather strange mode. The BM_VALID flag is internal to the
buffer manager - if you don't know how the buffer manager works, you
cannot understand that description. I couldn't quite understand what
that means myself. What can or can you not do with a buffer that's not
marked as BM_VALID? I'd suggest a mode like this instead:In RBM_IF_IN_CACHE mode, the page is returned if it is in the buffer
cache already. If it's not, it is not read from disk, and InvalidBuffer
is returned instead.
+1, though I think it's still worth mentioning why we're doing this (so
we know no one else can pin the buffer).
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:
On 05/02/2015 02:10 AM, Jim Nasby wrote:
This looks like a good way to address this until the more significant
work can be done.I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID?
or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on
the code; I see the logic to NO_BM_VALID...+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set + * BM_VALID bit before returning buffer so that noone could pin it.It would be better to explain why we want that mode. How about:
RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set
BM_VALID before returning the buffer. This is done to ensure that no one
can pin the buffer without actually reading the buffer contents in. This
is necessary while replying XLOG_BTREE_VACUUM records in hot standby.That's a rather strange mode. The BM_VALID flag is internal to the buffer
manager - if you don't know how the buffer manager works, you cannot
understand that description. I couldn't quite understand what that means
myself. What can or can you not do with a buffer that's not marked as
BM_VALID? I'd suggest a mode like this instead:In RBM_IF_IN_CACHE mode, the page is returned if it is in the buffer cache
already. If it's not, it is not read from disk, and InvalidBuffer is
returned instead.
To me it sounds like this shouldn't go through the full ReadBuffer()
rigamarole. That code is already complex enough, and here it's really
not needed. I think it'll be much easier to review - and actually faster
in many cases to simply have something like
bool
BufferInCache(Relation, ForkNumber, BlockNumber)
{
/* XXX: setup tag, hash, partition */
LWLockAcquire(newPartitionLock, LW_SHARED);
buf_id = BufTableLookup(&newTag, newHash);
LWLockRelease(newPartitionLock);
return buf_id != -1;
}
and then fall back to the normal ReadBuffer() when it's in cache.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Jul 26, 2015 at 9:46 PM, Andres Freund wrote:
On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:
To me it sounds like this shouldn't go through the full ReadBuffer()
rigamarole. That code is already complex enough, and here it's really
not needed. I think it'll be much easier to review - and actually faster
in many cases to simply have something likebool
BufferInCache(Relation, ForkNumber, BlockNumber)
{
/* XXX: setup tag, hash, partition */LWLockAcquire(newPartitionLock, LW_SHARED);
buf_id = BufTableLookup(&newTag, newHash);
LWLockRelease(newPartitionLock);
return buf_id != -1;
}and then fall back to the normal ReadBuffer() when it's in cache.
Patch marked as returned with feedback as input from the author has
been waited for some time now.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
25 авг. 2015 г., в 16:03, Michael Paquier <michael.paquier@gmail.com> написал(а):
On Sun, Jul 26, 2015 at 9:46 PM, Andres Freund wrote:
On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:
To me it sounds like this shouldn't go through the full ReadBuffer()
rigamarole. That code is already complex enough, and here it's really
not needed. I think it'll be much easier to review - and actually faster
in many cases to simply have something likebool
BufferInCache(Relation, ForkNumber, BlockNumber)
{
/* XXX: setup tag, hash, partition */LWLockAcquire(newPartitionLock, LW_SHARED);
buf_id = BufTableLookup(&newTag, newHash);
LWLockRelease(newPartitionLock);
return buf_id != -1;
}and then fall back to the normal ReadBuffer() when it's in cache.
Yep, sounds good. Patch with implementation attached.
Patch marked as returned with feedback as input from the author has
been waited for some time now.
Sorry for delay, I will link it to the current commitfest.
--
Michael
--
May the force be with you…
https://simply.name
Attachments:
btree_vacuum_v3.patchapplication/octet-stream; name=btree_vacuum_v3.patchDownload
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index da28e21..33fc022 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -423,6 +423,15 @@ btree_xlog_vacuum(XLogReaderState *record)
for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; blkno++)
{
/*
+ * First we try to know if the page is already in buffer cache.
+ * If not, it is definitely unpinned. If it's already there,
+ * we read it.
+ *
+ */
+ if (!BufferInCache(thisrnode, MAIN_FORKNUM, blkno))
+ continue;
+
+ /*
* We use RBM_NORMAL_NO_LOG mode because it's not an error
* condition to see all-zero pages. The original btvacuumpage
* scan would have skipped over all-zero pages, noting them in FSM
@@ -430,10 +439,6 @@ btree_xlog_vacuum(XLogReaderState *record)
* throw an error here. (We could skip acquiring the cleanup lock
* if PageIsNew, but it's probably not worth the cycles to test.)
*
- * XXX we don't actually need to read the block, we just need to
- * confirm it is unpinned. If we had a special call into the
- * buffer manager we could optimise this so that if the block is
- * not in shared_buffers we confirm it as unpinned.
*/
buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
RBM_NORMAL_NO_LOG);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cd3aaad..eb74da3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3640,3 +3640,34 @@ rnode_comparator(const void *p1, const void *p2)
else
return 0;
}
+
+/*
+ * BufferInCache -- returns true if buffer is already in buffer cache and
+ * false otherwise.
+ *
+ * No locks are held either at entry or exit.
+ */
+bool
+BufferInCache(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum)
+{
+ BufferTag Tag; /* identity of requested block */
+ uint32 Hash; /* hash value for newTag */
+ LWLock *PartitionLock; /* buffer partition lock for it */
+ int buf_id;
+
+ SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
+
+ /* create a tag so we can lookup the buffer */
+ INIT_BUFFERTAG(Tag, smgr->smgr_rnode.node, forkNum, blockNum);
+
+ /* determine its hash code and partition lock ID */
+ Hash = BufTableHashCode(&Tag);
+ PartitionLock = BufMappingPartitionLock(Hash);
+
+ /* see if the block is in the buffer pool already */
+ LWLockAcquire(PartitionLock, LW_SHARED);
+ buf_id = BufTableLookup(&Tag, Hash);
+ LWLockRelease(PartitionLock);
+
+ return buf_id != -1;
+}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..ced564f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,6 +159,8 @@ extern void MarkBufferDirty(Buffer buffer);
extern void IncrBufferRefCount(Buffer buffer);
extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
BlockNumber blockNum);
+extern bool BufferInCache(RelFileNode rnode, ForkNumber forkNum,
+ BlockNumber blockNum);
extern void InitBufferPool(void);
extern void InitBufferPoolAccess(void);
On Sun, Jul 26, 2015 at 1:46 PM, Andres Freund <andres@anarazel.de> wrote:
To me it sounds like this shouldn't go through the full ReadBuffer()
rigamarole. That code is already complex enough, and here it's really
not needed. I think it'll be much easier to review - and actually faster
in many cases to simply have something likebool
BufferInCache(Relation, ForkNumber, BlockNumber)
{
/* XXX: setup tag, hash, partition */LWLockAcquire(newPartitionLock, LW_SHARED);
buf_id = BufTableLookup(&newTag, newHash);
LWLockRelease(newPartitionLock);
return buf_id != -1;
}and then fall back to the normal ReadBuffer() when it's in cache.
I guess I'm missing something but how is this API useful? As soon as
its returned it the result might be invalid since before you actually
make use of the return value another process could come and read in
and pin the page in question. Is this part of some interlock where you
only have to know it was unpinned at some point in time since some
other event?
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-02 22:46:59 +0100, Greg Stark wrote:
On Sun, Jul 26, 2015 at 1:46 PM, Andres Freund <andres@anarazel.de> wrote:
To me it sounds like this shouldn't go through the full ReadBuffer()
rigamarole. That code is already complex enough, and here it's really
not needed. I think it'll be much easier to review - and actually faster
in many cases to simply have something likebool
BufferInCache(Relation, ForkNumber, BlockNumber)
{
/* XXX: setup tag, hash, partition */LWLockAcquire(newPartitionLock, LW_SHARED);
buf_id = BufTableLookup(&newTag, newHash);
LWLockRelease(newPartitionLock);
return buf_id != -1;
}and then fall back to the normal ReadBuffer() when it's in cache.
I guess I'm missing something but how is this API useful? As soon as
its returned it the result might be invalid since before you actually
make use of the return value another process could come and read in
and pin the page in question. Is this part of some interlock where you
only have to know it was unpinned at some point in time since some
other event?
Yes. We're talking about this block:
for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; blkno++)
{
/*
* We use RBM_NORMAL_NO_LOG mode because it's not an error
* condition to see all-zero pages. The original btvacuumpage
* scan would have skipped over all-zero pages, noting them in FSM
* but not bothering to initialize them just yet; so we mustn't
* throw an error here. (We could skip acquiring the cleanup lock
* if PageIsNew, but it's probably not worth the cycles to test.)
*
* XXX we don't actually need to read the block, we just need to
* confirm it is unpinned. If we had a special call into the
* buffer manager we could optimise this so that if the block is
* not in shared_buffers we confirm it as unpinned.
*/
buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
RBM_NORMAL_NO_LOG);
if (BufferIsValid(buffer))
{
LockBufferForCleanup(buffer);
UnlockReleaseBuffer(buffer);
}
}
}
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 3, 2015 at 6:10 AM, Vladimir Borodin wrote:
Patch with implementation attached.
Sorry for delay, I will link it to the current commitfest.
This patch looks correct, and is doing what it claims it does. It is
also not really worth refactoring the bit of code in PrefetchBuffer
that does a similar operation, so I am marking it as ready for
committer.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Vladimir Borodin wrote:
There are situations in which vacuuming big btree index causes stuck
in WAL replaying on hot standby servers for quite a long time. I’ve
described the problem in more details in this thread [0]. Below in
that thread Kevin Grittner proposed a good way for improving btree
scans so that btree vacuuming logic could be seriously simplified.
Since I don’t know when that may happen I’ve done a patch that makes
some improvement right now. If Kevin or someone else would expand [1]
for handling all types of btree scans, I suppose, my patch could be
thrown away and vacuuming logic should be strongly rewritten.
You submitted this patch in May 2015 -- and 7 months later, Simon came
up with another patch that's supposed to fix the underlying problem, so
that this shouldn't be a problem anymore.
Would you please have a look at Simon's patch, in particular verify
whether it solves the performance dip in your testing environment?
/messages/by-id/CANP8+jJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA@mail.gmail.com
(Note there's an updated patch a few emails down the thread.)
If it seems to fix the problem for you, I think we should mark yours
rejected and just apply Simon's.
Thanks!
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Vladimir Borodin wrote:
There are situations in which vacuuming big btree index causes stuck
in WAL replaying on hot standby servers for quite a long time. I’ve
described the problem in more details in this thread [0]. Below in
that thread Kevin Grittner proposed a good way for improving btree
scans so that btree vacuuming logic could be seriously simplified.
Since I don’t know when that may happen I’ve done a patch that makes
some improvement right now. If Kevin or someone else would expand [1]
for handling all types of btree scans, I suppose, my patch could be
thrown away and vacuuming logic should be strongly rewritten.You submitted this patch in May 2015 -- and 7 months later, Simon came
up with another patch that's supposed to fix the underlying problem, so
that this shouldn't be a problem anymore.Would you please have a look at Simon's patch, in particular verify
whether it solves the performance dip in your testing environment?
/messages/by-id/CANP8+jJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA@mail.gmail.com
(Note there's an updated patch a few emails down the thread.)If it seems to fix the problem for you, I think we should mark yours
rejected and just apply Simon's.
Simon's patch (justly) does not update lastBlockVacuumed in the case
of toast indexes, but Vladimir's patch would still optimize this case,
no?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
7 янв. 2016 г., в 5:26, Michael Paquier <michael.paquier@gmail.com> написал(а):
On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
<alvherre@2ndquadrant.com <mailto:alvherre@2ndquadrant.com>> wrote:Vladimir Borodin wrote:
There are situations in which vacuuming big btree index causes stuck
in WAL replaying on hot standby servers for quite a long time. I’ve
described the problem in more details in this thread [0]. Below in
that thread Kevin Grittner proposed a good way for improving btree
scans so that btree vacuuming logic could be seriously simplified.
Since I don’t know when that may happen I’ve done a patch that makes
some improvement right now. If Kevin or someone else would expand [1]
for handling all types of btree scans, I suppose, my patch could be
thrown away and vacuuming logic should be strongly rewritten.You submitted this patch in May 2015 -- and 7 months later, Simon came
up with another patch that's supposed to fix the underlying problem, so
that this shouldn't be a problem anymore.Would you please have a look at Simon's patch, in particular verify
whether it solves the performance dip in your testing environment?
/messages/by-id/CANP8+jJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA@mail.gmail.com
(Note there's an updated patch a few emails down the thread.)If it seems to fix the problem for you, I think we should mark yours
rejected and just apply Simon’s.
Ok, I’ll try this patch with my use case. Basically, it’s not so easy now since I’ve partitioned that big table to not have such problems but there is a way to reproduce it once again. If it helps, I agree that my should be rejected in favor of the Simon’s patch because my patch just reduces replication lag but Simon’s seems to remove lag at all.
Simon's patch (justly) does not update lastBlockVacuumed in the case
of toast indexes, but Vladimir's patch would still optimize this case,
no?
I suppose, in case of _not_ toast indexes. But yes, seems that my patch should help in that case too.
--
Michael
--
May the force be with you…
https://simply.name
Vladimir Borodin wrote:
7 янв. 2016 г., в 5:26, Michael Paquier <michael.paquier@gmail.com> написал(а):
On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
<alvherre@2ndquadrant.com <mailto:alvherre@2ndquadrant.com>> wrote:
Would you please have a look at Simon's patch, in particular verify
whether it solves the performance dip in your testing environment?
/messages/by-id/CANP8+jJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA@mail.gmail.com
(Note there's an updated patch a few emails down the thread.)If it seems to fix the problem for you, I think we should mark yours
rejected and just apply Simon’s.Ok, I’ll try this patch with my use case. Basically, it’s not so easy
now since I’ve partitioned that big table to not have such problems
but there is a way to reproduce it once again. If it helps, I agree
that my should be rejected in favor of the Simon’s patch because my
patch just reduces replication lag but Simon’s seems to remove lag at
all.
I would agree except for the observation on toast indexes. I think
that's an important enough use case that perhaps we should have both.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Vladimir Borodin wrote:
7 янв. 2016 г., в 5:26, Michael Paquier <michael.paquier@gmail.com>
написал(а):
On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
<alvherre@2ndquadrant.com <mailto:alvherre@2ndquadrant.com>> wrote:Would you please have a look at Simon's patch, in particular verify
whether it solves the performance dip in your testing environment?/messages/by-id/CANP8+jJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA@mail.gmail.com
(Note there's an updated patch a few emails down the thread.)
If it seems to fix the problem for you, I think we should mark yours
rejected and just apply Simon’s.Ok, I’ll try this patch with my use case. Basically, it’s not so easy
now since I’ve partitioned that big table to not have such problems
but there is a way to reproduce it once again. If it helps, I agree
that my should be rejected in favor of the Simon’s patch because my
patch just reduces replication lag but Simon’s seems to remove lag at
all.I would agree except for the observation on toast indexes. I think
that's an important enough use case that perhaps we should have both.
The exclusion of toast indexes is something we can remove also, I have
recently discovered. When we access toast data we ignore MVCC, but we still
have the toast pointer and chunkid to use for rechecking our scan results.
So a later patch will add some rechecks.
So I don't think it is worth applying this patch now. I should add that I
also had a patch that did this, posted earlier IIRC. That is not the reason
to reject this, just me pointing out that I'm effectively rejecting my own
earlier patch also.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Simon Riggs wrote:
On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I would agree except for the observation on toast indexes. I think
that's an important enough use case that perhaps we should have both.The exclusion of toast indexes is something we can remove also, I have
recently discovered. When we access toast data we ignore MVCC, but we still
have the toast pointer and chunkid to use for rechecking our scan results.
So a later patch will add some rechecks.
Ah, interesting, glad to hear. I take it you're pushing your patch
soon, then?
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 1/8/16 9:34 AM, Alvaro Herrera wrote:
Simon Riggs wrote:
On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I would agree except for the observation on toast indexes. I think
that's an important enough use case that perhaps we should have both.The exclusion of toast indexes is something we can remove also, I have
recently discovered. When we access toast data we ignore MVCC, but we still
have the toast pointer and chunkid to use for rechecking our scan results.
So a later patch will add some rechecks.Ah, interesting, glad to hear. I take it you're pushing your patch
soon, then?
ISTM that this patch should be "returned with feedback" or "rejected"
based on the thread. I'm marking it "waiting for author" for the time
being.
Thanks,
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Mar 10, 2016 at 1:29 AM, David Steele <david@pgmasters.net> wrote:
On 1/8/16 9:34 AM, Alvaro Herrera wrote:
Simon Riggs wrote:
On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:I would agree except for the observation on toast indexes. I think
that's an important enough use case that perhaps we should have both.The exclusion of toast indexes is something we can remove also, I have
recently discovered. When we access toast data we ignore MVCC, but we
still
have the toast pointer and chunkid to use for rechecking our scan
results.
So a later patch will add some rechecks.Ah, interesting, glad to hear. I take it you're pushing your patch
soon, then?ISTM that this patch should be "returned with feedback" or "rejected" based
on the thread. I'm marking it "waiting for author" for the time being.
I think that we are still waiting for some input from Simon here...
Simon, are you going to finish wrapping up your other patch?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10 March 2016 at 06:27, Michael Paquier <michael.paquier@gmail.com>
wrote:
On Thu, Mar 10, 2016 at 1:29 AM, David Steele <david@pgmasters.net> wrote:
On 1/8/16 9:34 AM, Alvaro Herrera wrote:
Simon Riggs wrote:
On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:I would agree except for the observation on toast indexes. I think
that's an important enough use case that perhaps we should have both.The exclusion of toast indexes is something we can remove also, I have
recently discovered. When we access toast data we ignore MVCC, but we
still
have the toast pointer and chunkid to use for rechecking our scan
results.
So a later patch will add some rechecks.Ah, interesting, glad to hear. I take it you're pushing your patch
soon, then?ISTM that this patch should be "returned with feedback" or "rejected"
based
on the thread. I'm marking it "waiting for author" for the time being.
I think that we are still waiting for some input from Simon here...
Simon, are you going to finish wrapping up your other patch?
Yes, I have done the research, so think patch should be rejected now.
Thanks to everyone for their input. It's good to have alternate approaches.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
10 марта 2016 г., в 11:50, Simon Riggs <simon@2ndquadrant.com> написал(а):
On 10 March 2016 at 06:27, Michael Paquier <michael.paquier@gmail.com <mailto:michael.paquier@gmail.com>> wrote:
On Thu, Mar 10, 2016 at 1:29 AM, David Steele <david@pgmasters.net <mailto:david@pgmasters.net>> wrote:On 1/8/16 9:34 AM, Alvaro Herrera wrote:
Simon Riggs wrote:
On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com <mailto:alvherre@2ndquadrant.com>>
wrote:I would agree except for the observation on toast indexes. I think
that's an important enough use case that perhaps we should have both.The exclusion of toast indexes is something we can remove also, I have
recently discovered. When we access toast data we ignore MVCC, but we
still
have the toast pointer and chunkid to use for rechecking our scan
results.
So a later patch will add some rechecks.Ah, interesting, glad to hear. I take it you're pushing your patch
soon, then?ISTM that this patch should be "returned with feedback" or "rejected" based
on the thread. I'm marking it "waiting for author" for the time being.I think that we are still waiting for some input from Simon here...
Simon, are you going to finish wrapping up your other patch?Yes, I have done the research, so think patch should be rejected now.
Let’s do immediately after you will send a new version of your patch? Or even better after testing your patch? Don’t get me wrong, but rejecting my patch without tangible work on your patch may lead to forgiving about the problem before 9.6 freeze.
Thanks to everyone for their input. It's good to have alternate approaches.
--
Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
May the force be with you…
https://simply.name
On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin <root@simply.name> wrote:
Let’s do immediately after you will send a new version of your patch? Or
even better after testing your patch? Don’t get me wrong, but rejecting my
patch without tangible work on your patch may lead to forgiving about the
problem before 9.6 freeze.
This makes sense. Let's not reject this patch yet if the alternative
approach is not committed.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10 March 2016 at 09:22, Michael Paquier <michael.paquier@gmail.com>
wrote:
On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin <root@simply.name>
wrote:Let’s do immediately after you will send a new version of your patch? Or
even better after testing your patch? Don’t get me wrong, but rejectingmy
patch without tangible work on your patch may lead to forgiving about the
problem before 9.6 freeze.This makes sense. Let's not reject this patch yet if the alternative
approach is not committed.
I attach 2 patches.
avoid_pin_scan_always.v1.patch
Takes the approach that we generate the same WAL records as in 9.5, we just
choose not to do anything with that information. This is possible because
we don't care anymore whether it is toast or other relations. So it
effectively reverts parts of the earlier patch.
This could be easily back-patched more easily.
toast_recheck.v1.patch
Adds recheck code for toast access. I'm not certain this is necessary, but
here it is. No problems found with it.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
avoid_pin_scan_always.v1.patchapplication/octet-stream; name=avoid_pin_scan_always.v1.patchDownload
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3c70394..067d15c 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -525,8 +525,12 @@ MVCC scans is not required on standby nodes. That is because
HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only
ever used during write transactions, which cannot exist on the standby.
-This leaves HeapTupleSatisfiesMVCC() and HeapTupleSatisfiesToast(), so
-HeapTupleSatisfiesToast() is the only non-MVCC scan type used on standbys.
+This leaves HeapTupleSatisfiesMVCC() and HeapTupleSatisfiesToast().
+HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
+because it doesn't need to - if the main heap row is visible then the
+toast rows will also be visible. So as long as we follow a toast
+pointer from a visible (live) tuple the corresponding toast rows
+will also be visible, so we do not need to recheck MVCC on them.
There is one minor exception, which is that the optimizer sometimes
looks at the boundaries of value ranges using SnapshotDirty, which
could result in returning a newer value for query statistics; this
@@ -536,13 +540,6 @@ in the index, so the scan retrieves a tid then immediately uses it
to look in the heap. It is unlikely that the tid could have been
deleted, vacuumed and re-inserted in the time taken to look in the heap
via direct tid access. So we ignore that scan type as a problem.
-This means if we re-check the results of any scan of a toast index we
-will be able to completely avoid performing the "pin scan" operation
-during replay of VACUUM WAL records.
-
-XXX FIXME: Toast re-checks are not yet added, so we still perform the
-pin scan when replaying vacuum records of toast indexes.
-
Other Things That Are Handy to Know
-----------------------------------
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..bf8ade3 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,7 +22,6 @@
#include "access/relscan.h"
#include "access/xlog.h"
#include "catalog/index.h"
-#include "catalog/pg_namespace.h"
#include "commands/vacuum.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
@@ -833,8 +832,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/*
* Check to see if we need to issue one final WAL record for this index,
* which may be needed for correctness on a hot standby node when
- * non-MVCC index scans could take place. This now only occurs when we
- * perform a TOAST scan, so only occurs for TOAST indexes.
+ * non-MVCC index scans could take place.
*
* If the WAL is replayed in hot standby, the replay process needs to get
* cleanup locks on all index leaf pages, just as we've been doing here.
@@ -846,7 +844,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* against the last leaf page in the index, if that one wasn't vacuumed.
*/
if (XLogStandbyInfoActive() &&
- rel->rd_rel->relnamespace == PG_TOAST_NAMESPACE &&
vstate.lastBlockVacuumed < vstate.lastBlockLocked)
{
Buffer buf;
@@ -1045,25 +1042,14 @@ restart:
*/
if (ndeletable > 0)
{
- BlockNumber lastBlockVacuumed = InvalidBlockNumber;
-
- /*
- * We may need to record the lastBlockVacuumed for use when
- * non-MVCC scans might be performed on the index on a
- * hot standby. See explanation in btree_xlog_vacuum().
- *
- * On a hot standby, a non-MVCC scan can only take place
- * when we access a Toast Index, so we need only record
- * the lastBlockVacuumed if we are vacuuming a Toast Index.
- */
- if (rel->rd_rel->relnamespace == PG_TOAST_NAMESPACE)
- lastBlockVacuumed = vstate->lastBlockVacuumed;
-
/*
- * Notice that the issued XLOG_BTREE_VACUUM WAL record includes an
- * instruction to the replay code to get cleanup lock on all pages
- * between the previous lastBlockVacuumed and this page. This
- * ensures that WAL replay locks all leaf pages at some point.
+ * Notice that the issued XLOG_BTREE_VACUUM WAL record includes all
+ * information to the replay code to allow it to get a cleanup lock
+ * on all pages between the previous lastBlockVacuumed and this page.
+ * This ensures that WAL replay locks all leaf pages at some point,
+ * which is important should non-MVCC scans be requested.
+ * This is currently unused on standby, but we record it anyway, so
+ * that the WAL contains the required information.
*
* Since we can visit leaf pages out-of-order when recursing,
* replay might end up locking such pages an extra time, but it
@@ -1071,7 +1057,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
- lastBlockVacuumed);
+ vstate->lastBlockVacuumed);
/*
* Remember highest leaf page number we've issued a
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 0d094ca..85fdd25 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -390,12 +390,16 @@ btree_xlog_vacuum(XLogReaderState *record)
Page page;
BTPageOpaque opaque;
+#ifdef UNUSED
/*
+ * This section of code is thought to be no longer needed, after
+ * analysis of the calling paths. It is retained to allow the code
+ * to be reinstated if a flaw is revealed in that thinking.
+ *
* If we are running non-MVCC scans using this index we need to do some
* additional work to ensure correctness, which is known as a "pin scan"
* described in more detail in next paragraphs. We used to do the extra
- * work in all cases, whereas we now avoid that work except when the index
- * is a toast index, since toast scans aren't fully MVCC compliant.
+ * work in all cases, whereas we now avoid that work in most cases.
* If lastBlockVacuumed is set to InvalidBlockNumber then we skip the
* additional work required for the pin scan.
*
@@ -458,6 +462,7 @@ btree_xlog_vacuum(XLogReaderState *record)
}
}
}
+#endif
/*
* Like in btvacuumpage(), we need to take a cleanup lock on every leaf
toast_recheck.v1.patchapplication/octet-stream; name=toast_recheck.v1.patchDownload
diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 4cffd21..93fcd7b 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -70,6 +70,7 @@ static void toast_delete_datum(Relation rel, Datum value);
static Datum toast_save_datum(Relation rel, Datum value,
struct varlena * oldexternal, int options);
static bool toastrel_valueid_exists(Relation toastrel, Oid valueid);
+static void toast_valueid_check_tuple(Relation toastrel, HeapTuple ttup, Oid valueid);
static bool toastid_valueid_exists(Oid toastrelid, Oid valueid);
static struct varlena *toast_fetch_datum(struct varlena * attr);
static struct varlena *toast_fetch_datum_slice(struct varlena * attr,
@@ -1700,6 +1701,8 @@ toast_delete_datum(Relation rel, Datum value)
SnapshotToast, 1, &toastkey);
while ((toasttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL)
{
+ toast_valueid_check_tuple(toastrel, toasttup, toast_pointer.va_valueid);
+
/*
* Have a chunk, delete it
*/
@@ -1730,6 +1733,7 @@ toastrel_valueid_exists(Relation toastrel, Oid valueid)
int num_indexes;
int validIndex;
Relation *toastidxs;
+ HeapTuple ttup;
/* Fetch a valid index relation */
validIndex = toast_open_indexes(toastrel,
@@ -1752,8 +1756,12 @@ toastrel_valueid_exists(Relation toastrel, Oid valueid)
RelationGetRelid(toastidxs[validIndex]),
true, SnapshotToast, 1, &toastkey);
- if (systable_getnext(toastscan) != NULL)
+ ttup = systable_getnext(toastscan);
+ if (ttup != NULL)
+ {
+ toast_valueid_check_tuple(toastrel, ttup, valueid);
result = true;
+ }
systable_endscan(toastscan);
@@ -1763,6 +1771,21 @@ toastrel_valueid_exists(Relation toastrel, Oid valueid)
return result;
}
+static void
+toast_valueid_check_tuple(Relation toastrel, HeapTuple ttup, Oid valueid)
+{
+ Oid checkOid;
+ bool isnull;
+ TupleDesc toasttupDesc = toastrel->rd_att;
+
+ checkOid = DatumGetObjectId(fastgetattr(ttup, 1, toasttupDesc, &isnull));
+ Assert(!isnull);
+
+ if (checkOid != valueid)
+ elog(ERROR, "toasted value has unexpected Oid (searched for %u, found %u)",
+ valueid, checkOid);
+}
+
/* ----------
* toastid_valueid_exists -
*
@@ -1863,6 +1886,8 @@ toast_fetch_datum(struct varlena * attr)
SnapshotToast, 1, &toastkey);
while ((ttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL)
{
+ toast_valueid_check_tuple(toastrel, ttup, toast_pointer.va_valueid);
+
/*
* Have a chunk, extract the sequence number and the data
*/
@@ -2087,6 +2112,8 @@ toast_fetch_datum_slice(struct varlena * attr, int32 sliceoffset, int32 length)
SnapshotToast, nscankeys, toastkey);
while ((ttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL)
{
+ toast_valueid_check_tuple(toastrel, ttup, toast_pointer.va_valueid);
+
/*
* Have a chunk, extract the sequence number and the data
*/
On 10 March 2016 at 11:38, Simon Riggs <simon@2ndquadrant.com> wrote:
On 10 March 2016 at 09:22, Michael Paquier <michael.paquier@gmail.com>
wrote:On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin <root@simply.name>
wrote:Let’s do immediately after you will send a new version of your patch? Or
even better after testing your patch? Don’t get me wrong, but rejectingmy
patch without tangible work on your patch may lead to forgiving about
the
problem before 9.6 freeze.
This makes sense. Let's not reject this patch yet if the alternative
approach is not committed.I attach 2 patches.
I'll wait for review, so setting status back to Needs Review.
Assuming review, I will commit the week w/c 21 March.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
10 марта 2016 г., в 14:38, Simon Riggs <simon@2ndquadrant.com> написал(а):
On 10 March 2016 at 09:22, Michael Paquier <michael.paquier@gmail.com <mailto:michael.paquier@gmail.com>> wrote:
On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin <root@simply.name <mailto:root@simply.name>> wrote:Let’s do immediately after you will send a new version of your patch? Or
even better after testing your patch? Don’t get me wrong, but rejecting my
patch without tangible work on your patch may lead to forgiving about the
problem before 9.6 freeze.This makes sense. Let's not reject this patch yet if the alternative
approach is not committed.I attach 2 patches.
avoid_pin_scan_always.v1.patch
Takes the approach that we generate the same WAL records as in 9.5, we just choose not to do anything with that information. This is possible because we don't care anymore whether it is toast or other relations. So it effectively reverts parts of the earlier patch.
This could be easily back-patched more easily.toast_recheck.v1.patch
Adds recheck code for toast access. I'm not certain this is necessary, but here it is. No problems found with it.
JFYI, I’m preparing the stand to reproduce the initial problem and I hope to finish testing this week.
--
Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
<avoid_pin_scan_always.v1.patch><toast_recheck.v1.patch>
--
May the force be with you…
https://simply.name
Hi Vladimir,
On 3/14/16 2:15 PM, Vladimir Borodin wrote:
JFYI, I’m preparing the stand to reproduce the initial problem and I
hope to finish testing this week.
Do you know when you'll have the results from the testing you were going
to do? It seems this patch is currently waiting on that to be finished.
Thanks,
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
25 марта 2016 г., в 19:11, David Steele <david@pgmasters.net> написал(а):
Hi Vladimir,
On 3/14/16 2:15 PM, Vladimir Borodin wrote:
JFYI, I’m preparing the stand to reproduce the initial problem and I
hope to finish testing this week.Do you know when you'll have the results from the testing you were going to do? It seems this patch is currently waiting on that to be finished.
I couldn’t reproduce the problem on pgbench database with scale factor 50000 last week. The test case was quite simple:
1. On master I was adding data to pgbench_accounts table.
2. On standby I was doing the following:
postgres@pgload01d ~ $ cat /tmp/query
\set naccounts 100000 * :scale
SELECT aid FROM pgbench_accounts WHERE aid = :naccounts;
postgres@pgload01d ~ $ /usr/pgsql-9.6/bin/pgbench -M prepared -f /tmp/query -c 1 -j 1 -T 3600 -P 10 -S -n pgbench
3. On master I was sometimes calling VACUUM pgbench_accounts.
Without applying patches there weren’t huge replication lags on standbys. Seems, that I'm doing something wrong… I’m doing my best right now to find the reason but I can’t give you any time evaluation :(
Thanks,
--
-David
david@pgmasters.net--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
May the force be with you…
https://simply.name