optimizing vacuum truncation scans
After a large bulk load aborted near the end, I decided to vacuum the main
table so as to not leave a huge chunk of free space in the middle of it,
before re-running the bulk load. This vacuum took a frustratingly long
time, as the backwards scan over the table to truncate the space did not
trigger read-ahead (or read-behind) in the file system. (Which was ext4).
In the hind-sight, the best thing to do would have been to open a psql
session and take an access-share lock on the main table, so that it would
prevent the truncation from occurring. Leaving cleaned, reusable space at
the end of the table to be refilled upon the next (successful) run of the
bulk load. But it would be nice if the system could do some optimization
for me, as hind sight never comes in time.
I did literally the simplest thing I could think of as a proof of concept
patch, to see if it would actually fix things. I just jumped back a
certain number of blocks occasionally and prefetched them forward, then
resumed the regular backward scan. The patch and driving script are
attached.
It was very successful in this test. I've report the fastest time to do
the truncation for each setting of the prefetch size out of about 10 runs
each, after removing a handful of cases where buffer pins prevented it from
doing a full-size truncation. (I used the minimum, because the server has
other duties what would occasionally kick in and cause a spike in time for
a few runs). 0 setting disables this feature.
JJFETCH trunc_time(s) 0 355.78 1 350.5 2 277.61 4 213.9 8 172.79 16
138.87 32 133.02 64 76.25 128 47.63 256 35.28 512 20.44 1024 18.6
2048 15.63 4096 15.34 8192 14.72 16384 13.45 32768 13.13
For context, the forward-scan part of this vacuum took about 80 sec (IQR 75
to 88).
Do we want something like this at all? If so, what would we have to do to
implement it for real? I thought that maybe the back-off amount could
start at one block and double each time, until it hits a maximum. And that
the maximum would be either seg_size, or 1/8 of effective_cache_size,
whichever is lower. (I don't think we would want to prefetch more than a
fraction of effective_cache_size, or else it could have already been
displaced by the time the scan gets back to it).
Also, what would be the best way to drill a hole through bufmgr.c into md.c
so that the prefetch could specify an entire range, rather than looping
over each individual block?
What would have to be done to detect people running on SSD and disable the
feature, if anything?
I'll add this to next commitfest as WIP patch.
Cheers,
Jeff
Attachments:
jjprefetch.patchapplication/octet-stream; name=jjprefetch.patchDownload
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
new file mode 100644
index cd5ca4c..215f53c
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 97,102 ****
--- 97,104 ----
*/
#define SKIP_PAGES_THRESHOLD ((BlockNumber) 32)
+ int JJFETCH=0;
+
typedef struct LVRelStats
{
/* hasindex = true means two-pass strategy; false means one-pass */
*************** count_nondeletable_pages(Relation onerel
*** 1581,1586 ****
--- 1583,1594 ----
* keep the number of system calls and actual shared lock table
* lookups to a minimum.
*/
+ if (JJFETCH && (blkno % JJFETCH) == 0 && blkno >=JJFETCH)
+ {
+ int i;
+ for (i=-JJFETCH; i<0; i++)
+ PrefetchBuffer(onerel, MAIN_FORKNUM,blkno+i);
+ }
if ((blkno % 32) == 0)
{
instr_time currenttime;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index b8a0f9f..df25c47
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** extern char *default_tablespace;
*** 111,116 ****
--- 111,117 ----
extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool synchronize_seqscans;
+ extern int JJFETCH;
#ifdef TRACE_SORT
extern bool trace_sort;
*************** static struct config_bool ConfigureNames
*** 1637,1642 ****
--- 1638,1652 ----
static struct config_int ConfigureNamesInt[] =
{
{
+ {"JJFETCH", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Do forward read prefetching during truncation"),
+ NULL,
+ },
+ &JJFETCH,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
+ {
{"archive_timeout", PGC_SIGHUP, WAL_ARCHIVING,
gettext_noop("Forces a switch to the next xlog file if a "
"new file has not been started within N seconds."),
On 4/19/15 9:09 PM, Jeff Janes wrote:
I did literally the simplest thing I could think of as a proof of
concept patch, to see if it would actually fix things. I just jumped
back a certain number of blocks occasionally and prefetched them
forward, then resumed the regular backward scan. The patch and driving
script are attached.
Shouldn't completely empty pages be set as all-visible in the VM? If so,
can't we just find the largest not-all-visible page and move forward
from there, instead of moving backwards like we currently do?
For that matter, why do we scan backwards anyway? The comments don't
explain it, and we have nonempty_pages as a starting point, so why don't
we just scan forward? I suspect that eons ago we didn't have that and
just blindly reverse-scanned until we finally hit a non-empty buffer...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Apr 19, 2015 at 10:38 PM, Jim Nasby <Jim.Nasby@bluetreble.com>
wrote:
On 4/19/15 9:09 PM, Jeff Janes wrote:
I did literally the simplest thing I could think of as a proof of
concept patch, to see if it would actually fix things. I just jumped
back a certain number of blocks occasionally and prefetched them
forward, then resumed the regular backward scan. The patch and driving
script are attached.Shouldn't completely empty pages be set as all-visible in the VM? If so,
can't we just find the largest not-all-visible page and move forward from
there, instead of moving backwards like we currently do?
If the entire table is all-visible, we would be starting from the
beginning, even though the beginning of the table still has read only
tuples present.
For that matter, why do we scan backwards anyway? The comments don't
explain it, and we have nonempty_pages as a starting point, so why don't we
just scan forward? I suspect that eons ago we didn't have that and just
blindly reverse-scanned until we finally hit a non-empty buffer...
nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan. But maybe the combination of the two?
If it is above nonempty_pages, then anyone who wrote into the page after
vacuum passed it must have cleared the VM bit. And currently I think no one
but vacuum ever sets VM bit back on, so once cleared it would stay cleared.
In any event nonempty_pages could be used to set the guess as to how many
pages (if any) might be worth prefetching, as that is not needed for
correctness.
Cheers,
Jeff
On 4/20/15 1:50 AM, Jeff Janes wrote:
Shouldn't completely empty pages be set as all-visible in the VM? If
so, can't we just find the largest not-all-visible page and move
forward from there, instead of moving backwards like we currently do?If the entire table is all-visible, we would be starting from the
beginning, even though the beginning of the table still has read only
tuples present.
Except we'd use it in conjunction with nonempty_pages. IIRC, that's set
to the last page that vacuum saw data on. If any page after that got
written to after vacuum visited it, the VM bit would be cleared. So
after we acquire the exclusive lock, AFAICT it's safe to just scan the
VM starting with nonempty_pages.
For that matter, why do we scan backwards anyway? The comments don't
explain it, and we have nonempty_pages as a starting point, so why
don't we just scan forward? I suspect that eons ago we didn't have
that and just blindly reverse-scanned until we finally hit a
non-empty buffer...nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan. But maybe the combination of the
two? If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.
Right.
In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.
Yeah, but I think we'd do a LOT better with the VM idea, because we
could immediately truncate without scanning anything.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Apr 19, 2015 at 7:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
I did literally the simplest thing I could think of as a proof of concept patch, to see if it would actually fix things. I just jumped back a certain number of blocks occasionally and prefetched them forward, then resumed the regular backward scan.
IIRC, this patches introduces optional prefetch for the backward scan
where file system prefetch may not work well. The result is promising!
Also, what would be the best way to drill a hole through bufmgr.c into md.c so that the prefetch could specify an entire range, rather than looping over each individual block?
Different from mdread(), which requires a buffer in argument.
Extending mdprefetch() looks natural and safe to me. This shall also
layout a foundation for other scan style optimizations.
What would have to be done to detect people running on SSD and disable the feature, if anything?
There are other call sites of prefetch - so this shall be a topic not
just for this optimization.
Regards,
Qingqing
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 20, 2015 at 10:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com>
wrote:
On 4/20/15 1:50 AM, Jeff Janes wrote:
For that matter, why do we scan backwards anyway? The comments don't
explain it, and we have nonempty_pages as a starting point, so why
don't we just scan forward? I suspect that eons ago we didn't have
that and just blindly reverse-scanned until we finally hit a
non-empty buffer...nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan. But maybe the combination of the
two? If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.Right.
In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.Yeah, but I think we'd do a LOT better with the VM idea, because we could
immediately truncate without scanning anything.
Right now all the interlocks to make this work seem to be in place (only
vacuum and startup can set visibility map bits, and only one vacuum can be
in a table at a time). But as far as I can tell, those assumption are not
"baked in" and we have pondered loosening them before.
For example, letting HOT clean up mark a page as all-visible if it finds it
be such. Now in that specific case it would be OK, as HOT cleanup would
not cause a page to become empty (or could it? If an insert on a table
with no indexes was rolled back, and hot clean up found it and cleaned it
up, it could conceptually become empty--unless we make special code to
prevent it) , and so the page would have to be below nonempty_pages. But
there may be other cases.
And I know other people have mentioned making VACUUM concurrent (although I
don't see the value in that myself).
So doing it this way would be hard to beat (scanning a bitmap vs the table
itself), but it would also introduce a modularity violation that I am not
sure is worth it.
Of course this could always be reverted if its requirements became a
problem for a more important change (assuming of course that we detected
the problem)
Cheers,
Jeff
On Tue, May 26, 2015 at 12:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Apr 20, 2015 at 10:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com>
wrote:On 4/20/15 1:50 AM, Jeff Janes wrote:
For that matter, why do we scan backwards anyway? The comments don't
explain it, and we have nonempty_pages as a starting point, so why
don't we just scan forward? I suspect that eons ago we didn't have
that and just blindly reverse-scanned until we finally hit a
non-empty buffer...nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan. But maybe the combination of the
two? If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.Right.
In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.Yeah, but I think we'd do a LOT better with the VM idea, because we could
immediately truncate without scanning anything.Right now all the interlocks to make this work seem to be in place (only
vacuum and startup can set visibility map bits, and only one vacuum can be
in a table at a time). But as far as I can tell, those assumption are not
"baked in" and we have pondered loosening them before.For example, letting HOT clean up mark a page as all-visible if it finds
it be such. Now in that specific case it would be OK, as HOT cleanup would
not cause a page to become empty (or could it? If an insert on a table
with no indexes was rolled back, and hot clean up found it and cleaned it
up, it could conceptually become empty--unless we make special code to
prevent it) , and so the page would have to be below nonempty_pages. But
there may be other cases.And I know other people have mentioned making VACUUM concurrent (although
I don't see the value in that myself).So doing it this way would be hard to beat (scanning a bitmap vs the table
itself), but it would also introduce a modularity violation that I am not
sure is worth it.Of course this could always be reverted if its requirements became a
problem for a more important change (assuming of course that we detected
the problem)
The fatal problem here is that nonempty_pages is unreliable. If vacuum
skips all-visible pages, it doesn't necessarily increment nonempty_pages
beyond that skippage. So if you just rely on nonempty_pages, you will
truncate away pages that were already all visible but are not empty. If we
changed it so that it did increment nonempty_pages past the skipped ones,
then pages which were all empty and got marked as all visible without being
truncated (say, because a lock could not be acquired, or because there was
a non-empty page after them which later became empty), then those pages
would never get truncated away.
As it is currently, it is not clear what purpose nonempty_pages serves. It
is a guardian value which doesn't seem to actually guard anything. At best
it prevents you from needing to inspect one page (the page of the guardian
value itself) to see if that page is actually empty, and finding that it is
not. That hardly seems worthwhile.
We could adopt two nonempty_pages counters, once that fails low on skipped
all-visible pages, and one that failed high on them. And then fast
truncate down to the high one, and do the current page by page scan between
the low and high. That seems rather grotesque.
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes:
As it is currently, it is not clear what purpose nonempty_pages serves.
It's to allow an early exit at vacuumlazy.c:271 in case there is no chance
of deleting a useful number of pages.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, May 26, 2015 at 12:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Apr 20, 2015 at 10:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com>
wrote:On 4/20/15 1:50 AM, Jeff Janes wrote:
For that matter, why do we scan backwards anyway? The comments don't
explain it, and we have nonempty_pages as a starting point, so why
don't we just scan forward? I suspect that eons ago we didn't have
that and just blindly reverse-scanned until we finally hit a
non-empty buffer...nonempty_pages is not concurrency safe, as the pages could become used
after vacuum passed them over but before the access exclusive lock was
grabbed before the truncation scan. But maybe the combination of the
two? If it is above nonempty_pages, then anyone who wrote into the page
after vacuum passed it must have cleared the VM bit. And currently I
think no one but vacuum ever sets VM bit back on, so once cleared it
would stay cleared.Right.
In any event nonempty_pages could be used to set the guess as to how
many pages (if any) might be worth prefetching, as that is not needed
for correctness.Yeah, but I think we'd do a LOT better with the VM idea, because we could
immediately truncate without scanning anything.Right now all the interlocks to make this work seem to be in place (only
vacuum and startup can set visibility map bits, and only one vacuum can be
in a table at a time). But as far as I can tell, those assumption are not
"baked in" and we have pondered loosening them before.For example, letting HOT clean up mark a page as all-visible if it finds
it be such. Now in that specific case it would be OK, as HOT cleanup would
not cause a page to become empty (or could it? If an insert on a table
with no indexes was rolled back, and hot clean up found it and cleaned it
up, it could conceptually become empty--unless we make special code to
prevent it) , and so the page would have to be below nonempty_pages. But
there may be other cases.And I know other people have mentioned making VACUUM concurrent (although
I don't see the value in that myself).So doing it this way would be hard to beat (scanning a bitmap vs the table
itself), but it would also introduce a modularity violation that I am not
sure is worth it.Of course this could always be reverted if its requirements became a
problem for a more important change (assuming of course that we detected
the problem)
Attached is a patch that implements the vm scan for truncation. It
introduces a variable to hold the last blkno which was skipped during the
forward portion. Any blocks after both this blkno and after the last
inspected nonempty page (which the code is already tracking) must have been
observed to be empty by the current vacuum. Any other process rendering
the page nonempty are required to clear the vm bit, and no other process
can set the bit again during the vacuum's lifetime. So if the bit is still
set, the page is still empty without needing to inspect it.
There is still the case of pages which had their visibility bit set by a
prior vacuum and then were not inspected by the current one. Once the
truncation scan runs into these pages, it falls back to the previous
behavior of reading block by block backwards. So there could still be
reason to optimize that fallback using forward-reading prefetch.
Using the previously shown test case, this patch reduces the truncation
part of the vacuum to 2 seconds.
Cheers,
Jeff
Attachments:
vac_trunc_trust_vm.patchapplication/octet-stream; name=vac_trunc_trust_vm.patchDownload
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
new file mode 100644
index a01cfb4..d2cce18
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
*************** typedef struct LVRelStats
*** 113,118 ****
--- 113,119 ----
BlockNumber pages_removed;
double tuples_deleted;
BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+ BlockNumber skipped_pages; /* actually, last skipped page + 1 */
/* List of TIDs of tuples we intend to delete */
/* NB: this list is ordered by TID address */
int num_dead_tuples; /* current # of entries */
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 475,480 ****
--- 476,482 ----
vacrelstats->rel_pages = nblocks;
vacrelstats->scanned_pages = 0;
vacrelstats->nonempty_pages = 0;
+ vacrelstats->skipped_pages = 0;
vacrelstats->latestRemovedXid = InvalidTransactionId;
lazy_space_alloc(vacrelstats, nblocks);
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 568,574 ****
--- 570,579 ----
{
/* Current block is all-visible */
if (skipping_all_visible_blocks && !scan_all)
+ {
+ vacrelstats->skipped_pages = blkno + 1;
continue;
+ }
all_visible_according_to_vm = true;
}
*************** static BlockNumber
*** 1559,1568 ****
--- 1564,1583 ----
count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
{
BlockNumber blkno;
+ BlockNumber trust_vm;
instr_time starttime;
+ Buffer vmbuffer = InvalidBuffer;
/* Initialize the starttime if we check for conflicting lock requests */
INSTR_TIME_SET_CURRENT(starttime);
+
+ /*
+ * Pages that were inspected and found to be empty by this vacuum run
+ * must still be empty if the vm bit is still set. Only vacuum sets
+ * vm bits, and only one vacuum can exist in a table at one time.
+ */
+ trust_vm=vacrelstats->nonempty_pages>vacrelstats->skipped_pages ?
+ vacrelstats->nonempty_pages : vacrelstats->skipped_pages;
/* Strange coding of loop control is needed because blkno is unsigned */
blkno = vacrelstats->rel_pages;
*************** count_nondeletable_pages(Relation onerel
*** 1600,1605 ****
--- 1615,1622 ----
RelationGetRelationName(onerel))));
vacrelstats->lock_waiter_detected = true;
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
return blkno;
}
starttime = currenttime;
*************** count_nondeletable_pages(Relation onerel
*** 1615,1620 ****
--- 1632,1648 ----
blkno--;
+ if (blkno >= trust_vm)
+ {
+ if (!visibilitymap_test(onerel,blkno,&vmbuffer))
+ {
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
+ return blkno + 1;
+ }
+ continue;
+ }
+
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vac_strategy);
*************** count_nondeletable_pages(Relation onerel
*** 1657,1663 ****
--- 1685,1695 ----
/* Done scanning if we found a tuple here */
if (hastup)
+ {
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
return blkno + 1;
+ }
}
/*
*************** count_nondeletable_pages(Relation onerel
*** 1665,1670 ****
--- 1697,1704 ----
* pages still are; we need not bother to look at the last known-nonempty
* page.
*/
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
return vacrelstats->nonempty_pages;
}
On Mon, Jun 29, 2015 at 3:54 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
Attached is a patch that implements the vm scan for truncation. It
introduces a variable to hold the last blkno which was skipped during the
forward portion. Any blocks after both this blkno and after the last
inspected nonempty page (which the code is already tracking) must have been
observed to be empty by the current vacuum. Any other process rendering the
page nonempty are required to clear the vm bit, and no other process can set
the bit again during the vacuum's lifetime. So if the bit is still set, the
page is still empty without needing to inspect it.
The patch looks good and it improves the vacuum truncation speed significantly.
There is still the case of pages which had their visibility bit set by a
prior vacuum and then were not inspected by the current one. Once the
truncation scan runs into these pages, it falls back to the previous
behavior of reading block by block backwards. So there could still be
reason to optimize that fallback using forward-reading prefetch.
The case, I didn't understand is that, how the current vacuum misses
the page which
was set by the prior vacuum?
The page should be counted either in skipped_pages or in
nonempty_pages. Is it possible
that a page doesn't comes under these two categories and it is not empty also?
If the above doesn't exist, then we can directly truncate the relation
from the highest block
number of either nonempty_pages or skipped_pages to till the end of
the relation.
Regards,
Hari Babu
Fujitsu Australia
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 8, 2015 at 9:46 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:
On Mon, Jun 29, 2015 at 3:54 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
Attached is a patch that implements the vm scan for truncation. It
introduces a variable to hold the last blkno which was skipped during the
forward portion. Any blocks after both this blkno and after the last
inspected nonempty page (which the code is already tracking) must havebeen
observed to be empty by the current vacuum. Any other process rendering
the
page nonempty are required to clear the vm bit, and no other process can
set
the bit again during the vacuum's lifetime. So if the bit is still set,
the
page is still empty without needing to inspect it.
The patch looks good and it improves the vacuum truncation speed
significantly.There is still the case of pages which had their visibility bit set by a
prior vacuum and then were not inspected by the current one. Once the
truncation scan runs into these pages, it falls back to the previous
behavior of reading block by block backwards. So there could still be
reason to optimize that fallback using forward-reading prefetch.The case, I didn't understand is that, how the current vacuum misses
the page which
was set by the prior vacuum?
The prior vacuum set them to all visible, but then doesn't delete them.
Either because it got interrupted or because there were still some pages
after them (at the time) that were not all empty.
The current vacuum skips them entirely on the forward scan because it
thinks it is waste of time to vacuum all visible pages. Since it skips
them, it doesn't know if they are empty as well as being all-visible.
There is no permanent indicator of the pages being all-empty, there is only
the inference based on the current vacuum's counters and protected by the
lock held on the table.
The page should be counted either in skipped_pages or in
nonempty_pages. Is it possible
that a page doesn't comes under these two categories and it is not empty
also?If the above doesn't exist, then we can directly truncate the relation
from the highest block
number of either nonempty_pages or skipped_pages to till the end of
the relation.
Right, and that is what this does (provided the vm bit is still set, so it
does still have to loop over the vm to verify that it is still set, while
it holds the access exclusive lock).
The problem is that the pages between the two counters are not known to be
empty, and also not known to be nonempty. Someone has to be willing to go
check on those pages at some point, or else they will never be eligible for
truncation.
Cheers,
Jeff
On Thu, Jul 9, 2015 at 4:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jul 8, 2015 at 9:46 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:On Mon, Jun 29, 2015 at 3:54 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
There is still the case of pages which had their visibility bit set by a
prior vacuum and then were not inspected by the current one. Once the
truncation scan runs into these pages, it falls back to the previous
behavior of reading block by block backwards. So there could still be
reason to optimize that fallback using forward-reading prefetch.The case, I didn't understand is that, how the current vacuum misses
the page which
was set by the prior vacuum?The prior vacuum set them to all visible, but then doesn't delete them.
Either because it got interrupted or because there were still some pages
after them (at the time) that were not all empty.The current vacuum skips them entirely on the forward scan because it thinks
it is waste of time to vacuum all visible pages. Since it skips them, it
doesn't know if they are empty as well as being all-visible. There is no
permanent indicator of the pages being all-empty, there is only the
inference based on the current vacuum's counters and protected by the lock
held on the table.The page should be counted either in skipped_pages or in
nonempty_pages. Is it possible
that a page doesn't comes under these two categories and it is not empty
also?If the above doesn't exist, then we can directly truncate the relation
from the highest block
number of either nonempty_pages or skipped_pages to till the end of
the relation.Right, and that is what this does (provided the vm bit is still set, so it
does still have to loop over the vm to verify that it is still set, while it
holds the access exclusive lock).
Thanks I got it. To re-verify the vm bit of a page after getting the
access exclusive lock,
we have to do the backward scan.
I also feel that your prefetch buffer patch needed to improve the
performance, as because
there is a need of backward scan to re-verify vm bit,
I will do some performance tests and send you the results.
The problem is that the pages between the two counters are not known to be
empty, and also not known to be nonempty. Someone has to be willing to go
check on those pages at some point, or else they will never be eligible for
truncation.
Yes, there is no way to identify the page is empty or not without
reading the page.
Regards,
Hari Babu
Fujitsu Australia
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 9, 2015 at 5:36 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
I will do some performance tests and send you the results.
Here are the performance results tested on my machine.
Head vm patch vm+prefetch patch
First vacuum 120sec <1sec <1sec
second vacuum 180 sec 180 sec 30 sec
I did some modifications in the code to skip the vacuum truncation by
the first vacuum command.
This way I collected the second vacuum time taken performance.
I just combined your vm and prefetch patch into a single patch
vm+prefetch patch without a GUC.
I kept the prefetch as 32 and did the performance test. I chosen
prefetch based on the current
buffer access strategy, which is 32 for vacuum presently instead of an
user option.
Here I attached the modified patch with both vm+prefetch logic.
I will do some tests on a machine with SSD and let you know the
results. Based on these results,
we can decide whether we need a GUC or not? based on the impact of
prefetch on ssd machines.
Regards,
Hari Babu
Fujitsu Australia
Attachments:
vac_trunc_trust_vm_and_prefetch.patchapplication/octet-stream; name=vac_trunc_trust_vm_and_prefetch.patchDownload
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 97,102 ****
--- 97,109 ----
*/
#define SKIP_PAGES_THRESHOLD ((BlockNumber) 32)
+ /*
+ * Macro used to specify the number of pages to be prefetched before
+ * reading the page, used for truncating a relation by verifing the page
+ * is empty or not.
+ */
+ #define VACUUM_TRUNCATE_PREFETCH 32
+
typedef struct LVRelStats
{
/* hasindex = true means two-pass strategy; false means one-pass */
***************
*** 113,118 **** typedef struct LVRelStats
--- 120,126 ----
BlockNumber pages_removed;
double tuples_deleted;
BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+ BlockNumber skipped_pages; /* actually, last skipped page + 1 */
/* List of TIDs of tuples we intend to delete */
/* NB: this list is ordered by TID address */
int num_dead_tuples; /* current # of entries */
***************
*** 475,480 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 483,489 ----
vacrelstats->rel_pages = nblocks;
vacrelstats->scanned_pages = 0;
vacrelstats->nonempty_pages = 0;
+ vacrelstats->skipped_pages = 0;
vacrelstats->latestRemovedXid = InvalidTransactionId;
lazy_space_alloc(vacrelstats, nblocks);
***************
*** 568,574 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 577,586 ----
{
/* Current block is all-visible */
if (skipping_all_visible_blocks && !scan_all)
+ {
+ vacrelstats->skipped_pages = blkno + 1;
continue;
+ }
all_visible_according_to_vm = true;
}
***************
*** 1559,1568 **** static BlockNumber
--- 1571,1590 ----
count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
{
BlockNumber blkno;
+ BlockNumber trust_vm;
instr_time starttime;
+ Buffer vmbuffer = InvalidBuffer;
/* Initialize the starttime if we check for conflicting lock requests */
INSTR_TIME_SET_CURRENT(starttime);
+
+ /*
+ * Pages that were inspected and found to be empty by this vacuum run
+ * must still be empty if the vm bit is still set. Only vacuum sets
+ * vm bits, and only one vacuum can exist in a table at one time.
+ */
+ trust_vm=vacrelstats->nonempty_pages>vacrelstats->skipped_pages ?
+ vacrelstats->nonempty_pages : vacrelstats->skipped_pages;
/* Strange coding of loop control is needed because blkno is unsigned */
blkno = vacrelstats->rel_pages;
***************
*** 1600,1605 **** count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
--- 1622,1629 ----
RelationGetRelationName(onerel))));
vacrelstats->lock_waiter_detected = true;
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
return blkno;
}
starttime = currenttime;
***************
*** 1615,1620 **** count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
--- 1639,1662 ----
blkno--;
+ if (blkno >= trust_vm)
+ {
+ if (!visibilitymap_test(onerel,blkno,&vmbuffer))
+ {
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
+ return blkno + 1;
+ }
+ continue;
+ }
+
+ if ((blkno % VACUUM_TRUNCATE_PREFETCH) == 0 && blkno >= VACUUM_TRUNCATE_PREFETCH)
+ {
+ int i;
+ for (i = -VACUUM_TRUNCATE_PREFETCH; i<0; i++)
+ PrefetchBuffer(onerel, MAIN_FORKNUM, blkno + i);
+ }
+
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vac_strategy);
***************
*** 1657,1663 **** count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
--- 1699,1709 ----
/* Done scanning if we found a tuple here */
if (hastup)
+ {
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
return blkno + 1;
+ }
}
/*
***************
*** 1665,1670 **** count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
--- 1711,1718 ----
* pages still are; we need not bother to look at the last known-nonempty
* page.
*/
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
return vacrelstats->nonempty_pages;
}
On Mon, Jul 13, 2015 at 12:06 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
On Thu, Jul 9, 2015 at 5:36 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
I will do some performance tests and send you the results.
Here are the performance results tested on my machine.
Head vm patch vm+prefetch patch
First vacuum 120sec <1sec <1sec
second vacuum 180 sec 180 sec 30 secI did some modifications in the code to skip the vacuum truncation by
the first vacuum command.
This way I collected the second vacuum time taken performance.I just combined your vm and prefetch patch into a single patch
vm+prefetch patch without a GUC.
I kept the prefetch as 32 and did the performance test. I chosen
prefetch based on the current
buffer access strategy, which is 32 for vacuum presently instead of an
user option.
Here I attached the modified patch with both vm+prefetch logic.I will do some tests on a machine with SSD and let you know the
results. Based on these results,
we can decide whether we need a GUC or not? based on the impact of
prefetch on ssd machines.
Following are the performance readings on a machine with SSD.
I increased the pgbench scale factor to 1000 in the test instead of 500
to show a better performance numbers.
Head vm patch vm+prefetch patch
First vacuum 6.24 sec 2.91 sec 2.91 sec
second vacuum 6.66 sec 6.66 sec 7.19 sec
There is a small performance impact on SSD with prefetch.
Regards,
Hari Babu
Fujitsu Australia
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jul 13, 2015 at 5:16 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
On Mon, Jul 13, 2015 at 12:06 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:On Thu, Jul 9, 2015 at 5:36 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
I will do some performance tests and send you the results.
Here are the performance results tested on my machine.
Head vm patch vm+prefetch patch
First vacuum 120sec <1sec <1sec
second vacuum 180 sec 180 sec 30 secI did some modifications in the code to skip the vacuum truncation by
the first vacuum command.
This way I collected the second vacuum time taken performance.I just combined your vm and prefetch patch into a single patch
vm+prefetch patch without a GUC.
I kept the prefetch as 32 and did the performance test. I chosen
prefetch based on the current
buffer access strategy, which is 32 for vacuum presently instead of an
user option.
Here I attached the modified patch with both vm+prefetch logic.I will do some tests on a machine with SSD and let you know the
results. Based on these results,
we can decide whether we need a GUC or not? based on the impact of
prefetch on ssd machines.Following are the performance readings on a machine with SSD.
I increased the pgbench scale factor to 1000 in the test instead of 500
to show a better performance numbers.Head vm patch vm+prefetch patch
First vacuum 6.24 sec 2.91 sec 2.91 sec
second vacuum 6.66 sec 6.66 sec 7.19 secThere is a small performance impact on SSD with prefetch.
The above prefetch overhead is observed with prefeching of 1639345 pages.
I feel this overhead is small.
Hi Jeff,
If you are fine with earlier attached patch, then I will mark this patch as
ready for committer, to get some committer view on the patch.
Regards,
Hari Babu
Fujitsu Australia
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 29, 2015 at 11:24 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
Attached is a patch that implements the vm scan for truncation. It
introduces a variable to hold the last blkno which was skipped during the
forward portion. Any blocks after both this blkno and after the last
inspected nonempty page (which the code is already tracking) must have been
observed to be empty by the current vacuum. Any other process rendering
the page nonempty are required to clear the vm bit, and no other process
can set the bit again during the vacuum's lifetime. So if the bit is still
set, the page is still empty without needing to inspect it.
One case where this patch can behave differently then current
code is, when before taking Exclusive lock on relation for truncation,
if some backend clears the vm bit and then inserts-deletes-prune that
page. I think with patch it will not consider to truncate pages whereas
current code will allow to truncate it as it re-verifies each page. I think
such a case would be rare and we might not want to bother about it,
but still worth to think about it once.
Some minor points about patch:
count_nondeletable_pages()
{
..
if (PageIsNew(page) || PageIsEmpty(page))
{
/* PageIsNew probably shouldn't happen... */
UnlockReleaseBuffer(buf);
continue;
}
..
}
Why vmbuffer is not freed in above loop?
count_nondeletable_pages()
{
..
+ /*
+ * Pages that were inspected and found to be empty by this vacuum run
+ * must still be empty if the vm bit is still set. Only vacuum sets
+ * vm bits, and only one vacuum can exist in a table at one time.
+ */
+ trust_vm=vacrelstats->nonempty_pages>vacrelstats->skipped_pages ?
+ vacrelstats->nonempty_pages : vacrelstats->skipped_pages;
..
}
I think it is better to have spaces in above assignment statement
(near '=' and near '>')
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 15, 2015 at 11:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jun 29, 2015 at 11:24 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
Attached is a patch that implements the vm scan for truncation. It
introduces a variable to hold the last blkno which was skipped during the
forward portion. Any blocks after both this blkno and after the last
inspected nonempty page (which the code is already tracking) must have been
observed to be empty by the current vacuum. Any other process rendering the
page nonempty are required to clear the vm bit, and no other process can set
the bit again during the vacuum's lifetime. So if the bit is still set, the
page is still empty without needing to inspect it.One case where this patch can behave differently then current
code is, when before taking Exclusive lock on relation for truncation,
if some backend clears the vm bit and then inserts-deletes-prune that
page. I think with patch it will not consider to truncate pages whereas
current code will allow to truncate it as it re-verifies each page. I think
such a case would be rare and we might not want to bother about it,
but still worth to think about it once.
Thanks for your review.
corrected the code as instead of returning the blkno after visibility map
check failure, the code just continue to verify the contents as
earlier approach.
Some minor points about patch:
count_nondeletable_pages()
{
..
if (PageIsNew(page) || PageIsEmpty(page))
{
/* PageIsNew probably shouldn't happen... */
UnlockReleaseBuffer(buf);
continue;
}
..
}Why vmbuffer is not freed in above loop?
In the above check, we are just continuing to the next blkno, at the
end of the loop the vmbuffer
is freed.
count_nondeletable_pages() { .. + /* + * Pages that were inspected and found to be empty by this vacuum run + * must still be empty if the vm bit is still set. Only vacuum sets + * vm bits, and only one vacuum can exist in a table at one time. + */ + trust_vm=vacrelstats->nonempty_pages>vacrelstats->skipped_pages ? + vacrelstats->nonempty_pages : vacrelstats->skipped_pages;..
}I think it is better to have spaces in above assignment statement
(near '=' and near '>')
corrected.
Here I attached updated patches,
1) without prefetch logic.
2) with combined vm and prefetch logic.
I marked the patch as ready for committer.
Regards,
Hari Babu
Fujitsu Australia
Attachments:
vac_trunc_trust_vm_v2.patchapplication/octet-stream; name=vac_trunc_trust_vm_v2.patchDownload
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 113,118 **** typedef struct LVRelStats
--- 113,119 ----
BlockNumber pages_removed;
double tuples_deleted;
BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+ BlockNumber skipped_pages; /* actually, last skipped page + 1 */
/* List of TIDs of tuples we intend to delete */
/* NB: this list is ordered by TID address */
int num_dead_tuples; /* current # of entries */
***************
*** 475,480 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 476,482 ----
vacrelstats->rel_pages = nblocks;
vacrelstats->scanned_pages = 0;
vacrelstats->nonempty_pages = 0;
+ vacrelstats->skipped_pages = 0;
vacrelstats->latestRemovedXid = InvalidTransactionId;
lazy_space_alloc(vacrelstats, nblocks);
***************
*** 568,574 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 570,579 ----
{
/* Current block is all-visible */
if (skipping_all_visible_blocks && !scan_all)
+ {
+ vacrelstats->skipped_pages = blkno + 1;
continue;
+ }
all_visible_according_to_vm = true;
}
***************
*** 1559,1568 **** static BlockNumber
--- 1564,1583 ----
count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
{
BlockNumber blkno;
+ BlockNumber trust_vm;
instr_time starttime;
+ Buffer vmbuffer = InvalidBuffer;
/* Initialize the starttime if we check for conflicting lock requests */
INSTR_TIME_SET_CURRENT(starttime);
+
+ /*
+ * Pages that were inspected and found to be empty by this vacuum run
+ * must still be empty if the vm bit is still set. Only vacuum sets
+ * vm bits, and only one vacuum can exist in a table at one time.
+ */
+ trust_vm = vacrelstats->nonempty_pages > vacrelstats->skipped_pages ?
+ vacrelstats->nonempty_pages : vacrelstats->skipped_pages;
/* Strange coding of loop control is needed because blkno is unsigned */
blkno = vacrelstats->rel_pages;
***************
*** 1600,1605 **** count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
--- 1615,1622 ----
RelationGetRelationName(onerel))));
vacrelstats->lock_waiter_detected = true;
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
return blkno;
}
starttime = currenttime;
***************
*** 1614,1620 **** count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
CHECK_FOR_INTERRUPTS();
blkno--;
!
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vac_strategy);
--- 1631,1644 ----
CHECK_FOR_INTERRUPTS();
blkno--;
!
! /*
! * Verify the page is empty or not based on visibility map test, instead
! * of verifying from the page contents.
! */
! if ((blkno >= trust_vm) && visibilitymap_test(onerel, blkno, &vmbuffer))
! continue;
!
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vac_strategy);
***************
*** 1657,1663 **** count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
--- 1681,1691 ----
/* Done scanning if we found a tuple here */
if (hastup)
+ {
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
return blkno + 1;
+ }
}
/*
***************
*** 1665,1670 **** count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
--- 1693,1700 ----
* pages still are; we need not bother to look at the last known-nonempty
* page.
*/
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
return vacrelstats->nonempty_pages;
}
vac_trunc_trust_vm_and_prefetch_v2.patchapplication/octet-stream; name=vac_trunc_trust_vm_and_prefetch_v2.patchDownload
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 97,102 ****
--- 97,109 ----
*/
#define SKIP_PAGES_THRESHOLD ((BlockNumber) 32)
+ /*
+ * Macro used to specify the number of pages to be prefetched before
+ * reading the page, used for truncating a relation by verifing the page
+ * is empty or not.
+ */
+ #define VACUUM_TRUNCATE_PREFETCH 32
+
typedef struct LVRelStats
{
/* hasindex = true means two-pass strategy; false means one-pass */
***************
*** 113,118 **** typedef struct LVRelStats
--- 120,126 ----
BlockNumber pages_removed;
double tuples_deleted;
BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+ BlockNumber skipped_pages; /* actually, last skipped page + 1 */
/* List of TIDs of tuples we intend to delete */
/* NB: this list is ordered by TID address */
int num_dead_tuples; /* current # of entries */
***************
*** 475,480 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 483,489 ----
vacrelstats->rel_pages = nblocks;
vacrelstats->scanned_pages = 0;
vacrelstats->nonempty_pages = 0;
+ vacrelstats->skipped_pages = 0;
vacrelstats->latestRemovedXid = InvalidTransactionId;
lazy_space_alloc(vacrelstats, nblocks);
***************
*** 568,574 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 577,586 ----
{
/* Current block is all-visible */
if (skipping_all_visible_blocks && !scan_all)
+ {
+ vacrelstats->skipped_pages = blkno + 1;
continue;
+ }
all_visible_according_to_vm = true;
}
***************
*** 1559,1568 **** static BlockNumber
--- 1571,1590 ----
count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
{
BlockNumber blkno;
+ BlockNumber trust_vm;
instr_time starttime;
+ Buffer vmbuffer = InvalidBuffer;
/* Initialize the starttime if we check for conflicting lock requests */
INSTR_TIME_SET_CURRENT(starttime);
+
+ /*
+ * Pages that were inspected and found to be empty by this vacuum run
+ * must still be empty if the vm bit is still set. Only vacuum sets
+ * vm bits, and only one vacuum can exist in a table at one time.
+ */
+ trust_vm = vacrelstats->nonempty_pages > vacrelstats->skipped_pages ?
+ vacrelstats->nonempty_pages : vacrelstats->skipped_pages;
/* Strange coding of loop control is needed because blkno is unsigned */
blkno = vacrelstats->rel_pages;
***************
*** 1600,1605 **** count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
--- 1622,1629 ----
RelationGetRelationName(onerel))));
vacrelstats->lock_waiter_detected = true;
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
return blkno;
}
starttime = currenttime;
***************
*** 1614,1619 **** count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
--- 1638,1657 ----
CHECK_FOR_INTERRUPTS();
blkno--;
+
+ /*
+ * Verify the page is empty or not based on visibility map test, instead
+ * of verifying from the page contents.
+ */
+ if ((blkno >= trust_vm) && visibilitymap_test(onerel, blkno, &vmbuffer))
+ continue;
+
+ if ((blkno % VACUUM_TRUNCATE_PREFETCH) == 0 && blkno >= VACUUM_TRUNCATE_PREFETCH)
+ {
+ int i;
+ for (i = -VACUUM_TRUNCATE_PREFETCH; i<0; i++)
+ PrefetchBuffer(onerel, MAIN_FORKNUM, blkno + i);
+ }
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vac_strategy);
***************
*** 1657,1663 **** count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
--- 1695,1705 ----
/* Done scanning if we found a tuple here */
if (hastup)
+ {
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
return blkno + 1;
+ }
}
/*
***************
*** 1665,1670 **** count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
--- 1707,1714 ----
* pages still are; we need not bother to look at the last known-nonempty
* page.
*/
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
return vacrelstats->nonempty_pages;
}
On Thu, Jul 16, 2015 at 11:21 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:
On Wed, Jul 15, 2015 at 11:19 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
One case where this patch can behave differently then current
code is, when before taking Exclusive lock on relation for truncation,
if some backend clears the vm bit and then inserts-deletes-prune that
page. I think with patch it will not consider to truncate pages whereas
current code will allow to truncate it as it re-verifies each page. I
think
such a case would be rare and we might not want to bother about it,
but still worth to think about it once.Thanks for your review.
corrected the code as instead of returning the blkno after visibility map
check failure, the code just continue to verify the contents as
earlier approach.
Okay, I think this should work.
Here I attached updated patches,
1) without prefetch logic.
2) with combined vm and prefetch logic.
I think it is better to just get the first patch in as that in itself is a
clear win and then we can evaluate the second one (prefetching during
truncation) separately. I think after first patch, the use case for doing
prefetch seems to be less beneficial and I think it could hurt by loading
not required pages (assume you have prefetched 32 pages and after
inspecting first page, it decides to quit the loop as that is non-empty page
and other is when it has prefetched just because for page the visibility map
bit was cleared, but for others it is set) in OS buffer cache.
Having said that, I am not against prefetching in this scenario as that
can help in more cases then it could hurt, but I think it will be better
to evaluate that separately.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 16, 2015 at 10:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Jul 16, 2015 at 11:21 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:Here I attached updated patches,
1) without prefetch logic.
2) with combined vm and prefetch logic.I think it is better to just get the first patch in as that in itself is a
clear win and then we can evaluate the second one (prefetching during
truncation) separately. I think after first patch, the use case for doing
prefetch seems to be less beneficial and I think it could hurt by loading
not required pages (assume you have prefetched 32 pages and after
inspecting first page, it decides to quit the loop as that is non-empty page
and other is when it has prefetched just because for page the visibility map
bit was cleared, but for others it is set) in OS buffer cache.
Yes, in the above cases, the prefetch is an overhead. I am not sure
how frequently such
scenarios occur in real time scenarios.
Having said that, I am not against prefetching in this scenario as that
can help in more cases then it could hurt, but I think it will be better
to evaluate that separately.
Yes, the prefetch patch works better in cases, where majorly the first
vacuum skips
the truncation because of lock waiters. If such cases are more in real
time scenarios,
then the prefetch patch is needed.
Regards,
Hari Babu
Fujitsu Australia
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 16 July 2015 at 06:51, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
I marked the patch as ready for committer.
The most recent results seem to indicate that the prefetch isn't worth
pursuing, but the vm test is. Can someone repeat the perf tests on
something larger so we can see, when/if there is a benefit?
Jeff, can you add detailed comments to explain the theory of operation of
these patches? The patches add the code, but don't say why. We will forget,
so lets put the detail in there now please. Thanks.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
Attached is a patch that implements the vm scan for truncation. It
introduces a variable to hold the last blkno which was skipped during the
forward portion. Any blocks after both this blkno and after the last
inspected nonempty page (which the code is already tracking) must have been
observed to be empty by the current vacuum. Any other process rendering the
page nonempty are required to clear the vm bit, and no other process can set
the bit again during the vacuum's lifetime. So if the bit is still set, the
page is still empty without needing to inspect it.
Urgh. So if we do this, that forever precludes having HOT pruning set
the all-visible bit. At the least we'd better document that carefully
so that nobody breaks it later. But I wonder if there isn't some
better approach, because I would certainly rather that we didn't
foreclose the possibility of doing something like that in the future.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
Attached is a patch that implements the vm scan for truncation. It
introduces a variable to hold the last blkno which was skipped during the
forward portion. Any blocks after both this blkno and after the last
inspected nonempty page (which the code is already tracking) must havebeen
observed to be empty by the current vacuum. Any other process rendering
the
page nonempty are required to clear the vm bit, and no other process can
set
the bit again during the vacuum's lifetime. So if the bit is still set,
the
page is still empty without needing to inspect it.
Urgh. So if we do this, that forever precludes having HOT pruning set
the all-visible bit.
I wouldn't say forever, as it would be easy to revert the change if
something more important came along that conflicted with it. I don't think
this change would grow tentacles across the code that make it hard to
revert, you would just have to take the performance hit (and by that time,
maybe HDD will truly be dead anyway and so we don't care anymore). But yes,
that is definitely a downside. HOT pruning is one example, but also one
could envision having someone (bgwriter?) set vm bits on unindexed tables.
Or if we invent some efficient way to know that no expiring tids for a
certain block range are stored in indexes, other jobs could also set the vm
bit on indexed tables. Or parallel vacuums in the same table, not that I
really see a reason to have those.
At the least we'd better document that carefully
so that nobody breaks it later. But I wonder if there isn't some
better approach, because I would certainly rather that we didn't
foreclose the possibility of doing something like that in the future.
But where do we document it (other than in-place)? README.HOT doesn't seem
sufficient, and there is no README.vm.
I guess add an "Assert(InRecovery || running_a_vacuum);" to
the visibilitymap_set with a comment there, except that I don't know how to
implement running_a_vacuum so that it covers manual vacs as well as
autovac. Perhaps assert that we hold a SHARE UPDATE EXCLUSIVE on rel?
The advantage of the other approach, just force kernel read-ahead to work
for us, is that it doesn't impose any of these restrictions on future
development. The disadvantage is that I don't know how to auto-tune it, or
auto-disable it for SSD, and it will never be as quite as efficient.
Cheers,
Jeff
On Wed, Jul 22, 2015 at 12:11 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
Urgh. So if we do this, that forever precludes having HOT pruning set
the all-visible bit.I wouldn't say forever, as it would be easy to revert the change if
something more important came along that conflicted with it.
Well, the people who want this change would likely object to reverting it later.
I don't think
this change would grow tentacles across the code that make it hard to
revert, you would just have to take the performance hit (and by that time,
maybe HDD will truly be dead anyway and so we don't care anymore). But yes,
that is definitely a downside. HOT pruning is one example, but also one
could envision having someone (bgwriter?) set vm bits on unindexed tables.
Or if we invent some efficient way to know that no expiring tids for a
certain block range are stored in indexes, other jobs could also set the vm
bit on indexed tables. Or parallel vacuums in the same table, not that I
really see a reason to have those.
Yes, good points. (I think we will need parallel vacuum to cope with
TB-sized tables, but that's another story.)
At the least we'd better document that carefully
so that nobody breaks it later. But I wonder if there isn't some
better approach, because I would certainly rather that we didn't
foreclose the possibility of doing something like that in the future.But where do we document it (other than in-place)? README.HOT doesn't seem
sufficient, and there is no README.vm.
visibilitymap_set?
I guess add an "Assert(InRecovery || running_a_vacuum);" to the
visibilitymap_set with a comment there, except that I don't know how to
implement running_a_vacuum so that it covers manual vacs as well as autovac.
Perhaps assert that we hold a SHARE UPDATE EXCLUSIVE on rel?
I care more about the comment than I do about the Assert().
The advantage of the other approach, just force kernel read-ahead to work
for us, is that it doesn't impose any of these restrictions on future
development. The disadvantage is that I don't know how to auto-tune it, or
auto-disable it for SSD, and it will never be as quite as efficient.
I have to say that I like the prefetch approach. I don't know what to
do about the fact that it loses to the VM-bit based approach, but I
think it's a bad bet that we will never care about setting visibility
map bits anyplace other than VACUUM.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-23 11:06:25 -0400, Robert Haas wrote:
I don't know what to do about the fact that it loses to the VM-bit
based approach, but I think it's a bad bet that we will never care
about setting visibility map bits anyplace other than VACUUM.
+1
A background process that tries to hint all pages shortly before they're
written out would be truly awesome. And would very likely set hint
bits...
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 7/22/15 11:11 AM, Jeff Janes wrote:
I guess add an "Assert(InRecovery || running_a_vacuum);" to
the visibilitymap_set with a comment there, except that I don't know how
to implement running_a_vacuum so that it covers manual vacs as well as
autovac.
static bool RunningVacuum; ? It only needs to be per-backend, right?
If it is that simple then my $0.02 is we should have the assert, not
just a comment.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22 July 2015 at 14:59, Robert Haas <robertmhaas@gmail.com> wrote:
Urgh. So if we do this, that forever precludes having HOT pruning set
the all-visible bit.
What is the reason why we don't do that already? Surely its a one liner?
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jul 25, 2015 at 4:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 22 July 2015 at 14:59, Robert Haas <robertmhaas@gmail.com> wrote:
Urgh. So if we do this, that forever precludes having HOT pruning set
the all-visible bit.What is the reason why we don't do that already? Surely its a one liner?
I wish!
It actually doesn't look simple to me to modify heap_page_prune to
know whether any not-all-visible items remain at the end. It's
definitely true that, in order for marking the page all-visible to be
a possibility, we need to reach heap_page_prune_execute() with ndead
== 0. But that's not enough: we also need to know that any tuple that
survived the prune operation (that is, it wasn't redirected or marked
unused) has a new-enough xmin, which isn't tracked anywhere.
vacuumlazy.c solves that problem by making a second pass over all the
tuples on the page to determine whether the page is all-visible, but
that seems pretty inefficient and I think heap_page_prune() is already
a CPU hog on some workloads. (Didn't Heikki have a patch to improve
that?)
Another sticky wicket is that, as things stand today, we'd have to
insert a second WAL record to mark the page all-visible, which would
add probably-significant overhead. We could probably modify the
format of the record we're already emitting for pruning to carry a
flag indicating whether to additionally set PD_ALL_VISIBLE.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22 July 2015 at 17:11, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas <robertmhaas@gmail.com>
wrote:On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
Attached is a patch that implements the vm scan for truncation. It
introduces a variable to hold the last blkno which was skipped duringthe
forward portion. Any blocks after both this blkno and after the last
inspected nonempty page (which the code is already tracking) must havebeen
observed to be empty by the current vacuum. Any other process
rendering the
page nonempty are required to clear the vm bit, and no other process
can set
the bit again during the vacuum's lifetime. So if the bit is still
set, the
page is still empty without needing to inspect it.
Urgh. So if we do this, that forever precludes having HOT pruning set
the all-visible bit.I wouldn't say forever, as it would be easy to revert the change if
something more important came along that conflicted with it.
I think what is being said here is that someone is already using this
technique, or if not, then we actively want to encourage them to do so as
an extension or as a submission to core.
In that case, I think the rely-on-VM technique sinks again, sorry Jim,
Jeff. Probably needs code comments added.
That does still leave the prefetch technique, so all is not lost.
Can we see a patch with just prefetch, probably with a simple choice of
stride? Thanks.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 7/27/15 10:39 AM, Robert Haas wrote:
But that's not enough: we also need to know that any tuple that
survived the prune operation (that is, it wasn't redirected or marked
unused) has a new-enough xmin, which isn't tracked anywhere.
Wouldn't that be old-enough xmin?
heap_prune_chain() is already calling HeapTupleSatisfiesVacuum, so it
should be able to figure out if the page is all visible without a lot of
extra work, and pass that info back to heap_page_prune (which would then
pass it down to _execute()).
Definitely not a one-liner though.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jul 27, 2015 at 1:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 22 July 2015 at 17:11, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas <robertmhaas@gmail.com>
wrote:On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com>
wrote:Attached is a patch that implements the vm scan for truncation. It
introduces a variable to hold the last blkno which was skipped duringthe
forward portion. Any blocks after both this blkno and after the last
inspected nonempty page (which the code is already tracking) must havebeen
observed to be empty by the current vacuum. Any other process
rendering the
page nonempty are required to clear the vm bit, and no other process
can set
the bit again during the vacuum's lifetime. So if the bit is still
set, the
page is still empty without needing to inspect it.
Urgh. So if we do this, that forever precludes having HOT pruning set
the all-visible bit.I wouldn't say forever, as it would be easy to revert the change if
something more important came along that conflicted with it.I think what is being said here is that someone is already using this
technique, or if not, then we actively want to encourage them to do so as
an extension or as a submission to core.In that case, I think the rely-on-VM technique sinks again, sorry Jim,
Jeff. Probably needs code comments added.
Sure, that sounds like the consensus. The VM method was very efficient,
but I agree it is pretty fragile and restricting.
That does still leave the prefetch technique, so all is not lost.
Can we see a patch with just prefetch, probably with a simple choice of
stride? Thanks.
I probably won't get back to it this commit fest, so it can be set to
returned with feedback. But if anyone has good ideas for how to set the
stride (or detect that it is on SSD and so is pointless to try) I'd love to
hear about them anytime.
Cheers,
Jeff
On 3 August 2015 at 17:18, Jeff Janes <jeff.janes@gmail.com> wrote:
That does still leave the prefetch technique, so all is not lost.
Can we see a patch with just prefetch, probably with a simple choice of
stride? Thanks.I probably won't get back to it this commit fest, so it can be set to
returned with feedback. But if anyone has good ideas for how to set the
stride (or detect that it is on SSD and so is pointless to try) I'd love to
hear about them anytime.
I've Returned With Feedback, as you suggest.
Given your earlier numbers, I'd just pick a constant value of 128, to keep
it simple. That balances out the various factors of small/large tables and
the uncertainty that we will be able to truncate the whole chunk of blocks.
I'd like to see tests on SSD before commit, but I hope and expect the
slightly negative results with SSD won't be substantiated on larger tests.
Kept simple, its a patch with a clear win in a restricted use case and I
would be happy to commit that sometime in the future.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Aug 4, 2015 at 2:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Jul 27, 2015 at 1:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 22 July 2015 at 17:11, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas <robertmhaas@gmail.com>
wrote:On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes <jeff.janes@gmail.com>
wrote:Attached is a patch that implements the vm scan for truncation. It
introduces a variable to hold the last blkno which was skipped during
the
forward portion. Any blocks after both this blkno and after the last
inspected nonempty page (which the code is already tracking) must have
been
observed to be empty by the current vacuum. Any other process
rendering the
page nonempty are required to clear the vm bit, and no other process
can set
the bit again during the vacuum's lifetime. So if the bit is still
set, the
page is still empty without needing to inspect it.Urgh. So if we do this, that forever precludes having HOT pruning set
the all-visible bit.I wouldn't say forever, as it would be easy to revert the change if
something more important came along that conflicted with it.I think what is being said here is that someone is already using this
technique, or if not, then we actively want to encourage them to do so as an
extension or as a submission to core.In that case, I think the rely-on-VM technique sinks again, sorry Jim,
Jeff. Probably needs code comments added.Sure, that sounds like the consensus. The VM method was very efficient, but
I agree it is pretty fragile and restricting.That does still leave the prefetch technique, so all is not lost.
Can we see a patch with just prefetch, probably with a simple choice of
stride? Thanks.I probably won't get back to it this commit fest, so it can be set to
returned with feedback. But if anyone has good ideas for how to set the
stride (or detect that it is on SSD and so is pointless to try) I'd love to
hear about them anytime.
I got the following way to get the whether data file is present in the
DISK or SSD.
1. Get the device file system that table data file is mapped using the
following or similar.
df -P "filename" | awk 'NR==2{print $1}'
2. if the device file system is of type /dev/sd* then treat is as a
disk system and proceed
with the prefetch optimization.
3. if we are not able to find the device details directly then we need
to get the information
from the mapping system.
Usually the devices will map like the following
/dev/mapper/v** points to ../dm-*
4. Get the name of the "dm-*" from the above details and check
whether it is a SSD or not
with the following command.
/sys/block/dm-*/queue/rotation
5. If the value is 0 then it is an SSD drive, 1 means disk drive.
The described above procedure works only for linux. I didn't check for
other operating systems yet.
Is it worth to consider?
Regards,
Hari Babu
Fujitsu Australia
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 28, 2015 at 2:05 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
I got the following way to get the whether data file is present in the
DISK or SSD.1. Get the device file system that table data file is mapped using the
following or similar.df -P "filename" | awk 'NR==2{print $1}'
2. if the device file system is of type /dev/sd* then treat is as a
disk system and proceed
with the prefetch optimization.3. if we are not able to find the device details directly then we need
to get the information
from the mapping system.Usually the devices will map like the following
/dev/mapper/v** points to ../dm-*
4. Get the name of the "dm-*" from the above details and check
whether it is a SSD or not
with the following command./sys/block/dm-*/queue/rotation
5. If the value is 0 then it is an SSD drive, 1 means disk drive.
The described above procedure works only for linux. I didn't check for
other operating systems yet.
Is it worth to consider?
No. If we need to have the behavior depend on the hardware, it should
be a GUC or tablespace option or reloption, not some kind of crazy
OS-dependent discovery.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers