Removing PD_ALL_VISIBLE
Follow up to discussion:
http://archives.postgresql.org/pgsql-hackers/2012-11/msg00817.php
I worked out a patch that replaces PD_ALL_VISIBLE with calls to
visibilitymap_test. It rips out a lot of complexity, with a net drop of
about 300 lines (not a lot, but some of that code is pretty complex).
The patch is quite rough, and I haven't yet added the code to keep the
VM buffer pins around longer, but I still don't see any major problems.
I'm fairly certain that scans will not be adversely affected, and I
think that inserts/updates/deletes should be fine as well.
The main worry I have with inserts/updates/deletes is that there will be
less spatial locality for allocating new buffers or for modifying
existing buffers. So keeping a pinned VM buffer might not help as much,
because it might need to pin a different one, anyway.
Adding to January commitfest, as it's past 11/15.
Regards,
Jeff Davis
Attachments:
On Wed, 2012-11-21 at 18:25 -0800, Jeff Davis wrote:
Follow up to discussion:
http://archives.postgresql.org/pgsql-hackers/2012-11/msg00817.phpI worked out a patch that replaces PD_ALL_VISIBLE with calls to
visibilitymap_test. It rips out a lot of complexity, with a net drop of
about 300 lines (not a lot, but some of that code is pretty complex).
Updated patch attached.
Now it tries to keep the VM buffer pinned during scans, inserts,
updates, and deletes. This should avoid increased contention pinning the
VM pages, but performance tests are required.
For updates, it currently only tries to hold a pin on the VM buffer for
the page of the original tuple. For HOT updates, that's always the same
as the new buffer anyway. For cold updates, we could also try to keep a
pin on the buffer for the new tuple, but right now I don't see an
obvious need for that complexity. It may plausibly be a problem when
doing a bulk update on a freshly-loaded table.
It occurred to me that it might be difficult to test this patch without
a fairly large test case. A big assumption of my patch is that there
will be locality of access (and the VM page you already have a pin on is
likely to be needed the next time), which is obvious during a scan but
not so obvious during I/U/D. But a single 8K VM page represents some 60K
pages, or about 500MB of data. So anything less than that means that
there is only one VM page, and locality is trivial... it seems like any
test on a table less than 5GB would not be fair.
Then again, if a 5GB table is being randomly accessed, an extra pin is
unlikely to matter. Also, without locality, the contention would not be
nearly as bad either. I'm still pretty unclear what the "worst case" for
this patch is supposed to look like.
Regards,
Jeff Davis
Attachments:
Jeff Davis <pgsql@j-davis.com> writes:
Now it tries to keep the VM buffer pinned during scans, inserts,
updates, and deletes. This should avoid increased contention pinning the
VM pages, but performance tests are required.
...
Then again, if a 5GB table is being randomly accessed, an extra pin is
unlikely to matter. Also, without locality, the contention would not be
nearly as bad either. I'm still pretty unclear what the "worst case" for
this patch is supposed to look like.
I'd be worried about the case of a lot of sessions touching a lot of
unrelated tables. This change implies doubling the number of buffers
that are held pinned by any given query, and the distributed overhead
from that (eg, adding cycles to searches for free buffers) is what you
ought to be afraid of.
Another possibly important point is that reducing the number of
pin/unpin cycles for a given VM page might actually hurt the chances of
it being found in shared buffers, because IIRC the usage_count is bumped
once per pin/unpin. That algorithm is based on the assumption that
buffer pins are not drastically different in lifespan, but I think you
just broke that for pins on VM pages.
I'm not particularly concerned about devising solutions for these
problems, though, because I think this idea is a loser from the get-go;
the increase in contention for VM pages is alone going to destroy any
possible benefit.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, 2012-11-25 at 22:30 -0500, Tom Lane wrote:
I'd be worried about the case of a lot of sessions touching a lot of
unrelated tables. This change implies doubling the number of buffers
that are held pinned by any given query, and the distributed overhead
from that (eg, adding cycles to searches for free buffers) is what you
ought to be afraid of.
That's a good point. "Doubling" might be an exaggeration if indexes are
involved, but it's still a concern. The cost of this might be difficult
to measure though.
Another possibly important point is that reducing the number of
pin/unpin cycles for a given VM page might actually hurt the chances of
it being found in shared buffers, because IIRC the usage_count is bumped
once per pin/unpin. That algorithm is based on the assumption that
buffer pins are not drastically different in lifespan, but I think you
just broke that for pins on VM pages.
If doing a bunch of simple key lookups using an index, then the root of
the index page is only pinned once per query, but we expect that to stay
in shared buffers. I see the VM page as about the same: one pin per
query (or maybe a couple for large tables).
I don't see how the lifetime of the pin matters a whole lot in this
case; it's more about the rate of pins/unpins, right?
I'm not particularly concerned about devising solutions for these
problems, though, because I think this idea is a loser from the get-go;
the increase in contention for VM pages is alone going to destroy any
possible benefit.
Your intuition here is better than mine, but I am still missing
something here. If we keep the buffer pinned, then there will be very
few pin/unpin cycles here, so I don't see where the contention would
come from (any more than there is contention pinning the root of an
index).
Do you still think I need a shared lock here or something? If so, then
index-only scans have a pretty big problem right now, too.
I'll try to quantify some of these effects you've mentioned, and see how
the numbers turn out. I'm worried that I'll need more than 4 cores to
show anything though, so perhaps someone with a many-core box would be
interested to test this out?
Regards,
Jeff Davis
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 26, 2012 at 3:29 PM, Jeff Davis <pgsql@j-davis.com> wrote:
Your intuition here is better than mine, but I am still missing
something here. If we keep the buffer pinned, then there will be very
few pin/unpin cycles here, so I don't see where the contention would
come from (any more than there is contention pinning the root of an
index).
Based on previous measurements, I think there *is* contention pinning
the root of an index. Currently, I believe it's largely overwhelmed
by contention from other sources, such as the buffer manager lwlocks
and the very-evil ProcArrayLock. However, I believe that as we fix
those problems, this will start to percolate up towards the top of the
heap.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jeff Davis <pgsql@j-davis.com> writes:
On Sun, 2012-11-25 at 22:30 -0500, Tom Lane wrote:
Another possibly important point is that reducing the number of
pin/unpin cycles for a given VM page might actually hurt the chances of
it being found in shared buffers, because IIRC the usage_count is bumped
once per pin/unpin.
If doing a bunch of simple key lookups using an index, then the root of
the index page is only pinned once per query, but we expect that to stay
in shared buffers. I see the VM page as about the same: one pin per
query (or maybe a couple for large tables).
Hmmm ... that seems like a valid analogy. I may be worried about
nothing as far as this point goes.
Do you still think I need a shared lock here or something? If so, then
index-only scans have a pretty big problem right now, too.
There's still the issue of whether the IOS code is safe in machines with
weak memory ordering. I think that it probably is safe, but the
argument for it in the current code comment is wrong; most likely, a
correct argument has to depend on read/write barriers associated with
taking snapshots. I think what you ought to do is work through that,
fix the existing comment, and then see whether the same argument works
for what you want to do.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, 2012-11-26 at 16:10 -0500, Tom Lane wrote:
There's still the issue of whether the IOS code is safe in machines with
weak memory ordering. I think that it probably is safe, but the
argument for it in the current code comment is wrong; most likely, a
correct argument has to depend on read/write barriers associated with
taking snapshots. I think what you ought to do is work through that,
fix the existing comment, and then see whether the same argument works
for what you want to do.
As a part of the patch, I did change the comment, and here's what I came
up with:
* Note on Memory Ordering Effects: visibilitymap_test does not lock
* the visibility map buffer, and therefore the result we read here
* could be slightly stale. However, it can't be stale enough to
* matter.
*
* We need to detect clearing a VM bit due to an insert right away,
* because the tuple is present in the index page but not visible. The
* reading of the TID by this scan (using a shared lock on the index
* buffer) is serialized with the insert of the TID into the index
* (using an exclusive lock on the index buffer). Because the VM bit is
* cleared before updating the index, and locking/unlocking of the
* index page acts as a full memory barrier, we are sure to see the
* cleared bit if we see a recently-inserted TID.
*
* Deletes do not update the index page (only VACUUM will clear out the
* TID), so the clearing of the VM bit by a delete is not serialized
* with this test below, and we may see a value that is significantly
* stale. However, we don't care about the delete right away, because
* the tuple is still visible until the deleting transaction commits or
* the statement ends (if it's our transaction). In either case, the
* lock on the VM buffer will have been released (acting as a write
* barrier) after clearing the bit. And for us to have a snapshot that
* includes the deleting transaction (making the tuple invisible), we
* must have acquired ProcArrayLock after that time, acting as a read
* barrier.
*
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
And I updated the comment in visibilitymap.c as well (reformatted for
this email):
"To test a bit in the visibility map, most callers should have a pin on
the VM buffer, and at least a shared lock on the data buffer. Any
process that clears the VM bit must have an exclusive lock on the data
buffer, so that will serialize access to the appropriate bit. Because
lock acquisition and release are full memory barriers, then there is no
danger of seeing the state of the bit before it was last cleared.
Callers that don't have the data buffer yet, such as an index only scan
or a VACUUM that is skipping pages, must handle the concurrency as
appropriate."
Regards,
Jeff Davis
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 26, 2012 at 3:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Nov 26, 2012 at 3:29 PM, Jeff Davis <pgsql@j-davis.com> wrote:
Your intuition here is better than mine, but I am still missing
something here. If we keep the buffer pinned, then there will be very
few pin/unpin cycles here, so I don't see where the contention would
come from (any more than there is contention pinning the root of an
index).Based on previous measurements, I think there *is* contention pinning
the root of an index. Currently, I believe it's largely overwhelmed
by contention from other sources, such as the buffer manager lwlocks
and the very-evil ProcArrayLock. However, I believe that as we fix
those problems, this will start to percolate up towards the top of the
heap.
Yup -- it (buffer pin contention on high traffic buffers) been caught
in the wild -- just maintaining the pin count was enough to do it in
at least one documented case. Pathological workloads demonstrate
contention today and there's no good reason to assume it's limited
index root nodes -- i'm strongly suspicious buffer spinlock issues are
behind some other malfeasance we've seen recently.
merlin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, 2012-11-26 at 16:55 -0600, Merlin Moncure wrote:
Based on previous measurements, I think there *is* contention pinning
the root of an index. Currently, I believe it's largely overwhelmed
by contention from other sources, such as the buffer manager lwlocks
and the very-evil ProcArrayLock. However, I believe that as we fix
those problems, this will start to percolate up towards the top of the
heap.Yup -- it (buffer pin contention on high traffic buffers) been caught
in the wild -- just maintaining the pin count was enough to do it in
at least one documented case. Pathological workloads demonstrate
contention today and there's no good reason to assume it's limited
index root nodes -- i'm strongly suspicious buffer spinlock issues are
behind some other malfeasance we've seen recently.
I tried for quite a while to show any kind of performance difference
between checking the VM and checking PD_ALL_VISIBLE on a 12-core box (24
if you count HT).
Three patches in question:
1. Current unpatched master
2. patch that naively always checks the VM page, pinning and unpinning
each time
3. Same as #2, but tries to keep buffers pinned (I had to fix a bug in
this patch though -- new version forthcoming)
I tested from 1 to 64 concurrent connections.
One test was just concurrent scans of a ~400MB table.
The other test was a DELETE FROM foo WHERE ctid BETWEEN '(N,0)' AND
'(N,256)' where N is the worker number in the test program. The table
here is only about 2 MB. The idea is that the scan will happen quickly,
leading to many quick deletes, but the deletes won't actually touch the
same pages (aside from the VM page). So, this was designed to be
uncontended except for pinning the VM page.
On the scan test, it was really hard to see any difference in the test
noise, but I may have seen about a 3-4% degradation between patch #1 and
patch #2 at higher concurrencies. It was difficult for me to reproduce
this result -- it usually wouldn't show up. I didn't see any difference
between patch #1 and patch #3.
On the delete test I detected no difference between #1 and #2 at all.
I think someone with access to a larger box may need to test this. Or,
if someone has a more specific suggestion about how I can create a worst
case, then let me know.
Regards,
Jeff Davis
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, 2012-11-30 at 13:16 -0800, Jeff Davis wrote:
I tried for quite a while to show any kind of performance difference
between checking the VM and checking PD_ALL_VISIBLE on a 12-core box (24
if you count HT).Three patches in question:
1. Current unpatched master
2. patch that naively always checks the VM page, pinning and unpinning
each time
3. Same as #2, but tries to keep buffers pinned (I had to fix a bug in
this patch though -- new version forthcoming)
New patch attached.
Nathan Boley kindly lent me access to a 64-core box, and that shows a
much more interesting result. The previous test (on the 12-core)
basically showed no difference between any of the patches.
Now, I see why on the 64 core box: the interesting region seems to be
around 32 concurrent connections.
The left column is the concurrency, and the right is the runtime. This
test was for concurrent scans of a 350MB table (each process did 4 scans
and quit). Test program attached.
Patch 1 (scan test):
001 004.299533
002 004.434378
004 004.708533
008 004.518470
012 004.487033
016 004.513915
024 004.765459
032 006.425780
048 007.089146
064 007.908850
072 009.461419
096 013.098646
108 015.278592
128 019.797206
Patch 2 (scan test):
001 004.385206
002 004.596340
004 004.616684
008 004.832248
012 004.858336
016 004.689959
024 005.016797
032 006.857642
048 012.049407
064 025.774772
072 032.680710
096 059.147500
108 083.654806
128 120.350200
Patch 3 (scan test):
001 004.464991
002 004.555595
004 004.562364
008 004.649633
012 004.628159
016 004.518748
024 004.768348
032 004.834177
048 007.003305
064 008.242714
072 009.732261
096 013.231056
108 014.996977
128 020.488570
As you can see, patch #2 starts to show a difference at around 32 and
completely falls over by 48 connections. This is expected because it's
the naive approach that pins the VM page every time it needs it.
Patch #1 and #3 are effectively the same, subsequent runs (and with more
measurements around concurrency 32) show that the differences are just
noise (which seems to be greater around the inflection point of 32). All
of the numbers that seem to show any difference can end up with patch #1
better or patch #3 better, depending on the run.
I tried the delete test, too, but I still couldn't see any difference.
(I neglected to mention in my last email: I aborted after each delete so
that it would be repeatable). The inflection point there is
significantly lower, so I assume it must be contending over something
else. I tried making the table unlogged to see if that would change
things, but it didn't change much. This test only scales linearly to
about 8 or so. Or, there could be something wrong with my test.
So, I conclude that contention is certainly a problem for scans for
patch #2, but patch #3 seems to fix that completely by holding the
buffer pins. The deletes are somewhat inconclusive, but I just can't see
a difference.
Holding more pins does have a distributed cost in theory, as Tom points
out, but I don't know where to begin testing that. We'll have to make a
decision between (a) maintaining the extra complexity and doing the
extra page writes involved with PD_ALL_VISIBLE; or (b) holding onto one
extra pin per table being scanned. Right now, if PD_ALL_VISIBLE did not
exist, it would be pretty hard to justify putting it in as far as I can
tell.
Regards,
Jeff Davis
Rebased patch attached. No significant changes.
Regards,
Jeff Davis
Attachments:
On 17 January 2013 03:02, Jeff Davis <pgsql@j-davis.com> wrote:
Rebased patch attached. No significant changes.
Jeff, can you summarise/collate why we're doing this, what concerns it
raises and how you've dealt with them? That will help decide whether
to commit.
Thanks
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 17, 2013 at 2:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 17 January 2013 03:02, Jeff Davis <pgsql@j-davis.com> wrote:
Rebased patch attached. No significant changes.
Jeff, can you summarise/collate why we're doing this, what concerns it
raises and how you've dealt with them? That will help decide whether
to commit.
+1. On another thread "Set visibility map bit after HOT prune", Robert
mentioned that its not such a good idea to remove the PD_ALL_VISIBLE
flag because it helps us to reduce the contention on the VM page,
especially when we need to reset the VM bit. Here is an excerpt from
Robert's comment that thread:
"Sure, but you're zipping rather blithely past the disadvantages of
such an approach. Jeff Davis recently proposed getting rid of
PD_ALL_VISIBLE, and Tom and I both expressed considerable skepticism
about that; this proposal has the same problems. One of the major
benefits of PD_ALL_VISIBLE is that, when it isn't set, inserts,
updates, and deletes to the page can ignore the visibility map. That
means that a server under heavy concurrency is much less likely to
encounter contention on the visibility map blocks. Now, maybe that's
not really a problem, but I sure haven't seen enough evidence to make
me believe it. If it's really true that PD_ALL_VISIBLE needn't fill
this role, then Heikki wasted an awful lot of time implementing it,
and I wasted an awful lot of time keeping it working when I made the
visibility map crash-safe for IOS. That could be true, but I tend to
think it isn't."
May be you've already addressed that concern with the proven
performance numbers, but I'm not sure.
Thanks,
Pavan
--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At 2013-01-17 08:41:37 +0000, simon@2ndQuadrant.com wrote:
Jeff, can you summarise/collate why we're doing this, what concerns it
raises and how you've dealt with them?
Since I was just looking at the original patch and discussion, and since
Pavan has posted an excerpt from one objection to it, here's an excerpt
from Jeff's original post titled "Do we need so many hint bits?"
/messages/by-id/1353026577.14335.91.camel@sussancws0025
Also, I am wondering about PD_ALL_VISIBLE. It was originally
introduced in the visibility map patch, apparently as a way to know
when to clear the VM bit when doing an update. It was then also used
for scans, which showed a significant speedup. But I wonder: why not
just use the visibilitymap directly from those places? It can be
used for the scan because it is crash safe now (not possible
before). And since it's only one lookup per scanned page, then I
don't think it would be a measurable performance loss there.
Inserts/updates/deletes also do a significant amount of work, so
again, I doubt it's a big drop in performance there -- maybe under a
lot of concurrency or something.
The benefit of removing PD_ALL_VISIBLE would be significantly
higher. It's quite common to load a lot of data, and then do some
reads for a while (setting hint bits and flushing them to disk), and
then do a VACUUM a while later, setting PD_ALL_VISIBLE and writing
all of the pages again. Also, if I remember correctly, Robert went
to significant effort when making the VM crash-safe to keep the
PD_ALL_VISIBLE and VM bits consistent. Maybe this was all discussed
before?
There was considerable discussion after this (accessible through the
archives link above), which I won't attempt to summarise.
-- Abhijit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 17, 2013 at 2:57 PM, Abhijit Menon-Sen <ams@2ndquadrant.com>wrote:
There was considerable discussion after this (accessible through the
archives link above), which I won't attempt to summarise.
I thought Robert made those comments after considerable discussions on
Jeff's approach. So he probably still stands by his objections or at least
not satisfied/seen the numbers.
Now that I look at the patch, I wonder if there is another fundamental
issue with the patch. Since the patch removes WAL logging for the VM set
operation, this can happen:
1. Vacuum kicks in and clears all dead tuples in a page and decides that
its all-visible
2. Vacuum WAL-logs the cleanup activity and marks the page dirty
3. Vacuum sets the visibility bit and marks the VM page dirty
4. Say the VM page gets written to the disk. The heap page is not yet
written neither the WAL log corresponding to the cleanup operation
5. CRASH
After recovery, the VM bit will remain set because the VM page got written
before the crash. But since heap page's cleanup WAL did not made to the
disk, those operations won't be replayed. The heap page will be left with
not-all-visible tuples in that case and its not a good state to be in.
The original code does not have this problem because the VM set WAL gets
written after the heap page cleanup WAL. So its guaranteed that the VM bit
will be set during recovery only if the cleanup WAL is replayed too (there
is more magic than what meets the eye and I think its not fully
documented).
Thanks,
Pavan
--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
May be you've already addressed that concern with the proven
performance numbers, but I'm not sure.
It would be nice to hear what Heikki's reasons were for adding
PD_ALL_VISIBLE in the first place.
Jeff's approach of holding the VM pins for longer certainly mitigates
some of the damage, in the sense that it reduces buffer lookups and
pin/unpin cycles - and might be worth doing independently of the rest
of the patch if we think it's a win. Index-only scans already use a
similar optimization, so extending it to inserts, updates, and deletes
is surely worth considering. The main question in my mind is whether
there are any negative consequences to holding a VM buffer pin for
that long without interruption. The usual consideration - namely,
blocking vacuum - doesn't apply here because vacuum does not take a
cleanup lock on the visibility map page, only the heap page, but I'm
not sure if there are others.
The other part of the issue is cache pressure. I don't think I can say
it better than what Tom already wrote:
# I'd be worried about the case of a lot of sessions touching a lot of
# unrelated tables. This change implies doubling the number of buffers
# that are held pinned by any given query, and the distributed overhead
# from that (eg, adding cycles to searches for free buffers) is what you
# ought to be afraid of.
I agree that we ought to be afraid of that. A pgbench test isn't
going to find a problem in this area because there you have a bunch of
sessions all accessing the same small group of tables. To find a
problem of the type above, you'd need lots of backends accessing lots
of different, small tables. That's not the use case we usually
benchmark, but I think there are a fair number of people doing things
like that - for example, imagine a hosting provider or web application
with many databases or schemas on a single instance. AFAICS, Jeff
hasn't tried to test this scenario.
Now, on the flip side, we should also be thinking about what we would
gain from this patch, and what other ways there might be to achieve
the same goals. As far as I can see, the main gain is that if you
bulk-load a table, don't vacuum it right away, get all the hint bits
set by some other mechanism, and then vacuum the table, you'll only
read the whole table instead of rewriting the whole table. So ISTM
that, for example, if we adopted the idea of making HOT prune set
visibility-map bits, most of the benefit of this change evaporates,
but whatever costs it may have will remain. There are other possible
ways of getting the same benefit as well - for example, I've been
thinking for a while now that we should try to find some judicious way
of vacuuming insert-only tables, perhaps only in small chunks when
there's nothing else going on. That wouldn't as clearly obsolete this
patch, but it would address a very similar use case and would also
preset hint bits, which would help a lot of people. Some of the ideas
we've had about a HEAP_XMIN_FREEZE intersect with this idea, too - if
we can do an early freeze without losing forensic information, we can
roll setting the hint bit, setting PD_ALL_VISIBLE, and freezing the
page into a single write.
All of which is to say that I think this patch is premature. If we
adopt something like this, we're likely never going to revert back to
the way we do it now, and whatever cache-pressure or other costs this
approach carries will be hear to stay - so we had better think awfully
carefully before we do that. And, FWIW, I don't believe that there is
sufficient time in this release cycle to carefully test this patch and
the other alternative designs that aim toward the same ends. Even if
there were, this is exactly the sort of thing that should be committed
at the beginning of a release cycle, not the end, so as to allow
adequate time for discovery of unforeseen consequences before the code
ends up out in the wild.
Of course, there's another issue here too, which is that as Pavan
points out, the page throws crash-safety out the window, which breaks
index-only scans (if you have a crash). HEAP_XLOG_VISIBLE is intended
principally to protect the VM bit, not PD_ALL_VISIBLE, but the patch
rips it out even though its purpose is to remove the latter, not the
former. Removing PD_ALL_VISIBLE eliminates the need to keep the
visibility-map bit consist with PD_ALL_VISIBLE, but it does not
eliminate the need to keep PD_ALL_VISIBLE consistent with the page
contents.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17.01.2013 16:53, Robert Haas wrote:
On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:May be you've already addressed that concern with the proven
performance numbers, but I'm not sure.It would be nice to hear what Heikki's reasons were for adding
PD_ALL_VISIBLE in the first place.
The idea was to avoid clearing the bit in the VM page on every update,
when the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is
not set. I assumed the traffic and contention on the VM page would be a
killer otherwise. I don't remember if I ever actually tested that
though. Maybe I was worrying about nothing and hitting the VM page on
every update is ok.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote:
Now that I look at the patch, I wonder if there is another fundamental
issue with the patch. Since the patch removes WAL logging for the VM
set operation, this can happen:
Thank you. I think I was confused by this comment here:
"When we *set* a visibility map during VACUUM, we must write WAL. This
may seem counterintuitive, since the bit is basically a hint: if it is
clear, it may still be the case that every tuple on the page is visible
to all transactions; we just don't know that for certain. The
difficulty is that there are two bits which are typically set together:
the PD_ALL_VISIBLE bit on the page itself, and the visibility map bit.
If a crash occurs after the visibility map page makes it to disk and
before the updated heap page makes it to disk, redo must set the bit on
the heap page. Otherwise, the next insert, update, or delete on the heap
page will fail to realize that the visibility map bit must be cleared,
possibly causing index-only scans to return wrong answers."
Which lead me to believe that I could just rip out the WAL-related code
if PD_ALL_VISIBLE goes away, which is incorrect. But the incorrectness
doesn't have to do with the WAL directly, it's because the VM page's LSN
is not bumped past the LSN of the related heap page cleanup, so it can
be written too early.
Of course, the way to bump the LSN is to write WAL for the
visibilitymap_set operation. But that would be a very simple WAL
routine, rather than the complex one that exists without the patch.
I suppose we could even try to bump the LSN without writing WAL somehow,
but it doesn't seem worth reasoning through that (setting a VM bit is
rare enough).
Regards,
Jeff Davis
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, 2013-01-17 at 10:39 -0800, Jeff Davis wrote:
On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote:
Now that I look at the patch, I wonder if there is another fundamental
issue with the patch. Since the patch removes WAL logging for the VM
set operation, this can happen:Thank you.
New patch attached with simple WAL logging.
Regards,
Jeff Davis
Attachments:
On 17 January 2013 15:14, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 17.01.2013 16:53, Robert Haas wrote:
On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:May be you've already addressed that concern with the proven
performance numbers, but I'm not sure.It would be nice to hear what Heikki's reasons were for adding
PD_ALL_VISIBLE in the first place.The idea was to avoid clearing the bit in the VM page on every update, when
the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is not set.
I assumed the traffic and contention on the VM page would be a killer
otherwise. I don't remember if I ever actually tested that though. Maybe I
was worrying about nothing and hitting the VM page on every update is ok.
Presumably we remember the state of the VM so we can skip the re-visit
after every write?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers