A design for amcheck heapam verification
It seems like a good next step for amcheck would be to add
functionality that verifies that heap tuples have matching index
tuples, and that heap pages are generally sane. I've been thinking
about a design for this for a while now, and would like to present
some tentative ideas before I start writing code.
Using a bloom filter built with hashed index tuples, and then matching
that against the heap is an approach to index/heap consistency
checking that has lots of upsides, but at least one downside. The
downside is pretty obvious: we introduce a low though quantifiable
risk of false negatives (failure to detect corruption due to the
actual absence of an appropriate index tuple). That's not great,
especially because it occurs non-deterministically, but the upsides to
this approach are considerable. Perhaps it's worth it.
Here is the general approach that I have in mind right now:
* Size the bloom filter based on the pg_class.reltuples of the index
to be verified, weighing a practical tolerance for false negatives,
capping with work_mem.
- We might throw an error immediately if it's impossible to get a
reasonably low probability of false negatives -- for some value of
"reasonable".
* Perform existing verification checks for a B-Tree. While scanning
the index, hash index tuples, including their heap TID pointer. Build
the bloom filter with hashed values as we go.
As we Scan the heap, we:
* Verify that HOT safety was correctly assessed in respect of the
index (or indexes) being verified.
* Test the visibility map, and sanity check MultiXacts [1]postgr.es/m/20161017014605.GA1220186@tornado.leadboat.com.
* Probe index/heap match check (uses bloom filter):
If a heap tuple meets the following conditions:
- Is not a HOT update tuple.
- Is committed, and committed before RecentGlobalXmin.
- Satisfies any predicate (for partial index case).
Then:
- Build a would-be index tuple value, perhaps reusing CREATE INDEX code.
- Hash that in the same manner as in the original index pass.
- Raise an error if the bloom filter indicates it is not present in the index.
Seems like we should definitely go to the index first, because the
heap is where we'll find visibility information.
If I wanted to go about implementing the same index/heap checks in a
way that does not have the notable downside around false negatives, I
suppose I'd have to invent a new, internal mechanism that performed a
kind of special merge join of an index against the heap. That would be
complicated, and require a lot of code; in other words, it would be
bug-prone. I wouldn't say that amcheck is simple today, but at least
the complexity owes everything to how B-Trees already work, as opposed
to how a bunch of custom infrastructure we had to invent works. The
amount of code in amcheck's verify_nbtree.c file is pretty low, and
that's a good goal to stick with. The very detailed comment that
argues for the correctness of amcheck's bt_right_page_check_scankey()
function is, to a large degree, also arguing for the correctness of a
bunch of code within nbtree. amcheck verification should be
comprehensive, but also simple and minimal, which IMV is a good idea
for about the same reason that that's a good idea when writing unit
tests.
The merge join style approach would also make verification quite
expensive, particularly when one table has multiple indexes. A tool
whose overhead can only really be justified when we're pretty sure
that there is already a problem is *significantly* less useful. And,
it ties verification to the executor, which could become a problem if
we make the functionality into a library that is usable by backup
tools that don't want to go through the buffer manager (or any SQL
interface).
Apart from the low memory overhead of using a bloom filter, resource
management is itself made a lot easier. We won't be sensitive to
misestimations, because we only need an estimate of the number of
tuples in the index, which will tend to be accurate enough in the vast
majority of cases. reltuples is needed to size the bloom filter bitmap
up-front. It doesn't matter how wide individual index tuples turn out
to be, because we simply hash everything, including even the heap TID
contained within the index.
Using a bloom filter makes verification "stackable" in a way that
might become important later. For example, we can later improve
amcheck to verify a table in parallel, by having a parallel worker
verify one index each, with bloom filters built in fixed size shared
memory buffers. A parallel heap scan then has workers concurrently
verify heap pages, and concurrently probe each final bloom filter.
Similarly, everything works the same if we add the option of scanning
a B-Tree in physical order (at the expense of not doing cross-page
verification). And, while I'd start with nbtree, you can still pretty
easily generalize the approach to building the bloom filter across
AMs. All index AMs other than BRIN have index tuples that are
essentially some values that are either the original heap values
themselves, or values that are a function of those original values,
plus a heap TID pointer. So, GIN might compress the TIDs, and
deduplicate the values, and SP-GiST might use its own compression, but
it's pretty easy to build a conditioned IndexTuple binary string that
normalizes away these differences, which is what we actually hash.
When WARM introduces the idea of a RED or BLUE tuple, it may be
possible to add that to the conditioning logic.
(BTW, the more advanced requirements are why I think that at least
some parts of amcheck should eventually end up in core -- most of
these are a long way off, but seem worth thinking about now.)
We can add a random seed value to the mix when hashing, so that any
problem that is masked by hash collisions won't stay masked on repeat
verification attempts. Errors from corruption display this value,
which lets users reliably recreate the original error by explicitly
setting the seed the next time around. Say, when users need to verify
with total confidence that a particular issue has been fixed within a
very large table in production.
I'd like to hear feedback on the general idea, and what the
user-visible interface ought to look like. The non-deterministic false
negatives may need to be considered by the user visible interface,
which is the main reason I mention it.
[1]: postgr.es/m/20161017014605.GA1220186@tornado.leadboat.com
--
Peter Geoghegan
VMware vCenter Server
https://www.vmware.com/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Apr 28, 2017 at 9:02 PM, Peter Geoghegan <pg@bowt.ie> wrote:
I'd like to hear feedback on the general idea, and what the
user-visible interface ought to look like. The non-deterministic false
negatives may need to be considered by the user visible interface,
which is the main reason I mention it.
Bloom filters are one of those things that come up on this mailing
list incredibly frequently but rarely get used in committed code; thus
far, contrib/bloom is the only example we've got, and not for lack of
other proposals. One problem is that Bloom filters assume you can get
n independent hash functions for a given value, which we have not got.
That problem would need to be solved somehow. If you only have one
hash function, the size of the required bloom filter probably gets
very large.
When hashing index and heap tuples, do you propose to include the heap
TID in the data getting hashed? I think that would be a good idea,
because otherwise you're only verifying that every heap tuple has an
index pointer pointing at something, not that every heap tuple has an
index tuple pointing at the right thing.
I wonder if it's also worth having a zero-error mode, even if it runs
for a long time. Scan the heap, and probe the index for the value
computed from each heap tuple. Maybe that's so awful that nobody
would ever use it, but I'm not sure. It might actually be simpler to
implement than what you have in mind.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 1, 2017 at 12:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Bloom filters are one of those things that come up on this mailing
list incredibly frequently but rarely get used in committed code; thus
far, contrib/bloom is the only example we've got, and not for lack of
other proposals.
They certainly are a fashionable data structure, but it's not as if
they're a new idea. The math behind them is very well understood. They
solve one narrow class of problem very well.
One problem is that Bloom filters assume you can get
n independent hash functions for a given value, which we have not got.
That problem would need to be solved somehow. If you only have one
hash function, the size of the required bloom filter probably gets
very large.
I don't think that that's a problem, because you really only need 2
hash functions [1]https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf -- Peter Geoghegan, which we have already (recall that Andres added
Murmur hash to Postgres 10). It's an area that I'd certainly need to
do more research on if I'm to go forward with bloom filters, but I'm
pretty confident that there is a robust solution to the practical
problem of not having arbitrary many hash functions at hand. (I think
that you rarely need all that many independent hash functions, in any
case.)
It isn't that hard to evaluate whether or not an implementation has
things right, at least for a variety of typical cases. We know how to
objectively evaluate a hash function while making only some pretty
mild assumptions.
When hashing index and heap tuples, do you propose to include the heap
TID in the data getting hashed? I think that would be a good idea,
because otherwise you're only verifying that every heap tuple has an
index pointer pointing at something, not that every heap tuple has an
index tuple pointing at the right thing.
Yes -- I definitely want to hash the heap TID from each IndexTuple.
I wonder if it's also worth having a zero-error mode, even if it runs
for a long time. Scan the heap, and probe the index for the value
computed from each heap tuple. Maybe that's so awful that nobody
would ever use it, but I'm not sure. It might actually be simpler to
implement than what you have in mind.
It's easy if you don't mind that the implementation will be an ad-hoc
nested loop join. I guess I could do that too, if only because it
won't be that hard, and that's really what you want when you know you
have corruption. Performance will probably be prohibitively poor when
verification needs to be run in any kind of routine way, which is a
problem if that's the only way it can work. My sense is that
verification needs to be reasonably low overhead, and it needs to
perform pretty consistently, even if you only use it for
stress-testing new features.
To reiterate, another thing that makes a bloom filter attractive is
how it simplifies resource management relative to an approach
involving sorting or a hash table. There are a bunch of edge cases
that I don't have to worry about around resource management (e.g., a
subset of very wide outlier IndexTuples, or two indexes that are of
very different sizes associated with the same table that need to
receive an even share of memory).
As I said, even if I was totally willing to duplicate the effort that
went into respecting work_mem as a budget within places like
tuplesort.c, having as little infrastructure code as possible is a
specific goal for amcheck.
[1]: https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf -- Peter Geoghegan
--
Peter Geoghegan
VMware vCenter Server
https://www.vmware.com/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Apr 28, 2017 at 6:02 PM, Peter Geoghegan <pg@bowt.ie> wrote:
- Is committed, and committed before RecentGlobalXmin.
Actually, I guess amcheck would need to use its own scan's snapshot
xmin instead. This is true because it cares about visibility in a way
that's "backwards" relative to existing code that tests something
against RecentGlobalXmin. Is there any existing thing that works that
way?
If it's not clear what I mean: existing code that cares about
RecentGlobalXmin is using it as a *conservative* point before which
every snapshot sees every transaction as committed/aborted (and
therefore nobody can care if that other backend hot prunes dead tuples
from before then, or whatever it is). Whereas, amcheck needs to care
about the possibility that *anyone else* decided that pruning or
whatever is okay, based on generic criteria, and not what amcheck
happened to see as RecentGlobalXmin during snapshot acquisition.
--
Peter Geoghegan
VMware vCenter Server
https://www.vmware.com/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 1 May 2017 at 20:46, Robert Haas <robertmhaas@gmail.com> wrote:
One problem is that Bloom filters assume you can get
n independent hash functions for a given value, which we have not got.
That problem would need to be solved somehow. If you only have one
hash function, the size of the required bloom filter probably gets
very large.
There's a simple formula to calculate the optimal number of hash
functions and size of the filter given a target false positive rate.
But I don't think this is as big of a problem as you imagine.
a) we don't really only have one hash function, we have a 32-bit hash
function and we could expand that to a larger bit size if we wanted.
Bloom filters are never 2^32 size bit arrays for obvious reasons. If
you have a 1kbit sized bloom filter that only needs 10 bits per index
so you have three fully independent hash functions available already.
If we changed to a 64-bit or 128-bit hash function then you could have
enough bits available to have a larger set of hash functions and a
larger array.
b) you can get a poor man's universal hash out of hash_any or hash_int
by just tweaking the input value in a way that doesn't interact in a
simple way with the hash function. Even something as simple has xoring
it with a random number (i.e. a vector of random numbers that identify
your randomly chosen distinct "hash functions") seems to work fine.
However for future-proofing security hardening I think Postgres should
really implement a real mathematically rigorous Universal Hashing
scheme which provides a family of hash functions from which to pick
randomly. This prevents users from being able to generate data that
would intentionally perform poorly in hash data structures for
example. But it also means you have a whole family of hash functions
to pick for bloom filters.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 1, 2017 at 2:10 PM, Peter Geoghegan <pg@bowt.ie> wrote:
Actually, I guess amcheck would need to use its own scan's snapshot
xmin instead. This is true because it cares about visibility in a way
that's "backwards" relative to existing code that tests something
against RecentGlobalXmin. Is there any existing thing that works that
way?
Looks like pg_visibility has a similar set of concerns, and so
sometimes calls GetOldestXmin() to "recompute" what it calls
OldestXmin (which I gather is like RecentGlobalXmin, but comes from
calling GetOldestXmin() at least once). This happens within
pg_visibility's collect_corrupt_items(). So, I could either follow
that approach, or, more conservatively, call GetOldestXmin()
immediately after each "amcheck whole index scan" finishes, for use
later on, when we go to the heap. Within the heap, we expect that any
committed tuple whose xmin precedes FooIndex.OldestXmin should be
present in that index's bloom filter. Of course, when there are
multiple indexes, we might only arrive at the heap much later. (I
guess we'd also want to check if the MVCC Snapshot's xmin preceded
FooIndex.OldestXmin, and set that as FooIndex.OldestXmin when that
happened to be the case.)
Anyone have an opinion on any of this? Offhand, I think that calling
GetOldestXmin() once per index when its "amcheck whole index scan"
finishes would be safe, and yet provide appreciably better test
coverage than only expecting things visible to our original MVCC
snapshot to be present in the index. I don't see a great reason to be
more aggressive and call GetOldestXmin() more often than once per
whole index scan, though.
--
Peter Geoghegan
VMware vCenter Server
https://www.vmware.com/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 1, 2017 at 4:28 PM, Peter Geoghegan <pg@bowt.ie> wrote:
Anyone have an opinion on any of this? Offhand, I think that calling
GetOldestXmin() once per index when its "amcheck whole index scan"
finishes would be safe, and yet provide appreciably better test
coverage than only expecting things visible to our original MVCC
snapshot to be present in the index. I don't see a great reason to be
more aggressive and call GetOldestXmin() more often than once per
whole index scan, though.
Wait, that's wrong, because in general RecentGlobalXmin may advance at
any time as new snapshots are acquired by other backends. The only
thing that we know for sure is that our MVCC snapshot is an interlock
against things being recycled that the snapshot needs to see
(according to MVCC semantics). And, we don't just have heap pruning to
worry about -- we also have nbtree's LP_DEAD based recycling to worry
about, before and during the amcheck full index scan (actually, this
is probably the main source of problematic recycling for our
verification protocol).
So, I think that we could call GetOldestXmin() once, provided we were
willing to recheck in the style of pg_visibility if and when there was
an apparent violation that might be explained as caused by concurrent
LP_DEAD recycling within nbtree. That seems complicated enough that
I'll never be able to convince myself that it's worthwhile before
actually trying to write the code.
--
Peter Geoghegan
VMware vCenter Server
https://www.vmware.com/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan <pg@bowt.ie> writes:
If it's not clear what I mean: existing code that cares about
RecentGlobalXmin is using it as a *conservative* point before which
every snapshot sees every transaction as committed/aborted (and
therefore nobody can care if that other backend hot prunes dead tuples
from before then, or whatever it is). Whereas, amcheck needs to care
about the possibility that *anyone else* decided that pruning or
whatever is okay, based on generic criteria, and not what amcheck
happened to see as RecentGlobalXmin during snapshot acquisition.
ISTM if you want to do that you have an inherent race condition.
That is, no matter what you do, the moment after you look the currently
oldest open transaction could commit, allowing some other session's
view of RecentGlobalXmin to move past what you think it is, so that
that session could start pruning stuff.
Maybe you can fix this by assuming that your own session's advertised xmin
is a safe upper bound on everybody else's RecentGlobalXmin. But I'm not
sure if that rule does what you want.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 1, 2017 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maybe you can fix this by assuming that your own session's advertised xmin
is a safe upper bound on everybody else's RecentGlobalXmin. But I'm not
sure if that rule does what you want.
That's what you might ultimately need to fall back on (that, or
perhaps repeated calls to GetOldestXmin() to recheck, in the style of
pg_visibility). It's useful to do rechecking, rather than just
starting with the MVCC snapshot's xmin because you might be able to
determine that the absence of some index tuple in the index (by which
I mean its bloom filter) *still* cannot be explained by concurrent
recycling. The conclusion that there is a real problem might never
have been reached without this extra complexity.
I'm not saying that it's worthwhile to add this complexity, rather
than just starting with the MVCC snapshot's xmin in the first place --
I really don't have an opinion either way just yet.
--
Peter Geoghegan
VMware vCenter Server
https://www.vmware.com/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 1, 2017 at 9:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
ISTM if you want to do that you have an inherent race condition.
That is, no matter what you do, the moment after you look the currently
oldest open transaction could commit, allowing some other session's
view of RecentGlobalXmin to move past what you think it is, so that
that session could start pruning stuff.
It can't prune the stuff we care about if we've got a shared content
lock on the target buffer. That's the trick pg_visibility uses:
/*
* Time has passed since we computed
OldestXmin, so it's
* possible that this tuple is
all-visible in reality even
* though it doesn't appear so based on our
* previously-computed value. Let's
compute a new value so we
* can be certain whether there is a problem.
*
* From a concurrency point of view,
it sort of sucks to
* retake ProcArrayLock here while
we're holding the buffer
* exclusively locked, but it should
be safe against
* deadlocks, because surely
GetOldestXmin() should never take
* a buffer lock. And this shouldn't
happen often, so it's
* worth being careful so as to avoid
false positives.
*/
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, May 1, 2017 at 6:39 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, May 1, 2017 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maybe you can fix this by assuming that your own session's advertised xmin
is a safe upper bound on everybody else's RecentGlobalXmin. But I'm not
sure if that rule does what you want.That's what you might ultimately need to fall back on (that, or
perhaps repeated calls to GetOldestXmin() to recheck, in the style of
pg_visibility).
I spent only a few hours writing a rough prototype, and came up with
something that does an IndexBuildHeapScan() scan following the
existing index verification steps. Its amcheck callback does an
index_form_tuple() call, hashes the resulting IndexTuple (heap TID and
all), and tests it for membership of a bloom filter generated as part
of the main B-Tree verification phase. The IndexTuple memory is freed
immediately following hashing.
The general idea here is that whatever IndexTuples ought to be in the
index following a fresh REINDEX also ought to have been found in the
index already. IndexBuildHeapScan() takes care of almost all of the
details for us.
I think I can do this correctly when only an AccessShareLock is
acquired on heap + index, provided I also do a separate
GetOldestXmin() before even the index scan begins, and do a final
recheck of the saved GetOldestXmin() value against heap tuple xmin
within the new IndexBuildHeapScan() callback (if we still think that
it should have been found by the index scan, then actually throw a
corruption related error). When there is only a ShareLock (for
bt_parent_index_check() calls), the recheck isn't necessary. I think I
should probably also make the IndexBuildHeapScan()-passed indexInfo
structure "ii_Unique = false", since waiting for the outcome of a
concurrent conflicting unique index insertion isn't useful, and can
cause deadlocks.
While I haven't really made my mind up, this design is extremely
simple, and effectively tests IndexBuildHeapScan() at the same time as
everything else. The addition of the bloom filter itself isn't
trivial, but the code added to verify_nbtree.c is.
The downside of going this way is that I cannot piggyback other types
of heap verification on the IndexBuildHeapScan() scan. Still, perhaps
it's worth it. Perhaps I should implement this bloom-filter-index-heap
verification step as one extra option for the existing B-Tree
verification functions. I may later add new verification functions
that examine and verify the heap and related SLRUs alone.
--
Peter Geoghegan
VMware vCenter Server
https://www.vmware.com/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, May 11, 2017 at 4:30 PM, Peter Geoghegan <pg@bowt.ie> wrote:
I spent only a few hours writing a rough prototype, and came up with
something that does an IndexBuildHeapScan() scan following the
existing index verification steps. Its amcheck callback does an
index_form_tuple() call, hashes the resulting IndexTuple (heap TID and
all), and tests it for membership of a bloom filter generated as part
of the main B-Tree verification phase. The IndexTuple memory is freed
immediately following hashing.
I attach a cleaned-up version of this. It has extensive documentation.
My bloom filter implementation is broken out as a separate patch,
added as core infrastructure under "lib".
I do have some outstanding concerns about V1 of the patch series:
* I'm still uncertain about the question of using IndexBuildHeapScan()
during Hot Standby. It seems safe, since we're only using the
CONCURRENTLY/AccessShareLock path when this happens, but someone might
find that objectionable on general principle. For now, in this first
version, it remains possible to call IndexBuildHeapScan() during Hot
Standby, to allow the new verification to work there.
* The bloom filter has been experimentally verified, and is based on
source material which is directly cited. It would nevertheless be
useful to have the hashing stuff scrutinized, because it's possible
that I've overlooked some subtlety.
This is only the beginning for heapam verification. Comprehensive
coverage can be added later, within routines that specifically target
some table, not some index.
While this patch series only adds index-to-heap verification for
B-Tree indexes, I can imagine someone adopting the same technique to
verifying that other access methods are consistent with their heap
relation. For example, it would be easy to do this with hash indexes.
Any other index access method where the same high-level principle that
I rely on applies can do index-to-heap verification with just a few
tweaks. I'm referring to the high-level principle that comments
specifically point out in the patch: that REINDEX leaves you with an
index structure that has exactly the same entries as the old index
structure had, though possibly with fewer dead index tuples. I like my
idea of reusing IndexBuildHeapScan() for verification. Very few new
LOCs are actually added to amcheck by this patch, and
IndexBuildHeapScan() is itself tested.
--
Peter Geoghegan
Attachments:
0002-Add-amcheck-verification-of-indexes-against-heap.patchtext/x-patch; charset=US-ASCII; name=0002-Add-amcheck-verification-of-indexes-against-heap.patchDownload+345-48
0001-Add-bloom-filter-data-structure-implementation.patchtext/x-patch; charset=US-ASCII; name=0001-Add-bloom-filter-data-structure-implementation.patchDownload+327-3
On Wed, Aug 30, 2017 at 7:58 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, May 11, 2017 at 4:30 PM, Peter Geoghegan <pg@bowt.ie> wrote:
I spent only a few hours writing a rough prototype, and came up with
something that does an IndexBuildHeapScan() scan following the
existing index verification steps. Its amcheck callback does an
index_form_tuple() call, hashes the resulting IndexTuple (heap TID and
all), and tests it for membership of a bloom filter generated as part
of the main B-Tree verification phase. The IndexTuple memory is freed
immediately following hashing.I attach a cleaned-up version of this. It has extensive documentation.
My bloom filter implementation is broken out as a separate patch,
added as core infrastructure under "lib".
Some drive-by comments on the lib patch:
+bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)
I think the plan is to use size_t for new stuff[1]/messages/by-id/25076.1489699457@sss.pgh.pa.us.
+/*
+ * Which element of the sequence of powers-of-two is less than or equal to n?
+ *
+ * Used to size bitset, which in practice is never allowed to exceed 2 ^ 30
+ * bits (128MB). This frees us from giving further consideration to int
+ * overflow.
+ */
+static int
+pow2_truncate(int64 target_bitset_size)
+{
+ int v = 0;
+
+ while (target_bitset_size > 0)
+ {
+ v++;
+ target_bitset_size = target_bitset_size >> 1;
+ }
+
+ return Min(v - 1, 30);
+}
This is another my_log2(), right?
It'd be nice to replace both with fls() or flsl(), though it's
annoying to have to think about long vs int64 etc. We already use
fls() in two places and supply an implementation in src/port/fls.c for
platforms that lack it (Windows?), but not the long version.
+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
+static uint32
+sdbmhash(unsigned char *elem, Size len)
+{
+ uint32 hash = 0;
+ int i;
+
+ for (i = 0; i < len; elem++, i++)
+ {
+ hash = (*elem) + (hash << 6) + (hash << 16) - hash;
+ }
+
+ return hash;
+}
I see that this is used in gawk, BerkeleyDB and all over the place[2]http://www.cse.yorku.ca/~oz/hash.html.
Nice. I understand that this point of this is to be a hash function
that is different from our usual one, for use by k_hashes(). Do you
think it belongs somewhere more common than this? It seems a bit like
our hash-related code is scattered all over the place but should be
consolidated, but I suppose that's a separate project.
Unnecessary braces here and elsewhere for single line body of for loops.
+bloom_prop_bits_set(bloom_filter *filter)
+{
+ int bitset_bytes = NBITS(filter) / BITS_PER_BYTE;
+ int64 bits_set = 0;
+ int i;
+
+ for (i = 0; i < bitset_bytes; i++)
+ {
+ unsigned char byte = filter->bitset[i];
+
+ while (byte)
+ {
+ bits_set++;
+ byte &= (byte - 1);
+ }
+ }
Sorry I didn't follow up with my threat[3]/messages/by-id/CAEepm=3g1_fjJGp38QGv+38BC2HHVkzUq6s69nk3mWLgPHqC3A@mail.gmail.com to provide a central
popcount() function to replace the implementations all over the tree.
[1]: /messages/by-id/25076.1489699457@sss.pgh.pa.us
[2]: http://www.cse.yorku.ca/~oz/hash.html
[3]: /messages/by-id/CAEepm=3g1_fjJGp38QGv+38BC2HHVkzUq6s69nk3mWLgPHqC3A@mail.gmail.com
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 8:34 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
It'd be nice to replace both with fls() or flsl(), though it's
annoying to have to think about long vs int64 etc. We already use
fls() in two places and supply an implementation in src/port/fls.c for
platforms that lack it (Windows?), but not the long version.
Yes, you can complain about MSVC compilation here.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Aug 29, 2017 at 4:34 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Some drive-by comments on the lib patch:
I was hoping that you'd look at this, since you'll probably want to
use a bloom filter for parallel hash join at some point. I've tried to
keep this one as simple as possible. I think that there is a good
chance that it will be usable for parallel hash join with multiple
batches. You'll need to update the interface a little bit to make that
work (e.g., bring-your-own-hash interface), but hopefully the changes
you'll need will be limited to a few small details.
+bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)
I think the plan is to use size_t for new stuff[1].
I'd forgotten.
This is another my_log2(), right?
It'd be nice to replace both with fls() or flsl(), though it's
annoying to have to think about long vs int64 etc. We already use
fls() in two places and supply an implementation in src/port/fls.c for
platforms that lack it (Windows?), but not the long version.
win64 longs are only 32-bits, so my_log2() would do the wrong thing
for me on that platform. pow2_truncate() is provided with a number of
bits as its argument, not a number of bytes (otherwise this would
work).
Ideally, we'd never use long integers, because its width is platform
dependent, and yet it is only ever used as an alternative to int
because it is wider than int. One example of where this causes
trouble: logtape.c uses long ints, so external sorts can use half the
temp space on win64.
+/* + * Hash function is taken from sdbm, a public-domain reimplementation of the + * ndbm database library. + */ +static uint32 +sdbmhash(unsigned char *elem, Size len) +{
I see that this is used in gawk, BerkeleyDB and all over the place[2].
Nice. I understand that this point of this is to be a hash function
that is different from our usual one, for use by k_hashes().
Right. It's only job is to be a credible hash function that isn't
derivative of hash_any().
Do you
think it belongs somewhere more common than this? It seems a bit like
our hash-related code is scattered all over the place but should be
consolidated, but I suppose that's a separate project.
Unsure. In its defense, there is also a private murmurhash one-liner
within tidbitmap.c. I don't mind changing this, but it's slightly odd
to expose a hash function whose only job is to be completely unrelated
to hash_any().
Unnecessary braces here and elsewhere for single line body of for loops.
+bloom_prop_bits_set(bloom_filter *filter) +{ + int bitset_bytes = NBITS(filter) / BITS_PER_BYTE; + int64 bits_set = 0; + int i; + + for (i = 0; i < bitset_bytes; i++) + { + unsigned char byte = filter->bitset[i]; + + while (byte) + { + bits_set++; + byte &= (byte - 1); + } + }
I don't follow what you mean here.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 1:00 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Aug 29, 2017 at 4:34 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:Some drive-by comments on the lib patch:
I was hoping that you'd look at this, since you'll probably want to
use a bloom filter for parallel hash join at some point. I've tried to
keep this one as simple as possible. I think that there is a good
chance that it will be usable for parallel hash join with multiple
batches. You'll need to update the interface a little bit to make that
work (e.g., bring-your-own-hash interface), but hopefully the changes
you'll need will be limited to a few small details.
Indeed. Thank you for working on this! To summarise a couple of
ideas that Peter and I discussed off-list a while back: (1) While
building the hash table for a hash join we could build a Bloom filter
per future batch and keep it in memory, and then while reading from
the outer plan we could skip writing tuples out to future batches if
there is no chance they'll find a match when read back in later (works
only for inner joins and only pays off in inverse proportion to the
join's selectivity); (2) We could push a Bloom filter down to scans
(many other databases do this, and at least one person has tried this
with PostgreSQL and found it to pay off[1]http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf).
To use this for anything involving parallelism where a Bloom filter
must be shared we'd probably finish up having to create a shared
version of bloom_init() that either uses caller-provided memory and
avoids the internal pointer, or allocates DSA memory. I suppose you
could consider splitting your bloom_init() function up into
bloom_estimate() and bloom_init(user_supplied_space, ...) now, and
getting rid of the separate pointer to bitset (ie stick char
bitset[FLEXIBLE_ARRAY_MEMBER] at the end of the struct)?
+bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)
I think the plan is to use size_t for new stuff[1].
I'd forgotten.
This is another my_log2(), right?
It'd be nice to replace both with fls() or flsl(), though it's
annoying to have to think about long vs int64 etc. We already use
fls() in two places and supply an implementation in src/port/fls.c for
platforms that lack it (Windows?), but not the long version.win64 longs are only 32-bits, so my_log2() would do the wrong thing
for me on that platform. pow2_truncate() is provided with a number of
bits as its argument, not a number of bytes (otherwise this would
work).
Hmm. Right.
Ideally, we'd never use long integers, because its width is platform
dependent, and yet it is only ever used as an alternative to int
because it is wider than int. One example of where this causes
trouble: logtape.c uses long ints, so external sorts can use half the
temp space on win64.
Agreed, "long" is terrible.
+bloom_prop_bits_set(bloom_filter *filter) +{ + int bitset_bytes = NBITS(filter) / BITS_PER_BYTE; + int64 bits_set = 0; + int i; + + for (i = 0; i < bitset_bytes; i++) + { + unsigned char byte = filter->bitset[i]; + + while (byte) + { + bits_set++; + byte &= (byte - 1); + } + }I don't follow what you mean here.
I was just observing that there is an opportunity for code reuse here.
This function's definition would ideally be something like:
double
bloom_prop_bits_set(bloom_filter *filter)
{
return popcount(filter->bitset, NBYTES(filter)) / (double) NBITS(filter);
}
This isn't an objection to the way you have it, since we don't have a
popcount function yet! We have several routines in the tree for
counting bits, though not yet the complete set from Hacker's Delight.
Your patch brings us one step closer to that goal. (The book says
that this approach is good far sparse bitsets, but your comment says
that we expect something near 50%. That's irrelevant anyway since a
future centralised popcount() implementation would do this in
word-sized chunks with a hardware instruction or branch-free-per-word
lookups in a table and not care at all about sparseness.)
+ * Test if bloom filter definitely lacks element.
I think where "Bloom filter" appears in prose it should have a capital
letter (person's name).
[1]: http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Aug 29, 2017 at 7:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
Indeed. Thank you for working on this! To summarise a couple of
ideas that Peter and I discussed off-list a while back: (1) While
building the hash table for a hash join we could build a Bloom filter
per future batch and keep it in memory, and then while reading from
the outer plan we could skip writing tuples out to future batches if
there is no chance they'll find a match when read back in later (works
only for inner joins and only pays off in inverse proportion to the
join's selectivity);
Right. Hash joins do tend to be very selective, though, so I think
that this could help rather a lot. With just 8 or 10 bits per element,
you can eliminate almost all the batch write-outs on the outer side.
No per-worker synchronization for BufFiles is needed when this
happens, either. It seems like that could be very important.
To use this for anything involving parallelism where a Bloom filter
must be shared we'd probably finish up having to create a shared
version of bloom_init() that either uses caller-provided memory and
avoids the internal pointer, or allocates DSA memory. I suppose you
could consider splitting your bloom_init() function up into
bloom_estimate() and bloom_init(user_supplied_space, ...) now, and
getting rid of the separate pointer to bitset (ie stick char
bitset[FLEXIBLE_ARRAY_MEMBER] at the end of the struct)?
Makes sense. Not hard to add that.
I was just observing that there is an opportunity for code reuse here.
This function's definition would ideally be something like:double
bloom_prop_bits_set(bloom_filter *filter)
{
return popcount(filter->bitset, NBYTES(filter)) / (double) NBITS(filter);
}This isn't an objection to the way you have it, since we don't have a
popcount function yet! We have several routines in the tree for
counting bits, though not yet the complete set from Hacker's Delight.
Right. I'm also reminded of the lookup tables for the visibility/freeze map.
Your patch brings us one step closer to that goal. (The book says
that this approach is good far sparse bitsets, but your comment says
that we expect something near 50%. That's irrelevant anyway since a
future centralised popcount() implementation would do this in
word-sized chunks with a hardware instruction or branch-free-per-word
lookups in a table and not care at all about sparseness.)
I own a copy of Hacker's Delight (well, uh, Daniel Farina lent me his
copy about 2 years ago!). pop()/popcount() does seem like a clever
algorithm, that we should probably think about adopting in some cases,
but I should point at that the current caller to my
bloom_prop_bits_set() function is an elog() DEBUG1 call. This is not
at all performance critical.
+ * Test if bloom filter definitely lacks element.
I think where "Bloom filter" appears in prose it should have a capital
letter (person's name).
Agreed.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan wrote:
Your patch brings us one step closer to that goal. (The book says
that this approach is good far sparse bitsets, but your comment says
that we expect something near 50%. That's irrelevant anyway since a
future centralised popcount() implementation would do this in
word-sized chunks with a hardware instruction or branch-free-per-word
lookups in a table and not care at all about sparseness.)I own a copy of Hacker's Delight (well, uh, Daniel Farina lent me his
copy about 2 years ago!). pop()/popcount() does seem like a clever
algorithm, that we should probably think about adopting in some cases,
but I should point at that the current caller to my
bloom_prop_bits_set() function is an elog() DEBUG1 call. This is not
at all performance critical.
Eh, if you want to optimize it for the case where debug output is not
enabled, make sure to use ereport() not elog(). ereport()
short-circuits evaluation of arguments, whereas elog() does not.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Eh, if you want to optimize it for the case where debug output is not
enabled, make sure to use ereport() not elog(). ereport()
short-circuits evaluation of arguments, whereas elog() does not.
I should do that, but it's still not really noticeable.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 9:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Eh, if you want to optimize it for the case where debug output is not
enabled, make sure to use ereport() not elog(). ereport()
short-circuits evaluation of arguments, whereas elog() does not.I should do that, but it's still not really noticeable.
Since this patch has now bit-rotted, I attach a new revision, V2.
Apart from fixing some Makefile bitrot, this revision also makes some
small tweaks as suggested by Thomas and Alvaro. The documentation is
also revised and expanded, and now discusses practical aspects of the
set membership being tested using a Bloom filter, how that relates to
maintenance_work_mem, and so on.
Note that this revision does not let the Bloom filter caller use their
own dynamic shared memory, which is something that Thomas asked about.
While that could easily be added, I think it should happen later. I
really just wanted to make sure that my Bloom filter was not in some
way fundamentally incompatible with Thomas' planned enhancements to
(parallel) hash join.
--
Peter Geoghegan