A design for amcheck heapam verification

Started by Peter Geogheganover 8 years ago77 messages

pg@bowt.ie

over 8 years ago

It seems like a good next step for amcheck would be to add
functionality that verifies that heap tuples have matching index
tuples, and that heap pages are generally sane. I've been thinking
about a design for this for a while now, and would like to present
some tentative ideas before I start writing code.

Using a bloom filter built with hashed index tuples, and then matching
that against the heap is an approach to index/heap consistency
checking that has lots of upsides, but at least one downside. The
downside is pretty obvious: we introduce a low though quantifiable
risk of false negatives (failure to detect corruption due to the
actual absence of an appropriate index tuple). That's not great,
especially because it occurs non-deterministically, but the upsides to
this approach are considerable. Perhaps it's worth it.

Here is the general approach that I have in mind right now:

* Size the bloom filter based on the pg_class.reltuples of the index
to be verified, weighing a practical tolerance for false negatives,
capping with work_mem.
- We might throw an error immediately if it's impossible to get a
reasonably low probability of false negatives -- for some value of
"reasonable".
* Perform existing verification checks for a B-Tree. While scanning
the index, hash index tuples, including their heap TID pointer. Build
the bloom filter with hashed values as we go.

As we Scan the heap, we:

* Verify that HOT safety was correctly assessed in respect of the
index (or indexes) being verified.
* Test the visibility map, and sanity check MultiXacts [1]postgr.es/m/20161017014605.GA1220186@tornado.leadboat.com.
* Probe index/heap match check (uses bloom filter):

If a heap tuple meets the following conditions:

- Is not a HOT update tuple.
- Is committed, and committed before RecentGlobalXmin.
- Satisfies any predicate (for partial index case).

Then:

- Build a would-be index tuple value, perhaps reusing CREATE INDEX code.
- Hash that in the same manner as in the original index pass.
- Raise an error if the bloom filter indicates it is not present in the index.

Seems like we should definitely go to the index first, because the
heap is where we'll find visibility information.

If I wanted to go about implementing the same index/heap checks in a
way that does not have the notable downside around false negatives, I
suppose I'd have to invent a new, internal mechanism that performed a
kind of special merge join of an index against the heap. That would be
complicated, and require a lot of code; in other words, it would be
bug-prone. I wouldn't say that amcheck is simple today, but at least
the complexity owes everything to how B-Trees already work, as opposed
to how a bunch of custom infrastructure we had to invent works. The
amount of code in amcheck's verify_nbtree.c file is pretty low, and
that's a good goal to stick with. The very detailed comment that
argues for the correctness of amcheck's bt_right_page_check_scankey()
function is, to a large degree, also arguing for the correctness of a
bunch of code within nbtree. amcheck verification should be
comprehensive, but also simple and minimal, which IMV is a good idea
for about the same reason that that's a good idea when writing unit
tests.

The merge join style approach would also make verification quite
expensive, particularly when one table has multiple indexes. A tool
whose overhead can only really be justified when we're pretty sure
that there is already a problem is *significantly* less useful. And,
it ties verification to the executor, which could become a problem if
we make the functionality into a library that is usable by backup
tools that don't want to go through the buffer manager (or any SQL
interface).

Apart from the low memory overhead of using a bloom filter, resource
management is itself made a lot easier. We won't be sensitive to
misestimations, because we only need an estimate of the number of
tuples in the index, which will tend to be accurate enough in the vast
majority of cases. reltuples is needed to size the bloom filter bitmap
up-front. It doesn't matter how wide individual index tuples turn out
to be, because we simply hash everything, including even the heap TID
contained within the index.

Using a bloom filter makes verification "stackable" in a way that
might become important later. For example, we can later improve
amcheck to verify a table in parallel, by having a parallel worker
verify one index each, with bloom filters built in fixed size shared
memory buffers. A parallel heap scan then has workers concurrently
verify heap pages, and concurrently probe each final bloom filter.
Similarly, everything works the same if we add the option of scanning
a B-Tree in physical order (at the expense of not doing cross-page
verification). And, while I'd start with nbtree, you can still pretty
easily generalize the approach to building the bloom filter across
AMs. All index AMs other than BRIN have index tuples that are
essentially some values that are either the original heap values
themselves, or values that are a function of those original values,
plus a heap TID pointer. So, GIN might compress the TIDs, and
deduplicate the values, and SP-GiST might use its own compression, but
it's pretty easy to build a conditioned IndexTuple binary string that
normalizes away these differences, which is what we actually hash.
When WARM introduces the idea of a RED or BLUE tuple, it may be
possible to add that to the conditioning logic.

(BTW, the more advanced requirements are why I think that at least
some parts of amcheck should eventually end up in core -- most of
these are a long way off, but seem worth thinking about now.)

We can add a random seed value to the mix when hashing, so that any
problem that is masked by hash collisions won't stay masked on repeat
verification attempts. Errors from corruption display this value,
which lets users reliably recreate the original error by explicitly
setting the seed the next time around. Say, when users need to verify
with total confidence that a particular issue has been fixed within a
very large table in production.

I'd like to hear feedback on the general idea, and what the
user-visible interface ought to look like. The non-deterministic false
negatives may need to be considered by the user visible interface,
which is the main reason I mention it.

[1]: postgr.es/m/20161017014605.GA1220186@tornado.leadboat.com

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Peter Geoghegan (#1)

Re: A design for amcheck heapam verification

On Fri, Apr 28, 2017 at 9:02 PM, Peter Geoghegan <pg@bowt.ie> wrote:

I'd like to hear feedback on the general idea, and what the
user-visible interface ought to look like. The non-deterministic false
negatives may need to be considered by the user visible interface,
which is the main reason I mention it.

Bloom filters are one of those things that come up on this mailing
list incredibly frequently but rarely get used in committed code; thus
far, contrib/bloom is the only example we've got, and not for lack of
other proposals. One problem is that Bloom filters assume you can get
n independent hash functions for a given value, which we have not got.
That problem would need to be solved somehow. If you only have one
hash function, the size of the required bloom filter probably gets
very large.

When hashing index and heap tuples, do you propose to include the heap
TID in the data getting hashed? I think that would be a good idea,
because otherwise you're only verifying that every heap tuple has an
index pointer pointing at something, not that every heap tuple has an
index tuple pointing at the right thing.

I wonder if it's also worth having a zero-error mode, even if it runs
for a long time. Scan the heap, and probe the index for the value
computed from each heap tuple. Maybe that's so awful that nobody
would ever use it, but I'm not sure. It might actually be simpler to
implement than what you have in mind.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Robert Haas (#2)

Re: A design for amcheck heapam verification

On Mon, May 1, 2017 at 12:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Bloom filters are one of those things that come up on this mailing
list incredibly frequently but rarely get used in committed code; thus
far, contrib/bloom is the only example we've got, and not for lack of
other proposals.

They certainly are a fashionable data structure, but it's not as if
they're a new idea. The math behind them is very well understood. They
solve one narrow class of problem very well.

One problem is that Bloom filters assume you can get
n independent hash functions for a given value, which we have not got.
That problem would need to be solved somehow. If you only have one
hash function, the size of the required bloom filter probably gets
very large.

I don't think that that's a problem, because you really only need 2
hash functions [1]https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf -- Peter Geoghegan, which we have already (recall that Andres added
Murmur hash to Postgres 10). It's an area that I'd certainly need to
do more research on if I'm to go forward with bloom filters, but I'm
pretty confident that there is a robust solution to the practical
problem of not having arbitrary many hash functions at hand. (I think
that you rarely need all that many independent hash functions, in any
case.)

It isn't that hard to evaluate whether or not an implementation has
things right, at least for a variety of typical cases. We know how to
objectively evaluate a hash function while making only some pretty
mild assumptions.

When hashing index and heap tuples, do you propose to include the heap
TID in the data getting hashed? I think that would be a good idea,
because otherwise you're only verifying that every heap tuple has an
index pointer pointing at something, not that every heap tuple has an
index tuple pointing at the right thing.

Yes -- I definitely want to hash the heap TID from each IndexTuple.

I wonder if it's also worth having a zero-error mode, even if it runs
for a long time. Scan the heap, and probe the index for the value
computed from each heap tuple. Maybe that's so awful that nobody
would ever use it, but I'm not sure. It might actually be simpler to
implement than what you have in mind.

It's easy if you don't mind that the implementation will be an ad-hoc
nested loop join. I guess I could do that too, if only because it
won't be that hard, and that's really what you want when you know you
have corruption. Performance will probably be prohibitively poor when
verification needs to be run in any kind of routine way, which is a
problem if that's the only way it can work. My sense is that
verification needs to be reasonably low overhead, and it needs to
perform pretty consistently, even if you only use it for
stress-testing new features.

To reiterate, another thing that makes a bloom filter attractive is
how it simplifies resource management relative to an approach
involving sorting or a hash table. There are a bunch of edge cases
that I don't have to worry about around resource management (e.g., a
subset of very wide outlier IndexTuples, or two indexes that are of
very different sizes associated with the same table that need to
receive an even share of memory).

As I said, even if I was totally willing to duplicate the effort that
went into respecting work_mem as a budget within places like
tuplesort.c, having as little infrastructure code as possible is a
specific goal for amcheck.

[1]: https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf -- Peter Geoghegan
--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Peter Geoghegan (#1)

Re: A design for amcheck heapam verification

On Fri, Apr 28, 2017 at 6:02 PM, Peter Geoghegan <pg@bowt.ie> wrote:

- Is committed, and committed before RecentGlobalXmin.

Actually, I guess amcheck would need to use its own scan's snapshot
xmin instead. This is true because it cares about visibility in a way
that's "backwards" relative to existing code that tests something
against RecentGlobalXmin. Is there any existing thing that works that
way?

If it's not clear what I mean: existing code that cares about
RecentGlobalXmin is using it as a *conservative* point before which
every snapshot sees every transaction as committed/aborted (and
therefore nobody can care if that other backend hot prunes dead tuples
from before then, or whatever it is). Whereas, amcheck needs to care
about the possibility that *anyone else* decided that pruning or
whatever is okay, based on generic criteria, and not what amcheck
happened to see as RecentGlobalXmin during snapshot acquisition.

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Greg Stark

stark@mit.edu

over 8 years ago

In reply to: Robert Haas (#2)

Re: A design for amcheck heapam verification

On 1 May 2017 at 20:46, Robert Haas <robertmhaas@gmail.com> wrote:

One problem is that Bloom filters assume you can get
n independent hash functions for a given value, which we have not got.
That problem would need to be solved somehow. If you only have one
hash function, the size of the required bloom filter probably gets
very large.

There's a simple formula to calculate the optimal number of hash
functions and size of the filter given a target false positive rate.

But I don't think this is as big of a problem as you imagine.

a) we don't really only have one hash function, we have a 32-bit hash
function and we could expand that to a larger bit size if we wanted.
Bloom filters are never 2^32 size bit arrays for obvious reasons. If
you have a 1kbit sized bloom filter that only needs 10 bits per index
so you have three fully independent hash functions available already.
If we changed to a 64-bit or 128-bit hash function then you could have
enough bits available to have a larger set of hash functions and a
larger array.

b) you can get a poor man's universal hash out of hash_any or hash_int
by just tweaking the input value in a way that doesn't interact in a
simple way with the hash function. Even something as simple has xoring
it with a random number (i.e. a vector of random numbers that identify
your randomly chosen distinct "hash functions") seems to work fine.

However for future-proofing security hardening I think Postgres should
really implement a real mathematically rigorous Universal Hashing
scheme which provides a family of hash functions from which to pick
randomly. This prevents users from being able to generate data that
would intentionally perform poorly in hash data structures for
example. But it also means you have a whole family of hash functions
to pick for bloom filters.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Peter Geoghegan (#4)

Re: A design for amcheck heapam verification

On Mon, May 1, 2017 at 2:10 PM, Peter Geoghegan <pg@bowt.ie> wrote:

Actually, I guess amcheck would need to use its own scan's snapshot
xmin instead. This is true because it cares about visibility in a way
that's "backwards" relative to existing code that tests something
against RecentGlobalXmin. Is there any existing thing that works that
way?

Looks like pg_visibility has a similar set of concerns, and so
sometimes calls GetOldestXmin() to "recompute" what it calls
OldestXmin (which I gather is like RecentGlobalXmin, but comes from
calling GetOldestXmin() at least once). This happens within
pg_visibility's collect_corrupt_items(). So, I could either follow
that approach, or, more conservatively, call GetOldestXmin()
immediately after each "amcheck whole index scan" finishes, for use
later on, when we go to the heap. Within the heap, we expect that any
committed tuple whose xmin precedes FooIndex.OldestXmin should be
present in that index's bloom filter. Of course, when there are
multiple indexes, we might only arrive at the heap much later. (I
guess we'd also want to check if the MVCC Snapshot's xmin preceded
FooIndex.OldestXmin, and set that as FooIndex.OldestXmin when that
happened to be the case.)

Anyone have an opinion on any of this? Offhand, I think that calling
GetOldestXmin() once per index when its "amcheck whole index scan"
finishes would be safe, and yet provide appreciably better test
coverage than only expecting things visible to our original MVCC
snapshot to be present in the index. I don't see a great reason to be
more aggressive and call GetOldestXmin() more often than once per
whole index scan, though.

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Peter Geoghegan (#6)

Re: A design for amcheck heapam verification

On Mon, May 1, 2017 at 4:28 PM, Peter Geoghegan <pg@bowt.ie> wrote:

Anyone have an opinion on any of this? Offhand, I think that calling
GetOldestXmin() once per index when its "amcheck whole index scan"
finishes would be safe, and yet provide appreciably better test
coverage than only expecting things visible to our original MVCC
snapshot to be present in the index. I don't see a great reason to be
more aggressive and call GetOldestXmin() more often than once per
whole index scan, though.

Wait, that's wrong, because in general RecentGlobalXmin may advance at
any time as new snapshots are acquired by other backends. The only
thing that we know for sure is that our MVCC snapshot is an interlock
against things being recycled that the snapshot needs to see
(according to MVCC semantics). And, we don't just have heap pruning to
worry about -- we also have nbtree's LP_DEAD based recycling to worry
about, before and during the amcheck full index scan (actually, this
is probably the main source of problematic recycling for our
verification protocol).

So, I think that we could call GetOldestXmin() once, provided we were
willing to recheck in the style of pg_visibility if and when there was
an apparent violation that might be explained as caused by concurrent
LP_DEAD recycling within nbtree. That seems complicated enough that
I'll never be able to convince myself that it's worthwhile before
actually trying to write the code.

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Peter Geoghegan (#4)

Re: A design for amcheck heapam verification

Peter Geoghegan <pg@bowt.ie> writes:

If it's not clear what I mean: existing code that cares about
RecentGlobalXmin is using it as a *conservative* point before which
every snapshot sees every transaction as committed/aborted (and
therefore nobody can care if that other backend hot prunes dead tuples
from before then, or whatever it is). Whereas, amcheck needs to care
about the possibility that *anyone else* decided that pruning or
whatever is okay, based on generic criteria, and not what amcheck
happened to see as RecentGlobalXmin during snapshot acquisition.

ISTM if you want to do that you have an inherent race condition.
That is, no matter what you do, the moment after you look the currently
oldest open transaction could commit, allowing some other session's
view of RecentGlobalXmin to move past what you think it is, so that
that session could start pruning stuff.

Maybe you can fix this by assuming that your own session's advertised xmin
is a safe upper bound on everybody else's RecentGlobalXmin. But I'm not
sure if that rule does what you want.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Tom Lane (#8)

Re: A design for amcheck heapam verification

On Mon, May 1, 2017 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe you can fix this by assuming that your own session's advertised xmin
is a safe upper bound on everybody else's RecentGlobalXmin. But I'm not
sure if that rule does what you want.

That's what you might ultimately need to fall back on (that, or
perhaps repeated calls to GetOldestXmin() to recheck, in the style of
pg_visibility). It's useful to do rechecking, rather than just
starting with the MVCC snapshot's xmin because you might be able to
determine that the absence of some index tuple in the index (by which
I mean its bloom filter) *still* cannot be explained by concurrent
recycling. The conclusion that there is a real problem might never
have been reached without this extra complexity.

I'm not saying that it's worthwhile to add this complexity, rather
than just starting with the MVCC snapshot's xmin in the first place --
I really don't have an opinion either way just yet.

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Tom Lane (#8)

Re: A design for amcheck heapam verification

On Mon, May 1, 2017 at 9:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

ISTM if you want to do that you have an inherent race condition.
That is, no matter what you do, the moment after you look the currently
oldest open transaction could commit, allowing some other session's
view of RecentGlobalXmin to move past what you think it is, so that
that session could start pruning stuff.

It can't prune the stuff we care about if we've got a shared content
lock on the target buffer. That's the trick pg_visibility uses:

/*
* Time has passed since we computed
OldestXmin, so it's
* possible that this tuple is
all-visible in reality even
* though it doesn't appear so based on our
* previously-computed value. Let's
compute a new value so we
* can be certain whether there is a problem.
*
* From a concurrency point of view,
it sort of sucks to
* retake ProcArrayLock here while
we're holding the buffer
* exclusively locked, but it should
be safe against
* deadlocks, because surely
GetOldestXmin() should never take
* a buffer lock. And this shouldn't
happen often, so it's
* worth being careful so as to avoid
false positives.
*/

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Peter Geoghegan (#9)

Re: A design for amcheck heapam verification

On Mon, May 1, 2017 at 6:39 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, May 1, 2017 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe you can fix this by assuming that your own session's advertised xmin
is a safe upper bound on everybody else's RecentGlobalXmin. But I'm not
sure if that rule does what you want.

That's what you might ultimately need to fall back on (that, or
perhaps repeated calls to GetOldestXmin() to recheck, in the style of
pg_visibility).

I spent only a few hours writing a rough prototype, and came up with
something that does an IndexBuildHeapScan() scan following the
existing index verification steps. Its amcheck callback does an
index_form_tuple() call, hashes the resulting IndexTuple (heap TID and
all), and tests it for membership of a bloom filter generated as part
of the main B-Tree verification phase. The IndexTuple memory is freed
immediately following hashing.

The general idea here is that whatever IndexTuples ought to be in the
index following a fresh REINDEX also ought to have been found in the
index already. IndexBuildHeapScan() takes care of almost all of the
details for us.

I think I can do this correctly when only an AccessShareLock is
acquired on heap + index, provided I also do a separate
GetOldestXmin() before even the index scan begins, and do a final
recheck of the saved GetOldestXmin() value against heap tuple xmin
within the new IndexBuildHeapScan() callback (if we still think that
it should have been found by the index scan, then actually throw a
corruption related error). When there is only a ShareLock (for
bt_parent_index_check() calls), the recheck isn't necessary. I think I
should probably also make the IndexBuildHeapScan()-passed indexInfo
structure "ii_Unique = false", since waiting for the outcome of a
concurrent conflicting unique index insertion isn't useful, and can
cause deadlocks.

While I haven't really made my mind up, this design is extremely
simple, and effectively tests IndexBuildHeapScan() at the same time as
everything else. The addition of the bloom filter itself isn't
trivial, but the code added to verify_nbtree.c is.

The downside of going this way is that I cannot piggyback other types
of heap verification on the IndexBuildHeapScan() scan. Still, perhaps
it's worth it. Perhaps I should implement this bloom-filter-index-heap
verification step as one extra option for the existing B-Tree
verification functions. I may later add new verification functions
that examine and verify the heap and related SLRUs alone.

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Peter Geoghegan (#11)

2 attachment(s)

Re: A design for amcheck heapam verification

On Thu, May 11, 2017 at 4:30 PM, Peter Geoghegan <pg@bowt.ie> wrote:

I spent only a few hours writing a rough prototype, and came up with
something that does an IndexBuildHeapScan() scan following the
existing index verification steps. Its amcheck callback does an
index_form_tuple() call, hashes the resulting IndexTuple (heap TID and
all), and tests it for membership of a bloom filter generated as part
of the main B-Tree verification phase. The IndexTuple memory is freed
immediately following hashing.

I attach a cleaned-up version of this. It has extensive documentation.
My bloom filter implementation is broken out as a separate patch,
added as core infrastructure under "lib".

I do have some outstanding concerns about V1 of the patch series:

* I'm still uncertain about the question of using IndexBuildHeapScan()
during Hot Standby. It seems safe, since we're only using the
CONCURRENTLY/AccessShareLock path when this happens, but someone might
find that objectionable on general principle. For now, in this first
version, it remains possible to call IndexBuildHeapScan() during Hot
Standby, to allow the new verification to work there.

* The bloom filter has been experimentally verified, and is based on
source material which is directly cited. It would nevertheless be
useful to have the hashing stuff scrutinized, because it's possible
that I've overlooked some subtlety.

This is only the beginning for heapam verification. Comprehensive
coverage can be added later, within routines that specifically target
some table, not some index.

While this patch series only adds index-to-heap verification for
B-Tree indexes, I can imagine someone adopting the same technique to
verifying that other access methods are consistent with their heap
relation. For example, it would be easy to do this with hash indexes.
Any other index access method where the same high-level principle that
I rely on applies can do index-to-heap verification with just a few
tweaks. I'm referring to the high-level principle that comments
specifically point out in the patch: that REINDEX leaves you with an
index structure that has exactly the same entries as the old index
structure had, though possibly with fewer dead index tuples. I like my
idea of reusing IndexBuildHeapScan() for verification. Very few new
LOCs are actually added to amcheck by this patch, and
IndexBuildHeapScan() is itself tested.

--
Peter Geoghegan

Attachments:

0002-Add-amcheck-verification-of-indexes-against-heap.patchtext/x-patch; charset=US-ASCII; name=0002-Add-amcheck-verification-of-indexes-against-heap.patchDownload

From 48499cfb58b7bf705e93fb12cc5359ec12cd9c51 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  callers can check that each heap tuple that
ought to have an index entry does in fact have one.  This happens at the
end of the existing verification checks.

This is implemented by using a bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
bloom filter is populated during the initial index verification scan.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql    |  28 ++++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |  14 +-
 contrib/amcheck/sql/check_btree.sql      |   9 +-
 contrib/amcheck/verify_nbtree.c          | 236 ++++++++++++++++++++++++++++---
 doc/src/sgml/amcheck.sgml                | 103 +++++++++++---
 7 files changed, 345 insertions(+), 49 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 0000000..e6cca0a
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+DROP FUNCTION bt_index_parent_check(regclass);
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 05e2861..4690484 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index df3741e..42872b8 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -16,8 +16,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
  bt_index_check 
@@ -56,8 +56,14 @@ SELECT bt_index_check('bttest_a_idx');
  
 (1 row)
 
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+SELECT bt_index_parent_check('bttest_b_idx', true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index fd90531..5d27969 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -19,8 +19,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
 SELECT bt_index_parent_check('bttest_a_idx');
@@ -42,8 +42,9 @@ ROLLBACK;
 
 -- normal check outside of xact
 SELECT bt_index_check('bttest_a_idx');
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+SELECT bt_index_parent_check('bttest_b_idx', true);
 
 BEGIN;
 SELECT bt_index_check('bttest_a_idx');
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 9ae83dc..2dcb3d2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -8,6 +8,11 @@
  * (the insertion scankey sort-wise NULL semantics are needed for
  * verification).
  *
+ * When index-to-heap verification is requested, a bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the heap
+ * of all index tuples fingerprinted within the bloom filter.
+ *
  *
  * Copyright (c) 2017, PostgreSQL Global Development Group
  *
@@ -18,13 +23,16 @@
  */
 #include "postgres.h"
 
+#include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "access/transam.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablecmds.h"
+#include "lib/bloomfilter.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 
@@ -53,10 +61,15 @@ typedef struct BtreeCheckState
 	 * Unchanging state, established at start of verification:
 	 */
 
-	/* B-Tree Index Relation */
+	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
+	Relation	heaprel;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
+	/* verifying heap has no unindexed tuples? */
+	bool		heapallindexed;
+	/* Oldest xmin before index examined (for !readonly + heapallindexed calls) */
+	TransactionId	oldestxmin;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -72,6 +85,15 @@ typedef struct BtreeCheckState
 	BlockNumber targetblock;
 	/* Target page's LSN */
 	XLogRecPtr	targetlsn;
+
+	/*
+	 * Mutable state, for optional heapallindexed verification:
+	 */
+
+	/* Bloom filter fingerprints B-Tree index */
+	bloom_filter *filter;
+	/* Debug counter */
+	int64		heaptuplespresent;
 } BtreeCheckState;
 
 /*
@@ -92,15 +114,20 @@ typedef struct BtreeLevel
 PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
-static void bt_index_check_internal(Oid indrelid, bool parentcheck);
+static void bt_index_check_internal(Oid indrelid, bool parentcheck,
+									bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
-static void bt_check_every_level(Relation rel, bool readonly);
+static void bt_check_every_level(Relation rel, Relation heaprel,
+								 bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
 static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 				  ScanKey targetkey);
+static void bt_tuple_present_callback(Relation index, HeapTuple htup,
+									  Datum *values, bool *isnull,
+									  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
@@ -116,37 +143,47 @@ static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires AccessShareLock on heap & index relations.  Does not consider
- * invariants that exist between parent/child pages.
+ * invariants that exist between parent/child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, false);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, false, heapallindexed);
 
 	PG_RETURN_VOID();
 }
 
 /*
- * bt_index_parent_check(index regclass)
+ * bt_index_parent_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires ShareLock on heap & index relations.  Verifies that downlinks in
- * parent pages are valid lower bounds on child pages.
+ * parent pages are valid lower bounds on child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, true);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, true, heapallindexed);
 
 	PG_RETURN_VOID();
 }
@@ -155,7 +192,7 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -205,7 +242,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	btree_index_checkable(indrel);
 
 	/* Check index */
-	bt_check_every_level(indrel, parentcheck);
+	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -253,11 +290,14 @@ btree_index_checkable(Relation rel)
 
 /*
  * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in
- * logical order, verifying invariants as it goes.
+ * logical order, verifying invariants as it goes.  Optionally, verification
+ * checks if the heap relation contains any tuples that are not represented in
+ * the index but should be.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
  * the index relation, and advise us if extra checks are safe when a ShareLock
- * is held.
+ * is held.  (A lock of the same type must also have been acquired on the heap
+ * relation.)
  *
  * A ShareLock is generally assumed to prevent any kind of physical
  * modification to the index structure, including modifications that VACUUM may
@@ -272,7 +312,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, bool readonly)
+bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
+					 bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -291,7 +332,34 @@ bt_check_every_level(Relation rel, bool readonly)
 	 */
 	state = palloc(sizeof(BtreeCheckState));
 	state->rel = rel;
+	state->heaprel = heaprel;
 	state->readonly = readonly;
+	state->heapallindexed = heapallindexed;
+	state->oldestxmin = InvalidTransactionId;
+
+	if (state->heapallindexed)
+	{
+		int64	total_elems;
+		uint32	seed;
+
+		/*
+		 * When only AccessShareLock held on heap, get oldestxmin before index
+		 * is first accessed.  Used for later visibility rechecks, within
+		 * bt_tuple_present_callback().
+		 */
+		if (!state->readonly)
+			state->oldestxmin = GetOldestXmin(state->heaprel,
+											  PROCARRAY_FLAGS_VACUUM);
+
+		/* Size bloom filter based on estimated number of tuples in index */
+		total_elems = (int64) state->rel->rd_rel->reltuples;
+		/* Random seed relies on backend srandom() call to avoid repetition */
+		seed = random();
+		/* Create bloom filter to fingerprint index */
+		state->filter = bloom_init(total_elems, maintenance_work_mem, seed);
+		state->heaptuplespresent = 0;
+	}
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -347,6 +415,40 @@ bt_check_every_level(Relation rel, bool readonly)
 		previouslevel = current.level;
 	}
 
+	/*
+	 * * Heap contains unindexed/malformed tuples check *
+	 */
+	if (state->heapallindexed)
+	{
+		IndexInfo  *indexinfo;
+
+		elog(DEBUG1, "verifying presence of required tuples in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+		indexinfo = BuildIndexInfo(state->rel);
+
+		/*
+		 * Since we're not actually indexing, don't enforce uniqueness/wait for
+		 * concurrent insertion to finish, even with unique indexes.
+		 *
+		 * Force use of MVCC snapshot (reuse CONCURRENTLY infrastructure) when
+		 * only an AccessShareLock held.  It seems like a good idea to not
+		 * diverge from expected heap lock strength in all cases.  This is
+		 * needed to prevent unhelpful WARNINGs due to concurrent insertions
+		 * that IndexBuildHeapScan() does not expect.
+		 */
+		indexinfo->ii_Unique = false;
+		indexinfo->ii_Concurrent = !state->readonly;
+		IndexBuildHeapScan(state->heaprel, state->rel, indexinfo, true,
+						   bt_tuple_present_callback, (void *) state);
+
+		elog(DEBUG1, "finished verifying presence of " INT64_FORMAT " tuples (proportion of bits set: %f) from table \"%s\"",
+			 state->heaptuplespresent, bloom_prop_bits_set(state->filter),
+			 RelationGetRelationName(heaprel));
+
+		bloom_free(state->filter);
+	}
+
 	/* Be tidy: */
 	MemoryContextDelete(state->targetcontext);
 }
@@ -499,7 +601,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
 										current, level.level, opaque->btpo.level)));
 
-		/* Verify invariants for page -- all important checks occur here */
+		/* Verify invariants for page */
 		bt_target_page_check(state);
 
 nextpage:
@@ -546,6 +648,9 @@ nextpage:
  *
  * - That all child pages respect downlinks lower bound.
  *
+ * This is also where heapallindexed callers build their bloom filter for later
+ * verification that index had all heap tuples.
+ *
  * Note:  Memory allocated in this routine is expected to be released by caller
  * resetting state->targetcontext.
  */
@@ -589,6 +694,11 @@ bt_target_page_check(BtreeCheckState *state)
 		itup = (IndexTuple) PageGetItem(state->target, itemid);
 		skey = _bt_mkscankey(state->rel, itup);
 
+		/* When verifying heap, record leaf items in bloom filter */
+		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
+			bloom_add_element(state->filter, (unsigned char *) itup,
+							  IndexTupleSize(itup));
+
 		/*
 		 * * High key check *
 		 *
@@ -682,8 +792,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Last item check *
 		 *
 		 * Check last item against next/right page's first data item's when
-		 * last item on page is reached.  This additional check can detect
-		 * transposed pages.
+		 * last item on page is reached.  This additional check will detect
+		 * transposed pages iff the supposed right sibling page happens to
+		 * belong before target in the key space.  (Otherwise, a subsequent
+		 * heap verification will probably detect the problem.)
 		 *
 		 * This check is similar to the item order check that will have
 		 * already been performed for every other "real" item on target page
@@ -1062,6 +1174,96 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 }
 
 /*
+ * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
+ * all needed entries using bloom filter probes.
+ *
+ * The redundancy between an index and the table it indexes provides a good
+ * opportunity to detect corruption in index, and especially in heap.  The high
+ * level principle behind verification performed here is that any index tuples
+ * that should be in the index following a REINDEX should also have been there
+ * all along.  This must be true because a REINDEX rebuilds the index in order
+ * to effectively remove bloat.  There might be dead index tuple entries in the
+ * bloom filter, because of the lack of reliable visibility information in
+ * index structures, but that hardly matters since we're concerned about the
+ * possible absence of needed tuples.  In other words, a fresh REINDEX should
+ * never affect the representation of any IndexTuple, because these are
+ * immutable for as long as heap tuple is visible to any possible snapshot
+ * (while the LP_DEAD bit is mutable, that's ItemId metadata, which we don't
+ * directly fingerprint).
+ *
+ * Since the overall structure of the index has already been verified, the most
+ * likely explanation for invariant not holding is a corrupt heap page (could
+ * be logical or physical corruption), which is why heap is blamed here.  Heap
+ * corruption is not always the problem, though.  Only readonly callers will
+ * have verified that left links and right links are in agreement, and so it's
+ * possible that a leaf page transposition within index is actually the source
+ * of corruption detected here (for !readonly callers), in which case the
+ * user-visible diagnostic message is misleading.  The checks performed only
+ * for readonly callers might more accurately frame the problem as a bogus leaf
+ * page transposition, or a cross-page invariant not holding due to recovery
+ * not replaying all WAL records.  That's why the !readonly ERROR message
+ * raised here includes a HINT about trying the other variant out.
+ */
+static void
+bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
+						  bool *isnull, bool tupleIsAlive, void *checkstate)
+{
+	BtreeCheckState *state = (BtreeCheckState *) checkstate;
+	IndexTuple		 itup;
+
+	Assert(state->heapallindexed);
+
+	/* Must recheck visibility when only AccessShareLock held */
+	if (!state->readonly)
+	{
+		TransactionId	xmin;
+
+		/*
+		 * Don't test for presence in index where xmin not at least old enough
+		 * that we know for sure that absence of index tuple wasn't just due to
+		 * some transaction performing insertion after our verifying index
+		 * traversal began.  (Actually, the cut-off is based on the point
+		 * before which any possible inserting transaction must have
+		 * committed/aborted.)
+		 *
+		 * You might think that the fact that an MVCC snapshot is used by the
+		 * heap scan (due to indicating that this is the first scan of a CREATE
+		 * INDEX CONCURRENTLY index build) would make this test redundant.
+		 * That's not quite true, because with current IndexBuildHeapScan()
+		 * interface caller cannot do the MVCC snapshot acquisition itself.  In
+		 * this way, heap tuple coverage is similar to the coverage we could
+		 * get by acquiring the MVCC snapshot ourselves at the point where
+		 * GetOldestXmin() is currently called.  It's easier to do this than to
+		 * adopt the IndexBuildHeapScan() interface to our narrow requirements.
+		 */
+		xmin = HeapTupleHeaderGetXmin(htup->t_data);
+		if (!TransactionIdPrecedes(xmin, state->oldestxmin))
+			return;
+	}
+
+	/* generate an index tuple */
+	itup = index_form_tuple(RelationGetDescr(index), values, isnull);
+	itup->t_tid = htup->t_self;
+
+	/* probe bloom filter -- tuple should be present */
+	if (bloom_lacks_element(state->filter, (unsigned char *) itup,
+							IndexTupleSize(itup)))
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("table \"%s\" lacks matching index tuple in index \"%s\" for tid (%u,%u)",
+						RelationGetRelationName(state->heaprel),
+						RelationGetRelationName(state->rel),
+						ItemPointerGetBlockNumber(&(itup->t_tid)),
+						ItemPointerGetOffsetNumber(&(itup->t_tid))),
+				 !state->readonly ?
+				 errhint("Calling bt_index_parent_check() against target \"%s\" may further isolate the inconsistency",
+						 RelationGetRelationName(state->rel)) : 0 ));
+
+	state->heaptuplespresent++;
+	pfree(itup);
+}
+
+/*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
  *
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index dd71dbd..3acf46e 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -44,7 +44,7 @@
   <variablelist>
    <varlistentry>
     <term>
-     <function>bt_index_check(index regclass) returns void</function>
+     <function>bt_index_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_check</primary>
      </indexterm>
@@ -55,7 +55,7 @@
       <function>bt_index_check</function> tests that its target, a
       B-Tree index, respects a variety of invariants.  Example usage:
 <screen>
-test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+test=# SELECT bt_index_check(index =&gt; c.oid, heapallindexed =&gt; false)
 FROM pg_index i
 JOIN pg_opclass op ON i.indclass[0] = op.oid
 JOIN pg_am am ON op.opcmethod = am.oid
@@ -83,11 +83,12 @@ ORDER BY c.relpages DESC LIMIT 10;
 </screen>
       This example shows a session that performs verification of every
       catalog index in the database <quote>test</>.  Details of just
-      the 10 largest indexes verified are displayed.  Since no error
-      is raised, all indexes tested appear to be logically consistent.
-      Naturally, this query could easily be changed to call
-      <function>bt_index_check</function> for every index in the
-      database where verification is supported.
+      the 10 largest indexes verified are displayed.  Verification of
+      the presence of heap tuples is not requested or performed.
+      Since no error is raised, all indexes tested appear to be
+      logically consistent.  Naturally, this query could easily be
+      changed to call <function>bt_index_check</function> for every
+      index in the database where verification is supported.
      </para>
      <para>
       <function>bt_index_check</function> acquires an <literal>AccessShareLock</>
@@ -95,8 +96,9 @@ ORDER BY c.relpages DESC LIMIT 10;
       is the same lock mode acquired on relations by simple
       <literal>SELECT</> statements.
       <function>bt_index_check</function> does not verify invariants
-      that span child/parent relationships, nor does it verify that
-      the target index is consistent with its heap relation.  When a
+      that span child/parent relationships, but will verify the
+      presence of all heap tuples in the index when
+      <parameter>heapallindexed</> is <literal>true</>.  When a
       routine, lightweight test for corruption is required in a live
       production environment, using
       <function>bt_index_check</function> often provides the best
@@ -108,7 +110,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -117,19 +119,22 @@ ORDER BY c.relpages DESC LIMIT 10;
     <listitem>
      <para>
       <function>bt_index_parent_check</function> tests that its
-      target, a B-Tree index, respects a variety of invariants.  The
-      checks performed by <function>bt_index_parent_check</function>
-      are a superset of the checks performed by
-      <function>bt_index_check</function>.
-      <function>bt_index_parent_check</function> can be thought of as
-      a more thorough variant of <function>bt_index_check</function>:
-      unlike <function>bt_index_check</function>,
+      target, a B-Tree index, respects a variety of invariants.
+      Optionally, when the <parameter>heapallindexed</> argument is
+      <literal>true</>, the function verifies the presence of all heap
+      tuples that should be found within the index.  The checks
+      performed by <function>bt_index_parent_check</function> are a
+      superset of the checks performed by
+      <function>bt_index_check</function> when called with the same
+      options.  <function>bt_index_parent_check</function> can be
+      thought of as a more thorough variant of
+      <function>bt_index_check</function>: unlike
+      <function>bt_index_check</function>,
       <function>bt_index_parent_check</function> also checks
-      invariants that span parent/child relationships.  However, it
-      does not verify that the target index is consistent with its
-      heap relation.  <function>bt_index_parent_check</function>
-      follows the general convention of raising an error if it finds a
-      logical inconsistency or other problem.
+      invariants that span parent/child relationships.
+      <function>bt_index_parent_check</function> follows the general
+      convention of raising an error if it finds a logical
+      inconsistency or other problem.
      </para>
      <para>
       A <literal>ShareLock</> is required on the target index by
@@ -159,6 +164,41 @@ ORDER BY c.relpages DESC LIMIT 10;
  </sect2>
 
  <sect2>
+  <title>Optional <parameter>heapallindexed</> verification</title>
+ <para>
+  When the <parameter>heapallindexed</> argument to amcheck functions
+  is <literal>true</>, an additional phase of verification is
+  performed against the heap relation associated with the target index
+  relation.  A dummy <command>CREATE INDEX</> operation checks for the
+  presence of all would-be new index tuples against a summarizing
+  structure that is built during the first, standard phase.  The high
+  level principle behind this verification is that any existing index
+  should have the same entries as an equivalent new index would have.
+  <parameter>heapallindexed</> verification generally makes
+  verification take significantly longer, but does not change the
+  locking requirements.
+ </para>
+ <para>
+  The summarizing structure is bound in size by
+  <varname>maintenance_work_mem</varname>.  In order to ensure that
+  there is no more than a 2% probability of failure to detect the
+  absence of any particular index tuple, approximately 2 bytes of
+  memory are needed per index tuple per function call/target index.
+  This is considered an acceptable trade-off, since it limits the
+  overhead of verification while only slightly reducing the
+  probability of detecting a problem, especially over time, in
+  installations where verification is treated as a routine maintenance
+  task.  In many applications, even the default
+  <varname>maintenance_work_mem</varname> setting of <literal>64MB</>
+  will be sufficient to have less than a 2% probability of overlooking
+  any single absent or corrupt tuple.  This will be the case when
+  there are no indexes with more than about 30 million distinct index
+  tuples.
+ </para>
+
+ </sect2>
+
+ <sect2>
   <title>Using <filename>amcheck</> effectively</title>
 
  <para>
@@ -199,6 +239,18 @@ ORDER BY c.relpages DESC LIMIT 10;
    </listitem>
    <listitem>
     <para>
+     Structural inconsistencies between indexes and the heap relations
+     that are indexed (when <parameter>heapallindexed</> verification
+     is performed).
+    </para>
+    <para>
+     There is no cross-checking of indexes against their heap relation
+     during normal operation.  Symptoms of heap corruption can be very
+     subtle.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
      Corruption caused by hypothetical undiscovered bugs in the
      underlying <productname>PostgreSQL</> access method code or sort
      code.
@@ -242,6 +294,13 @@ ORDER BY c.relpages DESC LIMIT 10;
      <emphasis>absolute</emphasis> protection against failures that
      result in memory corruption.
     </para>
+    <para>
+     When <parameter>heapallindexed</> is <literal>true</>, and heap
+     verification is performed, there is generally a greatly increased
+     chance of detecting single-bit errors, since strict binary
+     equality is tested, and the indexed attributes within the heap
+     are tested.
+    </para>
    </listitem>
   </itemizedlist>
   In general, <filename>amcheck</> can only prove the presence of
-- 
2.7.4

0001-Add-bloom-filter-data-structure-implementation.patchtext/x-patch; charset=US-ASCII; name=0001-Add-bloom-filter-data-structure-implementation.patchDownload

From 59237a5f6f1303a5939e1900a30e58fa884a2d5f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH 1/2] Add bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.
---
 src/backend/lib/Makefile      |   4 +-
 src/backend/lib/README        |   2 +
 src/backend/lib/bloomfilter.c | 297 ++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/bloomfilter.h |  26 ++++
 4 files changed, 327 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/bloomfilter.c
 create mode 100644 src/include/lib/bloomfilter.h

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index f222c6c..3da4a0d 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o hyperloglog.o ilist.o knapsack.o \
-       pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o hyperloglog.o ilist.o \
+       knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index 5e5ba5e..376ae27 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -3,6 +3,8 @@ in the backend:
 
 binaryheap.c - a binary heap
 
+bloomfilter.c - probabilistic, space-efficient set membership testing
+
 hyperloglog.c - a streaming cardinality estimator
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
new file mode 100644
index 0000000..b1e13a6
--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,297 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.c
+ *		Minimal bloom filter
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.  False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the bloom filter bitset.
+ *
+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when its bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/bloomfilter.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/hash.h"
+#include "lib/bloomfilter.h"
+#include "utils/memutils.h"
+
+#define MAX_HASH_FUNCS	10
+#define NBITS(filt)		((1 << (filt)->bloom_power))
+
+typedef struct bloom_filter
+{
+	/* 2 ^ bloom_power is the size of the bitset (in bits) */
+	int				bloom_power;
+	unsigned char  *bitset;
+
+	/* K hash functions are used, which are randomly seeded */
+	int				k_hash_funcs;
+	uint32			seed;
+} bloom_filter;
+
+static int pow2_truncate(int64 target_bitset_size);
+static int optimal_k(int64 bits, int64 total_elems);
+static void k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem,
+					 Size len);
+static uint32 sdbmhash(unsigned char *elem, Size len);
+
+/*
+ * Initialize bloom filter.  This should get a false positive rate of between
+ * 1% and 2% when its bitset is not constrained by memory.
+ *
+ * total_elems is an estimate of the final size of the set.  It ought to be
+ * approximately correct, but we can cope well with it being off by perhaps a
+ * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
+ * (Dillinger & Manolios, 2004) for details of why this is the case.
+ *
+ * work_mem is sized in KB, in line with the general convention.
+ *
+ * The bloom filter behaves non-deterministically when caller passes a random
+ * seed value.  This ensures that the same false positives will not occur from
+ * one run to the next, which is useful to some callers.
+ *
+ * Notes on appropriate use:
+ *
+ * To keep the implementation simple and predictable, the underlying bitset is
+ * always sized as a power-of-two number of bits, and the largest possible
+ * bitset is 2 ^ 30 bits, or 128MB.  The implementation is therefore well
+ * suited to data synchronization problems between unordered sets, where
+ * predictable performance is more important than worst case guarantees around
+ * false positives.  Another problem that the implementation is well suited for
+ * is cache filtering where good performance already relies upon having a
+ * relatively small and/or low cardinality set of things that are interesting
+ * (with perhaps many more uninteresting things that never populate the
+ * filter).
+ */
+bloom_filter *
+bloom_init(int64 total_elems, int work_mem, uint32 seed)
+{
+	bloom_filter   *filter;
+	int64			bitset_bytes;
+	int64			bitset_bits;
+
+	filter = palloc(sizeof(bloom_filter));
+
+	/*
+	 * Aim for two bytes per element; this is sufficient to get a false
+	 * positive rate below 1%, independent of the size of the bitset or total
+	 * number of elements.  Also, if rounding down the size of the bitset to
+	 * the next lowest power of two turns out to be a significant drop, the
+	 * false positive rate still won't exceed 2% in almost all cases.
+	 */
+	bitset_bytes = Min(total_elems * 2, MaxAllocSize);
+	bitset_bytes = Min(work_mem * 1024L, bitset_bytes);
+	/* Minimum allowable size is 1MB */
+	bitset_bytes = Max(1024L * 1024L, bitset_bytes);
+
+	/* Size in bits should be the highest power of two within budget */
+	filter->bloom_power = pow2_truncate(bitset_bytes * BITS_PER_BYTE);
+	bitset_bits = NBITS(filter);
+	bitset_bytes = bitset_bits / BITS_PER_BYTE;
+	filter->bitset = palloc0(bitset_bytes);
+	filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+	filter->seed = seed;
+
+	return filter;
+}
+
+/*
+ * Free bloom filter
+ */
+void
+bloom_free(bloom_filter *filter)
+{
+	pfree(filter->bitset);
+	pfree(filter);
+}
+
+/*
+ * Add element to bloom filter
+ */
+void
+bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)
+{
+	uint32	hashes[MAX_HASH_FUNCS];
+	int		i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		filter->bitset[hashes[i] >> 3] |= 1 << (hashes[i] & 7);
+	}
+}
+
+/*
+ * Test if bloom filter definitely lacks element.
+ *
+ * Returns true if the element is definitely not in the set of elements
+ * observed by bloom_add_element().  Otherwise, returns false, indicating that
+ * element is probably present in set.
+ */
+bool
+bloom_lacks_element(bloom_filter *filter, unsigned char *elem, Size len)
+{
+	uint32	hashes[MAX_HASH_FUNCS];
+	int		i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		if (!(filter->bitset[hashes[i] >> 3] & (1 << (hashes[i] & 7))))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.
+ *
+ * This is a useful, generic indicator of whether or not a bloom filter has
+ * summarized the set optimally within the available memory budget.  If return
+ * value exceeds 0.5 significantly, then that's either because there was a
+ * dramatic underestimation of set size by the caller, or because available
+ * work_mem is very low relative to the size of the set (less than 2 bits per
+ * element).
+ *
+ * Note that the value returned here should generally be close to 0.5, even
+ * when we have more than enough memory to ensure a false positive rate within
+ * our target 1% - 2% band, since more hash functions are used as more memory
+ * is available per element.
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
+{
+	int		bitset_bytes = NBITS(filter) / BITS_PER_BYTE;
+	int64	bits_set = 0;
+	int		i;
+
+	for (i = 0; i < bitset_bytes; i++)
+	{
+		unsigned char byte = filter->bitset[i];
+
+		while (byte)
+		{
+			bits_set++;
+			byte &= (byte - 1);
+		}
+	}
+
+	return bits_set / (double) NBITS(filter);
+}
+
+/*
+ * Which element of the sequence of powers-of-two is less than or equal to n?
+ *
+ * Used to size bitset, which in practice is never allowed to exceed 2 ^ 30
+ * bits (128MB).  This frees us from giving further consideration to int
+ * overflow.
+ */
+static int
+pow2_truncate(int64 target_bitset_size)
+{
+	int v = 0;
+
+	while (target_bitset_size > 0)
+	{
+		v++;
+		target_bitset_size = target_bitset_size >> 1;
+	}
+
+	return Min(v - 1, 30);
+}
+
+/*
+ * Determine optimal number of hash functions based on size of filter in bits,
+ * and projected total number of elements.  The optimal number is the number
+ * that minimizes the false positive rate.
+ */
+static int
+optimal_k(int64 bits, int64 total_elems)
+{
+	int		k = round(log(2.0) * bits / total_elems);
+
+	return Max(1, Min(k, MAX_HASH_FUNCS));
+}
+
+/*
+ * Generate k hash values for element.
+ *
+ * Caller passes array, which is filled-in with k values determined by hashing
+ * caller's element.
+ *
+ * Only 2 real independent hash functions are actually used to support an
+ * interface of up to MAX_HASH_FUNCS hash functions; "enhanced double hashing"
+ * is used to make this work.  See Dillinger & Manolios for details of why
+ * that's okay.  "Building a Better Bloom Filter" by Kirsch & Mitzenmacher also
+ * has detailed analysis of the algorithm.
+ */
+static void
+k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, Size len)
+{
+	uint32	hasha,
+			hashb;
+	int		i;
+
+	hasha = DatumGetUInt32(hash_any(elem, len));
+	hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);
+
+	/* Mix seed value */
+	hasha += filter->seed;
+	/* Apply "MOD m" to avoid losing bits/out-of-bounds array access */
+	hasha = hasha % NBITS(filter);
+	hashb = hashb % NBITS(filter);
+
+	/* First hash */
+	hashes[0] = hasha;
+
+	/* Subsequent hashes */
+	for (i = 1; i < filter->k_hash_funcs; i++)
+	{
+		hasha = (hasha + hashb) % NBITS(filter);
+		hashb = (hashb + i) % NBITS(filter);
+
+		/* Accumulate hash value for caller */
+		hashes[i] = hasha;
+	}
+}
+
+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
+static uint32
+sdbmhash(unsigned char *elem, Size len)
+{
+	uint32	hash = 0;
+	int		i;
+
+	for (i = 0; i < len; elem++, i++)
+	{
+		hash = (*elem) + (hash << 6) + (hash << 16) - hash;
+	}
+
+	return hash;
+}
diff --git a/src/include/lib/bloomfilter.h b/src/include/lib/bloomfilter.h
new file mode 100644
index 0000000..399e755
--- /dev/null
+++ b/src/include/lib/bloomfilter.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.h
+ *	  Minimal bloom filter
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/include/lib/bloomfilter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BLOOMFILTER_H_
+#define _BLOOMFILTER_H_
+
+typedef struct bloom_filter bloom_filter;
+
+extern bloom_filter *bloom_init(int64 total_elems, int work_mem, uint32 seed);
+extern void bloom_free(bloom_filter *filter);
+extern void bloom_add_element(bloom_filter *filter, unsigned char *elem,
+							  Size len);
+extern bool bloom_lacks_element(bloom_filter *filter, unsigned char *elem,
+								Size len);
+extern double bloom_prop_bits_set(bloom_filter *filter);
+
+#endif
-- 
2.7.4

#13

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Peter Geoghegan (#12)

Re: A design for amcheck heapam verification

On Wed, Aug 30, 2017 at 7:58 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, May 11, 2017 at 4:30 PM, Peter Geoghegan <pg@bowt.ie> wrote:

I spent only a few hours writing a rough prototype, and came up with
something that does an IndexBuildHeapScan() scan following the
existing index verification steps. Its amcheck callback does an
index_form_tuple() call, hashes the resulting IndexTuple (heap TID and
all), and tests it for membership of a bloom filter generated as part
of the main B-Tree verification phase. The IndexTuple memory is freed
immediately following hashing.

I attach a cleaned-up version of this. It has extensive documentation.
My bloom filter implementation is broken out as a separate patch,
added as core infrastructure under "lib".

Some drive-by comments on the lib patch:

+bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)

I think the plan is to use size_t for new stuff[1]/messages/by-id/25076.1489699457@sss.pgh.pa.us.

+/*
+ * Which element of the sequence of powers-of-two is less than or equal to n?
+ *
+ * Used to size bitset, which in practice is never allowed to exceed 2 ^ 30
+ * bits (128MB).  This frees us from giving further consideration to int
+ * overflow.
+ */
+static int
+pow2_truncate(int64 target_bitset_size)
+{
+    int v = 0;
+
+    while (target_bitset_size > 0)
+    {
+        v++;
+        target_bitset_size = target_bitset_size >> 1;
+    }
+
+    return Min(v - 1, 30);
+}

This is another my_log2(), right?

It'd be nice to replace both with fls() or flsl(), though it's
annoying to have to think about long vs int64 etc. We already use
fls() in two places and supply an implementation in src/port/fls.c for
platforms that lack it (Windows?), but not the long version.

+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
+static uint32
+sdbmhash(unsigned char *elem, Size len)
+{
+    uint32    hash = 0;
+    int        i;
+
+    for (i = 0; i < len; elem++, i++)
+    {
+        hash = (*elem) + (hash << 6) + (hash << 16) - hash;
+    }
+
+    return hash;
+}

I see that this is used in gawk, BerkeleyDB and all over the place[2]http://www.cse.yorku.ca/~oz/hash.html.
Nice. I understand that this point of this is to be a hash function
that is different from our usual one, for use by k_hashes(). Do you
think it belongs somewhere more common than this? It seems a bit like
our hash-related code is scattered all over the place but should be
consolidated, but I suppose that's a separate project.

Unnecessary braces here and elsewhere for single line body of for loops.

+bloom_prop_bits_set(bloom_filter *filter)
+{
+    int        bitset_bytes = NBITS(filter) / BITS_PER_BYTE;
+    int64    bits_set = 0;
+    int        i;
+
+    for (i = 0; i < bitset_bytes; i++)
+    {
+        unsigned char byte = filter->bitset[i];
+
+        while (byte)
+        {
+            bits_set++;
+            byte &= (byte - 1);
+        }
+    }

Sorry I didn't follow up with my threat[3]/messages/by-id/CAEepm=3g1_fjJGp38QGv+38BC2HHVkzUq6s69nk3mWLgPHqC3A@mail.gmail.com to provide a central
popcount() function to replace the implementations all over the tree.

[1]: /messages/by-id/25076.1489699457@sss.pgh.pa.us
[2]: http://www.cse.yorku.ca/~oz/hash.html
[3]: /messages/by-id/CAEepm=3g1_fjJGp38QGv+38BC2HHVkzUq6s69nk3mWLgPHqC3A@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Thomas Munro (#13)

Re: A design for amcheck heapam verification

On Wed, Aug 30, 2017 at 8:34 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

It'd be nice to replace both with fls() or flsl(), though it's
annoying to have to think about long vs int64 etc. We already use
fls() in two places and supply an implementation in src/port/fls.c for
platforms that lack it (Windows?), but not the long version.

Yes, you can complain about MSVC compilation here.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Thomas Munro (#13)

Re: A design for amcheck heapam verification

On Tue, Aug 29, 2017 at 4:34 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Some drive-by comments on the lib patch:

I was hoping that you'd look at this, since you'll probably want to
use a bloom filter for parallel hash join at some point. I've tried to
keep this one as simple as possible. I think that there is a good
chance that it will be usable for parallel hash join with multiple
batches. You'll need to update the interface a little bit to make that
work (e.g., bring-your-own-hash interface), but hopefully the changes
you'll need will be limited to a few small details.

+bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)

I think the plan is to use size_t for new stuff[1].

I'd forgotten.

This is another my_log2(), right?

It'd be nice to replace both with fls() or flsl(), though it's
annoying to have to think about long vs int64 etc. We already use
fls() in two places and supply an implementation in src/port/fls.c for
platforms that lack it (Windows?), but not the long version.

win64 longs are only 32-bits, so my_log2() would do the wrong thing
for me on that platform. pow2_truncate() is provided with a number of
bits as its argument, not a number of bytes (otherwise this would
work).

Ideally, we'd never use long integers, because its width is platform
dependent, and yet it is only ever used as an alternative to int
because it is wider than int. One example of where this causes
trouble: logtape.c uses long ints, so external sorts can use half the
temp space on win64.

+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
+static uint32
+sdbmhash(unsigned char *elem, Size len)
+{

I see that this is used in gawk, BerkeleyDB and all over the place[2].
Nice. I understand that this point of this is to be a hash function
that is different from our usual one, for use by k_hashes().

Right. It's only job is to be a credible hash function that isn't
derivative of hash_any().

Do you
think it belongs somewhere more common than this? It seems a bit like
our hash-related code is scattered all over the place but should be
consolidated, but I suppose that's a separate project.

Unsure. In its defense, there is also a private murmurhash one-liner
within tidbitmap.c. I don't mind changing this, but it's slightly odd
to expose a hash function whose only job is to be completely unrelated
to hash_any().

Unnecessary braces here and elsewhere for single line body of for loops.

+bloom_prop_bits_set(bloom_filter *filter)
+{
+    int        bitset_bytes = NBITS(filter) / BITS_PER_BYTE;
+    int64    bits_set = 0;
+    int        i;
+
+    for (i = 0; i < bitset_bytes; i++)
+    {
+        unsigned char byte = filter->bitset[i];
+
+        while (byte)
+        {
+            bits_set++;
+            byte &= (byte - 1);
+        }
+    }

I don't follow what you mean here.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Peter Geoghegan (#15)

Re: A design for amcheck heapam verification

On Wed, Aug 30, 2017 at 1:00 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Aug 29, 2017 at 4:34 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Some drive-by comments on the lib patch:

I was hoping that you'd look at this, since you'll probably want to
use a bloom filter for parallel hash join at some point. I've tried to
keep this one as simple as possible. I think that there is a good
chance that it will be usable for parallel hash join with multiple
batches. You'll need to update the interface a little bit to make that
work (e.g., bring-your-own-hash interface), but hopefully the changes
you'll need will be limited to a few small details.

Indeed. Thank you for working on this! To summarise a couple of
ideas that Peter and I discussed off-list a while back: (1) While
building the hash table for a hash join we could build a Bloom filter
per future batch and keep it in memory, and then while reading from
the outer plan we could skip writing tuples out to future batches if
there is no chance they'll find a match when read back in later (works
only for inner joins and only pays off in inverse proportion to the
join's selectivity); (2) We could push a Bloom filter down to scans
(many other databases do this, and at least one person has tried this
with PostgreSQL and found it to pay off[1]http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf).

To use this for anything involving parallelism where a Bloom filter
must be shared we'd probably finish up having to create a shared
version of bloom_init() that either uses caller-provided memory and
avoids the internal pointer, or allocates DSA memory. I suppose you
could consider splitting your bloom_init() function up into
bloom_estimate() and bloom_init(user_supplied_space, ...) now, and
getting rid of the separate pointer to bitset (ie stick char
bitset[FLEXIBLE_ARRAY_MEMBER] at the end of the struct)?

+bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)

I think the plan is to use size_t for new stuff[1].

I'd forgotten.

This is another my_log2(), right?

It'd be nice to replace both with fls() or flsl(), though it's
annoying to have to think about long vs int64 etc. We already use
fls() in two places and supply an implementation in src/port/fls.c for
platforms that lack it (Windows?), but not the long version.

win64 longs are only 32-bits, so my_log2() would do the wrong thing
for me on that platform. pow2_truncate() is provided with a number of
bits as its argument, not a number of bytes (otherwise this would
work).

Hmm. Right.

Ideally, we'd never use long integers, because its width is platform
dependent, and yet it is only ever used as an alternative to int
because it is wider than int. One example of where this causes
trouble: logtape.c uses long ints, so external sorts can use half the
temp space on win64.

Agreed, "long" is terrible.

+bloom_prop_bits_set(bloom_filter *filter)
+{
+    int        bitset_bytes = NBITS(filter) / BITS_PER_BYTE;
+    int64    bits_set = 0;
+    int        i;
+
+    for (i = 0; i < bitset_bytes; i++)
+    {
+        unsigned char byte = filter->bitset[i];
+
+        while (byte)
+        {
+            bits_set++;
+            byte &= (byte - 1);
+        }
+    }

I don't follow what you mean here.

I was just observing that there is an opportunity for code reuse here.
This function's definition would ideally be something like:

double
bloom_prop_bits_set(bloom_filter *filter)
{
return popcount(filter->bitset, NBYTES(filter)) / (double) NBITS(filter);
}

This isn't an objection to the way you have it, since we don't have a
popcount function yet! We have several routines in the tree for
counting bits, though not yet the complete set from Hacker's Delight.
Your patch brings us one step closer to that goal. (The book says
that this approach is good far sparse bitsets, but your comment says
that we expect something near 50%. That's irrelevant anyway since a
future centralised popcount() implementation would do this in
word-sized chunks with a hardware instruction or branch-free-per-word
lookups in a table and not care at all about sparseness.)

+ * Test if bloom filter definitely lacks element.

I think where "Bloom filter" appears in prose it should have a capital
letter (person's name).

[1]: http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Thomas Munro (#16)

Re: A design for amcheck heapam verification

On Tue, Aug 29, 2017 at 7:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Indeed. Thank you for working on this! To summarise a couple of
ideas that Peter and I discussed off-list a while back: (1) While
building the hash table for a hash join we could build a Bloom filter
per future batch and keep it in memory, and then while reading from
the outer plan we could skip writing tuples out to future batches if
there is no chance they'll find a match when read back in later (works
only for inner joins and only pays off in inverse proportion to the
join's selectivity);

Right. Hash joins do tend to be very selective, though, so I think
that this could help rather a lot. With just 8 or 10 bits per element,
you can eliminate almost all the batch write-outs on the outer side.
No per-worker synchronization for BufFiles is needed when this
happens, either. It seems like that could be very important.

To use this for anything involving parallelism where a Bloom filter
must be shared we'd probably finish up having to create a shared
version of bloom_init() that either uses caller-provided memory and
avoids the internal pointer, or allocates DSA memory. I suppose you
could consider splitting your bloom_init() function up into
bloom_estimate() and bloom_init(user_supplied_space, ...) now, and
getting rid of the separate pointer to bitset (ie stick char
bitset[FLEXIBLE_ARRAY_MEMBER] at the end of the struct)?

Makes sense. Not hard to add that.

I was just observing that there is an opportunity for code reuse here.
This function's definition would ideally be something like:

double
bloom_prop_bits_set(bloom_filter *filter)
{
return popcount(filter->bitset, NBYTES(filter)) / (double) NBITS(filter);
}

This isn't an objection to the way you have it, since we don't have a
popcount function yet! We have several routines in the tree for
counting bits, though not yet the complete set from Hacker's Delight.

Right. I'm also reminded of the lookup tables for the visibility/freeze map.

Your patch brings us one step closer to that goal. (The book says
that this approach is good far sparse bitsets, but your comment says
that we expect something near 50%. That's irrelevant anyway since a
future centralised popcount() implementation would do this in
word-sized chunks with a hardware instruction or branch-free-per-word
lookups in a table and not care at all about sparseness.)

I own a copy of Hacker's Delight (well, uh, Daniel Farina lent me his
copy about 2 years ago!). pop()/popcount() does seem like a clever
algorithm, that we should probably think about adopting in some cases,
but I should point at that the current caller to my
bloom_prop_bits_set() function is an elog() DEBUG1 call. This is not
at all performance critical.

+ * Test if bloom filter definitely lacks element.

I think where "Bloom filter" appears in prose it should have a capital
letter (person's name).

Agreed.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Peter Geoghegan (#17)

Re: A design for amcheck heapam verification

Peter Geoghegan wrote:

Your patch brings us one step closer to that goal. (The book says
that this approach is good far sparse bitsets, but your comment says
that we expect something near 50%. That's irrelevant anyway since a
future centralised popcount() implementation would do this in
word-sized chunks with a hardware instruction or branch-free-per-word
lookups in a table and not care at all about sparseness.)

I own a copy of Hacker's Delight (well, uh, Daniel Farina lent me his
copy about 2 years ago!). pop()/popcount() does seem like a clever
algorithm, that we should probably think about adopting in some cases,
but I should point at that the current caller to my
bloom_prop_bits_set() function is an elog() DEBUG1 call. This is not
at all performance critical.

Eh, if you want to optimize it for the case where debug output is not
enabled, make sure to use ereport() not elog(). ereport()
short-circuits evaluation of arguments, whereas elog() does not.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Alvaro Herrera (#18)

Re: A design for amcheck heapam verification

On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Eh, if you want to optimize it for the case where debug output is not
enabled, make sure to use ereport() not elog(). ereport()
short-circuits evaluation of arguments, whereas elog() does not.

I should do that, but it's still not really noticeable.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Peter Geoghegan (#19)

2 attachment(s)

Re: A design for amcheck heapam verification

On Wed, Aug 30, 2017 at 9:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Eh, if you want to optimize it for the case where debug output is not
enabled, make sure to use ereport() not elog(). ereport()
short-circuits evaluation of arguments, whereas elog() does not.

I should do that, but it's still not really noticeable.

Since this patch has now bit-rotted, I attach a new revision, V2.
Apart from fixing some Makefile bitrot, this revision also makes some
small tweaks as suggested by Thomas and Alvaro. The documentation is
also revised and expanded, and now discusses practical aspects of the
set membership being tested using a Bloom filter, how that relates to
maintenance_work_mem, and so on.

Note that this revision does not let the Bloom filter caller use their
own dynamic shared memory, which is something that Thomas asked about.
While that could easily be added, I think it should happen later. I
really just wanted to make sure that my Bloom filter was not in some
way fundamentally incompatible with Thomas' planned enhancements to
(parallel) hash join.

--
Peter Geoghegan

Attachments:

0001-Add-Bloom-filter-data-structure-implementation.patchtext/x-patch; charset=US-ASCII; name=0001-Add-Bloom-filter-data-structure-implementation.patchDownload

From d4dda95dd41204315dc12936fac83d2df8676992 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.
---
 src/backend/lib/Makefile      |   4 +-
 src/backend/lib/README        |   2 +
 src/backend/lib/bloomfilter.c | 297 ++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/bloomfilter.h |  26 ++++
 4 files changed, 327 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/bloomfilter.c
 create mode 100644 src/include/lib/bloomfilter.h

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index d1fefe4..191ea9b 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
-	   knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index 5e5ba5e..376ae27 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -3,6 +3,8 @@ in the backend:
 
 binaryheap.c - a binary heap
 
+bloomfilter.c - probabilistic, space-efficient set membership testing
+
 hyperloglog.c - a streaming cardinality estimator
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
new file mode 100644
index 0000000..e93f9b0
--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,297 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.c
+ *		Minimal Bloom filter
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.  False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the Bloom filter bitset.
+ *
+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when its Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/bloomfilter.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/hash.h"
+#include "lib/bloomfilter.h"
+#include "utils/memutils.h"
+
+#define MAX_HASH_FUNCS	10
+#define NBITS(filt)		((1 << (filt)->bloom_power))
+
+typedef struct bloom_filter
+{
+	/* 2 ^ bloom_power is the size of the bitset (in bits) */
+	int				bloom_power;
+	unsigned char  *bitset;
+
+	/* K hash functions are used, which are randomly seeded */
+	int				k_hash_funcs;
+	uint32			seed;
+} bloom_filter;
+
+static int pow2_truncate(int64 target_bitset_size);
+static int optimal_k(int64 bits, int64 total_elems);
+static void k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem,
+					 size_t len);
+static uint32 sdbmhash(unsigned char *elem, size_t len);
+
+/*
+ * Initialize Bloom filter.  This should get a false positive rate of between
+ * 1% and 2% when its bitset is not constrained by memory.
+ *
+ * total_elems is an estimate of the final size of the set.  It ought to be
+ * approximately correct, but we can cope well with it being off by perhaps a
+ * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
+ * (Dillinger & Manolios, 2004) for details of why this is the case.
+ *
+ * work_mem is sized in KB, in line with the general convention.
+ *
+ * The Bloom filter behaves non-deterministically when caller passes a random
+ * seed value.  This ensures that the same false positives will not occur from
+ * one run to the next, which is useful to some callers.
+ *
+ * Notes on appropriate use:
+ *
+ * To keep the implementation simple and predictable, the underlying bitset is
+ * always sized as a power-of-two number of bits, and the largest possible
+ * bitset is 2 ^ 30 bits, or 128MB.  The implementation is therefore well
+ * suited to data synchronization problems between unordered sets, where
+ * predictable performance is more important than worst case guarantees around
+ * false positives.  Another problem that the implementation is well suited for
+ * is cache filtering where good performance already relies upon having a
+ * relatively small and/or low cardinality set of things that are interesting
+ * (with perhaps many more uninteresting things that never populate the
+ * filter).
+ */
+bloom_filter *
+bloom_init(int64 total_elems, int work_mem, uint32 seed)
+{
+	bloom_filter   *filter;
+	int64			bitset_bytes;
+	int64			bitset_bits;
+
+	filter = palloc(sizeof(bloom_filter));
+
+	/*
+	 * Aim for two bytes per element; this is sufficient to get a false
+	 * positive rate below 1%, independent of the size of the bitset or total
+	 * number of elements.  Also, if rounding down the size of the bitset to
+	 * the next lowest power of two turns out to be a significant drop, the
+	 * false positive rate still won't exceed 2% in almost all cases.
+	 */
+	bitset_bytes = Min(total_elems * 2, MaxAllocSize);
+	bitset_bytes = Min(work_mem * 1024L, bitset_bytes);
+	/* Minimum allowable size is 1MB */
+	bitset_bytes = Max(1024L * 1024L, bitset_bytes);
+
+	/* Size in bits should be the highest power of two within budget */
+	filter->bloom_power = pow2_truncate(bitset_bytes * BITS_PER_BYTE);
+	bitset_bits = NBITS(filter);
+	bitset_bytes = bitset_bits / BITS_PER_BYTE;
+	filter->bitset = palloc0(bitset_bytes);
+	filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+	filter->seed = seed;
+
+	return filter;
+}
+
+/*
+ * Free Bloom filter
+ */
+void
+bloom_free(bloom_filter *filter)
+{
+	pfree(filter->bitset);
+	pfree(filter);
+}
+
+/*
+ * Add element to Bloom filter
+ */
+void
+bloom_add_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32	hashes[MAX_HASH_FUNCS];
+	int		i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		filter->bitset[hashes[i] >> 3] |= 1 << (hashes[i] & 7);
+	}
+}
+
+/*
+ * Test if Bloom filter definitely lacks element.
+ *
+ * Returns true if the element is definitely not in the set of elements
+ * observed by bloom_add_element().  Otherwise, returns false, indicating that
+ * element is probably present in set.
+ */
+bool
+bloom_lacks_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32	hashes[MAX_HASH_FUNCS];
+	int		i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		if (!(filter->bitset[hashes[i] >> 3] & (1 << (hashes[i] & 7))))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.
+ *
+ * This is a useful, generic indicator of whether or not a Bloom filter has
+ * summarized the set optimally within the available memory budget.  If return
+ * value exceeds 0.5 significantly, then that's either because there was a
+ * dramatic underestimation of set size by the caller, or because available
+ * work_mem is very low relative to the size of the set (less than 2 bits per
+ * element).
+ *
+ * Note that the value returned here should generally be close to 0.5, even
+ * when we have more than enough memory to ensure a false positive rate within
+ * our target 1% - 2% band, since more hash functions are used as more memory
+ * is available per element.
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
+{
+	int		bitset_bytes = NBITS(filter) / BITS_PER_BYTE;
+	int64	bits_set = 0;
+	int		i;
+
+	for (i = 0; i < bitset_bytes; i++)
+	{
+		unsigned char byte = filter->bitset[i];
+
+		while (byte)
+		{
+			bits_set++;
+			byte &= (byte - 1);
+		}
+	}
+
+	return bits_set / (double) NBITS(filter);
+}
+
+/*
+ * Which element of the sequence of powers-of-two is less than or equal to n?
+ *
+ * Used to size bitset, which in practice is never allowed to exceed 2 ^ 30
+ * bits (128MB).  This frees us from giving further consideration to int
+ * overflow.
+ */
+static int
+pow2_truncate(int64 target_bitset_size)
+{
+	int v = 0;
+
+	while (target_bitset_size > 0)
+	{
+		v++;
+		target_bitset_size = target_bitset_size >> 1;
+	}
+
+	return Min(v - 1, 30);
+}
+
+/*
+ * Determine optimal number of hash functions based on size of filter in bits,
+ * and projected total number of elements.  The optimal number is the number
+ * that minimizes the false positive rate.
+ */
+static int
+optimal_k(int64 bits, int64 total_elems)
+{
+	int		k = round(log(2.0) * bits / total_elems);
+
+	return Max(1, Min(k, MAX_HASH_FUNCS));
+}
+
+/*
+ * Generate k hash values for element.
+ *
+ * Caller passes array, which is filled-in with k values determined by hashing
+ * caller's element.
+ *
+ * Only 2 real independent hash functions are actually used to support an
+ * interface of up to MAX_HASH_FUNCS hash functions; "enhanced double hashing"
+ * is used to make this work.  See Dillinger & Manolios for details of why
+ * that's okay.  "Building a Better Bloom Filter" by Kirsch & Mitzenmacher also
+ * has detailed analysis of the algorithm.
+ */
+static void
+k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
+{
+	uint32	hasha,
+			hashb;
+	int		i;
+
+	hasha = DatumGetUInt32(hash_any(elem, len));
+	hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);
+
+	/* Mix seed value */
+	hasha += filter->seed;
+	/* Apply "MOD m" to avoid losing bits/out-of-bounds array access */
+	hasha = hasha % NBITS(filter);
+	hashb = hashb % NBITS(filter);
+
+	/* First hash */
+	hashes[0] = hasha;
+
+	/* Subsequent hashes */
+	for (i = 1; i < filter->k_hash_funcs; i++)
+	{
+		hasha = (hasha + hashb) % NBITS(filter);
+		hashb = (hashb + i) % NBITS(filter);
+
+		/* Accumulate hash value for caller */
+		hashes[i] = hasha;
+	}
+}
+
+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
+static uint32
+sdbmhash(unsigned char *elem, size_t len)
+{
+	uint32	hash = 0;
+	int		i;
+
+	for (i = 0; i < len; elem++, i++)
+	{
+		hash = (*elem) + (hash << 6) + (hash << 16) - hash;
+	}
+
+	return hash;
+}
diff --git a/src/include/lib/bloomfilter.h b/src/include/lib/bloomfilter.h
new file mode 100644
index 0000000..09a5501
--- /dev/null
+++ b/src/include/lib/bloomfilter.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.h
+ *	  Minimal Bloom filter
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/include/lib/bloomfilter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BLOOMFILTER_H_
+#define _BLOOMFILTER_H_
+
+typedef struct bloom_filter bloom_filter;
+
+extern bloom_filter *bloom_init(int64 total_elems, int work_mem, uint32 seed);
+extern void bloom_free(bloom_filter *filter);
+extern void bloom_add_element(bloom_filter *filter, unsigned char *elem,
+							  size_t len);
+extern bool bloom_lacks_element(bloom_filter *filter, unsigned char *elem,
+								size_t len);
+extern double bloom_prop_bits_set(bloom_filter *filter);
+
+#endif
-- 
2.7.4

0002-Add-amcheck-verification-of-indexes-against-heap.patchtext/x-patch; charset=US-ASCII; name=0002-Add-amcheck-verification-of-indexes-against-heap.patchDownload

From c68bcd52a3268e0bf09c04f0b794abbcf8474d32 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  callers can check that each heap tuple that
ought to have an index entry does in fact have one.  This happens at the
end of the existing verification checks.

This is implemented by using a Bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
Bloom filter is populated during the initial index verification scan.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql    |  28 ++++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |  14 +-
 contrib/amcheck/sql/check_btree.sql      |   9 +-
 contrib/amcheck/verify_nbtree.c          | 237 ++++++++++++++++++++++++++++---
 doc/src/sgml/amcheck.sgml                | 149 +++++++++++++++----
 7 files changed, 385 insertions(+), 56 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 0000000..e6cca0a
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+DROP FUNCTION bt_index_parent_check(regclass);
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 05e2861..4690484 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index df3741e..42872b8 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -16,8 +16,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
  bt_index_check 
@@ -56,8 +56,14 @@ SELECT bt_index_check('bttest_a_idx');
  
 (1 row)
 
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+SELECT bt_index_parent_check('bttest_b_idx', true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index fd90531..5d27969 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -19,8 +19,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
 SELECT bt_index_parent_check('bttest_a_idx');
@@ -42,8 +42,9 @@ ROLLBACK;
 
 -- normal check outside of xact
 SELECT bt_index_check('bttest_a_idx');
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+SELECT bt_index_parent_check('bttest_b_idx', true);
 
 BEGIN;
 SELECT bt_index_check('bttest_a_idx');
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 9ae83dc..346d788 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -8,6 +8,11 @@
  * (the insertion scankey sort-wise NULL semantics are needed for
  * verification).
  *
+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *
  *
  * Copyright (c) 2017, PostgreSQL Global Development Group
  *
@@ -18,13 +23,16 @@
  */
 #include "postgres.h"
 
+#include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "access/transam.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablecmds.h"
+#include "lib/bloomfilter.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 
@@ -53,10 +61,15 @@ typedef struct BtreeCheckState
 	 * Unchanging state, established at start of verification:
 	 */
 
-	/* B-Tree Index Relation */
+	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
+	Relation	heaprel;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
+	/* verifying heap has no unindexed tuples? */
+	bool		heapallindexed;
+	/* Oldest xmin before index examined (for !readonly + heapallindexed calls) */
+	TransactionId	oldestxmin;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -72,6 +85,15 @@ typedef struct BtreeCheckState
 	BlockNumber targetblock;
 	/* Target page's LSN */
 	XLogRecPtr	targetlsn;
+
+	/*
+	 * Mutable state, for optional heapallindexed verification:
+	 */
+
+	/* Bloom filter fingerprints B-Tree index */
+	bloom_filter *filter;
+	/* Debug counter */
+	int64		heaptuplespresent;
 } BtreeCheckState;
 
 /*
@@ -92,15 +114,20 @@ typedef struct BtreeLevel
 PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
-static void bt_index_check_internal(Oid indrelid, bool parentcheck);
+static void bt_index_check_internal(Oid indrelid, bool parentcheck,
+									bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
-static void bt_check_every_level(Relation rel, bool readonly);
+static void bt_check_every_level(Relation rel, Relation heaprel,
+								 bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
 static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 				  ScanKey targetkey);
+static void bt_tuple_present_callback(Relation index, HeapTuple htup,
+									  Datum *values, bool *isnull,
+									  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
@@ -116,37 +143,47 @@ static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires AccessShareLock on heap & index relations.  Does not consider
- * invariants that exist between parent/child pages.
+ * invariants that exist between parent/child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, false);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, false, heapallindexed);
 
 	PG_RETURN_VOID();
 }
 
 /*
- * bt_index_parent_check(index regclass)
+ * bt_index_parent_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires ShareLock on heap & index relations.  Verifies that downlinks in
- * parent pages are valid lower bounds on child pages.
+ * parent pages are valid lower bounds on child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, true);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, true, heapallindexed);
 
 	PG_RETURN_VOID();
 }
@@ -155,7 +192,7 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -205,7 +242,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	btree_index_checkable(indrel);
 
 	/* Check index */
-	bt_check_every_level(indrel, parentcheck);
+	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -253,11 +290,14 @@ btree_index_checkable(Relation rel)
 
 /*
  * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in
- * logical order, verifying invariants as it goes.
+ * logical order, verifying invariants as it goes.  Optionally, verification
+ * checks if the heap relation contains any tuples that are not represented in
+ * the index but should be.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
  * the index relation, and advise us if extra checks are safe when a ShareLock
- * is held.
+ * is held.  (A lock of the same type must also have been acquired on the heap
+ * relation.)
  *
  * A ShareLock is generally assumed to prevent any kind of physical
  * modification to the index structure, including modifications that VACUUM may
@@ -272,7 +312,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, bool readonly)
+bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
+					 bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -291,7 +332,34 @@ bt_check_every_level(Relation rel, bool readonly)
 	 */
 	state = palloc(sizeof(BtreeCheckState));
 	state->rel = rel;
+	state->heaprel = heaprel;
 	state->readonly = readonly;
+	state->heapallindexed = heapallindexed;
+	state->oldestxmin = InvalidTransactionId;
+
+	if (state->heapallindexed)
+	{
+		int64	total_elems;
+		uint32	seed;
+
+		/*
+		 * When only AccessShareLock held on heap, get oldestxmin before index
+		 * is first accessed.  Used for later visibility rechecks, within
+		 * bt_tuple_present_callback().
+		 */
+		if (!state->readonly)
+			state->oldestxmin = GetOldestXmin(state->heaprel,
+											  PROCARRAY_FLAGS_VACUUM);
+
+		/* Size Bloom filter based on estimated number of tuples in index */
+		total_elems = (int64) state->rel->rd_rel->reltuples;
+		/* Random seed relies on backend srandom() call to avoid repetition */
+		seed = random();
+		/* Create Bloom filter to fingerprint index */
+		state->filter = bloom_init(total_elems, maintenance_work_mem, seed);
+		state->heaptuplespresent = 0;
+	}
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -347,6 +415,41 @@ bt_check_every_level(Relation rel, bool readonly)
 		previouslevel = current.level;
 	}
 
+	/*
+	 * * Heap contains unindexed/malformed tuples check *
+	 */
+	if (state->heapallindexed)
+	{
+		IndexInfo  *indexinfo;
+
+		elog(DEBUG1, "verifying presence of required tuples in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+		indexinfo = BuildIndexInfo(state->rel);
+
+		/*
+		 * Since we're not actually indexing, don't enforce uniqueness/wait for
+		 * concurrent insertion to finish, even with unique indexes.
+		 *
+		 * Force use of MVCC snapshot (reuse CONCURRENTLY infrastructure) when
+		 * only an AccessShareLock held.  It seems like a good idea to not
+		 * diverge from expected heap lock strength in all cases.  This is
+		 * needed to prevent unhelpful WARNINGs due to concurrent insertions
+		 * that IndexBuildHeapScan() does not expect.
+		 */
+		indexinfo->ii_Unique = false;
+		indexinfo->ii_Concurrent = !state->readonly;
+		IndexBuildHeapScan(state->heaprel, state->rel, indexinfo, true,
+						   bt_tuple_present_callback, (void *) state);
+
+		ereport(DEBUG1,
+				(errmsg_internal("finished verifying presence of " INT64_FORMAT " tuples (proportion of bits set: %f) from table \"%s\"",
+								 state->heaptuplespresent, bloom_prop_bits_set(state->filter),
+								 RelationGetRelationName(heaprel))));
+
+		bloom_free(state->filter);
+	}
+
 	/* Be tidy: */
 	MemoryContextDelete(state->targetcontext);
 }
@@ -499,7 +602,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
 										current, level.level, opaque->btpo.level)));
 
-		/* Verify invariants for page -- all important checks occur here */
+		/* Verify invariants for page */
 		bt_target_page_check(state);
 
 nextpage:
@@ -546,6 +649,9 @@ nextpage:
  *
  * - That all child pages respect downlinks lower bound.
  *
+ * This is also where heapallindexed callers build their Bloom filter for later
+ * verification that index had all heap tuples.
+ *
  * Note:  Memory allocated in this routine is expected to be released by caller
  * resetting state->targetcontext.
  */
@@ -589,6 +695,11 @@ bt_target_page_check(BtreeCheckState *state)
 		itup = (IndexTuple) PageGetItem(state->target, itemid);
 		skey = _bt_mkscankey(state->rel, itup);
 
+		/* When verifying heap, record leaf items in Bloom filter */
+		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
+			bloom_add_element(state->filter, (unsigned char *) itup,
+							  IndexTupleSize(itup));
+
 		/*
 		 * * High key check *
 		 *
@@ -682,8 +793,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Last item check *
 		 *
 		 * Check last item against next/right page's first data item's when
-		 * last item on page is reached.  This additional check can detect
-		 * transposed pages.
+		 * last item on page is reached.  This additional check will detect
+		 * transposed pages iff the supposed right sibling page happens to
+		 * belong before target in the key space.  (Otherwise, a subsequent
+		 * heap verification will probably detect the problem.)
 		 *
 		 * This check is similar to the item order check that will have
 		 * already been performed for every other "real" item on target page
@@ -1062,6 +1175,96 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 }
 
 /*
+ * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
+ * all needed entries using Bloom filter probes.
+ *
+ * The redundancy between an index and the table it indexes provides a good
+ * opportunity to detect corruption in index, and especially in heap.  The high
+ * level principle behind verification performed here is that any index tuples
+ * that should be in the index following a REINDEX should also have been there
+ * all along.  This must be true because a REINDEX rebuilds the index in order
+ * to effectively remove bloat.  There might be dead index tuple entries in the
+ * Bloom filter, because of the lack of reliable visibility information in
+ * index structures, but that hardly matters since we're concerned about the
+ * possible absence of needed tuples.  In other words, a fresh REINDEX should
+ * never affect the representation of any IndexTuple, because these are
+ * immutable for as long as heap tuple is visible to any possible snapshot
+ * (while the LP_DEAD bit is mutable, that's ItemId metadata, which we don't
+ * directly fingerprint).
+ *
+ * Since the overall structure of the index has already been verified, the most
+ * likely explanation for invariant not holding is a corrupt heap page (could
+ * be logical or physical corruption), which is why heap is blamed here.  Heap
+ * corruption is not always the problem, though.  Only readonly callers will
+ * have verified that left links and right links are in agreement, and so it's
+ * possible that a leaf page transposition within index is actually the source
+ * of corruption detected here (for !readonly callers), in which case the
+ * user-visible diagnostic message is misleading.  The checks performed only
+ * for readonly callers might more accurately frame the problem as a bogus leaf
+ * page transposition, or a cross-page invariant not holding due to recovery
+ * not replaying all WAL records.  That's why the !readonly ERROR message
+ * raised here includes a HINT about trying the other variant out.
+ */
+static void
+bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
+						  bool *isnull, bool tupleIsAlive, void *checkstate)
+{
+	BtreeCheckState *state = (BtreeCheckState *) checkstate;
+	IndexTuple		 itup;
+
+	Assert(state->heapallindexed);
+
+	/* Must recheck visibility when only AccessShareLock held */
+	if (!state->readonly)
+	{
+		TransactionId	xmin;
+
+		/*
+		 * Don't test for presence in index where xmin not at least old enough
+		 * that we know for sure that absence of index tuple wasn't just due to
+		 * some transaction performing insertion after our verifying index
+		 * traversal began.  (Actually, the cut-off is based on the point
+		 * before which any possible inserting transaction must have
+		 * committed/aborted.)
+		 *
+		 * You might think that the fact that an MVCC snapshot is used by the
+		 * heap scan (due to indicating that this is the first scan of a CREATE
+		 * INDEX CONCURRENTLY index build) would make this test redundant.
+		 * That's not quite true, because with current IndexBuildHeapScan()
+		 * interface caller cannot do the MVCC snapshot acquisition itself.  In
+		 * this way, heap tuple coverage is similar to the coverage we could
+		 * get by acquiring the MVCC snapshot ourselves at the point where
+		 * GetOldestXmin() is currently called.  It's easier to do this than to
+		 * adopt the IndexBuildHeapScan() interface to our narrow requirements.
+		 */
+		xmin = HeapTupleHeaderGetXmin(htup->t_data);
+		if (!TransactionIdPrecedes(xmin, state->oldestxmin))
+			return;
+	}
+
+	/* Generate an index tuple */
+	itup = index_form_tuple(RelationGetDescr(index), values, isnull);
+	itup->t_tid = htup->t_self;
+
+	/* Probe Bloom filter -- tuple should be present */
+	if (bloom_lacks_element(state->filter, (unsigned char *) itup,
+							IndexTupleSize(itup)))
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("table \"%s\" lacks matching index tuple in index \"%s\" for tid (%u,%u)",
+						RelationGetRelationName(state->heaprel),
+						RelationGetRelationName(state->rel),
+						ItemPointerGetBlockNumber(&(itup->t_tid)),
+						ItemPointerGetOffsetNumber(&(itup->t_tid))),
+				 !state->readonly ?
+				 errhint("Calling bt_index_parent_check() against target \"%s\" may further isolate the inconsistency",
+						 RelationGetRelationName(state->rel)) : 0 ));
+
+	state->heaptuplespresent++;
+	pfree(itup);
+}
+
+/*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
  *
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index dd71dbd..c071ead 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -44,7 +44,7 @@
   <variablelist>
    <varlistentry>
     <term>
-     <function>bt_index_check(index regclass) returns void</function>
+     <function>bt_index_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_check</primary>
      </indexterm>
@@ -55,7 +55,7 @@
       <function>bt_index_check</function> tests that its target, a
       B-Tree index, respects a variety of invariants.  Example usage:
 <screen>
-test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+test=# SELECT bt_index_check(index =&gt; c.oid, heapallindexed =&gt; i.indisprimary)
 FROM pg_index i
 JOIN pg_opclass op ON i.indclass[0] = op.oid
 JOIN pg_am am ON op.opcmethod = am.oid
@@ -83,20 +83,23 @@ ORDER BY c.relpages DESC LIMIT 10;
 </screen>
       This example shows a session that performs verification of every
       catalog index in the database <quote>test</>.  Details of just
-      the 10 largest indexes verified are displayed.  Since no error
-      is raised, all indexes tested appear to be logically consistent.
-      Naturally, this query could easily be changed to call
+      the 10 largest indexes verified are displayed.  Verification of
+      the presence of heap tuples as index tuples is requested for
+      primary key indexes only.  Since no error is raised, all indexes
+      tested appear to be logically consistent.  Naturally, this query
+      could easily be changed to call
       <function>bt_index_check</function> for every index in the
       database where verification is supported.
      </para>
      <para>
-      <function>bt_index_check</function> acquires an <literal>AccessShareLock</>
-      on the target index and the heap relation it belongs to. This lock mode
-      is the same lock mode acquired on relations by simple
-      <literal>SELECT</> statements.
+      <function>bt_index_check</function> acquires an
+      <literal>AccessShareLock</> on the target index and the heap
+      relation it belongs to.  This lock mode is the same lock mode
+      acquired on relations by simple <literal>SELECT</> statements.
       <function>bt_index_check</function> does not verify invariants
-      that span child/parent relationships, nor does it verify that
-      the target index is consistent with its heap relation.  When a
+      that span child/parent relationships, but will verify the
+      presence of all heap tuples as index tuples within the index
+      when <parameter>heapallindexed</> is <literal>true</>.  When a
       routine, lightweight test for corruption is required in a live
       production environment, using
       <function>bt_index_check</function> often provides the best
@@ -108,7 +111,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -117,19 +120,22 @@ ORDER BY c.relpages DESC LIMIT 10;
     <listitem>
      <para>
       <function>bt_index_parent_check</function> tests that its
-      target, a B-Tree index, respects a variety of invariants.  The
-      checks performed by <function>bt_index_parent_check</function>
-      are a superset of the checks performed by
-      <function>bt_index_check</function>.
-      <function>bt_index_parent_check</function> can be thought of as
-      a more thorough variant of <function>bt_index_check</function>:
-      unlike <function>bt_index_check</function>,
+      target, a B-Tree index, respects a variety of invariants.
+      Optionally, when the <parameter>heapallindexed</> argument is
+      <literal>true</>, the function verifies the presence of all heap
+      tuples that should be found within the index.  The checks
+      performed by <function>bt_index_parent_check</function> are a
+      superset of the checks performed by
+      <function>bt_index_check</function> when called with the same
+      options.  <function>bt_index_parent_check</function> can be
+      thought of as a more thorough variant of
+      <function>bt_index_check</function>: unlike
+      <function>bt_index_check</function>,
       <function>bt_index_parent_check</function> also checks
-      invariants that span parent/child relationships.  However, it
-      does not verify that the target index is consistent with its
-      heap relation.  <function>bt_index_parent_check</function>
-      follows the general convention of raising an error if it finds a
-      logical inconsistency or other problem.
+      invariants that span parent/child relationships.
+      <function>bt_index_parent_check</function> follows the general
+      convention of raising an error if it finds a logical
+      inconsistency or other problem.
      </para>
      <para>
       A <literal>ShareLock</> is required on the target index by
@@ -159,6 +165,70 @@ ORDER BY c.relpages DESC LIMIT 10;
  </sect2>
 
  <sect2>
+  <title>Optional <parameter>heapallindexed</> verification</title>
+ <para>
+  When the <parameter>heapallindexed</> argument to verification
+  functions is <literal>true</>, an additional phase of verification
+  is performed against the table associated with the target index
+  relation.  This consists of a <quote>dummy</> <command>CREATE
+  INDEX</> operation, which checks for the presence of all would-be
+  new index tuples against a temporary, in-memory summarizing
+  structure (this is built when needed during the first, standard
+  phase).  The summarizing structure <quote>fingerprints</> every
+  tuple found within the target index.  The high level principle
+  behind <parameter>heapallindexed</> verification is that a new index
+  that is equivalent to the existing, target index must only have
+  entries that can be found in the existing structure.
+ </para>
+ <para>
+  The additional <parameter>heapallindexed</> phase adds significant
+  overhead: verification will typically take several times longer than
+  it would with only the standard consistency checking of the target
+  index's structure.  However, verification will still take
+  significantly less time than an actual <command>CREATE INDEX</>.
+  There is no change to the relation-level locks acquired when
+  <parameter>heapallindexed</> verification is performed.  The
+  summarizing structure is bound in size by
+  <varname>maintenance_work_mem</varname>.  In order to ensure that
+  there is no more than a 2% probability of failure to detect the
+  absence of any particular index tuple, approximately 2 bytes of
+  memory are needed per index tuple.  As less memory is made available
+  per index tuple, the probability of missing an inconsistency
+  increases.  This is considered an acceptable trade-off, since it
+  limits the overhead of verification very significantly, while only
+  slightly reducing the probability of detecting a problem, especially
+  for installations where verification is treated as a routine
+  maintenance task.
+ </para>
+ <para>
+  In many applications, even the default
+  <varname>maintenance_work_mem</varname> setting of <literal>64MB</>
+  will be sufficient to have less than a 2% probability of overlooking
+  any single absent or corrupt tuple.  This will be the case when
+  there are no indexes with more than about 30 million distinct index
+  tuples, regardless of the overall size of any index, the total
+  number of indexes, or anything else.  False positive candidate tuple
+  membership tests within the summarizing structure occur at random,
+  and are very unlikely to be the same for repeat verification
+  operations.  Furthermore, within a single verification operation,
+  each missing or malformed index tuple independently has the same
+  chance of being detected.  If there is any inconsistency at all, it
+  isn't particularly likely to be limited to a single tuple.  All of
+  these factors favor accepting a limited per operation per tuple
+  probability of missing corruption, in order to enable performing
+  more thorough index to heap verification more frequently (practical
+  concerns about the overhead of verification are likely to limit the
+  frequency of verification).  In aggregate, the probability of
+  detecting a hardware fault or software defect actually
+  <emphasis>increases</> significantly with this strategy in most real
+  world cases.  Moreover, frequent verification allows problems to be
+  caught earlier on average, which helps to limit the overall impact
+  of corruption, and often simplifies root cause analysis.
+ </para>
+
+ </sect2>
+
+ <sect2>
   <title>Using <filename>amcheck</> effectively</title>
 
  <para>
@@ -199,17 +269,31 @@ ORDER BY c.relpages DESC LIMIT 10;
    </listitem>
    <listitem>
     <para>
+     Structural inconsistencies between indexes and the heap relations
+     that are indexed (when <parameter>heapallindexed</> verification
+     is performed).
+    </para>
+    <para>
+     There is no cross-checking of indexes against their heap relation
+     during normal operation.  Symptoms of heap corruption can be very
+     subtle.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
      Corruption caused by hypothetical undiscovered bugs in the
-     underlying <productname>PostgreSQL</> access method code or sort
-     code.
+     underlying <productname>PostgreSQL</> access method code, sort
+     code, or transaction management code.
     </para>
     <para>
      Automatic verification of the structural integrity of indexes
      plays a role in the general testing of new or proposed
      <productname>PostgreSQL</> features that could plausibly allow a
-     logical inconsistency to be introduced.  One obvious testing
-     strategy is to call <filename>amcheck</> functions continuously
-     when running the standard regression tests.  See <xref
+     logical inconsistency to be introduced.  Verification of table
+     structure and associated visibility and transaction status
+     information plays a similar role.  One obvious testing strategy
+     is to call <filename>amcheck</> functions continuously when
+     running the standard regression tests.  See <xref
      linkend="regress-run"> for details on running the tests.
     </para>
    </listitem>
@@ -242,6 +326,13 @@ ORDER BY c.relpages DESC LIMIT 10;
      <emphasis>absolute</emphasis> protection against failures that
      result in memory corruption.
     </para>
+    <para>
+     When <parameter>heapallindexed</> is <literal>true</>, and heap
+     verification is performed, there is generally a greatly increased
+     chance of detecting single-bit errors, since strict binary
+     equality is tested, and the indexed attributes within the heap
+     are tested.
+    </para>
    </listitem>
   </itemizedlist>
   In general, <filename>amcheck</> can only prove the presence of
-- 
2.7.4

#21

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Peter Geoghegan (#20)

Re: A design for amcheck heapam verification

On Wed, Sep 6, 2017 at 7:26 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Aug 30, 2017 at 9:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Eh, if you want to optimize it for the case where debug output is not
enabled, make sure to use ereport() not elog(). ereport()
short-circuits evaluation of arguments, whereas elog() does not.

I should do that, but it's still not really noticeable.

Since this patch has now bit-rotted, I attach a new revision, V2.

I should point out that I am relying on deterministic TOAST
compression within index_form_tuple() at present. This could, in
theory, become a problem later down the road, when
toast_compress_datum() compression becomes configurable via a storage
parameter or something (e.g., we use PGLZ_strategy_always, rather than
the hard coded PGLZ_strategy_default strategy).

While I should definitely have a comment above the new amcheck
index_form_tuple() call that points this out, it's not clear if that's
all that is required. Normalizing the representation of hashed index
tuples to make verification robust against unforeseen variation in
TOAST compression strategy seems like needless complexity to me, but
I'd like to hear a second opinion on that.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Peter Geoghegan (#20)

Re: A design for amcheck heapam verification

On Thu, Sep 7, 2017 at 11:26 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Aug 30, 2017 at 9:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Eh, if you want to optimize it for the case where debug output is not
enabled, make sure to use ereport() not elog(). ereport()
short-circuits evaluation of arguments, whereas elog() does not.

I should do that, but it's still not really noticeable.

Since this patch has now bit-rotted, I attach a new revision, V2.
Apart from fixing some Makefile bitrot, this revision also makes some
small tweaks as suggested by Thomas and Alvaro. The documentation is
also revised and expanded, and now discusses practical aspects of the
set membership being tested using a Bloom filter, how that relates to
maintenance_work_mem, and so on.

Note that this revision does not let the Bloom filter caller use their
own dynamic shared memory, which is something that Thomas asked about.
While that could easily be added, I think it should happen later. I
really just wanted to make sure that my Bloom filter was not in some
way fundamentally incompatible with Thomas' planned enhancements to
(parallel) hash join.

I have signed up as a reviewer of this patch, and I have looked at the
bloom filter implementation for now. This is the kind of facility that
people have asked for on this list for many years.

One first thing striking me is that there is no test for this
implementation, which would be a base stone for other things, it would
be nice to validate that things are working properly before moving on
with 0002, and 0001 is a feature on its own. I don't think that it
would be complicated to have a small module in src/test/modules which
plugs in a couple of SQL functions on top of bloomfilter.h.

+#define MAX_HASH_FUNCS 10
Being able to define the number of hash functions used at
initialization would be nicer. Usually this number is kept way lower
than the number of elements to check as part of a set, but I see no
reason to not allow people to play with this API in a more extended
way. You can then let your module decide what it wants to use.

+ * work_mem is sized in KB, in line with the general convention.
In what is that a general convention? Using bytes would be more
intuitive IMO.. Still I think that this could be removed, see below
points.

+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
Reference link?

+   bitset_bytes = bitset_bits / BITS_PER_BYTE;
+   filter->bitset = palloc0(bitset_bytes);
+   filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+   filter->seed = seed;
I think that doing the allocation within the initialization phase is a
mistake. Being able to use a DSA would be nice, but I predict as well
cases where a module may want to have a bloom filter that persists as
well across multiple transactions, so the allocation should be able to
live across memory contexts. What I think you should do instead to
make this bloom implementation more modular is to let the caller give
a pointer to a memory area as well as its size. Then what bloom_init
should do is to just initialize this area of memory with zeros. This
approach would give a lot of freedom. Not linking a bloom definition
to work_mem would be nice as well.

+ hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);
I am wondering if it would make sense to be able to enforce the hash
function being used. The default does not look bad to me though, so we
could live with that.

+typedef struct bloom_filter bloom_filter;
Not allowing callers have a look at the structure contents is
definitely the right approach.

So, in my opinion, this bloom facility still needs more work.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Michael Paquier (#22)

Re: A design for amcheck heapam verification

On Wed, Sep 27, 2017 at 1:45 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

I have signed up as a reviewer of this patch, and I have looked at the
bloom filter implementation for now. This is the kind of facility that
people have asked for on this list for many years.

One first thing striking me is that there is no test for this
implementation, which would be a base stone for other things, it would
be nice to validate that things are working properly before moving on
with 0002, and 0001 is a feature on its own. I don't think that it
would be complicated to have a small module in src/test/modules which
plugs in a couple of SQL functions on top of bloomfilter.h.

0001 is just a library utility. None of the backend lib utilities
(like HyperLogLog, Discrete knapsack, etc) have dedicated test
frameworks. Coding up an SQL interface for these things is a
non-trivial project in itself.

I have testing the implementation myself, but that was something that
used C code.

Can you tell me what you have in mind here in detail? For example,
should there be a custom datatype that encapsulates the current state
of the bloom filter? Or, should there be an aggregate function, that
takes a containment argument that is tested at the end?

+#define MAX_HASH_FUNCS 10
Being able to define the number of hash functions used at
initialization would be nicer. Usually this number is kept way lower
than the number of elements to check as part of a set, but I see no
reason to not allow people to play with this API in a more extended
way. You can then let your module decide what it wants to use.

The number of hash functions used is itself a function of the amount
of memory available, and an estimate of the overall size of the set
made ahead of time (in the case of amcheck, this is
pg_class.reltuples). The typical interface is that the caller either
specifies the amount of memory, or the required false positive rate
(either way, they must also provide that estimate).

The value of MAX_HASH_FUNCS, 10, was chosen based on the fact that we
never actually use more than that many hash functions in practice,
given the limitations on the total amount of memory you can use
(128MB). The formula for determining the optimum number of hash
functions is pretty standard stuff.

+ * work_mem is sized in KB, in line with the general convention.
In what is that a general convention? Using bytes would be more
intuitive IMO.. Still I think that this could be removed, see below
points.

Both tuplesort and tuplestore do this. These are infrastructure that
is passed work_mem or maintenance_work_mem by convention, where those
are sized in KB.

+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
Reference link?

http://www.eecs.harvard.edu/margo/papers/usenix91/paper.pdf

I'm probably going to end up using murmurhash32() instead of the sdbm
hash function anyway, now that Andres has exposed it in a header file.
This probably won't matter in the next version.

+   bitset_bytes = bitset_bits / BITS_PER_BYTE;
+   filter->bitset = palloc0(bitset_bytes);
+   filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+   filter->seed = seed;
I think that doing the allocation within the initialization phase is a
mistake. Being able to use a DSA would be nice, but I predict as well
cases where a module may want to have a bloom filter that persists as
well across multiple transactions, so the allocation should be able to
live across memory contexts.

Why not just switch memory context before calling? Again, other
comparable utilities don't have provide this in their interface.

As for DSM, I think that that can come later, and can be written by
somebody closer to that problem. There can be more than one
initialization function.

What I think you should do instead to
make this bloom implementation more modular is to let the caller give
a pointer to a memory area as well as its size. Then what bloom_init
should do is to just initialize this area of memory with zeros. This
approach would give a lot of freedom. Not linking a bloom definition
to work_mem would be nice as well.

The implementation is probably always going to be bound in size by
work_mem in practice, like tuplesort and tuplestore. I would say that
that's a natural fit.

+ hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);
I am wondering if it would make sense to be able to enforce the hash
function being used. The default does not look bad to me though, so we
could live with that.

I prefer to keep things simple. I'm not aware of any use case that
calls for the user to use a custom hash function. That said, I could
believe that someone would want to use their own hash value for each
bloom_add_element(), when they have one close at hand anyway -- much
like addHyperLogLog(). Again, that seems like work for the ultimate
consumer of that functionality. It's a trivial tweak that can happen
later.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Peter Geoghegan (#23)

Re: A design for amcheck heapam verification

On Thu, Sep 28, 2017 at 3:32 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Sep 27, 2017 at 1:45 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

I have signed up as a reviewer of this patch, and I have looked at the
bloom filter implementation for now. This is the kind of facility that
people have asked for on this list for many years.

One first thing striking me is that there is no test for this
implementation, which would be a base stone for other things, it would
be nice to validate that things are working properly before moving on
with 0002, and 0001 is a feature on its own. I don't think that it
would be complicated to have a small module in src/test/modules which
plugs in a couple of SQL functions on top of bloomfilter.h.

0001 is just a library utility. None of the backend lib utilities
(like HyperLogLog, Discrete knapsack, etc) have dedicated test
frameworks. Coding up an SQL interface for these things is a
non-trivial project in itself.

I have testing the implementation myself, but that was something that
used C code.

Can you tell me what you have in mind here in detail? For example,
should there be a custom datatype that encapsulates the current state
of the bloom filter? Or, should there be an aggregate function, that
takes a containment argument that is tested at the end?

That could be something very simple:
- A function to initiate a bloom filter in a session context, with a
number of elements in input, which uses for example integers.
- A function to add an integer element to it.
- A function to query if an integer may exist or not.
- A function to free it.
The point is just to stress this code, I don't think that it is a
project this "large".

+#define MAX_HASH_FUNCS 10
Being able to define the number of hash functions used at
initialization would be nicer. Usually this number is kept way lower
than the number of elements to check as part of a set, but I see no
reason to not allow people to play with this API in a more extended
way. You can then let your module decide what it wants to use.

The number of hash functions used is itself a function of the amount
of memory available, and an estimate of the overall size of the set
made ahead of time (in the case of amcheck, this is
pg_class.reltuples). The typical interface is that the caller either
specifies the amount of memory, or the required false positive rate
(either way, they must also provide that estimate).

+ filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
The estimate is from wikipedia-sensei:
https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions
Being able to enforce that would be nice, not mandatory perhaps.

The value of MAX_HASH_FUNCS, 10, was chosen based on the fact that we
never actually use more than that many hash functions in practice,
given the limitations on the total amount of memory you can use
(128MB). The formula for determining the optimum number of hash
functions is pretty standard stuff.

Hm... OK. That could be a default... Not really convinced though.

+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
Reference link?
http://www.eecs.harvard.edu/margo/papers/usenix91/paper.pdf

That would be nicer if added in the code :)

I'm probably going to end up using murmurhash32() instead of the sdbm
hash function anyway, now that Andres has exposed it in a header file.
This probably won't matter in the next version.

Yeah, that may be a good idea at the end.

+   bitset_bytes = bitset_bits / BITS_PER_BYTE;
+   filter->bitset = palloc0(bitset_bytes);
+   filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+   filter->seed = seed;
I think that doing the allocation within the initialization phase is a
mistake. Being able to use a DSA would be nice, but I predict as well
cases where a module may want to have a bloom filter that persists as
well across multiple transactions, so the allocation should be able to
live across memory contexts.
Why not just switch memory context before calling? Again, other
comparable utilities don't have provide this in their interface.

As for DSM, I think that that can come later, and can be written by
somebody closer to that problem. There can be more than one
initialization function.

I don't completely disagree with that, there could be multiple
initialization functions. Still, an advantage about designing things
right from the beginning with a set of correct APIs is that we don't
need extra things later and this will never bother module maintainers.
I would think that this utility interface should be minimal and
portable to maintain a long-term stance.

What I think you should do instead to
make this bloom implementation more modular is to let the caller give
a pointer to a memory area as well as its size. Then what bloom_init
should do is to just initialize this area of memory with zeros. This
approach would give a lot of freedom. Not linking a bloom definition
to work_mem would be nice as well.

The implementation is probably always going to be bound in size by
work_mem in practice, like tuplesort and tuplestore. I would say that
that's a natural fit.

Hm...

+ hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);
I am wondering if it would make sense to be able to enforce the hash
function being used. The default does not look bad to me though, so we
could live with that.

I prefer to keep things simple. I'm not aware of any use case that
calls for the user to use a custom hash function. That said, I could
believe that someone would want to use their own hash value for each
bloom_add_element(), when they have one close at hand anyway -- much
like addHyperLogLog(). Again, that seems like work for the ultimate
consumer of that functionality. It's a trivial tweak that can happen
later.

Yeah, perhaps most people would be satisfied with having only a default.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Michael Paquier (#24)

Re: A design for amcheck heapam verification

On Fri, Sep 29, 2017 at 4:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

As for DSM, I think that that can come later, and can be written by
somebody closer to that problem. There can be more than one
initialization function.

I don't completely disagree with that, there could be multiple
initialization functions. Still, an advantage about designing things
right from the beginning with a set of correct APIs is that we don't
need extra things later and this will never bother module maintainers.
I would think that this utility interface should be minimal and
portable to maintain a long-term stance.

FWIW I think if I were attacking that problem the first thing I'd
probably try would be getting rid of that internal pointer
filter->bitset in favour of a FLEXIBLE_ARRAY_MEMBER and then making
the interface look something like this:

extern size_t bloom_estimate(int64 total elems, int work_mem);
extern void bloom_init(bloom_filter *filter, int64 total_elems, int work_mem);

Something that allocates new memory as the patch's bloom_init()
function does I'd tend to call 'make' or 'create' or 'new' or
something, rather than 'init'. 'init' has connotations of being the
second phase in an allocate-and-init pattern for me. Then
bloom_filt_make() would be trivially implemented on top of
bloom_estimate() and bloom_init(), and bloom_init() could be used
directly in DSM, DSA, traditional shmem without having to add any
special DSM support.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Thomas Munro (#25)

Re: A design for amcheck heapam verification

On Thu, Sep 28, 2017 at 11:34 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

FWIW I think if I were attacking that problem the first thing I'd
probably try would be getting rid of that internal pointer
filter->bitset in favour of a FLEXIBLE_ARRAY_MEMBER and then making
the interface look something like this:

extern size_t bloom_estimate(int64 total elems, int work_mem);
extern void bloom_init(bloom_filter *filter, int64 total_elems, int work_mem);

Yes, that seems quite convenient and is by now an established coding pattern.

I am also wondering whether this patch should consider
81c5e46c490e2426db243eada186995da5bb0ba7 as a way of obtaining
multiple hash values. I suppose that's probably inferior to what is
already being done on performance grounds, but I'll just throw out a
mention of it here all the same in case it was overlooked or the
relevance not spotted...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Thomas Munro (#25)

Re: A design for amcheck heapam verification

On Thu, Sep 28, 2017 at 8:34 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

FWIW I think if I were attacking that problem the first thing I'd
probably try would be getting rid of that internal pointer
filter->bitset in favour of a FLEXIBLE_ARRAY_MEMBER and then making
the interface look something like this:

extern size_t bloom_estimate(int64 total elems, int work_mem);
extern void bloom_init(bloom_filter *filter, int64 total_elems, int work_mem);

Something that allocates new memory as the patch's bloom_init()
function does I'd tend to call 'make' or 'create' or 'new' or
something, rather than 'init'.

I tend to agree. I'll adopt that style in the next version. I just
didn't want the caller to have to manage the memory themselves.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Robert Haas (#26)

Re: A design for amcheck heapam verification

On Fri, Sep 29, 2017 at 10:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I am also wondering whether this patch should consider
81c5e46c490e2426db243eada186995da5bb0ba7 as a way of obtaining
multiple hash values. I suppose that's probably inferior to what is
already being done on performance grounds, but I'll just throw out a
mention of it here all the same in case it was overlooked or the
relevance not spotted...

Well, we sometimes only want one hash value. This happens when we're
very short on memory (especially relative to the estimated final size
of the set), so it's a fairly common requirement. And, we have a
convenient way to get a second independent uint32 hash function now
(murmurhash32()).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Peter Geoghegan (#28)

Re: A design for amcheck heapam verification

On Fri, Sep 29, 2017 at 1:57 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Sep 29, 2017 at 10:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I am also wondering whether this patch should consider
81c5e46c490e2426db243eada186995da5bb0ba7 as a way of obtaining
multiple hash values. I suppose that's probably inferior to what is
already being done on performance grounds, but I'll just throw out a
mention of it here all the same in case it was overlooked or the
relevance not spotted...

Well, we sometimes only want one hash value. This happens when we're
very short on memory (especially relative to the estimated final size
of the set), so it's a fairly common requirement. And, we have a
convenient way to get a second independent uint32 hash function now
(murmurhash32()).

Right, so if you wanted to use the extended hash function
infrastructure, you'd just call the extended hash function with as
many different seeds as the number of hash functions you need. If you
need 1, you call it with one seed, say 0. And if you need any larger
number, well, cool.

Like I say, I'm not at all sure this is better than what you've got
right now. But it's an option.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Peter Geoghegan (#27)

2 attachment(s)

Re: A design for amcheck heapam verification

On Fri, Sep 29, 2017 at 10:54 AM, Peter Geoghegan <pg@bowt.ie> wrote:

Something that allocates new memory as the patch's bloom_init()
function does I'd tend to call 'make' or 'create' or 'new' or
something, rather than 'init'.

I tend to agree. I'll adopt that style in the next version. I just
didn't want the caller to have to manage the memory themselves.

v3 of the patch series, attached, does it that way -- it adds a
bloom_create(). The new bloom_create() function still allocates its
own memory, but does so while using a FLEXIBLE_ARRAY_MEMBER. A
separate bloom_init() function (that works with dynamic shared memory)
could easily be added later, for the benefit of parallel hash join.

Other notable changes:

* We now support bloom filters that have bitsets of up to 512MB in
size. The previous limit was 128MB.

* We now use TransactionXmin for our AccessShareLock xmin cutoff,
rather than calling GetOldestXmin(). This is the same cut-off used by
xacts that must avoid broken hot chains for their earliest snapshot.
This avoids a scan of the proc array, and allows more thorough
verification, since GetOldestXmin() was overly restrictive here.

* Expanded code comments describing the kinds of problems the new
verification capability is expected to be good at catching.

For example, there is now a passing reference to the CREATE INDEX
CONCURRENTLY bug that was fixed back in February (it's given as an
example of a more general problem -- faulty HOT safety assessment).
With the new heapallindexed enhancement added by this patch, amcheck
can be expected to catch that issue much of the time. We also go into
heap-only tuple handling within IndexBuildHeapScan(). The way that
CREATE INDEX tries to index the most recent tuple in a HOT chain
(while locating the root tuple in the chain, to get the right heap TID
for the index) has proven to be very useful as a smoke test while
investigating HOT/VACUUM FREEZE bugs in the past couple of weeks [1]/messages/by-id/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com -- Peter Geoghegan.
I believe it would have caught several historic MultiXact/recovery
bugs, too. This all seems worth noting explicitly.

[1]: /messages/by-id/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

0002-Add-amcheck-verification-of-indexes-against-heap.patchtext/x-patch; charset=US-ASCII; name=0002-Add-amcheck-verification-of-indexes-against-heap.patchDownload

From 3bed03a9e0506c0b81097b634c5f1b5534a2dcb3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  callers can check that each heap tuple that
ought to have an index entry does in fact have one.  This happens at the
end of the existing verification checks.

This is implemented by using a Bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
Bloom filter is populated during the initial index verification scan.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql    |  28 ++++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |  14 +-
 contrib/amcheck/sql/check_btree.sql      |   9 +-
 contrib/amcheck/verify_nbtree.c          | 275 ++++++++++++++++++++++++++++---
 doc/src/sgml/amcheck.sgml                | 157 ++++++++++++++----
 src/include/utils/snapmgr.h              |   2 +-
 8 files changed, 423 insertions(+), 66 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 0000000..e6cca0a
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+DROP FUNCTION bt_index_parent_check(regclass);
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 05e2861..4690484 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index df3741e..42872b8 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -16,8 +16,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
  bt_index_check 
@@ -56,8 +56,14 @@ SELECT bt_index_check('bttest_a_idx');
  
 (1 row)
 
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+SELECT bt_index_parent_check('bttest_b_idx', true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index fd90531..5d27969 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -19,8 +19,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
 SELECT bt_index_parent_check('bttest_a_idx');
@@ -42,8 +42,9 @@ ROLLBACK;
 
 -- normal check outside of xact
 SELECT bt_index_check('bttest_a_idx');
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+SELECT bt_index_parent_check('bttest_b_idx', true);
 
 BEGIN;
 SELECT bt_index_check('bttest_a_idx');
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 868c14e..f73ea31 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -8,6 +8,11 @@
  * (the insertion scankey sort-wise NULL semantics are needed for
  * verification).
  *
+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *
  *
  * Copyright (c) 2017, PostgreSQL Global Development Group
  *
@@ -18,11 +23,13 @@
  */
 #include "postgres.h"
 
+#include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "access/transam.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablecmds.h"
+#include "lib/bloomfilter.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -53,10 +60,13 @@ typedef struct BtreeCheckState
 	 * Unchanging state, established at start of verification:
 	 */
 
-	/* B-Tree Index Relation */
+	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
+	Relation	heaprel;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
+	/* Also verifying heap has no unindexed tuples? */
+	bool		heapallindexed;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -72,6 +82,15 @@ typedef struct BtreeCheckState
 	BlockNumber targetblock;
 	/* Target page's LSN */
 	XLogRecPtr	targetlsn;
+
+	/*
+	 * Mutable state, for optional heapallindexed verification:
+	 */
+
+	/* Bloom filter fingerprints B-Tree index */
+	bloom_filter *filter;
+	/* Debug counter */
+	int64		heaptuplespresent;
 } BtreeCheckState;
 
 /*
@@ -92,15 +111,20 @@ typedef struct BtreeLevel
 PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
-static void bt_index_check_internal(Oid indrelid, bool parentcheck);
+static void bt_index_check_internal(Oid indrelid, bool parentcheck,
+									bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
-static void bt_check_every_level(Relation rel, bool readonly);
+static void bt_check_every_level(Relation rel, Relation heaprel,
+								 bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
 static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 				  ScanKey targetkey);
+static void bt_tuple_present_callback(Relation index, HeapTuple htup,
+									  Datum *values, bool *isnull,
+									  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
@@ -116,37 +140,47 @@ static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires AccessShareLock on heap & index relations.  Does not consider
- * invariants that exist between parent/child pages.
+ * invariants that exist between parent/child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, false);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, false, heapallindexed);
 
 	PG_RETURN_VOID();
 }
 
 /*
- * bt_index_parent_check(index regclass)
+ * bt_index_parent_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires ShareLock on heap & index relations.  Verifies that downlinks in
- * parent pages are valid lower bounds on child pages.
+ * parent pages are valid lower bounds on child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, true);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, true, heapallindexed);
 
 	PG_RETURN_VOID();
 }
@@ -155,7 +189,7 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -191,9 +225,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	/*
 	 * Since we did the IndexGetRelation call above without any lock, it's
 	 * barely possible that a race against an index drop/recreation could have
-	 * netted us the wrong table.  Although the table itself won't actually be
-	 * examined during verification currently, a recheck still seems like a
-	 * good idea.
+	 * netted us the wrong table.
 	 */
 	if (heaprel == NULL || heapid != IndexGetRelation(indrelid, false))
 		ereport(ERROR,
@@ -204,8 +236,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	/* Relation suitable for checking as B-Tree? */
 	btree_index_checkable(indrel);
 
-	/* Check index */
-	bt_check_every_level(indrel, parentcheck);
+	/* Check index, possibly against table it is an index on */
+	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -253,11 +285,14 @@ btree_index_checkable(Relation rel)
 
 /*
  * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in
- * logical order, verifying invariants as it goes.
+ * logical order, verifying invariants as it goes.  Optionally, verification
+ * checks if the heap relation contains any tuples that are not represented in
+ * the index but should be.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
  * the index relation, and advise us if extra checks are safe when a ShareLock
- * is held.
+ * is held.  (A lock of the same type must also have been acquired on the heap
+ * relation.)
  *
  * A ShareLock is generally assumed to prevent any kind of physical
  * modification to the index structure, including modifications that VACUUM may
@@ -272,7 +307,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, bool readonly)
+bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
+					 bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -283,15 +319,35 @@ bt_check_every_level(Relation rel, bool readonly)
 	/*
 	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
 	 * RecentGlobalXmin/B-Tree page deletion.
+	 *
+	 * We also rely on TransactionXmin having been initialized by now.
 	 */
 	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(TransactionIdIsNormal(TransactionXmin));
 
 	/*
 	 * Initialize state for entire verification operation
 	 */
 	state = palloc(sizeof(BtreeCheckState));
 	state->rel = rel;
+	state->heaprel = heaprel;
 	state->readonly = readonly;
+	state->heapallindexed = heapallindexed;
+
+	if (state->heapallindexed)
+	{
+		int64	total_elems;
+		uint32	seed;
+
+		/* Size Bloom filter based on estimated number of tuples in index */
+		total_elems = (int64) state->rel->rd_rel->reltuples;
+		/* Random seed relies on backend srandom() call to avoid repetition */
+		seed = random();
+		/* Create Bloom filter to fingerprint index */
+		state->filter = bloom_create(total_elems, maintenance_work_mem, seed);
+		state->heaptuplespresent = 0;
+	}
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -347,6 +403,45 @@ bt_check_every_level(Relation rel, bool readonly)
 		previouslevel = current.level;
 	}
 
+	/*
+	 * * Heap contains unindexed/malformed tuples check *
+	 */
+	if (state->heapallindexed)
+	{
+		IndexInfo  *indexinfo;
+
+		if (state->readonly)
+			elog(DEBUG1, "verifying presence of all required tuples in index \"%s\"",
+				 RelationGetRelationName(rel));
+		else
+			elog(DEBUG1, "verifying presence of required tuples in index \"%s\" with xmin before %u",
+				 RelationGetRelationName(rel), TransactionXmin);
+
+		indexinfo = BuildIndexInfo(state->rel);
+
+		/*
+		 * Since we're not actually indexing, don't enforce uniqueness/wait for
+		 * concurrent insertion to finish, even with unique indexes.
+		 *
+		 * Force use of MVCC snapshot (reuse CONCURRENTLY infrastructure) when
+		 * only an AccessShareLock held.  It seems like a good idea to not
+		 * diverge from expected heap lock strength in all cases.  This is
+		 * needed to prevent unhelpful WARNINGs due to concurrent insertions
+		 * that IndexBuildHeapScan() does not expect.
+		 */
+		indexinfo->ii_Unique = false;
+		indexinfo->ii_Concurrent = !state->readonly;
+		IndexBuildHeapScan(state->heaprel, state->rel, indexinfo, true,
+						   bt_tuple_present_callback, (void *) state);
+
+		ereport(DEBUG1,
+				(errmsg_internal("finished verifying presence of " INT64_FORMAT " tuples (proportion of bits set: %f) from table \"%s\"",
+								 state->heaptuplespresent, bloom_prop_bits_set(state->filter),
+								 RelationGetRelationName(heaprel))));
+
+		bloom_free(state->filter);
+	}
+
 	/* Be tidy: */
 	MemoryContextDelete(state->targetcontext);
 }
@@ -499,7 +594,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
 										current, level.level, opaque->btpo.level)));
 
-		/* Verify invariants for page -- all important checks occur here */
+		/* Verify invariants for page */
 		bt_target_page_check(state);
 
 nextpage:
@@ -546,6 +641,9 @@ nextpage:
  *
  * - That all child pages respect downlinks lower bound.
  *
+ * This is also where heapallindexed callers use their Bloom filter to
+ * fingerprint IndexTuples.
+ *
  * Note:  Memory allocated in this routine is expected to be released by caller
  * resetting state->targetcontext.
  */
@@ -589,6 +687,11 @@ bt_target_page_check(BtreeCheckState *state)
 		itup = (IndexTuple) PageGetItem(state->target, itemid);
 		skey = _bt_mkscankey(state->rel, itup);
 
+		/* Fingerprint leaf page tuples (those that point to the heap) */
+		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
+			bloom_add_element(state->filter, (unsigned char *) itup,
+							  IndexTupleSize(itup));
+
 		/*
 		 * * High key check *
 		 *
@@ -682,8 +785,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Last item check *
 		 *
 		 * Check last item against next/right page's first data item's when
-		 * last item on page is reached.  This additional check can detect
-		 * transposed pages.
+		 * last item on page is reached.  This additional check will detect
+		 * transposed pages iff the supposed right sibling page happens to
+		 * belong before target in the key space.  (Otherwise, a subsequent
+		 * heap verification will probably detect the problem.)
 		 *
 		 * This check is similar to the item order check that will have
 		 * already been performed for every other "real" item on target page
@@ -1062,6 +1167,134 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 }
 
 /*
+ * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
+ * all the entries that definitely should have been observed in leaf pages of
+ * the target index (that is, all IndexTuples that were fingerprinted by our
+ * Bloom filter).  All heapallindexed checks occur here.
+ *
+ * Theory of operation:
+ *
+ * The redundancy between an index and the table it indexes provides a good
+ * opportunity to detect corruption, especially corruption within the table.
+ * The high level principle behind the verification performed here is that any
+ * IndexTuple that should be in an index following a fresh CREATE INDEX (based
+ * on the same index definition) should also have been in the original,
+ * existing index, which should have used exactly the same representation
+ * (Index tuple formation is assumed to be deterministic, and IndexTuples are
+ * assumed immutable; while the LP_DEAD bit is mutable, that's ItemId metadata,
+ * which is not fingerprinted).  There will often be some dead-to-everyone
+ * IndexTuples fingerprinted by the Bloom filter, but we only try to detect the
+ * *absence of needed tuples*, so that's okay.
+ *
+ * Since the overall structure of the index has already been verified, the most
+ * likely explanation for error here is a corrupt heap page (could be logical
+ * or physical corruption).  Index corruption may still be detected here,
+ * though.  Only readonly callers will have verified that left links and right
+ * links are in agreement, and so it's possible that a leaf page transposition
+ * within index is actually the source of corruption detected here (for
+ * !readonly callers).  The checks performed only for readonly callers might
+ * more accurately frame the problem as a cross-page invariant issue (this
+ * could even be due to recovery not replaying all WAL records).  The !readonly
+ * ERROR message raised here includes a HINT about retrying with readonly
+ * verification, just in case it's a cross-page invariant issue, though that
+ * doesn't particularly likely.
+ *
+ * IndexBuildHeapScan() expects to be able to find the root tuple when a
+ * heap-only tuple (the live tuple at the end of some HOT chain) needs to be
+ * indexed, in order to replace the actual tuple's TID with the root tuple's
+ * TID (which is what we're actually passed back here).  The index build heap
+ * scan code will raise an error when a tuple that claims to be the root of the
+ * heap-only tuple's HOT chain cannot be located.  This catches cases where the
+ * original root item offset/root tuple for a HOT chain indicates (for whatever
+ * reason) that the entire HOT chain is dead, despite the fact that the latest
+ * heap-only tuple should be indexed.  When this happens, sequential scans may
+ * always give correct answers, and all indexes may be considered structurally
+ * consistent (i.e. the nbtree structural checks would not detect corruption).
+ * It may be the case that only index scans give wrong answers, and yet heap or
+ * SLRU corruption is the real culprit.  (While it's true that LP_DEAD bit
+ * setting will probably also leave the index in a corrupt state before too
+ * long, the problem is nonetheless that there is heap corruption.)
+ *
+ * Note also that heap-only tuple handling within IndexBuildHeapScan() detects
+ * index tuples that contain the wrong values.  This can happen when there is
+ * no superseding index tuple due to a faulty assessment of HOT safety.
+ * Because the latest tuple's contents are used with the root TID, an error
+ * will be raised when a tuple with the same TID but different (correct)
+ * attribute values is passed back to us.  (Faulty assessment of HOT-safety was
+ * behind the CREATE INDEX CONCURRENTLY bug that was fixed in February of
+ * 2017.)
+ */
+static void
+bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
+						  bool *isnull, bool tupleIsAlive, void *checkstate)
+{
+	BtreeCheckState *state = (BtreeCheckState *) checkstate;
+	IndexTuple		 itup;
+
+	Assert(state->heapallindexed);
+
+	/* Must recheck visibility when only AccessShareLock held */
+	if (!state->readonly)
+	{
+		TransactionId	xmin;
+
+		/*
+		 * Don't test for presence in index where xmin not at least old enough
+		 * that we know for sure that absence of index tuple wasn't just due to
+		 * some transaction performing insertion after our verifying index
+		 * traversal began.  (Actually, the cut-off used is a point where
+		 * preceding write transactions must have committed/aborted.  We should
+		 * have already fingerprinted all index tuples for all such preceding
+		 * transactions, because the cut-off was established before our index
+		 * traversal even began.)
+		 *
+		 * You might think that the fact that an MVCC snapshot is used by the
+		 * heap scan (due to our indicating that this is the first scan of a
+		 * CREATE INDEX CONCURRENTLY index build) would make this test
+		 * redundant.  That's not quite true, because with current
+		 * IndexBuildHeapScan() interface caller cannot do the MVCC snapshot
+		 * acquisition itself.  In this way, heap tuple coverage is similar to
+		 * the coverage we could get by using the existing transaction
+		 * snapshot.  It's easier to do this than to adopt the
+		 * IndexBuildHeapScan() interface to our narrow requirements.
+		 */
+		Assert(tupleIsAlive);
+		xmin = HeapTupleHeaderGetXmin(htup->t_data);
+		if (!TransactionIdPrecedes(xmin, TransactionXmin))
+			return;
+	}
+
+	/*
+	 * Generate an index tuple.
+	 *
+	 * Note that we rely on deterministic index_form_tuple() TOAST compression.
+	 * If index_form_tuple() was ever enhanced to compress datums out-of-line,
+	 * or otherwise varied when or how compression was applied, our assumption
+	 * would break, leading to false positive reports of corruption.  For now,
+	 * we don't decompress/normalize toasted values as part of fingerprinting.
+	 */
+	itup = index_form_tuple(RelationGetDescr(index), values, isnull);
+	itup->t_tid = htup->t_self;
+
+	/* Probe Bloom filter -- tuple should be present */
+	if (bloom_lacks_element(state->filter, (unsigned char *) itup,
+							IndexTupleSize(itup)))
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("heap tuple (%u,%u) from table \"%s\" lacks matching index tuple within index \"%s\"",
+						ItemPointerGetBlockNumber(&(itup->t_tid)),
+						ItemPointerGetOffsetNumber(&(itup->t_tid)),
+						RelationGetRelationName(state->heaprel),
+						RelationGetRelationName(state->rel)),
+				 !state->readonly
+				 ? errhint("Retrying verification using the function bt_index_parent_check() might provide a more specific error.")
+				 : 0));
+
+	state->heaptuplespresent++;
+	pfree(itup);
+}
+
+/*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
  *
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index dd71dbd..48bdb13 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -44,7 +44,7 @@
   <variablelist>
    <varlistentry>
     <term>
-     <function>bt_index_check(index regclass) returns void</function>
+     <function>bt_index_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_check</primary>
      </indexterm>
@@ -55,7 +55,7 @@
       <function>bt_index_check</function> tests that its target, a
       B-Tree index, respects a variety of invariants.  Example usage:
 <screen>
-test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+test=# SELECT bt_index_check(index =&gt; c.oid, heapallindexed =&gt; i.indisprimary)
 FROM pg_index i
 JOIN pg_opclass op ON i.indclass[0] = op.oid
 JOIN pg_am am ON op.opcmethod = am.oid
@@ -83,20 +83,23 @@ ORDER BY c.relpages DESC LIMIT 10;
 </screen>
       This example shows a session that performs verification of every
       catalog index in the database <quote>test</>.  Details of just
-      the 10 largest indexes verified are displayed.  Since no error
-      is raised, all indexes tested appear to be logically consistent.
-      Naturally, this query could easily be changed to call
+      the 10 largest indexes verified are displayed.  Verification of
+      the presence of heap tuples as index tuples is requested for
+      primary key indexes only.  Since no error is raised, all indexes
+      tested appear to be logically consistent.  Naturally, this query
+      could easily be changed to call
       <function>bt_index_check</function> for every index in the
       database where verification is supported.
      </para>
      <para>
-      <function>bt_index_check</function> acquires an <literal>AccessShareLock</>
-      on the target index and the heap relation it belongs to. This lock mode
-      is the same lock mode acquired on relations by simple
-      <literal>SELECT</> statements.
+      <function>bt_index_check</function> acquires an
+      <literal>AccessShareLock</> on the target index and the heap
+      relation it belongs to.  This lock mode is the same lock mode
+      acquired on relations by simple <literal>SELECT</> statements.
       <function>bt_index_check</function> does not verify invariants
-      that span child/parent relationships, nor does it verify that
-      the target index is consistent with its heap relation.  When a
+      that span child/parent relationships, but will verify the
+      presence of all heap tuples as index tuples within the index
+      when <parameter>heapallindexed</> is <literal>true</>.  When a
       routine, lightweight test for corruption is required in a live
       production environment, using
       <function>bt_index_check</function> often provides the best
@@ -108,7 +111,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -117,19 +120,22 @@ ORDER BY c.relpages DESC LIMIT 10;
     <listitem>
      <para>
       <function>bt_index_parent_check</function> tests that its
-      target, a B-Tree index, respects a variety of invariants.  The
-      checks performed by <function>bt_index_parent_check</function>
-      are a superset of the checks performed by
-      <function>bt_index_check</function>.
-      <function>bt_index_parent_check</function> can be thought of as
-      a more thorough variant of <function>bt_index_check</function>:
-      unlike <function>bt_index_check</function>,
+      target, a B-Tree index, respects a variety of invariants.
+      Optionally, when the <parameter>heapallindexed</> argument is
+      <literal>true</>, the function verifies the presence of all heap
+      tuples that should be found within the index.  The checks
+      performed by <function>bt_index_parent_check</function> are a
+      superset of the checks performed by
+      <function>bt_index_check</function> when called with the same
+      options.  <function>bt_index_parent_check</function> can be
+      thought of as a more thorough variant of
+      <function>bt_index_check</function>: unlike
+      <function>bt_index_check</function>,
       <function>bt_index_parent_check</function> also checks
-      invariants that span parent/child relationships.  However, it
-      does not verify that the target index is consistent with its
-      heap relation.  <function>bt_index_parent_check</function>
-      follows the general convention of raising an error if it finds a
-      logical inconsistency or other problem.
+      invariants that span parent/child relationships.
+      <function>bt_index_parent_check</function> follows the general
+      convention of raising an error if it finds a logical
+      inconsistency or other problem.
      </para>
      <para>
       A <literal>ShareLock</> is required on the target index by
@@ -159,6 +165,70 @@ ORDER BY c.relpages DESC LIMIT 10;
  </sect2>
 
  <sect2>
+  <title>Optional <parameter>heapallindexed</> verification</title>
+ <para>
+  When the <parameter>heapallindexed</> argument to verification
+  functions is <literal>true</>, an additional phase of verification
+  is performed against the table associated with the target index
+  relation.  This consists of a <quote>dummy</> <command>CREATE
+  INDEX</> operation, which checks for the presence of all would-be
+  new index tuples against a temporary, in-memory summarizing
+  structure (this is built when needed during the first, standard
+  phase).  The summarizing structure <quote>fingerprints</> every
+  tuple found within the target index.  The high level principle
+  behind <parameter>heapallindexed</> verification is that a new index
+  that is equivalent to the existing, target index must only have
+  entries that can be found in the existing structure.
+ </para>
+ <para>
+  The additional <parameter>heapallindexed</> phase adds significant
+  overhead: verification will typically take several times longer than
+  it would with only the standard consistency checking of the target
+  index's structure.  However, verification will still take
+  significantly less time than an actual <command>CREATE INDEX</>.
+  There is no change to the relation-level locks acquired when
+  <parameter>heapallindexed</> verification is performed.  The
+  summarizing structure is bound in size by
+  <varname>maintenance_work_mem</varname>.  In order to ensure that
+  there is no more than a 2% probability of failure to detect the
+  absence of any particular index tuple, approximately 2 bytes of
+  memory are needed per index tuple.  As less memory is made available
+  per index tuple, the probability of missing an inconsistency
+  increases.  This is considered an acceptable trade-off, since it
+  limits the overhead of verification very significantly, while only
+  slightly reducing the probability of detecting a problem, especially
+  for installations where verification is treated as a routine
+  maintenance task.
+ </para>
+ <para>
+  In many applications, even the default
+  <varname>maintenance_work_mem</varname> setting of <literal>64MB</>
+  will be sufficient to have less than a 2% probability of overlooking
+  any single absent or corrupt tuple.  This will be the case when
+  there are no indexes with more than about 30 million distinct index
+  tuples, regardless of the overall size of any index, the total
+  number of indexes, or anything else.  False positive candidate tuple
+  membership tests within the summarizing structure occur at random,
+  and are very unlikely to be the same for repeat verification
+  operations.  Furthermore, within a single verification operation,
+  each missing or malformed index tuple independently has the same
+  chance of being detected.  If there is any inconsistency at all, it
+  isn't particularly likely to be limited to a single tuple.  All of
+  these factors favor accepting a limited per operation per tuple
+  probability of missing corruption, in order to enable performing
+  more thorough index to heap verification more frequently (practical
+  concerns about the overhead of verification are likely to limit the
+  frequency of verification).  In aggregate, the probability of
+  detecting a hardware fault or software defect actually
+  <emphasis>increases</> significantly with this strategy in most real
+  world cases.  Moreover, frequent verification allows problems to be
+  caught earlier on average, which helps to limit the overall impact
+  of corruption, and often simplifies root cause analysis.
+ </para>
+
+ </sect2>
+
+ <sect2>
   <title>Using <filename>amcheck</> effectively</title>
 
  <para>
@@ -199,17 +269,31 @@ ORDER BY c.relpages DESC LIMIT 10;
    </listitem>
    <listitem>
     <para>
+     Structural inconsistencies between indexes and the heap relations
+     that are indexed (when <parameter>heapallindexed</> verification
+     is performed).
+    </para>
+    <para>
+     There is no cross-checking of indexes against their heap relation
+     during normal operation.  Symptoms of heap corruption can be very
+     subtle.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
      Corruption caused by hypothetical undiscovered bugs in the
-     underlying <productname>PostgreSQL</> access method code or sort
-     code.
+     underlying <productname>PostgreSQL</> access method code, sort
+     code, or transaction management code.
     </para>
     <para>
      Automatic verification of the structural integrity of indexes
      plays a role in the general testing of new or proposed
      <productname>PostgreSQL</> features that could plausibly allow a
-     logical inconsistency to be introduced.  One obvious testing
-     strategy is to call <filename>amcheck</> functions continuously
-     when running the standard regression tests.  See <xref
+     logical inconsistency to be introduced.  Verification of table
+     structure and associated visibility and transaction status
+     information plays a similar role.  One obvious testing strategy
+     is to call <filename>amcheck</> functions continuously when
+     running the standard regression tests.  See <xref
      linkend="regress-run"> for details on running the tests.
     </para>
    </listitem>
@@ -242,6 +326,13 @@ ORDER BY c.relpages DESC LIMIT 10;
      <emphasis>absolute</emphasis> protection against failures that
      result in memory corruption.
     </para>
+    <para>
+     When <parameter>heapallindexed</> is <literal>true</>, and heap
+     verification is performed, there is generally a greatly increased
+     chance of detecting single-bit errors, since strict binary
+     equality is tested, and the indexed attributes within the heap
+     are tested.
+    </para>
    </listitem>
   </itemizedlist>
   In general, <filename>amcheck</> can only prove the presence of
@@ -253,11 +344,9 @@ ORDER BY c.relpages DESC LIMIT 10;
   <title>Repairing corruption</title>
  <para>
   No error concerning corruption raised by <filename>amcheck</> should
-  ever be a false positive.  In practice, <filename>amcheck</> is more
-  likely to find software bugs than problems with hardware.
-  <filename>amcheck</> raises errors in the event of conditions that,
-  by definition, should never happen, and so careful analysis of
-  <filename>amcheck</> errors is often required.
+  ever be a false positive.  <filename>amcheck</> raises errors in the
+  event of conditions that, by definition, should never happen, and so
+  careful analysis of <filename>amcheck</> errors is often required.
  </para>
  <para>
   There is no general method of repairing problems that
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index fc64153..565260f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -56,7 +56,7 @@ extern TimestampTz GetOldSnapshotThresholdTimestamp(void);
 
 extern bool FirstSnapshotSet;
 
-extern TransactionId TransactionXmin;
+extern PGDLLIMPORT TransactionId TransactionXmin;
 extern TransactionId RecentXmin;
 extern PGDLLIMPORT TransactionId RecentGlobalXmin;
 extern TransactionId RecentGlobalDataXmin;
-- 
2.7.4

0001-Add-Bloom-filter-data-structure-implementation.patchtext/x-patch; charset=US-ASCII; name=0001-Add-Bloom-filter-data-structure-implementation.patchDownload

From 38930ff5cefb2915ff4ce294f3dc275530b4872d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.
---
 src/backend/lib/Makefile      |   4 +-
 src/backend/lib/README        |   2 +
 src/backend/lib/bloomfilter.c | 303 ++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/bloomfilter.h |  27 ++++
 4 files changed, 334 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/bloomfilter.c
 create mode 100644 src/include/lib/bloomfilter.h

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index d1fefe4..191ea9b 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
-	   knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index 5e5ba5e..376ae27 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -3,6 +3,8 @@ in the backend:
 
 binaryheap.c - a binary heap
 
+bloomfilter.c - probabilistic, space-efficient set membership testing
+
 hyperloglog.c - a streaming cardinality estimator
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
new file mode 100644
index 0000000..7ba44c7
--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,303 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.c
+ *		Minimal Bloom filter
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.  False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the Bloom filter bitset.
+ *
+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when its Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/bloomfilter.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/hash.h"
+#include "lib/bloomfilter.h"
+
+#define MAX_HASH_FUNCS		10
+
+typedef struct bloom_filter
+{
+	/* K hash functions are used, which are randomly seeded */
+	int				k_hash_funcs;
+	uint32			seed;
+	/* Bitset is sized directly in bits.  It must be a power-of-two <= 2^32. */
+	int64			bitset_bits;
+	unsigned char bitset[FLEXIBLE_ARRAY_MEMBER];
+} bloom_filter;
+
+static int my_bloom_power(int64 target_bitset_bits);
+static int optimal_k(int64 bitset_bits, int64 total_elems);
+static void k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem,
+					 size_t len);
+static uint32 sdbmhash(unsigned char *elem, size_t len);
+
+/*
+ * Create Bloom filter in caller's memory context.  This should get a false
+ * positive rate of between 1% and 2% when bitset is not constrained by memory.
+ *
+ * total_elems is an estimate of the final size of the set.  It ought to be
+ * approximately correct, but we can cope well with it being off by perhaps a
+ * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
+ * (Dillinger & Manolios, 2004) for details of why this is the case.
+ *
+ * bloom_work_mem is sized in KB, in line with the general work_mem convention.
+ *
+ * The Bloom filter behaves non-deterministically when caller passes a random
+ * seed value.  This ensures that the same false positives will not occur from
+ * one run to the next, which is useful to some callers.
+ *
+ * Notes on appropriate use:
+ *
+ * To keep the implementation simple and predictable, the underlying bitset is
+ * always sized as a power-of-two number of bits, and the largest possible
+ * bitset is 512MB.  The implementation is therefore well suited to data
+ * synchronization problems between unordered sets, where predictable
+ * performance is more important than worst case guarantees around false
+ * positives.  Another problem that the implementation is well suited for is
+ * cache filtering where good performance already relies upon having a
+ * relatively small and/or low cardinality set of things that are interesting
+ * (with perhaps many more uninteresting things that never populate the
+ * filter).
+ */
+bloom_filter *
+bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
+{
+	bloom_filter   *filter;
+	int				bloom_power;
+	int64			bitset_bytes;
+	int64			bitset_bits;
+
+	/*
+	 * Aim for two bytes per element; this is sufficient to get a false
+	 * positive rate below 1%, independent of the size of the bitset or total
+	 * number of elements.  Also, if rounding down the size of the bitset to
+	 * the next lowest power of two turns out to be a significant drop, the
+	 * false positive rate still won't exceed 2% in almost all cases.
+	 */
+	bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
+	/* Minimum allowable size is 1MB */
+	bitset_bytes = Max(1024L * 1024L, bitset_bytes);
+
+	/* Size in bits should be the highest power of two within budget */
+	bloom_power = my_bloom_power(bitset_bytes * BITS_PER_BYTE);
+	/* bitset_bits is int64 because 2^32 is greater than UINT32_MAX */
+	bitset_bits = INT64CONST(1) << bloom_power;
+	bitset_bytes = bitset_bits / BITS_PER_BYTE;
+
+	/* Allocate bloom filter as all-zeroes */
+	filter = palloc0(offsetof(bloom_filter, bitset) +
+					 sizeof(unsigned char) * bitset_bytes);
+	filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+	filter->seed = seed;
+	filter->bitset_bits = bitset_bits;
+
+	return filter;
+}
+
+/*
+ * Free Bloom filter
+ */
+void
+bloom_free(bloom_filter *filter)
+{
+	pfree(filter);
+}
+
+/*
+ * Add element to Bloom filter
+ */
+void
+bloom_add_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32	hashes[MAX_HASH_FUNCS];
+	int		i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		filter->bitset[hashes[i] >> 3] |= 1 << (hashes[i] & 7);
+	}
+}
+
+/*
+ * Test if Bloom filter definitely lacks element.
+ *
+ * Returns true if the element is definitely not in the set of elements
+ * observed by bloom_add_element().  Otherwise, returns false, indicating that
+ * element is probably present in set.
+ */
+bool
+bloom_lacks_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32	hashes[MAX_HASH_FUNCS];
+	int		i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		if (!(filter->bitset[hashes[i] >> 3] & (1 << (hashes[i] & 7))))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.
+ *
+ * This is a useful, generic indicator of whether or not a Bloom filter has
+ * summarized the set optimally within the available memory budget.  If return
+ * value exceeds 0.5 significantly, then that's either because there was a
+ * dramatic underestimation of set size by the caller, or because available
+ * work_mem is very low relative to the size of the set (less than 2 bits per
+ * element).
+ *
+ * The value returned here should generally be close to 0.5, even when we have
+ * more than enough memory to ensure a false positive rate within target 1% to
+ * 2% band, since more hash functions are used as more memory is available per
+ * element.
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
+{
+	int		bitset_bytes = filter->bitset_bits / BITS_PER_BYTE;
+	int64	bits_set = 0;
+	int		i;
+
+	for (i = 0; i < bitset_bytes; i++)
+	{
+		unsigned char byte = filter->bitset[i];
+
+		while (byte)
+		{
+			bits_set++;
+			byte &= (byte - 1);
+		}
+	}
+
+	return bits_set / (double) filter->bitset_bits;
+}
+
+/*
+ * Which element in the sequence of powers-of-two is less than or equal to
+ * target_bitset_bits?
+ *
+ * Value returned here must be generally safe as the basis for actual bitset
+ * size.
+ *
+ * Bitset is never allowed to exceed 2 ^ 32 bits (512MB).  This is sufficient
+ * for the needs of all current callers, and allows us to use 32-bit hash
+ * functions.  It also makes it easy to stay under the MaxAllocSize restriction
+ * (caller needs to leave room for non-bitset fields that appear before
+ * flexible array member, so a 1GB bitset would use an allocation that just
+ * exceeds MaxAllocSize).
+ */
+static int
+my_bloom_power(int64 target_bitset_bits)
+{
+	int bloom_power = -1;
+
+	while (target_bitset_bits > 0 && bloom_power < 32)
+	{
+		bloom_power++;
+		target_bitset_bits >>= 1;
+	}
+
+	return bloom_power;
+}
+
+/*
+ * Determine optimal number of hash functions based on size of filter in bits,
+ * and projected total number of elements.  The optimal number is the number
+ * that minimizes the false positive rate.
+ */
+static int
+optimal_k(int64 bitset_bits, int64 total_elems)
+{
+	int		k = round(log(2.0) * bitset_bits / total_elems);
+
+	return Max(1, Min(k, MAX_HASH_FUNCS));
+}
+
+/*
+ * Generate k hash values for element.
+ *
+ * Caller passes array, which is filled-in with k values determined by hashing
+ * caller's element.
+ *
+ * Only 2 real independent hash functions are actually used to support an
+ * interface of up to MAX_HASH_FUNCS hash functions; "enhanced double hashing"
+ * is used to make this work.  See Dillinger & Manolios for details of why
+ * that's okay.  "Building a Better Bloom Filter" by Kirsch & Mitzenmacher also
+ * has detailed analysis of the algorithm.
+ */
+static void
+k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
+{
+	uint32	hasha,
+			hashb;
+	int		i;
+
+	hasha = DatumGetUInt32(hash_any(elem, len));
+	hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);
+
+	/* Mix seed value */
+	hasha += filter->seed;
+	/* Apply "MOD m" to avoid losing bits/out-of-bounds array access */
+	hasha = hasha % filter->bitset_bits;
+	hashb = hashb % filter->bitset_bits;
+
+	/* First hash */
+	hashes[0] = hasha;
+
+	/* Subsequent hashes */
+	for (i = 1; i < filter->k_hash_funcs; i++)
+	{
+		hasha = (hasha + hashb) % filter->bitset_bits;
+		hashb = (hashb + i) % filter->bitset_bits;
+
+		/* Accumulate hash value for caller */
+		hashes[i] = hasha;
+	}
+}
+
+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
+static uint32
+sdbmhash(unsigned char *elem, size_t len)
+{
+	uint32	hash = 0;
+	int		i;
+
+	for (i = 0; i < len; elem++, i++)
+	{
+		hash = (*elem) + (hash << 6) + (hash << 16) - hash;
+	}
+
+	return hash;
+}
diff --git a/src/include/lib/bloomfilter.h b/src/include/lib/bloomfilter.h
new file mode 100644
index 0000000..f46f233
--- /dev/null
+++ b/src/include/lib/bloomfilter.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.h
+ *	  Minimal Bloom filter
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/include/lib/bloomfilter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BLOOMFILTER_H_
+#define _BLOOMFILTER_H_
+
+typedef struct bloom_filter bloom_filter;
+
+extern bloom_filter *bloom_create(int64 total_elems, int bloom_work_mem,
+								  uint32 seed);
+extern void bloom_free(bloom_filter *filter);
+extern void bloom_add_element(bloom_filter *filter, unsigned char *elem,
+							  size_t len);
+extern bool bloom_lacks_element(bloom_filter *filter, unsigned char *elem,
+								size_t len);
+extern double bloom_prop_bits_set(bloom_filter *filter);
+
+#endif
-- 
2.7.4

#31

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Peter Geoghegan (#30)

2 attachment(s)

Re: A design for amcheck heapam verification

On Thu, Oct 5, 2017 at 7:00 PM, Peter Geoghegan <pg@bowt.ie> wrote:

v3 of the patch series, attached, does it that way -- it adds a
bloom_create(). The new bloom_create() function still allocates its
own memory, but does so while using a FLEXIBLE_ARRAY_MEMBER. A
separate bloom_init() function (that works with dynamic shared memory)
could easily be added later, for the benefit of parallel hash join.

Since Peter E's work on making the documentation sgml files more
XML-like has broken the v3 patch doc build, attached is v4, which
fixes this bit rot. It also has a few small tweaks here and there to
the docs. Nothing worth noting specifically, really -- I just don't
like to leave my patches with bit rot for long. (Hat-tip to Thomas
Munro for making this easy to detect with his new CF continuous
integration tooling.)

I should point out that I shipped virtually the same code yesterday,
as v1.1 of the Github version of amcheck (also known as amcheck_next).
Early adopters will be able to use this new "heapallindexed"
functionality in the next few days, once packages become available for
the apt and yum community repos. Just as before, the Github version
will work on versions of Postgres >= 9.4.

This seems like good timing on my part, because we know that this new
"heapallindexed" verification will detect the "freeze the dead" bugs
that the next point release is set to have fixes for -- that is
actually kind of how one of the bugs was found [1]/messages/by-id/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com -- Peter Geoghegan. We may even want
to advertise the available of this check within amcheck_next, in the
release notes for the next Postgres point release.

[1]: /messages/by-id/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

0002-Add-amcheck-verification-of-indexes-against-heap.patchtext/x-patch; charset=US-ASCII; name=0002-Add-amcheck-verification-of-indexes-against-heap.patchDownload

From 7906c7391a9f52d334c2cbc7d3e245ff014629f2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  callers can check that each heap tuple that
ought to have an index entry does in fact have one.  This happens at the
end of the existing verification checks.

This is implemented by using a Bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
Bloom filter is populated during the initial index verification scan.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql    |  28 +++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |  14 +-
 contrib/amcheck/sql/check_btree.sql      |   9 +-
 contrib/amcheck/verify_nbtree.c          | 298 ++++++++++++++++++++++++++++---
 doc/src/sgml/amcheck.sgml                | 173 ++++++++++++++----
 src/include/utils/snapmgr.h              |   2 +-
 8 files changed, 454 insertions(+), 74 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 0000000..e6cca0a
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+DROP FUNCTION bt_index_parent_check(regclass);
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 05e2861..4690484 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index df3741e..42872b8 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -16,8 +16,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
  bt_index_check 
@@ -56,8 +56,14 @@ SELECT bt_index_check('bttest_a_idx');
  
 (1 row)
 
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+SELECT bt_index_parent_check('bttest_b_idx', true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index fd90531..5d27969 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -19,8 +19,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
 SELECT bt_index_parent_check('bttest_a_idx');
@@ -42,8 +42,9 @@ ROLLBACK;
 
 -- normal check outside of xact
 SELECT bt_index_check('bttest_a_idx');
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+SELECT bt_index_parent_check('bttest_b_idx', true);
 
 BEGIN;
 SELECT bt_index_check('bttest_a_idx');
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 868c14e..8e57d2e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -8,6 +8,11 @@
  * (the insertion scankey sort-wise NULL semantics are needed for
  * verification).
  *
+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *
  *
  * Copyright (c) 2017, PostgreSQL Global Development Group
  *
@@ -18,11 +23,13 @@
  */
 #include "postgres.h"
 
+#include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "access/transam.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablecmds.h"
+#include "lib/bloomfilter.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -43,9 +50,10 @@ PG_MODULE_MAGIC;
  * target is the point of reference for a verification operation.
  *
  * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
- * they are current target's child pages). Conceptually, problems are only
- * ever found in the current target page. Each page found by verification's
- * left/right, top/bottom scan becomes the target exactly once.
+ * they are current target's child pages).  Conceptually, problems are only
+ * ever found in the current target page (or for a particular heap tuple during
+ * heapallindexed verification).  Each page found by verification's left/right,
+ * top/bottom scan becomes the target exactly once.
  */
 typedef struct BtreeCheckState
 {
@@ -53,10 +61,13 @@ typedef struct BtreeCheckState
 	 * Unchanging state, established at start of verification:
 	 */
 
-	/* B-Tree Index Relation */
+	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
+	Relation	heaprel;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
+	/* Also verifying heap has no unindexed tuples? */
+	bool		heapallindexed;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -72,6 +83,15 @@ typedef struct BtreeCheckState
 	BlockNumber targetblock;
 	/* Target page's LSN */
 	XLogRecPtr	targetlsn;
+
+	/*
+	 * Mutable state, for optional heapallindexed verification:
+	 */
+
+	/* Bloom filter fingerprints B-Tree index */
+	bloom_filter *filter;
+	/* Debug counter */
+	int64		heaptuplespresent;
 } BtreeCheckState;
 
 /*
@@ -92,15 +112,20 @@ typedef struct BtreeLevel
 PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
-static void bt_index_check_internal(Oid indrelid, bool parentcheck);
+static void bt_index_check_internal(Oid indrelid, bool parentcheck,
+									bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
-static void bt_check_every_level(Relation rel, bool readonly);
+static void bt_check_every_level(Relation rel, Relation heaprel,
+								 bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
 static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 				  ScanKey targetkey);
+static void bt_tuple_present_callback(Relation index, HeapTuple htup,
+									  Datum *values, bool *isnull,
+									  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
@@ -116,37 +141,47 @@ static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires AccessShareLock on heap & index relations.  Does not consider
- * invariants that exist between parent/child pages.
+ * invariants that exist between parent/child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, false);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, false, heapallindexed);
 
 	PG_RETURN_VOID();
 }
 
 /*
- * bt_index_parent_check(index regclass)
+ * bt_index_parent_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires ShareLock on heap & index relations.  Verifies that downlinks in
- * parent pages are valid lower bounds on child pages.
+ * parent pages are valid lower bounds on child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, true);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, true, heapallindexed);
 
 	PG_RETURN_VOID();
 }
@@ -155,7 +190,7 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -191,9 +226,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	/*
 	 * Since we did the IndexGetRelation call above without any lock, it's
 	 * barely possible that a race against an index drop/recreation could have
-	 * netted us the wrong table.  Although the table itself won't actually be
-	 * examined during verification currently, a recheck still seems like a
-	 * good idea.
+	 * netted us the wrong table.
 	 */
 	if (heaprel == NULL || heapid != IndexGetRelation(indrelid, false))
 		ereport(ERROR,
@@ -204,8 +237,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	/* Relation suitable for checking as B-Tree? */
 	btree_index_checkable(indrel);
 
-	/* Check index */
-	bt_check_every_level(indrel, parentcheck);
+	/* Check index, possibly against table it is an index on */
+	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -253,11 +286,14 @@ btree_index_checkable(Relation rel)
 
 /*
  * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in
- * logical order, verifying invariants as it goes.
+ * logical order, verifying invariants as it goes.  Optionally, verification
+ * checks if the heap relation contains any tuples that are not represented in
+ * the index but should be.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
  * the index relation, and advise us if extra checks are safe when a ShareLock
- * is held.
+ * is held.  (A lock of the same type must also have been acquired on the heap
+ * relation.)
  *
  * A ShareLock is generally assumed to prevent any kind of physical
  * modification to the index structure, including modifications that VACUUM may
@@ -272,7 +308,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, bool readonly)
+bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
+					 bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -283,15 +320,35 @@ bt_check_every_level(Relation rel, bool readonly)
 	/*
 	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
 	 * RecentGlobalXmin/B-Tree page deletion.
+	 *
+	 * We also rely on TransactionXmin having been initialized by now.
 	 */
 	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(TransactionIdIsNormal(TransactionXmin));
 
 	/*
 	 * Initialize state for entire verification operation
 	 */
 	state = palloc(sizeof(BtreeCheckState));
 	state->rel = rel;
+	state->heaprel = heaprel;
 	state->readonly = readonly;
+	state->heapallindexed = heapallindexed;
+
+	if (state->heapallindexed)
+	{
+		int64	total_elems;
+		uint32	seed;
+
+		/* Size Bloom filter based on estimated number of tuples in index */
+		total_elems = (int64) state->rel->rd_rel->reltuples;
+		/* Random seed relies on backend srandom() call to avoid repetition */
+		seed = random();
+		/* Create Bloom filter to fingerprint index */
+		state->filter = bloom_create(total_elems, maintenance_work_mem, seed);
+		state->heaptuplespresent = 0;
+	}
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -347,6 +404,61 @@ bt_check_every_level(Relation rel, bool readonly)
 		previouslevel = current.level;
 	}
 
+	/*
+	 * * Heap contains unindexed/malformed tuples check *
+	 */
+	if (state->heapallindexed)
+	{
+		IndexInfo  *indexinfo;
+
+		if (state->readonly)
+			elog(DEBUG1, "verifying presence of all required tuples in index \"%s\"",
+				 RelationGetRelationName(rel));
+		else
+			elog(DEBUG1, "verifying presence of required tuples in index \"%s\" with xmin before %u",
+				 RelationGetRelationName(rel), TransactionXmin);
+
+		indexinfo = BuildIndexInfo(state->rel);
+
+		/*
+		 * Force use of MVCC snapshot (reuse CONCURRENTLY infrastructure) when
+		 * only AccessShareLocks held.  It seems like a good idea to not
+		 * diverge from expected heap lock strength.
+		 */
+		indexinfo->ii_Concurrent = !state->readonly;
+
+		/*
+		 * Don't wait for uncommitted tuple xact commit/abort when index is a
+		 * unique index (or an index used by an exclusion constraint).  It is
+		 * supposed to be impossible to get duplicates with the already-defined
+		 * unique index in place.  Our relation-level locks prevent races
+		 * resulting in false positive corruption errors where an IndexTuple
+		 * insertion was just missed, but we still test its heap tuple.  (While
+		 * this would not be true for !readonly verification, it doesn't matter
+		 * because CREATE INDEX CONCURRENTLY index build heap scanning has no
+		 * special treatment for unique indexes to avoid.)
+		 *
+		 * Not waiting can only affect verification of indexes on system
+		 * catalogs, where heavyweights locks can be dropped before transaction
+		 * commit.  If anything, avoiding waiting slightly improves test
+		 * coverage.
+		 */
+		indexinfo->ii_Unique = false;
+		indexinfo->ii_ExclusionOps = NULL;
+		indexinfo->ii_ExclusionProcs = NULL;
+		indexinfo->ii_ExclusionStrats = NULL;
+
+		IndexBuildHeapScan(state->heaprel, state->rel, indexinfo, true,
+						   bt_tuple_present_callback, (void *) state);
+
+		ereport(DEBUG1,
+				(errmsg_internal("finished verifying presence of " INT64_FORMAT " tuples (proportion of bits set: %f) from table \"%s\"",
+								 state->heaptuplespresent, bloom_prop_bits_set(state->filter),
+								 RelationGetRelationName(heaprel))));
+
+		bloom_free(state->filter);
+	}
+
 	/* Be tidy: */
 	MemoryContextDelete(state->targetcontext);
 }
@@ -499,7 +611,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
 										current, level.level, opaque->btpo.level)));
 
-		/* Verify invariants for page -- all important checks occur here */
+		/* Verify invariants for page */
 		bt_target_page_check(state);
 
 nextpage:
@@ -546,6 +658,9 @@ nextpage:
  *
  * - That all child pages respect downlinks lower bound.
  *
+ * This is also where heapallindexed callers use their Bloom filter to
+ * fingerprint IndexTuples.
+ *
  * Note:  Memory allocated in this routine is expected to be released by caller
  * resetting state->targetcontext.
  */
@@ -589,6 +704,11 @@ bt_target_page_check(BtreeCheckState *state)
 		itup = (IndexTuple) PageGetItem(state->target, itemid);
 		skey = _bt_mkscankey(state->rel, itup);
 
+		/* Fingerprint leaf page tuples (those that point to the heap) */
+		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
+			bloom_add_element(state->filter, (unsigned char *) itup,
+							  IndexTupleSize(itup));
+
 		/*
 		 * * High key check *
 		 *
@@ -682,8 +802,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Last item check *
 		 *
 		 * Check last item against next/right page's first data item's when
-		 * last item on page is reached.  This additional check can detect
-		 * transposed pages.
+		 * last item on page is reached.  This additional check will detect
+		 * transposed pages iff the supposed right sibling page happens to
+		 * belong before target in the key space.  (Otherwise, a subsequent
+		 * heap verification will probably detect the problem.)
 		 *
 		 * This check is similar to the item order check that will have
 		 * already been performed for every other "real" item on target page
@@ -1062,6 +1184,134 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 }
 
 /*
+ * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
+ * all the entries that definitely should have been observed in leaf pages of
+ * the target index (that is, all IndexTuples that were fingerprinted by our
+ * Bloom filter).  All heapallindexed checks occur here.
+ *
+ * Theory of operation:
+ *
+ * The redundancy between an index and the table it indexes provides a good
+ * opportunity to detect corruption, especially corruption within the table.
+ * The high level principle behind the verification performed here is that any
+ * IndexTuple that should be in an index following a fresh CREATE INDEX (based
+ * on the same index definition) should also have been in the original,
+ * existing index, which should have used exactly the same representation
+ * (Index tuple formation is assumed to be deterministic, and IndexTuples are
+ * assumed immutable; while the LP_DEAD bit is mutable, that's ItemId metadata,
+ * which is not fingerprinted).  There will often be some dead-to-everyone
+ * IndexTuples fingerprinted by the Bloom filter, but we only try to detect the
+ * *absence of needed tuples*, so that's okay.
+ *
+ * Since the overall structure of the index has already been verified, the most
+ * likely explanation for error here is a corrupt heap page (could be logical
+ * or physical corruption).  Index corruption may still be detected here,
+ * though.  Only readonly callers will have verified that left links and right
+ * links are in agreement, and so it's possible that a leaf page transposition
+ * within index is actually the source of corruption detected here (for
+ * !readonly callers).  The checks performed only for readonly callers might
+ * more accurately frame the problem as a cross-page invariant issue (this
+ * could even be due to recovery not replaying all WAL records).  The !readonly
+ * ERROR message raised here includes a HINT about retrying with readonly
+ * verification, just in case it's a cross-page invariant issue, though that
+ * isn't particularly likely.
+ *
+ * IndexBuildHeapScan() expects to be able to find the root tuple when a
+ * heap-only tuple (the live tuple at the end of some HOT chain) needs to be
+ * indexed, in order to replace the actual tuple's TID with the root tuple's
+ * TID (which is what we're actually passed back here).  The index build heap
+ * scan code will raise an error when a tuple that claims to be the root of the
+ * heap-only tuple's HOT chain cannot be located.  This catches cases where the
+ * original root item offset/root tuple for a HOT chain indicates (for whatever
+ * reason) that the entire HOT chain is dead, despite the fact that the latest
+ * heap-only tuple should be indexed.  When this happens, sequential scans may
+ * always give correct answers, and all indexes may be considered structurally
+ * consistent (i.e. the nbtree structural checks would not detect corruption).
+ * It may be the case that only index scans give wrong answers, and yet heap or
+ * SLRU corruption is the real culprit.  (While it's true that LP_DEAD bit
+ * setting will probably also leave the index in a corrupt state before too
+ * long, the problem is nonetheless that there is heap corruption.)
+ *
+ * Note also that heap-only tuple handling within IndexBuildHeapScan() detects
+ * index tuples that contain the wrong values.  This can happen when there is
+ * no superseding index tuple due to a faulty assessment of HOT safety.
+ * Because the latest tuple's contents are used with the root TID, an error
+ * will be raised when a tuple with the same TID but different (correct)
+ * attribute values is passed back to us.  (Faulty assessment of HOT-safety was
+ * behind the CREATE INDEX CONCURRENTLY bug that was fixed in February of
+ * 2017.)
+ */
+static void
+bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
+						  bool *isnull, bool tupleIsAlive, void *checkstate)
+{
+	BtreeCheckState *state = (BtreeCheckState *) checkstate;
+	IndexTuple		 itup;
+
+	Assert(state->heapallindexed);
+
+	/* Must recheck visibility when only AccessShareLock held */
+	if (!state->readonly)
+	{
+		TransactionId	xmin;
+
+		/*
+		 * Don't test for presence in index where xmin not at least old enough
+		 * that we know for sure that absence of index tuple wasn't just due to
+		 * some transaction performing insertion after our verifying index
+		 * traversal began.  (Actually, the cut-off used is a point where
+		 * preceding write transactions must have committed/aborted.  We should
+		 * have already fingerprinted all index tuples for all such preceding
+		 * transactions, because the cut-off was established before our index
+		 * traversal even began.)
+		 *
+		 * You might think that the fact that an MVCC snapshot is used by the
+		 * heap scan (due to our indicating that this is the first scan of a
+		 * CREATE INDEX CONCURRENTLY index build) would make this test
+		 * redundant.  That's not quite true, because with current
+		 * IndexBuildHeapScan() interface caller cannot do the MVCC snapshot
+		 * acquisition itself.  Heap tuple coverage is thereby similar to the
+		 * coverage we could get by using earliest transaction snapshot
+		 * directly.  It's easier to do this than to adopt the
+		 * IndexBuildHeapScan() interface to our narrow requirements.
+		 */
+		Assert(tupleIsAlive);
+		xmin = HeapTupleHeaderGetXmin(htup->t_data);
+		if (!TransactionIdPrecedes(xmin, TransactionXmin))
+			return;
+	}
+
+	/*
+	 * Generate an index tuple.
+	 *
+	 * Note that we rely on deterministic index_form_tuple() TOAST compression.
+	 * If index_form_tuple() was ever enhanced to compress datums out-of-line,
+	 * or otherwise varied when or how compression was applied, our assumption
+	 * would break, leading to false positive reports of corruption.  For now,
+	 * we don't decompress/normalize toasted values as part of fingerprinting.
+	 */
+	itup = index_form_tuple(RelationGetDescr(index), values, isnull);
+	itup->t_tid = htup->t_self;
+
+	/* Probe Bloom filter -- tuple should be present */
+	if (bloom_lacks_element(state->filter, (unsigned char *) itup,
+							IndexTupleSize(itup)))
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("heap tuple (%u,%u) from table \"%s\" lacks matching index tuple within index \"%s\"",
+						ItemPointerGetBlockNumber(&(itup->t_tid)),
+						ItemPointerGetOffsetNumber(&(itup->t_tid)),
+						RelationGetRelationName(state->heaprel),
+						RelationGetRelationName(state->rel)),
+				 !state->readonly
+				 ? errhint("Retrying verification using the function bt_index_parent_check() might provide a more specific error.")
+				 : 0));
+
+	state->heaptuplespresent++;
+	pfree(itup);
+}
+
+/*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
  *
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 0dd68f0..bff1116 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -44,7 +44,7 @@
   <variablelist>
    <varlistentry>
     <term>
-     <function>bt_index_check(index regclass) returns void</function>
+     <function>bt_index_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_check</primary>
      </indexterm>
@@ -55,7 +55,9 @@
       <function>bt_index_check</function> tests that its target, a
       B-Tree index, respects a variety of invariants.  Example usage:
 <screen>
-test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+test=# SELECT bt_index_check(index =&gt; c.oid, heapallindexed =&gt; i.indisunique)
+               c.relname,
+               c.relpages
 FROM pg_index i
 JOIN pg_opclass op ON i.indclass[0] = op.oid
 JOIN pg_am am ON op.opcmethod = am.oid
@@ -83,9 +85,11 @@ ORDER BY c.relpages DESC LIMIT 10;
 </screen>
       This example shows a session that performs verification of every
       catalog index in the database <quote>test</quote>.  Details of just
-      the 10 largest indexes verified are displayed.  Since no error
-      is raised, all indexes tested appear to be logically consistent.
-      Naturally, this query could easily be changed to call
+      the 10 largest indexes verified are displayed.  Verification of
+      the presence of heap tuples as index tuples is requested for
+      unique indexes only.  Since no error is raised, all indexes
+      tested appear to be logically consistent.  Naturally, this query
+      could easily be changed to call
       <function>bt_index_check</function> for every index in the
       database where verification is supported.
      </para>
@@ -95,10 +99,11 @@ ORDER BY c.relpages DESC LIMIT 10;
       is the same lock mode acquired on relations by simple
       <literal>SELECT</literal> statements.
       <function>bt_index_check</function> does not verify invariants
-      that span child/parent relationships, nor does it verify that
-      the target index is consistent with its heap relation.  When a
-      routine, lightweight test for corruption is required in a live
-      production environment, using
+      that span child/parent relationships, but will verify the
+      presence of all heap tuples as index tuples within the index
+      when <parameter>heapallindexed</parameter> is
+      <literal>true</literal>.  When a routine, lightweight test for
+      corruption is required in a live production environment, using
       <function>bt_index_check</function> often provides the best
       trade-off between thoroughness of verification and limiting the
       impact on application performance and availability.
@@ -108,7 +113,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -117,30 +122,34 @@ ORDER BY c.relpages DESC LIMIT 10;
     <listitem>
      <para>
       <function>bt_index_parent_check</function> tests that its
-      target, a B-Tree index, respects a variety of invariants.  The
-      checks performed by <function>bt_index_parent_check</function>
-      are a superset of the checks performed by
-      <function>bt_index_check</function>.
+      target, a B-Tree index, respects a variety of invariants.
+      Optionally, when the <parameter>heapallindexed</parameter>
+      argument is <literal>true</literal>, the function verifies the
+      presence of all heap tuples that should be found within the
+      index.  The checks performed by
+      <function>bt_index_parent_check</function> are a superset of the
+      checks performed by <function>bt_index_check</function> when
+      called with the same options.
       <function>bt_index_parent_check</function> can be thought of as
       a more thorough variant of <function>bt_index_check</function>:
       unlike <function>bt_index_check</function>,
       <function>bt_index_parent_check</function> also checks
-      invariants that span parent/child relationships.  However, it
-      does not verify that the target index is consistent with its
-      heap relation.  <function>bt_index_parent_check</function>
-      follows the general convention of raising an error if it finds a
-      logical inconsistency or other problem.
+      invariants that span parent/child relationships.
+      <function>bt_index_parent_check</function> follows the general
+      convention of raising an error if it finds a logical
+      inconsistency or other problem.
      </para>
      <para>
-      A <literal>ShareLock</literal> is required on the target index by
-      <function>bt_index_parent_check</function> (a
-      <literal>ShareLock</literal> is also acquired on the heap relation).
-      These locks prevent concurrent data modification from
-      <command>INSERT</command>, <command>UPDATE</command>, and <command>DELETE</command>
-      commands.  The locks also prevent the underlying relation from
-      being concurrently processed by <command>VACUUM</command>, as well as
-      all other utility commands.  Note that the function holds locks
-      only while running, not for the entire transaction.
+      A <literal>ShareLock</literal> is required on the target index
+      by <function>bt_index_parent_check</function> (a
+      <literal>ShareLock</literal> is also acquired on the heap
+      relation).  These locks prevent concurrent data modification
+      from <command>INSERT</command>, <command>UPDATE</command>, and
+      <command>DELETE</command> commands.  The locks also prevent the
+      underlying relation from being concurrently processed by
+      <command>VACUUM</command>, as well as all other utility
+      commands.  Note that the function holds locks only while
+      running, not for the entire transaction.
      </para>
      <para>
       <function>bt_index_parent_check</function>'s additional
@@ -159,6 +168,72 @@ ORDER BY c.relpages DESC LIMIT 10;
  </sect2>
 
  <sect2>
+  <title>Optional <parameter>heapallindexed</parameter> verification</title>
+ <para>
+  When the <parameter>heapallindexed</parameter> argument to
+  verification functions is <literal>true</literal>, an additional
+  phase of verification is performed against the table associated with
+  the target index relation.  This consists of a <quote>dummy</quote>
+  <command>CREATE INDEX</command> operation, which checks for the
+  presence of all would-be new index tuples against a temporary,
+  in-memory summarizing structure (this is built when needed during
+  the first, standard phase).  The summarizing structure
+  <quote>fingerprints</quote> every tuple found within the target
+  index.  The high level principle behind
+  <parameter>heapallindexed</parameter> verification is that a new
+  index that is equivalent to the existing, target index must only
+  have entries that can be found in the existing structure.
+ </para>
+ <para>
+  The additional <parameter>heapallindexed</parameter> phase adds
+  significant overhead: verification will typically take several times
+  longer than it would with only the standard consistency checking of
+  the target index's structure.  However, verification will still take
+  significantly less time than an actual <command>CREATE
+  INDEX</command>.  There is no change to the relation-level locks
+  acquired when <parameter>heapallindexed</parameter> verification is
+  performed.  The summarizing structure is bound in size by
+  <varname>maintenance_work_mem</varname>.  In order to ensure that
+  there is no more than a 2% probability of failure to detect the
+  absence of any particular index tuple, approximately 2 bytes of
+  memory are needed per index tuple.  As less memory is made available
+  per index tuple, the probability of missing an inconsistency
+  increases.  This is considered an acceptable trade-off, since it
+  limits the overhead of verification very significantly, while only
+  slightly reducing the probability of detecting a problem, especially
+  for installations where verification is treated as a routine
+  maintenance task.
+ </para>
+ <para>
+  With many databases, even the default
+  <varname>maintenance_work_mem</varname> setting of
+  <literal>64MB</literal> is sufficient to have less than a 2%
+  probability of overlooking any single absent or corrupt tuple.  This
+  will be the case when there are no indexes with more than about 30
+  million distinct index tuples, regardless of the overall size of any
+  index, the total number of indexes, or anything else.  False
+  positive candidate tuple membership tests within the summarizing
+  structure occur at random, and are very unlikely to be the same for
+  repeat verification operations.  Furthermore, within a single
+  verification operation, each missing or malformed index tuple
+  independently has the same chance of being detected.  If there is
+  any inconsistency at all, it isn't particularly likely to be limited
+  to a single tuple.  All of these factors favor accepting a limited
+  per operation per tuple probability of missing corruption, in order
+  to enable performing more thorough index to heap verification more
+  frequently (practical concerns about the overhead of verification
+  are likely to limit the frequency of verification).  In aggregate,
+  the probability of detecting a hardware fault or software defect
+  actually <emphasis>increases</emphasis> significantly with this
+  strategy in most real world cases.  Moreover, frequent verification
+  allows problems to be caught earlier on average, which helps to
+  limit the overall impact of corruption, and often simplifies root
+  cause analysis.
+ </para>
+
+ </sect2>
+
+ <sect2>
   <title>Using <filename>amcheck</filename> effectively</title>
 
  <para>
@@ -199,18 +274,33 @@ ORDER BY c.relpages DESC LIMIT 10;
    </listitem>
    <listitem>
     <para>
+     Structural inconsistencies between indexes and the heap relations
+     that are indexed (when <parameter>heapallindexed</parameter>
+     verification is performed).
+    </para>
+    <para>
+     There is no cross-checking of indexes against their heap relation
+     during normal operation.  Symptoms of heap corruption can be very
+     subtle.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
      Corruption caused by hypothetical undiscovered bugs in the
-     underlying <productname>PostgreSQL</productname> access method code or sort
-     code.
+     underlying <productname>PostgreSQL</productname> access method
+     code, sort code, or transaction management code.
     </para>
     <para>
      Automatic verification of the structural integrity of indexes
      plays a role in the general testing of new or proposed
      <productname>PostgreSQL</productname> features that could plausibly allow a
-     logical inconsistency to be introduced.  One obvious testing
-     strategy is to call <filename>amcheck</filename> functions continuously
+     logical inconsistency to be introduced.  Verification of table
+     structure and associated visibility and transaction status
+     information plays a similar role.  One obvious testing strategy
+     is to call <filename>amcheck</filename> functions continuously
      when running the standard regression tests.  See <xref
-     linkend="regress-run"> for details on running the tests.
+     linkend="regress-run"> for details on running
+     the tests.
     </para>
    </listitem>
    <listitem>
@@ -242,6 +332,12 @@ ORDER BY c.relpages DESC LIMIT 10;
      <emphasis>absolute</emphasis> protection against failures that
      result in memory corruption.
     </para>
+    <para>
+     When <parameter>heapallindexed</parameter> verification is
+     performed, there is generally a greatly increased chance of
+     detecting single-bit errors, since strict binary equality is
+     tested, and the indexed attributes within the heap are tested.
+    </para>
    </listitem>
   </itemizedlist>
   In general, <filename>amcheck</filename> can only prove the presence of
@@ -252,12 +348,11 @@ ORDER BY c.relpages DESC LIMIT 10;
  <sect2>
   <title>Repairing corruption</title>
  <para>
-  No error concerning corruption raised by <filename>amcheck</filename> should
-  ever be a false positive.  In practice, <filename>amcheck</filename> is more
-  likely to find software bugs than problems with hardware.
-  <filename>amcheck</filename> raises errors in the event of conditions that,
-  by definition, should never happen, and so careful analysis of
-  <filename>amcheck</filename> errors is often required.
+  No error concerning corruption raised by <filename>amcheck</> should
+  ever be a false positive.  <filename>amcheck</filename> raises
+  errors in the event of conditions that, by definition, should never
+  happen, and so careful analysis of <filename>amcheck</filename>
+  errors is often required.
  </para>
  <para>
   There is no general method of repairing problems that
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index fc64153..565260f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -56,7 +56,7 @@ extern TimestampTz GetOldSnapshotThresholdTimestamp(void);
 
 extern bool FirstSnapshotSet;
 
-extern TransactionId TransactionXmin;
+extern PGDLLIMPORT TransactionId TransactionXmin;
 extern TransactionId RecentXmin;
 extern PGDLLIMPORT TransactionId RecentGlobalXmin;
 extern TransactionId RecentGlobalDataXmin;
-- 
2.7.4

0001-Add-Bloom-filter-data-structure-implementation.patchtext/x-patch; charset=US-ASCII; name=0001-Add-Bloom-filter-data-structure-implementation.patchDownload

From df0f669dfc8479398499a2d01d2be8fc8ab6fd47 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.
---
 src/backend/lib/Makefile      |   4 +-
 src/backend/lib/README        |   2 +
 src/backend/lib/bloomfilter.c | 303 ++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/bloomfilter.h |  27 ++++
 4 files changed, 334 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/bloomfilter.c
 create mode 100644 src/include/lib/bloomfilter.h

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index d1fefe4..191ea9b 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
-	   knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index 5e5ba5e..376ae27 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -3,6 +3,8 @@ in the backend:
 
 binaryheap.c - a binary heap
 
+bloomfilter.c - probabilistic, space-efficient set membership testing
+
 hyperloglog.c - a streaming cardinality estimator
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
new file mode 100644
index 0000000..6344030
--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,303 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.c
+ *		Minimal Bloom filter
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.  False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the Bloom filter bitset.
+ *
+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when its Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/bloomfilter.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/hash.h"
+#include "lib/bloomfilter.h"
+
+#define MAX_HASH_FUNCS		10
+
+typedef struct bloom_filter
+{
+	/* K hash functions are used, which are randomly seeded */
+	int				k_hash_funcs;
+	uint32			seed;
+	/* Bitset is sized directly in bits.  It must be a power-of-two <= 2^32. */
+	int64			bitset_bits;
+	unsigned char	bitset[FLEXIBLE_ARRAY_MEMBER];
+} bloom_filter;
+
+static int my_bloom_power(int64 target_bitset_bits);
+static int optimal_k(int64 bitset_bits, int64 total_elems);
+static void k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem,
+					 size_t len);
+static uint32 sdbmhash(unsigned char *elem, size_t len);
+
+/*
+ * Create Bloom filter in caller's memory context.  This should get a false
+ * positive rate of between 1% and 2% when bitset is not constrained by memory.
+ *
+ * total_elems is an estimate of the final size of the set.  It ought to be
+ * approximately correct, but we can cope well with it being off by perhaps a
+ * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
+ * (Dillinger & Manolios, 2004) for details of why this is the case.
+ *
+ * bloom_work_mem is sized in KB, in line with the general work_mem convention.
+ *
+ * The Bloom filter behaves non-deterministically when caller passes a random
+ * seed value.  This ensures that the same false positives will not occur from
+ * one run to the next, which is useful to some callers.
+ *
+ * Notes on appropriate use:
+ *
+ * To keep the implementation simple and predictable, the underlying bitset is
+ * always sized as a power-of-two number of bits, and the largest possible
+ * bitset is 512MB.  The implementation is therefore well suited to data
+ * synchronization problems between unordered sets, where predictable
+ * performance is more important than worst case guarantees around false
+ * positives.  Another problem that the implementation is well suited for is
+ * cache filtering where good performance already relies upon having a
+ * relatively small and/or low cardinality set of things that are interesting
+ * (with perhaps many more uninteresting things that never populate the
+ * filter).
+ */
+bloom_filter *
+bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
+{
+	bloom_filter   *filter;
+	int				bloom_power;
+	int64			bitset_bytes;
+	int64			bitset_bits;
+
+	/*
+	 * Aim for two bytes per element; this is sufficient to get a false
+	 * positive rate below 1%, independent of the size of the bitset or total
+	 * number of elements.  Also, if rounding down the size of the bitset to
+	 * the next lowest power of two turns out to be a significant drop, the
+	 * false positive rate still won't exceed 2% in almost all cases.
+	 */
+	bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
+	/* Minimum allowable size is 1MB */
+	bitset_bytes = Max(1024L * 1024L, bitset_bytes);
+
+	/* Size in bits should be the highest power of two within budget */
+	bloom_power = my_bloom_power(bitset_bytes * BITS_PER_BYTE);
+	/* bitset_bits is int64 because 2^32 is greater than UINT32_MAX */
+	bitset_bits = INT64CONST(1) << bloom_power;
+	bitset_bytes = bitset_bits / BITS_PER_BYTE;
+
+	/* Allocate bloom filter as all-zeroes */
+	filter = palloc0(offsetof(bloom_filter, bitset) +
+					 sizeof(unsigned char) * bitset_bytes);
+	filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+	filter->seed = seed;
+	filter->bitset_bits = bitset_bits;
+
+	return filter;
+}
+
+/*
+ * Free Bloom filter
+ */
+void
+bloom_free(bloom_filter *filter)
+{
+	pfree(filter);
+}
+
+/*
+ * Add element to Bloom filter
+ */
+void
+bloom_add_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32	hashes[MAX_HASH_FUNCS];
+	int		i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		filter->bitset[hashes[i] >> 3] |= 1 << (hashes[i] & 7);
+	}
+}
+
+/*
+ * Test if Bloom filter definitely lacks element.
+ *
+ * Returns true if the element is definitely not in the set of elements
+ * observed by bloom_add_element().  Otherwise, returns false, indicating that
+ * element is probably present in set.
+ */
+bool
+bloom_lacks_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32	hashes[MAX_HASH_FUNCS];
+	int		i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		if (!(filter->bitset[hashes[i] >> 3] & (1 << (hashes[i] & 7))))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.
+ *
+ * This is a useful, generic indicator of whether or not a Bloom filter has
+ * summarized the set optimally within the available memory budget.  If return
+ * value exceeds 0.5 significantly, then that's either because there was a
+ * dramatic underestimation of set size by the caller, or because available
+ * work_mem is very low relative to the size of the set (less than 2 bits per
+ * element).
+ *
+ * The value returned here should generally be close to 0.5, even when we have
+ * more than enough memory to ensure a false positive rate within target 1% to
+ * 2% band, since more hash functions are used as more memory is available per
+ * element.
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
+{
+	int		bitset_bytes = filter->bitset_bits / BITS_PER_BYTE;
+	int64	bits_set = 0;
+	int		i;
+
+	for (i = 0; i < bitset_bytes; i++)
+	{
+		unsigned char byte = filter->bitset[i];
+
+		while (byte)
+		{
+			bits_set++;
+			byte &= (byte - 1);
+		}
+	}
+
+	return bits_set / (double) filter->bitset_bits;
+}
+
+/*
+ * Which element in the sequence of powers-of-two is less than or equal to
+ * target_bitset_bits?
+ *
+ * Value returned here must be generally safe as the basis for actual bitset
+ * size.
+ *
+ * Bitset is never allowed to exceed 2 ^ 32 bits (512MB).  This is sufficient
+ * for the needs of all current callers, and allows us to use 32-bit hash
+ * functions.  It also makes it easy to stay under the MaxAllocSize restriction
+ * (caller needs to leave room for non-bitset fields that appear before
+ * flexible array member, so a 1GB bitset would use an allocation that just
+ * exceeds MaxAllocSize).
+ */
+static int
+my_bloom_power(int64 target_bitset_bits)
+{
+	int bloom_power = -1;
+
+	while (target_bitset_bits > 0 && bloom_power < 32)
+	{
+		bloom_power++;
+		target_bitset_bits >>= 1;
+	}
+
+	return bloom_power;
+}
+
+/*
+ * Determine optimal number of hash functions based on size of filter in bits,
+ * and projected total number of elements.  The optimal number is the number
+ * that minimizes the false positive rate.
+ */
+static int
+optimal_k(int64 bitset_bits, int64 total_elems)
+{
+	int		k = round(log(2.0) * bitset_bits / total_elems);
+
+	return Max(1, Min(k, MAX_HASH_FUNCS));
+}
+
+/*
+ * Generate k hash values for element.
+ *
+ * Caller passes array, which is filled-in with k values determined by hashing
+ * caller's element.
+ *
+ * Only 2 real independent hash functions are actually used to support an
+ * interface of up to MAX_HASH_FUNCS hash functions; "enhanced double hashing"
+ * is used to make this work.  See Dillinger & Manolios for details of why
+ * that's okay.  "Building a Better Bloom Filter" by Kirsch & Mitzenmacher also
+ * has detailed analysis of the algorithm.
+ */
+static void
+k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
+{
+	uint32	hasha,
+			hashb;
+	int		i;
+
+	hasha = DatumGetUInt32(hash_any(elem, len));
+	hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);
+
+	/* Mix seed value */
+	hasha += filter->seed;
+	/* Apply "MOD m" to avoid losing bits/out-of-bounds array access */
+	hasha = hasha % filter->bitset_bits;
+	hashb = hashb % filter->bitset_bits;
+
+	/* First hash */
+	hashes[0] = hasha;
+
+	/* Subsequent hashes */
+	for (i = 1; i < filter->k_hash_funcs; i++)
+	{
+		hasha = (hasha + hashb) % filter->bitset_bits;
+		hashb = (hashb + i) % filter->bitset_bits;
+
+		/* Accumulate hash value for caller */
+		hashes[i] = hasha;
+	}
+}
+
+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
+static uint32
+sdbmhash(unsigned char *elem, size_t len)
+{
+	uint32	hash = 0;
+	int		i;
+
+	for (i = 0; i < len; elem++, i++)
+	{
+		hash = (*elem) + (hash << 6) + (hash << 16) - hash;
+	}
+
+	return hash;
+}
diff --git a/src/include/lib/bloomfilter.h b/src/include/lib/bloomfilter.h
new file mode 100644
index 0000000..f46f233
--- /dev/null
+++ b/src/include/lib/bloomfilter.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.h
+ *	  Minimal Bloom filter
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/include/lib/bloomfilter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BLOOMFILTER_H_
+#define _BLOOMFILTER_H_
+
+typedef struct bloom_filter bloom_filter;
+
+extern bloom_filter *bloom_create(int64 total_elems, int bloom_work_mem,
+								  uint32 seed);
+extern void bloom_free(bloom_filter *filter);
+extern void bloom_add_element(bloom_filter *filter, unsigned char *elem,
+							  size_t len);
+extern bool bloom_lacks_element(bloom_filter *filter, unsigned char *elem,
+								size_t len);
+extern double bloom_prop_bits_set(bloom_filter *filter);
+
+#endif
-- 
2.7.4

#32

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Peter Geoghegan (#31)

Re: [HACKERS] A design for amcheck heapam verification

On Sat, Oct 21, 2017 at 9:34 AM, Peter Geoghegan <pg@bowt.ie> wrote:

I should point out that I shipped virtually the same code yesterday,
as v1.1 of the Github version of amcheck (also known as amcheck_next).
Early adopters will be able to use this new "heapallindexed"
functionality in the next few days, once packages become available for
the apt and yum community repos. Just as before, the Github version
will work on versions of Postgres >= 9.4.

This seems like good timing on my part, because we know that this new
"heapallindexed" verification will detect the "freeze the dead" bugs
that the next point release is set to have fixes for -- that is
actually kind of how one of the bugs was found [1]. We may even want
to advertise the available of this check within amcheck_next, in the
release notes for the next Postgres point release.

My apologies for slacking here. I would still welcome some regression
tests to stress the bloom API you are proposing! For now I am moving
this patch to next CF.
--
Michael

#33

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Michael Paquier (#32)

Re: [HACKERS] A design for amcheck heapam verification

On Tue, Nov 28, 2017 at 9:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

My apologies for slacking here. I would still welcome some regression
tests to stress the bloom API you are proposing! For now I am moving
this patch to next CF.

I still don't think that regression tests as such make sense. However,
it seems like it might be a good idea to add a test harness for the
Bloom filter code. I actually wrote code like this for myself during
development, that could be cleaned up. The hardness can live in
source/src/test/modules/test_bloom_filter. We already do this for the
red-black tree library code, for example, and it seems like good
practice.

Would that address your concern? There would be an SQL interface, but
it would be trivial.

--
Peter Geoghegan

#34

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Peter Geoghegan (#33)

Re: [HACKERS] A design for amcheck heapam verification

On Wed, Nov 29, 2017 at 2:48 PM, Peter Geoghegan <pg@bowt.ie> wrote:

I still don't think that regression tests as such make sense. However,
it seems like it might be a good idea to add a test harness for the
Bloom filter code. I actually wrote code like this for myself during
development, that could be cleaned up. The hardness can live in
source/src/test/modules/test_bloom_filter. We already do this for the
red-black tree library code, for example, and it seems like good
practice.

Would that address your concern? There would be an SQL interface, but
it would be trivial.

That's exactly what I think you should do, and mentioned so upthread.
A SQL interface can also show a good example of how developers can use
this API.
--
Michael

#35

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Michael Paquier (#34)

Re: [HACKERS] A design for amcheck heapam verification

On Tue, Nov 28, 2017 at 9:50 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Would that address your concern? There would be an SQL interface, but
it would be trivial.

That's exactly what I think you should do, and mentioned so upthread.
A SQL interface can also show a good example of how developers can use
this API.

My understanding of your earlier remarks, rightly or wrongly, was that
you wanted me to adopt the Bloom filter to actually be usable from SQL
in some kind of general way. As opposed to what I just said -- adding
a stub SQL interface that simply invokes the test harness, with all
the heavy lifting taking place in C code.

Obviously these are two very different things. I'm quite happy to add
the test harness.

--
Peter Geoghegan

#36

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Peter Geoghegan (#35)

Re: [HACKERS] A design for amcheck heapam verification

On Wed, Nov 29, 2017 at 2:54 PM, Peter Geoghegan <pg@bowt.ie> wrote:

My understanding of your earlier remarks, rightly or wrongly, was that
you wanted me to adopt the Bloom filter to actually be usable from SQL
in some kind of general way. As opposed to what I just said -- adding
a stub SQL interface that simply invokes the test harness, with all
the heavy lifting taking place in C code.

Obviously these are two very different things. I'm quite happy to add
the test harness.

Quote from this email:
/messages/by-id/CAB7nPqSUKppzvNSHY1OM_TdSj0UE18xNFCrOwPC3E8svq7Mb_Q@mail.gmail.com

One first thing striking me is that there is no test for this
implementation, which would be a base stone for other things, it would
be nice to validate that things are working properly before moving on
with 0002, and 0001 is a feature on its own. I don't think that it
would be complicated to have a small module in src/test/modules which
plugs in a couple of SQL functions on top of bloomfilter.h.

My apologies if this sounded like having a set of SQL functions in
core, I meant a test suite from the beginning with an extension
creating the interface or such.
--
Michael

#37

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Peter Geoghegan (#35)

2 attachment(s)

Re: [HACKERS] A design for amcheck heapam verification

On Tue, Nov 28, 2017 at 9:54 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Nov 28, 2017 at 9:50 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Would that address your concern? There would be an SQL interface, but
it would be trivial.

That's exactly what I think you should do, and mentioned so upthread.
A SQL interface can also show a good example of how developers can use
this API.

Attach revision, v5, adds a new test harness -- test_bloomfilter.

This can be used to experimentally verify that the meets the well
known "1% false positive rate with 9.6 bits per element" standard. It
manages to do exactly that:

postgres=# set client_min_messages = 'debug1';
SET
postgres=# SELECT test_bloomfilter(power => 23, nelements => 873813,
seed => -1, tests => 3);
DEBUG: beginning test #1...
DEBUG: bloom_work_mem (KB): 1024
DEBUG: false positives: 8630 (rate: 0.009876, proportion bits set:
0.517625, seed: 1373191603)
DEBUG: beginning test #2...
DEBUG: bloom_work_mem (KB): 1024
DEBUG: false positives: 8623 (rate: 0.009868, proportion bits set:
0.517623, seed: 406665822)
DEBUG: beginning test #3...
DEBUG: bloom_work_mem (KB): 1024
WARNING: false positives: 8840 (rate: 0.010117, proportion bits set:
0.517748, seed: 398116374)
test_bloomfilter
------------------

(1 row)

Here, we repeat the same test 3 times, varying only the seed value
used for each run.

The last message is a WARNING because we exceed the 1% threshold
(hard-coded into test_bloomfilter.c), though only by a tiny margin,
due only to random variations in seed value. We round up to 10 bits
per element for the regression tests. That's where the *actual*
"nelements" argument comes from within the tests, so pg_regress tests
should never see the WARNING (if they do, that counts as a failure).

I've experimentally observed that we get the 1% false positive rate
with any possible bitset when "nelements" works out at 9.6 bitset bits
per element. Inter-run variation is tiny. With 50 tests, I didn't
observe these same Bloom filter parameters produce a false positive
rate that came near to 1.1%. 1.01% or 1.02% was about as bad as it
got.

There is a fairly extensive README, which I hope will clear up the
theory behind the bloomfilter.c strategy on bitset size and false
positives. Also, there was a regression that I had to fix in
bloomfilter.c, in seeding. It didn't reliably cause variation in the
false positives. And, there was bitrot with the documentation that I
fixed up.

--
Peter Geoghegan

Attachments:

0002-Add-amcheck-verification-of-indexes-against-heap.patchtext/x-patch; charset=US-ASCII; name=0002-Add-amcheck-verification-of-indexes-against-heap.patchDownload

From 6d55e1c68a3685be1f11aff0842d354f06880e82 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  callers can check that each heap tuple that
ought to have an index entry does in fact have one.  This happens at the
end of the existing verification checks.

This is implemented by using a Bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
Bloom filter is populated during the initial index verification scan.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql    |  28 +++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |  14 +-
 contrib/amcheck/sql/check_btree.sql      |   9 +-
 contrib/amcheck/verify_nbtree.c          | 298 ++++++++++++++++++++++++++++---
 doc/src/sgml/amcheck.sgml                | 168 +++++++++++++----
 7 files changed, 450 insertions(+), 71 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 0000000..e6cca0a
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+DROP FUNCTION bt_index_parent_check(regclass);
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 05e2861..4690484 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index df3741e..42872b8 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -16,8 +16,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
  bt_index_check 
@@ -56,8 +56,14 @@ SELECT bt_index_check('bttest_a_idx');
  
 (1 row)
 
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+SELECT bt_index_parent_check('bttest_b_idx', true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index fd90531..5d27969 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -19,8 +19,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
 SELECT bt_index_parent_check('bttest_a_idx');
@@ -42,8 +42,9 @@ ROLLBACK;
 
 -- normal check outside of xact
 SELECT bt_index_check('bttest_a_idx');
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+SELECT bt_index_parent_check('bttest_b_idx', true);
 
 BEGIN;
 SELECT bt_index_check('bttest_a_idx');
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 868c14e..8e57d2e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -8,6 +8,11 @@
  * (the insertion scankey sort-wise NULL semantics are needed for
  * verification).
  *
+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *
  *
  * Copyright (c) 2017, PostgreSQL Global Development Group
  *
@@ -18,11 +23,13 @@
  */
 #include "postgres.h"
 
+#include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "access/transam.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablecmds.h"
+#include "lib/bloomfilter.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -43,9 +50,10 @@ PG_MODULE_MAGIC;
  * target is the point of reference for a verification operation.
  *
  * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
- * they are current target's child pages). Conceptually, problems are only
- * ever found in the current target page. Each page found by verification's
- * left/right, top/bottom scan becomes the target exactly once.
+ * they are current target's child pages).  Conceptually, problems are only
+ * ever found in the current target page (or for a particular heap tuple during
+ * heapallindexed verification).  Each page found by verification's left/right,
+ * top/bottom scan becomes the target exactly once.
  */
 typedef struct BtreeCheckState
 {
@@ -53,10 +61,13 @@ typedef struct BtreeCheckState
 	 * Unchanging state, established at start of verification:
 	 */
 
-	/* B-Tree Index Relation */
+	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
+	Relation	heaprel;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
+	/* Also verifying heap has no unindexed tuples? */
+	bool		heapallindexed;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -72,6 +83,15 @@ typedef struct BtreeCheckState
 	BlockNumber targetblock;
 	/* Target page's LSN */
 	XLogRecPtr	targetlsn;
+
+	/*
+	 * Mutable state, for optional heapallindexed verification:
+	 */
+
+	/* Bloom filter fingerprints B-Tree index */
+	bloom_filter *filter;
+	/* Debug counter */
+	int64		heaptuplespresent;
 } BtreeCheckState;
 
 /*
@@ -92,15 +112,20 @@ typedef struct BtreeLevel
 PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
-static void bt_index_check_internal(Oid indrelid, bool parentcheck);
+static void bt_index_check_internal(Oid indrelid, bool parentcheck,
+									bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
-static void bt_check_every_level(Relation rel, bool readonly);
+static void bt_check_every_level(Relation rel, Relation heaprel,
+								 bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
 static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 				  ScanKey targetkey);
+static void bt_tuple_present_callback(Relation index, HeapTuple htup,
+									  Datum *values, bool *isnull,
+									  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
@@ -116,37 +141,47 @@ static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires AccessShareLock on heap & index relations.  Does not consider
- * invariants that exist between parent/child pages.
+ * invariants that exist between parent/child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, false);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, false, heapallindexed);
 
 	PG_RETURN_VOID();
 }
 
 /*
- * bt_index_parent_check(index regclass)
+ * bt_index_parent_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires ShareLock on heap & index relations.  Verifies that downlinks in
- * parent pages are valid lower bounds on child pages.
+ * parent pages are valid lower bounds on child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, true);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, true, heapallindexed);
 
 	PG_RETURN_VOID();
 }
@@ -155,7 +190,7 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -191,9 +226,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	/*
 	 * Since we did the IndexGetRelation call above without any lock, it's
 	 * barely possible that a race against an index drop/recreation could have
-	 * netted us the wrong table.  Although the table itself won't actually be
-	 * examined during verification currently, a recheck still seems like a
-	 * good idea.
+	 * netted us the wrong table.
 	 */
 	if (heaprel == NULL || heapid != IndexGetRelation(indrelid, false))
 		ereport(ERROR,
@@ -204,8 +237,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	/* Relation suitable for checking as B-Tree? */
 	btree_index_checkable(indrel);
 
-	/* Check index */
-	bt_check_every_level(indrel, parentcheck);
+	/* Check index, possibly against table it is an index on */
+	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -253,11 +286,14 @@ btree_index_checkable(Relation rel)
 
 /*
  * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in
- * logical order, verifying invariants as it goes.
+ * logical order, verifying invariants as it goes.  Optionally, verification
+ * checks if the heap relation contains any tuples that are not represented in
+ * the index but should be.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
  * the index relation, and advise us if extra checks are safe when a ShareLock
- * is held.
+ * is held.  (A lock of the same type must also have been acquired on the heap
+ * relation.)
  *
  * A ShareLock is generally assumed to prevent any kind of physical
  * modification to the index structure, including modifications that VACUUM may
@@ -272,7 +308,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, bool readonly)
+bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
+					 bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -283,15 +320,35 @@ bt_check_every_level(Relation rel, bool readonly)
 	/*
 	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
 	 * RecentGlobalXmin/B-Tree page deletion.
+	 *
+	 * We also rely on TransactionXmin having been initialized by now.
 	 */
 	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(TransactionIdIsNormal(TransactionXmin));
 
 	/*
 	 * Initialize state for entire verification operation
 	 */
 	state = palloc(sizeof(BtreeCheckState));
 	state->rel = rel;
+	state->heaprel = heaprel;
 	state->readonly = readonly;
+	state->heapallindexed = heapallindexed;
+
+	if (state->heapallindexed)
+	{
+		int64	total_elems;
+		uint32	seed;
+
+		/* Size Bloom filter based on estimated number of tuples in index */
+		total_elems = (int64) state->rel->rd_rel->reltuples;
+		/* Random seed relies on backend srandom() call to avoid repetition */
+		seed = random();
+		/* Create Bloom filter to fingerprint index */
+		state->filter = bloom_create(total_elems, maintenance_work_mem, seed);
+		state->heaptuplespresent = 0;
+	}
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -347,6 +404,61 @@ bt_check_every_level(Relation rel, bool readonly)
 		previouslevel = current.level;
 	}
 
+	/*
+	 * * Heap contains unindexed/malformed tuples check *
+	 */
+	if (state->heapallindexed)
+	{
+		IndexInfo  *indexinfo;
+
+		if (state->readonly)
+			elog(DEBUG1, "verifying presence of all required tuples in index \"%s\"",
+				 RelationGetRelationName(rel));
+		else
+			elog(DEBUG1, "verifying presence of required tuples in index \"%s\" with xmin before %u",
+				 RelationGetRelationName(rel), TransactionXmin);
+
+		indexinfo = BuildIndexInfo(state->rel);
+
+		/*
+		 * Force use of MVCC snapshot (reuse CONCURRENTLY infrastructure) when
+		 * only AccessShareLocks held.  It seems like a good idea to not
+		 * diverge from expected heap lock strength.
+		 */
+		indexinfo->ii_Concurrent = !state->readonly;
+
+		/*
+		 * Don't wait for uncommitted tuple xact commit/abort when index is a
+		 * unique index (or an index used by an exclusion constraint).  It is
+		 * supposed to be impossible to get duplicates with the already-defined
+		 * unique index in place.  Our relation-level locks prevent races
+		 * resulting in false positive corruption errors where an IndexTuple
+		 * insertion was just missed, but we still test its heap tuple.  (While
+		 * this would not be true for !readonly verification, it doesn't matter
+		 * because CREATE INDEX CONCURRENTLY index build heap scanning has no
+		 * special treatment for unique indexes to avoid.)
+		 *
+		 * Not waiting can only affect verification of indexes on system
+		 * catalogs, where heavyweights locks can be dropped before transaction
+		 * commit.  If anything, avoiding waiting slightly improves test
+		 * coverage.
+		 */
+		indexinfo->ii_Unique = false;
+		indexinfo->ii_ExclusionOps = NULL;
+		indexinfo->ii_ExclusionProcs = NULL;
+		indexinfo->ii_ExclusionStrats = NULL;
+
+		IndexBuildHeapScan(state->heaprel, state->rel, indexinfo, true,
+						   bt_tuple_present_callback, (void *) state);
+
+		ereport(DEBUG1,
+				(errmsg_internal("finished verifying presence of " INT64_FORMAT " tuples (proportion of bits set: %f) from table \"%s\"",
+								 state->heaptuplespresent, bloom_prop_bits_set(state->filter),
+								 RelationGetRelationName(heaprel))));
+
+		bloom_free(state->filter);
+	}
+
 	/* Be tidy: */
 	MemoryContextDelete(state->targetcontext);
 }
@@ -499,7 +611,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
 										current, level.level, opaque->btpo.level)));
 
-		/* Verify invariants for page -- all important checks occur here */
+		/* Verify invariants for page */
 		bt_target_page_check(state);
 
 nextpage:
@@ -546,6 +658,9 @@ nextpage:
  *
  * - That all child pages respect downlinks lower bound.
  *
+ * This is also where heapallindexed callers use their Bloom filter to
+ * fingerprint IndexTuples.
+ *
  * Note:  Memory allocated in this routine is expected to be released by caller
  * resetting state->targetcontext.
  */
@@ -589,6 +704,11 @@ bt_target_page_check(BtreeCheckState *state)
 		itup = (IndexTuple) PageGetItem(state->target, itemid);
 		skey = _bt_mkscankey(state->rel, itup);
 
+		/* Fingerprint leaf page tuples (those that point to the heap) */
+		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
+			bloom_add_element(state->filter, (unsigned char *) itup,
+							  IndexTupleSize(itup));
+
 		/*
 		 * * High key check *
 		 *
@@ -682,8 +802,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Last item check *
 		 *
 		 * Check last item against next/right page's first data item's when
-		 * last item on page is reached.  This additional check can detect
-		 * transposed pages.
+		 * last item on page is reached.  This additional check will detect
+		 * transposed pages iff the supposed right sibling page happens to
+		 * belong before target in the key space.  (Otherwise, a subsequent
+		 * heap verification will probably detect the problem.)
 		 *
 		 * This check is similar to the item order check that will have
 		 * already been performed for every other "real" item on target page
@@ -1062,6 +1184,134 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 }
 
 /*
+ * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
+ * all the entries that definitely should have been observed in leaf pages of
+ * the target index (that is, all IndexTuples that were fingerprinted by our
+ * Bloom filter).  All heapallindexed checks occur here.
+ *
+ * Theory of operation:
+ *
+ * The redundancy between an index and the table it indexes provides a good
+ * opportunity to detect corruption, especially corruption within the table.
+ * The high level principle behind the verification performed here is that any
+ * IndexTuple that should be in an index following a fresh CREATE INDEX (based
+ * on the same index definition) should also have been in the original,
+ * existing index, which should have used exactly the same representation
+ * (Index tuple formation is assumed to be deterministic, and IndexTuples are
+ * assumed immutable; while the LP_DEAD bit is mutable, that's ItemId metadata,
+ * which is not fingerprinted).  There will often be some dead-to-everyone
+ * IndexTuples fingerprinted by the Bloom filter, but we only try to detect the
+ * *absence of needed tuples*, so that's okay.
+ *
+ * Since the overall structure of the index has already been verified, the most
+ * likely explanation for error here is a corrupt heap page (could be logical
+ * or physical corruption).  Index corruption may still be detected here,
+ * though.  Only readonly callers will have verified that left links and right
+ * links are in agreement, and so it's possible that a leaf page transposition
+ * within index is actually the source of corruption detected here (for
+ * !readonly callers).  The checks performed only for readonly callers might
+ * more accurately frame the problem as a cross-page invariant issue (this
+ * could even be due to recovery not replaying all WAL records).  The !readonly
+ * ERROR message raised here includes a HINT about retrying with readonly
+ * verification, just in case it's a cross-page invariant issue, though that
+ * isn't particularly likely.
+ *
+ * IndexBuildHeapScan() expects to be able to find the root tuple when a
+ * heap-only tuple (the live tuple at the end of some HOT chain) needs to be
+ * indexed, in order to replace the actual tuple's TID with the root tuple's
+ * TID (which is what we're actually passed back here).  The index build heap
+ * scan code will raise an error when a tuple that claims to be the root of the
+ * heap-only tuple's HOT chain cannot be located.  This catches cases where the
+ * original root item offset/root tuple for a HOT chain indicates (for whatever
+ * reason) that the entire HOT chain is dead, despite the fact that the latest
+ * heap-only tuple should be indexed.  When this happens, sequential scans may
+ * always give correct answers, and all indexes may be considered structurally
+ * consistent (i.e. the nbtree structural checks would not detect corruption).
+ * It may be the case that only index scans give wrong answers, and yet heap or
+ * SLRU corruption is the real culprit.  (While it's true that LP_DEAD bit
+ * setting will probably also leave the index in a corrupt state before too
+ * long, the problem is nonetheless that there is heap corruption.)
+ *
+ * Note also that heap-only tuple handling within IndexBuildHeapScan() detects
+ * index tuples that contain the wrong values.  This can happen when there is
+ * no superseding index tuple due to a faulty assessment of HOT safety.
+ * Because the latest tuple's contents are used with the root TID, an error
+ * will be raised when a tuple with the same TID but different (correct)
+ * attribute values is passed back to us.  (Faulty assessment of HOT-safety was
+ * behind the CREATE INDEX CONCURRENTLY bug that was fixed in February of
+ * 2017.)
+ */
+static void
+bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
+						  bool *isnull, bool tupleIsAlive, void *checkstate)
+{
+	BtreeCheckState *state = (BtreeCheckState *) checkstate;
+	IndexTuple		 itup;
+
+	Assert(state->heapallindexed);
+
+	/* Must recheck visibility when only AccessShareLock held */
+	if (!state->readonly)
+	{
+		TransactionId	xmin;
+
+		/*
+		 * Don't test for presence in index where xmin not at least old enough
+		 * that we know for sure that absence of index tuple wasn't just due to
+		 * some transaction performing insertion after our verifying index
+		 * traversal began.  (Actually, the cut-off used is a point where
+		 * preceding write transactions must have committed/aborted.  We should
+		 * have already fingerprinted all index tuples for all such preceding
+		 * transactions, because the cut-off was established before our index
+		 * traversal even began.)
+		 *
+		 * You might think that the fact that an MVCC snapshot is used by the
+		 * heap scan (due to our indicating that this is the first scan of a
+		 * CREATE INDEX CONCURRENTLY index build) would make this test
+		 * redundant.  That's not quite true, because with current
+		 * IndexBuildHeapScan() interface caller cannot do the MVCC snapshot
+		 * acquisition itself.  Heap tuple coverage is thereby similar to the
+		 * coverage we could get by using earliest transaction snapshot
+		 * directly.  It's easier to do this than to adopt the
+		 * IndexBuildHeapScan() interface to our narrow requirements.
+		 */
+		Assert(tupleIsAlive);
+		xmin = HeapTupleHeaderGetXmin(htup->t_data);
+		if (!TransactionIdPrecedes(xmin, TransactionXmin))
+			return;
+	}
+
+	/*
+	 * Generate an index tuple.
+	 *
+	 * Note that we rely on deterministic index_form_tuple() TOAST compression.
+	 * If index_form_tuple() was ever enhanced to compress datums out-of-line,
+	 * or otherwise varied when or how compression was applied, our assumption
+	 * would break, leading to false positive reports of corruption.  For now,
+	 * we don't decompress/normalize toasted values as part of fingerprinting.
+	 */
+	itup = index_form_tuple(RelationGetDescr(index), values, isnull);
+	itup->t_tid = htup->t_self;
+
+	/* Probe Bloom filter -- tuple should be present */
+	if (bloom_lacks_element(state->filter, (unsigned char *) itup,
+							IndexTupleSize(itup)))
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("heap tuple (%u,%u) from table \"%s\" lacks matching index tuple within index \"%s\"",
+						ItemPointerGetBlockNumber(&(itup->t_tid)),
+						ItemPointerGetOffsetNumber(&(itup->t_tid)),
+						RelationGetRelationName(state->heaprel),
+						RelationGetRelationName(state->rel)),
+				 !state->readonly
+				 ? errhint("Retrying verification using the function bt_index_parent_check() might provide a more specific error.")
+				 : 0));
+
+	state->heaptuplespresent++;
+	pfree(itup);
+}
+
+/*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
  *
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 852e260..ada9d77 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -44,7 +44,7 @@
   <variablelist>
    <varlistentry>
     <term>
-     <function>bt_index_check(index regclass) returns void</function>
+     <function>bt_index_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_check</primary>
      </indexterm>
@@ -55,7 +55,9 @@
       <function>bt_index_check</function> tests that its target, a
       B-Tree index, respects a variety of invariants.  Example usage:
 <screen>
-test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+test=# SELECT bt_index_check(index =&gt; c.oid, heapallindexed =&gt; i.indisunique)
+               c.relname,
+               c.relpages
 FROM pg_index i
 JOIN pg_opclass op ON i.indclass[0] = op.oid
 JOIN pg_am am ON op.opcmethod = am.oid
@@ -83,9 +85,11 @@ ORDER BY c.relpages DESC LIMIT 10;
 </screen>
       This example shows a session that performs verification of every
       catalog index in the database <quote>test</quote>.  Details of just
-      the 10 largest indexes verified are displayed.  Since no error
-      is raised, all indexes tested appear to be logically consistent.
-      Naturally, this query could easily be changed to call
+      the 10 largest indexes verified are displayed.  Verification of
+      the presence of heap tuples as index tuples is requested for
+      unique indexes only.  Since no error is raised, all indexes
+      tested appear to be logically consistent.  Naturally, this query
+      could easily be changed to call
       <function>bt_index_check</function> for every index in the
       database where verification is supported.
      </para>
@@ -95,10 +99,11 @@ ORDER BY c.relpages DESC LIMIT 10;
       is the same lock mode acquired on relations by simple
       <literal>SELECT</literal> statements.
       <function>bt_index_check</function> does not verify invariants
-      that span child/parent relationships, nor does it verify that
-      the target index is consistent with its heap relation.  When a
-      routine, lightweight test for corruption is required in a live
-      production environment, using
+      that span child/parent relationships, but will verify the
+      presence of all heap tuples as index tuples within the index
+      when <parameter>heapallindexed</parameter> is
+      <literal>true</literal>.  When a routine, lightweight test for
+      corruption is required in a live production environment, using
       <function>bt_index_check</function> often provides the best
       trade-off between thoroughness of verification and limiting the
       impact on application performance and availability.
@@ -108,7 +113,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -117,30 +122,34 @@ ORDER BY c.relpages DESC LIMIT 10;
     <listitem>
      <para>
       <function>bt_index_parent_check</function> tests that its
-      target, a B-Tree index, respects a variety of invariants.  The
-      checks performed by <function>bt_index_parent_check</function>
-      are a superset of the checks performed by
-      <function>bt_index_check</function>.
+      target, a B-Tree index, respects a variety of invariants.
+      Optionally, when the <parameter>heapallindexed</parameter>
+      argument is <literal>true</literal>, the function verifies the
+      presence of all heap tuples that should be found within the
+      index.  The checks performed by
+      <function>bt_index_parent_check</function> are a superset of the
+      checks performed by <function>bt_index_check</function> when
+      called with the same options.
       <function>bt_index_parent_check</function> can be thought of as
       a more thorough variant of <function>bt_index_check</function>:
       unlike <function>bt_index_check</function>,
       <function>bt_index_parent_check</function> also checks
-      invariants that span parent/child relationships.  However, it
-      does not verify that the target index is consistent with its
-      heap relation.  <function>bt_index_parent_check</function>
-      follows the general convention of raising an error if it finds a
-      logical inconsistency or other problem.
+      invariants that span parent/child relationships.
+      <function>bt_index_parent_check</function> follows the general
+      convention of raising an error if it finds a logical
+      inconsistency or other problem.
      </para>
      <para>
-      A <literal>ShareLock</literal> is required on the target index by
-      <function>bt_index_parent_check</function> (a
-      <literal>ShareLock</literal> is also acquired on the heap relation).
-      These locks prevent concurrent data modification from
-      <command>INSERT</command>, <command>UPDATE</command>, and <command>DELETE</command>
-      commands.  The locks also prevent the underlying relation from
-      being concurrently processed by <command>VACUUM</command>, as well as
-      all other utility commands.  Note that the function holds locks
-      only while running, not for the entire transaction.
+      A <literal>ShareLock</literal> is required on the target index
+      by <function>bt_index_parent_check</function> (a
+      <literal>ShareLock</literal> is also acquired on the heap
+      relation).  These locks prevent concurrent data modification
+      from <command>INSERT</command>, <command>UPDATE</command>, and
+      <command>DELETE</command> commands.  The locks also prevent the
+      underlying relation from being concurrently processed by
+      <command>VACUUM</command>, as well as all other utility
+      commands.  Note that the function holds locks only while
+      running, not for the entire transaction.
      </para>
      <para>
       <function>bt_index_parent_check</function>'s additional
@@ -159,6 +168,72 @@ ORDER BY c.relpages DESC LIMIT 10;
  </sect2>
 
  <sect2>
+  <title>Optional <parameter>heapallindexed</parameter> verification</title>
+ <para>
+  When the <parameter>heapallindexed</parameter> argument to
+  verification functions is <literal>true</literal>, an additional
+  phase of verification is performed against the table associated with
+  the target index relation.  This consists of a <quote>dummy</quote>
+  <command>CREATE INDEX</command> operation, which checks for the
+  presence of all would-be new index tuples against a temporary,
+  in-memory summarizing structure (this is built when needed during
+  the first, standard phase).  The summarizing structure
+  <quote>fingerprints</quote> every tuple found within the target
+  index.  The high level principle behind
+  <parameter>heapallindexed</parameter> verification is that a new
+  index that is equivalent to the existing, target index must only
+  have entries that can be found in the existing structure.
+ </para>
+ <para>
+  The additional <parameter>heapallindexed</parameter> phase adds
+  significant overhead: verification will typically take several times
+  longer than it would with only the standard consistency checking of
+  the target index's structure.  However, verification will still take
+  significantly less time than an actual <command>CREATE
+  INDEX</command>.  There is no change to the relation-level locks
+  acquired when <parameter>heapallindexed</parameter> verification is
+  performed.  The summarizing structure is bound in size by
+  <varname>maintenance_work_mem</varname>.  In order to ensure that
+  there is no more than a 2% probability of failure to detect the
+  absence of any particular index tuple, approximately 2 bytes of
+  memory are needed per index tuple.  As less memory is made available
+  per index tuple, the probability of missing an inconsistency
+  increases.  This is considered an acceptable trade-off, since it
+  limits the overhead of verification very significantly, while only
+  slightly reducing the probability of detecting a problem, especially
+  for installations where verification is treated as a routine
+  maintenance task.
+ </para>
+ <para>
+  With many databases, even the default
+  <varname>maintenance_work_mem</varname> setting of
+  <literal>64MB</literal> is sufficient to have less than a 2%
+  probability of overlooking any single absent or corrupt tuple.  This
+  will be the case when there are no indexes with more than about 30
+  million distinct index tuples, regardless of the overall size of any
+  index, the total number of indexes, or anything else.  False
+  positive candidate tuple membership tests within the summarizing
+  structure occur at random, and are very unlikely to be the same for
+  repeat verification operations.  Furthermore, within a single
+  verification operation, each missing or malformed index tuple
+  independently has the same chance of being detected.  If there is
+  any inconsistency at all, it isn't particularly likely to be limited
+  to a single tuple.  All of these factors favor accepting a limited
+  per operation per tuple probability of missing corruption, in order
+  to enable performing more thorough index to heap verification more
+  frequently (practical concerns about the overhead of verification
+  are likely to limit the frequency of verification).  In aggregate,
+  the probability of detecting a hardware fault or software defect
+  actually <emphasis>increases</emphasis> significantly with this
+  strategy in most real world cases.  Moreover, frequent verification
+  allows problems to be caught earlier on average, which helps to
+  limit the overall impact of corruption, and often simplifies root
+  cause analysis.
+ </para>
+
+ </sect2>
+
+ <sect2>
   <title>Using <filename>amcheck</filename> effectively</title>
 
  <para>
@@ -199,16 +274,30 @@ ORDER BY c.relpages DESC LIMIT 10;
    </listitem>
    <listitem>
     <para>
+     Structural inconsistencies between indexes and the heap relations
+     that are indexed (when <parameter>heapallindexed</parameter>
+     verification is performed).
+    </para>
+    <para>
+     There is no cross-checking of indexes against their heap relation
+     during normal operation.  Symptoms of heap corruption can be very
+     subtle.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
      Corruption caused by hypothetical undiscovered bugs in the
-     underlying <productname>PostgreSQL</productname> access method code or sort
-     code.
+     underlying <productname>PostgreSQL</productname> access method
+     code, sort code, or transaction management code.
     </para>
     <para>
      Automatic verification of the structural integrity of indexes
      plays a role in the general testing of new or proposed
      <productname>PostgreSQL</productname> features that could plausibly allow a
-     logical inconsistency to be introduced.  One obvious testing
-     strategy is to call <filename>amcheck</filename> functions continuously
+     logical inconsistency to be introduced.  Verification of table
+     structure and associated visibility and transaction status
+     information plays a similar role.  One obvious testing strategy
+     is to call <filename>amcheck</filename> functions continuously
      when running the standard regression tests.  See <xref
      linkend="regress-run"/> for details on running the tests.
     </para>
@@ -242,6 +331,12 @@ ORDER BY c.relpages DESC LIMIT 10;
      <emphasis>absolute</emphasis> protection against failures that
      result in memory corruption.
     </para>
+    <para>
+     When <parameter>heapallindexed</parameter> verification is
+     performed, there is generally a greatly increased chance of
+     detecting single-bit errors, since strict binary equality is
+     tested, and the indexed attributes within the heap are tested.
+    </para>
    </listitem>
   </itemizedlist>
   In general, <filename>amcheck</filename> can only prove the presence of
@@ -253,11 +348,10 @@ ORDER BY c.relpages DESC LIMIT 10;
   <title>Repairing corruption</title>
  <para>
   No error concerning corruption raised by <filename>amcheck</filename> should
-  ever be a false positive.  In practice, <filename>amcheck</filename> is more
-  likely to find software bugs than problems with hardware.
-  <filename>amcheck</filename> raises errors in the event of conditions that,
-  by definition, should never happen, and so careful analysis of
-  <filename>amcheck</filename> errors is often required.
+  ever be a false positive.  <filename>amcheck</filename> raises
+  errors in the event of conditions that, by definition, should never
+  happen, and so careful analysis of <filename>amcheck</filename>
+  errors is often required.
  </para>
  <para>
   There is no general method of repairing problems that
-- 
2.7.4

0001-Add-Bloom-filter-data-structure-implementation.patchtext/x-patch; charset=US-ASCII; name=0001-Add-Bloom-filter-data-structure-implementation.patchDownload

From 47f2c6cd398244f11c3490b644cd225beac9ae31 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.

This commit adds a test harness extension module, test_bloomfilter.  It
can be used to get a sense of how the Bloom filter implementation
performs under varying conditions.
---
 src/backend/lib/Makefile                           |   4 +-
 src/backend/lib/README                             |   2 +
 src/backend/lib/bloomfilter.c                      | 313 +++++++++++++++++++++
 src/include/lib/bloomfilter.h                      |  27 ++
 src/test/modules/Makefile                          |   1 +
 src/test/modules/test_bloomfilter/.gitignore       |   4 +
 src/test/modules/test_bloomfilter/Makefile         |  21 ++
 src/test/modules/test_bloomfilter/README           |  72 +++++
 .../test_bloomfilter/expected/test_bloomfilter.out |  25 ++
 .../test_bloomfilter/sql/test_bloomfilter.sql      |  22 ++
 .../test_bloomfilter/test_bloomfilter--1.0.sql     |  10 +
 .../modules/test_bloomfilter/test_bloomfilter.c    | 138 +++++++++
 .../test_bloomfilter/test_bloomfilter.control      |   4 +
 13 files changed, 641 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/bloomfilter.c
 create mode 100644 src/include/lib/bloomfilter.h
 create mode 100644 src/test/modules/test_bloomfilter/.gitignore
 create mode 100644 src/test/modules/test_bloomfilter/Makefile
 create mode 100644 src/test/modules/test_bloomfilter/README
 create mode 100644 src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
 create mode 100644 src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.c
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index d1fefe4..191ea9b 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
-	   knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index 5e5ba5e..376ae27 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -3,6 +3,8 @@ in the backend:
 
 binaryheap.c - a binary heap
 
+bloomfilter.c - probabilistic, space-efficient set membership testing
+
 hyperloglog.c - a streaming cardinality estimator
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
new file mode 100644
index 0000000..f8f7d45
--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,313 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.c
+ *		Minimal Bloom filter
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.  False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the Bloom filter bitset.
+ *
+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when its Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/bloomfilter.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/hash.h"
+#include "lib/bloomfilter.h"
+
+#define MAX_HASH_FUNCS		10
+
+struct bloom_filter
+{
+	/* K hash functions are used, which are randomly seeded */
+	int				k_hash_funcs;
+	uint32			seed;
+	/* Bitset is sized directly in bits.  It must be a power-of-two <= 2^32. */
+	int64			bitset_bits;
+	unsigned char	bitset[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static int my_bloom_power(int64 target_bitset_bits);
+static int optimal_k(int64 bitset_bits, int64 total_elems);
+static void k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem,
+					 size_t len);
+static uint32 sdbmhash(unsigned char *elem, size_t len);
+
+/*
+ * Create Bloom filter in caller's memory context.  This should get a false
+ * positive rate of between 1% and 2% when bitset is not constrained by memory.
+ *
+ * total_elems is an estimate of the final size of the set.  It ought to be
+ * approximately correct, but we can cope well with it being off by perhaps a
+ * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
+ * (Dillinger & Manolios, 2004) for details of why this is the case.
+ *
+ * bloom_work_mem is sized in KB, in line with the general work_mem convention.
+ *
+ * The Bloom filter behaves non-deterministically when caller passes a random
+ * seed value.  This ensures that the same false positives will not occur from
+ * one run to the next, which is useful to some callers.
+ *
+ * Notes on appropriate use:
+ *
+ * To keep the implementation simple and predictable, the underlying bitset is
+ * always sized as a power-of-two number of bits, and the largest possible
+ * bitset is 512MB.  The implementation is therefore well suited to data
+ * synchronization problems between unordered sets, where predictable
+ * performance is more important than worst case guarantees around false
+ * positives.  Another problem that the implementation is well suited for is
+ * cache filtering where good performance already relies upon having a
+ * relatively small and/or low cardinality set of things that are interesting
+ * (with perhaps many more uninteresting things that never populate the
+ * filter).
+ */
+bloom_filter *
+bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
+{
+	bloom_filter   *filter;
+	int				bloom_power;
+	int64			bitset_bytes;
+	int64			bitset_bits;
+
+	/*
+	 * Aim for two bytes per element; this is sufficient to get a false
+	 * positive rate below 1%, independent of the size of the bitset or total
+	 * number of elements.  Also, if rounding down the size of the bitset to
+	 * the next lowest power of two turns out to be a significant drop, the
+	 * false positive rate still won't exceed 2% in almost all cases.
+	 */
+	bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
+	/* Minimum allowable size is 1MB */
+	bitset_bytes = Max(1024L * 1024L, bitset_bytes);
+
+	/* Size in bits should be the highest power of two within budget */
+	bloom_power = my_bloom_power(bitset_bytes * BITS_PER_BYTE);
+	/* bitset_bits is int64 because 2^32 is greater than PG_UINT32_MAX */
+	bitset_bits = INT64CONST(1) << bloom_power;
+	bitset_bytes = bitset_bits / BITS_PER_BYTE;
+
+	/* Allocate bloom filter as all-zeroes */
+	filter = palloc0(offsetof(bloom_filter, bitset) +
+					 sizeof(unsigned char) * bitset_bytes);
+	filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+
+	/*
+	 * Hash caller's seed value.  We don't trust caller to provide values
+	 * uniformly distributed within the range of 0 - PG_UINT32_MAX.
+	 */
+	filter->seed = DatumGetUInt32(hash_uint32(seed));
+	filter->bitset_bits = bitset_bits;
+
+	return filter;
+}
+
+/*
+ * Free Bloom filter
+ */
+void
+bloom_free(bloom_filter *filter)
+{
+	pfree(filter);
+}
+
+/*
+ * Add element to Bloom filter
+ */
+void
+bloom_add_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32	hashes[MAX_HASH_FUNCS];
+	int		i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		filter->bitset[hashes[i] >> 3] |= 1 << (hashes[i] & 7);
+	}
+}
+
+/*
+ * Test if Bloom filter definitely lacks element.
+ *
+ * Returns true if the element is definitely not in the set of elements
+ * observed by bloom_add_element().  Otherwise, returns false, indicating that
+ * element is probably present in set.
+ */
+bool
+bloom_lacks_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32	hashes[MAX_HASH_FUNCS];
+	int		i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		if (!(filter->bitset[hashes[i] >> 3] & (1 << (hashes[i] & 7))))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.
+ *
+ * This is a useful, generic indicator of whether or not a Bloom filter has
+ * summarized the set optimally within the available memory budget.  If return
+ * value exceeds 0.5 significantly, then that's either because there was a
+ * dramatic underestimation of set size by the caller, or because available
+ * work_mem is very low relative to the size of the set (less than 2 bits per
+ * element).
+ *
+ * The value returned here should generally be close to 0.5, even when we have
+ * more than enough memory to ensure a false positive rate within target 1% to
+ * 2% band, since more hash functions are used as more memory is available per
+ * element.
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
+{
+	int		bitset_bytes = filter->bitset_bits / BITS_PER_BYTE;
+	int64	bits_set = 0;
+	int		i;
+
+	for (i = 0; i < bitset_bytes; i++)
+	{
+		unsigned char byte = filter->bitset[i];
+
+		while (byte)
+		{
+			bits_set++;
+			byte &= (byte - 1);
+		}
+	}
+
+	return bits_set / (double) filter->bitset_bits;
+}
+
+/*
+ * Which element in the sequence of powers-of-two is less than or equal to
+ * target_bitset_bits?
+ *
+ * Value returned here must be generally safe as the basis for actual bitset
+ * size.
+ *
+ * Bitset is never allowed to exceed 2 ^ 32 bits (512MB).  This is sufficient
+ * for the needs of all current callers, and allows us to use 32-bit hash
+ * functions.  It also makes it easy to stay under the MaxAllocSize restriction
+ * (caller needs to leave room for non-bitset fields that appear before
+ * flexible array member, so a 1GB bitset would use an allocation that just
+ * exceeds MaxAllocSize).
+ */
+static int
+my_bloom_power(int64 target_bitset_bits)
+{
+	int bloom_power = -1;
+
+	while (target_bitset_bits > 0 && bloom_power < 32)
+	{
+		bloom_power++;
+		target_bitset_bits >>= 1;
+	}
+
+	return bloom_power;
+}
+
+/*
+ * Determine optimal number of hash functions based on size of filter in bits,
+ * and projected total number of elements.  The optimal number is the number
+ * that minimizes the false positive rate.
+ */
+static int
+optimal_k(int64 bitset_bits, int64 total_elems)
+{
+	int		k = round(log(2.0) * bitset_bits / total_elems);
+
+	return Max(1, Min(k, MAX_HASH_FUNCS));
+}
+
+/*
+ * Generate k hash values for element.
+ *
+ * Caller passes array, which is filled-in with k values determined by hashing
+ * caller's element.
+ *
+ * Only 2 real independent hash functions are actually used to support an
+ * interface of up to MAX_HASH_FUNCS hash functions; "enhanced double hashing"
+ * is used to make this work.  See Dillinger & Manolios for details of why
+ * that's okay.  "Building a Better Bloom Filter" by Kirsch & Mitzenmacher also
+ * has detailed analysis of the algorithm.
+ */
+static void
+k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
+{
+	uint32	hasha,
+			hashb;
+	int		i;
+
+	hasha = DatumGetUInt32(hash_any(elem, len));
+	hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);
+
+	/*
+	 * Mix seed value using XOR.  Mixing with addition instead would defeat the
+	 * purpose of having a seed (false positives would never change for a given
+	 * set of input elements).
+	 */
+	hasha ^= filter->seed;
+
+	/* Apply "MOD m" to avoid losing bits/out-of-bounds array access */
+	hasha = hasha % filter->bitset_bits;
+	hashb = hashb % filter->bitset_bits;
+
+	/* First hash */
+	hashes[0] = hasha;
+
+	/* Subsequent hashes */
+	for (i = 1; i < filter->k_hash_funcs; i++)
+	{
+		hasha = (hasha + hashb) % filter->bitset_bits;
+		hashb = (hashb + i) % filter->bitset_bits;
+
+		/* Accumulate hash value for caller */
+		hashes[i] = hasha;
+	}
+}
+
+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
+static uint32
+sdbmhash(unsigned char *elem, size_t len)
+{
+	uint32	hash = 0;
+	int		i;
+
+	for (i = 0; i < len; elem++, i++)
+	{
+		hash = (*elem) + (hash << 6) + (hash << 16) - hash;
+	}
+
+	return hash;
+}
diff --git a/src/include/lib/bloomfilter.h b/src/include/lib/bloomfilter.h
new file mode 100644
index 0000000..f46f233
--- /dev/null
+++ b/src/include/lib/bloomfilter.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.h
+ *	  Minimal Bloom filter
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/include/lib/bloomfilter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BLOOMFILTER_H_
+#define _BLOOMFILTER_H_
+
+typedef struct bloom_filter bloom_filter;
+
+extern bloom_filter *bloom_create(int64 total_elems, int bloom_work_mem,
+								  uint32 seed);
+extern void bloom_free(bloom_filter *filter);
+extern void bloom_add_element(bloom_filter *filter, unsigned char *elem,
+							  size_t len);
+extern bool bloom_lacks_element(bloom_filter *filter, unsigned char *elem,
+								size_t len);
+extern double bloom_prop_bits_set(bloom_filter *filter);
+
+#endif
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index b7ed0af..fb3aae1 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -9,6 +9,7 @@ SUBDIRS = \
 		  commit_ts \
 		  dummy_seclabel \
 		  snapshot_too_old \
+		  test_bloomfilter \
 		  test_ddl_deparse \
 		  test_extensions \
 		  test_parser \
diff --git a/src/test/modules/test_bloomfilter/.gitignore b/src/test/modules/test_bloomfilter/.gitignore
new file mode 100644
index 0000000..5dcb3ff
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_bloomfilter/Makefile b/src/test/modules/test_bloomfilter/Makefile
new file mode 100644
index 0000000..808c931
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_bloomfilter/Makefile
+
+MODULE_big = test_bloomfilter
+OBJS = test_bloomfilter.o $(WIN32RES)
+PGFILEDESC = "test_bloomfilter - test code for Bloom filter library"
+
+EXTENSION = test_bloomfilter
+DATA = test_bloomfilter--1.0.sql
+
+REGRESS = test_bloomfilter
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_bloomfilter
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_bloomfilter/README b/src/test/modules/test_bloomfilter/README
new file mode 100644
index 0000000..2bf7538
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/README
@@ -0,0 +1,72 @@
+test_bloomfilter overview
+=========================
+
+test_bloomfilter is a test harness module for testing Bloom filter library set
+membership operations.  It consists of a single SQL-callable function,
+test_bloomfilter(), and regression tests.  Membership tests are performed using
+an artificial dataset that is programmatically generated.
+
+The test_bloomfilter() function displays instrumentation at DEBUG1 elog level
+(WARNING when the false positive rate exceeds a 1% threshold).  This can be
+used to get a sense of the performance characteristics of the Postgres Bloom
+filter implementation under varied conditions.
+
+Bitset size
+-----------
+
+The main bloomfilter.c criteria for sizing its bitset is that the false
+positive rate should not exceed 2% when sufficient bloom_work_mem is available
+(and the caller-supplied estimate of the number of elements turns out to have
+been accurate).  A 2% rate is currently assumed to be good enough for all Bloom
+filter callers.
+
+The traditional guarantee Bloom filters offer is that with an optimal K, there
+will be only a 1% false positive rate with just 9.6 bits of memory per element.
+The 2% worst case guarantee exists because there is a need for some slop, to
+account for implementation inflexibility in bitset sizing.  The bitset is kept
+to a power-of-two number of bits in size, to keep the implementation simple, so
+callers may have their bloom_work_mem argument truncated down by almost half --
+when that happens, the guarantee needs to hold up.  In practice callers that
+always pass a bloom_work_mem that is aligned with a power-of-two bitset size
+will actually get the "9.6 bits per element" 1% false positive rate.
+(Under-promising in this manner is a fudge that allows the contract to be kept
+simple.)
+
+Strategy
+--------
+
+Our approach to regression testing is to test that bloomfilter.c has only a 1%
+false positive rate for a single bitset size (2 ^ 23, or 1MB).  We test a
+dataset with 838,861 elements, which works out at 10 bits of memory per
+element.  We round up from 9.6 bits to 10 bits to make sure that we reliably
+get under 1% for regression testing.  Note that a random seed is used in the
+regression tests, because the exact false positive rate is inconsistent across
+platforms, which makes non-deterministic hashing something that the regression
+tests need to be tolerant of anyway.
+
+SQL-callable function
+=====================
+
+The SQL-callable function test_bloomfilter() provides the following arguments:
+
+* "power" is the power-of-two used to size the Bloom filter's bitset.
+
+The minimum valid argument value is 23 (2^23 bits), or 1MB of memory.  The
+maximum valid argument value is 32, or 512MB of memory.  These restrictions
+reflect restrictions in bloomfilter.c itself.
+
+* "nelements" is the number of elements to generate for testing purposes.
+
+Adjust argument value to observe changes in the false positive rate for a given
+Bloom filter bitset size.
+
+* "seed" is a seed value for hashing.
+
+A value < 0 is interpreted as "use random seed".  Varying the seed value (or
+specifying -1) should result in small variations in the total number of false
+positives.
+
+* "tests" is the number of tests to run.
+
+This may be increased when it's useful to perform many tests without the
+overhead of setting up and tearing down a pg_regress database each time.
diff --git a/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out b/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
new file mode 100644
index 0000000..4d60eca
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
@@ -0,0 +1,25 @@
+CREATE EXTENSION test_bloomfilter;
+--
+-- These tests don't produce any interesting output, unless they fail.  For an
+-- explanation of the arguments, and the values used here, see README.
+--
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
+ test_bloomfilter 
+------------------
+ 
+(1 row)
+
+-- Equivalent "10 bits per element" tests for all possible bitset sizes:
+--
+-- SELECT test_bloomfilter(24, 1677722)
+-- SELECT test_bloomfilter(25, 3355443)
+-- SELECT test_bloomfilter(26, 6710886)
+-- SELECT test_bloomfilter(27, 13421773)
+-- SELECT test_bloomfilter(28, 26843546)
+-- SELECT test_bloomfilter(29, 53687091)
+-- SELECT test_bloomfilter(30, 107374182)
+-- SELECT test_bloomfilter(31, 214748365)
+-- SELECT test_bloomfilter(32, 429496730)
diff --git a/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql b/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
new file mode 100644
index 0000000..cc9d19e
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
@@ -0,0 +1,22 @@
+CREATE EXTENSION test_bloomfilter;
+
+--
+-- These tests don't produce any interesting output, unless they fail.  For an
+-- explanation of the arguments, and the values used here, see README.
+--
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
+
+-- Equivalent "10 bits per element" tests for all possible bitset sizes:
+--
+-- SELECT test_bloomfilter(24, 1677722)
+-- SELECT test_bloomfilter(25, 3355443)
+-- SELECT test_bloomfilter(26, 6710886)
+-- SELECT test_bloomfilter(27, 13421773)
+-- SELECT test_bloomfilter(28, 26843546)
+-- SELECT test_bloomfilter(29, 53687091)
+-- SELECT test_bloomfilter(30, 107374182)
+-- SELECT test_bloomfilter(31, 214748365)
+-- SELECT test_bloomfilter(32, 429496730)
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql b/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
new file mode 100644
index 0000000..bf1f1cd
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
@@ -0,0 +1,10 @@
+/* src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_bloomfilter" to load this file. \quit
+
+-- See README for an explanation of each argument
+CREATE FUNCTION test_bloomfilter(power integer, nelements bigint,
+    seed integer DEFAULT -1, tests integer DEFAULT 1)
+	RETURNS pg_catalog.void STRICT
+	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter.c b/src/test/modules/test_bloomfilter/test_bloomfilter.c
new file mode 100644
index 0000000..502274b
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter.c
@@ -0,0 +1,138 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_bloomfilter.c
+ *		Test false positive rate of Bloom filter against test dataset.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_bloomfilter/test_bloomfilter.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/bloomfilter.h"
+#include "miscadmin.h"
+
+PG_MODULE_MAGIC;
+
+/* Must fit decimal representation of PG_INT64_MAX + 2 bytes: */
+#define MAX_ELEMENT_BYTES		20
+/* False positive rate WARNING threshold (1%): */
+#define FPOSITIVE_THRESHOLD		0.01
+
+
+/*
+ * Populate an empty Bloom filter with "nelements" dummy strings.
+ */
+static void
+populate_with_dummy_strings(bloom_filter *filter, int64 nelements)
+{
+	char		element[MAX_ELEMENT_BYTES];
+	int64		i;
+
+	for (i = 0; i < nelements; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		snprintf(element, sizeof(element), "i" INT64_FORMAT, i);
+		bloom_add_element(filter, (unsigned char *) element, strlen(element));
+	}
+}
+
+/*
+ * Returns number of strings that are indicated as probably appearing in Bloom
+ * filter that were in fact never added by populate_with_dummy_strings().
+ * These are false positives.
+ */
+static int64
+nfalsepos_for_missing_strings(bloom_filter *filter, int64 nelements)
+{
+	char		element[MAX_ELEMENT_BYTES];
+	int64		nfalsepos = 0;
+	int64		i;
+
+	for (i = 0; i < nelements; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		snprintf(element, sizeof(element), "M" INT64_FORMAT, i);
+		if (!bloom_lacks_element(filter, (unsigned char *) element,
+								 strlen(element)))
+			nfalsepos++;
+	}
+
+	return nfalsepos;
+}
+
+static void
+create_and_test_bloom(int power, int64 nelements, int callerseed)
+{
+	int				bloom_work_mem;
+	uint32			seed;
+	int64			nfalsepos;
+	bloom_filter   *filter;
+
+	bloom_work_mem = (INT64CONST(1) << power) / 8 / 1024;
+
+	elog(DEBUG1, "bloom_work_mem (KB): %d", bloom_work_mem);
+
+	/*
+	 * Generate random seed, or use caller's.  Seed should always be a positive
+	 * value less than or equal to PG_INT32_MAX, to ensure that any random seed
+	 * can be recreated through callerseed if the need arises.  (Don't assume
+	 * that RAND_MAX cannot exceed PG_INT32_MAX.)
+	 */
+	seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed;
+
+	/* Create Bloom filter, populate it, and report on false positive rate */
+	filter = bloom_create(nelements, bloom_work_mem, seed);
+	populate_with_dummy_strings(filter, nelements);
+	nfalsepos = nfalsepos_for_missing_strings(filter, nelements);
+
+	ereport((nfalsepos > nelements * FPOSITIVE_THRESHOLD) ? WARNING : DEBUG1,
+			(errmsg_internal("false positives: " INT64_FORMAT " (rate: %.6f, proportion bits set: %.6f, seed: %u)",
+							 nfalsepos, (double) nfalsepos / nelements,
+							 bloom_prop_bits_set(filter), seed)));
+
+	bloom_free(filter);
+}
+
+PG_FUNCTION_INFO_V1(test_bloomfilter);
+
+/*
+ * SQL-callable entry point to perform all tests.
+ *
+ * If a 1% false positive threshold is not met, emits WARNINGs.
+ *
+ * See README for details of arguments.
+ */
+Datum
+test_bloomfilter(PG_FUNCTION_ARGS)
+{
+	int		power = PG_GETARG_INT32(0);
+	int64	nelements = PG_GETARG_INT64(1);
+	int		seed = PG_GETARG_INT32(2);
+	int		tests = PG_GETARG_INT32(3);
+	int		i;
+
+	if (power < 23 || power > 32)
+		elog(ERROR, "power argument must be between 23 and 32 inclusive");
+
+	if (tests <= 0)
+		elog(ERROR, "invalid number of tests: %d", tests);
+
+	if (nelements < 0)
+		elog(ERROR, "invalid number of elements: %d", tests);
+
+	for (i = 0; i < tests; i++)
+	{
+		elog(DEBUG1, "beginning test #%d...", i + 1);
+
+		create_and_test_bloom(power, nelements, seed);
+	}
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter.control b/src/test/modules/test_bloomfilter/test_bloomfilter.control
new file mode 100644
index 0000000..99e56ee
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter.control
@@ -0,0 +1,4 @@
+comment = 'Test code for Bloom filter library'
+default_version = '1.0'
+module_pathname = '$libdir/test_bloomfilter'
+relocatable = true
-- 
2.7.4

#38

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Peter Geoghegan (#37)

Re: [HACKERS] A design for amcheck heapam verification

On Fri, Dec 8, 2017 at 4:37 AM, Peter Geoghegan <pg@bowt.ie> wrote:

Here, we repeat the same test 3 times, varying only the seed value
used for each run.

Thanks for the updated version! Looking at 0001, I have run a coverage
test and can see all code paths covered, which is nice. It is also
easier to check the correctness of the library. There are many ways to
shape up such tests, you chose one we could live with.

The last message is a WARNING because we exceed the 1% threshold
(hard-coded into test_bloomfilter.c), though only by a tiny margin,
due only to random variations in seed value. We round up to 10 bits
per element for the regression tests. That's where the *actual*
"nelements" argument comes from within the tests, so pg_regress tests
should never see the WARNING (if they do, that counts as a failure).

I've experimentally observed that we get the 1% false positive rate
with any possible bitset when "nelements" works out at 9.6 bitset bits
per element. Inter-run variation is tiny. With 50 tests, I didn't
observe these same Bloom filter parameters produce a false positive
rate that came near to 1.1%. 1.01% or 1.02% was about as bad as it
got.

Nice. That's close to what I would expect with the sizing this is doing.

There is a fairly extensive README, which I hope will clear up the
theory behind the bloomfilter.c strategy on bitset size and false
positives. Also, there was a regression that I had to fix in
bloomfilter.c, in seeding. It didn't reliably cause variation in the
false positives. And, there was bitrot with the documentation that I
fixed up.

+   /*
+    * Generate random seed, or use caller's.  Seed should always be a positive
+    * value less than or equal to PG_INT32_MAX, to ensure that any random seed
+    * can be recreated through callerseed if the need arises.  (Don't assume
+    * that RAND_MAX cannot exceed PG_INT32_MAX.)
+    */
+   seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed;
This could use pg_backend_random() instead.

+--
+-- These tests don't produce any interesting output, unless they fail.  For an
+-- explanation of the arguments, and the values used here, see README.
+--
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
This could also test the reproducibility of the tests with a fixed
seed number and at least two rounds, a low number of elements could be
more appropriate to limit the run time.

+   /*
+    * Aim for two bytes per element; this is sufficient to get a false
+    * positive rate below 1%, independent of the size of the bitset or total
+    * number of elements.  Also, if rounding down the size of the bitset to
+    * the next lowest power of two turns out to be a significant drop, the
+    * false positive rate still won't exceed 2% in almost all cases.
+    */
+   bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
+   /* Minimum allowable size is 1MB */
+   bitset_bytes = Max(1024L * 1024L, bitset_bytes);
I would vote for simplifying the initialization routine and just
directly let the caller decide it. Are there implementation
complications if this is not a power of two? I cannot guess one by
looking at the code. I think that the key point is just to document
that a false positive rate of 1% can be reached with just having
9.6bits per elements, and that this factor can be reduced by 10 if
adding 4.8 bits per elements. So I would expect people using this API
are smart enough to do proper sizing. Note that some callers may
accept even a larger false positive rate. In short, this basically
brings us back to Thomas' point upthread with a size estimation
routine, and an extra routine doing the initialization:
https://www.postgresql.org/message-id/CAEepm=0Dy53X1hG5DmYzmpv_KN99CrXzQBTo8gmiosXNyrx7+Q@mail.gmail.com
Did you consider it? Splitting the size estimation and the actual
initialization has a lot of benefits in my opinion.

+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.
+ *
[...]
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
Instead of that, having a function that prints direct information
about the filter's state would be enough with the real number of
elements and the number of bits set in the filter. I am not sure that
using rates is a good idea, sometimes rounding can cause confusion.
-- 
Michael

#39

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Michael Paquier (#38)

Re: [HACKERS] A design for amcheck heapam verification

On Mon, Dec 11, 2017 at 9:35 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

+   /*
+    * Generate random seed, or use caller's.  Seed should always be a positive
+    * value less than or equal to PG_INT32_MAX, to ensure that any random seed
+    * can be recreated through callerseed if the need arises.  (Don't assume
+    * that RAND_MAX cannot exceed PG_INT32_MAX.)
+    */
+   seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed;
This could use pg_backend_random() instead.

I don't see the need for a cryptographically secure source of
randomness for any current Bloom filter caller, including the test
harness. There are many uses of random() just like this throughout the
backend, and far more random() calls than pg_backend_random() calls.
srandom() seeds random per backend, ensuring non-deterministic
behavior across backends.

+--
+-- These tests don't produce any interesting output, unless they fail.  For an
+-- explanation of the arguments, and the values used here, see README.
+--
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
This could also test the reproducibility of the tests with a fixed
seed number and at least two rounds, a low number of elements could be
more appropriate to limit the run time.

The runtime is already dominated by pg_regress overhead. As it says in
the README, using a fixed seed in the test harness is pointless,
because it won't behave in a fixed way across platforms. As long as we
cannot ensure deterministic behavior, we may as well fully embrace
non-determinism.

I would vote for simplifying the initialization routine and just
directly let the caller decide it. Are there implementation
complications if this is not a power of two? I cannot guess one by
looking at the code.

Yes, there are. It's easier to reason about the implementation with
these restrictions.

So I would expect people using this API
are smart enough to do proper sizing. Note that some callers may
accept even a larger false positive rate. In short, this basically
brings us back to Thomas' point upthread with a size estimation
routine, and an extra routine doing the initialization:
/messages/by-id/CAEepm=0Dy53X1hG5DmYzmpv_KN99CrXzQBTo8gmiosXNyrx7+Q@mail.gmail.com
Did you consider it? Splitting the size estimation and the actual
initialization has a lot of benefits in my opinion.

Callers don't actually need to worry about these details. I think it's
the wrong call to complicate the interface to simplify the
implementation.

Thomas was not arguing for the caller being able to specify a false
positive rate to work backwards from -- that's not an unreasonable
thing, but it seems unlikely to be of use to amcheck, or parallel hash
join. Thomas was arguing for making the Bloom filter suitable for use
with DSM. I ended up incorporating most of his feedback. The only
things I were not added were a DSM-orientated routine for getting the
amount of memory required, as well as a separate DSM-orientated
routine for initializing caller's shared memory buffer. That's likely
inconvenient for most callers, and could be restrictive if and when
Bloom filters use resources other than memory, such as temp files.

+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.
+ *
[...]
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
Instead of that, having a function that prints direct information
about the filter's state would be enough with the real number of
elements and the number of bits set in the filter. I am not sure that
using rates is a good idea, sometimes rounding can cause confusion.

The bloom filter doesn't track the number of elements itself. Callers
that care can do it themselves. bloom_prop_bits_set() isn't very
important, even compared to other types of instrumentation. It's
moderately useful as a generic indicator of whether or not the Bloom
filter was totally overwhelmed. That's about it.

--
Peter Geoghegan

#40

Andrey Borodin

x4mmm@yandex-team.ru

about 8 years ago

In reply to: Peter Geoghegan (#39)

Re: [HACKERS] A design for amcheck heapam verification

Hello!

I like heapam verification functionality and use it right now. So, I'm planning to provide review for this patch, probably, this week.

From my current use I have some thoughts on interface. Here's what I get.

# select bt_index_check('messagefiltervalue_group_id_59490523e6ee451f',true);
ERROR: XX001: heap tuple (45,21) from table "messagefiltervalue" lacks matching index tuple within index "messagefiltervalue_group_id_59490523e6ee451f"
HINT: Retrying verification using the function bt_index_parent_check() might provide a more specific error.
LOCATION: bt_tuple_present_callback, verify_nbtree.c:1316
Time: 45.668 ms

# select bt_index_check('messagefiltervalue_group_id_59490523e6ee451f');
bt_index_check
----------------

(1 row)
Time: 32.873 ms

# select bt_index_parent_check('messagefiltervalue_group_id_59490523e6ee451f');
ERROR: XX002: down-link lower bound invariant violated for index "messagefiltervalue_group_id_59490523e6ee451f"
DETAIL: Parent block=6259 child index tid=(1747,2) parent page lsn=4A0/728F5DA8.
LOCATION: bt_downlink_check, verify_nbtree.c:1188
Time: 391194.113 ms

Seems like new check is working 4 orders of magnitudes faster then bt_index_parent_check() and still finds my specific error that bt_index_check() missed.
From this output I see that there is corruption, but cannot understand:
1. What is the scale of corruption
2. Are these corruptions related or not

I think an interface to list all or top N error could be useful.

14 дек. 2017 г., в 0:02, Peter Geoghegan <pg@bowt.ie> написал(а):

This could also test the reproducibility of the tests with a fixed
seed number and at least two rounds, a low number of elements could be
more appropriate to limit the run time.

The runtime is already dominated by pg_regress overhead. As it says in
the README, using a fixed seed in the test harness is pointless,
because it won't behave in a fixed way across platforms. As long as we
cannot ensure deterministic behavior, we may as well fully embrace
non-determinism.

I think that determinism across platforms is not that important as determinism across runs.

Thanks for the amcheck! It is very useful.

Best regards, Andrey Borodin.

#41

Andrey Borodin

x4mmm@yandex-team.ru

almost 8 years ago

In reply to: Andrey Borodin (#40)

Re: [HACKERS] A design for amcheck heapam verification

Hi, Peter!

11 янв. 2018 г., в 15:14, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
I like heapam verification functionality and use it right now. So, I'm planning to provide review for this patch, probably, this week.

I've looked into the code and here's my review.

The new functionality works and is useful right now. I believe it should be shipped in the Postgres box (currently, I install it with packet managers).
The code is nice and well documented.

I'd be happy, if I could pass argument like ErrorLevel, which would help me to estimate scale of corruption. Or any other way to list more than one error. But, clearly, this functionality is not necessary for this patch.
Also, I couldn't find where is check that ensures that there is a heap tuple for every B-tree leaf tuple. Is it there?

I've found few neatpicks in bloom filter:
1. Choice of sdbmhash does not seems like a best option from my point of view. I'd stick with MurmurX, with any available X. Or anything doing 32-bit alligned computations. Hacks like (hash << 6) + (hash << 16) - hash are cool, but nowadays there is no point not to use hash * 65599.
2. 
+	bitset_bytes = Max(1024L * 1024L, bitset_bytes);
bloom_work_mem was supposed to be the limit? Why we do not honor this limit?
3.
+	filter = palloc0(offsetof(bloom_filter, bitset) +
+					 sizeof(unsigned char) * bitset_bytes);
sizeof(unsigned char) == 1 by C standard
4. function my_bloom_power() returns bit numers, then it's result is powered INT64CONST(1) << bloom_power; back. I'd stik with removing bits in a loop by while(target_bitset_bits & (target_bitset_bits-1)) { target_bitset_bits&=target_bitset_bits-1; } or something like that. Or, if we use code like sdbm hash, fallback to bithacks :) https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2
5. I would implement k_hashed with the folllowing code:
static void
k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
{
    uint64  hasha,
            hashb;
    int     i;

hasha = DatumGetUInt32(hash_any(elem, len));
hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);

/*
* Mix seed value using XOR. Mixing with addition instead would defeat the
* purpose of having a seed (false positives would never change for a given
* set of input elements).
*/
hasha ^= filter->seed;

for (i = 0; i < filter->k_hash_funcs; i++)
{
/* Accumulate hash value for caller */
hashes[i] = (hasha + i * hashb + i) % filter->bitset_bits;
}
}
It produces almost same result (hashes 1..k-1 are +1'd), but uses a lot less % operations. Potential overflow is governed by uint64 type.

That's all what I've found. I do not know did the patch had all necessary reviewers attention. Please, feel free to change status if you think that patch is ready. From my point of view, the patch is in a good shape.

It was a pleasure to read amcheck code, thanks for writing it.

Best regards, Andrey Borodin.

#42

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Andrey Borodin (#40)

Re: [HACKERS] A design for amcheck heapam verification

On Thu, Jan 11, 2018 at 2:14 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I like heapam verification functionality and use it right now. So, I'm planning to provide review for this patch, probably, this week.

Great!

Seems like new check is working 4 orders of magnitudes faster then bt_index_parent_check() and still finds my specific error that bt_index_check() missed.
From this output I see that there is corruption, but cannot understand:
1. What is the scale of corruption
2. Are these corruptions related or not

I don't know the answer to either question, and I don't think that
anyone else could provide much more certainty than that, at least when
it comes to the general case. I think it's important to remember why
that is.

When amcheck raises an error, that really should be a rare,
exceptional event. When I ran amcheck on Heroku's platform, that was
what we found - it tended to be some specific software bug in all
cases (turns out that Amazon's EBS is very reliable in the last few
years, at least when it comes to avoiding silent data corruption). In
general, the nature of those problems was very difficult to predict.

The PostgreSQL project strives to provide a database system that never
loses data, and I think that we generally do very well there. It's
probably also true that (for example) Yandex have some very good DBAs,
that take every reasonable step to prevent data loss (validating
hardware, providing substantial redundancy at the storage level, and
so on). We trust the system, and you trust your own operational
procedures, and for the most part everything runs well, because you
(almost) think of everything.

I think that running amcheck at scale is interesting because its very
general approach to validation gives us an opportunity to learn *what
we were wrong about*. Sometimes the reasons will be simple, and some
times they'll be complicated, but they'll always be something that we
tried to account for in some way, and just didn't think of, despite
our best efforts. I know that torn pages can happen, which is a kind
of corruption -- that's why crash recovery replays FPIs. If I knew
what problems amcheck might find, then I probably would have already
found a way to prevent them from happening in the first place - there
are limits to what we can predict. (Google "Ludic fallacy" for more
information on this general idea.)

I try to be humble about these things. Very complicated systems can
have very complicated problems that stay around for a long time
without being discovered. Just ask Intel. While it might be true that
some people will use amcheck as the first line of defense, I think
that it makes much more sense as the last line of defense. So, to
repeat myself -- I just don't know.

I think an interface to list all or top N error could be useful.

I think that it might be useful if you could specify a limit on how
many errors you'll accept before giving up. I think that it's likely
less useful than you think, though. Once amcheck detects even a single
problem, all bets are off. Or at least any prediction that I might try
to give you now isn't worth much. Theoretically, amcheck should
*never* find any problem, which is actually what happens in the vast
majority of real world cases. When it does find a problem, there
should be some new lesson to be learned. If there isn't some new
insight, then somebody somewhere is doing a bad job.

--
Peter Geoghegan

#43

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Andrey Borodin (#41)

Re: [HACKERS] A design for amcheck heapam verification

Hi,

On Fri, Jan 12, 2018 at 1:41 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I've looked into the code and here's my review.

The new functionality works and is useful right now. I believe it should be shipped in the Postgres box (currently, I install it with packet managers).
The code is nice and well documented.

Thanks.

I'd be happy, if I could pass argument like ErrorLevel, which would help me to estimate scale of corruption. Or any other way to list more than one error. But, clearly, this functionality is not necessary for this patch.

My previous remarks apply here, too. I don't know how to rate severity
among error messages.

Also, I couldn't find where is check that ensures that there is a heap tuple for every B-tree leaf tuple. Is it there?

No, there isn't, because that's inherently race-prone. This is not so
bad. If you don't end up discovering a problem the other way around
(going from the heap to the index/Bloom filter), then the only way
that an index tuple pointing to the wrong place can go undetected is
because there is a duplicate heap TID in the index, or the heap TID
doesn't exist in any form.

I actually prototyped a patch that makes bt_index_parent_check()
detect duplicate heap TIDs in an index, by collecting heap TIDs from
the index, sorting them, and scanning the sorted array. That seems
like material for another patch, though.

I've found few neatpicks in bloom filter:
1. Choice of sdbmhash does not seems like a best option from my point of view.

I don't mind changing the second hash function, but I want to
emphasize that this shouldn't be expected to make anything more than a
very small difference. The bloom filter probes are slow because the
memory accesses are random.

There are many hash functions that would be suitable here, and not too
many reasons to prefer one over the other. My choice was based on a
desire for simplicity, and for something that seemed to be in
widespread usage for a long time.

+ bitset_bytes = Max(1024L * 1024L, bitset_bytes);
bloom_work_mem was supposed to be the limit? Why we do not honor this limit?

The minimum maintenance_work_mem setting is 1MB anyway.

3.
+       filter = palloc0(offsetof(bloom_filter, bitset) +
+                                        sizeof(unsigned char) * bitset_bytes);
sizeof(unsigned char) == 1 by C standard

I know. That's just what I prefer to do as a matter of style.

4. function my_bloom_power() returns bit numers, then it's result is powered INT64CONST(1) << bloom_power; back. I'd stik with removing bits in a loop by while(target_bitset_bits & (target_bitset_bits-1)) { target_bitset_bits&=target_bitset_bits-1; } or something like that. Or, if we use code like sdbm hash, fallback to bithacks :) https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2

my_bloom_power() is only called once per Bloom filter. So again, I
think that this is not very important. Thomas Munro talked about using
popcount() in another function, which is from the book Hacker's
Delight. While I'm sure that these techniques have their use, they
just don't seem to make sense when the alternative is simpler code
that is only going to be executed at most once per verification
operation anyway.

for (i = 0; i < filter->k_hash_funcs; i++)
{
/* Accumulate hash value for caller */
hashes[i] = (hasha + i * hashb + i) % filter->bitset_bits;
}
}
It produces almost same result (hashes 1..k-1 are +1'd), but uses a lot less % operations. Potential overflow is governed by uint64 type.

I prefer to be conservative, and to stick to what is described by
Dillinger & Manolios as much as possible.

That's all what I've found. I do not know did the patch had all necessary reviewers attention. Please, feel free to change status if you think that patch is ready. From my point of view, the patch is in a good shape.

Michael said he'd do more review. I generally feel this is close, though.

Thanks for the review
--
Peter Geoghegan

#44

Michael Paquier

michael.paquier@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#43)

Re: [HACKERS] A design for amcheck heapam verification

On Mon, Jan 22, 2018 at 03:22:05PM -0800, Peter Geoghegan wrote:

Michael said he'd do more review. I generally feel this is close, though.

Yep. I have provided the feedback I wanted for 0001 (no API change in
the bloom facility by the way :( ), but I still wanted to look at 0002
in depths.
--
Michael

#45

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Michael Paquier (#44)

Re: [HACKERS] A design for amcheck heapam verification

On Mon, Jan 22, 2018 at 6:07 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Yep. I have provided the feedback I wanted for 0001 (no API change in
the bloom facility by the way :( ), but I still wanted to look at 0002
in depths.

I don't see a point in adding complexity that no caller will actually
use. It *might* become useful in the future, in which case it's no
trouble at all to come up with an alternative initialization routine.

Anyway, parallel CREATE INDEX added a new "scan" argument to
IndexBuildHeapScan(), which caused this patch to bitrot. At a minimum,
an additional NULL argument should be passed by amcheck. However, I
have a better idea.

ISTM that verify_nbtree.c should manage the heap scan itself, it the
style of parallel CREATE INDEX. It can acquire its own MVCC snapshot
for bt_index_check() (which pretends to be a CREATE INDEX
CONCURRENTLY). There can be an MVCC snapshot acquired per index
verified, a snapshot that is under the direct control of amcheck. The
snapshot would be acquired at the start of verification on an index
(when "heapallindex = true"), before the verification of the index
structure even begins, and released at the very end of verification.
Advantages of this include:

1. It simplifies the code, and in particular lets us remove the use of
TransactionXmin. Comments already say that this TransactionXmin
business is a way of approximating our own MVCC snapshot acquisition
-- why not *actually do* the MVCC snapshot acquisition, now that
that's possible?

2. It makes the code for bt_index_check() and bt_index_parent_check()
essentially identical, except for varying
IndexBuildHeapScan()/indexInfo parameters to match what CREATE INDEX
itself does. The more we can outsource to IndexBuildHeapScan(), the
better.

3. It ensures that transactions that heapallindexed verify many
indexes in one go are at no real disadvantage to transactions that
heapallindexed verify a single index, since TransactionXmin going
stale won't impact verification (we won't have to skip Bloom filter
probes due to the uncertainty about what should be in the Bloom
filter).

4. It will make parallel verification easier in the future, which is
something that we ought to make happen. Parallel verification would
target a table with multiple indexes, and use a parallel heap scan. It
actually looks like making this work would be fairly easy. We'd only
need to copy code from nbtsort.c, and arrange for parallel workers to
verify an index each ahead of the heap scan. (There would be multiple
Bloom filters in shared memory, all of which parallel workers end up
probing.)

Thoughts?

--
Peter Geoghegan

#46

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Peter Geoghegan (#45)

2 attachment(s)

Re: [HACKERS] A design for amcheck heapam verification

On Mon, Feb 5, 2018 at 12:55 PM, Peter Geoghegan <pg@bowt.ie> wrote:

Anyway, parallel CREATE INDEX added a new "scan" argument to
IndexBuildHeapScan(), which caused this patch to bitrot. At a minimum,
an additional NULL argument should be passed by amcheck. However, I
have a better idea.

ISTM that verify_nbtree.c should manage the heap scan itself, it the
style of parallel CREATE INDEX. It can acquire its own MVCC snapshot
for bt_index_check() (which pretends to be a CREATE INDEX
CONCURRENTLY). There can be an MVCC snapshot acquired per index
verified, a snapshot that is under the direct control of amcheck. The
snapshot would be acquired at the start of verification on an index
(when "heapallindex = true"), before the verification of the index
structure even begins, and released at the very end of verification.

Attached patch fixes the parallel index build bitrot in this way. This
is version 6 of the patch.

This approach resulted in a nice reduction in complexity:
bt_index_check() and bt_index_parent_check() heapallindexed
verification operations both work in exactly the same way now, except
that bt_index_check() imitates a CREATE INDEX CONCURRENTLY (to match
the heavyweight relation locks acquired). This doesn't really need to
be explained as a special case anymore; bt_index_parent_check() is
like an ordinary CREATE INDEX, without any additional "TransactionXmin
heap tuple xmin recheck" complication.

A further benefit is that this makes running bt_index_check() checks
against many indexes more thorough, and easier to reason about. Users
won't have to worry about TransactionXmin becoming very stale when
many indexes are verified within a single command.

I made the following additional, unrelated changes based on various feedback:

* Faster modulo operations.

Andrey Borodin suggested that I make k_hashes() do fewer modulo
operations. While I don't want to change the algorithm to make this
happen, the overhead has been reduced. Modulo operations are now
performed through bitwise AND operations, which is possible only
because the bitset size is always a power of two. Apparently this is a
fairly common optimization for Bloom filters that use (enhanced)
double-hashing; we might as well do it this way.

I've really just transcribed the enhanced double hashing pseudo-code
from the Georgia Tech/Dillinger & Manolios paper into C code, so no
real change there; bloomfilter.c's k_hashes() is still closely based
on "5.2 Enhanced Double Hashing" from that same paper. Experience
suggests that we ought to be very conservative about developing novel
hashing techniques. Paranoid, even.

* New reference to the modulo bias effect.

Michael Paquier wondered why the Bloom filter was always a
power-of-two, which this addresses. (Of course, the "modulo bitwise
AND" optimization I just mentioned is another reason to limit
ourselves to power-of-two bitset sizes, albeit a new one.)

* Removed sdbmhash().

Michael also wanted to know more about sdbmhash(), due to some general
concern about its quality. I realized that it is best to avoid adding
a new general-purpose hash function, whose sole purpose is to be
different to hash_any(), when I could instead use
hash_uint32_extended() to get two 32-bit values all at once. Robert
suggested this approach at one point, actually, but for some reason I
didn't follow up until now.

--
Peter Geoghegan

Attachments:

0001-Add-Bloom-filter-data-structure-implementation.patchtext/x-patch; charset=US-ASCII; name=0001-Add-Bloom-filter-data-structure-implementation.patchDownload

From 2ff9dcace49ea590762701717235d87e13b03c6b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.

This commit adds a test harness extension module, test_bloomfilter.  It
can be used to get a sense of how the Bloom filter implementation
performs under varying conditions.
---
 src/backend/lib/Makefile                           |   4 +-
 src/backend/lib/README                             |   2 +
 src/backend/lib/bloomfilter.c                      | 303 +++++++++++++++++++++
 src/include/lib/bloomfilter.h                      |  27 ++
 src/test/modules/Makefile                          |   1 +
 src/test/modules/test_bloomfilter/.gitignore       |   4 +
 src/test/modules/test_bloomfilter/Makefile         |  21 ++
 src/test/modules/test_bloomfilter/README           |  71 +++++
 .../test_bloomfilter/expected/test_bloomfilter.out |  25 ++
 .../test_bloomfilter/sql/test_bloomfilter.sql      |  22 ++
 .../test_bloomfilter/test_bloomfilter--1.0.sql     |  10 +
 .../modules/test_bloomfilter/test_bloomfilter.c    | 138 ++++++++++
 .../test_bloomfilter/test_bloomfilter.control      |   4 +
 src/tools/pgindent/typedefs.list                   |   1 +
 14 files changed, 631 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/bloomfilter.c
 create mode 100644 src/include/lib/bloomfilter.h
 create mode 100644 src/test/modules/test_bloomfilter/.gitignore
 create mode 100644 src/test/modules/test_bloomfilter/Makefile
 create mode 100644 src/test/modules/test_bloomfilter/README
 create mode 100644 src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
 create mode 100644 src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.c
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index d1fefe4..191ea9b 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
-	   knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index 5e5ba5e..376ae27 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -3,6 +3,8 @@ in the backend:
 
 binaryheap.c - a binary heap
 
+bloomfilter.c - probabilistic, space-efficient set membership testing
+
 hyperloglog.c - a streaming cardinality estimator
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
new file mode 100644
index 0000000..a4ca18d
--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,303 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.c
+ *		Minimal Bloom filter
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.  False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the Bloom filter bitset.
+ *
+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when its Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/bloomfilter.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/hash.h"
+#include "lib/bloomfilter.h"
+
+#define MAX_HASH_FUNCS		10
+
+struct bloom_filter
+{
+	/* K hash functions are used, seeded by caller's seed */
+	int			k_hash_funcs;
+	uint64		seed;
+	/* m is bitset size, in bits.  Must be a power-of-two <= 2^32.  */
+	uint64		m;
+	unsigned char bitset[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static int	my_bloom_power(uint64 target_bitset_bits);
+static int	optimal_k(uint64 bitset_bits, int64 total_elems);
+static void k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem,
+		 size_t len);
+static inline uint32 mod_m(uint32 a, uint64 m);
+
+/*
+ * Create Bloom filter in caller's memory context.  This should get a false
+ * positive rate of between 1% and 2% when bitset is not constrained by memory.
+ *
+ * total_elems is an estimate of the final size of the set.  It ought to be
+ * approximately correct, but we can cope well with it being off by perhaps a
+ * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
+ * (Dillinger & Manolios, 2004) for details of why this is the case.
+ *
+ * bloom_work_mem is sized in KB, in line with the general work_mem convention.
+ *
+ * The Bloom filter behaves non-deterministically when caller passes a random
+ * seed value.  This ensures that the same false positives will not occur from
+ * one run to the next, which is useful to some callers.
+ *
+ * Notes on appropriate use:
+ *
+ * To keep the implementation simple and predictable, the underlying bitset is
+ * always sized as a power-of-two number of bits, and the largest possible
+ * bitset is 512MB.  The implementation rounds down as needed.
+ *
+ * The implementation is well suited to data synchronization problems between
+ * unordered sets, where predictable performance is more important than worst
+ * case guarantees around false positives.  Another problem that the
+ * implementation is well suited for is cache filtering where good performance
+ * already relies upon having a relatively small and/or low cardinality set of
+ * things that are interesting (with perhaps many more uninteresting things
+ * that never populate the filter).
+ */
+bloom_filter *
+bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
+{
+	bloom_filter *filter;
+	int			bloom_power;
+	uint64		bitset_bytes;
+	uint64		bitset_bits;
+
+	/*
+	 * Aim for two bytes per element; this is sufficient to get a false
+	 * positive rate below 1%, independent of the size of the bitset or total
+	 * number of elements.  Also, if rounding down the size of the bitset to
+	 * the next lowest power of two turns out to be a significant drop, the
+	 * false positive rate still won't exceed 2% in almost all cases.
+	 */
+	bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
+	/* Minimum allowable size is 1MB */
+	bitset_bytes = Max(1024L * 1024L, bitset_bytes);
+
+	/* Size in bits should be the highest power of two within budget */
+	bloom_power = my_bloom_power(bitset_bytes * BITS_PER_BYTE);
+	/* Use uint64 to size bitset, since PG_UINT32_MAX is 2^32 - 1, not 2^32 */
+	bitset_bits = UINT64CONST(1) << bloom_power;
+	bitset_bytes = bitset_bits / BITS_PER_BYTE;
+
+	/* Allocate bloom filter as all-zeroes */
+	filter = palloc0(offsetof(bloom_filter, bitset) +
+					 sizeof(unsigned char) * bitset_bytes);
+	filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+
+	/*
+	 * Caller will probably use signed 32-bit pseudo-random number, so hash
+	 * caller's value to get 64-bit seed value
+	 */
+	filter->seed = DatumGetUInt64(hash_uint32_extended(seed, 0));
+	filter->m = bitset_bits;
+
+	return filter;
+}
+
+/*
+ * Free Bloom filter
+ */
+void
+bloom_free(bloom_filter *filter)
+{
+	pfree(filter);
+}
+
+/*
+ * Add element to Bloom filter
+ */
+void
+bloom_add_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32		hashes[MAX_HASH_FUNCS];
+	int			i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		filter->bitset[hashes[i] >> 3] |= 1 << (hashes[i] & 7);
+	}
+}
+
+/*
+ * Test if Bloom filter definitely lacks element.
+ *
+ * Returns true if the element is definitely not in the set of elements
+ * observed by bloom_add_element().  Otherwise, returns false, indicating that
+ * element is probably present in set.
+ */
+bool
+bloom_lacks_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32		hashes[MAX_HASH_FUNCS];
+	int			i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		if (!(filter->bitset[hashes[i] >> 3] & (1 << (hashes[i] & 7))))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.  That should
+ * generally be close to 0.5, even when we have more than enough memory to
+ * ensure a false positive rate within target 1% to 2% band, since more hash
+ * functions are used as more memory is available per element.
+ *
+ * This is the only instrumentation that is low overhead enough to appear in
+ * debug traces.  When debugging Bloom filter code, it's likely to be far more
+ * interesting to directly test the false positive rate.
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
+{
+	int			bitset_bytes = filter->m / BITS_PER_BYTE;
+	uint64		bits_set = 0;
+	int			i;
+
+	for (i = 0; i < bitset_bytes; i++)
+	{
+		unsigned char byte = filter->bitset[i];
+
+		while (byte)
+		{
+			bits_set++;
+			byte &= (byte - 1);
+		}
+	}
+
+	return bits_set / (double) filter->m;
+}
+
+/*
+ * Which element in the sequence of powers-of-two is less than or equal to
+ * target_bitset_bits?
+ *
+ * Value returned here must be generally safe as the basis for actual bitset
+ * size.
+ *
+ * Bitset is never allowed to exceed 2 ^ 32 bits (512MB).  This is sufficient
+ * for the needs of all current callers, and allows us to use 32-bit hash
+ * functions.  It also makes it easy to stay under the MaxAllocSize restriction
+ * (caller needs to leave room for non-bitset fields that appear before
+ * flexible array member, so a 1GB bitset would use an allocation that just
+ * exceeds MaxAllocSize).
+ */
+static int
+my_bloom_power(uint64 target_bitset_bits)
+{
+	int			bloom_power = -1;
+
+	while (target_bitset_bits > 0 && bloom_power < 32)
+	{
+		bloom_power++;
+		target_bitset_bits >>= 1;
+	}
+
+	return bloom_power;
+}
+
+/*
+ * Determine optimal number of hash functions based on size of filter in bits,
+ * and projected total number of elements.  The optimal number is the number
+ * that minimizes the false positive rate.
+ */
+static int
+optimal_k(uint64 bitset_bits, int64 total_elems)
+{
+	int			k = round(log(2.0) * bitset_bits / total_elems);
+
+	return Max(1, Min(k, MAX_HASH_FUNCS));
+}
+
+/*
+ * Generate k hash values for element.
+ *
+ * Caller passes array, which is filled-in with k values determined by hashing
+ * caller's element.
+ *
+ * Only 2 real independent hash functions are actually used to support an
+ * interface of up to MAX_HASH_FUNCS hash functions; enhanced double hashing is
+ * used to make this work.  The main reason we prefer enhanced double hashing
+ * to classic double hashing is that the latter has an issue with collisions
+ * when using power-of-two sized bitsets.  See Dillinger & Manolios for full
+ * details.
+ */
+static void
+k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
+{
+	uint64		hash;
+	uint32		x, y;
+	uint64		m;
+	int			i;
+
+	/* Use 64-bit hashing to get two independent 32-bit hashes */
+	hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed));
+	x = (uint32) hash;
+	y = (uint32) (hash >> 32);
+	m = filter->m;
+
+	x = mod_m(x, m);
+	y = mod_m(y, m);
+
+	/* Accumulate hashes */
+	hashes[0] = x;
+	for (i = 1; i < filter->k_hash_funcs; i++)
+	{
+		x = mod_m(x + y, m);
+		y = mod_m(y + i, m);
+
+		hashes[i] = x;
+	}
+}
+
+/*
+ * Calculate "val MOD m" inexpensively.
+ *
+ * Assumes that m (which is bitset size) is a power-of-two.
+ *
+ * Using a power-of-two number of bits for bitset size allows us to use bitwise
+ * AND operations to calculate the modulo of a hash value.  It's also a simple
+ * way of avoiding the modulo bias effect.
+ */
+static inline uint32
+mod_m(uint32 val, uint64 m)
+{
+	Assert(m <= PG_UINT32_MAX + UINT64CONST(1));
+	Assert(((m - 1) & m) == 0);
+
+	return val & (m - 1);
+}
diff --git a/src/include/lib/bloomfilter.h b/src/include/lib/bloomfilter.h
new file mode 100644
index 0000000..5bc99c3
--- /dev/null
+++ b/src/include/lib/bloomfilter.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.h
+ *	  Minimal Bloom filter
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/include/lib/bloomfilter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BLOOMFILTER_H_
+#define _BLOOMFILTER_H_
+
+typedef struct bloom_filter bloom_filter;
+
+extern bloom_filter *bloom_create(int64 total_elems, int bloom_work_mem,
+			 uint32 seed);
+extern void bloom_free(bloom_filter *filter);
+extern void bloom_add_element(bloom_filter *filter, unsigned char *elem,
+				  size_t len);
+extern bool bloom_lacks_element(bloom_filter *filter, unsigned char *elem,
+					size_t len);
+extern double bloom_prop_bits_set(bloom_filter *filter);
+
+#endif
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index b7ed0af..fb3aae1 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -9,6 +9,7 @@ SUBDIRS = \
 		  commit_ts \
 		  dummy_seclabel \
 		  snapshot_too_old \
+		  test_bloomfilter \
 		  test_ddl_deparse \
 		  test_extensions \
 		  test_parser \
diff --git a/src/test/modules/test_bloomfilter/.gitignore b/src/test/modules/test_bloomfilter/.gitignore
new file mode 100644
index 0000000..5dcb3ff
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_bloomfilter/Makefile b/src/test/modules/test_bloomfilter/Makefile
new file mode 100644
index 0000000..808c931
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_bloomfilter/Makefile
+
+MODULE_big = test_bloomfilter
+OBJS = test_bloomfilter.o $(WIN32RES)
+PGFILEDESC = "test_bloomfilter - test code for Bloom filter library"
+
+EXTENSION = test_bloomfilter
+DATA = test_bloomfilter--1.0.sql
+
+REGRESS = test_bloomfilter
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_bloomfilter
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_bloomfilter/README b/src/test/modules/test_bloomfilter/README
new file mode 100644
index 0000000..e54ed13
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/README
@@ -0,0 +1,71 @@
+test_bloomfilter overview
+=========================
+
+test_bloomfilter is a test harness module for testing Bloom filter library set
+membership operations.  It consists of a single SQL-callable function,
+test_bloomfilter(), and regression tests.  Membership tests are performed using
+an artificial dataset that is programmatically generated.
+
+The test_bloomfilter() function displays instrumentation at DEBUG1 elog level
+(WARNING when the false positive rate exceeds a 1% threshold).  This can be
+used to get a sense of the performance characteristics of the Postgres Bloom
+filter implementation under varied conditions.
+
+Bitset size
+-----------
+
+The main bloomfilter.c criteria for sizing its bitset is that the false
+positive rate should not exceed 2% when sufficient bloom_work_mem is available
+(and the caller-supplied estimate of the number of elements turns out to have
+been accurate).  A 2% rate is currently assumed to be good enough for all Bloom
+filter callers.
+
+The traditional guarantee Bloom filters offer is that with an optimal K, there
+will be only a 1% false positive rate with just 9.6 bits of memory per element.
+The 2% worst case guarantee exists because there is a need for some slop, to
+account for implementation inflexibility in bitset sizing.  The bitset is kept
+to a power-of-two number of bits in size, so callers may have their
+bloom_work_mem argument truncated down by almost half -- when that happens, the
+guarantee needs to hold up.  In practice callers that always pass a
+bloom_work_mem that is aligned with a power-of-two bitset size will actually
+get the "9.6 bits per element" 1% false positive rate.  (Under-promising in
+this manner is a fudge that allows the contract to be kept simple.)
+
+Strategy
+--------
+
+Our approach to regression testing is to test that bloomfilter.c has only a 1%
+false positive rate for a single bitset size (2 ^ 23, or 1MB).  We test a
+dataset with 838,861 elements, which works out at 10 bits of memory per
+element.  We round up from 9.6 bits to 10 bits to make sure that we reliably
+get under 1% for regression testing.  Note that a random seed is used in the
+regression tests, because the exact false positive rate is inconsistent across
+platforms, which makes non-deterministic hashing something that the regression
+tests need to be tolerant of anyway.
+
+SQL-callable function
+=====================
+
+The SQL-callable function test_bloomfilter() provides the following arguments:
+
+* "power" is the power-of-two used to size the Bloom filter's bitset.
+
+The minimum valid argument value is 23 (2^23 bits), or 1MB of memory.  The
+maximum valid argument value is 32, or 512MB of memory.  These restrictions
+reflect restrictions in bloomfilter.c itself.
+
+* "nelements" is the number of elements to generate for testing purposes.
+
+Adjust argument value to observe changes in the false positive rate for a given
+Bloom filter bitset size.
+
+* "seed" is a seed value for hashing.
+
+A value < 0 is interpreted as "use random seed".  Varying the seed value (or
+specifying -1) should result in small variations in the total number of false
+positives.
+
+* "tests" is the number of tests to run.
+
+This may be increased when it's useful to perform many tests without the
+overhead of setting up and tearing down a pg_regress database each time.
diff --git a/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out b/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
new file mode 100644
index 0000000..4d60eca
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
@@ -0,0 +1,25 @@
+CREATE EXTENSION test_bloomfilter;
+--
+-- These tests don't produce any interesting output, unless they fail.  For an
+-- explanation of the arguments, and the values used here, see README.
+--
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
+ test_bloomfilter 
+------------------
+ 
+(1 row)
+
+-- Equivalent "10 bits per element" tests for all possible bitset sizes:
+--
+-- SELECT test_bloomfilter(24, 1677722)
+-- SELECT test_bloomfilter(25, 3355443)
+-- SELECT test_bloomfilter(26, 6710886)
+-- SELECT test_bloomfilter(27, 13421773)
+-- SELECT test_bloomfilter(28, 26843546)
+-- SELECT test_bloomfilter(29, 53687091)
+-- SELECT test_bloomfilter(30, 107374182)
+-- SELECT test_bloomfilter(31, 214748365)
+-- SELECT test_bloomfilter(32, 429496730)
diff --git a/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql b/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
new file mode 100644
index 0000000..cc9d19e
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
@@ -0,0 +1,22 @@
+CREATE EXTENSION test_bloomfilter;
+
+--
+-- These tests don't produce any interesting output, unless they fail.  For an
+-- explanation of the arguments, and the values used here, see README.
+--
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
+
+-- Equivalent "10 bits per element" tests for all possible bitset sizes:
+--
+-- SELECT test_bloomfilter(24, 1677722)
+-- SELECT test_bloomfilter(25, 3355443)
+-- SELECT test_bloomfilter(26, 6710886)
+-- SELECT test_bloomfilter(27, 13421773)
+-- SELECT test_bloomfilter(28, 26843546)
+-- SELECT test_bloomfilter(29, 53687091)
+-- SELECT test_bloomfilter(30, 107374182)
+-- SELECT test_bloomfilter(31, 214748365)
+-- SELECT test_bloomfilter(32, 429496730)
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql b/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
new file mode 100644
index 0000000..bf1f1cd
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
@@ -0,0 +1,10 @@
+/* src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_bloomfilter" to load this file. \quit
+
+-- See README for an explanation of each argument
+CREATE FUNCTION test_bloomfilter(power integer, nelements bigint,
+    seed integer DEFAULT -1, tests integer DEFAULT 1)
+	RETURNS pg_catalog.void STRICT
+	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter.c b/src/test/modules/test_bloomfilter/test_bloomfilter.c
new file mode 100644
index 0000000..74afd36
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter.c
@@ -0,0 +1,138 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_bloomfilter.c
+ *		Test false positive rate of Bloom filter against test dataset.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_bloomfilter/test_bloomfilter.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/bloomfilter.h"
+#include "miscadmin.h"
+
+PG_MODULE_MAGIC;
+
+/* Must fit decimal representation of PG_INT64_MAX + 2 bytes: */
+#define MAX_ELEMENT_BYTES		20
+/* False positive rate WARNING threshold (1%): */
+#define FPOSITIVE_THRESHOLD		0.01
+
+
+/*
+ * Populate an empty Bloom filter with "nelements" dummy strings.
+ */
+static void
+populate_with_dummy_strings(bloom_filter *filter, int64 nelements)
+{
+	char		element[MAX_ELEMENT_BYTES];
+	int64		i;
+
+	for (i = 0; i < nelements; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		snprintf(element, sizeof(element), "i" INT64_FORMAT, i);
+		bloom_add_element(filter, (unsigned char *) element, strlen(element));
+	}
+}
+
+/*
+ * Returns number of strings that are indicated as probably appearing in Bloom
+ * filter that were in fact never added by populate_with_dummy_strings().
+ * These are false positives.
+ */
+static int64
+nfalsepos_for_missing_strings(bloom_filter *filter, int64 nelements)
+{
+	char		element[MAX_ELEMENT_BYTES];
+	int64		nfalsepos = 0;
+	int64		i;
+
+	for (i = 0; i < nelements; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		snprintf(element, sizeof(element), "M" INT64_FORMAT, i);
+		if (!bloom_lacks_element(filter, (unsigned char *) element,
+								 strlen(element)))
+			nfalsepos++;
+	}
+
+	return nfalsepos;
+}
+
+static void
+create_and_test_bloom(int power, int64 nelements, int callerseed)
+{
+	int			bloom_work_mem;
+	uint32		seed;
+	int64		nfalsepos;
+	bloom_filter *filter;
+
+	bloom_work_mem = (1L << power) / 8L / 1024L;
+
+	elog(DEBUG1, "bloom_work_mem (KB): %d", bloom_work_mem);
+
+	/*
+	 * Generate random seed, or use caller's.  Seed should always be a
+	 * positive value less than or equal to PG_INT32_MAX, to ensure that any
+	 * random seed can be recreated through callerseed if the need arises.
+	 * (Don't assume that RAND_MAX cannot exceed PG_INT32_MAX.)
+	 */
+	seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed;
+
+	/* Create Bloom filter, populate it, and report on false positive rate */
+	filter = bloom_create(nelements, bloom_work_mem, seed);
+	populate_with_dummy_strings(filter, nelements);
+	nfalsepos = nfalsepos_for_missing_strings(filter, nelements);
+
+	ereport((nfalsepos > nelements * FPOSITIVE_THRESHOLD) ? WARNING : DEBUG1,
+			(errmsg_internal("false positives: " INT64_FORMAT " (rate: %.6f, proportion bits set: %.6f, seed: %u)",
+							 nfalsepos, (double) nfalsepos / nelements,
+							 bloom_prop_bits_set(filter), seed)));
+
+	bloom_free(filter);
+}
+
+PG_FUNCTION_INFO_V1(test_bloomfilter);
+
+/*
+ * SQL-callable entry point to perform all tests.
+ *
+ * If a 1% false positive threshold is not met, emits WARNINGs.
+ *
+ * See README for details of arguments.
+ */
+Datum
+test_bloomfilter(PG_FUNCTION_ARGS)
+{
+	int			power = PG_GETARG_INT32(0);
+	int64		nelements = PG_GETARG_INT64(1);
+	int			seed = PG_GETARG_INT32(2);
+	int			tests = PG_GETARG_INT32(3);
+	int			i;
+
+	if (power < 23 || power > 32)
+		elog(ERROR, "power argument must be between 23 and 32 inclusive");
+
+	if (tests <= 0)
+		elog(ERROR, "invalid number of tests: %d", tests);
+
+	if (nelements < 0)
+		elog(ERROR, "invalid number of elements: %d", tests);
+
+	for (i = 0; i < tests; i++)
+	{
+		elog(DEBUG1, "beginning test #%d...", i + 1);
+
+		create_and_test_bloom(power, nelements, seed);
+	}
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter.control b/src/test/modules/test_bloomfilter/test_bloomfilter.control
new file mode 100644
index 0000000..99e56ee
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter.control
@@ -0,0 +1,4 @@
+comment = 'Test code for Bloom filter library'
+default_version = '1.0'
+module_pathname = '$libdir/test_bloomfilter'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d4765ce..1b1a996 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2580,6 +2580,7 @@ bitmapword
 bits16
 bits32
 bits8
+bloom_filter
 bool
 brin_column_state
 bytea
-- 
2.7.4

0002-Add-amcheck-verification-of-indexes-against-heap.patchtext/x-patch; charset=US-ASCII; name=0002-Add-amcheck-verification-of-indexes-against-heap.patchDownload

From e492d4a7553c8e736ca03b2013fa6a8ec9302bd5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  callers can check that each heap tuple that
ought to have an index entry does in fact have one.  This happens at the
end of the existing verification checks.

This is implemented by using a Bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
Bloom filter is populated during the initial index verification scan.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql    |  28 +++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |  14 +-
 contrib/amcheck/sql/check_btree.sql      |   9 +-
 contrib/amcheck/verify_nbtree.c          | 286 ++++++++++++++++++++++++++++---
 doc/src/sgml/amcheck.sgml                | 122 ++++++++++---
 7 files changed, 401 insertions(+), 62 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 0000000..e6cca0a
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+DROP FUNCTION bt_index_parent_check(regclass);
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 05e2861..4690484 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index df3741e..42872b8 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -16,8 +16,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
  bt_index_check 
@@ -56,8 +56,14 @@ SELECT bt_index_check('bttest_a_idx');
  
 (1 row)
 
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+SELECT bt_index_parent_check('bttest_b_idx', true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index fd90531..5d27969 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -19,8 +19,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
 SELECT bt_index_parent_check('bttest_a_idx');
@@ -42,8 +42,9 @@ ROLLBACK;
 
 -- normal check outside of xact
 SELECT bt_index_check('bttest_a_idx');
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+SELECT bt_index_parent_check('bttest_b_idx', true);
 
 BEGIN;
 SELECT bt_index_check('bttest_a_idx');
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index da518da..7e20d52 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -8,6 +8,11 @@
  * (the insertion scankey sort-wise NULL semantics are needed for
  * verification).
  *
+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *
  *
  * Copyright (c) 2017-2018, PostgreSQL Global Development Group
  *
@@ -23,6 +28,7 @@
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablecmds.h"
+#include "lib/bloomfilter.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -43,9 +49,10 @@ PG_MODULE_MAGIC;
  * target is the point of reference for a verification operation.
  *
  * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
- * they are current target's child pages). Conceptually, problems are only
- * ever found in the current target page. Each page found by verification's
- * left/right, top/bottom scan becomes the target exactly once.
+ * they are current target's child pages).  Conceptually, problems are only
+ * ever found in the current target page (or for a particular heap tuple during
+ * heapallindexed verification).  Each page found by verification's left/right,
+ * top/bottom scan becomes the target exactly once.
  */
 typedef struct BtreeCheckState
 {
@@ -53,10 +60,13 @@ typedef struct BtreeCheckState
 	 * Unchanging state, established at start of verification:
 	 */
 
-	/* B-Tree Index Relation */
+	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
+	Relation	heaprel;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
+	/* Also verifying heap has no unindexed tuples? */
+	bool		heapallindexed;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -72,6 +82,15 @@ typedef struct BtreeCheckState
 	BlockNumber targetblock;
 	/* Target page's LSN */
 	XLogRecPtr	targetlsn;
+
+	/*
+	 * Mutable state, for optional heapallindexed verification:
+	 */
+
+	/* Bloom filter fingerprints B-Tree index */
+	bloom_filter *filter;
+	/* Debug counter */
+	int64		heaptuplespresent;
 } BtreeCheckState;
 
 /*
@@ -92,15 +111,20 @@ typedef struct BtreeLevel
 PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
-static void bt_index_check_internal(Oid indrelid, bool parentcheck);
+static void bt_index_check_internal(Oid indrelid, bool parentcheck,
+						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
-static void bt_check_every_level(Relation rel, bool readonly);
+static void bt_check_every_level(Relation rel, Relation heaprel,
+					 bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
 static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 				  ScanKey targetkey);
+static void bt_tuple_present_callback(Relation index, HeapTuple htup,
+						  Datum *values, bool *isnull,
+						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
@@ -116,37 +140,47 @@ static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires AccessShareLock on heap & index relations.  Does not consider
- * invariants that exist between parent/child pages.
+ * invariants that exist between parent/child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, false);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, false, heapallindexed);
 
 	PG_RETURN_VOID();
 }
 
 /*
- * bt_index_parent_check(index regclass)
+ * bt_index_parent_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires ShareLock on heap & index relations.  Verifies that downlinks in
- * parent pages are valid lower bounds on child pages.
+ * parent pages are valid lower bounds on child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, true);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, true, heapallindexed);
 
 	PG_RETURN_VOID();
 }
@@ -155,7 +189,7 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -185,15 +219,20 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	 * Open the target index relations separately (like relation_openrv(), but
 	 * with heap relation locked first to prevent deadlocking).  In hot
 	 * standby mode this will raise an error when parentcheck is true.
+	 *
+	 * There is no need for the usual indcheckxmin usability horizon test here,
+	 * even in the heapallindexed case, because index undergoing verification
+	 * only needs to have entries for the snapshot that may be registered
+	 * later.  (If this is a parentcheck verification, there is no question
+	 * about committed or recently dead heap tuples lacking index entries due
+	 * to concurrent activity.)
 	 */
 	indrel = index_open(indrelid, lockmode);
 
 	/*
 	 * Since we did the IndexGetRelation call above without any lock, it's
 	 * barely possible that a race against an index drop/recreation could have
-	 * netted us the wrong table.  Although the table itself won't actually be
-	 * examined during verification currently, a recheck still seems like a
-	 * good idea.
+	 * netted us the wrong table.
 	 */
 	if (heaprel == NULL || heapid != IndexGetRelation(indrelid, false))
 		ereport(ERROR,
@@ -204,8 +243,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	/* Relation suitable for checking as B-Tree? */
 	btree_index_checkable(indrel);
 
-	/* Check index */
-	bt_check_every_level(indrel, parentcheck);
+	/* Check index, possibly against table it is an index on */
+	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -253,11 +292,14 @@ btree_index_checkable(Relation rel)
 
 /*
  * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in
- * logical order, verifying invariants as it goes.
+ * logical order, verifying invariants as it goes.  Optionally, verification
+ * checks if the heap relation contains any tuples that are not represented in
+ * the index but should be.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
  * the index relation, and advise us if extra checks are safe when a ShareLock
- * is held.
+ * is held.  (A lock of the same type must also have been acquired on the heap
+ * relation.)
  *
  * A ShareLock is generally assumed to prevent any kind of physical
  * modification to the index structure, including modifications that VACUUM may
@@ -272,13 +314,15 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, bool readonly)
+bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
+					 bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
+	Snapshot	snapshot = SnapshotAny;
 
 	/*
 	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
@@ -291,7 +335,34 @@ bt_check_every_level(Relation rel, bool readonly)
 	 */
 	state = palloc(sizeof(BtreeCheckState));
 	state->rel = rel;
+	state->heaprel = heaprel;
 	state->readonly = readonly;
+	state->heapallindexed = heapallindexed;
+
+	if (state->heapallindexed)
+	{
+		int64		total_elems;
+		uint32		seed;
+
+		/* Size Bloom filter based on estimated number of tuples in index */
+		total_elems = (int64) state->rel->rd_rel->reltuples;
+		/* Random seed relies on backend srandom() call to avoid repetition */
+		seed = random();
+		/* Create Bloom filter to fingerprint index */
+		state->filter = bloom_create(total_elems, maintenance_work_mem, seed);
+		state->heaptuplespresent = 0;
+
+		/*
+		 * Register our own snapshot in !readonly case, rather than asking
+		 * IndexBuildHeapScan() to do this for us later.  This needs to happen
+		 * before index fingerprinting begins, so we can later be certain that
+		 * index fingerprinting should have reached all tuples returned by
+		 * IndexBuildHeapScan().
+		 */
+		if (!state->readonly)
+			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	}
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -345,6 +416,63 @@ bt_check_every_level(Relation rel, bool readonly)
 		previouslevel = current.level;
 	}
 
+	/*
+	 * * Heap contains unindexed/malformed tuples check *
+	 */
+	if (state->heapallindexed)
+	{
+		IndexInfo  *indexinfo = BuildIndexInfo(state->rel);
+		HeapScanDesc scan;
+
+		/*
+		 * Create our own scan for IndexBuildHeapScan(), like a parallel index
+		 * build.  We do things this way because it lets us use the MVCC
+		 * snapshot we acquired before index fingerprinting began (in the
+		 * !readonly case).
+		 */
+		scan = heap_beginscan_strat(state->heaprel, /* relation */
+									snapshot,	/* snapshot */
+									0,	/* number of keys */
+									NULL,	/* scan key */
+									true,	/* buffer access strategy OK */
+									true);	/* syncscan OK? */
+
+		/*
+		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
+		 * behaves when only AccessShareLock held.  This is really only needed
+		 * to prevent confusion within IndexBuildHeapScan() about how to
+		 * interpret the state we pass.
+		 */
+		indexinfo->ii_Concurrent = !state->readonly;
+
+		/*
+		 * Don't wait for uncommitted tuple xact commit/abort when index is a
+		 * unique index on a catalog (or an index used by an exclusion
+		 * constraint).  This could otherwise happen in the readonly case.
+		 */
+		indexinfo->ii_Unique = false;
+		indexinfo->ii_ExclusionOps = NULL;
+		indexinfo->ii_ExclusionProcs = NULL;
+		indexinfo->ii_ExclusionStrats = NULL;
+
+		elog(DEBUG1, "verifying that tuples from index \"%s\" are present in \"%s\"",
+			 RelationGetRelationName(state->rel),
+			 RelationGetRelationName(state->heaprel));
+
+		IndexBuildHeapScan(state->heaprel, state->rel, indexinfo, true,
+						   bt_tuple_present_callback, (void *) state, scan);
+
+		ereport(DEBUG1,
+				(errmsg_internal("finished verifying presence of " INT64_FORMAT " tuples (proportion of bits set: %f) from table \"%s\"",
+								 state->heaptuplespresent, bloom_prop_bits_set(state->filter),
+								 RelationGetRelationName(heaprel))));
+
+		if (snapshot != SnapshotAny)
+			UnregisterSnapshot(snapshot);
+
+		bloom_free(state->filter);
+	}
+
 	/* Be tidy: */
 	MemoryContextDelete(state->targetcontext);
 }
@@ -497,7 +625,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
 										current, level.level, opaque->btpo.level)));
 
-		/* Verify invariants for page -- all important checks occur here */
+		/* Verify invariants for page */
 		bt_target_page_check(state);
 
 nextpage:
@@ -544,6 +672,9 @@ nextpage:
  *
  * - That all child pages respect downlinks lower bound.
  *
+ * This is also where heapallindexed callers use their Bloom filter to
+ * fingerprint IndexTuples.
+ *
  * Note:  Memory allocated in this routine is expected to be released by caller
  * resetting state->targetcontext.
  */
@@ -587,6 +718,11 @@ bt_target_page_check(BtreeCheckState *state)
 		itup = (IndexTuple) PageGetItem(state->target, itemid);
 		skey = _bt_mkscankey(state->rel, itup);
 
+		/* Fingerprint leaf page tuples (those that point to the heap) */
+		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
+			bloom_add_element(state->filter, (unsigned char *) itup,
+							  IndexTupleSize(itup));
+
 		/*
 		 * * High key check *
 		 *
@@ -680,8 +816,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Last item check *
 		 *
 		 * Check last item against next/right page's first data item's when
-		 * last item on page is reached.  This additional check can detect
-		 * transposed pages.
+		 * last item on page is reached.  This additional check will detect
+		 * transposed pages iff the supposed right sibling page happens to
+		 * belong before target in the key space.  (Otherwise, a subsequent
+		 * heap verification will probably detect the problem.)
 		 *
 		 * This check is similar to the item order check that will have
 		 * already been performed for every other "real" item on target page
@@ -1060,6 +1198,106 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 }
 
 /*
+ * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
+ * all the entries that definitely should have been observed in leaf pages of
+ * the target index (that is, all IndexTuples that were fingerprinted by our
+ * Bloom filter).  All heapallindexed checks occur here.
+ *
+ * The redundancy between an index and the table it indexes provides a good
+ * opportunity to detect corruption, especially corruption within the table.
+ * The high level principle behind the verification performed here is that any
+ * IndexTuple that should be in an index following a fresh CREATE INDEX (based
+ * on the same index definition) should also have been in the original,
+ * existing index, which should have used exactly the same representation
+ *
+ * Since the overall structure of the index has already been verified, the most
+ * likely explanation for error here is a corrupt heap page (could be logical
+ * or physical corruption).  Index corruption may still be detected here,
+ * though.  Only readonly callers will have verified that left links and right
+ * links are in agreement, and so it's possible that a leaf page transposition
+ * within index is actually the source of corruption detected here (for
+ * !readonly callers).  The checks performed only for readonly callers might
+ * more accurately frame the problem as a cross-page invariant issue (this
+ * could even be due to recovery not replaying all WAL records).  The !readonly
+ * ERROR message raised here includes a HINT about retrying with readonly
+ * verification, just in case it's a cross-page invariant issue, though that
+ * isn't particularly likely.
+ *
+ * IndexBuildHeapScan() expects to be able to find the root tuple when a
+ * heap-only tuple (the live tuple at the end of some HOT chain) needs to be
+ * indexed, in order to replace the actual tuple's TID with the root tuple's
+ * TID (which is what we're actually passed back here).  The index build heap
+ * scan code will raise an error when a tuple that claims to be the root of the
+ * heap-only tuple's HOT chain cannot be located.  This catches cases where the
+ * original root item offset/root tuple for a HOT chain indicates (for whatever
+ * reason) that the entire HOT chain is dead, despite the fact that the latest
+ * heap-only tuple should be indexed.  When this happens, sequential scans may
+ * always give correct answers, and all indexes may be considered structurally
+ * consistent (i.e. the nbtree structural checks would not detect corruption).
+ * It may be the case that only index scans give wrong answers, and yet heap or
+ * SLRU corruption is the real culprit.  (While it's true that LP_DEAD bit
+ * setting will probably also leave the index in a corrupt state before too
+ * long, the problem is nonetheless that there is heap corruption.)
+ *
+ * Heap-only tuple handling within IndexBuildHeapScan() works in a way that
+ * helps us to detect index tuples that contain the wrong values (values that
+ * don't match the latest tuple in the HOT chain).  This can happen when there
+ * is no superseding index tuple due to a faulty assessment of HOT safety,
+ * perhaps during the original CREATE INDEX.  Because the latest tuple's
+ * contents are used with the root TID, an error will be raised when a tuple
+ * with the same TID but non-matching attribute values is passed back to us.
+ * Faulty assessment of HOT-safety was behind at least two distinct CREATE
+ * INDEX CONCURRENTLY bugs that made it into stable releases, one of which was
+ * undetected for many years.  In short, the same principle that allows a
+ * REINDEX to repair corruption when there was an (undetected) broken HOT chain
+ * also allows us to detect the corruption in many cases.
+ */
+static void
+bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
+						  bool *isnull, bool tupleIsAlive, void *checkstate)
+{
+	BtreeCheckState *state = (BtreeCheckState *) checkstate;
+	IndexTuple	itup;
+
+	Assert(state->heapallindexed);
+
+	/*
+	 * Generate an index tuple for fingerprinting.
+	 *
+	 * Index tuple formation is assumed to be deterministic, and IndexTuples
+	 * are assumed immutable.  While the LP_DEAD bit is mutable in leaf pages,
+	 * that's ItemId metadata, which was not fingerprinted.  (There will often
+	 * be some dead-to-everyone IndexTuples fingerprinted by the Bloom filter,
+	 * but we only try to detect the absence of needed tuples, so that's okay.)
+	 *
+	 * Note that we rely on deterministic index_form_tuple() TOAST compression.
+	 * If index_form_tuple() was ever enhanced to compress datums out-of-line,
+	 * or otherwise varied when or how compression was applied, our assumption
+	 * would break, leading to false positive reports of corruption.  For now,
+	 * we don't decompress/normalize toasted values as part of fingerprinting.
+	 */
+	itup = index_form_tuple(RelationGetDescr(index), values, isnull);
+	itup->t_tid = htup->t_self;
+
+	/* Probe Bloom filter -- tuple should be present */
+	if (bloom_lacks_element(state->filter, (unsigned char *) itup,
+							IndexTupleSize(itup)))
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("heap tuple (%u,%u) from table \"%s\" lacks matching index tuple within index \"%s\"",
+						ItemPointerGetBlockNumber(&(itup->t_tid)),
+						ItemPointerGetOffsetNumber(&(itup->t_tid)),
+						RelationGetRelationName(state->heaprel),
+						RelationGetRelationName(state->rel)),
+				 !state->readonly
+				 ? errhint("Retrying verification using the function bt_index_parent_check() might provide a more specific error.")
+				 : 0));
+
+	state->heaptuplespresent++;
+	pfree(itup);
+}
+
+/*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
  *
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 852e260..f6be1b3 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -44,7 +44,7 @@
   <variablelist>
    <varlistentry>
     <term>
-     <function>bt_index_check(index regclass) returns void</function>
+     <function>bt_index_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_check</primary>
      </indexterm>
@@ -55,7 +55,9 @@
       <function>bt_index_check</function> tests that its target, a
       B-Tree index, respects a variety of invariants.  Example usage:
 <screen>
-test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+test=# SELECT bt_index_check(index =&gt; c.oid, heapallindexed =&gt; i.indisunique)
+               c.relname,
+               c.relpages
 FROM pg_index i
 JOIN pg_opclass op ON i.indclass[0] = op.oid
 JOIN pg_am am ON op.opcmethod = am.oid
@@ -83,9 +85,11 @@ ORDER BY c.relpages DESC LIMIT 10;
 </screen>
       This example shows a session that performs verification of every
       catalog index in the database <quote>test</quote>.  Details of just
-      the 10 largest indexes verified are displayed.  Since no error
-      is raised, all indexes tested appear to be logically consistent.
-      Naturally, this query could easily be changed to call
+      the 10 largest indexes verified are displayed.  Verification of
+      the presence of heap tuples as index tuples is requested for
+      unique indexes only.  Since no error is raised, all indexes
+      tested appear to be logically consistent.  Naturally, this query
+      could easily be changed to call
       <function>bt_index_check</function> for every index in the
       database where verification is supported.
      </para>
@@ -95,10 +99,11 @@ ORDER BY c.relpages DESC LIMIT 10;
       is the same lock mode acquired on relations by simple
       <literal>SELECT</literal> statements.
       <function>bt_index_check</function> does not verify invariants
-      that span child/parent relationships, nor does it verify that
-      the target index is consistent with its heap relation.  When a
-      routine, lightweight test for corruption is required in a live
-      production environment, using
+      that span child/parent relationships, but will verify the
+      presence of all heap tuples as index tuples within the index
+      when <parameter>heapallindexed</parameter> is
+      <literal>true</literal>.  When a routine, lightweight test for
+      corruption is required in a live production environment, using
       <function>bt_index_check</function> often provides the best
       trade-off between thoroughness of verification and limiting the
       impact on application performance and availability.
@@ -108,7 +113,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -117,19 +122,21 @@ ORDER BY c.relpages DESC LIMIT 10;
     <listitem>
      <para>
       <function>bt_index_parent_check</function> tests that its
-      target, a B-Tree index, respects a variety of invariants.  The
-      checks performed by <function>bt_index_parent_check</function>
-      are a superset of the checks performed by
-      <function>bt_index_check</function>.
+      target, a B-Tree index, respects a variety of invariants.
+      Optionally, when the <parameter>heapallindexed</parameter>
+      argument is <literal>true</literal>, the function verifies the
+      presence of all heap tuples that should be found within the
+      index.  The checks that can be performed by
+      <function>bt_index_parent_check</function> are a superset of the
+      checks that can be performed by <function>bt_index_check</function>.
       <function>bt_index_parent_check</function> can be thought of as
       a more thorough variant of <function>bt_index_check</function>:
       unlike <function>bt_index_check</function>,
       <function>bt_index_parent_check</function> also checks
-      invariants that span parent/child relationships.  However, it
-      does not verify that the target index is consistent with its
-      heap relation.  <function>bt_index_parent_check</function>
-      follows the general convention of raising an error if it finds a
-      logical inconsistency or other problem.
+      invariants that span parent/child relationships.
+      <function>bt_index_parent_check</function> follows the general
+      convention of raising an error if it finds a logical
+      inconsistency or other problem.
      </para>
      <para>
       A <literal>ShareLock</literal> is required on the target index by
@@ -159,6 +166,47 @@ ORDER BY c.relpages DESC LIMIT 10;
  </sect2>
 
  <sect2>
+  <title>Optional <parameter>heapallindexed</parameter> verification</title>
+ <para>
+  When the <parameter>heapallindexed</parameter> argument to
+  verification functions is <literal>true</literal>, an additional
+  phase of verification is performed against the table associated with
+  the target index relation.  This consists of a <quote>dummy</quote>
+  <command>CREATE INDEX</command> operation, which checks for the
+  presence of all hypothetical new index tuples against a temporary,
+  in-memory summarizing structure (this is built when needed during
+  the basic first phase of verification).  The summarizing structure
+  <quote>fingerprints</quote> every tuple found within the target
+  index.  The high level principle behind
+  <parameter>heapallindexed</parameter> verification is that a new
+  index that is equivalent to the existing, target index must only
+  have entries that can be found in the existing structure.
+ </para>
+ <para>
+  The additional <parameter>heapallindexed</parameter> phase adds
+  significant overhead: verification will typically take several times
+  longer.  However, there is no change to the relation-level locks
+  acquired when <parameter>heapallindexed</parameter> verification is
+  performed.
+ </para>
+ <para>
+  The summarizing structure is bound in size by
+  <varname>maintenance_work_mem</varname>.  In order to ensure that
+  there is no more than a 2% probability of failure to detect an
+  inconsistency for each heap tuple that should be represented in the
+  index, approximately 2 bytes of memory are needed per tuple.  As
+  less memory is made available per tuple, the probability of missing
+  an inconsistency slowly increases.  This approach limits the
+  overhead of verification significantly, while only slightly reducing
+  the probability of detecting a problem, especially for installations
+  where verification is treated as a routine maintenance task.  Any
+  single absent or malformed tuple has a new opportunity to be
+  detected with each new verification attempt.
+ </para>
+
+ </sect2>
+
+ <sect2>
   <title>Using <filename>amcheck</filename> effectively</title>
 
  <para>
@@ -199,16 +247,29 @@ ORDER BY c.relpages DESC LIMIT 10;
    </listitem>
    <listitem>
     <para>
+     Structural inconsistencies between indexes and the heap relations
+     that are indexed (when <parameter>heapallindexed</parameter>
+     verification is performed).
+    </para>
+    <para>
+     There is no cross-checking of indexes against their heap relation
+     during normal operation.  Symptoms of heap corruption can be subtle.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
      Corruption caused by hypothetical undiscovered bugs in the
-     underlying <productname>PostgreSQL</productname> access method code or sort
-     code.
+     underlying <productname>PostgreSQL</productname> access method
+     code, sort code, or transaction management code.
     </para>
     <para>
      Automatic verification of the structural integrity of indexes
      plays a role in the general testing of new or proposed
      <productname>PostgreSQL</productname> features that could plausibly allow a
-     logical inconsistency to be introduced.  One obvious testing
-     strategy is to call <filename>amcheck</filename> functions continuously
+     logical inconsistency to be introduced.  Verification of table
+     structure and associated visibility and transaction status
+     information plays a similar role.  One obvious testing strategy
+     is to call <filename>amcheck</filename> functions continuously
      when running the standard regression tests.  See <xref
      linkend="regress-run"/> for details on running the tests.
     </para>
@@ -242,6 +303,12 @@ ORDER BY c.relpages DESC LIMIT 10;
      <emphasis>absolute</emphasis> protection against failures that
      result in memory corruption.
     </para>
+    <para>
+     When <parameter>heapallindexed</parameter> verification is
+     performed, there is generally a greatly increased chance of
+     detecting single-bit errors, since strict binary equality is
+     tested, and the indexed attributes within the heap are tested.
+    </para>
    </listitem>
   </itemizedlist>
   In general, <filename>amcheck</filename> can only prove the presence of
@@ -253,11 +320,10 @@ ORDER BY c.relpages DESC LIMIT 10;
   <title>Repairing corruption</title>
  <para>
   No error concerning corruption raised by <filename>amcheck</filename> should
-  ever be a false positive.  In practice, <filename>amcheck</filename> is more
-  likely to find software bugs than problems with hardware.
-  <filename>amcheck</filename> raises errors in the event of conditions that,
-  by definition, should never happen, and so careful analysis of
-  <filename>amcheck</filename> errors is often required.
+  ever be a false positive.  <filename>amcheck</filename> raises
+  errors in the event of conditions that, by definition, should never
+  happen, and so careful analysis of <filename>amcheck</filename>
+  errors is often required.
  </para>
  <para>
   There is no general method of repairing problems that
-- 
2.7.4

#47

Andrey Borodin

x4mmm@yandex-team.ru

almost 8 years ago

In reply to: Peter Geoghegan (#46)

Re: [HACKERS] A design for amcheck heapam verification

Hi, Peter!

8 февр. 2018 г., в 4:56, Peter Geoghegan <pg@bowt.ie> написал(а):

* Faster modulo operations.
....
* Removed sdbmhash().

Thanks! I definitely like how Bloom filter is implemented now.

I could not understand meaning of this, but apparently this will not harm
+	/*
+	 * Caller will probably use signed 32-bit pseudo-random number, so hash
+	 * caller's value to get 64-bit seed value
+	 */
+	filter->seed = DatumGetUInt64(hash_uint32_extended(seed, 0));
I do not see a reason behind hashing the seed.

Also, I'd like to reformulate this paragraph. I understand what you want to say, but the sentence is incorrect.
+ * The Bloom filter behaves non-deterministically when caller passes a random
+ * seed value.  This ensures that the same false positives will not occur from
+ * one run to the next, which is useful to some callers.
Bloom filter behaves deterministically, but differently. This does not ensures any thing, but probably will give something with hight probability.

Thanks!

Best regards, Andrey Borodin.

#48

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Andrey Borodin (#47)

Re: [HACKERS] A design for amcheck heapam verification

On Thu, Feb 8, 2018 at 6:05 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I do not see a reason behind hashing the seed.

It made some sense when I was XOR'ing it to mix. A uniform
distribution of bits seemed desirable then, since random() won't use
the most significant bit -- it generates random numbers in the range
of 0 to 2^31-1. It does seem unnecessary now.

Also, I'd like to reformulate this paragraph. I understand what you want to say, but the sentence is incorrect.
+ * The Bloom filter behaves non-deterministically when caller passes a random
+ * seed value.  This ensures that the same false positives will not occur from
+ * one run to the next, which is useful to some callers.
Bloom filter behaves deterministically, but differently. This does not ensures any thing, but probably will give something with hight probability.

I agree that that's unclear. I should probably cut it down, and say
something like "caller can pass a random seed to make it unlikely that
the same false positives will occur from one run to the next".

--
Peter Geoghegan

#49

Andrey Borodin

x4mmm@yandex-team.ru

almost 8 years ago

In reply to: Peter Geoghegan (#48)

Re: [HACKERS] A design for amcheck heapam verification

Hi!

8 февр. 2018 г., в 22:45, Peter Geoghegan <pg@bowt.ie> написал(а):

On Thu, Feb 8, 2018 at 6:05 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I do not see a reason behind hashing the seed.

It made some sense when I was XOR'ing it to mix. A uniform
distribution of bits seemed desirable then, since random() won't use
the most significant bit -- it generates random numbers in the range
of 0 to 2^31-1. It does seem unnecessary now.
Also, I'd like to reformulate this paragraph. I understand what you want to say, but the sentence is incorrect.
+ * The Bloom filter behaves non-deterministically when caller passes a random
+ * seed value.  This ensures that the same false positives will not occur from
+ * one run to the next, which is useful to some callers.
Bloom filter behaves deterministically, but differently. This does not ensures any thing, but probably will give something with hight probability.
I agree that that's unclear. I should probably cut it down, and say
something like "caller can pass a random seed to make it unlikely that
the same false positives will occur from one run to the next".

I've just flipped patch to WoA. But if above issues will be fixed I think that patch is ready for committer.

Best regards, Andrey Borodin.

#50

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Andrey Borodin (#49)

2 attachment(s)

Re: [HACKERS] A design for amcheck heapam verification

On Fri, Mar 23, 2018 at 7:13 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I've just flipped patch to WoA. But if above issues will be fixed I think that patch is ready for committer.

Attached is v7, which has the small tweaks that you suggested.

Thank you for the review. I hope that this can be committed shortly.

--
Peter Geoghegan

Attachments:

v7-0001-Add-Bloom-filter-data-structure-implementation.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Add-Bloom-filter-data-structure-implementation.patchDownload

From ede1ba731dc818172a94adbb6331323c1f2b1170 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH v7 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.

This commit adds a test harness extension module, test_bloomfilter.  It
can be used to get a sense of how the Bloom filter implementation
performs under varying conditions.
---
 src/backend/lib/Makefile                           |   4 +-
 src/backend/lib/README                             |   2 +
 src/backend/lib/bloomfilter.c                      | 304 +++++++++++++++++++++
 src/include/lib/bloomfilter.h                      |  27 ++
 src/test/modules/Makefile                          |   1 +
 src/test/modules/test_bloomfilter/.gitignore       |   4 +
 src/test/modules/test_bloomfilter/Makefile         |  21 ++
 src/test/modules/test_bloomfilter/README           |  71 +++++
 .../test_bloomfilter/expected/test_bloomfilter.out |  25 ++
 .../test_bloomfilter/sql/test_bloomfilter.sql      |  22 ++
 .../test_bloomfilter/test_bloomfilter--1.0.sql     |  10 +
 .../modules/test_bloomfilter/test_bloomfilter.c    | 138 ++++++++++
 .../test_bloomfilter/test_bloomfilter.control      |   4 +
 src/tools/pgindent/typedefs.list                   |   1 +
 14 files changed, 632 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/bloomfilter.c
 create mode 100644 src/include/lib/bloomfilter.h
 create mode 100644 src/test/modules/test_bloomfilter/.gitignore
 create mode 100644 src/test/modules/test_bloomfilter/Makefile
 create mode 100644 src/test/modules/test_bloomfilter/README
 create mode 100644 src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
 create mode 100644 src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.c
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index d1fefe43f2..191ea9bca2 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
-	   knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index 5e5ba5e437..376ae273a9 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -3,6 +3,8 @@ in the backend:
 
 binaryheap.c - a binary heap
 
+bloomfilter.c - probabilistic, space-efficient set membership testing
+
 hyperloglog.c - a streaming cardinality estimator
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
new file mode 100644
index 0000000000..dcf32c015c
--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,304 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.c
+ *		Space-efficient set membership testing
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.  False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the Bloom filter bitset.
+ *
+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when its Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.
+ *
+ * The implementation is well suited to data synchronization problems between
+ * unordered sets, especially where predictable performance is important and
+ * some false positives are acceptable.  It's also well suited to cache
+ * filtering problems where a relatively small and/or low cardinality set is
+ * fingerprinted, especially when many subsequent membership tests end up
+ * indicating that values of interest are not present.  That should save the
+ * caller many authoritative lookups, such as expensive probes of a much larger
+ * on-disk structure.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/bloomfilter.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/hash.h"
+#include "lib/bloomfilter.h"
+
+#define MAX_HASH_FUNCS		10
+
+struct bloom_filter
+{
+	/* K hash functions are used, seeded by caller's seed */
+	int			k_hash_funcs;
+	uint64		seed;
+	/* m is bitset size, in bits.  Must be a power-of-two <= 2^32.  */
+	uint64		m;
+	unsigned char bitset[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static int	my_bloom_power(uint64 target_bitset_bits);
+static int	optimal_k(uint64 bitset_bits, int64 total_elems);
+static void k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem,
+		 size_t len);
+static inline uint32 mod_m(uint32 a, uint64 m);
+
+/*
+ * Create Bloom filter in caller's memory context.  This should get a false
+ * positive rate of between 1% and 2% when bitset is not constrained by memory.
+ *
+ * total_elems is an estimate of the final size of the set.  It ought to be
+ * approximately correct, but we can cope well with it being off by perhaps a
+ * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
+ * (Dillinger & Manolios, 2004) for details of why this is the case.
+ *
+ * bloom_work_mem is sized in KB, in line with the general work_mem convention.
+ * This determines the size of the underlying bitset (trivial bookkeeping space
+ * isn't counted).  The bitset is always sized as a power-of-two number of
+ * bits, and the largest possible bitset is 512MB.  The implementation rounds
+ * down as needed.
+ *
+ * The Bloom filter is seeded using a value provided by the caller.  Using a
+ * distinct seed value on every call makes it unlikely that the same false
+ * positives will reoccur when the same set is fingerprinted a second time.
+ * Callers that don't care about this pass a constant as their seed, typically
+ * 0.
+ */
+bloom_filter *
+bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
+{
+	bloom_filter *filter;
+	int			bloom_power;
+	uint64		bitset_bytes;
+	uint64		bitset_bits;
+
+	/*
+	 * Aim for two bytes per element; this is sufficient to get a false
+	 * positive rate below 1%, independent of the size of the bitset or total
+	 * number of elements.  Also, if rounding down the size of the bitset to
+	 * the next lowest power of two turns out to be a significant drop, the
+	 * false positive rate still won't exceed 2% in almost all cases.
+	 */
+	bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
+	/* Minimum allowable size is 1MB */
+	bitset_bytes = Max(1024L * 1024L, bitset_bytes);
+
+	/* Size in bits should be the highest power of two within budget */
+	bloom_power = my_bloom_power(bitset_bytes * BITS_PER_BYTE);
+	/* Use uint64 to size bitset, since PG_UINT32_MAX is 2^32 - 1, not 2^32 */
+	bitset_bits = UINT64CONST(1) << bloom_power;
+	bitset_bytes = bitset_bits / BITS_PER_BYTE;
+
+	/* Allocate bloom filter as all-zeroes */
+	filter = palloc0(offsetof(bloom_filter, bitset) +
+					 sizeof(unsigned char) * bitset_bytes);
+	filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+	/*
+	 * Callers often use a pseudo-random integer in the range of 0 - INT_MAX as
+	 * their seed, which is fine.  A 64-bit unsigned integer is used as our
+	 * seed during hashing, since hash_any_extended() expects that.
+	 */
+	filter->seed = seed;
+	filter->m = bitset_bits;
+
+	return filter;
+}
+
+/*
+ * Free Bloom filter
+ */
+void
+bloom_free(bloom_filter *filter)
+{
+	pfree(filter);
+}
+
+/*
+ * Add element to Bloom filter
+ */
+void
+bloom_add_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32		hashes[MAX_HASH_FUNCS];
+	int			i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		filter->bitset[hashes[i] >> 3] |= 1 << (hashes[i] & 7);
+	}
+}
+
+/*
+ * Test if Bloom filter definitely lacks element.
+ *
+ * Returns true if the element is definitely not in the set of elements
+ * observed by bloom_add_element().  Otherwise, returns false, indicating that
+ * element is probably present in set.
+ */
+bool
+bloom_lacks_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32		hashes[MAX_HASH_FUNCS];
+	int			i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		if (!(filter->bitset[hashes[i] >> 3] & (1 << (hashes[i] & 7))))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.  That should
+ * generally be close to 0.5, even when we have more than enough memory to
+ * ensure a false positive rate within target 1% to 2% band, since more hash
+ * functions are used as more memory is available per element.
+ *
+ * This is the only instrumentation that is low overhead enough to appear in
+ * debug traces.  When debugging Bloom filter code, it's likely to be far more
+ * interesting to directly test the false positive rate.
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
+{
+	int			bitset_bytes = filter->m / BITS_PER_BYTE;
+	uint64		bits_set = 0;
+	int			i;
+
+	for (i = 0; i < bitset_bytes; i++)
+	{
+		unsigned char byte = filter->bitset[i];
+
+		while (byte)
+		{
+			bits_set++;
+			byte &= (byte - 1);
+		}
+	}
+
+	return bits_set / (double) filter->m;
+}
+
+/*
+ * Which element in the sequence of powers-of-two is less than or equal to
+ * target_bitset_bits?
+ *
+ * Value returned here must be generally safe as the basis for actual bitset
+ * size.
+ *
+ * Bitset is never allowed to exceed 2 ^ 32 bits (512MB).  This is sufficient
+ * for the needs of all current callers, and allows us to use 32-bit hash
+ * functions.  It also makes it easy to stay under the MaxAllocSize restriction
+ * (caller needs to leave room for non-bitset fields that appear before
+ * flexible array member, so a 1GB bitset would use an allocation that just
+ * exceeds MaxAllocSize).
+ */
+static int
+my_bloom_power(uint64 target_bitset_bits)
+{
+	int			bloom_power = -1;
+
+	while (target_bitset_bits > 0 && bloom_power < 32)
+	{
+		bloom_power++;
+		target_bitset_bits >>= 1;
+	}
+
+	return bloom_power;
+}
+
+/*
+ * Determine optimal number of hash functions based on size of filter in bits,
+ * and projected total number of elements.  The optimal number is the number
+ * that minimizes the false positive rate.
+ */
+static int
+optimal_k(uint64 bitset_bits, int64 total_elems)
+{
+	int			k = round(log(2.0) * bitset_bits / total_elems);
+
+	return Max(1, Min(k, MAX_HASH_FUNCS));
+}
+
+/*
+ * Generate k hash values for element.
+ *
+ * Caller passes array, which is filled-in with k values determined by hashing
+ * caller's element.
+ *
+ * Only 2 real independent hash functions are actually used to support an
+ * interface of up to MAX_HASH_FUNCS hash functions; enhanced double hashing is
+ * used to make this work.  The main reason we prefer enhanced double hashing
+ * to classic double hashing is that the latter has an issue with collisions
+ * when using power-of-two sized bitsets.  See Dillinger & Manolios for full
+ * details.
+ */
+static void
+k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
+{
+	uint64		hash;
+	uint32		x, y;
+	uint64		m;
+	int			i;
+
+	/* Use 64-bit hashing to get two independent 32-bit hashes */
+	hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed));
+	x = (uint32) hash;
+	y = (uint32) (hash >> 32);
+	m = filter->m;
+
+	x = mod_m(x, m);
+	y = mod_m(y, m);
+
+	/* Accumulate hashes */
+	hashes[0] = x;
+	for (i = 1; i < filter->k_hash_funcs; i++)
+	{
+		x = mod_m(x + y, m);
+		y = mod_m(y + i, m);
+
+		hashes[i] = x;
+	}
+}
+
+/*
+ * Calculate "val MOD m" inexpensively.
+ *
+ * Assumes that m (which is bitset size) is a power-of-two.
+ *
+ * Using a power-of-two number of bits for bitset size allows us to use bitwise
+ * AND operations to calculate the modulo of a hash value.  It's also a simple
+ * way of avoiding the modulo bias effect.
+ */
+static inline uint32
+mod_m(uint32 val, uint64 m)
+{
+	Assert(m <= PG_UINT32_MAX + UINT64CONST(1));
+	Assert(((m - 1) & m) == 0);
+
+	return val & (m - 1);
+}
diff --git a/src/include/lib/bloomfilter.h b/src/include/lib/bloomfilter.h
new file mode 100644
index 0000000000..ee337474c6
--- /dev/null
+++ b/src/include/lib/bloomfilter.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.h
+ *	  Space-efficient set membership testing
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/include/lib/bloomfilter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BLOOMFILTER_H_
+#define _BLOOMFILTER_H_
+
+typedef struct bloom_filter bloom_filter;
+
+extern bloom_filter *bloom_create(int64 total_elems, int bloom_work_mem,
+			 uint32 seed);
+extern void bloom_free(bloom_filter *filter);
+extern void bloom_add_element(bloom_filter *filter, unsigned char *elem,
+				  size_t len);
+extern bool bloom_lacks_element(bloom_filter *filter, unsigned char *elem,
+					size_t len);
+extern double bloom_prop_bits_set(bloom_filter *filter);
+
+#endif
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 7294b6958b..a9b8377acf 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -9,6 +9,7 @@ SUBDIRS = \
 		  commit_ts \
 		  dummy_seclabel \
 		  snapshot_too_old \
+		  test_bloomfilter \
 		  test_ddl_deparse \
 		  test_extensions \
 		  test_parser \
diff --git a/src/test/modules/test_bloomfilter/.gitignore b/src/test/modules/test_bloomfilter/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_bloomfilter/Makefile b/src/test/modules/test_bloomfilter/Makefile
new file mode 100644
index 0000000000..808c9314d4
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_bloomfilter/Makefile
+
+MODULE_big = test_bloomfilter
+OBJS = test_bloomfilter.o $(WIN32RES)
+PGFILEDESC = "test_bloomfilter - test code for Bloom filter library"
+
+EXTENSION = test_bloomfilter
+DATA = test_bloomfilter--1.0.sql
+
+REGRESS = test_bloomfilter
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_bloomfilter
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_bloomfilter/README b/src/test/modules/test_bloomfilter/README
new file mode 100644
index 0000000000..e54ed13adf
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/README
@@ -0,0 +1,71 @@
+test_bloomfilter overview
+=========================
+
+test_bloomfilter is a test harness module for testing Bloom filter library set
+membership operations.  It consists of a single SQL-callable function,
+test_bloomfilter(), and regression tests.  Membership tests are performed using
+an artificial dataset that is programmatically generated.
+
+The test_bloomfilter() function displays instrumentation at DEBUG1 elog level
+(WARNING when the false positive rate exceeds a 1% threshold).  This can be
+used to get a sense of the performance characteristics of the Postgres Bloom
+filter implementation under varied conditions.
+
+Bitset size
+-----------
+
+The main bloomfilter.c criteria for sizing its bitset is that the false
+positive rate should not exceed 2% when sufficient bloom_work_mem is available
+(and the caller-supplied estimate of the number of elements turns out to have
+been accurate).  A 2% rate is currently assumed to be good enough for all Bloom
+filter callers.
+
+The traditional guarantee Bloom filters offer is that with an optimal K, there
+will be only a 1% false positive rate with just 9.6 bits of memory per element.
+The 2% worst case guarantee exists because there is a need for some slop, to
+account for implementation inflexibility in bitset sizing.  The bitset is kept
+to a power-of-two number of bits in size, so callers may have their
+bloom_work_mem argument truncated down by almost half -- when that happens, the
+guarantee needs to hold up.  In practice callers that always pass a
+bloom_work_mem that is aligned with a power-of-two bitset size will actually
+get the "9.6 bits per element" 1% false positive rate.  (Under-promising in
+this manner is a fudge that allows the contract to be kept simple.)
+
+Strategy
+--------
+
+Our approach to regression testing is to test that bloomfilter.c has only a 1%
+false positive rate for a single bitset size (2 ^ 23, or 1MB).  We test a
+dataset with 838,861 elements, which works out at 10 bits of memory per
+element.  We round up from 9.6 bits to 10 bits to make sure that we reliably
+get under 1% for regression testing.  Note that a random seed is used in the
+regression tests, because the exact false positive rate is inconsistent across
+platforms, which makes non-deterministic hashing something that the regression
+tests need to be tolerant of anyway.
+
+SQL-callable function
+=====================
+
+The SQL-callable function test_bloomfilter() provides the following arguments:
+
+* "power" is the power-of-two used to size the Bloom filter's bitset.
+
+The minimum valid argument value is 23 (2^23 bits), or 1MB of memory.  The
+maximum valid argument value is 32, or 512MB of memory.  These restrictions
+reflect restrictions in bloomfilter.c itself.
+
+* "nelements" is the number of elements to generate for testing purposes.
+
+Adjust argument value to observe changes in the false positive rate for a given
+Bloom filter bitset size.
+
+* "seed" is a seed value for hashing.
+
+A value < 0 is interpreted as "use random seed".  Varying the seed value (or
+specifying -1) should result in small variations in the total number of false
+positives.
+
+* "tests" is the number of tests to run.
+
+This may be increased when it's useful to perform many tests without the
+overhead of setting up and tearing down a pg_regress database each time.
diff --git a/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out b/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
new file mode 100644
index 0000000000..4d60ecaa39
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
@@ -0,0 +1,25 @@
+CREATE EXTENSION test_bloomfilter;
+--
+-- These tests don't produce any interesting output, unless they fail.  For an
+-- explanation of the arguments, and the values used here, see README.
+--
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
+ test_bloomfilter 
+------------------
+ 
+(1 row)
+
+-- Equivalent "10 bits per element" tests for all possible bitset sizes:
+--
+-- SELECT test_bloomfilter(24, 1677722)
+-- SELECT test_bloomfilter(25, 3355443)
+-- SELECT test_bloomfilter(26, 6710886)
+-- SELECT test_bloomfilter(27, 13421773)
+-- SELECT test_bloomfilter(28, 26843546)
+-- SELECT test_bloomfilter(29, 53687091)
+-- SELECT test_bloomfilter(30, 107374182)
+-- SELECT test_bloomfilter(31, 214748365)
+-- SELECT test_bloomfilter(32, 429496730)
diff --git a/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql b/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
new file mode 100644
index 0000000000..cc9d19edcd
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
@@ -0,0 +1,22 @@
+CREATE EXTENSION test_bloomfilter;
+
+--
+-- These tests don't produce any interesting output, unless they fail.  For an
+-- explanation of the arguments, and the values used here, see README.
+--
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
+
+-- Equivalent "10 bits per element" tests for all possible bitset sizes:
+--
+-- SELECT test_bloomfilter(24, 1677722)
+-- SELECT test_bloomfilter(25, 3355443)
+-- SELECT test_bloomfilter(26, 6710886)
+-- SELECT test_bloomfilter(27, 13421773)
+-- SELECT test_bloomfilter(28, 26843546)
+-- SELECT test_bloomfilter(29, 53687091)
+-- SELECT test_bloomfilter(30, 107374182)
+-- SELECT test_bloomfilter(31, 214748365)
+-- SELECT test_bloomfilter(32, 429496730)
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql b/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
new file mode 100644
index 0000000000..bf1f1cd607
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
@@ -0,0 +1,10 @@
+/* src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_bloomfilter" to load this file. \quit
+
+-- See README for an explanation of each argument
+CREATE FUNCTION test_bloomfilter(power integer, nelements bigint,
+    seed integer DEFAULT -1, tests integer DEFAULT 1)
+	RETURNS pg_catalog.void STRICT
+	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter.c b/src/test/modules/test_bloomfilter/test_bloomfilter.c
new file mode 100644
index 0000000000..74afd36952
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter.c
@@ -0,0 +1,138 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_bloomfilter.c
+ *		Test false positive rate of Bloom filter against test dataset.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_bloomfilter/test_bloomfilter.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/bloomfilter.h"
+#include "miscadmin.h"
+
+PG_MODULE_MAGIC;
+
+/* Must fit decimal representation of PG_INT64_MAX + 2 bytes: */
+#define MAX_ELEMENT_BYTES		20
+/* False positive rate WARNING threshold (1%): */
+#define FPOSITIVE_THRESHOLD		0.01
+
+
+/*
+ * Populate an empty Bloom filter with "nelements" dummy strings.
+ */
+static void
+populate_with_dummy_strings(bloom_filter *filter, int64 nelements)
+{
+	char		element[MAX_ELEMENT_BYTES];
+	int64		i;
+
+	for (i = 0; i < nelements; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		snprintf(element, sizeof(element), "i" INT64_FORMAT, i);
+		bloom_add_element(filter, (unsigned char *) element, strlen(element));
+	}
+}
+
+/*
+ * Returns number of strings that are indicated as probably appearing in Bloom
+ * filter that were in fact never added by populate_with_dummy_strings().
+ * These are false positives.
+ */
+static int64
+nfalsepos_for_missing_strings(bloom_filter *filter, int64 nelements)
+{
+	char		element[MAX_ELEMENT_BYTES];
+	int64		nfalsepos = 0;
+	int64		i;
+
+	for (i = 0; i < nelements; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		snprintf(element, sizeof(element), "M" INT64_FORMAT, i);
+		if (!bloom_lacks_element(filter, (unsigned char *) element,
+								 strlen(element)))
+			nfalsepos++;
+	}
+
+	return nfalsepos;
+}
+
+static void
+create_and_test_bloom(int power, int64 nelements, int callerseed)
+{
+	int			bloom_work_mem;
+	uint32		seed;
+	int64		nfalsepos;
+	bloom_filter *filter;
+
+	bloom_work_mem = (1L << power) / 8L / 1024L;
+
+	elog(DEBUG1, "bloom_work_mem (KB): %d", bloom_work_mem);
+
+	/*
+	 * Generate random seed, or use caller's.  Seed should always be a
+	 * positive value less than or equal to PG_INT32_MAX, to ensure that any
+	 * random seed can be recreated through callerseed if the need arises.
+	 * (Don't assume that RAND_MAX cannot exceed PG_INT32_MAX.)
+	 */
+	seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed;
+
+	/* Create Bloom filter, populate it, and report on false positive rate */
+	filter = bloom_create(nelements, bloom_work_mem, seed);
+	populate_with_dummy_strings(filter, nelements);
+	nfalsepos = nfalsepos_for_missing_strings(filter, nelements);
+
+	ereport((nfalsepos > nelements * FPOSITIVE_THRESHOLD) ? WARNING : DEBUG1,
+			(errmsg_internal("false positives: " INT64_FORMAT " (rate: %.6f, proportion bits set: %.6f, seed: %u)",
+							 nfalsepos, (double) nfalsepos / nelements,
+							 bloom_prop_bits_set(filter), seed)));
+
+	bloom_free(filter);
+}
+
+PG_FUNCTION_INFO_V1(test_bloomfilter);
+
+/*
+ * SQL-callable entry point to perform all tests.
+ *
+ * If a 1% false positive threshold is not met, emits WARNINGs.
+ *
+ * See README for details of arguments.
+ */
+Datum
+test_bloomfilter(PG_FUNCTION_ARGS)
+{
+	int			power = PG_GETARG_INT32(0);
+	int64		nelements = PG_GETARG_INT64(1);
+	int			seed = PG_GETARG_INT32(2);
+	int			tests = PG_GETARG_INT32(3);
+	int			i;
+
+	if (power < 23 || power > 32)
+		elog(ERROR, "power argument must be between 23 and 32 inclusive");
+
+	if (tests <= 0)
+		elog(ERROR, "invalid number of tests: %d", tests);
+
+	if (nelements < 0)
+		elog(ERROR, "invalid number of elements: %d", tests);
+
+	for (i = 0; i < tests; i++)
+	{
+		elog(DEBUG1, "beginning test #%d...", i + 1);
+
+		create_and_test_bloom(power, nelements, seed);
+	}
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter.control b/src/test/modules/test_bloomfilter/test_bloomfilter.control
new file mode 100644
index 0000000000..99e56eebdf
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter.control
@@ -0,0 +1,4 @@
+comment = 'Test code for Bloom filter library'
+default_version = '1.0'
+module_pathname = '$libdir/test_bloomfilter'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 17bf55c1f5..abc10a8ffd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2590,6 +2590,7 @@ bitmapword
 bits16
 bits32
 bits8
+bloom_filter
 bool
 brin_column_state
 bytea
-- 
2.14.1

v7-0002-Add-amcheck-verification-of-indexes-against-heap.patchtext/x-patch; charset=US-ASCII; name=v7-0002-Add-amcheck-verification-of-indexes-against-heap.patchDownload

From 71878742061500b969faf7a7cff3603d644c90ca Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH v7 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  callers can check that each heap tuple that
ought to have an index entry does in fact have one.  This happens at the
end of the existing verification checks.

This is implemented by using a Bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
Bloom filter is populated during the initial index verification scan.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql    |  28 +++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |  14 +-
 contrib/amcheck/sql/check_btree.sql      |   9 +-
 contrib/amcheck/verify_nbtree.c          | 286 ++++++++++++++++++++++++++++---
 doc/src/sgml/amcheck.sgml                | 122 ++++++++++---
 7 files changed, 401 insertions(+), 62 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed919ae..c5764b544f 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 0000000000..e6cca0ac4b
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+DROP FUNCTION bt_index_parent_check(regclass);
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 05e2861d7a..469048403d 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index df3741e2c9..42872b89ca 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -16,8 +16,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
  bt_index_check 
@@ -56,8 +56,14 @@ SELECT bt_index_check('bttest_a_idx');
  
 (1 row)
 
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+SELECT bt_index_parent_check('bttest_b_idx', true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index fd90531027..5d2796990f 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -19,8 +19,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
 SELECT bt_index_parent_check('bttest_a_idx');
@@ -42,8 +42,9 @@ ROLLBACK;
 
 -- normal check outside of xact
 SELECT bt_index_check('bttest_a_idx');
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+SELECT bt_index_parent_check('bttest_b_idx', true);
 
 BEGIN;
 SELECT bt_index_check('bttest_a_idx');
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index da518daea3..c3380895a9 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -8,6 +8,11 @@
  * (the insertion scankey sort-wise NULL semantics are needed for
  * verification).
  *
+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *
  *
  * Copyright (c) 2017-2018, PostgreSQL Global Development Group
  *
@@ -23,6 +28,7 @@
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablecmds.h"
+#include "lib/bloomfilter.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -43,9 +49,10 @@ PG_MODULE_MAGIC;
  * target is the point of reference for a verification operation.
  *
  * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
- * they are current target's child pages). Conceptually, problems are only
- * ever found in the current target page. Each page found by verification's
- * left/right, top/bottom scan becomes the target exactly once.
+ * they are current target's child pages).  Conceptually, problems are only
+ * ever found in the current target page (or for a particular heap tuple during
+ * heapallindexed verification).  Each page found by verification's left/right,
+ * top/bottom scan becomes the target exactly once.
  */
 typedef struct BtreeCheckState
 {
@@ -53,10 +60,13 @@ typedef struct BtreeCheckState
 	 * Unchanging state, established at start of verification:
 	 */
 
-	/* B-Tree Index Relation */
+	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
+	Relation	heaprel;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
+	/* Also verifying heap has no unindexed tuples? */
+	bool		heapallindexed;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -72,6 +82,15 @@ typedef struct BtreeCheckState
 	BlockNumber targetblock;
 	/* Target page's LSN */
 	XLogRecPtr	targetlsn;
+
+	/*
+	 * Mutable state, for optional heapallindexed verification:
+	 */
+
+	/* Bloom filter fingerprints B-Tree index */
+	bloom_filter *filter;
+	/* Debug counter */
+	int64		heaptuplespresent;
 } BtreeCheckState;
 
 /*
@@ -92,15 +111,20 @@ typedef struct BtreeLevel
 PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
-static void bt_index_check_internal(Oid indrelid, bool parentcheck);
+static void bt_index_check_internal(Oid indrelid, bool parentcheck,
+						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
-static void bt_check_every_level(Relation rel, bool readonly);
+static void bt_check_every_level(Relation rel, Relation heaprel,
+					 bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
 static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 				  ScanKey targetkey);
+static void bt_tuple_present_callback(Relation index, HeapTuple htup,
+						  Datum *values, bool *isnull,
+						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
@@ -116,37 +140,47 @@ static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires AccessShareLock on heap & index relations.  Does not consider
- * invariants that exist between parent/child pages.
+ * invariants that exist between parent/child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, false);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, false, heapallindexed);
 
 	PG_RETURN_VOID();
 }
 
 /*
- * bt_index_parent_check(index regclass)
+ * bt_index_parent_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires ShareLock on heap & index relations.  Verifies that downlinks in
- * parent pages are valid lower bounds on child pages.
+ * parent pages are valid lower bounds on child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, true);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, true, heapallindexed);
 
 	PG_RETURN_VOID();
 }
@@ -155,7 +189,7 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -185,15 +219,20 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	 * Open the target index relations separately (like relation_openrv(), but
 	 * with heap relation locked first to prevent deadlocking).  In hot
 	 * standby mode this will raise an error when parentcheck is true.
+	 *
+	 * There is no need for the usual indcheckxmin usability horizon test here,
+	 * even in the heapallindexed case, because index undergoing verification
+	 * only needs to have entries for the snapshot that may be registered
+	 * later.  (If this is a parentcheck verification, there is no question
+	 * about committed or recently dead heap tuples lacking index entries due
+	 * to concurrent activity.)
 	 */
 	indrel = index_open(indrelid, lockmode);
 
 	/*
 	 * Since we did the IndexGetRelation call above without any lock, it's
 	 * barely possible that a race against an index drop/recreation could have
-	 * netted us the wrong table.  Although the table itself won't actually be
-	 * examined during verification currently, a recheck still seems like a
-	 * good idea.
+	 * netted us the wrong table.
 	 */
 	if (heaprel == NULL || heapid != IndexGetRelation(indrelid, false))
 		ereport(ERROR,
@@ -204,8 +243,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	/* Relation suitable for checking as B-Tree? */
 	btree_index_checkable(indrel);
 
-	/* Check index */
-	bt_check_every_level(indrel, parentcheck);
+	/* Check index, possibly against table it is an index on */
+	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -253,11 +292,14 @@ btree_index_checkable(Relation rel)
 
 /*
  * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in
- * logical order, verifying invariants as it goes.
+ * logical order, verifying invariants as it goes.  Optionally, verification
+ * checks if the heap relation contains any tuples that are not represented in
+ * the index but should be.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
  * the index relation, and advise us if extra checks are safe when a ShareLock
- * is held.
+ * is held.  (A lock of the same type must also have been acquired on the heap
+ * relation.)
  *
  * A ShareLock is generally assumed to prevent any kind of physical
  * modification to the index structure, including modifications that VACUUM may
@@ -272,13 +314,15 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, bool readonly)
+bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
+					 bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
+	Snapshot	snapshot = SnapshotAny;
 
 	/*
 	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
@@ -291,7 +335,34 @@ bt_check_every_level(Relation rel, bool readonly)
 	 */
 	state = palloc(sizeof(BtreeCheckState));
 	state->rel = rel;
+	state->heaprel = heaprel;
 	state->readonly = readonly;
+	state->heapallindexed = heapallindexed;
+
+	if (state->heapallindexed)
+	{
+		int64		total_elems;
+		uint32		seed;
+
+		/* Size Bloom filter based on estimated number of tuples in index */
+		total_elems = (int64) state->rel->rd_rel->reltuples;
+		/* Random seed relies on backend srandom() call to avoid repetition */
+		seed = random();
+		/* Create Bloom filter to fingerprint index */
+		state->filter = bloom_create(total_elems, maintenance_work_mem, seed);
+		state->heaptuplespresent = 0;
+
+		/*
+		 * Register our own snapshot in !readonly case, rather than asking
+		 * IndexBuildHeapScan() to do this for us later.  This needs to happen
+		 * before index fingerprinting begins, so we can later be certain that
+		 * index fingerprinting should have reached all tuples returned by
+		 * IndexBuildHeapScan().
+		 */
+		if (!state->readonly)
+			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	}
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -345,6 +416,63 @@ bt_check_every_level(Relation rel, bool readonly)
 		previouslevel = current.level;
 	}
 
+	/*
+	 * * Heap contains unindexed/malformed tuples check *
+	 */
+	if (state->heapallindexed)
+	{
+		IndexInfo  *indexinfo = BuildIndexInfo(state->rel);
+		HeapScanDesc scan;
+
+		/*
+		 * Create our own scan for IndexBuildHeapScan(), like a parallel index
+		 * build.  We do things this way because it lets us use the MVCC
+		 * snapshot we acquired before index fingerprinting began in the
+		 * !readonly case.
+		 */
+		scan = heap_beginscan_strat(state->heaprel, /* relation */
+									snapshot,	/* snapshot */
+									0,	/* number of keys */
+									NULL,	/* scan key */
+									true,	/* buffer access strategy OK */
+									true);	/* syncscan OK? */
+
+		/*
+		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
+		 * behaves when only AccessShareLock held.  This is really only needed
+		 * to prevent confusion within IndexBuildHeapScan() about how to
+		 * interpret the state we pass.
+		 */
+		indexinfo->ii_Concurrent = !state->readonly;
+
+		/*
+		 * Don't wait for uncommitted tuple xact commit/abort when index is a
+		 * unique index on a catalog (or an index used by an exclusion
+		 * constraint).  This could otherwise happen in the readonly case.
+		 */
+		indexinfo->ii_Unique = false;
+		indexinfo->ii_ExclusionOps = NULL;
+		indexinfo->ii_ExclusionProcs = NULL;
+		indexinfo->ii_ExclusionStrats = NULL;
+
+		elog(DEBUG1, "verifying that tuples from index \"%s\" are present in \"%s\"",
+			 RelationGetRelationName(state->rel),
+			 RelationGetRelationName(state->heaprel));
+
+		IndexBuildHeapScan(state->heaprel, state->rel, indexinfo, true,
+						   bt_tuple_present_callback, (void *) state, scan);
+
+		ereport(DEBUG1,
+				(errmsg_internal("finished verifying presence of " INT64_FORMAT " tuples from table \"%s\" with bitset %.2f%% set",
+								 state->heaptuplespresent, RelationGetRelationName(heaprel),
+								 100.0 * bloom_prop_bits_set(state->filter))));
+
+		if (snapshot != SnapshotAny)
+			UnregisterSnapshot(snapshot);
+
+		bloom_free(state->filter);
+	}
+
 	/* Be tidy: */
 	MemoryContextDelete(state->targetcontext);
 }
@@ -497,7 +625,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
 										current, level.level, opaque->btpo.level)));
 
-		/* Verify invariants for page -- all important checks occur here */
+		/* Verify invariants for page */
 		bt_target_page_check(state);
 
 nextpage:
@@ -544,6 +672,9 @@ nextpage:
  *
  * - That all child pages respect downlinks lower bound.
  *
+ * This is also where heapallindexed callers use their Bloom filter to
+ * fingerprint IndexTuples.
+ *
  * Note:  Memory allocated in this routine is expected to be released by caller
  * resetting state->targetcontext.
  */
@@ -587,6 +718,11 @@ bt_target_page_check(BtreeCheckState *state)
 		itup = (IndexTuple) PageGetItem(state->target, itemid);
 		skey = _bt_mkscankey(state->rel, itup);
 
+		/* Fingerprint leaf page tuples (those that point to the heap) */
+		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
+			bloom_add_element(state->filter, (unsigned char *) itup,
+							  IndexTupleSize(itup));
+
 		/*
 		 * * High key check *
 		 *
@@ -680,8 +816,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Last item check *
 		 *
 		 * Check last item against next/right page's first data item's when
-		 * last item on page is reached.  This additional check can detect
-		 * transposed pages.
+		 * last item on page is reached.  This additional check will detect
+		 * transposed pages iff the supposed right sibling page happens to
+		 * belong before target in the key space.  (Otherwise, a subsequent
+		 * heap verification will probably detect the problem.)
 		 *
 		 * This check is similar to the item order check that will have
 		 * already been performed for every other "real" item on target page
@@ -1059,6 +1197,106 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 	pfree(child);
 }
 
+/*
+ * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
+ * all the entries that definitely should have been observed in leaf pages of
+ * the target index (that is, all IndexTuples that were fingerprinted by our
+ * Bloom filter).  All heapallindexed checks occur here.
+ *
+ * The redundancy between an index and the table it indexes provides a good
+ * opportunity to detect corruption, especially corruption within the table.
+ * The high level principle behind the verification performed here is that any
+ * IndexTuple that should be in an index following a fresh CREATE INDEX (based
+ * on the same index definition) should also have been in the original,
+ * existing index, which should have used exactly the same representation
+ *
+ * Since the overall structure of the index has already been verified, the most
+ * likely explanation for error here is a corrupt heap page (could be logical
+ * or physical corruption).  Index corruption may still be detected here,
+ * though.  Only readonly callers will have verified that left links and right
+ * links are in agreement, and so it's possible that a leaf page transposition
+ * within index is actually the source of corruption detected here (for
+ * !readonly callers).  The checks performed only for readonly callers might
+ * more accurately frame the problem as a cross-page invariant issue (this
+ * could even be due to recovery not replaying all WAL records).  The !readonly
+ * ERROR message raised here includes a HINT about retrying with readonly
+ * verification, just in case it's a cross-page invariant issue, though that
+ * isn't particularly likely.
+ *
+ * IndexBuildHeapScan() expects to be able to find the root tuple when a
+ * heap-only tuple (the live tuple at the end of some HOT chain) needs to be
+ * indexed, in order to replace the actual tuple's TID with the root tuple's
+ * TID (which is what we're actually passed back here).  The index build heap
+ * scan code will raise an error when a tuple that claims to be the root of the
+ * heap-only tuple's HOT chain cannot be located.  This catches cases where the
+ * original root item offset/root tuple for a HOT chain indicates (for whatever
+ * reason) that the entire HOT chain is dead, despite the fact that the latest
+ * heap-only tuple should be indexed.  When this happens, sequential scans may
+ * always give correct answers, and all indexes may be considered structurally
+ * consistent (i.e. the nbtree structural checks would not detect corruption).
+ * It may be the case that only index scans give wrong answers, and yet heap or
+ * SLRU corruption is the real culprit.  (While it's true that LP_DEAD bit
+ * setting will probably also leave the index in a corrupt state before too
+ * long, the problem is nonetheless that there is heap corruption.)
+ *
+ * Heap-only tuple handling within IndexBuildHeapScan() works in a way that
+ * helps us to detect index tuples that contain the wrong values (values that
+ * don't match the latest tuple in the HOT chain).  This can happen when there
+ * is no superseding index tuple due to a faulty assessment of HOT safety,
+ * perhaps during the original CREATE INDEX.  Because the latest tuple's
+ * contents are used with the root TID, an error will be raised when a tuple
+ * with the same TID but non-matching attribute values is passed back to us.
+ * Faulty assessment of HOT-safety was behind at least two distinct CREATE
+ * INDEX CONCURRENTLY bugs that made it into stable releases, one of which was
+ * undetected for many years.  In short, the same principle that allows a
+ * REINDEX to repair corruption when there was an (undetected) broken HOT chain
+ * also allows us to detect the corruption in many cases.
+ */
+static void
+bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
+						  bool *isnull, bool tupleIsAlive, void *checkstate)
+{
+	BtreeCheckState *state = (BtreeCheckState *) checkstate;
+	IndexTuple	itup;
+
+	Assert(state->heapallindexed);
+
+	/*
+	 * Generate an index tuple for fingerprinting.
+	 *
+	 * Index tuple formation is assumed to be deterministic, and IndexTuples
+	 * are assumed immutable.  While the LP_DEAD bit is mutable in leaf pages,
+	 * that's ItemId metadata, which was not fingerprinted.  (There will often
+	 * be some dead-to-everyone IndexTuples fingerprinted by the Bloom filter,
+	 * but we only try to detect the absence of needed tuples, so that's okay.)
+	 *
+	 * Note that we rely on deterministic index_form_tuple() TOAST compression.
+	 * If index_form_tuple() was ever enhanced to compress datums out-of-line,
+	 * or otherwise varied when or how compression was applied, our assumption
+	 * would break, leading to false positive reports of corruption.  For now,
+	 * we don't decompress/normalize toasted values as part of fingerprinting.
+	 */
+	itup = index_form_tuple(RelationGetDescr(index), values, isnull);
+	itup->t_tid = htup->t_self;
+
+	/* Probe Bloom filter -- tuple should be present */
+	if (bloom_lacks_element(state->filter, (unsigned char *) itup,
+							IndexTupleSize(itup)))
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("heap tuple (%u,%u) from table \"%s\" lacks matching index tuple within index \"%s\"",
+						ItemPointerGetBlockNumber(&(itup->t_tid)),
+						ItemPointerGetOffsetNumber(&(itup->t_tid)),
+						RelationGetRelationName(state->heaprel),
+						RelationGetRelationName(state->rel)),
+				 !state->readonly
+				 ? errhint("Retrying verification using the function bt_index_parent_check() might provide a more specific error.")
+				 : 0));
+
+	state->heaptuplespresent++;
+	pfree(itup);
+}
+
 /*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 852e260c09..f6be1b3e05 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -44,7 +44,7 @@
   <variablelist>
    <varlistentry>
     <term>
-     <function>bt_index_check(index regclass) returns void</function>
+     <function>bt_index_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_check</primary>
      </indexterm>
@@ -55,7 +55,9 @@
       <function>bt_index_check</function> tests that its target, a
       B-Tree index, respects a variety of invariants.  Example usage:
 <screen>
-test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+test=# SELECT bt_index_check(index =&gt; c.oid, heapallindexed =&gt; i.indisunique)
+               c.relname,
+               c.relpages
 FROM pg_index i
 JOIN pg_opclass op ON i.indclass[0] = op.oid
 JOIN pg_am am ON op.opcmethod = am.oid
@@ -83,9 +85,11 @@ ORDER BY c.relpages DESC LIMIT 10;
 </screen>
       This example shows a session that performs verification of every
       catalog index in the database <quote>test</quote>.  Details of just
-      the 10 largest indexes verified are displayed.  Since no error
-      is raised, all indexes tested appear to be logically consistent.
-      Naturally, this query could easily be changed to call
+      the 10 largest indexes verified are displayed.  Verification of
+      the presence of heap tuples as index tuples is requested for
+      unique indexes only.  Since no error is raised, all indexes
+      tested appear to be logically consistent.  Naturally, this query
+      could easily be changed to call
       <function>bt_index_check</function> for every index in the
       database where verification is supported.
      </para>
@@ -95,10 +99,11 @@ ORDER BY c.relpages DESC LIMIT 10;
       is the same lock mode acquired on relations by simple
       <literal>SELECT</literal> statements.
       <function>bt_index_check</function> does not verify invariants
-      that span child/parent relationships, nor does it verify that
-      the target index is consistent with its heap relation.  When a
-      routine, lightweight test for corruption is required in a live
-      production environment, using
+      that span child/parent relationships, but will verify the
+      presence of all heap tuples as index tuples within the index
+      when <parameter>heapallindexed</parameter> is
+      <literal>true</literal>.  When a routine, lightweight test for
+      corruption is required in a live production environment, using
       <function>bt_index_check</function> often provides the best
       trade-off between thoroughness of verification and limiting the
       impact on application performance and availability.
@@ -108,7 +113,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -117,19 +122,21 @@ ORDER BY c.relpages DESC LIMIT 10;
     <listitem>
      <para>
       <function>bt_index_parent_check</function> tests that its
-      target, a B-Tree index, respects a variety of invariants.  The
-      checks performed by <function>bt_index_parent_check</function>
-      are a superset of the checks performed by
-      <function>bt_index_check</function>.
+      target, a B-Tree index, respects a variety of invariants.
+      Optionally, when the <parameter>heapallindexed</parameter>
+      argument is <literal>true</literal>, the function verifies the
+      presence of all heap tuples that should be found within the
+      index.  The checks that can be performed by
+      <function>bt_index_parent_check</function> are a superset of the
+      checks that can be performed by <function>bt_index_check</function>.
       <function>bt_index_parent_check</function> can be thought of as
       a more thorough variant of <function>bt_index_check</function>:
       unlike <function>bt_index_check</function>,
       <function>bt_index_parent_check</function> also checks
-      invariants that span parent/child relationships.  However, it
-      does not verify that the target index is consistent with its
-      heap relation.  <function>bt_index_parent_check</function>
-      follows the general convention of raising an error if it finds a
-      logical inconsistency or other problem.
+      invariants that span parent/child relationships.
+      <function>bt_index_parent_check</function> follows the general
+      convention of raising an error if it finds a logical
+      inconsistency or other problem.
      </para>
      <para>
       A <literal>ShareLock</literal> is required on the target index by
@@ -158,6 +165,47 @@ ORDER BY c.relpages DESC LIMIT 10;
   </variablelist>
  </sect2>
 
+ <sect2>
+  <title>Optional <parameter>heapallindexed</parameter> verification</title>
+ <para>
+  When the <parameter>heapallindexed</parameter> argument to
+  verification functions is <literal>true</literal>, an additional
+  phase of verification is performed against the table associated with
+  the target index relation.  This consists of a <quote>dummy</quote>
+  <command>CREATE INDEX</command> operation, which checks for the
+  presence of all hypothetical new index tuples against a temporary,
+  in-memory summarizing structure (this is built when needed during
+  the basic first phase of verification).  The summarizing structure
+  <quote>fingerprints</quote> every tuple found within the target
+  index.  The high level principle behind
+  <parameter>heapallindexed</parameter> verification is that a new
+  index that is equivalent to the existing, target index must only
+  have entries that can be found in the existing structure.
+ </para>
+ <para>
+  The additional <parameter>heapallindexed</parameter> phase adds
+  significant overhead: verification will typically take several times
+  longer.  However, there is no change to the relation-level locks
+  acquired when <parameter>heapallindexed</parameter> verification is
+  performed.
+ </para>
+ <para>
+  The summarizing structure is bound in size by
+  <varname>maintenance_work_mem</varname>.  In order to ensure that
+  there is no more than a 2% probability of failure to detect an
+  inconsistency for each heap tuple that should be represented in the
+  index, approximately 2 bytes of memory are needed per tuple.  As
+  less memory is made available per tuple, the probability of missing
+  an inconsistency slowly increases.  This approach limits the
+  overhead of verification significantly, while only slightly reducing
+  the probability of detecting a problem, especially for installations
+  where verification is treated as a routine maintenance task.  Any
+  single absent or malformed tuple has a new opportunity to be
+  detected with each new verification attempt.
+ </para>
+
+ </sect2>
+
  <sect2>
   <title>Using <filename>amcheck</filename> effectively</title>
 
@@ -197,18 +245,31 @@ ORDER BY c.relpages DESC LIMIT 10;
      operating system locales and collations.
     </para>
    </listitem>
+   <listitem>
+    <para>
+     Structural inconsistencies between indexes and the heap relations
+     that are indexed (when <parameter>heapallindexed</parameter>
+     verification is performed).
+    </para>
+    <para>
+     There is no cross-checking of indexes against their heap relation
+     during normal operation.  Symptoms of heap corruption can be subtle.
+    </para>
+   </listitem>
    <listitem>
     <para>
      Corruption caused by hypothetical undiscovered bugs in the
-     underlying <productname>PostgreSQL</productname> access method code or sort
-     code.
+     underlying <productname>PostgreSQL</productname> access method
+     code, sort code, or transaction management code.
     </para>
     <para>
      Automatic verification of the structural integrity of indexes
      plays a role in the general testing of new or proposed
      <productname>PostgreSQL</productname> features that could plausibly allow a
-     logical inconsistency to be introduced.  One obvious testing
-     strategy is to call <filename>amcheck</filename> functions continuously
+     logical inconsistency to be introduced.  Verification of table
+     structure and associated visibility and transaction status
+     information plays a similar role.  One obvious testing strategy
+     is to call <filename>amcheck</filename> functions continuously
      when running the standard regression tests.  See <xref
      linkend="regress-run"/> for details on running the tests.
     </para>
@@ -242,6 +303,12 @@ ORDER BY c.relpages DESC LIMIT 10;
      <emphasis>absolute</emphasis> protection against failures that
      result in memory corruption.
     </para>
+    <para>
+     When <parameter>heapallindexed</parameter> verification is
+     performed, there is generally a greatly increased chance of
+     detecting single-bit errors, since strict binary equality is
+     tested, and the indexed attributes within the heap are tested.
+    </para>
    </listitem>
   </itemizedlist>
   In general, <filename>amcheck</filename> can only prove the presence of
@@ -253,11 +320,10 @@ ORDER BY c.relpages DESC LIMIT 10;
   <title>Repairing corruption</title>
  <para>
   No error concerning corruption raised by <filename>amcheck</filename> should
-  ever be a false positive.  In practice, <filename>amcheck</filename> is more
-  likely to find software bugs than problems with hardware.
-  <filename>amcheck</filename> raises errors in the event of conditions that,
-  by definition, should never happen, and so careful analysis of
-  <filename>amcheck</filename> errors is often required.
+  ever be a false positive.  <filename>amcheck</filename> raises
+  errors in the event of conditions that, by definition, should never
+  happen, and so careful analysis of <filename>amcheck</filename>
+  errors is often required.
  </para>
  <para>
   There is no general method of repairing problems that
-- 
2.14.1

#51

Andrey Borodin

x4mmm@yandex-team.ru

almost 8 years ago

In reply to: Peter Geoghegan (#50)

Re: A design for amcheck heapam verification

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: tested, passed

Hi!

This patch adds handy code data structure (bloom filter) and important database corruption smoke test (check for the index accordance to heap).
I've been using latter functionality for a while and found it to be very useful.

I've checked the recent version. Patch works as expected, as far as I can see my small notes have been addressed.
I think that this patch is ready for committer.

Best regards, Andrey Borodin.

The new status of this patch is: Ready for Committer

#52

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#50)

Re: [HACKERS] A design for amcheck heapam verification

On Tue, Mar 27, 2018 at 8:50 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Mar 23, 2018 at 7:13 AM, Andrey Borodin <x4mmm@yandex-team.ru>
wrote:

I've just flipped patch to WoA. But if above issues will be fixed I

think that patch is ready for committer.

Attached is v7, which has the small tweaks that you suggested.

Thank you for the review. I hope that this can be committed shortly.

Sorry for coming a bit too late on this thread, but I started looking at
0002 patch.

  *
+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the
heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *

Is that correct? Aren't we actually verifying the presence in the index of
all
heap tuples?

@@ -116,37 +140,47 @@ static inline bool
invariant_leq_nontarget_offset(BtreeCheckState *state,
static Page palloc_btree_page(BtreeCheckState *state, BlockNumber
blocknum);

 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)

Can we come up with a better name for heapallindexed? May be
"check_heapindexed"?

+
+ /*
+ * Register our own snapshot in !readonly case, rather than asking
+ * IndexBuildHeapScan() to do this for us later.  This needs to happen
+ * before index fingerprinting begins, so we can later be certain that
+ * index fingerprinting should have reached all tuples returned by
+ * IndexBuildHeapScan().
+ */
+ if (!state->readonly)
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ }
+

So this looks safe. In !readonly mode, we take the snapshot *before*
fingerprinting the index. Since we're using MVCC snapshot, any tuple which
is
visible to heapscan must be reachable via indexscan too. So we must find the
index entry for every heap tuple returned by the scan.

What I am not sure about is whether we can really examine an index which is
valid but whose indcheckxmin hasn't crossed our xmin horizon? Notice that
this
amcheck might be running in a transaction block, probably in a
repeatable-read
isolation level and hence GetTransactionSnapshot() might actually return a
snapshot which can't yet read the index consistently. In practice, this is
quite unlikely, but I think we should check for that case if we agree that
it
could be a problem.

The case with readonly mode is also interesting. Since we're scanning heap
with
SnapshotAny, heapscan may return us tuples which are RECENTLY_DEAD. So the
question is: can this happen?

- some concurrent index scan sees a heaptuple as DEAD and marks the index
entry as LP_DEAD
- our index fingerprinting sees index tuple as LP_DEAD
- our heap scan still sees the heaptuple as RECENTLY_DEAD

Now that looks improbable given that we compute OldestXmin after our index
fingerprinting was done i.e between step 2 and 3 and hence if a tuple looked
DEAD to some OldestXmin/RecentGlobalXmin computed before we computed our
OldestXmin, then surely our OldestXmin should also see the tuple DEAD. Or is
there a corner case that we're missing?

Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_PROGRESS
tuples, especially if those tuples were inserted/deleted by our own
transaction? It probably worth thinking.

Apart from that, comments in IndexBuildHeapRangeScan() claim that the
function
is called only with ShareLock and above, which is no longer true. We should
check if that has any side-effects. I can't think of any, but better to
verify
and update the comments to reflect new reality,

The partial indexes look fine since the non-interesting tuples never get
called
back.

One thing that worth documenting/noting is the fact that a !readonly check
will
run with a long-duration registered snapshot, thus holding OldestXmin back.
Is
there anything we can do to lessen that burden like telling other backends
to
ignore our xmin while computing OldestXmin (like vacuum does)?

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#53

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Pavan Deolasee (#52)

1 attachment(s)

Re: [HACKERS] A design for amcheck heapam verification

On Tue, Mar 27, 2018 at 6:48 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the
heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *

Is that correct? Aren't we actually verifying the presence in the index of
all
heap tuples?

I think that you could describe it either way. We're verifying the
presence of heap tuples in the heap that ought to have been in the
index (that is, most of those that were fingerprinted).

As I say above the callback routine bt_tuple_present_callback(), we
blame any corruption on the heap, since that's more likely. It could
actually be the index which is corrupt, though, in rare cases where it
manages to pass the index structure tests despite being corrupt. This
general orientation comes through in the comment that you asked about.

Can we come up with a better name for heapallindexed? May be
"check_heapindexed"?

I don't think that that name is any better, and you're the first to
mention it in almost a year. I'd rather leave it. The fact that it's a
check is pretty clear from context.

What I am not sure about is whether we can really examine an index which is
valid but whose indcheckxmin hasn't crossed our xmin horizon? Notice that
this
amcheck might be running in a transaction block, probably in a
repeatable-read
isolation level and hence GetTransactionSnapshot() might actually return a
snapshot which can't yet read the index consistently. In practice, this is
quite unlikely, but I think we should check for that case if we agree that
it
could be a problem.

You're right - there is a narrow window for REPEATABLE READ and
SERIALIZABLE transactions. This is a regression in v6, the version
removed the TransactionXmin test.

I am tempted to fix this by calling GetLatestSnapshot() instead of
GetTransactionSnapshot(). However, that has a problem of its own -- it
won't work in parallel mode, and we're dealing with parallel
restricted functions, not parallel unsafe functions. I don't want to
change that to fix such a narrow issue. IMV, a better fix is to treat
this as a serialization failure. Attached patch, which applies on top
of v7, shows what I mean.

I think that this bug is practically impossible to hit, because we use
the xmin from the pg_index tuple during "is index safe to use?"
indcheckxmin/TransactionXmin checks (the patch that I attached adds a
similar though distinct check), which raises another question for a
REPEATABLE READ xact. That question is: How is a REPEATABLE READ
transaction supposed to see the pg_index entry to get the index
relation's oid to call a verification function in the first place? My
point is that there is no need for a more complicated solution than
what I propose.

Note that the attached patch doesn't update the existing comments on
TransactionXmin, since I see this as a serialization error, which is a
distinct thing -- I'm not actually doing anything with
TransactionXmin.

The case with readonly mode is also interesting. Since we're scanning heap
with
SnapshotAny, heapscan may return us tuples which are RECENTLY_DEAD. So the
question is: can this happen?

- some concurrent index scan sees a heaptuple as DEAD and marks the index
entry as LP_DEAD
- our index fingerprinting sees index tuple as LP_DEAD
- our heap scan still sees the heaptuple as RECENTLY_DEAD

Now that looks improbable given that we compute OldestXmin after our index
fingerprinting was done i.e between step 2 and 3 and hence if a tuple looked
DEAD to some OldestXmin/RecentGlobalXmin computed before we computed our
OldestXmin, then surely our OldestXmin should also see the tuple DEAD. Or is
there a corner case that we're missing?

I don't think so. The way we compute OldestXmin for
IndexBuildHeapRangeScan() is rather like a snapshot acquisition --
GetOldestXmin() locks the proc array in shared mode. As you pointed
out, the fact that it comes after everything else (fingerprinting)
means that it must be equal to or later than what index scans saw,
that allowed them to do the kill_prior_tuple() stuff (set LP_DEAD
bits).

The one exception is Hot Standby mode, where it's possible for the
result of GetOldestXmin() (OldestXmin) to go backwards across
successive calls. However, that's not a problem for us because
readonly heapallindexed verification does not work in Hot Standby
mode.

Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_PROGRESS
tuples, especially if those tuples were inserted/deleted by our own
transaction? It probably worth thinking.

Apart from that, comments in IndexBuildHeapRangeScan() claim that the
function
is called only with ShareLock and above, which is no longer true. We should
check if that has any side-effects. I can't think of any, but better to
verify
and update the comments to reflect new reality,

Those comments only seem to apply in the SnapshotAny/ShareLock case,
which is what amcheck calls the readonly case. When amcheck does not
have a ShareLock, it has an AccessShareLock on the heap relation
instead, and we imitate CREATE INDEX CONCURRENTLY.

IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
ShareUpdateExclusiveLock (it just says SharedLock), because that lock
strength doesn't have anything to do with IndexBuildHeapRangeScan()
when it operates with an MVCC snapshot. I think that this means that
this patch doesn't need to update comments within
IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
independent.

One thing that worth documenting/noting is the fact that a !readonly check
will
run with a long-duration registered snapshot, thus holding OldestXmin back.

The simple fact that you have a long-running statement already implies
that that'll happen, since that must have a snapshot that is at least
as old as the first snapshot that the first call to bt_check_index()
acquires. It's not a special case; it's exactly as bad as any
statement that takes the same amount of time to execute.

Is
there anything we can do to lessen that burden like telling other backends
to
ignore our xmin while computing OldestXmin (like vacuum does)?

I don't think so. The transaction involved is still an ordinary user
transaction.

--
Peter Geoghegan

Attachments:

0003-Defend-against-heapallindexed-using-transaction-snap.patchtext/x-patch; charset=US-ASCII; name=0003-Defend-against-heapallindexed-using-transaction-snap.patchDownload

From 080e3b512a0ad80147cd8d6aaba02e9df5b92daf Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 27 Mar 2018 13:33:33 -0700
Subject: [PATCH 3/3] Defend against heapallindexed using transaction snapshot.

---
 contrib/amcheck/verify_nbtree.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index c3380895a9..105945ee3b 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -23,8 +23,10 @@
  */
 #include "postgres.h"
 
+#include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "access/transam.h"
+#include "access/xact.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablecmds.h"
@@ -360,7 +362,30 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
 		 * IndexBuildHeapScan().
 		 */
 		if (!state->readonly)
+		{
 			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+			/*
+			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+			 * the entries it requires in the index.
+			 *
+			 * We must defend against the possibility that an old xact snapshot
+			 * was returned at higher isolation levels when that snapshot is
+			 * not safe for index scans of the target index.  This is possible
+			 * when the snapshot sees tuples that are before the index's
+			 * indcheckxmin horizon.  Throwing an error here should be very
+			 * rare.  It doesn't seem worth using a secondary snapshot to avoid
+			 * this.
+			 */
+			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+									   snapshot->xmin))
+				ereport(ERROR,
+						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+								RelationGetRelationName(rel))));
+		}
 	}
 
 	/* Create context for page */
-- 
2.14.1

#54

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#53)

Re: [HACKERS] A design for amcheck heapam verification

On Wed, Mar 28, 2018 at 2:48 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Mar 27, 2018 at 6:48 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

+ * When index-to-heap verification is requested, a Bloom filter is used

to

+ * fingerprint all tuples in the target index, as the index is

traversed to
+ * verify its structure.  A heap scan later verifies the presence in the
heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *
Is that correct? Aren't we actually verifying the presence in the index
of

all
heap tuples?

I think that you could describe it either way. We're verifying the
presence of heap tuples in the heap that ought to have been in the
index (that is, most of those that were fingerprinted).

Hmm Ok. It still sounds backwards to me, but then English is not my first
or even second language. "A heap scan later verifies the presence in the
heap of all index tuples fingerprinted" sounds as if we expect to find all
fingerprinted index tuples in the heap. But in reality, we check if all
heap tuples have a fingerprinted index tuple.

You're right - there is a narrow window for REPEATABLE READ and
SERIALIZABLE transactions. This is a regression in v6, the version
removed the TransactionXmin test.

I am tempted to fix this by calling GetLatestSnapshot() instead of
GetTransactionSnapshot(). However, that has a problem of its own -- it
won't work in parallel mode, and we're dealing with parallel
restricted functions, not parallel unsafe functions. I don't want to
change that to fix such a narrow issue. IMV, a better fix is to treat
this as a serialization failure. Attached patch, which applies on top
of v7, shows what I mean.

Ok. I am happy with that.

I think that this bug is practically impossible to hit, because we use
the xmin from the pg_index tuple during "is index safe to use?"
indcheckxmin/TransactionXmin checks (the patch that I attached adds a
similar though distinct check), which raises another question for a
REPEATABLE READ xact. That question is: How is a REPEATABLE READ
transaction supposed to see the pg_index entry to get the index
relation's oid to call a verification function in the first place?

Well pg_index entry will be visible and should be visible. Otherwise how do
you even maintain a newly created index. I am not sure, but I guess we take
fresh MVCC snapshots for catalog searches, even in RR transactions.

My
point is that there is no need for a more complicated solution than
what I propose.

I agree on that.

I don't think so. The way we compute OldestXmin for
IndexBuildHeapRangeScan() is rather like a snapshot acquisition --
GetOldestXmin() locks the proc array in shared mode. As you pointed
out, the fact that it comes after everything else (fingerprinting)
means that it must be equal to or later than what index scans saw,
that allowed them to do the kill_prior_tuple() stuff (set LP_DEAD
bits).

That's probably true.

Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_

PROGRESS

tuples, especially if those tuples were inserted/deleted by our own
transaction? It probably worth thinking.

Anything here that you would like to check? I understand that you may see
such tuples only for catalog tables.

IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
ShareUpdateExclusiveLock (it just says SharedLock), because that lock
strength doesn't have anything to do with IndexBuildHeapRangeScan()
when it operates with an MVCC snapshot. I think that this means that
this patch doesn't need to update comments within
IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
independent.

Ok, I agree. But note that we are now invoking that code
with AccessShareLock() whereas the existing callers either use ShareLock or
ShareUpdateExclusiveLock. That's probably does not matter, but it's a
change worth noting.

Is
there anything we can do to lessen that burden like telling other

backends

to
ignore our xmin while computing OldestXmin (like vacuum does)?

I don't think so. The transaction involved is still an ordinary user
transaction.

While that's true, I am thinking if there is anything we can do to stop
this a consistency-checking utility to create other non-trivial side
effects on the state of the database, even if that means we can not check
all heap tuples. For example, can there be a way by which we allow
concurrent vacuum or HOT prune to continue to prune away dead tuples, even
if that means running bt_check_index() is some specialised way (such as not
allowing in a transaction block) and the heap scan might miss out some
tuples? I don't know answer to that, but given that sometimes bt_check_index()
may take hours if not days to finish, it seems worth thinking or at least
documenting the side-effects somewhere.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#55

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#53)

Re: [HACKERS] A design for amcheck heapam verification

On Wed, Mar 28, 2018 at 2:48 AM, Peter Geoghegan <pg@bowt.ie> wrote:

I don't think so. The transaction involved is still an ordinary user
transaction.

Mostly a nitpick, but I guess we should leave a comment
after IndexBuildHeapScan() saying heap_endscan() is not necessary
since IndexBuildHeapScan()
does that internally. I stumbled upon that while looking for any potential
leaks. I know at least one other caller of IndexBuildHeapScan() doesn't
bother to say anything either, but it's helpful.

FWIW I also looked at the 0001 patch and it looks fine to me.

Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#56

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Pavan Deolasee (#54)

Re: [HACKERS] A design for amcheck heapam verification

On Wed, Mar 28, 2018 at 5:04 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Hmm Ok. It still sounds backwards to me, but then English is not my first or
even second language. "A heap scan later verifies the presence in the heap
of all index tuples fingerprinted" sounds as if we expect to find all
fingerprinted index tuples in the heap. But in reality, we check if all heap
tuples have a fingerprinted index tuple.

Why don't we leave this to the committer that picks the patch up? I
don't actually mind very much. I suspect that I am too close to the
problem to be sure that I've explained it in the clearest possible
way.

You're right - there is a narrow window for REPEATABLE READ and
SERIALIZABLE transactions. This is a regression in v6, the version
removed the TransactionXmin test.

I am tempted to fix this by calling GetLatestSnapshot() instead of
GetTransactionSnapshot(). However, that has a problem of its own -- it
won't work in parallel mode, and we're dealing with parallel
restricted functions, not parallel unsafe functions. I don't want to
change that to fix such a narrow issue. IMV, a better fix is to treat
this as a serialization failure. Attached patch, which applies on top
of v7, shows what I mean.

Ok. I am happy with that.

Cool.

Well pg_index entry will be visible and should be visible. Otherwise how do
you even maintain a newly created index. I am not sure, but I guess we take
fresh MVCC snapshots for catalog searches, even in RR transactions.

True, but that isn't what would happen with an SQL query that queries
the system catalogs. That only applies to how the system catalogs are
accessed internally, not how they'd almost certainly be accessed when
using amcheck.

Are there any interesting cases around
INSERT_IN_PROGRESS/DELETE_IN_PROGRESS
tuples, especially if those tuples were inserted/deleted by our own
transaction? It probably worth thinking.

Anything here that you would like to check? I understand that you may see
such tuples only for catalog tables.

I think that the WARNING ought to be fine. We shouldn't ever see it,
but if we do it's probably due to a bug in an extension or something.
It doesn't seem particularly likely that someone could ever insert
into the table concurrently despite our having a ShareLock on the
relation. Even with corruption.

IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
ShareUpdateExclusiveLock (it just says SharedLock), because that lock
strength doesn't have anything to do with IndexBuildHeapRangeScan()
when it operates with an MVCC snapshot. I think that this means that
this patch doesn't need to update comments within
IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
independent.

Ok, I agree. But note that we are now invoking that code with
AccessShareLock() whereas the existing callers either use ShareLock or
ShareUpdateExclusiveLock. That's probably does not matter, but it's a change
worth noting.

Fair point, even though the ShareUpdateExclusiveLock case isn't
actually acknowledged by IndexBuildHeapRangeScan(). Can we leave this
one up to the committer, too? I find it very hard to argue either for
or against this, and I want to avoid "analysis paralysis" at this
important time.

While that's true, I am thinking if there is anything we can do to stop this
a consistency-checking utility to create other non-trivial side effects on
the state of the database, even if that means we can not check all heap
tuples. For example, can there be a way by which we allow concurrent vacuum
or HOT prune to continue to prune away dead tuples, even if that means
running bt_check_index() is some specialised way (such as not allowing in a
transaction block) and the heap scan might miss out some tuples? I don't
know answer to that, but given that sometimes bt_check_index() may take
hours if not days to finish, it seems worth thinking or at least documenting
the side-effects somewhere.

That seems like a worthwhile goal for a heap checker that only checks
the structure of the heap, rather than something that checks the
consistency of an index against its table. Especially one that can run
without a connection to the database, for things like backup tools,
where performance is really important. There is certainly room for
that. For this particular enhancement, the similarity to CREATE INDEX
seems like an asset.

--
Peter Geoghegan

#57

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Pavan Deolasee (#55)

Re: [HACKERS] A design for amcheck heapam verification

On Wed, Mar 28, 2018 at 5:47 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Mostly a nitpick, but I guess we should leave a comment after
IndexBuildHeapScan() saying heap_endscan() is not necessary since
IndexBuildHeapScan() does that internally. I stumbled upon that while
looking for any potential leaks. I know at least one other caller of
IndexBuildHeapScan() doesn't bother to say anything either, but it's
helpful.

Fair point. Again, I'm going to suggest deferring to the committer. I
seem to have decision fatigue this week.

FWIW I also looked at the 0001 patch and it looks fine to me.

I'm grateful that you didn't feel any need to encourage me to use
whatever the novel/variant filter du jour is! :-)

--
Peter Geoghegan

#58

Simon Riggs

simon@2ndquadrant.com

almost 8 years ago

In reply to: Peter Geoghegan (#56)

Re: [HACKERS] A design for amcheck heapam verification

On 29 March 2018 at 01:49, Peter Geoghegan <pg@bowt.ie> wrote:

IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
ShareUpdateExclusiveLock (it just says SharedLock), because that lock
strength doesn't have anything to do with IndexBuildHeapRangeScan()
when it operates with an MVCC snapshot. I think that this means that
this patch doesn't need to update comments within
IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
independent.

Ok, I agree. But note that we are now invoking that code with
AccessShareLock() whereas the existing callers either use ShareLock or
ShareUpdateExclusiveLock. That's probably does not matter, but it's a change
worth noting.

Fair point, even though the ShareUpdateExclusiveLock case isn't
actually acknowledged by IndexBuildHeapRangeScan(). Can we leave this
one up to the committer, too? I find it very hard to argue either for
or against this, and I want to avoid "analysis paralysis" at this
important time.

The above discussion doesn't make sense to me, hope someone will explain.

I understand we are adding a check to verify heap against index and
this will take longer than before. When it runs does it report
progress of the run via pg_stat_activity, so we can monitor how long
it will take?

Locking is also an important concern.

If we need a ShareLock to run the extended check and the check runs
for a long time, when would we decide to run that? This sounds like it
will cause a production outage, so what are the pre-conditions that
would lead us to say "we'd better run this". For example, if running
this is known to be signficantly faster than running CREATE INDEX,
that might be an argument for someone to run this first if index
corruption is suspected.

If it detects an issue, can it fix the issue for the index by
injecting correct entries? If not then we will have to run CREATE
INDEX afterwards anyway, which makes it more likely that people would
just run CREATE INDEX and not bother with the check.

So my initial questions are about when we would run this and making
sure that is documented.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#59

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Simon Riggs (#58)

Re: [HACKERS] A design for amcheck heapam verification

On Thu, Mar 29, 2018 at 2:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

I understand we are adding a check to verify heap against index and
this will take longer than before. When it runs does it report
progress of the run via pg_stat_activity, so we can monitor how long
it will take?

My point to Pavan was that the checker function that uses a weaker
lock, bt_index_check(), imitates a CREATE INDEX CONCURRENTLY. However,
unlike CIC, it acquires an ASL, not a ShareUpdateExclusiveLock. I
believe that this is safe, because we only actually perform a process
that's similar to the first heap scan of a CIC, without the other CIC
steps. (The variant that uses a ShareLock, bt_index_parent_check(),
imitates a CREATE INDEX, so no difference in lock strength there.)

amcheck is still just a couple of SQL-callable functions, so the
closest thing that there is to progress monitoring is DEBUG traces,
which I've found to be informative in real-world situations. Of
course, only a well informed user can interpret that. No change there.

Locking is also an important concern.

If we need a ShareLock to run the extended check and the check runs
for a long time, when would we decide to run that? This sounds like it
will cause a production outage, so what are the pre-conditions that
would lead us to say "we'd better run this". For example, if running
this is known to be signficantly faster than running CREATE INDEX,
that might be an argument for someone to run this first if index
corruption is suspected.

I think that most people will choose to use bt_index_check(), since,
as I said, it only requires an ASL, as opposed to
bt_index_parent_check(), which requires a ShareLock. This is
especially likely because there isn't much difference between how
thorough verification is. The fact that bt_index_check() is likely the
best thing to use for routine verification is documented in the
v10/master version [1]https://www.postgresql.org/docs/10/static/amcheck.html#id-1.11.7.11.7 -- Peter Geoghegan, which hasn't changed. That said, there are
definitely environments where nobody cares a jot that a ShareLock is
needed, which is why we have bt_index_parent_check(). Often, amcheck
is used when the damage is already done, so that it can be fully
assessed.

This patch does not really change the interface; it just adds a new
heapallindexed argument, which has a default of 'false'.
bt_index_check() is unlikely to cause too many problems in production.
Heroku ran it on all databases at one point, and it wasn't too much
trouble. Of course, that was a version that lacked this heapallindexed
enhancement, which slows down verification by rather a lot when
actually used. My rough estimate is that heapallindexed verification
makes the process take 5x longer.

If it detects an issue, can it fix the issue for the index by
injecting correct entries? If not then we will have to run CREATE
INDEX afterwards anyway, which makes it more likely that people would
just run CREATE INDEX and not bother with the check.

It does not fix anything. I think that the new check is far more
likely to find problems in the heap than in the index, which is the
main reason for this.

The new check can only begin to run at the point where the index
structure has been found to be self-consistent, which is the main
reason why it seems like more of a heap checker than an index checker.
Also, that's what it actually found in the couple of interesting cases
that we've seen. It detects "freeze the dead" corruption, at least
with the test cases we have available. It also detects corruption
caused by failing to detect broken HOT chains during an initial CREATE
INDEX; there were two such bugs in CREATE INDEX CONCURRENTLY, one in
2012, and another in 2017, as you'll recall. The 2017 one (the CIC bug
that Pavan found through mechanical testing) went undetected for a
very long time; I think that a tool like this greatly increases our
odds of early detection of that kind of thing.

These are both issues that kind of seem like index corruption, that
are actually better understood as heap corruption. It's subtle.

So my initial questions are about when we would run this and making
sure that is documented.

Good question. I find it hard to offer general advice about this, to
be honest. In theory, you should never need to run it, because
everything should work, and in practice that's generally almost true.
I've certainly used it when investigating problems after the fact, and
as a general smoke-test, where it works well. I would perhaps
recommend running it once a week, although that's a fairly arbitrary
choice. The docs in v10 don't take a position on this, so while I tend
to agree we could do better, it is a preexisting issue.

[1]: https://www.postgresql.org/docs/10/static/amcheck.html#id-1.11.7.11.7 -- Peter Geoghegan
--
Peter Geoghegan

#60

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Peter Geoghegan (#50)

Re: [HACKERS] A design for amcheck heapam verification

Hi,

On 2018-03-26 20:20:57 -0700, Peter Geoghegan wrote:

From ede1ba731dc818172a94adbb6331323c1f2b1170 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH v7 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership. Callers will sometimes incur false
positives, but never false negatives. The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing. Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.

This commit adds a test harness extension module, test_bloomfilter. It
can be used to get a sense of how the Bloom filter implementation
performs under varying conditions.

Maybe add a short paragraph explaining what this'll be used for soon.

@@ -12,7 +12,7 @@ subdir = src/backend/lib
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global

-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
-	   knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o

*NOT* for this patch: I really wonder whether we should move towards a
style where there's only ever a single object per-line. Would make
things like this easier to view and conflicts easier to resolve.

--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,304 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.c
+ *		Space-efficient set membership testing
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.

s/of/in/?

False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the Bloom filter bitset.

The second half of this paragraph isn't easy to understand.

+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when its Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.

s/its Bloom/the Bloom/?

+ * The implementation is well suited to data synchronization problems between
+ * unordered sets, especially where predictable performance is important and
+ * some false positives are acceptable.

I'm not finding "data synchronization" very descriptive. Makes me think
of multi-threaded races and such.

+/*
+ * Create Bloom filter in caller's memory context.  This should get a false
+ * positive rate of between 1% and 2% when bitset is not constrained by memory.

s/should/aims at/?

+ * total_elems is an estimate of the final size of the set.  It ought to be
+ * approximately correct, but we can cope well with it being off by perhaps a
+ * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
+ * (Dillinger & Manolios, 2004) for details of why this is the case.

I'd simplify the language here. I'd replace ought with should at the
very least. Replace we with "the bloom filter" or similar?

+ * bloom_work_mem is sized in KB, in line with the general work_mem convention.
+ * This determines the size of the underlying bitset (trivial bookkeeping space
+ * isn't counted).  The bitset is always sized as a power-of-two number of
+ * bits, and the largest possible bitset is 512MB.  The implementation rounds
+ * down as needed.

"as needed" should be expanded. Just say ~"Only the required amount of
memory is allocated"?

+bloom_filter *
+bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
+{
+	bloom_filter *filter;
+	int			bloom_power;
+	uint64		bitset_bytes;
+	uint64		bitset_bits;
+
+	/*
+	 * Aim for two bytes per element; this is sufficient to get a false
+	 * positive rate below 1%, independent of the size of the bitset or total
+	 * number of elements.  Also, if rounding down the size of the bitset to
+	 * the next lowest power of two turns out to be a significant drop, the
+	 * false positive rate still won't exceed 2% in almost all cases.
+	 */
+	bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
+	/* Minimum allowable size is 1MB */
+	bitset_bytes = Max(1024L * 1024L, bitset_bytes);

Some upcasting might be advisable, to avoid dangers of overflows?

+/*
+ * Generate k hash values for element.
+ *
+ * Caller passes array, which is filled-in with k values determined by hashing
+ * caller's element.
+ *
+ * Only 2 real independent hash functions are actually used to support an
+ * interface of up to MAX_HASH_FUNCS hash functions; enhanced double hashing is
+ * used to make this work.  The main reason we prefer enhanced double hashing
+ * to classic double hashing is that the latter has an issue with collisions
+ * when using power-of-two sized bitsets.  See Dillinger & Manolios for full
+ * details.
+ */
+static void
+k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
+{
+	uint64		hash;
+	uint32		x, y;
+	uint64		m;
+	int			i;
+
+	/* Use 64-bit hashing to get two independent 32-bit hashes */
+	hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed));

Hm. Is that smart given how some hash functions are defined? E.g. for
int8 the higher bits aren't really that independent for small values:

Datum
hashint8(PG_FUNCTION_ARGS)
{
/*
* The idea here is to produce a hash value compatible with the values
* produced by hashint4 and hashint2 for logically equal inputs; this is
* necessary to support cross-type hash joins across these input types.
* Since all three types are signed, we can xor the high half of the int8
* value if the sign is positive, or the complement of the high half when
* the sign is negative.
*/
int64 val = PG_GETARG_INT64(0);
uint32 lohalf = (uint32) val;
uint32 hihalf = (uint32) (val >> 32);

lohalf ^= (val >= 0) ? hihalf : ~hihalf;

return hash_uint32(lohalf);
}

Datum
hashint8extended(PG_FUNCTION_ARGS)
{
/* Same approach as hashint8 */
int64 val = PG_GETARG_INT64(0);
uint32 lohalf = (uint32) val;
uint32 hihalf = (uint32) (val >> 32);

lohalf ^= (val >= 0) ? hihalf : ~hihalf;

return hash_uint32_extended(lohalf, PG_GETARG_INT64(1));
}

+/*
+ * Calculate "val MOD m" inexpensively.
+ *
+ * Assumes that m (which is bitset size) is a power-of-two.
+ *
+ * Using a power-of-two number of bits for bitset size allows us to use bitwise
+ * AND operations to calculate the modulo of a hash value.  It's also a simple
+ * way of avoiding the modulo bias effect.
+ */
+static inline uint32
+mod_m(uint32 val, uint64 m)
+{
+	Assert(m <= PG_UINT32_MAX + UINT64CONST(1));
+	Assert(((m - 1) & m) == 0);
+
+	return val & (m - 1);
+}

What's up with the two different widths here?

@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.h
+ *	  Space-efficient set membership testing
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/include/lib/bloomfilter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BLOOMFILTER_H_
+#define _BLOOMFILTER_H_

Names starting with an underscore followed by an uppercase letter are
reserved. Yes, we have some already. No, we shouldn't introduce further ones.

From 71878742061500b969faf7a7cff3603d644c90ca Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH v7 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check(): callers can check that each heap tuple that
ought to have an index entry does in fact have one. This happens at the
end of the existing verification checks.

And here we get back to why I last year though this interface is
bad. Now this really can't properly described as a pure index check
anymore, and we need to add the same functionality to multiple
functions.

+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;

This breaks functions et al referencing the existing function. Also, I
don't quite recall the rules, don't we have to drop the function from
the extension first?

/*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)
*
* Verify integrity of B-Tree index.
*
* Acquires AccessShareLock on heap & index relations.  Does not consider
- * invariants that exist between parent/child pages.
+ * invariants that exist between parent/child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
*/
Datum
bt_index_check(PG_FUNCTION_ARGS)
{
Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;

-	bt_index_check_internal(indrelid, false);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, false, heapallindexed);

PG_RETURN_VOID();
}

Given the PG_NARGS() checks I don't understand why don't just create a
separate two-argument SQL function above? If you rely on the default
value you don't need it anyway?

+	/*
+	 * * Heap contains unindexed/malformed tuples check *
+	 */

I'd reorder this to "Check whether heap contains ...".

+	if (state->heapallindexed)
+	{
+		IndexInfo  *indexinfo = BuildIndexInfo(state->rel);
+		HeapScanDesc scan;
+
+		/*
+		 * Create our own scan for IndexBuildHeapScan(), like a parallel index
+		 * build.  We do things this way because it lets us use the MVCC
+		 * snapshot we acquired before index fingerprinting began in the
+		 * !readonly case.
+		 */

I'd shorten the middle part out, so it's "IndexBuildHeapScan(), so we
can register an MVCC snapshot acquired before..."

+		scan = heap_beginscan_strat(state->heaprel, /* relation */
+									snapshot,	/* snapshot */
+									0,	/* number of keys */
+									NULL,	/* scan key */
+									true,	/* buffer access strategy OK */
+									true);	/* syncscan OK? */
+
+		/*
+		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
+		 * behaves when only AccessShareLock held.  This is really only needed
+		 * to prevent confusion within IndexBuildHeapScan() about how to
+		 * interpret the state we pass.
+		 */
+		indexinfo->ii_Concurrent = !state->readonly;

That's not very descriptive.

+		/* Fingerprint leaf page tuples (those that point to the heap) */
+		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
+			bloom_add_element(state->filter, (unsigned char *) itup,
+							  IndexTupleSize(itup));

So, if we didn't use IndexBuildHeapScan(), we could also check whether
dead entries at least have a corresponding item on the page, right? I'm
not asking to change, mostly curious.

+/*
+ * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
+ * all the entries that definitely should have been observed in leaf pages of
+ * the target index (that is, all IndexTuples that were fingerprinted by our
+ * Bloom filter).  All heapallindexed checks occur here.

The last sentence isn't entirely fully true ;), given that we check for
the bloom insertion above. s/checks/verification/?

We should be able to get this into v11...

Greetings,

Andres Freund

#61

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Andres Freund (#60)

Re: [HACKERS] A design for amcheck heapam verification

On Thu, Mar 29, 2018 at 6:18 PM, Andres Freund <andres@anarazel.de> wrote:

This commit adds a test harness extension module, test_bloomfilter. It
can be used to get a sense of how the Bloom filter implementation
performs under varying conditions.

Maybe add a short paragraph explaining what this'll be used for soon.

Sure.

@@ -12,7 +12,7 @@ subdir = src/backend/lib
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
-        knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
*NOT* for this patch: I really wonder whether we should move towards a
style where there's only ever a single object per-line. Would make
things like this easier to view and conflicts easier to resolve.

That does seem like it would be a minor improvement.

--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,304 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.c
+ *           Space-efficient set membership testing
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.

s/of/in/?

Wikipedia says "A Bloom filter is a space-efficient probabilistic data
structure, conceived by Burton Howard Bloom in 1970, that is used to
test whether an element is a member of a set". I think that either
works just as well.

False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the Bloom filter bitset.

The second half of this paragraph isn't easy to understand.

I'll tweak it.

+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when its Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.

s/its Bloom/the Bloom/?

Okay. I'll give you that one.

+ * The implementation is well suited to data synchronization problems between
+ * unordered sets, especially where predictable performance is important and
+ * some false positives are acceptable.
I'm not finding "data synchronization" very descriptive. Makes me think
of multi-threaded races and such.

Again, this is from Wikipedia:

https://en.wikipedia.org/wiki/Bloom_filter#Data_synchronization

https://en.wikipedia.org/wiki/Data_synchronization

+/*
+ * Create Bloom filter in caller's memory context.  This should get a false
+ * positive rate of between 1% and 2% when bitset is not constrained by memory.

s/should/aims at/?

I think that "should" is accurate, and no less informative. I'm not
going to argue with you, though -- I'll change it.

+ * total_elems is an estimate of the final size of the set.  It ought to be
+ * approximately correct, but we can cope well with it being off by perhaps a
+ * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
+ * (Dillinger & Manolios, 2004) for details of why this is the case.

I'd simplify the language here. I'd replace ought with should at the
very least. Replace we with "the bloom filter" or similar?

I don't see what's wrong with "ought", but okay. I don't see what's
wrong with "we", but okay.

+ * bloom_work_mem is sized in KB, in line with the general work_mem convention.
+ * This determines the size of the underlying bitset (trivial bookkeeping space
+ * isn't counted).  The bitset is always sized as a power-of-two number of
+ * bits, and the largest possible bitset is 512MB.  The implementation rounds
+ * down as needed.

"as needed" should be expanded. Just say ~"Only the required amount of
memory is allocated"?

Okay.

+bloom_filter *
+bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
+{
+     bloom_filter *filter;
+     int                     bloom_power;
+     uint64          bitset_bytes;
+     uint64          bitset_bits;
+
+     /*
+      * Aim for two bytes per element; this is sufficient to get a false
+      * positive rate below 1%, independent of the size of the bitset or total
+      * number of elements.  Also, if rounding down the size of the bitset to
+      * the next lowest power of two turns out to be a significant drop, the
+      * false positive rate still won't exceed 2% in almost all cases.
+      */
+     bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
+     /* Minimum allowable size is 1MB */
+     bitset_bytes = Max(1024L * 1024L, bitset_bytes);

Some upcasting might be advisable, to avoid dangers of overflows?

When it comes to sizing work_mem, using long literals to go from KB to
bytes is How It's Done™. I actually think that's silly myself, because
it's based on the assumption that long is wider than int, even though
it isn't on Windows. But that's okay because we have the old work_mem
size limits on Windows.

What would the upcasting you have in mind look like?

+     /* Use 64-bit hashing to get two independent 32-bit hashes */
+     hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed));
Hm. Is that smart given how some hash functions are defined? E.g. for
int8 the higher bits aren't really that independent for small values:

Robert suggested that I do this. I don't think that we need to make it
about the quality of the hash function that we have available. That
really seems like a distinct question to me. It seems clear that this
ought to be fine (or should be fine, if you prefer). I understand why
you're asking about this, but it's not scalable to ask every user of a
hash function to care that it might be a bit crap. Hash functions
aren't supposed to be a bit crap.

You may be nervous about the overall quality of the Bloom filter,
which is understandable. Check out the test harness, and how it can be
used [1]/messages/by-id/CAH2-Wznm5ZOjS0_DJoWrcm9Us19gzbkm0aTKt5hHprvjHFVHpQ@mail.gmail.com. This shows that the theoretical/expected false positive rate
[2]: https://hur.st/bloomfilter/
of the Bloom filter is varied, right up until the largest supported
size (512MB). The margin of error is tiny - certainly much less than a
practical use-case could ever care about.

+/*
+ * Calculate "val MOD m" inexpensively.
+ *
+ * Assumes that m (which is bitset size) is a power-of-two.
+ *
+ * Using a power-of-two number of bits for bitset size allows us to use bitwise
+ * AND operations to calculate the modulo of a hash value.  It's also a simple
+ * way of avoiding the modulo bias effect.
+ */
+static inline uint32
+mod_m(uint32 val, uint64 m)
+{
+     Assert(m <= PG_UINT32_MAX + UINT64CONST(1));
+     Assert(((m - 1) & m) == 0);
+
+     return val & (m - 1);
+}

What's up with the two different widths here?

For a 512MB Bloom filter, m can be UINT_MAX + 1. This is referenced earlier on.

+#ifndef _BLOOMFILTER_H_
+#define _BLOOMFILTER_H_

Names starting with an underscore followed by an uppercase letter are
reserved. Yes, we have some already. No, we shouldn't introduce further ones.

Okay. Will fix.

From 71878742061500b969faf7a7cff3603d644c90ca Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH v7 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check(): callers can check that each heap tuple that
ought to have an index entry does in fact have one. This happens at the
end of the existing verification checks.

And here we get back to why I last year though this interface is
bad. Now this really can't properly described as a pure index check
anymore, and we need to add the same functionality to multiple
functions.

If it makes you feel any better, I'm pretty much done with
bt_index_check() and bt_index_parent_check(). This patch will more
than likely be the last to revise their interface; there will be new
functions next year. That's why I thought it was okay last year.

+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;

This breaks functions et al referencing the existing function.

This sounds like a general argument against ever changing a function's
signature. It's not like we haven't done that a number of times in
extensions like pageinspect. Does it really matter?

Also, I
don't quite recall the rules, don't we have to drop the function from
the extension first?

But...I did drop the function?

Given the PG_NARGS() checks I don't understand why don't just create a
separate two-argument SQL function above? If you rely on the default
value you don't need it anyway?

I don't need it. But a user of amcheck might appreciate it.

+     /*
+      * * Heap contains unindexed/malformed tuples check *
+      */
I'd reorder this to "Check whether heap contains ...".

Okay.

+     if (state->heapallindexed)
+     {
+             IndexInfo  *indexinfo = BuildIndexInfo(state->rel);
+             HeapScanDesc scan;
+
+             /*
+              * Create our own scan for IndexBuildHeapScan(), like a parallel index
+              * build.  We do things this way because it lets us use the MVCC
+              * snapshot we acquired before index fingerprinting began in the
+              * !readonly case.
+              */

I'd shorten the middle part out, so it's "IndexBuildHeapScan(), so we
can register an MVCC snapshot acquired before..."

Okay.

+             scan = heap_beginscan_strat(state->heaprel, /* relation */
+                                                                     snapshot,       /* snapshot */
+                                                                     0,      /* number of keys */
+                                                                     NULL,   /* scan key */
+                                                                     true,   /* buffer access strategy OK */
+                                                                     true);  /* syncscan OK? */
+
+             /*
+              * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
+              * behaves when only AccessShareLock held.  This is really only needed
+              * to prevent confusion within IndexBuildHeapScan() about how to
+              * interpret the state we pass.
+              */
+             indexinfo->ii_Concurrent = !state->readonly;

That's not very descriptive.

The point is that it doesn't expect an MVCC snapshot when we don't say
we're CIC. The assertions that I added to IndexBuildHeapScan() go
nuts, for example.

I'll change this to "This is needed so that IndexBuildHeapScan() knows
to expect an MVCC snapshot".

+             /* Fingerprint leaf page tuples (those that point to the heap) */
+             if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
+                     bloom_add_element(state->filter, (unsigned char *) itup,
+                                                       IndexTupleSize(itup));
So, if we didn't use IndexBuildHeapScan(), we could also check whether
dead entries at least have a corresponding item on the page, right? I'm
not asking to change, mostly curious.

It's true that the dead entries ought to point to a valid heap
tuple/root heap-only tuple, since the LP_DEAD bit is just a hint.
However, you'd have to hold a buffer lock on the leaf page throughout,
since a write that would otherwise result in a page split is going to
recycling LP_DEAD items.

FWIW, the !ItemIdIsDead() thing isn't very important, because LP_DEAD
bits tend to get set quickly when the heap happens to be corrupt in a
way that makes corresponding heap tuples look dead when they shouldn't
be. But it's an opportunity to add less stuff to the Bloom filter,
which might make a small difference. Also, it might have some minor
educational value, for hackers that want to learn more about nbtree,
which remains a secondary goal.

+/*
+ * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
+ * all the entries that definitely should have been observed in leaf pages of
+ * the target index (that is, all IndexTuples that were fingerprinted by our
+ * Bloom filter).  All heapallindexed checks occur here.

The last sentence isn't entirely fully true ;), given that we check for
the bloom insertion above. s/checks/verification/?

Not sure what you mean. Is your point that we could have an error from
within IndexBuildHeapScan() and friends, as opposed to from this
callback, such as the error that we saw during the "freeze the dead"
business [3]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=8ecdc2ffe3da3a84d01e51c784ec3510157c893b -- Peter Geoghegan?

The bloom_add_element() calls cannot raise errors; they're just there
because they're needed to make the check in the callback
bt_tuple_present_callback() have a summarizing structure to work off
of.

We should be able to get this into v11...

That's a relief. Thanks.

[1]: /messages/by-id/CAH2-Wznm5ZOjS0_DJoWrcm9Us19gzbkm0aTKt5hHprvjHFVHpQ@mail.gmail.com
[2]: https://hur.st/bloomfilter/
[3]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=8ecdc2ffe3da3a84d01e51c784ec3510157c893b -- Peter Geoghegan
--
Peter Geoghegan

#62

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Peter Geoghegan (#61)

Re: [HACKERS] A design for amcheck heapam verification

On Thu, Mar 29, 2018 at 7:42 PM, Peter Geoghegan <pg@bowt.ie> wrote:

We should be able to get this into v11...

That's a relief. Thanks.

I have a new revision lined up. I won't send anything to the list
until you clear up what you meant in those few cases where it seemed
unclear.

I also acted on some of the feedback from Pavan, which I'd previously
put off/deferred. It seemed like there was no reason not to do it his
way when it came to minor stylistic points that were non-issues to me.

Finally, I added something myself:

745 /*
746 * lp_len should match the IndexTuple reported length
exactly, since
747 * lp_len is completely redundant in indexes, and both
sources of tuple
748 * length are MAXALIGN()'d. nbtree does not use lp_len all that
749 * frequently, and is surprisingly tolerant of corrupt
lp_len fields.
750 */
751 if (tupsize != ItemIdGetLength(itemid))
752 ereport(ERROR,
753 (errcode(ERRCODE_INDEX_CORRUPTED),
754 errmsg("index tuple size does not equal
lp_len in index \"%s\"",
755 RelationGetRelationName(state->rel)),
756 errdetail_internal("Index tid=(%u,%u) tuple
size=%zu lp_len=%u page lsn=%X/%X.",
757 state->targetblock, offset,
758 tupsize, ItemIdGetLength(itemid),
759 (uint32) (state->targetlsn >> 32),
760 (uint32) state->targetlsn),
761 errhint("This could be a torn page problem")));

It seems to me that we should take the opportunity to verify each
tuple's IndexTupleSize() value, now that we'll be using it directly.
There happens to be an easy way to do that, so why not just do it?

This is unlikely to find an error that we wouldn't have detected
anyway, even without using the new heapallindexed option. However, it
seems likely that this error message is more accurate in the real
world cases where it will be seen. A torn page can leave us with a
page image that looks surprisingly not-so-corrupt.

--
Peter Geoghegan

#63

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Peter Geoghegan (#61)

Re: [HACKERS] A design for amcheck heapam verification

On 2018-03-29 19:42:42 -0700, Peter Geoghegan wrote:

+     /*
+      * Aim for two bytes per element; this is sufficient to get a false
+      * positive rate below 1%, independent of the size of the bitset or total
+      * number of elements.  Also, if rounding down the size of the bitset to
+      * the next lowest power of two turns out to be a significant drop, the
+      * false positive rate still won't exceed 2% in almost all cases.
+      */
+     bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
+     /* Minimum allowable size is 1MB */
+     bitset_bytes = Max(1024L * 1024L, bitset_bytes);
Some upcasting might be advisable, to avoid dangers of overflows?
When it comes to sizing work_mem, using long literals to go from KB to
bytes is How It's Done™. I actually think that's silly myself, because
it's based on the assumption that long is wider than int, even though
it isn't on Windows. But that's okay because we have the old work_mem
size limits on Windows.

What would the upcasting you have in mind look like?

Just use UINT64CONST()? Let's try not to introduce further code that'll
need to get painfully fixed.

+     /* Use 64-bit hashing to get two independent 32-bit hashes */
+     hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed));
Hm. Is that smart given how some hash functions are defined? E.g. for
int8 the higher bits aren't really that independent for small values:
Robert suggested that I do this. I don't think that we need to make it
about the quality of the hash function that we have available. That
really seems like a distinct question to me. It seems clear that this
ought to be fine (or should be fine, if you prefer). I understand why
you're asking about this, but it's not scalable to ask every user of a
hash function to care that it might be a bit crap. Hash functions
aren't supposed to be a bit crap.

Well, they're not supposed to be, but they are. Practical realities
matter... But I think it's moot because we don't use any of the bad
ones, I only got to that later...

+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
This breaks functions et al referencing the existing function.
This sounds like a general argument against ever changing a function's
signature. It's not like we haven't done that a number of times in
extensions like pageinspect. Does it really matter?

Yes, it does. And imo we shouldn't.

Also, I
don't quite recall the rules, don't we have to drop the function from
the extension first?

But...I did drop the function?

I mean in the ALTER EXTENSION ... DROP FUNCTION sense.

Greetings,

Andres Freund

#64

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Andres Freund (#63)

Re: [HACKERS] A design for amcheck heapam verification

On Fri, Mar 30, 2018 at 6:20 PM, Andres Freund <andres@anarazel.de> wrote:

What would the upcasting you have in mind look like?

Just use UINT64CONST()? Let's try not to introduce further code that'll
need to get painfully fixed.

What I have right now is based on imitating the style that Tom uses.
I'm pretty sure that I did something like that in the patch I posted
that became 9563d5b5, which Tom editorialized to be in
"maintenance_work_mem * 1024L" style. That was only about 2 years ago.

I'll go ahead and use UINT64CONST(), as requested, but I wish that the
messaging on the right approach to such a standard question of style
was not contradictory.

Well, they're not supposed to be, but they are. Practical realities
matter... But I think it's moot because we don't use any of the bad
ones, I only got to that later...

Okay. My point was that that's not really the right level to solve the
problem (if that was a problem we had, which it turns out it isn't).
Not that practical realities don't matter.

This sounds like a general argument against ever changing a function's
signature. It's not like we haven't done that a number of times in
extensions like pageinspect. Does it really matter?

Yes, it does. And imo we shouldn't.

My recollection is that you didn't like the original bt_index_check()
function signature in the final v10 CF because it didn't allow you to
add arbitrary new arguments in the future; the name was too
prescriptive to support that. You wanted to only have one function
signature, envisioning a time when there'd be many arguments, that
you'd typically invoke using the named function argument (=>) syntax,
since many arguments may not actually be interesting, depending on the
exact details. I pushed back specifically because I thought there
should be simple rules for the heavyweight lock strength --
bt_index_check() should always acquire an ASL and only an ASL, and so
on. I also thought that we were unlikely to need many more options
that specifically deal with B-Tree indexes.

You brought this up again recently, recalling that my original
preferred signature style (the one that we went with) was bad because
it now necessitates altering a function signature to add a new
argument. I must admit that I am rather confused. Weren't *you* the
one that wanted to add lots of new arguments in the future? As I said,
I'm sure that I'm done adding new arguments to bt_index_check() +
bt_index_parent_check(). It's possible that there'll be another way to
get essentially the same functionality at a coarser granularity (e.g.
database, table), certainly, but I don't see that there is much more
that we can do while starting with a B-Tree index as our target.

Please propose an alternative user interface for the new check. If you
prefer, I can invent new bt_index_check_heap() +
bt_index_parent_check_heap() variants. That would be okay with me.

--
Peter Geoghegan

#65

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Peter Geoghegan (#64)

Re: [HACKERS] A design for amcheck heapam verification

On March 30, 2018 6:55:50 PM PDT, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Mar 30, 2018 at 6:20 PM, Andres Freund <andres@anarazel.de>
wrote:

What would the upcasting you have in mind look like?

Just use UINT64CONST()? Let's try not to introduce further code

that'll

need to get painfully fixed.

What I have right now is based on imitating the style that Tom uses.
I'm pretty sure that I did something like that in the patch I posted
that became 9563d5b5, which Tom editorialized to be in
"maintenance_work_mem * 1024L" style. That was only about 2 years ago.

I'll go ahead and use UINT64CONST(), as requested, but I wish that the
messaging on the right approach to such a standard question of style
was not contradictory.

Well, they're not supposed to be, but they are. Practical realities
matter... But I think it's moot because we don't use any of the bad
ones, I only got to that later...

Okay. My point was that that's not really the right level to solve the
problem (if that was a problem we had, which it turns out it isn't).
Not that practical realities don't matter.

This sounds like a general argument against ever changing a

function's

signature. It's not like we haven't done that a number of times in
extensions like pageinspect. Does it really matter?

Yes, it does. And imo we shouldn't.

My recollection is that you didn't like the original bt_index_check()
function signature in the final v10 CF because it didn't allow you to
add arbitrary new arguments in the future; the name was too
prescriptive to support that. You wanted to only have one function
signature, envisioning a time when there'd be many arguments, that
you'd typically invoke using the named function argument (=>) syntax,
since many arguments may not actually be interesting, depending on the
exact details. I pushed back specifically because I thought there
should be simple rules for the heavyweight lock strength --
bt_index_check() should always acquire an ASL and only an ASL, and so
on. I also thought that we were unlikely to need many more options
that specifically deal with B-Tree indexes.

You brought this up again recently, recalling that my original
preferred signature style (the one that we went with) was bad because
it now necessitates altering a function signature to add a new
argument. I must admit that I am rather confused. Weren't *you* the
one that wanted to add lots of new arguments in the future? As I said,
I'm sure that I'm done adding new arguments to bt_index_check() +
bt_index_parent_check(). It's possible that there'll be another way to
get essentially the same functionality at a coarser granularity (e.g.
database, table), certainly, but I don't see that there is much more
that we can do while starting with a B-Tree index as our target.

Please propose an alternative user interface for the new check. If you
prefer, I can invent new bt_index_check_heap() +
bt_index_parent_check_heap() variants. That would be okay with me.

I'm just saying that there should be two functions here, rather than dropping the old definition, and creating s new one with a default argument.

(Phone, more another time)

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#66

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Peter Geoghegan (#64)

Re: [HACKERS] A design for amcheck heapam verification

On Fri, Mar 30, 2018 at 6:55 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Mar 30, 2018 at 6:20 PM, Andres Freund <andres@anarazel.de> wrote:

What would the upcasting you have in mind look like?

Just use UINT64CONST()? Let's try not to introduce further code that'll
need to get painfully fixed.

What I have right now is based on imitating the style that Tom uses.
I'm pretty sure that I did something like that in the patch I posted
that became 9563d5b5, which Tom editorialized to be in
"maintenance_work_mem * 1024L" style. That was only about 2 years ago.

I'll go ahead and use UINT64CONST(), as requested, but I wish that the
messaging on the right approach to such a standard question of style
was not contradictory.

BTW, it's not obvious that using UINT64CONST() is going to become the
standard in the future. You may recall that commit 79e0f87a156 fixed a
bug that came from using Size instead of long in tuplesort.c;
tuplesort expects a signed type, since availMem must occasionally go
negative. Noah was not aware of using long for work_mem calculations,
imagining quite reasonable (but incorrectly) that that would break on
Windows, in the process missing this subtle negative availMem
requirement.

The 79e0f87a156 fix changed tuplesort to use int64 (even though it
could have been changed tuplesort back to using long without
consequence instead), which I thought might spread further and
eventually become a coding standard to follow. The point of things
like coding standards around expanding work_mem KB arguments to bytes,
or the MaxAllocSize thing, is that they cover a wide variety of cases
quite well, without much danger. Now, as it happens the Bloom filter
doesn't need to think about something like a negative availMem, so I
could use uint64 (or UINT64CONST()) for the size of the allocation.
But let's not pretend that that doesn't have its own problems. Am I
expected to learn everyone's individual preferences and prejudices
here?

--
Peter Geoghegan

#67

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Andres Freund (#65)

Re: [HACKERS] A design for amcheck heapam verification

On Fri, Mar 30, 2018 at 7:04 PM, Andres Freund <andres@anarazel.de> wrote:

I'm just saying that there should be two functions here, rather than dropping the old definition, and creating s new one with a default argument.

So you're asking for something like bt_index_check_heap() +
bt_index_parent_check_heap()? Or, are you talking about function
overloading?

--
Peter Geoghegan

#68

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Peter Geoghegan (#67)

Re: [HACKERS] A design for amcheck heapam verification

On 2018-03-31 11:27:14 -0700, Peter Geoghegan wrote:

On Fri, Mar 30, 2018 at 7:04 PM, Andres Freund <andres@anarazel.de> wrote:

I'm just saying that there should be two functions here, rather than dropping the old definition, and creating s new one with a default argument.

So you're asking for something like bt_index_check_heap() +
bt_index_parent_check_heap()? Or, are you talking about function
overloading?

The latter. That addresses my concerns about dropping the function and
causing issues due to dependencies.

Greetings,

Andres Freund

#69

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Andres Freund (#68)

Re: [HACKERS] A design for amcheck heapam verification

On Sat, Mar 31, 2018 at 2:56 PM, Andres Freund <andres@anarazel.de> wrote:

So you're asking for something like bt_index_check_heap() +
bt_index_parent_check_heap()? Or, are you talking about function
overloading?

The latter. That addresses my concerns about dropping the function and
causing issues due to dependencies.

WFM. I have all the information I need to produce the next revision now.

--
Peter Geoghegan

#70

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Peter Geoghegan (#69)

1 attachment(s)

Re: [HACKERS] A design for amcheck heapam verification

On Sat, Mar 31, 2018 at 2:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:

WFM. I have all the information I need to produce the next revision now.

I might as well post this one first. I'll have 0002 for you in a short while.

--
Peter Geoghegan

Attachments:

v8-0001-Add-Bloom-filter-data-structure-implementation.patchapplication/octet-stream; name=v8-0001-Add-Bloom-filter-data-structure-implementation.patchDownload

From 0757f36fbb7c56af8882cf77ca00aa4f2f9d976c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH v8 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.

This commit adds a test harness extension module, test_bloomfilter.  It
can be used to get a sense of how the Bloom filter implementation
performs under varying conditions.

This is infrastructure for the upcoming "heapallindexed" amcheck patch,
which verifies the consistency of a heap relation against one of its
indexes.

Author: Peter Geoghegan
Reviewed-By: Andrey Borodin, Michael Paquier, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/CAH2-Wzm5VmG7cu1N-H=nnS57wZThoSDQU+F5dewx3o84M+jY=g@mail.gmail.com
---
 src/backend/lib/Makefile                           |   4 +-
 src/backend/lib/README                             |   2 +
 src/backend/lib/bloomfilter.c                      | 305 +++++++++++++++++++++
 src/include/lib/bloomfilter.h                      |  27 ++
 src/test/modules/Makefile                          |   1 +
 src/test/modules/test_bloomfilter/.gitignore       |   4 +
 src/test/modules/test_bloomfilter/Makefile         |  21 ++
 src/test/modules/test_bloomfilter/README           |  68 +++++
 .../test_bloomfilter/expected/test_bloomfilter.out |  22 ++
 .../test_bloomfilter/sql/test_bloomfilter.sql      |  19 ++
 .../test_bloomfilter/test_bloomfilter--1.0.sql     |  11 +
 .../modules/test_bloomfilter/test_bloomfilter.c    | 138 ++++++++++
 .../test_bloomfilter/test_bloomfilter.control      |   4 +
 src/tools/pgindent/typedefs.list                   |   1 +
 14 files changed, 625 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/bloomfilter.c
 create mode 100644 src/include/lib/bloomfilter.h
 create mode 100644 src/test/modules/test_bloomfilter/.gitignore
 create mode 100644 src/test/modules/test_bloomfilter/Makefile
 create mode 100644 src/test/modules/test_bloomfilter/README
 create mode 100644 src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
 create mode 100644 src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.c
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index d1fefe43f2..191ea9bca2 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
-	   knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index 5e5ba5e437..376ae273a9 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -3,6 +3,8 @@ in the backend:
 
 binaryheap.c - a binary heap
 
+bloomfilter.c - probabilistic, space-efficient set membership testing
+
 hyperloglog.c - a streaming cardinality estimator
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
new file mode 100644
index 0000000000..eb08f4a7b8
--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,305 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.c
+ *		Space-efficient set membership testing
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.  False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This is typically very space efficient,
+ * which can be a decisive advantage.
+ *
+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when the Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.
+ *
+ * The implementation is well suited to data synchronization problems between
+ * unordered sets, especially where predictable performance is important and
+ * some false positives are acceptable.  It's also well suited to cache
+ * filtering problems where a relatively small and/or low cardinality set is
+ * fingerprinted, especially when many subsequent membership tests end up
+ * indicating that values of interest are not present.  That should save the
+ * caller many authoritative lookups, such as expensive probes of a much larger
+ * on-disk structure.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/bloomfilter.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/hash.h"
+#include "lib/bloomfilter.h"
+
+#define MAX_HASH_FUNCS		10
+
+struct bloom_filter
+{
+	/* K hash functions are used, seeded by caller's seed */
+	int			k_hash_funcs;
+	uint64		seed;
+	/* m is bitset size, in bits.  Must be a power of two <= 2^32.  */
+	uint64		m;
+	unsigned char bitset[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static int	my_bloom_power(uint64 target_bitset_bits);
+static int	optimal_k(uint64 bitset_bits, int64 total_elems);
+static void k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem,
+		 size_t len);
+static inline uint32 mod_m(uint32 a, uint64 m);
+
+/*
+ * Create Bloom filter in caller's memory context.  We aim for a false positive
+ * rate of between 1% and 2% when bitset size is not constrained by memory
+ * availability.
+ *
+ * total_elems is an estimate of the final size of the set.  It should be
+ * approximately correct, but the implementation can cope well with it being
+ * off by perhaps a factor of five or more.  See "Bloom Filters in
+ * Probabilistic Verification" (Dillinger & Manolios, 2004) for details of why
+ * this is the case.
+ *
+ * bloom_work_mem is sized in KB, in line with the general work_mem convention.
+ * This determines the size of the underlying bitset (trivial bookkeeping space
+ * isn't counted).  The bitset is always sized as a power of two number of
+ * bits, and the largest possible bitset is 512MB (2^32 bits).  The
+ * implementation allocates only enough memory to target its standard false
+ * positive rate, using a simple formula with caller's total_elems estimate as
+ * an input.  The bitset might be as small as 1MB, even when bloom_work_mem is
+ * much higher.
+ *
+ * The Bloom filter is seeded using a value provided by the caller.  Using a
+ * distinct seed value on every call makes it unlikely that the same false
+ * positives will reoccur when the same set is fingerprinted a second time.
+ * Callers that don't care about this pass a constant as their seed, typically
+ * 0.  Callers can use a pseudo-random seed in the range of 0 - INT_MAX by
+ * calling random().
+ */
+bloom_filter *
+bloom_create(int64 total_elems, int bloom_work_mem, uint64 seed)
+{
+	bloom_filter *filter;
+	int			bloom_power;
+	uint64		bitset_bytes;
+	uint64		bitset_bits;
+
+	/*
+	 * Aim for two bytes per element; this is sufficient to get a false
+	 * positive rate below 1%, independent of the size of the bitset or total
+	 * number of elements.  Also, if rounding down the size of the bitset to
+	 * the next lowest power of two turns out to be a significant drop, the
+	 * false positive rate still won't exceed 2% in almost all cases.
+	 */
+	bitset_bytes = Min(bloom_work_mem * UINT64CONST(1024), total_elems * 2);
+	bitset_bytes = Max(1024 * 1024, bitset_bytes);
+
+	/*
+	 * Size in bits should be the highest power of two <= target.  bitset_bits
+	 * is uint64 because PG_UINT32_MAX is 2^32 - 1, not 2^32
+	 */
+	bloom_power = my_bloom_power(bitset_bytes * BITS_PER_BYTE);
+	bitset_bits = UINT64CONST(1) << bloom_power;
+	bitset_bytes = bitset_bits / BITS_PER_BYTE;
+
+	/* Allocate bloom filter with unset bitset */
+	filter = palloc0(offsetof(bloom_filter, bitset) +
+					 sizeof(unsigned char) * bitset_bytes);
+	filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+	filter->seed = seed;
+	filter->m = bitset_bits;
+
+	return filter;
+}
+
+/*
+ * Free Bloom filter
+ */
+void
+bloom_free(bloom_filter *filter)
+{
+	pfree(filter);
+}
+
+/*
+ * Add element to Bloom filter
+ */
+void
+bloom_add_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32		hashes[MAX_HASH_FUNCS];
+	int			i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		filter->bitset[hashes[i] >> 3] |= 1 << (hashes[i] & 7);
+	}
+}
+
+/*
+ * Test if Bloom filter definitely lacks element.
+ *
+ * Returns true if the element is definitely not in the set of elements
+ * observed by bloom_add_element().  Otherwise, returns false, indicating that
+ * element is probably present in set.
+ */
+bool
+bloom_lacks_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32		hashes[MAX_HASH_FUNCS];
+	int			i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		if (!(filter->bitset[hashes[i] >> 3] & (1 << (hashes[i] & 7))))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.  That should
+ * generally be close to 0.5, even when we have more than enough memory to
+ * ensure a false positive rate within target 1% to 2% band, since more hash
+ * functions are used as more memory is available per element.
+ *
+ * This is the only instrumentation that is low overhead enough to appear in
+ * debug traces.  When debugging Bloom filter code, it's likely to be far more
+ * interesting to directly test the false positive rate.
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
+{
+	int			bitset_bytes = filter->m / BITS_PER_BYTE;
+	uint64		bits_set = 0;
+	int			i;
+
+	for (i = 0; i < bitset_bytes; i++)
+	{
+		unsigned char byte = filter->bitset[i];
+
+		while (byte)
+		{
+			bits_set++;
+			byte &= (byte - 1);
+		}
+	}
+
+	return bits_set / (double) filter->m;
+}
+
+/*
+ * Which element in the sequence of powers of two is less than or equal to
+ * target_bitset_bits?
+ *
+ * Value returned here must be generally safe as the basis for actual bitset
+ * size.
+ *
+ * Bitset is never allowed to exceed 2 ^ 32 bits (512MB).  This is sufficient
+ * for the needs of all current callers, and allows us to use 32-bit hash
+ * functions.  It also makes it easy to stay under the MaxAllocSize restriction
+ * (caller needs to leave room for non-bitset fields that appear before
+ * flexible array member, so a 1GB bitset would use an allocation that just
+ * exceeds MaxAllocSize).
+ */
+static int
+my_bloom_power(uint64 target_bitset_bits)
+{
+	int			bloom_power = -1;
+
+	while (target_bitset_bits > 0 && bloom_power < 32)
+	{
+		bloom_power++;
+		target_bitset_bits >>= 1;
+	}
+
+	return bloom_power;
+}
+
+/*
+ * Determine optimal number of hash functions based on size of filter in bits,
+ * and projected total number of elements.  The optimal number is the number
+ * that minimizes the false positive rate.
+ */
+static int
+optimal_k(uint64 bitset_bits, int64 total_elems)
+{
+	int			k = round(log(2.0) * bitset_bits / total_elems);
+
+	return Max(1, Min(k, MAX_HASH_FUNCS));
+}
+
+/*
+ * Generate k hash values for element.
+ *
+ * Caller passes array, which is filled-in with k values determined by hashing
+ * caller's element.
+ *
+ * Only 2 real independent hash functions are actually used to support an
+ * interface of up to MAX_HASH_FUNCS hash functions; enhanced double hashing is
+ * used to make this work.  The main reason we prefer enhanced double hashing
+ * to classic double hashing is that the latter has an issue with collisions
+ * when using power of two sized bitsets.  See Dillinger & Manolios for full
+ * details.
+ */
+static void
+k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
+{
+	uint64		hash;
+	uint32		x, y;
+	uint64		m;
+	int			i;
+
+	/* Use 64-bit hashing to get two independent 32-bit hashes */
+	hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed));
+	x = (uint32) hash;
+	y = (uint32) (hash >> 32);
+	m = filter->m;
+
+	x = mod_m(x, m);
+	y = mod_m(y, m);
+
+	/* Accumulate hashes */
+	hashes[0] = x;
+	for (i = 1; i < filter->k_hash_funcs; i++)
+	{
+		x = mod_m(x + y, m);
+		y = mod_m(y + i, m);
+
+		hashes[i] = x;
+	}
+}
+
+/*
+ * Calculate "val MOD m" inexpensively.
+ *
+ * Assumes that m (which is bitset size) is a power of two.
+ *
+ * Using a power of two number of bits for bitset size allows us to use bitwise
+ * AND operations to calculate the modulo of a hash value.  It's also a simple
+ * way of avoiding the modulo bias effect.
+ */
+static inline uint32
+mod_m(uint32 val, uint64 m)
+{
+	Assert(m <= PG_UINT32_MAX + UINT64CONST(1));
+	Assert(((m - 1) & m) == 0);
+
+	return val & (m - 1);
+}
diff --git a/src/include/lib/bloomfilter.h b/src/include/lib/bloomfilter.h
new file mode 100644
index 0000000000..6cbdd9bfd9
--- /dev/null
+++ b/src/include/lib/bloomfilter.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.h
+ *	  Space-efficient set membership testing
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/include/lib/bloomfilter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BLOOMFILTER_H
+#define BLOOMFILTER_H
+
+typedef struct bloom_filter bloom_filter;
+
+extern bloom_filter *bloom_create(int64 total_elems, int bloom_work_mem,
+			 uint64 seed);
+extern void bloom_free(bloom_filter *filter);
+extern void bloom_add_element(bloom_filter *filter, unsigned char *elem,
+				  size_t len);
+extern bool bloom_lacks_element(bloom_filter *filter, unsigned char *elem,
+					size_t len);
+extern double bloom_prop_bits_set(bloom_filter *filter);
+
+#endif							/* BLOOMFILTER_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 7294b6958b..a9b8377acf 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -9,6 +9,7 @@ SUBDIRS = \
 		  commit_ts \
 		  dummy_seclabel \
 		  snapshot_too_old \
+		  test_bloomfilter \
 		  test_ddl_deparse \
 		  test_extensions \
 		  test_parser \
diff --git a/src/test/modules/test_bloomfilter/.gitignore b/src/test/modules/test_bloomfilter/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_bloomfilter/Makefile b/src/test/modules/test_bloomfilter/Makefile
new file mode 100644
index 0000000000..808c9314d4
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_bloomfilter/Makefile
+
+MODULE_big = test_bloomfilter
+OBJS = test_bloomfilter.o $(WIN32RES)
+PGFILEDESC = "test_bloomfilter - test code for Bloom filter library"
+
+EXTENSION = test_bloomfilter
+DATA = test_bloomfilter--1.0.sql
+
+REGRESS = test_bloomfilter
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_bloomfilter
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_bloomfilter/README b/src/test/modules/test_bloomfilter/README
new file mode 100644
index 0000000000..4c05efe5a8
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/README
@@ -0,0 +1,68 @@
+test_bloomfilter overview
+=========================
+
+test_bloomfilter is a test harness module for testing Bloom filter library set
+membership operations.  It consists of a single SQL-callable function,
+test_bloomfilter(), plus a regression test that calls test_bloomfilter().
+Membership tests are performed against a dataset that the test harness module
+generates.
+
+The test_bloomfilter() function displays instrumentation at DEBUG1 elog level
+(WARNING when the false positive rate exceeds a 1% threshold).  This can be
+used to get a sense of the performance characteristics of the Postgres Bloom
+filter implementation under varied conditions.
+
+Bitset size
+-----------
+
+The main bloomfilter.c criteria for sizing its bitset is that the false
+positive rate should not exceed 2% when sufficient bloom_work_mem is available
+(and the caller-supplied estimate of the number of elements turns out to have
+been accurate).  A 1% - 2% rate is currently assumed to be suitable for all
+Bloom filter callers.
+
+With an optimal K (number of hash functions), Bloom filters should only have a
+1% false positive rate with just 9.6 bits of memory per element.  The Postgres
+implementation's 2% worst case guarantee exists because there is a need for
+some slop due to implementation inflexibility in bitset sizing.  Since the
+bitset size is always actually kept to a power of two number of bits, callers
+can have their bloom_work_mem argument truncated down by almost half.
+In practice, callers that make a point of passing a bloom_work_mem that is an
+exact power of two bitset size (such as test_bloomfilter.c) will actually get
+the "9.6 bits per element" 1% false positive rate.
+
+Testing strategy
+----------------
+
+Our approach to regression testing is to test that a Bloom filter has only a 1%
+false positive rate for a single bitset size (2 ^ 23, or 1MB).  We test a
+dataset with 838,861 elements, which works out at 10 bits of memory per
+element.  We round up from 9.6 bits to 10 bits to make sure that we reliably
+get under 1% for regression testing.  Note that a random seed is used in the
+regression tests because the exact false positive rate is inconsistent across
+platforms.  Inconsistent hash function behavior is something that the
+regression tests need to be tolerant of anyway.
+
+test_bloomfilter() SQL-callable function
+========================================
+
+The SQL-callable function test_bloomfilter() provides the following arguments:
+
+* "power" is the power of two used to size the Bloom filter's bitset.
+
+The minimum valid argument value is 23 (2^23 bits), or 1MB of memory.  The
+maximum valid argument value is 32, or 512MB of memory.
+
+* "nelements" is the number of elements to generate for testing purposes.
+
+* "seed" is a seed value for hashing.
+
+A value < 0 is interpreted as "use random seed".  Varying the seed value (or
+specifying -1) should result in small variations in the total number of false
+positives.
+
+* "tests" is the number of tests to run.
+
+This may be increased when it's useful to perform many tests in an interactive
+session.  It only makes sense to perform multiple tests when a random seed is
+used.
diff --git a/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out b/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
new file mode 100644
index 0000000000..21c068867d
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
@@ -0,0 +1,22 @@
+CREATE EXTENSION test_bloomfilter;
+-- See README for explanation of arguments:
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
+ test_bloomfilter 
+------------------
+ 
+(1 row)
+
+-- Equivalent "10 bits per element" tests for all possible bitset sizes:
+--
+-- SELECT test_bloomfilter(24, 1677722)
+-- SELECT test_bloomfilter(25, 3355443)
+-- SELECT test_bloomfilter(26, 6710886)
+-- SELECT test_bloomfilter(27, 13421773)
+-- SELECT test_bloomfilter(28, 26843546)
+-- SELECT test_bloomfilter(29, 53687091)
+-- SELECT test_bloomfilter(30, 107374182)
+-- SELECT test_bloomfilter(31, 214748365)
+-- SELECT test_bloomfilter(32, 429496730)
diff --git a/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql b/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
new file mode 100644
index 0000000000..9ec159ce40
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
@@ -0,0 +1,19 @@
+CREATE EXTENSION test_bloomfilter;
+
+-- See README for explanation of arguments:
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
+
+-- Equivalent "10 bits per element" tests for all possible bitset sizes:
+--
+-- SELECT test_bloomfilter(24, 1677722)
+-- SELECT test_bloomfilter(25, 3355443)
+-- SELECT test_bloomfilter(26, 6710886)
+-- SELECT test_bloomfilter(27, 13421773)
+-- SELECT test_bloomfilter(28, 26843546)
+-- SELECT test_bloomfilter(29, 53687091)
+-- SELECT test_bloomfilter(30, 107374182)
+-- SELECT test_bloomfilter(31, 214748365)
+-- SELECT test_bloomfilter(32, 429496730)
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql b/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
new file mode 100644
index 0000000000..7682318fe3
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
@@ -0,0 +1,11 @@
+/* src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_bloomfilter" to load this file. \quit
+
+CREATE FUNCTION test_bloomfilter(power integer,
+    nelements bigint,
+    seed integer DEFAULT -1,
+    tests integer DEFAULT 1)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter.c b/src/test/modules/test_bloomfilter/test_bloomfilter.c
new file mode 100644
index 0000000000..1691b0fb30
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter.c
@@ -0,0 +1,138 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_bloomfilter.c
+ *		Test false positive rate of Bloom filter.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_bloomfilter/test_bloomfilter.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/bloomfilter.h"
+#include "miscadmin.h"
+
+PG_MODULE_MAGIC;
+
+/* Must fit decimal representation of PG_INT64_MAX + 2 bytes: */
+#define MAX_ELEMENT_BYTES		20
+/* False positive rate WARNING threshold (1%): */
+#define FPOSITIVE_THRESHOLD		0.01
+
+
+/*
+ * Populate an empty Bloom filter with "nelements" dummy strings.
+ */
+static void
+populate_with_dummy_strings(bloom_filter *filter, int64 nelements)
+{
+	char		element[MAX_ELEMENT_BYTES];
+	int64		i;
+
+	for (i = 0; i < nelements; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		snprintf(element, sizeof(element), "i" INT64_FORMAT, i);
+		bloom_add_element(filter, (unsigned char *) element, strlen(element));
+	}
+}
+
+/*
+ * Returns number of strings that are indicated as probably appearing in Bloom
+ * filter that were in fact never added by populate_with_dummy_strings().
+ * These are false positives.
+ */
+static int64
+nfalsepos_for_missing_strings(bloom_filter *filter, int64 nelements)
+{
+	char		element[MAX_ELEMENT_BYTES];
+	int64		nfalsepos = 0;
+	int64		i;
+
+	for (i = 0; i < nelements; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		snprintf(element, sizeof(element), "M" INT64_FORMAT, i);
+		if (!bloom_lacks_element(filter, (unsigned char *) element,
+								 strlen(element)))
+			nfalsepos++;
+	}
+
+	return nfalsepos;
+}
+
+static void
+create_and_test_bloom(int power, int64 nelements, int callerseed)
+{
+	int			bloom_work_mem;
+	uint64		seed;
+	int64		nfalsepos;
+	bloom_filter *filter;
+
+	bloom_work_mem = (1L << power) / 8L / 1024L;
+
+	elog(DEBUG1, "bloom_work_mem (KB): %d", bloom_work_mem);
+
+	/*
+	 * Generate random seed, or use caller's.  Seed should always be a
+	 * positive value less than or equal to PG_INT32_MAX, to ensure that any
+	 * random seed can be recreated through callerseed if the need arises.
+	 * (Don't assume that RAND_MAX cannot exceed PG_INT32_MAX.)
+	 */
+	seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed;
+
+	/* Create Bloom filter, populate it, and report on false positive rate */
+	filter = bloom_create(nelements, bloom_work_mem, seed);
+	populate_with_dummy_strings(filter, nelements);
+	nfalsepos = nfalsepos_for_missing_strings(filter, nelements);
+
+	ereport((nfalsepos > nelements * FPOSITIVE_THRESHOLD) ? WARNING : DEBUG1,
+			(errmsg_internal("seed: " UINT64_FORMAT " false positives: " INT64_FORMAT " (%.6f%%) bitset %.2f%% set" ,
+							 seed, nfalsepos, (double) nfalsepos / nelements,
+							 100.0 * bloom_prop_bits_set(filter))));
+
+	bloom_free(filter);
+}
+
+PG_FUNCTION_INFO_V1(test_bloomfilter);
+
+/*
+ * SQL-callable entry point to perform all tests.
+ *
+ * If a 1% false positive threshold is not met, emits WARNINGs.
+ *
+ * See README for details of arguments.
+ */
+Datum
+test_bloomfilter(PG_FUNCTION_ARGS)
+{
+	int			power = PG_GETARG_INT32(0);
+	int64		nelements = PG_GETARG_INT64(1);
+	int			seed = PG_GETARG_INT32(2);
+	int			tests = PG_GETARG_INT32(3);
+	int			i;
+
+	if (power < 23 || power > 32)
+		elog(ERROR, "power argument must be between 23 and 32 inclusive");
+
+	if (tests <= 0)
+		elog(ERROR, "invalid number of tests: %d", tests);
+
+	if (nelements < 0)
+		elog(ERROR, "invalid number of elements: %d", tests);
+
+	for (i = 0; i < tests; i++)
+	{
+		elog(DEBUG1, "beginning test #%d...", i + 1);
+
+		create_and_test_bloom(power, nelements, seed);
+	}
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_bloomfilter/test_bloomfilter.control b/src/test/modules/test_bloomfilter/test_bloomfilter.control
new file mode 100644
index 0000000000..99e56eebdf
--- /dev/null
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter.control
@@ -0,0 +1,4 @@
+comment = 'Test code for Bloom filter library'
+default_version = '1.0'
+module_pathname = '$libdir/test_bloomfilter'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 17bf55c1f5..abc10a8ffd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2590,6 +2590,7 @@ bitmapword
 bits16
 bits32
 bits8
+bloom_filter
 bool
 brin_column_state
 bytea
-- 
2.14.1

#71

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Peter Geoghegan (#70)

1 attachment(s)

Re: [HACKERS] A design for amcheck heapam verification

On Sat, Mar 31, 2018 at 3:15 PM, Peter Geoghegan <pg@bowt.ie> wrote:

WFM. I have all the information I need to produce the next revision now.

I might as well post this one first. I'll have 0002 for you in a short while.

Attached is 0002 -- the amcheck enhancement itself. As requested by
Andres, this adds a new overloaded set of functions, rather than
dropping and recreating functions to change their signature.

I'm pretty sure that avoiding issues with dependencies by using this
approach is unprecedented, so I had to use my own judgement on how to
deal with a couple of things. I decided not to create a new C symbol
for the new function versions, preferring to leave it to the existing
PG_NARGS() tests. I guess this was probably what you intended I should
do, based on your "Given the PG_NARGS() checks..." remark. I also
chose to not document the single argument functions in the docs. I
suppose that we should consider these to be implementation details of
a work-around for dependency breakage, something that doesn't need to
be documented. That's a bit like how we don't document functions
within certain extensions that are designed just to get called within
a view definition. I don't feel strongly about it, though.

No other changes to report. I did mention that this would have a few
small changes yesterday; no need to repeat the details now.

Thanks
--
Peter Geoghegan

Attachments:

v8-0002-Add-amcheck-verification-of-heap-relations.patchtext/x-patch; charset=US-ASCII; name=v8-0002-Add-amcheck-verification-of-heap-relations.patchDownload

From 9e2c1469013374117209c9c0c00e12ecf10ab7b9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH v8 2/2] Add amcheck verification of heap relations.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  check that each heap tuple that should have an
index entry does in fact have one.  The extra checking is performed at
the end of the existing nbtree checks.

This is implemented by using a Bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
Bloom filter is populated during the initial index verification scan.

Reusing the CREATE INDEX infrastructure allows the new verification
option to automatically benefit from the heap consistency checks that
CREATE INDEX already performs.  CREATE INDEX does thorough sanity
checking of HOT chains, so the new check actually manages to detect
problems in heap-only tuples.

Author: Peter Geoghegan
Reviewed-By: Pavan Deolasee, Andres Freund
Discussion: https://postgr.es/m/CAH2-Wzm5VmG7cu1N-H=nnS57wZThoSDQU+F5dewx3o84M+jY=g@mail.gmail.com
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql    |  29 +++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |  12 +-
 contrib/amcheck/sql/check_btree.sql      |   7 +-
 contrib/amcheck/verify_nbtree.c          | 343 ++++++++++++++++++++++++++++---
 doc/src/sgml/amcheck.sgml                | 126 +++++++++---
 7 files changed, 458 insertions(+), 63 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 0000000..088416e
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,29 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+-- In order to avoid issues with dependencies when updating amcheck to 1.1,
+-- create new, overloaded versions of the 1.0 functions
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 05e2861..4690484 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index df3741e..6f5b917 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -18,6 +18,8 @@ RESET ROLE;
 -- above explicit permission has to be granted for that.
 GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
 GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
  bt_index_check 
@@ -56,8 +58,14 @@ SELECT bt_index_check('bttest_a_idx');
  
 (1 row)
 
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+SELECT bt_index_parent_check('bttest_b_idx', true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index fd90531..03f4c96 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -21,6 +21,8 @@ RESET ROLE;
 -- above explicit permission has to be granted for that.
 GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
 GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
 SELECT bt_index_parent_check('bttest_a_idx');
@@ -42,8 +44,9 @@ ROLLBACK;
 
 -- normal check outside of xact
 SELECT bt_index_check('bttest_a_idx');
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+SELECT bt_index_parent_check('bttest_b_idx', true);
 
 BEGIN;
 SELECT bt_index_check('bttest_a_idx');
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index da518da..a15fe21 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -8,6 +8,11 @@
  * (the insertion scankey sort-wise NULL semantics are needed for
  * verification).
  *
+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later uses Bloom filter probes to verify
+ * that every visible heap tuple has a matching index tuple.
+ *
  *
  * Copyright (c) 2017-2018, PostgreSQL Global Development Group
  *
@@ -18,11 +23,14 @@
  */
 #include "postgres.h"
 
+#include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "access/transam.h"
+#include "access/xact.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablecmds.h"
+#include "lib/bloomfilter.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
@@ -43,9 +51,10 @@ PG_MODULE_MAGIC;
  * target is the point of reference for a verification operation.
  *
  * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
- * they are current target's child pages). Conceptually, problems are only
- * ever found in the current target page. Each page found by verification's
- * left/right, top/bottom scan becomes the target exactly once.
+ * they are current target's child pages).  Conceptually, problems are only
+ * ever found in the current target page (or for a particular heap tuple during
+ * heapallindexed verification).  Each page found by verification's left/right,
+ * top/bottom scan becomes the target exactly once.
  */
 typedef struct BtreeCheckState
 {
@@ -53,10 +62,13 @@ typedef struct BtreeCheckState
 	 * Unchanging state, established at start of verification:
 	 */
 
-	/* B-Tree Index Relation */
+	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
+	Relation	heaprel;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
+	/* Also verifying heap has no unindexed tuples? */
+	bool		heapallindexed;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -72,6 +84,15 @@ typedef struct BtreeCheckState
 	BlockNumber targetblock;
 	/* Target page's LSN */
 	XLogRecPtr	targetlsn;
+
+	/*
+	 * Mutable state, for optional heapallindexed verification:
+	 */
+
+	/* Bloom filter fingerprints B-Tree index */
+	bloom_filter *filter;
+	/* Debug counter */
+	int64		heaptuplespresent;
 } BtreeCheckState;
 
 /*
@@ -92,15 +113,20 @@ typedef struct BtreeLevel
 PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
-static void bt_index_check_internal(Oid indrelid, bool parentcheck);
+static void bt_index_check_internal(Oid indrelid, bool parentcheck,
+						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
-static void bt_check_every_level(Relation rel, bool readonly);
+static void bt_check_every_level(Relation rel, Relation heaprel,
+					 bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
 static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 				  ScanKey targetkey);
+static void bt_tuple_present_callback(Relation index, HeapTuple htup,
+						  Datum *values, bool *isnull,
+						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
@@ -116,37 +142,47 @@ static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires AccessShareLock on heap & index relations.  Does not consider
- * invariants that exist between parent/child pages.
+ * invariants that exist between parent/child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, false);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, false, heapallindexed);
 
 	PG_RETURN_VOID();
 }
 
 /*
- * bt_index_parent_check(index regclass)
+ * bt_index_parent_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires ShareLock on heap & index relations.  Verifies that downlinks in
- * parent pages are valid lower bounds on child pages.
+ * parent pages are valid lower bounds on child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, true);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, true, heapallindexed);
 
 	PG_RETURN_VOID();
 }
@@ -155,7 +191,7 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -185,15 +221,20 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	 * Open the target index relations separately (like relation_openrv(), but
 	 * with heap relation locked first to prevent deadlocking).  In hot
 	 * standby mode this will raise an error when parentcheck is true.
+	 *
+	 * There is no need for the usual indcheckxmin usability horizon test here,
+	 * even in the heapallindexed case, because index undergoing verification
+	 * only needs to have entries for a new transaction snapshot.  (If this is
+	 * a parentcheck verification, there is no question about committed or
+	 * recently dead heap tuples lacking index entries due to concurrent
+	 * activity.)
 	 */
 	indrel = index_open(indrelid, lockmode);
 
 	/*
 	 * Since we did the IndexGetRelation call above without any lock, it's
 	 * barely possible that a race against an index drop/recreation could have
-	 * netted us the wrong table.  Although the table itself won't actually be
-	 * examined during verification currently, a recheck still seems like a
-	 * good idea.
+	 * netted us the wrong table.
 	 */
 	if (heaprel == NULL || heapid != IndexGetRelation(indrelid, false))
 		ereport(ERROR,
@@ -204,8 +245,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	/* Relation suitable for checking as B-Tree? */
 	btree_index_checkable(indrel);
 
-	/* Check index */
-	bt_check_every_level(indrel, parentcheck);
+	/* Check index, possibly against table it is an index on */
+	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -253,11 +294,14 @@ btree_index_checkable(Relation rel)
 
 /*
  * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in
- * logical order, verifying invariants as it goes.
+ * logical order, verifying invariants as it goes.  Optionally, verification
+ * checks if the heap relation contains any tuples that are not represented in
+ * the index but should be.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
  * the index relation, and advise us if extra checks are safe when a ShareLock
- * is held.
+ * is held.  (A lock of the same type must also have been acquired on the heap
+ * relation.)
  *
  * A ShareLock is generally assumed to prevent any kind of physical
  * modification to the index structure, including modifications that VACUUM may
@@ -272,13 +316,15 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, bool readonly)
+bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
+					 bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
+	Snapshot	snapshot = SnapshotAny;
 
 	/*
 	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
@@ -291,7 +337,57 @@ bt_check_every_level(Relation rel, bool readonly)
 	 */
 	state = palloc(sizeof(BtreeCheckState));
 	state->rel = rel;
+	state->heaprel = heaprel;
 	state->readonly = readonly;
+	state->heapallindexed = heapallindexed;
+
+	if (state->heapallindexed)
+	{
+		int64		total_elems;
+		uint64		seed;
+
+		/* Size Bloom filter based on estimated number of tuples in index */
+		total_elems = (int64) state->rel->rd_rel->reltuples;
+		/* Random seed relies on backend srandom() call to avoid repetition */
+		seed = random();
+		/* Create Bloom filter to fingerprint index */
+		state->filter = bloom_create(total_elems, maintenance_work_mem, seed);
+		state->heaptuplespresent = 0;
+
+		/*
+		 * Register our own snapshot in !readonly case, rather than asking
+		 * IndexBuildHeapScan() to do this for us later.  This needs to happen
+		 * before index fingerprinting begins, so we can later be certain that
+		 * index fingerprinting should have reached all tuples returned by
+		 * IndexBuildHeapScan().
+		 */
+		if (!state->readonly)
+		{
+			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+			/*
+			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+			 * the entries it requires in the index.
+			 *
+			 * We must defend against the possibility that an old xact snapshot
+			 * was returned at higher isolation levels when that snapshot is
+			 * not safe for index scans of the target index.  This is possible
+			 * when the snapshot sees tuples that are before the index's
+			 * indcheckxmin horizon.  Throwing an error here should be very
+			 * rare.  It doesn't seem worth using a secondary snapshot to avoid
+			 * this.
+			 */
+			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+									   snapshot->xmin))
+				ereport(ERROR,
+						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+								RelationGetRelationName(rel))));
+		}
+	}
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -345,6 +441,69 @@ bt_check_every_level(Relation rel, bool readonly)
 		previouslevel = current.level;
 	}
 
+	/*
+	 * * Check whether heap contains unindexed/malformed tuples *
+	 */
+	if (state->heapallindexed)
+	{
+		IndexInfo  *indexinfo = BuildIndexInfo(state->rel);
+		HeapScanDesc scan;
+
+		/*
+		 * Create our own scan for IndexBuildHeapScan(), rather than getting it
+		 * to do so for us.  This is required so that we can actually use the
+		 * MVCC snapshot registered earlier in !readonly case.
+		 *
+		 * Note that IndexBuildHeapScan() calls heap_endscan() for us.
+		 */
+		scan = heap_beginscan_strat(state->heaprel, /* relation */
+									snapshot,	/* snapshot */
+									0,	/* number of keys */
+									NULL,	/* scan key */
+									true,	/* buffer access strategy OK */
+									true);	/* syncscan OK? */
+
+		/*
+		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
+		 * behaves in !readonly case.
+		 *
+		 * It's okay that we don't actually use the same lock strength for the
+		 * heap relation as any other ii_Concurrent caller would in !readonly
+		 * case.  We have no reason to care about a concurrent VACUUM
+		 * operation, since there isn't going to be a second scan of the heap
+		 * that needs to be sure that there was no concurrent recycling of
+		 * TIDs.
+		 */
+		indexinfo->ii_Concurrent = !state->readonly;
+
+		/*
+		 * Don't wait for uncommitted tuple xact commit/abort when index is a
+		 * unique index on a catalog (or an index used by an exclusion
+		 * constraint).  This could otherwise happen in the readonly case.
+		 */
+		indexinfo->ii_Unique = false;
+		indexinfo->ii_ExclusionOps = NULL;
+		indexinfo->ii_ExclusionProcs = NULL;
+		indexinfo->ii_ExclusionStrats = NULL;
+
+		elog(DEBUG1, "verifying that tuples from index \"%s\" are present in \"%s\"",
+			 RelationGetRelationName(state->rel),
+			 RelationGetRelationName(state->heaprel));
+
+		IndexBuildHeapScan(state->heaprel, state->rel, indexinfo, true,
+						   bt_tuple_present_callback, (void *) state, scan);
+
+		ereport(DEBUG1,
+				(errmsg_internal("finished verifying presence of " INT64_FORMAT " tuples from table \"%s\" with bitset %.2f%% set",
+								 state->heaptuplespresent, RelationGetRelationName(heaprel),
+								 100.0 * bloom_prop_bits_set(state->filter))));
+
+		if (snapshot != SnapshotAny)
+			UnregisterSnapshot(snapshot);
+
+		bloom_free(state->filter);
+	}
+
 	/* Be tidy: */
 	MemoryContextDelete(state->targetcontext);
 }
@@ -497,7 +656,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
 										current, level.level, opaque->btpo.level)));
 
-		/* Verify invariants for page -- all important checks occur here */
+		/* Verify invariants for page */
 		bt_target_page_check(state);
 
 nextpage:
@@ -544,6 +703,9 @@ nextpage:
  *
  * - That all child pages respect downlinks lower bound.
  *
+ * This is also where heapallindexed callers use their Bloom filter to
+ * fingerprint IndexTuples.
+ *
  * Note:  Memory allocated in this routine is expected to be released by caller
  * resetting state->targetcontext.
  */
@@ -572,21 +734,46 @@ bt_target_page_check(BtreeCheckState *state)
 		ItemId		itemid;
 		IndexTuple	itup;
 		ScanKey		skey;
+		size_t		tupsize;
 
 		CHECK_FOR_INTERRUPTS();
 
+		itemid = PageGetItemId(state->target, offset);
+		itup = (IndexTuple) PageGetItem(state->target, itemid);
+		tupsize = IndexTupleSize(itup);
+
+		/*
+		 * lp_len should match the IndexTuple reported length exactly, since
+		 * lp_len is completely redundant in indexes, and both sources of tuple
+		 * length are MAXALIGN()'d.  nbtree does not use lp_len all that
+		 * frequently, and is surprisingly tolerant of corrupt lp_len fields.
+		 */
+		if (tupsize != ItemIdGetLength(itemid))
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index tuple size does not equal lp_len in index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=(%u,%u) tuple size=%zu lp_len=%u page lsn=%X/%X.",
+										state->targetblock, offset,
+										tupsize, ItemIdGetLength(itemid),
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn),
+					 errhint("This could be a torn page problem")));
+
 		/*
 		 * Don't try to generate scankey using "negative infinity" garbage
-		 * data
+		 * data on internal pages
 		 */
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
 		/* Build insertion scankey for current page offset */
-		itemid = PageGetItemId(state->target, offset);
-		itup = (IndexTuple) PageGetItem(state->target, itemid);
 		skey = _bt_mkscankey(state->rel, itup);
 
+		/* Fingerprint leaf page tuples (those that point to the heap) */
+		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
+			bloom_add_element(state->filter, (unsigned char *) itup, tupsize);
+
 		/*
 		 * * High key check *
 		 *
@@ -680,8 +867,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Last item check *
 		 *
 		 * Check last item against next/right page's first data item's when
-		 * last item on page is reached.  This additional check can detect
-		 * transposed pages.
+		 * last item on page is reached.  This additional check will detect
+		 * transposed pages iff the supposed right sibling page happens to
+		 * belong before target in the key space.  (Otherwise, a subsequent
+		 * heap verification will probably detect the problem.)
 		 *
 		 * This check is similar to the item order check that will have
 		 * already been performed for every other "real" item on target page
@@ -1060,6 +1249,106 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 }
 
 /*
+ * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
+ * all the entries that definitely should have been observed in leaf pages of
+ * the target index (that is, all IndexTuples that were fingerprinted by our
+ * Bloom filter).  All heapallindexed checks occur here.
+ *
+ * The redundancy between an index and the table it indexes provides a good
+ * opportunity to detect corruption, especially corruption within the table.
+ * The high level principle behind the verification performed here is that any
+ * IndexTuple that should be in an index following a fresh CREATE INDEX (based
+ * on the same index definition) should also have been in the original,
+ * existing index, which should have used exactly the same representation
+ *
+ * Since the overall structure of the index has already been verified, the most
+ * likely explanation for error here is a corrupt heap page (could be logical
+ * or physical corruption).  Index corruption may still be detected here,
+ * though.  Only readonly callers will have verified that left links and right
+ * links are in agreement, and so it's possible that a leaf page transposition
+ * within index is actually the source of corruption detected here (for
+ * !readonly callers).  The checks performed only for readonly callers might
+ * more accurately frame the problem as a cross-page invariant issue (this
+ * could even be due to recovery not replaying all WAL records).  The !readonly
+ * ERROR message raised here includes a HINT about retrying with readonly
+ * verification, just in case it's a cross-page invariant issue, though that
+ * isn't particularly likely.
+ *
+ * IndexBuildHeapScan() expects to be able to find the root tuple when a
+ * heap-only tuple (the live tuple at the end of some HOT chain) needs to be
+ * indexed, in order to replace the actual tuple's TID with the root tuple's
+ * TID (which is what we're actually passed back here).  The index build heap
+ * scan code will raise an error when a tuple that claims to be the root of the
+ * heap-only tuple's HOT chain cannot be located.  This catches cases where the
+ * original root item offset/root tuple for a HOT chain indicates (for whatever
+ * reason) that the entire HOT chain is dead, despite the fact that the latest
+ * heap-only tuple should be indexed.  When this happens, sequential scans may
+ * always give correct answers, and all indexes may be considered structurally
+ * consistent (i.e. the nbtree structural checks would not detect corruption).
+ * It may be the case that only index scans give wrong answers, and yet heap or
+ * SLRU corruption is the real culprit.  (While it's true that LP_DEAD bit
+ * setting will probably also leave the index in a corrupt state before too
+ * long, the problem is nonetheless that there is heap corruption.)
+ *
+ * Heap-only tuple handling within IndexBuildHeapScan() works in a way that
+ * helps us to detect index tuples that contain the wrong values (values that
+ * don't match the latest tuple in the HOT chain).  This can happen when there
+ * is no superseding index tuple due to a faulty assessment of HOT safety,
+ * perhaps during the original CREATE INDEX.  Because the latest tuple's
+ * contents are used with the root TID, an error will be raised when a tuple
+ * with the same TID but non-matching attribute values is passed back to us.
+ * Faulty assessment of HOT-safety was behind at least two distinct CREATE
+ * INDEX CONCURRENTLY bugs that made it into stable releases, one of which was
+ * undetected for many years.  In short, the same principle that allows a
+ * REINDEX to repair corruption when there was an (undetected) broken HOT chain
+ * also allows us to detect the corruption in many cases.
+ */
+static void
+bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
+						  bool *isnull, bool tupleIsAlive, void *checkstate)
+{
+	BtreeCheckState *state = (BtreeCheckState *) checkstate;
+	IndexTuple	itup;
+
+	Assert(state->heapallindexed);
+
+	/*
+	 * Generate an index tuple for fingerprinting.
+	 *
+	 * Index tuple formation is assumed to be deterministic, and IndexTuples
+	 * are assumed immutable.  While the LP_DEAD bit is mutable in leaf pages,
+	 * that's ItemId metadata, which was not fingerprinted.  (There will often
+	 * be some dead-to-everyone IndexTuples fingerprinted by the Bloom filter,
+	 * but we only try to detect the absence of needed tuples, so that's okay.)
+	 *
+	 * Note that we rely on deterministic index_form_tuple() TOAST compression.
+	 * If index_form_tuple() was ever enhanced to compress datums out-of-line,
+	 * or otherwise varied when or how compression was applied, our assumption
+	 * would break, leading to false positive reports of corruption.  For now,
+	 * we don't decompress/normalize toasted values as part of fingerprinting.
+	 */
+	itup = index_form_tuple(RelationGetDescr(index), values, isnull);
+	itup->t_tid = htup->t_self;
+
+	/* Probe Bloom filter -- tuple should be present */
+	if (bloom_lacks_element(state->filter, (unsigned char *) itup,
+							IndexTupleSize(itup)))
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("heap tuple (%u,%u) from table \"%s\" lacks matching index tuple within index \"%s\"",
+						ItemPointerGetBlockNumber(&(itup->t_tid)),
+						ItemPointerGetOffsetNumber(&(itup->t_tid)),
+						RelationGetRelationName(state->heaprel),
+						RelationGetRelationName(state->rel)),
+				 !state->readonly
+				 ? errhint("Retrying verification using the function bt_index_parent_check() might provide a more specific error.")
+				 : 0));
+
+	state->heaptuplespresent++;
+	pfree(itup);
+}
+
+/*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
  *
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 852e260..a712c86 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -9,13 +9,13 @@
 
  <para>
   The <filename>amcheck</filename> module provides functions that allow you to
-  verify the logical consistency of the structure of indexes.  If the
+  verify the logical consistency of the structure of relations.  If the
   structure appears to be valid, no error is raised.
  </para>
 
  <para>
   The functions verify various <emphasis>invariants</emphasis> in the
-  structure of the representation of particular indexes.  The
+  structure of the representation of particular relations.  The
   correctness of the access method functions behind index scans and
   other important operations relies on these invariants always
   holding.  For example, certain functions verify, among other things,
@@ -44,7 +44,7 @@
   <variablelist>
    <varlistentry>
     <term>
-     <function>bt_index_check(index regclass) returns void</function>
+     <function>bt_index_check(index regclass, heapallindexed boolean) returns void</function>
      <indexterm>
       <primary>bt_index_check</primary>
      </indexterm>
@@ -55,7 +55,9 @@
       <function>bt_index_check</function> tests that its target, a
       B-Tree index, respects a variety of invariants.  Example usage:
 <screen>
-test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+test=# SELECT bt_index_check(index =&gt; c.oid, heapallindexed =&gt; i.indisunique)
+               c.relname,
+               c.relpages
 FROM pg_index i
 JOIN pg_opclass op ON i.indclass[0] = op.oid
 JOIN pg_am am ON op.opcmethod = am.oid
@@ -83,9 +85,11 @@ ORDER BY c.relpages DESC LIMIT 10;
 </screen>
       This example shows a session that performs verification of every
       catalog index in the database <quote>test</quote>.  Details of just
-      the 10 largest indexes verified are displayed.  Since no error
-      is raised, all indexes tested appear to be logically consistent.
-      Naturally, this query could easily be changed to call
+      the 10 largest indexes verified are displayed.  Verification of
+      the presence of heap tuples as index tuples is requested for
+      unique indexes only.  Since no error is raised, all indexes
+      tested appear to be logically consistent.  Naturally, this query
+      could easily be changed to call
       <function>bt_index_check</function> for every index in the
       database where verification is supported.
      </para>
@@ -95,10 +99,11 @@ ORDER BY c.relpages DESC LIMIT 10;
       is the same lock mode acquired on relations by simple
       <literal>SELECT</literal> statements.
       <function>bt_index_check</function> does not verify invariants
-      that span child/parent relationships, nor does it verify that
-      the target index is consistent with its heap relation.  When a
-      routine, lightweight test for corruption is required in a live
-      production environment, using
+      that span child/parent relationships, but will verify the
+      presence of all heap tuples as index tuples within the index
+      when <parameter>heapallindexed</parameter> is
+      <literal>true</literal>.  When a routine, lightweight test for
+      corruption is required in a live production environment, using
       <function>bt_index_check</function> often provides the best
       trade-off between thoroughness of verification and limiting the
       impact on application performance and availability.
@@ -108,7 +113,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -117,19 +122,21 @@ ORDER BY c.relpages DESC LIMIT 10;
     <listitem>
      <para>
       <function>bt_index_parent_check</function> tests that its
-      target, a B-Tree index, respects a variety of invariants.  The
-      checks performed by <function>bt_index_parent_check</function>
-      are a superset of the checks performed by
-      <function>bt_index_check</function>.
+      target, a B-Tree index, respects a variety of invariants.
+      Optionally, when the <parameter>heapallindexed</parameter>
+      argument is <literal>true</literal>, the function verifies the
+      presence of all heap tuples that should be found within the
+      index.  The checks that can be performed by
+      <function>bt_index_parent_check</function> are a superset of the
+      checks that can be performed by <function>bt_index_check</function>.
       <function>bt_index_parent_check</function> can be thought of as
       a more thorough variant of <function>bt_index_check</function>:
       unlike <function>bt_index_check</function>,
       <function>bt_index_parent_check</function> also checks
-      invariants that span parent/child relationships.  However, it
-      does not verify that the target index is consistent with its
-      heap relation.  <function>bt_index_parent_check</function>
-      follows the general convention of raising an error if it finds a
-      logical inconsistency or other problem.
+      invariants that span parent/child relationships.
+      <function>bt_index_parent_check</function> follows the general
+      convention of raising an error if it finds a logical
+      inconsistency or other problem.
      </para>
      <para>
       A <literal>ShareLock</literal> is required on the target index by
@@ -159,6 +166,47 @@ ORDER BY c.relpages DESC LIMIT 10;
  </sect2>
 
  <sect2>
+  <title>Optional <parameter>heapallindexed</parameter> verification</title>
+ <para>
+  When the <parameter>heapallindexed</parameter> argument to
+  verification functions is <literal>true</literal>, an additional
+  phase of verification is performed against the table associated with
+  the target index relation.  This consists of a <quote>dummy</quote>
+  <command>CREATE INDEX</command> operation, which checks for the
+  presence of all hypothetical new index tuples against a temporary,
+  in-memory summarizing structure (this is built when needed during
+  the basic first phase of verification).  The summarizing structure
+  <quote>fingerprints</quote> every tuple found within the target
+  index.  The high level principle behind
+  <parameter>heapallindexed</parameter> verification is that a new
+  index that is equivalent to the existing, target index must only
+  have entries that can be found in the existing structure.
+ </para>
+ <para>
+  The additional <parameter>heapallindexed</parameter> phase adds
+  significant overhead: verification will typically take several times
+  longer.  However, there is no change to the relation-level locks
+  acquired when <parameter>heapallindexed</parameter> verification is
+  performed.
+ </para>
+ <para>
+  The summarizing structure is bound in size by
+  <varname>maintenance_work_mem</varname>.  In order to ensure that
+  there is no more than a 2% probability of failure to detect an
+  inconsistency for each heap tuple that should be represented in the
+  index, approximately 2 bytes of memory are needed per tuple.  As
+  less memory is made available per tuple, the probability of missing
+  an inconsistency slowly increases.  This approach limits the
+  overhead of verification significantly, while only slightly reducing
+  the probability of detecting a problem, especially for installations
+  where verification is treated as a routine maintenance task.  Any
+  single absent or malformed tuple has a new opportunity to be
+  detected with each new verification attempt.
+ </para>
+
+ </sect2>
+
+ <sect2>
   <title>Using <filename>amcheck</filename> effectively</title>
 
  <para>
@@ -199,16 +247,29 @@ ORDER BY c.relpages DESC LIMIT 10;
    </listitem>
    <listitem>
     <para>
+     Structural inconsistencies between indexes and the heap relations
+     that are indexed (when <parameter>heapallindexed</parameter>
+     verification is performed).
+    </para>
+    <para>
+     There is no cross-checking of indexes against their heap relation
+     during normal operation.  Symptoms of heap corruption can be subtle.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
      Corruption caused by hypothetical undiscovered bugs in the
-     underlying <productname>PostgreSQL</productname> access method code or sort
-     code.
+     underlying <productname>PostgreSQL</productname> access method
+     code, sort code, or transaction management code.
     </para>
     <para>
      Automatic verification of the structural integrity of indexes
      plays a role in the general testing of new or proposed
      <productname>PostgreSQL</productname> features that could plausibly allow a
-     logical inconsistency to be introduced.  One obvious testing
-     strategy is to call <filename>amcheck</filename> functions continuously
+     logical inconsistency to be introduced.  Verification of table
+     structure and associated visibility and transaction status
+     information plays a similar role.  One obvious testing strategy
+     is to call <filename>amcheck</filename> functions continuously
      when running the standard regression tests.  See <xref
      linkend="regress-run"/> for details on running the tests.
     </para>
@@ -242,6 +303,12 @@ ORDER BY c.relpages DESC LIMIT 10;
      <emphasis>absolute</emphasis> protection against failures that
      result in memory corruption.
     </para>
+    <para>
+     When <parameter>heapallindexed</parameter> verification is
+     performed, there is generally a greatly increased chance of
+     detecting single-bit errors, since strict binary equality is
+     tested, and the indexed attributes within the heap are tested.
+    </para>
    </listitem>
   </itemizedlist>
   In general, <filename>amcheck</filename> can only prove the presence of
@@ -253,11 +320,10 @@ ORDER BY c.relpages DESC LIMIT 10;
   <title>Repairing corruption</title>
  <para>
   No error concerning corruption raised by <filename>amcheck</filename> should
-  ever be a false positive.  In practice, <filename>amcheck</filename> is more
-  likely to find software bugs than problems with hardware.
-  <filename>amcheck</filename> raises errors in the event of conditions that,
-  by definition, should never happen, and so careful analysis of
-  <filename>amcheck</filename> errors is often required.
+  ever be a false positive.  <filename>amcheck</filename> raises
+  errors in the event of conditions that, by definition, should never
+  happen, and so careful analysis of <filename>amcheck</filename>
+  errors is often required.
  </para>
  <para>
   There is no general method of repairing problems that
-- 
2.7.4

#72

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Peter Geoghegan (#70)

Re: [HACKERS] A design for amcheck heapam verification

On Sat, Mar 31, 2018 at 3:15 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Mar 31, 2018 at 2:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:

WFM. I have all the information I need to produce the next revision now.

I might as well post this one first. I'll have 0002 for you in a short while.

Looks like thrips doesn't like this, though other Windows buildfarm
animals are okay with it.

round() is from C99, apparently. I'll investigate a fix.

--
Peter Geoghegan

#73

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Peter Geoghegan (#72)

Re: [HACKERS] A design for amcheck heapam verification

On 2018-03-31 19:43:45 -0700, Peter Geoghegan wrote:

On Sat, Mar 31, 2018 at 3:15 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Mar 31, 2018 at 2:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:

WFM. I have all the information I need to produce the next revision now.

I might as well post this one first. I'll have 0002 for you in a short while.

Looks like thrips doesn't like this, though other Windows buildfarm
animals are okay with it.

round() is from C99, apparently. I'll investigate a fix.

Just replacing it with a floor(val + 0.5) ought to do the trick, right?

Greetings,

Andres Freund

#74

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Andres Freund (#73)

Re: [HACKERS] A design for amcheck heapam verification

On Sat, Mar 31, 2018 at 8:08 PM, Andres Freund <andres@anarazel.de> wrote:

round() is from C99, apparently. I'll investigate a fix.

Just replacing it with a floor(val + 0.5) ought to do the trick, right?

I was thinking of using rint(), which is what you get if you call
round(float8) from SQL.

--
Peter Geoghegan

#75

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Peter Geoghegan (#74)

1 attachment(s)

Re: [HACKERS] A design for amcheck heapam verification

On Sat, Mar 31, 2018 at 8:09 PM, Peter Geoghegan <pg@bowt.ie> wrote:

I was thinking of using rint(), which is what you get if you call
round(float8) from SQL.

Attached patch does it that way. Note that there are float/int cast
regression tests that ensure that rint() behaves consistently on
supported platforms -- see commit 06bf0dd6.

--
Peter Geoghegan

Attachments:

0001-Fix-non-portable-call-to-round.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-non-portable-call-to-round.patchDownload

From ff0bd32d33ceb6b5650e28d76ee794961862a4fc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 31 Mar 2018 19:54:48 -0700
Subject: [PATCH] Fix non-portable call to round().

round() is from C99.  Apparently, not all supported versions of Visual
Studio make it available.  Use rint() instead.  There are behavioral
differences between round() and rint(), but they should not matter to
the Bloom filter optimal_k() function.  We already assume POSIX behavior
for rint(), so there is no question of rint() not using "rounds towards
nearest" as its rounding mode.

Cleanup from commit 51bc271790eb234a1ba4d14d3e6530f70de92ab5.

Per buildfarm member thrips.

Author: Peter Geoghegan
---
 src/backend/lib/bloomfilter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
index eb08f4a..3565480 100644
--- a/src/backend/lib/bloomfilter.c
+++ b/src/backend/lib/bloomfilter.c
@@ -240,7 +240,7 @@ my_bloom_power(uint64 target_bitset_bits)
 static int
 optimal_k(uint64 bitset_bits, int64 total_elems)
 {
-	int			k = round(log(2.0) * bitset_bits / total_elems);
+	int			k = rint(log(2.0) * bitset_bits / total_elems);

 	return Max(1, Min(k, MAX_HASH_FUNCS));
 }
-- 
2.7.4

#76

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Peter Geoghegan (#75)

Re: [HACKERS] A design for amcheck heapam verification

On 2018-03-31 20:25:24 -0700, Peter Geoghegan wrote:

On Sat, Mar 31, 2018 at 8:09 PM, Peter Geoghegan <pg@bowt.ie> wrote:

I was thinking of using rint(), which is what you get if you call
round(float8) from SQL.

Attached patch does it that way. Note that there are float/int cast
regression tests that ensure that rint() behaves consistently on
supported platforms -- see commit 06bf0dd6.

LGTM, pushed. Closing CF entry. Yay! Only 110 to go.

- Andres

#77

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Andres Freund (#76)

Re: [HACKERS] A design for amcheck heapam verification

On Sat, Mar 31, 2018 at 8:32 PM, Andres Freund <andres@anarazel.de> wrote:

LGTM, pushed. Closing CF entry. Yay! Only 110 to go.

Thanks Andres!

--
Peter Geoghegan