pgsql: Bloom index contrib module

Started by Teodor Sigaevabout 10 years ago14 messageshackers
Jump to latest
#1Teodor Sigaev
teodor@sigaev.ru

Bloom index contrib module

Module provides new access method. It is actually a simple Bloom filter
implemented as pgsql's index. It could give some benefits on search
with large number of columns.

Module is a single way to test generic WAL interface committed earlier.

Author: Teodor Sigaev, Alexander Korotkov
Reviewers: Aleksander Alekseev, Michael Paquier, Jim Nasby

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/9ee014fc899a28a198492b074e32b60ed8915ea9

Modified Files
--------------
contrib/Makefile | 1 +
contrib/bloom/.gitignore | 4 +
contrib/bloom/Makefile | 24 ++
contrib/bloom/blcost.c | 48 ++++
contrib/bloom/blinsert.c | 313 ++++++++++++++++++++++++++
contrib/bloom/bloom--1.0.sql | 19 ++
contrib/bloom/bloom.control | 5 +
contrib/bloom/bloom.h | 178 +++++++++++++++
contrib/bloom/blscan.c | 175 +++++++++++++++
contrib/bloom/blutils.c | 463 +++++++++++++++++++++++++++++++++++++++
contrib/bloom/blvacuum.c | 212 ++++++++++++++++++
contrib/bloom/blvalidate.c | 220 +++++++++++++++++++
contrib/bloom/expected/bloom.out | 122 +++++++++++
contrib/bloom/sql/bloom.sql | 47 ++++
contrib/bloom/t/001_wal.pl | 75 +++++++
doc/src/sgml/bloom.sgml | 218 ++++++++++++++++++
doc/src/sgml/contrib.sgml | 1 +
doc/src/sgml/filelist.sgml | 1 +
18 files changed, 2126 insertions(+)

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#2Erik Rijkers
er@xs4all.nl
In reply to: Teodor Sigaev (#1)
Re: pgsql: Bloom index contrib module

On 2016-04-01 15:49, Teodor Sigaev wrote:

Bloom index contrib module

Module provides new access method. It is actually a simple Bloom filter
implemented as pgsql's index. It could give some benefits on search
with large number of columns.

doc/src/sgml/bloom.sgml | 218 ++++++++++++++++++

I edited the bloom.sgml text a bit.

Great stuff, thanks!

Erik Rijkers

Attachments:

bloom.sgml.difftext/x-diff; name=bloom.sgml.diffDownload+29-29
#3Teodor Sigaev
teodor@sigaev.ru
In reply to: Teodor Sigaev (#1)
Re: pgsql: Bloom index contrib module

Several non-x86 members of pgbuildfarm aren't happy with it, we are
investigating the problem

Teodor Sigaev wrote:

Bloom index contrib module

Module provides new access method. It is actually a simple Bloom filter
implemented as pgsql's index. It could give some benefits on search
with large number of columns.

Module is a single way to test generic WAL interface committed earlier.

Author: Teodor Sigaev, Alexander Korotkov
Reviewers: Aleksander Alekseev, Michael Paquier, Jim Nasby

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/9ee014fc899a28a198492b074e32b60ed8915ea9

Modified Files
--------------
contrib/Makefile | 1 +
contrib/bloom/.gitignore | 4 +
contrib/bloom/Makefile | 24 ++
contrib/bloom/blcost.c | 48 ++++
contrib/bloom/blinsert.c | 313 ++++++++++++++++++++++++++
contrib/bloom/bloom--1.0.sql | 19 ++
contrib/bloom/bloom.control | 5 +
contrib/bloom/bloom.h | 178 +++++++++++++++
contrib/bloom/blscan.c | 175 +++++++++++++++
contrib/bloom/blutils.c | 463 +++++++++++++++++++++++++++++++++++++++
contrib/bloom/blvacuum.c | 212 ++++++++++++++++++
contrib/bloom/blvalidate.c | 220 +++++++++++++++++++
contrib/bloom/expected/bloom.out | 122 +++++++++++
contrib/bloom/sql/bloom.sql | 47 ++++
contrib/bloom/t/001_wal.pl | 75 +++++++
doc/src/sgml/bloom.sgml | 218 ++++++++++++++++++
doc/src/sgml/contrib.sgml | 1 +
doc/src/sgml/filelist.sgml | 1 +
18 files changed, 2126 insertions(+)

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Teodor Sigaev (#1)
Re: pgsql: Bloom index contrib module

Teodor Sigaev <teodor@sigaev.ru> writes:

Bloom index contrib module

skink provided some pretty suggestive evidence about why this
is unstable:

==32446== VALGRINDERROR-BEGIN
==32446== Conditional jump or move depends on uninitialised value(s)
==32446== at 0x4E2E71: writeDelta (generic_xlog.c:137)
==32446== by 0x4E341E: GenericXLogFinish (generic_xlog.c:313)
==32446== by 0x14E83324: blbulkdelete (blvacuum.c:149)
==32446== by 0x4BCEE7: index_bulk_delete (indexam.c:627)
==32446== by 0x5DE577: lazy_vacuum_index (vacuumlazy.c:1581)
==32446== by 0x5DFB52: lazy_scan_heap (vacuumlazy.c:1273)
==32446== by 0x5E03AA: lazy_vacuum_rel (vacuumlazy.c:249)
==32446== by 0x5DC7B7: vacuum_rel (vacuum.c:1375)
==32446== by 0x5DD5F7: vacuum (vacuum.c:296)
==32446== by 0x693B71: autovacuum_do_vac_analyze (autovacuum.c:2807)
==32446== by 0x695B2A: do_autovacuum (autovacuum.c:2328)
==32446== by 0x696055: AutoVacWorkerMain (autovacuum.c:1647)
==32446== Uninitialised value was created by a stack allocation
==32446== at 0x14E82CAB: blbulkdelete (blvacuum.c:36)
==32446==
==32446== VALGRINDERROR-END

regards, tom lane

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#5Erik Rijkers
er@xs4all.nl
In reply to: Erik Rijkers (#2)
Re: pgsql: Bloom index contrib module

On 2016-04-01 14:36, Erik Rijkers wrote:

On 2016-04-01 15:49, Teodor Sigaev wrote:

Bloom index contrib module

doc/src/sgml/bloom.sgml | 218 ++++++++++++++++++

The size of example table (in bloom.sgml):

CREATE TABLE tbloom AS
SELECT
random()::int as i1,
random()::int as i2,
[...]
random()::int as i12,
random()::int as i13
FROM
generate_series(1,1000);

seems too small to demonstrate the index-use.

For me, both on $BigServer at work as on $ModestDesktop at home the 1000
rows are not enough.

I suggest making the rowcount in that example a larger, for instance
10000, so: generate_series(1,10000).

Does that make sense? I realize the behavior is probably somewhat
dependent from hardware and settings...

thanks,

Erik Rijkers

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Teodor Sigaev (#1)
Re: [COMMITTERS] pgsql: Bloom index contrib module

Teodor Sigaev <teodor@sigaev.ru> writes:

Bloom index contrib module

Would it be possible to dial down the amount of runtime consumed by
the regression tests for this module?

On my primary dev machine, "make installcheck" in contrib/bloom takes
4.5 seconds, which seems a bit excessive when make installcheck across
all the rest of contrib takes 22 seconds.

On prairiedog, which admittedly is one of the slower buildfarm animals
(though far from the slowest), the contrib-install-check-C step went
from 2:54 immediately before this patch went in to 4:28 immediately
after.

That seems out of line for a single contrib module, especially one of
unproven usefulness.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#6)
Re: [COMMITTERS] pgsql: Bloom index contrib module

I wrote:

Would it be possible to dial down the amount of runtime consumed by
the regression tests for this module?

Hmm ... "perf" says that a full 50% of the runtime of contrib/bloom's
regression test is consumed by GenericXLogFinish, and of that, the vast
majority is consumed by the extremely inefficient one-byte-at-a-time
matching in writeDelta() --- even dwarfing the "exchange two page images"
memcpy's. So there seems to be good reason to expend some
micro-optimization effort in that area. But aside from the writeDelta
problems, what in the world is the reason for the page exchange logic?
I could see copying a working image into place in the shared buffer
arena, but not saving the previous contents.

Even if these costs went to zero, though, the bloom regression tests
would still be consuming a disproportionate amount of time.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#6)
Re: [COMMITTERS] pgsql: Bloom index contrib module

On Sat, Apr 09, 2016 at 11:50:08AM -0400, Tom Lane wrote:

Teodor Sigaev <teodor@sigaev.ru> writes:

Bloom index contrib module

Would it be possible to dial down the amount of runtime consumed by
the regression tests for this module?

On my primary dev machine, "make installcheck" in contrib/bloom takes
4.5 seconds, which seems a bit excessive when make installcheck across
all the rest of contrib takes 22 seconds.

On prairiedog, which admittedly is one of the slower buildfarm animals
(though far from the slowest), the contrib-install-check-C step went
from 2:54 immediately before this patch went in to 4:28 immediately
after.

That seems out of line for a single contrib module, especially one of
unproven usefulness.

I find this added test duration reasonable. If someone identifies a way to
realize similar coverage with lower duration, I'd value that contribution. -1
for meeting some runtime target at the expense of coverage. Older modules
have rather little test coverage, so they're poor as benchmarks.

A feature's lack of usefulness can be a good reason to exclude it from the
tree entirely, but it's a bad reason to slash the feature's test cases. (I
have not evaluated the usefulness of this access method.)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Noah Misch (#8)
Re: [COMMITTERS] pgsql: Bloom index contrib module

On Sat, Apr 9, 2016 at 4:43 PM, Noah Misch <noah@leadboat.com> wrote:

I find this added test duration reasonable. If someone identifies a way to
realize similar coverage with lower duration, I'd value that contribution. -1
for meeting some runtime target at the expense of coverage. Older modules
have rather little test coverage, so they're poor as benchmarks.

+1.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#8)
Re: [COMMITTERS] pgsql: Bloom index contrib module

Noah Misch <noah@leadboat.com> writes:

On Sat, Apr 09, 2016 at 11:50:08AM -0400, Tom Lane wrote:

Would it be possible to dial down the amount of runtime consumed by
the regression tests for this module?

I find this added test duration reasonable. If someone identifies a way to
realize similar coverage with lower duration, I'd value that contribution. -1
for meeting some runtime target at the expense of coverage. Older modules
have rather little test coverage, so they're poor as benchmarks.

That argument is fine in principle, but the thing about applications such
as SQL databases is that shoving more rows through them doesn't in itself
increase test coverage; it may just iterate the loops more times.

The contrib/bloom regression test currently creates a 100000-row table.
I wondered how many rows were really required to achieve the same level
of code coverage. I experimented with gcov, and what I find is that
as of HEAD that test provides these levels of coverage:

Line Coverage Functions

contrib/bloom/: 75.4 % 374 / 496 93.1 % 27 / 29
generic_xlog.c: 68.5 % 98 / 143 77.8 % 7 / 9

(I looked specifically at generic_xlog.c because the main point of this
contrib module, so far as the core code is concerned, is to exercise
that file.)

I was depressed, though not entirely surprised, to find that you get
exactly that same line-count coverage if the table size is cut back
to ONE row. Only when you remove the INSERT command entirely is there
any change in these gcov statistics. Therefore, the last 99999 rows
are wasting my time, and yours, and that of every other developer who
will ever run this test suite in future. I do not like having the
community's time wasted.

I'm all for improving our test coverage, but just shoving lots of rows
through a not-very-well-designed-in-the-first-place regression test
isn't a reliable way to do that.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#10)
Re: [COMMITTERS] pgsql: Bloom index contrib module

I wrote:

I was depressed, though not entirely surprised, to find that you get
exactly that same line-count coverage if the table size is cut back
to ONE row.

Oh, I found the flaw in my testing: there are two INSERTs in the test
script and I was changing only one of them. After correcting that,
the results behave a little more sanely:

Line Coverage Functions
1 row: 70.4 % 349 / 496 93.1 % 27 / 29
10 row: 73.6 % 365 / 496 93.1 % 27 / 29
100 rows: 73.6 % 365 / 496 93.1 % 27 / 29
1000 rows: 75.4 % 374 / 496 93.1 % 27 / 29

Still, we've reached the most coverage this test can give us at 1000
rows, which still means it's wasting the last 99% of its runtime.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#11)
Re: [COMMITTERS] pgsql: Bloom index contrib module

On Sat, Apr 09, 2016 at 10:08:01PM -0400, Tom Lane wrote:

I wrote:

I was depressed, though not entirely surprised, to find that you get
exactly that same line-count coverage if the table size is cut back
to ONE row.

Oh, I found the flaw in my testing: there are two INSERTs in the test
script and I was changing only one of them. After correcting that,
the results behave a little more sanely:

Line Coverage Functions
1 row: 70.4 % 349 / 496 93.1 % 27 / 29
10 row: 73.6 % 365 / 496 93.1 % 27 / 29
100 rows: 73.6 % 365 / 496 93.1 % 27 / 29
1000 rows: 75.4 % 374 / 496 93.1 % 27 / 29

Still, we've reached the most coverage this test can give us at 1000
rows, which still means it's wasting the last 99% of its runtime.

If dropping the row count to 1000 shaves >500ms on your primary machine, +1
for committing such a row count change. This is exactly what I meant by
"someone identifies a way to realize similar coverage with lower duration."
Thanks for contributing this study.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Alexander Korotkov
aekorotkov@gmail.com
In reply to: Noah Misch (#12)
Re: [COMMITTERS] pgsql: Bloom index contrib module

On Sun, Apr 10, 2016 at 9:01 AM, Noah Misch <noah@leadboat.com> wrote:

On Sat, Apr 09, 2016 at 10:08:01PM -0400, Tom Lane wrote:

I wrote:

I was depressed, though not entirely surprised, to find that you get
exactly that same line-count coverage if the table size is cut back
to ONE row.

Oh, I found the flaw in my testing: there are two INSERTs in the test
script and I was changing only one of them. After correcting that,
the results behave a little more sanely:

Line Coverage Functions
1 row: 70.4 % 349 / 496 93.1 % 27 / 29
10 row: 73.6 % 365 / 496 93.1 % 27 / 29
100 rows: 73.6 % 365 / 496 93.1 % 27 / 29
1000 rows: 75.4 % 374 / 496 93.1 % 27 / 29

Still, we've reached the most coverage this test can give us at 1000
rows, which still means it's wasting the last 99% of its runtime.

If dropping the row count to 1000 shaves >500ms on your primary machine, +1
for committing such a row count change. This is exactly what I meant by
"someone identifies a way to realize similar coverage with lower duration."
Thanks for contributing this study.

+1, row count reduction is a good to reduce regression test time.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#12)
Re: [COMMITTERS] pgsql: Bloom index contrib module

Noah Misch <noah@leadboat.com> writes:

On Sat, Apr 09, 2016 at 10:08:01PM -0400, Tom Lane wrote:

Still, we've reached the most coverage this test can give us at 1000
rows, which still means it's wasting the last 99% of its runtime.

If dropping the row count to 1000 shaves >500ms on your primary machine, +1
for committing such a row count change. This is exactly what I meant by
"someone identifies a way to realize similar coverage with lower duration."
Thanks for contributing this study.

Further testing with gcov showed that the max coverage point was reached
somewhere between 500 and 750 rows (I didn't bother to pin it down
exactly). So I thought setting the table size to 1000 rows was probably
shaving things a bit too close; I made it 2000 rows instead.

I also added a few more test cases to improve the coverage beyond what
it was to start with, using a white-box approach of modifying the test
script until gcov said that it was hitting the areas it wasn't before.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers