PATCH: Extending the HyperLogLog API a bit

Started by Tomas Vondraover 10 years ago12 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

while working on the bloom filters for hashjoins, I've started using the
HLL library committed as part of the sorting improvements for 9.5. I
propose adding two more functions to the API, which I think are quite
useful:

1) initHyperLogLogError(hyperLogLogState *cState, double error)

Instead of specifying bwidth (essentially the number of bits used
for addressing in the counter), this allows specifying the expected
error rate for the counter, which is

error_rate = 1.04 / sqrt(2^bwidth)

So for 5% we get bwidth=5, and so on. This makes the API a bit easier
the use, because there are pretty much no comments about the meaning
of bwidth, and the existing callers simply use 10 without details.

2) freeHyperLogLog(hyperLogLogState *cState)

I think it's a good idea to provide function "undoing" what init
does, i.e. freeing the internal memory etc. Currently that's trivial
to do, but perhaps we'll make the structure more complicated in the
future (albeit that might be unlikely).

FWIW I've already used this in the patch marrying hash joins and bloom
filters.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#1)
Re: PATCH: Extending the HyperLogLog API a bit

Meh, of course I forgot to actually attach the patch.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-extend-the-HyperLogLog-API-a-bit-by-adding-two-more-.patchtext/x-diff; name=0001-extend-the-HyperLogLog-API-a-bit-by-adding-two-more-.patchDownload+31-3
In reply to: Tomas Vondra (#1)
Re: PATCH: Extending the HyperLogLog API a bit

On Thu, Dec 31, 2015 at 12:48 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

1) initHyperLogLogError(hyperLogLogState *cState, double error)

Instead of specifying bwidth (essentially the number of bits used
for addressing in the counter), this allows specifying the expected
error rate for the counter, which is

error_rate = 1.04 / sqrt(2^bwidth)

So for 5% we get bwidth=5, and so on. This makes the API a bit easier
the use, because there are pretty much no comments about the meaning
of bwidth, and the existing callers simply use 10 without details.

Fair, but you didn't add any better comments!

2) freeHyperLogLog(hyperLogLogState *cState)

I think it's a good idea to provide function "undoing" what init
does, i.e. freeing the internal memory etc. Currently that's trivial
to do, but perhaps we'll make the structure more complicated in the
future (albeit that might be unlikely).

Seems reasonable.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#3)
Re: PATCH: Extending the HyperLogLog API a bit

On 01/01/2016 12:08 AM, Peter Geoghegan wrote:

On Thu, Dec 31, 2015 at 12:48 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

1) initHyperLogLogError(hyperLogLogState *cState, double error)

Instead of specifying bwidth (essentially the number of bits used
for addressing in the counter), this allows specifying the expected
error rate for the counter, which is

error_rate = 1.04 / sqrt(2^bwidth)

So for 5% we get bwidth=5, and so on. This makes the API a bit easier
the use, because there are pretty much no comments about the meaning
of bwidth, and the existing callers simply use 10 without details.

Fair, but you didn't add any better comments!

2) freeHyperLogLog(hyperLogLogState *cState)

I think it's a good idea to provide function "undoing" what init
does, i.e. freeing the internal memory etc. Currently that's trivial
to do, but perhaps we'll make the structure more complicated in the
future (albeit that might be unlikely).

Seems reasonable.

Attached is v2 of the patch, adding the comments.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-extend-HyperLogLog-API-v2.patchtext/x-diff; name=0001-extend-HyperLogLog-API-v2.patchDownload+47-1
#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#4)
Re: PATCH: Extending the HyperLogLog API a bit

Tomas Vondra wrote:

Attached is v2 of the patch, adding the comments.

Looks pretty reasonable to me. I'm not sure we want to push this ahead
of the bloom filter stuff, but it looks ready to commit otherwise.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#5)
Re: PATCH: Extending the HyperLogLog API a bit

On Mon, Jan 11, 2016 at 2:22 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Tomas Vondra wrote:

Attached is v2 of the patch, adding the comments.

Looks pretty reasonable to me. I'm not sure we want to push this ahead
of the bloom filter stuff, but it looks ready to commit otherwise.

I'd say go ahead and push it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#4)
Re: PATCH: Extending the HyperLogLog API a bit

While going over this patch I noticed this commit in HyperLogLog's
upstream:

https://github.com/hideo55/cpp-HyperLogLog/commit/b8cb5e7b856928af15e9535b4b1650f493ba453f

In the first hunk which is what we care about, the author is doing
bit-or of both hash values instead of taking the max value, which is
what we were doing.

Our transcript seems to predate that bugfix commit, so I assume we need
to apply this to our copy too. Sadly, Hideaki-san commit message isn't
very descriptive.

I don't really know how HyperLogLog works, so maybe we can't or
shouldn't apply the patch because of how the hash stuff is used.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#6)
Re: PATCH: Extending the HyperLogLog API a bit

Robert Haas wrote:

On Mon, Jan 11, 2016 at 2:22 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Tomas Vondra wrote:

Attached is v2 of the patch, adding the comments.

Looks pretty reasonable to me. I'm not sure we want to push this ahead
of the bloom filter stuff, but it looks ready to commit otherwise.

I'd say go ahead and push it.

Done.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Alvaro Herrera (#7)
Re: PATCH: Extending the HyperLogLog API a bit

On Tue, Jan 19, 2016 at 9:37 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Our transcript seems to predate that bugfix commit, so I assume we need
to apply this to our copy too. Sadly, Hideaki-san commit message isn't
very descriptive.

Fortunately, the function mergeHyperLogLog() in our hyperloglog.c
currently has no callers.

I don't really know how HyperLogLog works, so maybe we can't or
shouldn't apply the patch because of how the hash stuff is used.

I think that Hideaki's confusion comes from whether or not this HLL
state is a sparse or dense/full representation. The distinction is
explained in the README for postgresql-hll:

https://github.com/aggregateknowledge/postgresql-hll

postgresql-hll has no support for merging HLLs that are sparse:

https://github.com/aggregateknowledge/postgresql-hll/blob/master/hll.c#L1888

Can't we just tear mergeHyperLogLog() out?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#9)
Re: PATCH: Extending the HyperLogLog API a bit

On 01/19/2016 10:54 PM, Peter Geoghegan wrote:

On Tue, Jan 19, 2016 at 9:37 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Our transcript seems to predate that bugfix commit, so I assume we need
to apply this to our copy too. Sadly, Hideaki-san commit message isn't
very descriptive.

Fortunately, the function mergeHyperLogLog() in our hyperloglog.c
currently has no callers.

I don't really know how HyperLogLog works, so maybe we can't or
shouldn't apply the patch because of how the hash stuff is used.

I think that Hideaki's confusion comes from whether or not this HLL
state is a sparse or dense/full representation. The distinction is
explained in the README for postgresql-hll:

https://github.com/aggregateknowledge/postgresql-hll

postgresql-hll has no support for merging HLLs that are sparse:

https://github.com/aggregateknowledge/postgresql-hll/blob/master/hll.c#L1888

Can't we just tear mergeHyperLogLog() out?

FWIW I've been considering adding APPROX_COUNT_DISTINCT() aggregate,
similarly to what other databases (e.g. Vertica) have built-in. Now,
that would not require the merge too, but we're currently baking support
for 'combine' functions, and that's exactly what merge does.

So why not just fix the bug?

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Tomas Vondra (#10)
Re: PATCH: Extending the HyperLogLog API a bit

On Tue, Jan 19, 2016 at 2:03 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

FWIW I've been considering adding APPROX_COUNT_DISTINCT() aggregate,
similarly to what other databases (e.g. Vertica) have built-in. Now, that
would not require the merge too, but we're currently baking support for
'combine' functions, and that's exactly what merge does.

So why not just fix the bug?

You can go from the sparse representation to the dense representation,
so in principle you can merge two of our HLL states, if you are then
satisfied with having a new representation for the merged state. I
don't have time right now to do a full analysis of whether or not it's
possible to just fix the bug without doing all that, but I think it
might not be.

I think we benefit from the simplicity of the sparse representation.
So, in the absence of a clear justification for retaining
mergeHyperLogLog(), ripping it out seems best.

I also think that an expanded set of functionality would be required
for your APPROX_COUNT_DISTINCT() patch anyway, including support for
multiple representations (perhaps this isn't documented in your
APPROX_COUNT_DISTINCT(), but complete implementations seem to switch
from sparse to full at a certain point). So, ripping out
mergeHyperLogLog() doesn't really make that effort any more difficult.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#11)
Re: PATCH: Extending the HyperLogLog API a bit

On Tue, Jan 19, 2016 at 2:22 PM, Peter Geoghegan <pg@heroku.com> wrote:

I don't have time right now to do a full analysis of whether or not it's
possible to just fix the bug without doing all that, but I think it
might not be.

IOW: I think that Hideaki's bug fix might itself be wrong (although I
might be wrong about that; no time to make sure right now). Note that
Hideaki's implementation was not the only one I looked at when working
on this code.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers