popcount

Started by David Fetterover 5 years ago19 messageshackers
Jump to latest
#1David Fetter
david@fetter.org

Hi,

Per request, I'd like to see about surfacing $Subject to SQL.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachments:

v1-0001-popcount.patchtext/x-diff; charset=us-asciiDownload+107-0
#2Daniel Verite
daniel@manitou-mail.org
In reply to: David Fetter (#1)
Re: popcount

David Fetter wrote:

+Datum
+byteapopcount(PG_FUNCTION_ARGS)
+{
+	bytea	*t1 = PG_GETARG_BYTEA_PP(0);
+	int		len, result;
+
+	len = VARSIZE_ANY_EXHDR(t1);
+	result = pg_popcount(VARDATA_ANY(t1), len);
+
+	PG_RETURN_INT32(result);
+}

The input may have more than 2 billion bits set to 1. The biggest possible
result should be 8 billion for bytea (1 GB with all bits set to 1).
So shouldn't this function return an int8?

There is a pre-existing function in core: bit_length(bytea) returning int4,
but it errors out for large values (in fact it's a SQL function that returns
octet_length($1)*8).

For the varbit case, it doesn't seem like it can hold so many bits, although
I don't see the limit documented in
https://www.postgresql.org/docs/current/datatype-bit.html

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

#3David Fetter
david@fetter.org
In reply to: Daniel Verite (#2)
Re: popcount

On Wed, Dec 30, 2020 at 05:27:04PM +0100, Daniel Verite wrote:

David Fetter wrote:

+Datum
+byteapopcount(PG_FUNCTION_ARGS)
+{
+	bytea	*t1 = PG_GETARG_BYTEA_PP(0);
+	int		len, result;
+
+	len = VARSIZE_ANY_EXHDR(t1);
+	result = pg_popcount(VARDATA_ANY(t1), len);
+
+	PG_RETURN_INT32(result);
+}

The input may have more than 2 billion bits set to 1. The biggest possible
result should be 8 billion for bytea (1 GB with all bits set to 1).
So shouldn't this function return an int8?

It does now, and thanks for looking at this.

There is a pre-existing function in core: bit_length(bytea) returning int4,
but it errors out for large values (in fact it's a SQL function that returns
octet_length($1)*8).

For the varbit case, it doesn't seem like it can hold so many bits, although
I don't see the limit documented in
https://www.postgresql.org/docs/current/datatype-bit.html

Out of an abundance of caution, I changed that one, too.

I'd noticed earlier that ceil() doesn't need to be quite as wasteful
as the way I had it earlier, and fixed that with help from Andrew
Gierth.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachments:

v2-0001-popcount.patchtext/x-diff; charset=us-asciiDownload+111-0
#4Peter Eisentraut
peter_e@gmx.net
In reply to: David Fetter (#3)
Re: popcount

On 2020-12-30 17:41, David Fetter wrote:

The input may have more than 2 billion bits set to 1. The biggest possible
result should be 8 billion for bytea (1 GB with all bits set to 1).
So shouldn't this function return an int8?

It does now, and thanks for looking at this.

The documentation still reflects the previous int4 return type (in two
different spellings, too).

#5David Fetter
david@fetter.org
In reply to: Peter Eisentraut (#4)
Re: popcount

On Mon, Jan 11, 2021 at 03:50:54PM +0100, Peter Eisentraut wrote:

On 2020-12-30 17:41, David Fetter wrote:

The input may have more than 2 billion bits set to 1. The biggest possible
result should be 8 billion for bytea (1 GB with all bits set to 1).
So shouldn't this function return an int8?

It does now, and thanks for looking at this.

The documentation still reflects the previous int4 return type (in two
different spellings, too).

Thanks for looking this over!

Please find attached the next version with corrected documentation.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachments:

v3-0001-popcount.patchtext/x-diff; charset=us-asciiDownload+111-0
#6Peter Eisentraut
peter_e@gmx.net
In reply to: David Fetter (#5)
Re: popcount

On 2021-01-11 17:13, David Fetter wrote:

On Mon, Jan 11, 2021 at 03:50:54PM +0100, Peter Eisentraut wrote:

On 2020-12-30 17:41, David Fetter wrote:

The input may have more than 2 billion bits set to 1. The biggest possible
result should be 8 billion for bytea (1 GB with all bits set to 1).
So shouldn't this function return an int8?

It does now, and thanks for looking at this.

The documentation still reflects the previous int4 return type (in two
different spellings, too).

Thanks for looking this over!

Please find attached the next version with corrected documentation.

The documentation entries should be placed in alphabetical order in
their respective tables.

For the bytea function, maybe choose a simpler example that a reader can
compute in their head. Also for the test cases.

In varbit.c:

The comment says

+ * Returns the number of bits set in a bit array.

but it should be "bit string".

+ int8 popcount;

should be int64.

Also, pg_popcount() returns uint64, not int64. Perhaps overflow is not
possible here, but perhaps a small comment about why or an assertion
could be appropriate.

+ p = VARBITS(arg1);

Why not assign that in the declaration block as well?

This comment:

+   /*
+    * ceil(VARBITLEN(ARG1)/BITS_PER_BYTE)
+    * done to minimize branches and instructions.
+    */

I don't know what that is supposed to mean or why that kind of tuning
would be necessary for a user-callable function.

+ popcount = pg_popcount((const char *)p, len);

The "const" is probably not necessary.

byteapopcount() looks better, but also needs int8 -> int64.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#6)
Re: popcount

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:

[ assorted nits ]

At the level of bikeshedding ... I quite dislike using the name "popcount"
for these functions. I'm aware that some C compilers provide primitives
of that name, but I wouldn't expect a SQL programmer to know that;
without that context the name seems pretty random and unintuitive.
Moreover, it invites confusion with SQL's use of "pop" to abbreviate
"population" in the statistical aggregates, such as var_pop().

Perhaps something along the lines of count_ones() or count_set_bits()
would be more apropos.

regards, tom lane

#8Daniel Verite
daniel@manitou-mail.org
In reply to: Peter Eisentraut (#6)
Re: popcount

Peter Eisentraut wrote:

+   /*
+    * ceil(VARBITLEN(ARG1)/BITS_PER_BYTE)
+    * done to minimize branches and instructions.
+    */

I don't know what that is supposed to mean or why that kind of tuning
would be necessary for a user-callable function.

Also, the formula just below looks wrong in the current patch (v3):

+ /*
+  * ceil(VARBITLEN(ARG1)/BITS_PER_BYTE)
+  * done to minimize branches and instructions.
+  */
+ len = (VARBITLEN(arg1) + BITS_PER_BYTE + 1) / BITS_PER_BYTE;
+ p = VARBITS(arg1);
+
+ popcount = pg_popcount((const char *)p, len);

It should be
len = (VARBITLEN(arg1) + BITS_PER_BYTE - 1) / BITS_PER_BYTE

If we add 1 to BITS_PER_BYTE instead of subtracting 1, when
VARBITLEN(arg1) is a multiple of BITS_PER_BYTE, "len" is one byte too
large.

The correct formula is already used in include/utils/varbit.h near the
definition of VARBITLEN:

#define VARBITTOTALLEN(BITLEN) (((BITLEN) + BITS_PER_BYTE-1)/BITS_PER_BYTE +
\
VARHDRSZ +
VARBITHDRSZ)

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

#9David Fetter
david@fetter.org
In reply to: Tom Lane (#7)
Re: popcount

On Mon, Jan 18, 2021 at 10:34:10AM -0500, Tom Lane wrote:

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:

[ assorted nits ]

fixed, I think.

At the level of bikeshedding ... I quite dislike using the name "popcount"
for these functions. I'm aware that some C compilers provide primitives
of that name, but I wouldn't expect a SQL programmer to know that;
without that context the name seems pretty random and unintuitive.
Moreover, it invites confusion with SQL's use of "pop" to abbreviate
"population" in the statistical aggregates, such as var_pop().

Perhaps something along the lines of count_ones() or count_set_bits()
would be more apropos.

Done that way.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachments:

v4-0001-popcount.patchtext/x-diff; charset=us-asciiDownload+112-0
#10Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#7)
Re: popcount

On 2021-01-18 16:34, Tom Lane wrote:

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:

[ assorted nits ]

At the level of bikeshedding ... I quite dislike using the name "popcount"
for these functions. I'm aware that some C compilers provide primitives
of that name, but I wouldn't expect a SQL programmer to know that;
without that context the name seems pretty random and unintuitive.
Moreover, it invites confusion with SQL's use of "pop" to abbreviate
"population" in the statistical aggregates, such as var_pop().

I was thinking about that too, but according to
<https://en.wikipedia.org/wiki/Hamming_weight&gt;, popcount is an accepted
high-level term, with "pop" also standing for "population".

#11Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#10)
Re: popcount

On Tue, Jan 19, 2021 at 3:06 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

On 2021-01-18 16:34, Tom Lane wrote:

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:

[ assorted nits ]

At the level of bikeshedding ... I quite dislike using the name "popcount"
for these functions. I'm aware that some C compilers provide primitives
of that name, but I wouldn't expect a SQL programmer to know that;
without that context the name seems pretty random and unintuitive.
Moreover, it invites confusion with SQL's use of "pop" to abbreviate
"population" in the statistical aggregates, such as var_pop().

I was thinking about that too, but according to
<https://en.wikipedia.org/wiki/Hamming_weight&gt;, popcount is an accepted
high-level term, with "pop" also standing for "population".

Yeah, I am not sure that it's going to be good to invent our own name
for this, although maybe. But at least I think we should make sure
there are some good comments in an easily discoverable place. Some
people seem to think every programmer in the universe should know what
things like popcount() and fls() and ffs() and stuff like that are,
but it's far from obvious and I often have to refresh my memory. Let's
make it easy for someone to figure out, if they don't know already.
Like just a comment that says "this returns the number of 1 bits in
the integer supplied as an argument" or something can save somebody a
lot of trouble.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12David Fetter
david@fetter.org
In reply to: Robert Haas (#11)
Re: popcount

On Tue, Jan 19, 2021 at 07:58:12AM -0500, Robert Haas wrote:

On Tue, Jan 19, 2021 at 3:06 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

On 2021-01-18 16:34, Tom Lane wrote:

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:

[ assorted nits ]

At the level of bikeshedding ... I quite dislike using the name "popcount"
for these functions. I'm aware that some C compilers provide primitives
of that name, but I wouldn't expect a SQL programmer to know that;
without that context the name seems pretty random and unintuitive.
Moreover, it invites confusion with SQL's use of "pop" to abbreviate
"population" in the statistical aggregates, such as var_pop().

I was thinking about that too, but according to
<https://en.wikipedia.org/wiki/Hamming_weight&gt;, popcount is an accepted
high-level term, with "pop" also standing for "population".

Yeah, I am not sure that it's going to be good to invent our own
name for this, although maybe. But at least I think we should make
sure there are some good comments in an easily discoverable place.
Some people seem to think every programmer in the universe should
know what things like popcount() and fls() and ffs() and stuff like
that are, but it's far from obvious and I often have to refresh my
memory. Let's make it easy for someone to figure out, if they don't
know already.

I went with count_set_bits() for the next version, and I hope that
helps clarify what it does.

Like just a comment that says "this returns the number of 1 bits in
the integer supplied as an argument" or something can save somebody
a lot of trouble.

You bring up an excellent point, which is that our builtin functions
could use a lot more documentation directly to hand than they now
have. For example, there's a lot of needless ambiguity created by
function comments which leave it up in the air as to which positional
argument does what in functions like string_to_array, which take
multiple arguments. I'll try to get a patch in for the next CF with a
fix for that, and a separate one that doesn't put it on people to use
\df+ to find the comments we do provide. There have been proposals for
including an optional space for COMMENT ON in DDL, but I suspect that
those won't fly unless and until they make their way into the
standard.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#13Isaac Morland
isaac.morland@gmail.com
In reply to: David Fetter (#12)
Re: popcount

On Tue, 19 Jan 2021 at 11:38, David Fetter <david@fetter.org> wrote:

You bring up an excellent point, which is that our builtin functions
could use a lot more documentation directly to hand than they now
have. For example, there's a lot of needless ambiguity created by
function comments which leave it up in the air as to which positional
argument does what in functions like string_to_array, which take
multiple arguments. I'll try to get a patch in for the next CF with a
fix for that, and a separate one that doesn't put it on people to use
\df+ to find the comments we do provide. There have been proposals for
including an optional space for COMMENT ON in DDL, but I suspect that
those won't fly unless and until they make their way into the
standard.

Since you mention \df+, I wonder if this is the time to consider removing
the source code from \df+ (since we have \sf) and adding in the proacl
instead?

#14Ibrar Ahmed
ibrar.ahmad@gmail.com
In reply to: Isaac Morland (#13)
Re: popcount

On Tue, Jan 19, 2021 at 9:42 PM Isaac Morland <isaac.morland@gmail.com>
wrote:

On Tue, 19 Jan 2021 at 11:38, David Fetter <david@fetter.org> wrote:

You bring up an excellent point, which is that our builtin functions
could use a lot more documentation directly to hand than they now
have. For example, there's a lot of needless ambiguity created by
function comments which leave it up in the air as to which positional
argument does what in functions like string_to_array, which take
multiple arguments. I'll try to get a patch in for the next CF with a
fix for that, and a separate one that doesn't put it on people to use
\df+ to find the comments we do provide. There have been proposals for
including an optional space for COMMENT ON in DDL, but I suspect that
those won't fly unless and until they make their way into the
standard.

Since you mention \df+, I wonder if this is the time to consider removing
the source code from \df+ (since we have \sf) and adding in the proacl
instead?

The cf bot failed to apply the patch (v4-0001-popcount.patch) because of

the wrong "-p"
I have regenerated the patch, can you please take a look.

http://cfbot.cputube.org/patch_32_2917.log

=== applying patch ./v4-0001-popcount.patch
can't find file to patch at input line 21
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------

--
Ibrar Ahmed

Attachments:

v5-0001-popcount.patchapplication/octet-stream; name=v5-0001-popcount.patchDownload+112-0
#15John Naylor
john.naylor@enterprisedb.com
In reply to: Ibrar Ahmed (#14)
Re: popcount

Hi David,

Just a nitpick:

+SET bytea_output TO hex;

Since we don't see the string in the output, I don't immediately see the
reason to change the output format here?

Aside from that, this patch works as expected, and is ready for committer.
--
John Naylor
EDB: http://www.enterprisedb.com

#16Peter Eisentraut
peter_e@gmx.net
In reply to: John Naylor (#15)
Re: popcount

On 18.03.21 13:51, John Naylor wrote:

Hi David,

Just a nitpick:

+SET bytea_output TO hex;

Since we don't see the string in the output, I don't immediately see the
reason to change the output format here?

Aside from that, this patch works as expected, and is ready for committer.

I have now read the entire internet on what a suitable name for this
function could be. I think the emerging winner is BIT_COUNT(), which
already exists in MySQL, and also in Python (int.bit_count()) and Java
(Integer.bitCount()).

#17David Fetter
david@fetter.org
In reply to: Peter Eisentraut (#16)
Re: popcount

On Sat, Mar 20, 2021 at 09:01:25PM +0100, Peter Eisentraut wrote:

On 18.03.21 13:51, John Naylor wrote:

Hi David,

Just a nitpick:

+SET bytea_output TO hex;

Since we don't see the string in the output, I don't immediately see the
reason to change the output format here?

That's how I got it to work. If there's a way to make it go without
that, I'd be delighted to learn what it is :)

Aside from that, this patch works as expected, and is ready for committer.

I have now read the entire internet on what a suitable name for this
function could be. I think the emerging winner is BIT_COUNT(), which
already exists in MySQL, and also in Python (int.bit_count()) and Java
(Integer.bitCount()).

Thanks for doing this tedious work. Please find attached the next
version of the patch.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachments:

v5-0001-popcount.patchtext/x-diff; charset=us-asciiDownload+112-0
#18Peter Eisentraut
peter_e@gmx.net
In reply to: David Fetter (#17)
Re: popcount

On 21.03.21 02:31, David Fetter wrote:

I have now read the entire internet on what a suitable name for this
function could be. I think the emerging winner is BIT_COUNT(), which
already exists in MySQL, and also in Python (int.bit_count()) and Java
(Integer.bitCount()).

Thanks for doing this tedious work. Please find attached the next
version of the patch.

committing, with some polishing

#19David Fetter
david@fetter.org
In reply to: Peter Eisentraut (#18)
Re: popcount

On Tue, Mar 23, 2021 at 10:51:08AM +0100, Peter Eisentraut wrote:

On 21.03.21 02:31, David Fetter wrote:

I have now read the entire internet on what a suitable name for this
function could be. I think the emerging winner is BIT_COUNT(), which
already exists in MySQL, and also in Python (int.bit_count()) and Java
(Integer.bitCount()).

Thanks for doing this tedious work. Please find attached the next
version of the patch.

committing, with some polishing

Thanks!

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate