Patch: propose to include 3 new functions into intarray and intagg

Started by Dmitry Koterovover 17 years ago9 messages

dmitry@koterov.ru

over 17 years ago

Hello.

Here are these functions with detailed documentation:
http://en.dklab.ru/lib/dklab_postgresql_patch/

intagg.int_array_append_aggregate(int[]): fast merge arrays into one large
list
intarray._int_group_count_sort(int[], bool): frequency-based sorting
intarray.bidx(int[], int): binary search in a sorted array

Tested for about a year on a real PostgreSQL cluster (10 machines, Slony
replication) under a heavy load (millions of requests).
No crash nor memory problem detected during a year, so I suppose these
functions are well-tested.

What do you think about that?

Zdenek Kotala

Zdenek.Kotala@Sun.COM

over 17 years ago

In reply to: Dmitry Koterov (#1)

Re: Patch: propose to include 3 new functions into intarray and intagg

Dmitry Koterov napsal(a):

Hello.

Here are these functions with detailed documentation:
http://en.dklab.ru/lib/dklab_postgresql_patch/

Added to next commit fest patch list.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

Markus Wanner

markus@bluegap.ch

over 17 years ago

In reply to: Dmitry Koterov (#1)

Review Report: propose to include 3 new functions into intarray and intagg

Hi,

this is my first "official" review. I've tried to follow the "Review a
patch" guidelines from the wiki - thanks Simon, that was pretty helpful.

This review covers only the intagg additions.

Dmitry Koterov wrote:

Here are these functions with detailed documentation:
http://en.dklab.ru/lib/dklab_postgresql_patch/

Submission review: we generally prefer having patches archived on our
mailing lists, so please just send future patches or revisions of this
patch to our lists (I prefer -hackers, but probably -patches is still
the official one). The documentation should go into a README file of the
contrib module or some such. Tests are missing, but that's the case for
the intagg contrib module anyway. The patch applies and is in context
diff format, good. It adds an unrelated newline, which should generally
be avoided.

Usability review: functional additions look good and usable, certainly
within a contrib module. I don't think it's compliant to any spec, but
that's not required for contrib, IMO.

Functional test: works as advertised on the accompanying website. I've
tested the int_array_append_aggregate somewhat, see [1]functional and performance testing session:.

Performance testing: I've tested with 10 mio rows of arrays of size 1
and compared against core's int_array_aggregate function, see [1]functional and performance testing session: again.
In this simple test it took roughly 50% longer, which seems okay. Memory
consumption looks sane as well.

Coding review: style seems fine for contrib, though lines longer than 80
cols should be broken up. Comments in the code are sparse , some even
counter-productive (marking something as "additional things" certainly
doesn't help).

Code and architecture review: the new int_array_append_aggregate()
functions itself seems fine to me.

Summary: My general feeling is, that this patch should be applied after
minor code style corrections. As a longer term goal I think intagg
should be integrated into core, since it's very basic functionality.
TODO entries for things like an array_accum() aggregate already exist.
Adding this patch to contrib now might be a step into the right direction.

Dmitry, can you please apply these small corrections and re-submit the
patch?

Regards

Markus Wanner

P.S.: I dislike the intagg's use of PGARRAY, but that's nothing to do
with this patch. Shouldn't this better use a real composite type as the
aggregate's state type? I'd propose to clean up the intagg contrib
module and prepare it for inclusion into core.

[1]: functional and performance testing session:

On a database with (patched) intagg and intarr contrib modules:

markus=# CREATE TABLE test (id INT NOT NULL, arr INT[] NOT NULL);
CREATE TABLE
markus=# INSERT INTO test VALUES (1, ARRAY[1,2,3]), (2, ARRAY[4,5]), (3,
ARRAY[3,2,1]);
INSERT 0 3
markus=# SELECT * FROM test;
id | arr
----+---------
1 | {1,2,3}
2 | {4,5}
3 | {3,2,1}
(3 rows)

markus=# SELECT int_array_append_aggregate(arr) FROM test;
int_array_append_aggregate
----------------------------
{1,2,3,4,5,3,2,1}
(1 row)

markus=# SELECT * FROM test;
id | arr
----+---------
1 | {1,2,3}
2 | {4,5}
3 | {3,2,1}
(3 rows)

markus=# SELECT int_array_aggregate(id) AS ids,
int_array_append_aggregate(arr) FROM test GROUP BY (id / 2);
ids | int_array_append_aggregate
-------+----------------------------
{1} | {1,2,3}
{2,3} | {4,5,3,2,1}
(2 rows)

markus=# SELECT int_array_aggregate(id) AS ids,
int_array_append_aggregate(arr) FROM test GROUP BY (id % 2);
ids | int_array_append_aggregate
-------+----------------------------
{2} | {4,5}
{1,3} | {1,2,3,3,2,1}
(2 rows)

markus=# INSERT INTO test VALUES (4, NULL);
INSERT 0 1

markus=# SELECT int_array_append_aggregate(arr) FROM test;
int_array_append_aggregate
----------------------------
{1,2,3,4,5,3,2,1}
(1 row)

markus=# SELECT id, int_array_append_aggregate(arr) FROM test GROUP BY id;
id | int_array_append_aggregate
----+----------------------------
4 | {}
2 | {4,5}
3 | {3,2,1}
1 | {1,2,3}
(4 rows)

-- switching to performance testing

markus=# \timing
Timing is on.

markus=# DELETE FROM test;
DELETE 4
Time: 9.037 ms

markus=# INSERT INTO test SELECT generate_series(1, 10000000),
array[round(random() * 100)]::int[];
INSERT 0 10000000
Time: 53321.186 ms

markus=# SELECT icount(int_array_aggregate(id)) AS count FROM test;
count
----------
10000000
(1 row)

Time: 2493.184 ms

markus=# SELECT icount(int_array_append_aggregate(arr)) AS count FROM test;
count
----------
10000000
(1 row)

Time: 4152.478 ms

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Markus Wanner (#3)

Re: Review Report: propose to include 3 new functions into intarray and intagg

Markus Wanner <markus@bluegap.ch> writes:

Submission review: we generally prefer having patches archived on our
mailing lists, so please just send future patches or revisions of this
patch to our lists (I prefer -hackers, but probably -patches is still
the official one).

Please. But -patches is dead, please use -hackers.

The documentation should go into a README file of the
contrib module or some such.

No, it should go into the SGML docs, specifically doc/src/sgml/intarray.sgml
and intagg.sgml. We got rid of flat-text README documentation for
contrib in 8.3, and are not about to permit any backsliding.

Summary: My general feeling is, that this patch should be applied after
minor code style corrections. As a longer term goal I think intagg
should be integrated into core, since it's very basic functionality.

Well, what should be integrated is generic (non-datatype-specific)
versions of this functionality, ie functions working on
anyarray/anyelement not just int4[]/int4. There was some discussion
about that a few days ago. I'm not really sure how painful it would be
to do; probably the generalization to non-fixed-width types would be the
trickiest part.

regards, tom lane

Dmitry Koterov

dmitry@koterov.ru

over 17 years ago

In reply to: Markus Wanner (#3)

Re: Review Report: propose to include 3 new functions into intarray and intagg

OK, thank you for your review.

I'll correct everything and send a patch in a couple of days. Are you
completely sure that this patch will be included? If not, seems the work of
the patch standartization has much lower priority, and I will not hurry so
much.

But, what about intarray patch? Does somebody plan to review it? I'd prefer
to include it too. If you approve, I'll correct the code style in intarray
contrib patch too.

On Sat, Sep 6, 2008 at 3:34 PM, Markus Wanner <markus@bluegap.ch> wrote:

Show quoted text

Hi,

this is my first "official" review. I've tried to follow the "Review a
patch" guidelines from the wiki - thanks Simon, that was pretty helpful.

This review covers only the intagg additions.

Dmitry Koterov wrote:

Here are these functions with detailed documentation:
http://en.dklab.ru/lib/dklab_postgresql_patch/

Submission review: we generally prefer having patches archived on our
mailing lists, so please just send future patches or revisions of this
patch to our lists (I prefer -hackers, but probably -patches is still
the official one). The documentation should go into a README file of the
contrib module or some such. Tests are missing, but that's the case for
the intagg contrib module anyway. The patch applies and is in context
diff format, good. It adds an unrelated newline, which should generally be
avoided.

Usability review: functional additions look good and usable, certainly
within a contrib module. I don't think it's compliant to any spec, but
that's not required for contrib, IMO.

Functional test: works as advertised on the accompanying website. I've
tested the int_array_append_aggregate somewhat, see [1].

Performance testing: I've tested with 10 mio rows of arrays of size 1
and compared against core's int_array_aggregate function, see [1] again.
In this simple test it took roughly 50% longer, which seems okay. Memory
consumption looks sane as well.

Coding review: style seems fine for contrib, though lines longer than 80
cols should be broken up. Comments in the code are sparse , some even
counter-productive (marking something as "additional things" certainly
doesn't help).

Code and architecture review: the new int_array_append_aggregate()
functions itself seems fine to me.

Summary: My general feeling is, that this patch should be applied after
minor code style corrections. As a longer term goal I think intagg should be
integrated into core, since it's very basic functionality. TODO entries for
things like an array_accum() aggregate already exist. Adding this patch to
contrib now might be a step into the right direction.

Dmitry, can you please apply these small corrections and re-submit the
patch?

Regards

Markus Wanner

P.S.: I dislike the intagg's use of PGARRAY, but that's nothing to do with
this patch. Shouldn't this better use a real composite type as the
aggregate's state type? I'd propose to clean up the intagg contrib module
and prepare it for inclusion into core.

[1]: functional and performance testing session:

On a database with (patched) intagg and intarr contrib modules:

markus=# CREATE TABLE test (id INT NOT NULL, arr INT[] NOT NULL);
CREATE TABLE
markus=# INSERT INTO test VALUES (1, ARRAY[1,2,3]), (2, ARRAY[4,5]), (3,
ARRAY[3,2,1]);
INSERT 0 3
markus=# SELECT * FROM test;
id | arr
----+---------
1 | {1,2,3}
2 | {4,5}
3 | {3,2,1}
(3 rows)

markus=# SELECT int_array_append_aggregate(arr) FROM test;
int_array_append_aggregate
----------------------------
{1,2,3,4,5,3,2,1}
(1 row)

markus=# SELECT * FROM test;
id | arr
----+---------
1 | {1,2,3}
2 | {4,5}
3 | {3,2,1}
(3 rows)

markus=# SELECT int_array_aggregate(id) AS ids,
int_array_append_aggregate(arr) FROM test GROUP BY (id / 2);
ids | int_array_append_aggregate
-------+----------------------------
{1} | {1,2,3}
{2,3} | {4,5,3,2,1}
(2 rows)

markus=# SELECT int_array_aggregate(id) AS ids,
int_array_append_aggregate(arr) FROM test GROUP BY (id % 2);
ids | int_array_append_aggregate
-------+----------------------------
{2} | {4,5}
{1,3} | {1,2,3,3,2,1}
(2 rows)

markus=# INSERT INTO test VALUES (4, NULL);
INSERT 0 1

markus=# SELECT int_array_append_aggregate(arr) FROM test;
int_array_append_aggregate
----------------------------
{1,2,3,4,5,3,2,1}
(1 row)

markus=# SELECT id, int_array_append_aggregate(arr) FROM test GROUP BY id;
id | int_array_append_aggregate
----+----------------------------
4 | {}
2 | {4,5}
3 | {3,2,1}
1 | {1,2,3}
(4 rows)

-- switching to performance testing

markus=# \timing
Timing is on.

markus=# DELETE FROM test;
DELETE 4
Time: 9.037 ms

markus=# INSERT INTO test SELECT generate_series(1, 10000000),
array[round(random() * 100)]::int[];
INSERT 0 10000000
Time: 53321.186 ms

markus=# SELECT icount(int_array_aggregate(id)) AS count FROM test;
count
----------
10000000
(1 row)

Time: 2493.184 ms

markus=# SELECT icount(int_array_append_aggregate(arr)) AS count FROM test;
count
----------
10000000
(1 row)

Time: 4152.478 ms

Markus Wanner

markus@bluegap.ch

over 17 years ago

In reply to: Dmitry Koterov (#5)

Re: Review Report: propose to include 3 new functions into intarray and intagg

Hi,

Dmitry Koterov wrote:

I'll correct everything and send a patch in a couple of days.

Cool, thank you.

Are you
completely sure that this patch will be included?

Uh.. I'm not a committer, but I'm pretty sure your patch has good chances.

I can help with SGML documentation, if you want.

But, what about intarray patch? Does somebody plan to review it? I'd
prefer to include it too. If you approve, I'll correct the code style in
intarray contrib patch too.

I've already volunteered for reviewing it as well. I just felt like
splitting things up...

Regards

Markus Wanner

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Markus Wanner (#6)

Re: Review Report: propose to include 3 new functions into intarray and intagg

Markus Wanner wrote:

Dmitry Koterov wrote:

But, what about intarray patch? Does somebody plan to review it? I'd
prefer to include it too. If you approve, I'll correct the code style
in intarray contrib patch too.

I've already volunteered for reviewing it as well. I just felt like
splitting things up...

Looking at the intarray patch:

I find it a bit unfriendly to have a function that depends on sorted
input, but doesn't check it. But that's probably not a good enough
reason to reject an otherwise simple and useful function. Also, we
already have uniq, which doesn't strictly speaking require sorted input,
but it does if you want to eliminate all duplicates from the array.

_int_group_count_sort seems a bit special purpose. Why does it bother to
sort the output? That's wasted time if you don't need sorted output, or
if you want the array sorted by the integer value instead of frequency.
If you want sorted output, you can just sort it afterwards.

Also, it's requiring sorted input for a small performance gain, but
there's a lot more precedence in the existing intarray functions to not
require sorted input, but to sort the input instead (union, intersect,
same, overlap).

I realize that the current implementation is faster for the use case
where the input is sorted, and output needs to be sorted, but if we go
down that path we'll soon have dozens of different variants of various
functions, with different ordering requirements of inputs and outputs.

So, I'd suggest changing _int_group_count_sort so that it doesn't
require sorted input, and doesn't sort the output. The binary search
function looks good to me (I think I'd prefer naming it bsearch(),
though, though I can see that it was named bidx in reference to the
existing idx function). Also, as Markus pointed out, the SGML docs need
to be updated.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Markus Wanner

markus@bluegap.ch

over 17 years ago

In reply to: Heikki Linnakangas (#7)

Re: Review Report: propose to include 3 new functions into intarray and intagg

Hi,

sorry for not having completed this review, yet. As you are obviously
looking at the patch as well, I'll try to quickly write down my points
so far.

Trying to compile the intarray module, I now receive an error:

error: ï¿½INT4OIDï¿½ undeclared (first use in this function)

That can be solved by including "catalog/pg_type.h" from
contrib/intarr/_int_op.c.

The PG_FUNCTION_INFO_V1 and prototype definition certainly belong to the
top of the file, where all others are.

Some lines are longer than 80 columns and again comments are a bit
sparse or even useless (no "additional things", please).

Heikki Linnakangas wrote:

I find it a bit unfriendly to have a function that depends on sorted
input, but doesn't check it. But that's probably not a good enough
reason to reject an otherwise simple and useful function. Also, we
already have uniq, which doesn't strictly speaking require sorted input,
but it does if you want to eliminate all duplicates from the array.

I think it's a performance optimization which is absolutely required in
some cases. Some time ago I've also had to rip out the sorting step from
certain intarray module functions to save processing time.

One option already mentioned somewhere would be saving a 'sorted'
property for the array. Then again, I think such a thing would certainly
have to be done globally, for all kinds of arrays.

_int_group_count_sort seems a bit special purpose. Why does it bother to
sort the output? That's wasted time if you don't need sorted output, or
if you want the array sorted by the integer value instead of frequency.
If you want sorted output, you can just sort it afterwards.

Agreed. IMO the normal GROUP BY and ORDER BY stuff of the database
itself should be used for such a thing. However, that means turning an
array into a set of rows...

Also, it's requiring sorted input for a small performance gain, but
there's a lot more precedence in the existing intarray functions to not
require sorted input, but to sort the input instead (union, intersect,
same, overlap).

..and exactly these are the functions I had to wrap again to strip the
sorting step, due to poor performance for known-sorted arrays.

I realize that the current implementation is faster for the use case
where the input is sorted, and output needs to be sorted, but if we go
down that path we'll soon have dozens of different variants of various
functions, with different ordering requirements of inputs and outputs.

Agreed.

However, given the OP is using that in production, there seems to be a
use case for the optimization, where we have none for the same function
without it.

So, I'd suggest changing _int_group_count_sort so that it doesn't
require sorted input, and doesn't sort the output. The binary search
function looks good to me (I think I'd prefer naming it bsearch(),
though, though I can see that it was named bidx in reference to the
existing idx function). Also, as Markus pointed out, the SGML docs need
to be updated.

As is, it should probably also carry the '_int' prefix, because it's not
a general purpose array function. So propose to name it '_int_bsearch'.

Overall I think these functions are overly specialized and should be
replaced be more general counterparts in core. However, until we have
that, it's hard to refuse such a thing for contrib.

By that reasoning, however, the intarray would have to provide methods
for sorted as well as asorted input arrays as well.

I'm closing my part of reviewing of this patch now. Dmitry, how do you
want to proceed with these patches? Do you have time for some cleaning
up and writing documentation?

Regards

Markus Wanner

Dmitry Koterov

dmitry@koterov.ru

over 17 years ago

In reply to: Markus Wanner (#8)

Re: Review Report: propose to include 3 new functions into intarray and intagg

No problem, I have time for clearing. But are these functions
guaranteed to be included in the contrib? If there is no guarantee,
seems the time of clearing will be wasted. (5 years ago I have already
cleaned one open-source library on demand and after that it was not
approved for PEAR repository, so now I am more circumspect, sorry :-).

So - is the approval procedure finished? If not, who could make the
final decision ("be or not to be")? Sorry, I don't yet know well the
patch proposal procedure...

P.S.
1. Unfortunately GROUP BY + ORDER BY is sometimes 1000 times (!)
slower than _int_group_count_sort. Because of that I have created this
function.
2. I have to assume that the input is sorted in functions, because
else the performance is lowered very much. Sort operation is quite
expensive; checking if an array is sorted or not is also quite
expensive... So I think it is not a good idea to add
sorting-independency to functions.
3. Seems the conversion of functions to anyarray/anyelement is much
more time-expensive than simply including it to specialized
intarray/intagg. Is it absolutely necessery?

Show quoted text

On Mon, Sep 15, 2008 at 4:38 PM, Markus Wanner <markus@bluegap.ch> wrote:

Hi,

sorry for not having completed this review, yet. As you are obviously
looking at the patch as well, I'll try to quickly write down my points so
far.

Trying to compile the intarray module, I now receive an error:

error: 'INT4OID' undeclared (first use in this function)

That can be solved by including "catalog/pg_type.h" from
contrib/intarr/_int_op.c.

The PG_FUNCTION_INFO_V1 and prototype definition certainly belong to the top
of the file, where all others are.

Some lines are longer than 80 columns and again comments are a bit sparse or
even useless (no "additional things", please).

Heikki Linnakangas wrote:

I find it a bit unfriendly to have a function that depends on sorted
input, but doesn't check it. But that's probably not a good enough reason to
reject an otherwise simple and useful function. Also, we already have uniq,
which doesn't strictly speaking require sorted input, but it does if you
want to eliminate all duplicates from the array.

I think it's a performance optimization which is absolutely required in some
cases. Some time ago I've also had to rip out the sorting step from certain
intarray module functions to save processing time.

One option already mentioned somewhere would be saving a 'sorted' property
for the array. Then again, I think such a thing would certainly have to be
done globally, for all kinds of arrays.

_int_group_count_sort seems a bit special purpose. Why does it bother to
sort the output? That's wasted time if you don't need sorted output, or if
you want the array sorted by the integer value instead of frequency. If you
want sorted output, you can just sort it afterwards.

Agreed. IMO the normal GROUP BY and ORDER BY stuff of the database itself
should be used for such a thing. However, that means turning an array into a
set of rows...

Also, it's requiring sorted input for a small performance gain, but
there's a lot more precedence in the existing intarray functions to not
require sorted input, but to sort the input instead (union, intersect, same,
overlap).

..and exactly these are the functions I had to wrap again to strip the
sorting step, due to poor performance for known-sorted arrays.

I realize that the current implementation is faster for the use case where
the input is sorted, and output needs to be sorted, but if we go down that
path we'll soon have dozens of different variants of various functions, with
different ordering requirements of inputs and outputs.

Agreed.

However, given the OP is using that in production, there seems to be a use
case for the optimization, where we have none for the same function without
it.

So, I'd suggest changing _int_group_count_sort so that it doesn't require
sorted input, and doesn't sort the output. The binary search function looks
good to me (I think I'd prefer naming it bsearch(), though, though I can see
that it was named bidx in reference to the existing idx function). Also, as
Markus pointed out, the SGML docs need to be updated.

As is, it should probably also carry the '_int' prefix, because it's not a
general purpose array function. So propose to name it '_int_bsearch'.

Overall I think these functions are overly specialized and should be
replaced be more general counterparts in core. However, until we have that,
it's hard to refuse such a thing for contrib.

By that reasoning, however, the intarray would have to provide methods for
sorted as well as asorted input arrays as well.

I'm closing my part of reviewing of this patch now. Dmitry, how do you want
to proceed with these patches? Do you have time for some cleaning up and
writing documentation?

Regards

Markus Wanner