Proposal to introduce a shuffle function to intarray extension
Dear list,
i am dealing with an application that processes fairly large arrays of
integers. It makes heavy use of the intarray extension, which works
great in most cases. However, there are two requirements that cannot be
addressed by the extension and are rather slow with plain SQL. Both can
be met with shuffling:
- Taking n random members from an integer array
- Splitting an array into n chunks, where each member is assigned to a
random chunk
Shuffling is currently implemented by unnesting the array, ordering the
members by random() and aggregating them again.
create table numbers (arr int[]);
insert into numbers (arr)
select array_agg(i)
from generate_series(1, 4000000) i;
select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select array_agg(n order by random()) arr
from (
select unnest(arr) n from numbers
) plain
) shuffled;
---------------------------------------------------------
{2717290,3093757,2426384} ... {3011871,1402540,1613647}
Time: 2348.961 ms (00:02.349)
I wrote a small extension (see source code below) to see how much we can
gain, when the shuffling is implemented in C and the results speak for
themselves:
select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select shuffle(arr) arr from numbers
) shuffled;
----------------------------------------------------
{1313971,3593627,86630} ... {50764,430410,3901128}
Time: 132.151 ms
I would like to see a function like this inside the intarray extension.
Is there any way to get to this point? How is the process to deal with
such proposals?
Best regards,
Martin Kalcher
Source code of extension mentioned above:
#include "postgres.h"
#include "port.h"
#include "utils/array.h"
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(shuffle);
void _shuffle(int32 *a, int len);
Datum
shuffle(PG_FUNCTION_ARGS)
{
ArrayType *a = PG_GETARG_ARRAYTYPE_P_COPY(0);
int len;
if (array_contains_nulls(a))
ereport(ERROR,
(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
errmsg("array must not contain nulls")));
len = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a));
if (len > 1)
_shuffle((int32 *) ARR_DATA_PTR(a), len);
PG_RETURN_POINTER(a);
}
void
_shuffle(int32 *a, int len) {
int i, j;
int32 tmp;
for (i = len - 1; i > 0; i--) {
j = random() % (i + 1);
tmp = a[i];
a[i] = a[j];
a[j] = tmp;
}
}
On 7/15/22 04:36, Martin Kalcher wrote:
Dear list,
i am dealing with an application that processes fairly large arrays of
integers. It makes heavy use of the intarray extension, which works
great in most cases. However, there are two requirements that cannot
be addressed by the extension and are rather slow with plain SQL. Both
can be met with shuffling:- Taking n random members from an integer array
- Splitting an array into n chunks, where each member is assigned to a
random chunkShuffling is currently implemented by unnesting the array, ordering
the members by random() and aggregating them again.
Martin, have you considered PL/Python and NumPy module?
--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com
Am 16.07.22 um 18:53 schrieb Mladen Gogala:
On 7/15/22 04:36, Martin Kalcher wrote:
Dear list,
i am dealing with an application that processes fairly large arrays of
integers. It makes heavy use of the intarray extension, which works
great in most cases. However, there are two requirements that cannot
be addressed by the extension and are rather slow with plain SQL. Both
can be met with shuffling:- Taking n random members from an integer array
- Splitting an array into n chunks, where each member is assigned to a
random chunkShuffling is currently implemented by unnesting the array, ordering
the members by random() and aggregating them again.Martin, have you considered PL/Python and NumPy module?
Hey Mladen,
thank you for your advice. Unfortunately the performance of shuffling
with NumPy is about the same as with SQL.
create function numpy_shuffle(arr int[])
returns int[]
as $$
import numpy
numpy.random.shuffle(arr)
return arr
$$ language 'plpython3u';
select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select numpy_shuffle(arr) arr from numbers
) shuffled;
-------------------------------------------------------
{674026,3306457,1727170} ... {343875,3825484,1235246}
Time: 2315.431 ms (00:02.315)
Am i doing something wrong?
Martin
On 7/16/22 16:21, Martin Kalcher wrote:
Hey Mladen,
thank you for your advice. Unfortunately the performance of shuffling
with NumPy is about the same as with SQL.create function numpy_shuffle(arr int[])
returns int[]
as $$
import numpy
numpy.random.shuffle(arr)
return arr
$$ language 'plpython3u';select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select numpy_shuffle(arr) arr from numbers
) shuffled;-------------------------------------------------------
{674026,3306457,1727170} ... {343875,3825484,1235246}Time: 2315.431 ms (00:02.315)
Am i doing something wrong?
Martin
Hi Martin,
No, you're doing everything right. I have no solution for you. You may
need to do some C programming or throw a stronger hardware at the
problem. The performance of your processors may be the problem. Good luck!
--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com
On Fri, Jul 15, 2022 at 8:36 PM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:
I would like to see a function like this inside the intarray extension.
Is there any way to get to this point? How is the process to deal with
such proposals?
Hi Martin,
I'm redirecting this to the pgsql-hackers@ mailing list, where we talk
about code. For the archives, Martin's initial message to -general
was:
/messages/by-id/9d160a44-7675-51e8-60cf-6d64b76db831@aboutsource.net
The first question is whether such a thing belongs in an external
extension, or in the contrib/intarray module. The latter seems like a
reasonable thing to want to me. The first step towards that will be
to get your code into .patch format, as in git format-patch or git
diff, that can be applied to the master branch.
https://wiki.postgresql.org/wiki/Submitting_a_Patch
Some initial feedback from me: I'd recommend adding a couple of
comments to the code, like the algorithm name for someone who wants to
read more about it (I think it's a Fisher-Yates shuffle?). You'll
need to have the contrib/intarrays/intarray--1.5--1.6.sql file that
creates the function. You might want to add something to
control/intarray/sql/_int.sql that invokes the function when you run
make check in there (but doesn't display the results, since that'd be
unstable across machines?), just to have 'code coverage' (I mean, it'd
prove it doesn't crash at least). Once details are settled, you'd
also want to add documentation in doc/src/sgml/intarray.sgml. I
understand that this is a specialised int[] shuffle, but I wonder if
someone would ever want to have a general array shuffle, and how that
would work, in terms of naming convention etc.
Am 16.07.22 um 23:56 schrieb Thomas Munro:
On Fri, Jul 15, 2022 at 8:36 PM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:I would like to see a function like this inside the intarray extension.
Is there any way to get to this point? How is the process to deal with
such proposals?Hi Martin,
I'm redirecting this to the pgsql-hackers@ mailing list, where we talk
about code. For the archives, Martin's initial message to -general
was:/messages/by-id/9d160a44-7675-51e8-60cf-6d64b76db831@aboutsource.net
The first question is whether such a thing belongs in an external
extension, or in the contrib/intarray module. The latter seems like a
reasonable thing to want to me. The first step towards that will be
to get your code into .patch format, as in git format-patch or git
diff, that can be applied to the master branch.https://wiki.postgresql.org/wiki/Submitting_a_Patch
Some initial feedback from me: I'd recommend adding a couple of
comments to the code, like the algorithm name for someone who wants to
read more about it (I think it's a Fisher-Yates shuffle?). You'll
need to have the contrib/intarrays/intarray--1.5--1.6.sql file that
creates the function. You might want to add something to
control/intarray/sql/_int.sql that invokes the function when you run
make check in there (but doesn't display the results, since that'd be
unstable across machines?), just to have 'code coverage' (I mean, it'd
prove it doesn't crash at least). Once details are settled, you'd
also want to add documentation in doc/src/sgml/intarray.sgml. I
understand that this is a specialised int[] shuffle, but I wonder if
someone would ever want to have a general array shuffle, and how that
would work, in terms of naming convention etc.
Hello Thomas,
Thank you for pointing me towards the correct list.
I do not feel qualified to answer the question wether this belongs in an
external extension or in contrib/intarray. That reason i would like to
see it in contrib/intarray is, that it is lot easier for me to get our
operations team to upgrade our database system, because of new features
we need, than to get them to install a self-maintained extension on our
database servers.
Thank you for your feedback. I tried to address all your points and made
a first patch. Some points are still open:
- Documentation is postponed until further feedback.
- I am not shure about the naming. intarray has generic names like
sort() and uniq() and specialised names like icount(). I guess in case
someone wants to have a general array shuffle it could be accomplished
with function overloading. Or am i wrong here?
- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted. The
important one is shuffle().
Martin
Attachments:
0001-Introduce-shuffle-and-sample-to-intarray-extension.patchtext/x-patch; charset=UTF-8; name=0001-Introduce-shuffle-and-sample-to-intarray-extension.patchDownload+155-4
Martin Kalcher <martin.kalcher@aboutsource.net> writes:
Am 16.07.22 um 23:56 schrieb Thomas Munro:
I understand that this is a specialised int[] shuffle, but I wonder if
someone would ever want to have a general array shuffle, and how that
would work, in terms of naming convention etc.
- I am not shure about the naming. intarray has generic names like
sort() and uniq() and specialised names like icount(). I guess in case
someone wants to have a general array shuffle it could be accomplished
with function overloading. Or am i wrong here?
I suppose this is exactly the point Thomas was wondering about: if we
use a generic function name for this, will it cause problems for someone
trying to add a generic function later?
We can investigate that question with a couple of toy functions:
regression=# create function foo(int[]) returns text as 'select ''int[] version''' language sql;
CREATE FUNCTION
regression=# create function foo(anyarray) returns text as 'select ''anyarray version''' language sql;
CREATE FUNCTION
regression=# select foo('{1,2,3}');
ERROR: function foo(unknown) is not unique
LINE 1: select foo('{1,2,3}');
^
HINT: Could not choose a best candidate function. You might need to add explicit type casts.
OK, that's not too surprising: with an unknown input there's just not
anything that the parser can use to disambiguate. But this happens
with just about any overloaded name, so I don't think it's a showstopper.
regression=# select foo('{1,2,3}'::int[]);
foo
---------------
int[] version
(1 row)
regression=# select foo('{1,2,3}'::int8[]);
foo
------------------
anyarray version
(1 row)
Good, that's more or less the minimum functionality we should expect.
regression=# select foo('{1,2,3}'::int2[]);
ERROR: function foo(smallint[]) is not unique
LINE 1: select foo('{1,2,3}'::int2[]);
^
HINT: Could not choose a best candidate function. You might need to add explicit type casts.
Oh, that's slightly annoying ...
regression=# select foo('{1,2,3}'::oid[]);
foo
------------------
anyarray version
(1 row)
regression=# select foo('{1,2,3}'::text[]);
foo
------------------
anyarray version
(1 row)
regression=# select foo('{1,2,3}'::float8[]);
foo
------------------
anyarray version
(1 row)
I couldn't readily find any case that misbehaves except for int2[].
You can force that to work, at least in one direction:
regression=# select foo('{1,2,3}'::int2[]::int[]);
foo
---------------
int[] version
(1 row)
On the whole, I'd vote for calling it shuffle(), and expecting that
we'd also use that name for any future generic version. That's
clearly the easiest-to-remember definition, and it doesn't seem
like the gotchas are bad enough to justify using separate names.
- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted.
I have no opinion about whether this one is valuable enough to include in
intarray, but I do feel like sample() is a vague name, and easily confused
with marginally-related operations like TABLESAMPLE. Can we think of a
more on-point name? Something like "random_subset" would be pretty
clear, but it's also clunky. It's too late here for me to think of
le mot juste...
regards, tom lane
On Sat, Jul 16, 2022 at 7:25 PM Martin Kalcher <
martin.kalcher@aboutsource.net> wrote:
- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted. The
important one is shuffle().
+SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) !=
sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6);
+ ?column?
+----------
+ t
+(1 row)
+
While small, there is a non-zero chance for both samples to be equal. This
test should probably just go, I don't see what it tests that isn't covered
by other tests or even trivial usage.
Same goes for:
+SELECT shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}') !=
shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}');
+ ?column?
+----------
+ t
+(1 row)
+
David J.
On Sat, Jul 16, 2022 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Martin Kalcher <martin.kalcher@aboutsource.net> writes:
- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted.I have no opinion about whether this one is valuable enough to include in
intarray, but I do feel like sample() is a vague name, and easily confused
with marginally-related operations like TABLESAMPLE. Can we think of a
more on-point name? Something like "random_subset" would be pretty
clear, but it's also clunky. It's too late here for me to think of
le mot juste...
choose(input anyarray, size integer, with_replacement boolean default
false, algorithm text default 'default')?
David J.
I wrote:
On the whole, I'd vote for calling it shuffle(), and expecting that
we'd also use that name for any future generic version.
Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.
regards, tom lane
On Sun, Jul 17, 2022 at 3:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
On the whole, I'd vote for calling it shuffle(), and expecting that
we'd also use that name for any future generic version.Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.
Yeah, that seems like a good direction. If there is a performance
advantage to specialising, then perhaps we only have to specialise on
size, not type. Perhaps there could be a general function that
internally looks out for typbyval && typlen == 4, and dispatches to a
specialised 4-byte, and likewise for 8, if it can, and that'd already
be enough to cover int, bigint, float etc, without needing
specialisations for each type.
I went to see what Professor Lemire would have to say about all this,
expecting to find a SIMD rabbit hole to fall down for some Sunday
evening reading, but the main thing that jumped out was this article
about the modulo operation required by textbook Fisher-Yates to get a
bounded random number, the random() % n that appears in the patch. He
talks about shuffling twice as fast by using a no-division trick to
get bounded random numbers[1]https://lemire.me/blog/2016/06/30/fast-random-shuffling/. I guess you might need to use our
pg_prng_uint32() for that trick because random()'s 0..RAND_MAX might
introduce bias. Anyway, file that under go-faster ideas for later.
[1]: https://lemire.me/blog/2016/06/30/fast-random-shuffling/
Am 17.07.22 um 05:37 schrieb Tom Lane:
Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.regards, tom lane
Hi Tom,
thank you for your thoughts. There are two reasons for choosing an
int4-only version. I am not familiar with postgres development (yet) and
i was not sure how open you are about such changes to core and if the
proposed feature is considered valuable enough to go into core. The
second reason was ease of implementation. The intarray extension does
not allow any NULL elements in arrays and treats multidimensional arrays
as though they were linear. Which makes the implementation straight
forward, because there are fewer cases to consider.
However, i will take a look at an implementation for anyarray in core.
Martin
Am 17.07.22 um 08:00 schrieb Thomas Munro:
I went to see what Professor Lemire would have to say about all this,
expecting to find a SIMD rabbit hole to fall down for some Sunday
evening reading, but the main thing that jumped out was this article
about the modulo operation required by textbook Fisher-Yates to get a
bounded random number, the random() % n that appears in the patch. He
talks about shuffling twice as fast by using a no-division trick to
get bounded random numbers[1]. I guess you might need to use our
pg_prng_uint32() for that trick because random()'s 0..RAND_MAX might
introduce bias. Anyway, file that under go-faster ideas for later.[1] https://lemire.me/blog/2016/06/30/fast-random-shuffling/
Hi Thomas,
the small bias of random() % n is not a problem for my use case, but
might be for others. Its easily replaceable with
(int) pg_prng_uint64_range(&pg_global_prng_state, 0, n-1)
Unfortunately it is a bit slower (on my machine), but thats negligible.
Martin
Am 17.07.22 um 05:32 schrieb David G. Johnston:
+SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) != sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6); + ?column? +---------- + t +(1 row) +While small, there is a non-zero chance for both samples to be equal. This
test should probably just go, I don't see what it tests that isn't covered
by other tests or even trivial usage.
Hey David,
you are right. There is a small chance for this test to fail. I wanted
to test, that two invocations produce different results (after all the
main feature of the function). But it can probably go.
Martin
Am 17.07.22 um 08:00 schrieb Thomas Munro:
Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.Yeah, that seems like a good direction. If there is a performance
advantage to specialising, then perhaps we only have to specialise on
size, not type. Perhaps there could be a general function that
internally looks out for typbyval && typlen == 4, and dispatches to a
specialised 4-byte, and likewise for 8, if it can, and that'd already
be enough to cover int, bigint, float etc, without needing
specialisations for each type.
I played around with the idea of an anyarray shuffle(). The hard part
was to deal with arrays with variable length elements, as they can not
be swapped easily in place. I solved it by creating an intermediate
array of references to the elements. I'll attach a patch with the proof
of concept. Unfortunatly it is already about 5 times slower than the
specialised version and i am not sure if it is worth going down that road.
Martin
Attachments:
0001-introduce-array_shuffle.patchtext/x-patch; charset=UTF-8; name=0001-introduce-array_shuffle.patchDownload+76-1
On Mon, Jul 18, 2022 at 4:15 AM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:
Am 17.07.22 um 08:00 schrieb Thomas Munro:
Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.Yeah, that seems like a good direction. If there is a performance
advantage to specialising, then perhaps we only have to specialise on
size, not type. Perhaps there could be a general function that
internally looks out for typbyval && typlen == 4, and dispatches to a
specialised 4-byte, and likewise for 8, if it can, and that'd already
be enough to cover int, bigint, float etc, without needing
specialisations for each type.I played around with the idea of an anyarray shuffle(). The hard part
was to deal with arrays with variable length elements, as they can not
be swapped easily in place. I solved it by creating an intermediate
array of references to the elements. I'll attach a patch with the proof
of concept. Unfortunatly it is already about 5 times slower than the
specialised version and i am not sure if it is worth going down that road.
Seems OK for a worst case. It must still be a lot faster than doing
it in SQL. Now I wonder what the exact requirements would be to
dispatch to a faster version that would handle int4. I haven't
studied this in detail but perhaps to dispatch to a fast shuffle for
objects of size X, the requirement would be something like typlen == X
&& align_bytes <= typlen && typlen % align_bytes == 0, where
align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}?
Or in English, 'the data consists of densely packed objects of fixed
size X, no padding'. Or perhaps you can work out the padded size and
use that, to catch a few more types. Then you call
array_shuffle_{2,4,8}() as appropriate, which should be as fast as
your original int[] proposal, but work also for float, date, ...?
About your experimental patch, I haven't reviewed it properly or tried
it but I wonder if uint32 dat_offset, uint32 size (= half size
elements) would be enough due to limitations on varlenas.
Martin Kalcher <martin.kalcher@aboutsource.net> writes:
Am 17.07.22 um 08:00 schrieb Thomas Munro:
Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
I played around with the idea of an anyarray shuffle(). The hard part
was to deal with arrays with variable length elements, as they can not
be swapped easily in place. I solved it by creating an intermediate
array of references to the elements. I'll attach a patch with the proof
of concept.
This does not look particularly idiomatic, or even type-safe. What you
should have done was use deconstruct_array to get an array of Datums and
isnull flags, then shuffled those, then used construct_array to build the
output.
(Or, perhaps, use construct_md_array to replicate the input's
precise dimensionality. Not sure if anyone would care.)
regards, tom lane
Thomas Munro <thomas.munro@gmail.com> writes:
Seems OK for a worst case. It must still be a lot faster than doing
it in SQL. Now I wonder what the exact requirements would be to
dispatch to a faster version that would handle int4.
I find it impossible to believe that it's worth micro-optimizing
shuffle() to that extent. Now, maybe doing something in that line
in deconstruct_array and construct_array would be worth our time,
as that'd benefit a pretty wide group of functions.
regards, tom lane
Am 18.07.22 um 00:46 schrieb Tom Lane:
This does not look particularly idiomatic, or even type-safe. What you
should have done was use deconstruct_array to get an array of Datums and
isnull flags, then shuffled those, then used construct_array to build the
output.(Or, perhaps, use construct_md_array to replicate the input's
precise dimensionality. Not sure if anyone would care.)regards, tom lane
deconstruct_array() would destroy the arrays dimensions. I would expect
that shuffle() only shuffles the first dimension and keeps the inner
arrays intact.
Martin
Am 18.07.22 um 00:37 schrieb Thomas Munro:
Seems OK for a worst case. It must still be a lot faster than doing
it in SQL. Now I wonder what the exact requirements would be to
dispatch to a faster version that would handle int4. I haven't
studied this in detail but perhaps to dispatch to a fast shuffle for
objects of size X, the requirement would be something like typlen == X
&& align_bytes <= typlen && typlen % align_bytes == 0, where
align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}?
Or in English, 'the data consists of densely packed objects of fixed
size X, no padding'. Or perhaps you can work out the padded size and
use that, to catch a few more types. Then you call
array_shuffle_{2,4,8}() as appropriate, which should be as fast as
your original int[] proposal, but work also for float, date, ...?About your experimental patch, I haven't reviewed it properly or tried
it but I wonder if uint32 dat_offset, uint32 size (= half size
elements) would be enough due to limitations on varlenas.
I made another experimental patch with fast tracks for typelen4 and
typelen8. alignments are not yet considered.