Proposal to introduce a shuffle function to intarray extension

Started by Martin Kalcheralmost 4 years ago76 messageshackersgeneral
Jump to latest
#1Martin Kalcher
martin.kalcher@aboutsource.net
hackersgeneral

Dear list,

i am dealing with an application that processes fairly large arrays of
integers. It makes heavy use of the intarray extension, which works
great in most cases. However, there are two requirements that cannot be
addressed by the extension and are rather slow with plain SQL. Both can
be met with shuffling:

- Taking n random members from an integer array
- Splitting an array into n chunks, where each member is assigned to a
random chunk

Shuffling is currently implemented by unnesting the array, ordering the
members by random() and aggregating them again.

create table numbers (arr int[]);

insert into numbers (arr)
select array_agg(i)
from generate_series(1, 4000000) i;

select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select array_agg(n order by random()) arr
from (
select unnest(arr) n from numbers
) plain
) shuffled;

---------------------------------------------------------
{2717290,3093757,2426384} ... {3011871,1402540,1613647}

Time: 2348.961 ms (00:02.349)

I wrote a small extension (see source code below) to see how much we can
gain, when the shuffling is implemented in C and the results speak for
themselves:

select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select shuffle(arr) arr from numbers
) shuffled;

----------------------------------------------------
{1313971,3593627,86630} ... {50764,430410,3901128}

Time: 132.151 ms

I would like to see a function like this inside the intarray extension.
Is there any way to get to this point? How is the process to deal with
such proposals?

Best regards,
Martin Kalcher

Source code of extension mentioned above:

#include "postgres.h"
#include "port.h"
#include "utils/array.h"

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(shuffle);

void _shuffle(int32 *a, int len);

Datum
shuffle(PG_FUNCTION_ARGS)
{
ArrayType *a = PG_GETARG_ARRAYTYPE_P_COPY(0);

int len;

if (array_contains_nulls(a))
ereport(ERROR,
(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
errmsg("array must not contain nulls")));

len = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a));

if (len > 1)
_shuffle((int32 *) ARR_DATA_PTR(a), len);

PG_RETURN_POINTER(a);
}

void
_shuffle(int32 *a, int len) {
int i, j;
int32 tmp;

for (i = len - 1; i > 0; i--) {
j = random() % (i + 1);
tmp = a[i];
a[i] = a[j];
a[j] = tmp;
}
}

#2Mladen Gogala
gogala.mladen@gmail.com
In reply to: Martin Kalcher (#1)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

On 7/15/22 04:36, Martin Kalcher wrote:

Dear list,

i am dealing with an application that processes fairly large arrays of
integers. It makes heavy use of the intarray extension, which works
great in most cases. However, there are two requirements that cannot
be addressed by the extension and are rather slow with plain SQL. Both
can be met with shuffling:

- Taking n random members from an integer array
- Splitting an array into n chunks, where each member is assigned to a
random chunk

Shuffling is currently implemented by unnesting the array, ordering
the members by random() and aggregating them again.

Martin, have you considered PL/Python and NumPy module?

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

#3Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Mladen Gogala (#2)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

Am 16.07.22 um 18:53 schrieb Mladen Gogala:

On 7/15/22 04:36, Martin Kalcher wrote:

Dear list,

i am dealing with an application that processes fairly large arrays of
integers. It makes heavy use of the intarray extension, which works
great in most cases. However, there are two requirements that cannot
be addressed by the extension and are rather slow with plain SQL. Both
can be met with shuffling:

- Taking n random members from an integer array
- Splitting an array into n chunks, where each member is assigned to a
random chunk

Shuffling is currently implemented by unnesting the array, ordering
the members by random() and aggregating them again.

Martin, have you considered PL/Python and NumPy module?

Hey Mladen,

thank you for your advice. Unfortunately the performance of shuffling
with NumPy is about the same as with SQL.

create function numpy_shuffle(arr int[])
returns int[]
as $$
import numpy
numpy.random.shuffle(arr)
return arr
$$ language 'plpython3u';

select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select numpy_shuffle(arr) arr from numbers
) shuffled;

-------------------------------------------------------
{674026,3306457,1727170} ... {343875,3825484,1235246}

Time: 2315.431 ms (00:02.315)

Am i doing something wrong?

Martin

#4Mladen Gogala
gogala.mladen@gmail.com
In reply to: Martin Kalcher (#3)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

On 7/16/22 16:21, Martin Kalcher wrote:

Hey Mladen,

thank you for your advice. Unfortunately the performance of shuffling
with NumPy is about the same as with SQL.

  create function numpy_shuffle(arr int[])
  returns int[]
  as $$
    import numpy
    numpy.random.shuffle(arr)
    return arr
  $$ language 'plpython3u';

  select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
  from (
    select numpy_shuffle(arr) arr from numbers
  ) shuffled;

  -------------------------------------------------------
   {674026,3306457,1727170} ... {343875,3825484,1235246}

  Time: 2315.431 ms (00:02.315)

Am i doing something wrong?

Martin

Hi Martin,

No, you're doing everything right. I have no solution for you. You may
need to do some C programming or throw a stronger hardware at the
problem. The performance of your processors may be the problem. Good luck!

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Martin Kalcher (#1)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

On Fri, Jul 15, 2022 at 8:36 PM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

I would like to see a function like this inside the intarray extension.
Is there any way to get to this point? How is the process to deal with
such proposals?

Hi Martin,

I'm redirecting this to the pgsql-hackers@ mailing list, where we talk
about code. For the archives, Martin's initial message to -general
was:

/messages/by-id/9d160a44-7675-51e8-60cf-6d64b76db831@aboutsource.net

The first question is whether such a thing belongs in an external
extension, or in the contrib/intarray module. The latter seems like a
reasonable thing to want to me. The first step towards that will be
to get your code into .patch format, as in git format-patch or git
diff, that can be applied to the master branch.

https://wiki.postgresql.org/wiki/Submitting_a_Patch

Some initial feedback from me: I'd recommend adding a couple of
comments to the code, like the algorithm name for someone who wants to
read more about it (I think it's a Fisher-Yates shuffle?). You'll
need to have the contrib/intarrays/intarray--1.5--1.6.sql file that
creates the function. You might want to add something to
control/intarray/sql/_int.sql that invokes the function when you run
make check in there (but doesn't display the results, since that'd be
unstable across machines?), just to have 'code coverage' (I mean, it'd
prove it doesn't crash at least). Once details are settled, you'd
also want to add documentation in doc/src/sgml/intarray.sgml. I
understand that this is a specialised int[] shuffle, but I wonder if
someone would ever want to have a general array shuffle, and how that
would work, in terms of naming convention etc.

#6Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Thomas Munro (#5)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

Am 16.07.22 um 23:56 schrieb Thomas Munro:

On Fri, Jul 15, 2022 at 8:36 PM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

I would like to see a function like this inside the intarray extension.
Is there any way to get to this point? How is the process to deal with
such proposals?

Hi Martin,

I'm redirecting this to the pgsql-hackers@ mailing list, where we talk
about code. For the archives, Martin's initial message to -general
was:

/messages/by-id/9d160a44-7675-51e8-60cf-6d64b76db831@aboutsource.net

The first question is whether such a thing belongs in an external
extension, or in the contrib/intarray module. The latter seems like a
reasonable thing to want to me. The first step towards that will be
to get your code into .patch format, as in git format-patch or git
diff, that can be applied to the master branch.

https://wiki.postgresql.org/wiki/Submitting_a_Patch

Some initial feedback from me: I'd recommend adding a couple of
comments to the code, like the algorithm name for someone who wants to
read more about it (I think it's a Fisher-Yates shuffle?). You'll
need to have the contrib/intarrays/intarray--1.5--1.6.sql file that
creates the function. You might want to add something to
control/intarray/sql/_int.sql that invokes the function when you run
make check in there (but doesn't display the results, since that'd be
unstable across machines?), just to have 'code coverage' (I mean, it'd
prove it doesn't crash at least). Once details are settled, you'd
also want to add documentation in doc/src/sgml/intarray.sgml. I
understand that this is a specialised int[] shuffle, but I wonder if
someone would ever want to have a general array shuffle, and how that
would work, in terms of naming convention etc.

Hello Thomas,

Thank you for pointing me towards the correct list.

I do not feel qualified to answer the question wether this belongs in an
external extension or in contrib/intarray. That reason i would like to
see it in contrib/intarray is, that it is lot easier for me to get our
operations team to upgrade our database system, because of new features
we need, than to get them to install a self-maintained extension on our
database servers.

Thank you for your feedback. I tried to address all your points and made
a first patch. Some points are still open:

- Documentation is postponed until further feedback.

- I am not shure about the naming. intarray has generic names like
sort() and uniq() and specialised names like icount(). I guess in case
someone wants to have a general array shuffle it could be accomplished
with function overloading. Or am i wrong here?

- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted. The
important one is shuffle().

Martin

Attachments:

0001-Introduce-shuffle-and-sample-to-intarray-extension.patchtext/x-patch; charset=UTF-8; name=0001-Introduce-shuffle-and-sample-to-intarray-extension.patchDownload+155-4
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martin Kalcher (#6)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

Am 16.07.22 um 23:56 schrieb Thomas Munro:

I understand that this is a specialised int[] shuffle, but I wonder if
someone would ever want to have a general array shuffle, and how that
would work, in terms of naming convention etc.

- I am not shure about the naming. intarray has generic names like
sort() and uniq() and specialised names like icount(). I guess in case
someone wants to have a general array shuffle it could be accomplished
with function overloading. Or am i wrong here?

I suppose this is exactly the point Thomas was wondering about: if we
use a generic function name for this, will it cause problems for someone
trying to add a generic function later?

We can investigate that question with a couple of toy functions:

regression=# create function foo(int[]) returns text as 'select ''int[] version''' language sql;
CREATE FUNCTION
regression=# create function foo(anyarray) returns text as 'select ''anyarray version''' language sql;
CREATE FUNCTION
regression=# select foo('{1,2,3}');
ERROR: function foo(unknown) is not unique
LINE 1: select foo('{1,2,3}');
^
HINT: Could not choose a best candidate function. You might need to add explicit type casts.

OK, that's not too surprising: with an unknown input there's just not
anything that the parser can use to disambiguate. But this happens
with just about any overloaded name, so I don't think it's a showstopper.

regression=# select foo('{1,2,3}'::int[]);
foo
---------------
int[] version
(1 row)

regression=# select foo('{1,2,3}'::int8[]);
foo
------------------
anyarray version
(1 row)

Good, that's more or less the minimum functionality we should expect.

regression=# select foo('{1,2,3}'::int2[]);
ERROR: function foo(smallint[]) is not unique
LINE 1: select foo('{1,2,3}'::int2[]);
^
HINT: Could not choose a best candidate function. You might need to add explicit type casts.

Oh, that's slightly annoying ...

regression=# select foo('{1,2,3}'::oid[]);
foo
------------------
anyarray version
(1 row)

regression=# select foo('{1,2,3}'::text[]);
foo
------------------
anyarray version
(1 row)

regression=# select foo('{1,2,3}'::float8[]);
foo
------------------
anyarray version
(1 row)

I couldn't readily find any case that misbehaves except for int2[].
You can force that to work, at least in one direction:

regression=# select foo('{1,2,3}'::int2[]::int[]);
foo
---------------
int[] version
(1 row)

On the whole, I'd vote for calling it shuffle(), and expecting that
we'd also use that name for any future generic version. That's
clearly the easiest-to-remember definition, and it doesn't seem
like the gotchas are bad enough to justify using separate names.

- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted.

I have no opinion about whether this one is valuable enough to include in
intarray, but I do feel like sample() is a vague name, and easily confused
with marginally-related operations like TABLESAMPLE. Can we think of a
more on-point name? Something like "random_subset" would be pretty
clear, but it's also clunky. It's too late here for me to think of
le mot juste...

regards, tom lane

#8David G. Johnston
david.g.johnston@gmail.com
In reply to: Martin Kalcher (#6)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

On Sat, Jul 16, 2022 at 7:25 PM Martin Kalcher <
martin.kalcher@aboutsource.net> wrote:

- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted. The
important one is shuffle().

 +SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) !=
sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6);
+ ?column?
+----------
+ t
+(1 row)
+

While small, there is a non-zero chance for both samples to be equal. This
test should probably just go, I don't see what it tests that isn't covered
by other tests or even trivial usage.

Same goes for:

+SELECT shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}') !=
shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}');
+ ?column?
+----------
+ t
+(1 row)
+

David J.

#9David G. Johnston
david.g.johnston@gmail.com
In reply to: Tom Lane (#7)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

On Sat, Jul 16, 2022 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted.

I have no opinion about whether this one is valuable enough to include in
intarray, but I do feel like sample() is a vague name, and easily confused
with marginally-related operations like TABLESAMPLE. Can we think of a
more on-point name? Something like "random_subset" would be pretty
clear, but it's also clunky. It's too late here for me to think of
le mot juste...

choose(input anyarray, size integer, with_replacement boolean default
false, algorithm text default 'default')?

David J.

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

I wrote:

On the whole, I'd vote for calling it shuffle(), and expecting that
we'd also use that name for any future generic version.

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.

regards, tom lane

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#10)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

On Sun, Jul 17, 2022 at 3:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

On the whole, I'd vote for calling it shuffle(), and expecting that
we'd also use that name for any future generic version.

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.

Yeah, that seems like a good direction. If there is a performance
advantage to specialising, then perhaps we only have to specialise on
size, not type. Perhaps there could be a general function that
internally looks out for typbyval && typlen == 4, and dispatches to a
specialised 4-byte, and likewise for 8, if it can, and that'd already
be enough to cover int, bigint, float etc, without needing
specialisations for each type.

I went to see what Professor Lemire would have to say about all this,
expecting to find a SIMD rabbit hole to fall down for some Sunday
evening reading, but the main thing that jumped out was this article
about the modulo operation required by textbook Fisher-Yates to get a
bounded random number, the random() % n that appears in the patch. He
talks about shuffling twice as fast by using a no-division trick to
get bounded random numbers[1]https://lemire.me/blog/2016/06/30/fast-random-shuffling/. I guess you might need to use our
pg_prng_uint32() for that trick because random()'s 0..RAND_MAX might
introduce bias. Anyway, file that under go-faster ideas for later.

[1]: https://lemire.me/blog/2016/06/30/fast-random-shuffling/

#12Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Tom Lane (#10)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

Am 17.07.22 um 05:37 schrieb Tom Lane:

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.

regards, tom lane

Hi Tom,

thank you for your thoughts. There are two reasons for choosing an
int4-only version. I am not familiar with postgres development (yet) and
i was not sure how open you are about such changes to core and if the
proposed feature is considered valuable enough to go into core. The
second reason was ease of implementation. The intarray extension does
not allow any NULL elements in arrays and treats multidimensional arrays
as though they were linear. Which makes the implementation straight
forward, because there are fewer cases to consider.

However, i will take a look at an implementation for anyarray in core.

Martin

#13Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Thomas Munro (#11)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

Am 17.07.22 um 08:00 schrieb Thomas Munro:

I went to see what Professor Lemire would have to say about all this,
expecting to find a SIMD rabbit hole to fall down for some Sunday
evening reading, but the main thing that jumped out was this article
about the modulo operation required by textbook Fisher-Yates to get a
bounded random number, the random() % n that appears in the patch. He
talks about shuffling twice as fast by using a no-division trick to
get bounded random numbers[1]. I guess you might need to use our
pg_prng_uint32() for that trick because random()'s 0..RAND_MAX might
introduce bias. Anyway, file that under go-faster ideas for later.

[1] https://lemire.me/blog/2016/06/30/fast-random-shuffling/

Hi Thomas,

the small bias of random() % n is not a problem for my use case, but
might be for others. Its easily replaceable with

(int) pg_prng_uint64_range(&pg_global_prng_state, 0, n-1)

Unfortunately it is a bit slower (on my machine), but thats negligible.

Martin

#14Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: David G. Johnston (#8)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

Am 17.07.22 um 05:32 schrieb David G. Johnston:

+SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) !=
sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6);
+ ?column?
+----------
+ t
+(1 row)
+

While small, there is a non-zero chance for both samples to be equal. This
test should probably just go, I don't see what it tests that isn't covered
by other tests or even trivial usage.

Hey David,

you are right. There is a small chance for this test to fail. I wanted
to test, that two invocations produce different results (after all the
main feature of the function). But it can probably go.

Martin

#15Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Thomas Munro (#11)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

Am 17.07.22 um 08:00 schrieb Thomas Munro:

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.

Yeah, that seems like a good direction. If there is a performance
advantage to specialising, then perhaps we only have to specialise on
size, not type. Perhaps there could be a general function that
internally looks out for typbyval && typlen == 4, and dispatches to a
specialised 4-byte, and likewise for 8, if it can, and that'd already
be enough to cover int, bigint, float etc, without needing
specialisations for each type.

I played around with the idea of an anyarray shuffle(). The hard part
was to deal with arrays with variable length elements, as they can not
be swapped easily in place. I solved it by creating an intermediate
array of references to the elements. I'll attach a patch with the proof
of concept. Unfortunatly it is already about 5 times slower than the
specialised version and i am not sure if it is worth going down that road.

Martin

Attachments:

0001-introduce-array_shuffle.patchtext/x-patch; charset=UTF-8; name=0001-introduce-array_shuffle.patchDownload+76-1
#16Thomas Munro
thomas.munro@gmail.com
In reply to: Martin Kalcher (#15)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

On Mon, Jul 18, 2022 at 4:15 AM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

Am 17.07.22 um 08:00 schrieb Thomas Munro:

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.

Yeah, that seems like a good direction. If there is a performance
advantage to specialising, then perhaps we only have to specialise on
size, not type. Perhaps there could be a general function that
internally looks out for typbyval && typlen == 4, and dispatches to a
specialised 4-byte, and likewise for 8, if it can, and that'd already
be enough to cover int, bigint, float etc, without needing
specialisations for each type.

I played around with the idea of an anyarray shuffle(). The hard part
was to deal with arrays with variable length elements, as they can not
be swapped easily in place. I solved it by creating an intermediate
array of references to the elements. I'll attach a patch with the proof
of concept. Unfortunatly it is already about 5 times slower than the
specialised version and i am not sure if it is worth going down that road.

Seems OK for a worst case. It must still be a lot faster than doing
it in SQL. Now I wonder what the exact requirements would be to
dispatch to a faster version that would handle int4. I haven't
studied this in detail but perhaps to dispatch to a fast shuffle for
objects of size X, the requirement would be something like typlen == X
&& align_bytes <= typlen && typlen % align_bytes == 0, where
align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}?
Or in English, 'the data consists of densely packed objects of fixed
size X, no padding'. Or perhaps you can work out the padded size and
use that, to catch a few more types. Then you call
array_shuffle_{2,4,8}() as appropriate, which should be as fast as
your original int[] proposal, but work also for float, date, ...?

About your experimental patch, I haven't reviewed it properly or tried
it but I wonder if uint32 dat_offset, uint32 size (= half size
elements) would be enough due to limitations on varlenas.

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martin Kalcher (#15)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

Am 17.07.22 um 08:00 schrieb Thomas Munro:

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?

I played around with the idea of an anyarray shuffle(). The hard part
was to deal with arrays with variable length elements, as they can not
be swapped easily in place. I solved it by creating an intermediate
array of references to the elements. I'll attach a patch with the proof
of concept.

This does not look particularly idiomatic, or even type-safe. What you
should have done was use deconstruct_array to get an array of Datums and
isnull flags, then shuffled those, then used construct_array to build the
output.

(Or, perhaps, use construct_md_array to replicate the input's
precise dimensionality. Not sure if anyone would care.)

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#16)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

Thomas Munro <thomas.munro@gmail.com> writes:

Seems OK for a worst case. It must still be a lot faster than doing
it in SQL. Now I wonder what the exact requirements would be to
dispatch to a faster version that would handle int4.

I find it impossible to believe that it's worth micro-optimizing
shuffle() to that extent. Now, maybe doing something in that line
in deconstruct_array and construct_array would be worth our time,
as that'd benefit a pretty wide group of functions.

regards, tom lane

#19Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Tom Lane (#17)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

Am 18.07.22 um 00:46 schrieb Tom Lane:

This does not look particularly idiomatic, or even type-safe. What you
should have done was use deconstruct_array to get an array of Datums and
isnull flags, then shuffled those, then used construct_array to build the
output.

(Or, perhaps, use construct_md_array to replicate the input's
precise dimensionality. Not sure if anyone would care.)

regards, tom lane

deconstruct_array() would destroy the arrays dimensions. I would expect
that shuffle() only shuffles the first dimension and keeps the inner
arrays intact.

Martin

#20Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Thomas Munro (#16)
hackersgeneral
Re: Proposal to introduce a shuffle function to intarray extension

Am 18.07.22 um 00:37 schrieb Thomas Munro:

Seems OK for a worst case. It must still be a lot faster than doing
it in SQL. Now I wonder what the exact requirements would be to
dispatch to a faster version that would handle int4. I haven't
studied this in detail but perhaps to dispatch to a fast shuffle for
objects of size X, the requirement would be something like typlen == X
&& align_bytes <= typlen && typlen % align_bytes == 0, where
align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}?
Or in English, 'the data consists of densely packed objects of fixed
size X, no padding'. Or perhaps you can work out the padded size and
use that, to catch a few more types. Then you call
array_shuffle_{2,4,8}() as appropriate, which should be as fast as
your original int[] proposal, but work also for float, date, ...?

About your experimental patch, I haven't reviewed it properly or tried
it but I wonder if uint32 dat_offset, uint32 size (= half size
elements) would be enough due to limitations on varlenas.

I made another experimental patch with fast tracks for typelen4 and
typelen8. alignments are not yet considered.

Attachments:

0001-introduce-array_shuffle.patchtext/x-patch; charset=UTF-8; name=0001-introduce-array_shuffle.patchDownload+181-1
#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martin Kalcher (#19)
hackersgeneral
#22Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Tom Lane (#21)
hackersgeneral
#23Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Tom Lane (#21)
hackersgeneral
#24John Naylor
john.naylor@enterprisedb.com
In reply to: Martin Kalcher (#23)
hackersgeneral
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Naylor (#24)
hackersgeneral
#26Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Tom Lane (#25)
hackersgeneral
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martin Kalcher (#26)
hackersgeneral
#28Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Tom Lane (#27)
hackersgeneral
#29Robert Haas
robertmhaas@gmail.com
In reply to: Martin Kalcher (#26)
hackersgeneral
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#29)
hackersgeneral
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#30)
hackersgeneral
#32Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Tom Lane (#31)
hackersgeneral
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martin Kalcher (#32)
hackersgeneral
#34David G. Johnston
david.g.johnston@gmail.com
In reply to: Tom Lane (#33)
hackersgeneral
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: David G. Johnston (#34)
hackersgeneral
#36Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Tom Lane (#33)
hackersgeneral
#37Thomas Munro
thomas.munro@gmail.com
In reply to: Martin Kalcher (#28)
hackersgeneral
#38Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Martin Kalcher (#36)
hackersgeneral
#39Aleksander Alekseev
aleksander@timescale.com
In reply to: Martin Kalcher (#38)
hackersgeneral
#40Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Martin Kalcher (#32)
hackersgeneral
#41Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#35)
hackersgeneral
#42Andrew Dunstan
andrew@dunslane.net
In reply to: Martin Kalcher (#40)
hackersgeneral
#43Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#42)
hackersgeneral
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#43)
hackersgeneral
#45Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Tom Lane (#44)
hackersgeneral
#46Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Martin Kalcher (#45)
hackersgeneral
#47Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Dean Rasheed (#46)
hackersgeneral
#48Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Martin Kalcher (#47)
hackersgeneral
#49Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Dean Rasheed (#48)
hackersgeneral
#50Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Dean Rasheed (#46)
hackersgeneral
#51Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Martin Kalcher (#49)
hackersgeneral
#52Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Dean Rasheed (#51)
hackersgeneral
#53Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Martin Kalcher (#52)
hackersgeneral
#54Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#44)
hackersgeneral
#55Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Martin Kalcher (#52)
hackersgeneral
#56Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Martin Kalcher (#55)
hackersgeneral
#57Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Fabien COELHO (#56)
hackersgeneral
#58Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Martin Kalcher (#57)
hackersgeneral
#59Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Fabien COELHO (#58)
hackersgeneral
#60Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Martin Kalcher (#59)
hackersgeneral
#61Andres Freund
andres@anarazel.de
In reply to: Martin Kalcher (#60)
hackersgeneral
#62Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Andres Freund (#61)
hackersgeneral
#63Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martin Kalcher (#62)
hackersgeneral
#64Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Tom Lane (#63)
hackersgeneral
#65Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Martin Kalcher (#64)
hackersgeneral
#66Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fabien COELHO (#65)
hackersgeneral
#67Martin Kalcher
martin.kalcher@aboutsource.net
In reply to: Tom Lane (#66)
hackersgeneral
#68Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martin Kalcher (#67)
hackersgeneral
#69Gregory Stark (as CFM)
stark.cfm@gmail.com
In reply to: Tom Lane (#68)
hackersgeneral
#70Gregory Stark (as CFM)
stark.cfm@gmail.com
In reply to: Gregory Stark (as CFM) (#69)
hackersgeneral
#71Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#68)
hackersgeneral
#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Gustafsson (#71)
hackersgeneral
#73Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#72)
hackersgeneral
#74Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Gustafsson (#73)
hackersgeneral
#75Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#74)
hackersgeneral
#76Salek Talangi
salek.talangi@googlemail.com
In reply to: Daniel Gustafsson (#75)
hackersgeneral