Proposal to introduce a shuffle function to intarray extension

Started by Martin Kalcherover 3 years ago76 messages

martin.kalcher@aboutsource.net

over 3 years ago

Dear list,

i am dealing with an application that processes fairly large arrays of
integers. It makes heavy use of the intarray extension, which works
great in most cases. However, there are two requirements that cannot be
addressed by the extension and are rather slow with plain SQL. Both can
be met with shuffling:

- Taking n random members from an integer array
- Splitting an array into n chunks, where each member is assigned to a
random chunk

Shuffling is currently implemented by unnesting the array, ordering the
members by random() and aggregating them again.

create table numbers (arr int[]);

insert into numbers (arr)
select array_agg(i)
from generate_series(1, 4000000) i;

select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select array_agg(n order by random()) arr
from (
select unnest(arr) n from numbers
) plain
) shuffled;

---------------------------------------------------------
{2717290,3093757,2426384} ... {3011871,1402540,1613647}

Time: 2348.961 ms (00:02.349)

I wrote a small extension (see source code below) to see how much we can
gain, when the shuffling is implemented in C and the results speak for
themselves:

select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select shuffle(arr) arr from numbers
) shuffled;

----------------------------------------------------
{1313971,3593627,86630} ... {50764,430410,3901128}

Time: 132.151 ms

I would like to see a function like this inside the intarray extension.
Is there any way to get to this point? How is the process to deal with
such proposals?

Best regards,
Martin Kalcher

Source code of extension mentioned above:

#include "postgres.h"
#include "port.h"
#include "utils/array.h"

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(shuffle);

void _shuffle(int32 *a, int len);

Datum
shuffle(PG_FUNCTION_ARGS)
{
ArrayType *a = PG_GETARG_ARRAYTYPE_P_COPY(0);

int len;

if (array_contains_nulls(a))
ereport(ERROR,
(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
errmsg("array must not contain nulls")));

len = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a));

if (len > 1)
_shuffle((int32 *) ARR_DATA_PTR(a), len);

PG_RETURN_POINTER(a);
}

void
_shuffle(int32 *a, int len) {
int i, j;
int32 tmp;

for (i = len - 1; i > 0; i--) {
j = random() % (i + 1);
tmp = a[i];
a[i] = a[j];
a[j] = tmp;
}
}

Mladen Gogala

gogala.mladen@gmail.com

over 3 years ago

In reply to: Martin Kalcher (#1)

Re: Proposal to introduce a shuffle function to intarray extension

On 7/15/22 04:36, Martin Kalcher wrote:

Dear list,

i am dealing with an application that processes fairly large arrays of
integers. It makes heavy use of the intarray extension, which works
great in most cases. However, there are two requirements that cannot
be addressed by the extension and are rather slow with plain SQL. Both
can be met with shuffling:

- Taking n random members from an integer array
- Splitting an array into n chunks, where each member is assigned to a
random chunk

Shuffling is currently implemented by unnesting the array, ordering
the members by random() and aggregating them again.

Martin, have you considered PL/Python and NumPy module?

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Mladen Gogala (#2)

Re: Proposal to introduce a shuffle function to intarray extension

Am 16.07.22 um 18:53 schrieb Mladen Gogala:

On 7/15/22 04:36, Martin Kalcher wrote:

Dear list,

i am dealing with an application that processes fairly large arrays of
integers. It makes heavy use of the intarray extension, which works
great in most cases. However, there are two requirements that cannot
be addressed by the extension and are rather slow with plain SQL. Both
can be met with shuffling:

- Taking n random members from an integer array
- Splitting an array into n chunks, where each member is assigned to a
random chunk

Shuffling is currently implemented by unnesting the array, ordering
the members by random() and aggregating them again.

Martin, have you considered PL/Python and NumPy module?

Hey Mladen,

thank you for your advice. Unfortunately the performance of shuffling
with NumPy is about the same as with SQL.

create function numpy_shuffle(arr int[])
returns int[]
as $$
import numpy
numpy.random.shuffle(arr)
return arr
$$ language 'plpython3u';

select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select numpy_shuffle(arr) arr from numbers
) shuffled;

-------------------------------------------------------
{674026,3306457,1727170} ... {343875,3825484,1235246}

Time: 2315.431 ms (00:02.315)

Am i doing something wrong?

Martin

Mladen Gogala

gogala.mladen@gmail.com

over 3 years ago

In reply to: Martin Kalcher (#3)

Re: Proposal to introduce a shuffle function to intarray extension

On 7/16/22 16:21, Martin Kalcher wrote:

Hey Mladen,

thank you for your advice. Unfortunately the performance of shuffling
with NumPy is about the same as with SQL.

create function numpy_shuffle(arr int[])
returns int[]
as $$
    import numpy
    numpy.random.shuffle(arr)
    return arr
$$ language 'plpython3u';

select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
    select numpy_shuffle(arr) arr from numbers
) shuffled;

-------------------------------------------------------
   {674026,3306457,1727170} ... {343875,3825484,1235246}

Time: 2315.431 ms (00:02.315)

Am i doing something wrong?

Martin

Hi Martin,

No, you're doing everything right. I have no solution for you. You may
need to do some C programming or throw a stronger hardware at the
problem. The performance of your processors may be the problem. Good luck!

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Martin Kalcher (#1)

Re: Proposal to introduce a shuffle function to intarray extension

On Fri, Jul 15, 2022 at 8:36 PM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

I would like to see a function like this inside the intarray extension.
Is there any way to get to this point? How is the process to deal with
such proposals?

Hi Martin,

I'm redirecting this to the pgsql-hackers@ mailing list, where we talk
about code. For the archives, Martin's initial message to -general
was:

/messages/by-id/9d160a44-7675-51e8-60cf-6d64b76db831@aboutsource.net

The first question is whether such a thing belongs in an external
extension, or in the contrib/intarray module. The latter seems like a
reasonable thing to want to me. The first step towards that will be
to get your code into .patch format, as in git format-patch or git
diff, that can be applied to the master branch.

https://wiki.postgresql.org/wiki/Submitting_a_Patch

Some initial feedback from me: I'd recommend adding a couple of
comments to the code, like the algorithm name for someone who wants to
read more about it (I think it's a Fisher-Yates shuffle?). You'll
need to have the contrib/intarrays/intarray--1.5--1.6.sql file that
creates the function. You might want to add something to
control/intarray/sql/_int.sql that invokes the function when you run
make check in there (but doesn't display the results, since that'd be
unstable across machines?), just to have 'code coverage' (I mean, it'd
prove it doesn't crash at least). Once details are settled, you'd
also want to add documentation in doc/src/sgml/intarray.sgml. I
understand that this is a specialised int[] shuffle, but I wonder if
someone would ever want to have a general array shuffle, and how that
would work, in terms of naming convention etc.

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Thomas Munro (#5)

1 attachment(s)

Re: Proposal to introduce a shuffle function to intarray extension

Am 16.07.22 um 23:56 schrieb Thomas Munro:

On Fri, Jul 15, 2022 at 8:36 PM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

I would like to see a function like this inside the intarray extension.
Is there any way to get to this point? How is the process to deal with
such proposals?

Hi Martin,

I'm redirecting this to the pgsql-hackers@ mailing list, where we talk
about code. For the archives, Martin's initial message to -general
was:

/messages/by-id/9d160a44-7675-51e8-60cf-6d64b76db831@aboutsource.net

The first question is whether such a thing belongs in an external
extension, or in the contrib/intarray module. The latter seems like a
reasonable thing to want to me. The first step towards that will be
to get your code into .patch format, as in git format-patch or git
diff, that can be applied to the master branch.

https://wiki.postgresql.org/wiki/Submitting_a_Patch

Some initial feedback from me: I'd recommend adding a couple of
comments to the code, like the algorithm name for someone who wants to
read more about it (I think it's a Fisher-Yates shuffle?). You'll
need to have the contrib/intarrays/intarray--1.5--1.6.sql file that
creates the function. You might want to add something to
control/intarray/sql/_int.sql that invokes the function when you run
make check in there (but doesn't display the results, since that'd be
unstable across machines?), just to have 'code coverage' (I mean, it'd
prove it doesn't crash at least). Once details are settled, you'd
also want to add documentation in doc/src/sgml/intarray.sgml. I
understand that this is a specialised int[] shuffle, but I wonder if
someone would ever want to have a general array shuffle, and how that
would work, in terms of naming convention etc.

Hello Thomas,

Thank you for pointing me towards the correct list.

I do not feel qualified to answer the question wether this belongs in an
external extension or in contrib/intarray. That reason i would like to
see it in contrib/intarray is, that it is lot easier for me to get our
operations team to upgrade our database system, because of new features
we need, than to get them to install a self-maintained extension on our
database servers.

Thank you for your feedback. I tried to address all your points and made
a first patch. Some points are still open:

- Documentation is postponed until further feedback.

- I am not shure about the naming. intarray has generic names like
sort() and uniq() and specialised names like icount(). I guess in case
someone wants to have a general array shuffle it could be accomplished
with function overloading. Or am i wrong here?

- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted. The
important one is shuffle().

Martin

Attachments:

0001-Introduce-shuffle-and-sample-to-intarray-extension.patchtext/x-patch; charset=UTF-8; name=0001-Introduce-shuffle-and-sample-to-intarray-extension.patchDownload

From 2c6c28755416556597188c323604ec975faffa25 Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 01:53:05 +0200
Subject: [PATCH] Introduce shuffle() and sample() to intarray extension

* shuffle(arr) returns an array with the elements of the provided array
  in random order
* sample(arr, n) returns an array with n randomly-chosen elements from
  the provided array
---
 contrib/intarray/Makefile               |  4 +-
 contrib/intarray/_int.h                 |  2 +
 contrib/intarray/_int_op.c              | 35 +++++++++++++++
 contrib/intarray/_int_tool.c            | 57 +++++++++++++++++++++++++
 contrib/intarray/expected/_int.out      | 36 ++++++++++++++++
 contrib/intarray/intarray--1.5--1.6.sql | 16 +++++++
 contrib/intarray/intarray.control       |  2 +-
 contrib/intarray/sql/_int.sql           |  6 +++
 8 files changed, 155 insertions(+), 3 deletions(-)
 create mode 100644 contrib/intarray/intarray--1.5--1.6.sql

diff --git a/contrib/intarray/Makefile b/contrib/intarray/Makefile
index 3817c1669a..1a79c359ed 100644
--- a/contrib/intarray/Makefile
+++ b/contrib/intarray/Makefile
@@ -12,8 +12,8 @@ OBJS = \
 	_intbig_gist.o
 
 EXTENSION = intarray
-DATA = intarray--1.4--1.5.sql intarray--1.3--1.4.sql intarray--1.2--1.3.sql \
-	intarray--1.2.sql intarray--1.1--1.2.sql \
+DATA = intarray--1.5--1.6.sql intarray--1.4--1.5.sql intarray--1.3--1.4.sql \
+	intarray--1.2--1.3.sql intarray--1.2.sql intarray--1.1--1.2.sql \
 	intarray--1.0--1.1.sql
 PGFILEDESC = "intarray - functions and operators for arrays of integers"
 
diff --git a/contrib/intarray/_int.h b/contrib/intarray/_int.h
index 304c29525c..eaa7a09ae8 100644
--- a/contrib/intarray/_int.h
+++ b/contrib/intarray/_int.h
@@ -123,6 +123,8 @@ bool		inner_int_overlap(ArrayType *a, ArrayType *b);
 bool		inner_int_contains(ArrayType *a, ArrayType *b);
 ArrayType  *inner_int_union(ArrayType *a, ArrayType *b);
 ArrayType  *inner_int_inter(ArrayType *a, ArrayType *b);
+void		inner_int_shuffle(ArrayType *a);
+ArrayType  *inner_int_sample(ArrayType *a, int n);
 void		rt__int_size(ArrayType *a, float *size);
 void		gensign(BITVECP sign, int *a, int len, int siglen);
 
diff --git a/contrib/intarray/_int_op.c b/contrib/intarray/_int_op.c
index 5b164f6788..8c4edf6ddf 100644
--- a/contrib/intarray/_int_op.c
+++ b/contrib/intarray/_int_op.c
@@ -167,6 +167,8 @@ PG_FUNCTION_INFO_V1(sort);
 PG_FUNCTION_INFO_V1(sort_asc);
 PG_FUNCTION_INFO_V1(sort_desc);
 PG_FUNCTION_INFO_V1(uniq);
+PG_FUNCTION_INFO_V1(shuffle);
+PG_FUNCTION_INFO_V1(sample);
 PG_FUNCTION_INFO_V1(idx);
 PG_FUNCTION_INFO_V1(subarray);
 PG_FUNCTION_INFO_V1(intarray_push_elem);
@@ -255,6 +257,39 @@ uniq(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(a);
 }
 
+Datum
+shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *a = PG_GETARG_ARRAYTYPE_P_COPY(0);
+
+	CHECKARRVALID(a);
+	inner_int_shuffle(a);
+	PG_RETURN_POINTER(a);
+}
+
+Datum
+sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *a = PG_GETARG_ARRAYTYPE_P_COPY(0);
+	ArrayType  *r;
+
+	int	n = PG_GETARG_INT32(1);
+
+	CHECKARRVALID(a);
+
+	if (n >= ARRNELEMS(a))
+		PG_RETURN_POINTER(a);
+
+	if (n < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("second parameter must not be negative")));
+
+	r = inner_int_sample(a, n);
+	PG_FREE_IF_COPY(a, 0);
+	PG_RETURN_POINTER(r);
+}
+
 Datum
 idx(PG_FUNCTION_ARGS)
 {
diff --git a/contrib/intarray/_int_tool.c b/contrib/intarray/_int_tool.c
index 8ed4d63fc3..b26c910deb 100644
--- a/contrib/intarray/_int_tool.c
+++ b/contrib/intarray/_int_tool.c
@@ -179,6 +179,63 @@ inner_int_inter(ArrayType *a, ArrayType *b)
 		return resize_intArrayType(r, k);
 }
 
+void
+inner_int_shuffle(ArrayType *a)
+{
+	int *d = ARRPTR(a);
+	int n = ARRNELEMS(a);
+	int i, tmp;
+
+	/*
+	 * Shuffling is done using the Fisher-Yates shuffle algorithm:
+	 * Iterate through the array and swap the current head with a
+	 * randomly-chosen element from behind the current head (possibly
+	 * the head itself).
+	 */
+	while (n > 1)
+	{
+		i = random() % n;
+		tmp = d[i];
+		d[i] = *d;
+		*d = tmp;
+		d++;
+		n--;
+	}
+}
+
+ArrayType *
+inner_int_sample(ArrayType *a, int n)
+{
+	ArrayType *r;
+	int *da,
+		*dr;
+	int na = ARRNELEMS(a);
+
+	n = Min(na, n);
+	r = new_intArrayType(n);
+	da = ARRPTR(a),
+	dr = ARRPTR(r);
+
+	/*
+	 * Choosing samples is done using a variant of the "inside-out"
+	 * Fisher-Yates shuffle algorithm:
+	 * Iterate through the source array, move a randomly-chosen
+	 * element from behind the current head (or the head itself) to
+	 * the destination array and replace it with the current head.
+	 */
+	while (n > 0)
+	{
+		int i = random() % na;
+		*dr = da[i];
+		da[i] = *da;
+		dr++;
+		n--;
+		da++;
+		na--;
+	}
+	return r;
+}
+
 void
 rt__int_size(ArrayType *a, float *size)
 {
diff --git a/contrib/intarray/expected/_int.out b/contrib/intarray/expected/_int.out
index a09d40efa1..cffa683ffe 100644
--- a/contrib/intarray/expected/_int.out
+++ b/contrib/intarray/expected/_int.out
@@ -85,6 +85,42 @@ SELECT subarray('{1234234,-30,-30,234234,-30}',0,-1);
  {1234234,-30,-30,234234}
 (1 row)
 
+SELECT sort(shuffle('{1234234,-30,-30,234234,-30}'));
+             sort             
+------------------------------
+ {-30,-30,-30,234234,1234234}
+(1 row)
+
+SELECT shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}') != shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}');
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT icount(sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6));
+ icount 
+--------
+      6
+(1 row)
+
+SELECT icount(uniq(sort(sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6))));
+ icount 
+--------
+      6
+(1 row)
+
+SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) <@ '{1,2,3,4,5,6,7,8,9,10,11,12}';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) != sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6);
+ ?column? 
+----------
+ t
+(1 row)
+
 SELECT #'{1234234,234234}'::int[];
  ?column? 
 ----------
diff --git a/contrib/intarray/intarray--1.5--1.6.sql b/contrib/intarray/intarray--1.5--1.6.sql
new file mode 100644
index 0000000000..5d5ef8ee01
--- /dev/null
+++ b/contrib/intarray/intarray--1.5--1.6.sql
@@ -0,0 +1,16 @@
+/* contrib/intarray/intarray--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION intarray UPDATE TO '1.6'" to load this file. \quit
+
+-- Add shuffle functions
+
+CREATE FUNCTION shuffle(_int4)
+RETURNS _int4
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION sample(_int4, int4)
+RETURNS _int4
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
diff --git a/contrib/intarray/intarray.control b/contrib/intarray/intarray.control
index c3ff753e2c..6ae8881f15 100644
--- a/contrib/intarray/intarray.control
+++ b/contrib/intarray/intarray.control
@@ -1,6 +1,6 @@
 # intarray extension
 comment = 'functions, operators, and index support for 1-D arrays of integers'
-default_version = '1.5'
+default_version = '1.6'
 module_pathname = '$libdir/_int'
 relocatable = true
 trusted = true
diff --git a/contrib/intarray/sql/_int.sql b/contrib/intarray/sql/_int.sql
index b26fc57e4d..3eb15d5c08 100644
--- a/contrib/intarray/sql/_int.sql
+++ b/contrib/intarray/sql/_int.sql
@@ -18,6 +18,12 @@ SELECT idx('{1234234,-30,-30,234234,-30}',-30);
 SELECT subarray('{1234234,-30,-30,234234,-30}',2,3);
 SELECT subarray('{1234234,-30,-30,234234,-30}',-1,1);
 SELECT subarray('{1234234,-30,-30,234234,-30}',0,-1);
+SELECT sort(shuffle('{1234234,-30,-30,234234,-30}'));
+SELECT shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}') != shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}');
+SELECT icount(sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6));
+SELECT icount(uniq(sort(sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6))));
+SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) <@ '{1,2,3,4,5,6,7,8,9,10,11,12}';
+SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) != sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6);
 
 SELECT #'{1234234,234234}'::int[];
 SELECT '{123,623,445}'::int[] + 1245;
-- 
2.37.1

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Martin Kalcher (#6)

Re: Proposal to introduce a shuffle function to intarray extension

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

Am 16.07.22 um 23:56 schrieb Thomas Munro:

I understand that this is a specialised int[] shuffle, but I wonder if
someone would ever want to have a general array shuffle, and how that
would work, in terms of naming convention etc.

- I am not shure about the naming. intarray has generic names like
sort() and uniq() and specialised names like icount(). I guess in case
someone wants to have a general array shuffle it could be accomplished
with function overloading. Or am i wrong here?

I suppose this is exactly the point Thomas was wondering about: if we
use a generic function name for this, will it cause problems for someone
trying to add a generic function later?

We can investigate that question with a couple of toy functions:

regression=# create function foo(int[]) returns text as 'select ''int[] version''' language sql;
CREATE FUNCTION
regression=# create function foo(anyarray) returns text as 'select ''anyarray version''' language sql;
CREATE FUNCTION
regression=# select foo('{1,2,3}');
ERROR: function foo(unknown) is not unique
LINE 1: select foo('{1,2,3}');
^
HINT: Could not choose a best candidate function. You might need to add explicit type casts.

OK, that's not too surprising: with an unknown input there's just not
anything that the parser can use to disambiguate. But this happens
with just about any overloaded name, so I don't think it's a showstopper.

regression=# select foo('{1,2,3}'::int[]);
foo
---------------
int[] version
(1 row)

regression=# select foo('{1,2,3}'::int8[]);
foo
------------------
anyarray version
(1 row)

Good, that's more or less the minimum functionality we should expect.

regression=# select foo('{1,2,3}'::int2[]);
ERROR: function foo(smallint[]) is not unique
LINE 1: select foo('{1,2,3}'::int2[]);
^
HINT: Could not choose a best candidate function. You might need to add explicit type casts.

Oh, that's slightly annoying ...

regression=# select foo('{1,2,3}'::oid[]);
foo
------------------
anyarray version
(1 row)

regression=# select foo('{1,2,3}'::text[]);
foo
------------------
anyarray version
(1 row)

regression=# select foo('{1,2,3}'::float8[]);
foo
------------------
anyarray version
(1 row)

I couldn't readily find any case that misbehaves except for int2[].
You can force that to work, at least in one direction:

regression=# select foo('{1,2,3}'::int2[]::int[]);
foo
---------------
int[] version
(1 row)

On the whole, I'd vote for calling it shuffle(), and expecting that
we'd also use that name for any future generic version. That's
clearly the easiest-to-remember definition, and it doesn't seem
like the gotchas are bad enough to justify using separate names.

- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted.

I have no opinion about whether this one is valuable enough to include in
intarray, but I do feel like sample() is a vague name, and easily confused
with marginally-related operations like TABLESAMPLE. Can we think of a
more on-point name? Something like "random_subset" would be pretty
clear, but it's also clunky. It's too late here for me to think of
le mot juste...

regards, tom lane

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Martin Kalcher (#6)

Re: Proposal to introduce a shuffle function to intarray extension

On Sat, Jul 16, 2022 at 7:25 PM Martin Kalcher <
martin.kalcher@aboutsource.net> wrote:

- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted. The
important one is shuffle().

 +SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) !=
sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6);
+ ?column?
+----------
+ t
+(1 row)
+

While small, there is a non-zero chance for both samples to be equal. This
test should probably just go, I don't see what it tests that isn't covered
by other tests or even trivial usage.

Same goes for:

+SELECT shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}') !=
shuffle('{1,2,3,4,5,6,7,8,9,10,11,12}');
+ ?column?
+----------
+ t
+(1 row)
+

David J.

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Tom Lane (#7)

Re: Proposal to introduce a shuffle function to intarray extension

On Sat, Jul 16, 2022 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

- I added a second function sample(), because it is a lot faster to take
some elements from an array than to shuffle the whole array and
slice it. This function can be removed if it is not wanted.

I have no opinion about whether this one is valuable enough to include in
intarray, but I do feel like sample() is a vague name, and easily confused
with marginally-related operations like TABLESAMPLE. Can we think of a
more on-point name? Something like "random_subset" would be pretty
clear, but it's also clunky. It's too late here for me to think of
le mot juste...

choose(input anyarray, size integer, with_replacement boolean default
false, algorithm text default 'default')?

David J.

#10

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Tom Lane (#7)

Re: Proposal to introduce a shuffle function to intarray extension

I wrote:

On the whole, I'd vote for calling it shuffle(), and expecting that
we'd also use that name for any future generic version.

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.

regards, tom lane

#11

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Tom Lane (#10)

Re: Proposal to introduce a shuffle function to intarray extension

On Sun, Jul 17, 2022 at 3:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

On the whole, I'd vote for calling it shuffle(), and expecting that
we'd also use that name for any future generic version.

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.

Yeah, that seems like a good direction. If there is a performance
advantage to specialising, then perhaps we only have to specialise on
size, not type. Perhaps there could be a general function that
internally looks out for typbyval && typlen == 4, and dispatches to a
specialised 4-byte, and likewise for 8, if it can, and that'd already
be enough to cover int, bigint, float etc, without needing
specialisations for each type.

I went to see what Professor Lemire would have to say about all this,
expecting to find a SIMD rabbit hole to fall down for some Sunday
evening reading, but the main thing that jumped out was this article
about the modulo operation required by textbook Fisher-Yates to get a
bounded random number, the random() % n that appears in the patch. He
talks about shuffling twice as fast by using a no-division trick to
get bounded random numbers[1]https://lemire.me/blog/2016/06/30/fast-random-shuffling/. I guess you might need to use our
pg_prng_uint32() for that trick because random()'s 0..RAND_MAX might
introduce bias. Anyway, file that under go-faster ideas for later.

[1]: https://lemire.me/blog/2016/06/30/fast-random-shuffling/

#12

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Tom Lane (#10)

Re: Proposal to introduce a shuffle function to intarray extension

Am 17.07.22 um 05:37 schrieb Tom Lane:

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.

regards, tom lane

Hi Tom,

thank you for your thoughts. There are two reasons for choosing an
int4-only version. I am not familiar with postgres development (yet) and
i was not sure how open you are about such changes to core and if the
proposed feature is considered valuable enough to go into core. The
second reason was ease of implementation. The intarray extension does
not allow any NULL elements in arrays and treats multidimensional arrays
as though they were linear. Which makes the implementation straight
forward, because there are fewer cases to consider.

However, i will take a look at an implementation for anyarray in core.

Martin

#13

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Thomas Munro (#11)

Re: Proposal to introduce a shuffle function to intarray extension

Am 17.07.22 um 08:00 schrieb Thomas Munro:

I went to see what Professor Lemire would have to say about all this,
expecting to find a SIMD rabbit hole to fall down for some Sunday
evening reading, but the main thing that jumped out was this article
about the modulo operation required by textbook Fisher-Yates to get a
bounded random number, the random() % n that appears in the patch. He
talks about shuffling twice as fast by using a no-division trick to
get bounded random numbers[1]. I guess you might need to use our
pg_prng_uint32() for that trick because random()'s 0..RAND_MAX might
introduce bias. Anyway, file that under go-faster ideas for later.

[1] https://lemire.me/blog/2016/06/30/fast-random-shuffling/

Hi Thomas,

the small bias of random() % n is not a problem for my use case, but
might be for others. Its easily replaceable with

(int) pg_prng_uint64_range(&pg_global_prng_state, 0, n-1)

Unfortunately it is a bit slower (on my machine), but thats negligible.

Martin

#14

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: David G. Johnston (#8)

Re: Proposal to introduce a shuffle function to intarray extension

Am 17.07.22 um 05:32 schrieb David G. Johnston:

+SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) !=
sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6);
+ ?column?
+----------
+ t
+(1 row)
+
While small, there is a non-zero chance for both samples to be equal. This
test should probably just go, I don't see what it tests that isn't covered
by other tests or even trivial usage.

Hey David,

you are right. There is a small chance for this test to fail. I wanted
to test, that two invocations produce different results (after all the
main feature of the function). But it can probably go.

Martin

#15

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Thomas Munro (#11)

1 attachment(s)

Re: Proposal to introduce a shuffle function to intarray extension

Am 17.07.22 um 08:00 schrieb Thomas Munro:

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.

Yeah, that seems like a good direction. If there is a performance
advantage to specialising, then perhaps we only have to specialise on
size, not type. Perhaps there could be a general function that
internally looks out for typbyval && typlen == 4, and dispatches to a
specialised 4-byte, and likewise for 8, if it can, and that'd already
be enough to cover int, bigint, float etc, without needing
specialisations for each type.

I played around with the idea of an anyarray shuffle(). The hard part
was to deal with arrays with variable length elements, as they can not
be swapped easily in place. I solved it by creating an intermediate
array of references to the elements. I'll attach a patch with the proof
of concept. Unfortunatly it is already about 5 times slower than the
specialised version and i am not sure if it is worth going down that road.

Martin

Attachments:

0001-introduce-array_shuffle.patchtext/x-patch; charset=UTF-8; name=0001-introduce-array_shuffle.patchDownload

From 85cb90c79459132c4220b8d78586dffe6e590ba9 Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] introduce array_shuffle()

---
 src/backend/utils/adt/arrayfuncs.c | 73 ++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat    |  3 ++
 2 files changed, 76 insertions(+)

diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..b6576ce178 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -6872,3 +6872,76 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	char	   *adat,
+			   *rdat;
+	int		   nitems,
+			   i;
+	Oid 	   elemtype;
+	TypeCacheEntry typentry;
+	struct { char *dat; size_t size; } *refs;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	if (ARR_HASNULL(array))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("shuffling arrays with null elements is not supported")));
+
+	if (ARR_NDIM(array) > 1)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("shuffling of multidimensional arrays is not supported")));
+
+	elemtype = ARR_ELEMTYPE(array);
+	typentry = *lookup_type_cache(elemtype, 0);
+	nitems = ArrayGetNItems(ARR_NDIM(array), ARR_DIMS(array));
+	refs = palloc(nitems * (sizeof(char *) + sizeof(size_t)));
+	adat = ARR_DATA_PTR(array);
+
+	/*
+	 * To allow shuffling of arrays with variable length elements, we are going
+	 * to shuffle references to the elements, because swapping variable length
+	 * elements in an array is complicated. In a second step we will iterate the
+	 * shuffled references and copy the elements to the destination array.
+	 * We are using the "inside-out" variant of the Fisher-Yates algorithm to
+	 * produce a shuffled array of references: Append each reference than swap it
+	 * with a radomly-chosen refence.
+	 */
+	for (i = 0; i < nitems; i++)
+	{
+		int j = random() % (i + 1);
+		char *next = att_addlength_pointer(adat, typentry.typlen, adat);
+		next = (char *) att_align_nominal(next, typentry.typalign);
+		refs[i] = refs[j];
+		refs[j].dat = adat;
+		refs[j].size = next - adat;
+		adat = next;
+	}
+
+	result = create_array_envelope(ARR_NDIM(array),
+								   ARR_DIMS(array),
+								   ARR_LBOUND(array),
+								   ARR_SIZE(array),
+								   ARR_ELEMTYPE(array),
+								   array->dataoffset);
+
+	rdat = ARR_DATA_PTR(result);
+
+	/* Collect all references and copy elements to the destination array */
+	for (i = 0; i < nitems; i++)
+	{
+		memcpy(rdat, refs[i].dat, refs[i].size);
+		rdat += refs[i].size;
+	}
+
+	pfree(refs);
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..56aff551d3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,9 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '7777', descr => 'shuffle array',
+  proname => 'array_shuffle', proisstrict => 'f', prorettype => 'anyarray',
+  proargtypes => 'anyarray', prosrc => 'array_shuffle' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
-- 
2.37.1

#16

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Martin Kalcher (#15)

Re: Proposal to introduce a shuffle function to intarray extension

On Mon, Jul 18, 2022 at 4:15 AM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

Am 17.07.22 um 08:00 schrieb Thomas Munro:

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.

Yeah, that seems like a good direction. If there is a performance
advantage to specialising, then perhaps we only have to specialise on
size, not type. Perhaps there could be a general function that
internally looks out for typbyval && typlen == 4, and dispatches to a
specialised 4-byte, and likewise for 8, if it can, and that'd already
be enough to cover int, bigint, float etc, without needing
specialisations for each type.

I played around with the idea of an anyarray shuffle(). The hard part
was to deal with arrays with variable length elements, as they can not
be swapped easily in place. I solved it by creating an intermediate
array of references to the elements. I'll attach a patch with the proof
of concept. Unfortunatly it is already about 5 times slower than the
specialised version and i am not sure if it is worth going down that road.

Seems OK for a worst case. It must still be a lot faster than doing
it in SQL. Now I wonder what the exact requirements would be to
dispatch to a faster version that would handle int4. I haven't
studied this in detail but perhaps to dispatch to a fast shuffle for
objects of size X, the requirement would be something like typlen == X
&& align_bytes <= typlen && typlen % align_bytes == 0, where
align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}?
Or in English, 'the data consists of densely packed objects of fixed
size X, no padding'. Or perhaps you can work out the padded size and
use that, to catch a few more types. Then you call
array_shuffle_{2,4,8}() as appropriate, which should be as fast as
your original int[] proposal, but work also for float, date, ...?

About your experimental patch, I haven't reviewed it properly or tried
it but I wonder if uint32 dat_offset, uint32 size (= half size
elements) would be enough due to limitations on varlenas.

#17

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Martin Kalcher (#15)

Re: Proposal to introduce a shuffle function to intarray extension

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

Am 17.07.22 um 08:00 schrieb Thomas Munro:

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?

I played around with the idea of an anyarray shuffle(). The hard part
was to deal with arrays with variable length elements, as they can not
be swapped easily in place. I solved it by creating an intermediate
array of references to the elements. I'll attach a patch with the proof
of concept.

This does not look particularly idiomatic, or even type-safe. What you
should have done was use deconstruct_array to get an array of Datums and
isnull flags, then shuffled those, then used construct_array to build the
output.

(Or, perhaps, use construct_md_array to replicate the input's
precise dimensionality. Not sure if anyone would care.)

regards, tom lane

#18

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#16)

Re: Proposal to introduce a shuffle function to intarray extension

Thomas Munro <thomas.munro@gmail.com> writes:

Seems OK for a worst case. It must still be a lot faster than doing
it in SQL. Now I wonder what the exact requirements would be to
dispatch to a faster version that would handle int4.

I find it impossible to believe that it's worth micro-optimizing
shuffle() to that extent. Now, maybe doing something in that line
in deconstruct_array and construct_array would be worth our time,
as that'd benefit a pretty wide group of functions.

regards, tom lane

#19

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Tom Lane (#17)

Re: Proposal to introduce a shuffle function to intarray extension

Am 18.07.22 um 00:46 schrieb Tom Lane:

This does not look particularly idiomatic, or even type-safe. What you
should have done was use deconstruct_array to get an array of Datums and
isnull flags, then shuffled those, then used construct_array to build the
output.

(Or, perhaps, use construct_md_array to replicate the input's
precise dimensionality. Not sure if anyone would care.)

regards, tom lane

deconstruct_array() would destroy the arrays dimensions. I would expect
that shuffle() only shuffles the first dimension and keeps the inner
arrays intact.

Martin

#20

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Thomas Munro (#16)

1 attachment(s)

Re: Proposal to introduce a shuffle function to intarray extension

Am 18.07.22 um 00:37 schrieb Thomas Munro:

Seems OK for a worst case. It must still be a lot faster than doing
it in SQL. Now I wonder what the exact requirements would be to
dispatch to a faster version that would handle int4. I haven't
studied this in detail but perhaps to dispatch to a fast shuffle for
objects of size X, the requirement would be something like typlen == X
&& align_bytes <= typlen && typlen % align_bytes == 0, where
align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}?
Or in English, 'the data consists of densely packed objects of fixed
size X, no padding'. Or perhaps you can work out the padded size and
use that, to catch a few more types. Then you call
array_shuffle_{2,4,8}() as appropriate, which should be as fast as
your original int[] proposal, but work also for float, date, ...?

About your experimental patch, I haven't reviewed it properly or tried
it but I wonder if uint32 dat_offset, uint32 size (= half size
elements) would be enough due to limitations on varlenas.

I made another experimental patch with fast tracks for typelen4 and
typelen8. alignments are not yet considered.

Attachments:

0001-introduce-array_shuffle.patchtext/x-patch; charset=UTF-8; name=0001-introduce-array_shuffle.patchDownload

From 37269598a1dace99c9059514e0fdcb90b9240ed3 Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] introduce array_shuffle()

---
 src/backend/utils/adt/arrayfuncs.c | 178 +++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat    |   3 +
 2 files changed, 181 insertions(+)

diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..919cb2bfd6 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -6872,3 +6872,181 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+static ArrayType *
+array_shuffle_4(ArrayType *array)
+{
+	ArrayType	*result;
+	int32		*adat,
+				*rdat;
+	int			i,
+				j,
+				nitems;
+
+	result = create_array_envelope(ARR_NDIM(array),
+								   ARR_DIMS(array),
+								   ARR_LBOUND(array),
+								   ARR_SIZE(array),
+								   ARR_ELEMTYPE(array),
+								   array->dataoffset);
+
+	adat = (int32 *) ARR_DATA_PTR(array);
+	rdat = (int32 *) ARR_DATA_PTR(result);
+	nitems = ARR_DIMS(array)[0];
+
+	for (i = 0; i < nitems; i++)
+	{
+		j = random() % (i + 1);
+		rdat[i] = rdat[j];
+		rdat[j] = adat[i];
+	}
+	return result;
+}
+
+static ArrayType *
+array_shuffle_8(ArrayType *array)
+{
+	ArrayType	*result;
+	int64		*adat,
+				*rdat;
+	int			i,
+				j,
+				nitems;
+
+	result = create_array_envelope(ARR_NDIM(array),
+								   ARR_DIMS(array),
+								   ARR_LBOUND(array),
+								   ARR_SIZE(array),
+								   ARR_ELEMTYPE(array),
+								   array->dataoffset);
+
+	adat = (int64 *) ARR_DATA_PTR(array);
+	rdat = (int64 *) ARR_DATA_PTR(result);
+	nitems = ARR_DIMS(array)[0];
+
+	for (i = 0; i < nitems; i++)
+	{
+		j = random() % (i + 1);
+		rdat[i] = rdat[j];
+		rdat[j] = adat[i];
+	}
+	return result;
+}
+
+static ArrayType *
+array_shuffle_generic(ArrayType *array,
+					  int16 elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType	*result;
+	char		*adat,
+				*rdat;
+	int 		i,
+				ndim,
+				*dims,
+				nitems,
+				nelms_per_item;
+
+	struct {
+		char *dat;
+		size_t size;
+	} *refs;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+
+	nelms_per_item = 1;
+
+	for (i = 1; i < ndim; i++)
+		nelms_per_item *= dims[i];
+
+	nitems = dims[0];
+
+	refs = palloc(nitems * (sizeof(char *) + sizeof(size_t)));
+	adat = ARR_DATA_PTR(array);
+
+	/*
+	 * To allow shuffling of arrays with variable length elements, we are going
+	 * to shuffle references to the elements, because swapping variable length
+	 * elements in an array is complicated. In a second step we will iterate the
+	 * shuffled references and copy the elements to the destination array.
+	 * We are using the "inside-out" variant of the Fisher-Yates algorithm to
+	 * produce a shuffled array of references: Append each reference than swap it
+	 * with a radomly-chosen refence.
+	 */
+	for (i = 0; i < nitems; i++)
+	{
+		char *next = array_seek(adat, 0, NULL, nelms_per_item,
+							    elmlen, elmbyval, elmalign);
+		int j = random() % (i + 1);
+		refs[i] = refs[j];
+		refs[j].dat = adat;
+		refs[j].size = next - adat;
+		adat = next;
+	}
+
+	result = create_array_envelope(ARR_NDIM(array),
+								   ARR_DIMS(array),
+								   ARR_LBOUND(array),
+								   ARR_SIZE(array),
+								   ARR_ELEMTYPE(array),
+								   array->dataoffset);
+
+	rdat = ARR_DATA_PTR(result);
+
+	/* Collect all references and copy elements to the destination array */
+	for (i = 0; i < nitems; i++)
+	{
+		memcpy(rdat, refs[i].dat, refs[i].size);
+		rdat += refs[i].size;
+	}
+
+	pfree(refs);
+
+	return result;
+}
+
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType		*array,
+					*result;
+	int16			elmlen;
+	bool			elmbyval;
+	char			elmalign;
+	Oid             elmtyp;
+	TypeCacheEntry	*typentry;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	if (ARR_HASNULL(array))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("shuffling arrays with null elements is not supported")));
+
+	if (ARR_NDIM(array) < 1 || ARR_DIMS(array)[0] < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	if (ARR_NDIM(array) == 1 && elmlen == 4)
+		result = array_shuffle_4(array);
+	else if (ARR_NDIM(array) == 1 && elmlen == 8)
+		result = array_shuffle_8(array);
+	else
+		result = array_shuffle_generic(array, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..56aff551d3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,9 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '7777', descr => 'shuffle array',
+  proname => 'array_shuffle', proisstrict => 'f', prorettype => 'anyarray',
+  proargtypes => 'anyarray', prosrc => 'array_shuffle' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
-- 
2.37.1

#21

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Martin Kalcher (#19)

Re: Proposal to introduce a shuffle function to intarray extension

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

Am 18.07.22 um 00:46 schrieb Tom Lane:

This does not look particularly idiomatic, or even type-safe. What you
should have done was use deconstruct_array to get an array of Datums and
isnull flags, then shuffled those, then used construct_array to build the
output.

(Or, perhaps, use construct_md_array to replicate the input's
precise dimensionality. Not sure if anyone would care.)

deconstruct_array() would destroy the arrays dimensions.

As I said, you can use construct_md_array if you want to preserve the
array shape.

I would expect that shuffle() only shuffles the first dimension and
keeps the inner arrays intact.

This argument is based on a false premise, ie that Postgres thinks
multidimensional arrays are arrays-of-arrays. They aren't, and
we're not going to start making them so by defining shuffle()
at variance with every other array-manipulating function. Shuffling
the individual elements regardless of array shape is the definition
that's consistent with our existing functionality.

(Having said that, even if we were going to implement it with that
definition, I should think that it'd be easiest to do so on the
array-of-Datums representation produced by deconstruct_array.
That way you don't need to do different things for different element
types.)

regards, tom lane

#22

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Tom Lane (#21)

Re: Proposal to introduce a shuffle function to intarray extension

Am 18.07.22 um 01:20 schrieb Tom Lane:

I would expect that shuffle() only shuffles the first dimension and
keeps the inner arrays intact.

This argument is based on a false premise, ie that Postgres thinks
multidimensional arrays are arrays-of-arrays. They aren't, and
we're not going to start making them so by defining shuffle()
at variance with every other array-manipulating function. Shuffling
the individual elements regardless of array shape is the definition
that's consistent with our existing functionality.

Hey Tom,

thank you for clarification. I did not know that. I will make a patch
that is using deconstruct_array().

Martin

#23

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Tom Lane (#21)

1 attachment(s)

Re: Proposal to introduce a shuffle function to intarray extension

Am 18.07.22 um 01:20 schrieb Tom Lane:

(Having said that, even if we were going to implement it with that
definition, I should think that it'd be easiest to do so on the
array-of-Datums representation produced by deconstruct_array.
That way you don't need to do different things for different element
types.)

Thank you Tom, here is a patch utilising deconstruct_array(). If we
agree, that this is the way to go, i would like to add array_sample()
(good name?), some test cases, and documentation.

One more question. How do i pick a Oid for the functions?

Martin

Attachments:

0001-introduce-array_shuffle.patchtext/x-patch; charset=UTF-8; name=0001-introduce-array_shuffle.patchDownload

From baec08168357098287342c92672ef97361a91371 Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] introduce array_shuffle()

---
 src/backend/utils/adt/arrayfuncs.c | 61 ++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat    |  3 ++
 2 files changed, 64 insertions(+)

diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..e185c1ef74 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -6872,3 +6872,64 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType		*array,
+					*result;
+	int16			elmlen;
+	bool			elmbyval;
+	char			elmalign;
+	Oid             elmtyp;
+	TypeCacheEntry	*typentry;
+	Datum			*elms,
+					elm;
+	bool			*nuls,
+					nul;
+	int         	nelms,
+					i,
+					j;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	if (ArrayGetNItems(ARR_NDIM(array), ARR_DIMS(array)) < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &elms, &nuls, &nelms);
+
+	for (i = nelms - 1; i > 0; i--)
+	{
+		j = random() % (i + 1);
+		elm = elms[i];
+		nul = nuls[i];
+		elms[i] = elms[j];
+		nuls[i] = nuls[j];
+		elms[j] = elm;
+		nuls[j] = nul;
+	}
+
+	result = construct_md_array(elms, nuls,
+								ARR_NDIM(array), ARR_DIMS(array), ARR_LBOUND(array),
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(elms);
+	pfree(nuls);
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..56aff551d3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,9 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '7777', descr => 'shuffle array',
+  proname => 'array_shuffle', proisstrict => 'f', prorettype => 'anyarray',
+  proargtypes => 'anyarray', prosrc => 'array_shuffle' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
-- 
2.37.1

#24

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Martin Kalcher (#23)

Re: Proposal to introduce a shuffle function to intarray extension

On Mon, Jul 18, 2022 at 2:47 PM Martin Kalcher <
martin.kalcher@aboutsource.net> wrote:

One more question. How do i pick a Oid for the functions?

For this, we recommend running src/include/catalog/unused_oids, and it will
give you a random range to pick from. That reduces the chance of different
patches conflicting with each other. It doesn't really matter what the oid
here is, since at feature freeze a committer will change them anyway.

--
John Naylor
EDB: http://www.enterprisedb.com

#25

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: John Naylor (#24)

Re: Proposal to introduce a shuffle function to intarray extension

John Naylor <john.naylor@enterprisedb.com> writes:

On Mon, Jul 18, 2022 at 2:47 PM Martin Kalcher <
martin.kalcher@aboutsource.net> wrote:

One more question. How do i pick a Oid for the functions?

For this, we recommend running src/include/catalog/unused_oids, and it will
give you a random range to pick from. That reduces the chance of different
patches conflicting with each other. It doesn't really matter what the oid
here is, since at feature freeze a committer will change them anyway.

If you want the nitty gritty details here, see

https://www.postgresql.org/docs/devel/system-catalog-initial-data.html#SYSTEM-CATALOG-OID-ASSIGNMENT

regards, tom lane

#26

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Tom Lane (#25)

1 attachment(s)

[PATCH] Introduce array_shuffle() and array_sample()

Thanks for all your feedback and help. I got a patch that i consider
ready for review. It introduces two new functions:

array_shuffle(anyarray) -> anyarray
array_sample(anyarray, integer) -> anyarray

array_shuffle() shuffles an array (obviously). array_sample() picks n
random elements from an array.

Is someone interested in looking at it? What are the next steps?

Martin

Attachments:

0001-Introduce-array_shuffle-and-array_sample.patchtext/x-patch; charset=UTF-8; name=0001-Introduce-array_shuffle-and-array_sample.patchDownload

From 5498bb2d9f1fab4cad56cd0d3a6eeafa24a26c8e Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles to elements of an array.
* array_sample() chooses n elements from an array by random.

Shuffling of arrays can already be accomplished with SQL
using unnest() and array_agg(order by random()). But that is
very slow compared to the new functions.
---
 doc/src/sgml/func.sgml               |  34 ++++++
 src/backend/utils/adt/arrayfuncs.c   | 163 +++++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat      |   6 +
 src/test/regress/expected/arrays.out |  30 +++++
 src/test/regress/sql/arrays.sql      |  11 ++
 5 files changed, 244 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..c676031b4a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,40 @@ SELECT NULLIF(value, '(none)') ...
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_sample</primary>
+        </indexterm>
+        <function>array_sample</function> ( <parameter>array</parameter> <type>anyarray</type>, <parameter>n</parameter> <type>integer</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter>.
+       </para>
+       <para>
+        <literal>array_sample(ARRAY[1,2,3,4,5,6], 3)</literal>
+        <returnvalue>{4,5,1}</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_shuffle</primary>
+        </indexterm>
+        <function>array_shuffle</function> ( <type>anyarray</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Shuffles the elements of the array.
+       </para>
+       <para>
+        <literal>array_shuffle(ARRAY[1,2,3,4,5,6])</literal>
+        <returnvalue>{4,5,1,3,2,6}</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm id="function-array-to-string">
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..a6769a8ebf 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -6872,3 +6872,166 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	Datum	   *elms,
+				elm;
+	bool	   *nuls,
+				nul;
+	int			nelms,
+				i,
+				j;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	nelms = ArrayGetNItems(ARR_NDIM(array), ARR_DIMS(array));
+
+	/* There is no point in shuffling arrays with less then two elements. */
+	if (nelms < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &elms, &nuls, &nelms);
+
+	/* Shuffle elements and nulls using Fisher-Yates shuffle algorithm. */
+	for (i = nelms - 1; i > 0; i--)
+	{
+		j = random() % (i + 1);
+		elm = elms[i];
+		nul = nuls[i];
+		elms[i] = elms[j];
+		nuls[i] = nuls[j];
+		elms[j] = elm;
+		nuls[j] = nul;
+	}
+
+	result = construct_md_array(elms, nuls,
+								ARR_NDIM(array), ARR_DIMS(array), ARR_LBOUND(array),
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(elms);
+	pfree(nuls);
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
+
+/*
+ * Choose N random elements from an array.
+ */
+Datum
+array_sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	Datum	   *elms,
+				elm;
+	bool	   *nuls,
+				nul;
+	int			nelms,
+				rnelms,
+				rdims[1],
+				rlbs[1],
+				i,
+				j;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	nelms = ArrayGetNItems(ARR_NDIM(array), ARR_DIMS(array));
+	elmtyp = ARR_ELEMTYPE(array);
+	rnelms = PG_GETARG_INT32(1);
+
+	if (rnelms < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("second parameter must not be negative")));
+
+	/* Return an empty array if the requested sample size is zero. */
+	if (rnelms == 0)
+	{
+		PG_FREE_IF_COPY(array, 0);
+		PG_RETURN_ARRAYTYPE_P(construct_empty_array(elmtyp));
+	}
+
+	/*
+	 * Return the original array if the requested sample size is greater than
+	 * or equal to its own size.
+	 */
+	if (rnelms >= nelms)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &elms, &nuls, &nelms);
+
+	/*
+	 * Move rnelms (requested sample size) randomly-chosen elements to the
+	 * beginning of the array, using the Fisher-Yates shuffle algorithm.
+	 */
+	for (i = 0; i < rnelms; i++)
+	{
+		/* random j such that i <= j < nelms */
+		j = (random() % (nelms - i)) + i;
+		elm = elms[i];
+		nul = nuls[i];
+		elms[i] = elms[j];
+		nuls[i] = nuls[j];
+		elms[j] = elm;
+		nuls[j] = nul;
+	}
+
+	/*
+	 * Construct a new array from the first rnelms (requested sample size)
+	 * elements of the shuffled array. Even though we are constructing a
+	 * one-dimensional array, we cannot use construct_array(), because it does
+	 * not support NULL elements.
+	 */
+	rdims[0] = rnelms;
+	rlbs[0] = 1;
+
+	result = construct_md_array(elms, nuls, 1, rdims, rlbs,
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(elms);
+	pfree(nuls);
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..f51af5ee9f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,12 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '8464', descr => 'shuffle array',
+  proname => 'array_shuffle', proisstrict => 't', prorettype => 'anyarray',
+  proargtypes => 'anyarray', prosrc => 'array_shuffle' },
+{ oid => '8465', descr => 'take samples from array',
+  proname => 'array_sample', proisstrict => 't', prorettype => 'anyarray',
+  proargtypes => 'anyarray int4', prosrc => 'array_sample' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index ce6f3a65f9..4994e6ac62 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -2445,3 +2445,33 @@ SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
+-- array_shuffle
+SELECT array_agg(a order by a)
+FROM
+(SELECT unnest(array_shuffle('{NULL,1,2,3,4,5}'::int[]))) v(a);
+    array_agg     
+------------------
+ {1,2,3,4,5,NULL}
+(1 row)
+
+-- array_sample
+SELECT cardinality(array_sample('{NULL,1,2,3,4,5}'::int[], 3));
+ cardinality 
+-------------
+           3
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], 3) <@ '{1,2,3,4,5}'::int[];
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], 10);
+ array_sample 
+--------------
+ {1,2,3,4,5}
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], -1); -- fail
+ERROR:  second parameter must not be negative
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index f774faf856..0234aafa06 100644
--- a/src/test/regress/sql/arrays.sql
+++ b/src/test/regress/sql/arrays.sql
@@ -754,3 +754,14 @@ FROM
 
 SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
+
+-- array_shuffle
+SELECT array_agg(a order by a)
+FROM
+(SELECT unnest(array_shuffle('{NULL,1,2,3,4,5}'::int[]))) v(a);
+
+-- array_sample
+SELECT cardinality(array_sample('{NULL,1,2,3,4,5}'::int[], 3));
+SELECT array_sample('{1,2,3,4,5}'::int[], 3) <@ '{1,2,3,4,5}'::int[];
+SELECT array_sample('{1,2,3,4,5}'::int[], 10);
+SELECT array_sample('{1,2,3,4,5}'::int[], -1); -- fail
-- 
2.37.1

#27

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Martin Kalcher (#26)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

Is someone interested in looking at it? What are the next steps?

The preferred thing to do is to add it to our "commitfest" queue,
which will ensure that it gets looked at eventually. The currently
open cycle is 2022-09 [1]https://commitfest.postgresql.org/39/ (see the "New Patch" button there).

I believe you have to have signed up as a community member[2]https://www.postgresql.org/account/login/
in order to put yourself in as the patch author. I think "New Patch"
will work anyway, but it's better to have an author listed.

regards, tom lane

[1]: https://commitfest.postgresql.org/39/
[2]: https://www.postgresql.org/account/login/

#28

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Tom Lane (#27)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 18.07.22 um 21:29 schrieb Tom Lane:

The preferred thing to do is to add it to our "commitfest" queue,
which will ensure that it gets looked at eventually. The currently
open cycle is 2022-09 [1] (see the "New Patch" button there).

Thanks Tom, did that. I am not sure if "SQL Commands" is the correct
topic for this patch, but i guess its not that important. I am impressed
by all the infrastructure build around this project.

Martin

#29

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Martin Kalcher (#26)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On Mon, Jul 18, 2022 at 3:03 PM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

Thanks for all your feedback and help. I got a patch that i consider
ready for review. It introduces two new functions:

array_shuffle(anyarray) -> anyarray
array_sample(anyarray, integer) -> anyarray

array_shuffle() shuffles an array (obviously). array_sample() picks n
random elements from an array.

I like this idea.

I think it's questionable whether the behavior of array_shuffle() is
correct for a multi-dimensional array. The implemented behavior is to
keep the dimensions as they were, but permute the elements across all
levels at random. But there are at least two other behaviors that seem
potentially defensible: (1) always return a 1-dimensional array, (2)
shuffle the sub-arrays at the top-level without the possibility of
moving elements within or between sub-arrays. What behavior we decide
is best here should be documented.

array_sample() will return elements in random order when sample_size <
array_size, but in the original order when sample_size >= array_size.
Similarly, it will always return a 1-dimensional array in the former
case, but will keep the original dimensions in the latter case. That
seems pretty hard to defend. I think it should always return a
1-dimensional array with elements in random order, and I think this
should be documented.

I also think you should add test cases involving multi-dimensional
arrays, as well as arrays with non-default bounds. e.g. trying
shuffling or sampling some values like
'[8:10][-6:-5]={{1,2},{3,4},{5,6}}'::int[]

--
Robert Haas
EDB: http://www.enterprisedb.com

#30

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#29)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jul 18, 2022 at 3:03 PM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

array_shuffle(anyarray) -> anyarray
array_sample(anyarray, integer) -> anyarray

I think it's questionable whether the behavior of array_shuffle() is
correct for a multi-dimensional array. The implemented behavior is to
keep the dimensions as they were, but permute the elements across all
levels at random. But there are at least two other behaviors that seem
potentially defensible: (1) always return a 1-dimensional array, (2)
shuffle the sub-arrays at the top-level without the possibility of
moving elements within or between sub-arrays. What behavior we decide
is best here should be documented.

Martin had originally proposed (2), which I rejected on the grounds
that we don't treat multi-dimensional arrays as arrays-of-arrays for
any other purpose. Maybe we should have, but that ship sailed decades
ago, and I doubt that shuffle() is the place to start changing it.

I could get behind your option (1) though, to make it clearer that
the input array's dimensionality is not of interest. Especially since,
as you say, (1) is pretty much the only sensible choice for array_sample.

I also think you should add test cases involving multi-dimensional
arrays, as well as arrays with non-default bounds. e.g. trying
shuffling or sampling some values like
'[8:10][-6:-5]={{1,2},{3,4},{5,6}}'::int[]

This'd only matter if we decide not to ignore the input's dimensionality.

regards, tom lane

#31

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Tom Lane (#30)

Re: [PATCH] Introduce array_shuffle() and array_sample()

I wrote:

Martin had originally proposed (2), which I rejected on the grounds
that we don't treat multi-dimensional arrays as arrays-of-arrays for
any other purpose.

Actually, after poking at it for awhile, that's an overstatement.
It's true that the type system doesn't think N-D arrays are
arrays-of-arrays, but there are individual functions/operators that do.

For instance, @> is in the its-a-flat-list-of-elements camp:

regression=# select array[[1,2],[3,4]] @> array[1,3];
?column?
----------
t
(1 row)

but || wants to preserve dimensionality:

regression=# select array[[1,2],[3,4]] || array[1];
ERROR: cannot concatenate incompatible arrays
DETAIL: Arrays with differing dimensions are not compatible for concatenation.

Various other functions dodge the issue by refusing to work on arrays
of more than one dimension.

There seem to be more functions that think arrays are flat than
otherwise, but it's not as black-and-white as I was thinking.

regards, tom lane

#32

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Tom Lane (#31)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 18.07.22 um 23:03 schrieb Tom Lane:

I wrote:

Martin had originally proposed (2), which I rejected on the grounds
that we don't treat multi-dimensional arrays as arrays-of-arrays for
any other purpose.

Actually, after poking at it for awhile, that's an overstatement.
It's true that the type system doesn't think N-D arrays are
arrays-of-arrays, but there are individual functions/operators that do.
Thanks Robert for pointing out the inconsistent behavior of

array_sample(). That needs to be fixed.

As Tom's investigation showed, there is no consensus in the code if
multi-dimensional arrays are treated as arrays-of-arrays or not. We need
to decide what should be the correct treatment.

If we go with (1) array_shuffle() and array_sample() should shuffle each
element individually and always return a one-dimensional array.

select array_shuffle('{{1,2},{3,4},{5,6}}');
-----------
{1,4,3,5,6,2}

select array_sample('{{1,2},{3,4},{5,6}}', 3);
----------
{1,4,3}

If we go with (2) both functions should only operate on the first
dimension and shuffle whole subarrays and keep the dimensions intact.

select array_shuffle('{{1,2},{3,4},{5,6}}');
---------------------
{{3,4},{1,2},{5,6}}

select array_sample('{{1,2},{3,4},{5,6}}', 2);
---------------
{{3,4},{1,2}}

I do not feel qualified to make that decision. (2) complicates the code
a bit, but that should not be the main argument here.

Martin

#33

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Martin Kalcher (#32)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

If we go with (1) array_shuffle() and array_sample() should shuffle each
element individually and always return a one-dimensional array.

select array_shuffle('{{1,2},{3,4},{5,6}}');
-----------
{1,4,3,5,6,2}

select array_sample('{{1,2},{3,4},{5,6}}', 3);
----------
{1,4,3}

Independently of the dimensionality question --- I'd imagined that
array_sample would select a random subset of the array elements
but keep their order intact. If you want the behavior shown
above, you can do array_shuffle(array_sample(...)). But if we
randomize it, and that's not what the user wanted, she has no
recourse.

Now, if you're convinced that the set of people wanting
sampling-without-shuffling is the empty set, then making everybody
else call two functions is a loser. But I'm not convinced.
At the least, I'd like to see the argument made why nobody
would want that.

regards, tom lane

#34

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Tom Lane (#33)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On Mon, Jul 18, 2022 at 3:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Independently of the dimensionality question --- I'd imagined that
array_sample would select a random subset of the array elements
but keep their order intact. If you want the behavior shown
above, you can do array_shuffle(array_sample(...)). But if we
randomize it, and that's not what the user wanted, she has no
recourse.

And for those that want to know in what order those elements were chosen
they have no recourse in the other setup.

I really think this function needs to grow an algorithm argument that can
be used to specify stuff like ordering, replacement/without-replacement,
etc...just some enums separated by commas that can be added to the call.

David J.

#35

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: David G. Johnston (#34)

Re: [PATCH] Introduce array_shuffle() and array_sample()

"David G. Johnston" <david.g.johnston@gmail.com> writes:

On Mon, Jul 18, 2022 at 3:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Independently of the dimensionality question --- I'd imagined that
array_sample would select a random subset of the array elements
but keep their order intact. If you want the behavior shown
above, you can do array_shuffle(array_sample(...)). But if we
randomize it, and that's not what the user wanted, she has no
recourse.

And for those that want to know in what order those elements were chosen
they have no recourse in the other setup.

Um ... why is "the order in which the elements were chosen" a concept
we want to expose? ISTM sample() is a black box in which notionally
the decisions could all be made at once.

I really think this function needs to grow an algorithm argument that can
be used to specify stuff like ordering, replacement/without-replacement,
etc...just some enums separated by commas that can be added to the call.

I think you might run out of gold paint somewhere around here. I'm
still not totally convinced we should bother with the sample() function
at all, let alone that it needs algorithm variants. At some point we
say to the user "here's a PL, write what you want for yourself".

regards, tom lane

#36

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Tom Lane (#33)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 19.07.22 um 00:18 schrieb Tom Lane:

Independently of the dimensionality question --- I'd imagined that
array_sample would select a random subset of the array elements
but keep their order intact. If you want the behavior shown
above, you can do array_shuffle(array_sample(...)). But if we
randomize it, and that's not what the user wanted, she has no
recourse.

Now, if you're convinced that the set of people wanting
sampling-without-shuffling is the empty set, then making everybody
else call two functions is a loser. But I'm not convinced.
At the least, I'd like to see the argument made why nobody
would want that.

On the contrary! I am pretty sure there are people out there wanting
sampling-without-shuffling. I will think about that.

#37

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Martin Kalcher (#28)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On Tue, Jul 19, 2022 at 8:15 AM Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

Am 18.07.22 um 21:29 schrieb Tom Lane:

The preferred thing to do is to add it to our "commitfest" queue,
which will ensure that it gets looked at eventually. The currently
open cycle is 2022-09 [1] (see the "New Patch" button there).

Thanks Tom, did that.

FYI that brought your patch to the attention of this CI robot, which
is showing a couple of warnings. See the FAQ link there for an
explanation of that infrastructure.

http://cfbot.cputube.org/martin-kalcher.html

#38

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Martin Kalcher (#36)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 19.07.22 um 00:52 schrieb Martin Kalcher:

On the contrary! I am pretty sure there are people out there wanting
sampling-without-shuffling. I will think about that.

I gave it some thought. Even though there might be use cases, where a
stable order is desired, i would consider them edge cases, not worth the
additional complexity. I personally would not expect array_sample() to
return elements in any specific order. I looked up some sample()
implementations. None of them makes guarantees about the order of the
resulting array or explicitly states that the resulting array is in
random or selection order.

- Python random.sample [0]https://docs.python.org/3/library/random.html#random.sample
- Ruby Array#sample [1]https://ruby-doc.org/core-3.0.0/Array.html#method-i-sample
- Rust rand::seq::SliceRandom::choose_multiple [2]https://docs.rs/rand/0.6.5/rand/seq/trait.SliceRandom.html#tymethod.choose_multiple
- Julia StatsBase.sample [3]https://juliastats.org/StatsBase.jl/stable/sampling/#StatsBase.sample stable order needs explicit request

[0]: https://docs.python.org/3/library/random.html#random.sample
[1]: https://ruby-doc.org/core-3.0.0/Array.html#method-i-sample
[2]: https://docs.rs/rand/0.6.5/rand/seq/trait.SliceRandom.html#tymethod.choose_multiple
https://docs.rs/rand/0.6.5/rand/seq/trait.SliceRandom.html#tymethod.choose_multiple
[3]: https://juliastats.org/StatsBase.jl/stable/sampling/#StatsBase.sample

#39

Aleksander Alekseev

aleksander@timescale.com

over 3 years ago

In reply to: Martin Kalcher (#38)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Hi Martin,

I didn't look at the code yet but I very much like the idea. Many
thanks for working on this!

It's a pity your patch was too late for the July commitfest. In
September, please keep an eye on cfbot [1]http://cfbot.cputube.org/ to make sure your patch
applies properly.

As Tom's investigation showed, there is no consensus in the code if
multi-dimensional arrays are treated as arrays-of-arrays or not. We need
to decide what should be the correct treatment.

Here are my two cents.

From the user perspective I would expect that by default a
multidimensional array should be treated as an array of arrays. So for
instance:

```
array_shuffle([ [1,2], [3,4], [5,6] ])
```

... should return something like:

```
[ [3,4], [1,2], [5,6] ]
```

Note that the order of the elements in the internal arrays is preserved.

However, I believe there should be an optional argument that overrides
this behavior. For instance:

```
array_shuffle([ [1,2], [3,4], [5,6] ], depth => 2)
```

BTW, while on it, shouldn't we add similar functions for JSON and/or
JSONB? Or is this going to be too much for a single discussion?

[1]: http://cfbot.cputube.org/

--
Best regards,
Aleksander Alekseev

#40

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Martin Kalcher (#32)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 18.07.22 um 23:48 schrieb Martin Kalcher:

If we go with (1) array_shuffle() and array_sample() should shuffle each
element individually and always return a one-dimensional array.

select array_shuffle('{{1,2},{3,4},{5,6}}');
-----------
   {1,4,3,5,6,2}

select array_sample('{{1,2},{3,4},{5,6}}', 3);
----------
   {1,4,3}

If we go with (2) both functions should only operate on the first
dimension and shuffle whole subarrays and keep the dimensions intact.

select array_shuffle('{{1,2},{3,4},{5,6}}');
---------------------
   {{3,4},{1,2},{5,6}}

select array_sample('{{1,2},{3,4},{5,6}}', 2);
---------------
   {{3,4},{1,2}}

Having thought about it, i would go with (2). It gives the user the
ability to decide wether or not array-of-arrays behavior is desired. If
he wants the behavior of (1) he can flatten the array before applying
array_shuffle(). Unfortunately there is no array_flatten() function (at
the moment) and the user would have to work around it with unnest() and
array_agg().

#41

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#35)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On Mon, Jul 18, 2022 at 6:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Um ... why is "the order in which the elements were chosen" a concept
we want to expose? ISTM sample() is a black box in which notionally
the decisions could all be made at once.

I agree with that. But I also think it's fine for the elements to be
returned in a shuffled order rather than the original order.

I really think this function needs to grow an algorithm argument that can
be used to specify stuff like ordering, replacement/without-replacement,
etc...just some enums separated by commas that can be added to the call.

I think you might run out of gold paint somewhere around here. I'm
still not totally convinced we should bother with the sample() function
at all, let alone that it needs algorithm variants. At some point we
say to the user "here's a PL, write what you want for yourself".

I don't know what gold paint has to do with anything here, but I agree
that David's proposal seems to be moving the goalposts a very long
way.

The thing is, as Martin points out, these functions already exist in a
bunch of other systems. For one example I've used myself, see
https://underscorejs.org/

I probably wouldn't have called a function to put a list into a random
order "shuffle" in a vacuum, but it seems to be common nomenclature
these days. I believe that if you don't make reference to Fisher-Yates
in the documentation, they kick you out of the cool programmers club.

--
Robert Haas
EDB: http://www.enterprisedb.com

#42

Andrew Dunstan

andrew@dunslane.net

over 3 years ago

In reply to: Martin Kalcher (#40)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On 2022-07-19 Tu 07:15, Martin Kalcher wrote:

Am 18.07.22 um 23:48 schrieb Martin Kalcher:

If we go with (1) array_shuffle() and array_sample() should shuffle
each element individually and always return a one-dimensional array.

   select array_shuffle('{{1,2},{3,4},{5,6}}');
   -----------
    {1,4,3,5,6,2}

   select array_sample('{{1,2},{3,4},{5,6}}', 3);
   ----------
    {1,4,3}

If we go with (2) both functions should only operate on the first
dimension and shuffle whole subarrays and keep the dimensions intact.

   select array_shuffle('{{1,2},{3,4},{5,6}}');
   ---------------------
    {{3,4},{1,2},{5,6}}

   select array_sample('{{1,2},{3,4},{5,6}}', 2);
   ---------------
    {{3,4},{1,2}}

Having thought about it, i would go with (2). It gives the user the
ability to decide wether or not array-of-arrays behavior is desired.
If he wants the behavior of (1) he can flatten the array before
applying array_shuffle(). Unfortunately there is no array_flatten()
function (at the moment) and the user would have to work around it
with unnest() and array_agg().

Why not have an optional second parameter for array_shuffle that
indicates whether or not to flatten? e.g. array_shuffle(my_array,
flatten => true)

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#43

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Andrew Dunstan (#42)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On Tue, Jul 19, 2022 at 9:53 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Having thought about it, i would go with (2). It gives the user the
ability to decide wether or not array-of-arrays behavior is desired.
If he wants the behavior of (1) he can flatten the array before
applying array_shuffle(). Unfortunately there is no array_flatten()
function (at the moment) and the user would have to work around it
with unnest() and array_agg().

Why not have an optional second parameter for array_shuffle that
indicates whether or not to flatten? e.g. array_shuffle(my_array,
flatten => true)

IMHO, if we think that's something many people are going to want, it
would be better to add an array_flatten() function, because that could
be used for a variety of purposes, whereas an option to this function
can only be used for this function.

--
Robert Haas
EDB: http://www.enterprisedb.com

#44

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#43)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jul 19, 2022 at 9:53 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Why not have an optional second parameter for array_shuffle that
indicates whether or not to flatten? e.g. array_shuffle(my_array,
flatten => true)

IMHO, if we think that's something many people are going to want, it
would be better to add an array_flatten() function, because that could
be used for a variety of purposes, whereas an option to this function
can only be used for this function.

Agreed. Whether it's really needed, I dunno --- I don't recall the
issue having come up before.

After taking a second look through
https://www.postgresql.org/docs/current/functions-array.html
it seems like the things that treat arrays as flat even when they
are multi-D are just

* unnest(), which is more or less forced into that position by our
type system: it has to be anyarray returning anyelement, not
anyarray returning anyelement-or-anyarray-depending.

* array_to_string(), which maybe could have done it differently but
can reasonably be considered a variant of unnest().

* The containment/overlap operators, which are kind of their own
thing anyway. Arguably they got this wrong, though I suppose it's
too late to rethink that.

Everything else either explicitly rejects more-than-one-D arrays
or does something that is compatible with thinking of them as
arrays-of-arrays.

So I withdraw my original position. These functions should just
shuffle or select in the array's first dimension, preserving
subarrays. Or else be lazy and reject more-than-one-D arrays;
but it's probably not that hard to handle them.

regards, tom lane

#45

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Tom Lane (#44)

1 attachment(s)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 19.07.22 um 16:20 schrieb Tom Lane:

So I withdraw my original position. These functions should just
shuffle or select in the array's first dimension, preserving
subarrays. Or else be lazy and reject more-than-one-D arrays;
but it's probably not that hard to handle them.

Here is a patch with dimension aware array_shuffle() and array_sample().

If you think array_flatten() is desirable, i can take a look at it.
Maybe a second parameter would be nice to specify the target dimension:

select array_flatten(array[[[1,2],[3,4]],[[5,6],[7,8]]], 1);
-------------------
{1,2,3,4,5,6,7,8}

select array_flatten(array[[[1,2],[3,4]],[[5,6],[7,8]]], 2);
-----------------------
{{1,2,3,4},{5,6,7,8}}

Martin

Attachments:

0001-Introduce-array_shuffle-and-array_sample.patchtext/x-patch; charset=UTF-8; name=0001-Introduce-array_shuffle-and-array_sample.patchDownload

From 2aa6d72ff0f4d8835ee2f09f8cdf16b7e8005e56 Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

Shuffling of arrays can already be accomplished with SQL
using unnest() and array_agg(order by random()). But that is
very slow compared to the new functions. In addition the new functions
are dimension aware.
---
 doc/src/sgml/func.sgml               |  35 +++++
 src/backend/utils/adt/arrayfuncs.c   | 189 +++++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat      |   6 +
 src/test/regress/expected/arrays.out |  60 +++++++++
 src/test/regress/sql/arrays.sql      |  17 +++
 5 files changed, 307 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..6b96897244 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ...
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_sample</primary>
+        </indexterm>
+        <function>array_sample</function> ( <parameter>array</parameter> <type>anyarray</type>, <parameter>n</parameter> <type>integer</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter>.
+        The order of the elements in resulting array is unspecified.
+       </para>
+       <para>
+        <literal>array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)</literal>
+        <returnvalue>{{5,6},{3,4}}</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_shuffle</primary>
+        </indexterm>
+        <function>array_shuffle</function> ( <type>anyarray</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Shuffles the first dimension of the array.
+       </para>
+       <para>
+        <literal>array_shuffle(ARRAY[[1,2],[3,4],[5,6]])</literal>
+        <returnvalue>{{5,6},{3,4},{1,2}}</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm id="function-array-to-string">
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..3de1b0db30 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -6872,3 +6872,192 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Produce array with max n random items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: max number of items
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtype.  However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+				Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+				rdims[MAXDIM],
+				nelm,
+				nitem,
+				i,
+				j,
+				k;
+	Datum		elm,
+			   *elms,
+			   *relms;
+	bool		nul,
+			   *nuls,
+			   *rnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &relms, &rnuls, &nelm);
+
+	/* Calculate number of elements per item. */
+	nelm = (ndim > 1) ? ArrayGetNItems(ndim - 1, dims + 1) : 1;
+	elms = relms;
+	nuls = rnuls;
+	nitem = dims[0];
+	n = Min(n, nitem);
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Swap head with an randomly
+	 * chosen item and increment head.
+	 */
+	for (i = 0; i < n; i++)
+	{
+		k = (rand() % (nitem - i)) * nelm;
+
+		/* Swap all elements in head with elements in item k. */
+		for (j = 0; j < nelm; j++)
+		{
+			elm = *elms;
+			nul = *nuls;
+			*elms = elms[k];
+			*nuls = nuls[k];
+			elms[k] = elm;
+			nuls[k] = nul;
+			elms++;
+			nuls++;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = n;
+
+	result = construct_md_array(relms, rnuls, ndim, rdims, lbs,
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(relms);
+	pfree(rnuls);
+
+	return result;
+}
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	/* Return empty array immediately. */
+	if (ARR_NDIM(array) < 1)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	n = ARR_DIMS(array)[0];
+
+	/* There is no point in shuffling arrays with less then two items. */
+	if (n < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
+
+/*
+ * Choose N random elements from an array.
+ */
+Datum
+array_sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	n = PG_GETARG_INT32(1);
+
+	if (n < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("second parameter must not be negative")));
+
+	elmtyp = ARR_ELEMTYPE(array);
+
+	/* Return an empty array if the requested sample size is zero. */
+	if (n == 0)
+	{
+		PG_FREE_IF_COPY(array, 0);
+		PG_RETURN_ARRAYTYPE_P(construct_empty_array(elmtyp));
+	}
+
+	/*
+	 * Return the original array if its size is less than or equal to the
+	 * requested sample size.
+	 */
+	if (ARR_NDIM(array) < 1 || ARR_DIMS(array)[0] <= n)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..f51af5ee9f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,12 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '8464', descr => 'shuffle array',
+  proname => 'array_shuffle', proisstrict => 't', prorettype => 'anyarray',
+  proargtypes => 'anyarray', prosrc => 'array_shuffle' },
+{ oid => '8465', descr => 'take samples from array',
+  proname => 'array_sample', proisstrict => 't', prorettype => 'anyarray',
+  proargtypes => 'anyarray int4', prosrc => 'array_sample' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index ce6f3a65f9..1eb7ef37cf 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -2445,3 +2445,63 @@ SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
+-- array_shuffle
+SELECT array_agg(a order by a)
+FROM
+(SELECT unnest(array_shuffle('{NULL,1,2,3,4,5}'::int[]))) v(a);
+    array_agg     
+------------------
+ {1,2,3,4,5,NULL}
+(1 row)
+
+SELECT array_dims(array_shuffle('{{1,2,3},{4,5,6}}'::int[][]));
+ array_dims 
+------------
+ [1:2][1:3]
+(1 row)
+
+SELECT array_dims(array_shuffle('[-1:1]={1,2,3}'::int[]));
+ array_dims 
+------------
+ [-1:1]
+(1 row)
+
+SELECT (array_shuffle('{{1,2},{3,4},{5,6}}'::int[][]))[1:][1] <@ '{1,3,5}'::int[];
+ ?column? 
+----------
+ t
+(1 row)
+
+-- array_sample
+SELECT cardinality(array_sample('{NULL,1,2,3,4,5}'::int[], 3));
+ cardinality 
+-------------
+           3
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], 3) <@ '{1,2,3,4,5}'::int[];
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], 10) <@ '{1,2,3,4,5}'::int[];
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT array_dims(array_sample('{{1,2},{3,4},{4,5},{6,7}}'::int[][], 3));
+ array_dims 
+------------
+ [1:3][1:2]
+(1 row)
+
+SELECT array_dims(array_sample('[-1:1]={1,2,3}'::int[], 2));
+ array_dims 
+------------
+ [-1:0]
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], -1); -- fail
+ERROR:  second parameter must not be negative
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index f774faf856..5daee9aa72 100644
--- a/src/test/regress/sql/arrays.sql
+++ b/src/test/regress/sql/arrays.sql
@@ -754,3 +754,20 @@ FROM
 
 SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
+
+-- array_shuffle
+SELECT array_agg(a order by a)
+FROM
+(SELECT unnest(array_shuffle('{NULL,1,2,3,4,5}'::int[]))) v(a);
+
+SELECT array_dims(array_shuffle('{{1,2,3},{4,5,6}}'::int[][]));
+SELECT array_dims(array_shuffle('[-1:1]={1,2,3}'::int[]));
+SELECT (array_shuffle('{{1,2},{3,4},{5,6}}'::int[][]))[1:][1] <@ '{1,3,5}'::int[];
+
+-- array_sample
+SELECT cardinality(array_sample('{NULL,1,2,3,4,5}'::int[], 3));
+SELECT array_sample('{1,2,3,4,5}'::int[], 3) <@ '{1,2,3,4,5}'::int[];
+SELECT array_sample('{1,2,3,4,5}'::int[], 10) <@ '{1,2,3,4,5}'::int[];
+SELECT array_dims(array_sample('{{1,2},{3,4},{4,5},{6,7}}'::int[][], 3));
+SELECT array_dims(array_sample('[-1:1]={1,2,3}'::int[], 2));
+SELECT array_sample('{1,2,3,4,5}'::int[], -1); -- fail
-- 
2.37.1

#46

Dean Rasheed

dean.a.rasheed@gmail.com

over 3 years ago

In reply to: Martin Kalcher (#45)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On Tue, 19 Jul 2022 at 21:21, Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

Here is a patch with dimension aware array_shuffle() and array_sample().

+1 for this feature, and this way of handling multi-dimensional arrays.

If you think array_flatten() is desirable, i can take a look at it.

That's not something I've ever wanted -- personally, I rarely use
multi-dimensional arrays in Postgres.

A couple of quick comments on the current patch:

It's important to mark these new functions as VOLATILE, not IMMUTABLE,
otherwise they won't work as expected in queries. See
https://www.postgresql.org/docs/current/xfunc-volatility.html

It would be better to use pg_prng_uint64_range() rather than rand() to
pick elements. Partly, that's because it uses a higher quality PRNG,
with a larger internal state, and it ensures that the results are
unbiased across the range. But more importantly, it interoperates with
setseed(), allowing predictable sequences of "random" numbers to be
generated -- something that's useful in writing repeatable regression
tests.

Assuming these new functions are made to interoperate with setseed(),
which I think they should be, then they also need to be marked as
PARALLEL RESTRICTED, rather than PARALLEL SAFE. See
https://www.postgresql.org/docs/current/parallel-safety.html, which
explains why setseed() and random() are parallel restricted.

In my experience, the requirement for sampling with replacement is
about as common as the requirement for sampling without replacement,
so it seems odd to provide one but not the other. Of course, we can
always add a with-replacement function later, and give it a different
name. But maybe array_sample() could be used for both, with a
"with_replacement" boolean parameter?

Regards,
Dean

#47

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Dean Rasheed (#46)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 21.07.22 um 10:41 schrieb Dean Rasheed:

A couple of quick comments on the current patch:

Thank you for your feedback!

It's important to mark these new functions as VOLATILE, not IMMUTABLE,
otherwise they won't work as expected in queries. See
https://www.postgresql.org/docs/current/xfunc-volatility.html

CREATE FUNCTION marks functions as VOLATILE by default if not explicitly
marked otherwise. I assumed function definitions in pg_proc.dat have the
same behavior. I will fix that.
https://www.postgresql.org/docs/current/sql-createfunction.html

It would be better to use pg_prng_uint64_range() rather than rand() to
pick elements. Partly, that's because it uses a higher quality PRNG,
with a larger internal state, and it ensures that the results are
unbiased across the range. But more importantly, it interoperates with
setseed(), allowing predictable sequences of "random" numbers to be
generated -- something that's useful in writing repeatable regression
tests.

I agree that we should use pg_prng_uint64_range(). However, in order to
achieve interoperability with setseed() we would have to use
drandom_seed (rather than pg_global_prng_state) as rng state, which is
declared statically in float.c and exclusively used by random(). Do we
want to expose drandom_seed to other functions?

Assuming these new functions are made to interoperate with setseed(),
which I think they should be, then they also need to be marked as
PARALLEL RESTRICTED, rather than PARALLEL SAFE. See
https://www.postgresql.org/docs/current/parallel-safety.html, which
explains why setseed() and random() are parallel restricted.

As mentioned above, i assumed the default here is PARALLEL UNSAFE. I'll
fix that.

In my experience, the requirement for sampling with replacement is
about as common as the requirement for sampling without replacement,
so it seems odd to provide one but not the other. Of course, we can
always add a with-replacement function later, and give it a different
name. But maybe array_sample() could be used for both, with a
"with_replacement" boolean parameter?

We can also add a "with_replacement" boolean parameter which is false by
default to array_sample() later. I do not have a strong opinion about
that and will implement it, if desired. Any opinions about
default/no-default?

Martin

#48

Dean Rasheed

dean.a.rasheed@gmail.com

over 3 years ago

In reply to: Martin Kalcher (#47)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On Thu, 21 Jul 2022 at 12:15, Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

I agree that we should use pg_prng_uint64_range(). However, in order to
achieve interoperability with setseed() we would have to use
drandom_seed (rather than pg_global_prng_state) as rng state, which is
declared statically in float.c and exclusively used by random(). Do we
want to expose drandom_seed to other functions?

Ah, I didn't realise that setseed() and random() were bound up so
tightly. It does feel as though, if we're adding more user-facing
functions that return random sequences, there ought to be a way to
seed them, and I wouldn't want to have separate setseed functions for
each one.

I'm inclined to say that we want a new pg_global_prng_user_state that
is updated by setseed(), and used by random(), array_shuffle(),
array_sample(), and any other user-facing random functions we add
later.

I can also see that others might not like expanding the scope of
setseed() in this way.

Regards,
Dean

#49

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Dean Rasheed (#48)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 21.07.22 um 14:25 schrieb Dean Rasheed:

I'm inclined to say that we want a new pg_global_prng_user_state that
is updated by setseed(), and used by random(), array_shuffle(),
array_sample(), and any other user-facing random functions we add
later.

I like the idea. How would you organize the code? I imagine some sort of
user prng that is encapsulated in something like utils/adt/random.c and
is guaranteed to always be seeded.

Martin

#50

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Dean Rasheed (#46)

1 attachment(s)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 21.07.22 um 10:41 schrieb Dean Rasheed:

It's important to mark these new functions as VOLATILE, not IMMUTABLE,
otherwise they won't work as expected in queries. See
https://www.postgresql.org/docs/current/xfunc-volatility.html

It would be better to use pg_prng_uint64_range() rather than rand() to
pick elements. Partly, that's because it uses a higher quality PRNG,
with a larger internal state, and it ensures that the results are
unbiased across the range. But more importantly, it interoperates with
setseed(), allowing predictable sequences of "random" numbers to be
generated -- something that's useful in writing repeatable regression
tests.

Assuming these new functions are made to interoperate with setseed(),
which I think they should be, then they also need to be marked as
PARALLEL RESTRICTED, rather than PARALLEL SAFE. See
https://www.postgresql.org/docs/current/parallel-safety.html, which
explains why setseed() and random() are parallel restricted.

Here is an updated patch that marks the functions VOLATILE PARALLEL
RESTRICTED and uses pg_prng_uint64_range() rather than rand().

Attachments:

0001-Introduce-array_shuffle-and-array_sample.patchtext/x-patch; charset=UTF-8; name=0001-Introduce-array_shuffle-and-array_sample.patchDownload

From 26676802f05d00c31e0b2d5997f61734aa421fca Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

Shuffling of arrays can already be accomplished with SQL
using unnest() and array_agg(order by random()). But that is
very slow compared to the new functions. In addition the new functions
are dimension aware.
---
 doc/src/sgml/func.sgml               |  35 +++++
 src/backend/utils/adt/arrayfuncs.c   | 189 ++++++++++++++++++++++++++-
 src/include/catalog/pg_proc.dat      |   6 +
 src/test/regress/expected/arrays.out |  60 +++++++++
 src/test/regress/sql/arrays.sql      |  17 +++
 5 files changed, 306 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..6b96897244 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ...
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_sample</primary>
+        </indexterm>
+        <function>array_sample</function> ( <parameter>array</parameter> <type>anyarray</type>, <parameter>n</parameter> <type>integer</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter>.
+        The order of the elements in resulting array is unspecified.
+       </para>
+       <para>
+        <literal>array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)</literal>
+        <returnvalue>{{5,6},{3,4}}</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_shuffle</primary>
+        </indexterm>
+        <function>array_shuffle</function> ( <type>anyarray</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Shuffles the first dimension of the array.
+       </para>
+       <para>
+        <literal>array_shuffle(ARRAY[[1,2],[3,4],[5,6]])</literal>
+        <returnvalue>{{5,6},{3,4},{1,2}}</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm id="function-array-to-string">
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..64da980348 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -34,7 +34,7 @@
 #include "utils/memutils.h"
 #include "utils/selfuncs.h"
 #include "utils/typcache.h"
-
+#include "common/pg_prng.h"
 
 /*
  * GUC parameter
@@ -6872,3 +6872,190 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Produce array with max n random items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: max number of items
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtype.  However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+				Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+				rdims[MAXDIM],
+				nelm,
+				nitem,
+				i,
+				j,
+				k;
+	Datum		elm,
+			   *elms,
+			   *relms;
+	bool		nul,
+			   *nuls,
+			   *rnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &relms, &rnuls, &nelm);
+
+	/* Calculate number of elements per item. */
+	nelm = (ndim > 1) ? ArrayGetNItems(ndim - 1, dims + 1) : 1;
+	elms = relms;
+	nuls = rnuls;
+	nitem = dims[0];
+	n = Min(n, nitem);
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Swap head with an randomly
+	 * chosen item and increment head.
+	 */
+	for (i = 0; i < n; i++)
+	{
+		k = (int) pg_prng_uint64_range(&pg_global_prng_state, 0, nitem - i - 1) * nelm;
+
+		/* Swap all elements in head with elements in item k. */
+		for (j = 0; j < nelm; j++)
+		{
+			elm = *elms;
+			nul = *nuls;
+			*elms = elms[k];
+			*nuls = nuls[k];
+			elms[k] = elm;
+			nuls[k] = nul;
+			elms++;
+			nuls++;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = n;
+
+	result = construct_md_array(relms, rnuls, ndim, rdims, lbs,
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(relms);
+	pfree(rnuls);
+
+	return result;
+}
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	/* Return empty array immediately. */
+	if (ARR_NDIM(array) < 1)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	n = ARR_DIMS(array)[0];
+
+	/* There is no point in shuffling arrays with less then two items. */
+	if (n < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
+
+/*
+ * Choose N random elements from an array.
+ */
+Datum
+array_sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	n = PG_GETARG_INT32(1);
+
+	if (n < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("second parameter must not be negative")));
+
+	elmtyp = ARR_ELEMTYPE(array);
+
+	/* Return an empty array if the requested sample size is zero. */
+	if (n == 0)
+	{
+		PG_FREE_IF_COPY(array, 0);
+		PG_RETURN_ARRAYTYPE_P(construct_empty_array(elmtyp));
+	}
+
+	/*
+	 * Return the original array if its size is less than or equal to the
+	 * requested sample size.
+	 */
+	if (ARR_NDIM(array) < 1 || ARR_DIMS(array)[0] <= n)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..9933d10f6d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,12 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '8464', descr => 'shuffle array',
+  proname => 'array_shuffle', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray', prosrc => 'array_shuffle' },
+{ oid => '8465', descr => 'take samples from array',
+  proname => 'array_sample', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray int4', prosrc => 'array_sample' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index ce6f3a65f9..1eb7ef37cf 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -2445,3 +2445,63 @@ SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
+-- array_shuffle
+SELECT array_agg(a order by a)
+FROM
+(SELECT unnest(array_shuffle('{NULL,1,2,3,4,5}'::int[]))) v(a);
+    array_agg     
+------------------
+ {1,2,3,4,5,NULL}
+(1 row)
+
+SELECT array_dims(array_shuffle('{{1,2,3},{4,5,6}}'::int[][]));
+ array_dims 
+------------
+ [1:2][1:3]
+(1 row)
+
+SELECT array_dims(array_shuffle('[-1:1]={1,2,3}'::int[]));
+ array_dims 
+------------
+ [-1:1]
+(1 row)
+
+SELECT (array_shuffle('{{1,2},{3,4},{5,6}}'::int[][]))[1:][1] <@ '{1,3,5}'::int[];
+ ?column? 
+----------
+ t
+(1 row)
+
+-- array_sample
+SELECT cardinality(array_sample('{NULL,1,2,3,4,5}'::int[], 3));
+ cardinality 
+-------------
+           3
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], 3) <@ '{1,2,3,4,5}'::int[];
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], 10) <@ '{1,2,3,4,5}'::int[];
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT array_dims(array_sample('{{1,2},{3,4},{4,5},{6,7}}'::int[][], 3));
+ array_dims 
+------------
+ [1:3][1:2]
+(1 row)
+
+SELECT array_dims(array_sample('[-1:1]={1,2,3}'::int[], 2));
+ array_dims 
+------------
+ [-1:0]
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], -1); -- fail
+ERROR:  second parameter must not be negative
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index f774faf856..5daee9aa72 100644
--- a/src/test/regress/sql/arrays.sql
+++ b/src/test/regress/sql/arrays.sql
@@ -754,3 +754,20 @@ FROM
 
 SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
+
+-- array_shuffle
+SELECT array_agg(a order by a)
+FROM
+(SELECT unnest(array_shuffle('{NULL,1,2,3,4,5}'::int[]))) v(a);
+
+SELECT array_dims(array_shuffle('{{1,2,3},{4,5,6}}'::int[][]));
+SELECT array_dims(array_shuffle('[-1:1]={1,2,3}'::int[]));
+SELECT (array_shuffle('{{1,2},{3,4},{5,6}}'::int[][]))[1:][1] <@ '{1,3,5}'::int[];
+
+-- array_sample
+SELECT cardinality(array_sample('{NULL,1,2,3,4,5}'::int[], 3));
+SELECT array_sample('{1,2,3,4,5}'::int[], 3) <@ '{1,2,3,4,5}'::int[];
+SELECT array_sample('{1,2,3,4,5}'::int[], 10) <@ '{1,2,3,4,5}'::int[];
+SELECT array_dims(array_sample('{{1,2},{3,4},{4,5},{6,7}}'::int[][], 3));
+SELECT array_dims(array_sample('[-1:1]={1,2,3}'::int[], 2));
+SELECT array_sample('{1,2,3,4,5}'::int[], -1); -- fail
-- 
2.37.1

#51

Dean Rasheed

dean.a.rasheed@gmail.com

over 3 years ago

In reply to: Martin Kalcher (#49)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On Thu, 21 Jul 2022 at 16:43, Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

Am 21.07.22 um 14:25 schrieb Dean Rasheed:

I'm inclined to say that we want a new pg_global_prng_user_state that
is updated by setseed(), and used by random(), array_shuffle(),
array_sample(), and any other user-facing random functions we add
later.

I like the idea. How would you organize the code? I imagine some sort of
user prng that is encapsulated in something like utils/adt/random.c and
is guaranteed to always be seeded.

Something like that, perhaps. I can see 2 ways it could go:

Option 1:
Keep random.c small
- Responsible for initialisation of the user prng on demand
- Expose the user prng state to other code like float.c and arrayfuncs.c

Option 2:
Move all random functions wanting to use the user prng to random.c
- Starting with drandom() and setseed()
- Then, array_shuffle() and array_sample()
- Later, any new SQL-callable random functions we might add
- Keep the user prng state local to random.c

Option 2 seems quite appealing, because it keeps all SQL-callable
random functions together in one place, and keeps the state local,
making it easier to keep track of which functions are using it.

Code like the Fisher-Yates algorithm is more to do with random than it
is to do with arrays, so putting it in random.c seems to make more
sense.

It's possible that some hypothetical random function might need access
to type-specific internals. For example, if we wanted to add a
function to return a random numeric, it would probably have to go in
numeric.c, but that could be solved by having random.c call numeric.c,
passing it the prng state.

Regards,
Dean

#52

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Dean Rasheed (#51)

1 attachment(s)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 22.07.22 um 09:59 schrieb Dean Rasheed:>

Option 1:
Keep random.c small
- Responsible for initialisation of the user prng on demand
- Expose the user prng state to other code like float.c and arrayfuncs.c

Option 2:
Move all random functions wanting to use the user prng to random.c
- Starting with drandom() and setseed()
- Then, array_shuffle() and array_sample()
- Later, any new SQL-callable random functions we might add
- Keep the user prng state local to random.c

Hey Dean,

i came to the same conclusions and went with Option 1 (see patch).
Mainly because most code in utils/adt is organized by type and this way
it is clear, that this is a thin wrapper around pg_prng.

What do you think?

Attachments:

0001-Introduce-array_shuffle-and-array_sample.patchtext/x-patch; charset=UTF-8; name=0001-Introduce-array_shuffle-and-array_sample.patchDownload

From ceda50f1f7f7e0c123de9b2ce2cc7b5d2b2b7db6 Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml               |  35 +++++
 src/backend/utils/adt/Makefile       |   1 +
 src/backend/utils/adt/arrayfuncs.c   | 189 ++++++++++++++++++++++++++-
 src/backend/utils/adt/float.c        |  37 +-----
 src/backend/utils/adt/user_prng.c    |  88 +++++++++++++
 src/include/catalog/pg_proc.dat      |   6 +
 src/include/utils/user_prng.h        |  21 +++
 src/test/regress/expected/arrays.out |  58 ++++++++
 src/test/regress/sql/arrays.sql      |  14 ++
 9 files changed, 415 insertions(+), 34 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..6b96897244 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ...
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_sample</primary>
+        </indexterm>
+        <function>array_sample</function> ( <parameter>array</parameter> <type>anyarray</type>, <parameter>n</parameter> <type>integer</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter>.
+        The order of the elements in resulting array is unspecified.
+       </para>
+       <para>
+        <literal>array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)</literal>
+        <returnvalue>{{5,6},{3,4}}</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_shuffle</primary>
+        </indexterm>
+        <function>array_shuffle</function> ( <type>anyarray</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Shuffles the first dimension of the array.
+       </para>
+       <para>
+        <literal>array_shuffle(ARRAY[[1,2],[3,4],[5,6]])</literal>
+        <returnvalue>{{5,6},{3,4},{1,2}}</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm id="function-array-to-string">
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7c722ea2ce..42b65e58bb 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -109,6 +109,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..3a40ccc362 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -32,10 +32,10 @@
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/user_prng.h"
 #include "utils/selfuncs.h"
 #include "utils/typcache.h"
 
-
 /*
  * GUC parameter
  */
@@ -6872,3 +6872,190 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Produce array with max n random items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: max number of items
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtype.  However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+				Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+				rdims[MAXDIM],
+				nelm,
+				nitem,
+				i,
+				j,
+				k;
+	Datum		elm,
+			   *elms,
+			   *relms;
+	bool		nul,
+			   *nuls,
+			   *rnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &relms, &rnuls, &nelm);
+
+	/* Calculate number of elements per item. */
+	nelm = (ndim > 1) ? ArrayGetNItems(ndim - 1, dims + 1) : 1;
+	elms = relms;
+	nuls = rnuls;
+	nitem = dims[0];
+	n = Min(n, nitem);
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Swap head with an randomly
+	 * chosen item and increment head.
+	 */
+	for (i = 0; i < n; i++)
+	{
+		k = (int) user_prng_uint64_range(0, nitem - i - 1) * nelm;
+
+		/* Swap all elements in head with elements in item k. */
+		for (j = 0; j < nelm; j++)
+		{
+			elm = *elms;
+			nul = *nuls;
+			*elms = elms[k];
+			*nuls = nuls[k];
+			elms[k] = elm;
+			nuls[k] = nul;
+			elms++;
+			nuls++;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = n;
+
+	result = construct_md_array(relms, rnuls, ndim, rdims, lbs,
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(relms);
+	pfree(rnuls);
+
+	return result;
+}
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	/* Return empty array immediately. */
+	if (ARR_NDIM(array) < 1)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	n = ARR_DIMS(array)[0];
+
+	/* There is no point in shuffling arrays with less then two items. */
+	if (n < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
+
+/*
+ * Choose N random elements from an array.
+ */
+Datum
+array_sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	n = PG_GETARG_INT32(1);
+
+	if (n < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("second parameter must not be negative")));
+
+	elmtyp = ARR_ELEMTYPE(array);
+
+	/* Return an empty array if the requested sample size is zero. */
+	if (n == 0)
+	{
+		PG_FREE_IF_COPY(array, 0);
+		PG_RETURN_ARRAYTYPE_P(construct_empty_array(elmtyp));
+	}
+
+	/*
+	 * Return the original array if its size is less than or equal to the
+	 * requested sample size.
+	 */
+	if (ARR_NDIM(array) < 1 || ARR_DIMS(array)[0] <= n)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index fc8f39a7a9..29dabbbc6a 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -21,15 +21,13 @@
 
 #include "catalog/pg_type.h"
 #include "common/int.h"
-#include "common/pg_prng.h"
 #include "common/shortest_dec.h"
 #include "libpq/pqformat.h"
-#include "miscadmin.h"
 #include "utils/array.h"
 #include "utils/float.h"
 #include "utils/fmgrprotos.h"
+#include "utils/user_prng.h"
 #include "utils/sortsupport.h"
-#include "utils/timestamp.h"
 
 
 /*
@@ -64,10 +62,6 @@ float8		degree_c_sixty = 60.0;
 float8		degree_c_one_half = 0.5;
 float8		degree_c_one = 1.0;
 
-/* State for drandom() and setseed() */
-static bool drandom_seed_set = false;
-static pg_prng_state drandom_seed;
-
 /* Local function prototypes */
 static double sind_q1(double x);
 static double cosd_q1(double x);
@@ -2749,30 +2743,8 @@ datanh(PG_FUNCTION_ARGS)
 Datum
 drandom(PG_FUNCTION_ARGS)
 {
-	float8		result;
-
-	/* Initialize random seed, if not done yet in this process */
-	if (unlikely(!drandom_seed_set))
-	{
-		/*
-		 * If possible, initialize the seed using high-quality random bits.
-		 * Should that fail for some reason, we fall back on a lower-quality
-		 * seed based on current time and PID.
-		 */
-		if (unlikely(!pg_prng_strong_seed(&drandom_seed)))
-		{
-			TimestampTz now = GetCurrentTimestamp();
-			uint64		iseed;
-
-			/* Mix the PID with the most predictable bits of the timestamp */
-			iseed = (uint64) now ^ ((uint64) MyProcPid << 32);
-			pg_prng_seed(&drandom_seed, iseed);
-		}
-		drandom_seed_set = true;
-	}
-
-	/* pg_prng_double produces desired result range [0.0 - 1.0) */
-	result = pg_prng_double(&drandom_seed);
+	/* user_prng_double produces desired result range [0.0 - 1.0) */
+	float8		result = user_prng_double();
 
 	PG_RETURN_FLOAT8(result);
 }
@@ -2792,8 +2764,7 @@ setseed(PG_FUNCTION_ARGS)
 				 errmsg("setseed parameter %g is out of allowed range [-1,1]",
 						seed)));
 
-	pg_prng_fseed(&drandom_seed, seed);
-	drandom_seed_set = true;
+	user_prng_seed(seed);
 
 	PG_RETURN_VOID();
 }
diff --git a/src/backend/utils/adt/user_prng.c b/src/backend/utils/adt/user_prng.c
new file mode 100644
index 0000000000..7a42b426cd
--- /dev/null
+++ b/src/backend/utils/adt/user_prng.c
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * user_prng.c
+ *	  Random number generator for user facing functions.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/user_prng.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "miscadmin.h"
+#include "utils/user_prng.h"
+#include "utils/timestamp.h"
+
+/*
+ * Process wide state vector used by all user facing random functions,
+ * like random() or array_shuffle().
+ */
+static pg_prng_state user_prng_state;
+static bool user_prng_state_set = false;
+
+/*
+ * Ensure the prng is seeded.
+ */
+static inline void
+user_prng_ensure_seed()
+{
+	/* Initialize prng state, if not done yet in this process. */
+	if (unlikely(!user_prng_state_set))
+	{
+		/*
+		 * If possible, initialize the state using high-quality random bits.
+		 * Should that fail for some reason, we fall back on a lower-quality
+		 * seed based on current time and PID.
+		 */
+		if (unlikely(!pg_prng_strong_seed(&user_prng_state)))
+		{
+			uint64		now = (uint64) GetCurrentTimestamp();
+			uint64		pid = (uint64) MyProcPid;
+
+			/* Mix PID with the most predictable bits of the timestamp. */
+			pg_prng_seed(&user_prng_state, now ^ (pid << 32));
+		}
+
+		user_prng_state_set = true;
+	}
+}
+
+/*
+ * Seed the prng.
+ */
+inline void
+user_prng_seed(float8 seed)
+{
+	pg_prng_fseed(&user_prng_state, seed);
+
+	user_prng_state_set = true;
+}
+
+/*
+ * Select a random double from the range [0.0, 1.0).
+ */
+inline float8
+user_prng_double()
+{
+	user_prng_ensure_seed();
+
+	return pg_prng_double(&user_prng_state);
+}
+
+/*
+ * Select a random uint64 from the range [min, max].
+ */
+inline uint64
+user_prng_uint64_range(uint64 min, uint64 max)
+{
+	user_prng_ensure_seed();
+
+	return pg_prng_uint64_range(&user_prng_state, min, max);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..9933d10f6d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,12 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '8464', descr => 'shuffle array',
+  proname => 'array_shuffle', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray', prosrc => 'array_shuffle' },
+{ oid => '8465', descr => 'take samples from array',
+  proname => 'array_sample', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray int4', prosrc => 'array_sample' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
diff --git a/src/include/utils/user_prng.h b/src/include/utils/user_prng.h
new file mode 100644
index 0000000000..ebae20f0f7
--- /dev/null
+++ b/src/include/utils/user_prng.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * user_prng.h
+ *	  Random number generator for user facing functions.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/include/utils/user_prng.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RANDOM_H
+#define RANDOM_H
+
+extern void user_prng_seed(float8 seed);
+extern float8 user_prng_double();
+extern uint64 user_prng_uint64_range(uint64 min, uint64 max);
+
+#endif
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index ce6f3a65f9..ff4fea0923 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -2445,3 +2445,61 @@ SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
+-- array_shuffle
+SELECT setseed(0.1); -- ensure predictable results
+ setseed 
+---------
+ 
+(1 row)
+
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+  array_shuffle   
+------------------
+ {5,NULL,1,2,4,3}
+(1 row)
+
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+  array_shuffle   
+------------------
+ {4,1,2,NULL,3,5}
+(1 row)
+
+SELECT array_shuffle('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[]);
+             array_shuffle             
+---------------------------------------
+ [-1:2][2:3]={{7,8},{3,4},{1,2},{5,6}}
+(1 row)
+
+-- array_sample
+SELECT setseed(0.1); -- ensure predictable results
+ setseed 
+---------
+ 
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], 3);
+ array_sample 
+--------------
+ {5,1,2}
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], 3);
+ array_sample 
+--------------
+ {3,4,1}
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], 6);
+ array_sample 
+--------------
+ {1,2,3,4,5}
+(1 row)
+
+SELECT array_sample('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[], 3);
+          array_sample           
+---------------------------------
+ [-1:1][2:3]={{1,2},{5,6},{3,4}}
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], -1); -- fail
+ERROR:  second parameter must not be negative
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index f774faf856..3371c04131 100644
--- a/src/test/regress/sql/arrays.sql
+++ b/src/test/regress/sql/arrays.sql
@@ -754,3 +754,17 @@ FROM
 
 SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
+
+-- array_shuffle
+SELECT setseed(0.1); -- ensure predictable results
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+SELECT array_shuffle('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[]);
+
+-- array_sample
+SELECT setseed(0.1); -- ensure predictable results
+SELECT array_sample('{1,2,3,4,5}'::int[], 3);
+SELECT array_sample('{1,2,3,4,5}'::int[], 3);
+SELECT array_sample('{1,2,3,4,5}'::int[], 6);
+SELECT array_sample('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[], 3);
+SELECT array_sample('{1,2,3,4,5}'::int[], -1); -- fail
-- 
2.37.1

#53

Dean Rasheed

dean.a.rasheed@gmail.com

over 3 years ago

In reply to: Martin Kalcher (#52)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On Fri, 22 Jul 2022 at 10:31, Martin Kalcher
<martin.kalcher@aboutsource.net> wrote:

i came to the same conclusions and went with Option 1 (see patch).
Mainly because most code in utils/adt is organized by type and this way
it is clear, that this is a thin wrapper around pg_prng.

What do you think?

Looks fairly neat, on a quick read-through. There's certainly
something to be said for preserving the organisation by type.

Regards,
Dean

#54

Joe Conway

mail@joeconway.com

over 3 years ago

In reply to: Tom Lane (#44)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On 7/19/22 10:20, Tom Lane wrote:

Everything else either explicitly rejects more-than-one-D arrays
or does something that is compatible with thinking of them as
arrays-of-arrays.

I think I am responsible for at least some of those, and I agree that
thinking of MD arrays as arrays-of-arrays is preferable even though they
are not actually that. Long ago[1]/messages/by-id/Pine.LNX.4.44.0306281418020.2178-100000@peter.localdomain Peter E asked me to fix that as I
recall but it was one of those round tuits that I never found.

So I withdraw my original position. These functions should just
shuffle or select in the array's first dimension, preserving
subarrays. Or else be lazy and reject more-than-one-D arrays;
but it's probably not that hard to handle them.

Joe

[1]: /messages/by-id/Pine.LNX.4.44.0306281418020.2178-100000@peter.localdomain
/messages/by-id/Pine.LNX.4.44.0306281418020.2178-100000@peter.localdomain

--
Joe Conway
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#55

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Martin Kalcher (#52)

1 attachment(s)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 22.07.22 um 11:31 schrieb Martin Kalcher:

i came to the same conclusions and went with Option 1 (see patch).
Mainly because most code in utils/adt is organized by type and this way
it is clear, that this is a thin wrapper around pg_prng.

Small patch update. I realized the new functions should live
array_userfuncs.c (rather than arrayfuncs.c), fixed some file headers
and added some comments to the code.

Attachments:

0001-Introduce-array_shuffle-and-array_sample.patchtext/x-patch; charset=UTF-8; name=0001-Introduce-array_shuffle-and-array_sample.patchDownload

From 138777531961c31250074d1c0d87417e31d63656 Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml                  |  35 +++++
 src/backend/utils/adt/Makefile          |   1 +
 src/backend/utils/adt/array_userfuncs.c | 182 ++++++++++++++++++++++++
 src/backend/utils/adt/float.c           |  37 +----
 src/backend/utils/adt/user_prng.c       |  87 +++++++++++
 src/include/catalog/pg_proc.dat         |   6 +
 src/include/utils/user_prng.h           |  19 +++
 src/test/regress/expected/arrays.out    |  58 ++++++++
 src/test/regress/sql/arrays.sql         |  14 ++
 9 files changed, 406 insertions(+), 33 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..87df8a9c81 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ...
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_sample</primary>
+        </indexterm>
+        <function>array_sample</function> ( <parameter>array</parameter> <type>anyarray</type>, <parameter>n</parameter> <type>integer</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter>.
+        The order of the elements in resulting array is unspecified.
+       </para>
+       <para>
+        <literal>array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)</literal>
+        <returnvalue>{{5,6},{1,2}}</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_shuffle</primary>
+        </indexterm>
+        <function>array_shuffle</function> ( <type>anyarray</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Shuffles the first dimension of the array.
+       </para>
+       <para>
+        <literal>array_shuffle(ARRAY[[1,2],[3,4],[5,6]])</literal>
+        <returnvalue>{{5,6},{1,2},{3,4}}</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm id="function-array-to-string">
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7c722ea2ce..42b65e58bb 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -109,6 +109,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..e3ba14a17c 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -17,6 +17,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
+#include "utils/user_prng.h"
 #include "utils/typcache.h"
 
 
@@ -902,3 +903,184 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with max n randomly chosen items from given array in
+ * random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: max number of items
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+				Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+				rdims[MAXDIM],
+				nelm,
+				nitem,
+				rnitem,
+				i,
+				j,
+				k;
+	Datum		elm,
+			   *elms,
+			   *relms;
+	bool		nul,
+			   *nuls,
+			   *rnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &relms, &rnuls, &nelm);
+
+	nitem = dims[0];			/* total number of items */
+	rnitem = Min(n, nitem);		/* target number of items */
+	nelm /= nitem;				/* number of elements per item */
+
+	elms = relms;
+	nuls = rnuls;
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Swap head with an randomly
+	 * chosen item and increment head.
+	 */
+	for (i = 0; i < rnitem; i++)
+	{
+		k = (int) user_prng_uint64_range(0, nitem - i - 1) * nelm;
+
+		/* Swap all elements in head with elements in item k. */
+		for (j = 0; j < nelm; j++)
+		{
+			elm = *elms;
+			nul = *nuls;
+			*elms = elms[k];
+			*nuls = nuls[k];
+			elms[k] = elm;
+			nuls[k] = nul;
+			elms++;
+			nuls++;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = rnitem;
+
+	result = construct_md_array(relms, rnuls, ndim, rdims, lbs,
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(relms);
+	pfree(rnuls);
+
+	return result;
+}
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	/* Return empty array immediately. */
+	if (ARR_NDIM(array) < 1)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	n = ARR_DIMS(array)[0];
+
+	/* There is no point in shuffling arrays with less then two items. */
+	if (n < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
+
+/*
+ * Choose N random elements from an array.
+ */
+Datum
+array_sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	n = PG_GETARG_INT32(1);
+
+	if (n < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("second parameter must not be negative")));
+
+	/*
+	 * Return the original array if its size is less than or equal to the
+	 * requested sample size.
+	 */
+	if (ARR_NDIM(array) < 1 || ARR_DIMS(array)[0] <= n)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index fc8f39a7a9..29dabbbc6a 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -21,15 +21,13 @@
 
 #include "catalog/pg_type.h"
 #include "common/int.h"
-#include "common/pg_prng.h"
 #include "common/shortest_dec.h"
 #include "libpq/pqformat.h"
-#include "miscadmin.h"
 #include "utils/array.h"
 #include "utils/float.h"
 #include "utils/fmgrprotos.h"
+#include "utils/user_prng.h"
 #include "utils/sortsupport.h"
-#include "utils/timestamp.h"
 
 
 /*
@@ -64,10 +62,6 @@ float8		degree_c_sixty = 60.0;
 float8		degree_c_one_half = 0.5;
 float8		degree_c_one = 1.0;
 
-/* State for drandom() and setseed() */
-static bool drandom_seed_set = false;
-static pg_prng_state drandom_seed;
-
 /* Local function prototypes */
 static double sind_q1(double x);
 static double cosd_q1(double x);
@@ -2749,30 +2743,8 @@ datanh(PG_FUNCTION_ARGS)
 Datum
 drandom(PG_FUNCTION_ARGS)
 {
-	float8		result;
-
-	/* Initialize random seed, if not done yet in this process */
-	if (unlikely(!drandom_seed_set))
-	{
-		/*
-		 * If possible, initialize the seed using high-quality random bits.
-		 * Should that fail for some reason, we fall back on a lower-quality
-		 * seed based on current time and PID.
-		 */
-		if (unlikely(!pg_prng_strong_seed(&drandom_seed)))
-		{
-			TimestampTz now = GetCurrentTimestamp();
-			uint64		iseed;
-
-			/* Mix the PID with the most predictable bits of the timestamp */
-			iseed = (uint64) now ^ ((uint64) MyProcPid << 32);
-			pg_prng_seed(&drandom_seed, iseed);
-		}
-		drandom_seed_set = true;
-	}
-
-	/* pg_prng_double produces desired result range [0.0 - 1.0) */
-	result = pg_prng_double(&drandom_seed);
+	/* user_prng_double produces desired result range [0.0 - 1.0) */
+	float8		result = user_prng_double();
 
 	PG_RETURN_FLOAT8(result);
 }
@@ -2792,8 +2764,7 @@ setseed(PG_FUNCTION_ARGS)
 				 errmsg("setseed parameter %g is out of allowed range [-1,1]",
 						seed)));
 
-	pg_prng_fseed(&drandom_seed, seed);
-	drandom_seed_set = true;
+	user_prng_seed(seed);
 
 	PG_RETURN_VOID();
 }
diff --git a/src/backend/utils/adt/user_prng.c b/src/backend/utils/adt/user_prng.c
new file mode 100644
index 0000000000..cac9abe40c
--- /dev/null
+++ b/src/backend/utils/adt/user_prng.c
@@ -0,0 +1,87 @@
+/*-------------------------------------------------------------------------
+ *
+ * user_prng.c
+ *	  Random number generator for user facing functions.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/user_prng.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "miscadmin.h"
+#include "utils/user_prng.h"
+#include "utils/timestamp.h"
+
+/*
+ * Process wide state vector used by all user facing random functions,
+ * like random() or array_shuffle().
+ */
+static pg_prng_state user_prng_state;
+static bool user_prng_state_set = false;
+
+/*
+ * Ensure the prng is seeded.
+ */
+static inline void
+user_prng_ensure_seed()
+{
+	/* Initialize prng state, if not done yet in this process. */
+	if (unlikely(!user_prng_state_set))
+	{
+		/*
+		 * If possible, initialize the state using high-quality random bits.
+		 * Should that fail for some reason, we fall back on a lower-quality
+		 * seed based on current time and PID.
+		 */
+		if (unlikely(!pg_prng_strong_seed(&user_prng_state)))
+		{
+			uint64		now = (uint64) GetCurrentTimestamp();
+			uint64		pid = (uint64) MyProcPid;
+
+			/* Mix PID with the most predictable bits of the timestamp. */
+			pg_prng_seed(&user_prng_state, now ^ (pid << 32));
+		}
+
+		user_prng_state_set = true;
+	}
+}
+
+/*
+ * Seed the prng.
+ */
+inline void
+user_prng_seed(float8 seed)
+{
+	pg_prng_fseed(&user_prng_state, seed);
+
+	user_prng_state_set = true;
+}
+
+/*
+ * Select a random double from the range [0.0, 1.0).
+ */
+inline float8
+user_prng_double()
+{
+	user_prng_ensure_seed();
+
+	return pg_prng_double(&user_prng_state);
+}
+
+/*
+ * Select a random uint64 from the range [min, max].
+ */
+inline uint64
+user_prng_uint64_range(uint64 min, uint64 max)
+{
+	user_prng_ensure_seed();
+
+	return pg_prng_uint64_range(&user_prng_state, min, max);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..9933d10f6d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,12 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '8464', descr => 'shuffle array',
+  proname => 'array_shuffle', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray', prosrc => 'array_shuffle' },
+{ oid => '8465', descr => 'take samples from array',
+  proname => 'array_sample', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray int4', prosrc => 'array_sample' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
diff --git a/src/include/utils/user_prng.h b/src/include/utils/user_prng.h
new file mode 100644
index 0000000000..1f6df0b6fa
--- /dev/null
+++ b/src/include/utils/user_prng.h
@@ -0,0 +1,19 @@
+/*-------------------------------------------------------------------------
+ *
+ * user_prng.h
+ *	  Random number generator for user facing functions.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/utils/user_prng.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef USER_PRNG_H
+#define USER_PRNG_H
+
+extern void user_prng_seed(float8 seed);
+extern float8 user_prng_double();
+extern uint64 user_prng_uint64_range(uint64 min, uint64 max);
+
+#endif							/* USER_PRNG_H */
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index ce6f3a65f9..ff4fea0923 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -2445,3 +2445,61 @@ SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
+-- array_shuffle
+SELECT setseed(0.1); -- ensure predictable results
+ setseed 
+---------
+ 
+(1 row)
+
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+  array_shuffle   
+------------------
+ {5,NULL,1,2,4,3}
+(1 row)
+
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+  array_shuffle   
+------------------
+ {4,1,2,NULL,3,5}
+(1 row)
+
+SELECT array_shuffle('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[]);
+             array_shuffle             
+---------------------------------------
+ [-1:2][2:3]={{7,8},{3,4},{1,2},{5,6}}
+(1 row)
+
+-- array_sample
+SELECT setseed(0.1); -- ensure predictable results
+ setseed 
+---------
+ 
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], 3);
+ array_sample 
+--------------
+ {5,1,2}
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], 3);
+ array_sample 
+--------------
+ {3,4,1}
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], 6);
+ array_sample 
+--------------
+ {1,2,3,4,5}
+(1 row)
+
+SELECT array_sample('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[], 3);
+          array_sample           
+---------------------------------
+ [-1:1][2:3]={{1,2},{5,6},{3,4}}
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5}'::int[], -1); -- fail
+ERROR:  second parameter must not be negative
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index f774faf856..3371c04131 100644
--- a/src/test/regress/sql/arrays.sql
+++ b/src/test/regress/sql/arrays.sql
@@ -754,3 +754,17 @@ FROM
 
 SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
+
+-- array_shuffle
+SELECT setseed(0.1); -- ensure predictable results
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+SELECT array_shuffle('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[]);
+
+-- array_sample
+SELECT setseed(0.1); -- ensure predictable results
+SELECT array_sample('{1,2,3,4,5}'::int[], 3);
+SELECT array_sample('{1,2,3,4,5}'::int[], 3);
+SELECT array_sample('{1,2,3,4,5}'::int[], 6);
+SELECT array_sample('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[], 3);
+SELECT array_sample('{1,2,3,4,5}'::int[], -1); -- fail
-- 
2.37.1

#56

Fabien COELHO

coelho@cri.ensmp.fr

over 3 years ago

In reply to: Martin Kalcher (#55)

Re: [PATCH] Introduce array_shuffle() and array_sample()

i came to the same conclusions and went with Option 1 (see patch). Mainly
because most code in utils/adt is organized by type and this way it is
clear, that this is a thin wrapper around pg_prng.

Small patch update. I realized the new functions should live
array_userfuncs.c (rather than arrayfuncs.c), fixed some file headers and
added some comments to the code.

My 0,02ï¿œ about this patch:

Use (void) when declaring no parameters in headers or in functions.

Should the exchange be skipped when i == k?

I do not see the point of having *only* inline functions in a c file. The
external functions should not be inlined?

The random and array shuffling functions share a common state. I'm
wondering whether it should ot should not be so. It seems ok, but then
ISTM that the documentation suggests implicitely that setseed applies to
random() only, which is not the case anymore after the patch.

If more samples are required than the number of elements, it does not
error out. I'm wondering whether it should.

Also, the sampling should not return its result in order when the number
of elements required is the full array, ISTM that it should be shuffled
there as well.

I must say that without significant knowledge of the array internal
implementation, the swap code looks pretty strange. ISTM that the code
would be clearer if pointers and array references style were not
intermixed.

Maybe you could add a test with a 3D array? Some sample with NULLs?

Unrelated: I notice again that (postgre)SQL does not provide a way to
generate random integers. I do not see why not. Should we provide one?

--
Fabien.

#57

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Fabien COELHO (#56)

1 attachment(s)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 24.07.22 um 10:15 schrieb Fabien COELHO:

My 0,02€ about this patch:

Thank you for your feedback. I attached a patch, that addresses most of
your points.

Use (void) when declaring no parameters in headers or in functions.

Done.

Should the exchange be skipped when i == k?

The additional branch is actually slower (on my machine, tested with an
10M element int array) for 1D-arrays, which i assume is the most common
case. Even with a 2D-array with a subarray size of 1M there is no
difference in execution time. i == k should be relatively rare for large
arrays, given a good prng.

I do not see the point of having *only* inline functions in a c file.
The external functions should not be inlined?

Done.

The random and array shuffling functions share a common state. I'm
wondering whether it should ot should not be so. It seems ok, but then
ISTM that the documentation suggests implicitely that setseed applies to
random() only, which is not the case anymore after the patch.

I really like the idea of a prng state owned by the user, that is used
by all user facing random functions, but see that their might be
concerns about this change. I would update the setseed() documentation,
if this proposal is accepted. Do you think my patch should already
include the documentation update?

If more samples are required than the number of elements, it does not
error out. I'm wondering whether it should.

The second argument to array_sample() is treated like a LIMIT, rather
than the actual number of elements. I updated the documentation. My
use-case is: take max random items from an array of unknown size.

Also, the sampling should not return its result in order when the number
of elements required is the full array, ISTM that it should be shuffled
there as well.

You are the second person asking for this. It's done. I am thinking
about ditching array_sample() and replace it with a second signature for
array_shuffle():

array_shuffle(array anyarray) -> anyarray
array_shuffle(array anyarray, limit int) -> anyarray

What do you think?

I must say that without significant knowledge of the array internal
implementation, the swap code looks pretty strange. ISTM that the code
would be clearer if pointers and array references style were not
intermixed.

Done. Went with pointers.

Maybe you could add a test with a 3D array? Some sample with NULLs?

Done.

Unrelated: I notice again that (postgre)SQL does not provide a way to
generate random integers. I do not see why not. Should we provide one?

Maybe. I think it is outside the scope of this patch.

Thank you, Martin

Attachments:

0001-Introduce-array_shuffle-and-array_sample.patchtext/x-patch; charset=UTF-8; name=0001-Introduce-array_shuffle-and-array_sample.patchDownload

From d4f3df99cbc4451f7907a7c759a7590d63d8388c Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses max n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml                  |  34 +++++
 src/backend/utils/adt/Makefile          |   1 +
 src/backend/utils/adt/array_userfuncs.c | 181 ++++++++++++++++++++++++
 src/backend/utils/adt/float.c           |  37 +----
 src/backend/utils/adt/user_prng.c       |  87 ++++++++++++
 src/include/catalog/pg_proc.dat         |   6 +
 src/include/utils/user_prng.h           |  19 +++
 src/test/regress/expected/arrays.out    |  70 +++++++++
 src/test/regress/sql/arrays.sql         |  16 +++
 9 files changed, 418 insertions(+), 33 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..c44db6c652 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,40 @@ SELECT NULLIF(value, '(none)') ...
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_sample</primary>
+        </indexterm>
+        <function>array_sample</function> ( <parameter>array</parameter> <type>anyarray</type>, <parameter>limit</parameter> <type>integer</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Returns max <parameter>limit</parameter> randomly chosen elements from <parameter>array</parameter> in selection order.
+       </para>
+       <para>
+        <literal>array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)</literal>
+        <returnvalue>{{5,6},{1,2}}</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_shuffle</primary>
+        </indexterm>
+        <function>array_shuffle</function> ( <type>anyarray</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Shuffles the first dimension of the array.
+       </para>
+       <para>
+        <literal>array_shuffle(ARRAY[[1,2],[3,4],[5,6]])</literal>
+        <returnvalue>{{5,6},{1,2},{3,4}}</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm id="function-array-to-string">
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7c722ea2ce..42b65e58bb 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -109,6 +109,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..a5053f5bc2 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -17,6 +17,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
+#include "utils/user_prng.h"
 #include "utils/typcache.h"
 
 
@@ -902,3 +903,183 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with max limit randomly chosen items from given array in
+ * random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * limit: max number of items
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_limit(ArrayType *array, int limit,
+					Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+				rdims[MAXDIM],
+				nelm,
+				nitem,
+				rnitem,
+				i,
+				j,
+				k;
+	Datum		elm,
+			   *elms,
+			   *ielms,
+			   *jelms;
+	bool		nul,
+			   *nuls,
+			   *inuls,
+			   *jnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || limit < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &elms, &nuls, &nelm);
+
+	nitem = dims[0];			/* total number of items */
+	rnitem = Min(limit, nitem); /* target number of items */
+	nelm /= nitem;				/* number of elements per item */
+
+	ielms = elms;
+	inuls = nuls;
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Iterate array and swap head
+	 * (ielms) with an randomly chosen item (jelms) at each iteration.
+	 */
+	for (i = 0; i < rnitem; i++)
+	{
+		j = (int) user_prng_uint64_range(0, nitem - i - 1) * nelm;
+		jelms = ielms + j;
+		jnuls = inuls + j;
+
+		/* Swap all elements in item (i) with elements in item (j). */
+		for (k = 0; k < nelm; k++)
+		{
+			elm = *ielms;
+			nul = *inuls;
+			*ielms = *jelms;
+			*inuls = *jnuls;
+			*jelms = elm;
+			*jnuls = nul;
+			ielms++;
+			inuls++;
+			jelms++;
+			jnuls++;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = rnitem;
+
+	result = construct_md_array(elms, nuls, ndim, rdims, lbs,
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(elms);
+	pfree(nuls);
+
+	return result;
+}
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	/* Return empty array immediately. */
+	if (ARR_NDIM(array) < 1)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	n = ARR_DIMS(array)[0];
+
+	/* There is no point in shuffling arrays with less then two items. */
+	if (n < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_limit(array, n,
+								 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
+
+/*
+ * Choose N random elements from an array.
+ */
+Datum
+array_sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			limit;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	limit = PG_GETARG_INT32(1);
+
+	if (limit < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("second parameter must not be negative")));
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_limit(array, limit,
+								 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index fc8f39a7a9..29dabbbc6a 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -21,15 +21,13 @@
 
 #include "catalog/pg_type.h"
 #include "common/int.h"
-#include "common/pg_prng.h"
 #include "common/shortest_dec.h"
 #include "libpq/pqformat.h"
-#include "miscadmin.h"
 #include "utils/array.h"
 #include "utils/float.h"
 #include "utils/fmgrprotos.h"
+#include "utils/user_prng.h"
 #include "utils/sortsupport.h"
-#include "utils/timestamp.h"
 
 
 /*
@@ -64,10 +62,6 @@ float8		degree_c_sixty = 60.0;
 float8		degree_c_one_half = 0.5;
 float8		degree_c_one = 1.0;
 
-/* State for drandom() and setseed() */
-static bool drandom_seed_set = false;
-static pg_prng_state drandom_seed;
-
 /* Local function prototypes */
 static double sind_q1(double x);
 static double cosd_q1(double x);
@@ -2749,30 +2743,8 @@ datanh(PG_FUNCTION_ARGS)
 Datum
 drandom(PG_FUNCTION_ARGS)
 {
-	float8		result;
-
-	/* Initialize random seed, if not done yet in this process */
-	if (unlikely(!drandom_seed_set))
-	{
-		/*
-		 * If possible, initialize the seed using high-quality random bits.
-		 * Should that fail for some reason, we fall back on a lower-quality
-		 * seed based on current time and PID.
-		 */
-		if (unlikely(!pg_prng_strong_seed(&drandom_seed)))
-		{
-			TimestampTz now = GetCurrentTimestamp();
-			uint64		iseed;
-
-			/* Mix the PID with the most predictable bits of the timestamp */
-			iseed = (uint64) now ^ ((uint64) MyProcPid << 32);
-			pg_prng_seed(&drandom_seed, iseed);
-		}
-		drandom_seed_set = true;
-	}
-
-	/* pg_prng_double produces desired result range [0.0 - 1.0) */
-	result = pg_prng_double(&drandom_seed);
+	/* user_prng_double produces desired result range [0.0 - 1.0) */
+	float8		result = user_prng_double();
 
 	PG_RETURN_FLOAT8(result);
 }
@@ -2792,8 +2764,7 @@ setseed(PG_FUNCTION_ARGS)
 				 errmsg("setseed parameter %g is out of allowed range [-1,1]",
 						seed)));
 
-	pg_prng_fseed(&drandom_seed, seed);
-	drandom_seed_set = true;
+	user_prng_seed(seed);
 
 	PG_RETURN_VOID();
 }
diff --git a/src/backend/utils/adt/user_prng.c b/src/backend/utils/adt/user_prng.c
new file mode 100644
index 0000000000..a541e955ef
--- /dev/null
+++ b/src/backend/utils/adt/user_prng.c
@@ -0,0 +1,87 @@
+/*-------------------------------------------------------------------------
+ *
+ * user_prng.c
+ *	  Random number generator for user facing functions.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/user_prng.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "miscadmin.h"
+#include "utils/user_prng.h"
+#include "utils/timestamp.h"
+
+/*
+ * Process wide state vector used by all user facing random functions,
+ * like random() or array_shuffle().
+ */
+static pg_prng_state user_prng_state;
+static bool user_prng_state_set = false;
+
+/*
+ * Ensure the prng is seeded.
+ */
+static inline void
+user_prng_ensure_seed(void)
+{
+	/* Initialize prng state, if not done yet in this process. */
+	if (unlikely(!user_prng_state_set))
+	{
+		/*
+		 * If possible, initialize the state using high-quality random bits.
+		 * Should that fail for some reason, we fall back on a lower-quality
+		 * seed based on current time and PID.
+		 */
+		if (unlikely(!pg_prng_strong_seed(&user_prng_state)))
+		{
+			uint64		now = (uint64) GetCurrentTimestamp();
+			uint64		pid = (uint64) MyProcPid;
+
+			/* Mix PID with the most predictable bits of the timestamp. */
+			pg_prng_seed(&user_prng_state, now ^ (pid << 32));
+		}
+
+		user_prng_state_set = true;
+	}
+}
+
+/*
+ * Seed the prng.
+ */
+void
+user_prng_seed(float8 seed)
+{
+	pg_prng_fseed(&user_prng_state, seed);
+
+	user_prng_state_set = true;
+}
+
+/*
+ * Select a random double from the range [0.0, 1.0).
+ */
+float8
+user_prng_double(void)
+{
+	user_prng_ensure_seed();
+
+	return pg_prng_double(&user_prng_state);
+}
+
+/*
+ * Select a random uint64 from the range [min, max].
+ */
+uint64
+user_prng_uint64_range(uint64 min, uint64 max)
+{
+	user_prng_ensure_seed();
+
+	return pg_prng_uint64_range(&user_prng_state, min, max);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..9933d10f6d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,12 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '8464', descr => 'shuffle array',
+  proname => 'array_shuffle', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray', prosrc => 'array_shuffle' },
+{ oid => '8465', descr => 'take samples from array',
+  proname => 'array_sample', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray int4', prosrc => 'array_sample' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
diff --git a/src/include/utils/user_prng.h b/src/include/utils/user_prng.h
new file mode 100644
index 0000000000..959c1d727d
--- /dev/null
+++ b/src/include/utils/user_prng.h
@@ -0,0 +1,19 @@
+/*-------------------------------------------------------------------------
+ *
+ * user_prng.h
+ *	  Random number generator for user facing functions.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/utils/user_prng.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef USER_PRNG_H
+#define USER_PRNG_H
+
+extern void user_prng_seed(float8 seed);
+extern float8 user_prng_double(void);
+extern uint64 user_prng_uint64_range(uint64 min, uint64 max);
+
+#endif							/* USER_PRNG_H */
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index ce6f3a65f9..da9127a08c 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -2445,3 +2445,73 @@ SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
+-- array_shuffle
+SELECT setseed(0.1); -- ensure predictable results
+ setseed 
+---------
+ 
+(1 row)
+
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+  array_shuffle   
+------------------
+ {5,NULL,1,2,4,3}
+(1 row)
+
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+  array_shuffle   
+------------------
+ {4,1,2,NULL,3,5}
+(1 row)
+
+SELECT array_shuffle('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[]);
+             array_shuffle             
+---------------------------------------
+ [-1:2][2:3]={{7,8},{3,4},{1,2},{5,6}}
+(1 row)
+
+SELECT array_shuffle('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[]);
+                 array_shuffle                  
+------------------------------------------------
+ {{{5,6},{7,8}},{{1,2},{3,4}},{{9,10},{11,12}}}
+(1 row)
+
+-- array_sample
+SELECT setseed(0.1); -- ensure predictable results
+ setseed 
+---------
+ 
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+ array_sample 
+--------------
+ {5,NULL,1}
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+ array_sample 
+--------------
+ {5,3,4}
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 7);
+   array_sample   
+------------------
+ {NULL,2,3,1,5,4}
+(1 row)
+
+SELECT array_sample('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[], 3);
+          array_sample           
+---------------------------------
+ [-1:1][2:3]={{1,2},{5,6},{3,4}}
+(1 row)
+
+SELECT array_sample('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[], 2);
+           array_sample           
+----------------------------------
+ {{{9,10},{11,12}},{{1,2},{3,4}}}
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], -1); -- fail
+ERROR:  second parameter must not be negative
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index f774faf856..ba33d7ba04 100644
--- a/src/test/regress/sql/arrays.sql
+++ b/src/test/regress/sql/arrays.sql
@@ -754,3 +754,19 @@ FROM
 
 SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
+
+-- array_shuffle
+SELECT setseed(0.1); -- ensure predictable results
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+SELECT array_shuffle('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[]);
+SELECT array_shuffle('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[]);
+
+-- array_sample
+SELECT setseed(0.1); -- ensure predictable results
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 7);
+SELECT array_sample('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[], 3);
+SELECT array_sample('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[], 2);
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], -1); -- fail
-- 
2.37.1

#58

Fabien COELHO

coelho@cri.ensmp.fr

over 3 years ago

In reply to: Martin Kalcher (#57)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Hello,

Thank you for your feedback. I attached a patch, that addresses most of your
points.

I'll look into it. It would help if the patch could include a version
number at the end.

Should the exchange be skipped when i == k?

The additional branch is actually slower (on my machine, tested with an 10M
element int array) for 1D-arrays, which i assume is the most common case.
Even with a 2D-array with a subarray size of 1M there is no difference in
execution time. i == k should be relatively rare for large arrays, given a
good prng.

Ok, slower is bad.

The random and array shuffling functions share a common state. I'm
wondering whether it should ot should not be so. It seems ok, but then ISTM
that the documentation suggests implicitely that setseed applies to
random() only, which is not the case anymore after the patch.

I really like the idea of a prng state owned by the user, that is used by all
user facing random functions, but see that their might be concerns about this
change. I would update the setseed() documentation, if this proposal is
accepted. Do you think my patch should already include the documentation
update?

Duno. I'm still wondering what it should do. I'm pretty sure that the
documentation should be clear about a shared seed, if any. I do not think
that departing from the standard is a good thing, either.

If more samples are required than the number of elements, it does not error
out. I'm wondering whether it should.

The second argument to array_sample() is treated like a LIMIT, rather than
the actual number of elements. I updated the documentation. My use-case is:
take max random items from an array of unknown size.

Hmmm. ISTM that the use case of wanting "this many" stuff would make more
sense because it is strictier so less likely to result in unforseen
consequences. On principle I do not like not knowing the output size.

If someone wants a limit, they can easily "LEAST(#1 dim size, other
limit)" to get it, it is easy enough with a strict function.

Also, the sampling should not return its result in order when the number of
elements required is the full array, ISTM that it should be shuffled there
as well.

You are the second person asking for this. It's done. I am thinking about
ditching array_sample() and replace it with a second signature for
array_shuffle():

array_shuffle(array anyarray) -> anyarray
array_shuffle(array anyarray, limit int) -> anyarray

What do you think?

I think that shuffle means shuffle, not partial shuffle, so a different
name seems better to me.

--
Fabien.

#59

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Fabien COELHO (#58)

1 attachment(s)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 24.07.22 um 21:42 schrieb Fabien COELHO:

Duno. I'm still wondering what it should do. I'm pretty sure that the
documentation should be clear about a shared seed, if any. I do not
think that departing from the standard is a good thing, either.

Are sure it violates the standard? I could not find anything about it.
The private prng state for random() was introduced in 2018 [0]/messages/by-id/E1gdNAo-00036g-TB@gemulon.postgresql.org. Neither
commit nor discussion mentions any standard compliance.

[0]: /messages/by-id/E1gdNAo-00036g-TB@gemulon.postgresql.org
/messages/by-id/E1gdNAo-00036g-TB@gemulon.postgresql.org

I updated the documentation for setseed().

If someone wants a limit, they can easily "LEAST(#1 dim size, other
limit)" to get it, it is easy enough with a strict function.

Convinced. It errors out now if n is out of bounds.

Martin

Attachments:

v2-0001-Introduce-array_shuffle-and-array_sample.patchtext/x-patch; charset=UTF-8; name=v2-0001-Introduce-array_shuffle-and-array_sample.patchDownload

From afb7c022abd26b82a4fd3611313a83f144909554 Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH v2] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml                  |  40 +++++-
 src/backend/utils/adt/Makefile          |   1 +
 src/backend/utils/adt/array_userfuncs.c | 180 ++++++++++++++++++++++++
 src/backend/utils/adt/float.c           |  37 +----
 src/backend/utils/adt/user_prng.c       |  87 ++++++++++++
 src/include/catalog/pg_proc.dat         |   6 +
 src/include/utils/user_prng.h           |  19 +++
 src/test/regress/expected/arrays.out    |  66 +++++++++
 src/test/regress/sql/arrays.sql         |  16 +++
 9 files changed, 417 insertions(+), 35 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..8c51a17eda 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1820,7 +1820,8 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
         <returnvalue>void</returnvalue>
        </para>
        <para>
-        Sets the seed for subsequent <literal>random()</literal> calls;
+        Sets the seed for subsequent calls to random functions, like
+        <literal>random()</literal> or <literal>array_shuffle()</literal>;
         argument must be between -1.0 and 1.0, inclusive
        </para>
        <para>
@@ -1838,7 +1839,8 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
    applications; see the <xref linkend="pgcrypto"/> module for a more
    secure alternative.
    If <function>setseed()</function> is called, the series of results of
-   subsequent <function>random()</function> calls in the current session
+   subsequent calls to random functions, like <function>random()</function> or
+   <function>array_shuffle()</function>, in the current session
    can be repeated by re-issuing <function>setseed()</function> with the same
    argument.
   </para>
@@ -19395,6 +19397,40 @@ SELECT NULLIF(value, '(none)') ...
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_sample</primary>
+        </indexterm>
+        <function>array_sample</function> ( <parameter>array</parameter> <type>anyarray</type>, <parameter>n</parameter> <type>integer</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter> in selection order.
+       </para>
+       <para>
+        <literal>array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)</literal>
+        <returnvalue>{{5,6},{1,2}}</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_shuffle</primary>
+        </indexterm>
+        <function>array_shuffle</function> ( <type>anyarray</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Shuffles the first dimension of the array.
+       </para>
+       <para>
+        <literal>array_shuffle(ARRAY[[1,2],[3,4],[5,6]])</literal>
+        <returnvalue>{{5,6},{1,2},{3,4}}</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm id="function-array-to-string">
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7c722ea2ce..42b65e58bb 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -109,6 +109,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..d62f1aa4ef 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -17,6 +17,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
+#include "utils/user_prng.h"
 #include "utils/typcache.h"
 
 
@@ -902,3 +903,182 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with n randomly chosen items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: number of items (must not be greater than the size of the arrays first dimension)
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+				Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+				rdims[MAXDIM],
+				nelm,
+				nitem,
+				i,
+				j,
+				k;
+	Datum		elm,
+			   *elms,
+			   *ielms,
+			   *jelms;
+	bool		nul,
+			   *nuls,
+			   *inuls,
+			   *jnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &elms, &nuls, &nelm);
+
+	nitem = dims[0];			/* total number of items */
+	nelm /= nitem;				/* number of elements per item */
+
+	ielms = elms;
+	inuls = nuls;
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Iterate array and swap head
+	 * (ielms) with an randomly chosen item (jelms) at each iteration.
+	 */
+	for (i = 0; i < n; i++)
+	{
+		j = (int) user_prng_uint64_range(0, nitem - i - 1) * nelm;
+		jelms = ielms + j;
+		jnuls = inuls + j;
+
+		/* Swap all elements in item (i) with elements in item (j). */
+		for (k = 0; k < nelm; k++)
+		{
+			elm = *ielms;
+			nul = *inuls;
+			*ielms = *jelms;
+			*inuls = *jnuls;
+			*jelms = elm;
+			*jnuls = nul;
+			ielms++;
+			inuls++;
+			jelms++;
+			jnuls++;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = n;
+
+	result = construct_md_array(elms, nuls, ndim, rdims, lbs,
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(elms);
+	pfree(nuls);
+
+	return result;
+}
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	/* Return empty array immediately. */
+	if (ARR_NDIM(array) < 1)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	n = ARR_DIMS(array)[0];
+
+	/* There is no point in shuffling arrays with less then two items. */
+	if (n < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
+
+/*
+ * Choose N random elements from an array.
+ */
+Datum
+array_sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n,
+				nitem;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	n = PG_GETARG_INT32(1);
+	nitem = (ARR_NDIM(array) < 1) ? 0 : ARR_DIMS(array)[0];
+
+	if (n < 0 || n > nitem)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("sample size must be between 0 and %d", nitem)));
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index fc8f39a7a9..29dabbbc6a 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -21,15 +21,13 @@
 
 #include "catalog/pg_type.h"
 #include "common/int.h"
-#include "common/pg_prng.h"
 #include "common/shortest_dec.h"
 #include "libpq/pqformat.h"
-#include "miscadmin.h"
 #include "utils/array.h"
 #include "utils/float.h"
 #include "utils/fmgrprotos.h"
+#include "utils/user_prng.h"
 #include "utils/sortsupport.h"
-#include "utils/timestamp.h"
 
 
 /*
@@ -64,10 +62,6 @@ float8		degree_c_sixty = 60.0;
 float8		degree_c_one_half = 0.5;
 float8		degree_c_one = 1.0;
 
-/* State for drandom() and setseed() */
-static bool drandom_seed_set = false;
-static pg_prng_state drandom_seed;
-
 /* Local function prototypes */
 static double sind_q1(double x);
 static double cosd_q1(double x);
@@ -2749,30 +2743,8 @@ datanh(PG_FUNCTION_ARGS)
 Datum
 drandom(PG_FUNCTION_ARGS)
 {
-	float8		result;
-
-	/* Initialize random seed, if not done yet in this process */
-	if (unlikely(!drandom_seed_set))
-	{
-		/*
-		 * If possible, initialize the seed using high-quality random bits.
-		 * Should that fail for some reason, we fall back on a lower-quality
-		 * seed based on current time and PID.
-		 */
-		if (unlikely(!pg_prng_strong_seed(&drandom_seed)))
-		{
-			TimestampTz now = GetCurrentTimestamp();
-			uint64		iseed;
-
-			/* Mix the PID with the most predictable bits of the timestamp */
-			iseed = (uint64) now ^ ((uint64) MyProcPid << 32);
-			pg_prng_seed(&drandom_seed, iseed);
-		}
-		drandom_seed_set = true;
-	}
-
-	/* pg_prng_double produces desired result range [0.0 - 1.0) */
-	result = pg_prng_double(&drandom_seed);
+	/* user_prng_double produces desired result range [0.0 - 1.0) */
+	float8		result = user_prng_double();
 
 	PG_RETURN_FLOAT8(result);
 }
@@ -2792,8 +2764,7 @@ setseed(PG_FUNCTION_ARGS)
 				 errmsg("setseed parameter %g is out of allowed range [-1,1]",
 						seed)));
 
-	pg_prng_fseed(&drandom_seed, seed);
-	drandom_seed_set = true;
+	user_prng_seed(seed);
 
 	PG_RETURN_VOID();
 }
diff --git a/src/backend/utils/adt/user_prng.c b/src/backend/utils/adt/user_prng.c
new file mode 100644
index 0000000000..a541e955ef
--- /dev/null
+++ b/src/backend/utils/adt/user_prng.c
@@ -0,0 +1,87 @@
+/*-------------------------------------------------------------------------
+ *
+ * user_prng.c
+ *	  Random number generator for user facing functions.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/user_prng.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "miscadmin.h"
+#include "utils/user_prng.h"
+#include "utils/timestamp.h"
+
+/*
+ * Process wide state vector used by all user facing random functions,
+ * like random() or array_shuffle().
+ */
+static pg_prng_state user_prng_state;
+static bool user_prng_state_set = false;
+
+/*
+ * Ensure the prng is seeded.
+ */
+static inline void
+user_prng_ensure_seed(void)
+{
+	/* Initialize prng state, if not done yet in this process. */
+	if (unlikely(!user_prng_state_set))
+	{
+		/*
+		 * If possible, initialize the state using high-quality random bits.
+		 * Should that fail for some reason, we fall back on a lower-quality
+		 * seed based on current time and PID.
+		 */
+		if (unlikely(!pg_prng_strong_seed(&user_prng_state)))
+		{
+			uint64		now = (uint64) GetCurrentTimestamp();
+			uint64		pid = (uint64) MyProcPid;
+
+			/* Mix PID with the most predictable bits of the timestamp. */
+			pg_prng_seed(&user_prng_state, now ^ (pid << 32));
+		}
+
+		user_prng_state_set = true;
+	}
+}
+
+/*
+ * Seed the prng.
+ */
+void
+user_prng_seed(float8 seed)
+{
+	pg_prng_fseed(&user_prng_state, seed);
+
+	user_prng_state_set = true;
+}
+
+/*
+ * Select a random double from the range [0.0, 1.0).
+ */
+float8
+user_prng_double(void)
+{
+	user_prng_ensure_seed();
+
+	return pg_prng_double(&user_prng_state);
+}
+
+/*
+ * Select a random uint64 from the range [min, max].
+ */
+uint64
+user_prng_uint64_range(uint64 min, uint64 max)
+{
+	user_prng_ensure_seed();
+
+	return pg_prng_uint64_range(&user_prng_state, min, max);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..9933d10f6d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,12 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '8464', descr => 'shuffle array',
+  proname => 'array_shuffle', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray', prosrc => 'array_shuffle' },
+{ oid => '8465', descr => 'take samples from array',
+  proname => 'array_sample', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray int4', prosrc => 'array_sample' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
diff --git a/src/include/utils/user_prng.h b/src/include/utils/user_prng.h
new file mode 100644
index 0000000000..959c1d727d
--- /dev/null
+++ b/src/include/utils/user_prng.h
@@ -0,0 +1,19 @@
+/*-------------------------------------------------------------------------
+ *
+ * user_prng.h
+ *	  Random number generator for user facing functions.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/utils/user_prng.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef USER_PRNG_H
+#define USER_PRNG_H
+
+extern void user_prng_seed(float8 seed);
+extern float8 user_prng_double(void);
+extern uint64 user_prng_uint64_range(uint64 min, uint64 max);
+
+#endif							/* USER_PRNG_H */
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index ce6f3a65f9..bd6061fa1a 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -2445,3 +2445,69 @@ SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
+-- array_shuffle
+SELECT setseed(0.1); -- ensure predictable results
+ setseed 
+---------
+ 
+(1 row)
+
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+  array_shuffle   
+------------------
+ {5,NULL,1,2,4,3}
+(1 row)
+
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+  array_shuffle   
+------------------
+ {4,1,2,NULL,3,5}
+(1 row)
+
+SELECT array_shuffle('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[]);
+             array_shuffle             
+---------------------------------------
+ [-1:2][2:3]={{7,8},{3,4},{1,2},{5,6}}
+(1 row)
+
+SELECT array_shuffle('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[]);
+                 array_shuffle                  
+------------------------------------------------
+ {{{5,6},{7,8}},{{1,2},{3,4}},{{9,10},{11,12}}}
+(1 row)
+
+-- array_sample
+SELECT setseed(0.1); -- ensure predictable results
+ setseed 
+---------
+ 
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+ array_sample 
+--------------
+ {5,NULL,1}
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+ array_sample 
+--------------
+ {5,3,4}
+(1 row)
+
+SELECT array_sample('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[], 3);
+          array_sample           
+---------------------------------
+ [-1:1][2:3]={{1,2},{3,4},{5,6}}
+(1 row)
+
+SELECT array_sample('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[], 2);
+           array_sample           
+----------------------------------
+ {{{1,2},{3,4}},{{9,10},{11,12}}}
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], -1); -- fail
+ERROR:  sample size must be between 0 and 6
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 7); --fail
+ERROR:  sample size must be between 0 and 6
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index f774faf856..09a3ecff61 100644
--- a/src/test/regress/sql/arrays.sql
+++ b/src/test/regress/sql/arrays.sql
@@ -754,3 +754,19 @@ FROM
 
 SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
+
+-- array_shuffle
+SELECT setseed(0.1); -- ensure predictable results
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+SELECT array_shuffle('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[]);
+SELECT array_shuffle('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[]);
+
+-- array_sample
+SELECT setseed(0.1); -- ensure predictable results
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+SELECT array_sample('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[], 3);
+SELECT array_sample('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[], 2);
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], -1); -- fail
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 7); --fail
-- 
2.37.1

#60

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Martin Kalcher (#59)

1 attachment(s)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Patch update without merge conflicts.

Martin

Attachments:

v3-0001-Introduce-array_shuffle-and-array_sample.patchtext/x-patch; charset=UTF-8; name=v3-0001-Introduce-array_shuffle-and-array_sample.patchDownload

From 0ecffcf3ed2eb59d045941b69bb86a34b93f3391 Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH v3] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml                  |  40 +++++-
 src/backend/utils/adt/Makefile          |   1 +
 src/backend/utils/adt/array_userfuncs.c | 176 ++++++++++++++++++++++++
 src/backend/utils/adt/float.c           |  37 +----
 src/backend/utils/adt/user_prng.c       |  87 ++++++++++++
 src/include/catalog/pg_proc.dat         |   6 +
 src/include/utils/user_prng.h           |  19 +++
 src/test/regress/expected/arrays.out    |  66 +++++++++
 src/test/regress/sql/arrays.sql         |  16 +++
 9 files changed, 413 insertions(+), 35 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 053d4dc650..ef7c001fe7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1820,7 +1820,8 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
         <returnvalue>void</returnvalue>
        </para>
        <para>
-        Sets the seed for subsequent <literal>random()</literal> calls;
+        Sets the seed for subsequent calls to random functions, like
+        <literal>random()</literal> or <literal>array_shuffle()</literal>;
         argument must be between -1.0 and 1.0, inclusive
        </para>
        <para>
@@ -1838,7 +1839,8 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
    applications; see the <xref linkend="pgcrypto"/> module for a more
    secure alternative.
    If <function>setseed()</function> is called, the series of results of
-   subsequent <function>random()</function> calls in the current session
+   subsequent calls to random functions, like <function>random()</function> or
+   <function>array_shuffle()</function>, in the current session
    can be repeated by re-issuing <function>setseed()</function> with the same
    argument.
    Without any prior <function>setseed()</function> call in the same
@@ -19398,6 +19400,40 @@ SELECT NULLIF(value, '(none)') ...
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_sample</primary>
+        </indexterm>
+        <function>array_sample</function> ( <parameter>array</parameter> <type>anyarray</type>, <parameter>n</parameter> <type>integer</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter> in selection order.
+       </para>
+       <para>
+        <literal>array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)</literal>
+        <returnvalue>{{5,6},{1,2}}</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_shuffle</primary>
+        </indexterm>
+        <function>array_shuffle</function> ( <type>anyarray</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Shuffles the first dimension of the array.
+       </para>
+       <para>
+        <literal>array_shuffle(ARRAY[[1,2],[3,4],[5,6]])</literal>
+        <returnvalue>{{5,6},{1,2},{3,4}}</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm id="function-array-to-string">
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7c722ea2ce..42b65e58bb 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -109,6 +109,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..c4a2117df7 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -17,6 +17,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
+#include "utils/user_prng.h"
 #include "utils/typcache.h"
 
 
@@ -902,3 +903,178 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with n randomly chosen items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: number of items (must not be greater than the size of the arrays first dimension)
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+				Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+				rdims[MAXDIM],
+				nelm,
+				nitem,
+				i,
+				j,
+				k;
+	Datum		elm,
+			   *elms,
+			   *ielms,
+			   *jelms;
+	bool		nul,
+			   *nuls,
+			   *inuls,
+			   *jnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &elms, &nuls, &nelm);
+
+	nitem = dims[0];			/* total number of items */
+	nelm /= nitem;				/* number of elements per item */
+
+	ielms = elms;
+	inuls = nuls;
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Iterate array and swap head
+	 * (ielms) with an randomly chosen item (jelms) at each iteration.
+	 */
+	for (i = 0; i < n; i++)
+	{
+		j = (int) user_prng_uint64_range(0, nitem - i - 1) * nelm;
+		jelms = ielms + j;
+		jnuls = inuls + j;
+
+		/* Swap all elements in item (i) with elements in item (j). */
+		for (k = 0; k < nelm; k++)
+		{
+			elm = *ielms;
+			nul = *inuls;
+			*ielms++ = *jelms;
+			*inuls++ = *jnuls;
+			*jelms++ = elm;
+			*jnuls++ = nul;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = n;
+
+	result = construct_md_array(elms, nuls, ndim, rdims, lbs,
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(elms);
+	pfree(nuls);
+
+	return result;
+}
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	/* Return empty array immediately. */
+	if (ARR_NDIM(array) < 1)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	n = ARR_DIMS(array)[0];
+
+	/* There is no point in shuffling arrays with less then two items. */
+	if (n < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
+
+/*
+ * Choose N random elements from an array.
+ */
+Datum
+array_sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n,
+				nitem;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	n = PG_GETARG_INT32(1);
+	nitem = (ARR_NDIM(array) < 1) ? 0 : ARR_DIMS(array)[0];
+
+	if (n < 0 || n > nitem)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("sample size must be between 0 and %d", nitem)));
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index fc8f39a7a9..29dabbbc6a 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -21,15 +21,13 @@
 
 #include "catalog/pg_type.h"
 #include "common/int.h"
-#include "common/pg_prng.h"
 #include "common/shortest_dec.h"
 #include "libpq/pqformat.h"
-#include "miscadmin.h"
 #include "utils/array.h"
 #include "utils/float.h"
 #include "utils/fmgrprotos.h"
+#include "utils/user_prng.h"
 #include "utils/sortsupport.h"
-#include "utils/timestamp.h"
 
 
 /*
@@ -64,10 +62,6 @@ float8		degree_c_sixty = 60.0;
 float8		degree_c_one_half = 0.5;
 float8		degree_c_one = 1.0;
 
-/* State for drandom() and setseed() */
-static bool drandom_seed_set = false;
-static pg_prng_state drandom_seed;
-
 /* Local function prototypes */
 static double sind_q1(double x);
 static double cosd_q1(double x);
@@ -2749,30 +2743,8 @@ datanh(PG_FUNCTION_ARGS)
 Datum
 drandom(PG_FUNCTION_ARGS)
 {
-	float8		result;
-
-	/* Initialize random seed, if not done yet in this process */
-	if (unlikely(!drandom_seed_set))
-	{
-		/*
-		 * If possible, initialize the seed using high-quality random bits.
-		 * Should that fail for some reason, we fall back on a lower-quality
-		 * seed based on current time and PID.
-		 */
-		if (unlikely(!pg_prng_strong_seed(&drandom_seed)))
-		{
-			TimestampTz now = GetCurrentTimestamp();
-			uint64		iseed;
-
-			/* Mix the PID with the most predictable bits of the timestamp */
-			iseed = (uint64) now ^ ((uint64) MyProcPid << 32);
-			pg_prng_seed(&drandom_seed, iseed);
-		}
-		drandom_seed_set = true;
-	}
-
-	/* pg_prng_double produces desired result range [0.0 - 1.0) */
-	result = pg_prng_double(&drandom_seed);
+	/* user_prng_double produces desired result range [0.0 - 1.0) */
+	float8		result = user_prng_double();
 
 	PG_RETURN_FLOAT8(result);
 }
@@ -2792,8 +2764,7 @@ setseed(PG_FUNCTION_ARGS)
 				 errmsg("setseed parameter %g is out of allowed range [-1,1]",
 						seed)));
 
-	pg_prng_fseed(&drandom_seed, seed);
-	drandom_seed_set = true;
+	user_prng_seed(seed);
 
 	PG_RETURN_VOID();
 }
diff --git a/src/backend/utils/adt/user_prng.c b/src/backend/utils/adt/user_prng.c
new file mode 100644
index 0000000000..a541e955ef
--- /dev/null
+++ b/src/backend/utils/adt/user_prng.c
@@ -0,0 +1,87 @@
+/*-------------------------------------------------------------------------
+ *
+ * user_prng.c
+ *	  Random number generator for user facing functions.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/user_prng.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "miscadmin.h"
+#include "utils/user_prng.h"
+#include "utils/timestamp.h"
+
+/*
+ * Process wide state vector used by all user facing random functions,
+ * like random() or array_shuffle().
+ */
+static pg_prng_state user_prng_state;
+static bool user_prng_state_set = false;
+
+/*
+ * Ensure the prng is seeded.
+ */
+static inline void
+user_prng_ensure_seed(void)
+{
+	/* Initialize prng state, if not done yet in this process. */
+	if (unlikely(!user_prng_state_set))
+	{
+		/*
+		 * If possible, initialize the state using high-quality random bits.
+		 * Should that fail for some reason, we fall back on a lower-quality
+		 * seed based on current time and PID.
+		 */
+		if (unlikely(!pg_prng_strong_seed(&user_prng_state)))
+		{
+			uint64		now = (uint64) GetCurrentTimestamp();
+			uint64		pid = (uint64) MyProcPid;
+
+			/* Mix PID with the most predictable bits of the timestamp. */
+			pg_prng_seed(&user_prng_state, now ^ (pid << 32));
+		}
+
+		user_prng_state_set = true;
+	}
+}
+
+/*
+ * Seed the prng.
+ */
+void
+user_prng_seed(float8 seed)
+{
+	pg_prng_fseed(&user_prng_state, seed);
+
+	user_prng_state_set = true;
+}
+
+/*
+ * Select a random double from the range [0.0, 1.0).
+ */
+float8
+user_prng_double(void)
+{
+	user_prng_ensure_seed();
+
+	return pg_prng_double(&user_prng_state);
+}
+
+/*
+ * Select a random uint64 from the range [min, max].
+ */
+uint64
+user_prng_uint64_range(uint64 min, uint64 max)
+{
+	user_prng_ensure_seed();
+
+	return pg_prng_uint64_range(&user_prng_state, min, max);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index be47583122..a5fa527061 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,12 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '8464', descr => 'shuffle array',
+  proname => 'array_shuffle', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray', prosrc => 'array_shuffle' },
+{ oid => '8465', descr => 'take samples from array',
+  proname => 'array_sample', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray int4', prosrc => 'array_sample' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
diff --git a/src/include/utils/user_prng.h b/src/include/utils/user_prng.h
new file mode 100644
index 0000000000..959c1d727d
--- /dev/null
+++ b/src/include/utils/user_prng.h
@@ -0,0 +1,19 @@
+/*-------------------------------------------------------------------------
+ *
+ * user_prng.h
+ *	  Random number generator for user facing functions.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/utils/user_prng.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef USER_PRNG_H
+#define USER_PRNG_H
+
+extern void user_prng_seed(float8 seed);
+extern float8 user_prng_double(void);
+extern uint64 user_prng_uint64_range(uint64 min, uint64 max);
+
+#endif							/* USER_PRNG_H */
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index 97920f38c2..67e6ca9ab2 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -2447,3 +2447,69 @@ SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
 SELECT trim_array(ARRAY[]::int[], 1); -- fail
 ERROR:  number of elements to trim must be between 0 and 0
+-- array_shuffle
+SELECT setseed(0.1); -- ensure predictable results
+ setseed 
+---------
+ 
+(1 row)
+
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+  array_shuffle   
+------------------
+ {5,NULL,1,2,4,3}
+(1 row)
+
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+  array_shuffle   
+------------------
+ {4,1,2,NULL,3,5}
+(1 row)
+
+SELECT array_shuffle('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[]);
+             array_shuffle             
+---------------------------------------
+ [-1:2][2:3]={{7,8},{3,4},{1,2},{5,6}}
+(1 row)
+
+SELECT array_shuffle('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[]);
+                 array_shuffle                  
+------------------------------------------------
+ {{{5,6},{7,8}},{{1,2},{3,4}},{{9,10},{11,12}}}
+(1 row)
+
+-- array_sample
+SELECT setseed(0.1); -- ensure predictable results
+ setseed 
+---------
+ 
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+ array_sample 
+--------------
+ {5,NULL,1}
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+ array_sample 
+--------------
+ {5,3,4}
+(1 row)
+
+SELECT array_sample('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[], 3);
+          array_sample           
+---------------------------------
+ [-1:1][2:3]={{1,2},{3,4},{5,6}}
+(1 row)
+
+SELECT array_sample('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[], 2);
+           array_sample           
+----------------------------------
+ {{{1,2},{3,4}},{{9,10},{11,12}}}
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], -1); -- fail
+ERROR:  sample size must be between 0 and 6
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 7); --fail
+ERROR:  sample size must be between 0 and 6
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index 791af5c0ce..98c3993d9c 100644
--- a/src/test/regress/sql/arrays.sql
+++ b/src/test/regress/sql/arrays.sql
@@ -755,3 +755,19 @@ FROM
 SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 SELECT trim_array(ARRAY[]::int[], 1); -- fail
+
+-- array_shuffle
+SELECT setseed(0.1); -- ensure predictable results
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+SELECT array_shuffle('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[]);
+SELECT array_shuffle('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[]);
+
+-- array_sample
+SELECT setseed(0.1); -- ensure predictable results
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+SELECT array_sample('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[], 3);
+SELECT array_sample('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[], 2);
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], -1); -- fail
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 7); --fail
-- 
2.37.1

#61

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Martin Kalcher (#60)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Hi,

On 2022-08-04 07:46:10 +0200, Martin Kalcher wrote:

Patch update without merge conflicts.

Due to the merge of the meson based build, this patch needs to be adjusted. See
https://cirrus-ci.com/build/6580671765282816
Looks like it'd just be adding user_prng.c to
src/backend/utils/adt/meson.build's list.

Greetings,

Andres Freund

#62

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Andres Freund (#61)

1 attachment(s)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 22.09.22 um 17:23 schrieb Andres Freund:

Hi,

On 2022-08-04 07:46:10 +0200, Martin Kalcher wrote:

Patch update without merge conflicts.

Due to the merge of the meson based build, this patch needs to be adjusted. See
https://cirrus-ci.com/build/6580671765282816
Looks like it'd just be adding user_prng.c to
src/backend/utils/adt/meson.build's list.

Greetings,

Andres Freund

Hi Andres,

thanks for the heads up. Adjusted patch is attached.

- Martin

Attachments:

v4-0001-Introduce-array_shuffle-and-array_sample.patchtext/x-patch; charset=UTF-8; name=v4-0001-Introduce-array_shuffle-and-array_sample.patchDownload

From 3c4a939abf8d29bcf666e49ecb042ade42b36c2c Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH v4] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml                  |  40 +++++-
 src/backend/utils/adt/Makefile          |   1 +
 src/backend/utils/adt/array_userfuncs.c | 176 ++++++++++++++++++++++++
 src/backend/utils/adt/float.c           |  37 +----
 src/backend/utils/adt/meson.build       |   1 +
 src/backend/utils/adt/user_prng.c       |  87 ++++++++++++
 src/include/catalog/pg_proc.dat         |   6 +
 src/include/utils/user_prng.h           |  19 +++
 src/test/regress/expected/arrays.out    |  66 +++++++++
 src/test/regress/sql/arrays.sql         |  16 +++
 10 files changed, 414 insertions(+), 35 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e1fe4fec57..2a96fc323a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1820,7 +1820,8 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
         <returnvalue>void</returnvalue>
        </para>
        <para>
-        Sets the seed for subsequent <literal>random()</literal> calls;
+        Sets the seed for subsequent calls to random functions, like
+        <literal>random()</literal> or <literal>array_shuffle()</literal>;
         argument must be between -1.0 and 1.0, inclusive
        </para>
        <para>
@@ -1838,7 +1839,8 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
    applications; see the <xref linkend="pgcrypto"/> module for a more
    secure alternative.
    If <function>setseed()</function> is called, the series of results of
-   subsequent <function>random()</function> calls in the current session
+   subsequent calls to random functions, like <function>random()</function> or
+   <function>array_shuffle()</function>, in the current session
    can be repeated by re-issuing <function>setseed()</function> with the same
    argument.
    Without any prior <function>setseed()</function> call in the same
@@ -18468,6 +18470,40 @@ SELECT NULLIF(value, '(none)') ...
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_sample</primary>
+        </indexterm>
+        <function>array_sample</function> ( <parameter>array</parameter> <type>anyarray</type>, <parameter>n</parameter> <type>integer</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter> in selection order.
+       </para>
+       <para>
+        <literal>array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)</literal>
+        <returnvalue>{{5,6},{1,2}}</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_shuffle</primary>
+        </indexterm>
+        <function>array_shuffle</function> ( <type>anyarray</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Shuffles the first dimension of the array.
+       </para>
+       <para>
+        <literal>array_shuffle(ARRAY[[1,2],[3,4],[5,6]])</literal>
+        <returnvalue>{{5,6},{1,2},{3,4}}</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm id="function-array-to-string">
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 0de0bbb1b8..17b57a5569 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -110,6 +110,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..c4a2117df7 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -17,6 +17,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
+#include "utils/user_prng.h"
 #include "utils/typcache.h"
 
 
@@ -902,3 +903,178 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with n randomly chosen items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: number of items (must not be greater than the size of the arrays first dimension)
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+				Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+				rdims[MAXDIM],
+				nelm,
+				nitem,
+				i,
+				j,
+				k;
+	Datum		elm,
+			   *elms,
+			   *ielms,
+			   *jelms;
+	bool		nul,
+			   *nuls,
+			   *inuls,
+			   *jnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &elms, &nuls, &nelm);
+
+	nitem = dims[0];			/* total number of items */
+	nelm /= nitem;				/* number of elements per item */
+
+	ielms = elms;
+	inuls = nuls;
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Iterate array and swap head
+	 * (ielms) with an randomly chosen item (jelms) at each iteration.
+	 */
+	for (i = 0; i < n; i++)
+	{
+		j = (int) user_prng_uint64_range(0, nitem - i - 1) * nelm;
+		jelms = ielms + j;
+		jnuls = inuls + j;
+
+		/* Swap all elements in item (i) with elements in item (j). */
+		for (k = 0; k < nelm; k++)
+		{
+			elm = *ielms;
+			nul = *inuls;
+			*ielms++ = *jelms;
+			*inuls++ = *jnuls;
+			*jelms++ = elm;
+			*jnuls++ = nul;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = n;
+
+	result = construct_md_array(elms, nuls, ndim, rdims, lbs,
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(elms);
+	pfree(nuls);
+
+	return result;
+}
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	/* Return empty array immediately. */
+	if (ARR_NDIM(array) < 1)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	n = ARR_DIMS(array)[0];
+
+	/* There is no point in shuffling arrays with less then two items. */
+	if (n < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
+
+/*
+ * Choose N random elements from an array.
+ */
+Datum
+array_sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n,
+				nitem;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	n = PG_GETARG_INT32(1);
+	nitem = (ARR_NDIM(array) < 1) ? 0 : ARR_DIMS(array)[0];
+
+	if (n < 0 || n > nitem)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("sample size must be between 0 and %d", nitem)));
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index fc8f39a7a9..29dabbbc6a 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -21,15 +21,13 @@
 
 #include "catalog/pg_type.h"
 #include "common/int.h"
-#include "common/pg_prng.h"
 #include "common/shortest_dec.h"
 #include "libpq/pqformat.h"
-#include "miscadmin.h"
 #include "utils/array.h"
 #include "utils/float.h"
 #include "utils/fmgrprotos.h"
+#include "utils/user_prng.h"
 #include "utils/sortsupport.h"
-#include "utils/timestamp.h"
 
 
 /*
@@ -64,10 +62,6 @@ float8		degree_c_sixty = 60.0;
 float8		degree_c_one_half = 0.5;
 float8		degree_c_one = 1.0;
 
-/* State for drandom() and setseed() */
-static bool drandom_seed_set = false;
-static pg_prng_state drandom_seed;
-
 /* Local function prototypes */
 static double sind_q1(double x);
 static double cosd_q1(double x);
@@ -2749,30 +2743,8 @@ datanh(PG_FUNCTION_ARGS)
 Datum
 drandom(PG_FUNCTION_ARGS)
 {
-	float8		result;
-
-	/* Initialize random seed, if not done yet in this process */
-	if (unlikely(!drandom_seed_set))
-	{
-		/*
-		 * If possible, initialize the seed using high-quality random bits.
-		 * Should that fail for some reason, we fall back on a lower-quality
-		 * seed based on current time and PID.
-		 */
-		if (unlikely(!pg_prng_strong_seed(&drandom_seed)))
-		{
-			TimestampTz now = GetCurrentTimestamp();
-			uint64		iseed;
-
-			/* Mix the PID with the most predictable bits of the timestamp */
-			iseed = (uint64) now ^ ((uint64) MyProcPid << 32);
-			pg_prng_seed(&drandom_seed, iseed);
-		}
-		drandom_seed_set = true;
-	}
-
-	/* pg_prng_double produces desired result range [0.0 - 1.0) */
-	result = pg_prng_double(&drandom_seed);
+	/* user_prng_double produces desired result range [0.0 - 1.0) */
+	float8		result = user_prng_double();
 
 	PG_RETURN_FLOAT8(result);
 }
@@ -2792,8 +2764,7 @@ setseed(PG_FUNCTION_ARGS)
 				 errmsg("setseed parameter %g is out of allowed range [-1,1]",
 						seed)));
 
-	pg_prng_fseed(&drandom_seed, seed);
-	drandom_seed_set = true;
+	user_prng_seed(seed);
 
 	PG_RETURN_VOID();
 }
diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build
index ed9ceadfef..0372f5e4b7 100644
--- a/src/backend/utils/adt/meson.build
+++ b/src/backend/utils/adt/meson.build
@@ -95,6 +95,7 @@ backend_sources += files(
   'tsvector.c',
   'tsvector_op.c',
   'tsvector_parser.c',
+  'user_prng.c',
   'uuid.c',
   'varbit.c',
   'varchar.c',
diff --git a/src/backend/utils/adt/user_prng.c b/src/backend/utils/adt/user_prng.c
new file mode 100644
index 0000000000..a541e955ef
--- /dev/null
+++ b/src/backend/utils/adt/user_prng.c
@@ -0,0 +1,87 @@
+/*-------------------------------------------------------------------------
+ *
+ * user_prng.c
+ *	  Random number generator for user facing functions.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/user_prng.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "miscadmin.h"
+#include "utils/user_prng.h"
+#include "utils/timestamp.h"
+
+/*
+ * Process wide state vector used by all user facing random functions,
+ * like random() or array_shuffle().
+ */
+static pg_prng_state user_prng_state;
+static bool user_prng_state_set = false;
+
+/*
+ * Ensure the prng is seeded.
+ */
+static inline void
+user_prng_ensure_seed(void)
+{
+	/* Initialize prng state, if not done yet in this process. */
+	if (unlikely(!user_prng_state_set))
+	{
+		/*
+		 * If possible, initialize the state using high-quality random bits.
+		 * Should that fail for some reason, we fall back on a lower-quality
+		 * seed based on current time and PID.
+		 */
+		if (unlikely(!pg_prng_strong_seed(&user_prng_state)))
+		{
+			uint64		now = (uint64) GetCurrentTimestamp();
+			uint64		pid = (uint64) MyProcPid;
+
+			/* Mix PID with the most predictable bits of the timestamp. */
+			pg_prng_seed(&user_prng_state, now ^ (pid << 32));
+		}
+
+		user_prng_state_set = true;
+	}
+}
+
+/*
+ * Seed the prng.
+ */
+void
+user_prng_seed(float8 seed)
+{
+	pg_prng_fseed(&user_prng_state, seed);
+
+	user_prng_state_set = true;
+}
+
+/*
+ * Select a random double from the range [0.0, 1.0).
+ */
+float8
+user_prng_double(void)
+{
+	user_prng_ensure_seed();
+
+	return pg_prng_double(&user_prng_state);
+}
+
+/*
+ * Select a random uint64 from the range [min, max].
+ */
+uint64
+user_prng_uint64_range(uint64 min, uint64 max)
+{
+	user_prng_ensure_seed();
+
+	return pg_prng_uint64_range(&user_prng_state, min, max);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a07e737a33..00c32c58e7 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,12 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '8464', descr => 'shuffle array',
+  proname => 'array_shuffle', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray', prosrc => 'array_shuffle' },
+{ oid => '8465', descr => 'take samples from array',
+  proname => 'array_sample', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray int4', prosrc => 'array_sample' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
diff --git a/src/include/utils/user_prng.h b/src/include/utils/user_prng.h
new file mode 100644
index 0000000000..959c1d727d
--- /dev/null
+++ b/src/include/utils/user_prng.h
@@ -0,0 +1,19 @@
+/*-------------------------------------------------------------------------
+ *
+ * user_prng.h
+ *	  Random number generator for user facing functions.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * src/include/utils/user_prng.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef USER_PRNG_H
+#define USER_PRNG_H
+
+extern void user_prng_seed(float8 seed);
+extern float8 user_prng_double(void);
+extern uint64 user_prng_uint64_range(uint64 min, uint64 max);
+
+#endif							/* USER_PRNG_H */
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index 97920f38c2..67e6ca9ab2 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -2447,3 +2447,69 @@ SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
 SELECT trim_array(ARRAY[]::int[], 1); -- fail
 ERROR:  number of elements to trim must be between 0 and 0
+-- array_shuffle
+SELECT setseed(0.1); -- ensure predictable results
+ setseed 
+---------
+ 
+(1 row)
+
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+  array_shuffle   
+------------------
+ {5,NULL,1,2,4,3}
+(1 row)
+
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+  array_shuffle   
+------------------
+ {4,1,2,NULL,3,5}
+(1 row)
+
+SELECT array_shuffle('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[]);
+             array_shuffle             
+---------------------------------------
+ [-1:2][2:3]={{7,8},{3,4},{1,2},{5,6}}
+(1 row)
+
+SELECT array_shuffle('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[]);
+                 array_shuffle                  
+------------------------------------------------
+ {{{5,6},{7,8}},{{1,2},{3,4}},{{9,10},{11,12}}}
+(1 row)
+
+-- array_sample
+SELECT setseed(0.1); -- ensure predictable results
+ setseed 
+---------
+ 
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+ array_sample 
+--------------
+ {5,NULL,1}
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+ array_sample 
+--------------
+ {5,3,4}
+(1 row)
+
+SELECT array_sample('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[], 3);
+          array_sample           
+---------------------------------
+ [-1:1][2:3]={{1,2},{3,4},{5,6}}
+(1 row)
+
+SELECT array_sample('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[], 2);
+           array_sample           
+----------------------------------
+ {{{1,2},{3,4}},{{9,10},{11,12}}}
+(1 row)
+
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], -1); -- fail
+ERROR:  sample size must be between 0 and 6
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 7); --fail
+ERROR:  sample size must be between 0 and 6
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index 791af5c0ce..98c3993d9c 100644
--- a/src/test/regress/sql/arrays.sql
+++ b/src/test/regress/sql/arrays.sql
@@ -755,3 +755,19 @@ FROM
 SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 SELECT trim_array(ARRAY[]::int[], 1); -- fail
+
+-- array_shuffle
+SELECT setseed(0.1); -- ensure predictable results
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+SELECT array_shuffle('{NULL,1,2,3,4,5}'::int[]);
+SELECT array_shuffle('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[]);
+SELECT array_shuffle('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[]);
+
+-- array_sample
+SELECT setseed(0.1); -- ensure predictable results
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 3);
+SELECT array_sample('[-1:2][2:3]={{1,2},{3,4},{5,6},{7,8}}'::int[], 3);
+SELECT array_sample('{{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[], 2);
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], -1); -- fail
+SELECT array_sample('{NULL,1,2,3,4,5}'::int[], 7); --fail
-- 
2.37.3

#63

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Martin Kalcher (#62)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

[ v4-0001-Introduce-array_shuffle-and-array_sample.patch ]

I think this idea of exporting drandom()'s PRNG for all and sundry
to use is completely misguided. If we go down that path we'll
be right back in the swamp that we were in when we used random(3),
namely that (a) it's not clear what affects what, and (b) to the
extent that there are such interferences, it could be bad, maybe
even a security problem. We very intentionally decoupled drandom()
from the rest of the world at commit 6645ad6bd, and I'm not ready
to unlearn that lesson.

With our current PRNG infrastructure it doesn't cost much to have
a separate PRNG for every purpose. I don't object to having
array_shuffle() and array_sample() share one PRNG, but I don't
think it should go much further than that.

regards, tom lane

#64

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Tom Lane (#63)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 26.09.22 um 22:16 schrieb Tom Lane:

With our current PRNG infrastructure it doesn't cost much to have
a separate PRNG for every purpose. I don't object to having
array_shuffle() and array_sample() share one PRNG, but I don't
think it should go much further than that.

Thanks for your thoughts, Tom. I have a couple of questions. Should we
introduce a new seed function for the new PRNG state, used by
array_shuffle() and array_sample()? What would be a good name? Or should
those functions use pg_global_prng_state? Is it safe to assume, that
pg_global_prng_state is seeded?

Martin

#65

Fabien COELHO

coelho@cri.ensmp.fr

over 3 years ago

In reply to: Martin Kalcher (#64)

Re: [PATCH] Introduce array_shuffle() and array_sample()

With our current PRNG infrastructure it doesn't cost much to have
a separate PRNG for every purpose. I don't object to having
array_shuffle() and array_sample() share one PRNG, but I don't
think it should go much further than that.

Thanks for your thoughts, Tom. I have a couple of questions. Should we
introduce a new seed function for the new PRNG state, used by array_shuffle()
and array_sample()? What would be a good name? Or should those functions use
pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is
seeded?

I'd suggest to use the existing global state. The global state should be
seeded at the process start, AFAIKR.

--
Fabien.

#66

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Fabien COELHO (#65)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Fabien COELHO <coelho@cri.ensmp.fr> writes:

Thanks for your thoughts, Tom. I have a couple of questions. Should we
introduce a new seed function for the new PRNG state, used by array_shuffle()
and array_sample()? What would be a good name? Or should those functions use
pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is
seeded?

I'd suggest to use the existing global state. The global state should be
seeded at the process start, AFAIKR.

It is seeded at process start, yes. If you don't feel a need for
user control over the sequence used by these functions, then using
pg_global_prng_state is fine. (Basically the point to be made
here is that we need to keep a tight rein on what can be affected
by setseed().)

regards, tom lane

#67

Martin Kalcher

martin.kalcher@aboutsource.net

over 3 years ago

In reply to: Tom Lane (#66)

1 attachment(s)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Am 28.09.22 um 16:18 schrieb Tom Lane:

It is seeded at process start, yes. If you don't feel a need for
user control over the sequence used by these functions, then using
pg_global_prng_state is fine. (Basically the point to be made
here is that we need to keep a tight rein on what can be affected
by setseed().)

regards, tom lane

New patch: array_shuffle() and array_sample() use pg_global_prng_state now.

Martin

Attachments:

v5-0001-Introduce-array_shuffle-and-array_sample.patchtext/x-patch; charset=UTF-8; name=v5-0001-Introduce-array_shuffle-and-array_sample.patchDownload

From b9433564f925521f5f6bcebd7cd74a3e12f4f354 Mon Sep 17 00:00:00 2001
From: Martin Kalcher <martin.kalcher@aboutsource.net>
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH v5] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.
---
 doc/src/sgml/func.sgml                  |  34 +++++
 src/backend/utils/adt/array_userfuncs.c | 176 ++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat         |   6 +
 src/test/regress/expected/arrays.out    |  54 ++++++++
 src/test/regress/sql/arrays.sql         |  14 ++
 5 files changed, 284 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e1fe4fec57..6b2a3d16f6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18468,6 +18468,40 @@ SELECT NULLIF(value, '(none)') ...
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_sample</primary>
+        </indexterm>
+        <function>array_sample</function> ( <parameter>array</parameter> <type>anyarray</type>, <parameter>n</parameter> <type>integer</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter> in selection order.
+       </para>
+       <para>
+        <literal>array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)</literal>
+        <returnvalue>{{5,6},{1,2}}</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_shuffle</primary>
+        </indexterm>
+        <function>array_shuffle</function> ( <type>anyarray</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Shuffles the first dimension of the array.
+       </para>
+       <para>
+        <literal>array_shuffle(ARRAY[[1,2],[3,4],[5,6]])</literal>
+        <returnvalue>{{5,6},{1,2},{3,4}}</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm id="function-array-to-string">
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..4bf28da5e4 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -14,6 +14,7 @@
 
 #include "catalog/pg_type.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -902,3 +903,178 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with n randomly chosen items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: number of items (must not be greater than the size of the arrays first dimension)
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+				Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+				rdims[MAXDIM],
+				nelm,
+				nitem,
+				i,
+				j,
+				k;
+	Datum		elm,
+			   *elms,
+			   *ielms,
+			   *jelms;
+	bool		nul,
+			   *nuls,
+			   *inuls,
+			   *jnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &elms, &nuls, &nelm);
+
+	nitem = dims[0];			/* total number of items */
+	nelm /= nitem;				/* number of elements per item */
+
+	ielms = elms;
+	inuls = nuls;
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Iterate array and swap head
+	 * (ielms) with an randomly chosen item (jelms) at each iteration.
+	 */
+	for (i = 0; i < n; i++)
+	{
+		j = (int) pg_prng_uint64_range(&pg_global_prng_state, 0, nitem - i - 1) * nelm;
+		jelms = ielms + j;
+		jnuls = inuls + j;
+
+		/* Swap all elements in item (i) with elements in item (j). */
+		for (k = 0; k < nelm; k++)
+		{
+			elm = *ielms;
+			nul = *inuls;
+			*ielms++ = *jelms;
+			*inuls++ = *jnuls;
+			*jelms++ = elm;
+			*jnuls++ = nul;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = n;
+
+	result = construct_md_array(elms, nuls, ndim, rdims, lbs,
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(elms);
+	pfree(nuls);
+
+	return result;
+}
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	/* Return empty array immediately. */
+	if (ARR_NDIM(array) < 1)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	n = ARR_DIMS(array)[0];
+
+	/* There is no point in shuffling arrays with less then two items. */
+	if (n < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
+
+/*
+ * Choose N random elements from an array.
+ */
+Datum
+array_sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n,
+				nitem;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	n = PG_GETARG_INT32(1);
+	nitem = (ARR_NDIM(array) < 1) ? 0 : ARR_DIMS(array)[0];
+
+	if (n < 0 || n > nitem)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("sample size must be between 0 and %d", nitem)));
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	result = array_shuffle_n(array, n,
+							 elmtyp, elmlen, elmbyval, elmalign);
+
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a07e737a33..00c32c58e7 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,12 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '8464', descr => 'shuffle array',
+  proname => 'array_shuffle', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray', prosrc => 'array_shuffle' },
+{ oid => '8465', descr => 'take samples from array',
+  proname => 'array_sample', provolatile => 'v', proparallel => 'r', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray int4', prosrc => 'array_sample' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index 97920f38c2..9dd365f22d 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -2447,3 +2447,57 @@ SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
 SELECT trim_array(ARRAY[]::int[], 1); -- fail
 ERROR:  number of elements to trim must be between 0 and 0
+-- array_shuffle
+SELECT array_shuffle('{1,2,3,4,5,6}'::int[]) <@ '{1,2,3,4,5,6}';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT array_shuffle('{1,2,3,4,5,6}'::int[]) @> '{1,2,3,4,5,6}';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT array_dims(array_shuffle('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[]));
+ array_dims  
+-------------
+ [-1:2][2:3]
+(1 row)
+
+SELECT array_dims(array_shuffle('{{{1,2},{3,NULL}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[]));
+   array_dims    
+-----------------
+ [1:3][1:2][1:2]
+(1 row)
+
+-- array_sample
+SELECT array_sample('{1,2,3,4,5,6}'::int[], 3) <@ '{1,2,3,4,5,6}';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT array_length(array_sample('{1,2,3,4,5,6}'::int[], 3), 1);
+ array_length 
+--------------
+            3
+(1 row)
+
+SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+ array_dims  
+-------------
+ [-1:1][2:3]
+(1 row)
+
+SELECT array_dims(array_sample('{{{1,2},{3,NULL}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[], 2));
+   array_dims    
+-----------------
+ [1:2][1:2][1:2]
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5,6}'::int[], -1); -- fail
+ERROR:  sample size must be between 0 and 6
+SELECT array_sample('{1,2,3,4,5,6}'::int[], 7); --fail
+ERROR:  sample size must be between 0 and 6
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index 791af5c0ce..766798bc62 100644
--- a/src/test/regress/sql/arrays.sql
+++ b/src/test/regress/sql/arrays.sql
@@ -755,3 +755,17 @@ FROM
 SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 SELECT trim_array(ARRAY[]::int[], 1); -- fail
+
+-- array_shuffle
+SELECT array_shuffle('{1,2,3,4,5,6}'::int[]) <@ '{1,2,3,4,5,6}';
+SELECT array_shuffle('{1,2,3,4,5,6}'::int[]) @> '{1,2,3,4,5,6}';
+SELECT array_dims(array_shuffle('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[]));
+SELECT array_dims(array_shuffle('{{{1,2},{3,NULL}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[]));
+
+-- array_sample
+SELECT array_sample('{1,2,3,4,5,6}'::int[], 3) <@ '{1,2,3,4,5,6}';
+SELECT array_length(array_sample('{1,2,3,4,5,6}'::int[], 3), 1);
+SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+SELECT array_dims(array_sample('{{{1,2},{3,NULL}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[], 2));
+SELECT array_sample('{1,2,3,4,5,6}'::int[], -1); -- fail
+SELECT array_sample('{1,2,3,4,5,6}'::int[], 7); --fail
-- 
2.37.3

#68

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Martin Kalcher (#67)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

New patch: array_shuffle() and array_sample() use pg_global_prng_state now.

I took a closer look at the patch today. I find this behavior a bit
surprising:

+SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+ array_dims  
+-------------
+ [-1:1][2:3]
+(1 row)

I can buy preserving the lower bound in array_shuffle(), but
array_sample() is not preserving the first-dimension indexes of
the array, so ISTM it ought to reset the first lower bound to 1.

Some other comments:

+ Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter> in selection order.

What's "selection order"? And this probably shouldn't just rely
on the example to describe what happens with multi-D arrays.
Writing "elements" seems somewhere between confusing and wrong.

* Personally I think I'd pass the TypeCacheEntry pointer to array_shuffle_n,
and let it pull out what it needs. Less duplicate code that way.

* I find array_shuffle_n drastically undercommented, and what comments
it has are pretty misleading, eg

+ /* Swap all elements in item (i) with elements in item (j). */

j is *not* the index of the second item to be swapped. You could make
it so, and that might be more readable:

j = (int) pg_prng_uint64_range(&pg_global_prng_state, i, nitem - 1);
jelms = elms + j * nelm;
jnuls = nuls + j * nelm;

But if you want the code to stay as it is, this comment needs work.

* I think good style for SQL-callable C functions is to make the arguments
clear at the top:

+array_sample(PG_FUNCTION_ARGS)
+{
+    ArrayType  *array = PG_GETARG_ARRAYTYPE_P(0);
+    int         n = PG_GETARG_INT32(1);
+    ArrayType  *result;
+    ... other declarations as needed ...

We can't quite make normal C declaration style work, but that's a poor
excuse for not making the argument list visible as directly as possible.

* I wouldn't bother with the PG_FREE_IF_COPY calls. Those are generally
only used in btree comparison functions, in which there's a need to not
leak memory.

* I wonder if we really need these to be proparallel = 'r'. If we let
a worker process execute them, they will take numbers from the worker's
pg_global_prng_state seeding not the parent process's seeding, but why
is that a problem? In the prior version where you were tying them
to the parent's drandom() sequence, proparallel = 'r' made sense;
but now I think it's unnecessary.

regards, tom lane

#69

Gregory Stark (as CFM)

stark.cfm@gmail.com

almost 3 years ago

In reply to: Tom Lane (#68)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On Thu, 29 Sept 2022 at 15:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

New patch: array_shuffle() and array_sample() use pg_global_prng_state now.

I took a closer look at the patch today. I find this behavior a bit
surprising:

It looks like this patch received useful feedback and it wouldn't take
much to push it over the line. But it's been Waiting On Author since
last September.

Martin, any chance of getting these last bits of feedback resolved so
it can be Ready for Commit?

--
Gregory Stark
As Commitfest Manager

#70

Gregory Stark (as CFM)

stark.cfm@gmail.com

almost 3 years ago

In reply to: Gregory Stark (as CFM) (#69)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Given that there's been no updates since September 22 I'm going to
make this patch Returned with Feedback. The patch can be resurrected
when there's more work done -- don't forget to move it to the new CF
when you do that.

--
Gregory Stark
As Commitfest Manager

#71

Daniel Gustafsson

daniel@yesql.se

almost 3 years ago

In reply to: Tom Lane (#68)

1 attachment(s)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On 29 Sep 2022, at 21:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Martin Kalcher <martin.kalcher@aboutsource.net> writes:

New patch: array_shuffle() and array_sample() use pg_global_prng_state now.

I took a closer look at the patch today.

Since this seems pretty close to going in, and seems like quite useful
functions, I took a look to see if I could get it across the line (although I
noticed that CFM beat me to the clock in sending this =)).

I find this behavior a bit surprising:
+SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+ array_dims  
+-------------
+ [-1:1][2:3]
+(1 row)
I can buy preserving the lower bound in array_shuffle(), but
array_sample() is not preserving the first-dimension indexes of
the array, so ISTM it ought to reset the first lower bound to 1.

I might be daft but I'm not sure I follow why not preserving here, can you
explain?

The rest of your comments have been addressed in the attached v6 I think
(although I'm pretty sure the docs part is just as bad now, explaining these in
concise words is hard, will take another look with fresh eyes tomorrow).

--
Daniel Gustafsson

Attachments:

v6-0001-Introduce-array_shuffle-and-array_sample.patchapplication/octet-stream; name=v6-0001-Introduce-array_shuffle-and-array_sample.patch; x-unix-mode=0644Download

From 0885d5d25f525dfd2d816f83425197f429d7be7f Mon Sep 17 00:00:00 2001
From: Daniel Gustafsson <dgustafsson@postgresql.org>
Date: Mon, 3 Apr 2023 23:21:09 +0200
Subject: [PATCH v6] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.
TODO: better commitmessage

Author: Martin Kalcher <martin.kalcher@aboutsource.net>
---
 doc/src/sgml/func.sgml                  |  39 ++++++
 src/backend/utils/adt/array_userfuncs.c | 161 ++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat         |   6 +
 src/test/regress/expected/arrays.out    |  54 ++++++++
 src/test/regress/sql/arrays.sql         |  14 +++
 5 files changed, 274 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 918a492234..9bc1d5d356 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18779,6 +18779,45 @@ SELECT NULLIF(value, '(none)') ...
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_sample</primary>
+        </indexterm>
+        <function>array_sample</function> ( <parameter>array</parameter> <type>anyarray</type>, <parameter>n</parameter> <type>integer</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Returns an array with an upper-bound of <parameter>n</parameter>
+        randomly chosen dimensions from <parameter>array</parameter>.
+       </para>
+       <para>
+        <literal>array_sample(ARRAY[1,2,3,4,5,6], 3)</literal>
+        <returnvalue>{2,6,1}</returnvalue>
+       </para>
+       <para>
+        <literal>array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)</literal>
+        <returnvalue>{{5,6},{1,2}}</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>array_shuffle</primary>
+        </indexterm>
+        <function>array_shuffle</function> ( <type>anyarray</type> )
+        <returnvalue>anyarray</returnvalue>
+       </para>
+       <para>
+        Shuffles the first dimension of the array.
+       </para>
+       <para>
+        <literal>array_shuffle(ARRAY[[1,2],[3,4],[5,6]])</literal>
+        <returnvalue>{{5,6},{1,2},{3,4}}</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm id="function-array-to-string">
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index 80750191d8..8ee569c4ff 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -15,6 +15,7 @@
 #include "catalog/pg_type.h"
 #include "libpq/pqformat.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "port/pg_bitutils.h"
 #include "utils/array.h"
 #include "utils/datum.h"
@@ -1525,3 +1526,163 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * array_shuffle_n
+ *		Return a copy of array with n randomly chosen items.
+ *
+ * The number of items must not exceed the size of the first dimension of the
+ * array.
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n, Oid elmtyp, TypeCacheEntry *typentry)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+				rdims[MAXDIM],
+				nelm,
+				nitem;
+	Datum		elm,
+			   *elms,
+			   *ielms;
+	bool		nul,
+			   *nuls,
+			   *inuls;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	/* If the target array is empty, exit fast */
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+					  &elms, &nuls, &nelm);
+
+	nitem = dims[0];			/* total number of items */
+	nelm /= nitem;				/* number of elements per item */
+
+	ielms = elms;
+	inuls = nuls;
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Iterate array and swap head
+	 * (ielms) with an randomly chosen item (jelms) at each iteration.
+	 */
+	for (int i = 0; i < n; i++)
+	{
+		int		j = (int) pg_prng_uint64_range(&pg_global_prng_state, 0, nitem - i - 1) * nelm;
+		Datum  *jelms = ielms + j;
+		bool   *jnuls = inuls + j;
+
+		/*
+		 * Swap elements in item (i) with elements in the (jelms) item which is
+		 * chosen by the random offset of (j).
+		 */
+		for (int k = 0; k < nelm; k++)
+		{
+			elm = *ielms;
+			nul = *inuls;
+			*ielms++ = *jelms;
+			*inuls++ = *jnuls;
+			*jelms++ = elm;
+			*jnuls++ = nul;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = n;
+
+	result = construct_md_array(elms, nuls, ndim, rdims, lbs,
+								elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(elms);
+	pfree(nuls);
+
+	return result;
+}
+
+/*
+ * array_shuffle
+ *
+ * Returns an array with the same dimensions as the input array, with its
+ * elements in random order.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array = PG_GETARG_ARRAYTYPE_P(0);
+	ArrayType  *result;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+
+	/*
+	 * There is no point in shuffling empty arrays or arrays with less than
+	 * two items.
+	 */
+	if (ARR_NDIM(array) < 1 || ARR_DIMS(array)[0] < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+
+	result = array_shuffle_n(array, ARR_DIMS(array)[0], elmtyp, typentry);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
+
+/*
+ * array_sample
+ *
+ * Returns an array of the same dimensionality with an upper-bound of n with
+ * randomly chosen items from the input array.
+ */
+Datum
+array_sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array = PG_GETARG_ARRAYTYPE_P(0);
+	int			n = PG_GETARG_INT32(1);
+	ArrayType  *result;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			nitem;
+
+	nitem = (ARR_NDIM(array) < 1) ? 0 : ARR_DIMS(array)[0];
+
+	if (n < 0 || n > nitem)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("sample size must be between 0 and %d", nitem)));
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+
+	result = array_shuffle_n(array, n, elmtyp, typentry);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9f2642201..660681ce52 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1728,6 +1728,12 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '8464', descr => 'shuffle array',
+  proname => 'array_shuffle', provolatile => 'v', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray', prosrc => 'array_shuffle' },
+{ oid => '8465', descr => 'take samples from array',
+  proname => 'array_sample', provolatile => 'v', proisstrict => 't',
+  prorettype => 'anyarray', proargtypes => 'anyarray int4', prosrc => 'array_sample' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index bfaf125187..565ba6ea24 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -2472,3 +2472,57 @@ SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 ERROR:  number of elements to trim must be between 0 and 3
 SELECT trim_array(ARRAY[]::int[], 1); -- fail
 ERROR:  number of elements to trim must be between 0 and 0
+-- array_shuffle
+SELECT array_shuffle('{1,2,3,4,5,6}'::int[]) <@ '{1,2,3,4,5,6}';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT array_shuffle('{1,2,3,4,5,6}'::int[]) @> '{1,2,3,4,5,6}';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT array_dims(array_shuffle('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[]));
+ array_dims  
+-------------
+ [-1:2][2:3]
+(1 row)
+
+SELECT array_dims(array_shuffle('{{{1,2},{3,NULL}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[]));
+   array_dims    
+-----------------
+ [1:3][1:2][1:2]
+(1 row)
+
+-- array_sample
+SELECT array_sample('{1,2,3,4,5,6}'::int[], 3) <@ '{1,2,3,4,5,6}';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT array_length(array_sample('{1,2,3,4,5,6}'::int[], 3), 1);
+ array_length 
+--------------
+            3
+(1 row)
+
+SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+ array_dims  
+-------------
+ [-1:1][2:3]
+(1 row)
+
+SELECT array_dims(array_sample('{{{1,2},{3,NULL}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[], 2));
+   array_dims    
+-----------------
+ [1:2][1:2][1:2]
+(1 row)
+
+SELECT array_sample('{1,2,3,4,5,6}'::int[], -1); -- fail
+ERROR:  sample size must be between 0 and 6
+SELECT array_sample('{1,2,3,4,5,6}'::int[], 7); --fail
+ERROR:  sample size must be between 0 and 6
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index 094937ba63..f1375621e0 100644
--- a/src/test/regress/sql/arrays.sql
+++ b/src/test/regress/sql/arrays.sql
@@ -761,3 +761,17 @@ FROM
 SELECT trim_array(ARRAY[1, 2, 3], -1); -- fail
 SELECT trim_array(ARRAY[1, 2, 3], 10); -- fail
 SELECT trim_array(ARRAY[]::int[], 1); -- fail
+
+-- array_shuffle
+SELECT array_shuffle('{1,2,3,4,5,6}'::int[]) <@ '{1,2,3,4,5,6}';
+SELECT array_shuffle('{1,2,3,4,5,6}'::int[]) @> '{1,2,3,4,5,6}';
+SELECT array_dims(array_shuffle('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[]));
+SELECT array_dims(array_shuffle('{{{1,2},{3,NULL}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[]));
+
+-- array_sample
+SELECT array_sample('{1,2,3,4,5,6}'::int[], 3) <@ '{1,2,3,4,5,6}';
+SELECT array_length(array_sample('{1,2,3,4,5,6}'::int[], 3), 1);
+SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+SELECT array_dims(array_sample('{{{1,2},{3,NULL}},{{5,6},{7,8}},{{9,10},{11,12}}}'::int[], 2));
+SELECT array_sample('{1,2,3,4,5,6}'::int[], -1); -- fail
+SELECT array_sample('{1,2,3,4,5,6}'::int[], 7); --fail
-- 
2.32.1 (Apple Git-133)

#72

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Daniel Gustafsson (#71)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Daniel Gustafsson <daniel@yesql.se> writes:

On 29 Sep 2022, at 21:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I find this behavior a bit surprising:
+SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+ array_dims  
+-------------
+ [-1:1][2:3]
+(1 row)
I can buy preserving the lower bound in array_shuffle(), but
array_sample() is not preserving the first-dimension indexes of
the array, so ISTM it ought to reset the first lower bound to 1.

I might be daft but I'm not sure I follow why not preserving here, can you
explain?

Because array_sample selects only some of the (first level) array
elements, those elements are typically not going to have the same
indexes in the output as they did in the input. So I don't see why
it would be useful to preserve the same lower-bound index. It does
make sense to preserve the lower-order index bounds ([2:3] in this
example) because we are including or not including those array
slices as a whole.

regards, tom lane

#73

Daniel Gustafsson

daniel@yesql.se

almost 3 years ago

In reply to: Tom Lane (#72)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On 3 Apr 2023, at 23:46, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Daniel Gustafsson <daniel@yesql.se> writes:
On 29 Sep 2022, at 21:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I find this behavior a bit surprising:
+SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+ array_dims  
+-------------
+ [-1:1][2:3]
+(1 row)
I can buy preserving the lower bound in array_shuffle(), but
array_sample() is not preserving the first-dimension indexes of
the array, so ISTM it ought to reset the first lower bound to 1.
I might be daft but I'm not sure I follow why not preserving here, can you
explain?

Because array_sample selects only some of the (first level) array
elements, those elements are typically not going to have the same
indexes in the output as they did in the input. So I don't see why
it would be useful to preserve the same lower-bound index. It does
make sense to preserve the lower-order index bounds ([2:3] in this
example) because we are including or not including those array
slices as a whole.

Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch like
this tomorrow.

--
Daniel Gustafsson

#74

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Daniel Gustafsson (#73)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Daniel Gustafsson <daniel@yesql.se> writes:

Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch like
this tomorrow.

Since we're running out of time, I took the liberty of fixing and
pushing this.

regards, tom lane

#75

Daniel Gustafsson

daniel@yesql.se

almost 3 years ago

In reply to: Tom Lane (#74)

Re: [PATCH] Introduce array_shuffle() and array_sample()

On 7 Apr 2023, at 17:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Daniel Gustafsson <daniel@yesql.se> writes:

Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch like
this tomorrow.

Since we're running out of time, I took the liberty of fixing and
pushing this.

Great, thanks!

--
Daniel Gustafsson

#76

Salek Talangi

salek.talangi@googlemail.com

over 2 years ago

In reply to: Daniel Gustafsson (#75)

Re: [PATCH] Introduce array_shuffle() and array_sample()

Hi all,

reading this blog post
https://www.depesz.com/2023/04/18/waiting-for-postgresql-16-add-array_sample-and-array_shuffle-functions/
I became aware of the new feature and had a look at it and the commit
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=888f2ea0a81ff171087bdd1c5c1eeda3b78d73d4

To me the description
/*
* Shuffle array using Fisher-Yates algorithm. Scan the array and swap
* current item (nelm datums starting at ielms) with a randomly chosen
* later item (nelm datums starting at jelms) in each iteration. We can
* stop once we've done n iterations; then first n items are the result.
*/

seems wrong. For n = 1 the returned item could never be the 1st item of the
array (see "randomly chosen later item").
If this really is the case then the result is not really random. But to me
it seems j later can be 0 (making it not really "later"), so this might
only be a documentation issue.

Best regards
Salek Talangi

Am Mi., 19. Apr. 2023 um 13:48 Uhr schrieb Daniel Gustafsson <
daniel@yesql.se>:

Show quoted text

On 7 Apr 2023, at 17:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Daniel Gustafsson <daniel@yesql.se> writes:

Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch

like

this tomorrow.

Since we're running out of time, I took the liberty of fixing and
pushing this.

Great, thanks!

--
Daniel Gustafsson