New function normal_rand_array function to contrib/tablefunc.

Started by Andy Fanalmost 2 years ago30 messageshackers

zhihui.fan1213@gmail.com

almost 2 years ago

Here is a new function which could produce an array of numbers with a
controllable array length and duplicated elements in these arrays. I
used it when working with gin index, and I think it would be helpful for
others as well, so here it is.

select * from normal_rand_array(5, 10, 1.8::numeric, 3.5::numeric);
normal_rand_array
-----------------------------------------------
{3.3,2.3,2.7,3.2,2.0,2.7,3.4,2.7,2.3,2.9}
{3.3,1.8,2.9,3.4,2.0,1.8,2.0,3.5,2.8,2.5}
{2.1,1.9,2.3,1.9,2.5,2.7,2.4,2.9,1.8}
{2.3,2.5,2.4,2.7,2.7,2.3,2.9,3.3,3.3,1.9,3.5}
{2.8,3.4,2.7,1.8,3.3,2.3,2.2,3.5,2.6,2.5}
(5 rows)

select * from normal_rand_array(5, 10, 1.8::int4, 3.5::int4);
normal_rand_array
-------------------------------------
{3,2,2,3,4,2}
{2,4,2,3,3,3,3,2,2,3,3,2,3,2}
{2,4,3}
{4,2,3,4,2,4,2,2,3,4,3,3,2,4,4,2,3}
{4,3,3,4,3,3,4,2,4}
(5 rows)

the 5 means it needs to produce 5 rows in total and the 10 is the
average array length, and 1.8 is the minvalue for the random function
and 3.5 is the maxvalue.

--
Best Regards
Andy Fan

Stepan Neretin

fenixrnd@mail.ru

almost 2 years ago

In reply to: Andy Fan (#1)

Re: New function normal_rand_array function to contrib/tablefunc.

It looks useful, for example, it can be used in sorting tests to make them more interesting. I just have one question. Why are you using SRF_IS_FIRST CALL and not _PG_init?
Best regards, Stepan Neretin.

Jim Jones

jim.jones@uni-muenster.de

almost 2 years ago

In reply to: Andy Fan (#1)

Re: New function normal_rand_array function to contrib/tablefunc.

Hi Andy

On 08.06.24 08:05, Andy Fan wrote:

Here is a new function which could produce an array of numbers with a
controllable array length and duplicated elements in these arrays. I
used it when working with gin index, and I think it would be helpful for
others as well, so here it is.

select * from normal_rand_array(5, 10, 1.8::numeric, 3.5::numeric);
normal_rand_array
-----------------------------------------------
{3.3,2.3,2.7,3.2,2.0,2.7,3.4,2.7,2.3,2.9}
{3.3,1.8,2.9,3.4,2.0,1.8,2.0,3.5,2.8,2.5}
{2.1,1.9,2.3,1.9,2.5,2.7,2.4,2.9,1.8}
{2.3,2.5,2.4,2.7,2.7,2.3,2.9,3.3,3.3,1.9,3.5}
{2.8,3.4,2.7,1.8,3.3,2.3,2.2,3.5,2.6,2.5}
(5 rows)

select * from normal_rand_array(5, 10, 1.8::int4, 3.5::int4);
normal_rand_array
-------------------------------------
{3,2,2,3,4,2}
{2,4,2,3,3,3,3,2,2,3,3,2,3,2}
{2,4,3}
{4,2,3,4,2,4,2,2,3,4,3,3,2,4,4,2,3}
{4,3,3,4,3,3,4,2,4}
(5 rows)

the 5 means it needs to produce 5 rows in total and the 10 is the
average array length, and 1.8 is the minvalue for the random function
and 3.5 is the maxvalue.

When either minval or maxval exceeds int4 the function cannot be
executed/found

SELECT * FROM normal_rand_array(5, 10, 8, 42::bigint);

ERROR: function normal_rand_array(integer, integer, integer, bigint)
does not exist
LINE 1: SELECT * FROM normal_rand_array(5, 10, 8, 42::bigint);
^
HINT: No function matches the given name and argument types. You might
need to add explicit type casts.
---

SELECT * FROM normal_rand_array(5, 10, 8::bigint, 42);

ERROR: function normal_rand_array(integer, integer, bigint, integer)
does not exist
LINE 1: SELECT * FROM normal_rand_array(5, 10, 8::bigint, 42);
^
HINT: No function matches the given name and argument types. You might
need to add explicit type casts.
---

However, when both are int8 it works fine:

SELECT * FROM normal_rand_array(5, 10, 8::bigint, 42::bigint);

normal_rand_array
--------------------------------------------------
{29,38,31,10,23,39,9,32}
{8,39,19,31,29,15,17,15,36,20,33,19}
{15,18,42,19}
{16,31,33,11,14,20,24,9,12,17,22,42,41,24,11,41}
{15,11,36,8,28,37}
(5 rows)
---

Is it the expected behaviour?

In some cases the function returns an empty array. Is it also expected?

SELECT count(*)
FROM normal_rand_array(100000, 10, 8, 42) i
WHERE array_length(i,1) IS NULL;

count
-------
4533
(1 row)

In both cases, perhaps mentioning these behaviors in the docs would
avoid some confusion.

Thanks!

Best,

--
Jim

Dean Rasheed

dean.a.rasheed@gmail.com

almost 2 years ago

In reply to: Jim Jones (#3)

Re: New function normal_rand_array function to contrib/tablefunc.

On Tue, 2 Jul 2024 at 11:18, Jim Jones <jim.jones@uni-muenster.de> wrote:

When either minval or maxval exceeds int4 the function cannot be
executed/found

SELECT * FROM normal_rand_array(5, 10, 8, 42::bigint);

ERROR: function normal_rand_array(integer, integer, integer, bigint)
does not exist
LINE 1: SELECT * FROM normal_rand_array(5, 10, 8, 42::bigint);
^
HINT: No function matches the given name and argument types. You might
need to add explicit type casts.

This could be solved by defining separate functions for each supported
type, rather than one function with type anyelement. Internally, they
could continue to share common code, of course.

In some cases the function returns an empty array. Is it also expected?

Perhaps it would be useful to have separate minimum and maximum array
length arguments, rather than a mean array length argument.

Actually, I find the current behaviour somewhat counterintuitive. Only
after reading the source code did I realise what it's actually doing,
which is this:

Row 1: array of random length in range [0, meanarraylen]
Row 2: array of length 2*meanarraylen - length of previous array
Row 3: array of random length in range [0, meanarraylen]
Row 4: array of length 2*meanarraylen - length of previous array
...

That's far from obvious (it's certainly not documented) and I don't
think it's a particularly good way of achieving a specified mean array
length, because only half the lengths are random.

One thing that's definitely needed is more documentation. It should
have its own new subsection, like the other tablefunc functions.

Something else that confused me is why this function is called
normal_rand_array(). The underlying random functions that it's calling
return random values from a uniform distribution, not a normal
distribution. Arrays of normally distributed random values might also
be useful, but that's not what this patch is doing.

Also, the function accepts float8 minval and maxval arguments, and
then simply ignores them and returns random float8 values in the range
[0,1), which is highly counterintuitive.

My suggestion would be to mirror the signatures of the core random()
functions more closely, and have this:

1). rand_array(numvals int, minlen int, maxlen int)
returns setof float8[]

2). rand_array(numvals int, minlen int, maxlen int,
minval int, maxval int)
returns setof int[]

3). rand_array(numvals int, minlen int, maxlen int,
minval bigint, maxval bigint)
returns setof bigint[]

4). rand_array(numvals int, minlen int, maxlen int,
minval numeric, maxval numeric)
returns setof numeric[]

Also, I'd recommend giving the function arguments names in SQL, like
this, since that makes them easier to use.

Something else that's not obvious is that this patch is relying on the
core random functions, which means that it's using the same PRNG state
as those functions. That's probably OK, but it should be documented,
because it's different from tablefunc's normal_rand() function, which
uses pg_global_prng_state. In particular, what this means is that
calling setseed() will affect the output of these new functions, but
not normal_rand(). Incidentally, that makes writing tests much simpler
-- just call setseed() first and the output will be repeatable.

Regards,
Dean

Andy Fan

zhihui.fan1213@gmail.com

almost 2 years ago

In reply to: Dean Rasheed (#4)

Re: New function normal_rand_array function to contrib/tablefunc.

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

My suggestion would be to mirror the signatures of the core random()
functions more closely, and have this:

1). rand_array(numvals int, minlen int, maxlen int)
returns setof float8[]

2). rand_array(numvals int, minlen int, maxlen int,
minval int, maxval int)
returns setof int[]

3). rand_array(numvals int, minlen int, maxlen int,
minval bigint, maxval bigint)
returns setof bigint[]

4). rand_array(numvals int, minlen int, maxlen int,
minval numeric, maxval numeric)
returns setof numeric[]

this is indeed a more clean and correct APIs, I will use the above ones
in the next version. Thanks for the suggestion.

It is just not clear to me how verbose the document should to be, and
where the document should be, tablefunc.sgml, the comment above the
function or the places just the code? In many cases you provides above
or below are just implementation details, not the API declaimed purpose.

Something else that's not obvious is that this patch is relying on the
core random functions, which means that it's using the same PRNG state
as those functions. That's probably OK, but it should be documented,
because it's different from tablefunc's normal_rand() function, which
uses pg_global_prng_state.

My above question applies to this comment.

In particular, what this means is that
calling setseed() will affect the output of these new functions, but
not normal_rand(). Incidentally, that makes writing tests much simpler
-- just call setseed() first and the output will be repeatable.

Good to know this user case. for example, should this be documented?

In some cases the function returns an empty array. Is it also expected?

Perhaps it would be useful to have separate minimum and maximum array
length arguments, rather than a mean array length argument.

I'm not sure which one is better, but main user case of this function
for testing pupose, so it I think minimum and maximum array length is
good for me too.

Actually, I find the current behaviour somewhat counterintuitive. Only
after reading the source code did I realise what it's actually doing,
which is this:

Row 1: array of random length in range [0, meanarraylen]
Row 2: array of length 2*meanarraylen - length of previous array
Row 3: array of random length in range [0, meanarraylen]
Row 4: array of length 2*meanarraylen - length of previous array
...

That's far from obvious (it's certainly not documented) and I don't
think it's a particularly good way of achieving a specified mean array
length, because only half the lengths are random.

I'm not sure how does this matter in real user case.

One thing that's definitely needed is more documentation. It should
have its own new subsection, like the other tablefunc functions.

is the documentaion for the '2*meanarraylen - lastarraylen'?

and What is new subsection, do you mean anything wrong in
'tablefunc.sgml', I did have some issue to run 'make html', but the
error exists before my patch, so I change the document carefully without
testing it. do you know how to fix the below error in 'make html'?

$/usr/bin/xsltproc --nonet --path . --path . --stringparam pg.version '18devel' stylesheet.xsl postgres-full.xml

I/O error : Attempt to load network entity http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl
warning: failed to load external entity "http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl"
compilation error: file stylesheet.xsl line 6 element import
xsl:import : unable to load http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl
I/O error : Attempt to load network entity http://docbook.sourceforge.net/release/xsl/current/common/entities.ent
stylesheet-html-common.xsl:4: warning: failed to load external entity "http://docbook.sourceforge.net/release/xsl/current/common/entities.ent"
%common.entities;
^
stylesheet-html-common.xsl:124: parser error : Entity 'primary' not defined
translate(substring(&primary;, 1, 1),

Something else that confused me is why this function is called
normal_rand_array(). The underlying random functions that it's calling
return random values from a uniform distribution, not a normal
distribution. Arrays of normally distributed random values might also
be useful, but that's not what this patch is doing.

OK, you are right, your new names should be better.

Also, the function accepts float8 minval and maxval arguments, and
then simply ignores them and returns random float8 values in the range
[0,1), which is highly counterintuitive.

This is a obvious bug and it only exists in float8 case IIUC, will fix
it in the next version.

--
Best Regards
Andy Fan

Andy Fan

zhihui.fan1213@gmail.com

almost 2 years ago

In reply to: Andy Fan (#1)

Re: New function normal_rand_array function to contrib/tablefunc.

Andy Fan <zhihuifan1213@163.com> writes:

(just noticed this reply is sent to Jim privately, re-sent it to
public.)

Hi Jim,

When either minval or maxval exceeds int4 the function cannot be
executed/found

SELECT * FROM normal_rand_array(5, 10, 8, 42::bigint);

ERROR: function normal_rand_array(integer, integer, integer, bigint)
does not exist
LINE 1: SELECT * FROM normal_rand_array(5, 10, 8, 42::bigint);
^
HINT: No function matches the given name and argument types. You might
need to add explicit type casts.
---

SELECT * FROM normal_rand_array(5, 10, 8::bigint, 42);

ERROR: function normal_rand_array(integer, integer, bigint, integer)
does not exist
LINE 1: SELECT * FROM normal_rand_array(5, 10, 8::bigint, 42);
^
HINT: No function matches the given name and argument types. You might
need to add explicit type casts.
---

However, when both are int8 it works fine:

I defined the function as below:

postgres=# \df normal_rand_array
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+-------------------+------------------+------------------------------------------+------
public | normal_rand_array | SETOF anyarray | integer, integer, anyelement, anyelement | func
(1 row)

so it is required that the 3nd and 4th argument should have the same
data type, that's why your first 2 test case failed and the third one
works. and I also think we should not add a test case / document for
this since the behavior of 'anyelement' system.

This issue can be fixed with the new API defined suggested by Dean.

SELECT * FROM normal_rand_array(5, 10, 8::bigint, 42::bigint);

normal_rand_array
--------------------------------------------------
{29,38,31,10,23,39,9,32}
{8,39,19,31,29,15,17,15,36,20,33,19}
{15,18,42,19}
{16,31,33,11,14,20,24,9,12,17,22,42,41,24,11,41}
{15,11,36,8,28,37}
(5 rows)
---

Is it the expected behaviour?

Yes, see the above statements.

In some cases the function returns an empty array. Is it also expected?

SELECT count(*)
FROM normal_rand_array(100000, 10, 8, 42) i
WHERE array_length(i,1) IS NULL;

count
-------
4533
(1 row)

Yes, by design I think it is a feature which could generate [] case
which should be used a special case for testing, and at the
implementation side, the [] means the length is 0 which is caused by I
choose the 'len' by random [0 .. len * 2], so 0 is possible and doesn't
confict with the declared behavior.

In both cases, perhaps mentioning these behaviors in the docs would
avoid some confusion.

hmm, It doesn't take some big effort to add them, but I'm feeling that
would make the document a bit of too verbose/detailed.

Sorry for the late respone!

--
Best Regards
Andy Fan

Import Notes

Reply to msg id not found: 871q46u560.fsf@163.com

Dean Rasheed

dean.a.rasheed@gmail.com

almost 2 years ago

In reply to: Andy Fan (#5)

Re: New function normal_rand_array function to contrib/tablefunc.

On Wed, 17 Jul 2024 at 07:29, Andy Fan <zhihuifan1213@163.com> wrote:

It is just not clear to me how verbose the document should to be, and
where the document should be, tablefunc.sgml, the comment above the
function or the places just the code? In many cases you provides above
or below are just implementation details, not the API declaimed purpose.

Something else that's not obvious is that this patch is relying on the
core random functions, which means that it's using the same PRNG state
as those functions. That's probably OK, but it should be documented,
because it's different from tablefunc's normal_rand() function, which
uses pg_global_prng_state.

My above question applies to this comment.

One thing that's definitely needed is more documentation. It should
have its own new subsection, like the other tablefunc functions.

I was really referring to the SGML docs. Try to follow the style used
for the existing functions in tablefunc.sgml -- so in addition to
adding the row to the table at the top, also add one or more sections
further down the page to give more details, and example output.
Something like this:

https://www.postgresql.org/docs/current/tablefunc.html#TABLEFUNC-FUNCTIONS-NORMAL-RAND

That would be a good place to mention that setseed() can be used to
produce repeatable results.

I did have some issue to run 'make html', but the
error exists before my patch, so I change the document carefully without
testing it. do you know how to fix the below error in 'make html'?

$/usr/bin/xsltproc --nonet --path . --path . --stringparam pg.version '18devel' stylesheet.xsl postgres-full.xml

I/O error : Attempt to load network entity http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl
warning: failed to load external entity "http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl"
compilation error: file stylesheet.xsl line 6 element import
xsl:import : unable to load http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl
I/O error : Attempt to load network entity http://docbook.sourceforge.net/release/xsl/current/common/entities.ent
stylesheet-html-common.xsl:4: warning: failed to load external entity "http://docbook.sourceforge.net/release/xsl/current/common/entities.ent"
%common.entities;
^
stylesheet-html-common.xsl:124: parser error : Entity 'primary' not defined
translate(substring(&primary;, 1, 1),

This looks like you're missing a required package. Try installing
docbook-xsl or docbook-xsl-stylesheets or something similar (the
package name varies depending on your distro).

Regards,
Dean

Andy Fan

zhihui.fan1213@gmail.com

over 1 year ago

In reply to: Dean Rasheed (#7)

Re: New function normal_rand_array function to contrib/tablefunc.

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

Hello Dean,

I did have some issue to run 'make html', but the
error exists before my patch, so I change the document carefully without
testing it. do you know how to fix the below error in 'make html'?

$/usr/bin/xsltproc --nonet --path . --path . --stringparam pg.version '18devel' stylesheet.xsl postgres-full.xml

I/O error : Attempt to load network entity http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl
warning: failed to load external entity
"http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl"

This looks like you're missing a required package. Try installing
docbook-xsl or docbook-xsl-stylesheets or something similar (the
package name varies depending on your distro).

This does work, thank you!

My suggestion would be to mirror the signatures of the core random()
functions more closely, and have this:

1). rand_array(numvals int, minlen int, maxlen int)
returns setof float8[]

..>

4). rand_array(numvals int, minlen int, maxlen int,
minval numeric, maxval numeric)
returns setof numeric[]

this is indeed a more clean and correct APIs, I will use the above ones
in the next version. Thanks for the suggestion.

I followed your suggestion in the new attached version. They are not
only some cleaner APIs for user and but also some cleaner implementation
in core, Thank for this suggestion as well.

Sorry for the late response, just my new posistion is bit of busy that I
don't have enough time on community work.

--
Best Regards
Andy Fan

Andy Fan

zhihui.fan1213@gmail.com

over 1 year ago

In reply to: Andy Fan (#8)

Re: New function normal_rand_array function to contrib/tablefunc.

Andy Fan <zhihuifan1213@163.com> writes:

My suggestion would be to mirror the signatures of the core random()
functions more closely, and have this:

1). rand_array(numvals int, minlen int, maxlen int)
returns setof float8[]

..>

4). rand_array(numvals int, minlen int, maxlen int,
minval numeric, maxval numeric)
returns setof numeric[]

this is indeed a more clean and correct APIs, I will use the above ones
in the next version. Thanks for the suggestion.

I followed your suggestion in the new attached version. They are not
only some cleaner APIs for user and but also some cleaner implementation
in core, Thank for this suggestion as well.

A new version is attached, nothing changed except replace
PG_GETARG_INT16 with PG_GETARG_INT32. PG_GETARG_INT16 is a copy-paste
error.

--
Best Regards
Andy Fan

#10

Japin Li

japinli@hotmail.com

over 1 year ago

In reply to: Andy Fan (#9)

Re: New function normal_rand_array function to contrib/tablefunc.

On Tue, 27 Aug 2024 at 16:43, Andy Fan <zhihuifan1213@163.com> wrote:

Andy Fan <zhihuifan1213@163.com> writes:

My suggestion would be to mirror the signatures of the core random()
functions more closely, and have this:

1). rand_array(numvals int, minlen int, maxlen int)
returns setof float8[]

..>

4). rand_array(numvals int, minlen int, maxlen int,
minval numeric, maxval numeric)
returns setof numeric[]

this is indeed a more clean and correct APIs, I will use the above ones
in the next version. Thanks for the suggestion.

I followed your suggestion in the new attached version. They are not
only some cleaner APIs for user and but also some cleaner implementation
in core, Thank for this suggestion as well.

A new version is attached, nothing changed except replace
PG_GETARG_INT16 with PG_GETARG_INT32. PG_GETARG_INT16 is a copy-paste
error.

Thanks for updating the patch. Here are some comments.

+	if (minlen >= maxlen)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("minlen must be greater than maxlen.")));

There error message should be "minlen must be smaller than maxlen", right?

+	if (minlen < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("minlen and maxlen must be greater than zero.")));

Here the minlen might be zero, so the error message is incorrect.
How about use "minlen must be greater than or equal to zero"?

--
Regrads,
Japin Li

#11

Andy Fan

zhihui.fan1213@gmail.com

over 1 year ago

In reply to: Japin Li (#10)

Re: New function normal_rand_array function to contrib/tablefunc.

Japin Li <japinli@hotmail.com> writes:

Thanks for updating the patch. Here are some comments.
+	if (minlen >= maxlen)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("minlen must be greater than maxlen.")));
There error message should be "minlen must be smaller than maxlen", right?
+	if (minlen < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("minlen and maxlen must be greater than zero.")));
Here the minlen might be zero, so the error message is incorrect.
How about use "minlen must be greater than or equal to zero"?

Yes, you are right. A new version is attached, thanks for checking!

--
Best Regards
Andy Fan

#12

Japin Li

japinli@hotmail.com

over 1 year ago

In reply to: Andy Fan (#11)

Re: New function normal_rand_array function to contrib/tablefunc.

On Wed, 28 Aug 2024 at 12:27, Andy Fan <zhihuifan1213@163.com> wrote:

Japin Li <japinli@hotmail.com> writes:
Thanks for updating the patch. Here are some comments.
+	if (minlen >= maxlen)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("minlen must be greater than maxlen.")));
There error message should be "minlen must be smaller than maxlen", right?
+	if (minlen < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("minlen and maxlen must be greater than zero.")));
Here the minlen might be zero, so the error message is incorrect.
How about use "minlen must be greater than or equal to zero"?
Yes, you are right. A new version is attached, thanks for checking!

Nitpick, the minlen is smaller than maxlen, so the maxlen cannot be zero.

After giving it some more thought, it would also be helpful if maxlen is
equal to minlen.

For example, I want have each row has four items, I can use the following

SELECT rand_array(10, 4, 4, 50::int, 80::int);

OTOH, I find the range bound uses "less than or equal to", how about
replacing "smaller" with "less"?

--
Regrads,
Japin Li

#13

Andy Fan

zhihui.fan1213@gmail.com

over 1 year ago

In reply to: Japin Li (#12)

Re: New function normal_rand_array function to contrib/tablefunc.

Japin Li <japinli@hotmail.com> writes:

On Wed, 28 Aug 2024 at 12:27, Andy Fan <zhihuifan1213@163.com> wrote:

Japin Li <japinli@hotmail.com> writes:

Nitpick, the minlen is smaller than maxlen, so the maxlen cannot be zero.

After giving it some more thought, it would also be helpful if maxlen is
equal to minlen.

For example, I want have each row has four items, I can use the following

SELECT rand_array(10, 4, 4, 50::int, 80::int);

Yes, that's a valid usage. the new vesion is attached. I have changed
the the commit entry [1] from "Waiting on Author" to "Needs review".

OTOH, I find the range bound uses "less than or equal to", how about
replacing "smaller" with "less"?

--
Best Regards
Andy Fan

#14

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Andy Fan (#13)

Re: New function normal_rand_array function to contrib/tablefunc.

On Thu, 29 Aug 2024 at 05:39, Andy Fan <zhihuifan1213@163.com> wrote:

Yes, that's a valid usage. the new vesion is attached. I have changed
the the commit entry [1] from "Waiting on Author" to "Needs review".

Note that this needs a rebase, following commit 4681ad4b2f.

Here are a few more review comments:

1). In the tests:

+select setseed(0.8);
+select rand_array(10, 0, 3, 50::int, 80::int);
+select setseed(0.8);
+select rand_array(10, 0, 3, 50::bigint, 80::bigint);
+select setseed(0.8);
+select rand_array(10, 0, 3, 50::float8, 80::float8);
+select setseed(0.8);
+select rand_array(10, 0, 3, 50::float4, 80::float4);
+select setseed(0.8);
+select rand_array(10, 0, 3, 50::numeric, 80::numeric);

this should really have a comment block to distinguish these new tests
from the preceeding normal_rand() tests.

2). It's worth also having tests for the error cases.

3). It's only necessary to call setseed() once to get a reproducible
set of results from then on.

4). I'd use setseed(0) or setseed(0.5), since those values are exactly
representable as double precision values, unlike 0.8, ensuring that it
works the same on all platforms. It might not matter, but why take the
risk?

5). The float8 case still has minimum and maximum value arguments that
it just ignores. It should either not take those arguments, or it
should respect them, and try to return float8 values in the requested
range. I suspect that trying to return float8 values in the requested
range would be hard, if not impossible, due to the inexact nature of
double precision arithmetic. That's why I suggested earlier having the
float8 function not take minval/maxval arguments, and just return
values in the range 0 to 1.

6). This new function:

+static Datum
+rand_array_internal(FunctionCallInfo fcinfo, Oid datatype)
+{

should have a comment block. In particular, it should document what
its inputs and outputs are.

7). This code:

+    int num_tuples = PG_GETARG_INT32(0);
+    int minlen = PG_GETARG_INT32(1);
+    int maxlen = PG_GETARG_INT32(2);
+    Datum minval = PG_GETARG_DATUM(3),
+        maxval = PG_GETARG_DATUM(4);
+    rand_array_fctx *fctx;
+
+    if (datatype == INT4OID)
+        random_fn_oid = F_RANDOM_INT4_INT4;
+    else if (datatype == INT8OID)
+        random_fn_oid = F_RANDOM_INT8_INT8;
+    else if    (datatype == FLOAT8OID)
+        random_fn_oid = F_RANDOM_;
+    else if (datatype == NUMERICOID)
+        random_fn_oid = F_RANDOM_NUMERIC_NUMERIC;
+    else
+        elog(ERROR, "unsupported type %d for rand_array function.",
+             datatype);
+
+    if (num_tuples < 0)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("number of rows cannot be negative")));
+
+    if (minlen > maxlen)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("minlen must not be greater than maxlen.")));
+
+    if (minlen < 0)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("minlen and maxlen must be greater than or
equal to zero.")));

should be inside the "if (SRF_IS_FIRSTCALL())" block. There's no point
repeating all those checks for every call.

8). I think it would be neater to code the "if (datatype == INT4OID)
... else if ..." as a switch statement.

9). I would swap the last 2 bound checks round, and simplify the error
messages as follows:

if (minlen < 0)
ereport(ERROR,
errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("minlen must be greater than or equal to zero"));

if (maxlen < minlen)
ereport(ERROR,
errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("maxlen must be greater than or equal to minlen"));

Also note that primary error messages should not end with a period.
See https://www.postgresql.org/docs/current/error-style-guide.html

10). In this error:

+        elog(ERROR, "unsupported type %d for rand_array function.",
+             datatype);

"datatype" is of type Oid, which is unsigned, and so it should use
"%u" not "%d". Also, as above, it should not end with a period, so it
should be:

+        elog(ERROR, "unsupported type %u for rand_array function",
+             datatype);

11). If the float8 case is made to not have minval/maxval arguments, this code:

+ Datum minval = PG_GETARG_DATUM(3),
+ maxval = PG_GETARG_DATUM(4);

and the FunctionCallInfo setup code needs to be different for the
float8 case, in order to avoid reading and setting undefined
arguments. Perhaps introduce a "need_val_bounds" boolean variable,
based on the datatype.

12). This code:

+        random_len_fcinfo->args[0].value = minlen;
+        random_len_fcinfo->args[1].value = maxlen;

should really be:

+        random_len_fcinfo->args[0].value = Int32GetDatum(minlen);
+        random_len_fcinfo->args[1].value = Int32GetDatum(maxlen);

It amounts to the same thing, but documents the fact that it's
converting an integer to a Datum.

13). These new functions are significantly under-documented,
especially when compared to all the other functions on
https://www.postgresql.org/docs/current/tablefunc.html

They really should have their own subsection, along the same lines as
"F.41.1.1. Normal_rand", explaining what the functions do, what their
arguments mean, and giving a couple of usage examples.

Regards,
Dean

#15

Andy Fan

zhihui.fan1213@gmail.com

over 1 year ago

In reply to: Dean Rasheed (#14)

Re: New function normal_rand_array function to contrib/tablefunc.

Hello Dean,

Thanks for the detailed feedback! Here is the rebased version.

1). In the tests:

+select setseed(0.8);
+select rand_array(10, 0, 3, 50::int, 80::int);

this should really have a comment block to distinguish these new tests
from the preceeding normal_rand() tests.

3). It's only necessary to call setseed() once to get a reproducible
set of results from then on.

4). I'd use setseed(0) or setseed(0.5), since those values are exactly
representable as double precision values, unlike 0.8, ensuring that it
works the same on all platforms. It might not matter, but why take the
risk?

All Done, very interesting about the reason why the 0.8 is bad.

2). It's worth also having tests for the error cases.

After the code refactor, looks the ERROR is not reachable, so I added
both Assert(false) and elog(ERROR) with no test case for that.

5). The float8 case still has minimum and maximum value arguments that
it just ignores. It should either not take those arguments, or it
should respect them, and try to return float8 values in the requested
range. I suspect that trying to return float8 values in the requested
range would be hard, if not impossible, due to the inexact nature of
double precision arithmetic. That's why I suggested earlier having the
float8 function not take minval/maxval arguments, and just return
values in the range 0 to 1.

11). If the float8 case is made to not have minval/maxval arguments, this code:

+ Datum minval = PG_GETARG_DATUM(3),
+ maxval = PG_GETARG_DATUM(4);

and the FunctionCallInfo setup code needs to be different for the
float8 case, in order to avoid reading and setting undefined
arguments. Perhaps introduce a "need_val_bounds" boolean variable,
based on the datatype.

I should have noticed the float8 issue after reading your first
reply.. Actually I don't have speical reason for which I have to support
float8. At least when I working on the patch, I want some toast and
non-toast data type, so supporting both int/bigint and numeric should be
good, at least for my purpose. So in the attached version, I removed the
support for float8 for simplification.

6). This new function:
+static Datum
+rand_array_internal(FunctionCallInfo fcinfo, Oid datatype)
+{
should have a comment block. In particular, it should document what
its inputs and outputs are.

Done.

7). This code:

+    int num_tuples = PG_GETARG_INT32(0);
+    int minlen = PG_GETARG_INT32(1);
+    int maxlen = PG_GETARG_INT32(2);
+    Datum minval = PG_GETARG_DATUM(3),
+        maxval = PG_GETARG_DATUM(4);
+    rand_array_fctx *fctx;
+
+    if (datatype == INT4OID)
+        random_fn_oid = F_RANDOM_INT4_INT4;
+    else if (datatype == INT8OID)

...

should be inside the "if (SRF_IS_FIRSTCALL())" block. There's no point
repeating all those checks for every call.

Done.

8). I think it would be neater to code the "if (datatype == INT4OID)
... else if ..." as a switch statement.

Done.

9). I would swap the last 2 bound checks round, and simplify the error
messages as follows:

Done.

Also note that primary error messages should not end with a period.
See https://www.postgresql.org/docs/current/error-style-guide.html

Thanks for sharing this!

10). In this error:
+        elog(ERROR, "unsupported type %d for rand_array function.",
+             datatype);
"datatype" is of type Oid, which is unsigned, and so it should use
"%u" not "%d". Also, as above, it should not end with a period, so it
should be:
+        elog(ERROR, "unsupported type %u for rand_array function",
+             datatype);

Done.

12). This code:
+        random_len_fcinfo->args[0].value = minlen;
+        random_len_fcinfo->args[1].value = maxlen;
should really be:
+        random_len_fcinfo->args[0].value = Int32GetDatum(minlen);
+        random_len_fcinfo->args[1].value = Int32GetDatum(maxlen);
It amounts to the same thing, but documents the fact that it's
converting an integer to a Datum.

Done.

13). These new functions are significantly under-documented,
especially when compared to all the other functions on
https://www.postgresql.org/docs/current/tablefunc.html

They really should have their own subsection, along the same lines as
"F.41.1.1. Normal_rand", explaining what the functions do, what their
arguments mean, and giving a couple of usage examples.

Done, very impresive feedback and I know why PostgreSQL can always have
user friendly documentation now.

--
Best Regards
Andy Fan

#16

Andy Fan

zhihui.fan1213@gmail.com

over 1 year ago

In reply to: Dean Rasheed (#14)

Re: New function normal_rand_array function to contrib/tablefunc.

10). In this error:
+        elog(ERROR, "unsupported type %d for rand_array function.",
+             datatype);
"datatype" is of type Oid, which is unsigned, and so it should use
"%u" not "%d". Also, as above, it should not end with a period, so it
should be:
+        elog(ERROR, "unsupported type %u for rand_array function",
+             datatype);

I remember my IDE could detect such issue before, but failed this
time. I just checked more today and it is still failing. Looks the
checker is not stable and I can't find out the reason so far.

main.c

#include <stdio.h>

int main(int argc, char *argv[])
{
unsigned int i = 0;
int x = 2;
printf("i = %d\n", i);
printf("i = %u\n", x);
return 0;
}

All the following commands succeed without any warnings.

clang -O0 -g main.c -o main -Wall -Wformat
gcc -g main.c -o main -Wall -Wformat

scan-build clang -g main.c -o main -Wall -Wformat
cppcheck main.c

clang: 18.1.6
gcc: 13.3.0

Only "cppcheck --enable=all main.c" catch the warnning.

Any hints on this will be appreicated.

--
Best Regards
Andy Fan

#17

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Andy Fan (#16)

Re: New function normal_rand_array function to contrib/tablefunc.

On Sat, 26 Oct 2024 at 01:51, Andy Fan <zhihuifan1213@163.com> wrote:

10). In this error:
+        elog(ERROR, "unsupported type %d for rand_array function.",
+             datatype);
"datatype" is of type Oid, which is unsigned, and so it should use
"%u" not "%d". Also, as above, it should not end with a period, so it
should be:
+        elog(ERROR, "unsupported type %u for rand_array function",
+             datatype);
All the following commands succeed without any warnings.

clang -O0 -g main.c -o main -Wall -Wformat
gcc -g main.c -o main -Wall -Wformat

This can be detected in gcc with -Wformat plus -Wformat-signedness flags.

I see that this has been discussed before (e.g., [1]/messages/by-id/CA+hUKGJ7EQm9extQAgrFZNNUKqRT8Vv5t1tKqA-5nEcYn0+wNA@mail.gmail.com), but it doesn't
look like those patches were committed, and there are still many such
warnings, if you try compiling postgres with those flags.

[1]: /messages/by-id/CA+hUKGJ7EQm9extQAgrFZNNUKqRT8Vv5t1tKqA-5nEcYn0+wNA@mail.gmail.com

I don't know if anyone has any plans to pick up that work again, but
in any case, it seems wise to not add more.

Regards,
Dean

#18

Andy Fan

zhihui.fan1213@gmail.com

over 1 year ago

In reply to: Dean Rasheed (#17)

Re: New function normal_rand_array function to contrib/tablefunc.

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On Sat, 26 Oct 2024 at 01:51, Andy Fan <zhihuifan1213@163.com> wrote:
10). In this error:
+        elog(ERROR, "unsupported type %d for rand_array function.",
+             datatype);
"datatype" is of type Oid, which is unsigned, and so it should use
"%u" not "%d". Also, as above, it should not end with a period, so it
should be:
+        elog(ERROR, "unsupported type %u for rand_array function",
+             datatype);
All the following commands succeed without any warnings.

clang -O0 -g main.c -o main -Wall -Wformat
gcc -g main.c -o main -Wall -Wformat
This can be detected in gcc with -Wformat plus -Wformat-signedness
flags.

Yes, this one works. I didn't realize we have "-Wformat-signedness"
subsection after we already have "-Wformat". I have added this one into
my toolset.

For recording purpose, clang doesn't support this option until now.

I see that this has been discussed before (e.g., [1]), but it doesn't
look like those patches were committed, and there are still many such
warnings, if you try compiling postgres with those flags.

OK, Thanks for the information. Currently I add the "c/c++-gcc" checker
for my c file, it just trigger when I am writting a file. so the
warnings in other places probably doesn't bother me.

[1]
/messages/by-id/CA+hUKGJ7EQm9extQAgrFZNNUKqRT8Vv5t1tKqA-5nEcYn0+wNA@mail.gmail.com

I don't know if anyone has any plans to pick up that work again.

I can take that very soon.

but in any case, it seems wise to not add more.

Very true.

Thank you for all your helps on this series.

--
Best Regards
Andy Fan

#19

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Andy Fan (#15)

Re: New function normal_rand_array function to contrib/tablefunc.

On Wed, 16 Oct 2024 at 08:43, Andy Fan <zhihuifan1213@163.com> wrote:

Thanks for the detailed feedback! Here is the rebased version.

I took another look at this and I think it's in reasonable shape.

I'm attaching an update, rebasing it on top of 9be4e5d293.

Also it was missing a required update to the meson.build file --
that's the immediate cause of the other cfbot failures.

The rest is just cosmetic tidying up, fixing indentation, tweaking
comments, and the like. I also hacked on the docs a bit -- the
synopsis only listed one of the new function signatures for some
reason. After fixing that, I think it's sufficient to just list one
usage example.

Regards,
Dean

#20

Japin Li

japinli@hotmail.com

over 1 year ago

In reply to: Dean Rasheed (#19)

Re: New function normal_rand_array function to contrib/tablefunc.

On Fri, 01 Nov 2024 at 09:33, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Wed, 16 Oct 2024 at 08:43, Andy Fan <zhihuifan1213@163.com> wrote:

Thanks for the detailed feedback! Here is the rebased version.

I took another look at this and I think it's in reasonable shape.

I'm attaching an update, rebasing it on top of 9be4e5d293.

Also it was missing a required update to the meson.build file --
that's the immediate cause of the other cfbot failures.

The rest is just cosmetic tidying up, fixing indentation, tweaking
comments, and the like. I also hacked on the docs a bit -- the
synopsis only listed one of the new function signatures for some
reason. After fixing that, I think it's sufficient to just list one
usage example.

LGTM expect there is a warning when applying the patch.

Applying: tablefunc: Add rand_array() functions.
.git/rebase-apply/patch:475: trailing whitespace.
rand_array
warning: 1 line adds whitespace errors.

The other looks good to me.

--
Regrads,
Japin Li

#21

Andy Fan

zhihui.fan1213@gmail.com

over 1 year ago

In reply to: Dean Rasheed (#19)

#22