Add error functions: erf() and erfc()

Started by Dean Rasheedalmost 3 years ago9 messages
#1Dean Rasheed
dean.a.rasheed@gmail.com
2 attachment(s)

Now that we have random_normal(), it seems like it would be useful to
add the error functions erf() and erfc(), which I think are
potentially useful to the people who will find random_normal() useful,
and possibly others.

An immediate use for erf() is that it allows us to do a
Kolmogorov-Smirnov test for random_normal(), similar to the one for
random().

Both of these functions are defined in POSIX and C99, so in theory
they should be available on all platforms. If that turns out not to be
the case, then there's a commonly used implementation (e.g., see [1]https://github.com/lsds/musl/blob/master/src/math/erf.c),
which we could include. I played around with that (replacing the
direct bit manipulation stuff with frexp()/ldexp(), see pg_erf.c
attached), and it appeared to be accurate to +/-1 ULP across the full
range of inputs. Hopefully we won't need that though.

I tested this on a couple of different platforms and found I needed to
reduce extra_float_digits to -1 to get the regression tests to pass
consistently, due to rounding errors. It wouldn't surprise me if that
needs to be reduced further, though perhaps it's not necessary to have
so many tests (I included one test value from each branch, while
testing the hand-rolled implementation).

Regards,
Dean

[1]: https://github.com/lsds/musl/blob/master/src/math/erf.c

Attachments:

erf.patchtext/x-patch; charset=US-ASCII; name=erf.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index 0cbdf63..e30e285
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1289,6 +1289,41 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</r
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
+         <primary>erf</primary>
+        </indexterm>
+        <function>erf</function> ( <type>double precision</type> )
+        <returnvalue>double precision</returnvalue>
+       </para>
+       <para>
+        Error function
+       </para>
+       <para>
+        <literal>erf(1.0)</literal>
+        <returnvalue>0.8427007929497149</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>erfc</primary>
+        </indexterm>
+        <function>erfc</function> ( <type>double precision</type> )
+        <returnvalue>double precision</returnvalue>
+       </para>
+       <para>
+        Complementary error function (<literal>1 - erf(x)</literal>, without
+        loss of precision for large inputs)
+       </para>
+       <para>
+        <literal>erfc(1.0)</literal>
+        <returnvalue>0.15729920705028513</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
          <primary>exp</primary>
         </indexterm>
         <function>exp</function> ( <type>numeric</type> )
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
new file mode 100644
index d290b4c..aa79487
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -2742,6 +2742,53 @@ datanh(PG_FUNCTION_ARGS)
 }
 
 
+/* ========== ERROR FUNCTIONS ========== */
+
+
+/*
+ *		derf			- returns the error function: erf(arg1)
+ */
+Datum
+derf(PG_FUNCTION_ARGS)
+{
+	float8		arg1 = PG_GETARG_FLOAT8(0);
+	float8		result;
+
+	/*
+	 * For erf, we don't need an errno check because it never overflows.
+	 */
+	result = erf(arg1);
+
+	if (unlikely(isinf(result)))
+		float_overflow_error();
+
+	PG_RETURN_FLOAT8(result);
+}
+
+/*
+ *		derfc			- returns the complementary error function: 1 - erf(arg1)
+ */
+Datum
+derfc(PG_FUNCTION_ARGS)
+{
+	float8		arg1 = PG_GETARG_FLOAT8(0);
+	float8		result;
+
+	/*
+	 * For erfc, we don't need an errno check because it never overflows.
+	 */
+	result = erfc(arg1);
+
+	if (unlikely(isinf(result)))
+		float_overflow_error();
+
+	PG_RETURN_FLOAT8(result);
+}
+
+
+/* ========== RANDOM FUNCTIONS ========== */
+
+
 /*
  * initialize_drandom_seed - initialize drandom_seed if not yet done
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
new file mode 100644
index e2a7642..4ae8fef
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3465,6 +3465,13 @@
   proname => 'atanh', prorettype => 'float8', proargtypes => 'float8',
   prosrc => 'datanh' },
 
+{ oid => '8788', descr => 'error function',
+  proname => 'erf', prorettype => 'float8', proargtypes => 'float8',
+  prosrc => 'derf' },
+{ oid => '8789', descr => 'complementary error function',
+  proname => 'erfc', prorettype => 'float8', proargtypes => 'float8',
+  prosrc => 'derfc' },
+
 { oid => '1618',
   proname => 'interval_mul', prorettype => 'interval',
   proargtypes => 'interval float8', prosrc => 'interval_mul' },
diff --git a/src/test/regress/expected/float8.out b/src/test/regress/expected/float8.out
new file mode 100644
index 2b25784..0e89e24
--- a/src/test/regress/expected/float8.out
+++ b/src/test/regress/expected/float8.out
@@ -790,6 +790,45 @@ SELECT atanh(float8 'nan');
    NaN
 (1 row)
 
+-- error functions
+-- we run these with extra_float_digits = -1, to get consistently rounded
+-- results on all platforms.
+SET extra_float_digits = -1;
+SELECT x,
+       erf(x),
+       erfc(x)
+FROM (VALUES (float8 '-infinity'),
+      (-28), (-6), (-3.4), (-2.1), (-1.1), (-0.45),
+      (-1.2e-9), (-2.3e-13), (-1.2e-17), (0),
+      (1.2e-17), (2.3e-13), (1.2e-9),
+      (0.45), (1.1), (2.1), (3.4), (6), (28),
+      (float8 'infinity'), (float8 'nan')) AS t(x);
+     x     |         erf          |        erfc         
+-----------+----------------------+---------------------
+ -Infinity |                   -1 |                   2
+       -28 |                   -1 |                   2
+        -6 |                   -1 |                   2
+      -3.4 |    -0.99999847800664 |     1.9999984780066
+      -2.1 |    -0.99702053334367 |     1.9970205333437
+      -1.1 |    -0.88020506957408 |     1.8802050695741
+     -0.45 |    -0.47548171978692 |     1.4754817197869
+  -1.2e-09 | -1.3540550005146e-09 |     1.0000000013541
+  -2.3e-13 | -2.5952720843197e-13 |     1.0000000000003
+  -1.2e-17 | -1.3540550005146e-17 |                   1
+         0 |                    0 |                   1
+   1.2e-17 |  1.3540550005146e-17 |                   1
+   2.3e-13 |  2.5952720843197e-13 |    0.99999999999974
+   1.2e-09 |  1.3540550005146e-09 |    0.99999999864595
+      0.45 |     0.47548171978692 |    0.52451828021308
+       1.1 |     0.88020506957408 |    0.11979493042592
+       2.1 |     0.99702053334367 |   0.002979466656333
+       3.4 |     0.99999847800664 | 1.5219933628623e-06
+         6 |                    1 | 2.1519736712499e-17
+        28 |                    1 |                   0
+  Infinity |                    1 |                   0
+       NaN |                  NaN |                 NaN
+(22 rows)
+
 RESET extra_float_digits;
 -- test for over- and underflow
 INSERT INTO FLOAT8_TBL(f1) VALUES ('10e400');
diff --git a/src/test/regress/expected/random.out b/src/test/regress/expected/random.out
new file mode 100644
index 6dbb43a..2235907
--- a/src/test/regress/expected/random.out
+++ b/src/test/regress/expected/random.out
@@ -88,6 +88,38 @@ GROUP BY r;
  -10 |   100
 (1 row)
 
+-- Check standard normal distribution using the Kolmogorov-Smirnov test.
+CREATE FUNCTION ks_test_normal_random()
+RETURNS boolean AS
+$$
+DECLARE
+  n int := 1000;        -- Number of samples
+  c float8 := 1.94947;  -- Critical value for 99.9% confidence
+  ok boolean;
+BEGIN
+  ok := (
+    WITH samples AS (
+      SELECT random_normal() r FROM generate_series(1, n) ORDER BY 1
+    ), indexed_samples AS (
+      SELECT (row_number() OVER())-1.0 i, r FROM samples
+    )
+    SELECT max(abs((1+erf(r/sqrt(2)))/2 - i/n)) < c / sqrt(n)
+    FROM indexed_samples
+  );
+  RETURN ok;
+END
+$$
+LANGUAGE plpgsql;
+-- As above, ks_test_normal_random() returns true about 99.9%
+-- of the time, so try it 3 times and accept if any test passes.
+SELECT ks_test_normal_random() OR
+       ks_test_normal_random() OR
+       ks_test_normal_random() AS standard_normal;
+ standard_normal 
+-----------------
+ t
+(1 row)
+
 -- setseed() should produce a reproducible series of random() values.
 SELECT setseed(0.5);
  setseed 
diff --git a/src/test/regress/sql/float8.sql b/src/test/regress/sql/float8.sql
new file mode 100644
index c276a53..de99ef0
--- a/src/test/regress/sql/float8.sql
+++ b/src/test/regress/sql/float8.sql
@@ -229,6 +229,20 @@ SELECT atanh(float8 'infinity');
 SELECT atanh(float8 '-infinity');
 SELECT atanh(float8 'nan');
 
+-- error functions
+-- we run these with extra_float_digits = -1, to get consistently rounded
+-- results on all platforms.
+SET extra_float_digits = -1;
+SELECT x,
+       erf(x),
+       erfc(x)
+FROM (VALUES (float8 '-infinity'),
+      (-28), (-6), (-3.4), (-2.1), (-1.1), (-0.45),
+      (-1.2e-9), (-2.3e-13), (-1.2e-17), (0),
+      (1.2e-17), (2.3e-13), (1.2e-9),
+      (0.45), (1.1), (2.1), (3.4), (6), (28),
+      (float8 'infinity'), (float8 'nan')) AS t(x);
+
 RESET extra_float_digits;
 
 -- test for over- and underflow
diff --git a/src/test/regress/sql/random.sql b/src/test/regress/sql/random.sql
new file mode 100644
index 6e99e2e..14cc76b
--- a/src/test/regress/sql/random.sql
+++ b/src/test/regress/sql/random.sql
@@ -70,6 +70,36 @@ SELECT r, count(*)
 FROM (SELECT random_normal(-10, 0) r FROM generate_series(1, 100)) ss
 GROUP BY r;
 
+-- Check standard normal distribution using the Kolmogorov-Smirnov test.
+
+CREATE FUNCTION ks_test_normal_random()
+RETURNS boolean AS
+$$
+DECLARE
+  n int := 1000;        -- Number of samples
+  c float8 := 1.94947;  -- Critical value for 99.9% confidence
+  ok boolean;
+BEGIN
+  ok := (
+    WITH samples AS (
+      SELECT random_normal() r FROM generate_series(1, n) ORDER BY 1
+    ), indexed_samples AS (
+      SELECT (row_number() OVER())-1.0 i, r FROM samples
+    )
+    SELECT max(abs((1+erf(r/sqrt(2)))/2 - i/n)) < c / sqrt(n)
+    FROM indexed_samples
+  );
+  RETURN ok;
+END
+$$
+LANGUAGE plpgsql;
+
+-- As above, ks_test_normal_random() returns true about 99.9%
+-- of the time, so try it 3 times and accept if any test passes.
+SELECT ks_test_normal_random() OR
+       ks_test_normal_random() OR
+       ks_test_normal_random() AS standard_normal;
+
 -- setseed() should produce a reproducible series of random() values.
 
 SELECT setseed(0.5);
pg_erf.ctext/x-csrc; charset=US-ASCII; name=pg_erf.cDownload
#2Nathan Bossart
nathandbossart@gmail.com
In reply to: Dean Rasheed (#1)
Re: Add error functions: erf() and erfc()

On Mon, Feb 27, 2023 at 12:54:35PM +0000, Dean Rasheed wrote:

+	/*
+	 * For erf, we don't need an errno check because it never overflows.
+	 */
+	/*
+	 * For erfc, we don't need an errno check because it never overflows.
+	 */

The man pages for these seem to indicate that underflow can occur. Do we
need to check for that?

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Dean Rasheed (#1)
Re: Add error functions: erf() and erfc()

On Tue, Feb 28, 2023 at 1:54 AM Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Now that we have random_normal(), it seems like it would be useful to
add the error functions erf() and erfc(), which I think are
potentially useful to the people who will find random_normal() useful,
and possibly others.

An immediate use for erf() is that it allows us to do a
Kolmogorov-Smirnov test for random_normal(), similar to the one for
random().

Both of these functions are defined in POSIX and C99, so in theory
they should be available on all platforms. If that turns out not to be
the case, then there's a commonly used implementation (e.g., see [1]),
which we could include. I played around with that (replacing the
direct bit manipulation stuff with frexp()/ldexp(), see pg_erf.c
attached), and it appeared to be accurate to +/-1 ULP across the full
range of inputs. Hopefully we won't need that though.

Hi,

No comment on the maths, but I'm pretty sure we won't need a fallback
implementation. That stuff goes back to the math libraries of 80s
Unixes, even though it didn't make it into C until '99. I just
checked the man pages for all our target systems and they all show it.
(There might be some portability details around the tgmath.h versions
on some systems, eg to support different float sizes, I dunno, but
you're using the plain math.h versions.)

I wonder if the SQL standard has anything to say about these, or the
related normal CDF. I can't check currently but I doubt it, based on
searches and other systems' manuals.

Two related functions that also arrived in C99 are lgamma() and
tgamma(). If you'll excuse the digression, that reminded me of
something I was trying to figure out once, for a practical problem.
My statistics knowledge is extremely patchy, but I have been trying to
up my benchmarking game, and that led to a bit of remedial reading on
Student's t tests and related stuff. A few shaven yaks later, I
understood that you could probably (if you like pain) do that sort of
stuff inside PostgreSQL using our existing aggregates, if you took the
approach of ministat[1]https://man.freebsd.org/cgi/man.cgi?query=ministat. That tool has a table of critical values
inside it, indexed by degrees-of-freedom (1-100) and confidence level
(80, 90, 95, 98, 99, 99.5), and one could probably write SQL queries
that spit out an answer like "p is less than 5%, ship it!", if we
stole that table. But what if I actually want to know p? Of course
you can do all that good stuff very easily with tools like R, SciPy
etc and maybe that's the best way to do it. But Oracle, and I think
several other analytics-focused SQL systems, can do it in a very easy
built-in way. I think to get at that you probably need the t CDF, and
in there[2]https://www.mathworks.com/help/stats/tcdf.html I see... Γ(). Huh.

[1]: https://man.freebsd.org/cgi/man.cgi?query=ministat
[2]: https://www.mathworks.com/help/stats/tcdf.html

#4Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Nathan Bossart (#2)
Re: Add error functions: erf() and erfc()

On Wed, 8 Mar 2023 at 20:11, Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Feb 27, 2023 at 12:54:35PM +0000, Dean Rasheed wrote:

+     /*
+      * For erf, we don't need an errno check because it never overflows.
+      */
+     /*
+      * For erfc, we don't need an errno check because it never overflows.
+      */

The man pages for these seem to indicate that underflow can occur. Do we
need to check for that?

No, I don't think so. The docs indicate that if an underflow occurs,
the correct result (after rounding) should be returned, so I think we
should just return that result (as we do for tanh(), for example).

Regards,
Dean

#5Nathan Bossart
nathandbossart@gmail.com
In reply to: Dean Rasheed (#4)
Re: Add error functions: erf() and erfc()

On Wed, Mar 08, 2023 at 11:29:12PM +0000, Dean Rasheed wrote:

On Wed, 8 Mar 2023 at 20:11, Nathan Bossart <nathandbossart@gmail.com> wrote:

The man pages for these seem to indicate that underflow can occur. Do we
need to check for that?

No, I don't think so. The docs indicate that if an underflow occurs,
the correct result (after rounding) should be returned, so I think we
should just return that result (as we do for tanh(), for example).

Makes sense.

I'm also wondering about whether we need the isinf() checks. IIUC that
should never happen, but maybe you added that "just in case."

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#6Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Thomas Munro (#3)
Re: Add error functions: erf() and erfc()

On Wed, 8 Mar 2023 at 23:24, Thomas Munro <thomas.munro@gmail.com> wrote:

No comment on the maths, but I'm pretty sure we won't need a fallback
implementation. That stuff goes back to the math libraries of 80s
Unixes, even though it didn't make it into C until '99. I just
checked the man pages for all our target systems and they all show it.
(There might be some portability details around the tgmath.h versions
on some systems, eg to support different float sizes, I dunno, but
you're using the plain math.h versions.)

Thanks for checking. Hopefully they will be available everywhere.

I think what's more likely to happen is that the tests will reveal
implementation variations, as they did when the hyperbolic functions
were added, and it'll be necessary to adjust or remove some of the
test cases. When I originally wrote those tests, I picked a value from
each branch that the FreeBSD implementation handled differently, but I
think that's overkill. If the purpose of the tests is just to confirm
that the right C library function has been exposed, they could
probably be pared all the way down to just testing erf(1) and erfc(1),
but it might be useful to first see what platform variations exist.

Two related functions that also arrived in C99 are lgamma() and
tgamma(). If you'll excuse the digression, that reminded me of
something I was trying to figure out once, for a practical problem.
My statistics knowledge is extremely patchy, but I have been trying to
up my benchmarking game, and that led to a bit of remedial reading on
Student's t tests and related stuff. A few shaven yaks later, I
understood that you could probably (if you like pain) do that sort of
stuff inside PostgreSQL using our existing aggregates, if you took the
approach of ministat[1]. That tool has a table of critical values
inside it, indexed by degrees-of-freedom (1-100) and confidence level
(80, 90, 95, 98, 99, 99.5), and one could probably write SQL queries
that spit out an answer like "p is less than 5%, ship it!", if we
stole that table. But what if I actually want to know p? Of course
you can do all that good stuff very easily with tools like R, SciPy
etc and maybe that's the best way to do it. But Oracle, and I think
several other analytics-focused SQL systems, can do it in a very easy
built-in way. I think to get at that you probably need the t CDF, and
in there[2] I see... Γ(). Huh.

Hmm, possibly having the gamma function would be independently useful
for other things too. I don't want to get side-tracked though.

Regards,
Dean

#7Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Nathan Bossart (#5)
Re: Add error functions: erf() and erfc()

On Thu, 9 Mar 2023 at 00:13, Nathan Bossart <nathandbossart@gmail.com> wrote:

On Wed, Mar 08, 2023 at 11:29:12PM +0000, Dean Rasheed wrote:

On Wed, 8 Mar 2023 at 20:11, Nathan Bossart <nathandbossart@gmail.com> wrote:

The man pages for these seem to indicate that underflow can occur. Do we
need to check for that?

No, I don't think so. The docs indicate that if an underflow occurs,
the correct result (after rounding) should be returned, so I think we
should just return that result (as we do for tanh(), for example).

Makes sense.

I'm also wondering about whether we need the isinf() checks. IIUC that
should never happen, but maybe you added that "just in case."

I copied those from dtanh(), otherwise I probably wouldn't have
bothered. erf() is always in the range [-1, 1], just like tanh(), so
it should never overflow, but maybe it can happen in a broken
implementation.

Regards,
Dean

#8Nathan Bossart
nathandbossart@gmail.com
In reply to: Dean Rasheed (#7)
Re: Add error functions: erf() and erfc()

On Thu, Mar 09, 2023 at 12:27:47AM +0000, Dean Rasheed wrote:

On Thu, 9 Mar 2023 at 00:13, Nathan Bossart <nathandbossart@gmail.com> wrote:

I'm also wondering about whether we need the isinf() checks. IIUC that
should never happen, but maybe you added that "just in case."

I copied those from dtanh(), otherwise I probably wouldn't have
bothered. erf() is always in the range [-1, 1], just like tanh(), so
it should never overflow, but maybe it can happen in a broken
implementation.

Okay. This looks good to me, then.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#9Thomas Munro
thomas.munro@gmail.com
In reply to: Dean Rasheed (#6)
Re: Add error functions: erf() and erfc()

On Thu, Mar 9, 2023 at 1:16 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Wed, 8 Mar 2023 at 23:24, Thomas Munro <thomas.munro@gmail.com> wrote:

... But Oracle, and I think
several other analytics-focused SQL systems, can do it in a very easy
built-in way. I think to get at that you probably need the t CDF, and
in there[2] I see... Γ(). Huh.

Hmm, possibly having the gamma function would be independently useful
for other things too. I don't want to get side-tracked though.

I guess if we did want to add some nice easy-to-use hypothesis testing
tools to PostgreSQL, then perhaps gamma wouldn't actually be needed
from SQL, but it might be used inside C code for something higher
level like tcdf()[1]https://stats.stackexchange.com/questions/394978/how-to-approximate-the-student-t-cdf-at-a-point-without-the-hypergeometric-funct, or even very high level like
t_test_independent_agg(s1, s2) etc. Anyway, just thought I'd mention
those in passing, as I see they arrived together; sorry for getting
off topic.

[1]: https://stats.stackexchange.com/questions/394978/how-to-approximate-the-student-t-cdf-at-a-point-without-the-hypergeometric-funct