Optimising numeric division

Started by Dean Rasheedover 1 year ago9 messages

dean.a.rasheed@gmail.com

over 1 year ago

4 attachment(s)

Currently numeric.c has 2 separate functions that implement numeric
division, div_var() and div_var_fast(). Since div_var_fast() is only
approximate, it is only used in the transcendental functions, where a
slightly inaccurate result is OK. In all other cases, div_var() is
used.

Aside from the pain of having to maintain 2 separate functions,
div_var() is very slow for large inputs. In fact, it's well over an
order of magnitude slower than mul_var() for similar sized inputs. For
example:

create temp table foo(x numeric, y numeric, xy numeric);
insert into foo select x, y, x*y from
(select random(1e49999, 1e50000-1), random(1e49999, 1e50000-1)
from generate_series(1,100)) g(x,y);

select count(*) from foo where x*y != xy; -- 840 ms

select count(*) from foo where xy/x != y; -- 36364 ms

select count(*) from foo where xy/y != x; -- 36515 ms

The attached patch attempts to resolve those issues by replacing
div_var() and div_var_fast() with a single function intended to be
faster than both the originals.

The new version of div_var() has an extra "exact" boolean parameter,
so that it can operate in the faster, approximate mode when used by
the transcendental functions, but regardless of the value of this
parameter, it adopts the algorithm used by div_var_fast(), using
float-point arithmetic to approximate each quotient digit. The
difference is that, if an exact result is requested, it uses a larger
working dividend, so that it can subtract off complete multiples of
the divisor for each quotient digit (while still delaying the
propagation of carries). This then gives the remainder, which can be
used to check the quotient and correct it, if it's inaccurate.

In addition, like mul_var(), div_var() now does the computation in
base NBASE^2, using 64-bit integers for the working dividend array,
which halves the number of iterations in the outer loop and reduces
the frequency of carry-propagation passes.

In the testing I've done so far, this is up to around 20 times faster
than the old version of div_var() when "exact" is true (it computes
each of the queries above in around 1.6 seconds), and it's up to
around 2-3 times faster than div_var_fast() when "exact" is false.

In addition, I found that it's significantly faster than
div_var_int64() for 3 and 4 digit divisors, so I've removed that as
well.

Overall, this reduces numeric.c by around 300 lines, and means that
we'll only have one function to maintain.

Patch 0001 is the patch itself. 0002 is just for testing purposes --
it exposes both old functions and the new one as SQL-callable
functions with all their arguments, so they can be compared.

I'm also attaching the performance test script I used, and the output
it produced.

Something else these test results reveal is the answer to the
question: under what circumstances is the approximate computation
faster than the exact one? The answer seems to be: whenever var2 has
more than around 12 NBASE digits, regardless of the size of var1.

Thinking about that, it makes sense based on the following reasoning:
The exact computation subtracts off a complete multiple of the divisor
for each result digit, so the overall cost is roughly proportional to

res_ndigitpairs * var2ndigitpairs

The approximate calculation computes an additional DIV_GUARD_DIGITS
result digits (4 NBASE digits, or 2 NBASE^2 digits), so there's an
additional cost of around 2 * var2ndigitpairs, but it ignores digits
in the working dividend beyond the guard digits, which means that, for
later result digits, it is able to truncate the subtraction step. The
size of the truncation increases from 1 to var2ndigitpairs-1 as "qi"
increases, so the overall net saving is roughly

1 + 2 + 3 + ... + (var2ndigitpairs-1) - 2 * var2ndigitpairs

The first part of that is just an arithmetic series, so this equals

var2ndigitpairs * (var2ndigitpairs - 1) / 2 - 2 * var2ndigitpairs

= var2ndigitpairs * (var2ndigitpairs - 5) / 2

That suggests there will be an overall saving once var2ndigitpairs is
larger than 5, which is roughly consistent with observations. It's
understandable that it's a little larger in practice, due to other
overheads in the outer loop.

So I'm inclined to add something like

if (var2ndigitpairs <= 6)
exact = true;

near the start, to try to get the best performance in all cases.

I also tested how common it was that a correction needed to be applied
to the quotient. Using random inputs, the quotient was one too small
in the final place roughly 1 in every 2000 - 100000 times (varying
depending on the size of the inputs), and it was never off by more
than one, or too large. A simple example where the estimated quotient
is one too small is this:

select div(5399133729, 5399133729);

This initially rounds the wrong way, due to loss of precision in the
floating-point numbers.

It is also possible to trigger the other case (an estimated quotient
digit that is one too large) with carefully crafted inputs of this
form:

-- 53705623790171816464 = 7328412092 * 7328412092
select div(53705623790171816464 - 1, 7328412092) = 7328412092 - 1;

Again, I wasn't able to cause the intermediate result to be off by
more than one using these kinds of inputs.

I'm inclined to add those examples to the regression tests, just so
that we have coverage of those 2 branches of the correction code.

Regards,
Dean

Attachments:

v1-0001-Optimise-numeric-division-using-base-NBASE-2-arit.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Optimise-numeric-division-using-base-NBASE-2-arit.patchDownload

From b349524354969a3d79ee724920e107f391c5e751 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <dean.a.rasheed@gmail.com>
Date: Fri, 23 Aug 2024 09:54:27 +0100
Subject: [PATCH v1 1/2] Optimise numeric division using base-NBASE^2
 arithmetic.

Formerly there were two internal functions in numeric.c to perform
numeric division, div_var() and div_var_fast(). div_var() performed
division exactly to a specified rscale using Knuth's long division
algorithm, while div_var_fast() used a different algorithm which
approximated each quotient digit using floating-point arithmetic, and
performed a truncated computation with DIV_GUARD_DIGITS extra digits.
div_var_fast() could be many times faster than div_var(), but did not
guarantee correct results in all cases, and was therefore only
suitable for use in transcendental functions, where small errors are
acceptable.

This commit merges div_var() and div_var_fast() together into a single
function with an extra "exact" boolean parameter for cases when the
caller is OK with an approximate result. The new function uses the
same algorithm as the old div_var_fast() function, except that when
"exact" is true, it does not truncate the computation with
DIV_GUARD_DIGITS extra digits, but instead performs the full-precision
computation, subtracting off complete multiples of the divisor for
each quotient digit. However, it is able to retain most of the
performance benefits of div_var_fast(), by delaying the propagation of
carries.

Since this may still lead to an inaccurate result, when "exact" is
true, it then inspects the remainder and uses that to adjust the
quotient, if necessary, to make it correct. In practice, the quotient
rarely needs to be adjusted, and never by more than one in the final
digit, though it's difficult to prove that, so the code allows for
larger adjustments, just in case.

In addition, use base-NBASE^2 arithmetic and a 64-bit dividend array,
similar to mul_var(), so that the number of iterations of the outer
loop is roughly halved. In practice, this makes div_var() up to around
20 times as fast as the old Knuth algorithm when "exact" is true, and
up to around 2 or 3 times as fast as div_var_fast() when "exact" is
false. Additionally, it is significantly faster than div_var_int64()
for 3 and 4 digit divisors, so get rid of that too.
---
 src/backend/utils/adt/numeric.c | 941 +++++++++++---------------------
 1 file changed, 323 insertions(+), 618 deletions(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 44d88e9007..7d5038f575 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -60,7 +60,7 @@
  * NBASE that's less than sqrt(INT_MAX), in practice we are only interested
  * in NBASE a power of ten, so that I/O conversions and decimal rounding
  * are easy.  Also, it's actually more efficient if NBASE is rather less than
- * sqrt(INT_MAX), so that there is "headroom" for mul_var and div_var_fast to
+ * sqrt(INT_MAX), so that there is "headroom" for mul_var and div_var to
  * postpone processing carries.
  *
  * Values of NBASE other than 10000 are considered of historical interest only
@@ -563,16 +563,9 @@ static void mul_var(const NumericVar *var1, const NumericVar *var2,
 static void mul_var_short(const NumericVar *var1, const NumericVar *var2,
 						  NumericVar *result);
 static void div_var(const NumericVar *var1, const NumericVar *var2,
-					NumericVar *result,
-					int rscale, bool round);
-static void div_var_fast(const NumericVar *var1, const NumericVar *var2,
-						 NumericVar *result, int rscale, bool round);
+					NumericVar *result, int rscale, bool round, bool exact);
 static void div_var_int(const NumericVar *var, int ival, int ival_weight,
 						NumericVar *result, int rscale, bool round);
-#ifdef HAVE_INT128
-static void div_var_int64(const NumericVar *var, int64 ival, int ival_weight,
-						  NumericVar *result, int rscale, bool round);
-#endif
 static int	select_div_scale(const NumericVar *var1, const NumericVar *var2);
 static void mod_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result);
@@ -1958,7 +1951,7 @@ compute_bucket(Numeric operand, Numeric bound1, Numeric bound2,
 
 	mul_var(&operand_var, count_var, &operand_var,
 			operand_var.dscale + count_var->dscale);
-	div_var(&operand_var, &bound2_var, result_var, 0, false);
+	div_var(&operand_var, &bound2_var, result_var, 0, false, true);
 	add_var(result_var, &const_one, result_var);
 
 	free_var(&bound1_var);
@@ -3240,7 +3233,7 @@ numeric_div_opt_error(Numeric num1, Numeric num2, bool *have_error)
 	/*
 	 * Do the divide and return the result
 	 */
-	div_var(&arg1, &arg2, &result, rscale, true);
+	div_var(&arg1, &arg2, &result, rscale, true, true);
 
 	res = make_result_opt_error(&result, have_error);
 
@@ -3329,7 +3322,7 @@ numeric_div_trunc(PG_FUNCTION_ARGS)
 	/*
 	 * Do the divide and return the result
 	 */
-	div_var(&arg1, &arg2, &result, 0, false);
+	div_var(&arg1, &arg2, &result, 0, false, true);
 
 	res = make_result(&result);
 
@@ -3600,7 +3593,7 @@ numeric_lcm(PG_FUNCTION_ARGS)
 	else
 	{
 		gcd_var(&arg1, &arg2, &result);
-		div_var(&arg1, &result, &result, 0, false);
+		div_var(&arg1, &result, &result, 0, false, true);
 		mul_var(&arg2, &result, &result, arg2.dscale);
 		result.sign = NUMERIC_POS;
 	}
@@ -6272,7 +6265,7 @@ numeric_stddev_internal(NumericAggState *state,
 		else
 			mul_var(&vN, &vN, &vNminus1, 0);	/* N * N */
 		rscale = select_div_scale(&vsumX2, &vNminus1);
-		div_var(&vsumX2, &vNminus1, &vsumX, rscale, true);	/* variance */
+		div_var(&vsumX2, &vNminus1, &vsumX, rscale, true, true);	/* variance */
 		if (!variance)
 			sqrt_var(&vsumX, &vsumX, rscale);	/* stddev */
 
@@ -7690,7 +7683,7 @@ get_str_from_var_sci(const NumericVar *var, int rscale)
 	init_var(&tmp_var);
 
 	power_ten_int(exponent, &tmp_var);
-	div_var(var, &tmp_var, &tmp_var, rscale, true);
+	div_var(var, &tmp_var, &tmp_var, rscale, true, true);
 	sig_out = get_str_from_var(&tmp_var);
 
 	free_var(&tmp_var);
@@ -9229,32 +9222,48 @@ mul_var_short(const NumericVar *var1, const NumericVar *var2,
 /*
  * div_var() -
  *
- *	Division on variable level. Quotient of var1 / var2 is stored in result.
- *	The quotient is figured to exactly rscale fractional digits.
- *	If round is true, it is rounded at the rscale'th digit; if false, it
- *	is truncated (towards zero) at that digit.
+ *	Compute the quotient var1 / var2 to rscale fractional digits.
+ *
+ *	If "round" is true, the result is rounded at the rscale'th digit; if
+ *	false, it is truncated (towards zero) at that digit.
+ *
+ *	If "exact" is true, the exact result is computed to the specified rscale;
+ *	if false, successive quotient digits are approximated up to rscale plus
+ *	DIV_GUARD_DIGITS extra digits, ignoring all contributions from digits to
+ *	the right of that, before rounding or truncating to the specified rscale.
+ *	This can be significantly faster, and usually gives the same result as the
+ *	exact computation, but it may occasionally be off by one in the final
+ *	digit, if contributions from the ignored digits would have propagated
+ *	through the guard digits.  This is good enough for the transcendental
+ *	functions, where small errors are acceptable.
  */
 static void
 div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
-		int rscale, bool round)
+		int rscale, bool round, bool exact)
 {
-	int			div_ndigits;
-	int			res_ndigits;
+	int			var1ndigits = var1->ndigits;
+	int			var2ndigits = var2->ndigits;
 	int			res_sign;
 	int			res_weight;
-	int			carry;
-	int			borrow;
-	int			divisor1;
-	int			divisor2;
-	NumericDigit *dividend;
-	NumericDigit *divisor;
+	int			res_ndigits;
+	int			var1ndigitpairs;
+	int			var2ndigitpairs;
+	int			res_ndigitpairs;
+	int			div_ndigitpairs;
+	int64	   *dividend;
+	int32	   *divisor;
+	double		fdivisor,
+				fdivisorinverse,
+				fdividend,
+				fquotient;
+	int64		maxdiv;
+	int			qi;
+	int32		qdigit;
+	int64		carry;
+	int64		newdig;
+	int64	   *remainder;
 	NumericDigit *res_digits;
 	int			i;
-	int			j;
-
-	/* copy these values into local vars for speed in inner loop */
-	int			var1ndigits = var1->ndigits;
-	int			var2ndigits = var2->ndigits;
 
 	/*
 	 * First of all division by zero check; we must not be handed an
@@ -9268,9 +9277,6 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	/*
 	 * If the divisor has just one or two digits, delegate to div_var_int(),
 	 * which uses fast short division.
-	 *
-	 * Similarly, on platforms with 128-bit integer support, delegate to
-	 * div_var_int64() for divisors with three or four digits.
 	 */
 	if (var2ndigits <= 2)
 	{
@@ -9290,26 +9296,6 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		div_var_int(var1, idivisor, idivisor_weight, result, rscale, round);
 		return;
 	}
-#ifdef HAVE_INT128
-	if (var2ndigits <= 4)
-	{
-		int64		idivisor;
-		int			idivisor_weight;
-
-		idivisor = var2->digits[0];
-		idivisor_weight = var2->weight;
-		for (i = 1; i < var2ndigits; i++)
-		{
-			idivisor = idivisor * NBASE + var2->digits[i];
-			idivisor_weight--;
-		}
-		if (var2->sign == NUMERIC_NEG)
-			idivisor = -idivisor;
-
-		div_var_int64(var1, idivisor, idivisor_weight, result, rscale, round);
-		return;
-	}
-#endif
 
 	/*
 	 * Otherwise, perform full long division.
@@ -9332,7 +9318,7 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		res_sign = NUMERIC_POS;
 	else
 		res_sign = NUMERIC_NEG;
-	res_weight = var1->weight - var2->weight;
+	res_weight = var1->weight - var2->weight + 1;
 	/* The number of accurate result digits we need to produce: */
 	res_ndigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS;
 	/* ... but always at least 1 */
@@ -9340,476 +9326,210 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	/* If rounding needed, figure one more digit to ensure correct result */
 	if (round)
 		res_ndigits++;
+	/* Add guard digits for roundoff error when producing approx result */
+	if (!exact)
+		res_ndigits += DIV_GUARD_DIGITS;
 
 	/*
-	 * The working dividend normally requires res_ndigits + var2ndigits
-	 * digits, but make it at least var1ndigits so we can load all of var1
-	 * into it.  (There will be an additional digit dividend[0] in the
-	 * dividend space, but for consistency with Knuth's notation we don't
-	 * count that in div_ndigits.)
-	 */
-	div_ndigits = res_ndigits + var2ndigits;
-	div_ndigits = Max(div_ndigits, var1ndigits);
-
-	/*
-	 * We need a workspace with room for the working dividend (div_ndigits+1
-	 * digits) plus room for the possibly-normalized divisor (var2ndigits
-	 * digits).  It is convenient also to have a zero at divisor[0] with the
-	 * actual divisor data in divisor[1 .. var2ndigits].  Transferring the
-	 * digits into the workspace also allows us to realloc the result (which
-	 * might be the same as either input var) before we begin the main loop.
-	 * Note that we use palloc0 to ensure that divisor[0], dividend[0], and
-	 * any additional dividend positions beyond var1ndigits, start out 0.
-	 */
-	dividend = (NumericDigit *)
-		palloc0((div_ndigits + var2ndigits + 2) * sizeof(NumericDigit));
-	divisor = dividend + (div_ndigits + 1);
-	memcpy(dividend + 1, var1->digits, var1ndigits * sizeof(NumericDigit));
-	memcpy(divisor + 1, var2->digits, var2ndigits * sizeof(NumericDigit));
-
-	/*
-	 * Now we can realloc the result to hold the generated quotient digits.
+	 * The computation itself is done using base-NBASE^2 arithmetic, so we
+	 * actually process the input digits in pairs, producing base-NBASE^2
+	 * intermediate result digits.  This significantly improves performance,
+	 * since the computation is O(N^2) in the number of input digits, and
+	 * working in base NBASE^2 effectively halves "N".
 	 */
-	alloc_var(result, res_ndigits);
-	res_digits = result->digits;
+	var1ndigitpairs = (var1ndigits + 1) / 2;
+	var2ndigitpairs = (var2ndigits + 1) / 2;
+	res_ndigitpairs = (res_ndigits + 1) / 2;
+	res_ndigits = 2 * res_ndigitpairs;
 
 	/*
-	 * The full multiple-place algorithm is taken from Knuth volume 2,
-	 * Algorithm 4.3.1D.
+	 * We do the arithmetic in an array "dividend[]" of signed 64-bit
+	 * integers.  Since PG_INT64_MAX is much larger than NBASE^4, this gives
+	 * us a lot of headroom to avoid normalizing carries immediately.
+	 *
+	 * When performing an exact computation, the working dividend requires
+	 * res_ndigitpairs + var2ndigitpairs digits.  If var1 is larger than that,
+	 * the extra digits do not contribute to the result, and are ignored.
 	 *
-	 * We need the first divisor digit to be >= NBASE/2.  If it isn't, make it
-	 * so by scaling up both the divisor and dividend by the factor "d".  (The
-	 * reason for allocating dividend[0] above is to leave room for possible
-	 * carry here.)
+	 * When performing an approximate computation, the working dividend only
+	 * requires res_ndigitpairs digits (which includes the extra guard
+	 * digits).  All input digits beyond that are ignored.
 	 */
-	if (divisor[1] < HALF_NBASE)
+	if (exact)
 	{
-		int			d = NBASE / (divisor[1] + 1);
-
-		carry = 0;
-		for (i = var2ndigits; i > 0; i--)
-		{
-			carry += divisor[i] * d;
-			divisor[i] = carry % NBASE;
-			carry = carry / NBASE;
-		}
-		Assert(carry == 0);
-		carry = 0;
-		/* at this point only var1ndigits of dividend can be nonzero */
-		for (i = var1ndigits; i >= 0; i--)
-		{
-			carry += dividend[i] * d;
-			dividend[i] = carry % NBASE;
-			carry = carry / NBASE;
-		}
-		Assert(carry == 0);
-		Assert(divisor[1] >= HALF_NBASE);
+		div_ndigitpairs = res_ndigitpairs + var2ndigitpairs;
+		var1ndigitpairs = Min(var1ndigitpairs, div_ndigitpairs);
 	}
-	/* First 2 divisor digits are used repeatedly in main loop */
-	divisor1 = divisor[1];
-	divisor2 = divisor[2];
-
-	/*
-	 * Begin the main loop.  Each iteration of this loop produces the j'th
-	 * quotient digit by dividing dividend[j .. j + var2ndigits] by the
-	 * divisor; this is essentially the same as the common manual procedure
-	 * for long division.
-	 */
-	for (j = 0; j < res_ndigits; j++)
+	else
 	{
-		/* Estimate quotient digit from the first two dividend digits */
-		int			next2digits = dividend[j] * NBASE + dividend[j + 1];
-		int			qhat;
-
-		/*
-		 * If next2digits are 0, then quotient digit must be 0 and there's no
-		 * need to adjust the working dividend.  It's worth testing here to
-		 * fall out ASAP when processing trailing zeroes in a dividend.
-		 */
-		if (next2digits == 0)
-		{
-			res_digits[j] = 0;
-			continue;
-		}
-
-		if (dividend[j] == divisor1)
-			qhat = NBASE - 1;
-		else
-			qhat = next2digits / divisor1;
-
-		/*
-		 * Adjust quotient digit if it's too large.  Knuth proves that after
-		 * this step, the quotient digit will be either correct or just one
-		 * too large.  (Note: it's OK to use dividend[j+2] here because we
-		 * know the divisor length is at least 2.)
-		 */
-		while (divisor2 * qhat >
-			   (next2digits - qhat * divisor1) * NBASE + dividend[j + 2])
-			qhat--;
-
-		/* As above, need do nothing more when quotient digit is 0 */
-		if (qhat > 0)
-		{
-			NumericDigit *dividend_j = &dividend[j];
-
-			/*
-			 * Multiply the divisor by qhat, and subtract that from the
-			 * working dividend.  The multiplication and subtraction are
-			 * folded together here, noting that qhat <= NBASE (since it might
-			 * be one too large), and so the intermediate result "tmp_result"
-			 * is in the range [-NBASE^2, NBASE - 1], and "borrow" is in the
-			 * range [0, NBASE].
-			 */
-			borrow = 0;
-			for (i = var2ndigits; i >= 0; i--)
-			{
-				int			tmp_result;
-
-				tmp_result = dividend_j[i] - borrow - divisor[i] * qhat;
-				borrow = (NBASE - 1 - tmp_result) / NBASE;
-				dividend_j[i] = tmp_result + borrow * NBASE;
-			}
-
-			/*
-			 * If we got a borrow out of the top dividend digit, then indeed
-			 * qhat was one too large.  Fix it, and add back the divisor to
-			 * correct the working dividend.  (Knuth proves that this will
-			 * occur only about 3/NBASE of the time; hence, it's a good idea
-			 * to test this code with small NBASE to be sure this section gets
-			 * exercised.)
-			 */
-			if (borrow)
-			{
-				qhat--;
-				carry = 0;
-				for (i = var2ndigits; i >= 0; i--)
-				{
-					carry += dividend_j[i] + divisor[i];
-					if (carry >= NBASE)
-					{
-						dividend_j[i] = carry - NBASE;
-						carry = 1;
-					}
-					else
-					{
-						dividend_j[i] = carry;
-						carry = 0;
-					}
-				}
-				/* A carry should occur here to cancel the borrow above */
-				Assert(carry == 1);
-			}
-		}
-
-		/* And we're done with this quotient digit */
-		res_digits[j] = qhat;
+		div_ndigitpairs = res_ndigitpairs;
+		var1ndigitpairs = Min(var1ndigitpairs, div_ndigitpairs);
+		var2ndigitpairs = Min(var2ndigitpairs, div_ndigitpairs);
 	}
 
-	pfree(dividend);
-
-	/*
-	 * Finally, round or truncate the result to the requested precision.
-	 */
-	result->weight = res_weight;
-	result->sign = res_sign;
-
-	/* Round or truncate to target rscale (and set result->dscale) */
-	if (round)
-		round_var(result, rscale);
-	else
-		trunc_var(result, rscale);
-
-	/* Strip leading and trailing zeroes */
-	strip_var(result);
-}
-
-
-/*
- * div_var_fast() -
- *
- *	This has the same API as div_var, but is implemented using the division
- *	algorithm from the "FM" library, rather than Knuth's schoolbook-division
- *	approach.  This is significantly faster but can produce inaccurate
- *	results, because it sometimes has to propagate rounding to the left,
- *	and so we can never be entirely sure that we know the requested digits
- *	exactly.  We compute DIV_GUARD_DIGITS extra digits, but there is
- *	no certainty that that's enough.  We use this only in the transcendental
- *	function calculation routines, where everything is approximate anyway.
- *
- *	Although we provide a "round" argument for consistency with div_var,
- *	it is unwise to use this function with round=false.  In truncation mode
- *	it is possible to get a result with no significant digits, for example
- *	with rscale=0 we might compute 0.99999... and truncate that to 0 when
- *	the correct answer is 1.
- */
-static void
-div_var_fast(const NumericVar *var1, const NumericVar *var2,
-			 NumericVar *result, int rscale, bool round)
-{
-	int			div_ndigits;
-	int			load_ndigits;
-	int			res_sign;
-	int			res_weight;
-	int		   *div;
-	int			qdigit;
-	int			carry;
-	int			maxdiv;
-	int			newdig;
-	NumericDigit *res_digits;
-	double		fdividend,
-				fdivisor,
-				fdivisorinverse,
-				fquotient;
-	int			qi;
-	int			i;
-
-	/* copy these values into local vars for speed in inner loop */
-	int			var1ndigits = var1->ndigits;
-	int			var2ndigits = var2->ndigits;
-	NumericDigit *var1digits = var1->digits;
-	NumericDigit *var2digits = var2->digits;
-
 	/*
-	 * First of all division by zero check; we must not be handed an
-	 * unnormalized divisor.
-	 */
-	if (var2ndigits == 0 || var2digits[0] == 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_DIVISION_BY_ZERO),
-				 errmsg("division by zero")));
-
-	/*
-	 * If the divisor has just one or two digits, delegate to div_var_int(),
-	 * which uses fast short division.
+	 * Allocate room for the working dividend (div_ndigitpairs 64-bit digits)
+	 * plus the divisor (var2ndigitpairs 32-bit base-NBASE^2 digits).
 	 *
-	 * Similarly, on platforms with 128-bit integer support, delegate to
-	 * div_var_int64() for divisors with three or four digits.
+	 * For convenience, we allocate one extra dividend digit, which is set to
+	 * zero and not counted in div_ndigitpairs.  This is used when expanding
+	 * the remainder down, and also so that we don't need to worry about
+	 * reading off the end of the dividend array when computing fdividend.
 	 */
-	if (var2ndigits <= 2)
-	{
-		int			idivisor;
-		int			idivisor_weight;
+	dividend = (int64 *) palloc((div_ndigitpairs + 1) * sizeof(int64) +
+								var2ndigitpairs * sizeof(int32));
+	divisor = (int32 *) (dividend + div_ndigitpairs + 1);
 
-		idivisor = var2->digits[0];
-		idivisor_weight = var2->weight;
-		if (var2ndigits == 2)
-		{
-			idivisor = idivisor * NBASE + var2->digits[1];
-			idivisor_weight--;
-		}
-		if (var2->sign == NUMERIC_NEG)
-			idivisor = -idivisor;
+	/* load var1 into dividend[0 .. var1ndigitpairs-1], zeroing the rest */
+	for (i = 0; i < var1ndigitpairs - 1; i++)
+		dividend[i] = var1->digits[2 * i] * NBASE + var1->digits[2 * i + 1];
 
-		div_var_int(var1, idivisor, idivisor_weight, result, rscale, round);
-		return;
-	}
-#ifdef HAVE_INT128
-	if (var2ndigits <= 4)
-	{
-		int64		idivisor;
-		int			idivisor_weight;
-
-		idivisor = var2->digits[0];
-		idivisor_weight = var2->weight;
-		for (i = 1; i < var2ndigits; i++)
-		{
-			idivisor = idivisor * NBASE + var2->digits[i];
-			idivisor_weight--;
-		}
-		if (var2->sign == NUMERIC_NEG)
-			idivisor = -idivisor;
-
-		div_var_int64(var1, idivisor, idivisor_weight, result, rscale, round);
-		return;
-	}
-#endif
+	if (2 * i + 1 < var1ndigits)
+		dividend[i] = var1->digits[2 * i] * NBASE + var1->digits[2 * i + 1];
+	else
+		dividend[i] = var1->digits[2 * i] * NBASE;
 
-	/*
-	 * Otherwise, perform full long division.
-	 */
+	memset(dividend + i + 1, 0, (div_ndigitpairs - i) * sizeof(int64));
 
-	/* Result zero check */
-	if (var1ndigits == 0)
-	{
-		zero_var(result);
-		result->dscale = rscale;
-		return;
-	}
+	/* load var2 into divisor[0 .. var2ndigitpairs-1] */
+	for (i = 0; i < var2ndigitpairs - 1; i++)
+		divisor[i] = var2->digits[2 * i] * NBASE + var2->digits[2 * i + 1];
 
-	/*
-	 * Determine the result sign, weight and number of digits to calculate
-	 */
-	if (var1->sign == var2->sign)
-		res_sign = NUMERIC_POS;
+	if (2 * i + 1 < var2ndigits)
+		divisor[i] = var2->digits[2 * i] * NBASE + var2->digits[2 * i + 1];
 	else
-		res_sign = NUMERIC_NEG;
-	res_weight = var1->weight - var2->weight + 1;
-	/* The number of accurate result digits we need to produce: */
-	div_ndigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS;
-	/* Add guard digits for roundoff error */
-	div_ndigits += DIV_GUARD_DIGITS;
-	if (div_ndigits < DIV_GUARD_DIGITS)
-		div_ndigits = DIV_GUARD_DIGITS;
-
-	/*
-	 * We do the arithmetic in an array "div[]" of signed int's.  Since
-	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
-	 * to avoid normalizing carries immediately.
-	 *
-	 * We start with div[] containing one zero digit followed by the
-	 * dividend's digits (plus appended zeroes to reach the desired precision
-	 * including guard digits).  Each step of the main loop computes an
-	 * (approximate) quotient digit and stores it into div[], removing one
-	 * position of dividend space.  A final pass of carry propagation takes
-	 * care of any mistaken quotient digits.
-	 *
-	 * Note that div[] doesn't necessarily contain all of the digits from the
-	 * dividend --- the desired precision plus guard digits might be less than
-	 * the dividend's precision.  This happens, for example, in the square
-	 * root algorithm, where we typically divide a 2N-digit number by an
-	 * N-digit number, and only require a result with N digits of precision.
-	 */
-	div = (int *) palloc0((div_ndigits + 1) * sizeof(int));
-	load_ndigits = Min(div_ndigits, var1ndigits);
-	for (i = 0; i < load_ndigits; i++)
-		div[i + 1] = var1digits[i];
+		divisor[i] = var2->digits[2 * i] * NBASE;
 
 	/*
 	 * We estimate each quotient digit using floating-point arithmetic, taking
-	 * the first four digits of the (current) dividend and divisor.  This must
-	 * be float to avoid overflow.  The quotient digits will generally be off
-	 * by no more than one from the exact answer.
-	 */
-	fdivisor = (double) var2digits[0];
-	for (i = 1; i < 4; i++)
-	{
-		fdivisor *= NBASE;
-		if (i < var2ndigits)
-			fdivisor += (double) var2digits[i];
-	}
+	 * the first 2 base-NBASE^2 digits of the (current) dividend and divisor.
+	 * This must be float to avoid overflow.
+	 *
+	 * Since the floating-point dividend and divisor use at least 4 base-NBASE
+	 * input digits, they include roughly 40-53 bits of information from their
+	 * respective inputs (assuming NBASE is 10000), which fits well in IEEE
+	 * double-precision variables.  In addition, the relative error in the
+	 * floating-point quotient digit will be less than around 1/NBASE^3, so
+	 * the estimated base-NBASE^2 quotient digit will typically be correct,
+	 * and should not be off by more than one from the correct value.
+	 */
+	fdivisor = (double) divisor[0] * NBASE_SQR;
+	if (var2ndigitpairs > 1)
+		fdivisor += (double) divisor[1];
 	fdivisorinverse = 1.0 / fdivisor;
 
 	/*
-	 * maxdiv tracks the maximum possible absolute value of any div[] entry;
-	 * when this threatens to exceed INT_MAX, we take the time to propagate
-	 * carries.  Furthermore, we need to ensure that overflow doesn't occur
-	 * during the carry propagation passes either.  The carry values may have
-	 * an absolute value as high as INT_MAX/NBASE + 1, so really we must
-	 * normalize when digits threaten to exceed INT_MAX - INT_MAX/NBASE - 1.
+	 * maxdiv tracks the maximum possible absolute value of any dividend[]
+	 * entry; when this threatens to exceed PG_INT64_MAX, we take the time to
+	 * propagate carries.  Furthermore, we need to ensure that overflow
+	 * doesn't occur during the carry propagation passes either.  The carry
+	 * values may have an absolute value as high as PG_INT64_MAX/NBASE^2 + 1,
+	 * so really we must normalize when digits threaten to exceed PG_INT64_MAX
+	 * - PG_INT64_MAX/NBASE^2 - 1.
 	 *
 	 * To avoid overflow in maxdiv itself, it represents the max absolute
-	 * value divided by NBASE-1, ie, at the top of the loop it is known that
-	 * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
+	 * value divided by NBASE^2-1, ie, at the top of the loop it is known that
+	 * no dividend[] entry has an absolute value exceeding maxdiv *
+	 * (NBASE^2-1).
 	 *
-	 * Actually, though, that holds good only for div[] entries after div[qi];
-	 * the adjustment done at the bottom of the loop may cause div[qi + 1] to
-	 * exceed the maxdiv limit, so that div[qi] in the next iteration is
-	 * beyond the limit.  This does not cause problems, as explained below.
+	 * Actually, though, that holds good only for dividend[] entries after
+	 * dividend[qi]; the adjustment done at the bottom of the loop may cause
+	 * dividend[qi + 1] to exceed the maxdiv limit, so that dividend[qi] in
+	 * the next iteration is beyond the limit.  This does not cause problems,
+	 * as explained below.
 	 */
 	maxdiv = 1;
 
 	/*
-	 * Outer loop computes next quotient digit, which will go into div[qi]
+	 * Outer loop computes next quotient digit, which goes in dividend[qi].
 	 */
-	for (qi = 0; qi < div_ndigits; qi++)
+	for (qi = 0; qi < res_ndigitpairs; qi++)
 	{
 		/* Approximate the current dividend value */
-		fdividend = (double) div[qi];
-		for (i = 1; i < 4; i++)
-		{
-			fdividend *= NBASE;
-			if (qi + i <= div_ndigits)
-				fdividend += (double) div[qi + i];
-		}
+		fdividend = (double) dividend[qi] * NBASE_SQR;
+		fdividend += (double) dividend[qi + 1];
+
 		/* Compute the (approximate) quotient digit */
 		fquotient = fdividend * fdivisorinverse;
-		qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
-			(((int) fquotient) - 1);	/* truncate towards -infinity */
+		qdigit = (fquotient >= 0.0) ? ((int32) fquotient) :
+			(((int32) fquotient) - 1);	/* truncate towards -infinity */
 
 		if (qdigit != 0)
 		{
 			/* Do we need to normalize now? */
-			maxdiv += abs(qdigit);
-			if (maxdiv > (INT_MAX - INT_MAX / NBASE - 1) / (NBASE - 1))
+			maxdiv += i64abs(qdigit);
+			if (maxdiv > (PG_INT64_MAX - PG_INT64_MAX / NBASE_SQR - 1) / (NBASE_SQR - 1))
 			{
 				/*
-				 * Yes, do it.  Note that if var2ndigits is much smaller than
-				 * div_ndigits, we can save a significant amount of effort
-				 * here by noting that we only need to normalise those div[]
-				 * entries touched where prior iterations subtracted multiples
-				 * of the divisor.
+				 * Yes, do it.  Note that if var2ndigitpairs is much smaller
+				 * than div_ndigitpairs, we can save a significant amount of
+				 * effort here by noting that we only need to normalise those
+				 * dividend[] entries touched where prior iterations
+				 * subtracted multiples of the divisor.
 				 */
 				carry = 0;
-				for (i = Min(qi + var2ndigits - 2, div_ndigits); i > qi; i--)
+				for (i = Min(qi + var2ndigitpairs - 2, div_ndigitpairs - 1); i > qi; i--)
 				{
-					newdig = div[i] + carry;
+					newdig = dividend[i] + carry;
 					if (newdig < 0)
 					{
-						carry = -((-newdig - 1) / NBASE) - 1;
-						newdig -= carry * NBASE;
+						carry = -((-newdig - 1) / NBASE_SQR) - 1;
+						newdig -= carry * NBASE_SQR;
 					}
-					else if (newdig >= NBASE)
+					else if (newdig >= NBASE_SQR)
 					{
-						carry = newdig / NBASE;
-						newdig -= carry * NBASE;
+						carry = newdig / NBASE_SQR;
+						newdig -= carry * NBASE_SQR;
 					}
 					else
 						carry = 0;
-					div[i] = newdig;
+					dividend[i] = newdig;
 				}
-				newdig = div[qi] + carry;
-				div[qi] = newdig;
+				dividend[qi] += carry;
 
 				/*
-				 * All the div[] digits except possibly div[qi] are now in the
-				 * range 0..NBASE-1.  We do not need to consider div[qi] in
-				 * the maxdiv value anymore, so we can reset maxdiv to 1.
+				 * All the dividend[] digits except possibly dividend[qi] are
+				 * now in the range 0..NBASE^2-1.  We do not need to consider
+				 * dividend[qi] in the maxdiv value anymore, so we can reset
+				 * maxdiv to 1.
 				 */
 				maxdiv = 1;
 
 				/*
 				 * Recompute the quotient digit since new info may have
-				 * propagated into the top four dividend digits
+				 * propagated into the top two dividend digits.
 				 */
-				fdividend = (double) div[qi];
-				for (i = 1; i < 4; i++)
-				{
-					fdividend *= NBASE;
-					if (qi + i <= div_ndigits)
-						fdividend += (double) div[qi + i];
-				}
-				/* Compute the (approximate) quotient digit */
+				fdividend = (double) dividend[qi] * NBASE_SQR;
+				fdividend += (double) dividend[qi + 1];
 				fquotient = fdividend * fdivisorinverse;
-				qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
-					(((int) fquotient) - 1);	/* truncate towards -infinity */
-				maxdiv += abs(qdigit);
+				qdigit = (fquotient >= 0.0) ? ((int32) fquotient) :
+					(((int32) fquotient) - 1);	/* truncate towards -infinity */
+
+				maxdiv += i64abs(qdigit);
 			}
 
 			/*
 			 * Subtract off the appropriate multiple of the divisor.
 			 *
-			 * The digits beyond div[qi] cannot overflow, because we know they
-			 * will fall within the maxdiv limit.  As for div[qi] itself, note
-			 * that qdigit is approximately trunc(div[qi] / vardigits[0]),
-			 * which would make the new value simply div[qi] mod vardigits[0].
-			 * The lower-order terms in qdigit can change this result by not
-			 * more than about twice INT_MAX/NBASE, so overflow is impossible.
+			 * The digits beyond dividend[qi] cannot overflow, because we know
+			 * they will fall within the maxdiv limit.  As for dividend[qi]
+			 * itself, note that qdigit is approximately trunc(dividend[qi] /
+			 * divisor[0]), which would make the new value simply dividend[qi] *
+			 * mod divisor[0].  The lower-order terms in qdigit can change
+			 * this result by not more than about twice PG_INT64_MAX/NBASE^2,
+			 * so overflow is impossible.
 			 *
 			 * This inner loop is the performance bottleneck for division, so
 			 * code it in the same way as the inner loop of mul_var() so that
-			 * it can be auto-vectorized.  We cast qdigit to NumericDigit
-			 * before multiplying to allow the compiler to generate more
-			 * efficient code (using 16-bit multiplication), which is safe
-			 * since we know that the quotient digit is off by at most one, so
-			 * there is no overflow risk.
+			 * it can be auto-vectorized.
 			 */
 			if (qdigit != 0)
 			{
-				int			istop = Min(var2ndigits, div_ndigits - qi + 1);
-				int		   *div_qi = &div[qi];
+				int			istop = Min(var2ndigitpairs, div_ndigitpairs - qi);
+				int64	   *dividend_qi = &dividend[qi];
 
 				for (i = 0; i < istop; i++)
-					div_qi[i] -= ((NumericDigit) qdigit) * var2digits[i];
+					dividend_qi[i] -= (int64) qdigit * divisor[i];
 			}
 		}
 
@@ -9818,78 +9538,180 @@ div_var_fast(const NumericVar *var1, const NumericVar *var2,
 		 * Fold it into the next digit position.
 		 *
 		 * There is no risk of overflow here, although proving that requires
-		 * some care.  Much as with the argument for div[qi] not overflowing,
-		 * if we consider the first two terms in the numerator and denominator
-		 * of qdigit, we can see that the final value of div[qi + 1] will be
-		 * approximately a remainder mod (vardigits[0]*NBASE + vardigits[1]).
-		 * Accounting for the lower-order terms is a bit complicated but ends
-		 * up adding not much more than INT_MAX/NBASE to the possible range.
-		 * Thus, div[qi + 1] cannot overflow here, and in its role as div[qi]
-		 * in the next loop iteration, it can't be large enough to cause
-		 * overflow in the carry propagation step (if any), either.
+		 * some care.  Much as with the argument for dividend[qi] not
+		 * overflowing, if we consider the first two terms in the numerator
+		 * and denominator of qdigit, we can see that the final value of
+		 * dividend[qi + 1] will be approximately a remainder mod
+		 * (divisor[0]*NBASE^2 + divisor[1]).  Accounting for the lower-order
+		 * terms is a bit complicated but ends up adding not much more than
+		 * PG_INT64_MAX/NBASE^2 to the possible range.  Thus, dividend[qi + 1]
+		 * cannot overflow here, and in its role as dividend[qi] in the next
+		 * loop iteration, it can't be large enough to cause overflow in the
+		 * carry propagation step (if any), either.
 		 *
-		 * But having said that: div[qi] can be more than INT_MAX/NBASE, as
-		 * noted above, which means that the product div[qi] * NBASE *can*
-		 * overflow.  When that happens, adding it to div[qi + 1] will always
-		 * cause a canceling overflow so that the end result is correct.  We
-		 * could avoid the intermediate overflow by doing the multiplication
-		 * and addition in int64 arithmetic, but so far there appears no need.
+		 * But having said that: dividend[qi] can be more than
+		 * PG_INT64_MAX/NBASE^2, as noted above, which means that the product
+		 * dividend[qi] * NBASE^2 *can* overflow.  When that happens, adding
+		 * it to dividend[qi + 1] will always cause a canceling overflow so
+		 * that the end result is correct.  We could avoid the intermediate
+		 * overflow by doing the multiplication and addition using unsigned
+		 * int64 arithmetic, which is modulo 2^64, but so far there appears no
+		 * need.
 		 */
-		div[qi + 1] += div[qi] * NBASE;
+		dividend[qi + 1] += dividend[qi] * NBASE_SQR;
 
-		div[qi] = qdigit;
+		dividend[qi] = qdigit;
 	}
 
 	/*
-	 * Approximate and store the last quotient digit (div[div_ndigits])
+	 * If an exact result was requested, use the remainder to correct the
+	 * approximate quotient.  The remainder is in dividend[], immediately
+	 * after the quotient digits.  Note, however, that although the remainder
+	 * starts at dividend[qi], the first digit is the result of folding two
+	 * remainder digits into one above, so it needs to be expanded down by one
+	 * digit when it is normalized.  Doing so shifts the least significant
+	 * base-NBASE^2 digit out of the working dividend (which is why we
+	 * allocated space for one extra digit above).  For the purposes of
+	 * correcting the quotient, only the first var2ndigitpairs base-NBASE^2
+	 * digits of the remainder matter.
 	 */
-	fdividend = (double) div[qi];
-	for (i = 1; i < 4; i++)
-		fdividend *= NBASE;
-	fquotient = fdividend * fdivisorinverse;
-	qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
-		(((int) fquotient) - 1);	/* truncate towards -infinity */
-	div[qi] = qdigit;
+	if (exact)
+	{
+		/* Normalize the remainder, expanding it down by one digit */
+		remainder = &dividend[qi];
+		carry = 0;
+		for (i = var2ndigitpairs - 1; i >= 0; i--)
+		{
+			newdig = remainder[i] + carry;
+			if (newdig < 0)
+			{
+				carry = -((-newdig - 1) / NBASE_SQR) - 1;
+				newdig -= carry * NBASE_SQR;
+			}
+			else if (newdig >= NBASE_SQR)
+			{
+				carry = newdig / NBASE_SQR;
+				newdig -= carry * NBASE_SQR;
+			}
+			else
+				carry = 0;
+			remainder[i + 1] = newdig;
+		}
+		remainder[0] = carry;
+
+		if (remainder[0] < 0)
+		{
+			/* remainder < 0; quotient is too large */
+			do
+			{
+				/* Add divisor to remainder */
+				carry = 0;
+				for (i = var2ndigitpairs - 1; i > 0; i--)
+				{
+					newdig = remainder[i] + divisor[i] + carry;
+					if (newdig >= NBASE_SQR)
+					{
+						remainder[i] = newdig - NBASE_SQR;
+						carry = 1;
+					}
+					else
+					{
+						remainder[i] = newdig;
+						carry = 0;
+					}
+				}
+				remainder[0] += divisor[0] + carry;
+
+				/* Subtract 1 from quotient (propagating carries later) */
+				dividend[qi - 1]--;
+
+			} while (remainder[0] < 0);
+		}
+		else
+		{
+			/* remainder >= 0 */
+			while (true)
+			{
+				bool		less = false;
+
+				/* Is remainder < divisor? */
+				for (i = 0; i < var2ndigitpairs; i++)
+				{
+					if (remainder[i] < divisor[i])
+					{
+						less = true;
+						break;
+					}
+					if (remainder[i] > divisor[i])
+						break;	/* remainder > divisor */
+				}
+				if (less)
+					break;
+
+				/* Subtract divisor from remainder */
+				carry = 0;
+				for (i = var2ndigitpairs - 1; i > 0; i--)
+				{
+					newdig = remainder[i] - divisor[i] + carry;
+					if (newdig < 0)
+					{
+						remainder[i] = newdig + NBASE_SQR;
+						carry = -1;
+					}
+					else
+					{
+						remainder[i] = newdig;
+						carry = 0;
+					}
+				}
+				remainder[0] = remainder[0] - divisor[0] + carry;
+
+				/* Add 1 to quotient (propagating carries later) */
+				dividend[qi - 1]++;
+			}
+		}
+	}
 
 	/*
-	 * Because the quotient digits might be off by one, some of them might be
-	 * -1 or NBASE at this point.  The represented value is correct in a
-	 * mathematical sense, but it doesn't look right.  We do a final carry
-	 * propagation pass to normalize the digits, which we combine with storing
-	 * the result digits into the output.  Note that this is still done at
-	 * full precision w/guard digits.
+	 * Because the quotient digits were estimates that might have been off by
+	 * one (and we didn't bother propagating carries when adjusting the
+	 * quotient above), some quotient digits might be out of range, so do a
+	 * final carry propagation pass to normalize back to base NBASE^2, and
+	 * construct the base-NBASE result digits.  Note that this is still done
+	 * at full precision w/guard digits.
 	 */
-	alloc_var(result, div_ndigits + 1);
+	alloc_var(result, res_ndigits);
 	res_digits = result->digits;
 	carry = 0;
-	for (i = div_ndigits; i >= 0; i--)
+	for (i = res_ndigitpairs - 1; i >= 0; i--)
 	{
-		newdig = div[i] + carry;
+		newdig = dividend[i] + carry;
 		if (newdig < 0)
 		{
-			carry = -((-newdig - 1) / NBASE) - 1;
-			newdig -= carry * NBASE;
+			carry = -((-newdig - 1) / NBASE_SQR) - 1;
+			newdig -= carry * NBASE_SQR;
 		}
-		else if (newdig >= NBASE)
+		else if (newdig >= NBASE_SQR)
 		{
-			carry = newdig / NBASE;
-			newdig -= carry * NBASE;
+			carry = newdig / NBASE_SQR;
+			newdig -= carry * NBASE_SQR;
 		}
 		else
 			carry = 0;
-		res_digits[i] = newdig;
+		res_digits[2 * i + 1] = (NumericDigit) ((uint32) newdig % NBASE);
+		res_digits[2 * i] = (NumericDigit) ((uint32) newdig / NBASE);
 	}
 	Assert(carry == 0);
 
-	pfree(div);
+	pfree(dividend);
 
 	/*
-	 * Finally, round the result to the requested precision.
+	 * Finally, round or truncate the result to the requested precision.
 	 */
 	result->weight = res_weight;
 	result->sign = res_sign;
 
-	/* Round to target rscale (and set result->dscale) */
+	/* Round or truncate to target rscale (and set result->dscale) */
 	if (round)
 		round_var(result, rscale);
 	else
@@ -10012,123 +9834,6 @@ div_var_int(const NumericVar *var, int ival, int ival_weight,
 }
 
 
-#ifdef HAVE_INT128
-/*
- * div_var_int64() -
- *
- *	Divide a numeric variable by a 64-bit integer with the specified weight.
- *	The quotient var / (ival * NBASE^ival_weight) is stored in result.
- *
- *	This duplicates the logic in div_var_int(), so any changes made there
- *	should be made here too.
- */
-static void
-div_var_int64(const NumericVar *var, int64 ival, int ival_weight,
-			  NumericVar *result, int rscale, bool round)
-{
-	NumericDigit *var_digits = var->digits;
-	int			var_ndigits = var->ndigits;
-	int			res_sign;
-	int			res_weight;
-	int			res_ndigits;
-	NumericDigit *res_buf;
-	NumericDigit *res_digits;
-	uint64		divisor;
-	int			i;
-
-	/* Guard against division by zero */
-	if (ival == 0)
-		ereport(ERROR,
-				errcode(ERRCODE_DIVISION_BY_ZERO),
-				errmsg("division by zero"));
-
-	/* Result zero check */
-	if (var_ndigits == 0)
-	{
-		zero_var(result);
-		result->dscale = rscale;
-		return;
-	}
-
-	/*
-	 * Determine the result sign, weight and number of digits to calculate.
-	 * The weight figured here is correct if the emitted quotient has no
-	 * leading zero digits; otherwise strip_var() will fix things up.
-	 */
-	if (var->sign == NUMERIC_POS)
-		res_sign = ival > 0 ? NUMERIC_POS : NUMERIC_NEG;
-	else
-		res_sign = ival > 0 ? NUMERIC_NEG : NUMERIC_POS;
-	res_weight = var->weight - ival_weight;
-	/* The number of accurate result digits we need to produce: */
-	res_ndigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS;
-	/* ... but always at least 1 */
-	res_ndigits = Max(res_ndigits, 1);
-	/* If rounding needed, figure one more digit to ensure correct result */
-	if (round)
-		res_ndigits++;
-
-	res_buf = digitbuf_alloc(res_ndigits + 1);
-	res_buf[0] = 0;				/* spare digit for later rounding */
-	res_digits = res_buf + 1;
-
-	/*
-	 * Now compute the quotient digits.  This is the short division algorithm
-	 * described in Knuth volume 2, section 4.3.1 exercise 16, except that we
-	 * allow the divisor to exceed the internal base.
-	 *
-	 * In this algorithm, the carry from one digit to the next is at most
-	 * divisor - 1.  Therefore, while processing the next digit, carry may
-	 * become as large as divisor * NBASE - 1, and so it requires a 128-bit
-	 * integer if this exceeds PG_UINT64_MAX.
-	 */
-	divisor = i64abs(ival);
-
-	if (divisor <= PG_UINT64_MAX / NBASE)
-	{
-		/* carry cannot overflow 64 bits */
-		uint64		carry = 0;
-
-		for (i = 0; i < res_ndigits; i++)
-		{
-			carry = carry * NBASE + (i < var_ndigits ? var_digits[i] : 0);
-			res_digits[i] = (NumericDigit) (carry / divisor);
-			carry = carry % divisor;
-		}
-	}
-	else
-	{
-		/* carry may exceed 64 bits */
-		uint128		carry = 0;
-
-		for (i = 0; i < res_ndigits; i++)
-		{
-			carry = carry * NBASE + (i < var_ndigits ? var_digits[i] : 0);
-			res_digits[i] = (NumericDigit) (carry / divisor);
-			carry = carry % divisor;
-		}
-	}
-
-	/* Store the quotient in result */
-	digitbuf_free(result->buf);
-	result->ndigits = res_ndigits;
-	result->buf = res_buf;
-	result->digits = res_digits;
-	result->weight = res_weight;
-	result->sign = res_sign;
-
-	/* Round or truncate to target rscale (and set result->dscale) */
-	if (round)
-		round_var(result, rscale);
-	else
-		trunc_var(result, rscale);
-
-	/* Strip leading/trailing zeroes */
-	strip_var(result);
-}
-#endif
-
-
 /*
  * Default scale selection for division
  *
@@ -10216,7 +9921,7 @@ mod_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result)
 	 * div_var can be persuaded to give us trunc(x/y) directly.
 	 * ----------
 	 */
-	div_var(var1, var2, &tmp, 0, false);
+	div_var(var1, var2, &tmp, 0, false, true);
 
 	mul_var(var2, &tmp, &tmp, var2->dscale);
 
@@ -10243,11 +9948,11 @@ div_mod_var(const NumericVar *var1, const NumericVar *var2,
 	init_var(&r);
 
 	/*
-	 * Use div_var_fast() to get an initial estimate for the integer quotient.
-	 * This might be inaccurate (per the warning in div_var_fast's comments),
-	 * but we can correct it below.
+	 * Use div_var() with exact = false to get an initial estimate for the
+	 * integer quotient (truncated towards zero).  This might be slightly
+	 * inaccurate, but we correct it below.
 	 */
-	div_var_fast(var1, var2, &q, 0, false);
+	div_var(var1, var2, &q, 0, false, false);
 
 	/* Compute initial estimate of remainder using the quotient estimate. */
 	mul_var(var2, &q, &r, var2->dscale);
@@ -11190,7 +10895,7 @@ ln_var(const NumericVar *arg, NumericVar *result, int rscale)
 
 	sub_var(&x, &const_one, result);
 	add_var(&x, &const_one, &elem);
-	div_var_fast(result, &elem, result, local_rscale, true);
+	div_var(result, &elem, result, local_rscale, true, false);
 	set_var_from_var(result, &xx);
 	mul_var(result, result, &x, local_rscale);
 
@@ -11274,7 +10979,7 @@ log_var(const NumericVar *base, const NumericVar *num, NumericVar *result)
 	ln_var(num, &ln_num, ln_num_rscale);
 
 	/* Divide and round to the required scale */
-	div_var_fast(&ln_num, &ln_base, result, rscale, true);
+	div_var(&ln_num, &ln_base, result, rscale, true, false);
 
 	free_var(&ln_num);
 	free_var(&ln_base);
@@ -11536,7 +11241,7 @@ power_var_int(const NumericVar *base, int exp, int exp_dscale,
 			round_var(result, rscale);
 			return;
 		case -1:
-			div_var(&const_one, base, result, rscale, true);
+			div_var(&const_one, base, result, rscale, true, true);
 			return;
 		case 2:
 			mul_var(base, base, result, rscale);
@@ -11644,7 +11349,7 @@ power_var_int(const NumericVar *base, int exp, int exp_dscale,
 
 	/* Compensate for input sign, and round to requested rscale */
 	if (neg)
-		div_var_fast(&const_one, result, result, rscale, true);
+		div_var(&const_one, result, result, rscale, true, false);
 	else
 		round_var(result, rscale);
 }
-- 
2.43.0

v1-0002-Test-code-for-div_var.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Test-code-for-div_var.patchDownload

From 8d0af4cd9621989b4787b0f0e86d16f332f78757 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <dean.a.rasheed@gmail.com>
Date: Fri, 23 Aug 2024 09:58:53 +0100
Subject: [PATCH v1 2/2] Test code for div_var().

This adds 3 SQL-callable functions for comparing the old div_var() and
div_var_fast() functions with the new div_var() function:

- div_knuth(num1 numeric, num2 numeric, rscale int, round bool)
- div_fast(num1 numeric, num2 numeric, rscale int, round bool)
- div_new(num1 numeric, num2 numeric, rscale int, round bool, exact bool)

Not intended to be pushed to master.
---
 src/backend/utils/adt/numeric.c | 1181 +++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat |   12 +
 2 files changed, 1193 insertions(+)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 7d5038f575..10ab454aae 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -12328,3 +12328,1184 @@ accum_sum_combine(NumericSumAccum *accum, NumericSumAccum *accum2)
 
 	free_var(&tmp_var);
 }
+
+
+/* ================= Test code, not intended for commit ================= */
+
+
+extern Numeric numeric_div_knuth_opt_error(Numeric num1, Numeric num2,
+										   int rscale, bool round,
+										   bool *have_error);
+extern Numeric numeric_div_fast_opt_error(Numeric num1, Numeric num2,
+										  int rscale, bool round,
+										  bool *have_error);
+extern Numeric numeric_div_new_opt_error(Numeric num1, Numeric num2,
+										 int rscale, bool round, bool exact,
+										 bool *have_error);
+
+
+#ifdef HAVE_INT128
+/*
+ * div_var_int64() -
+ *
+ *	Divide a numeric variable by a 64-bit integer with the specified weight.
+ *	The quotient var / (ival * NBASE^ival_weight) is stored in result.
+ *
+ *	This duplicates the logic in div_var_int(), so any changes made there
+ *	should be made here too.
+ */
+static void
+div_var_int64(const NumericVar *var, int64 ival, int ival_weight,
+			  NumericVar *result, int rscale, bool round)
+{
+	NumericDigit *var_digits = var->digits;
+	int			var_ndigits = var->ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			res_ndigits;
+	NumericDigit *res_buf;
+	NumericDigit *res_digits;
+	uint64		divisor;
+	int			i;
+
+	/* Guard against division by zero */
+	if (ival == 0)
+		ereport(ERROR,
+				errcode(ERRCODE_DIVISION_BY_ZERO),
+				errmsg("division by zero"));
+
+	/* Result zero check */
+	if (var_ndigits == 0)
+	{
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * Determine the result sign, weight and number of digits to calculate.
+	 * The weight figured here is correct if the emitted quotient has no
+	 * leading zero digits; otherwise strip_var() will fix things up.
+	 */
+	if (var->sign == NUMERIC_POS)
+		res_sign = ival > 0 ? NUMERIC_POS : NUMERIC_NEG;
+	else
+		res_sign = ival > 0 ? NUMERIC_NEG : NUMERIC_POS;
+	res_weight = var->weight - ival_weight;
+	/* The number of accurate result digits we need to produce: */
+	res_ndigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS;
+	/* ... but always at least 1 */
+	res_ndigits = Max(res_ndigits, 1);
+	/* If rounding needed, figure one more digit to ensure correct result */
+	if (round)
+		res_ndigits++;
+
+	res_buf = digitbuf_alloc(res_ndigits + 1);
+	res_buf[0] = 0;				/* spare digit for later rounding */
+	res_digits = res_buf + 1;
+
+	/*
+	 * Now compute the quotient digits.  This is the short division algorithm
+	 * described in Knuth volume 2, section 4.3.1 exercise 16, except that we
+	 * allow the divisor to exceed the internal base.
+	 *
+	 * In this algorithm, the carry from one digit to the next is at most
+	 * divisor - 1.  Therefore, while processing the next digit, carry may
+	 * become as large as divisor * NBASE - 1, and so it requires a 128-bit
+	 * integer if this exceeds PG_UINT64_MAX.
+	 */
+	divisor = i64abs(ival);
+
+	if (divisor <= PG_UINT64_MAX / NBASE)
+	{
+		/* carry cannot overflow 64 bits */
+		uint64		carry = 0;
+
+		for (i = 0; i < res_ndigits; i++)
+		{
+			carry = carry * NBASE + (i < var_ndigits ? var_digits[i] : 0);
+			res_digits[i] = (NumericDigit) (carry / divisor);
+			carry = carry % divisor;
+		}
+	}
+	else
+	{
+		/* carry may exceed 64 bits */
+		uint128		carry = 0;
+
+		for (i = 0; i < res_ndigits; i++)
+		{
+			carry = carry * NBASE + (i < var_ndigits ? var_digits[i] : 0);
+			res_digits[i] = (NumericDigit) (carry / divisor);
+			carry = carry % divisor;
+		}
+	}
+
+	/* Store the quotient in result */
+	digitbuf_free(result->buf);
+	result->ndigits = res_ndigits;
+	result->buf = res_buf;
+	result->digits = res_digits;
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round or truncate to target rscale (and set result->dscale) */
+	if (round)
+		round_var(result, rscale);
+	else
+		trunc_var(result, rscale);
+
+	/* Strip leading/trailing zeroes */
+	strip_var(result);
+}
+#endif
+
+
+/*
+ * div_var_knuth() -
+ *
+ *	Division on variable level. Quotient of var1 / var2 is stored in result.
+ *	The quotient is figured to exactly rscale fractional digits.
+ *	If round is true, it is rounded at the rscale'th digit; if false, it
+ *	is truncated (towards zero) at that digit.
+ */
+static void
+div_var_knuth(const NumericVar *var1, const NumericVar *var2,
+			  NumericVar *result, int rscale, bool round)
+{
+	int			div_ndigits;
+	int			res_ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			carry;
+	int			borrow;
+	int			divisor1;
+	int			divisor2;
+	NumericDigit *dividend;
+	NumericDigit *divisor;
+	NumericDigit *res_digits;
+	int			i;
+	int			j;
+
+	/* copy these values into local vars for speed in inner loop */
+	int			var1ndigits = var1->ndigits;
+	int			var2ndigits = var2->ndigits;
+
+	/*
+	 * First of all division by zero check; we must not be handed an
+	 * unnormalized divisor.
+	 */
+	if (var2ndigits == 0 || var2->digits[0] == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_DIVISION_BY_ZERO),
+				 errmsg("division by zero")));
+
+	/*
+	 * If the divisor has just one or two digits, delegate to div_var_int(),
+	 * which uses fast short division.
+	 *
+	 * Similarly, on platforms with 128-bit integer support, delegate to
+	 * div_var_int64() for divisors with three or four digits.
+	 */
+	if (var2ndigits <= 2)
+	{
+		int			idivisor;
+		int			idivisor_weight;
+
+		idivisor = var2->digits[0];
+		idivisor_weight = var2->weight;
+		if (var2ndigits == 2)
+		{
+			idivisor = idivisor * NBASE + var2->digits[1];
+			idivisor_weight--;
+		}
+		if (var2->sign == NUMERIC_NEG)
+			idivisor = -idivisor;
+
+		div_var_int(var1, idivisor, idivisor_weight, result, rscale, round);
+		return;
+	}
+#ifdef HAVE_INT128
+	if (var2ndigits <= 4)
+	{
+		int64		idivisor;
+		int			idivisor_weight;
+
+		idivisor = var2->digits[0];
+		idivisor_weight = var2->weight;
+		for (i = 1; i < var2ndigits; i++)
+		{
+			idivisor = idivisor * NBASE + var2->digits[i];
+			idivisor_weight--;
+		}
+		if (var2->sign == NUMERIC_NEG)
+			idivisor = -idivisor;
+
+		div_var_int64(var1, idivisor, idivisor_weight, result, rscale, round);
+		return;
+	}
+#endif
+
+	/*
+	 * Otherwise, perform full long division.
+	 */
+
+	/* Result zero check */
+	if (var1ndigits == 0)
+	{
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * Determine the result sign, weight and number of digits to calculate.
+	 * The weight figured here is correct if the emitted quotient has no
+	 * leading zero digits; otherwise strip_var() will fix things up.
+	 */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight - var2->weight;
+	/* The number of accurate result digits we need to produce: */
+	res_ndigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS;
+	/* ... but always at least 1 */
+	res_ndigits = Max(res_ndigits, 1);
+	/* If rounding needed, figure one more digit to ensure correct result */
+	if (round)
+		res_ndigits++;
+
+	/*
+	 * The working dividend normally requires res_ndigits + var2ndigits
+	 * digits, but make it at least var1ndigits so we can load all of var1
+	 * into it.  (There will be an additional digit dividend[0] in the
+	 * dividend space, but for consistency with Knuth's notation we don't
+	 * count that in div_ndigits.)
+	 */
+	div_ndigits = res_ndigits + var2ndigits;
+	div_ndigits = Max(div_ndigits, var1ndigits);
+
+	/*
+	 * We need a workspace with room for the working dividend (div_ndigits+1
+	 * digits) plus room for the possibly-normalized divisor (var2ndigits
+	 * digits).  It is convenient also to have a zero at divisor[0] with the
+	 * actual divisor data in divisor[1 .. var2ndigits].  Transferring the
+	 * digits into the workspace also allows us to realloc the result (which
+	 * might be the same as either input var) before we begin the main loop.
+	 * Note that we use palloc0 to ensure that divisor[0], dividend[0], and
+	 * any additional dividend positions beyond var1ndigits, start out 0.
+	 */
+	dividend = (NumericDigit *)
+		palloc0((div_ndigits + var2ndigits + 2) * sizeof(NumericDigit));
+	divisor = dividend + (div_ndigits + 1);
+	memcpy(dividend + 1, var1->digits, var1ndigits * sizeof(NumericDigit));
+	memcpy(divisor + 1, var2->digits, var2ndigits * sizeof(NumericDigit));
+
+	/*
+	 * Now we can realloc the result to hold the generated quotient digits.
+	 */
+	alloc_var(result, res_ndigits);
+	res_digits = result->digits;
+
+	/*
+	 * The full multiple-place algorithm is taken from Knuth volume 2,
+	 * Algorithm 4.3.1D.
+	 *
+	 * We need the first divisor digit to be >= NBASE/2.  If it isn't, make it
+	 * so by scaling up both the divisor and dividend by the factor "d".  (The
+	 * reason for allocating dividend[0] above is to leave room for possible
+	 * carry here.)
+	 */
+	if (divisor[1] < HALF_NBASE)
+	{
+		int			d = NBASE / (divisor[1] + 1);
+
+		carry = 0;
+		for (i = var2ndigits; i > 0; i--)
+		{
+			carry += divisor[i] * d;
+			divisor[i] = carry % NBASE;
+			carry = carry / NBASE;
+		}
+		Assert(carry == 0);
+		carry = 0;
+		/* at this point only var1ndigits of dividend can be nonzero */
+		for (i = var1ndigits; i >= 0; i--)
+		{
+			carry += dividend[i] * d;
+			dividend[i] = carry % NBASE;
+			carry = carry / NBASE;
+		}
+		Assert(carry == 0);
+		Assert(divisor[1] >= HALF_NBASE);
+	}
+	/* First 2 divisor digits are used repeatedly in main loop */
+	divisor1 = divisor[1];
+	divisor2 = divisor[2];
+
+	/*
+	 * Begin the main loop.  Each iteration of this loop produces the j'th
+	 * quotient digit by dividing dividend[j .. j + var2ndigits] by the
+	 * divisor; this is essentially the same as the common manual procedure
+	 * for long division.
+	 */
+	for (j = 0; j < res_ndigits; j++)
+	{
+		/* Estimate quotient digit from the first two dividend digits */
+		int			next2digits = dividend[j] * NBASE + dividend[j + 1];
+		int			qhat;
+
+		/*
+		 * If next2digits are 0, then quotient digit must be 0 and there's no
+		 * need to adjust the working dividend.  It's worth testing here to
+		 * fall out ASAP when processing trailing zeroes in a dividend.
+		 */
+		if (next2digits == 0)
+		{
+			res_digits[j] = 0;
+			continue;
+		}
+
+		if (dividend[j] == divisor1)
+			qhat = NBASE - 1;
+		else
+			qhat = next2digits / divisor1;
+
+		/*
+		 * Adjust quotient digit if it's too large.  Knuth proves that after
+		 * this step, the quotient digit will be either correct or just one
+		 * too large.  (Note: it's OK to use dividend[j+2] here because we
+		 * know the divisor length is at least 2.)
+		 */
+		while (divisor2 * qhat >
+			   (next2digits - qhat * divisor1) * NBASE + dividend[j + 2])
+			qhat--;
+
+		/* As above, need do nothing more when quotient digit is 0 */
+		if (qhat > 0)
+		{
+			NumericDigit *dividend_j = &dividend[j];
+
+			/*
+			 * Multiply the divisor by qhat, and subtract that from the
+			 * working dividend.  The multiplication and subtraction are
+			 * folded together here, noting that qhat <= NBASE (since it might
+			 * be one too large), and so the intermediate result "tmp_result"
+			 * is in the range [-NBASE^2, NBASE - 1], and "borrow" is in the
+			 * range [0, NBASE].
+			 */
+			borrow = 0;
+			for (i = var2ndigits; i >= 0; i--)
+			{
+				int			tmp_result;
+
+				tmp_result = dividend_j[i] - borrow - divisor[i] * qhat;
+				borrow = (NBASE - 1 - tmp_result) / NBASE;
+				dividend_j[i] = tmp_result + borrow * NBASE;
+			}
+
+			/*
+			 * If we got a borrow out of the top dividend digit, then indeed
+			 * qhat was one too large.  Fix it, and add back the divisor to
+			 * correct the working dividend.  (Knuth proves that this will
+			 * occur only about 3/NBASE of the time; hence, it's a good idea
+			 * to test this code with small NBASE to be sure this section gets
+			 * exercised.)
+			 */
+			if (borrow)
+			{
+				qhat--;
+				carry = 0;
+				for (i = var2ndigits; i >= 0; i--)
+				{
+					carry += dividend_j[i] + divisor[i];
+					if (carry >= NBASE)
+					{
+						dividend_j[i] = carry - NBASE;
+						carry = 1;
+					}
+					else
+					{
+						dividend_j[i] = carry;
+						carry = 0;
+					}
+				}
+				/* A carry should occur here to cancel the borrow above */
+				Assert(carry == 1);
+			}
+		}
+
+		/* And we're done with this quotient digit */
+		res_digits[j] = qhat;
+	}
+
+	pfree(dividend);
+
+	/*
+	 * Finally, round or truncate the result to the requested precision.
+	 */
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round or truncate to target rscale (and set result->dscale) */
+	if (round)
+		round_var(result, rscale);
+	else
+		trunc_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+
+/*
+ * div_var_fast() -
+ *
+ *	This has the same API as div_var, but is implemented using the division
+ *	algorithm from the "FM" library, rather than Knuth's schoolbook-division
+ *	approach.  This is significantly faster but can produce inaccurate
+ *	results, because it sometimes has to propagate rounding to the left,
+ *	and so we can never be entirely sure that we know the requested digits
+ *	exactly.  We compute DIV_GUARD_DIGITS extra digits, but there is
+ *	no certainty that that's enough.  We use this only in the transcendental
+ *	function calculation routines, where everything is approximate anyway.
+ *
+ *	Although we provide a "round" argument for consistency with div_var,
+ *	it is unwise to use this function with round=false.  In truncation mode
+ *	it is possible to get a result with no significant digits, for example
+ *	with rscale=0 we might compute 0.99999... and truncate that to 0 when
+ *	the correct answer is 1.
+ */
+static void
+div_var_fast(const NumericVar *var1, const NumericVar *var2,
+			 NumericVar *result, int rscale, bool round)
+{
+	int			div_ndigits;
+	int			load_ndigits;
+	int			res_sign;
+	int			res_weight;
+	int		   *div;
+	int			qdigit;
+	int			carry;
+	int			maxdiv;
+	int			newdig;
+	NumericDigit *res_digits;
+	double		fdividend,
+				fdivisor,
+				fdivisorinverse,
+				fquotient;
+	int			qi;
+	int			i;
+
+	/* copy these values into local vars for speed in inner loop */
+	int			var1ndigits = var1->ndigits;
+	int			var2ndigits = var2->ndigits;
+	NumericDigit *var1digits = var1->digits;
+	NumericDigit *var2digits = var2->digits;
+
+	/*
+	 * First of all division by zero check; we must not be handed an
+	 * unnormalized divisor.
+	 */
+	if (var2ndigits == 0 || var2digits[0] == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_DIVISION_BY_ZERO),
+				 errmsg("division by zero")));
+
+	/*
+	 * If the divisor has just one or two digits, delegate to div_var_int(),
+	 * which uses fast short division.
+	 *
+	 * Similarly, on platforms with 128-bit integer support, delegate to
+	 * div_var_int64() for divisors with three or four digits.
+	 */
+	if (var2ndigits <= 2)
+	{
+		int			idivisor;
+		int			idivisor_weight;
+
+		idivisor = var2->digits[0];
+		idivisor_weight = var2->weight;
+		if (var2ndigits == 2)
+		{
+			idivisor = idivisor * NBASE + var2->digits[1];
+			idivisor_weight--;
+		}
+		if (var2->sign == NUMERIC_NEG)
+			idivisor = -idivisor;
+
+		div_var_int(var1, idivisor, idivisor_weight, result, rscale, round);
+		return;
+	}
+#ifdef HAVE_INT128
+	if (var2ndigits <= 4)
+	{
+		int64		idivisor;
+		int			idivisor_weight;
+
+		idivisor = var2->digits[0];
+		idivisor_weight = var2->weight;
+		for (i = 1; i < var2ndigits; i++)
+		{
+			idivisor = idivisor * NBASE + var2->digits[i];
+			idivisor_weight--;
+		}
+		if (var2->sign == NUMERIC_NEG)
+			idivisor = -idivisor;
+
+		div_var_int64(var1, idivisor, idivisor_weight, result, rscale, round);
+		return;
+	}
+#endif
+
+	/*
+	 * Otherwise, perform full long division.
+	 */
+
+	/* Result zero check */
+	if (var1ndigits == 0)
+	{
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * Determine the result sign, weight and number of digits to calculate
+	 */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight - var2->weight + 1;
+	/* The number of accurate result digits we need to produce: */
+	div_ndigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS;
+	/* Add guard digits for roundoff error */
+	div_ndigits += DIV_GUARD_DIGITS;
+	if (div_ndigits < DIV_GUARD_DIGITS)
+		div_ndigits = DIV_GUARD_DIGITS;
+
+	/*
+	 * We do the arithmetic in an array "div[]" of signed int's.  Since
+	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
+	 * to avoid normalizing carries immediately.
+	 *
+	 * We start with div[] containing one zero digit followed by the
+	 * dividend's digits (plus appended zeroes to reach the desired precision
+	 * including guard digits).  Each step of the main loop computes an
+	 * (approximate) quotient digit and stores it into div[], removing one
+	 * position of dividend space.  A final pass of carry propagation takes
+	 * care of any mistaken quotient digits.
+	 *
+	 * Note that div[] doesn't necessarily contain all of the digits from the
+	 * dividend --- the desired precision plus guard digits might be less than
+	 * the dividend's precision.  This happens, for example, in the square
+	 * root algorithm, where we typically divide a 2N-digit number by an
+	 * N-digit number, and only require a result with N digits of precision.
+	 */
+	div = (int *) palloc0((div_ndigits + 1) * sizeof(int));
+	load_ndigits = Min(div_ndigits, var1ndigits);
+	for (i = 0; i < load_ndigits; i++)
+		div[i + 1] = var1digits[i];
+
+	/*
+	 * We estimate each quotient digit using floating-point arithmetic, taking
+	 * the first four digits of the (current) dividend and divisor.  This must
+	 * be float to avoid overflow.  The quotient digits will generally be off
+	 * by no more than one from the exact answer.
+	 */
+	fdivisor = (double) var2digits[0];
+	for (i = 1; i < 4; i++)
+	{
+		fdivisor *= NBASE;
+		if (i < var2ndigits)
+			fdivisor += (double) var2digits[i];
+	}
+	fdivisorinverse = 1.0 / fdivisor;
+
+	/*
+	 * maxdiv tracks the maximum possible absolute value of any div[] entry;
+	 * when this threatens to exceed INT_MAX, we take the time to propagate
+	 * carries.  Furthermore, we need to ensure that overflow doesn't occur
+	 * during the carry propagation passes either.  The carry values may have
+	 * an absolute value as high as INT_MAX/NBASE + 1, so really we must
+	 * normalize when digits threaten to exceed INT_MAX - INT_MAX/NBASE - 1.
+	 *
+	 * To avoid overflow in maxdiv itself, it represents the max absolute
+	 * value divided by NBASE-1, ie, at the top of the loop it is known that
+	 * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
+	 *
+	 * Actually, though, that holds good only for div[] entries after div[qi];
+	 * the adjustment done at the bottom of the loop may cause div[qi + 1] to
+	 * exceed the maxdiv limit, so that div[qi] in the next iteration is
+	 * beyond the limit.  This does not cause problems, as explained below.
+	 */
+	maxdiv = 1;
+
+	/*
+	 * Outer loop computes next quotient digit, which will go into div[qi]
+	 */
+	for (qi = 0; qi < div_ndigits; qi++)
+	{
+		/* Approximate the current dividend value */
+		fdividend = (double) div[qi];
+		for (i = 1; i < 4; i++)
+		{
+			fdividend *= NBASE;
+			if (qi + i <= div_ndigits)
+				fdividend += (double) div[qi + i];
+		}
+		/* Compute the (approximate) quotient digit */
+		fquotient = fdividend * fdivisorinverse;
+		qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
+			(((int) fquotient) - 1);	/* truncate towards -infinity */
+
+		if (qdigit != 0)
+		{
+			/* Do we need to normalize now? */
+			maxdiv += abs(qdigit);
+			if (maxdiv > (INT_MAX - INT_MAX / NBASE - 1) / (NBASE - 1))
+			{
+				/*
+				 * Yes, do it.  Note that if var2ndigits is much smaller than
+				 * div_ndigits, we can save a significant amount of effort
+				 * here by noting that we only need to normalise those div[]
+				 * entries touched where prior iterations subtracted multiples
+				 * of the divisor.
+				 */
+				carry = 0;
+				for (i = Min(qi + var2ndigits - 2, div_ndigits); i > qi; i--)
+				{
+					newdig = div[i] + carry;
+					if (newdig < 0)
+					{
+						carry = -((-newdig - 1) / NBASE) - 1;
+						newdig -= carry * NBASE;
+					}
+					else if (newdig >= NBASE)
+					{
+						carry = newdig / NBASE;
+						newdig -= carry * NBASE;
+					}
+					else
+						carry = 0;
+					div[i] = newdig;
+				}
+				newdig = div[qi] + carry;
+				div[qi] = newdig;
+
+				/*
+				 * All the div[] digits except possibly div[qi] are now in the
+				 * range 0..NBASE-1.  We do not need to consider div[qi] in
+				 * the maxdiv value anymore, so we can reset maxdiv to 1.
+				 */
+				maxdiv = 1;
+
+				/*
+				 * Recompute the quotient digit since new info may have
+				 * propagated into the top four dividend digits
+				 */
+				fdividend = (double) div[qi];
+				for (i = 1; i < 4; i++)
+				{
+					fdividend *= NBASE;
+					if (qi + i <= div_ndigits)
+						fdividend += (double) div[qi + i];
+				}
+				/* Compute the (approximate) quotient digit */
+				fquotient = fdividend * fdivisorinverse;
+				qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
+					(((int) fquotient) - 1);	/* truncate towards -infinity */
+				maxdiv += abs(qdigit);
+			}
+
+			/*
+			 * Subtract off the appropriate multiple of the divisor.
+			 *
+			 * The digits beyond div[qi] cannot overflow, because we know they
+			 * will fall within the maxdiv limit.  As for div[qi] itself, note
+			 * that qdigit is approximately trunc(div[qi] / vardigits[0]),
+			 * which would make the new value simply div[qi] mod vardigits[0].
+			 * The lower-order terms in qdigit can change this result by not
+			 * more than about twice INT_MAX/NBASE, so overflow is impossible.
+			 *
+			 * This inner loop is the performance bottleneck for division, so
+			 * code it in the same way as the inner loop of mul_var() so that
+			 * it can be auto-vectorized.  We cast qdigit to NumericDigit
+			 * before multiplying to allow the compiler to generate more
+			 * efficient code (using 16-bit multiplication), which is safe
+			 * since we know that the quotient digit is off by at most one, so
+			 * there is no overflow risk.
+			 */
+			if (qdigit != 0)
+			{
+				int			istop = Min(var2ndigits, div_ndigits - qi + 1);
+				int		   *div_qi = &div[qi];
+
+				for (i = 0; i < istop; i++)
+					div_qi[i] -= ((NumericDigit) qdigit) * var2digits[i];
+			}
+		}
+
+		/*
+		 * The dividend digit we are about to replace might still be nonzero.
+		 * Fold it into the next digit position.
+		 *
+		 * There is no risk of overflow here, although proving that requires
+		 * some care.  Much as with the argument for div[qi] not overflowing,
+		 * if we consider the first two terms in the numerator and denominator
+		 * of qdigit, we can see that the final value of div[qi + 1] will be
+		 * approximately a remainder mod (vardigits[0]*NBASE + vardigits[1]).
+		 * Accounting for the lower-order terms is a bit complicated but ends
+		 * up adding not much more than INT_MAX/NBASE to the possible range.
+		 * Thus, div[qi + 1] cannot overflow here, and in its role as div[qi]
+		 * in the next loop iteration, it can't be large enough to cause
+		 * overflow in the carry propagation step (if any), either.
+		 *
+		 * But having said that: div[qi] can be more than INT_MAX/NBASE, as
+		 * noted above, which means that the product div[qi] * NBASE *can*
+		 * overflow.  When that happens, adding it to div[qi + 1] will always
+		 * cause a canceling overflow so that the end result is correct.  We
+		 * could avoid the intermediate overflow by doing the multiplication
+		 * and addition in int64 arithmetic, but so far there appears no need.
+		 */
+		div[qi + 1] += div[qi] * NBASE;
+
+		div[qi] = qdigit;
+	}
+
+	/*
+	 * Approximate and store the last quotient digit (div[div_ndigits])
+	 */
+	fdividend = (double) div[qi];
+	for (i = 1; i < 4; i++)
+		fdividend *= NBASE;
+	fquotient = fdividend * fdivisorinverse;
+	qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
+		(((int) fquotient) - 1);	/* truncate towards -infinity */
+	div[qi] = qdigit;
+
+	/*
+	 * Because the quotient digits might be off by one, some of them might be
+	 * -1 or NBASE at this point.  The represented value is correct in a
+	 * mathematical sense, but it doesn't look right.  We do a final carry
+	 * propagation pass to normalize the digits, which we combine with storing
+	 * the result digits into the output.  Note that this is still done at
+	 * full precision w/guard digits.
+	 */
+	alloc_var(result, div_ndigits + 1);
+	res_digits = result->digits;
+	carry = 0;
+	for (i = div_ndigits; i >= 0; i--)
+	{
+		newdig = div[i] + carry;
+		if (newdig < 0)
+		{
+			carry = -((-newdig - 1) / NBASE) - 1;
+			newdig -= carry * NBASE;
+		}
+		else if (newdig >= NBASE)
+		{
+			carry = newdig / NBASE;
+			newdig -= carry * NBASE;
+		}
+		else
+			carry = 0;
+		res_digits[i] = newdig;
+	}
+	Assert(carry == 0);
+
+	pfree(div);
+
+	/*
+	 * Finally, round the result to the requested precision.
+	 */
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	if (round)
+		round_var(result, rscale);
+	else
+		trunc_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+
+Numeric
+numeric_div_knuth_opt_error(Numeric num1, Numeric num2, int rscale,
+							bool round, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	if (have_error)
+		*have_error = false;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			if (NUMERIC_IS_SPECIAL(num2))
+				return make_result(&const_nan); /* Inf / [-]Inf */
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					if (have_error)
+					{
+						*have_error = true;
+						return NULL;
+					}
+					ereport(ERROR,
+							(errcode(ERRCODE_DIVISION_BY_ZERO),
+							 errmsg("division by zero")));
+					break;
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			if (NUMERIC_IS_SPECIAL(num2))
+				return make_result(&const_nan); /* -Inf / [-]Inf */
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					if (have_error)
+					{
+						*have_error = true;
+						return NULL;
+					}
+					ereport(ERROR,
+							(errcode(ERRCODE_DIVISION_BY_ZERO),
+							 errmsg("division by zero")));
+					break;
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+
+		/*
+		 * POSIX would have us return zero or minus zero if num1 is zero, and
+		 * otherwise throw an underflow error.  But the numeric type doesn't
+		 * really do underflow, so let's just return zero.
+		 */
+		return make_result(&const_zero);
+	}
+
+	/*
+	 * Unpack the arguments
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	/*
+	 * If "have_error" is provided, check for division by zero here
+	 */
+	if (have_error && (arg2.ndigits == 0 || arg2.digits[0] == 0))
+	{
+		*have_error = true;
+		return NULL;
+	}
+
+	/*
+	 * Do the divide and return the result
+	 */
+	div_var_knuth(&arg1, &arg2, &result, rscale, round);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
+Numeric
+numeric_div_fast_opt_error(Numeric num1, Numeric num2, int rscale,
+						   bool round, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	if (have_error)
+		*have_error = false;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			if (NUMERIC_IS_SPECIAL(num2))
+				return make_result(&const_nan); /* Inf / [-]Inf */
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					if (have_error)
+					{
+						*have_error = true;
+						return NULL;
+					}
+					ereport(ERROR,
+							(errcode(ERRCODE_DIVISION_BY_ZERO),
+							 errmsg("division by zero")));
+					break;
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			if (NUMERIC_IS_SPECIAL(num2))
+				return make_result(&const_nan); /* -Inf / [-]Inf */
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					if (have_error)
+					{
+						*have_error = true;
+						return NULL;
+					}
+					ereport(ERROR,
+							(errcode(ERRCODE_DIVISION_BY_ZERO),
+							 errmsg("division by zero")));
+					break;
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+
+		/*
+		 * POSIX would have us return zero or minus zero if num1 is zero, and
+		 * otherwise throw an underflow error.  But the numeric type doesn't
+		 * really do underflow, so let's just return zero.
+		 */
+		return make_result(&const_zero);
+	}
+
+	/*
+	 * Unpack the arguments
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	/*
+	 * If "have_error" is provided, check for division by zero here
+	 */
+	if (have_error && (arg2.ndigits == 0 || arg2.digits[0] == 0))
+	{
+		*have_error = true;
+		return NULL;
+	}
+
+	/*
+	 * Do the divide and return the result
+	 */
+	div_var_fast(&arg1, &arg2, &result, rscale, round);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
+Numeric
+numeric_div_new_opt_error(Numeric num1, Numeric num2, int rscale,
+						  bool round, bool exact, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	if (have_error)
+		*have_error = false;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			if (NUMERIC_IS_SPECIAL(num2))
+				return make_result(&const_nan); /* Inf / [-]Inf */
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					if (have_error)
+					{
+						*have_error = true;
+						return NULL;
+					}
+					ereport(ERROR,
+							(errcode(ERRCODE_DIVISION_BY_ZERO),
+							 errmsg("division by zero")));
+					break;
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			if (NUMERIC_IS_SPECIAL(num2))
+				return make_result(&const_nan); /* -Inf / [-]Inf */
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					if (have_error)
+					{
+						*have_error = true;
+						return NULL;
+					}
+					ereport(ERROR,
+							(errcode(ERRCODE_DIVISION_BY_ZERO),
+							 errmsg("division by zero")));
+					break;
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+
+		/*
+		 * POSIX would have us return zero or minus zero if num1 is zero, and
+		 * otherwise throw an underflow error.  But the numeric type doesn't
+		 * really do underflow, so let's just return zero.
+		 */
+		return make_result(&const_zero);
+	}
+
+	/*
+	 * Unpack the arguments
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	/*
+	 * If "have_error" is provided, check for division by zero here
+	 */
+	if (have_error && (arg2.ndigits == 0 || arg2.digits[0] == 0))
+	{
+		*have_error = true;
+		return NULL;
+	}
+
+	/*
+	 * Do the divide and return the result
+	 */
+	div_var(&arg1, &arg2, &result, rscale, round, exact);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
+/*
+ * numeric_div_knuth() -
+ *
+ *	Divide one numeric into another using the old Knuth algorithm
+ */
+Datum
+numeric_div_knuth(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	int			rscale = PG_GETARG_INT32(2);
+	bool		round = PG_GETARG_BOOL(3);
+	Numeric		res;
+
+	res = numeric_div_knuth_opt_error(num1, num2, rscale, round, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
+
+
+/*
+ * numeric_div_fast() -
+ *
+ *	Divide one numeric into another using the old "fast" algorithm
+ */
+Datum
+numeric_div_fast(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	int			rscale = PG_GETARG_INT32(2);
+	bool		round = PG_GETARG_BOOL(3);
+	Numeric		res;
+
+	res = numeric_div_fast_opt_error(num1, num2, rscale, round, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
+
+
+/*
+ * numeric_div_new() -
+ *
+ *	Divide one numeric into another using the new algorithm
+ */
+Datum
+numeric_div_new(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	int			rscale = PG_GETARG_INT32(2);
+	bool		round = PG_GETARG_BOOL(3);
+	bool		exact = PG_GETARG_BOOL(4);
+	Numeric		res;
+
+	res = numeric_div_new_opt_error(num1, num2, rscale, round, exact, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4abc6d9526..b4c09f5f25 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -4501,6 +4501,18 @@
 { oid => '1727',
   proname => 'numeric_div', prorettype => 'numeric',
   proargtypes => 'numeric numeric', prosrc => 'numeric_div' },
+{ oid => '9090', descr => 'test numeric division using old algorithm by Knuth',
+  proname => 'div_knuth', prorettype => 'numeric',
+  proargnames => '{num1,num2,rscale,round}',
+  proargtypes => 'numeric numeric int4 bool', prosrc => 'numeric_div_knuth' },
+{ oid => '9091', descr => 'test numeric division using old div_var_fast algorithm',
+  proname => 'div_fast', prorettype => 'numeric',
+  proargnames => '{num1,num2,rscale,round}',
+  proargtypes => 'numeric numeric int4 bool', prosrc => 'numeric_div_fast' },
+{ oid => '9092', descr => 'test numeric division using new algorithm',
+  proname => 'div_new', prorettype => 'numeric',
+  proargnames => '{num1,num2,rscale,round,exact}',
+  proargtypes => 'numeric numeric int4 bool bool', prosrc => 'numeric_div_new' },
 { oid => '1728', descr => 'modulus',
   proname => 'mod', prorettype => 'numeric', proargtypes => 'numeric numeric',
   prosrc => 'numeric_mod' },
-- 
2.43.0

perf_test.sqlapplication/sql; name=perf_test.sqlDownload

perf_test.outapplication/octet-stream; name=perf_test.outDownload

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#1)

3 attachment(s)

Re: Optimising numeric division

On Fri, Aug 23, 2024, at 15:49, Dean Rasheed wrote:

Currently numeric.c has 2 separate functions that implement numeric
division, div_var() and div_var_fast(). Since div_var_fast() is only
approximate, it is only used in the transcendental functions, where a
slightly inaccurate result is OK. In all other cases, div_var() is
used.

...

The attached patch attempts to resolve those issues by replacing
div_var() and div_var_fast() with a single function intended to be
faster than both the originals.

...

In addition, like mul_var(), div_var() now does the computation in
base NBASE^2, using 64-bit integers for the working dividend array,
which halves the number of iterations in the outer loop and reduces
the frequency of carry-propagation passes.

...

In the testing I've done so far, this is up to around 20 times faster
than the old version of div_var() when "exact" is true (it computes
each of the queries above in around 1.6 seconds), and it's up to
around 2-3 times faster than div_var_fast() when "exact" is false.

In addition, I found that it's significantly faster than
div_var_int64() for 3 and 4 digit divisors, so I've removed that as
well.

...

Overall, this reduces numeric.c by around 300 lines, and means that
we'll only have one function to maintain.

Impressive simplifications and optimizations.
Very happy to see reuse of the NBASE^2 trick.

Patch 0001 is the patch itself. 0002 is just for testing purposes --
it exposes both old functions and the new one as SQL-callable
functions with all their arguments, so they can be compared.

I'm also attaching the performance test script I used, and the output
it produced.

I've had an initial look at the code and it looks straight-forward,
thanks to most of the complicated parts of the changed code is just
change of NBASE to NBASE_SQR.

I think the comments are of high quality.

I've run perf_test.sql on my three machines, without any errors.
Output files attached.

Regards,
Joel

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#2)

Re: Optimising numeric division

On Fri, Aug 23, 2024, at 21:21, Joel Jacobson wrote:

On Fri, Aug 23, 2024, at 15:49, Dean Rasheed wrote:

The attached patch attempts to resolve those issues by replacing
div_var() and div_var_fast() with a single function intended to be
faster than both the originals.

...

I've had an initial look at the code and it looks straight-forward,
thanks to most of the complicated parts of the changed code is just
change of NBASE to NBASE_SQR.

This part seems to be entirely new:

```
if (remainder[0] < 0)
{
/* remainder < 0; quotient is too large */
...
}
else
{
/* remainder >= 0 */
...
}
```

Maybe add some additional comments here? Or maybe it's fine already as is,
not sure what I think. Nothing specific here that is extra complicated,
but there are four nested branches, so quite a lot in total to keep track of.

Regards,
Joel

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#2)

Re: Optimising numeric division

On Fri, Aug 23, 2024, at 21:21, Joel Jacobson wrote:

Attachments:
* perf_test-M3 Max.out
* perf_test-Intel Core i9-14900K.out
* perf_test-AMD Ryzen 9 7950X3D.out

Here are some additional benchmarks from pg-catbench:

AMD Ryzen 9 7950X3D:

select x var1ndigits,y var2ndigits,a_avg,b_avg,pooled_stddev,abs_diff,rel_diff,sigmas from catbench.vreport where summary like 'Optimise numeric division using base-NBASE^2 arithmetic%' and function_name = 'numeric_div' order by 1,2;
var1ndigits | var2ndigits | a_avg | b_avg | pooled_stddev | abs_diff | rel_diff | sigmas
-------------+-------------+--------+--------+---------------+----------+----------+--------
1 | 1 | 43 ns | 39 ns | 11 ns | -3.3 ns | -8 | 0
1 | 2 | 47 ns | 43 ns | 11 ns | -3.2 ns | -7 | 0
1 | 3 | 55 ns | 89 ns | 18 ns | 33 ns | 60 | 2
1 | 4 | 76 ns | 93 ns | 20 ns | 16 ns | 22 | 1
1 | 8 | 190 ns | 98 ns | 36 ns | -94 ns | -49 | 3
1 | 16 | 280 ns | 120 ns | 52 ns | -160 ns | -57 | 3
1 | 32 | 490 ns | 140 ns | 98 ns | -350 ns | -72 | 4
1 | 64 | 780 ns | 190 ns | 110 ns | -590 ns | -75 | 5
1 | 128 | 1.4 µs | 330 ns | 85 ns | -1.1 µs | -77 | 13
1 | 256 | 590 ns | 210 ns | 120 ns | -380 ns | -65 | 3
1 | 512 | 1.6 µs | 430 ns | 460 ns | -1.2 µs | -74 | 3
1 | 1024 | 2.8 µs | 820 ns | 1 µs | -2 µs | -71 | 2
1 | 2048 | 6.6 µs | 1.4 µs | 1.9 µs | -5.2 µs | -78 | 3
1 | 4096 | 11 µs | 2.8 µs | 2.5 µs | -8.5 µs | -76 | 3
1 | 8192 | 25 µs | 5.6 µs | 7.4 µs | -20 µs | -78 | 3
1 | 16384 | 53 µs | 15 µs | 15 µs | -37 µs | -71 | 3
2 | 2 | 49 ns | 49 ns | 12 ns | 2e-10 s | 0 | 0
2 | 3 | 54 ns | 93 ns | 16 ns | 39 ns | 73 | 2
2 | 4 | 86 ns | 97 ns | 19 ns | 10 ns | 12 | 1
2 | 8 | 200 ns | 89 ns | 36 ns | -110 ns | -56 | 3
2 | 16 | 340 ns | 120 ns | 63 ns | -210 ns | -63 | 3
2 | 32 | 460 ns | 130 ns | 84 ns | -320 ns | -71 | 4
2 | 64 | 770 ns | 210 ns | 68 ns | -560 ns | -73 | 8
2 | 128 | 1.5 µs | 330 ns | 290 ns | -1.2 µs | -78 | 4
2 | 256 | 1.2 µs | 380 ns | 160 ns | -870 ns | -70 | 5
2 | 512 | 1.9 µs | 520 ns | 510 ns | -1.4 µs | -73 | 3
2 | 1024 | 3.9 µs | 1 µs | 1.2 µs | -2.9 µs | -74 | 2
2 | 2048 | 7.8 µs | 2.1 µs | 2.5 µs | -5.7 µs | -74 | 2
2 | 4096 | 17 µs | 5.3 µs | 2 µs | -12 µs | -69 | 6
2 | 8192 | 30 µs | 7.7 µs | 8.5 µs | -22 µs | -74 | 3
2 | 16384 | 35 µs | 7.3 µs | 7.2 µs | -27 µs | -79 | 4
3 | 3 | 52 ns | 88 ns | 16 ns | 36 ns | 69 | 2
3 | 4 | 78 ns | 97 ns | 24 ns | 19 ns | 24 | 1
3 | 8 | 210 ns | 94 ns | 38 ns | -120 ns | -56 | 3
3 | 16 | 300 ns | 130 ns | 53 ns | -170 ns | -57 | 3
3 | 32 | 510 ns | 140 ns | 95 ns | -370 ns | -72 | 4
3 | 64 | 800 ns | 230 ns | 100 ns | -570 ns | -72 | 6
3 | 128 | 1.4 µs | 290 ns | 210 ns | -1.1 µs | -79 | 5
3 | 256 | 900 ns | 270 ns | 310 ns | -630 ns | -70 | 2
3 | 512 | 1.4 µs | 470 ns | 550 ns | -940 ns | -67 | 2
3 | 1024 | 2.2 µs | 510 ns | 460 ns | -1.7 µs | -77 | 4
3 | 2048 | 5.5 µs | 1.6 µs | 2.1 µs | -4 µs | -72 | 2
3 | 4096 | 13 µs | 3.8 µs | 3.7 µs | -9.3 µs | -71 | 3
3 | 8192 | 29 µs | 7.6 µs | 8.3 µs | -22 µs | -74 | 3
3 | 16384 | 43 µs | 11 µs | 17 µs | -32 µs | -74 | 2
4 | 4 | 85 ns | 92 ns | 20 ns | 6.6 ns | 8 | 0
4 | 8 | 200 ns | 89 ns | 37 ns | -110 ns | -56 | 3
4 | 16 | 320 ns | 120 ns | 61 ns | -200 ns | -61 | 3
4 | 32 | 470 ns | 140 ns | 88 ns | -320 ns | -69 | 4
4 | 64 | 800 ns | 220 ns | 120 ns | -580 ns | -72 | 5
4 | 128 | 1.5 µs | 330 ns | 250 ns | -1.2 µs | -78 | 5
4 | 256 | 890 ns | 340 ns | 310 ns | -550 ns | -61 | 2
4 | 512 | 1.2 µs | 460 ns | 330 ns | -730 ns | -62 | 2
4 | 1024 | 2.5 µs | 790 ns | 860 ns | -1.7 µs | -69 | 2
4 | 2048 | 3.5 µs | 950 ns | 400 ns | -2.6 µs | -73 | 7
4 | 4096 | 13 µs | 4 µs | 3.8 µs | -9 µs | -69 | 2
4 | 8192 | 30 µs | 7.7 µs | 8.2 µs | -22 µs | -74 | 3
4 | 16384 | 61 µs | 20 µs | 7.1 µs | -42 µs | -68 | 6
8 | 8 | 200 ns | 94 ns | 38 ns | -110 ns | -53 | 3
8 | 16 | 330 ns | 110 ns | 63 ns | -210 ns | -66 | 3
8 | 32 | 480 ns | 140 ns | 86 ns | -340 ns | -71 | 4
8 | 64 | 800 ns | 210 ns | 49 ns | -590 ns | -74 | 12
8 | 128 | 1.6 µs | 320 ns | 190 ns | -1.2 µs | -79 | 7
8 | 256 | 1.6 µs | 460 ns | 290 ns | -1.1 µs | -71 | 4
8 | 512 | 1.9 µs | 570 ns | 550 ns | -1.3 µs | -70 | 2
8 | 1024 | 4 µs | 1 µs | 1.1 µs | -3 µs | -75 | 3
8 | 2048 | 6.6 µs | 1.4 µs | 2.4 µs | -5.2 µs | -78 | 2
8 | 4096 | 19 µs | 5.2 µs | 870 ns | -14 µs | -73 | 16
8 | 8192 | 22 µs | 5.8 µs | 8.2 µs | -16 µs | -73 | 2
8 | 16384 | 50 µs | 11 µs | 14 µs | -39 µs | -78 | 3
16 | 16 | 310 ns | 130 ns | 57 ns | -180 ns | -59 | 3
16 | 32 | 460 ns | 160 ns | 63 ns | -310 ns | -66 | 5
16 | 64 | 820 ns | 210 ns | 130 ns | -610 ns | -74 | 5
16 | 128 | 1.4 µs | 310 ns | 180 ns | -1.1 µs | -78 | 6
16 | 256 | 2.8 µs | 510 ns | 150 ns | -2.3 µs | -82 | 15
16 | 512 | 1.2 µs | 320 ns | 340 ns | -840 ns | -72 | 2
16 | 1024 | 4.6 µs | 1.2 µs | 980 ns | -3.4 µs | -73 | 3
16 | 2048 | 7.7 µs | 1.9 µs | 2.4 µs | -5.8 µs | -75 | 2
16 | 4096 | 15 µs | 4.1 µs | 4.3 µs | -11 µs | -72 | 3
16 | 8192 | 14 µs | 3.6 µs | 480 ns | -10 µs | -74 | 21
16 | 16384 | 28 µs | 6.9 µs | 430 ns | -21 µs | -75 | 48
32 | 32 | 570 ns | 150 ns | 120 ns | -420 ns | -74 | 4
32 | 64 | 860 ns | 220 ns | 120 ns | -640 ns | -74 | 5
32 | 128 | 1.4 µs | 350 ns | 160 ns | -1.1 µs | -76 | 7
32 | 256 | 2.9 µs | 530 ns | 420 ns | -2.4 µs | -82 | 6
32 | 512 | 2.3 µs | 750 ns | 410 ns | -1.6 µs | -67 | 4
32 | 1024 | 3.7 µs | 1 µs | 1 µs | -2.7 µs | -73 | 3
32 | 2048 | 5.4 µs | 1.6 µs | 2.2 µs | -3.8 µs | -70 | 2
32 | 4096 | 11 µs | 1.8 µs | 1.7 µs | -9.2 µs | -83 | 5
32 | 8192 | 25 µs | 6.1 µs | 7.5 µs | -19 µs | -76 | 3
32 | 16384 | 59 µs | 15 µs | 17 µs | -44 µs | -74 | 3
64 | 64 | 830 ns | 230 ns | 150 ns | -610 ns | -73 | 4
64 | 128 | 1.4 µs | 330 ns | 100 ns | -1.1 µs | -77 | 11
64 | 256 | 2.9 µs | 520 ns | 170 ns | -2.3 µs | -82 | 14
64 | 512 | 1.7 µs | 540 ns | 530 ns | -1.2 µs | -69 | 2
64 | 1024 | 2.7 µs | 770 ns | 580 ns | -2 µs | -72 | 3
64 | 2048 | 4.3 µs | 1 µs | 950 ns | -3.3 µs | -77 | 3
64 | 4096 | 17 µs | 3.8 µs | 2.5 µs | -13 µs | -77 | 5
64 | 8192 | 38 µs | 9.9 µs | 1.3 µs | -28 µs | -74 | 21
64 | 16384 | 66 µs | 15 µs | 10 µs | -51 µs | -77 | 5
128 | 128 | 1.6 µs | 380 ns | 180 ns | -1.2 µs | -76 | 7
128 | 256 | 2.6 µs | 530 ns | 100 ns | -2.1 µs | -80 | 20
128 | 512 | 1.2 µs | 380 ns | 300 ns | -840 ns | -69 | 3
128 | 1024 | 5.3 µs | 1.3 µs | 1 µs | -4 µs | -75 | 4
128 | 2048 | 5.5 µs | 980 ns | 960 ns | -4.5 µs | -82 | 5
128 | 4096 | 13 µs | 2.8 µs | 3.9 µs | -10 µs | -78 | 3
128 | 8192 | 18 µs | 5.9 µs | 5.2 µs | -12 µs | -68 | 2
128 | 16384 | 59 µs | 15 µs | 9.3 µs | -44 µs | -75 | 5
256 | 256 | 3.3 µs | 590 ns | 540 ns | -2.7 µs | -82 | 5
256 | 512 | 1.2 µs | 410 ns | 340 ns | -830 ns | -67 | 2
256 | 1024 | 2.9 µs | 810 ns | 1.1 µs | -2.1 µs | -72 | 2
256 | 2048 | 9.1 µs | 2.4 µs | 1.7 µs | -6.7 µs | -74 | 4
256 | 4096 | 13 µs | 3.8 µs | 3.7 µs | -9.4 µs | -71 | 2
256 | 8192 | 30 µs | 8 µs | 4.7 µs | -22 µs | -73 | 5
256 | 16384 | 43 µs | 11 µs | 17 µs | -32 µs | -74 | 2
512 | 512 | 6.6 µs | 1 µs | 790 ns | -5.6 µs | -84 | 7
512 | 1024 | 4.4 µs | 980 ns | 1.4 µs | -3.4 µs | -78 | 2
512 | 2048 | 9.6 µs | 2 µs | 2.2 µs | -7.6 µs | -79 | 4
512 | 4096 | 9.3 µs | 1.9 µs | 2.1 µs | -7.4 µs | -79 | 3
512 | 8192 | 27 µs | 7.9 µs | 7.8 µs | -19 µs | -70 | 2
512 | 16384 | 60 µs | 15 µs | 17 µs | -44 µs | -74 | 3
1024 | 1024 | 12 µs | 2 µs | 960 ns | -9.9 µs | -83 | 10
1024 | 2048 | 4.7 µs | 1.5 µs | 1.2 µs | -3.2 µs | -67 | 3
1024 | 4096 | 11 µs | 3.1 µs | 4.6 µs | -8.2 µs | -73 | 2
1024 | 8192 | 22 µs | 5.8 µs | 8.8 µs | -17 µs | -74 | 2
1024 | 16384 | 35 µs | 7.4 µs | 7.5 µs | -28 µs | -79 | 4
2048 | 2048 | 24 µs | 3.8 µs | 1.5 µs | -20 µs | -84 | 13
2048 | 4096 | 17 µs | 4.1 µs | 5 µs | -13 µs | -75 | 3
2048 | 8192 | 27 µs | 7.9 µs | 8 µs | -19 µs | -71 | 2
2048 | 16384 | 45 µs | 12 µs | 18 µs | -33 µs | -74 | 2
4096 | 4096 | 51 µs | 8.3 µs | 1.7 µs | -42 µs | -84 | 24
4096 | 8192 | 28 µs | 7.5 µs | 8.6 µs | -21 µs | -73 | 2
4096 | 16384 | 28 µs | 8.4 µs | 1.1 µs | -19 µs | -70 | 17
8192 | 8192 | 80 µs | 14 µs | 1.3 µs | -66 µs | -82 | 49
8192 | 16384 | 66 µs | 16 µs | 20 µs | -50 µs | -76 | 3
16384 | 16384 | 200 µs | 30 µs | 2.4 µs | -170 µs | -85 | 71
(136 rows)

Since microbenchmark results are not normally distributed,
the sigmas and stddev columns unfortunately don't say much at all,
they are just an attempt to give some form of indication of variance.
Any ideas on better indicators appreciated.

Here are the same report for Intel Core i9-14900K:

select x var1ndigits,y var2ndigits,a_avg,b_avg,pooled_stddev,abs_diff,rel_diff,sigmas from catbench.vreport where summary like 'Optimise numeric division using base-NBASE^2 arithmetic%' and function_name = 'numeric_div' order by 1,2;
var1ndigits | var2ndigits | a_avg | b_avg | pooled_stddev | abs_diff | rel_diff | sigmas
-------------+-------------+--------+--------+---------------+------------+----------+--------
1 | 1 | 72 ns | 72 ns | 6.8 ns | -1.9e-10 s | 0 | 0
1 | 2 | 76 ns | 77 ns | 8.5 ns | 1.1 ns | 1 | 0
1 | 3 | 87 ns | 120 ns | 10 ns | 38 ns | 43 | 4
1 | 4 | 98 ns | 120 ns | 12 ns | 27 ns | 27 | 2
1 | 8 | 340 ns | 130 ns | 19 ns | -200 ns | -60 | 11
1 | 16 | 500 ns | 160 ns | 34 ns | -350 ns | -69 | 10
1 | 32 | 850 ns | 200 ns | 81 ns | -660 ns | -77 | 8
1 | 64 | 1.6 µs | 300 ns | 130 ns | -1.3 µs | -82 | 11
1 | 128 | 3.2 µs | 520 ns | 180 ns | -2.7 µs | -84 | 15
1 | 256 | 2.2 µs | 570 ns | 360 ns | -1.6 µs | -74 | 5
1 | 512 | 4.4 µs | 1.1 µs | 1.2 µs | -3.3 µs | -75 | 3
1 | 1024 | 4.9 µs | 1 µs | 1 µs | -3.9 µs | -79 | 4
1 | 2048 | 15 µs | 4.2 µs | 4.1 µs | -11 µs | -72 | 3
1 | 4096 | 42 µs | 11 µs | 3.3 µs | -31 µs | -75 | 10
1 | 8192 | 86 µs | 22 µs | 4.5 µs | -65 µs | -75 | 15
1 | 16384 | 65 µs | 17 µs | 4 µs | -47 µs | -73 | 12
2 | 2 | 78 ns | 83 ns | 8.2 ns | 5.1 ns | 7 | 1
2 | 3 | 89 ns | 120 ns | 12 ns | 29 ns | 33 | 3
2 | 4 | 97 ns | 130 ns | 13 ns | 29 ns | 30 | 2
2 | 8 | 310 ns | 130 ns | 30 ns | -180 ns | -58 | 6
2 | 16 | 520 ns | 160 ns | 34 ns | -360 ns | -69 | 11
2 | 32 | 980 ns | 210 ns | 14 ns | -780 ns | -79 | 57
2 | 64 | 1.6 µs | 300 ns | 150 ns | -1.3 µs | -81 | 9
2 | 128 | 3.1 µs | 530 ns | 270 ns | -2.6 µs | -83 | 10
2 | 256 | 2.8 µs | 710 ns | 150 ns | -2.1 µs | -74 | 14
2 | 512 | 4.2 µs | 1.1 µs | 720 ns | -3.1 µs | -73 | 4
2 | 1024 | 6.7 µs | 2.2 µs | 1.5 µs | -4.5 µs | -67 | 3
2 | 2048 | 7.7 µs | 2 µs | 690 ns | -5.7 µs | -74 | 8
2 | 4096 | 20 µs | 4 µs | 4 µs | -16 µs | -81 | 4
2 | 8192 | 65 µs | 13 µs | 12 µs | -52 µs | -80 | 4
2 | 16384 | 100 µs | 26 µs | 39 µs | -78 µs | -75 | 2
3 | 3 | 87 ns | 130 ns | 10 ns | 39 ns | 45 | 4
3 | 4 | 100 ns | 130 ns | 13 ns | 27 ns | 27 | 2
3 | 8 | 330 ns | 120 ns | 21 ns | -210 ns | -63 | 10
3 | 16 | 470 ns | 160 ns | 38 ns | -310 ns | -66 | 8
3 | 32 | 910 ns | 190 ns | 72 ns | -720 ns | -79 | 10
3 | 64 | 1.7 µs | 300 ns | 130 ns | -1.4 µs | -83 | 11
3 | 128 | 3.2 µs | 510 ns | 250 ns | -2.7 µs | -84 | 11
3 | 256 | 1.9 µs | 570 ns | 530 ns | -1.4 µs | -71 | 3
3 | 512 | 3.3 µs | 770 ns | 1.2 µs | -2.5 µs | -77 | 2
3 | 1024 | 7.6 µs | 2 µs | 2.2 µs | -5.6 µs | -73 | 3
3 | 2048 | 12 µs | 3 µs | 4.8 µs | -9.4 µs | -76 | 2
3 | 4096 | 30 µs | 8.4 µs | 8.7 µs | -21 µs | -72 | 2
3 | 8192 | 68 µs | 17 µs | 19 µs | -51 µs | -75 | 3
3 | 16384 | 100 µs | 26 µs | 39 µs | -76 µs | -75 | 2
4 | 4 | 100 ns | 130 ns | 13 ns | 28 ns | 28 | 2
4 | 8 | 300 ns | 130 ns | 33 ns | -170 ns | -56 | 5
4 | 16 | 510 ns | 160 ns | 35 ns | -350 ns | -69 | 10
4 | 32 | 910 ns | 210 ns | 77 ns | -700 ns | -77 | 9
4 | 64 | 1.7 µs | 310 ns | 97 ns | -1.4 µs | -82 | 14
4 | 128 | 3.1 µs | 490 ns | 270 ns | -2.6 µs | -84 | 10
4 | 256 | 1.9 µs | 440 ns | 530 ns | -1.4 µs | -77 | 3
4 | 512 | 2.8 µs | 800 ns | 730 ns | -2 µs | -71 | 3
4 | 1024 | 9.4 µs | 2.1 µs | 1.6 µs | -7.3 µs | -77 | 5
4 | 2048 | 13 µs | 3 µs | 4.9 µs | -9.7 µs | -76 | 2
4 | 4096 | 34 µs | 8.4 µs | 9.9 µs | -26 µs | -75 | 3
4 | 8192 | 61 µs | 17 µs | 17 µs | -44 µs | -73 | 3
4 | 16384 | 120 µs | 26 µs | 34 µs | -89 µs | -77 | 3
8 | 8 | 310 ns | 120 ns | 33 ns | -180 ns | -59 | 6
8 | 16 | 540 ns | 170 ns | 30 ns | -360 ns | -68 | 12
8 | 32 | 930 ns | 200 ns | 75 ns | -730 ns | -79 | 10
8 | 64 | 1.7 µs | 310 ns | 120 ns | -1.4 µs | -82 | 12
8 | 128 | 3.3 µs | 510 ns | 270 ns | -2.7 µs | -84 | 10
8 | 256 | 4.7 µs | 880 ns | 330 ns | -3.8 µs | -81 | 11
8 | 512 | 3.7 µs | 800 ns | 1 µs | -2.9 µs | -79 | 3
8 | 1024 | 6.3 µs | 1.5 µs | 2.5 µs | -4.8 µs | -76 | 2
8 | 2048 | 14 µs | 3 µs | 4.2 µs | -11 µs | -79 | 3
8 | 4096 | 29 µs | 6.2 µs | 8.6 µs | -23 µs | -79 | 3
8 | 8192 | 32 µs | 8.3 µs | 2 µs | -24 µs | -74 | 12
8 | 16384 | 120 µs | 25 µs | 34 µs | -94 µs | -79 | 3
16 | 16 | 490 ns | 160 ns | 55 ns | -330 ns | -67 | 6
16 | 32 | 1 µs | 200 ns | 38 ns | -810 ns | -80 | 21
16 | 64 | 1.6 µs | 310 ns | 130 ns | -1.3 µs | -81 | 10
16 | 128 | 3.2 µs | 530 ns | 250 ns | -2.7 µs | -83 | 11
16 | 256 | 6.4 µs | 950 ns | 440 ns | -5.5 µs | -85 | 13
16 | 512 | 3.1 µs | 780 ns | 1.2 µs | -2.3 µs | -74 | 2
16 | 1024 | 5.2 µs | 1.5 µs | 1.4 µs | -3.8 µs | -72 | 3
16 | 2048 | 22 µs | 5.2 µs | 1.1 µs | -16 µs | -76 | 15
16 | 4096 | 29 µs | 8.4 µs | 8.1 µs | -21 µs | -71 | 3
16 | 8192 | 76 µs | 17 µs | 12 µs | -59 µs | -78 | 5
16 | 16384 | 120 µs | 26 µs | 12 µs | -90 µs | -77 | 8
32 | 32 | 970 ns | 220 ns | 98 ns | -740 ns | -77 | 8
32 | 64 | 1.7 µs | 310 ns | 140 ns | -1.3 µs | -81 | 10
32 | 128 | 3.3 µs | 520 ns | 280 ns | -2.8 µs | -84 | 10
32 | 256 | 6.9 µs | 980 ns | 250 ns | -5.9 µs | -86 | 23
32 | 512 | 3.7 µs | 790 ns | 1.1 µs | -2.9 µs | -79 | 3
32 | 1024 | 6.2 µs | 1.6 µs | 2.5 µs | -4.7 µs | -75 | 2
32 | 2048 | 22 µs | 5.3 µs | 840 ns | -17 µs | -76 | 20
32 | 4096 | 20 µs | 4 µs | 4.3 µs | -16 µs | -80 | 4
32 | 8192 | 58 µs | 13 µs | 17 µs | -45 µs | -78 | 3
32 | 16384 | 140 µs | 34 µs | 19 µs | -100 µs | -75 | 5
64 | 64 | 2 µs | 340 ns | 74 ns | -1.6 µs | -83 | 22
64 | 128 | 3.6 µs | 550 ns | 190 ns | -3.1 µs | -85 | 17
64 | 256 | 6.1 µs | 930 ns | 600 ns | -5.2 µs | -85 | 9
64 | 512 | 3.3 µs | 790 ns | 1.3 µs | -2.5 µs | -76 | 2
64 | 1024 | 9.5 µs | 2 µs | 1.5 µs | -7.4 µs | -79 | 5
64 | 2048 | 13 µs | 3.1 µs | 4.9 µs | -9.4 µs | -75 | 2
64 | 4096 | 39 µs | 11 µs | 4.3 µs | -29 µs | -73 | 7
64 | 8192 | 50 µs | 12 µs | 19 µs | -38 µs | -75 | 2
64 | 16384 | 64 µs | 17 µs | 4.5 µs | -47 µs | -73 | 10
128 | 128 | 3 µs | 540 ns | 210 ns | -2.5 µs | -82 | 12
128 | 256 | 6.9 µs | 1 µs | 390 ns | -5.9 µs | -85 | 15
128 | 512 | 2.7 µs | 810 ns | 730 ns | -1.9 µs | -70 | 3
128 | 1024 | 5.1 µs | 1.6 µs | 1.5 µs | -3.5 µs | -68 | 2
128 | 2048 | 20 µs | 5.2 µs | 2.4 µs | -15 µs | -74 | 6
128 | 4096 | 26 µs | 6.1 µs | 9.8 µs | -19 µs | -76 | 2
128 | 8192 | 60 µs | 17 µs | 17 µs | -44 µs | -72 | 3
128 | 16384 | 100 µs | 26 µs | 39 µs | -77 µs | -74 | 2
256 | 256 | 5.9 µs | 1 µs | 370 ns | -4.9 µs | -83 | 13
256 | 512 | 4.2 µs | 850 ns | 1.2 µs | -3.4 µs | -80 | 3
256 | 1024 | 12 µs | 2.6 µs | 180 ns | -9.3 µs | -78 | 51
256 | 2048 | 20 µs | 5 µs | 2.4 µs | -15 µs | -75 | 6
256 | 4096 | 30 µs | 6.3 µs | 8.8 µs | -24 µs | -79 | 3
256 | 8192 | 69 µs | 17 µs | 19 µs | -52 µs | -75 | 3
256 | 16384 | 150 µs | 35 µs | 23 µs | -120 µs | -78 | 5
512 | 512 | 14 µs | 2.1 µs | 1.1 µs | -12 µs | -85 | 11
512 | 1024 | 9.7 µs | 1.8 µs | 1.4 µs | -7.9 µs | -82 | 6
512 | 2048 | 10 µs | 2.1 µs | 2.4 µs | -8.3 µs | -79 | 3
512 | 4096 | 20 µs | 4.2 µs | 4.5 µs | -16 µs | -79 | 4
512 | 8192 | 88 µs | 21 µs | 4.6 µs | -67 µs | -76 | 15
512 | 16384 | 140 µs | 34 µs | 38 µs | -100 µs | -76 | 3
1024 | 1024 | 25 µs | 4 µs | 2.7 µs | -21 µs | -84 | 8
1024 | 2048 | 16 µs | 4.2 µs | 5.2 µs | -12 µs | -74 | 2
1024 | 4096 | 40 µs | 8.1 µs | 6.7 µs | -32 µs | -80 | 5
1024 | 8192 | 80 µs | 17 µs | 12 µs | -63 µs | -79 | 5
1024 | 16384 | 120 µs | 35 µs | 35 µs | -89 µs | -72 | 3
2048 | 2048 | 55 µs | 8 µs | 5.2 µs | -46 µs | -85 | 9
2048 | 4096 | 28 µs | 6.2 µs | 6 µs | -22 µs | -78 | 4
2048 | 8192 | 53 µs | 13 µs | 21 µs | -40 µs | -76 | 2
2048 | 16384 | 100 µs | 26 µs | 40 µs | -76 µs | -74 | 2
4096 | 4096 | 110 µs | 16 µs | 8.8 µs | -95 µs | -86 | 11
4096 | 8192 | 43 µs | 13 µs | 12 µs | -30 µs | -69 | 2
4096 | 16384 | 150 µs | 36 µs | 42 µs | -110 µs | -76 | 3
8192 | 8192 | 210 µs | 32 µs | 22 µs | -180 µs | -85 | 8
8192 | 16384 | 160 µs | 34 µs | 47 µs | -120 µs | -79 | 3
16384 | 16384 | 370 µs | 61 µs | 23 µs | -310 µs | -84 | 14
(136 rows)

Out of these, some appears to be slower, but not sure if actually so,
might just be noise, since quite few sigmas, and like said above,
the sigmas isn't very scientific since the data can't be assumed
to be normally distributed:

AMD Ryzen 9 7950X3D:

Intel Core i9-14900K:

Quite similar (var1ndigits,var2ndigits) pairs that seems slower between
the two CPUs, so maybe it actually is a slowdown.

These benchmark results were obtained by comparing
numeric_div() between HEAD (ff59d5d) and with
v1-0001-Optimise-numeric-division-using-base-NBASE-2-arit.patch
applied.

Since statistical tools that rely on normal distributions can't be used,
let's look at the individual measurements for (var1ndigits=3, var2ndigits=3)
since that seems to be the biggest slowdown on both CPUs,
and see if our level of surprise is affected.

This is how many microseconds 512 iterations took
when comparing HEAD vs v1-0001:

AMD Ryzen 9 7950X3D:
{21,31,31,21,21,21,32,34,21,21,22,35,36,35,39,21,35,21,21,30,21,21,21,20,31,36,22,22,37,33} -- HEAD (ff59d5d)
{49,60,32,56,48,55,48,48,32,32,48,48,32,63,48,49,49,56,48,49,33,47,32,47,33,55,55,56,33,32} -- v1-0001

Intel Core i9-14900K:
{45,36,46,45,49,45,46,49,46,46,46,45,49,49,45,33,46,46,36,49,45,49,46,45,46,49,45,49,46,45} -- HEAD (ff59d5d)
{70,63,70,70,70,51,63,45,69,69,63,70,70,69,70,62,69,70,63,70,69,69,63,69,64,70,69,63,64,51} -- v1-0001

n=30 (3 random vars * 10 measurements)

(The reason why Intel is slower than AMD is because the Intel is running at a fixed CPU frequency.)

Regards,
Joel

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#4)

Re: Optimising numeric division

On Sat, Aug 24, 2024, at 00:00, Joel Jacobson wrote:

Since statistical tools that rely on normal distributions can't be used,
let's look at the individual measurements for (var1ndigits=3, var2ndigits=3)
since that seems to be the biggest slowdown on both CPUs,
and see if our level of surprise is affected.

Here is a more traditional benchmark,
which seems to also indicate (var1ndigits=3, var2ndigits=3) is a bit slower:

SELECT setseed(0);
CREATE TABLE t AS
SELECT
random(111111111111::numeric,999999999999::numeric) AS var1,
random(111111111111::numeric,999999999999::numeric) AS var2
FROM generate_series(1,1e7);
EXPLAIN ANALYZE SELECT SUM(var1/var2) FROM t;

/*
* Intel Core i9-14900K
*/

-- HEAD (ff59d5d)
Execution Time: 575.141 ms
Execution Time: 572.179 ms
Execution Time: 571.394 ms

-- v1-0001-Optimise-numeric-division-using-base-NBASE-2-arit.patch
Execution Time: 672.983 ms
Execution Time: 603.031 ms
Execution Time: 620.736 ms

/*
* AMD Ryzen 9 7950X3D
*/

-- HEAD (ff59d5d)
Execution Time: 561.349 ms
Execution Time: 516.365 ms
Execution Time: 510.782 ms

-- v1-0001-Optimise-numeric-division-using-base-NBASE-2-arit.patch
Execution Time: 659.049 ms
Execution Time: 607.035 ms
Execution Time: 600.026 ms

Regards,
Joel

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#5)

Re: Optimising numeric division

On Sat, Aug 24, 2024, at 01:35, Joel Jacobson wrote:

On Sat, Aug 24, 2024, at 00:00, Joel Jacobson wrote:

Since statistical tools that rely on normal distributions can't be used,
let's look at the individual measurements for (var1ndigits=3, var2ndigits=3)
since that seems to be the biggest slowdown on both CPUs,
and see if our level of surprise is affected.

Here is a more traditional benchmark,
which seems to also indicate (var1ndigits=3, var2ndigits=3) is a bit slower:

I tested just adding back div_var_int64, and it seems to help.

-- Intel Core i9-14900K:

select summary, x var1ndigits,y var2ndigits,a_avg,b_avg,pooled_stddev,abs_diff,rel_diff,sigmas from catbench.vreport where function_name = 'numeric_div' and summary like 'Add back div_var_int64%' and sigmas > 1 order by x,y;
summary | var1ndigits | var2ndigits | a_avg | b_avg | pooled_stddev | abs_diff | rel_diff | sigmas
--------------------------+-------------+-------------+--------+--------+---------------+----------+----------+--------
Add back div_var_int64. | 1 | 3 | 120 ns | 85 ns | 11 ns | -40 ns | -32 | 4
Add back div_var_int64. | 1 | 4 | 120 ns | 97 ns | 11 ns | -28 ns | -23 | 3
Add back div_var_int64. | 2 | 3 | 120 ns | 89 ns | 11 ns | -29 ns | -25 | 3
Add back div_var_int64. | 2 | 4 | 130 ns | 94 ns | 14 ns | -32 ns | -25 | 2
Add back div_var_int64. | 3 | 3 | 130 ns | 85 ns | 11 ns | -41 ns | -32 | 4
Add back div_var_int64. | 3 | 4 | 130 ns | 99 ns | 13 ns | -29 ns | -23 | 2
Add back div_var_int64. | 4 | 4 | 130 ns | 100 ns | 12 ns | -28 ns | -22 | 2
(7 rows)

Regards,
Joel

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#6)

2 attachment(s)

Re: Optimising numeric division

On Sat, 24 Aug 2024 at 08:26, Joel Jacobson <joel@compiler.org> wrote:

On Sat, Aug 24, 2024, at 01:35, Joel Jacobson wrote:

On Sat, Aug 24, 2024, at 00:00, Joel Jacobson wrote:

Since statistical tools that rely on normal distributions can't be used,
let's look at the individual measurements for (var1ndigits=3, var2ndigits=3)
since that seems to be the biggest slowdown on both CPUs,
and see if our level of surprise is affected.

Here is a more traditional benchmark,
which seems to also indicate (var1ndigits=3, var2ndigits=3) is a bit slower:

I tested just adding back div_var_int64, and it seems to help.

Thanks for testing.

There does appear to be quite a lot of variability between platforms
over whether or not div_var_int64() is a win for 3 and 4 digit
divisors. Since this patch is primarily about improving div_var()'s
long division algorithm, it's probably best for it to not touch that,
so I've put div_var_int64() back in for now. We could possibly
investigate whether it can be improved separately.

Looking at your other test results, they seem to confirm my previous
observation that exact mode is faster than approximate mode for
var2ndigits <= 12 or so, so I've added code to do that.

I also expanded on the comments for the quotient-correction code a bit.

Regards,
Dean

Attachments:

v2-0001-Speed-up-numeric-division-by-always-using-the-fas.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Speed-up-numeric-division-by-always-using-the-fas.patchDownload

From 0b5ae26beb3b7cd0c68c566f01c527e59a264386 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <dean.a.rasheed@gmail.com>
Date: Fri, 23 Aug 2024 09:54:27 +0100
Subject: [PATCH v2 1/2] Speed up numeric division by always using the "fast"
 algorithm.

Formerly there were two internal functions in numeric.c to perform
numeric division, div_var() and div_var_fast(). div_var() performed
division exactly to a specified rscale using Knuth's long division
algorithm, while div_var_fast() used the algorithm from the "FM"
library, which approximates each quotient digit using floating-point
arithmetic, and computes a truncated quotient with DIV_GUARD_DIGITS
extra digits. div_var_fast() could be many times faster than
div_var(), but did not guarantee correct results in all cases, and was
therefore only suitable for use in transcendental functions, where
small errors are acceptable.

This commit merges div_var() and div_var_fast() together into a single
function with an extra "exact" boolean parameter, which can be set to
false if the caller is OK with an approximate result. The new function
uses the faster algorithm from the "FM" library, except that when
"exact" is true, it does not truncate the computation with
DIV_GUARD_DIGITS extra digits, but instead performs the full-precision
computation, subtracting off complete multiples of the divisor for
each quotient digit. However, it is able to retain most of the
performance benefits of div_var_fast(), by delaying the propagation of
carries, allowing the inner loop to be auto-vectorized.

Since this may still lead to an inaccurate result, when "exact" is
true, it then inspects the remainder and uses that to adjust the
quotient, if necessary, to make it correct. In practice, the quotient
rarely needs to be adjusted, and never by more than one in the final
digit, though it's difficult to prove that, so the code allows for
larger adjustments, just in case.

In addition, use base-NBASE^2 arithmetic and a 64-bit dividend array,
similar to mul_var(), so that the number of iterations of the outer
loop is roughly halved. Together with the faster algorithm, this makes
div_var() up to around 20 times as fast as the old Knuth algorithm
when "exact" is true, and using base-NBASE^2 arithmetic makes it up to
2 or 3 times as fast as the old div_var_fast() function when "exact"
is false.
---
 src/backend/utils/adt/numeric.c | 829 ++++++++++++++------------------
 1 file changed, 352 insertions(+), 477 deletions(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 44d88e9007..46ac04035e 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -60,7 +60,7 @@
  * NBASE that's less than sqrt(INT_MAX), in practice we are only interested
  * in NBASE a power of ten, so that I/O conversions and decimal rounding
  * are easy.  Also, it's actually more efficient if NBASE is rather less than
- * sqrt(INT_MAX), so that there is "headroom" for mul_var and div_var_fast to
+ * sqrt(INT_MAX), so that there is "headroom" for mul_var and div_var to
  * postpone processing carries.
  *
  * Values of NBASE other than 10000 are considered of historical interest only
@@ -563,10 +563,7 @@ static void mul_var(const NumericVar *var1, const NumericVar *var2,
 static void mul_var_short(const NumericVar *var1, const NumericVar *var2,
 						  NumericVar *result);
 static void div_var(const NumericVar *var1, const NumericVar *var2,
-					NumericVar *result,
-					int rscale, bool round);
-static void div_var_fast(const NumericVar *var1, const NumericVar *var2,
-						 NumericVar *result, int rscale, bool round);
+					NumericVar *result, int rscale, bool round, bool exact);
 static void div_var_int(const NumericVar *var, int ival, int ival_weight,
 						NumericVar *result, int rscale, bool round);
 #ifdef HAVE_INT128
@@ -1958,7 +1955,7 @@ compute_bucket(Numeric operand, Numeric bound1, Numeric bound2,
 
 	mul_var(&operand_var, count_var, &operand_var,
 			operand_var.dscale + count_var->dscale);
-	div_var(&operand_var, &bound2_var, result_var, 0, false);
+	div_var(&operand_var, &bound2_var, result_var, 0, false, true);
 	add_var(result_var, &const_one, result_var);
 
 	free_var(&bound1_var);
@@ -3240,7 +3237,7 @@ numeric_div_opt_error(Numeric num1, Numeric num2, bool *have_error)
 	/*
 	 * Do the divide and return the result
 	 */
-	div_var(&arg1, &arg2, &result, rscale, true);
+	div_var(&arg1, &arg2, &result, rscale, true, true);
 
 	res = make_result_opt_error(&result, have_error);
 
@@ -3329,7 +3326,7 @@ numeric_div_trunc(PG_FUNCTION_ARGS)
 	/*
 	 * Do the divide and return the result
 	 */
-	div_var(&arg1, &arg2, &result, 0, false);
+	div_var(&arg1, &arg2, &result, 0, false, true);
 
 	res = make_result(&result);
 
@@ -3600,7 +3597,7 @@ numeric_lcm(PG_FUNCTION_ARGS)
 	else
 	{
 		gcd_var(&arg1, &arg2, &result);
-		div_var(&arg1, &result, &result, 0, false);
+		div_var(&arg1, &result, &result, 0, false, true);
 		mul_var(&arg2, &result, &result, arg2.dscale);
 		result.sign = NUMERIC_POS;
 	}
@@ -6272,7 +6269,7 @@ numeric_stddev_internal(NumericAggState *state,
 		else
 			mul_var(&vN, &vN, &vNminus1, 0);	/* N * N */
 		rscale = select_div_scale(&vsumX2, &vNminus1);
-		div_var(&vsumX2, &vNminus1, &vsumX, rscale, true);	/* variance */
+		div_var(&vsumX2, &vNminus1, &vsumX, rscale, true, true);	/* variance */
 		if (!variance)
 			sqrt_var(&vsumX, &vsumX, rscale);	/* stddev */
 
@@ -7690,7 +7687,7 @@ get_str_from_var_sci(const NumericVar *var, int rscale)
 	init_var(&tmp_var);
 
 	power_ten_int(exponent, &tmp_var);
-	div_var(var, &tmp_var, &tmp_var, rscale, true);
+	div_var(var, &tmp_var, &tmp_var, rscale, true, true);
 	sig_out = get_str_from_var(&tmp_var);
 
 	free_var(&tmp_var);
@@ -9229,32 +9226,48 @@ mul_var_short(const NumericVar *var1, const NumericVar *var2,
 /*
  * div_var() -
  *
- *	Division on variable level. Quotient of var1 / var2 is stored in result.
- *	The quotient is figured to exactly rscale fractional digits.
- *	If round is true, it is rounded at the rscale'th digit; if false, it
- *	is truncated (towards zero) at that digit.
+ *	Compute the quotient var1 / var2 to rscale fractional digits.
+ *
+ *	If "round" is true, the result is rounded at the rscale'th digit; if
+ *	false, it is truncated (towards zero) at that digit.
+ *
+ *	If "exact" is true, the exact result is computed to the specified rscale;
+ *	if false, successive quotient digits are approximated up to rscale plus
+ *	DIV_GUARD_DIGITS extra digits, ignoring all contributions from digits to
+ *	the right of that, before rounding or truncating to the specified rscale.
+ *	This can be significantly faster, and usually gives the same result as the
+ *	exact computation, but it may occasionally be off by one in the final
+ *	digit, if contributions from the ignored digits would have propagated
+ *	through the guard digits.  This is good enough for the transcendental
+ *	functions, where small errors are acceptable.
  */
 static void
 div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
-		int rscale, bool round)
+		int rscale, bool round, bool exact)
 {
-	int			div_ndigits;
-	int			res_ndigits;
+	int			var1ndigits = var1->ndigits;
+	int			var2ndigits = var2->ndigits;
 	int			res_sign;
 	int			res_weight;
-	int			carry;
-	int			borrow;
-	int			divisor1;
-	int			divisor2;
-	NumericDigit *dividend;
-	NumericDigit *divisor;
+	int			res_ndigits;
+	int			var1ndigitpairs;
+	int			var2ndigitpairs;
+	int			res_ndigitpairs;
+	int			div_ndigitpairs;
+	int64	   *dividend;
+	int32	   *divisor;
+	double		fdivisor,
+				fdivisorinverse,
+				fdividend,
+				fquotient;
+	int64		maxdiv;
+	int			qi;
+	int32		qdigit;
+	int64		carry;
+	int64		newdig;
+	int64	   *remainder;
 	NumericDigit *res_digits;
 	int			i;
-	int			j;
-
-	/* copy these values into local vars for speed in inner loop */
-	int			var1ndigits = var1->ndigits;
-	int			var2ndigits = var2->ndigits;
 
 	/*
 	 * First of all division by zero check; we must not be handed an
@@ -9268,9 +9281,6 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	/*
 	 * If the divisor has just one or two digits, delegate to div_var_int(),
 	 * which uses fast short division.
-	 *
-	 * Similarly, on platforms with 128-bit integer support, delegate to
-	 * div_var_int64() for divisors with three or four digits.
 	 */
 	if (var2ndigits <= 2)
 	{
@@ -9323,6 +9333,23 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		return;
 	}
 
+	/*
+	 * The approximate computation can be significantly faster than the exact
+	 * one, since the working dividend is var2ndigitpairs base-NBASE^2 digits
+	 * shorter below.  However, that comes with the tradeoff of computing
+	 * DIV_GUARD_DIGITS extra base-NBASE result digits.  Ignoring all other
+	 * overheads, that suggests that, in theory, the approximate computation
+	 * will only be faster than the exact one when var2ndigits is greater than
+	 * 2 * (DIV_GUARD_DIGITS + 1), independent of the size of var1.
+	 *
+	 * Thus, we're better off doing an exact computation when var2 is shorter
+	 * than this.  Empirically, it has been found that the exact threshold is
+	 * a little higher (likely due to other overheads in the outer division
+	 * loop below).
+	 */
+	if (var2ndigits <= 2 * (DIV_GUARD_DIGITS + 2))
+		exact = true;
+
 	/*
 	 * Determine the result sign, weight and number of digits to calculate.
 	 * The weight figured here is correct if the emitted quotient has no
@@ -9332,7 +9359,7 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		res_sign = NUMERIC_POS;
 	else
 		res_sign = NUMERIC_NEG;
-	res_weight = var1->weight - var2->weight;
+	res_weight = var1->weight - var2->weight + 1;
 	/* The number of accurate result digits we need to produce: */
 	res_ndigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS;
 	/* ... but always at least 1 */
@@ -9340,476 +9367,210 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	/* If rounding needed, figure one more digit to ensure correct result */
 	if (round)
 		res_ndigits++;
+	/* Add guard digits for roundoff error when producing approx result */
+	if (!exact)
+		res_ndigits += DIV_GUARD_DIGITS;
 
 	/*
-	 * The working dividend normally requires res_ndigits + var2ndigits
-	 * digits, but make it at least var1ndigits so we can load all of var1
-	 * into it.  (There will be an additional digit dividend[0] in the
-	 * dividend space, but for consistency with Knuth's notation we don't
-	 * count that in div_ndigits.)
-	 */
-	div_ndigits = res_ndigits + var2ndigits;
-	div_ndigits = Max(div_ndigits, var1ndigits);
-
-	/*
-	 * We need a workspace with room for the working dividend (div_ndigits+1
-	 * digits) plus room for the possibly-normalized divisor (var2ndigits
-	 * digits).  It is convenient also to have a zero at divisor[0] with the
-	 * actual divisor data in divisor[1 .. var2ndigits].  Transferring the
-	 * digits into the workspace also allows us to realloc the result (which
-	 * might be the same as either input var) before we begin the main loop.
-	 * Note that we use palloc0 to ensure that divisor[0], dividend[0], and
-	 * any additional dividend positions beyond var1ndigits, start out 0.
-	 */
-	dividend = (NumericDigit *)
-		palloc0((div_ndigits + var2ndigits + 2) * sizeof(NumericDigit));
-	divisor = dividend + (div_ndigits + 1);
-	memcpy(dividend + 1, var1->digits, var1ndigits * sizeof(NumericDigit));
-	memcpy(divisor + 1, var2->digits, var2ndigits * sizeof(NumericDigit));
-
-	/*
-	 * Now we can realloc the result to hold the generated quotient digits.
+	 * The computation itself is done using base-NBASE^2 arithmetic, so we
+	 * actually process the input digits in pairs, producing base-NBASE^2
+	 * intermediate result digits.  This significantly improves performance,
+	 * since the computation is O(N^2) in the number of input digits, and
+	 * working in base NBASE^2 effectively halves "N".
 	 */
-	alloc_var(result, res_ndigits);
-	res_digits = result->digits;
+	var1ndigitpairs = (var1ndigits + 1) / 2;
+	var2ndigitpairs = (var2ndigits + 1) / 2;
+	res_ndigitpairs = (res_ndigits + 1) / 2;
+	res_ndigits = 2 * res_ndigitpairs;
 
 	/*
-	 * The full multiple-place algorithm is taken from Knuth volume 2,
-	 * Algorithm 4.3.1D.
+	 * We do the arithmetic in an array "dividend[]" of signed 64-bit
+	 * integers.  Since PG_INT64_MAX is much larger than NBASE^4, this gives
+	 * us a lot of headroom to avoid normalizing carries immediately.
+	 *
+	 * When performing an exact computation, the working dividend requires
+	 * res_ndigitpairs + var2ndigitpairs digits.  If var1 is larger than that,
+	 * the extra digits do not contribute to the result, and are ignored.
 	 *
-	 * We need the first divisor digit to be >= NBASE/2.  If it isn't, make it
-	 * so by scaling up both the divisor and dividend by the factor "d".  (The
-	 * reason for allocating dividend[0] above is to leave room for possible
-	 * carry here.)
+	 * When performing an approximate computation, the working dividend only
+	 * requires res_ndigitpairs digits (which includes the extra guard
+	 * digits).  All input digits beyond that are ignored.
 	 */
-	if (divisor[1] < HALF_NBASE)
+	if (exact)
 	{
-		int			d = NBASE / (divisor[1] + 1);
-
-		carry = 0;
-		for (i = var2ndigits; i > 0; i--)
-		{
-			carry += divisor[i] * d;
-			divisor[i] = carry % NBASE;
-			carry = carry / NBASE;
-		}
-		Assert(carry == 0);
-		carry = 0;
-		/* at this point only var1ndigits of dividend can be nonzero */
-		for (i = var1ndigits; i >= 0; i--)
-		{
-			carry += dividend[i] * d;
-			dividend[i] = carry % NBASE;
-			carry = carry / NBASE;
-		}
-		Assert(carry == 0);
-		Assert(divisor[1] >= HALF_NBASE);
+		div_ndigitpairs = res_ndigitpairs + var2ndigitpairs;
+		var1ndigitpairs = Min(var1ndigitpairs, div_ndigitpairs);
 	}
-	/* First 2 divisor digits are used repeatedly in main loop */
-	divisor1 = divisor[1];
-	divisor2 = divisor[2];
-
-	/*
-	 * Begin the main loop.  Each iteration of this loop produces the j'th
-	 * quotient digit by dividing dividend[j .. j + var2ndigits] by the
-	 * divisor; this is essentially the same as the common manual procedure
-	 * for long division.
-	 */
-	for (j = 0; j < res_ndigits; j++)
+	else
 	{
-		/* Estimate quotient digit from the first two dividend digits */
-		int			next2digits = dividend[j] * NBASE + dividend[j + 1];
-		int			qhat;
-
-		/*
-		 * If next2digits are 0, then quotient digit must be 0 and there's no
-		 * need to adjust the working dividend.  It's worth testing here to
-		 * fall out ASAP when processing trailing zeroes in a dividend.
-		 */
-		if (next2digits == 0)
-		{
-			res_digits[j] = 0;
-			continue;
-		}
-
-		if (dividend[j] == divisor1)
-			qhat = NBASE - 1;
-		else
-			qhat = next2digits / divisor1;
-
-		/*
-		 * Adjust quotient digit if it's too large.  Knuth proves that after
-		 * this step, the quotient digit will be either correct or just one
-		 * too large.  (Note: it's OK to use dividend[j+2] here because we
-		 * know the divisor length is at least 2.)
-		 */
-		while (divisor2 * qhat >
-			   (next2digits - qhat * divisor1) * NBASE + dividend[j + 2])
-			qhat--;
-
-		/* As above, need do nothing more when quotient digit is 0 */
-		if (qhat > 0)
-		{
-			NumericDigit *dividend_j = &dividend[j];
-
-			/*
-			 * Multiply the divisor by qhat, and subtract that from the
-			 * working dividend.  The multiplication and subtraction are
-			 * folded together here, noting that qhat <= NBASE (since it might
-			 * be one too large), and so the intermediate result "tmp_result"
-			 * is in the range [-NBASE^2, NBASE - 1], and "borrow" is in the
-			 * range [0, NBASE].
-			 */
-			borrow = 0;
-			for (i = var2ndigits; i >= 0; i--)
-			{
-				int			tmp_result;
-
-				tmp_result = dividend_j[i] - borrow - divisor[i] * qhat;
-				borrow = (NBASE - 1 - tmp_result) / NBASE;
-				dividend_j[i] = tmp_result + borrow * NBASE;
-			}
-
-			/*
-			 * If we got a borrow out of the top dividend digit, then indeed
-			 * qhat was one too large.  Fix it, and add back the divisor to
-			 * correct the working dividend.  (Knuth proves that this will
-			 * occur only about 3/NBASE of the time; hence, it's a good idea
-			 * to test this code with small NBASE to be sure this section gets
-			 * exercised.)
-			 */
-			if (borrow)
-			{
-				qhat--;
-				carry = 0;
-				for (i = var2ndigits; i >= 0; i--)
-				{
-					carry += dividend_j[i] + divisor[i];
-					if (carry >= NBASE)
-					{
-						dividend_j[i] = carry - NBASE;
-						carry = 1;
-					}
-					else
-					{
-						dividend_j[i] = carry;
-						carry = 0;
-					}
-				}
-				/* A carry should occur here to cancel the borrow above */
-				Assert(carry == 1);
-			}
-		}
-
-		/* And we're done with this quotient digit */
-		res_digits[j] = qhat;
+		div_ndigitpairs = res_ndigitpairs;
+		var1ndigitpairs = Min(var1ndigitpairs, div_ndigitpairs);
+		var2ndigitpairs = Min(var2ndigitpairs, div_ndigitpairs);
 	}
 
-	pfree(dividend);
-
-	/*
-	 * Finally, round or truncate the result to the requested precision.
-	 */
-	result->weight = res_weight;
-	result->sign = res_sign;
-
-	/* Round or truncate to target rscale (and set result->dscale) */
-	if (round)
-		round_var(result, rscale);
-	else
-		trunc_var(result, rscale);
-
-	/* Strip leading and trailing zeroes */
-	strip_var(result);
-}
-
-
-/*
- * div_var_fast() -
- *
- *	This has the same API as div_var, but is implemented using the division
- *	algorithm from the "FM" library, rather than Knuth's schoolbook-division
- *	approach.  This is significantly faster but can produce inaccurate
- *	results, because it sometimes has to propagate rounding to the left,
- *	and so we can never be entirely sure that we know the requested digits
- *	exactly.  We compute DIV_GUARD_DIGITS extra digits, but there is
- *	no certainty that that's enough.  We use this only in the transcendental
- *	function calculation routines, where everything is approximate anyway.
- *
- *	Although we provide a "round" argument for consistency with div_var,
- *	it is unwise to use this function with round=false.  In truncation mode
- *	it is possible to get a result with no significant digits, for example
- *	with rscale=0 we might compute 0.99999... and truncate that to 0 when
- *	the correct answer is 1.
- */
-static void
-div_var_fast(const NumericVar *var1, const NumericVar *var2,
-			 NumericVar *result, int rscale, bool round)
-{
-	int			div_ndigits;
-	int			load_ndigits;
-	int			res_sign;
-	int			res_weight;
-	int		   *div;
-	int			qdigit;
-	int			carry;
-	int			maxdiv;
-	int			newdig;
-	NumericDigit *res_digits;
-	double		fdividend,
-				fdivisor,
-				fdivisorinverse,
-				fquotient;
-	int			qi;
-	int			i;
-
-	/* copy these values into local vars for speed in inner loop */
-	int			var1ndigits = var1->ndigits;
-	int			var2ndigits = var2->ndigits;
-	NumericDigit *var1digits = var1->digits;
-	NumericDigit *var2digits = var2->digits;
-
 	/*
-	 * First of all division by zero check; we must not be handed an
-	 * unnormalized divisor.
-	 */
-	if (var2ndigits == 0 || var2digits[0] == 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_DIVISION_BY_ZERO),
-				 errmsg("division by zero")));
-
-	/*
-	 * If the divisor has just one or two digits, delegate to div_var_int(),
-	 * which uses fast short division.
+	 * Allocate room for the working dividend (div_ndigitpairs 64-bit digits)
+	 * plus the divisor (var2ndigitpairs 32-bit base-NBASE^2 digits).
 	 *
-	 * Similarly, on platforms with 128-bit integer support, delegate to
-	 * div_var_int64() for divisors with three or four digits.
+	 * For convenience, we allocate one extra dividend digit, which is set to
+	 * zero and not counted in div_ndigitpairs.  This is used when expanding
+	 * the remainder down, and also so that we don't need to worry about
+	 * reading off the end of the dividend array when computing fdividend.
 	 */
-	if (var2ndigits <= 2)
-	{
-		int			idivisor;
-		int			idivisor_weight;
+	dividend = (int64 *) palloc((div_ndigitpairs + 1) * sizeof(int64) +
+								var2ndigitpairs * sizeof(int32));
+	divisor = (int32 *) (dividend + div_ndigitpairs + 1);
 
-		idivisor = var2->digits[0];
-		idivisor_weight = var2->weight;
-		if (var2ndigits == 2)
-		{
-			idivisor = idivisor * NBASE + var2->digits[1];
-			idivisor_weight--;
-		}
-		if (var2->sign == NUMERIC_NEG)
-			idivisor = -idivisor;
+	/* load var1 into dividend[0 .. var1ndigitpairs-1], zeroing the rest */
+	for (i = 0; i < var1ndigitpairs - 1; i++)
+		dividend[i] = var1->digits[2 * i] * NBASE + var1->digits[2 * i + 1];
 
-		div_var_int(var1, idivisor, idivisor_weight, result, rscale, round);
-		return;
-	}
-#ifdef HAVE_INT128
-	if (var2ndigits <= 4)
-	{
-		int64		idivisor;
-		int			idivisor_weight;
-
-		idivisor = var2->digits[0];
-		idivisor_weight = var2->weight;
-		for (i = 1; i < var2ndigits; i++)
-		{
-			idivisor = idivisor * NBASE + var2->digits[i];
-			idivisor_weight--;
-		}
-		if (var2->sign == NUMERIC_NEG)
-			idivisor = -idivisor;
-
-		div_var_int64(var1, idivisor, idivisor_weight, result, rscale, round);
-		return;
-	}
-#endif
+	if (2 * i + 1 < var1ndigits)
+		dividend[i] = var1->digits[2 * i] * NBASE + var1->digits[2 * i + 1];
+	else
+		dividend[i] = var1->digits[2 * i] * NBASE;
 
-	/*
-	 * Otherwise, perform full long division.
-	 */
+	memset(dividend + i + 1, 0, (div_ndigitpairs - i) * sizeof(int64));
 
-	/* Result zero check */
-	if (var1ndigits == 0)
-	{
-		zero_var(result);
-		result->dscale = rscale;
-		return;
-	}
+	/* load var2 into divisor[0 .. var2ndigitpairs-1] */
+	for (i = 0; i < var2ndigitpairs - 1; i++)
+		divisor[i] = var2->digits[2 * i] * NBASE + var2->digits[2 * i + 1];
 
-	/*
-	 * Determine the result sign, weight and number of digits to calculate
-	 */
-	if (var1->sign == var2->sign)
-		res_sign = NUMERIC_POS;
+	if (2 * i + 1 < var2ndigits)
+		divisor[i] = var2->digits[2 * i] * NBASE + var2->digits[2 * i + 1];
 	else
-		res_sign = NUMERIC_NEG;
-	res_weight = var1->weight - var2->weight + 1;
-	/* The number of accurate result digits we need to produce: */
-	div_ndigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS;
-	/* Add guard digits for roundoff error */
-	div_ndigits += DIV_GUARD_DIGITS;
-	if (div_ndigits < DIV_GUARD_DIGITS)
-		div_ndigits = DIV_GUARD_DIGITS;
-
-	/*
-	 * We do the arithmetic in an array "div[]" of signed int's.  Since
-	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
-	 * to avoid normalizing carries immediately.
-	 *
-	 * We start with div[] containing one zero digit followed by the
-	 * dividend's digits (plus appended zeroes to reach the desired precision
-	 * including guard digits).  Each step of the main loop computes an
-	 * (approximate) quotient digit and stores it into div[], removing one
-	 * position of dividend space.  A final pass of carry propagation takes
-	 * care of any mistaken quotient digits.
-	 *
-	 * Note that div[] doesn't necessarily contain all of the digits from the
-	 * dividend --- the desired precision plus guard digits might be less than
-	 * the dividend's precision.  This happens, for example, in the square
-	 * root algorithm, where we typically divide a 2N-digit number by an
-	 * N-digit number, and only require a result with N digits of precision.
-	 */
-	div = (int *) palloc0((div_ndigits + 1) * sizeof(int));
-	load_ndigits = Min(div_ndigits, var1ndigits);
-	for (i = 0; i < load_ndigits; i++)
-		div[i + 1] = var1digits[i];
+		divisor[i] = var2->digits[2 * i] * NBASE;
 
 	/*
 	 * We estimate each quotient digit using floating-point arithmetic, taking
-	 * the first four digits of the (current) dividend and divisor.  This must
-	 * be float to avoid overflow.  The quotient digits will generally be off
-	 * by no more than one from the exact answer.
-	 */
-	fdivisor = (double) var2digits[0];
-	for (i = 1; i < 4; i++)
-	{
-		fdivisor *= NBASE;
-		if (i < var2ndigits)
-			fdivisor += (double) var2digits[i];
-	}
+	 * the first 2 base-NBASE^2 digits of the (current) dividend and divisor.
+	 * This must be float to avoid overflow.
+	 *
+	 * Since the floating-point dividend and divisor use at least 4 base-NBASE
+	 * input digits, they include roughly 40-53 bits of information from their
+	 * respective inputs (assuming NBASE is 10000), which fits well in IEEE
+	 * double-precision variables.  In addition, the relative error in the
+	 * floating-point quotient digit will be less than around 1/NBASE^3, so
+	 * the estimated base-NBASE^2 quotient digit will typically be correct,
+	 * and should not be off by more than one from the correct value.
+	 */
+	fdivisor = (double) divisor[0] * NBASE_SQR;
+	if (var2ndigitpairs > 1)
+		fdivisor += (double) divisor[1];
 	fdivisorinverse = 1.0 / fdivisor;
 
 	/*
-	 * maxdiv tracks the maximum possible absolute value of any div[] entry;
-	 * when this threatens to exceed INT_MAX, we take the time to propagate
-	 * carries.  Furthermore, we need to ensure that overflow doesn't occur
-	 * during the carry propagation passes either.  The carry values may have
-	 * an absolute value as high as INT_MAX/NBASE + 1, so really we must
-	 * normalize when digits threaten to exceed INT_MAX - INT_MAX/NBASE - 1.
+	 * maxdiv tracks the maximum possible absolute value of any dividend[]
+	 * entry; when this threatens to exceed PG_INT64_MAX, we take the time to
+	 * propagate carries.  Furthermore, we need to ensure that overflow
+	 * doesn't occur during the carry propagation passes either.  The carry
+	 * values may have an absolute value as high as PG_INT64_MAX/NBASE^2 + 1,
+	 * so really we must normalize when digits threaten to exceed PG_INT64_MAX
+	 * - PG_INT64_MAX/NBASE^2 - 1.
 	 *
 	 * To avoid overflow in maxdiv itself, it represents the max absolute
-	 * value divided by NBASE-1, ie, at the top of the loop it is known that
-	 * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
+	 * value divided by NBASE^2-1, ie, at the top of the loop it is known that
+	 * no dividend[] entry has an absolute value exceeding maxdiv *
+	 * (NBASE^2-1).
 	 *
-	 * Actually, though, that holds good only for div[] entries after div[qi];
-	 * the adjustment done at the bottom of the loop may cause div[qi + 1] to
-	 * exceed the maxdiv limit, so that div[qi] in the next iteration is
-	 * beyond the limit.  This does not cause problems, as explained below.
+	 * Actually, though, that holds good only for dividend[] entries after
+	 * dividend[qi]; the adjustment done at the bottom of the loop may cause
+	 * dividend[qi + 1] to exceed the maxdiv limit, so that dividend[qi] in
+	 * the next iteration is beyond the limit.  This does not cause problems,
+	 * as explained below.
 	 */
 	maxdiv = 1;
 
 	/*
-	 * Outer loop computes next quotient digit, which will go into div[qi]
+	 * Outer loop computes next quotient digit, which goes in dividend[qi].
 	 */
-	for (qi = 0; qi < div_ndigits; qi++)
+	for (qi = 0; qi < res_ndigitpairs; qi++)
 	{
 		/* Approximate the current dividend value */
-		fdividend = (double) div[qi];
-		for (i = 1; i < 4; i++)
-		{
-			fdividend *= NBASE;
-			if (qi + i <= div_ndigits)
-				fdividend += (double) div[qi + i];
-		}
+		fdividend = (double) dividend[qi] * NBASE_SQR;
+		fdividend += (double) dividend[qi + 1];
+
 		/* Compute the (approximate) quotient digit */
 		fquotient = fdividend * fdivisorinverse;
-		qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
-			(((int) fquotient) - 1);	/* truncate towards -infinity */
+		qdigit = (fquotient >= 0.0) ? ((int32) fquotient) :
+			(((int32) fquotient) - 1);	/* truncate towards -infinity */
 
 		if (qdigit != 0)
 		{
 			/* Do we need to normalize now? */
-			maxdiv += abs(qdigit);
-			if (maxdiv > (INT_MAX - INT_MAX / NBASE - 1) / (NBASE - 1))
+			maxdiv += i64abs(qdigit);
+			if (maxdiv > (PG_INT64_MAX - PG_INT64_MAX / NBASE_SQR - 1) / (NBASE_SQR - 1))
 			{
 				/*
-				 * Yes, do it.  Note that if var2ndigits is much smaller than
-				 * div_ndigits, we can save a significant amount of effort
-				 * here by noting that we only need to normalise those div[]
-				 * entries touched where prior iterations subtracted multiples
-				 * of the divisor.
+				 * Yes, do it.  Note that if var2ndigitpairs is much smaller
+				 * than div_ndigitpairs, we can save a significant amount of
+				 * effort here by noting that we only need to normalise those
+				 * dividend[] entries touched where prior iterations
+				 * subtracted multiples of the divisor.
 				 */
 				carry = 0;
-				for (i = Min(qi + var2ndigits - 2, div_ndigits); i > qi; i--)
+				for (i = Min(qi + var2ndigitpairs - 2, div_ndigitpairs - 1); i > qi; i--)
 				{
-					newdig = div[i] + carry;
+					newdig = dividend[i] + carry;
 					if (newdig < 0)
 					{
-						carry = -((-newdig - 1) / NBASE) - 1;
-						newdig -= carry * NBASE;
+						carry = -((-newdig - 1) / NBASE_SQR) - 1;
+						newdig -= carry * NBASE_SQR;
 					}
-					else if (newdig >= NBASE)
+					else if (newdig >= NBASE_SQR)
 					{
-						carry = newdig / NBASE;
-						newdig -= carry * NBASE;
+						carry = newdig / NBASE_SQR;
+						newdig -= carry * NBASE_SQR;
 					}
 					else
 						carry = 0;
-					div[i] = newdig;
+					dividend[i] = newdig;
 				}
-				newdig = div[qi] + carry;
-				div[qi] = newdig;
+				dividend[qi] += carry;
 
 				/*
-				 * All the div[] digits except possibly div[qi] are now in the
-				 * range 0..NBASE-1.  We do not need to consider div[qi] in
-				 * the maxdiv value anymore, so we can reset maxdiv to 1.
+				 * All the dividend[] digits except possibly dividend[qi] are
+				 * now in the range 0..NBASE^2-1.  We do not need to consider
+				 * dividend[qi] in the maxdiv value anymore, so we can reset
+				 * maxdiv to 1.
 				 */
 				maxdiv = 1;
 
 				/*
 				 * Recompute the quotient digit since new info may have
-				 * propagated into the top four dividend digits
+				 * propagated into the top two dividend digits.
 				 */
-				fdividend = (double) div[qi];
-				for (i = 1; i < 4; i++)
-				{
-					fdividend *= NBASE;
-					if (qi + i <= div_ndigits)
-						fdividend += (double) div[qi + i];
-				}
-				/* Compute the (approximate) quotient digit */
+				fdividend = (double) dividend[qi] * NBASE_SQR;
+				fdividend += (double) dividend[qi + 1];
 				fquotient = fdividend * fdivisorinverse;
-				qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
-					(((int) fquotient) - 1);	/* truncate towards -infinity */
-				maxdiv += abs(qdigit);
+				qdigit = (fquotient >= 0.0) ? ((int32) fquotient) :
+					(((int32) fquotient) - 1);	/* truncate towards -infinity */
+
+				maxdiv += i64abs(qdigit);
 			}
 
 			/*
 			 * Subtract off the appropriate multiple of the divisor.
 			 *
-			 * The digits beyond div[qi] cannot overflow, because we know they
-			 * will fall within the maxdiv limit.  As for div[qi] itself, note
-			 * that qdigit is approximately trunc(div[qi] / vardigits[0]),
-			 * which would make the new value simply div[qi] mod vardigits[0].
-			 * The lower-order terms in qdigit can change this result by not
-			 * more than about twice INT_MAX/NBASE, so overflow is impossible.
+			 * The digits beyond dividend[qi] cannot overflow, because we know
+			 * they will fall within the maxdiv limit.  As for dividend[qi]
+			 * itself, note that qdigit is approximately trunc(dividend[qi] /
+			 * divisor[0]), which would make the new value simply dividend[qi] *
+			 * mod divisor[0].  The lower-order terms in qdigit can change
+			 * this result by not more than about twice PG_INT64_MAX/NBASE^2,
+			 * so overflow is impossible.
 			 *
 			 * This inner loop is the performance bottleneck for division, so
 			 * code it in the same way as the inner loop of mul_var() so that
-			 * it can be auto-vectorized.  We cast qdigit to NumericDigit
-			 * before multiplying to allow the compiler to generate more
-			 * efficient code (using 16-bit multiplication), which is safe
-			 * since we know that the quotient digit is off by at most one, so
-			 * there is no overflow risk.
+			 * it can be auto-vectorized.
 			 */
 			if (qdigit != 0)
 			{
-				int			istop = Min(var2ndigits, div_ndigits - qi + 1);
-				int		   *div_qi = &div[qi];
+				int			istop = Min(var2ndigitpairs, div_ndigitpairs - qi);
+				int64	   *dividend_qi = &dividend[qi];
 
 				for (i = 0; i < istop; i++)
-					div_qi[i] -= ((NumericDigit) qdigit) * var2digits[i];
+					dividend_qi[i] -= (int64) qdigit * divisor[i];
 			}
 		}
 
@@ -9818,78 +9579,192 @@ div_var_fast(const NumericVar *var1, const NumericVar *var2,
 		 * Fold it into the next digit position.
 		 *
 		 * There is no risk of overflow here, although proving that requires
-		 * some care.  Much as with the argument for div[qi] not overflowing,
-		 * if we consider the first two terms in the numerator and denominator
-		 * of qdigit, we can see that the final value of div[qi + 1] will be
-		 * approximately a remainder mod (vardigits[0]*NBASE + vardigits[1]).
-		 * Accounting for the lower-order terms is a bit complicated but ends
-		 * up adding not much more than INT_MAX/NBASE to the possible range.
-		 * Thus, div[qi + 1] cannot overflow here, and in its role as div[qi]
-		 * in the next loop iteration, it can't be large enough to cause
-		 * overflow in the carry propagation step (if any), either.
+		 * some care.  Much as with the argument for dividend[qi] not
+		 * overflowing, if we consider the first two terms in the numerator
+		 * and denominator of qdigit, we can see that the final value of
+		 * dividend[qi + 1] will be approximately a remainder mod
+		 * (divisor[0]*NBASE^2 + divisor[1]).  Accounting for the lower-order
+		 * terms is a bit complicated but ends up adding not much more than
+		 * PG_INT64_MAX/NBASE^2 to the possible range.  Thus, dividend[qi + 1]
+		 * cannot overflow here, and in its role as dividend[qi] in the next
+		 * loop iteration, it can't be large enough to cause overflow in the
+		 * carry propagation step (if any), either.
 		 *
-		 * But having said that: div[qi] can be more than INT_MAX/NBASE, as
-		 * noted above, which means that the product div[qi] * NBASE *can*
-		 * overflow.  When that happens, adding it to div[qi + 1] will always
-		 * cause a canceling overflow so that the end result is correct.  We
-		 * could avoid the intermediate overflow by doing the multiplication
-		 * and addition in int64 arithmetic, but so far there appears no need.
+		 * But having said that: dividend[qi] can be more than
+		 * PG_INT64_MAX/NBASE^2, as noted above, which means that the product
+		 * dividend[qi] * NBASE^2 *can* overflow.  When that happens, adding
+		 * it to dividend[qi + 1] will always cause a canceling overflow so
+		 * that the end result is correct.  We could avoid the intermediate
+		 * overflow by doing the multiplication and addition using unsigned
+		 * int64 arithmetic, which is modulo 2^64, but so far there appears no
+		 * need.
 		 */
-		div[qi + 1] += div[qi] * NBASE;
+		dividend[qi + 1] += dividend[qi] * NBASE_SQR;
 
-		div[qi] = qdigit;
+		dividend[qi] = qdigit;
 	}
 
 	/*
-	 * Approximate and store the last quotient digit (div[div_ndigits])
+	 * If an exact result was requested, use the remainder to correct the
+	 * approximate quotient.  The remainder is in dividend[], immediately
+	 * after the quotient digits.  Note, however, that although the remainder
+	 * starts at dividend[qi], the first digit is the result of folding two
+	 * remainder digits into one above, so it needs to be expanded down by one
+	 * digit when it is normalized.  Doing so shifts the least significant
+	 * base-NBASE^2 digit out of the working dividend (which is why we
+	 * allocated space for one extra digit above).  For the purposes of
+	 * correcting the quotient, only the first var2ndigitpairs base-NBASE^2
+	 * digits of the remainder matter.
 	 */
-	fdividend = (double) div[qi];
-	for (i = 1; i < 4; i++)
-		fdividend *= NBASE;
-	fquotient = fdividend * fdivisorinverse;
-	qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
-		(((int) fquotient) - 1);	/* truncate towards -infinity */
-	div[qi] = qdigit;
+	if (exact)
+	{
+		/* Normalize the remainder, expanding it down by one digit */
+		remainder = &dividend[qi];
+		carry = 0;
+		for (i = var2ndigitpairs - 1; i >= 0; i--)
+		{
+			newdig = remainder[i] + carry;
+			if (newdig < 0)
+			{
+				carry = -((-newdig - 1) / NBASE_SQR) - 1;
+				newdig -= carry * NBASE_SQR;
+			}
+			else if (newdig >= NBASE_SQR)
+			{
+				carry = newdig / NBASE_SQR;
+				newdig -= carry * NBASE_SQR;
+			}
+			else
+				carry = 0;
+			remainder[i + 1] = newdig;
+		}
+		remainder[0] = carry;
+
+		if (remainder[0] < 0)
+		{
+			/*
+			 * The remainder is negative, so the approximate quotient is too
+			 * large.  Correct by reducing the quotient by one and adding the
+			 * divisor to the remainder until the remainder is positive.  We
+			 * expect the quotient to be off by at most one, which has been
+			 * borne out in all testing, but not conclusively proven, so we
+			 * allow for larger corrections, just in case.
+			 */
+			do
+			{
+				/* Add the divisor to the remainder */
+				carry = 0;
+				for (i = var2ndigitpairs - 1; i > 0; i--)
+				{
+					newdig = remainder[i] + divisor[i] + carry;
+					if (newdig >= NBASE_SQR)
+					{
+						remainder[i] = newdig - NBASE_SQR;
+						carry = 1;
+					}
+					else
+					{
+						remainder[i] = newdig;
+						carry = 0;
+					}
+				}
+				remainder[0] += divisor[0] + carry;
+
+				/* Subtract 1 from the quotient (propagating carries later) */
+				dividend[qi - 1]--;
+
+			} while (remainder[0] < 0);
+		}
+		else
+		{
+			/*
+			 * The remainder is nonnegative.  If it's greater than or equal to
+			 * the divisor, then the approximate quotient is too small and
+			 * must be corrected.  As above, we don't expect to have to apply
+			 * more than one correction, but allow for it just in case.
+			 */
+			while (true)
+			{
+				bool		less = false;
+
+				/* Is remainder < divisor? */
+				for (i = 0; i < var2ndigitpairs; i++)
+				{
+					if (remainder[i] < divisor[i])
+					{
+						less = true;
+						break;
+					}
+					if (remainder[i] > divisor[i])
+						break;	/* remainder > divisor */
+				}
+				if (less)
+					break;		/* quotient is correct */
+
+				/* Subtract the divisor from the remainder */
+				carry = 0;
+				for (i = var2ndigitpairs - 1; i > 0; i--)
+				{
+					newdig = remainder[i] - divisor[i] + carry;
+					if (newdig < 0)
+					{
+						remainder[i] = newdig + NBASE_SQR;
+						carry = -1;
+					}
+					else
+					{
+						remainder[i] = newdig;
+						carry = 0;
+					}
+				}
+				remainder[0] = remainder[0] - divisor[0] + carry;
+
+				/* Add 1 to the quotient (propagating carries later) */
+				dividend[qi - 1]++;
+			}
+		}
+	}
 
 	/*
-	 * Because the quotient digits might be off by one, some of them might be
-	 * -1 or NBASE at this point.  The represented value is correct in a
-	 * mathematical sense, but it doesn't look right.  We do a final carry
-	 * propagation pass to normalize the digits, which we combine with storing
-	 * the result digits into the output.  Note that this is still done at
-	 * full precision w/guard digits.
+	 * Because the quotient digits were estimates that might have been off by
+	 * one (and we didn't bother propagating carries when adjusting the
+	 * quotient above), some quotient digits might be out of range, so do a
+	 * final carry propagation pass to normalize back to base NBASE^2, and
+	 * construct the base-NBASE result digits.  Note that this is still done
+	 * at full precision w/guard digits.
 	 */
-	alloc_var(result, div_ndigits + 1);
+	alloc_var(result, res_ndigits);
 	res_digits = result->digits;
 	carry = 0;
-	for (i = div_ndigits; i >= 0; i--)
+	for (i = res_ndigitpairs - 1; i >= 0; i--)
 	{
-		newdig = div[i] + carry;
+		newdig = dividend[i] + carry;
 		if (newdig < 0)
 		{
-			carry = -((-newdig - 1) / NBASE) - 1;
-			newdig -= carry * NBASE;
+			carry = -((-newdig - 1) / NBASE_SQR) - 1;
+			newdig -= carry * NBASE_SQR;
 		}
-		else if (newdig >= NBASE)
+		else if (newdig >= NBASE_SQR)
 		{
-			carry = newdig / NBASE;
-			newdig -= carry * NBASE;
+			carry = newdig / NBASE_SQR;
+			newdig -= carry * NBASE_SQR;
 		}
 		else
 			carry = 0;
-		res_digits[i] = newdig;
+		res_digits[2 * i + 1] = (NumericDigit) ((uint32) newdig % NBASE);
+		res_digits[2 * i] = (NumericDigit) ((uint32) newdig / NBASE);
 	}
 	Assert(carry == 0);
 
-	pfree(div);
+	pfree(dividend);
 
 	/*
-	 * Finally, round the result to the requested precision.
+	 * Finally, round or truncate the result to the requested precision.
 	 */
 	result->weight = res_weight;
 	result->sign = res_sign;
 
-	/* Round to target rscale (and set result->dscale) */
+	/* Round or truncate to target rscale (and set result->dscale) */
 	if (round)
 		round_var(result, rscale);
 	else
@@ -10216,7 +10091,7 @@ mod_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result)
 	 * div_var can be persuaded to give us trunc(x/y) directly.
 	 * ----------
 	 */
-	div_var(var1, var2, &tmp, 0, false);
+	div_var(var1, var2, &tmp, 0, false, true);
 
 	mul_var(var2, &tmp, &tmp, var2->dscale);
 
@@ -10243,11 +10118,11 @@ div_mod_var(const NumericVar *var1, const NumericVar *var2,
 	init_var(&r);
 
 	/*
-	 * Use div_var_fast() to get an initial estimate for the integer quotient.
-	 * This might be inaccurate (per the warning in div_var_fast's comments),
-	 * but we can correct it below.
+	 * Use div_var() with exact = false to get an initial estimate for the
+	 * integer quotient (truncated towards zero).  This might be slightly
+	 * inaccurate, but we correct it below.
 	 */
-	div_var_fast(var1, var2, &q, 0, false);
+	div_var(var1, var2, &q, 0, false, false);
 
 	/* Compute initial estimate of remainder using the quotient estimate. */
 	mul_var(var2, &q, &r, var2->dscale);
@@ -11190,7 +11065,7 @@ ln_var(const NumericVar *arg, NumericVar *result, int rscale)
 
 	sub_var(&x, &const_one, result);
 	add_var(&x, &const_one, &elem);
-	div_var_fast(result, &elem, result, local_rscale, true);
+	div_var(result, &elem, result, local_rscale, true, false);
 	set_var_from_var(result, &xx);
 	mul_var(result, result, &x, local_rscale);
 
@@ -11274,7 +11149,7 @@ log_var(const NumericVar *base, const NumericVar *num, NumericVar *result)
 	ln_var(num, &ln_num, ln_num_rscale);
 
 	/* Divide and round to the required scale */
-	div_var_fast(&ln_num, &ln_base, result, rscale, true);
+	div_var(&ln_num, &ln_base, result, rscale, true, false);
 
 	free_var(&ln_num);
 	free_var(&ln_base);
@@ -11536,7 +11411,7 @@ power_var_int(const NumericVar *base, int exp, int exp_dscale,
 			round_var(result, rscale);
 			return;
 		case -1:
-			div_var(&const_one, base, result, rscale, true);
+			div_var(&const_one, base, result, rscale, true, true);
 			return;
 		case 2:
 			mul_var(base, base, result, rscale);
@@ -11644,7 +11519,7 @@ power_var_int(const NumericVar *base, int exp, int exp_dscale,
 
 	/* Compensate for input sign, and round to requested rscale */
 	if (neg)
-		div_var_fast(&const_one, result, result, rscale, true);
+		div_var(&const_one, result, result, rscale, true, false);
 	else
 		round_var(result, rscale);
 }
-- 
2.43.0

v2-0002-Test-code-for-div_var.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Test-code-for-div_var.patchDownload

From 9f7350a04c04c88a587f151e1cd6a7c55d9aabc8 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <dean.a.rasheed@gmail.com>
Date: Fri, 23 Aug 2024 09:58:53 +0100
Subject: [PATCH v2 2/2] Test code for div_var().

This adds 3 SQL-callable functions for comparing the old div_var() and
div_var_fast() functions with the new div_var() function:

- div_knuth(num1 numeric, num2 numeric, rscale int, round bool)
- div_fast(num1 numeric, num2 numeric, rscale int, round bool)
- div_new(num1 numeric, num2 numeric, rscale int, round bool, exact bool)

Not intended to be pushed to master.
---
 src/backend/utils/adt/numeric.c | 1064 +++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat |   12 +
 2 files changed, 1076 insertions(+)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 46ac04035e..b59403c23f 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -12498,3 +12498,1067 @@ accum_sum_combine(NumericSumAccum *accum, NumericSumAccum *accum2)
 
 	free_var(&tmp_var);
 }
+
+
+/* ================= Test code, not intended for commit ================= */
+
+
+extern Numeric numeric_div_knuth_opt_error(Numeric num1, Numeric num2,
+										   int rscale, bool round,
+										   bool *have_error);
+extern Numeric numeric_div_fast_opt_error(Numeric num1, Numeric num2,
+										  int rscale, bool round,
+										  bool *have_error);
+extern Numeric numeric_div_new_opt_error(Numeric num1, Numeric num2,
+										 int rscale, bool round, bool exact,
+										 bool *have_error);
+
+
+/*
+ * div_var_knuth() -
+ *
+ *	Division on variable level. Quotient of var1 / var2 is stored in result.
+ *	The quotient is figured to exactly rscale fractional digits.
+ *	If round is true, it is rounded at the rscale'th digit; if false, it
+ *	is truncated (towards zero) at that digit.
+ */
+static void
+div_var_knuth(const NumericVar *var1, const NumericVar *var2,
+			  NumericVar *result, int rscale, bool round)
+{
+	int			div_ndigits;
+	int			res_ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			carry;
+	int			borrow;
+	int			divisor1;
+	int			divisor2;
+	NumericDigit *dividend;
+	NumericDigit *divisor;
+	NumericDigit *res_digits;
+	int			i;
+	int			j;
+
+	/* copy these values into local vars for speed in inner loop */
+	int			var1ndigits = var1->ndigits;
+	int			var2ndigits = var2->ndigits;
+
+	/*
+	 * First of all division by zero check; we must not be handed an
+	 * unnormalized divisor.
+	 */
+	if (var2ndigits == 0 || var2->digits[0] == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_DIVISION_BY_ZERO),
+				 errmsg("division by zero")));
+
+	/*
+	 * If the divisor has just one or two digits, delegate to div_var_int(),
+	 * which uses fast short division.
+	 *
+	 * Similarly, on platforms with 128-bit integer support, delegate to
+	 * div_var_int64() for divisors with three or four digits.
+	 */
+	if (var2ndigits <= 2)
+	{
+		int			idivisor;
+		int			idivisor_weight;
+
+		idivisor = var2->digits[0];
+		idivisor_weight = var2->weight;
+		if (var2ndigits == 2)
+		{
+			idivisor = idivisor * NBASE + var2->digits[1];
+			idivisor_weight--;
+		}
+		if (var2->sign == NUMERIC_NEG)
+			idivisor = -idivisor;
+
+		div_var_int(var1, idivisor, idivisor_weight, result, rscale, round);
+		return;
+	}
+#ifdef HAVE_INT128
+	if (var2ndigits <= 4)
+	{
+		int64		idivisor;
+		int			idivisor_weight;
+
+		idivisor = var2->digits[0];
+		idivisor_weight = var2->weight;
+		for (i = 1; i < var2ndigits; i++)
+		{
+			idivisor = idivisor * NBASE + var2->digits[i];
+			idivisor_weight--;
+		}
+		if (var2->sign == NUMERIC_NEG)
+			idivisor = -idivisor;
+
+		div_var_int64(var1, idivisor, idivisor_weight, result, rscale, round);
+		return;
+	}
+#endif
+
+	/*
+	 * Otherwise, perform full long division.
+	 */
+
+	/* Result zero check */
+	if (var1ndigits == 0)
+	{
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * Determine the result sign, weight and number of digits to calculate.
+	 * The weight figured here is correct if the emitted quotient has no
+	 * leading zero digits; otherwise strip_var() will fix things up.
+	 */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight - var2->weight;
+	/* The number of accurate result digits we need to produce: */
+	res_ndigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS;
+	/* ... but always at least 1 */
+	res_ndigits = Max(res_ndigits, 1);
+	/* If rounding needed, figure one more digit to ensure correct result */
+	if (round)
+		res_ndigits++;
+
+	/*
+	 * The working dividend normally requires res_ndigits + var2ndigits
+	 * digits, but make it at least var1ndigits so we can load all of var1
+	 * into it.  (There will be an additional digit dividend[0] in the
+	 * dividend space, but for consistency with Knuth's notation we don't
+	 * count that in div_ndigits.)
+	 */
+	div_ndigits = res_ndigits + var2ndigits;
+	div_ndigits = Max(div_ndigits, var1ndigits);
+
+	/*
+	 * We need a workspace with room for the working dividend (div_ndigits+1
+	 * digits) plus room for the possibly-normalized divisor (var2ndigits
+	 * digits).  It is convenient also to have a zero at divisor[0] with the
+	 * actual divisor data in divisor[1 .. var2ndigits].  Transferring the
+	 * digits into the workspace also allows us to realloc the result (which
+	 * might be the same as either input var) before we begin the main loop.
+	 * Note that we use palloc0 to ensure that divisor[0], dividend[0], and
+	 * any additional dividend positions beyond var1ndigits, start out 0.
+	 */
+	dividend = (NumericDigit *)
+		palloc0((div_ndigits + var2ndigits + 2) * sizeof(NumericDigit));
+	divisor = dividend + (div_ndigits + 1);
+	memcpy(dividend + 1, var1->digits, var1ndigits * sizeof(NumericDigit));
+	memcpy(divisor + 1, var2->digits, var2ndigits * sizeof(NumericDigit));
+
+	/*
+	 * Now we can realloc the result to hold the generated quotient digits.
+	 */
+	alloc_var(result, res_ndigits);
+	res_digits = result->digits;
+
+	/*
+	 * The full multiple-place algorithm is taken from Knuth volume 2,
+	 * Algorithm 4.3.1D.
+	 *
+	 * We need the first divisor digit to be >= NBASE/2.  If it isn't, make it
+	 * so by scaling up both the divisor and dividend by the factor "d".  (The
+	 * reason for allocating dividend[0] above is to leave room for possible
+	 * carry here.)
+	 */
+	if (divisor[1] < HALF_NBASE)
+	{
+		int			d = NBASE / (divisor[1] + 1);
+
+		carry = 0;
+		for (i = var2ndigits; i > 0; i--)
+		{
+			carry += divisor[i] * d;
+			divisor[i] = carry % NBASE;
+			carry = carry / NBASE;
+		}
+		Assert(carry == 0);
+		carry = 0;
+		/* at this point only var1ndigits of dividend can be nonzero */
+		for (i = var1ndigits; i >= 0; i--)
+		{
+			carry += dividend[i] * d;
+			dividend[i] = carry % NBASE;
+			carry = carry / NBASE;
+		}
+		Assert(carry == 0);
+		Assert(divisor[1] >= HALF_NBASE);
+	}
+	/* First 2 divisor digits are used repeatedly in main loop */
+	divisor1 = divisor[1];
+	divisor2 = divisor[2];
+
+	/*
+	 * Begin the main loop.  Each iteration of this loop produces the j'th
+	 * quotient digit by dividing dividend[j .. j + var2ndigits] by the
+	 * divisor; this is essentially the same as the common manual procedure
+	 * for long division.
+	 */
+	for (j = 0; j < res_ndigits; j++)
+	{
+		/* Estimate quotient digit from the first two dividend digits */
+		int			next2digits = dividend[j] * NBASE + dividend[j + 1];
+		int			qhat;
+
+		/*
+		 * If next2digits are 0, then quotient digit must be 0 and there's no
+		 * need to adjust the working dividend.  It's worth testing here to
+		 * fall out ASAP when processing trailing zeroes in a dividend.
+		 */
+		if (next2digits == 0)
+		{
+			res_digits[j] = 0;
+			continue;
+		}
+
+		if (dividend[j] == divisor1)
+			qhat = NBASE - 1;
+		else
+			qhat = next2digits / divisor1;
+
+		/*
+		 * Adjust quotient digit if it's too large.  Knuth proves that after
+		 * this step, the quotient digit will be either correct or just one
+		 * too large.  (Note: it's OK to use dividend[j+2] here because we
+		 * know the divisor length is at least 2.)
+		 */
+		while (divisor2 * qhat >
+			   (next2digits - qhat * divisor1) * NBASE + dividend[j + 2])
+			qhat--;
+
+		/* As above, need do nothing more when quotient digit is 0 */
+		if (qhat > 0)
+		{
+			NumericDigit *dividend_j = &dividend[j];
+
+			/*
+			 * Multiply the divisor by qhat, and subtract that from the
+			 * working dividend.  The multiplication and subtraction are
+			 * folded together here, noting that qhat <= NBASE (since it might
+			 * be one too large), and so the intermediate result "tmp_result"
+			 * is in the range [-NBASE^2, NBASE - 1], and "borrow" is in the
+			 * range [0, NBASE].
+			 */
+			borrow = 0;
+			for (i = var2ndigits; i >= 0; i--)
+			{
+				int			tmp_result;
+
+				tmp_result = dividend_j[i] - borrow - divisor[i] * qhat;
+				borrow = (NBASE - 1 - tmp_result) / NBASE;
+				dividend_j[i] = tmp_result + borrow * NBASE;
+			}
+
+			/*
+			 * If we got a borrow out of the top dividend digit, then indeed
+			 * qhat was one too large.  Fix it, and add back the divisor to
+			 * correct the working dividend.  (Knuth proves that this will
+			 * occur only about 3/NBASE of the time; hence, it's a good idea
+			 * to test this code with small NBASE to be sure this section gets
+			 * exercised.)
+			 */
+			if (borrow)
+			{
+				qhat--;
+				carry = 0;
+				for (i = var2ndigits; i >= 0; i--)
+				{
+					carry += dividend_j[i] + divisor[i];
+					if (carry >= NBASE)
+					{
+						dividend_j[i] = carry - NBASE;
+						carry = 1;
+					}
+					else
+					{
+						dividend_j[i] = carry;
+						carry = 0;
+					}
+				}
+				/* A carry should occur here to cancel the borrow above */
+				Assert(carry == 1);
+			}
+		}
+
+		/* And we're done with this quotient digit */
+		res_digits[j] = qhat;
+	}
+
+	pfree(dividend);
+
+	/*
+	 * Finally, round or truncate the result to the requested precision.
+	 */
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round or truncate to target rscale (and set result->dscale) */
+	if (round)
+		round_var(result, rscale);
+	else
+		trunc_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+
+/*
+ * div_var_fast() -
+ *
+ *	This has the same API as div_var, but is implemented using the division
+ *	algorithm from the "FM" library, rather than Knuth's schoolbook-division
+ *	approach.  This is significantly faster but can produce inaccurate
+ *	results, because it sometimes has to propagate rounding to the left,
+ *	and so we can never be entirely sure that we know the requested digits
+ *	exactly.  We compute DIV_GUARD_DIGITS extra digits, but there is
+ *	no certainty that that's enough.  We use this only in the transcendental
+ *	function calculation routines, where everything is approximate anyway.
+ *
+ *	Although we provide a "round" argument for consistency with div_var,
+ *	it is unwise to use this function with round=false.  In truncation mode
+ *	it is possible to get a result with no significant digits, for example
+ *	with rscale=0 we might compute 0.99999... and truncate that to 0 when
+ *	the correct answer is 1.
+ */
+static void
+div_var_fast(const NumericVar *var1, const NumericVar *var2,
+			 NumericVar *result, int rscale, bool round)
+{
+	int			div_ndigits;
+	int			load_ndigits;
+	int			res_sign;
+	int			res_weight;
+	int		   *div;
+	int			qdigit;
+	int			carry;
+	int			maxdiv;
+	int			newdig;
+	NumericDigit *res_digits;
+	double		fdividend,
+				fdivisor,
+				fdivisorinverse,
+				fquotient;
+	int			qi;
+	int			i;
+
+	/* copy these values into local vars for speed in inner loop */
+	int			var1ndigits = var1->ndigits;
+	int			var2ndigits = var2->ndigits;
+	NumericDigit *var1digits = var1->digits;
+	NumericDigit *var2digits = var2->digits;
+
+	/*
+	 * First of all division by zero check; we must not be handed an
+	 * unnormalized divisor.
+	 */
+	if (var2ndigits == 0 || var2digits[0] == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_DIVISION_BY_ZERO),
+				 errmsg("division by zero")));
+
+	/*
+	 * If the divisor has just one or two digits, delegate to div_var_int(),
+	 * which uses fast short division.
+	 *
+	 * Similarly, on platforms with 128-bit integer support, delegate to
+	 * div_var_int64() for divisors with three or four digits.
+	 */
+	if (var2ndigits <= 2)
+	{
+		int			idivisor;
+		int			idivisor_weight;
+
+		idivisor = var2->digits[0];
+		idivisor_weight = var2->weight;
+		if (var2ndigits == 2)
+		{
+			idivisor = idivisor * NBASE + var2->digits[1];
+			idivisor_weight--;
+		}
+		if (var2->sign == NUMERIC_NEG)
+			idivisor = -idivisor;
+
+		div_var_int(var1, idivisor, idivisor_weight, result, rscale, round);
+		return;
+	}
+#ifdef HAVE_INT128
+	if (var2ndigits <= 4)
+	{
+		int64		idivisor;
+		int			idivisor_weight;
+
+		idivisor = var2->digits[0];
+		idivisor_weight = var2->weight;
+		for (i = 1; i < var2ndigits; i++)
+		{
+			idivisor = idivisor * NBASE + var2->digits[i];
+			idivisor_weight--;
+		}
+		if (var2->sign == NUMERIC_NEG)
+			idivisor = -idivisor;
+
+		div_var_int64(var1, idivisor, idivisor_weight, result, rscale, round);
+		return;
+	}
+#endif
+
+	/*
+	 * Otherwise, perform full long division.
+	 */
+
+	/* Result zero check */
+	if (var1ndigits == 0)
+	{
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * Determine the result sign, weight and number of digits to calculate
+	 */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight - var2->weight + 1;
+	/* The number of accurate result digits we need to produce: */
+	div_ndigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS;
+	/* Add guard digits for roundoff error */
+	div_ndigits += DIV_GUARD_DIGITS;
+	if (div_ndigits < DIV_GUARD_DIGITS)
+		div_ndigits = DIV_GUARD_DIGITS;
+
+	/*
+	 * We do the arithmetic in an array "div[]" of signed int's.  Since
+	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
+	 * to avoid normalizing carries immediately.
+	 *
+	 * We start with div[] containing one zero digit followed by the
+	 * dividend's digits (plus appended zeroes to reach the desired precision
+	 * including guard digits).  Each step of the main loop computes an
+	 * (approximate) quotient digit and stores it into div[], removing one
+	 * position of dividend space.  A final pass of carry propagation takes
+	 * care of any mistaken quotient digits.
+	 *
+	 * Note that div[] doesn't necessarily contain all of the digits from the
+	 * dividend --- the desired precision plus guard digits might be less than
+	 * the dividend's precision.  This happens, for example, in the square
+	 * root algorithm, where we typically divide a 2N-digit number by an
+	 * N-digit number, and only require a result with N digits of precision.
+	 */
+	div = (int *) palloc0((div_ndigits + 1) * sizeof(int));
+	load_ndigits = Min(div_ndigits, var1ndigits);
+	for (i = 0; i < load_ndigits; i++)
+		div[i + 1] = var1digits[i];
+
+	/*
+	 * We estimate each quotient digit using floating-point arithmetic, taking
+	 * the first four digits of the (current) dividend and divisor.  This must
+	 * be float to avoid overflow.  The quotient digits will generally be off
+	 * by no more than one from the exact answer.
+	 */
+	fdivisor = (double) var2digits[0];
+	for (i = 1; i < 4; i++)
+	{
+		fdivisor *= NBASE;
+		if (i < var2ndigits)
+			fdivisor += (double) var2digits[i];
+	}
+	fdivisorinverse = 1.0 / fdivisor;
+
+	/*
+	 * maxdiv tracks the maximum possible absolute value of any div[] entry;
+	 * when this threatens to exceed INT_MAX, we take the time to propagate
+	 * carries.  Furthermore, we need to ensure that overflow doesn't occur
+	 * during the carry propagation passes either.  The carry values may have
+	 * an absolute value as high as INT_MAX/NBASE + 1, so really we must
+	 * normalize when digits threaten to exceed INT_MAX - INT_MAX/NBASE - 1.
+	 *
+	 * To avoid overflow in maxdiv itself, it represents the max absolute
+	 * value divided by NBASE-1, ie, at the top of the loop it is known that
+	 * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
+	 *
+	 * Actually, though, that holds good only for div[] entries after div[qi];
+	 * the adjustment done at the bottom of the loop may cause div[qi + 1] to
+	 * exceed the maxdiv limit, so that div[qi] in the next iteration is
+	 * beyond the limit.  This does not cause problems, as explained below.
+	 */
+	maxdiv = 1;
+
+	/*
+	 * Outer loop computes next quotient digit, which will go into div[qi]
+	 */
+	for (qi = 0; qi < div_ndigits; qi++)
+	{
+		/* Approximate the current dividend value */
+		fdividend = (double) div[qi];
+		for (i = 1; i < 4; i++)
+		{
+			fdividend *= NBASE;
+			if (qi + i <= div_ndigits)
+				fdividend += (double) div[qi + i];
+		}
+		/* Compute the (approximate) quotient digit */
+		fquotient = fdividend * fdivisorinverse;
+		qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
+			(((int) fquotient) - 1);	/* truncate towards -infinity */
+
+		if (qdigit != 0)
+		{
+			/* Do we need to normalize now? */
+			maxdiv += abs(qdigit);
+			if (maxdiv > (INT_MAX - INT_MAX / NBASE - 1) / (NBASE - 1))
+			{
+				/*
+				 * Yes, do it.  Note that if var2ndigits is much smaller than
+				 * div_ndigits, we can save a significant amount of effort
+				 * here by noting that we only need to normalise those div[]
+				 * entries touched where prior iterations subtracted multiples
+				 * of the divisor.
+				 */
+				carry = 0;
+				for (i = Min(qi + var2ndigits - 2, div_ndigits); i > qi; i--)
+				{
+					newdig = div[i] + carry;
+					if (newdig < 0)
+					{
+						carry = -((-newdig - 1) / NBASE) - 1;
+						newdig -= carry * NBASE;
+					}
+					else if (newdig >= NBASE)
+					{
+						carry = newdig / NBASE;
+						newdig -= carry * NBASE;
+					}
+					else
+						carry = 0;
+					div[i] = newdig;
+				}
+				newdig = div[qi] + carry;
+				div[qi] = newdig;
+
+				/*
+				 * All the div[] digits except possibly div[qi] are now in the
+				 * range 0..NBASE-1.  We do not need to consider div[qi] in
+				 * the maxdiv value anymore, so we can reset maxdiv to 1.
+				 */
+				maxdiv = 1;
+
+				/*
+				 * Recompute the quotient digit since new info may have
+				 * propagated into the top four dividend digits
+				 */
+				fdividend = (double) div[qi];
+				for (i = 1; i < 4; i++)
+				{
+					fdividend *= NBASE;
+					if (qi + i <= div_ndigits)
+						fdividend += (double) div[qi + i];
+				}
+				/* Compute the (approximate) quotient digit */
+				fquotient = fdividend * fdivisorinverse;
+				qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
+					(((int) fquotient) - 1);	/* truncate towards -infinity */
+				maxdiv += abs(qdigit);
+			}
+
+			/*
+			 * Subtract off the appropriate multiple of the divisor.
+			 *
+			 * The digits beyond div[qi] cannot overflow, because we know they
+			 * will fall within the maxdiv limit.  As for div[qi] itself, note
+			 * that qdigit is approximately trunc(div[qi] / vardigits[0]),
+			 * which would make the new value simply div[qi] mod vardigits[0].
+			 * The lower-order terms in qdigit can change this result by not
+			 * more than about twice INT_MAX/NBASE, so overflow is impossible.
+			 *
+			 * This inner loop is the performance bottleneck for division, so
+			 * code it in the same way as the inner loop of mul_var() so that
+			 * it can be auto-vectorized.  We cast qdigit to NumericDigit
+			 * before multiplying to allow the compiler to generate more
+			 * efficient code (using 16-bit multiplication), which is safe
+			 * since we know that the quotient digit is off by at most one, so
+			 * there is no overflow risk.
+			 */
+			if (qdigit != 0)
+			{
+				int			istop = Min(var2ndigits, div_ndigits - qi + 1);
+				int		   *div_qi = &div[qi];
+
+				for (i = 0; i < istop; i++)
+					div_qi[i] -= ((NumericDigit) qdigit) * var2digits[i];
+			}
+		}
+
+		/*
+		 * The dividend digit we are about to replace might still be nonzero.
+		 * Fold it into the next digit position.
+		 *
+		 * There is no risk of overflow here, although proving that requires
+		 * some care.  Much as with the argument for div[qi] not overflowing,
+		 * if we consider the first two terms in the numerator and denominator
+		 * of qdigit, we can see that the final value of div[qi + 1] will be
+		 * approximately a remainder mod (vardigits[0]*NBASE + vardigits[1]).
+		 * Accounting for the lower-order terms is a bit complicated but ends
+		 * up adding not much more than INT_MAX/NBASE to the possible range.
+		 * Thus, div[qi + 1] cannot overflow here, and in its role as div[qi]
+		 * in the next loop iteration, it can't be large enough to cause
+		 * overflow in the carry propagation step (if any), either.
+		 *
+		 * But having said that: div[qi] can be more than INT_MAX/NBASE, as
+		 * noted above, which means that the product div[qi] * NBASE *can*
+		 * overflow.  When that happens, adding it to div[qi + 1] will always
+		 * cause a canceling overflow so that the end result is correct.  We
+		 * could avoid the intermediate overflow by doing the multiplication
+		 * and addition in int64 arithmetic, but so far there appears no need.
+		 */
+		div[qi + 1] += div[qi] * NBASE;
+
+		div[qi] = qdigit;
+	}
+
+	/*
+	 * Approximate and store the last quotient digit (div[div_ndigits])
+	 */
+	fdividend = (double) div[qi];
+	for (i = 1; i < 4; i++)
+		fdividend *= NBASE;
+	fquotient = fdividend * fdivisorinverse;
+	qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
+		(((int) fquotient) - 1);	/* truncate towards -infinity */
+	div[qi] = qdigit;
+
+	/*
+	 * Because the quotient digits might be off by one, some of them might be
+	 * -1 or NBASE at this point.  The represented value is correct in a
+	 * mathematical sense, but it doesn't look right.  We do a final carry
+	 * propagation pass to normalize the digits, which we combine with storing
+	 * the result digits into the output.  Note that this is still done at
+	 * full precision w/guard digits.
+	 */
+	alloc_var(result, div_ndigits + 1);
+	res_digits = result->digits;
+	carry = 0;
+	for (i = div_ndigits; i >= 0; i--)
+	{
+		newdig = div[i] + carry;
+		if (newdig < 0)
+		{
+			carry = -((-newdig - 1) / NBASE) - 1;
+			newdig -= carry * NBASE;
+		}
+		else if (newdig >= NBASE)
+		{
+			carry = newdig / NBASE;
+			newdig -= carry * NBASE;
+		}
+		else
+			carry = 0;
+		res_digits[i] = newdig;
+	}
+	Assert(carry == 0);
+
+	pfree(div);
+
+	/*
+	 * Finally, round the result to the requested precision.
+	 */
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	if (round)
+		round_var(result, rscale);
+	else
+		trunc_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+
+Numeric
+numeric_div_knuth_opt_error(Numeric num1, Numeric num2, int rscale,
+							bool round, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	if (have_error)
+		*have_error = false;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			if (NUMERIC_IS_SPECIAL(num2))
+				return make_result(&const_nan); /* Inf / [-]Inf */
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					if (have_error)
+					{
+						*have_error = true;
+						return NULL;
+					}
+					ereport(ERROR,
+							(errcode(ERRCODE_DIVISION_BY_ZERO),
+							 errmsg("division by zero")));
+					break;
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			if (NUMERIC_IS_SPECIAL(num2))
+				return make_result(&const_nan); /* -Inf / [-]Inf */
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					if (have_error)
+					{
+						*have_error = true;
+						return NULL;
+					}
+					ereport(ERROR,
+							(errcode(ERRCODE_DIVISION_BY_ZERO),
+							 errmsg("division by zero")));
+					break;
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+
+		/*
+		 * POSIX would have us return zero or minus zero if num1 is zero, and
+		 * otherwise throw an underflow error.  But the numeric type doesn't
+		 * really do underflow, so let's just return zero.
+		 */
+		return make_result(&const_zero);
+	}
+
+	/*
+	 * Unpack the arguments
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	/*
+	 * If "have_error" is provided, check for division by zero here
+	 */
+	if (have_error && (arg2.ndigits == 0 || arg2.digits[0] == 0))
+	{
+		*have_error = true;
+		return NULL;
+	}
+
+	/*
+	 * Do the divide and return the result
+	 */
+	div_var_knuth(&arg1, &arg2, &result, rscale, round);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
+Numeric
+numeric_div_fast_opt_error(Numeric num1, Numeric num2, int rscale,
+						   bool round, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	if (have_error)
+		*have_error = false;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			if (NUMERIC_IS_SPECIAL(num2))
+				return make_result(&const_nan); /* Inf / [-]Inf */
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					if (have_error)
+					{
+						*have_error = true;
+						return NULL;
+					}
+					ereport(ERROR,
+							(errcode(ERRCODE_DIVISION_BY_ZERO),
+							 errmsg("division by zero")));
+					break;
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			if (NUMERIC_IS_SPECIAL(num2))
+				return make_result(&const_nan); /* -Inf / [-]Inf */
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					if (have_error)
+					{
+						*have_error = true;
+						return NULL;
+					}
+					ereport(ERROR,
+							(errcode(ERRCODE_DIVISION_BY_ZERO),
+							 errmsg("division by zero")));
+					break;
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+
+		/*
+		 * POSIX would have us return zero or minus zero if num1 is zero, and
+		 * otherwise throw an underflow error.  But the numeric type doesn't
+		 * really do underflow, so let's just return zero.
+		 */
+		return make_result(&const_zero);
+	}
+
+	/*
+	 * Unpack the arguments
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	/*
+	 * If "have_error" is provided, check for division by zero here
+	 */
+	if (have_error && (arg2.ndigits == 0 || arg2.digits[0] == 0))
+	{
+		*have_error = true;
+		return NULL;
+	}
+
+	/*
+	 * Do the divide and return the result
+	 */
+	div_var_fast(&arg1, &arg2, &result, rscale, round);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
+Numeric
+numeric_div_new_opt_error(Numeric num1, Numeric num2, int rscale,
+						  bool round, bool exact, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	if (have_error)
+		*have_error = false;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			if (NUMERIC_IS_SPECIAL(num2))
+				return make_result(&const_nan); /* Inf / [-]Inf */
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					if (have_error)
+					{
+						*have_error = true;
+						return NULL;
+					}
+					ereport(ERROR,
+							(errcode(ERRCODE_DIVISION_BY_ZERO),
+							 errmsg("division by zero")));
+					break;
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			if (NUMERIC_IS_SPECIAL(num2))
+				return make_result(&const_nan); /* -Inf / [-]Inf */
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					if (have_error)
+					{
+						*have_error = true;
+						return NULL;
+					}
+					ereport(ERROR,
+							(errcode(ERRCODE_DIVISION_BY_ZERO),
+							 errmsg("division by zero")));
+					break;
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+
+		/*
+		 * POSIX would have us return zero or minus zero if num1 is zero, and
+		 * otherwise throw an underflow error.  But the numeric type doesn't
+		 * really do underflow, so let's just return zero.
+		 */
+		return make_result(&const_zero);
+	}
+
+	/*
+	 * Unpack the arguments
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	/*
+	 * If "have_error" is provided, check for division by zero here
+	 */
+	if (have_error && (arg2.ndigits == 0 || arg2.digits[0] == 0))
+	{
+		*have_error = true;
+		return NULL;
+	}
+
+	/*
+	 * Do the divide and return the result
+	 */
+	div_var(&arg1, &arg2, &result, rscale, round, exact);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
+/*
+ * numeric_div_knuth() -
+ *
+ *	Divide one numeric into another using the old Knuth algorithm
+ */
+Datum
+numeric_div_knuth(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	int			rscale = PG_GETARG_INT32(2);
+	bool		round = PG_GETARG_BOOL(3);
+	Numeric		res;
+
+	res = numeric_div_knuth_opt_error(num1, num2, rscale, round, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
+
+
+/*
+ * numeric_div_fast() -
+ *
+ *	Divide one numeric into another using the old "fast" algorithm
+ */
+Datum
+numeric_div_fast(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	int			rscale = PG_GETARG_INT32(2);
+	bool		round = PG_GETARG_BOOL(3);
+	Numeric		res;
+
+	res = numeric_div_fast_opt_error(num1, num2, rscale, round, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
+
+
+/*
+ * numeric_div_new() -
+ *
+ *	Divide one numeric into another using the new algorithm
+ */
+Datum
+numeric_div_new(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	int			rscale = PG_GETARG_INT32(2);
+	bool		round = PG_GETARG_BOOL(3);
+	bool		exact = PG_GETARG_BOOL(4);
+	Numeric		res;
+
+	res = numeric_div_new_opt_error(num1, num2, rscale, round, exact, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4abc6d9526..b4c09f5f25 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -4501,6 +4501,18 @@
 { oid => '1727',
   proname => 'numeric_div', prorettype => 'numeric',
   proargtypes => 'numeric numeric', prosrc => 'numeric_div' },
+{ oid => '9090', descr => 'test numeric division using old algorithm by Knuth',
+  proname => 'div_knuth', prorettype => 'numeric',
+  proargnames => '{num1,num2,rscale,round}',
+  proargtypes => 'numeric numeric int4 bool', prosrc => 'numeric_div_knuth' },
+{ oid => '9091', descr => 'test numeric division using old div_var_fast algorithm',
+  proname => 'div_fast', prorettype => 'numeric',
+  proargnames => '{num1,num2,rscale,round}',
+  proargtypes => 'numeric numeric int4 bool', prosrc => 'numeric_div_fast' },
+{ oid => '9092', descr => 'test numeric division using new algorithm',
+  proname => 'div_new', prorettype => 'numeric',
+  proargnames => '{num1,num2,rscale,round,exact}',
+  proargtypes => 'numeric numeric int4 bool bool', prosrc => 'numeric_div_new' },
 { oid => '1728', descr => 'modulus',
   proname => 'mod', prorettype => 'numeric', proargtypes => 'numeric numeric',
   prosrc => 'numeric_mod' },
-- 
2.43.0

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#7)

Re: Optimising numeric division

On Sat, Aug 24, 2024, at 14:10, Dean Rasheed wrote:

On Sat, 24 Aug 2024 at 08:26, Joel Jacobson <joel@compiler.org> wrote:

On Sat, Aug 24, 2024, at 01:35, Joel Jacobson wrote:

On Sat, Aug 24, 2024, at 00:00, Joel Jacobson wrote:

Since statistical tools that rely on normal distributions can't be used,
let's look at the individual measurements for (var1ndigits=3, var2ndigits=3)
since that seems to be the biggest slowdown on both CPUs,
and see if our level of surprise is affected.

Here is a more traditional benchmark,
which seems to also indicate (var1ndigits=3, var2ndigits=3) is a bit slower:

I tested just adding back div_var_int64, and it seems to help.

Thanks for testing.

There does appear to be quite a lot of variability between platforms
over whether or not div_var_int64() is a win for 3 and 4 digit
divisors. Since this patch is primarily about improving div_var()'s
long division algorithm, it's probably best for it to not touch that,
so I've put div_var_int64() back in for now. We could possibly
investigate whether it can be improved separately.

Looking at your other test results, they seem to confirm my previous
observation that exact mode is faster than approximate mode for
var2ndigits <= 12 or so, so I've added code to do that.

I also expanded on the comments for the quotient-correction code a bit.

Nice. LGTM.
I've successfully tested the new patch again on both Intel and AMD.

I've marked it as Ready for Committer.

Regards,
Joel

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#8)

Re: Optimising numeric division

On Sun, 25 Aug 2024 at 10:33, Joel Jacobson <joel@compiler.org> wrote:

Nice. LGTM.
I've successfully tested the new patch again on both Intel and AMD.

I've marked it as Ready for Committer.

[Finally getting back to this]

Thanks for the review and testing.
Committed.

Regards,
Dean