Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

Started by Joel Jacobsonover 1 year ago40 messages

joel@compiler.org

over 1 year ago

1 attachment(s)

Hello hackers,

Attached patch introduces an optimization of mul_var() in numeric.c, targeting cases where the multiplicands consist of only one or two base-NBASE digits. Such small multiplicands can fit into an int64 and thus be computed directly, resulting in a significant performance improvement, between 26% - 34% benchmarked on Intel Core i9-14900K.

This optimization is similar to commit d1b307eef2, that also targeted one and two base-NBASE digit operands, but optimized div_var().

Regards,
Joel

Attachments:

0001-optimize-numeric-mul_var-small-factors.patchapplication/octet-stream; name="=?UTF-8?Q?0001-optimize-numeric-mul=5Fvar-small-factors.patch?="Download

From 070bca416c0a78f4ce32a25c1225e820c820498a Mon Sep 17 00:00:00 2001
From: Joel Jakobsson <github@compiler.org>
Date: Mon, 1 Jul 2024 07:44:46 +0200
Subject: [PATCH] Optimize numeric multiplication for up to two base-NBASE
 digit operands

---
 src/backend/utils/adt/numeric.c | 52 +++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..d6c9017a35 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8747,6 +8747,58 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		return;
 	}
 
+	/*
+	 * If var1 and var2 are just one or two digits, their product will fit in
+	 * an int64 can be computed directly, which is significantly faster.
+	 */
+	if (var2ndigits <= 2)
+	{
+		int64		product;
+
+		switch (var1ndigits)
+		{
+			case 1:
+				product = var1digits[0];
+				break;
+			case 2:
+				product = var1digits[0] * NBASE + var1digits[1];
+				break;
+		}
+
+		switch (var2ndigits)
+		{
+			case 1:
+				product *= var2digits[0];
+				break;
+			case 2:
+				product *= var2digits[0] * NBASE + var2digits[1];
+				break;
+		}
+
+		alloc_var(result, res_ndigits);
+		res_digits = result->digits;
+		for (i = res_ndigits - 1; i >= 0; i--)
+		{
+			res_digits[i] = product % NBASE;
+			product /= NBASE;
+		}
+		Assert(product == 0);
+
+		/*
+		 * Finally, round the result to the requested precision.
+		 */
+		result->weight = res_weight;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+
 	/*
 	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
 	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
-- 
2.45.1

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#1)

1 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Mon, Jul 1, 2024, at 08:04, Joel Jacobson wrote:

* 0001-optimize-numeric-mul_var-small-factors.patch

New version to silence maybe-uninitialized error reported by cfbot.

/Joel

Attachments:

v2-0001-optimize-numeric-mul_var-small-factors.patchapplication/octet-stream; name="=?UTF-8?Q?v2-0001-optimize-numeric-mul=5Fvar-small-factors.patch?="Download

From 9eb8198ee7687ac7c9b89f5f4ca3b6a0cf9517ea Mon Sep 17 00:00:00 2001
From: Joel Jakobsson <github@compiler.org>
Date: Mon, 1 Jul 2024 07:44:46 +0200
Subject: [PATCH] Optimize numeric multiplication for up to two base-NBASE
 digit operands

---
 src/backend/utils/adt/numeric.c | 52 +++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..304033d2cc 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8747,6 +8747,58 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		return;
 	}
 
+	/*
+	 * If var1 and var2 are just one or two digits, their product will fit in
+	 * an int64 can be computed directly, which is significantly faster.
+	 */
+	if (var2ndigits <= 2)
+	{
+		int64		product = 0;
+
+		switch (var1ndigits)
+		{
+			case 1:
+				product = var1digits[0];
+				break;
+			case 2:
+				product = var1digits[0] * NBASE + var1digits[1];
+				break;
+		}
+
+		switch (var2ndigits)
+		{
+			case 1:
+				product *= var2digits[0];
+				break;
+			case 2:
+				product *= var2digits[0] * NBASE + var2digits[1];
+				break;
+		}
+
+		alloc_var(result, res_ndigits);
+		res_digits = result->digits;
+		for (i = res_ndigits - 1; i >= 0; i--)
+		{
+			res_digits[i] = product % NBASE;
+			product /= NBASE;
+		}
+		Assert(product == 0);
+
+		/*
+		 * Finally, round the result to the requested precision.
+		 */
+		result->weight = res_weight;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+
 	/*
 	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
 	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
-- 
2.45.1

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

over 1 year ago

In reply to: Joel Jacobson (#1)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

"Joel Jacobson" <joel@compiler.org> writes:

Hello hackers,

Attached patch introduces an optimization of mul_var() in numeric.c,
targeting cases where the multiplicands consist of only one or two
base-NBASE digits. Such small multiplicands can fit into an int64 and
thus be computed directly, resulting in a significant performance
improvement, between 26% - 34% benchmarked on Intel Core i9-14900K.

This optimization is similar to commit d1b307eef2, that also targeted
one and two base-NBASE digit operands, but optimized div_var().

div_var() also has an optimisation for 3- and 4-digit operands under
HAVE_INT128 (added in commit 0aa38db56bf), would that make sense in
mul_var() too?

Regards,
Joel

- ilmari

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dagfinn Ilmari Mannsåker (#3)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Mon, Jul 1, 2024, at 14:25, Dagfinn Ilmari Mannsåker wrote:

div_var() also has an optimisation for 3- and 4-digit operands under
HAVE_INT128 (added in commit 0aa38db56bf), would that make sense in
mul_var() too?

I considered it, but it only gives a marginal speed-up on Intel Core i9-14900K,
and is actually slower on Apple M3 Max.
Not really sure why. Maybe the code I tried can be optimized further:

```
#ifdef HAVE_INT128
/*
* If var1 and var2 are up to four digits, their product will fit in
* an int128 can be computed directly, which is significantly faster.
*/
if (var2ndigits <= 4)
{
int128 product = 0;

switch (var1ndigits)
{
case 1:
product = var1digits[0];
break;
case 2:
product = var1digits[0] * NBASE + var1digits[1];
break;
case 3:
product = ((int128) var1digits[0] * NBASE + var1digits[1])
* NBASE + var1digits[2];
break;
case 4:
product = (((int128) var1digits[0] * NBASE + var1digits[1])
* NBASE + var1digits[2]) * NBASE + var1digits[3];
break;
}

switch (var2ndigits)
{
case 1:
product *= var2digits[0];
break;
case 2:
product *= var2digits[0] * NBASE + var2digits[1];
break;
case 3:
product = ((int128) var2digits[0] * NBASE + var2digits[1])
* NBASE + var2digits[2];
break;
case 4:
product = (((int128) var2digits[0] * NBASE + var2digits[1])
* NBASE + var2digits[2]) * NBASE + var2digits[3];
break;
}

alloc_var(result, res_ndigits);
res_digits = result->digits;
for (i = res_ndigits - 1; i >= 0; i--)
{
res_digits[i] = product % NBASE;
product /= NBASE;
}
Assert(product == 0);

/*
* Finally, round the result to the requested precision.
*/
result->weight = res_weight;
result->sign = res_sign;

/* Round to target rscale (and set result->dscale) */
round_var(result, rscale);

/* Strip leading and trailing zeroes */
strip_var(result);

return;
}
#endif
```

Benchmark 1, testing 2 ndigits * 2 ndigits:

SELECT
timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
'numeric_mul',
'numeric_mul_patched',
input_values := ARRAY[
'11112222',
'33334444'
],
min_time := 1000000,
timeout := '10 s'
);

Apple M3 Max:

execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
32.2 ns | 20.5 ns | 1.5700112246809388
(1 row)

Intel Core i9-14900K:

execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
30.2 ns | 21.4 ns | 1.4113042510107371
(1 row)

So 57% and 41% faster.

Benchmark 2, testing 4 ndigits * 4 ndigits:

SELECT
timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
'numeric_mul',
'numeric_mul_patched',
input_values := ARRAY[
'1111222233334444',
'5555666677778888'
],
min_time := 1000000,
timeout := '10 s'
);

Apple M3 Max:

execution_time_a | execution_time_b | performance_ratio
------------------+------------------+------------------------
41.9 ns | 51.3 ns | 0.81733655797170943614
(1 row)

Intel Core i9-14900K:

execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
40 ns | 38 ns | 1.0515610914706320
(1 row)

So 18% slower on Apple M3 Max and just 5% faster on Intel Core i9-14900K.

/Joel

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#4)

1 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Mon, Jul 1, 2024, at 15:11, Joel Jacobson wrote:

Not really sure why. Maybe the code I tried can be optimized further:

If anyone want experiment with the int128 version,
here is a patch that adds a separate numeric_mul_patched() function,
so it's easier to benchmark against the unmodified numeric_mul().

/Joel

Attachments:

0001-Optimize-mul_var-for-var2ndigits-4.patchapplication/octet-stream; name="=?UTF-8?Q?0001-Optimize-mul=5Fvar-for-var2ndigits-4.patch?="Download

From c443f44d1338a69a21632ba0cd3fb069b6cbbcf5 Mon Sep 17 00:00:00 2001
From: Joel Jakobsson <github@compiler.org>
Date: Mon, 1 Jul 2024 07:17:50 +0200
Subject: [PATCH] Optimize mul_var() for var2ndigits <= 4

---
 src/backend/utils/adt/numeric.c | 457 ++++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat |   3 +
 src/include/utils/numeric.h     |   2 +
 3 files changed, 462 insertions(+)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..87980fa771 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -551,6 +551,9 @@ static void sub_var(const NumericVar *var1, const NumericVar *var2,
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
+static void mul_var_patched(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
 static void div_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale, bool round);
@@ -3115,6 +3118,130 @@ numeric_mul_opt_error(Numeric num1, Numeric num2, bool *have_error)
 }
 
 
+/*
+ * numeric_mul_patched() -
+ *
+ * This function multiplies two numeric values using the patched algorithm,
+ * designed for efficient handling of large numbers. It's introduced to allow
+ * direct benchmark comparisons with the standard numeric_mul() function.
+ */
+Datum
+numeric_mul_patched(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	Numeric		res;
+
+	res = numeric_mul_patched_opt_error(num1, num2, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
+
+
+/*
+ * numeric_mul_patched_opt_error() -
+ *
+ *	Internal version of numeric_mul_patched().
+ *	If "*have_error" flag is provided, on error it's set to true, NULL returned.
+ *	This is helpful when caller need to handle errors by itself.
+ */
+Numeric
+numeric_mul_patched_opt_error(Numeric num1, Numeric num2, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* Inf * 0 */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* -Inf * 0 */
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+		if (NUMERIC_IS_PINF(num2))
+		{
+			switch (numeric_sign_internal(num1))
+			{
+				case 0:
+					return make_result(&const_nan); /* 0 * Inf */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		Assert(NUMERIC_IS_NINF(num2));
+		switch (numeric_sign_internal(num1))
+		{
+			case 0:
+				return make_result(&const_nan); /* 0 * -Inf */
+			case 1:
+				return make_result(&const_ninf);
+			case -1:
+				return make_result(&const_pinf);
+		}
+		Assert(false);
+	}
+
+	/*
+	 * Unpack the values, let mul_var() compute the result and return it.
+	 * Unlike add_var() and sub_var(), mul_var() will round its result. In the
+	 * case of numeric_mul(), which is invoked for the * operator on numerics,
+	 * we request exact representation for the product (rscale = sum(dscale of
+	 * arg1, dscale of arg2)).  If the exact result has more digits after the
+	 * decimal point than can be stored in a numeric, we round it.  Rounding
+	 * after computing the exact result ensures that the final result is
+	 * correctly rounded (rounding in mul_var() using a truncated product
+	 * would not guarantee this).
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	mul_var_patched(&arg1, &arg2, &result, arg1.dscale + arg2.dscale);
+
+	if (result.dscale > NUMERIC_DSCALE_MAX)
+		round_var(&result, NUMERIC_DSCALE_MAX);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
 /*
  * numeric_div() -
  *
@@ -8864,6 +8991,336 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	strip_var(result);
 }
 
+/*
+ * mul_var_patched() -
+ *
+ * Implements patched multiplication for large numbers, introduced
+ * alongside the unchanged original mul_var(). This function is part of
+ * an optimization effort, allowing direct benchmark comparisons with
+ * mul_var(). It selects full or half patched based on input size.
+ * This is a temporary measure before considering its replacement of
+ * mul_var() based on benchmark outcomes.
+ */
+static void
+mul_var_patched(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+	int			res_ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			maxdigits;
+	int		   *dig;
+	int			carry;
+	int			maxdig;
+	int			newdig;
+	int			var1ndigits;
+	int			var2ndigits;
+	NumericDigit *var1digits;
+	NumericDigit *var2digits;
+	NumericDigit *res_digits;
+	int			i,
+				i1,
+				i2;
+
+	/*
+	 * Arrange for var1 to be the shorter of the two numbers.  This improves
+	 * performance because the inner multiplication loop is much simpler than
+	 * the outer loop, so it's better to have a smaller number of iterations
+	 * of the outer loop.  This also reduces the number of times that the
+	 * accumulator array needs to be normalized.
+	 */
+	if (var1->ndigits > var2->ndigits)
+	{
+		const NumericVar *tmp = var1;
+
+		var1 = var2;
+		var2 = tmp;
+	}
+
+	/* copy these values into local vars for speed in inner loop */
+	var1ndigits = var1->ndigits;
+	var2ndigits = var2->ndigits;
+	var1digits = var1->digits;
+	var2digits = var2->digits;
+
+	if (var1ndigits == 0 || var2ndigits == 0)
+	{
+		/* one or both inputs is zero; so is result */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/* Determine result sign and (maximum possible) weight */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 2;
+
+	/*
+	 * Determine the number of result digits to compute.  If the exact result
+	 * would have more than rscale fractional digits, truncate the computation
+	 * with MUL_GUARD_DIGITS guard digits, i.e., ignore input digits that
+	 * would only contribute to the right of that.  (This will give the exact
+	 * rounded-to-rscale answer unless carries out of the ignored positions
+	 * would have propagated through more than MUL_GUARD_DIGITS digits.)
+	 *
+	 * Note: an exact computation could not produce more than var1ndigits +
+	 * var2ndigits digits, but we allocate one extra output digit in case
+	 * rscale-driven rounding produces a carry out of the highest exact digit.
+	 */
+	res_ndigits = var1ndigits + var2ndigits + 1;
+	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+		MUL_GUARD_DIGITS;
+	res_ndigits = Min(res_ndigits, maxdigits);
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * If var1 and var2 are just one or two NBASE digits, their product will
+	 * fit in an int64 can be computed directly, which is significantly faster.
+	 */
+	if (var2ndigits <= 2)
+	{
+		int64		product = 0;
+		switch (var1ndigits)
+		{
+			case 1:
+				product = var1digits[0];
+				break;
+			case 2:
+				product = var1digits[0] * NBASE + var1digits[1];
+				break;
+		}
+
+		switch (var2ndigits)
+		{
+			case 1:
+				product *= var2digits[0];
+				break;
+			case 2:
+				product *= var2digits[0] * NBASE + var2digits[1];
+				break;
+		}
+
+		alloc_var(result, res_ndigits);
+		res_digits = result->digits;
+		for (i = res_ndigits - 1; i >= 0; i--)
+		{
+			res_digits[i] = product % NBASE;
+			product /= NBASE;
+		}
+		Assert(product == 0);
+
+		/*
+		 * Finally, round the result to the requested precision.
+		 */
+		result->weight = res_weight;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+#ifdef HAVE_INT128
+	/*
+	 * If var1 and var2 are up to four digits, their product will fit in
+	 * an int128 can be computed directly, which is significantly faster.
+	 */
+	if (var2ndigits <= 4)
+	{
+		int128		product = 0;
+
+		switch (var1ndigits)
+		{
+			case 1:
+				product = var1digits[0];
+				break;
+			case 2:
+				product = var1digits[0] * NBASE + var1digits[1];
+				break;
+			case 3:
+				product = ((int128) var1digits[0] * NBASE + var1digits[1])
+						* NBASE + var1digits[2];
+				break;
+			case 4:
+				product = (((int128) var1digits[0] * NBASE + var1digits[1])
+						* NBASE + var1digits[2]) * NBASE + var1digits[3];
+				break;
+		}
+
+		switch (var2ndigits)
+		{
+			case 1:
+				product *= var2digits[0];
+				break;
+			case 2:
+				product *= var2digits[0] * NBASE + var2digits[1];
+				break;
+			case 3:
+				product = ((int128) var2digits[0] * NBASE + var2digits[1])
+						* NBASE + var2digits[2];
+				break;
+			case 4:
+				product = (((int128) var2digits[0] * NBASE + var2digits[1])
+						* NBASE + var2digits[2]) * NBASE + var2digits[3];
+				break;
+		}
+
+		alloc_var(result, res_ndigits);
+		res_digits = result->digits;
+		for (i = res_ndigits - 1; i >= 0; i--)
+		{
+			res_digits[i] = product % NBASE;
+			product /= NBASE;
+		}
+		Assert(product == 0);
+
+		/*
+		 * Finally, round the result to the requested precision.
+		 */
+		result->weight = res_weight;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+#endif
+
+	/*
+	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
+	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
+	 * to avoid normalizing carries immediately.
+	 *
+	 * maxdig tracks the maximum possible value of any dig[] entry; when this
+	 * threatens to exceed INT_MAX, we take the time to propagate carries.
+	 * Furthermore, we need to ensure that overflow doesn't occur during the
+	 * carry propagation passes either.  The carry values could be as much as
+	 * INT_MAX/NBASE, so really we must normalize when digits threaten to
+	 * exceed INT_MAX - INT_MAX/NBASE.
+	 *
+	 * To avoid overflow in maxdig itself, it actually represents the max
+	 * possible value divided by NBASE-1, ie, at the top of the loop it is
+	 * known that no dig[] entry exceeds maxdig * (NBASE-1).
+	 */
+	dig = (int *) palloc0(res_ndigits * sizeof(int));
+	maxdig = 0;
+
+	/*
+	 * The least significant digits of var1 should be ignored if they don't
+	 * contribute directly to the first res_ndigits digits of the result that
+	 * we are computing.
+	 *
+	 * Digit i1 of var1 and digit i2 of var2 are multiplied and added to digit
+	 * i1+i2+2 of the accumulator array, so we need only consider digits of
+	 * var1 for which i1 <= res_ndigits - 3.
+	 */
+	for (i1 = Min(var1ndigits - 1, res_ndigits - 3); i1 >= 0; i1--)
+	{
+		NumericDigit var1digit = var1digits[i1];
+
+		if (var1digit == 0)
+			continue;
+
+		/* Time to normalize? */
+		maxdig += var1digit;
+		if (maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
+		{
+			/* Yes, do it */
+			carry = 0;
+			for (i = res_ndigits - 1; i >= 0; i--)
+			{
+				newdig = dig[i] + carry;
+				if (newdig >= NBASE)
+				{
+					carry = newdig / NBASE;
+					newdig -= carry * NBASE;
+				}
+				else
+					carry = 0;
+				dig[i] = newdig;
+			}
+			Assert(carry == 0);
+			/* Reset maxdig to indicate new worst-case */
+			maxdig = 1 + var1digit;
+		}
+
+		/*
+		 * Add the appropriate multiple of var2 into the accumulator.
+		 *
+		 * As above, digits of var2 can be ignored if they don't contribute,
+		 * so we only include digits for which i1+i2+2 < res_ndigits.
+		 *
+		 * This inner loop is the performance bottleneck for multiplication,
+		 * so we want to keep it simple enough so that it can be
+		 * auto-vectorized.  Accordingly, process the digits left-to-right
+		 * even though schoolbook multiplication would suggest right-to-left.
+		 * Since we aren't propagating carries in this loop, the order does
+		 * not matter.
+		 */
+		{
+			int			i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
+			int		   *dig_i1_2 = &dig[i1 + 2];
+
+			for (i2 = 0; i2 < i2limit; i2++)
+				dig_i1_2[i2] += var1digit * var2digits[i2];
+		}
+	}
+
+	/*
+	 * Now we do a final carry propagation pass to normalize the result, which
+	 * we combine with storing the result digits into the output. Note that
+	 * this is still done at full precision w/guard digits.
+	 */
+	alloc_var(result, res_ndigits);
+	res_digits = result->digits;
+	carry = 0;
+	for (i = res_ndigits - 1; i >= 0; i--)
+	{
+		newdig = dig[i] + carry;
+		if (newdig >= NBASE)
+		{
+			carry = newdig / NBASE;
+			newdig -= carry * NBASE;
+		}
+		else
+			carry = 0;
+		res_digits[i] = newdig;
+	}
+	Assert(carry == 0);
+
+	pfree(dig);
+
+	/*
+	 * Finally, round the result to the requested precision.
+	 */
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+
+}
+
 
 /*
  * div_var() -
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6a5476d3c4..15dde0f3c3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -4465,6 +4465,9 @@
 { oid => '1726',
   proname => 'numeric_mul', prorettype => 'numeric',
   proargtypes => 'numeric numeric', prosrc => 'numeric_mul' },
+{ oid => '6347',
+  proname => 'numeric_mul_patched', prorettype => 'numeric',
+  proargtypes => 'numeric numeric', prosrc => 'numeric_mul_patched' },
 { oid => '1727',
   proname => 'numeric_div', prorettype => 'numeric',
   proargtypes => 'numeric numeric', prosrc => 'numeric_div' },
diff --git a/src/include/utils/numeric.h b/src/include/utils/numeric.h
index 43c75c436f..9036c9db50 100644
--- a/src/include/utils/numeric.h
+++ b/src/include/utils/numeric.h
@@ -97,6 +97,8 @@ extern Numeric numeric_sub_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
 extern Numeric numeric_mul_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
+extern Numeric numeric_mul_patched_opt_error(Numeric num1, Numeric num2,
+									 bool *have_error);
 extern Numeric numeric_div_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
 extern Numeric numeric_mod_opt_error(Numeric num1, Numeric num2,
-- 
2.45.1

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#5)

1 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Mon, Jul 1, 2024, at 15:14, Joel Jacobson wrote:

* 0001-Optimize-mul_var-for-var2ndigits-4.patch

Found a typo, fixed in new version.

The int128 version is still slower though,
I wonder if there is something that can be done to speed it up further.

Below is a more realistic benchmark than just microbenchmarking mul_var(),
not testing the int128 version, but the code for up to 2 NBASE-digits:

```
CREATE TABLE bench_mul_var (num1 numeric, num2 numeric);

INSERT INTO bench_mul_var (num1, num2)
SELECT random(0::numeric,1e8::numeric), random(0::numeric,1e8::numeric) FROM generate_series(1,1e8);

SELECT SUM(num1*num2) FROM bench_mul_var;
Time: 8331.953 ms (00:08.332)
Time: 7415.241 ms (00:07.415)
Time: 7298.296 ms (00:07.298)
Time: 7314.754 ms (00:07.315)
Time: 7289.560 ms (00:07.290)

SELECT SUM(numeric_mul_patched(num1,num2)) FROM bench_mul_var;
Time: 6403.426 ms (00:06.403)
Time: 6401.797 ms (00:06.402)
Time: 6366.136 ms (00:06.366)
Time: 6376.049 ms (00:06.376)
Time: 6317.282 ms (00:06.317)
``

Benchmarked on a Intel Core i9-14900K machine.

/Joel

Attachments:

v2-0001-Optimize-mul_var-for-var2ndigits-4.patchapplication/octet-stream; name="=?UTF-8?Q?v2-0001-Optimize-mul=5Fvar-for-var2ndigits-4.patch?="Download

From b075e18c2ae94f2ada20572c05ff35730d2c5968 Mon Sep 17 00:00:00 2001
From: Joel Jakobsson <github@compiler.org>
Date: Mon, 1 Jul 2024 07:17:50 +0200
Subject: [PATCH] Optimize mul_var() for var2ndigits <= 4

---
 src/backend/utils/adt/numeric.c | 457 ++++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat |   3 +
 src/include/utils/numeric.h     |   2 +
 3 files changed, 462 insertions(+)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..b74e03eefa 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -551,6 +551,9 @@ static void sub_var(const NumericVar *var1, const NumericVar *var2,
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
+static void mul_var_patched(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
 static void div_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale, bool round);
@@ -3115,6 +3118,130 @@ numeric_mul_opt_error(Numeric num1, Numeric num2, bool *have_error)
 }
 
 
+/*
+ * numeric_mul_patched() -
+ *
+ * This function multiplies two numeric values using the patched algorithm,
+ * designed for efficient handling of large numbers. It's introduced to allow
+ * direct benchmark comparisons with the standard numeric_mul() function.
+ */
+Datum
+numeric_mul_patched(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	Numeric		res;
+
+	res = numeric_mul_patched_opt_error(num1, num2, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
+
+
+/*
+ * numeric_mul_patched_opt_error() -
+ *
+ *	Internal version of numeric_mul_patched().
+ *	If "*have_error" flag is provided, on error it's set to true, NULL returned.
+ *	This is helpful when caller need to handle errors by itself.
+ */
+Numeric
+numeric_mul_patched_opt_error(Numeric num1, Numeric num2, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* Inf * 0 */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* -Inf * 0 */
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+		if (NUMERIC_IS_PINF(num2))
+		{
+			switch (numeric_sign_internal(num1))
+			{
+				case 0:
+					return make_result(&const_nan); /* 0 * Inf */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		Assert(NUMERIC_IS_NINF(num2));
+		switch (numeric_sign_internal(num1))
+		{
+			case 0:
+				return make_result(&const_nan); /* 0 * -Inf */
+			case 1:
+				return make_result(&const_ninf);
+			case -1:
+				return make_result(&const_pinf);
+		}
+		Assert(false);
+	}
+
+	/*
+	 * Unpack the values, let mul_var() compute the result and return it.
+	 * Unlike add_var() and sub_var(), mul_var() will round its result. In the
+	 * case of numeric_mul(), which is invoked for the * operator on numerics,
+	 * we request exact representation for the product (rscale = sum(dscale of
+	 * arg1, dscale of arg2)).  If the exact result has more digits after the
+	 * decimal point than can be stored in a numeric, we round it.  Rounding
+	 * after computing the exact result ensures that the final result is
+	 * correctly rounded (rounding in mul_var() using a truncated product
+	 * would not guarantee this).
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	mul_var_patched(&arg1, &arg2, &result, arg1.dscale + arg2.dscale);
+
+	if (result.dscale > NUMERIC_DSCALE_MAX)
+		round_var(&result, NUMERIC_DSCALE_MAX);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
 /*
  * numeric_div() -
  *
@@ -8864,6 +8991,336 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	strip_var(result);
 }
 
+/*
+ * mul_var_patched() -
+ *
+ * Implements patched multiplication for large numbers, introduced
+ * alongside the unchanged original mul_var(). This function is part of
+ * an optimization effort, allowing direct benchmark comparisons with
+ * mul_var(). It selects full or half patched based on input size.
+ * This is a temporary measure before considering its replacement of
+ * mul_var() based on benchmark outcomes.
+ */
+static void
+mul_var_patched(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+	int			res_ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			maxdigits;
+	int		   *dig;
+	int			carry;
+	int			maxdig;
+	int			newdig;
+	int			var1ndigits;
+	int			var2ndigits;
+	NumericDigit *var1digits;
+	NumericDigit *var2digits;
+	NumericDigit *res_digits;
+	int			i,
+				i1,
+				i2;
+
+	/*
+	 * Arrange for var1 to be the shorter of the two numbers.  This improves
+	 * performance because the inner multiplication loop is much simpler than
+	 * the outer loop, so it's better to have a smaller number of iterations
+	 * of the outer loop.  This also reduces the number of times that the
+	 * accumulator array needs to be normalized.
+	 */
+	if (var1->ndigits > var2->ndigits)
+	{
+		const NumericVar *tmp = var1;
+
+		var1 = var2;
+		var2 = tmp;
+	}
+
+	/* copy these values into local vars for speed in inner loop */
+	var1ndigits = var1->ndigits;
+	var2ndigits = var2->ndigits;
+	var1digits = var1->digits;
+	var2digits = var2->digits;
+
+	if (var1ndigits == 0 || var2ndigits == 0)
+	{
+		/* one or both inputs is zero; so is result */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/* Determine result sign and (maximum possible) weight */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 2;
+
+	/*
+	 * Determine the number of result digits to compute.  If the exact result
+	 * would have more than rscale fractional digits, truncate the computation
+	 * with MUL_GUARD_DIGITS guard digits, i.e., ignore input digits that
+	 * would only contribute to the right of that.  (This will give the exact
+	 * rounded-to-rscale answer unless carries out of the ignored positions
+	 * would have propagated through more than MUL_GUARD_DIGITS digits.)
+	 *
+	 * Note: an exact computation could not produce more than var1ndigits +
+	 * var2ndigits digits, but we allocate one extra output digit in case
+	 * rscale-driven rounding produces a carry out of the highest exact digit.
+	 */
+	res_ndigits = var1ndigits + var2ndigits + 1;
+	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+		MUL_GUARD_DIGITS;
+	res_ndigits = Min(res_ndigits, maxdigits);
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * If var1 and var2 are just one or two NBASE digits, their product will
+	 * fit in an int64 can be computed directly, which is significantly faster.
+	 */
+	if (var2ndigits <= 2)
+	{
+		int64		product = 0;
+		switch (var1ndigits)
+		{
+			case 1:
+				product = var1digits[0];
+				break;
+			case 2:
+				product = var1digits[0] * NBASE + var1digits[1];
+				break;
+		}
+
+		switch (var2ndigits)
+		{
+			case 1:
+				product *= var2digits[0];
+				break;
+			case 2:
+				product *= var2digits[0] * NBASE + var2digits[1];
+				break;
+		}
+
+		alloc_var(result, res_ndigits);
+		res_digits = result->digits;
+		for (i = res_ndigits - 1; i >= 0; i--)
+		{
+			res_digits[i] = product % NBASE;
+			product /= NBASE;
+		}
+		Assert(product == 0);
+
+		/*
+		 * Finally, round the result to the requested precision.
+		 */
+		result->weight = res_weight;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+#ifdef HAVE_INT128
+	/*
+	 * If var1 and var2 are up to four digits, their product will fit in
+	 * an int128 can be computed directly, which is significantly faster.
+	 */
+	if (var2ndigits <= 4)
+	{
+		int128		product = 0;
+
+		switch (var1ndigits)
+		{
+			case 1:
+				product = var1digits[0];
+				break;
+			case 2:
+				product = var1digits[0] * NBASE + var1digits[1];
+				break;
+			case 3:
+				product = ((int128) var1digits[0] * NBASE + var1digits[1])
+						* NBASE + var1digits[2];
+				break;
+			case 4:
+				product = (((int128) var1digits[0] * NBASE + var1digits[1])
+						* NBASE + var1digits[2]) * NBASE + var1digits[3];
+				break;
+		}
+
+		switch (var2ndigits)
+		{
+			case 1:
+				product *= var2digits[0];
+				break;
+			case 2:
+				product *= var2digits[0] * NBASE + var2digits[1];
+				break;
+			case 3:
+				product *= ((int128) var2digits[0] * NBASE + var2digits[1])
+						* NBASE + var2digits[2];
+				break;
+			case 4:
+				product *= (((int128) var2digits[0] * NBASE + var2digits[1])
+						* NBASE + var2digits[2]) * NBASE + var2digits[3];
+				break;
+		}
+
+		alloc_var(result, res_ndigits);
+		res_digits = result->digits;
+		for (i = res_ndigits - 1; i >= 0; i--)
+		{
+			res_digits[i] = product % NBASE;
+			product /= NBASE;
+		}
+		Assert(product == 0);
+
+		/*
+		 * Finally, round the result to the requested precision.
+		 */
+		result->weight = res_weight;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+#endif
+
+	/*
+	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
+	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
+	 * to avoid normalizing carries immediately.
+	 *
+	 * maxdig tracks the maximum possible value of any dig[] entry; when this
+	 * threatens to exceed INT_MAX, we take the time to propagate carries.
+	 * Furthermore, we need to ensure that overflow doesn't occur during the
+	 * carry propagation passes either.  The carry values could be as much as
+	 * INT_MAX/NBASE, so really we must normalize when digits threaten to
+	 * exceed INT_MAX - INT_MAX/NBASE.
+	 *
+	 * To avoid overflow in maxdig itself, it actually represents the max
+	 * possible value divided by NBASE-1, ie, at the top of the loop it is
+	 * known that no dig[] entry exceeds maxdig * (NBASE-1).
+	 */
+	dig = (int *) palloc0(res_ndigits * sizeof(int));
+	maxdig = 0;
+
+	/*
+	 * The least significant digits of var1 should be ignored if they don't
+	 * contribute directly to the first res_ndigits digits of the result that
+	 * we are computing.
+	 *
+	 * Digit i1 of var1 and digit i2 of var2 are multiplied and added to digit
+	 * i1+i2+2 of the accumulator array, so we need only consider digits of
+	 * var1 for which i1 <= res_ndigits - 3.
+	 */
+	for (i1 = Min(var1ndigits - 1, res_ndigits - 3); i1 >= 0; i1--)
+	{
+		NumericDigit var1digit = var1digits[i1];
+
+		if (var1digit == 0)
+			continue;
+
+		/* Time to normalize? */
+		maxdig += var1digit;
+		if (maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
+		{
+			/* Yes, do it */
+			carry = 0;
+			for (i = res_ndigits - 1; i >= 0; i--)
+			{
+				newdig = dig[i] + carry;
+				if (newdig >= NBASE)
+				{
+					carry = newdig / NBASE;
+					newdig -= carry * NBASE;
+				}
+				else
+					carry = 0;
+				dig[i] = newdig;
+			}
+			Assert(carry == 0);
+			/* Reset maxdig to indicate new worst-case */
+			maxdig = 1 + var1digit;
+		}
+
+		/*
+		 * Add the appropriate multiple of var2 into the accumulator.
+		 *
+		 * As above, digits of var2 can be ignored if they don't contribute,
+		 * so we only include digits for which i1+i2+2 < res_ndigits.
+		 *
+		 * This inner loop is the performance bottleneck for multiplication,
+		 * so we want to keep it simple enough so that it can be
+		 * auto-vectorized.  Accordingly, process the digits left-to-right
+		 * even though schoolbook multiplication would suggest right-to-left.
+		 * Since we aren't propagating carries in this loop, the order does
+		 * not matter.
+		 */
+		{
+			int			i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
+			int		   *dig_i1_2 = &dig[i1 + 2];
+
+			for (i2 = 0; i2 < i2limit; i2++)
+				dig_i1_2[i2] += var1digit * var2digits[i2];
+		}
+	}
+
+	/*
+	 * Now we do a final carry propagation pass to normalize the result, which
+	 * we combine with storing the result digits into the output. Note that
+	 * this is still done at full precision w/guard digits.
+	 */
+	alloc_var(result, res_ndigits);
+	res_digits = result->digits;
+	carry = 0;
+	for (i = res_ndigits - 1; i >= 0; i--)
+	{
+		newdig = dig[i] + carry;
+		if (newdig >= NBASE)
+		{
+			carry = newdig / NBASE;
+			newdig -= carry * NBASE;
+		}
+		else
+			carry = 0;
+		res_digits[i] = newdig;
+	}
+	Assert(carry == 0);
+
+	pfree(dig);
+
+	/*
+	 * Finally, round the result to the requested precision.
+	 */
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+
+}
+
 
 /*
  * div_var() -
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6a5476d3c4..15dde0f3c3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -4465,6 +4465,9 @@
 { oid => '1726',
   proname => 'numeric_mul', prorettype => 'numeric',
   proargtypes => 'numeric numeric', prosrc => 'numeric_mul' },
+{ oid => '6347',
+  proname => 'numeric_mul_patched', prorettype => 'numeric',
+  proargtypes => 'numeric numeric', prosrc => 'numeric_mul_patched' },
 { oid => '1727',
   proname => 'numeric_div', prorettype => 'numeric',
   proargtypes => 'numeric numeric', prosrc => 'numeric_div' },
diff --git a/src/include/utils/numeric.h b/src/include/utils/numeric.h
index 43c75c436f..9036c9db50 100644
--- a/src/include/utils/numeric.h
+++ b/src/include/utils/numeric.h
@@ -97,6 +97,8 @@ extern Numeric numeric_sub_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
 extern Numeric numeric_mul_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
+extern Numeric numeric_mul_patched_opt_error(Numeric num1, Numeric num2,
+									 bool *have_error);
 extern Numeric numeric_div_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
 extern Numeric numeric_mod_opt_error(Numeric num1, Numeric num2,
-- 
2.45.1

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#6)

1 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Mon, 1 Jul 2024 at 20:56, Joel Jacobson <joel@compiler.org> wrote:

Below is a more realistic benchmark

CREATE TABLE bench_mul_var (num1 numeric, num2 numeric);

INSERT INTO bench_mul_var (num1, num2)
SELECT random(0::numeric,1e8::numeric), random(0::numeric,1e8::numeric) FROM generate_series(1,1e8);

SELECT SUM(num1*num2) FROM bench_mul_var;

I had a play with this, and came up with a slightly different way of
doing it that works for var2 of any size, as long as var1 is just 1 or
2 digits.

Repeating your benchmark where both numbers have up to 2 NBASE-digits,
this new approach was slightly faster:

SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 4762.990 ms (00:04.763)
Time: 4332.166 ms (00:04.332)
Time: 4276.211 ms (00:04.276)
Time: 4247.321 ms (00:04.247)
Time: 4166.738 ms (00:04.167)

SELECT SUM(num1*num2) FROM bench_mul_var; -- v2 patch
Time: 4398.812 ms (00:04.399)
Time: 3672.668 ms (00:03.673)
Time: 3650.227 ms (00:03.650)
Time: 3611.420 ms (00:03.611)
Time: 3534.218 ms (00:03.534)

SELECT SUM(num1*num2) FROM bench_mul_var; -- this patch
Time: 3350.596 ms (00:03.351)
Time: 3336.224 ms (00:03.336)
Time: 3335.599 ms (00:03.336)
Time: 3336.990 ms (00:03.337)
Time: 3351.453 ms (00:03.351)

(This was on an older Intel Core i9-9900K, so I'm not sure why all the
timings are faster. What compiler settings are you using?)

The approach taken in this patch only uses 32-bit integers, so in
theory it could be extended to work for var1ndigits = 3, 4, or even
more, but the code would get increasingly complex, and I ran out of
steam at 2 digits. It might be worth trying though.

Regards,
Dean

Attachments:

optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txttext/plain; charset=US-ASCII; name=optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txtDownload

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 5510a20..adbfd5c
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8748,6 +8748,74 @@ mul_var(const NumericVar *var1, const Nu
 	}
 
 	/*
+	 * Simplified fast-path computation, if var1 has just one or two digits.
+	 * This is significantly faster, since it avoids allocating a separate
+	 * digit array, making multiple passes over var2, and having separate
+	 * carry-propagation passes.
+	 */
+	if (var1ndigits <= 2)
+	{
+		NumericDigit *res_buf;
+
+		/* Allocate result digit array */
+		res_buf = digitbuf_alloc(res_ndigits);
+		res_buf[0] = 0;				/* spare digit for later rounding */
+		res_digits = res_buf + 1;
+
+		/*
+		 * Compute the result digits directly, in one pass, propagating the
+		 * carry up as we go.
+		 */
+		switch (var1ndigits)
+		{
+			case 1:
+				carry = 0;
+				for (i = res_ndigits - 3; i >= 0; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				res_digits[0] = (NumericDigit) carry;
+				break;
+
+			case 2:
+				newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
+				res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+
+				for (i = res_ndigits - 4; i > 0; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] +
+						(int) var1digits[1] * var2digits[i - 1] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+
+				newdig = (int) var1digits[0] * var2digits[0] + carry;
+				res_digits[1] = (NumericDigit) (newdig % NBASE);
+				res_digits[0] = (NumericDigit) (newdig / NBASE);
+				break;
+		}
+
+		/* Store the product in result (minus extra rounding digit) */
+		digitbuf_free(result->buf);
+		result->ndigits = res_ndigits - 1;
+		result->buf = res_buf;
+		result->digits = res_digits;
+		result->weight = res_weight - 1;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+
+	/*
 	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
 	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
 	 * to avoid normalizing carries immediately.

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#7)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, Jul 2, 2024, at 00:19, Dean Rasheed wrote:

I had a play with this, and came up with a slightly different way of
doing it that works for var2 of any size, as long as var1 is just 1 or
2 digits.

Repeating your benchmark where both numbers have up to 2 NBASE-digits,
this new approach was slightly faster:

...

(This was on an older Intel Core i9-9900K, so I'm not sure why all the
timings are faster. What compiler settings are you using?)

Strange. I just did `./configure` with a --prefix.

Compiler settings on my Intel Core i9-14900K machine:

$ pg_config | grep -E '^(CC|CFLAGS|CPPFLAGS|LDFLAGS)'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -O2
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/home/joel/pg-dev/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =

The approach taken in this patch only uses 32-bit integers, so in
theory it could be extended to work for var1ndigits = 3, 4, or even
more, but the code would get increasingly complex, and I ran out of
steam at 2 digits. It might be worth trying though.

Regards,
Dean

Attachments:
* optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt

Really nice!

I've benchmarked your patch on my three machines with great results.
I added a setseed() step, to make the benchmarks reproducible,
shouldn't matter much since it should statistically average out, but I thought why not.

CREATE TABLE bench_mul_var (num1 numeric, num2 numeric);
SELECT setseed(0.12345);
INSERT INTO bench_mul_var (num1, num2)
SELECT random(0::numeric,1e8::numeric), random(0::numeric,1e8::numeric) FROM generate_series(1,1e8);
\timing

/*
* Apple M3 Max
*/

SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 3622.342 ms (00:03.622)
Time: 3029.786 ms (00:03.030)
Time: 3046.497 ms (00:03.046)
Time: 3035.910 ms (00:03.036)
Time: 3034.073 ms (00:03.034)

SELECT SUM(num1*num2) FROM bench_mul_var; -- optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
Time: 2484.685 ms (00:02.485)
Time: 2478.341 ms (00:02.478)
Time: 2494.397 ms (00:02.494)
Time: 2470.987 ms (00:02.471)
Time: 2490.215 ms (00:02.490)

/*
* Intel Core i9-14900K
*/

SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 2555.569 ms (00:02.556)
Time: 2523.145 ms (00:02.523)
Time: 2518.671 ms (00:02.519)
Time: 2514.501 ms (00:02.515)
Time: 2516.919 ms (00:02.517)

SELECT SUM(num1*num2) FROM bench_mul_var; -- optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
Time: 2246.441 ms (00:02.246)
Time: 2243.900 ms (00:02.244)
Time: 2245.350 ms (00:02.245)
Time: 2245.080 ms (00:02.245)
Time: 2247.856 ms (00:02.248)

/*
* AMD Ryzen 9 7950X3D
*/

SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 3037.497 ms (00:03.037)
Time: 3010.037 ms (00:03.010)
Time: 3000.956 ms (00:03.001)
Time: 2989.424 ms (00:02.989)
Time: 2984.911 ms (00:02.985)

SELECT SUM(num1*num2) FROM bench_mul_var; -- optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
Time: 2645.530 ms (00:02.646)
Time: 2640.472 ms (00:02.640)
Time: 2638.613 ms (00:02.639)
Time: 2637.889 ms (00:02.638)
Time: 2638.054 ms (00:02.638)

/Joel

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#8)

1 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, 2 Jul 2024 at 08:50, Joel Jacobson <joel@compiler.org> wrote:

On Tue, Jul 2, 2024, at 00:19, Dean Rasheed wrote:

Attachments:
* optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt

Shortly after posting that, I realised that there was a small bug. This bit:

case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];

isn't quite right in the case where rscale is less than the full
result. In that case, the least significant result digit has a
contribution from one more var2 digit, so it needs to be:

newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
if (res_ndigits - 3 < var2ndigits)
newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];

That doesn't noticeably affect the performance though. Update attached.

I've benchmarked your patch on my three machines with great results.
I added a setseed() step, to make the benchmarks reproducible,
shouldn't matter much since it should statistically average out, but I thought why not.

Nice. The results on the Apple M3 Max look particularly impressive.

I think it'd probably be worth trying to extend this to 3 or maybe 4
var1 digits, since that would cover a lot of "everyday" sized numeric
values that a lot of people might be using. I don't think we should go
beyond that though.

Regards,
Dean

Attachments:

v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchtext/x-patch; charset=US-ASCII; name=v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchDownload

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 5510a20..f77ad8b
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8748,6 +8748,76 @@ mul_var(const NumericVar *var1, const Nu
 	}
 
 	/*
+	 * Simplified fast-path computation, if var1 has just one or two digits.
+	 * This is significantly faster, since it avoids allocating a separate
+	 * digit array, making multiple passes over var2, and having separate
+	 * carry-propagation passes.
+	 */
+	if (var1ndigits <= 2)
+	{
+		NumericDigit *res_buf;
+
+		/* Allocate result digit array */
+		res_buf = digitbuf_alloc(res_ndigits);
+		res_buf[0] = 0;				/* spare digit for later rounding */
+		res_digits = res_buf + 1;
+
+		/*
+		 * Compute the result digits directly, in one pass, propagating the
+		 * carry up as we go.
+		 */
+		switch (var1ndigits)
+		{
+			case 1:
+				carry = 0;
+				for (i = res_ndigits - 3; i >= 0; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				res_digits[0] = (NumericDigit) carry;
+				break;
+
+			case 2:
+				newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
+				if (res_ndigits - 3 < var2ndigits)
+					newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
+				res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+
+				for (i = res_ndigits - 4; i > 0; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] +
+						(int) var1digits[1] * var2digits[i - 1] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+
+				newdig = (int) var1digits[0] * var2digits[0] + carry;
+				res_digits[1] = (NumericDigit) (newdig % NBASE);
+				res_digits[0] = (NumericDigit) (newdig / NBASE);
+				break;
+		}
+
+		/* Store the product in result (minus extra rounding digit) */
+		digitbuf_free(result->buf);
+		result->ndigits = res_ndigits - 1;
+		result->buf = res_buf;
+		result->digits = res_digits;
+		result->weight = res_weight - 1;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+
+	/*
 	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
 	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
 	 * to avoid normalizing carries immediately.

#10

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#9)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, Jul 2, 2024, at 10:22, Dean Rasheed wrote:

Shortly after posting that, I realised that there was a small bug. This bit:

case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];

isn't quite right in the case where rscale is less than the full
result. In that case, the least significant result digit has a
contribution from one more var2 digit, so it needs to be:

newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
if (res_ndigits - 3 < var2ndigits)
newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];

That doesn't noticeably affect the performance though. Update attached.

Nice catch. Could we add a test somehow that would test mul_var()
with rscale less than the full result, that would catch bugs like this one
and others?

I created a new benchmark, that specifically tests different var1ndigits.
I've only run it on Apple M3 Max yet. More to come.

\timing
SELECT setseed(0.12345);
CREATE TABLE bench_mul_var_var1ndigits_1 (var1 numeric, var2 numeric);
INSERT INTO bench_mul_var_var1ndigits_1 (var1, var2)
SELECT random(0::numeric,9999::numeric), random(10000000::numeric,1e32::numeric) FROM generate_series(1,1e8);
CREATE TABLE bench_mul_var_var1ndigits_2 (var1 numeric, var2 numeric);
INSERT INTO bench_mul_var_var1ndigits_2 (var1, var2)
SELECT random(10000000::numeric,99999999::numeric), random(100000000000::numeric,1e32::numeric) FROM generate_series(1,1e8);
CREATE TABLE bench_mul_var_var1ndigits_3 (var1 numeric, var2 numeric);
INSERT INTO bench_mul_var_var1ndigits_3 (var1, var2)
SELECT random(100000000000::numeric,999999999999::numeric), random(1000000000000000::numeric,1e32::numeric) FROM generate_series(1,1e8);
CREATE TABLE bench_mul_var_var1ndigits_4 (var1 numeric, var2 numeric);
INSERT INTO bench_mul_var_var1ndigits_4 (var1, var2)
SELECT random(1000000000000000::numeric,9999999999999999::numeric), random(10000000000000000000::numeric,1e32::numeric) FROM generate_series(1,1e8);

/*
* Apple M3 Max
*/

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 2986.952 ms (00:02.987)
Time: 2991.765 ms (00:02.992)
Time: 2987.253 ms (00:02.987)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 2874.841 ms (00:02.875)
Time: 2883.070 ms (00:02.883)
Time: 2899.973 ms (00:02.900)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 3459.556 ms (00:03.460)
Time: 3304.983 ms (00:03.305)
Time: 3299.728 ms (00:03.300)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3053.140 ms (00:03.053)
Time: 3065.227 ms (00:03.065)
Time: 3069.995 ms (00:03.070)

/*
* Just for completeness, also testing var1ndigits 3 and 4,
* although no change is expected since not yet implemented.
*/

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3809.005 ms (00:03.809)
Time: 3438.260 ms (00:03.438)
Time: 3453.920 ms (00:03.454)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3437.592 ms (00:03.438)
Time: 3457.586 ms (00:03.458)
Time: 3474.344 ms (00:03.474)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 4133.193 ms (00:04.133)
Time: 3554.477 ms (00:03.554)
Time: 3560.855 ms (00:03.561)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3508.540 ms (00:03.509)
Time: 3566.721 ms (00:03.567)
Time: 3524.083 ms (00:03.524)

I think it'd probably be worth trying to extend this to 3 or maybe 4
var1 digits, since that would cover a lot of "everyday" sized numeric
values that a lot of people might be using. I don't think we should go
beyond that though.

I think so too. I'm working on var1ndigits=3 now.

/Joel

#11

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#10)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, Jul 2, 2024, at 11:05, Joel Jacobson wrote:

On Tue, Jul 2, 2024, at 10:22, Dean Rasheed wrote:

Shortly after posting that, I realised that there was a small bug. This bit:

case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];

isn't quite right in the case where rscale is less than the full
result. In that case, the least significant result digit has a
contribution from one more var2 digit, so it needs to be:

newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
if (res_ndigits - 3 < var2ndigits)
newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];

Just to be able to verify mul_var() is working as expected when
rscale is less than the full result, I've added a numeric_mul_patched()
function that takes a third rscale_adjustment parameter:

I've tried to get a different result with and without the fix,
but I'm failing to hit the bug.

Can you think of an example that should trigger the bug?

Example usage:

SELECT numeric_mul_patched(9999.999,2,0);
var1ndigits=1 var2ndigits=2 rscale=3
numeric_mul_patched
---------------------
19999.998
(1 row)

SELECT numeric_mul_patched(9999.999,2,1);
var1ndigits=1 var2ndigits=2 rscale=4
numeric_mul_patched
---------------------
19999.9980
(1 row)

SELECT numeric_mul_patched(9999.999,2,-1);
var1ndigits=1 var2ndigits=2 rscale=2
numeric_mul_patched
---------------------
20000.00
(1 row)

SELECT numeric_mul_patched(9999.999,2,-2);
var1ndigits=1 var2ndigits=2 rscale=1
numeric_mul_patched
---------------------
20000.0
(1 row)

/Joel

#12

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#11)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, 2 Jul 2024 at 11:23, Joel Jacobson <joel@compiler.org> wrote:

Just to be able to verify mul_var() is working as expected when
rscale is less than the full result, I've added a numeric_mul_patched()
function that takes a third rscale_adjustment parameter:

Yeah, we could expose such a function, and maybe it would be generally
useful, not just for testing, but that's really a separate proposal.
In general, mul_var() with reduced rscale doesn't guarantee correctly
rounded results though, which might make it less generally acceptable.

For this patch though, the aim is just to ensure the results are the
same as before.

I've tried to get a different result with and without the fix,
but I'm failing to hit the bug.

Can you think of an example that should trigger the bug?

9999.0001 * 5000.9999_9999 with rscale = 0 triggers it (returned
50004999 instead of 50005000).

Regards,
Dean

#13

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#12)

1 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, Jul 2, 2024, at 13:44, Dean Rasheed wrote:

Can you think of an example that should trigger the bug?

9999.0001 * 5000.9999_9999 with rscale = 0 triggers it (returned
50004999 instead of 50005000).

Thanks, helpful.

Attached patch adds the var1ndigits=3 case.

Benchmark:

/*
* Apple M3 Max
*/

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3809.005 ms (00:03.809)
Time: 3438.260 ms (00:03.438)
Time: 3453.920 ms (00:03.454)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3230.017 ms (00:03.230)
Time: 3244.094 ms (00:03.244)
Time: 3246.370 ms (00:03.246)

/*
* Intel Core i9-14900K
*/

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 4082.759 ms (00:04.083)
Time: 4077.372 ms (00:04.077)
Time: 4080.830 ms (00:04.081)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3989.021 ms (00:03.989)
Time: 3978.263 ms (00:03.978)
Time: 3955.331 ms (00:03.955)

/*
* AMD Ryzen 9 7950X3D
*/

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3791.271 ms (00:03.791)
Time: 3760.138 ms (00:03.760)
Time: 3758.773 ms (00:03.759)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3400.824 ms (00:03.401)
Time: 3373.714 ms (00:03.374)
Time: 3345.505 ms (00:03.346)

Regards,
Joel

Attachments:

v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchapplication/octet-stream; name="=?UTF-8?Q?v3-optimize-numeric-mul=5Fvar-small-var1-arbitrary-var2.patch?="Download

 From ece9611d080f0c048cef6e413f4a5e51683ec177 Mon Sep 17 00:00:00 2001
From: Joel Jakobsson <github@compiler.org>
Date: Tue, 2 Jul 2024 17:54:25 +0200
Subject: [PATCH] numeric: Simplified fast-path computation for mul_var()

---
 src/backend/utils/adt/numeric.c | 100 ++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..0fb217f6e9 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8747,6 +8747,106 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		return;
 	}
 
+	/*
+	 * Simplified fast-path computation, if var1 has just one or two digits.
+	 * This is significantly faster, since it avoids allocating a separate
+	 * digit array, making multiple passes over var2, and having separate
+	 * carry-propagation passes.
+	 */
+	if (var1ndigits <= 3)
+	{
+		NumericDigit *res_buf;
+
+		/* Allocate result digit array */
+		res_buf = digitbuf_alloc(res_ndigits);
+		res_buf[0] = 0;				/* spare digit for later rounding */
+		res_digits = res_buf + 1;
+
+		/*
+		 * Compute the result digits directly, in one pass, propagating the
+		 * carry up as we go.
+		 */
+		switch (var1ndigits)
+		{
+			case 1:
+				carry = 0;
+				for (i = res_ndigits - 3; i >= 0; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				res_digits[0] = (NumericDigit) carry;
+				break;
+
+			case 2:
+				newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
+				if (res_ndigits - 3 < var2ndigits)
+					newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
+				res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+				for (i = res_ndigits - 4; i >= 1; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] +
+						(int) var1digits[1] * var2digits[i - 1] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				newdig = (int) var1digits[0] * var2digits[0] + carry;
+				res_digits[1] = (NumericDigit) (newdig % NBASE);
+				res_digits[0] = (NumericDigit) (newdig / NBASE);
+				break;
+
+			case 3:
+				newdig = (int) var1digits[2] * var2digits[res_ndigits - 5];
+				if (res_ndigits - 4 < var2ndigits)
+					newdig += (int) var1digits[1] * var2digits[res_ndigits - 4];
+				if (res_ndigits - 3 < var2ndigits)
+					newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
+				res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+				for (i = res_ndigits - 4; i >= 2; i--)
+				{
+					newdig = carry;
+					if (i < var2ndigits)
+						newdig += (int) var1digits[0] * var2digits[i];
+					if (i - 1 >= 0 && i - 1 < var2ndigits)
+						newdig += (int) var1digits[1] * var2digits[i - 1];
+					if (i - 2 >= 0 && i - 2 < var2ndigits)
+						newdig += (int) var1digits[2] * var2digits[i - 2];
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				newdig = carry;
+				if (var2ndigits > 1)
+					newdig += (int) var1digits[0] * var2digits[1];
+				if (var2ndigits > 0)
+					newdig += (int) var1digits[1] * var2digits[0];
+				res_digits[2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+				newdig = (int) var1digits[0] * var2digits[0] + carry;
+				res_digits[1] = (NumericDigit) (newdig % NBASE);
+				res_digits[0] = (NumericDigit) (newdig / NBASE);
+				break;
+		}
+
+		/* Store the product in result (minus extra rounding digit) */
+		digitbuf_free(result->buf);
+		result->ndigits = res_ndigits - 1;
+		result->buf = res_buf;
+		result->digits = res_digits;
+		result->weight = res_weight - 1;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+
 	/*
 	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
 	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
-- 
2.45.1

#14

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#13)

1 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, Jul 2, 2024, at 18:20, Joel Jacobson wrote:

* v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch

Hmm, v3 contains a bug which I haven't been able to solve yet.
Reporting now to avoid time waste reviewing it since it's buggy.

The attached patch is how I tested and found the bug.
It contains a file test-mul-var.sql, which tests mul_var()
with varying rscale, using the SQL-callable numeric_mul_patched(),
which third argument is the rscale_adjustment.

Out of 2481600 random tests, all passed except 4:

SELECT
result = numeric_mul_patched(var1,var2,rscale_adjustment),
COUNT(*)
FROM test_numeric_mul_patched
GROUP BY 1
ORDER BY 1;
?column? | count
----------+---------
f | 4
t | 2481596
(2 rows)

SELECT
var1,
var2,
var1*var2 AS full_resolution,
rscale_adjustment,
result AS expected,
numeric_mul_patched(var1,var2,rscale_adjustment) AS actual
FROM test_numeric_mul_patched
WHERE result <> numeric_mul_patched(var1,var2,rscale_adjustment);
var1 | var2 | full_resolution | rscale_adjustment | expected | actual
-------------------+----------------+---------------------------+-------------------+----------+--------
676.797214075 | 0.308068877759 | 208.500158210502929257925 | -21 | 209 | 208
555.07029 | 0.381033094735 | 211.50015039415392315 | -17 | 212 | 211
0.476637718921 | 66.088276 | 31.500165120061470196 | -18 | 32 | 31
0.060569165063082 | 998.85933 | 60.50007563356949425506 | -20 | 61 | 60
(4 rows)

Trying to wrap my head around what could cause this.

It's rounding down instead of up, and these cases all end with decimal .500XXXX.

Regards,
Joel

Attachments:

numeric_mul_patched.txttext/plain; name=numeric_mul_patched.txtDownload

From 074aeb223ab496f23c2075eabd35e6e76241d1d8 Mon Sep 17 00:00:00 2001
From: Joel Jakobsson <github@compiler.org>
Date: Mon, 1 Jul 2024 07:17:50 +0200
Subject: [PATCH] Add SQL-callable numeric_mul_patched() to bench Simplified
 fast-path computation

---
 src/backend/utils/adt/numeric.c | 438 ++++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat |   3 +
 src/include/utils/numeric.h     |   2 +
 test-mul-var.sql                |  48 ++++
 4 files changed, 491 insertions(+)
 create mode 100644 test-mul-var.sql

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..8f5d553f15 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -551,6 +551,9 @@ static void sub_var(const NumericVar *var1, const NumericVar *var2,
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
+static void mul_var_patched(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
 static void div_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale, bool round);
@@ -3115,6 +3118,131 @@ numeric_mul_opt_error(Numeric num1, Numeric num2, bool *have_error)
 }
 
 
+/*
+ * numeric_mul_patched() -
+ *
+ * This function multiplies two numeric values using the patched algorithm,
+ * designed for efficient handling of large numbers. It's introduced to allow
+ * direct benchmark comparisons with the standard numeric_mul() function.
+ */
+Datum
+numeric_mul_patched(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	int32		rscale_adjustment = PG_GETARG_INT32(2);
+	Numeric		res;
+
+	res = numeric_mul_patched_opt_error(num1, num2, rscale_adjustment, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
+
+
+/*
+ * numeric_mul_patched_opt_error() -
+ *
+ *	Internal version of numeric_mul_patched().
+ *	If "*have_error" flag is provided, on error it's set to true, NULL returned.
+ *	This is helpful when caller need to handle errors by itself.
+ */
+Numeric
+numeric_mul_patched_opt_error(Numeric num1, Numeric num2, int32 rscale_adjustment, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* Inf * 0 */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* -Inf * 0 */
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+		if (NUMERIC_IS_PINF(num2))
+		{
+			switch (numeric_sign_internal(num1))
+			{
+				case 0:
+					return make_result(&const_nan); /* 0 * Inf */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		Assert(NUMERIC_IS_NINF(num2));
+		switch (numeric_sign_internal(num1))
+		{
+			case 0:
+				return make_result(&const_nan); /* 0 * -Inf */
+			case 1:
+				return make_result(&const_ninf);
+			case -1:
+				return make_result(&const_pinf);
+		}
+		Assert(false);
+	}
+
+	/*
+	 * Unpack the values, let mul_var() compute the result and return it.
+	 * Unlike add_var() and sub_var(), mul_var() will round its result. In the
+	 * case of numeric_mul(), which is invoked for the * operator on numerics,
+	 * we request exact representation for the product (rscale = sum(dscale of
+	 * arg1, dscale of arg2)).  If the exact result has more digits after the
+	 * decimal point than can be stored in a numeric, we round it.  Rounding
+	 * after computing the exact result ensures that the final result is
+	 * correctly rounded (rounding in mul_var() using a truncated product
+	 * would not guarantee this).
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	mul_var_patched(&arg1, &arg2, &result, arg1.dscale + arg2.dscale + rscale_adjustment);
+
+	if (result.dscale > NUMERIC_DSCALE_MAX)
+		round_var(&result, NUMERIC_DSCALE_MAX);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
 /*
  * numeric_div() -
  *
@@ -8864,6 +8992,316 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	strip_var(result);
 }
 
+/*
+ * mul_var_patched() -
+ *
+ * Implements patched multiplication for large numbers, introduced
+ * alongside the unchanged original mul_var(). This function is part of
+ * an optimization effort, allowing direct benchmark comparisons with
+ * mul_var(). It selects full or half patched based on input size.
+ * This is a temporary measure before considering its replacement of
+ * mul_var() based on benchmark outcomes.
+ */
+static void
+mul_var_patched(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+	int			res_ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			maxdigits;
+	int		   *dig;
+	int			carry;
+	int			maxdig;
+	int			newdig;
+	int			var1ndigits;
+	int			var2ndigits;
+	NumericDigit *var1digits;
+	NumericDigit *var2digits;
+	NumericDigit *res_digits;
+	int			i,
+				i1,
+				i2;
+
+	/*
+	 * Arrange for var1 to be the shorter of the two numbers.  This improves
+	 * performance because the inner multiplication loop is much simpler than
+	 * the outer loop, so it's better to have a smaller number of iterations
+	 * of the outer loop.  This also reduces the number of times that the
+	 * accumulator array needs to be normalized.
+	 */
+	if (var1->ndigits > var2->ndigits)
+	{
+		const NumericVar *tmp = var1;
+
+		var1 = var2;
+		var2 = tmp;
+	}
+
+	/* copy these values into local vars for speed in inner loop */
+	var1ndigits = var1->ndigits;
+	var2ndigits = var2->ndigits;
+	var1digits = var1->digits;
+	var2digits = var2->digits;
+
+	if (var1ndigits == 0 || var2ndigits == 0)
+	{
+		/* one or both inputs is zero; so is result */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/* Determine result sign and (maximum possible) weight */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 2;
+
+	/*
+	 * Determine the number of result digits to compute.  If the exact result
+	 * would have more than rscale fractional digits, truncate the computation
+	 * with MUL_GUARD_DIGITS guard digits, i.e., ignore input digits that
+	 * would only contribute to the right of that.  (This will give the exact
+	 * rounded-to-rscale answer unless carries out of the ignored positions
+	 * would have propagated through more than MUL_GUARD_DIGITS digits.)
+	 *
+	 * Note: an exact computation could not produce more than var1ndigits +
+	 * var2ndigits digits, but we allocate one extra output digit in case
+	 * rscale-driven rounding produces a carry out of the highest exact digit.
+	 */
+	res_ndigits = var1ndigits + var2ndigits + 1;
+	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+		MUL_GUARD_DIGITS;
+	res_ndigits = Min(res_ndigits, maxdigits);
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * Simplified fast-path computation, if var1 has just one or two digits.
+	 * This is significantly faster, since it avoids allocating a separate
+	 * digit array, making multiple passes over var2, and having separate
+	 * carry-propagation passes.
+	 */
+	if (var1ndigits <= 3)
+	{
+		NumericDigit *res_buf;
+
+		/* Allocate result digit array */
+		res_buf = digitbuf_alloc(res_ndigits);
+		res_buf[0] = 0;				/* spare digit for later rounding */
+		res_digits = res_buf + 1;
+
+		/*
+		 * Compute the result digits directly, in one pass, propagating the
+		 * carry up as we go.
+		 */
+		switch (var1ndigits)
+		{
+			case 1:
+				carry = 0;
+				for (i = res_ndigits - 3; i >= 0; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				res_digits[0] = (NumericDigit) carry;
+				break;
+
+			case 2:
+				newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
+				if (res_ndigits - 3 < var2ndigits)
+					newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
+				res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+				for (i = res_ndigits - 4; i >= 1; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] +
+						(int) var1digits[1] * var2digits[i - 1] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				newdig = (int) var1digits[0] * var2digits[0] + carry;
+				res_digits[1] = (NumericDigit) (newdig % NBASE);
+				res_digits[0] = (NumericDigit) (newdig / NBASE);
+				break;
+
+			case 3:
+				newdig = (int) var1digits[2] * var2digits[res_ndigits - 5];
+				if (res_ndigits - 4 < var2ndigits)
+					newdig += (int) var1digits[1] * var2digits[res_ndigits - 4];
+				if (res_ndigits - 3 < var2ndigits)
+					newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
+				res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+				for (i = res_ndigits - 4; i >= 2; i--)
+				{
+					newdig = carry;
+					if (i < var2ndigits)
+						newdig += (int) var1digits[0] * var2digits[i];
+					if (i - 1 >= 0 && i - 1 < var2ndigits)
+						newdig += (int) var1digits[1] * var2digits[i - 1];
+					if (i - 2 >= 0 && i - 2 < var2ndigits)
+						newdig += (int) var1digits[2] * var2digits[i - 2];
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				newdig = carry;
+				if (var2ndigits > 1)
+					newdig += (int) var1digits[0] * var2digits[1];
+				if (var2ndigits > 0)
+					newdig += (int) var1digits[1] * var2digits[0];
+				res_digits[2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+				newdig = (int) var1digits[0] * var2digits[0] + carry;
+				res_digits[1] = (NumericDigit) (newdig % NBASE);
+				res_digits[0] = (NumericDigit) (newdig / NBASE);
+				break;
+		}
+
+		/* Store the product in result (minus extra rounding digit) */
+		digitbuf_free(result->buf);
+		result->ndigits = res_ndigits - 1;
+		result->buf = res_buf;
+		result->digits = res_digits;
+		result->weight = res_weight - 1;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+
+	/*
+	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
+	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
+	 * to avoid normalizing carries immediately.
+	 *
+	 * maxdig tracks the maximum possible value of any dig[] entry; when this
+	 * threatens to exceed INT_MAX, we take the time to propagate carries.
+	 * Furthermore, we need to ensure that overflow doesn't occur during the
+	 * carry propagation passes either.  The carry values could be as much as
+	 * INT_MAX/NBASE, so really we must normalize when digits threaten to
+	 * exceed INT_MAX - INT_MAX/NBASE.
+	 *
+	 * To avoid overflow in maxdig itself, it actually represents the max
+	 * possible value divided by NBASE-1, ie, at the top of the loop it is
+	 * known that no dig[] entry exceeds maxdig * (NBASE-1).
+	 */
+	dig = (int *) palloc0(res_ndigits * sizeof(int));
+	maxdig = 0;
+
+	/*
+	 * The least significant digits of var1 should be ignored if they don't
+	 * contribute directly to the first res_ndigits digits of the result that
+	 * we are computing.
+	 *
+	 * Digit i1 of var1 and digit i2 of var2 are multiplied and added to digit
+	 * i1+i2+2 of the accumulator array, so we need only consider digits of
+	 * var1 for which i1 <= res_ndigits - 3.
+	 */
+	for (i1 = Min(var1ndigits - 1, res_ndigits - 3); i1 >= 0; i1--)
+	{
+		NumericDigit var1digit = var1digits[i1];
+
+		if (var1digit == 0)
+			continue;
+
+		/* Time to normalize? */
+		maxdig += var1digit;
+		if (maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
+		{
+			/* Yes, do it */
+			carry = 0;
+			for (i = res_ndigits - 1; i >= 0; i--)
+			{
+				newdig = dig[i] + carry;
+				if (newdig >= NBASE)
+				{
+					carry = newdig / NBASE;
+					newdig -= carry * NBASE;
+				}
+				else
+					carry = 0;
+				dig[i] = newdig;
+			}
+			Assert(carry == 0);
+			/* Reset maxdig to indicate new worst-case */
+			maxdig = 1 + var1digit;
+		}
+
+		/*
+		 * Add the appropriate multiple of var2 into the accumulator.
+		 *
+		 * As above, digits of var2 can be ignored if they don't contribute,
+		 * so we only include digits for which i1+i2+2 < res_ndigits.
+		 *
+		 * This inner loop is the performance bottleneck for multiplication,
+		 * so we want to keep it simple enough so that it can be
+		 * auto-vectorized.  Accordingly, process the digits left-to-right
+		 * even though schoolbook multiplication would suggest right-to-left.
+		 * Since we aren't propagating carries in this loop, the order does
+		 * not matter.
+		 */
+		{
+			int			i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
+			int		   *dig_i1_2 = &dig[i1 + 2];
+
+			for (i2 = 0; i2 < i2limit; i2++)
+				dig_i1_2[i2] += var1digit * var2digits[i2];
+		}
+	}
+
+	/*
+	 * Now we do a final carry propagation pass to normalize the result, which
+	 * we combine with storing the result digits into the output. Note that
+	 * this is still done at full precision w/guard digits.
+	 */
+	alloc_var(result, res_ndigits);
+	res_digits = result->digits;
+	carry = 0;
+	for (i = res_ndigits - 1; i >= 0; i--)
+	{
+		newdig = dig[i] + carry;
+		if (newdig >= NBASE)
+		{
+			carry = newdig / NBASE;
+			newdig -= carry * NBASE;
+		}
+		else
+			carry = 0;
+		res_digits[i] = newdig;
+	}
+	Assert(carry == 0);
+
+	pfree(dig);
+
+	/*
+	 * Finally, round the result to the requested precision.
+	 */
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+
+}
+
 
 /*
  * div_var() -
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d4ac578ae6..5b3024cb6d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -4465,6 +4465,9 @@
 { oid => '1726',
   proname => 'numeric_mul', prorettype => 'numeric',
   proargtypes => 'numeric numeric', prosrc => 'numeric_mul' },
+{ oid => '6347',
+  proname => 'numeric_mul_patched', prorettype => 'numeric',
+  proargtypes => 'numeric numeric int4', prosrc => 'numeric_mul_patched' },
 { oid => '1727',
   proname => 'numeric_div', prorettype => 'numeric',
   proargtypes => 'numeric numeric', prosrc => 'numeric_div' },
diff --git a/src/include/utils/numeric.h b/src/include/utils/numeric.h
index 43c75c436f..454a56da9a 100644
--- a/src/include/utils/numeric.h
+++ b/src/include/utils/numeric.h
@@ -97,6 +97,8 @@ extern Numeric numeric_sub_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
 extern Numeric numeric_mul_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
+extern Numeric numeric_mul_patched_opt_error(Numeric num1, Numeric num2,
+									 int32 rscale_adjustment, bool *have_error);
 extern Numeric numeric_div_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
 extern Numeric numeric_mod_opt_error(Numeric num1, Numeric num2,
diff --git a/test-mul-var.sql b/test-mul-var.sql
new file mode 100644
index 0000000000..ee7e3855bc
--- /dev/null
+++ b/test-mul-var.sql
@@ -0,0 +1,48 @@
+CREATE TABLE test_numeric_mul_patched (
+  var1 numeric,
+  var2 numeric,
+  rscale_adjustment int,
+  result numeric
+);
+
+DO $$
+DECLARE
+var1 numeric;
+var2 numeric;
+BEGIN
+  FOR i IN 1..100 LOOP
+    RAISE NOTICE '%', i;
+    FOR var1ndigits IN 1..4 LOOP
+      FOR var2ndigits IN 1..4 LOOP
+        FOR var1dscale IN 0..(var1ndigits*4) LOOP
+          FOR var2dscale IN 0..(var2ndigits*4) LOOP
+            FOR rscale_adjustment IN 0..(var1dscale+var2dscale) LOOP
+              var1 := round(random(
+                format('1%s',repeat('0',(var1ndigits-1)*4-1))::numeric,
+                format('%s',repeat('9',var1ndigits*4))::numeric
+              ) / 10::numeric^var1dscale, var1dscale);
+              var2 := round(random(
+                format('1%s',repeat('0',(var2ndigits-1)*4-1))::numeric,
+                format('%s',repeat('9',var2ndigits*4))::numeric
+              ) / 10::numeric^var2dscale, var2dscale);
+              INSERT INTO test_numeric_mul_patched
+                (var1, var2, rscale_adjustment)
+              VALUES
+                (var1, var2, -rscale_adjustment);
+            END LOOP;
+          END LOOP;
+        END LOOP;
+      END LOOP;
+    END LOOP;
+  END LOOP;
+END $$;
+
+-- First, set result with a numeric_mul_patched() version where
+-- the Simplified fast-path computation code has been commented out.
+UPDATE test_numeric_mul_patched SET result = numeric_mul_patched(var1, var2, rscale_adjustment);
+
+-- Then, recompile with the Simplified fast-path computation code,
+-- and check if any differences can be found:
+SELECT *, numeric_mul_patched(var1,var2,rscale_adjustment)
+FROM test_numeric_mul_patched
+WHERE result IS DISTINCT FROM numeric_mul_patched(var1,var2,rscale_adjustment);
-- 
2.45.1

#15

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#14)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, Jul 2, 2024, at 20:53, Joel Jacobson wrote:

Trying to wrap my head around what could cause this.

It's rounding down instead of up, and these cases all end with decimal .500XXXX.

Interesting, I actually think there is a bug in the normal mul_var() code.
Found a case that rounds down, when it should round up:

Calling mul_var() with:
var1=51.2945442386599
var2=0.828548712212
rscale=0

returns 42, but I think it should return 43,
since 51.2945442386599*0.828548712212=42.5000285724431241296446988

But maybe this is expected and OK, having to do with MUL_GUARD_DIGITS?

/Joel

#16

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#15)

1 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, Jul 2, 2024, at 21:55, Joel Jacobson wrote:

On Tue, Jul 2, 2024, at 20:53, Joel Jacobson wrote:

Trying to wrap my head around what could cause this.

I found the bug in the case 3 code,
and it turns out the same type of bug also exists in the case 2 code:

case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];

The problem here is that res_ndigits could become less than 4,
if rscale is low enough,
and then we would try to access a negative array index in var2digits.

Fix:

case 2:
if (res_ndigits - 4 >= 0 && res_ndigits - 4 < var2ndigits)
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
else
newdig = 0;

Another problem in the case 2 code:

if (res_ndigits - 3 < var2ndigits)
newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];

Fix:

if (res_ndigits - 3 >= 0 && res_ndigits - 3 < var2ndigits)
newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];

New version attached that fixes both the case 2 and case 3 code.

Regards,
Joel

Attachments:

v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchapplication/octet-stream; name="=?UTF-8?Q?v4-optimize-numeric-mul=5Fvar-small-var1-arbitrary-var2.patch?="Download

From a7d421d5fcd4f6b367f266c3b005c82abff03d7c Mon Sep 17 00:00:00 2001
From: Joel Jakobsson <github@compiler.org>
Date: Tue, 2 Jul 2024 17:54:25 +0200
Subject: [PATCH] numeric: Simplified fast-path computation for mul_var()

---
 src/backend/utils/adt/numeric.c | 106 ++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..e3bc29b626 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8747,6 +8747,112 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		return;
 	}
 
+	/*
+	 * Simplified fast-path computation, if var1 has just one or two digits.
+	 * This is significantly faster, since it avoids allocating a separate
+	 * digit array, making multiple passes over var2, and having separate
+	 * carry-propagation passes.
+	 */
+	if (var1ndigits <= 3)
+	{
+		NumericDigit *res_buf;
+
+		/* Allocate result digit array */
+		res_buf = digitbuf_alloc(res_ndigits);
+		res_buf[0] = 0;				/* spare digit for later rounding */
+		res_digits = res_buf + 1;
+
+		/*
+		 * Compute the result digits directly, in one pass, propagating the
+		 * carry up as we go.
+		 */
+		switch (var1ndigits)
+		{
+			case 1:
+				carry = 0;
+				for (i = res_ndigits - 3; i >= 0; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				res_digits[0] = (NumericDigit) carry;
+				break;
+
+			case 2:
+				if (res_ndigits - 4 >= 0 && res_ndigits - 4 < var2ndigits)
+					newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
+				else
+					newdig = 0;
+				if (res_ndigits - 3 >= 0 && res_ndigits - 3 < var2ndigits)
+					newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
+				res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+				for (i = res_ndigits - 4; i >= 1; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] +
+						(int) var1digits[1] * var2digits[i - 1] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				newdig = (int) var1digits[0] * var2digits[0] + carry;
+				res_digits[1] = (NumericDigit) (newdig % NBASE);
+				res_digits[0] = (NumericDigit) (newdig / NBASE);
+				break;
+
+			case 3:
+				if (res_ndigits - 5 >= 0 && res_ndigits - 5 < var2ndigits)
+					newdig = (int) var1digits[2] * var2digits[res_ndigits - 5];
+				else
+					newdig = 0;
+				if (res_ndigits - 4 >= 0 && res_ndigits - 4 < var2ndigits)
+					newdig += (int) var1digits[1] * var2digits[res_ndigits - 4];
+				if (res_ndigits - 3 >= 0 && res_ndigits - 3 < var2ndigits)
+					newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
+				res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+				for (i = res_ndigits - 4; i >= 2; i--)
+				{
+					newdig = carry;
+					if (i < var2ndigits)
+						newdig += (int) var1digits[0] * var2digits[i];
+					if (i - 1 >= 0 && i - 1 < var2ndigits)
+						newdig += (int) var1digits[1] * var2digits[i - 1];
+					if (i - 2 >= 0 && i - 2 < var2ndigits)
+						newdig += (int) var1digits[2] * var2digits[i - 2];
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				newdig = carry;
+				if (var2ndigits > 1)
+					newdig += (int) var1digits[0] * var2digits[1];
+				if (var2ndigits > 0)
+					newdig += (int) var1digits[1] * var2digits[0];
+				res_digits[2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+				newdig = (int) var1digits[0] * var2digits[0] + carry;
+				res_digits[1] = (NumericDigit) (newdig % NBASE);
+				res_digits[0] = (NumericDigit) (newdig / NBASE);
+				break;
+		}
+
+		/* Store the product in result (minus extra rounding digit) */
+		digitbuf_free(result->buf);
+		result->ndigits = res_ndigits - 1;
+		result->buf = res_buf;
+		result->digits = res_digits;
+		result->weight = res_weight - 1;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+
 	/*
 	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
 	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
-- 
2.45.1

#17

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#16)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, Jul 2, 2024, at 22:10, Joel Jacobson wrote:

* v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch

Instead of these boundary checks, maybe it would be cleaner to just
skip this optimization if there are too few res_ndigits?

/Joel

#18

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#15)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, 2 Jul 2024 at 20:55, Joel Jacobson <joel@compiler.org> wrote:

Interesting, I actually think there is a bug in the normal mul_var() code.
Found a case that rounds down, when it should round up:

Calling mul_var() with:
var1=51.2945442386599
var2=0.828548712212
rscale=0

returns 42, but I think it should return 43,
since 51.2945442386599*0.828548712212=42.5000285724431241296446988

But maybe this is expected and OK, having to do with MUL_GUARD_DIGITS?

No, that's not a bug. It's to be expected. The point of using only
MUL_GUARD_DIGITS extra digits is to get the correctly rounded result
*most of the time*, while saving a lot of effort by not computing the
full result.

The callers of mul_var() that ask for reduced rscale results are the
transcendental functions like ln_var() and exp_var(), which are
working to some limited precision intended to compute the final result
reasonably accurately to a particular rscale.

Regards,
Dean

#19

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#16)

2 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, 2 Jul 2024 at 21:10, Joel Jacobson <joel@compiler.org> wrote:

I found the bug in the case 3 code,
and it turns out the same type of bug also exists in the case 2 code:

case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];

The problem here is that res_ndigits could become less than 4,

Yes. It can't be less than 3 though (per an earlier test), so the case
2 code was correct.

I've been hacking on this a bit and trying to tidy it up. Firstly, I
moved it to a separate function, because it was starting to look messy
having so much extra code in mul_var(). Then I added a bunch more
comments to explain what's going on, and the limits of the various
variables. Note that most of the boundary checks are actually
unnecessary -- in particular all the ones in or after the main loop,
provided you pull out the first 2 result digits from the main loop in
the 3-digit case. That does seem to work very well, but...

I wasn't entirely happy with how messy that code is getting, so I
tried a different approach. Similar to div_var_int(), I tried writing
a mul_var_int() function instead. This can be used for 1 and 2 digit
factors, and we could add a similar mul_var_int64() function on
platforms with 128-bit integers. The code looks quite a lot neater, so
it's probably less likely to contain bugs (though I have just written
it in a hurry,so it might still have bugs). In testing, it seemed to
give a decent speedup, but perhaps a little less than before. But
that's to be balanced against having more maintainable code, and also
a function that might be useful elsewhere in numeric.c.

Anyway, here are both patches for comparison. I'll stop hacking for a
while and let you see what you make of these.

Regards,
Dean

Attachments:

v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchtext/x-patch; charset=US-ASCII; name=v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchDownload

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 5510a20..81600b3
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -551,6 +551,8 @@ static void sub_var(const NumericVar *va
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
+static void mul_var_small(const NumericVar *var1, const NumericVar *var2,
+						  NumericVar *result, int rscale);
 static void div_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale, bool round);
@@ -8707,7 +8709,7 @@ mul_var(const NumericVar *var1, const Nu
 	var1digits = var1->digits;
 	var2digits = var2->digits;
 
-	if (var1ndigits == 0 || var2ndigits == 0)
+	if (var1ndigits == 0)
 	{
 		/* one or both inputs is zero; so is result */
 		zero_var(result);
@@ -8715,6 +8717,16 @@ mul_var(const NumericVar *var1, const Nu
 		return;
 	}
 
+	/*
+	 * If var1 has 3 digits or fewer, delegate to mul_var_small() which uses a
+	 * faster short multiplication algorithm.
+	 */
+	if (var1ndigits <= 3)
+	{
+		mul_var_small(var1, var2, result, rscale);
+		return;
+	}
+
 	/* Determine result sign and (maximum possible) weight */
 	if (var1->sign == var2->sign)
 		res_sign = NUMERIC_POS;
@@ -8858,6 +8870,188 @@ mul_var(const NumericVar *var1, const Nu
 	result->sign = res_sign;
 
 	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+
+/*
+ * mul_var_small() -
+ *
+ *	This has the same API as mul_var, but it assumes that var1 has no more
+ *	than 3 digits and var2 has at least as many digits as var1.  For variables
+ *	satisfying these conditions, the product can be computed more quickly than
+ *	the general algorithm used in mul_var.
+ */
+static void
+mul_var_small(const NumericVar *var1, const NumericVar *var2,
+			  NumericVar *result, int rscale)
+{
+	int			var1ndigits = var1->ndigits;
+	int			var2ndigits = var2->ndigits;
+	NumericDigit *var1digits = var1->digits;
+	NumericDigit *var2digits = var2->digits;
+	int			res_sign;
+	int			res_weight;
+	int			res_ndigits;
+	int			maxdigits;
+	NumericDigit *res_buf;
+	NumericDigit *res_digits;
+	int			carry;
+	int			term;
+
+	/* Check preconditions */
+	Assert(var1ndigits <= 3);
+	Assert(var2ndigits >= var1ndigits);
+
+	/* Determine result sign and (maximum possible) weight */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 2;
+
+	/* Determine the number of result digits to compute - see mul_var() */
+	res_ndigits = var1ndigits + var2ndigits + 1;
+	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+		MUL_GUARD_DIGITS;
+	res_ndigits = Min(res_ndigits, maxdigits);
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/* Allocate result digit array */
+	res_buf = digitbuf_alloc(res_ndigits);
+	res_buf[0] = 0;				/* spare digit for later rounding */
+	res_digits = res_buf + 1;
+
+	/*
+	 * Compute the result digits in reverse, in one pass, propagating the
+	 * carry up as we go.
+	 *
+	 * This computes res_digits[res_ndigits - 2], ... res_digits[0] by summing
+	 * the products var1digits[i1] * var2digits[i2] for which i1 + i2 + 1 is
+	 * the result index.
+	 */
+	switch (var1ndigits)
+	{
+		case 1:
+			/* ---------
+			 * 1-digit case:
+			 *		var1ndigits = 1
+			 *		var2ndigits >= 1
+			 *		3 <= res_ndigits <= var2ndigits + 2
+			 * ----------
+			 */
+			carry = 0;
+			for (int i = res_ndigits - 3; i >= 0; i--)
+			{
+				term = (int) var1digits[0] * var2digits[i] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+			res_digits[0] = (NumericDigit) carry;
+			break;
+
+		case 2:
+			/* ---------
+			 * 2-digit case:
+			 *		var1ndigits = 2
+			 *		var2ndigits >= 2
+			 *		3 <= res_ndigits <= var2ndigits + 3
+			 * ----------
+			 */
+			/* last result digit and carry */
+			term = 0;
+			if (res_ndigits - 3 < var2ndigits)
+				term += (int) var1digits[0] * var2digits[res_ndigits - 3];
+			if (res_ndigits > 3)
+				term += (int) var1digits[1] * var2digits[res_ndigits - 4];
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first two */
+			for (int i = res_ndigits - 4; i >= 1; i--)
+			{
+				term = (int) var1digits[0] * var2digits[i] +
+					(int) var1digits[1] * var2digits[i - 1] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+
+			/* first two digits */
+			term = (int) var1digits[0] * var2digits[0] + carry;
+			res_digits[1] = (NumericDigit) (term % NBASE);
+			res_digits[0] = (NumericDigit) (term / NBASE);
+			break;
+
+		case 3:
+			/* ---------
+			 * 3-digit case:
+			 *		var1ndigits = 3
+			 *		var2ndigits >= 3
+			 *		3 <= res_ndigits <= var2ndigits + 4
+			 * ----------
+			 */
+			/* last result digit and carry */
+			term = 0;
+			if (res_ndigits - 3 < var2ndigits)
+				term += (int) var1digits[0] * var2digits[res_ndigits - 3];
+			if (res_ndigits > 3 && res_ndigits - 4 < var2ndigits)
+				term += (int) var1digits[1] * var2digits[res_ndigits - 4];
+			if (res_ndigits > 4)
+				term += (int) var1digits[2] * var2digits[res_ndigits - 5];
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* penultimate result digit */
+			term = carry;
+			if (res_ndigits > 3 && res_ndigits - 4 < var2ndigits)
+				term += (int) var1digits[0] * var2digits[res_ndigits - 4];
+			if (res_ndigits > 4)
+				term += (int) var1digits[1] * var2digits[res_ndigits - 5];
+			if (res_ndigits > 5)
+				term += (int) var1digits[2] * var2digits[res_ndigits - 6];
+			res_digits[res_ndigits - 3] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first three */
+			for (int i = res_ndigits - 5; i >= 2; i--)
+			{
+				term = (int) var1digits[0] * var2digits[i] +
+					(int) var1digits[1] * var2digits[i - 1] +
+					(int) var1digits[2] * var2digits[i - 2] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+
+			/* first three digits */
+			term = (int) var1digits[0] * var2digits[1] +
+				(int) var1digits[1] * var2digits[0] + carry;
+			res_digits[2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+			term = (int) var1digits[0] * var2digits[0] + carry;
+			res_digits[1] = (NumericDigit) (term % NBASE);
+			res_digits[0] = (NumericDigit) (term / NBASE);
+			break;
+	}
+
+	/* Store the product in result (minus extra rounding digit) */
+	digitbuf_free(result->buf);
+	result->ndigits = res_ndigits - 1;
+	result->buf = res_buf;
+	result->digits = res_digits;
+	result->weight = res_weight - 1;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
 	round_var(result, rscale);
 
 	/* Strip leading and trailing zeroes */

v5-add-mul_var_int.patchtext/x-patch; charset=US-ASCII; name=v5-add-mul_var_int.patchDownload

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 5510a20..9e50ea7
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -551,6 +551,8 @@ static void sub_var(const NumericVar *va
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
+static void mul_var_int(const NumericVar *var, int ival, int ival_weight,
+						NumericVar *result, int rscale);
 static void div_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale, bool round);
@@ -8707,7 +8709,7 @@ mul_var(const NumericVar *var1, const Nu
 	var1digits = var1->digits;
 	var2digits = var2->digits;
 
-	if (var1ndigits == 0 || var2ndigits == 0)
+	if (var1ndigits == 0)
 	{
 		/* one or both inputs is zero; so is result */
 		zero_var(result);
@@ -8715,6 +8717,31 @@ mul_var(const NumericVar *var1, const Nu
 		return;
 	}
 
+	/*
+	 * If var1 has just one or two digits, delegate to mul_var_int(), which
+	 * uses a faster direct multiplication algorithm.
+	 *
+	 * TODO: Similarly, on platforms with 128-bit integers ...
+	 */
+	if (var1ndigits <= 2)
+	{
+		int			ifactor;
+		int			ifactor_weight;
+
+		ifactor = var1->digits[0];
+		ifactor_weight = var1->weight;
+		if (var1ndigits == 2)
+		{
+			ifactor = ifactor * NBASE + var1->digits[1];
+			ifactor_weight--;
+		}
+		if (var1->sign == NUMERIC_NEG)
+			ifactor = -ifactor;
+
+		mul_var_int(var2, ifactor, ifactor_weight, result, rscale);
+		return;
+	}
+
 	/* Determine result sign and (maximum possible) weight */
 	if (var1->sign == var2->sign)
 		res_sign = NUMERIC_POS;
@@ -8857,6 +8884,123 @@ mul_var(const NumericVar *var1, const Nu
 	result->weight = res_weight;
 	result->sign = res_sign;
 
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+
+/*
+ * mul_var_int() -
+ *
+ *	Multiply a numeric variable by a 32-bit integer with the specified weight.
+ *	The product var * ival * NBASE^ival_weight is stored in result.
+ */
+static void
+mul_var_int(const NumericVar *var, int ival, int ival_weight,
+			NumericVar *result, int rscale)
+{
+	NumericDigit *var_digits = var->digits;
+	int			var_ndigits = var->ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			res_ndigits;
+	int			maxdigits;
+	NumericDigit *res_buf;
+	NumericDigit *res_digits;
+	uint32		factor;
+	uint32		carry;
+
+	if (ival == 0 || var_ndigits == 0)
+	{
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * Determine the result sign, (maximum possible) weight and number of
+	 * digits to calculate.  The weight figured here is correct if the emitted
+	 * product has no leading zero digits; otherwise strip_var() will fix
+	 * things up.
+	 */
+	if (var->sign == NUMERIC_POS)
+		res_sign = ival > 0 ? NUMERIC_POS : NUMERIC_NEG;
+	else
+		res_sign = ival > 0 ? NUMERIC_NEG : NUMERIC_POS;
+	res_weight = var->weight + ival_weight + 3;
+	/* The number of accurate result digits we need to produce: */
+	res_ndigits = var_ndigits + 3;
+	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+		MUL_GUARD_DIGITS;
+	res_ndigits = Min(res_ndigits, maxdigits);
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	res_buf = digitbuf_alloc(res_ndigits + 1);
+	res_buf[0] = 0;				/* spare digit for later rounding */
+	res_digits = res_buf + 1;
+
+	/*
+	 * Now compute the product digits by procssing the input digits in reverse
+	 * and propagating the carry up as we go.
+	 *
+	 * In this algorithm, the carry from one digit to the next is at most
+	 * factor - 1, and product is at most factor * NBASE - 1, and so it needs
+	 * to be a 64-bit integer if this exceeds UINT_MAX.
+	 */
+	factor = abs(ival);
+	carry = 0;
+
+	if (factor <= UINT_MAX / NBASE)
+	{
+		/* product cannot overflow 32 bits */
+		uint32		product;
+
+		for (int i = res_ndigits - 4; i >= 0; i--)
+		{
+			product = factor * var_digits[i] + carry;
+			res_digits[i + 3] = (NumericDigit) (product % NBASE);
+			carry = product / NBASE;
+		}
+		res_digits[2] = (NumericDigit) (carry % NBASE);
+		carry = carry / NBASE;
+		res_digits[1] = (NumericDigit) (carry % NBASE);
+		res_digits[0] = (NumericDigit) (carry / NBASE);
+	}
+	else
+	{
+		/* product may exceed 32 bits */
+		uint64		product;
+
+		for (int i = res_ndigits - 4; i >= 0; i--)
+		{
+			product = (uint64) factor * var_digits[i] + carry;
+			res_digits[i + 3] = (NumericDigit) (product % NBASE);
+			carry = (uint32) (product / NBASE);
+		}
+		res_digits[2] = (NumericDigit) (carry % NBASE);
+		carry = carry / NBASE;
+		res_digits[1] = (NumericDigit) (carry % NBASE);
+		res_digits[0] = (NumericDigit) (carry / NBASE);
+	}
+
+	/* Store the product in result */
+	digitbuf_free(result->buf);
+	result->ndigits = res_ndigits;
+	result->buf = res_buf;
+	result->digits = res_digits;
+	result->weight = res_weight;
+	result->sign = res_sign;
+
 	/* Round to target rscale (and set result->dscale) */
 	round_var(result, rscale);

#20

Ranier Vilela

ranier.vf@gmail.com

over 1 year ago

In reply to: Dean Rasheed (#19)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

Em qua., 3 de jul. de 2024 às 08:18, Dean Rasheed <dean.a.rasheed@gmail.com>
escreveu:

On Tue, 2 Jul 2024 at 21:10, Joel Jacobson <joel@compiler.org> wrote:

I found the bug in the case 3 code,
and it turns out the same type of bug also exists in the case 2 code:

case 2:
newdig = (int) var1digits[1] *

var2digits[res_ndigits - 4];

The problem here is that res_ndigits could become less than 4,

Yes. It can't be less than 3 though (per an earlier test), so the case
2 code was correct.

I've been hacking on this a bit and trying to tidy it up. Firstly, I
moved it to a separate function, because it was starting to look messy
having so much extra code in mul_var(). Then I added a bunch more
comments to explain what's going on, and the limits of the various
variables. Note that most of the boundary checks are actually
unnecessary -- in particular all the ones in or after the main loop,
provided you pull out the first 2 result digits from the main loop in
the 3-digit case. That does seem to work very well, but...

I wasn't entirely happy with how messy that code is getting, so I
tried a different approach. Similar to div_var_int(), I tried writing
a mul_var_int() function instead. This can be used for 1 and 2 digit
factors, and we could add a similar mul_var_int64() function on
platforms with 128-bit integers. The code looks quite a lot neater, so
it's probably less likely to contain bugs (though I have just written
it in a hurry,so it might still have bugs). In testing, it seemed to
give a decent speedup, but perhaps a little less than before. But
that's to be balanced against having more maintainable code, and also
a function that might be useful elsewhere in numeric.c.

Anyway, here are both patches for comparison. I'll stop hacking for a
while and let you see what you make of these.

I liked v5-add-mul_var_int.patch better.

I think that *var_digits can be const too.
+ const NumericDigit *var_digits = var->digits;

Typo In the comments:
- by procssing
+ by processing

best regards,
Ranier Vilela

#21

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Dean Rasheed (#7)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Mon, Jul 1, 2024 at 6:19 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Repeating your benchmark where both numbers have up to 2 NBASE-digits,
this new approach was slightly faster:

SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 4762.990 ms (00:04.763)
Time: 4332.166 ms (00:04.332)
Time: 4276.211 ms (00:04.276)
Time: 4247.321 ms (00:04.247)
Time: 4166.738 ms (00:04.167)

SELECT SUM(num1*num2) FROM bench_mul_var; -- v2 patch
Time: 4398.812 ms (00:04.399)
Time: 3672.668 ms (00:03.673)
Time: 3650.227 ms (00:03.650)
Time: 3611.420 ms (00:03.611)
Time: 3534.218 ms (00:03.534)

SELECT SUM(num1*num2) FROM bench_mul_var; -- this patch
Time: 3350.596 ms (00:03.351)
Time: 3336.224 ms (00:03.336)
Time: 3335.599 ms (00:03.336)
Time: 3336.990 ms (00:03.337)
Time: 3351.453 ms (00:03.351)

I don't have any particular technical insight on this topic, but I
just want to mention that I'm excited about the work. Numeric
performance can be painfully slow, and these seem like quite
significant speedups that will affect lots of real-world cases.

--
Robert Haas
EDB: http://www.enterprisedb.com

#22

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#19)

1 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Wed, Jul 3, 2024, at 13:17, Dean Rasheed wrote:

On Tue, 2 Jul 2024 at 21:10, Joel Jacobson <joel@compiler.org> wrote:

I found the bug in the case 3 code,
and it turns out the same type of bug also exists in the case 2 code:

case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];

The problem here is that res_ndigits could become less than 4,

Yes. It can't be less than 3 though (per an earlier test), so the case
2 code was correct.

Hmm, I don't see how the case 2 code can be correct?
If, like you say, res_ndigits can't be less than 3, that means it can be 3, right?
And if res_ndigits=3 then `var2digits[res_ndigits - 4]` would try to access `var2digits[-1]`.

I've been hacking on this a bit and trying to tidy it up. Firstly, I
moved it to a separate function, because it was starting to look messy
having so much extra code in mul_var(). Then I added a bunch more
comments to explain what's going on, and the limits of the various
variables. Note that most of the boundary checks are actually
unnecessary -- in particular all the ones in or after the main loop,
provided you pull out the first 2 result digits from the main loop in
the 3-digit case. That does seem to work very well, but...

Nice, I was starting to feel a bit uncomfortable with the level of increased complexity.

I wasn't entirely happy with how messy that code is getting, so I
tried a different approach. Similar to div_var_int(), I tried writing
a mul_var_int() function instead. This can be used for 1 and 2 digit
factors, and we could add a similar mul_var_int64() function on
platforms with 128-bit integers. The code looks quite a lot neater, so
it's probably less likely to contain bugs (though I have just written
it in a hurry,so it might still have bugs). In testing, it seemed to
give a decent speedup, but perhaps a little less than before. But
that's to be balanced against having more maintainable code, and also
a function that might be useful elsewhere in numeric.c.

Anyway, here are both patches for comparison. I'll stop hacking for a
while and let you see what you make of these.

I've tested both patches, and they produces the same output given the
same input as HEAD, when rscale is unmodified (full precision).

However, for a reduced rscale, there are some differences:

mul_var_small() seems more resilient to rscale reductions than mul_var_int().

The previous version we worked on, I've called "mul_var inlined" in the output below.

```
CREATE TABLE test_numeric_mul_patched (
var1 numeric,
var2 numeric,
rscale_adjustment int,
result numeric
);

DO $$
DECLARE
var1 numeric;
var2 numeric;
BEGIN
FOR i IN 1..1000 LOOP
RAISE NOTICE '%', i;
FOR var1ndigits IN 1..4 LOOP
FOR var2ndigits IN 1..4 LOOP
FOR var1dscale IN 0..(var1ndigits*4) LOOP
FOR var2dscale IN 0..(var2ndigits*4) LOOP
FOR rscale_adjustment IN 0..(var1dscale+var2dscale) LOOP
var1 := round(random(
format('1%s',repeat('0',(var1ndigits-1)*4-1))::numeric,
format('%s',repeat('9',var1ndigits*4))::numeric
) / 10::numeric^var1dscale, var1dscale);
var2 := round(random(
format('1%s',repeat('0',(var2ndigits-1)*4-1))::numeric,
format('%s',repeat('9',var2ndigits*4))::numeric
) / 10::numeric^var2dscale, var2dscale);
INSERT INTO test_numeric_mul_patched
(var1, var2, rscale_adjustment)
VALUES
(var1, var2, -rscale_adjustment);
END LOOP;
END LOOP;
END LOOP;
END LOOP;
END LOOP;
END LOOP;
END $$;

UPDATE test_numeric_mul_patched SET result = numeric_mul_head(var1, var2, rscale_adjustment);

SELECT
rscale_adjustment,
COUNT(*),
COUNT(*) FILTER (WHERE result IS DISTINCT FROM numeric_mul_patch_int(var1,var2,rscale_adjustment)) AS "mul_var_int",
COUNT(*) FILTER (WHERE result IS DISTINCT FROM numeric_mul_patch_small(var1,var2,rscale_adjustment)) AS "mul_var_small",
COUNT(*) FILTER (WHERE result IS DISTINCT FROM numeric_mul_patch_inline(var1,var2,rscale_adjustment)) AS "mul_var inlined"
FROM test_numeric_mul_patched
GROUP BY 1
ORDER BY 1;

rscale_adjustment | count | mul_var_int | mul_var_small | mul_var inlined
-------------------+---------+-------------+---------------+-----------------
-32 | 1000 | 0 | 0 | 0
-31 | 3000 | 0 | 0 | 0
-30 | 6000 | 0 | 0 | 0
-29 | 10000 | 0 | 0 | 0
-28 | 17000 | 0 | 0 | 0
-27 | 27000 | 0 | 0 | 0
-26 | 40000 | 0 | 1 | 0
-25 | 56000 | 1 | 11 | 0
-24 | 78000 | 316 | 119 | 1
-23 | 106000 | 498 | 1696 | 0
-22 | 140000 | 531 | 2480 | 1
-21 | 180000 | 591 | 3145 | 0
-20 | 230000 | 1956 | 5309 | 1
-19 | 290000 | 2189 | 5032 | 0
-18 | 360000 | 2314 | 4868 | 0
-17 | 440000 | 2503 | 4544 | 1
-16 | 533000 | 5201 | 3633 | 0
-15 | 631000 | 5621 | 3006 | 0
-14 | 734000 | 5907 | 2631 | 0
-13 | 842000 | 6268 | 2204 | 0
-12 | 957000 | 9558 | 778 | 0
-11 | 1071000 | 10597 | 489 | 0
-10 | 1184000 | 10765 | 193 | 0
-9 | 1296000 | 9452 | 0 | 0
-8 | 1408000 | 1142 | 0 | 0
-7 | 1512000 | 391 | 0 | 0
-6 | 1608000 | 235 | 0 | 0
-5 | 1696000 | 0 | 0 | 0
-4 | 1776000 | 0 | 0 | 0
-3 | 1840000 | 0 | 0 | 0
-2 | 1888000 | 0 | 0 | 0
-1 | 1920000 | 0 | 0 | 0
0 | 1936000 | 0 | 0 | 0
(33 rows)

SELECT
result - numeric_mul_patch_int(var1,var2,rscale_adjustment),
COUNT(*)
FROM test_numeric_mul_patched
GROUP BY 1
ORDER BY 1;

?column? | count
----------------+----------
0 | 24739964
0.000000000001 | 2170
0.00000000001 | 234
0.0000000001 | 18
0.000000001 | 4
0.00000001 | 8927
0.0000001 | 882
0.000001 | 90
0.00001 | 6
0.0001 | 21963
0.001 | 2174
0.01 | 214
0.1 | 18
1 | 39336
(14 rows)

SELECT
result - numeric_mul_patch_small(var1,var2,rscale_adjustment),
COUNT(*)
FROM test_numeric_mul_patched
GROUP BY 1
ORDER BY 1;

?column? | count
-------------------+----------
-1 | 1233
-0.01 | 9
-0.001 | 73
-0.0001 | 647
-0.000001 | 2
-0.0000001 | 9
-0.00000001 | 116
0.000000000000000 | 24775861
0.00000001 | 1035
0.00000002 | 2
0.0000001 | 96
0.000001 | 9
0.0001 | 8771
0.0002 | 3
0.001 | 952
0.01 | 69
0.1 | 10
1 | 27098
2 | 5
(19 rows)

SELECT
result - numeric_mul_patch_inline(var1,var2,rscale_adjustment),
COUNT(*)
FROM test_numeric_mul_patched
GROUP BY 1
ORDER BY 1;

?column? | count
----------+----------
-1 | 4
0 | 24815996
(2 rows)
```

I found these two interesting to look closer at:
```
0.00000002 | 2
0.0002 | 3

SELECT
*,
numeric_mul_patch_small(var1,var2,rscale_adjustment)
FROM test_numeric_mul_patched
WHERE result - numeric_mul_patch_small(var1,var2,rscale_adjustment) IN (0.00000002, 0.0002);

var1 | var2 | rscale_adjustment | result | numeric_mul_patch_small
-------------------+----------------+-------------------+------------+-------------------------
8952.12658563 | 0.902315486665 | -16 | 8077.6425 | 8077.6423
0.881715409579 | 0.843165739371 | -16 | 0.74343223 | 0.74343221
0.905322758954 | 0.756905996850 | -16 | 0.68524423 | 0.68524421
8464.043170546608 | 0.518100129611 | -20 | 4385.2219 | 4385.2217
5253.006296984449 | 0.989308019355 | -20 | 5196.8413 | 5196.8411
(5 rows)
```

What can be said about mul_var()'s contract with regards to rscale?
It's the number of decimal digits requested by the caller, and if not
requesting full precision, then the decimal digits might not be accurate,
but can something be said about how far off they can be?

The mul_var_int() patch only produces a difference that is exactly
1 less than the exact result, at the last non-zero decimal digit.

Could the difference be more than 1 at the last non-zero digit,
like in the five cases found above?

It would be nice if we could define mul_var()'s contract with regards to
rscale, in terms of what precision can be expected in the result.

Attaching the hacked together version with all the patches, used to do the testing above.

Regards,
Joel

Attachments:

test-mul-var-versions.patch.txttext/plain; name=test-mul-var-versions.patch.txtDownload

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..63de5cd994 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -551,6 +551,23 @@ static void sub_var(const NumericVar *var1, const NumericVar *var2,
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
+static void mul_var_head(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
+static void mul_var_patch_inline(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
+static void mul_var_patch_small(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
+static void mul_var_patch_int(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
+static void mul_var_small(const NumericVar *var1, const NumericVar *var2,
+						  NumericVar *result, int rscale);
+
+static void mul_var_int(const NumericVar *var, int ival, int ival_weight,
+						NumericVar *result, int rscale);
 static void div_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale, bool round);
@@ -3115,6 +3132,448 @@ numeric_mul_opt_error(Numeric num1, Numeric num2, bool *have_error)
 }
 
 
+Datum
+numeric_mul_head(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	int32		rscale_adjustment = PG_GETARG_INT32(2);
+	Numeric		res;
+
+	res = numeric_mul_head_opt_error(num1, num2, rscale_adjustment, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
+
+
+Datum
+numeric_mul_patch_inline(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	int32		rscale_adjustment = PG_GETARG_INT32(2);
+	Numeric		res;
+
+	res = numeric_mul_patch_inline_opt_error(num1, num2, rscale_adjustment, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
+
+Datum
+numeric_mul_patch_small(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	int32		rscale_adjustment = PG_GETARG_INT32(2);
+	Numeric		res;
+
+	res = numeric_mul_patch_small_opt_error(num1, num2, rscale_adjustment, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
+
+Datum
+numeric_mul_patch_int(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	int32		rscale_adjustment = PG_GETARG_INT32(2);
+	Numeric		res;
+
+	res = numeric_mul_patch_int_opt_error(num1, num2, rscale_adjustment, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
+
+
+Numeric
+numeric_mul_patch_inline_opt_error(Numeric num1, Numeric num2, int32 rscale_adjustment, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* Inf * 0 */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* -Inf * 0 */
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+		if (NUMERIC_IS_PINF(num2))
+		{
+			switch (numeric_sign_internal(num1))
+			{
+				case 0:
+					return make_result(&const_nan); /* 0 * Inf */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		Assert(NUMERIC_IS_NINF(num2));
+		switch (numeric_sign_internal(num1))
+		{
+			case 0:
+				return make_result(&const_nan); /* 0 * -Inf */
+			case 1:
+				return make_result(&const_ninf);
+			case -1:
+				return make_result(&const_pinf);
+		}
+		Assert(false);
+	}
+
+	/*
+	 * Unpack the values, let mul_var() compute the result and return it.
+	 * Unlike add_var() and sub_var(), mul_var() will round its result. In the
+	 * case of numeric_mul(), which is invoked for the * operator on numerics,
+	 * we request exact representation for the product (rscale = sum(dscale of
+	 * arg1, dscale of arg2)).  If the exact result has more digits after the
+	 * decimal point than can be stored in a numeric, we round it.  Rounding
+	 * after computing the exact result ensures that the final result is
+	 * correctly rounded (rounding in mul_var() using a truncated product
+	 * would not guarantee this).
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	mul_var_patch_inline(&arg1, &arg2, &result, arg1.dscale + arg2.dscale + rscale_adjustment);
+
+	if (result.dscale > NUMERIC_DSCALE_MAX)
+		round_var(&result, NUMERIC_DSCALE_MAX);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
+Numeric
+numeric_mul_head_opt_error(Numeric num1, Numeric num2, int32 rscale_adjustment, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* Inf * 0 */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* -Inf * 0 */
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+		if (NUMERIC_IS_PINF(num2))
+		{
+			switch (numeric_sign_internal(num1))
+			{
+				case 0:
+					return make_result(&const_nan); /* 0 * Inf */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		Assert(NUMERIC_IS_NINF(num2));
+		switch (numeric_sign_internal(num1))
+		{
+			case 0:
+				return make_result(&const_nan); /* 0 * -Inf */
+			case 1:
+				return make_result(&const_ninf);
+			case -1:
+				return make_result(&const_pinf);
+		}
+		Assert(false);
+	}
+
+	/*
+	 * Unpack the values, let mul_var() compute the result and return it.
+	 * Unlike add_var() and sub_var(), mul_var() will round its result. In the
+	 * case of numeric_mul(), which is invoked for the * operator on numerics,
+	 * we request exact representation for the product (rscale = sum(dscale of
+	 * arg1, dscale of arg2)).  If the exact result has more digits after the
+	 * decimal point than can be stored in a numeric, we round it.  Rounding
+	 * after computing the exact result ensures that the final result is
+	 * correctly rounded (rounding in mul_var() using a truncated product
+	 * would not guarantee this).
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	mul_var_head(&arg1, &arg2, &result, arg1.dscale + arg2.dscale + rscale_adjustment);
+
+	if (result.dscale > NUMERIC_DSCALE_MAX)
+		round_var(&result, NUMERIC_DSCALE_MAX);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
+Numeric
+numeric_mul_patch_small_opt_error(Numeric num1, Numeric num2, int32 rscale_adjustment, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* Inf * 0 */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* -Inf * 0 */
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+		if (NUMERIC_IS_PINF(num2))
+		{
+			switch (numeric_sign_internal(num1))
+			{
+				case 0:
+					return make_result(&const_nan); /* 0 * Inf */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		Assert(NUMERIC_IS_NINF(num2));
+		switch (numeric_sign_internal(num1))
+		{
+			case 0:
+				return make_result(&const_nan); /* 0 * -Inf */
+			case 1:
+				return make_result(&const_ninf);
+			case -1:
+				return make_result(&const_pinf);
+		}
+		Assert(false);
+	}
+
+	/*
+	 * Unpack the values, let mul_var() compute the result and return it.
+	 * Unlike add_var() and sub_var(), mul_var() will round its result. In the
+	 * case of numeric_mul(), which is invoked for the * operator on numerics,
+	 * we request exact representation for the product (rscale = sum(dscale of
+	 * arg1, dscale of arg2)).  If the exact result has more digits after the
+	 * decimal point than can be stored in a numeric, we round it.  Rounding
+	 * after computing the exact result ensures that the final result is
+	 * correctly rounded (rounding in mul_var() using a truncated product
+	 * would not guarantee this).
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	mul_var_patch_small(&arg1, &arg2, &result, arg1.dscale + arg2.dscale + rscale_adjustment);
+
+	if (result.dscale > NUMERIC_DSCALE_MAX)
+		round_var(&result, NUMERIC_DSCALE_MAX);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
+Numeric
+numeric_mul_patch_int_opt_error(Numeric num1, Numeric num2, int32 rscale_adjustment, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* Inf * 0 */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* -Inf * 0 */
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+		if (NUMERIC_IS_PINF(num2))
+		{
+			switch (numeric_sign_internal(num1))
+			{
+				case 0:
+					return make_result(&const_nan); /* 0 * Inf */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		Assert(NUMERIC_IS_NINF(num2));
+		switch (numeric_sign_internal(num1))
+		{
+			case 0:
+				return make_result(&const_nan); /* 0 * -Inf */
+			case 1:
+				return make_result(&const_ninf);
+			case -1:
+				return make_result(&const_pinf);
+		}
+		Assert(false);
+	}
+
+	/*
+	 * Unpack the values, let mul_var() compute the result and return it.
+	 * Unlike add_var() and sub_var(), mul_var() will round its result. In the
+	 * case of numeric_mul(), which is invoked for the * operator on numerics,
+	 * we request exact representation for the product (rscale = sum(dscale of
+	 * arg1, dscale of arg2)).  If the exact result has more digits after the
+	 * decimal point than can be stored in a numeric, we round it.  Rounding
+	 * after computing the exact result ensures that the final result is
+	 * correctly rounded (rounding in mul_var() using a truncated product
+	 * would not guarantee this).
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	mul_var_patch_int(&arg1, &arg2, &result, arg1.dscale + arg2.dscale + rscale_adjustment);
+
+	if (result.dscale > NUMERIC_DSCALE_MAX)
+		round_var(&result, NUMERIC_DSCALE_MAX);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
 /*
  * numeric_div() -
  *
@@ -8864,6 +9323,1245 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	strip_var(result);
 }
 
+static void
+mul_var_head(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
+		int rscale)
+{
+	int			res_ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			maxdigits;
+	int		   *dig;
+	int			carry;
+	int			maxdig;
+	int			newdig;
+	int			var1ndigits;
+	int			var2ndigits;
+	NumericDigit *var1digits;
+	NumericDigit *var2digits;
+	NumericDigit *res_digits;
+	int			i,
+				i1,
+				i2;
+
+	/*
+	 * Arrange for var1 to be the shorter of the two numbers.  This improves
+	 * performance because the inner multiplication loop is much simpler than
+	 * the outer loop, so it's better to have a smaller number of iterations
+	 * of the outer loop.  This also reduces the number of times that the
+	 * accumulator array needs to be normalized.
+	 */
+	if (var1->ndigits > var2->ndigits)
+	{
+		const NumericVar *tmp = var1;
+
+		var1 = var2;
+		var2 = tmp;
+	}
+
+	/* copy these values into local vars for speed in inner loop */
+	var1ndigits = var1->ndigits;
+	var2ndigits = var2->ndigits;
+	var1digits = var1->digits;
+	var2digits = var2->digits;
+
+	if (var1ndigits == 0 || var2ndigits == 0)
+	{
+		/* one or both inputs is zero; so is result */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/* Determine result sign and (maximum possible) weight */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 2;
+
+	/*
+	 * Determine the number of result digits to compute.  If the exact result
+	 * would have more than rscale fractional digits, truncate the computation
+	 * with MUL_GUARD_DIGITS guard digits, i.e., ignore input digits that
+	 * would only contribute to the right of that.  (This will give the exact
+	 * rounded-to-rscale answer unless carries out of the ignored positions
+	 * would have propagated through more than MUL_GUARD_DIGITS digits.)
+	 *
+	 * Note: an exact computation could not produce more than var1ndigits +
+	 * var2ndigits digits, but we allocate one extra output digit in case
+	 * rscale-driven rounding produces a carry out of the highest exact digit.
+	 */
+	res_ndigits = var1ndigits + var2ndigits + 1;
+	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+		MUL_GUARD_DIGITS;
+	res_ndigits = Min(res_ndigits, maxdigits);
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
+	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
+	 * to avoid normalizing carries immediately.
+	 *
+	 * maxdig tracks the maximum possible value of any dig[] entry; when this
+	 * threatens to exceed INT_MAX, we take the time to propagate carries.
+	 * Furthermore, we need to ensure that overflow doesn't occur during the
+	 * carry propagation passes either.  The carry values could be as much as
+	 * INT_MAX/NBASE, so really we must normalize when digits threaten to
+	 * exceed INT_MAX - INT_MAX/NBASE.
+	 *
+	 * To avoid overflow in maxdig itself, it actually represents the max
+	 * possible value divided by NBASE-1, ie, at the top of the loop it is
+	 * known that no dig[] entry exceeds maxdig * (NBASE-1).
+	 */
+	dig = (int *) palloc0(res_ndigits * sizeof(int));
+	maxdig = 0;
+
+	/*
+	 * The least significant digits of var1 should be ignored if they don't
+	 * contribute directly to the first res_ndigits digits of the result that
+	 * we are computing.
+	 *
+	 * Digit i1 of var1 and digit i2 of var2 are multiplied and added to digit
+	 * i1+i2+2 of the accumulator array, so we need only consider digits of
+	 * var1 for which i1 <= res_ndigits - 3.
+	 */
+	for (i1 = Min(var1ndigits - 1, res_ndigits - 3); i1 >= 0; i1--)
+	{
+		NumericDigit var1digit = var1digits[i1];
+
+		if (var1digit == 0)
+			continue;
+
+		/* Time to normalize? */
+		maxdig += var1digit;
+		if (maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
+		{
+			/* Yes, do it */
+			carry = 0;
+			for (i = res_ndigits - 1; i >= 0; i--)
+			{
+				newdig = dig[i] + carry;
+				if (newdig >= NBASE)
+				{
+					carry = newdig / NBASE;
+					newdig -= carry * NBASE;
+				}
+				else
+					carry = 0;
+				dig[i] = newdig;
+			}
+			Assert(carry == 0);
+			/* Reset maxdig to indicate new worst-case */
+			maxdig = 1 + var1digit;
+		}
+
+		/*
+		 * Add the appropriate multiple of var2 into the accumulator.
+		 *
+		 * As above, digits of var2 can be ignored if they don't contribute,
+		 * so we only include digits for which i1+i2+2 < res_ndigits.
+		 *
+		 * This inner loop is the performance bottleneck for multiplication,
+		 * so we want to keep it simple enough so that it can be
+		 * auto-vectorized.  Accordingly, process the digits left-to-right
+		 * even though schoolbook multiplication would suggest right-to-left.
+		 * Since we aren't propagating carries in this loop, the order does
+		 * not matter.
+		 */
+		{
+			int			i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
+			int		   *dig_i1_2 = &dig[i1 + 2];
+
+			for (i2 = 0; i2 < i2limit; i2++)
+				dig_i1_2[i2] += var1digit * var2digits[i2];
+		}
+	}
+
+	/*
+	 * Now we do a final carry propagation pass to normalize the result, which
+	 * we combine with storing the result digits into the output. Note that
+	 * this is still done at full precision w/guard digits.
+	 */
+	alloc_var(result, res_ndigits);
+	res_digits = result->digits;
+	carry = 0;
+	for (i = res_ndigits - 1; i >= 0; i--)
+	{
+		newdig = dig[i] + carry;
+		if (newdig >= NBASE)
+		{
+			carry = newdig / NBASE;
+			newdig -= carry * NBASE;
+		}
+		else
+			carry = 0;
+		res_digits[i] = newdig;
+	}
+	Assert(carry == 0);
+
+	pfree(dig);
+
+	/*
+	 * Finally, round the result to the requested precision.
+	 */
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+static void
+mul_var_patch_inline(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+	int			res_ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			maxdigits;
+	int		   *dig;
+	int			carry;
+	int			maxdig;
+	int			newdig;
+	int			var1ndigits;
+	int			var2ndigits;
+	NumericDigit *var1digits;
+	NumericDigit *var2digits;
+	NumericDigit *res_digits;
+	int			i,
+				i1,
+				i2;
+
+	/*
+	 * Arrange for var1 to be the shorter of the two numbers.  This improves
+	 * performance because the inner multiplication loop is much simpler than
+	 * the outer loop, so it's better to have a smaller number of iterations
+	 * of the outer loop.  This also reduces the number of times that the
+	 * accumulator array needs to be normalized.
+	 */
+	if (var1->ndigits > var2->ndigits)
+	{
+		const NumericVar *tmp = var1;
+
+		var1 = var2;
+		var2 = tmp;
+	}
+
+	/* copy these values into local vars for speed in inner loop */
+	var1ndigits = var1->ndigits;
+	var2ndigits = var2->ndigits;
+	var1digits = var1->digits;
+	var2digits = var2->digits;
+
+	if (var1ndigits == 0 || var2ndigits == 0)
+	{
+		/* one or both inputs is zero; so is result */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/* Determine result sign and (maximum possible) weight */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 2;
+
+	/*
+	 * Determine the number of result digits to compute.  If the exact result
+	 * would have more than rscale fractional digits, truncate the computation
+	 * with MUL_GUARD_DIGITS guard digits, i.e., ignore input digits that
+	 * would only contribute to the right of that.  (This will give the exact
+	 * rounded-to-rscale answer unless carries out of the ignored positions
+	 * would have propagated through more than MUL_GUARD_DIGITS digits.)
+	 *
+	 * Note: an exact computation could not produce more than var1ndigits +
+	 * var2ndigits digits, but we allocate one extra output digit in case
+	 * rscale-driven rounding produces a carry out of the highest exact digit.
+	 */
+	res_ndigits = var1ndigits + var2ndigits + 1;
+	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+		MUL_GUARD_DIGITS;
+	res_ndigits = Min(res_ndigits, maxdigits);
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * Simplified fast-path computation, if var1 has just one or two digits.
+	 * This is significantly faster, since it avoids allocating a separate
+	 * digit array, making multiple passes over var2, and having separate
+	 * carry-propagation passes.
+	 */
+	if (var1ndigits <= 3)
+	{
+		NumericDigit *res_buf;
+
+		/* Allocate result digit array */
+		res_buf = digitbuf_alloc(res_ndigits);
+		res_buf[0] = 0;				/* spare digit for later rounding */
+		res_digits = res_buf + 1;
+
+		/*
+		 * Compute the result digits directly, in one pass, propagating the
+		 * carry up as we go.
+		 */
+		switch (var1ndigits)
+		{
+			case 1:
+				carry = 0;
+				for (i = res_ndigits - 3; i >= 0; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				res_digits[0] = (NumericDigit) carry;
+				break;
+
+			case 2:
+				if (res_ndigits - 4 >= 0 && res_ndigits - 4 < var2ndigits)
+					newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
+				else
+					newdig = 0;
+				if (res_ndigits - 3 >= 0 && res_ndigits - 3 < var2ndigits)
+					newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
+				res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+				for (i = res_ndigits - 4; i >= 1; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] +
+						(int) var1digits[1] * var2digits[i - 1] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				newdig = (int) var1digits[0] * var2digits[0] + carry;
+				res_digits[1] = (NumericDigit) (newdig % NBASE);
+				res_digits[0] = (NumericDigit) (newdig / NBASE);
+				break;
+
+			case 3:
+				if (res_ndigits - 5 >= 0 && res_ndigits - 5 < var2ndigits)
+					newdig = (int) var1digits[2] * var2digits[res_ndigits - 5];
+				else
+					newdig = 0;
+				if (res_ndigits - 4 >= 0 && res_ndigits - 4 < var2ndigits)
+					newdig += (int) var1digits[1] * var2digits[res_ndigits - 4];
+				if (res_ndigits - 3 >= 0 && res_ndigits - 3 < var2ndigits)
+					newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
+				res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+				for (i = res_ndigits - 4; i >= 2; i--)
+				{
+					newdig = carry;
+					if (i < var2ndigits)
+						newdig += (int) var1digits[0] * var2digits[i];
+					if (i - 1 >= 0 && i - 1 < var2ndigits)
+						newdig += (int) var1digits[1] * var2digits[i - 1];
+					if (i - 2 >= 0 && i - 2 < var2ndigits)
+						newdig += (int) var1digits[2] * var2digits[i - 2];
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				newdig = carry;
+				if (var2ndigits > 1)
+					newdig += (int) var1digits[0] * var2digits[1];
+				if (var2ndigits > 0)
+					newdig += (int) var1digits[1] * var2digits[0];
+				res_digits[2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+				newdig = (int) var1digits[0] * var2digits[0] + carry;
+				res_digits[1] = (NumericDigit) (newdig % NBASE);
+				res_digits[0] = (NumericDigit) (newdig / NBASE);
+				break;
+		}
+
+		/* Store the product in result (minus extra rounding digit) */
+		digitbuf_free(result->buf);
+		result->ndigits = res_ndigits - 1;
+		result->buf = res_buf;
+		result->digits = res_digits;
+		result->weight = res_weight - 1;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+
+	/*
+	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
+	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
+	 * to avoid normalizing carries immediately.
+	 *
+	 * maxdig tracks the maximum possible value of any dig[] entry; when this
+	 * threatens to exceed INT_MAX, we take the time to propagate carries.
+	 * Furthermore, we need to ensure that overflow doesn't occur during the
+	 * carry propagation passes either.  The carry values could be as much as
+	 * INT_MAX/NBASE, so really we must normalize when digits threaten to
+	 * exceed INT_MAX - INT_MAX/NBASE.
+	 *
+	 * To avoid overflow in maxdig itself, it actually represents the max
+	 * possible value divided by NBASE-1, ie, at the top of the loop it is
+	 * known that no dig[] entry exceeds maxdig * (NBASE-1).
+	 */
+	dig = (int *) palloc0(res_ndigits * sizeof(int));
+	maxdig = 0;
+
+	/*
+	 * The least significant digits of var1 should be ignored if they don't
+	 * contribute directly to the first res_ndigits digits of the result that
+	 * we are computing.
+	 *
+	 * Digit i1 of var1 and digit i2 of var2 are multiplied and added to digit
+	 * i1+i2+2 of the accumulator array, so we need only consider digits of
+	 * var1 for which i1 <= res_ndigits - 3.
+	 */
+	for (i1 = Min(var1ndigits - 1, res_ndigits - 3); i1 >= 0; i1--)
+	{
+		NumericDigit var1digit = var1digits[i1];
+
+		if (var1digit == 0)
+			continue;
+
+		/* Time to normalize? */
+		maxdig += var1digit;
+		if (maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
+		{
+			/* Yes, do it */
+			carry = 0;
+			for (i = res_ndigits - 1; i >= 0; i--)
+			{
+				newdig = dig[i] + carry;
+				if (newdig >= NBASE)
+				{
+					carry = newdig / NBASE;
+					newdig -= carry * NBASE;
+				}
+				else
+					carry = 0;
+				dig[i] = newdig;
+			}
+			Assert(carry == 0);
+			/* Reset maxdig to indicate new worst-case */
+			maxdig = 1 + var1digit;
+		}
+
+		/*
+		 * Add the appropriate multiple of var2 into the accumulator.
+		 *
+		 * As above, digits of var2 can be ignored if they don't contribute,
+		 * so we only include digits for which i1+i2+2 < res_ndigits.
+		 *
+		 * This inner loop is the performance bottleneck for multiplication,
+		 * so we want to keep it simple enough so that it can be
+		 * auto-vectorized.  Accordingly, process the digits left-to-right
+		 * even though schoolbook multiplication would suggest right-to-left.
+		 * Since we aren't propagating carries in this loop, the order does
+		 * not matter.
+		 */
+		{
+			int			i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
+			int		   *dig_i1_2 = &dig[i1 + 2];
+
+			for (i2 = 0; i2 < i2limit; i2++)
+			{
+				dig_i1_2[i2] += var1digit * var2digits[i2];
+			}
+		}
+	}
+
+	/*
+	 * Now we do a final carry propagation pass to normalize the result, which
+	 * we combine with storing the result digits into the output. Note that
+	 * this is still done at full precision w/guard digits.
+	 */
+	alloc_var(result, res_ndigits);
+	res_digits = result->digits;
+	carry = 0;
+	for (i = res_ndigits - 1; i >= 0; i--)
+	{
+		newdig = dig[i] + carry;
+		if (newdig >= NBASE)
+		{
+			carry = newdig / NBASE;
+			newdig -= carry * NBASE;
+		}
+		else
+			carry = 0;
+		res_digits[i] = newdig;
+	}
+	Assert(carry == 0);
+
+	pfree(dig);
+
+	/*
+	 * Finally, round the result to the requested precision.
+	 */
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+
+}
+
+
+static void
+mul_var_patch_int(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
+		int rscale)
+{
+	int			res_ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			maxdigits;
+	int		   *dig;
+	int			carry;
+	int			maxdig;
+	int			newdig;
+	int			var1ndigits;
+	int			var2ndigits;
+	NumericDigit *var1digits;
+	NumericDigit *var2digits;
+	NumericDigit *res_digits;
+	int			i,
+				i1,
+				i2;
+
+	/*
+	 * Arrange for var1 to be the shorter of the two numbers.  This improves
+	 * performance because the inner multiplication loop is much simpler than
+	 * the outer loop, so it's better to have a smaller number of iterations
+	 * of the outer loop.  This also reduces the number of times that the
+	 * accumulator array needs to be normalized.
+	 */
+	if (var1->ndigits > var2->ndigits)
+	{
+		const NumericVar *tmp = var1;
+
+		var1 = var2;
+		var2 = tmp;
+	}
+
+	/* copy these values into local vars for speed in inner loop */
+	var1ndigits = var1->ndigits;
+	var2ndigits = var2->ndigits;
+	var1digits = var1->digits;
+	var2digits = var2->digits;
+
+	if (var1ndigits == 0)
+	{
+		/* one or both inputs is zero; so is result */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * If var1 has just one or two digits, delegate to mul_var_int(), which
+	 * uses a faster direct multiplication algorithm.
+	 *
+	 * TODO: Similarly, on platforms with 128-bit integers ...
+	 */
+	if (var1ndigits <= 2)
+	{
+		int			ifactor;
+		int			ifactor_weight;
+
+		ifactor = var1->digits[0];
+		ifactor_weight = var1->weight;
+		if (var1ndigits == 2)
+		{
+			ifactor = ifactor * NBASE + var1->digits[1];
+			ifactor_weight--;
+		}
+		if (var1->sign == NUMERIC_NEG)
+			ifactor = -ifactor;
+
+		mul_var_int(var2, ifactor, ifactor_weight, result, rscale);
+		return;
+	}
+
+	/* Determine result sign and (maximum possible) weight */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 2;
+
+	/*
+	 * Determine the number of result digits to compute.  If the exact result
+	 * would have more than rscale fractional digits, truncate the computation
+	 * with MUL_GUARD_DIGITS guard digits, i.e., ignore input digits that
+	 * would only contribute to the right of that.  (This will give the exact
+	 * rounded-to-rscale answer unless carries out of the ignored positions
+	 * would have propagated through more than MUL_GUARD_DIGITS digits.)
+	 *
+	 * Note: an exact computation could not produce more than var1ndigits +
+	 * var2ndigits digits, but we allocate one extra output digit in case
+	 * rscale-driven rounding produces a carry out of the highest exact digit.
+	 */
+	res_ndigits = var1ndigits + var2ndigits + 1;
+	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+		MUL_GUARD_DIGITS;
+	res_ndigits = Min(res_ndigits, maxdigits);
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
+	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
+	 * to avoid normalizing carries immediately.
+	 *
+	 * maxdig tracks the maximum possible value of any dig[] entry; when this
+	 * threatens to exceed INT_MAX, we take the time to propagate carries.
+	 * Furthermore, we need to ensure that overflow doesn't occur during the
+	 * carry propagation passes either.  The carry values could be as much as
+	 * INT_MAX/NBASE, so really we must normalize when digits threaten to
+	 * exceed INT_MAX - INT_MAX/NBASE.
+	 *
+	 * To avoid overflow in maxdig itself, it actually represents the max
+	 * possible value divided by NBASE-1, ie, at the top of the loop it is
+	 * known that no dig[] entry exceeds maxdig * (NBASE-1).
+	 */
+	dig = (int *) palloc0(res_ndigits * sizeof(int));
+	maxdig = 0;
+
+	/*
+	 * The least significant digits of var1 should be ignored if they don't
+	 * contribute directly to the first res_ndigits digits of the result that
+	 * we are computing.
+	 *
+	 * Digit i1 of var1 and digit i2 of var2 are multiplied and added to digit
+	 * i1+i2+2 of the accumulator array, so we need only consider digits of
+	 * var1 for which i1 <= res_ndigits - 3.
+	 */
+	for (i1 = Min(var1ndigits - 1, res_ndigits - 3); i1 >= 0; i1--)
+	{
+		NumericDigit var1digit = var1digits[i1];
+
+		if (var1digit == 0)
+			continue;
+
+		/* Time to normalize? */
+		maxdig += var1digit;
+		if (maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
+		{
+			/* Yes, do it */
+			carry = 0;
+			for (i = res_ndigits - 1; i >= 0; i--)
+			{
+				newdig = dig[i] + carry;
+				if (newdig >= NBASE)
+				{
+					carry = newdig / NBASE;
+					newdig -= carry * NBASE;
+				}
+				else
+					carry = 0;
+				dig[i] = newdig;
+			}
+			Assert(carry == 0);
+			/* Reset maxdig to indicate new worst-case */
+			maxdig = 1 + var1digit;
+		}
+
+		/*
+		 * Add the appropriate multiple of var2 into the accumulator.
+		 *
+		 * As above, digits of var2 can be ignored if they don't contribute,
+		 * so we only include digits for which i1+i2+2 < res_ndigits.
+		 *
+		 * This inner loop is the performance bottleneck for multiplication,
+		 * so we want to keep it simple enough so that it can be
+		 * auto-vectorized.  Accordingly, process the digits left-to-right
+		 * even though schoolbook multiplication would suggest right-to-left.
+		 * Since we aren't propagating carries in this loop, the order does
+		 * not matter.
+		 */
+		{
+			int			i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
+			int		   *dig_i1_2 = &dig[i1 + 2];
+
+			for (i2 = 0; i2 < i2limit; i2++)
+				dig_i1_2[i2] += var1digit * var2digits[i2];
+		}
+	}
+
+	/*
+	 * Now we do a final carry propagation pass to normalize the result, which
+	 * we combine with storing the result digits into the output. Note that
+	 * this is still done at full precision w/guard digits.
+	 */
+	alloc_var(result, res_ndigits);
+	res_digits = result->digits;
+	carry = 0;
+	for (i = res_ndigits - 1; i >= 0; i--)
+	{
+		newdig = dig[i] + carry;
+		if (newdig >= NBASE)
+		{
+			carry = newdig / NBASE;
+			newdig -= carry * NBASE;
+		}
+		else
+			carry = 0;
+		res_digits[i] = newdig;
+	}
+	Assert(carry == 0);
+
+	pfree(dig);
+
+	/*
+	 * Finally, round the result to the requested precision.
+	 */
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+/*
+ * mul_var_int() -
+ *
+ *	Multiply a numeric variable by a 32-bit integer with the specified weight.
+ *	The product var * ival * NBASE^ival_weight is stored in result.
+ */
+static void
+mul_var_int(const NumericVar *var, int ival, int ival_weight,
+			NumericVar *result, int rscale)
+{
+	NumericDigit *var_digits = var->digits;
+	int			var_ndigits = var->ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			res_ndigits;
+	int			maxdigits;
+	NumericDigit *res_buf;
+	NumericDigit *res_digits;
+	uint32		factor;
+	uint32		carry;
+
+	if (ival == 0 || var_ndigits == 0)
+	{
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * Determine the result sign, (maximum possible) weight and number of
+	 * digits to calculate.  The weight figured here is correct if the emitted
+	 * product has no leading zero digits; otherwise strip_var() will fix
+	 * things up.
+	 */
+	if (var->sign == NUMERIC_POS)
+		res_sign = ival > 0 ? NUMERIC_POS : NUMERIC_NEG;
+	else
+		res_sign = ival > 0 ? NUMERIC_NEG : NUMERIC_POS;
+	res_weight = var->weight + ival_weight + 3;
+	/* The number of accurate result digits we need to produce: */
+	res_ndigits = var_ndigits + 3;
+	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+		MUL_GUARD_DIGITS;
+	res_ndigits = Min(res_ndigits, maxdigits);
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	res_buf = digitbuf_alloc(res_ndigits + 1);
+	res_buf[0] = 0;				/* spare digit for later rounding */
+	res_digits = res_buf + 1;
+
+	/*
+	 * Now compute the product digits by procssing the input digits in reverse
+	 * and propagating the carry up as we go.
+	 *
+	 * In this algorithm, the carry from one digit to the next is at most
+	 * factor - 1, and product is at most factor * NBASE - 1, and so it needs
+	 * to be a 64-bit integer if this exceeds UINT_MAX.
+	 */
+	factor = abs(ival);
+	carry = 0;
+
+	if (factor <= UINT_MAX / NBASE)
+	{
+		/* product cannot overflow 32 bits */
+		uint32		product;
+
+		for (int i = res_ndigits - 4; i >= 0; i--)
+		{
+			product = factor * var_digits[i] + carry;
+			res_digits[i + 3] = (NumericDigit) (product % NBASE);
+			carry = product / NBASE;
+		}
+		res_digits[2] = (NumericDigit) (carry % NBASE);
+		carry = carry / NBASE;
+		res_digits[1] = (NumericDigit) (carry % NBASE);
+		res_digits[0] = (NumericDigit) (carry / NBASE);
+	}
+	else
+	{
+		/* product may exceed 32 bits */
+		uint64		product;
+
+		for (int i = res_ndigits - 4; i >= 0; i--)
+		{
+			product = (uint64) factor * var_digits[i] + carry;
+			res_digits[i + 3] = (NumericDigit) (product % NBASE);
+			carry = (uint32) (product / NBASE);
+		}
+		res_digits[2] = (NumericDigit) (carry % NBASE);
+		carry = carry / NBASE;
+		res_digits[1] = (NumericDigit) (carry % NBASE);
+		res_digits[0] = (NumericDigit) (carry / NBASE);
+	}
+
+	/* Store the product in result */
+	digitbuf_free(result->buf);
+	result->ndigits = res_ndigits;
+	result->buf = res_buf;
+	result->digits = res_digits;
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+
+static void
+mul_var_patch_small(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
+		int rscale)
+{
+	int			res_ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			maxdigits;
+	int		   *dig;
+	int			carry;
+	int			maxdig;
+	int			newdig;
+	int			var1ndigits;
+	int			var2ndigits;
+	NumericDigit *var1digits;
+	NumericDigit *var2digits;
+	NumericDigit *res_digits;
+	int			i,
+				i1,
+				i2;
+
+	/*
+	 * Arrange for var1 to be the shorter of the two numbers.  This improves
+	 * performance because the inner multiplication loop is much simpler than
+	 * the outer loop, so it's better to have a smaller number of iterations
+	 * of the outer loop.  This also reduces the number of times that the
+	 * accumulator array needs to be normalized.
+	 */
+	if (var1->ndigits > var2->ndigits)
+	{
+		const NumericVar *tmp = var1;
+
+		var1 = var2;
+		var2 = tmp;
+	}
+
+	/* copy these values into local vars for speed in inner loop */
+	var1ndigits = var1->ndigits;
+	var2ndigits = var2->ndigits;
+	var1digits = var1->digits;
+	var2digits = var2->digits;
+
+	if (var1ndigits == 0)
+	{
+		/* one or both inputs is zero; so is result */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * If var1 has 3 digits or fewer, delegate to mul_var_small() which uses a
+	 * faster short multiplication algorithm.
+	 */
+	if (var1ndigits <= 3)
+	{
+		mul_var_small(var1, var2, result, rscale);
+		return;
+	}
+
+	/* Determine result sign and (maximum possible) weight */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 2;
+
+	/*
+	 * Determine the number of result digits to compute.  If the exact result
+	 * would have more than rscale fractional digits, truncate the computation
+	 * with MUL_GUARD_DIGITS guard digits, i.e., ignore input digits that
+	 * would only contribute to the right of that.  (This will give the exact
+	 * rounded-to-rscale answer unless carries out of the ignored positions
+	 * would have propagated through more than MUL_GUARD_DIGITS digits.)
+	 *
+	 * Note: an exact computation could not produce more than var1ndigits +
+	 * var2ndigits digits, but we allocate one extra output digit in case
+	 * rscale-driven rounding produces a carry out of the highest exact digit.
+	 */
+	res_ndigits = var1ndigits + var2ndigits + 1;
+	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+		MUL_GUARD_DIGITS;
+	res_ndigits = Min(res_ndigits, maxdigits);
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
+	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
+	 * to avoid normalizing carries immediately.
+	 *
+	 * maxdig tracks the maximum possible value of any dig[] entry; when this
+	 * threatens to exceed INT_MAX, we take the time to propagate carries.
+	 * Furthermore, we need to ensure that overflow doesn't occur during the
+	 * carry propagation passes either.  The carry values could be as much as
+	 * INT_MAX/NBASE, so really we must normalize when digits threaten to
+	 * exceed INT_MAX - INT_MAX/NBASE.
+	 *
+	 * To avoid overflow in maxdig itself, it actually represents the max
+	 * possible value divided by NBASE-1, ie, at the top of the loop it is
+	 * known that no dig[] entry exceeds maxdig * (NBASE-1).
+	 */
+	dig = (int *) palloc0(res_ndigits * sizeof(int));
+	maxdig = 0;
+
+	/*
+	 * The least significant digits of var1 should be ignored if they don't
+	 * contribute directly to the first res_ndigits digits of the result that
+	 * we are computing.
+	 *
+	 * Digit i1 of var1 and digit i2 of var2 are multiplied and added to digit
+	 * i1+i2+2 of the accumulator array, so we need only consider digits of
+	 * var1 for which i1 <= res_ndigits - 3.
+	 */
+	for (i1 = Min(var1ndigits - 1, res_ndigits - 3); i1 >= 0; i1--)
+	{
+		NumericDigit var1digit = var1digits[i1];
+
+		if (var1digit == 0)
+			continue;
+
+		/* Time to normalize? */
+		maxdig += var1digit;
+		if (maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
+		{
+			/* Yes, do it */
+			carry = 0;
+			for (i = res_ndigits - 1; i >= 0; i--)
+			{
+				newdig = dig[i] + carry;
+				if (newdig >= NBASE)
+				{
+					carry = newdig / NBASE;
+					newdig -= carry * NBASE;
+				}
+				else
+					carry = 0;
+				dig[i] = newdig;
+			}
+			Assert(carry == 0);
+			/* Reset maxdig to indicate new worst-case */
+			maxdig = 1 + var1digit;
+		}
+
+		/*
+		 * Add the appropriate multiple of var2 into the accumulator.
+		 *
+		 * As above, digits of var2 can be ignored if they don't contribute,
+		 * so we only include digits for which i1+i2+2 < res_ndigits.
+		 *
+		 * This inner loop is the performance bottleneck for multiplication,
+		 * so we want to keep it simple enough so that it can be
+		 * auto-vectorized.  Accordingly, process the digits left-to-right
+		 * even though schoolbook multiplication would suggest right-to-left.
+		 * Since we aren't propagating carries in this loop, the order does
+		 * not matter.
+		 */
+		{
+			int			i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
+			int		   *dig_i1_2 = &dig[i1 + 2];
+
+			for (i2 = 0; i2 < i2limit; i2++)
+				dig_i1_2[i2] += var1digit * var2digits[i2];
+		}
+	}
+
+	/*
+	 * Now we do a final carry propagation pass to normalize the result, which
+	 * we combine with storing the result digits into the output. Note that
+	 * this is still done at full precision w/guard digits.
+	 */
+	alloc_var(result, res_ndigits);
+	res_digits = result->digits;
+	carry = 0;
+	for (i = res_ndigits - 1; i >= 0; i--)
+	{
+		newdig = dig[i] + carry;
+		if (newdig >= NBASE)
+		{
+			carry = newdig / NBASE;
+			newdig -= carry * NBASE;
+		}
+		else
+			carry = 0;
+		res_digits[i] = newdig;
+	}
+	Assert(carry == 0);
+
+	pfree(dig);
+
+	/*
+	 * Finally, round the result to the requested precision.
+	 */
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+
+/*
+ * mul_var_small() -
+ *
+ *	This has the same API as mul_var, but it assumes that var1 has no more
+ *	than 3 digits and var2 has at least as many digits as var1.  For variables
+ *	satisfying these conditions, the product can be computed more quickly than
+ *	the general algorithm used in mul_var.
+ */
+static void
+mul_var_small(const NumericVar *var1, const NumericVar *var2,
+			  NumericVar *result, int rscale)
+{
+	int			var1ndigits = var1->ndigits;
+	int			var2ndigits = var2->ndigits;
+	NumericDigit *var1digits = var1->digits;
+	NumericDigit *var2digits = var2->digits;
+	int			res_sign;
+	int			res_weight;
+	int			res_ndigits;
+	int			maxdigits;
+	NumericDigit *res_buf;
+	NumericDigit *res_digits;
+	int			carry;
+	int			term;
+
+	/* Check preconditions */
+	Assert(var1ndigits <= 3);
+	Assert(var2ndigits >= var1ndigits);
+
+	/* Determine result sign and (maximum possible) weight */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 2;
+
+	/* Determine the number of result digits to compute - see mul_var() */
+	res_ndigits = var1ndigits + var2ndigits + 1;
+	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+		MUL_GUARD_DIGITS;
+	res_ndigits = Min(res_ndigits, maxdigits);
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/* Allocate result digit array */
+	res_buf = digitbuf_alloc(res_ndigits);
+	res_buf[0] = 0;				/* spare digit for later rounding */
+	res_digits = res_buf + 1;
+
+	/*
+	 * Compute the result digits in reverse, in one pass, propagating the
+	 * carry up as we go.
+	 *
+	 * This computes res_digits[res_ndigits - 2], ... res_digits[0] by summing
+	 * the products var1digits[i1] * var2digits[i2] for which i1 + i2 + 1 is
+	 * the result index.
+	 */
+	switch (var1ndigits)
+	{
+		case 1:
+			/* ---------
+			 * 1-digit case:
+			 *		var1ndigits = 1
+			 *		var2ndigits >= 1
+			 *		3 <= res_ndigits <= var2ndigits + 2
+			 * ----------
+			 */
+			carry = 0;
+			for (int i = res_ndigits - 3; i >= 0; i--)
+			{
+				term = (int) var1digits[0] * var2digits[i] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+			res_digits[0] = (NumericDigit) carry;
+			break;
+
+		case 2:
+			/* ---------
+			 * 2-digit case:
+			 *		var1ndigits = 2
+			 *		var2ndigits >= 2
+			 *		3 <= res_ndigits <= var2ndigits + 3
+			 * ----------
+			 */
+			/* last result digit and carry */
+			term = 0;
+			if (res_ndigits - 3 < var2ndigits)
+				term += (int) var1digits[0] * var2digits[res_ndigits - 3];
+			if (res_ndigits > 3)
+				term += (int) var1digits[1] * var2digits[res_ndigits - 4];
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first two */
+			for (int i = res_ndigits - 4; i >= 1; i--)
+			{
+				term = (int) var1digits[0] * var2digits[i] +
+					(int) var1digits[1] * var2digits[i - 1] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+
+			/* first two digits */
+			term = (int) var1digits[0] * var2digits[0] + carry;
+			res_digits[1] = (NumericDigit) (term % NBASE);
+			res_digits[0] = (NumericDigit) (term / NBASE);
+			break;
+
+		case 3:
+			/* ---------
+			 * 3-digit case:
+			 *		var1ndigits = 3
+			 *		var2ndigits >= 3
+			 *		3 <= res_ndigits <= var2ndigits + 4
+			 * ----------
+			 */
+			/* last result digit and carry */
+			term = 0;
+			if (res_ndigits - 3 < var2ndigits)
+				term += (int) var1digits[0] * var2digits[res_ndigits - 3];
+			if (res_ndigits > 3 && res_ndigits - 4 < var2ndigits)
+				term += (int) var1digits[1] * var2digits[res_ndigits - 4];
+			if (res_ndigits > 4)
+				term += (int) var1digits[2] * var2digits[res_ndigits - 5];
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* penultimate result digit */
+			term = carry;
+			if (res_ndigits > 3 && res_ndigits - 4 < var2ndigits)
+				term += (int) var1digits[0] * var2digits[res_ndigits - 4];
+			if (res_ndigits > 4)
+				term += (int) var1digits[1] * var2digits[res_ndigits - 5];
+			if (res_ndigits > 5)
+				term += (int) var1digits[2] * var2digits[res_ndigits - 6];
+			res_digits[res_ndigits - 3] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first three */
+			for (int i = res_ndigits - 5; i >= 2; i--)
+			{
+				term = (int) var1digits[0] * var2digits[i] +
+					(int) var1digits[1] * var2digits[i - 1] +
+					(int) var1digits[2] * var2digits[i - 2] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+
+			/* first three digits */
+			term = (int) var1digits[0] * var2digits[1] +
+				(int) var1digits[1] * var2digits[0] + carry;
+			res_digits[2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+			term = (int) var1digits[0] * var2digits[0] + carry;
+			res_digits[1] = (NumericDigit) (term % NBASE);
+			res_digits[0] = (NumericDigit) (term / NBASE);
+			break;
+	}
+
+	/* Store the product in result (minus extra rounding digit) */
+	digitbuf_free(result->buf);
+	result->ndigits = res_ndigits - 1;
+	result->buf = res_buf;
+	result->digits = res_digits;
+	result->weight = res_weight - 1;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
 
 /*
  * div_var() -
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d4ac578ae6..c85ed20a99 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -4465,6 +4465,18 @@
 { oid => '1726',
   proname => 'numeric_mul', prorettype => 'numeric',
   proargtypes => 'numeric numeric', prosrc => 'numeric_mul' },
+{ oid => '6347',
+  proname => 'numeric_mul_head', prorettype => 'numeric',
+  proargtypes => 'numeric numeric int4', prosrc => 'numeric_mul_head' },
+{ oid => '6348',
+  proname => 'numeric_mul_patch_inline', prorettype => 'numeric',
+  proargtypes => 'numeric numeric int4', prosrc => 'numeric_mul_patch_inline' },
+{ oid => '6349',
+  proname => 'numeric_mul_patch_int', prorettype => 'numeric',
+  proargtypes => 'numeric numeric int4', prosrc => 'numeric_mul_patch_int' },
+{ oid => '6350',
+  proname => 'numeric_mul_patch_small', prorettype => 'numeric',
+  proargtypes => 'numeric numeric int4', prosrc => 'numeric_mul_patch_small' },
 { oid => '1727',
   proname => 'numeric_div', prorettype => 'numeric',
   proargtypes => 'numeric numeric', prosrc => 'numeric_div' },
diff --git a/src/include/utils/numeric.h b/src/include/utils/numeric.h
index 43c75c436f..2bc400c741 100644
--- a/src/include/utils/numeric.h
+++ b/src/include/utils/numeric.h
@@ -97,6 +97,14 @@ extern Numeric numeric_sub_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
 extern Numeric numeric_mul_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
+extern Numeric numeric_mul_head_opt_error(Numeric num1, Numeric num2,
+									 int32 rscale_adjustment, bool *have_error);
+extern Numeric numeric_mul_patch_inline_opt_error(Numeric num1, Numeric num2,
+									 int32 rscale_adjustment, bool *have_error);
+extern Numeric numeric_mul_patch_int_opt_error(Numeric num1, Numeric num2,
+									 int32 rscale_adjustment, bool *have_error);
+extern Numeric numeric_mul_patch_small_opt_error(Numeric num1, Numeric num2,
+									 int32 rscale_adjustment, bool *have_error);
 extern Numeric numeric_div_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
 extern Numeric numeric_mod_opt_error(Numeric num1, Numeric num2,
diff --git a/test-mul-var.sql b/test-mul-var.sql
new file mode 100644
index 0000000000..cf7c55a089
--- /dev/null
+++ b/test-mul-var.sql
@@ -0,0 +1,50 @@
+CREATE TABLE test_numeric_mul_patched (
+  var1 numeric,
+  var2 numeric,
+  rscale_adjustment int,
+  result numeric
+);
+
+DO $$
+DECLARE
+var1 numeric;
+var2 numeric;
+BEGIN
+  FOR i IN 1..1000 LOOP
+    RAISE NOTICE '%', i;
+    FOR var1ndigits IN 1..4 LOOP
+      FOR var2ndigits IN 1..4 LOOP
+        FOR var1dscale IN 0..(var1ndigits*4) LOOP
+          FOR var2dscale IN 0..(var2ndigits*4) LOOP
+            FOR rscale_adjustment IN 0..(var1dscale+var2dscale) LOOP
+              var1 := round(random(
+                format('1%s',repeat('0',(var1ndigits-1)*4-1))::numeric,
+                format('%s',repeat('9',var1ndigits*4))::numeric
+              ) / 10::numeric^var1dscale, var1dscale);
+              var2 := round(random(
+                format('1%s',repeat('0',(var2ndigits-1)*4-1))::numeric,
+                format('%s',repeat('9',var2ndigits*4))::numeric
+              ) / 10::numeric^var2dscale, var2dscale);
+              INSERT INTO test_numeric_mul_patched
+                (var1, var2, rscale_adjustment)
+              VALUES
+                (var1, var2, -rscale_adjustment);
+            END LOOP;
+          END LOOP;
+        END LOOP;
+      END LOOP;
+    END LOOP;
+  END LOOP;
+END $$;
+
+UPDATE test_numeric_mul_patched SET result = numeric_mul_head(var1, var2, rscale_adjustment);
+
+SELECT
+    rscale_adjustment,
+    COUNT(*),
+    COUNT(*) FILTER (WHERE result IS DISTINCT FROM numeric_mul_patch_int(var1,var2,rscale_adjustment)) AS "mul_var_int",
+    COUNT(*) FILTER (WHERE result IS DISTINCT FROM numeric_mul_patch_small(var1,var2,rscale_adjustment)) AS "mul_var_small",
+    COUNT(*) FILTER (WHERE result IS DISTINCT FROM numeric_mul_patch_inline(var1,var2,rscale_adjustment)) AS "mul_var inlined"
+FROM test_numeric_mul_patched
+GROUP BY 1
+ORDER BY 1;

#23

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#22)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Wed, 3 Jul 2024 at 14:49, Joel Jacobson <joel@compiler.org> wrote:

Hmm, I don't see how the case 2 code can be correct?
If, like you say, res_ndigits can't be less than 3, that means it can be 3, right?
And if res_ndigits=3 then `var2digits[res_ndigits - 4]` would try to access `var2digits[-1]`.

Ah yes, I think I was looking at a newer version of the code where I'd
already fixed that bug. Unless you think there are still bugs in any
of the boundary checks, which is entirely possible.

I've tested both patches, and they produces the same output given the
same input as HEAD, when rscale is unmodified (full precision).

However, for a reduced rscale, there are some differences:

mul_var_small() seems more resilient to rscale reductions than mul_var_int().

Ah, I can see what's going on. It's perhaps best illustrated with a
simple example. Suppose you are multiplying a 4-digit integer x by a
2-digit integer y (both with dscale=0). Then the terms of the full
product computed by mul_var() or mul_var_small() would look something
like this:

x0 x1 x2 x3
* y0 y1
---------------------------------
x0*y0 x1*y0 x2*y0 x3*y0
x0*y1 x1*y1 x2*y1 x3*y1

In the reduced-rscale case, it might perform a truncated computation,
computing just the first 3 columns (say), and discarding the last two
columns. Therefore it would skip the 3 rightmost digit products.

However, in mul_var_int(), y0 and y1 have been combined into a single
integer equal to y0*NBASE+y1, and the terms of full product are
computed as follows:

x0*(y0*NBASE+y1) x1*(y0*NBASE+y1) x2*(y0*NBASE+y1) x3*(y0*NBASE+y1)

In the full product, that gives the same result, but if you follow the
same rule in the reduced-rscale case, skipping the last two terms, it
would actually discard 4 digit products, making it less accurate.

That could be avoided by increasing maxdigits by 1 in mul_var_int() so
it would always be at least as accurate as it was before, but that
might not really be necessary. However, if we implemented
mul_var_int64() in the same way, it would be discarding much
higher-order digit products, and so we probably would have to increase
maxdigits to get sufficiently accurate results. But there's an even
bigger problem: the results would be different between platforms that
did and didn't have 128-bit integers, which I really don't like. We
could avoid that by not using it in reduced-rscale cases, but that
would involve another test condition.

By contrast, mul_var_small() is intended to replicate the arithmetic
in mul_var() exactly (but in a different order) for all rscales. So if
you've found any cases where they give different results, that's a
bug.

In light of that, it might be that mul_var_small() is the better
option, rather than mul_var_int(), but it'd be interesting to see how
they compare in terms of performance first.

What can be said about mul_var()'s contract with regards to rscale?
It's the number of decimal digits requested by the caller, and if not
requesting full precision, then the decimal digits might not be accurate,
but can something be said about how far off they can be?

I wouldn't expect it to ever be off by more than 1, given that
MUL_GUARD_DIGITS = 2, which corresponds to 8 decimal digits, and the
number of digits in the smaller input (and hence the number of digit
products in each column) is limited to something like 16,000 NBASE
digits.

Regards,
Dean

#24

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#22)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Wed, Jul 3, 2024, at 15:48, Joel Jacobson wrote:

On Wed, Jul 3, 2024, at 13:17, Dean Rasheed wrote:

On Tue, 2 Jul 2024 at 21:10, Joel Jacobson <joel@compiler.org> wrote:

I found the bug in the case 3 code,
and it turns out the same type of bug also exists in the case 2 code:

case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];

The problem here is that res_ndigits could become less than 4,

Yes. It can't be less than 3 though (per an earlier test), so the case
2 code was correct.

Hmm, I don't see how the case 2 code can be correct?
If, like you say, res_ndigits can't be less than 3, that means it can
be 3, right?
And if res_ndigits=3 then `var2digits[res_ndigits - 4]` would try to
access `var2digits[-1]`.

Here is an example on how to trigger the bug:

```
case 2:
if (res_ndigits - 4 < 0)
{
printf("var1=%s\n",get_str_from_var(var1));
printf("var2=%s\n",get_str_from_var(var2));
printf("rscale=%d\n", rscale);
printf("res_ndigits - 4 < 0 => var2digits[%d]=%d\n", res_ndigits - 4, var2digits[res_ndigits - 4]);
}
```

Running through my tests, I hit lots of cases, including:

var1=0.10968501
var2=0.903728177113
rscale=0
res_ndigits - 4 < 0 => var2digits[-1]=-31105

All of the spotted cases had rscale=0.

If we know that mul_var() will never be called with rscale=0 when dealing with decimal inputs, perhaps we should enforce this with an Assert(), to prevent the otherwise possible out-of-bounds access (negative indexing) and provide early detection?

Regards,
Joel

#25

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#23)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Wed, Jul 3, 2024, at 20:57, Dean Rasheed wrote:

Ah yes, I think I was looking at a newer version of the code where I'd
already fixed that bug. Unless you think there are still bugs in any
of the boundary checks, which is entirely possible.

Ah, that explains it.
And no, I can't find any other bugs in the boundary checks.

Ah, I can see what's going on. It's perhaps best illustrated with a
simple example. Suppose you are multiplying a 4-digit integer x by a
2-digit integer y (both with dscale=0). Then the terms of the full
product computed by mul_var() or mul_var_small() would look something
like this:

x0 x1 x2 x3
* y0 y1
---------------------------------
x0*y0 x1*y0 x2*y0 x3*y0
x0*y1 x1*y1 x2*y1 x3*y1

In the reduced-rscale case, it might perform a truncated computation,
computing just the first 3 columns (say), and discarding the last two
columns. Therefore it would skip the 3 rightmost digit products.

However, in mul_var_int(), y0 and y1 have been combined into a single
integer equal to y0*NBASE+y1, and the terms of full product are
computed as follows:

x0*(y0*NBASE+y1) x1*(y0*NBASE+y1) x2*(y0*NBASE+y1) x3*(y0*NBASE+y1)

In the full product, that gives the same result, but if you follow the
same rule in the reduced-rscale case, skipping the last two terms, it
would actually discard 4 digit products, making it less accurate.

That could be avoided by increasing maxdigits by 1 in mul_var_int() so
it would always be at least as accurate as it was before, but that
might not really be necessary. However, if we implemented
mul_var_int64() in the same way, it would be discarding much
higher-order digit products, and so we probably would have to increase
maxdigits to get sufficiently accurate results. But there's an even
bigger problem: the results would be different between platforms that
did and didn't have 128-bit integers, which I really don't like. We
could avoid that by not using it in reduced-rscale cases, but that
would involve another test condition.

By contrast, mul_var_small() is intended to replicate the arithmetic
in mul_var() exactly (but in a different order) for all rscales. So if
you've found any cases where they give different results, that's a
bug.

In light of that, it might be that mul_var_small() is the better
option, rather than mul_var_int(), but it'd be interesting to see how
they compare in terms of performance first.

Thanks for explaining, very helpful.

I agree on your reasoning about the pros and cons.
Not sure yet which version I prefer. Let's see how it evolves.

I've done some benchmarks.
Haven't tested Intel and AMD yet, but this is what I get on my Apple M3 Max:

--
-- varndigits=1
--

-- HEAD
SELECT SUM(numeric_mul_head(var1,var2,0)) FROM bench_mul_var_var1ndigits_1;
Time: 2976.896 ms (00:02.977)
Time: 2984.759 ms (00:02.985)
Time: 2970.364 ms (00:02.970)

-- mul_var_int() patch
SELECT SUM(numeric_mul_patch_int(var1,var2,0)) FROM bench_mul_var_var1ndigits_1;
Time: 2790.227 ms (00:02.790)
Time: 2786.338 ms (00:02.786)
Time: 2784.957 ms (00:02.785)

-- mul_var_small() patch
SELECT SUM(numeric_mul_patch_small(var1,var2,0)) FROM bench_mul_var_var1ndigits_1;
Time: 2770.211 ms (00:02.770)
Time: 2760.685 ms (00:02.761)
Time: 2773.221 ms (00:02.773)

--
-- varndigits=2
--

-- HEAD
SELECT SUM(numeric_mul_head(var1,var2,0)) FROM bench_mul_var_var1ndigits_2;
Time: 3353.258 ms (00:03.353)
Time: 3273.055 ms (00:03.273)
Time: 3266.392 ms (00:03.266)

-- mul_var_int() patch
SELECT SUM(numeric_mul_patch_int(var1,var2,0)) FROM bench_mul_var_var1ndigits_2;
Time: 2694.169 ms (00:02.694)
Time: 2687.935 ms (00:02.688)
Time: 2692.398 ms (00:02.692)

-- mul_var_small() patch
SELECT SUM(numeric_mul_patch_small(var1,var2,0)) FROM bench_mul_var_var1ndigits_2;
Time: 2997.685 ms (00:02.998)
Time: 2984.418 ms (00:02.984)
Time: 2986.976 ms (00:02.987)

--
-- varndigits=3
--

-- HEAD
SELECT SUM(numeric_mul_head(var1,var2,0)) FROM bench_mul_var_var1ndigits_3;
Time: 3471.391 ms (00:03.471)
Time: 3384.114 ms (00:03.384)
Time: 3387.031 ms (00:03.387)

-- mul_var_int() patch
SELECT SUM(numeric_mul_patch_int(var1,var2,0)) FROM bench_mul_var_var1ndigits_3;
Time: 3384.428 ms (00:03.384)
Time: 3398.044 ms (00:03.398)
Time: 3393.727 ms (00:03.394)

-- mul_var_small() patch
SELECT SUM(numeric_mul_patch_small(var1,var2,0)) FROM bench_mul_var_var1ndigits_3;
Time: 3100.567 ms (00:03.101)
Time: 3114.225 ms (00:03.114)
Time: 3116.137 ms (00:03.116)

Interesting, mul_var_small() seems to be the winner for var1ndigits=3
and mul_var_int() to be the winner for var1ndigits=2,
and about the same for var1ndigits=1.

What can be said about mul_var()'s contract with regards to rscale?
It's the number of decimal digits requested by the caller, and if not
requesting full precision, then the decimal digits might not be accurate,
but can something be said about how far off they can be?

I wouldn't expect it to ever be off by more than 1, given that
MUL_GUARD_DIGITS = 2, which corresponds to 8 decimal digits, and the
number of digits in the smaller input (and hence the number of digit
products in each column) is limited to something like 16,000 NBASE
digits.

OK, so then the cases I found where it was off by 2 for the mul_var_int() patch
are unexpected?

Regards,
Joel

#26

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#25)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Wed, Jul 3, 2024, at 22:27, Joel Jacobson wrote:

On Wed, Jul 3, 2024, at 20:57, Dean Rasheed wrote:

I wouldn't expect it to ever be off by more than 1, given that
MUL_GUARD_DIGITS = 2, which corresponds to 8 decimal digits, and the
number of digits in the smaller input (and hence the number of digit
products in each column) is limited to something like 16,000 NBASE
digits.

OK, so then the cases I found where it was off by 2 for the mul_var_int() patch
are unexpected?

Sorry, I meant off by 2 for the mul_var_small() patch, these cases that I found:

Regards,
Joel

#27

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#19)

3 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Wed, Jul 3, 2024, at 13:17, Dean Rasheed wrote:

Anyway, here are both patches for comparison. I'll stop hacking for a
while and let you see what you make of these.

Regards,
Dean

Attachments:
* v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
* v5-add-mul_var_int.patch

I've now benchmarked the patches on all my machines,
see bench_mul_var.sql for details.

Summary of benchmark results:

cpu | var1ndigits | winner
----------------------+-------------+-------------------------------------------------------------
AMD Ryzen 9 7950X3D | 1 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
AMD Ryzen 9 7950X3D | 2 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
AMD Ryzen 9 7950X3D | 3 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Apple M3 Max | 1 | v5-add-mul_var_int.patch
Apple M3 Max | 2 | v5-add-mul_var_int.patch
Apple M3 Max | 3 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Intel Core i9-14900K | 1 | v5-add-mul_var_int.patch
Intel Core i9-14900K | 2 | v5-add-mul_var_int.patch
Intel Core i9-14900K | 3 | v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
(9 rows)

Performance ratio against HEAD per CPU and var1ndigits:

The queries to produce the above are in bench_csv_queries.txt

/Joel

#28

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#27)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Thu, Jul 4, 2024, at 09:38, Joel Jacobson wrote:

Summary of benchmark results:

cpu | var1ndigits | winner
----------------------+-------------+-------------------------------------------------------------

v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
AMD Ryzen 9 7950X3D | 3 |
v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch

...

Apple M3 Max | 3 |
v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch

...

Intel Core i9-14900K | 3 |
v5-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch

Since v5-add-mul_var_int.patch only implements (var1ndigits <= 2)
it can't possibly win the var1ndigits=3 competition.

/Joel

#29

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#26)

2 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Wed, 3 Jul 2024 at 21:45, Joel Jacobson <joel@compiler.org> wrote:

On Wed, Jul 3, 2024, at 20:57, Dean Rasheed wrote:

I wouldn't expect it to ever be off by more than 1

OK, so then the cases I found where it was off by 2 for the mul_var_int() patch
are unexpected?

Sorry, I meant off by 2 for the mul_var_small() patch, these cases that I found:

Yeah, so that was another bug in mul_var_small(). If rscale is made
small enough, the result index for the digits computed before the main
loop overlaps the ones after, so it would overwrite digits already
computed.

Of course, that's fairly easy to fix, but at this point I think the
better solution is to only use mul_var_small() when an exact product
is requested. We would have to do that for mul_var_int() anyway,
because of its accuracy issues discussed earlier. I think this is a
reasonable thing to do because only functions like ln_var() and
exp_var() will ask mul_var() for a reduced-rscale result, and those
functions are likely to be dominated by computations involving larger
numbers, for which this patch wouldn't help anyway. Also those
functions are probably less widely used.

If we make that decision, a lot of the complexity in mul_var_small()
goes away, including all the conditional array accesses, making it
simpler and more efficient. v6 patch attached.

I also updated the mul_var_int() patch so that it is also only invoked
when an exact product is requested, and I noticed a couple of other
minor optimisations that could be made. Then I decided to try
implementing mul_var_int64(). This gives a pretty decent speedup for
3-digit inputs, but unfortunately it is much slower for 4-digit inputs
(for which most values will go through the 128-bit code path). I'm
attaching that too, just for information, but it's clearly not going
to be acceptable as-is.

Running your benchmark queries, I got these results:

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1;
Time: 4520.874 ms (00:04.521) -- HEAD
Time: 3937.536 ms (00:03.938) -- v5-mul_var_int.patch
Time: 3919.495 ms (00:03.919) -- v5-mul_var_small.patch
Time: 3916.964 ms (00:03.917) -- v6-mul_var_int64.patch
Time: 3811.118 ms (00:03.811) -- v6-mul_var_small.patch

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2;
Time: 4762.528 ms (00:04.763) -- HEAD
Time: 4075.546 ms (00:04.076) -- v5-mul_var_int.patch
Time: 4055.180 ms (00:04.055) -- v5-mul_var_small.patch
Time: 4037.866 ms (00:04.038) -- v6-mul_var_int64.patch
Time: 4018.488 ms (00:04.018) -- v6-mul_var_small.patch

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3;
Time: 5387.514 ms (00:05.388) -- HEAD
Time: 5350.736 ms (00:05.351) -- v5-mul_var_int.patch
Time: 4648.449 ms (00:04.648) -- v5-mul_var_small.patch
Time: 4655.204 ms (00:04.655) -- v6-mul_var_int64.patch
Time: 4645.962 ms (00:04.646) -- v6-mul_var_small.patch

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4;
Time: 5617.150 ms (00:05.617) -- HEAD
Time: 5505.913 ms (00:05.506) -- v5-mul_var_int.patch
Time: 5486.441 ms (00:05.486) -- v5-mul_var_small.patch
Time: 8203.081 ms (00:08.203) -- v6-mul_var_int64.patch
Time: 5598.909 ms (00:05.599) -- v6-mul_var_small.patch

So v6-mul_var_int64 improves on v5-mul_var_int in the 3-digit case,
but is terrible in the 4-digit case. None of the other patches touch
the 4-digit case, but it might be interesting to try mul_var_small()
with 4 digits.

Regards,
Dean

Attachments:

v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchtext/x-patch; charset=US-ASCII; name=v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchDownload

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 5510a20..bae07f2
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -551,6 +551,8 @@ static void sub_var(const NumericVar *va
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
+static void mul_var_small(const NumericVar *var1, const NumericVar *var2,
+						  NumericVar *result, int rscale);
 static void div_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale, bool round);
@@ -8707,7 +8709,7 @@ mul_var(const NumericVar *var1, const Nu
 	var1digits = var1->digits;
 	var2digits = var2->digits;
 
-	if (var1ndigits == 0 || var2ndigits == 0)
+	if (var1ndigits == 0)
 	{
 		/* one or both inputs is zero; so is result */
 		zero_var(result);
@@ -8715,6 +8717,17 @@ mul_var(const NumericVar *var1, const Nu
 		return;
 	}
 
+	/*
+	 * If var1 has 3 digits or fewer, and we are computing the exact result,
+	 * with no rounding, delegate to mul_var_small() which uses a faster short
+	 * multiplication algorithm.
+	 */
+	if (var1ndigits <= 3 && rscale == var1->dscale + var2->dscale)
+	{
+		mul_var_small(var1, var2, result, rscale);
+		return;
+	}
+
 	/* Determine result sign and (maximum possible) weight */
 	if (var1->sign == var2->sign)
 		res_sign = NUMERIC_POS;
@@ -8858,6 +8871,168 @@ mul_var(const NumericVar *var1, const Nu
 	result->sign = res_sign;
 
 	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+
+/*
+ * mul_var_small() -
+ *
+ *	This has the same API as mul_var, but it assumes that var1 has no more
+ *	than 3 digits and var2 has at least as many digits as var1.  For variables
+ *	satisfying these conditions, the product can be computed more quickly than
+ *	the general algorithm used in mul_var.
+ */
+static void
+mul_var_small(const NumericVar *var1, const NumericVar *var2,
+			  NumericVar *result, int rscale)
+{
+	int			var1ndigits = var1->ndigits;
+	int			var2ndigits = var2->ndigits;
+	NumericDigit *var1digits = var1->digits;
+	NumericDigit *var2digits = var2->digits;
+	int			res_sign;
+	int			res_weight;
+	int			res_ndigits;
+	NumericDigit *res_buf;
+	NumericDigit *res_digits;
+	uint32		carry;
+	uint32		term;
+
+	/* Check preconditions */
+	Assert(var1ndigits <= 3);
+	Assert(var2ndigits >= var1ndigits);
+
+	/* Determine result sign and (maximum possible) weight */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 2;
+
+	/* Determine the number of result digits to compute - cf. mul_var() */
+	res_ndigits = var1ndigits + var2ndigits + 1;
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/* Allocate result digit array */
+	res_buf = digitbuf_alloc(res_ndigits);
+	res_buf[0] = 0;				/* spare digit for later rounding */
+	res_digits = res_buf + 1;
+
+	/*
+	 * Compute the result digits in reverse, in one pass, propagating the
+	 * carry up as we go.
+	 *
+	 * This computes res_digits[res_ndigits - 2], ... res_digits[0] by summing
+	 * the products var1digits[i1] * var2digits[i2] for which i1 + i2 + 1 is
+	 * the result index.
+	 */
+	switch (var1ndigits)
+	{
+		case 1:
+			/* ---------
+			 * 1-digit case:
+			 *		var1ndigits = 1
+			 *		var2ndigits >= 1
+			 *		res_ndigits = var2ndigits + 2
+			 * ----------
+			 */
+			carry = 0;
+			for (int i = res_ndigits - 3; i >= 0; i--)
+			{
+				term = (uint32) var1digits[0] * var2digits[i] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+			res_digits[0] = (NumericDigit) carry;
+			break;
+
+		case 2:
+			/* ---------
+			 * 2-digit case:
+			 *		var1ndigits = 2
+			 *		var2ndigits >= 2
+			 *		res_ndigits = var2ndigits + 3
+			 * ----------
+			 */
+			/* last result digit and carry */
+			term = (uint32) var1digits[1] * var2digits[res_ndigits - 4];
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first two */
+			for (int i = res_ndigits - 4; i >= 1; i--)
+			{
+				term = (uint32) var1digits[0] * var2digits[i] +
+					(uint32) var1digits[1] * var2digits[i - 1] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+
+			/* first two digits */
+			term = (uint32) var1digits[0] * var2digits[0] + carry;
+			res_digits[1] = (NumericDigit) (term % NBASE);
+			res_digits[0] = (NumericDigit) (term / NBASE);
+			break;
+
+		case 3:
+			/* ---------
+			 * 3-digit case:
+			 *		var1ndigits = 3
+			 *		var2ndigits >= 3
+			 *		res_ndigits = var2ndigits + 4
+			 * ----------
+			 */
+			/* last two result digits */
+			term = (uint32) var1digits[2] * var2digits[res_ndigits - 5];
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[1] * var2digits[res_ndigits - 5] +
+				(uint32) var1digits[2] * var2digits[res_ndigits - 6] + carry;
+			res_digits[res_ndigits - 3] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first three */
+			for (int i = res_ndigits - 5; i >= 2; i--)
+			{
+				term = (uint32) var1digits[0] * var2digits[i] +
+					(uint32) var1digits[1] * var2digits[i - 1] +
+					(uint32) var1digits[2] * var2digits[i - 2] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+
+			/* first three digits */
+			term = (uint32) var1digits[0] * var2digits[1] +
+				(uint32) var1digits[1] * var2digits[0] + carry;
+			res_digits[2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+			term = (uint32) var1digits[0] * var2digits[0] + carry;
+			res_digits[1] = (NumericDigit) (term % NBASE);
+			res_digits[0] = (NumericDigit) (term / NBASE);
+			break;
+	}
+
+	/* Store the product in result (minus extra rounding digit) */
+	digitbuf_free(result->buf);
+	result->ndigits = res_ndigits - 1;
+	result->buf = res_buf;
+	result->digits = res_digits;
+	result->weight = res_weight - 1;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
 	round_var(result, rscale);
 
 	/* Strip leading and trailing zeroes */

v6-add-mul_var_int64.patchtext/x-patch; charset=US-ASCII; name=v6-add-mul_var_int64.patchDownload

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 5510a20..96456d8
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -551,6 +551,12 @@ static void sub_var(const NumericVar *va
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
+static void mul_var_int(const NumericVar *var, int ival, int ival_weight,
+						NumericVar *result, int rscale);
+#ifdef HAVE_INT128
+static void mul_var_int64(const NumericVar *var, int64 ival, int ival_weight,
+						  NumericVar *result, int rscale);
+#endif
 static void div_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale, bool round);
@@ -8707,7 +8713,7 @@ mul_var(const NumericVar *var1, const Nu
 	var1digits = var1->digits;
 	var2digits = var2->digits;
 
-	if (var1ndigits == 0 || var2ndigits == 0)
+	if (var1ndigits == 0)
 	{
 		/* one or both inputs is zero; so is result */
 		zero_var(result);
@@ -8715,6 +8721,54 @@ mul_var(const NumericVar *var1, const Nu
 		return;
 	}
 
+	/*
+	 * If var1 has just one or two digits, and we are computing the exact
+	 * result, with no rounding, delegate to mul_var_int(), which uses a
+	 * faster direct multiplication algorithm.
+	 *
+	 * Similarly, on platforms with 128-bit integer support, delegate to
+	 * mul_var_int64() if var1 has three or four digits, and we are computing
+	 * the exact result, with no rounding.
+	 */
+	if (var1ndigits <= 2 && rscale == var1->dscale + var2->dscale)
+	{
+		int			ifactor;
+		int			ifactor_weight;
+
+		ifactor = var1->digits[0];
+		ifactor_weight = var1->weight;
+		if (var1ndigits == 2)
+		{
+			ifactor = ifactor * NBASE + var1->digits[1];
+			ifactor_weight--;
+		}
+		if (var1->sign == NUMERIC_NEG)
+			ifactor = -ifactor;
+
+		mul_var_int(var2, ifactor, ifactor_weight, result, rscale);
+		return;
+	}
+#ifdef HAVE_INT128
+	if (var1ndigits <= 4 && rscale == var1->dscale + var2->dscale)
+	{
+		int64		ifactor;
+		int			ifactor_weight;
+
+		ifactor = var1->digits[0];
+		ifactor_weight = var1->weight;
+		for (i = 1; i < var1ndigits; i++)
+		{
+			ifactor = ifactor * NBASE + var1->digits[i];
+			ifactor_weight--;
+		}
+		if (var1->sign == NUMERIC_NEG)
+			ifactor = -ifactor;
+
+		mul_var_int64(var2, ifactor, ifactor_weight, result, rscale);
+		return;
+	}
+#endif
+
 	/* Determine result sign and (maximum possible) weight */
 	if (var1->sign == var2->sign)
 		res_sign = NUMERIC_POS;
@@ -8866,6 +8920,233 @@ mul_var(const NumericVar *var1, const Nu
 
 
 /*
+ * mul_var_int() -
+ *
+ *	Multiply a numeric variable by a 32-bit integer with the specified weight.
+ *	The product var * ival * NBASE^ival_weight is stored in result.
+ */
+static void
+mul_var_int(const NumericVar *var, int ival, int ival_weight,
+			NumericVar *result, int rscale)
+{
+	NumericDigit *var_digits = var->digits;
+	int			var_ndigits = var->ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			res_ndigits;
+	NumericDigit *res_buf;
+	NumericDigit *res_digits;
+	uint32		factor;
+	uint32		carry;
+
+	if (ival == 0 || var_ndigits == 0)
+	{
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * Determine the result sign, (maximum possible) weight and number of
+	 * digits to calculate.  The weight figured here is correct if the emitted
+	 * product has no leading zero digits; otherwise strip_var() will fix
+	 * things up.
+	 */
+	if (var->sign == NUMERIC_POS)
+		res_sign = ival > 0 ? NUMERIC_POS : NUMERIC_NEG;
+	else
+		res_sign = ival > 0 ? NUMERIC_NEG : NUMERIC_POS;
+	res_weight = var->weight + ival_weight + 3;
+	/* The number of accurate result digits we need to produce: */
+	res_ndigits = var_ndigits + 3;
+
+	res_buf = digitbuf_alloc(res_ndigits + 1);
+	res_buf[0] = 0;				/* spare digit for later rounding */
+	res_digits = res_buf + 1;
+
+	/*
+	 * Now compute the product digits by processing the input digits in
+	 * reverse and propagating the carry up as we go.
+	 *
+	 * In this algorithm, the carry from one digit to the next is at most
+	 * factor - 1, and product is at most factor * NBASE - 1, and so it needs
+	 * to be a 64-bit integer if this exceeds UINT_MAX.
+	 */
+	factor = abs(ival);
+	carry = 0;
+
+	if (factor <= UINT_MAX / NBASE)
+	{
+		/* product cannot overflow 32 bits */
+		uint32		product;
+
+		for (int i = res_ndigits - 4; i >= 0; i--)
+		{
+			product = factor * var_digits[i] + carry;
+			res_digits[i + 3] = (NumericDigit) (product % NBASE);
+			carry = product / NBASE;
+		}
+		/* note: carry < UINT_MAX / NBASE in this branch */
+		res_digits[2] = (NumericDigit) (carry % NBASE);
+		carry = carry / NBASE;
+		Assert(carry < NBASE);
+		res_digits[1] = (NumericDigit) carry;
+		res_digits[0] = 0;
+	}
+	else
+	{
+		/* product may exceed 32 bits */
+		uint64		product;
+
+		for (int i = res_ndigits - 4; i >= 0; i--)
+		{
+			product = (uint64) factor * var_digits[i] + carry;
+			res_digits[i + 3] = (NumericDigit) (product % NBASE);
+			carry = (uint32) (product / NBASE);
+		}
+		res_digits[2] = (NumericDigit) (carry % NBASE);
+		carry = carry / NBASE;
+		res_digits[1] = (NumericDigit) (carry % NBASE);
+		res_digits[0] = (NumericDigit) (carry / NBASE);
+	}
+
+	/* Store the product in result */
+	digitbuf_free(result->buf);
+	result->ndigits = res_ndigits;
+	result->buf = res_buf;
+	result->digits = res_digits;
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+
+#ifdef HAVE_INT128
+/*
+ * mul_var_int64() -
+ *
+ *	Multiply a numeric variable by a 64-bit integer with the specified weight.
+ *	The product var * ival * NBASE^ival_weight is stored in result.
+ *
+ *	This duplicates the logic in mul_var_int(), so any changes made there
+ *	should be made here too.
+ */
+static void
+mul_var_int64(const NumericVar *var, int64 ival, int ival_weight,
+			  NumericVar *result, int rscale)
+{
+	NumericDigit *var_digits = var->digits;
+	int			var_ndigits = var->ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			res_ndigits;
+	NumericDigit *res_buf;
+	NumericDigit *res_digits;
+	uint64		factor;
+	uint64		carry;
+
+	if (ival == 0 || var_ndigits == 0)
+	{
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * Determine the result sign, (maximum possible) weight and number of
+	 * digits to calculate.  The weight figured here is correct if the emitted
+	 * product has no leading zero digits; otherwise strip_var() will fix
+	 * things up.
+	 */
+	if (var->sign == NUMERIC_POS)
+		res_sign = ival > 0 ? NUMERIC_POS : NUMERIC_NEG;
+	else
+		res_sign = ival > 0 ? NUMERIC_NEG : NUMERIC_POS;
+	res_weight = var->weight + ival_weight + 5;
+	/* The number of accurate result digits we need to produce: */
+	res_ndigits = var_ndigits + 5;
+
+	res_buf = digitbuf_alloc(res_ndigits + 1);
+	res_buf[0] = 0;				/* spare digit for later rounding */
+	res_digits = res_buf + 1;
+
+	/*
+	 * Now compute the product digits by processing the input digits in
+	 * reverse and propagating the carry up as we go.
+	 *
+	 * In this algorithm, the carry from one digit to the next is at most
+	 * factor - 1, and product is at most factor * NBASE - 1, and so it needs
+	 * to be a 128-bit integer if this exceeds PG_UINT64_MAX.
+	 */
+	factor = i64abs(ival);
+	carry = 0;
+
+	if (factor <= PG_UINT64_MAX / NBASE)
+	{
+		/* product cannot overflow 64 bits */
+		uint64		product;
+
+		for (int i = res_ndigits - 6; i >= 0; i--)
+		{
+			product = factor * var_digits[i] + carry;
+			res_digits[i + 5] = (NumericDigit) (product % NBASE);
+			carry = product / NBASE;
+		}
+		/* note: carry < PG_UINT64_MAX / NBASE in this branch */
+		res_digits[4] = (NumericDigit) (carry % NBASE);
+		carry = carry / NBASE;
+		res_digits[3] = (NumericDigit) (carry % NBASE);
+		carry = carry / NBASE;
+		res_digits[2] = (NumericDigit) (carry % NBASE);
+		carry = carry / NBASE;
+		Assert(carry < NBASE);
+		res_digits[1] = (NumericDigit) carry;
+		res_digits[0] = 0;
+	}
+	else
+	{
+		/* product may exceed 64 bits */
+		uint128		product;
+
+		for (int i = res_ndigits - 6; i >= 0; i--)
+		{
+			product = (uint128) factor * var_digits[i] + carry;
+			res_digits[i + 5] = (NumericDigit) (product % NBASE);
+			carry = (uint64) (product / NBASE);
+		}
+		res_digits[4] = (NumericDigit) (carry % NBASE);
+		carry = carry / NBASE;
+		res_digits[3] = (NumericDigit) (carry % NBASE);
+		carry = carry / NBASE;
+		res_digits[2] = (NumericDigit) (carry % NBASE);
+		carry = carry / NBASE;
+		res_digits[1] = (NumericDigit) (carry % NBASE);
+		res_digits[0] = (NumericDigit) (carry / NBASE);
+	}
+
+	/* Store the product in result */
+	digitbuf_free(result->buf);
+	result->ndigits = res_ndigits;
+	result->buf = res_buf;
+	result->digits = res_digits;
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+#endif
+
+
+/*
  * div_var() -
  *
  *	Division on variable level. Quotient of var1 / var2 is stored in result.

#30

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#29)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Thu, Jul 4, 2024, at 20:43, Dean Rasheed wrote:

On Wed, 3 Jul 2024 at 21:45, Joel Jacobson <joel@compiler.org> wrote:

On Wed, Jul 3, 2024, at 20:57, Dean Rasheed wrote:

I wouldn't expect it to ever be off by more than 1

OK, so then the cases I found where it was off by 2 for the mul_var_int() patch
are unexpected?

Sorry, I meant off by 2 for the mul_var_small() patch, these cases that I found:

Yeah, so that was another bug in mul_var_small(). If rscale is made
small enough, the result index for the digits computed before the main
loop overlaps the ones after, so it would overwrite digits already
computed.

Of course, that's fairly easy to fix, but at this point I think the
better solution is to only use mul_var_small() when an exact product
is requested. We would have to do that for mul_var_int() anyway,
because of its accuracy issues discussed earlier. I think this is a
reasonable thing to do because only functions like ln_var() and
exp_var() will ask mul_var() for a reduced-rscale result, and those
functions are likely to be dominated by computations involving larger
numbers, for which this patch wouldn't help anyway. Also those
functions are probably less widely used.

If we make that decision, a lot of the complexity in mul_var_small()
goes away, including all the conditional array accesses, making it
simpler and more efficient. v6 patch attached.

Nice, I think that looks like the better trade-off.

I also updated the mul_var_int() patch so that it is also only invoked
when an exact product is requested, and I noticed a couple of other
minor optimisations that could be made.

Looks nice.

Then I decided to try
implementing mul_var_int64(). This gives a pretty decent speedup for
3-digit inputs, but unfortunately it is much slower for 4-digit inputs
(for which most values will go through the 128-bit code path). I'm
attaching that too, just for information, but it's clearly not going
to be acceptable as-is.

Running your benchmark queries, I got these results:

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1;
Time: 4520.874 ms (00:04.521) -- HEAD
Time: 3937.536 ms (00:03.938) -- v5-mul_var_int.patch
Time: 3919.495 ms (00:03.919) -- v5-mul_var_small.patch
Time: 3916.964 ms (00:03.917) -- v6-mul_var_int64.patch
Time: 3811.118 ms (00:03.811) -- v6-mul_var_small.patch

My benchmarks also indicate v6-mul_var_small.patch (v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch)
is the winner on all my three CPUs, for var1ndigits=1:

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 3046.668 ms (00:03.047)
Time: 3053.327 ms (00:03.053)
Time: 3045.517 ms (00:03.046)
Time: 3049.626 ms (00:03.050)
Time: 3075.101 ms (00:03.075)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-add-mul_var_int64.patch
Time: 2781.536 ms (00:02.782)
Time: 2781.324 ms (00:02.781)
Time: 2781.301 ms (00:02.781)
Time: 2786.524 ms (00:02.787)
Time: 2784.494 ms (00:02.784)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 2711.167 ms (00:02.711)
Time: 2723.647 ms (00:02.724)
Time: 2706.372 ms (00:02.706)
Time: 2708.883 ms (00:02.709)
Time: 2704.621 ms (00:02.705)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 3496.714 ms (00:03.497)
Time: 3278.909 ms (00:03.279)
Time: 3278.631 ms (00:03.279)
Time: 3277.658 ms (00:03.278)
Time: 3276.121 ms (00:03.276)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-add-mul_var_int64.patch
Time: 3080.751 ms (00:03.081)
Time: 3078.118 ms (00:03.078)
Time: 3079.932 ms (00:03.080)
Time: 3080.668 ms (00:03.081)
Time: 3080.697 ms (00:03.081)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3043.635 ms (00:03.044)
Time: 3043.606 ms (00:03.044)
Time: 3041.391 ms (00:03.041)
Time: 3041.997 ms (00:03.042)
Time: 3045.464 ms (00:03.045)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 3421.307 ms (00:03.421)
Time: 3400.935 ms (00:03.401)
Time: 3359.839 ms (00:03.360)
Time: 3374.246 ms (00:03.374)
Time: 3375.085 ms (00:03.375)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-add-mul_var_int64.patch
Time: 3002.214 ms (00:03.002)
Time: 3016.042 ms (00:03.016)
Time: 3010.541 ms (00:03.011)
Time: 3009.204 ms (00:03.009)
Time: 3002.088 ms (00:03.002)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 2959.319 ms (00:02.959)
Time: 2957.694 ms (00:02.958)
Time: 2971.559 ms (00:02.972)
Time: 2958.788 ms (00:02.959)
Time: 2958.812 ms (00:02.959)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2;
Time: 4762.528 ms (00:04.763) -- HEAD
Time: 4075.546 ms (00:04.076) -- v5-mul_var_int.patch
Time: 4055.180 ms (00:04.055) -- v5-mul_var_small.patch
Time: 4037.866 ms (00:04.038) -- v6-mul_var_int64.patch
Time: 4018.488 ms (00:04.018) -- v6-mul_var_small.patch

I get mixed results for var1ndigits=2:
Winner on Apple M3 Max and AMD Ryzen 9 7950X3D is v6-add-mul_var_int64.patch.
Winner on Intel Core i9-14900K is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 3369.724 ms (00:03.370)
Time: 3340.977 ms (00:03.341)
Time: 3331.407 ms (00:03.331)
Time: 3333.304 ms (00:03.333)
Time: 3332.136 ms (00:03.332)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-add-mul_var_int64.patch
Time: 2768.108 ms (00:02.768)
Time: 2736.996 ms (00:02.737)
Time: 2730.217 ms (00:02.730)
Time: 2743.556 ms (00:02.744)
Time: 2746.725 ms (00:02.747)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 2895.200 ms (00:02.895)
Time: 2894.823 ms (00:02.895)
Time: 2899.475 ms (00:02.899)
Time: 2895.385 ms (00:02.895)
Time: 2898.647 ms (00:02.899)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 3385.836 ms (00:03.386)
Time: 3367.739 ms (00:03.368)
Time: 3363.321 ms (00:03.363)
Time: 3365.433 ms (00:03.365)
Time: 3365.301 ms (00:03.365)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-add-mul_var_int64.patch
Time: 3086.253 ms (00:03.086)
Time: 3085.272 ms (00:03.085)
Time: 3085.769 ms (00:03.086)
Time: 3089.128 ms (00:03.089)
Time: 3086.735 ms (00:03.087)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3053.775 ms (00:03.054)
Time: 3058.392 ms (00:03.058)
Time: 3068.113 ms (00:03.068)
Time: 3057.333 ms (00:03.057)
Time: 3057.218 ms (00:03.057)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 3619.441 ms (00:03.619)
Time: 3609.553 ms (00:03.610)
Time: 3574.277 ms (00:03.574)
Time: 3578.031 ms (00:03.578)
Time: 3558.545 ms (00:03.559)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-add-mul_var_int64.patch
Time: 3061.858 ms (00:03.062)
Time: 3082.174 ms (00:03.082)
Time: 3081.093 ms (00:03.081)
Time: 3093.610 ms (00:03.094)
Time: 3064.507 ms (00:03.065)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3100.287 ms (00:03.100)
Time: 3100.264 ms (00:03.100)
Time: 3097.207 ms (00:03.097)
Time: 3101.477 ms (00:03.101)
Time: 3098.035 ms (00:03.098)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3;
Time: 5387.514 ms (00:05.388) -- HEAD
Time: 5350.736 ms (00:05.351) -- v5-mul_var_int.patch
Time: 4648.449 ms (00:04.648) -- v5-mul_var_small.patch
Time: 4655.204 ms (00:04.655) -- v6-mul_var_int64.patch
Time: 4645.962 ms (00:04.646) -- v6-mul_var_small.patch

I get mixed results for var1ndigits=3:
Winner on Apple M3 Max and AMD Ryzen 9 7950X3D is v6-add-mul_var_int64.patch.
Winner on Intel Core i9-14900K is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Same winners as for var1ndigits=2.

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3466.932 ms (00:03.467)
Time: 3447.001 ms (00:03.447)
Time: 3457.259 ms (00:03.457)
Time: 3445.938 ms (00:03.446)
Time: 3443.310 ms (00:03.443)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-add-mul_var_int64.patch
Time: 2988.444 ms (00:02.988)
Time: 2976.036 ms (00:02.976)
Time: 2982.756 ms (00:02.983)
Time: 2986.436 ms (00:02.986)
Time: 2973.457 ms (00:02.973)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3026.666 ms (00:03.027)
Time: 3024.458 ms (00:03.024)
Time: 3022.976 ms (00:03.023)
Time: 3029.526 ms (00:03.030)
Time: 3021.543 ms (00:03.022)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 4510.078 ms (00:04.510)
Time: 4222.501 ms (00:04.223)
Time: 4179.509 ms (00:04.180)
Time: 4179.307 ms (00:04.179)
Time: 4183.026 ms (00:04.183)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-add-mul_var_int64.patch
Time: 3811.866 ms (00:03.812)
Time: 3816.664 ms (00:03.817)
Time: 3811.695 ms (00:03.812)
Time: 3811.674 ms (00:03.812)
Time: 3812.265 ms (00:03.812)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 4095.053 ms (00:04.095)
Time: 3896.002 ms (00:03.896)
Time: 3888.999 ms (00:03.889)
Time: 3889.346 ms (00:03.889)
Time: 3889.017 ms (00:03.889)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3946.255 ms (00:03.946)
Time: 3896.110 ms (00:03.896)
Time: 3877.470 ms (00:03.877)
Time: 3854.402 ms (00:03.854)
Time: 3901.218 ms (00:03.901)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-add-mul_var_int64.patch
Time: 3393.572 ms (00:03.394)
Time: 3395.401 ms (00:03.395)
Time: 3395.199 ms (00:03.395)
Time: 3418.555 ms (00:03.419)
Time: 3399.619 ms (00:03.400)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3308.791 ms (00:03.309)
Time: 3309.316 ms (00:03.309)
Time: 3318.238 ms (00:03.318)
Time: 3296.061 ms (00:03.296)
Time: 3320.282 ms (00:03.320)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4;
Time: 5617.150 ms (00:05.617) -- HEAD
Time: 5505.913 ms (00:05.506) -- v5-mul_var_int.patch
Time: 5486.441 ms (00:05.486) -- v5-mul_var_small.patch
Time: 8203.081 ms (00:08.203) -- v6-mul_var_int64.patch
Time: 5598.909 ms (00:05.599) -- v6-mul_var_small.patch
So v6-mul_var_int64 improves on v5-mul_var_int in the 3-digit case,
but is terrible in the 4-digit case. None of the other patches touch
the 4-digit case, but it might be interesting to try mul_var_small()
with 4 digits.

Interesting you got so bad bench results for v6-mul_var_int64.patch
for var1ndigits=4, that patch is actually the winner on AMD Ryzen 9 7950X3D.

On Intel Core i9-14900K the winner is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.
On Apple M3 Max, HEAD is the winner.

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3618.530 ms (00:03.619)
Time: 3595.239 ms (00:03.595)
Time: 3600.412 ms (00:03.600)
Time: 3607.500 ms (00:03.607)
Time: 3604.122 ms (00:03.604)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 4498.993 ms (00:04.499)
Time: 4527.302 ms (00:04.527)
Time: 4493.613 ms (00:04.494)
Time: 4482.194 ms (00:04.482)
Time: 4493.660 ms (00:04.494)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3628.177 ms (00:03.628)
Time: 3613.975 ms (00:03.614)
Time: 3612.213 ms (00:03.612)
Time: 3614.026 ms (00:03.614)
Time: 3622.959 ms (00:03.623)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 4130.810 ms (00:04.131)
Time: 3836.462 ms (00:03.836)
Time: 3810.604 ms (00:03.811)
Time: 3805.443 ms (00:03.805)
Time: 3803.055 ms (00:03.803)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 3952.862 ms (00:03.953)
Time: 3792.272 ms (00:03.792)
Time: 3793.995 ms (00:03.794)
Time: 3790.531 ms (00:03.791)
Time: 3793.647 ms (00:03.794)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3762.828 ms (00:03.763)
Time: 3754.869 ms (00:03.755)
Time: 3756.041 ms (00:03.756)
Time: 3758.719 ms (00:03.759)
Time: 3754.894 ms (00:03.755)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 4075.988 ms (00:04.076)
Time: 4067.702 ms (00:04.068)
Time: 4035.191 ms (00:04.035)
Time: 4022.411 ms (00:04.022)
Time: 4016.475 ms (00:04.016)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 3830.021 ms (00:03.830)
Time: 3837.947 ms (00:03.838)
Time: 3834.894 ms (00:03.835)
Time: 3832.755 ms (00:03.833)
Time: 3834.512 ms (00:03.835)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 4031.128 ms (00:04.031)
Time: 4001.590 ms (00:04.002)
Time: 4031.212 ms (00:04.031)
Time: 4035.941 ms (00:04.036)
Time: 4031.218 ms (00:04.031)

Regards,
Joel

#31

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#30)

1 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Fri, 5 Jul 2024 at 12:56, Joel Jacobson <joel@compiler.org> wrote:

Interesting you got so bad bench results for v6-mul_var_int64.patch
for var1ndigits=4, that patch is actually the winner on AMD Ryzen 9 7950X3D.

Interesting.

On Intel Core i9-14900K the winner is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.

That must be random noise, since
v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch doesn't
invoke mul_var_small() for 4-digit inputs.

On Apple M3 Max, HEAD is the winner.

Importantly, mul_var_int64() is around 1.25x slower there, and it was
even worse on my machine.

Attached is a v7 mul_var_small() patch adding 4-digit support. For me,
this gives a nice speedup:

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4;
Time: 5617.150 ms (00:05.617) -- HEAD
Time: 8203.081 ms (00:08.203) -- v6-mul_var_int64.patch
Time: 4750.212 ms (00:04.750) -- v7-mul_var_small.patch

The other advantage, of course, is that it doesn't require 128-bit
integer support.

Regards,
Dean

Attachments:

v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchtext/x-patch; charset=US-ASCII; name=v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchDownload

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 5510a20..b9497e1
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -551,6 +551,8 @@ static void sub_var(const NumericVar *va
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
+static void mul_var_small(const NumericVar *var1, const NumericVar *var2,
+						  NumericVar *result, int rscale);
 static void div_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale, bool round);
@@ -8707,7 +8709,7 @@ mul_var(const NumericVar *var1, const Nu
 	var1digits = var1->digits;
 	var2digits = var2->digits;
 
-	if (var1ndigits == 0 || var2ndigits == 0)
+	if (var1ndigits == 0)
 	{
 		/* one or both inputs is zero; so is result */
 		zero_var(result);
@@ -8715,6 +8717,17 @@ mul_var(const NumericVar *var1, const Nu
 		return;
 	}
 
+	/*
+	 * If var1 has 4 digits or fewer, and we are computing the exact result,
+	 * with no rounding, delegate to mul_var_small() which uses a faster short
+	 * multiplication algorithm.
+	 */
+	if (var1ndigits <= 4 && rscale == var1->dscale + var2->dscale)
+	{
+		mul_var_small(var1, var2, result, rscale);
+		return;
+	}
+
 	/* Determine result sign and (maximum possible) weight */
 	if (var1->sign == var2->sign)
 		res_sign = NUMERIC_POS;
@@ -8858,6 +8871,221 @@ mul_var(const NumericVar *var1, const Nu
 	result->sign = res_sign;
 
 	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+}
+
+
+/*
+ * mul_var_small() -
+ *
+ *	This has the same API as mul_var, but it assumes that var1 has no more
+ *	than 4 digits and var2 has at least as many digits as var1.  For variables
+ *	satisfying these conditions, the product can be computed more quickly than
+ *	the general algorithm used in mul_var.
+ */
+static void
+mul_var_small(const NumericVar *var1, const NumericVar *var2,
+			  NumericVar *result, int rscale)
+{
+	int			var1ndigits = var1->ndigits;
+	int			var2ndigits = var2->ndigits;
+	NumericDigit *var1digits = var1->digits;
+	NumericDigit *var2digits = var2->digits;
+	int			res_sign;
+	int			res_weight;
+	int			res_ndigits;
+	NumericDigit *res_buf;
+	NumericDigit *res_digits;
+	uint32		carry;
+	uint32		term;
+
+	/* Check preconditions */
+	Assert(var1ndigits <= 4);
+	Assert(var2ndigits >= var1ndigits);
+
+	/* Determine result sign and (maximum possible) weight */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 2;
+
+	/* Determine the number of result digits to compute - cf. mul_var() */
+	res_ndigits = var1ndigits + var2ndigits + 1;
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/* Allocate result digit array */
+	res_buf = digitbuf_alloc(res_ndigits);
+	res_buf[0] = 0;				/* spare digit for later rounding */
+	res_digits = res_buf + 1;
+
+	/*
+	 * Compute the result digits in reverse, in one pass, propagating the
+	 * carry up as we go.
+	 *
+	 * This computes res_digits[res_ndigits - 2], ... res_digits[0] by summing
+	 * the products var1digits[i1] * var2digits[i2] for which i1 + i2 + 1 is
+	 * the result index.
+	 */
+	switch (var1ndigits)
+	{
+		case 1:
+			/* ---------
+			 * 1-digit case:
+			 *		var1ndigits = 1
+			 *		var2ndigits >= 1
+			 *		res_ndigits = var2ndigits + 2
+			 * ----------
+			 */
+			carry = 0;
+			for (int i = res_ndigits - 3; i >= 0; i--)
+			{
+				term = (uint32) var1digits[0] * var2digits[i] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+			res_digits[0] = (NumericDigit) carry;
+			break;
+
+		case 2:
+			/* ---------
+			 * 2-digit case:
+			 *		var1ndigits = 2
+			 *		var2ndigits >= 2
+			 *		res_ndigits = var2ndigits + 3
+			 * ----------
+			 */
+			/* last result digit and carry */
+			term = (uint32) var1digits[1] * var2digits[res_ndigits - 4];
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first two */
+			for (int i = res_ndigits - 4; i >= 1; i--)
+			{
+				term = (uint32) var1digits[0] * var2digits[i] +
+					(uint32) var1digits[1] * var2digits[i - 1] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+
+			/* first two digits */
+			term = (uint32) var1digits[0] * var2digits[0] + carry;
+			res_digits[1] = (NumericDigit) (term % NBASE);
+			res_digits[0] = (NumericDigit) (term / NBASE);
+			break;
+
+		case 3:
+			/* ---------
+			 * 3-digit case:
+			 *		var1ndigits = 3
+			 *		var2ndigits >= 3
+			 *		res_ndigits = var2ndigits + 4
+			 * ----------
+			 */
+			/* last two result digits */
+			term = (uint32) var1digits[2] * var2digits[res_ndigits - 5];
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[1] * var2digits[res_ndigits - 5] +
+				(uint32) var1digits[2] * var2digits[res_ndigits - 6] + carry;
+			res_digits[res_ndigits - 3] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first three */
+			for (int i = res_ndigits - 5; i >= 2; i--)
+			{
+				term = (uint32) var1digits[0] * var2digits[i] +
+					(uint32) var1digits[1] * var2digits[i - 1] +
+					(uint32) var1digits[2] * var2digits[i - 2] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+
+			/* first three digits */
+			term = (uint32) var1digits[0] * var2digits[1] +
+				(uint32) var1digits[1] * var2digits[0] + carry;
+			res_digits[2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[0] * var2digits[0] + carry;
+			res_digits[1] = (NumericDigit) (term % NBASE);
+			res_digits[0] = (NumericDigit) (term / NBASE);
+			break;
+
+		case 4:
+			/* ---------
+			 * 4-digit case:
+			 *		var1ndigits = 4
+			 *		var2ndigits >= 4
+			 *		res_ndigits = var2ndigits + 5
+			 * ----------
+			 */
+			/* last three result digits */
+			term = (uint32) var1digits[3] * var2digits[res_ndigits - 6];
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[2] * var2digits[res_ndigits - 6] +
+				(uint32) var1digits[3] * var2digits[res_ndigits - 7] + carry;
+			res_digits[res_ndigits - 3] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[1] * var2digits[res_ndigits - 6] +
+				(uint32) var1digits[2] * var2digits[res_ndigits - 7] +
+				(uint32) var1digits[3] * var2digits[res_ndigits - 8] + carry;
+			res_digits[res_ndigits - 4] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first four */
+			for (int i = res_ndigits - 6; i >= 3; i--)
+			{
+				term = (uint32) var1digits[0] * var2digits[i] +
+					(uint32) var1digits[1] * var2digits[i - 1] +
+					(uint32) var1digits[2] * var2digits[i - 2] +
+					(uint32) var1digits[3] * var2digits[i - 3] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+
+			/* first four digits */
+			term = (uint32) var1digits[0] * var2digits[2] +
+				(uint32) var1digits[1] * var2digits[1] +
+				(uint32) var1digits[2] * var2digits[0] + carry;
+			res_digits[3] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[0] * var2digits[1] +
+				(uint32) var1digits[1] * var2digits[0] + carry;
+			res_digits[2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[0] * var2digits[0] + carry;
+			res_digits[1] = (NumericDigit) (term % NBASE);
+			res_digits[0] = (NumericDigit) (term / NBASE);
+			break;
+	}
+
+	/* Store the product in result (minus extra rounding digit) */
+	digitbuf_free(result->buf);
+	result->ndigits = res_ndigits - 1;
+	result->buf = res_buf;
+	result->digits = res_digits;
+	result->weight = res_weight - 1;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
 	round_var(result, rscale);
 
 	/* Strip leading and trailing zeroes */

#32

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#31)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Fri, Jul 5, 2024, at 17:41, Dean Rasheed wrote:

On Fri, 5 Jul 2024 at 12:56, Joel Jacobson <joel@compiler.org> wrote:

Interesting you got so bad bench results for v6-mul_var_int64.patch
for var1ndigits=4, that patch is actually the winner on AMD Ryzen 9 7950X3D.

Interesting.

I remeasured just to be sure, and yes, it was the winner among the previous patches,
but the new v7 beats it.

On Intel Core i9-14900K the winner is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.

That must be random noise, since
v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch doesn't
invoke mul_var_small() for 4-digit inputs.

Yes, something was off with the HEAD measurements for that one,
I remeasured and then got almost identical results (as expected)
between HEAD and v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
for 4-digit inputs.

On Apple M3 Max, HEAD is the winner.

Importantly, mul_var_int64() is around 1.25x slower there, and it was
even worse on my machine.

Attached is a v7 mul_var_small() patch adding 4-digit support. For me,
this gives a nice speedup:

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4;
Time: 5617.150 ms (00:05.617) -- HEAD
Time: 8203.081 ms (00:08.203) -- v6-mul_var_int64.patch
Time: 4750.212 ms (00:04.750) -- v7-mul_var_small.patch

The other advantage, of course, is that it doesn't require 128-bit
integer support.

Very nice, v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
is now the winner on all my CPUs:

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3574.865 ms (00:03.575)
Time: 3573.678 ms (00:03.574)
Time: 3576.953 ms (00:03.577)
Time: 3580.536 ms (00:03.581)
Time: 3589.007 ms (00:03.589)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3110.171 ms (00:03.110)
Time: 3098.558 ms (00:03.099)
Time: 3105.873 ms (00:03.106)
Time: 3104.484 ms (00:03.104)
Time: 3109.035 ms (00:03.109)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3751.767 ms (00:03.752)
Time: 3745.916 ms (00:03.746)
Time: 3742.542 ms (00:03.743)
Time: 3746.139 ms (00:03.746)
Time: 3745.493 ms (00:03.745)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3747.640 ms (00:03.748)
Time: 3747.231 ms (00:03.747)
Time: 3747.965 ms (00:03.748)
Time: 3748.309 ms (00:03.748)
Time: 3746.498 ms (00:03.746)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3417.924 ms (00:03.418)
Time: 3417.088 ms (00:03.417)
Time: 3415.708 ms (00:03.416)
Time: 3415.453 ms (00:03.415)
Time: 3419.566 ms (00:03.420)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3970.131 ms (00:03.970)
Time: 3924.335 ms (00:03.924)
Time: 3927.863 ms (00:03.928)
Time: 3924.761 ms (00:03.925)
Time: 3926.290 ms (00:03.926)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 3874.769 ms (00:03.875)
Time: 3858.071 ms (00:03.858)
Time: 3836.698 ms (00:03.837)
Time: 3871.388 ms (00:03.871)
Time: 3844.907 ms (00:03.845)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3397.846 ms (00:03.398)
Time: 3398.050 ms (00:03.398)
Time: 3395.279 ms (00:03.395)
Time: 3393.285 ms (00:03.393)
Time: 3402.570 ms (00:03.403)

Code wise I think it's now very nice and clear, with just enough comments.

Also nice to see that the var1ndigits=4 case isn't much more complex
than var1ndigits=3, since it follows the same pattern.

Regards,
Joel

#33

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#32)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Fri, Jul 5, 2024, at 18:42, Joel Jacobson wrote:

Very nice, v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
is now the winner on all my CPUs:

I thought it would be interesting to also measure the isolated effect
on just numeric_mul() without the query overhead.

Included var1ndigits=5 var2ndigits=5, that should be unaffected,
just to get a sense of the noise level.

SELECT timeit.h('numeric_mul',array['9999','9999'],2,min_time:='1 s'::interval);
SELECT timeit.h('numeric_mul',array['9999_9999','9999_9999'],2,min_time:='1 s'::interval);
SELECT timeit.h('numeric_mul',array['9999_9999_9999','9999_9999_9999'],2,min_time:='1 s'::interval);
SELECT timeit.h('numeric_mul',array['9999_9999_9999_9999','9999_9999_9999_9999'],2,min_time:='1 s'::interval);
SELECT timeit.h('numeric_mul',array['9999_9999_9999_9999_9999','9999_9999_9999_9999_9999'],2,min_time:='1 s'::interval);

CPU | var1ndigits | var2ndigits | HEAD | v7 | HEAD/v7
---------------------+-------------+-------------+-------+-------+---------
Apple M3 Max | 1 | 1 | 28 ns | 18 ns | 1.56
Apple M3 Max | 2 | 2 | 32 ns | 18 ns | 1.78
Apple M3 Max | 3 | 3 | 38 ns | 21 ns | 1.81
Apple M3 Max | 4 | 4 | 42 ns | 24 ns | 1.75
Intel Core i9-14900K | 1 | 1 | 25 ns | 20 ns | 1.25
Intel Core i9-14900K | 2 | 2 | 28 ns | 20 ns | 1.40
Intel Core i9-14900K | 3 | 3 | 33 ns | 24 ns | 1.38
Intel Core i9-14900K | 4 | 4 | 37 ns | 25 ns | 1.48
AMD Ryzen 9 7950X3D | 1 | 1 | 37 ns | 29 ns | 1.28
AMD Ryzen 9 7950X3D | 2 | 2 | 43 ns | 31 ns | 1.39
AMD Ryzen 9 7950X3D | 3 | 3 | 50 ns | 37 ns | 1.35
AMD Ryzen 9 7950X3D | 4 | 4 | 55 ns | 39 ns | 1.41

Impressive speed-up, between 25% - 81%.

Regards,
Joel

#34

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#33)

1 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Fri, 5 Jul 2024 at 18:37, Joel Jacobson <joel@compiler.org> wrote:

On Fri, Jul 5, 2024, at 18:42, Joel Jacobson wrote:

Very nice, v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
is now the winner on all my CPUs:

I thought it would be interesting to also measure the isolated effect
on just numeric_mul() without the query overhead.

Impressive speed-up, between 25% - 81%.

Cool. I think we should go with the mul_var_small() patch then, since
it's more generally applicable.

I also did some testing with much larger var2 values, and saw similar
speed-ups. One high-level function that benefits from that is
factorial(), which accepts inputs up to 32177, and so uses both the
1-digit and 2-digit code with very large var2 values. I doubt anyone
actually uses it with such large inputs, but it's interesting
nonetheless:

SELECT factorial(32177);
Time: 923.117 ms -- HEAD
Time: 534.375 ms -- mul_var_small() patch

I did one more round of (mostly cosmetic) copy-editing. Aside from
improving some of the comments, it occurred to me that there's no need
to pass rscale to mul_var_small(), or for it to call round_var(),
since it's always computing the exact result. That shaves off a few
more cycles.

Additionally, I didn't like how res_weight and res_ndigits were being
set 1 higher than they needed to be. That makes sense in mul_var()
because it may round the result, causing a non-zero carry to propagate
into the next digit up, but it's just confusing in mul_var_small(). So
I've reduced those by 1, which makes the look much more logical. To be
clear, this doesn't change how many digits we're calculating. But now
res_ndigits is actually the number of digits being calculated, whereas
before, res_ndigits was 1 larger and we were calculating res_ndigits -
1 digits, which was confusing.

I think this is good to go, so unless there are any further comments,
I plan to commit it soon.

Possible future work would be to try extending it to larger var1
values. I have a feeling that might work quite well for 5 or 6 digits,
but at some point, we'll start seeing diminishing returns, and the
code bloat won't be worth it.

Regards,
Dean

Attachments:

v8-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchtext/x-patch; charset=US-ASCII; name=v8-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchDownload

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 5510a20..b556861
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -551,6 +551,8 @@ static void sub_var(const NumericVar *va
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
+static void mul_var_small(const NumericVar *var1, const NumericVar *var2,
+						  NumericVar *result);
 static void div_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale, bool round);
@@ -8707,7 +8709,7 @@ mul_var(const NumericVar *var1, const Nu
 	var1digits = var1->digits;
 	var2digits = var2->digits;
 
-	if (var1ndigits == 0 || var2ndigits == 0)
+	if (var1ndigits == 0)
 	{
 		/* one or both inputs is zero; so is result */
 		zero_var(result);
@@ -8715,6 +8717,16 @@ mul_var(const NumericVar *var1, const Nu
 		return;
 	}
 
+	/*
+	 * If var1 has 1-4 digits and the exact result was requested, delegate to
+	 * mul_var_small() which uses a faster direct multiplication algorithm.
+	 */
+	if (var1ndigits <= 4 && rscale == var1->dscale + var2->dscale)
+	{
+		mul_var_small(var1, var2, result);
+		return;
+	}
+
 	/* Determine result sign and (maximum possible) weight */
 	if (var1->sign == var2->sign)
 		res_sign = NUMERIC_POS;
@@ -8862,6 +8874,212 @@ mul_var(const NumericVar *var1, const Nu
 
 	/* Strip leading and trailing zeroes */
 	strip_var(result);
+}
+
+
+/*
+ * mul_var_small() -
+ *
+ *	Special-case multiplication function used when var1 has 1-4 digits, var2
+ *	has at least as many digits as var1, and the exact product var1 * var2 is
+ *	requested.
+ */
+static void
+mul_var_small(const NumericVar *var1, const NumericVar *var2,
+			  NumericVar *result)
+{
+	int			var1ndigits = var1->ndigits;
+	int			var2ndigits = var2->ndigits;
+	NumericDigit *var1digits = var1->digits;
+	NumericDigit *var2digits = var2->digits;
+	int			res_sign;
+	int			res_weight;
+	int			res_ndigits;
+	NumericDigit *res_buf;
+	NumericDigit *res_digits;
+	uint32		carry;
+	uint32		term;
+
+	/* Check preconditions */
+	Assert(var1ndigits >= 1);
+	Assert(var1ndigits <= 4);
+	Assert(var2ndigits >= var1ndigits);
+
+	/*
+	 * Determine the result sign, weight, and number of digits to calculate.
+	 * The weight figured here is correct if the product has no leading zero
+	 * digits; otherwise strip_var() will fix things up.  Note that, unlike
+	 * mul_var(), we do not need to allocate an extra output digit, because we
+	 * are not rounding here.
+	 */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 1;
+	res_ndigits = var1ndigits + var2ndigits;
+
+	/* Allocate result digit array */
+	res_buf = digitbuf_alloc(res_ndigits + 1);
+	res_buf[0] = 0;				/* spare digit for later rounding */
+	res_digits = res_buf + 1;
+
+	/*
+	 * Compute the result digits in reverse, in one pass, propagating the
+	 * carry up as we go.  The i'th result digit consists of the sum of the
+	 * products var1digits[i1] * var2digits[i2] for which i = i1 + i2 + 1.
+	 */
+	switch (var1ndigits)
+	{
+		case 1:
+			/* ---------
+			 * 1-digit case:
+			 *		var1ndigits = 1
+			 *		var2ndigits >= 1
+			 *		res_ndigits = var2ndigits + 1
+			 * ----------
+			 */
+			carry = 0;
+			for (int i = res_ndigits - 2; i >= 0; i--)
+			{
+				term = (uint32) var1digits[0] * var2digits[i] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+			res_digits[0] = (NumericDigit) carry;
+			break;
+
+		case 2:
+			/* ---------
+			 * 2-digit case:
+			 *		var1ndigits = 2
+			 *		var2ndigits >= 2
+			 *		res_ndigits = var2ndigits + 2
+			 * ----------
+			 */
+			/* last result digit and carry */
+			term = (uint32) var1digits[1] * var2digits[res_ndigits - 3];
+			res_digits[res_ndigits - 1] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first two */
+			for (int i = res_ndigits - 3; i >= 1; i--)
+			{
+				term = (uint32) var1digits[0] * var2digits[i] +
+					(uint32) var1digits[1] * var2digits[i - 1] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+
+			/* first two digits */
+			term = (uint32) var1digits[0] * var2digits[0] + carry;
+			res_digits[1] = (NumericDigit) (term % NBASE);
+			res_digits[0] = (NumericDigit) (term / NBASE);
+			break;
+
+		case 3:
+			/* ---------
+			 * 3-digit case:
+			 *		var1ndigits = 3
+			 *		var2ndigits >= 3
+			 *		res_ndigits = var2ndigits + 3
+			 * ----------
+			 */
+			/* last two result digits */
+			term = (uint32) var1digits[2] * var2digits[res_ndigits - 4];
+			res_digits[res_ndigits - 1] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[1] * var2digits[res_ndigits - 4] +
+				(uint32) var1digits[2] * var2digits[res_ndigits - 5] + carry;
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first three */
+			for (int i = res_ndigits - 4; i >= 2; i--)
+			{
+				term = (uint32) var1digits[0] * var2digits[i] +
+					(uint32) var1digits[1] * var2digits[i - 1] +
+					(uint32) var1digits[2] * var2digits[i - 2] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+
+			/* first three digits */
+			term = (uint32) var1digits[0] * var2digits[1] +
+				(uint32) var1digits[1] * var2digits[0] + carry;
+			res_digits[2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[0] * var2digits[0] + carry;
+			res_digits[1] = (NumericDigit) (term % NBASE);
+			res_digits[0] = (NumericDigit) (term / NBASE);
+			break;
+
+		case 4:
+			/* ---------
+			 * 4-digit case:
+			 *		var1ndigits = 4
+			 *		var2ndigits >= 4
+			 *		res_ndigits = var2ndigits + 4
+			 * ----------
+			 */
+			/* last three result digits */
+			term = (uint32) var1digits[3] * var2digits[res_ndigits - 5];
+			res_digits[res_ndigits - 1] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[2] * var2digits[res_ndigits - 5] +
+				(uint32) var1digits[3] * var2digits[res_ndigits - 6] + carry;
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[1] * var2digits[res_ndigits - 5] +
+				(uint32) var1digits[2] * var2digits[res_ndigits - 6] +
+				(uint32) var1digits[3] * var2digits[res_ndigits - 7] + carry;
+			res_digits[res_ndigits - 3] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first four */
+			for (int i = res_ndigits - 5; i >= 3; i--)
+			{
+				term = (uint32) var1digits[0] * var2digits[i] +
+					(uint32) var1digits[1] * var2digits[i - 1] +
+					(uint32) var1digits[2] * var2digits[i - 2] +
+					(uint32) var1digits[3] * var2digits[i - 3] + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+
+			/* first four digits */
+			term = (uint32) var1digits[0] * var2digits[2] +
+				(uint32) var1digits[1] * var2digits[1] +
+				(uint32) var1digits[2] * var2digits[0] + carry;
+			res_digits[3] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[0] * var2digits[1] +
+				(uint32) var1digits[1] * var2digits[0] + carry;
+			res_digits[2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = (uint32) var1digits[0] * var2digits[0] + carry;
+			res_digits[1] = (NumericDigit) (term % NBASE);
+			res_digits[0] = (NumericDigit) (term / NBASE);
+			break;
+	}
+
+	/* Store the product in result */
+	digitbuf_free(result->buf);
+	result->ndigits = res_ndigits;
+	result->buf = res_buf;
+	result->digits = res_digits;
+	result->weight = res_weight;
+	result->sign = res_sign;
+	result->dscale = var1->dscale + var2->dscale;
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
 }

#35

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#34)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Sat, Jul 6, 2024, at 11:34, Dean Rasheed wrote:

On Fri, 5 Jul 2024 at 18:37, Joel Jacobson <joel@compiler.org> wrote:

On Fri, Jul 5, 2024, at 18:42, Joel Jacobson wrote:

Very nice, v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
is now the winner on all my CPUs:

I thought it would be interesting to also measure the isolated effect
on just numeric_mul() without the query overhead.

Impressive speed-up, between 25% - 81%.

Cool. I think we should go with the mul_var_small() patch then, since
it's more generally applicable.

I agree.

I also did some testing with much larger var2 values, and saw similar
speed-ups. One high-level function that benefits from that is
factorial(), which accepts inputs up to 32177, and so uses both the
1-digit and 2-digit code with very large var2 values. I doubt anyone
actually uses it with such large inputs, but it's interesting
nonetheless:

SELECT factorial(32177);
Time: 923.117 ms -- HEAD
Time: 534.375 ms -- mul_var_small() patch

Nice!

I did one more round of (mostly cosmetic) copy-editing. Aside from
improving some of the comments, it occurred to me that there's no need
to pass rscale to mul_var_small(), or for it to call round_var(),
since it's always computing the exact result. That shaves off a few
more cycles.

Nice, also cleaner.

Additionally, I didn't like how res_weight and res_ndigits were being
set 1 higher than they needed to be. That makes sense in mul_var()
because it may round the result, causing a non-zero carry to propagate
into the next digit up, but it's just confusing in mul_var_small(). So
I've reduced those by 1, which makes the look much more logical. To be
clear, this doesn't change how many digits we're calculating. But now
res_ndigits is actually the number of digits being calculated, whereas
before, res_ndigits was 1 larger and we were calculating res_ndigits -
1 digits, which was confusing.

Nice, much cleaner.

I think this is good to go, so unless there are any further comments,
I plan to commit it soon.

LGTM.

Benchmark, only on Apple M3 Max:

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 3042.157 ms (00:03.042)
Time: 3027.711 ms (00:03.028)
Time: 3078.215 ms (00:03.078)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v8
Time: 2700.676 ms (00:02.701)
Time: 2713.594 ms (00:02.714)
Time: 2704.139 ms (00:02.704)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 4506.064 ms (00:04.506)
Time: 3316.204 ms (00:03.316)
Time: 3321.086 ms (00:03.321)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v8
Time: 2904.786 ms (00:02.905)
Time: 2921.996 ms (00:02.922)
Time: 2919.269 ms (00:02.919)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 4636.051 ms (00:04.636)
Time: 3439.951 ms (00:03.440)
Time: 3471.245 ms (00:03.471)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v8
Time: 3034.364 ms (00:03.034)
Time: 3025.351 ms (00:03.025)
Time: 3075.024 ms (00:03.075)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 4978.086 ms (00:04.978)
Time: 3580.283 ms (00:03.580)
Time: 3582.719 ms (00:03.583)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v8
Time: 3147.352 ms (00:03.147)
Time: 3135.903 ms (00:03.136)
Time: 3172.491 ms (00:03.172)

Regards,
Joel

#36

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#35)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Sat, 6 Jul 2024 at 12:17, Joel Jacobson <joel@compiler.org> wrote:

I think this is good to go, so unless there are any further comments,
I plan to commit it soon.

LGTM.

OK, I have committed this.

At the last minute, I changed the name of the new function to
mul_var_short() because "short" is probably a better term to use in
this context (we already use it in a preceding comment). "Small" is
potentially misleading, because the numbers themselves could be
numerically very large.

Regards,
Dean

#37

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Dean Rasheed (#36)

1 attachment(s)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, 9 Jul 2024 at 10:11, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

OK, I have committed this.

Before considering the other patches to optimise for larger inputs, I
think it's worth optimising the existing mul_var() code as much as
possible.

One thing I noticed while testing the earlier patches on this thread
was that they were significantly faster if they used unsigned integers
rather than signed integers. I think the reason is that operations
like "x / 10000" and "x % 10000" use fewer CPU instructions (on every
platform, according to godbolt.org) if x is unsigned.

In addition, this reduces the number of times the digit array needs to
be renormalised, which seems to be the biggest factor.

Another small optimisation that seems to be just about worthwhile is
to pull the first digit of var1 out of the main loop, so that its
contributions can be set directly in dig[], rather than being added to
it. This allows palloc() to be used to allocate dig[], rather than
palloc0(), and only requires the upper part of dig[] to be initialised
to zeros, rather than all of it.

Together, these seem to give a decent speed-up:

NBASE digits | HEAD rate | patch rate
--------------+---------------+---------------
5 | 5.8797105e+06 | 6.047134e+06
12 | 4.140017e+06 | 4.3429845e+06
25 | 2.5711072e+06 | 2.7530615e+06
50 | 1.0367389e+06 | 1.3370771e+06
100 | 367924.1 | 462437.38
250 | 77231.32 | 104593.95
2500 | 881.48694 | 1197.4739
15000 | 25.064987 | 32.78391

The largest gains are above around 50 NBASE digits, where the time
spent renormalising dig[] becomes significant.

Regards,
Dean

Attachments:

optimise-mul_var.patchtext/x-patch; charset=US-ASCII; name=optimise-mul_var.patchDownload

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index f6e20cf..2a2e22d
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8690,18 +8690,21 @@ mul_var(const NumericVar *var1, const Nu
 	int			res_sign;
 	int			res_weight;
 	int			maxdigits;
-	int		   *dig;
-	int			carry;
+	uint32	   *dig,
+			   *dig_i1_2;
+	uint32		carry;
 	int			maxdig;
-	int			newdig;
+	uint32		newdig;
 	int			var1ndigits;
 	int			var2ndigits;
 	NumericDigit *var1digits;
 	NumericDigit *var2digits;
 	NumericDigit *res_digits;
+	NumericDigit var1digit;
 	int			i,
 				i1,
-				i2;
+				i2,
+				i2limit;
 
 	/*
 	 * Arrange for var1 to be the shorter of the two numbers.  This improves
@@ -8775,23 +8778,22 @@ mul_var(const NumericVar *var1, const Nu
 	}
 
 	/*
-	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
-	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
-	 * to avoid normalizing carries immediately.
+	 * We do the arithmetic in an array "dig[]" of unsigned 32-bit integers.
+	 * Since PG_UINT32_MAX is noticeably larger than NBASE*NBASE, this gives
+	 * us headroom to avoid normalizing carries immediately.
 	 *
 	 * maxdig tracks the maximum possible value of any dig[] entry; when this
-	 * threatens to exceed INT_MAX, we take the time to propagate carries.
-	 * Furthermore, we need to ensure that overflow doesn't occur during the
-	 * carry propagation passes either.  The carry values could be as much as
-	 * INT_MAX/NBASE, so really we must normalize when digits threaten to
-	 * exceed INT_MAX - INT_MAX/NBASE.
+	 * threatens to exceed PG_UINT32_MAX, we take the time to propagate
+	 * carries.  Furthermore, we need to ensure that overflow doesn't occur
+	 * during the carry propagation passes either.  The carry values could be
+	 * as much as PG_UINT32_MAX/NBASE, so really we must normalize when digits
+	 * threaten to exceed PG_UINT32_MAX - PG_UINT32_MAX/NBASE.
 	 *
 	 * To avoid overflow in maxdig itself, it actually represents the max
 	 * possible value divided by NBASE-1, ie, at the top of the loop it is
 	 * known that no dig[] entry exceeds maxdig * (NBASE-1).
 	 */
-	dig = (int *) palloc0(res_ndigits * sizeof(int));
-	maxdig = 0;
+	dig = (uint32 *) palloc(res_ndigits * sizeof(uint32));
 
 	/*
 	 * The least significant digits of var1 should be ignored if they don't
@@ -8801,17 +8803,37 @@ mul_var(const NumericVar *var1, const Nu
 	 * Digit i1 of var1 and digit i2 of var2 are multiplied and added to digit
 	 * i1+i2+2 of the accumulator array, so we need only consider digits of
 	 * var1 for which i1 <= res_ndigits - 3.
+	 *
+	 * We process the least significant digit of var1 outside the main loop,
+	 * since the results can be applied directly to dig[], rather than being
+	 * added to it.
 	 */
-	for (i1 = Min(var1ndigits - 1, res_ndigits - 3); i1 >= 0; i1--)
+	i1 = Min(var1ndigits - 1, res_ndigits - 3);
+	var1digit = var1digits[i1];
+	maxdig = var1digit;
+
+	/*
+	 * Multiply by the least significant applicable digit of var1 -- see
+	 * comments below.  Any changes made there should be reflected here.
+	 */
+	i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
+	dig_i1_2 = &dig[i1 + 2];
+
+	memset(dig, 0, sizeof(uint32) * (i1 + 2));
+	for (i2 = 0; i2 < i2limit; i2++)
+		dig_i1_2[i2] = var1digit * var2digits[i2];
+
+	/* Add contributions from remaining digits of var1 */
+	for (i1 = i1 - 1; i1 >= 0; i1--)
 	{
-		NumericDigit var1digit = var1digits[i1];
+		var1digit = var1digits[i1];
 
 		if (var1digit == 0)
 			continue;
 
 		/* Time to normalize? */
 		maxdig += var1digit;
-		if (maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
+		if (maxdig > (PG_UINT32_MAX - PG_UINT32_MAX / NBASE) / (NBASE - 1))
 		{
 			/* Yes, do it */
 			carry = 0;
@@ -8845,13 +8867,11 @@ mul_var(const NumericVar *var1, const Nu
 		 * Since we aren't propagating carries in this loop, the order does
 		 * not matter.
 		 */
-		{
-			int			i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
-			int		   *dig_i1_2 = &dig[i1 + 2];
+		i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
+		dig_i1_2 = &dig[i1 + 2];
 
-			for (i2 = 0; i2 < i2limit; i2++)
-				dig_i1_2[i2] += var1digit * var2digits[i2];
-		}
+		for (i2 = 0; i2 < i2limit; i2++)
+			dig_i1_2[i2] += var1digit * var2digits[i2];
 	}
 
 	/*

#38

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#37)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, Jul 9, 2024, at 14:01, Dean Rasheed wrote:

Before considering the other patches to optimise for larger inputs, I
think it's worth optimising the existing mul_var() code as much as
possible.

One thing I noticed while testing the earlier patches on this thread
was that they were significantly faster if they used unsigned integers
rather than signed integers. I think the reason is that operations
like "x / 10000" and "x % 10000" use fewer CPU instructions (on every
platform, according to godbolt.org) if x is unsigned.

In addition, this reduces the number of times the digit array needs to
be renormalised, which seems to be the biggest factor.

Another small optimisation that seems to be just about worthwhile is
to pull the first digit of var1 out of the main loop, so that its
contributions can be set directly in dig[], rather than being added to
it. This allows palloc() to be used to allocate dig[], rather than
palloc0(), and only requires the upper part of dig[] to be initialised
to zeros, rather than all of it.

Nice, really smart!

Together, these seem to give a decent speed-up:

NBASE digits | HEAD rate | patch rate
--------------+---------------+---------------
5 | 5.8797105e+06 | 6.047134e+06
12 | 4.140017e+06 | 4.3429845e+06
25 | 2.5711072e+06 | 2.7530615e+06
50 | 1.0367389e+06 | 1.3370771e+06
100 | 367924.1 | 462437.38
250 | 77231.32 | 104593.95
2500 | 881.48694 | 1197.4739
15000 | 25.064987 | 32.78391

The largest gains are above around 50 NBASE digits, where the time
spent renormalising dig[] becomes significant.

I added some more ndigits test cases:

/*
* Intel Core i9-14900K
*/

NBASE digits | HEAD rate | patch rate | relative difference
--------------+----------------+----------------+---------------------
1 | 4.7846890e+07 | 4.7846890e+07 | 0.00%
2 | 4.9504950e+07 | 4.7393365e+07 | -4.27%
3 | 4.0816327e+07 | 4.0983607e+07 | 0.41%
4 | 4.1152263e+07 | 3.9370079e+07 | -4.33%
5 | 2.2573363e+07 | 2.1978022e+07 | -2.64%
6 | 2.1739130e+07 | 1.9646365e+07 | -9.63%
7 | 1.6393443e+07 | 1.6339869e+07 | -0.33%
8 | 1.6863406e+07 | 1.6778523e+07 | -0.50%
9 | 1.5105740e+07 | 1.6420361e+07 | 8.70%
10 | 1.3315579e+07 | 1.5527950e+07 | 16.61%
11 | 1.2360939e+07 | 1.4124294e+07 | 14.27%
12 | 1.1764706e+07 | 1.2836970e+07 | 9.11%
13 | 1.0060362e+07 | 1.1820331e+07 | 17.49%
14 | 9.0909091e+06 | 1.0000000e+07 | 10.00%
15 | 7.6923077e+06 | 8.0000000e+06 | 4.00%
16 | 9.0909091e+06 | 9.4339623e+06 | 3.77%
17 | 7.2992701e+06 | 9.0909091e+06 | 24.55%
18 | 7.0921986e+06 | 7.8125000e+06 | 10.16%
19 | 6.5789474e+06 | 6.6666667e+06 | 1.33%
20 | 6.2500000e+06 | 6.5789474e+06 | 5.26%
21 | 5.8479532e+06 | 6.1728395e+06 | 5.56%
22 | 5.5555556e+06 | 5.9880240e+06 | 7.78%
24 | 5.2631579e+06 | 5.8823529e+06 | 11.76%
25 | 5.2083333e+06 | 5.5555556e+06 | 6.67%
26 | 4.7619048e+06 | 5.2631579e+06 | 10.53%
27 | 4.5045045e+06 | 5.2083333e+06 | 15.63%
28 | 4.4247788e+06 | 4.7619048e+06 | 7.62%
29 | 4.1666667e+06 | 4.5454545e+06 | 9.09%
30 | 4.0000000e+06 | 4.3478261e+06 | 8.70%
31 | 3.4482759e+06 | 4.0000000e+06 | 16.00%
32 | 3.9840637e+06 | 4.2016807e+06 | 5.46%
50 | 2.0964361e+06 | 2.6595745e+06 | 26.86%
100 | 666666.67 | 869565.22 | 30.43%
250 | 142653.35 | 171526.59 | 20.24%
2500 | 1642.04 | 2197.80 | 33.85%
15000 | 41.67 | 52.63 | 26.32%
(36 rows)

/*
* AMD Ryzen 9 7950X3D
*/

NBASE digits | HEAD rate | patch rate | relative difference
--------------+----------------+----------------+---------------------
1 | 3.6900369e+07 | 3.8022814e+07 | 3.04%
2 | 3.4602076e+07 | 3.5714286e+07 | 3.21%
3 | 2.8011204e+07 | 2.7777778e+07 | -0.83%
4 | 2.7932961e+07 | 2.8328612e+07 | 1.42%
5 | 1.6420361e+07 | 1.7123288e+07 | 4.28%
6 | 1.4705882e+07 | 1.5313936e+07 | 4.13%
7 | 1.3192612e+07 | 1.3888889e+07 | 5.28%
8 | 1.2121212e+07 | 1.2919897e+07 | 6.59%
9 | 1.1235955e+07 | 1.2135922e+07 | 8.01%
10 | 1.0000000e+07 | 1.1312217e+07 | 13.12%
11 | 9.0909091e+06 | 1.0000000e+07 | 10.00%
12 | 8.1967213e+06 | 8.4033613e+06 | 2.52%
13 | 7.2463768e+06 | 7.7519380e+06 | 6.98%
14 | 6.7567568e+06 | 7.1428571e+06 | 5.71%
15 | 5.5555556e+06 | 5.8823529e+06 | 5.88%
16 | 6.3291139e+06 | 5.7803468e+06 | -8.67%
17 | 5.8823529e+06 | 5.9880240e+06 | 1.80%
18 | 5.5555556e+06 | 5.7142857e+06 | 2.86%
19 | 5.2356021e+06 | 5.6179775e+06 | 7.30%
20 | 4.9019608e+06 | 5.1020408e+06 | 4.08%
21 | 4.5454545e+06 | 4.8543689e+06 | 6.80%
22 | 4.1841004e+06 | 4.5871560e+06 | 9.63%
24 | 4.4642857e+06 | 4.4052863e+06 | -1.32%
25 | 4.1666667e+06 | 4.2194093e+06 | 1.27%
26 | 4.0000000e+06 | 3.9525692e+06 | -1.19%
27 | 3.8461538e+06 | 3.8022814e+06 | -1.14%
28 | 3.9062500e+06 | 3.8759690e+06 | -0.78%
29 | 3.7878788e+06 | 3.8022814e+06 | 0.38%
30 | 3.3898305e+06 | 3.7174721e+06 | 9.67%
31 | 2.7472527e+06 | 2.8571429e+06 | 4.00%
32 | 3.0395137e+06 | 3.1446541e+06 | 3.46%
50 | 1.7094017e+06 | 2.0576132e+06 | 20.37%
100 | 518134.72 | 609756.10 | 17.68%
250 | 108577.63 | 136612.02 | 25.82%
2500 | 1264.22 | 1610.31 | 27.38%
15000 | 34.48 | 43.48 | 26.09%
(36 rows)

/*
* Apple M3 Max
*/

NBASE digits | HEAD rate | patch rate | relative difference
--------------+----------------+----------------+---------------------
1 | 4.9504950e+07 | 4.9504950e+07 | 0.00%
2 | 6.1349693e+07 | 5.9171598e+07 | -3.55%
3 | 5.2631579e+07 | 5.2083333e+07 | -1.04%
4 | 4.5248869e+07 | 4.5248869e+07 | 0.00%
5 | 2.1978022e+07 | 2.2727273e+07 | 3.41%
6 | 1.9920319e+07 | 2.0876827e+07 | 4.80%
7 | 1.7182131e+07 | 1.8018018e+07 | 4.86%
8 | 1.5822785e+07 | 1.6051364e+07 | 1.44%
9 | 1.3368984e+07 | 1.3333333e+07 | -0.27%
10 | 1.1709602e+07 | 1.1627907e+07 | -0.70%
11 | 1.0020040e+07 | 1.0526316e+07 | 5.05%
12 | 9.0909091e+06 | 9.0909091e+06 | 0.00%
13 | 8.2644628e+06 | 8.2644628e+06 | 0.00%
14 | 7.6923077e+06 | 7.6335878e+06 | -0.76%
15 | 7.1428571e+06 | 7.0921986e+06 | -0.71%
16 | 6.6225166e+06 | 6.5789474e+06 | -0.66%
17 | 5.9880240e+06 | 6.2111801e+06 | 3.73%
18 | 5.7803468e+06 | 5.5865922e+06 | -3.35%
19 | 5.2631579e+06 | 5.2356021e+06 | -0.52%
20 | 4.6296296e+06 | 4.8543689e+06 | 4.85%
21 | 4.4444444e+06 | 4.3859649e+06 | -1.32%
22 | 4.2016807e+06 | 4.0485830e+06 | -3.64%
24 | 3.7453184e+06 | 3.5714286e+06 | -4.64%
25 | 3.4843206e+06 | 3.4013605e+06 | -2.38%
26 | 3.2786885e+06 | 3.2786885e+06 | 0.00%
27 | 3.0674847e+06 | 3.1055901e+06 | 1.24%
28 | 2.8818444e+06 | 2.9069767e+06 | 0.87%
29 | 2.7322404e+06 | 2.7700831e+06 | 1.39%
30 | 2.5839793e+06 | 2.6246719e+06 | 1.57%
31 | 2.5062657e+06 | 2.4630542e+06 | -1.72%
32 | 4.5871560e+06 | 4.6082949e+06 | 0.46%
50 | 1.7513135e+06 | 1.9880716e+06 | 13.52%
100 | 714285.71 | 833333.33 | 16.67%
250 | 124223.60 | 149925.04 | 20.69%
2500 | 1440.92 | 1760.56 | 22.18%
15000 | 39.53 | 48.08 | 21.63%
(36 rows)

Regards,
Joel

#39

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#38)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, Jul 9, 2024, at 16:11, Joel Jacobson wrote:

I added some more ndigits test cases:

Ops, please ignore previous benchmark;
I had forgot to commit in between the measurements,
so they all ran in the same db txn,
which caused a lot of noise on few ndigits.

New benchmark:

/*
* Intel Core i9-14900K
*/

NBASE digits | HEAD rate | patch rate | relative difference
--------------+----------------+----------------+---------------------
1 | 5.0251256e+07 | 5.2631579e+07 | 4.74%
2 | 4.8543689e+07 | 4.9751244e+07 | 2.49%
3 | 4.1493776e+07 | 4.3478261e+07 | 4.78%
4 | 4.1493776e+07 | 4.0816327e+07 | -1.63%
5 | 2.2371365e+07 | 2.3364486e+07 | 4.44%
6 | 2.1008403e+07 | 2.1186441e+07 | 0.85%
7 | 1.7152659e+07 | 1.6233766e+07 | -5.36%
8 | 1.7123288e+07 | 1.8450185e+07 | 7.75%
9 | 1.5290520e+07 | 1.7271157e+07 | 12.95%
10 | 1.3351135e+07 | 1.5384615e+07 | 15.23%
11 | 1.2453300e+07 | 1.4164306e+07 | 13.74%
12 | 1.1655012e+07 | 1.2936611e+07 | 11.00%
13 | 1.0373444e+07 | 1.1904762e+07 | 14.76%
14 | 9.0909091e+06 | 1.0162602e+07 | 11.79%
15 | 7.7519380e+06 | 8.1300813e+06 | 4.88%
16 | 9.0909091e+06 | 9.8039216e+06 | 7.84%
17 | 7.5757576e+06 | 9.0909091e+06 | 20.00%
18 | 7.2463768e+06 | 8.2644628e+06 | 14.05%
19 | 6.6225166e+06 | 7.5757576e+06 | 14.39%
20 | 6.4516129e+06 | 7.0422535e+06 | 9.15%
21 | 6.0606061e+06 | 6.5789474e+06 | 8.55%
22 | 5.7142857e+06 | 6.2500000e+06 | 9.38%
24 | 5.4054054e+06 | 6.0240964e+06 | 11.45%
25 | 5.2356021e+06 | 5.8139535e+06 | 11.05%
26 | 5.0251256e+06 | 5.8139535e+06 | 15.70%
27 | 4.7393365e+06 | 5.7142857e+06 | 20.57%
28 | 4.6082949e+06 | 5.2083333e+06 | 13.02%
29 | 4.3478261e+06 | 4.9504950e+06 | 13.86%
30 | 4.0816327e+06 | 4.6728972e+06 | 14.49%
31 | 3.4843206e+06 | 3.9682540e+06 | 13.89%
32 | 4.0000000e+06 | 4.1666667e+06 | 4.17%
50 | 2.1097046e+06 | 2.8571429e+06 | 35.43%
100 | 680272.11 | 909090.91 | 33.64%
250 | 141643.06 | 174216.03 | 23.00%
2500 | 1626.02 | 2188.18 | 34.57%
15000 | 41.67 | 52.63 | 26.32%
(36 rows)

/*
* AMD Ryzen 9 7950X3D
*/

NBASE digits | HEAD rate | patch rate | relative difference
--------------+----------------+----------------+---------------------
1 | 3.7037037e+07 | 3.8910506e+07 | 5.06%
2 | 3.5587189e+07 | 3.5971223e+07 | 1.08%
3 | 3.0581040e+07 | 2.9239766e+07 | -4.39%
4 | 2.7322404e+07 | 3.0303030e+07 | 10.91%
5 | 1.8050542e+07 | 1.9011407e+07 | 5.32%
6 | 1.5974441e+07 | 1.6233766e+07 | 1.62%
7 | 1.3106160e+07 | 1.3071895e+07 | -0.26%
8 | 1.2285012e+07 | 1.3106160e+07 | 6.68%
9 | 1.1534025e+07 | 1.2269939e+07 | 6.38%
10 | 1.1135857e+07 | 1.1507480e+07 | 3.34%
11 | 9.7943193e+06 | 1.0976948e+07 | 12.07%
12 | 9.5238095e+06 | 1.0256410e+07 | 7.69%
13 | 8.6206897e+06 | 8.7719298e+06 | 1.75%
14 | 7.3529412e+06 | 8.1967213e+06 | 11.48%
15 | 6.2893082e+06 | 6.7114094e+06 | 6.71%
16 | 7.2463768e+06 | 7.0422535e+06 | -2.82%
17 | 6.2893082e+06 | 7.2463768e+06 | 15.22%
18 | 6.3694268e+06 | 7.4626866e+06 | 17.16%
19 | 5.6818182e+06 | 6.6225166e+06 | 16.56%
20 | 5.2083333e+06 | 6.1728395e+06 | 18.52%
21 | 5.0251256e+06 | 5.7471264e+06 | 14.37%
22 | 4.5248869e+06 | 5.1282051e+06 | 13.33%
24 | 4.9261084e+06 | 5.1020408e+06 | 3.57%
25 | 4.6511628e+06 | 4.9504950e+06 | 6.44%
26 | 4.2553191e+06 | 4.6082949e+06 | 8.29%
27 | 3.9682540e+06 | 4.2918455e+06 | 8.15%
28 | 3.8910506e+06 | 4.1322314e+06 | 6.20%
29 | 3.8167939e+06 | 3.7593985e+06 | -1.50%
30 | 3.5842294e+06 | 3.6101083e+06 | 0.72%
31 | 3.1948882e+06 | 3.1645570e+06 | -0.95%
32 | 3.4722222e+06 | 3.7174721e+06 | 7.06%
50 | 1.6474465e+06 | 2.1691974e+06 | 31.67%
100 | 555555.56 | 653594.77 | 17.65%
250 | 109409.19 | 140449.44 | 28.37%
2500 | 1236.09 | 1555.21 | 25.82%
15000 | 34.48 | 43.48 | 26.09%
(36 rows)

/*
* Apple M3 Max
*/

NBASE digits | HEAD rate | patch rate | relative difference
--------------+----------------+----------------+---------------------
1 | 4.7169811e+07 | 4.7619048e+07 | 0.95%
2 | 6.0240964e+07 | 5.8479532e+07 | -2.92%
3 | 5.2083333e+07 | 5.3191489e+07 | 2.13%
4 | 4.5871560e+07 | 4.6948357e+07 | 2.35%
5 | 2.2075055e+07 | 2.3529412e+07 | 6.59%
6 | 2.0080321e+07 | 2.1505376e+07 | 7.10%
7 | 1.7301038e+07 | 1.8975332e+07 | 9.68%
8 | 1.6025641e+07 | 1.6556291e+07 | 3.31%
9 | 1.3245033e+07 | 1.3717421e+07 | 3.57%
10 | 1.1709602e+07 | 1.2315271e+07 | 5.17%
11 | 1.0000000e+07 | 1.0989011e+07 | 9.89%
12 | 9.0909091e+06 | 9.7276265e+06 | 7.00%
13 | 8.3333333e+06 | 9.0090090e+06 | 8.11%
14 | 7.6923077e+06 | 8.0645161e+06 | 4.84%
15 | 7.0921986e+06 | 7.5187970e+06 | 6.02%
16 | 6.6666667e+06 | 7.0921986e+06 | 6.38%
17 | 6.2111801e+06 | 6.3694268e+06 | 2.55%
18 | 5.7803468e+06 | 5.9523810e+06 | 2.98%
19 | 5.2910053e+06 | 5.4347826e+06 | 2.72%
20 | 4.7846890e+06 | 5.0505051e+06 | 5.56%
21 | 4.5454545e+06 | 4.6728972e+06 | 2.80%
22 | 4.2372881e+06 | 4.3859649e+06 | 3.51%
24 | 3.7174721e+06 | 3.8759690e+06 | 4.26%
25 | 3.4722222e+06 | 3.6231884e+06 | 4.35%
26 | 3.2894737e+06 | 3.3898305e+06 | 3.05%
27 | 3.0674847e+06 | 3.1847134e+06 | 3.82%
28 | 2.9239766e+06 | 3.0120482e+06 | 3.01%
29 | 2.7548209e+06 | 2.8901734e+06 | 4.91%
30 | 2.6041667e+06 | 2.7322404e+06 | 4.92%
31 | 2.5000000e+06 | 2.5773196e+06 | 3.09%
32 | 4.6082949e+06 | 4.7846890e+06 | 3.83%
50 | 1.7241379e+06 | 2.0703934e+06 | 20.08%
100 | 719424.46 | 869565.22 | 20.87%
250 | 124688.28 | 157977.88 | 26.70%
2500 | 1455.60 | 1811.59 | 24.46%
15000 | 40.00 | 50.00 | 25.00%
(36 rows)

Regards,
Joel

#40

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#37)

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

On Tue, Jul 9, 2024, at 14:01, Dean Rasheed wrote:

One thing I noticed while testing the earlier patches on this thread
was that they were significantly faster if they used unsigned integers
rather than signed integers. I think the reason is that operations
like "x / 10000" and "x % 10000" use fewer CPU instructions (on every
platform, according to godbolt.org) if x is unsigned.

In addition, this reduces the number of times the digit array needs to
be renormalised, which seems to be the biggest factor.

Another small optimisation that seems to be just about worthwhile is
to pull the first digit of var1 out of the main loop, so that its
contributions can be set directly in dig[], rather than being added to
it. This allows palloc() to be used to allocate dig[], rather than
palloc0(), and only requires the upper part of dig[] to be initialised
to zeros, rather than all of it.

Together, these seem to give a decent speed-up:

Attachments:
* optimise-mul_var.patch

I've reviewed the patch now.
Code is straightforward, and comments easy to understand.
LGTM.

Regards,
Joel