Optimize numeric.c mul_var() using the Karatsuba algorithm

Started by Joel Jacobsonover 1 year ago17 messages

joel@compiler.org

over 1 year ago

2 attachment(s)

Hi,

This patch introduces the Karatsuba algorithm to speed up multiplication
operations in numeric.c, where the operands have many digits.

It is implemented via a new conditional in mul_var() that determines whether
the sizes of the factors are sufficiently large to justify its use. This decision
is non-trivial due to its recursive nature, depending on the size and ratio of
the factors. Moreover, the optimal threshold varies across different
architectures.

This benefits all users of mul_var() in numeric.c, such as:
numeric_mul()
numeric_lcm()
numeric_fac()
int64_div_fast_to_numeric()
mod_var()
div_mod_var()
sqrt_var()
exp_var()
ln_var()
power_var()

The macro KARATSUBA_CONDITION(var1ndigits, var2ndigits) is responsible of this
decision. It is deliberately conservative to prevent performance regressions on
the tested architectures while maximizing potential gains and maintaining
simplicity.

Patches:

1. mul_var-karatsuba.patch
Modifies mul_var() to use the Karatsuba-functions
for multiplying larger numerical factors.

2. mul_var-karatsuba-benchmark.patch
Introduces numeric_mul_karatsuba() and mul_var_karatsuba()
alongside the existing numeric_mul() and mul_var() functions.
This enables benchmark comparisons between the original multiplication method
and the Karatsuba-optimized version.

Some benchmark numbers, tested on Intel Core i9-14900K:

Helper-function to generate numeric of given ndigits,
using the new random(min numeric, max numeric):

CREATE OR REPLACE FUNCTION random_ndigits(ndigits INT) RETURNS NUMERIC AS $$
SELECT random(
('1000'||repeat('0000',ndigits-1))::numeric,
(repeat('9999',ndigits))::numeric
)
$$ LANGUAGE sql;

Benchmark equal factor sizes, 16384 x 16384 ndigits:

SELECT random_ndigits(16384) * random_ndigits(16384) > 0;
Time: 33.990 ms
Time: 33.961 ms
Time: 34.183 ms

SELECT numeric_mul_karatsuba(random_ndigits(16384), random_ndigits(16384)) > 0;
Time: 17.621 ms
Time: 17.209 ms
Time: 16.444 ms

Benchmark equal factor sizes, 8192 x 8192 ndigits:

SELECT random_ndigits(8192) * random_ndigits(8192) > 0;
Time: 12.568 ms
Time: 12.563 ms
Time: 12.701 ms

SELECT numeric_mul_karatsuba(random_ndigits(8192), random_ndigits(8192)) > 0;
Time: 9.919 ms
Time: 9.929 ms
Time: 9.659 ms

To measure smaller factor sizes, \timing doesn't provide enough precision.
Below measurements are made using my pg-timeit extension:

Benchmark equal factor sizes, 1024 x 1024 ndigits:

SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(1024)::TEXT,random_ndigits(1024)::TEXT],significant_figures:=2);
100 µs
SELECT timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(1024)::TEXT,random_ndigits(1024)::TEXT],significant_figures:=2);
73 µs

Benchmark equal factor sizes, 512 x 512 ndigits:

SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(512)::TEXT,random_ndigits(512)::TEXT],significant_figures:=2);
27 µs
SELECT timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(512)::TEXT,random_ndigits(512)::TEXT],significant_figures:=2);
23 µs

Benchmark unequal factor sizes, 2048 x 16384 ndigits:

SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(2048)::TEXT,random_ndigits(16384)::TEXT],significant_figures:=2);
3.6 ms
SELECT timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(2048)::TEXT,random_ndigits(16384)::TEXT],significant_figures:=2);
2.7 ms

The KARATSUBA_CONDITION was determined through benchmarking on the following architectures:

- Intel Core i9-14900K (desktop)
- AMD Ryzen 9 7950X3D (desktop)
- Apple M1Max (laptop)
- AWS EC2 m7g.4xlarge (cloud server, AWS Graviton3 CPU)
- AWS EC2 m7i.4xlarge (cloud server, Intel Xeon 4th Gen, Sapphire Rapids)

The images depicting the benchmark plots are rather large, so I've refrained
from including them as attachments. Instead, I've provided URLs to
the benchmarks for direct access:

https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/derive_threshold_function_using_dynamic_programming.png
This image shows the best possible performance ratio per architecture,
derived using Dynamic Programming. The black line segment shows the manually crafted
threshold function, which aims to avoid performance regressions, while capturing
the beneficial regions, as a relatively simple threshold function,
which has been implemented in both patches as the KARATSUBA_CONDITION macro.

https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/benchmark.png
This plot displays the actual performance ratio per architecture,
measured after applying the mul_var-karatsuba-benchmark.patch.

The performance_ratio scale in both plots uses a rainbow scale,
where blue is at 1.0 and means no change. The maximum at 4.0
means that the Karatsuba version was four times faster
than the traditional mul_var() at that architecture.

To make it easier to distinguish performance regressions from,
a magenta color scale that goes from pure magenta just below 1.0,
to dark at 0.0. I picked magenta for this purpose since it's
not part of the rainbow colors.

/Joel

Attachments:

mul_var-karatsuba.patchapplication/octet-stream; name=mul_var-karatsuba.patchDownload

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..08ad102901 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -459,6 +459,30 @@ static const NumericVar const_ninf =
 static const int round_powers[4] = {0, 1000, 100, 10};
 #endif
 
+#define KARATSUBA_BASE_LIMIT 384
+#define KARATSUBA_VAR1_MIN1 128
+#define KARATSUBA_VAR1_MIN2 2000
+#define KARATSUBA_VAR2_MIN1 2500
+#define KARATSUBA_VAR2_MIN2 9000
+#define KARATSUBA_SLOPE 0.764
+#define KARATSUBA_INTERCEPT 90.737
+
+#define KARATSUBA_LOW_RANGE_CONDITION(var1ndigits, var2ndigits) \
+	((var1ndigits) > (KARATSUBA_SLOPE) * (var2ndigits) + KARATSUBA_INTERCEPT)
+
+#define KARATSUBA_MIDDLE_RANGE_CONDITION(var1ndigits, var2ndigits) \
+	((var2ndigits) > KARATSUBA_VAR2_MIN1 && \
+	 (var1ndigits) > KARATSUBA_VAR1_MIN2)
+
+#define KARATSUBA_HIGH_RANGE_CONDITION(var1ndigits, var2ndigits) \
+	((var2ndigits) > KARATSUBA_VAR2_MIN2 && \
+	 (var1ndigits) > KARATSUBA_VAR1_MIN1)
+
+#define KARATSUBA_CONDITION(var1ndigits, var2ndigits) \
+	((var2ndigits) >= KARATSUBA_BASE_LIMIT && \
+	 (KARATSUBA_LOW_RANGE_CONDITION(var1ndigits, var2ndigits) || \
+	  KARATSUBA_MIDDLE_RANGE_CONDITION(var1ndigits, var2ndigits) || \
+	  KARATSUBA_HIGH_RANGE_CONDITION(var1ndigits, var2ndigits)))
 
 /* ----------
  * Local functions
@@ -548,6 +572,14 @@ static void add_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result);
 static void sub_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result);
+inline static void split_var_at(const NumericVar *var, int split_point,
+								NumericVar *low, NumericVar *high);
+static void mul_var_karatsuba_full(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
+static void mul_var_karatsuba_half(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
@@ -8659,6 +8691,219 @@ sub_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result)
 }
 
 
+/*
+ * split_var_at() -
+ *
+ * Split a NumericVar into two parts at a specified position.
+ */
+inline static void
+split_var_at(const NumericVar *var, int split_point,
+			 NumericVar *low, NumericVar *high)
+{
+	int high_ndigits = var->ndigits - split_point;
+	int low_ndigits = split_point;
+
+	init_var(high);
+	init_var(low);
+
+	high->ndigits = high_ndigits;
+	high->digits = var->digits;
+	high->buf = NULL;
+	high->weight = var->weight - low_ndigits;
+	high->sign = var->sign;
+	high->dscale = (var->ndigits - var->weight - 1) * DEC_DIGITS;
+
+	low->ndigits = low_ndigits;
+	low->digits = var->digits + high_ndigits;
+	low->buf = NULL;
+	low->weight = var->weight - high_ndigits;
+	low->sign = var->sign;
+	low->dscale = var->dscale;
+}
+
+
+/*
+ * mul_var_karatsuba_full() -
+ *
+ *	Multiplication using the Karatsuba algorithm.
+ *
+ *	The algorithm normally starts by checking if any of the inputs
+ *	are smaller than the NBASE, the base case for the recursion,
+ *	and if so, fall back to traditional multiplication.
+ *
+ *	That part is handled by the caller in our code, so when this function
+ *	is called, we know that var1 and var2 are large enough for Karatsuba
+ *	to be used. We also know that var1 is shorter or of equal length as var2,
+ *	which has been arranged by the caller by swapping them if necessary.
+ *
+ *	The algorithm then proceeds by splitting var1 and var2 into
+ *	two high and low parts, at half the length of the longer input:
+ *
+ *		m = max(size_NBASE(var1), size_NBASE(var2))
+ *		m2 = floor(m / 2)
+ *
+ *		high1, low1 = split_var_at(var1, m2)
+ *		high2, low2 = split_var_at(var2, m2)
+ *
+ *		z0 = (low1 * low2)
+ *		z1 = ((low1 + high1) * (low2 + high2))
+ *		z2 = (high1 * high2)
+ *
+ *		return (z2 * NBASE ^ (m2 × 2)) + ((z1 - z2 - z0) * NBASE ^ m2) + z0
+ */
+static void
+mul_var_karatsuba_full(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+		NumericVar	high1, low1;
+		NumericVar	high2, low2;
+		NumericVar	z0, z1, z2;
+		NumericVar	temp1, temp2;
+		int			m2 = var2->ndigits / 2;
+
+		init_var(&low1);
+		init_var(&low2);
+		init_var(&high1);
+		init_var(&high2);
+		init_var(&z0);
+		init_var(&z1);
+		init_var(&z2);
+		init_var(&temp1);
+		init_var(&temp2);
+
+		split_var_at(var1, m2, &low1, &high1);
+		split_var_at(var2, m2, &low2, &high2);
+
+		mul_var(&low1, &low2, &z0, low1.dscale + low2.dscale);
+
+		add_var(&low1, &high1, &temp1);
+		add_var(&low2, &high2, &temp2);
+		mul_var(&temp1, &temp2, &z1, temp1.dscale + temp2.dscale);
+
+		mul_var(&high1, &high2, &z2, high1.dscale + high2.dscale);
+
+		set_var_from_var(&z2, &temp1);
+		temp1.weight += m2 * 2;
+
+		sub_var(&z1, &z2, &z1);
+		sub_var(&z1, &z0, &temp2);
+		temp2.weight += m2;
+
+		add_var(&temp1, &temp2, &temp2);
+		add_var(&temp2, &z0, result);
+
+		free_var(&low1);
+		free_var(&low2);
+		free_var(&high1);
+		free_var(&high2);
+		free_var(&z0);
+		free_var(&z1);
+		free_var(&z2);
+		free_var(&temp1);
+		free_var(&temp2);
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+}
+
+
+/*
+ * mul_var_karatsuba_half() -
+ *
+ *	Karatsuba Multiplication for factors with significant length disparity.
+ *
+ *	The Half-Karatsuba Multiplication Algorithm is a specialized case of
+ *	the normal Karatsuba multiplication algorithm, designed for the scenario
+ *	where var2 has at least twice as many base digits as var1.
+ *
+ *	In this case var2 (the longer input) is split into high2 and low1,
+ *	at m2 (half the length of var2) and var1 (the shorter input),
+ *	is used directly without splitting.
+ *
+ *	The algorithm then proceeds as follows:
+ *
+ *	1. Compute the product z0 = var1 * low2.
+ *	2. Compute the product temp2 = var1 * high2.
+ *	3. Adjust the weight of temp2 by adding m2 (* NBASE ^ m2)
+ *	4. Add temp2 and z0 to obtain the final result.
+ *
+ *	Proof:
+ *
+ *	The algorithm can be derived from the original Karatsuba algorithm by
+ *	simplifying the formula when the shorter factor var1 is not split into
+ *	high and low parts, as shown below.
+ *
+ *	Original Karatsuba formula:
+ *
+ *		result = (z2 * NBASE ^ (m2 × 2)) + ((z1 - z2 - z0) * NBASE ^ m2) + z0
+ *
+ *	Substitutions:
+ *
+ *		low1 = var1
+ *		high1 = 0
+ *
+ *	Applying substitutions:
+ *
+ *		z0 = (low1 * low2)
+ *		   = (var1 * low2)
+ *
+ *		z1 = ((low1 + high1) * (low2 + high2))
+ *		   = ((var1 + 0) * (low2 + high2))
+ *		   = (var1 * low2) + (var1 * high2)
+ *
+ *		z2 = (high1 * high2)
+ *		   = (0 * high2)
+ *		   = 0
+ *
+ *	Simplified using the above substitutions:
+ *
+ *		result = (z2 * NBASE ^ (m2 × 2)) + ((z1 - z2 - z0) * NBASE ^ m2) + z0
+ *			   = (0 * NBASE ^ (m2 × 2)) + ((z1 - 0 - z0) * NBASE ^ m2) + z0
+ *			   = ((z1 - z0) * NBASE ^ m2) + z0
+ *			   = ((z1 - z0) * NBASE ^ m2) + z0
+ *			   = (var1 * high2) * NBASE ^ m2 + z0
+ */
+static void
+mul_var_karatsuba_half(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+	NumericVar	high2, low2;
+	NumericVar	z0;
+	NumericVar	temp2;
+	int			m2 = var2->ndigits / 2;
+
+	init_var(&high2);
+	init_var(&low2);
+	init_var(&z0);
+	init_var(&temp2);
+
+	split_var_at(var2, m2, &low2, &high2);
+
+	mul_var(var1, &low2, &z0, var1->dscale + low2.dscale);
+	mul_var(var1, &high2, &temp2, var1->dscale + high2.dscale);
+	temp2.weight += m2;
+	add_var(&temp2, &z0, result);
+
+	free_var(&high2);
+	free_var(&low2);
+	free_var(&z0);
+	free_var(&temp2);
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+
+	return;
+}
+
+
 /*
  * mul_var() -
  *
@@ -8747,6 +8992,18 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		return;
 	}
 
+	/*
+	 * Use the Karatsuba algorithm for sufficiently large factors.
+	 */
+	if (KARATSUBA_CONDITION(var1ndigits, var2ndigits))
+	{
+		if (var1ndigits * 2 > var2ndigits)
+			mul_var_karatsuba_full(var1, var2, result, rscale);
+		else
+			mul_var_karatsuba_half(var1, var2, result, rscale);
+		return;
+	}
+
 	/*
 	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
 	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom

mul_var-karatsuba-benchmark.patchapplication/octet-stream; name=mul_var-karatsuba-benchmark.patchDownload

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..c40b062c9a 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -459,6 +459,30 @@ static const NumericVar const_ninf =
 static const int round_powers[4] = {0, 1000, 100, 10};
 #endif
 
+#define KARATSUBA_BASE_LIMIT 384
+#define KARATSUBA_VAR1_MIN1 128
+#define KARATSUBA_VAR1_MIN2 2000
+#define KARATSUBA_VAR2_MIN1 2500
+#define KARATSUBA_VAR2_MIN2 9000
+#define KARATSUBA_SLOPE 0.764
+#define KARATSUBA_INTERCEPT 90.737
+
+#define KARATSUBA_LOW_RANGE_CONDITION(var1ndigits, var2ndigits) \
+	((var1ndigits) > (KARATSUBA_SLOPE) * (var2ndigits) + KARATSUBA_INTERCEPT)
+
+#define KARATSUBA_MIDDLE_RANGE_CONDITION(var1ndigits, var2ndigits) \
+	((var2ndigits) > KARATSUBA_VAR2_MIN1 && \
+	 (var1ndigits) > KARATSUBA_VAR1_MIN2)
+
+#define KARATSUBA_HIGH_RANGE_CONDITION(var1ndigits, var2ndigits) \
+	((var2ndigits) > KARATSUBA_VAR2_MIN2 && \
+	 (var1ndigits) > KARATSUBA_VAR1_MIN1)
+
+#define KARATSUBA_CONDITION(var1ndigits, var2ndigits) \
+	((var2ndigits) >= KARATSUBA_BASE_LIMIT && \
+	 (KARATSUBA_LOW_RANGE_CONDITION(var1ndigits, var2ndigits) || \
+	  KARATSUBA_MIDDLE_RANGE_CONDITION(var1ndigits, var2ndigits) || \
+	  KARATSUBA_HIGH_RANGE_CONDITION(var1ndigits, var2ndigits)))
 
 /* ----------
  * Local functions
@@ -551,6 +575,17 @@ static void sub_var(const NumericVar *var1, const NumericVar *var2,
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
+static void mul_var_karatsuba(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
+inline static void split_var_at(const NumericVar *var, int split_point,
+								NumericVar *low, NumericVar *high);
+static void mul_var_karatsuba_full(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
+static void mul_var_karatsuba_half(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
 static void div_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale, bool round);
@@ -3115,6 +3150,130 @@ numeric_mul_opt_error(Numeric num1, Numeric num2, bool *have_error)
 }
 
 
+/*
+ * numeric_mul_karatsuba() -
+ *
+ * This function multiplies two numeric values using the Karatsuba algorithm,
+ * designed for efficient handling of large numbers. It's introduced to allow
+ * direct benchmark comparisons with the standard numeric_mul() function.
+ */
+Datum
+numeric_mul_karatsuba(PG_FUNCTION_ARGS)
+{
+	Numeric		num1 = PG_GETARG_NUMERIC(0);
+	Numeric		num2 = PG_GETARG_NUMERIC(1);
+	Numeric		res;
+
+	res = numeric_mul_karatsuba_opt_error(num1, num2, NULL);
+
+	PG_RETURN_NUMERIC(res);
+}
+
+
+/*
+ * numeric_mul_karatsuba_opt_error() -
+ *
+ *	Internal version of numeric_mul_karatsuba().
+ *	If "*have_error" flag is provided, on error it's set to true, NULL returned.
+ *	This is helpful when caller need to handle errors by itself.
+ */
+Numeric
+numeric_mul_karatsuba_opt_error(Numeric num1, Numeric num2, bool *have_error)
+{
+	NumericVar	arg1;
+	NumericVar	arg2;
+	NumericVar	result;
+	Numeric		res;
+
+	/*
+	 * Handle NaN and infinities
+	 */
+	if (NUMERIC_IS_SPECIAL(num1) || NUMERIC_IS_SPECIAL(num2))
+	{
+		if (NUMERIC_IS_NAN(num1) || NUMERIC_IS_NAN(num2))
+			return make_result(&const_nan);
+		if (NUMERIC_IS_PINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* Inf * 0 */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		if (NUMERIC_IS_NINF(num1))
+		{
+			switch (numeric_sign_internal(num2))
+			{
+				case 0:
+					return make_result(&const_nan); /* -Inf * 0 */
+				case 1:
+					return make_result(&const_ninf);
+				case -1:
+					return make_result(&const_pinf);
+			}
+			Assert(false);
+		}
+		/* by here, num1 must be finite, so num2 is not */
+		if (NUMERIC_IS_PINF(num2))
+		{
+			switch (numeric_sign_internal(num1))
+			{
+				case 0:
+					return make_result(&const_nan); /* 0 * Inf */
+				case 1:
+					return make_result(&const_pinf);
+				case -1:
+					return make_result(&const_ninf);
+			}
+			Assert(false);
+		}
+		Assert(NUMERIC_IS_NINF(num2));
+		switch (numeric_sign_internal(num1))
+		{
+			case 0:
+				return make_result(&const_nan); /* 0 * -Inf */
+			case 1:
+				return make_result(&const_ninf);
+			case -1:
+				return make_result(&const_pinf);
+		}
+		Assert(false);
+	}
+
+	/*
+	 * Unpack the values, let mul_var() compute the result and return it.
+	 * Unlike add_var() and sub_var(), mul_var() will round its result. In the
+	 * case of numeric_mul(), which is invoked for the * operator on numerics,
+	 * we request exact representation for the product (rscale = sum(dscale of
+	 * arg1, dscale of arg2)).  If the exact result has more digits after the
+	 * decimal point than can be stored in a numeric, we round it.  Rounding
+	 * after computing the exact result ensures that the final result is
+	 * correctly rounded (rounding in mul_var() using a truncated product
+	 * would not guarantee this).
+	 */
+	init_var_from_num(num1, &arg1);
+	init_var_from_num(num2, &arg2);
+
+	init_var(&result);
+
+	mul_var_karatsuba(&arg1, &arg2, &result, arg1.dscale + arg2.dscale);
+
+	if (result.dscale > NUMERIC_DSCALE_MAX)
+		round_var(&result, NUMERIC_DSCALE_MAX);
+
+	res = make_result_opt_error(&result, have_error);
+
+	free_var(&result);
+
+	return res;
+}
+
+
 /*
  * numeric_div() -
  *
@@ -8659,6 +8818,37 @@ sub_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result)
 }
 
 
+/*
+ * split_var_at() -
+ *
+ * Split a NumericVar into two parts at a specified position.
+ */
+inline static void
+split_var_at(const NumericVar *var, int split_point,
+			 NumericVar *low, NumericVar *high)
+{
+	int high_ndigits = var->ndigits - split_point;
+	int low_ndigits = split_point;
+
+	init_var(high);
+	init_var(low);
+
+	high->ndigits = high_ndigits;
+	high->digits = var->digits;
+	high->buf = NULL;
+	high->weight = var->weight - low_ndigits;
+	high->sign = var->sign;
+	high->dscale = (var->ndigits - var->weight - 1) * DEC_DIGITS;
+
+	low->ndigits = low_ndigits;
+	low->digits = var->digits + high_ndigits;
+	low->buf = NULL;
+	low->weight = var->weight - high_ndigits;
+	low->sign = var->sign;
+	low->dscale = var->dscale;
+}
+
+
 /*
  * mul_var() -
  *
@@ -8865,6 +9055,411 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 }
 
 
+/*
+ * mul_var_karatsuba_full() -
+ *
+ *	Multiplication using the Karatsuba algorithm.
+ *
+ *	The algorithm normally starts by checking if any of the inputs
+ *	are smaller than the NBASE, the base case for the recursion,
+ *	and if so, fall back to traditional multiplication.
+ *
+ *	That part is handled by the caller in our code, so when this function
+ *	is called, we know that var1 and var2 are large enough for Karatsuba
+ *	to be used. We also know that var1 is shorter or of equal length as var2,
+ *	which has been arranged by the caller by swapping them if necessary.
+ *
+ *	The algorithm then proceeds by splitting var1 and var2 into
+ *	two high and low parts, at half the length of the longer input:
+ *
+ *		m = max(size_NBASE(var1), size_NBASE(var2))
+ *		m2 = floor(m / 2)
+ *
+ *		high1, low1 = split_var_at(var1, m2)
+ *		high2, low2 = split_var_at(var2, m2)
+ *
+ *		z0 = (low1 * low2)
+ *		z1 = ((low1 + high1) * (low2 + high2))
+ *		z2 = (high1 * high2)
+ *
+ *		return (z2 * NBASE ^ (m2 × 2)) + ((z1 - z2 - z0) * NBASE ^ m2) + z0
+ */
+static void
+mul_var_karatsuba_full(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+		NumericVar	high1, low1;
+		NumericVar	high2, low2;
+		NumericVar	z0, z1, z2;
+		NumericVar	temp1, temp2;
+		int			m2 = var2->ndigits / 2;
+
+		init_var(&low1);
+		init_var(&low2);
+		init_var(&high1);
+		init_var(&high2);
+		init_var(&z0);
+		init_var(&z1);
+		init_var(&z2);
+		init_var(&temp1);
+		init_var(&temp2);
+
+		split_var_at(var1, m2, &low1, &high1);
+		split_var_at(var2, m2, &low2, &high2);
+
+		mul_var_karatsuba(&low1, &low2, &z0, low1.dscale + low2.dscale);
+
+		add_var(&low1, &high1, &temp1);
+		add_var(&low2, &high2, &temp2);
+		mul_var_karatsuba(&temp1, &temp2, &z1, temp1.dscale + temp2.dscale);
+
+		mul_var_karatsuba(&high1, &high2, &z2, high1.dscale + high2.dscale);
+
+		set_var_from_var(&z2, &temp1);
+		temp1.weight += m2 * 2;
+
+		sub_var(&z1, &z2, &z1);
+		sub_var(&z1, &z0, &temp2);
+		temp2.weight += m2;
+
+		add_var(&temp1, &temp2, &temp2);
+		add_var(&temp2, &z0, result);
+
+		free_var(&low1);
+		free_var(&low2);
+		free_var(&high1);
+		free_var(&high2);
+		free_var(&z0);
+		free_var(&z1);
+		free_var(&z2);
+		free_var(&temp1);
+		free_var(&temp2);
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+}
+
+
+/*
+ * mul_var_karatsuba_half() -
+ *
+ *	Karatsuba Multiplication for factors with significant length disparity.
+ *
+ *	The Half-Karatsuba Multiplication Algorithm is a specialized case of
+ *	the normal Karatsuba multiplication algorithm, designed for the scenario
+ *	where var2 has at least twice as many base digits as var1.
+ *
+ *	In this case var2 (the longer input) is split into high2 and low1,
+ *	at m2 (half the length of var2) and var1 (the shorter input),
+ *	is used directly without splitting.
+ *
+ *	The algorithm then proceeds as follows:
+ *
+ *	1. Compute the product z0 = var1 * low2.
+ *	2. Compute the product temp2 = var1 * high2.
+ *	3. Adjust the weight of temp2 by adding m2 (* NBASE ^ m2)
+ *	4. Add temp2 and z0 to obtain the final result.
+ *
+ *	Proof:
+ *
+ *	The algorithm can be derived from the original Karatsuba algorithm by
+ *	simplifying the formula when the shorter factor var1 is not split into
+ *	high and low parts, as shown below.
+ *
+ *	Original Karatsuba formula:
+ *
+ *		result = (z2 * NBASE ^ (m2 × 2)) + ((z1 - z2 - z0) * NBASE ^ m2) + z0
+ *
+ *	Substitutions:
+ *
+ *		low1 = var1
+ *		high1 = 0
+ *
+ *	Applying substitutions:
+ *
+ *		z0 = (low1 * low2)
+ *		   = (var1 * low2)
+ *
+ *		z1 = ((low1 + high1) * (low2 + high2))
+ *		   = ((var1 + 0) * (low2 + high2))
+ *		   = (var1 * low2) + (var1 * high2)
+ *
+ *		z2 = (high1 * high2)
+ *		   = (0 * high2)
+ *		   = 0
+ *
+ *	Simplified using the above substitutions:
+ *
+ *		result = (z2 * NBASE ^ (m2 × 2)) + ((z1 - z2 - z0) * NBASE ^ m2) + z0
+ *			   = (0 * NBASE ^ (m2 × 2)) + ((z1 - 0 - z0) * NBASE ^ m2) + z0
+ *			   = ((z1 - z0) * NBASE ^ m2) + z0
+ *			   = ((z1 - z0) * NBASE ^ m2) + z0
+ *			   = (var1 * high2) * NBASE ^ m2 + z0
+ */
+static void
+mul_var_karatsuba_half(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+	NumericVar	high2, low2;
+	NumericVar	z0;
+	NumericVar	temp2;
+	int			m2 = var2->ndigits / 2;
+
+	init_var(&high2);
+	init_var(&low2);
+	init_var(&z0);
+	init_var(&temp2);
+
+	split_var_at(var2, m2, &low2, &high2);
+
+	mul_var_karatsuba(var1, &low2, &z0, var1->dscale + low2.dscale);
+	mul_var_karatsuba(var1, &high2, &temp2, var1->dscale + high2.dscale);
+	temp2.weight += m2;
+	add_var(&temp2, &z0, result);
+
+	free_var(&high2);
+	free_var(&low2);
+	free_var(&z0);
+	free_var(&temp2);
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+
+	return;
+}
+
+
+/*
+ * mul_var_karatsuba() -
+ *
+ * Implements Karatsuba multiplication for large numbers, introduced
+ * alongside the unchanged original mul_var(). This function is part of
+ * an optimization effort, allowing direct benchmark comparisons with
+ * mul_var(). It selects full or half Karatsuba based on input size.
+ * This is a temporary measure before considering its replacement of
+ * mul_var() based on benchmark outcomes.
+ */
+static void
+mul_var_karatsuba(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+	int			res_ndigits;
+	int			res_sign;
+	int			res_weight;
+	int			maxdigits;
+	int		   *dig;
+	int			carry;
+	int			maxdig;
+	int			newdig;
+	int			var1ndigits;
+	int			var2ndigits;
+	NumericDigit *var1digits;
+	NumericDigit *var2digits;
+	NumericDigit *res_digits;
+	int			i,
+				i1,
+				i2;
+
+	/*
+	 * Arrange for var1 to be the shorter of the two numbers.  This improves
+	 * performance because the inner multiplication loop is much simpler than
+	 * the outer loop, so it's better to have a smaller number of iterations
+	 * of the outer loop.  This also reduces the number of times that the
+	 * accumulator array needs to be normalized.
+	 */
+	if (var1->ndigits > var2->ndigits)
+	{
+		const NumericVar *tmp = var1;
+
+		var1 = var2;
+		var2 = tmp;
+	}
+
+	/* copy these values into local vars for speed in inner loop */
+	var1ndigits = var1->ndigits;
+	var2ndigits = var2->ndigits;
+	var1digits = var1->digits;
+	var2digits = var2->digits;
+
+	if (var1ndigits == 0 || var2ndigits == 0)
+	{
+		/* one or both inputs is zero; so is result */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/* Determine result sign and (maximum possible) weight */
+	if (var1->sign == var2->sign)
+		res_sign = NUMERIC_POS;
+	else
+		res_sign = NUMERIC_NEG;
+	res_weight = var1->weight + var2->weight + 2;
+
+	/*
+	 * Determine the number of result digits to compute.  If the exact result
+	 * would have more than rscale fractional digits, truncate the computation
+	 * with MUL_GUARD_DIGITS guard digits, i.e., ignore input digits that
+	 * would only contribute to the right of that.  (This will give the exact
+	 * rounded-to-rscale answer unless carries out of the ignored positions
+	 * would have propagated through more than MUL_GUARD_DIGITS digits.)
+	 *
+	 * Note: an exact computation could not produce more than var1ndigits +
+	 * var2ndigits digits, but we allocate one extra output digit in case
+	 * rscale-driven rounding produces a carry out of the highest exact digit.
+	 */
+	res_ndigits = var1ndigits + var2ndigits + 1;
+	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+		MUL_GUARD_DIGITS;
+	res_ndigits = Min(res_ndigits, maxdigits);
+
+	if (res_ndigits < 3)
+	{
+		/* All input digits will be ignored; so result is zero */
+		zero_var(result);
+		result->dscale = rscale;
+		return;
+	}
+
+	/*
+	 * Use the Karatsuba algorithm for sufficiently large factors.
+	 */
+	if (KARATSUBA_CONDITION(var1ndigits, var2ndigits))
+	{
+		if (var1ndigits * 2 > var2ndigits)
+			mul_var_karatsuba_full(var1, var2, result, rscale);
+		else
+			mul_var_karatsuba_half(var1, var2, result, rscale);
+		return;
+	}
+
+	/*
+	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
+	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
+	 * to avoid normalizing carries immediately.
+	 *
+	 * maxdig tracks the maximum possible value of any dig[] entry; when this
+	 * threatens to exceed INT_MAX, we take the time to propagate carries.
+	 * Furthermore, we need to ensure that overflow doesn't occur during the
+	 * carry propagation passes either.  The carry values could be as much as
+	 * INT_MAX/NBASE, so really we must normalize when digits threaten to
+	 * exceed INT_MAX - INT_MAX/NBASE.
+	 *
+	 * To avoid overflow in maxdig itself, it actually represents the max
+	 * possible value divided by NBASE-1, ie, at the top of the loop it is
+	 * known that no dig[] entry exceeds maxdig * (NBASE-1).
+	 */
+	dig = (int *) palloc0(res_ndigits * sizeof(int));
+	maxdig = 0;
+
+	/*
+	 * The least significant digits of var1 should be ignored if they don't
+	 * contribute directly to the first res_ndigits digits of the result that
+	 * we are computing.
+	 *
+	 * Digit i1 of var1 and digit i2 of var2 are multiplied and added to digit
+	 * i1+i2+2 of the accumulator array, so we need only consider digits of
+	 * var1 for which i1 <= res_ndigits - 3.
+	 */
+	for (i1 = Min(var1ndigits - 1, res_ndigits - 3); i1 >= 0; i1--)
+	{
+		NumericDigit var1digit = var1digits[i1];
+
+		if (var1digit == 0)
+			continue;
+
+		/* Time to normalize? */
+		maxdig += var1digit;
+		if (maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
+		{
+			/* Yes, do it */
+			carry = 0;
+			for (i = res_ndigits - 1; i >= 0; i--)
+			{
+				newdig = dig[i] + carry;
+				if (newdig >= NBASE)
+				{
+					carry = newdig / NBASE;
+					newdig -= carry * NBASE;
+				}
+				else
+					carry = 0;
+				dig[i] = newdig;
+			}
+			Assert(carry == 0);
+			/* Reset maxdig to indicate new worst-case */
+			maxdig = 1 + var1digit;
+		}
+
+		/*
+		 * Add the appropriate multiple of var2 into the accumulator.
+		 *
+		 * As above, digits of var2 can be ignored if they don't contribute,
+		 * so we only include digits for which i1+i2+2 < res_ndigits.
+		 *
+		 * This inner loop is the performance bottleneck for multiplication,
+		 * so we want to keep it simple enough so that it can be
+		 * auto-vectorized.  Accordingly, process the digits left-to-right
+		 * even though schoolbook multiplication would suggest right-to-left.
+		 * Since we aren't propagating carries in this loop, the order does
+		 * not matter.
+		 */
+		{
+			int			i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
+			int		   *dig_i1_2 = &dig[i1 + 2];
+
+			for (i2 = 0; i2 < i2limit; i2++)
+				dig_i1_2[i2] += var1digit * var2digits[i2];
+		}
+	}
+
+	/*
+	 * Now we do a final carry propagation pass to normalize the result, which
+	 * we combine with storing the result digits into the output. Note that
+	 * this is still done at full precision w/guard digits.
+	 */
+	alloc_var(result, res_ndigits);
+	res_digits = result->digits;
+	carry = 0;
+	for (i = res_ndigits - 1; i >= 0; i--)
+	{
+		newdig = dig[i] + carry;
+		if (newdig >= NBASE)
+		{
+			carry = newdig / NBASE;
+			newdig -= carry * NBASE;
+		}
+		else
+			carry = 0;
+		res_digits[i] = newdig;
+	}
+	Assert(carry == 0);
+
+	pfree(dig);
+
+	/*
+	 * Finally, round the result to the requested precision.
+	 */
+	result->weight = res_weight;
+	result->sign = res_sign;
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+
+}
+
+
 /*
  * div_var() -
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 153d816a05..cab6fb8238 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -4465,6 +4465,9 @@
 { oid => '1726',
   proname => 'numeric_mul', prorettype => 'numeric',
   proargtypes => 'numeric numeric', prosrc => 'numeric_mul' },
+{ oid => '6312',
+  proname => 'numeric_mul_karatsuba', prorettype => 'numeric',
+  proargtypes => 'numeric numeric', prosrc => 'numeric_mul_karatsuba' },
 { oid => '1727',
   proname => 'numeric_div', prorettype => 'numeric',
   proargtypes => 'numeric numeric', prosrc => 'numeric_div' },
diff --git a/src/include/utils/numeric.h b/src/include/utils/numeric.h
index 43c75c436f..2b214a7700 100644
--- a/src/include/utils/numeric.h
+++ b/src/include/utils/numeric.h
@@ -97,6 +97,8 @@ extern Numeric numeric_sub_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
 extern Numeric numeric_mul_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
+extern Numeric numeric_mul_karatsuba_opt_error(Numeric num1, Numeric num2,
+									 bool *have_error);
 extern Numeric numeric_div_opt_error(Numeric num1, Numeric num2,
 									 bool *have_error);
 extern Numeric numeric_mod_opt_error(Numeric num1, Numeric num2,

Aaron Altman

aaronaltman@posteo.net

over 1 year ago

In reply to: Joel Jacobson (#1)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

Hi Joel, thanks for posting this. Although I have only a cursory
familiarity with fast multiplication algorithms, I'd like to try and
give it a review. To start with, can you help me understand the choice
of this algorithm versus others like Toom? If this looks correct on a
closer view I'll propose it for inclusion. Along the way though I'd like
to have it explicitly called out whether this is superior in general to
other choices, better for more realistic use cases, simpler, clearer to
license or something similar. It would be nice for future dicussions to
have some context around whether it would make sense to have conditions
to choose other algorithms as well, or if this one is generally the best
for what Postgres users are usually doing.

Continuing with code review in any case. Interested to hear more.

Regards,

Aaron Altman

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Aaron Altman (#2)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

On Tue, Jun 11, 2024, at 19:16, Aaron Altman wrote:

Hi Joel, thanks for posting this. Although I have only a cursory
familiarity with fast multiplication algorithms, I'd like to try and
give it a review. To start with, can you help me understand the choice
of this algorithm versus others like Toom? If this looks correct on a
closer view I'll propose it for inclusion. Along the way though I'd like
to have it explicitly called out whether this is superior in general to
other choices, better for more realistic use cases, simpler, clearer to
license or something similar. It would be nice for future dicussions to
have some context around whether it would make sense to have conditions
to choose other algorithms as well, or if this one is generally the best
for what Postgres users are usually doing.

Continuing with code review in any case. Interested to hear more.

Hi Aaron, thanks for looking at this!

The choice of best algorithm depends on the factor sizes.

The larger factor sizes, the more complicated algorithm can be "afforded".

List of fast multiplication algorithms,
ordered by factor sizes they are suitable for:

- Long multiplication, aka "schoolbook" multiplication.
- Karatsuba
- Toom-3
- Schönhage–Strassen algorithm (Fast Fourier Transform)

The Toom-3 algorithm can be modified to split the smaller and larger factors
into different number of parts. The notation used at Wikipedia is e.g. Toom-2.5
which I think means splitting the larger into three parts, and the smaller into
two parts, while GMP uses Toom32 to mean the same thing.
Personally, I think GMPs notation is easier to understand as the number of parts
can be directly derived from the name.

I experimented with implementing Toom-3 as well, but there was only a marginal
win, at very large factor sizes, and since numeric's max ndigits
(number of base-digits) is capped at 32768, I didn't think it was worth it,
since it adds quite a lot of complexity.

The Karatsuba algorithm is the next step in the hierarchy of fast multiplication
algorithms, and all other bigint libs I've looked at implement Karatsuba,
even if they also implement Toom-3, since Karatsuba is faster than Toom-3 for
sufficiently small factors, but that are at the same time sufficiently large for
Karatsuba to be faster than schoolbook.

I was initially surprised by the quite large threshold, where Karatsuba started
to be a win over schoolbook.

I think the explanation why mul_var() stays fast up to quite remarkably high
factor sizes, could be a combination of several things, such as:

- mul_var() is already heavily optimized, with clever tricks,
such as deferred carry propagation.

- numeric uses NBASE=10000, while other bigint libs usually use a power of two.

In the Karatsuba implementation, I tried to keep the KARATSUBA_CONDITION()
quite simple, but it's way more complex than what most bigint libs use,
that usually just check if the smaller factor is smaller than some threshold,
and if so, use schoolbook. For instance, this is what Rust's num-bigint does:

if x.len() <= 32 {
// Long multiplication
} else if x.len() * 2 <= y.len() {
// Half-Karatsuba, for factors with significant length disparity
} else if x.len() <= 256 {
// Karatsuba multiplication
} else {
// Toom-3 multiplication
}

Source: https://github.com/rust-num/num-bigint/blob/master/src/biguint/multiplication.rs#L101

Side note: When working on Karatsuba in mul_var(), I looked at some other bigint
implementations, to try to understand their threshold functions.
I noticed that Rust's num-bigint didn't optimise for factors with significant
length disparity, so I contributed a patch based on my "Half-Karatsuba" idea,
that I got when working with mul_var(), which has now been merged:
https://github.com/rust-num/num-bigint/commit/06b61c8138ad8a9959ac54d9773d0a9ebe25b346

In mul_var(), if we don't like the complexity of KARATSUBA_CONDITION(),
we could go for a more traditional threshold approach, i.e. just checking
the smaller factor size. However, I believe that would be at the expense
of missing out of some performance gains.

I've tried quite hard to find the best KARATSUBA_CONDITION(), but I found it to
be a really hard problem, the differences between different CPU architectures,
in combination with wanting a simple expression, means there is no obvious
perfect threshold function, there will always be a trade-off.

I eventually stopped trying to improve it, and just settled on the version in
the patch, and thought that I'll leave it up to the community to give feedback
on what complexity for the threshold function is motivated. If we absolutely
just want to check the smallest factor size, like Rust, then it's super simple,
then the threshold can easily be found just by testing different values.
It's when both factor sizes are input to the threshold function that makes it
complicated.

/Joel

Aaron Altman

aaronaltman@posteo.net

over 1 year ago

In reply to: Joel Jacobson (#3)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

This applies cleanly to master, builds and passes regression tests in my Windows/Cygwin environment.

I also read through comments and confirmed for myself that the assumption about the caller ensuring var1 is shorter is done already in unchanged code from mul_var. Frees correspond to inits. The "Karatsuba condition" reasoning for deciding whether a number is big enough to use this algorithm appears to match what Joel has stated in this thread.

The arithmetic appears to match what's been described in the comments. I have *not* confirmed that with any detailed review of the Karatsuba algorithm from outside sources, other implementations like the Rust one referenced here, or anything similar. I'm hoping that the regression tests give sufficient coverage that if the arithmetic was incorrect there would be obvious failures. If additional coverage was needed, cases falling immediately on either side of the limiting conditions used in the patch would probably be useful. From the limited precedent I've exposed myself to, that doesn't seem to be required here, but I'm open to contrary input from other reviewers. In the meantime, I'm marking this approved.

Thanks for the detailed background and comments, Joel!

The new status of this patch is: Ready for Committer

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Aaron Altman (#4)

1 attachment(s)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

On Fri, Jun 14, 2024, at 03:07, Aaron Altman wrote:

Thanks for the detailed background and comments, Joel!

The new status of this patch is: Ready for Committer

Thanks for reviewing.

Attached, rebased version of the patch that implements the Karatsuba algorithm in numeric.c's mul_var().

/Joel

Attachments:

0001-numeric-mul_var-karatsuba.patchapplication/octet-stream; name="=?UTF-8?Q?0001-numeric-mul=5Fvar-karatsuba.patch?="Download

From 283eafe9b6f9d7556e1fe7de7085a06bb5b3fdc7 Mon Sep 17 00:00:00 2001
From: Joel Jakobsson <github@compiler.org>
Date: Sun, 23 Jun 2024 08:54:54 +0200
Subject: [PATCH] numeric.c mul_var(): Implement Karatsuba Algorithm

This speeds up multiplications for very large factors.
---
 src/backend/utils/adt/numeric.c | 257 ++++++++++++++++++++++++++++++++
 1 file changed, 257 insertions(+)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..0d8ace9c97 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -459,6 +459,30 @@ static const NumericVar const_ninf =
 static const int round_powers[4] = {0, 1000, 100, 10};
 #endif
 
+#define KARATSUBA_BASE_LIMIT 384
+#define KARATSUBA_VAR1_MIN1 128
+#define KARATSUBA_VAR1_MIN2 2000
+#define KARATSUBA_VAR2_MIN1 2500
+#define KARATSUBA_VAR2_MIN2 9000
+#define KARATSUBA_SLOPE 0.764
+#define KARATSUBA_INTERCEPT 90.737
+
+#define KARATSUBA_LOW_RANGE_CONDITION(var1ndigits, var2ndigits) \
+	((var1ndigits) > (KARATSUBA_SLOPE) * (var2ndigits) + KARATSUBA_INTERCEPT)
+
+#define KARATSUBA_MIDDLE_RANGE_CONDITION(var1ndigits, var2ndigits) \
+	((var2ndigits) > KARATSUBA_VAR2_MIN1 && \
+	 (var1ndigits) > KARATSUBA_VAR1_MIN2)
+
+#define KARATSUBA_HIGH_RANGE_CONDITION(var1ndigits, var2ndigits) \
+	((var2ndigits) > KARATSUBA_VAR2_MIN2 && \
+	 (var1ndigits) > KARATSUBA_VAR1_MIN1)
+
+#define KARATSUBA_CONDITION(var1ndigits, var2ndigits) \
+	((var2ndigits) >= KARATSUBA_BASE_LIMIT && \
+	 (KARATSUBA_LOW_RANGE_CONDITION(var1ndigits, var2ndigits) || \
+	  KARATSUBA_MIDDLE_RANGE_CONDITION(var1ndigits, var2ndigits) || \
+	  KARATSUBA_HIGH_RANGE_CONDITION(var1ndigits, var2ndigits)))
 
 /* ----------
  * Local functions
@@ -548,6 +572,14 @@ static void add_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result);
 static void sub_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result);
+inline static void split_var_at(const NumericVar *var, int split_point,
+								NumericVar *low, NumericVar *high);
+static void mul_var_karatsuba_full(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
+static void mul_var_karatsuba_half(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
@@ -8659,6 +8691,219 @@ sub_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result)
 }
 
 
+/*
+ * split_var_at() -
+ *
+ * Split a NumericVar into two parts at a specified position.
+ */
+inline static void
+split_var_at(const NumericVar *var, int split_point,
+			 NumericVar *low, NumericVar *high)
+{
+	int high_ndigits = var->ndigits - split_point;
+	int low_ndigits = split_point;
+
+	init_var(high);
+	init_var(low);
+
+	high->ndigits = high_ndigits;
+	high->digits = var->digits;
+	high->buf = NULL;
+	high->weight = var->weight - low_ndigits;
+	high->sign = var->sign;
+	high->dscale = (var->ndigits - var->weight - 1) * DEC_DIGITS;
+
+	low->ndigits = low_ndigits;
+	low->digits = var->digits + high_ndigits;
+	low->buf = NULL;
+	low->weight = var->weight - high_ndigits;
+	low->sign = var->sign;
+	low->dscale = var->dscale;
+}
+
+
+/*
+ * mul_var_karatsuba_full() -
+ *
+ *	Multiplication using the Karatsuba algorithm.
+ *
+ *	The algorithm normally starts by checking if any of the inputs
+ *	are smaller than the NBASE, the base case for the recursion,
+ *	and if so, fall back to traditional multiplication.
+ *
+ *	That part is handled by the caller in our code, so when this function
+ *	is called, we know that var1 and var2 are large enough for Karatsuba
+ *	to be used. We also know that var1 is shorter or of equal length as var2,
+ *	which has been arranged by the caller by swapping them if necessary.
+ *
+ *	The algorithm then proceeds by splitting var1 and var2 into
+ *	two high and low parts, at half the length of the longer input:
+ *
+ *		m = max(size_NBASE(var1), size_NBASE(var2))
+ *		m2 = floor(m / 2)
+ *
+ *		high1, low1 = split_var_at(var1, m2)
+ *		high2, low2 = split_var_at(var2, m2)
+ *
+ *		z0 = (low1 * low2)
+ *		z1 = ((low1 + high1) * (low2 + high2))
+ *		z2 = (high1 * high2)
+ *
+ *		return (z2 * NBASE ^ (m2 × 2)) + ((z1 - z2 - z0) * NBASE ^ m2) + z0
+ */
+static void
+mul_var_karatsuba_full(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+		NumericVar	high1, low1;
+		NumericVar	high2, low2;
+		NumericVar	z0, z1, z2;
+		NumericVar	temp1, temp2;
+		int			m2 = var2->ndigits / 2;
+
+		init_var(&low1);
+		init_var(&low2);
+		init_var(&high1);
+		init_var(&high2);
+		init_var(&z0);
+		init_var(&z1);
+		init_var(&z2);
+		init_var(&temp1);
+		init_var(&temp2);
+
+		split_var_at(var1, m2, &low1, &high1);
+		split_var_at(var2, m2, &low2, &high2);
+
+		mul_var(&low1, &low2, &z0, low1.dscale + low2.dscale);
+
+		add_var(&low1, &high1, &temp1);
+		add_var(&low2, &high2, &temp2);
+		mul_var(&temp1, &temp2, &z1, temp1.dscale + temp2.dscale);
+
+		mul_var(&high1, &high2, &z2, high1.dscale + high2.dscale);
+
+		set_var_from_var(&z2, &temp1);
+		temp1.weight += m2 * 2;
+
+		sub_var(&z1, &z2, &z1);
+		sub_var(&z1, &z0, &temp2);
+		temp2.weight += m2;
+
+		add_var(&temp1, &temp2, &temp2);
+		add_var(&temp2, &z0, result);
+
+		free_var(&low1);
+		free_var(&low2);
+		free_var(&high1);
+		free_var(&high2);
+		free_var(&z0);
+		free_var(&z1);
+		free_var(&z2);
+		free_var(&temp1);
+		free_var(&temp2);
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+}
+
+
+/*
+ * mul_var_karatsuba_half() -
+ *
+ *	Karatsuba Multiplication for factors with significant length disparity.
+ *
+ *	The Half-Karatsuba Multiplication Algorithm is a specialized case of
+ *	the normal Karatsuba multiplication algorithm, designed for the scenario
+ *	where var2 has at least twice as many base digits as var1.
+ *
+ *	In this case var2 (the longer input) is split into high2 and low1,
+ *	at m2 (half the length of var2) and var1 (the shorter input),
+ *	is used directly without splitting.
+ *
+ *	The algorithm then proceeds as follows:
+ *
+ *	1. Compute the product z0 = var1 * low2.
+ *	2. Compute the product temp2 = var1 * high2.
+ *	3. Adjust the weight of temp2 by adding m2 (* NBASE ^ m2)
+ *	4. Add temp2 and z0 to obtain the final result.
+ *
+ *	Proof:
+ *
+ *	The algorithm can be derived from the original Karatsuba algorithm by
+ *	simplifying the formula when the shorter factor var1 is not split into
+ *	high and low parts, as shown below.
+ *
+ *	Original Karatsuba formula:
+ *
+ *		result = (z2 * NBASE ^ (m2 × 2)) + ((z1 - z2 - z0) * NBASE ^ m2) + z0
+ *
+ *	Substitutions:
+ *
+ *		low1 = var1
+ *		high1 = 0
+ *
+ *	Applying substitutions:
+ *
+ *		z0 = (low1 * low2)
+ *		   = (var1 * low2)
+ *
+ *		z1 = ((low1 + high1) * (low2 + high2))
+ *		   = ((var1 + 0) * (low2 + high2))
+ *		   = (var1 * low2) + (var1 * high2)
+ *
+ *		z2 = (high1 * high2)
+ *		   = (0 * high2)
+ *		   = 0
+ *
+ *	Simplified using the above substitutions:
+ *
+ *		result = (z2 * NBASE ^ (m2 × 2)) + ((z1 - z2 - z0) * NBASE ^ m2) + z0
+ *			   = (0 * NBASE ^ (m2 × 2)) + ((z1 - 0 - z0) * NBASE ^ m2) + z0
+ *			   = ((z1 - z0) * NBASE ^ m2) + z0
+ *			   = ((z1 - z0) * NBASE ^ m2) + z0
+ *			   = (var1 * high2) * NBASE ^ m2 + z0
+ */
+static void
+mul_var_karatsuba_half(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+	NumericVar	high2, low2;
+	NumericVar	z0;
+	NumericVar	temp2;
+	int			m2 = var2->ndigits / 2;
+
+	init_var(&high2);
+	init_var(&low2);
+	init_var(&z0);
+	init_var(&temp2);
+
+	split_var_at(var2, m2, &low2, &high2);
+
+	mul_var(var1, &low2, &z0, var1->dscale + low2.dscale);
+	mul_var(var1, &high2, &temp2, var1->dscale + high2.dscale);
+	temp2.weight += m2;
+	add_var(&temp2, &z0, result);
+
+	free_var(&high2);
+	free_var(&low2);
+	free_var(&z0);
+	free_var(&temp2);
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+
+	return;
+}
+
+
 /*
  * mul_var() -
  *
@@ -8747,6 +8992,18 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		return;
 	}
 
+	/*
+	 * Use the Karatsuba algorithm for sufficiently large factors.
+	 */
+	if (KARATSUBA_CONDITION(var1ndigits, var2ndigits))
+	{
+		if (var1ndigits * 2 > var2ndigits)
+			mul_var_karatsuba_full(var1, var2, result, rscale);
+		else
+			mul_var_karatsuba_half(var1, var2, result, rscale);
+		return;
+	}
+
 	/*
 	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
 	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
-- 
2.45.1

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Joel Jacobson (#5)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote:

Attached, rebased version of the patch that implements the Karatsuba algorithm in numeric.c's mul_var().

It's one of these areas where Dean Rasheed would be a good match for a
review, so adding him in CC. He has been doing a lot of stuff in this
area over the years.

+#define KARATSUBA_BASE_LIMIT 384
+#define KARATSUBA_VAR1_MIN1 128
+#define KARATSUBA_VAR1_MIN2 2000
+#define KARATSUBA_VAR2_MIN1 2500
+#define KARATSUBA_VAR2_MIN2 9000
+#define KARATSUBA_SLOPE 0.764
+#define KARATSUBA_INTERCEPT 90.737

These numbers feel magic, and there are no explanations behind these
choices so it is hard to know whether these are good or not, or if
there are potentially "better" choices. I'd suggest to explain why
these variables are here as well as the basics of the method in this
area of the code, with the function doing the operation pointing at
that so as all the explanations are in a single place. Okay, these
are thresholds based on the number of digits to decide if the normal
or Karatsuba's method should be used, but grouping all the
explanations in a single place is simpler.

I may have missed something, but did you do some benchmark when the
thresholds are at their limit where we would fallback to the
calculation method of HEAD? I guess that the answer to my question of
"Is HEAD performing better across these thresholds?" is clearly "no"
based on what I read at [1]https://en.wikipedia.org/wiki/Karatsuba_algorithm -- Michael and the threshold numbers chosen, still
asking.

[1]: https://en.wikipedia.org/wiki/Karatsuba_algorithm -- Michael
--
Michael

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Michael Paquier (#6)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote:

Attached, rebased version of the patch that implements the Karatsuba algorithm in numeric.c's mul_var().

Something to watch out for is that not all callers of mul_var() want
an exact result. Several internal callers request an approximate
result by passing it an rscale value less than the sum of the input
dscales. The schoolbook algorithm handles that by computing up to
rscale plus MUL_GUARD_DIGITS extra digits and then rounding, whereas
the new Karatsuba code always computes the full result and then
rounds. That impacts the performance of various functions, for
example:

select sum(exp(x)) from generate_series(5999.9, 5950.0, -0.1) x;

Time: 1790.825 ms (00:01.791) [HEAD]
Time: 2161.244 ms (00:02.161) [with patch]

Looking at mul_var_karatsuba_half(), I don't really like the approach
it takes. The whole correctness proof using the Karatsuba formula
seems to somewhat miss the point that this function isn't actually
implementing the Karatsuba algorithm, it is implementing the
schoolbook algorithm in two steps, by splitting the longer input into
two pieces. But why just split it into two pieces? That will just lead
to a lot of unnecessary recursion for very unbalanced inputs. Instead,
why not split the longer input into N roughly equal sized pieces, each
around the same length as the shorter input, multiplying and adding
them at the appropriate offsets? As an example, given inputs with
var1ndigits = 1000 and var2ndigits = 10000, mul_var() will invoke
mul_var_karatsuba_half(), which will then recursively invoke mul_var()
twice with var1ndigits = 1000 and var2ndigits = 5000, which no longer
satisfies KARATSUBA_CONDITION(), so it will just invoke the schoolbook
algorithm on each half, which stands no chance of being any faster. On
the other hand, if it divided var2 into 10 chunks of length 1000, it
would invoke the Karatsuba algorithm on each chunk, which would at
least stand a chance of being faster.

Related to that, KARATSUBA_HIGH_RANGE_CONDITION() doesn't appear to
make a lot of sense. For inputs with var1ndigits between 128 and 2000,
and var2ndigits > 9000, this condition will pass and it will
recursively break up the longer input into smaller and smaller pieces
until eventually that condition no longer passes, but none of the
other conditions in KARATSUBA_CONDITION() will pass either, so it'll
just invoke the schoolbook algorithm on each piece, which is bound to
be slower once all the overheads are taken into account. For example,
given var1ndigits = 200 and var2ndigits = 30000, KARATSUBA_CONDITION()
will pass due to KARATSUBA_HIGH_RANGE_CONDITION(), and it will recurse
with var1ndigits = 200 and var2ndigits = 15000, and then again with
var1ndigits = 200 and var2ndigits = 7500, at which point
KARATSUBA_CONDITION() no longer passes. With mul_var_karatsuba_half()
implemented as it is, that is bound to happen, because each half will
end up having var2ndigits between 4500 and 9000, which fails
KARATSUBA_CONDITION() if var1ndigits < 2000. If
mul_var_karatsuba_half() was replaced by something that recursed with
more balanced chunks, then it might make more sense, though allowing
values of var1ndigits down to 128 doesn't make sense, since the
Karatsuba algorithm will never be invoked for inputs shorter than 384.

Looking at KARATSUBA_MIDDLE_RANGE_CONDITION(), the test that
var2ndigits > 2500 seems to be redundant. If var1ndigits > 2000 and
var2ndigits < 2500, then KARATSUBA_LOW_RANGE_CONDITION() is satisfied,
so these tests could be simplified, eliminating some of those magic
constants.

However, I really don't like having these magic constants at all,
because in practice the threshold above which the Karatsuba algorithm
is a win can vary depending on a number of factors, such as whether
it's running on 32-bit or 64-bit, whether or not SIMD instructions are
available, the relative timings of CPU instructions, the compiler
options used, and probably a bunch of other things. The last time I
looked at the Java source code, for example, they had separate
thresholds for 32-bit and 64-bit platforms, and even that's probably
too crude. Some numeric libraries tune the thresholds for a large
number of different platforms, but that takes a lot of effort. I think
a better approach would be to have a configurable threshold. Ideally,
this would be just one number, with all other numbers being derived
from it, possibly using some simple heuristic to reduce the effective
threshold for more balanced inputs, for which the Karatsuba algorithm
is more efficient.

Having a configurable threshold would allow people to tune for best
performance on their own platforms, and also it would make it easier
to write tests that hit the new code. As it stands, it's not obvious
how much of the new code is being hit by the existing tests.

Doing a quick test on my machine, using random equal-length inputs of
various sizes, I got the following performance results:

digits | rate (HEAD) | rate (patch) | change
--------+---------------+---------------+--------
10 | 6.060014e+06 | 6.0189365e+06 | -0.7%
100 | 2.7038752e+06 | 2.7287925e+06 | +0.9%
1000 | 88640.37 | 90504.82 | +2.1%
1500 | 39885.23 | 41041.504 | +2.9%
1600 | 36355.24 | 33368.28 | -8.2%
2000 | 23308.582 | 23105.932 | -0.9%
3000 | 10765.185 | 11360.11 | +5.5%
4000 | 6118.2554 | 6645.4116 | +8.6%
5000 | 3928.4985 | 4639.914 | +18.1%
10000 | 1003.80164 | 1431.9335 | +42.7%
20000 | 255.46135 | 456.23462 | +78.6%
30000 | 110.69313 | 226.53398 | +104.7%
40000 | 62.29333 | 148.12916 | +137.8%
50000 | 39.867493 | 95.16788 | +138.7%
60000 | 27.7672 | 74.01282 | +166.5%

The Karatsuba algorithm kicks in at 384*4 = 1536 decimal digits, so
presumably the variations below that are just noise, but this does
seem to suggest that KARATSUBA_BASE_LIMIT = 384 is too low for me, and
I'd probably want it to be something like 500-700.

There's another complication though (if the threshold is made
configurable): the various numeric functions that use mul_var() are
immutable, which means that the results from the Karatsuba algorithm
must match those from the schoolbook algorithm exactly, for all
inputs. That's not currently the case when computing approximate
results with a reduced rscale. That's fixable, but it's not obvious
whether or not the Karatsuba algorithm can actually be made beneficial
when computing such approximate results.

There's a wider question as to how many people use such big numeric
values -- i.e., how many people are actually going to benefit from
this? I don't have a good feel for that.

Regards,
Dean

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Dean Rasheed (#7)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

There's another complication though (if the threshold is made
configurable): the various numeric functions that use mul_var() are
immutable, which means that the results from the Karatsuba algorithm
must match those from the schoolbook algorithm exactly, for all
inputs.

That seems like an impossible standard to meet. What we'd probably
have to do is enable Karatsuba only when mul_var is being asked
for an exact (full-precision) result. This'd complicate the check
condition further, possibly reaching the point where there is a
visible drag on performance in the non-Karatsuba case.

Another possible source of drag: if mul_var is now recursive,
does it need a stack depth check? If we can prove that the
number of recursion levels is no more than O(log(N)) in the
number of input digits, it's probably safe to skip that ...
but I see no such proof here. (In general I find this patch
seriously undercommented.)

There's a wider question as to how many people use such big numeric
values -- i.e., how many people are actually going to benefit from
this? I don't have a good feel for that.

I have heard of people doing calculations on bignum integers in
Postgres, but they are very few and far between --- usually that
sort of task is better done in another language. (No matter how
fast we make mul_var, the general overhead of SQL expressions in
general and type numeric in particular means it's probably not
the right tool for heavy-duty bignum arithmetic.)

There is definitely an argument to be made that this proposal is
not worth the development effort and ongoing maintenance effort
we'd have to sink into it.

regards, tom lane

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Tom Lane (#8)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

On Sat, 29 Jun 2024 at 16:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

There's another complication though (if the threshold is made
configurable): the various numeric functions that use mul_var() are
immutable, which means that the results from the Karatsuba algorithm
must match those from the schoolbook algorithm exactly, for all
inputs.

That seems like an impossible standard to meet. What we'd probably
have to do is enable Karatsuba only when mul_var is being asked
for an exact (full-precision) result.

Yeah, using Karatsuba for approximate products is probably a bit too
ambitious. I think it'd be reasonably straightforward to have it
produce the same results for all rscale values. You'd just have to
decide on the required rscale for each sub-product, based on where
it's being added, truncating at various points, and being sure to only
round once at the very end. The problem is that it'd end up computing
a larger fraction of the full product than the schoolbook algorithm
would have done, so the threshold for using Karatsuba would have to be
higher (probably quite a lot higher) and figuring out how that varied
with the requested rscale would be hard.

So using Karatsuba only in the full-precision case seems like a
reasonable restriction. That'd still benefit some other functions like
sqrt() and therefore ln() and pow() to some extent. However,...

There is definitely an argument to be made that this proposal is
not worth the development effort and ongoing maintenance effort
we'd have to sink into it.

I'm leaning more towards this opinion, especially since I think the
patch needs a lot more work to be committable.

Regards,
Dean

#10

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#7)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote:

On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote:

Attached, rebased version of the patch that implements the Karatsuba algorithm in numeric.c's mul_var().

Something to watch out for is that not all callers of mul_var() want
an exact result. Several internal callers request an approximate
result by passing it an rscale value less than the sum of the input
dscales. The schoolbook algorithm handles that by computing up to
rscale plus MUL_GUARD_DIGITS extra digits and then rounding, whereas
the new Karatsuba code always computes the full result and then
rounds. That impacts the performance of various functions, for
example:

select sum(exp(x)) from generate_series(5999.9, 5950.0, -0.1) x;

Time: 1790.825 ms (00:01.791) [HEAD]
Time: 2161.244 ms (00:02.161) [with patch]

Ops. Thanks for spotting this and clarifying.

I read Tom's reply and note that we should only do Karatsuba
an exact (full-precision) result is requested.

Looking at mul_var_karatsuba_half(), I don't really like the approach
it takes. The whole correctness proof using the Karatsuba formula
seems to somewhat miss the point that this function isn't actually
implementing the Karatsuba algorithm, it is implementing the
schoolbook algorithm in two steps, by splitting the longer input into
two pieces.

The surprising realization here is that there are actually (var1ndigits, var2ndigits)
combinations where *only* doing mul_var_karatsuba_half() recursively
all the way down to schoolbook *is* a performance win,
even though we don't do any mul_var_karatsuba_full().

mul_var_karatsuba_half() *is* actually implementing the exact
Karatsuba formula, it's just taking a shortcut exploiting the pre-known
that splitting `var1` at `m2` would result in `high1` being zero.
This allows the provably correct substitutions to be made,
which avoids the meaningless computations.

But why just split it into two pieces? That will just lead
to a lot of unnecessary recursion for very unbalanced inputs. Instead,
why not split the longer input into N roughly equal sized pieces, each
around the same length as the shorter input, multiplying and adding
them at the appropriate offsets?

The approach you're describing is implemented by e.g. CPython
and is called "lopsided" in their code base. It has some different
performance characteristics, compared to the recursive Half-Karatsuba
approach.

What I didn't like about lopsided is the degenerate case where the
last chunk is much shorter than the var1, for example, if we pretend
we would be doing Karatsuba all the way down to ndigits 2,
and think about the example var1ndigits = 3 and var2ndigits = 10,
then lopsided would do
var1ndigits=3 var2ndigits=3
var1ndigits=3 var2ndigits=3
var1ndigits=3 var2ndigits=3
var1ndigits=3 var2ndigits=1

whereas Half-Karatsuba would do
var1ndigits=3 var2ndigits=5
var1ndigits=3 var2ndigits=5

You can find contrary examples too of course where lopsided
is better than Half-Karatsuba, none of them seem substantially better
than the other.

My measurements indicated that overall, Half-Karatsuba seemed like
the overall marginal winner, on the architectures I tested, but they were all
very similar, i.e. almost the same number of "wins" and "losses",
for different (var1ndigits, var2ndigits) combinations.

Note that even with lopsided, there will still be recursion, since Karatsuba
is a recursive algorithm, so to satisfy Tom Lane's request about
proving the recursion is limited, we will still need to prove the same
thing for lopsided+Karatsuba.

Here is some old code from my experiments, if we want to evaluate lopsided:

```
static void slice_var(const NumericVar *var, int start, int length,
NumericVar *slice);

static void mul_var_lopsided(const NumericVar *var1, const NumericVar *var2,
NumericVar *result);

/*
* slice_var() -
*
* Extract a slice of a NumericVar starting at a specified position
* and with a specified length.
*/
static void
slice_var(const NumericVar *var, int start, int length,
NumericVar *slice)
{
Assert(start >= 0);
Assert(start + length <= var->ndigits);

init_var(slice);

slice->ndigits = length;
slice->digits = var->digits + start;
slice->buf = NULL;
slice->weight = var->weight - var->ndigits + length;
slice->sign = var->sign;
slice->dscale = (var->ndigits - var->weight - 1) * DEC_DIGITS;
}

/*
* mul_var_lopsided() -
*
* Lopsided Multiplication for unequal-length factors.
*
* This function handles the case where var1 has significantly fewer digits
* than var2. In such a scenario, splitting var1 for a balanced multiplication
* algorithm would be inefficient, as the high part would be zero.
*
* To overcome this inefficiency, the function divides factor2 into a series of
* slices, each containing the same number of digits as var1, and multiplies
* var1 with each slice one at a time. As a result, the recursive call to
* mul_var() will have balanced inputs, which improves the performance of
* divide-and-conquer algorithm, such as the Karatsuba.
*/
static void
mul_var_lopsided(const NumericVar *var1, const NumericVar *var2,
NumericVar *result)
{
int var1ndigits = var1->ndigits;
int var2ndigits = var2->ndigits;
int processed = 0;
int remaining = var2ndigits;
int length;
NumericVar slice;
NumericVar product;
NumericVar sum;

Assert(var1ndigits <= var2ndigits);
Assert(var1ndigits > MUL_SMALL);
Assert(var1ndigits * 2 <= var2ndigits);

init_var(&slice);
init_var(&product);
init_var(&sum);

while (remaining > 0)
{
length = Min(remaining, var1ndigits);
slice_var(var2, var2ndigits - processed - length, length, &slice);
mul_var(var1, &slice, &product, var1->dscale + slice.dscale);
product.weight += processed;
add_var(&sum, &product, &sum);
remaining -= length;
processed += length;
}

set_var_from_var(&sum, result);

free_var(&slice);
free_var(&product);
free_var(&sum);
}
```

As an example, given inputs with
var1ndigits = 1000 and var2ndigits = 10000, mul_var() will invoke
mul_var_karatsuba_half(), which will then recursively invoke mul_var()
twice with var1ndigits = 1000 and var2ndigits = 5000, which no longer
satisfies KARATSUBA_CONDITION(), so it will just invoke the schoolbook
algorithm on each half, which stands no chance of being any faster. On
the other hand, if it divided var2 into 10 chunks of length 1000, it
would invoke the Karatsuba algorithm on each chunk, which would at
least stand a chance of being faster.

Interesting example!

Indeed only mul_var_karatsuba_half() will be called with the inputs:
var1ndigits=1000 var2ndigits=10000
var1ndigits=1000 var2ndigits=5000
It will never call mul_var_karatsuba_full().

Surprisingly, this still gives a 13% speed-up on a Intel Core i9-14900K.

This performance gain comes from the splitting of the larger factor.

Here is how I benchmarked using pg-timeit [1]https://github.com/joelonsql/pg-timeit and the
mul_var-karatsuba-benchmark.patch [2]/messages/by-id/attachment/159528/mul_var-karatsuba-benchmark.patch from my original post:

To test the patch, you have to edit pg_proc.dat for
numeric_mul_karatsuba and give it a new unique oid.

```
SELECT
timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
total_time_a::numeric/total_time_b - 1 AS execution_time_difference
FROM timeit.cmp(
'numeric_mul',
'numeric_mul_karatsuba',
input_values := ARRAY[
random_ndigits(1000)::TEXT,
random_ndigits(10000)::TEXT
],
min_time := 1000000,
timeout := '10 s'
);
-[ RECORD 1 ]-------------+-------------------
execution_time_a | 976 µs
execution_time_b | 864 µs
execution_time_difference | 0.1294294200936600
```

The KARATSUBA_CONDITION tries to capture the interesting region of performance gains, as observed from measurements [3]https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/derive_threshold_function_using_dynamic_programming.png, in an expression that is not too complex.

In the image [3]https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/derive_threshold_function_using_dynamic_programming.png, the purple to black area is performance regressions,
and the rainbow colors are performance gains.

The black colored line segment is the KARATSUBA_CONDITION,
which tries to capture the three performance gain regions,
as defined by:
KARATSUBA_LOW_RANGE_CONDITION
KARATSUBA_MIDDLE_RANGE_CONDITION
KARATSUBA_HIGH_RANGE_CONDITION

Related to that, KARATSUBA_HIGH_RANGE_CONDITION() doesn't appear to
make a lot of sense. For inputs with var1ndigits between 128 and 2000,
and var2ndigits > 9000, this condition will pass and it will
recursively break up the longer input into smaller and smaller pieces
until eventually that condition no longer passes, but none of the
other conditions in KARATSUBA_CONDITION() will pass either, so it'll
just invoke the schoolbook algorithm on each piece, which is bound to
be slower once all the overheads are taken into account. For example,
given var1ndigits = 200 and var2ndigits = 30000, KARATSUBA_CONDITION()
will pass due to KARATSUBA_HIGH_RANGE_CONDITION(), and it will recurse
with var1ndigits = 200 and var2ndigits = 15000, and then again with
var1ndigits = 200 and var2ndigits = 7500, at which point
KARATSUBA_CONDITION() no longer passes. With mul_var_karatsuba_half()
implemented as it is, that is bound to happen, because each half will
end up having var2ndigits between 4500 and 9000, which fails
KARATSUBA_CONDITION() if var1ndigits < 2000. If
mul_var_karatsuba_half() was replaced by something that recursed with
more balanced chunks, then it might make more sense,though allowing
values of var1ndigits down to 128 doesn't make sense, since the
Karatsuba algorithm will never be invoked for inputs shorter than 384.

Like explained above, the mul_var_karatsuba_half() is not meaningless,
even for cases where we never reach mul_var_karatsuba_full().

Regarding the last comment on 128 and 384, I think you're reading the
conditions wrong, note that 128 is for var1ndigits while 384 is for var2ndigits:

+#define KARATSUBA_BASE_LIMIT 384
+#define KARATSUBA_VAR1_MIN1 128
...
+	((var2ndigits) >= KARATSUBA_BASE_LIMIT && \
...
+	 (var1ndigits) > KARATSUBA_VAR1_MIN1)

Looking at KARATSUBA_MIDDLE_RANGE_CONDITION(), the test that
var2ndigits > 2500 seems to be redundant. If var1ndigits > 2000 and
var2ndigits < 2500, then KARATSUBA_LOW_RANGE_CONDITION() is satisfied,
so these tests could be simplified, eliminating some of those magic
constants.

Yes, I realized that myself too, but chose to keep the start and end
boundaries for each of the three ranges. Since it's just boolean logic,
with constants, I think the compiler should be smart enough to optimize
away the redundancy, but maybe better to keep the redundant condition
as a comment instead of actual code, I have no strong opinion what's best.

However, I really don't like having these magic constants at all,
because in practice the threshold above which the Karatsuba algorithm
is a win can vary depending on a number of factors, such as whether
it's running on 32-bit or 64-bit, whether or not SIMD instructions are
available, the relative timings of CPU instructions, the compiler
options used, and probably a bunch of other things. The last time I
looked at the Java source code, for example, they had separate
thresholds for 32-bit and 64-bit platforms, and even that's probably
too crude. Some numeric libraries tune the thresholds for a large
number of different platforms, but that takes a lot of effort. I think
a better approach would be to have a configurable threshold. Ideally,
this would be just one number, with all other numbers being derived
from it, possibly using some simple heuristic to reduce the effective
threshold for more balanced inputs, for which the Karatsuba algorithm
is more efficient.

Having a configurable threshold would allow people to tune for best
performance on their own platforms, and also it would make it easier
to write tests that hit the new code. As it stands, it's not obvious
how much of the new code is being hit by the existing tests.

Doing a quick test on my machine, using random equal-length inputs of
various sizes, I got the following performance results:

digits | rate (HEAD) | rate (patch) | change
--------+---------------+---------------+--------
10 | 6.060014e+06 | 6.0189365e+06 | -0.7%
100 | 2.7038752e+06 | 2.7287925e+06 | +0.9%
1000 | 88640.37 | 90504.82 | +2.1%
1500 | 39885.23 | 41041.504 | +2.9%
1600 | 36355.24 | 33368.28 | -8.2%
2000 | 23308.582 | 23105.932 | -0.9%
3000 | 10765.185 | 11360.11 | +5.5%
4000 | 6118.2554 | 6645.4116 | +8.6%
5000 | 3928.4985 | 4639.914 | +18.1%
10000 | 1003.80164 | 1431.9335 | +42.7%
20000 | 255.46135 | 456.23462 | +78.6%
30000 | 110.69313 | 226.53398 | +104.7%
40000 | 62.29333 | 148.12916 | +137.8%
50000 | 39.867493 | 95.16788 | +138.7%
60000 | 27.7672 | 74.01282 | +166.5%

The Karatsuba algorithm kicks in at 384*4 = 1536 decimal digits, so
presumably the variations below that are just noise, but this does
seem to suggest that KARATSUBA_BASE_LIMIT = 384 is too low for me, and
I'd probably want it to be something like 500-700.

I've tried hard to reduce the magic part of these constants,
by benchmarking on numerous architectures, and picking them
manually by making a balanced judgement about what complexity
could possibly be acceptable for the threshold function,
and what performance gains that are important to try to capture.

I think this approach is actually less magical than the hard-coded
single value constants I've seen in many other numeric libraries,
where it's not clear at all what the full two dimensional performance
image looks like.

I considered if my initial post on this should propose a patch with a
simple threshold function, that just checks if var1ndigits is larger
than some constant, like many other numeric libraries do.
However, I decided I should at least try to do something smarter,
since it seemed possible.

There's another complication though (if the threshold is made
configurable): the various numeric functions that use mul_var() are
immutable, which means that the results from the Karatsuba algorithm
must match those from the schoolbook algorithm exactly, for all
inputs. That's not currently the case when computing approximate
results with a reduced rscale. That's fixable, but it's not obvious
whether or not the Karatsuba algorithm can actually be made beneficial
when computing such approximate results.

I read Tom's reply on this part, and understand we can only do Karatsuba
if full rscale is desired.

There's a wider question as to how many people use such big numeric
values -- i.e., how many people are actually going to benefit from
this? I don't have a good feel for that.

Personally, I started working on this because I wanted a to use numeric
to implement the ECDSA verify algorithm in PL/pgSQL, to avoid
dependency on a C extension that I discovered could segfault.

Unfortunately, Karatsuba didn't help much for this particular case,
since the ECDSA factors are not big enough.
I ended up implementing ECDSA verify as a pgrx extension instead.

However, I can imagine other crypto algorithms might require larger factors,
as well as other scientific research use cases, such as astronomy and physics,
that could desire storage of numeric values of very high precision.

Toom-3 is probably overkill, since we "only" support up to 32768 base digits,
but I think we should at least consider optimizing for the range of numeric values
that are supported by numeric.

Regards,
Joel

[1]: https://github.com/joelonsql/pg-timeit
[2]: /messages/by-id/attachment/159528/mul_var-karatsuba-benchmark.patch
[3]: https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/derive_threshold_function_using_dynamic_programming.png

#11

Alvaro Herrera

alvherre@alvh.no-ip.org

over 1 year ago

In reply to: Joel Jacobson (#10)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

On 2024-Jun-30, Joel Jacobson wrote:

However, I can imagine other crypto algorithms might require larger factors,
as well as other scientific research use cases, such as astronomy and physics,
that could desire storage of numeric values of very high precision.

FWIW I was talking to some people with astronomical databases, and for
reasons that seem pretty relatable they prefer to do all numerical
analysis outside of the database, and keep the database server only for
the data storage and retrieval parts. It seems a really hard sell that
they would try to do that analysis in the database, because the
paralellizability aspect is completely different for the analysis part
than for the database part -- I mean, they can easily add more servers
to do photography analysis, but trying to add database servers (to cope
with additional load from doing the analysis there) is a much harder
sell.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")

#12

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Tom Lane (#8)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

On Sat, Jun 29, 2024, at 17:25, Tom Lane wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

There's another complication though (if the threshold is made
configurable): the various numeric functions that use mul_var() are
immutable, which means that the results from the Karatsuba algorithm
must match those from the schoolbook algorithm exactly, for all
inputs.

That seems like an impossible standard to meet. What we'd probably
have to do is enable Karatsuba only when mul_var is being asked
for an exact (full-precision) result. This'd complicate the check
condition further, possibly reaching the point where there is a
visible drag on performance in the non-Karatsuba case.

OK, noted that we should only do Karatsuba when mul_var is
being asked for an exact (full-precision) result.

Another possible source of drag: if mul_var is now recursive,
does it need a stack depth check? If we can prove that the
number of recursion levels is no more than O(log(N)) in the
number of input digits, it's probably safe to skip that ...
but I see no such proof here.

(In general I find this patch seriously undercommented.)

Yes, the #define's around KARATSUBA_CONDITION needs
to be documented, but I felt this region of the patch might evolve
and need some discussion first, if this level of complexity is
acceptable.

However, I think the comments above split_var_at(),
mul_var_karatsuba_full() and mul_var_karatsuba_half()
are quite good already, what do you think?

There's a wider question as to how many people use such big numeric
values -- i.e., how many people are actually going to benefit from
this? I don't have a good feel for that.

I have heard of people doing calculations on bignum integers in
Postgres, but they are very few and far between --- usually that
sort of task is better done in another language. (No matter how
fast we make mul_var, the general overhead of SQL expressions in
general and type numeric in particular means it's probably not
the right tool for heavy-duty bignum arithmetic.)

There is definitely an argument to be made that this proposal is
not worth the development effort and ongoing maintenance effort
we'd have to sink into it.

It's a good question, I'm not sure what I think, maybe status quo is the best,
but it feels like something could be done at least.

The patch basically consists of three parts:

- Threshold function KARATSUBA_CONDITION
This is a bit unorthodox, since it's more ambitious than many popular numeric
libraries. I've only found GMP to be even more complex.

- mul_var_karatsuba_half()
Since it's provably correct using simple school-grade substitutions,
I don't think this function is unorthodox, and shouldn't need to change,
unless we change the definition of NumericVar in the future.

- mul_var_karatsuba()
This follows the canonical pseudo-code at Wikipedia for the Karatsuba
algorithm precisely, so nothing unorthodox here either.

So, I think the KARATSUBA_CONDITION require more development and maintenance
effort than the rest of the patch, since it's based on measurements
on numerous architectures, which will be different in the future.

The rest of the patch, split_var_at(), mul_var_karatsuba_full(),
mul_var_karatsuba_half(), should require much less effort to maintain,
since they are should remain the same,
even when we need to support new architectures.

I'm eager to hear your thoughts after you've also read my other reply moments ago to Dean.

Regards,
Joel

#13

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Dean Rasheed (#7)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote:

However, I really don't like having these magic constants at all,
because in practice the threshold above which the Karatsuba algorithm
is a win can vary depending on a number of factors, such as whether
it's running on 32-bit or 64-bit, whether or not SIMD instructions are
available, the relative timings of CPU instructions, the compiler
options used, and probably a bunch of other things.

...

Doing a quick test on my machine, using random equal-length inputs of
various sizes, I got the following performance results:

digits | rate (HEAD) | rate (patch) | change
--------+---------------+---------------+--------
10 | 6.060014e+06 | 6.0189365e+06 | -0.7%
100 | 2.7038752e+06 | 2.7287925e+06 | +0.9%

Does the PostgreSQL community these days have access to some kind
of performance farm, covering some/all of the supported hardware architectures?

Personally, I have three machines:
MacBook Pro M3 Max
Intel Core i9-14900K
AMD Ryzen 9 7950X3D

In addition I usually spin up a few AWS instances of different types,
but this is scary, because one time I forgot to turn them off for a week,
which was quite costly.

Would be much nicer with a performance farm!

If one exists, please let me know and no need to read the rest of this email.
Otherwise:

Imagine if we could send a patch to a separate mailing list,
and the system would auto-detect what catalog functions are affected,
and automatically generate a performance report, showing the delta per platform.

Binary functions, like numeric_mul(), should generate an image where the two
axes would be the size of the inputs, and the color of each pixel should show
the performance gain/loss, whereas unary functions like sqrt() should have
the size of the input as the x-axis and performance gain/loss as the y-axis.

How to test each catalog function would of course need to be designed
manually, but maybe the detection of affected function would be
automated, if accepting some false positives/negatives, i.e. benchmarking
too many or too few catalog functions, given a certain patch.

Catalog functions are just a tiny part of PostgreSQL, so there should
of course be other tests as well to cover other things, but since they are
simple to test predictably, maybe it could be a good start for the project,
even if it's far from the most important thing to benchmark.

I found an old performance farm topic from 2012 but it seems the discussion
just stopped for some reason not clear to me.

Regards,
Joel

#14

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Joel Jacobson (#12)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

"Joel Jacobson" <joel@compiler.org> writes:

On Sat, Jun 29, 2024, at 17:25, Tom Lane wrote:

(In general I find this patch seriously undercommented.)

However, I think the comments above split_var_at(),
mul_var_karatsuba_full() and mul_var_karatsuba_half()
are quite good already, what do you think?

Not remarkably so. For starters, heaven help the reader who has
no idea what "the Karatsuba algorithm" refers to. Nor is there
any mention of why (or when) it's better than the traditional
algorithm. You could at least do people the courtesy of providing
a link to the wikipedia article that you're assuming they've
memorized.

There's also a discussion to be had about why Karatsuba is
a better choice than other divide-and-conquer multiplication
methods. Why not Toom-Cook, for example, which the aforesaid
wikipedia page says is faster yet? I suppose you concluded
that the extra complexity is unwarranted, but this is the
sort of thing I'd expect to see explained in the comments.

regards, tom lane

#15

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Tom Lane (#14)

1 attachment(s)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

On Sun, Jun 30, 2024, at 17:44, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

On Sat, Jun 29, 2024, at 17:25, Tom Lane wrote:

(In general I find this patch seriously undercommented.)

However, I think the comments above split_var_at(),
mul_var_karatsuba_full() and mul_var_karatsuba_half()
are quite good already, what do you think?

Not remarkably so. For starters, heaven help the reader who has
no idea what "the Karatsuba algorithm" refers to. Nor is there
any mention of why (or when) it's better than the traditional
algorithm. You could at least do people the courtesy of providing
a link to the wikipedia article that you're assuming they've
memorized.

Thanks for guidance. New patch attached.

I've added as an introduction to Karatsuba:

+/*
+ * mul_var_karatsuba_full() -
+ *
+ *	Multiplication using the Karatsuba algorithm.
+ *
+ *	The Karatsuba algorithm is a divide-and-conquer algorithm that reduces
+ *	the complexity of large number multiplication. It splits each number
+ *	into two halves and performs three multiplications on the parts,
+ *	rather than four as in the traditional method. This results in
+ *	a significant performance improvement for sufficiently large numbers.
...
+ *	For more details on the Karatsuba algorithm, see the Wikipedia article:
+ *	https://en.wikipedia.org/wiki/Karatsuba_algorithm

There's also a discussion to be had about why Karatsuba is
a better choice than other divide-and-conquer multiplication
methods. Why not Toom-Cook, for example, which the aforesaid
wikipedia page says is faster yet? I suppose you concluded
that the extra complexity is unwarranted, but this is the
sort of thing I'd expect to see explained in the comments.

I've added this to the end of the comment on mul_var_karatsuba_full():

+ *	The Karatsuba algorithm is preferred over other divide-and-conquer methods
+ *	like Toom-Cook for this implementation due to its balance of complexity and
+ *	performance gains given Numeric's constraints.
+ *
+ *	For Toom-Cook to be worth the added complexity, the factors would need to
+ *	be much larger than supported by Numeric, making Karatsuba a more
+ *	appropriate choice.

Also, I added this comment on the #define's at the beginning:

+/*
+ * Constants used to determine when the Karatsuba algorithm should be used
+ * for multiplication. These thresholds were determined empirically through
+ * benchmarking across various architectures, aiming to avoid performance
+ * regressions while capturing potential gains. The choice of these values
+ * involves trade-offs and balances simplicity and performance.
+ */

As well as this comment on KARATSUBA_CONDITION:

+/*
+ * KARATSUBA_CONDITION() -
+ *
+ * This macro determines if the Karatsuba algorithm should be applied
+ * based on the number of digits in the multiplicands. It checks if
+ * the number of digits in the larger multiplicand exceeds a base limit
+ * and if the sizes of the multiplicands fall within specific ranges
+ * where Karatsuba multiplication is usually beneficial.
+ *
+ * The conditions encapsulated by KARATSUBA_CONDITION are:
+ * 1. The larger multiplicand has more digits than the base limit.
+ * 2. The sizes of the multiplicands fall within low, middle, or high range
+ *    conditions which were identified as performance beneficial regions during
+ *    benchmarks.
+ *
+ * The macro ensures that the algorithm is applied only when it is likely
+ * to provide performance benefits, considering the size and ratio of the
+ * factors.
+ */

What do you think?

Regards,
Joel

Attachments:

v2-0001-numeric-mul_var-karatsuba.patchapplication/octet-stream; name="=?UTF-8?Q?v2-0001-numeric-mul=5Fvar-karatsuba.patch?="Download

From 430c40c1fd3969c544a642fe62db925ba1a4d0f9 Mon Sep 17 00:00:00 2001
From: Joel Jakobsson <github@compiler.org>
Date: Sun, 30 Jun 2024 20:11:32 +0200
Subject: [PATCH] numeric.c mul_var(): Implement Karatsuba Algorithm

---
 src/backend/utils/adt/numeric.c | 301 ++++++++++++++++++++++++++++++++
 1 file changed, 301 insertions(+)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..d9983a9535 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -459,6 +459,56 @@ static const NumericVar const_ninf =
 static const int round_powers[4] = {0, 1000, 100, 10};
 #endif
 
+/*
+ * Constants used to determine when the Karatsuba algorithm should be used
+ * for multiplication. These thresholds were determined empirically through
+ * benchmarking across various architectures, aiming to avoid performance
+ * regressions while capturing potential gains. The choice of these values
+ * involves trade-offs and balances simplicity and performance.
+ */
+#define KARATSUBA_BASE_LIMIT 384
+#define KARATSUBA_VAR1_MIN1 128
+#define KARATSUBA_VAR1_MIN2 2000
+#define KARATSUBA_VAR2_MIN1 2500
+#define KARATSUBA_VAR2_MIN2 9000
+#define KARATSUBA_SLOPE 0.764
+#define KARATSUBA_INTERCEPT 90.737
+
+#define KARATSUBA_LOW_RANGE_CONDITION(var1ndigits, var2ndigits) \
+	((var1ndigits) > (KARATSUBA_SLOPE) * (var2ndigits) + KARATSUBA_INTERCEPT)
+
+#define KARATSUBA_MIDDLE_RANGE_CONDITION(var1ndigits, var2ndigits) \
+	((var2ndigits) > KARATSUBA_VAR2_MIN1 && \
+	 (var1ndigits) > KARATSUBA_VAR1_MIN2)
+
+#define KARATSUBA_HIGH_RANGE_CONDITION(var1ndigits, var2ndigits) \
+	((var2ndigits) > KARATSUBA_VAR2_MIN2 && \
+	 (var1ndigits) > KARATSUBA_VAR1_MIN1)
+
+/*
+ * KARATSUBA_CONDITION() -
+ *
+ * This macro determines if the Karatsuba algorithm should be applied
+ * based on the number of digits in the multiplicands. It checks if
+ * the number of digits in the larger multiplicand exceeds a base limit
+ * and if the sizes of the multiplicands fall within specific ranges
+ * where Karatsuba multiplication is usually beneficial.
+ *
+ * The conditions encapsulated by KARATSUBA_CONDITION are:
+ * 1. The larger multiplicand has more digits than the base limit.
+ * 2. The sizes of the multiplicands fall within low, middle, or high range
+ *    conditions which were identified as performance beneficial regions during
+ *    benchmarks.
+ *
+ * The macro ensures that the algorithm is applied only when it is likely
+ * to provide performance benefits, considering the size and ratio of the
+ * factors.
+ */
+#define KARATSUBA_CONDITION(var1ndigits, var2ndigits) \
+	((var2ndigits) >= KARATSUBA_BASE_LIMIT && \
+	 (KARATSUBA_LOW_RANGE_CONDITION(var1ndigits, var2ndigits) || \
+	  KARATSUBA_MIDDLE_RANGE_CONDITION(var1ndigits, var2ndigits) || \
+	  KARATSUBA_HIGH_RANGE_CONDITION(var1ndigits, var2ndigits)))
 
 /* ----------
  * Local functions
@@ -548,6 +598,14 @@ static void add_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result);
 static void sub_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result);
+inline static void split_var_at(const NumericVar *var, int split_point,
+								NumericVar *low, NumericVar *high);
+static void mul_var_karatsuba_full(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
+static void mul_var_karatsuba_half(const NumericVar *var1, const NumericVar *var2,
+					NumericVar *result,
+					int rscale);
 static void mul_var(const NumericVar *var1, const NumericVar *var2,
 					NumericVar *result,
 					int rscale);
@@ -8659,6 +8717,237 @@ sub_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result)
 }
 
 
+/*
+ * split_var_at() -
+ *
+ * Split a NumericVar into two parts at a specified position.
+ */
+inline static void
+split_var_at(const NumericVar *var, int split_point,
+			 NumericVar *low, NumericVar *high)
+{
+	int high_ndigits = var->ndigits - split_point;
+	int low_ndigits = split_point;
+
+	init_var(high);
+	init_var(low);
+
+	high->ndigits = high_ndigits;
+	high->digits = var->digits;
+	high->buf = NULL;
+	high->weight = var->weight - low_ndigits;
+	high->sign = var->sign;
+	high->dscale = (var->ndigits - var->weight - 1) * DEC_DIGITS;
+
+	low->ndigits = low_ndigits;
+	low->digits = var->digits + high_ndigits;
+	low->buf = NULL;
+	low->weight = var->weight - high_ndigits;
+	low->sign = var->sign;
+	low->dscale = var->dscale;
+}
+
+
+/*
+ * mul_var_karatsuba_full() -
+ *
+ *	Multiplication using the Karatsuba algorithm.
+ *
+ *	The Karatsuba algorithm is a divide-and-conquer algorithm that reduces
+ *	the complexity of large number multiplication. It splits each number
+ *	into two halves and performs three multiplications on the parts,
+ *	rather than four as in the traditional method. This results in
+ *	a significant performance improvement for sufficiently large numbers.
+ *
+ *	The algorithm normally starts by checking if any of the inputs
+ *	are smaller than the NBASE, the base case for the recursion,
+ *	and if so, fall back to traditional multiplication.
+ *
+ *	That part is handled by the caller in our code, so when this function
+ *	is called, we know that var1 and var2 are large enough for Karatsuba
+ *	to be used. We also know that var1 is shorter or of equal length as var2,
+ *	which has been arranged by the caller by swapping them if necessary.
+ *
+ *	The algorithm then proceeds by splitting var1 and var2 into
+ *	two high and low parts, at half the length of the longer input:
+ *
+ *		m = max(size_NBASE(var1), size_NBASE(var2))
+ *		m2 = floor(m / 2)
+ *
+ *		high1, low1 = split_var_at(var1, m2)
+ *		high2, low2 = split_var_at(var2, m2)
+ *
+ *		z0 = (low1 * low2)
+ *		z1 = ((low1 + high1) * (low2 + high2))
+ *		z2 = (high1 * high2)
+ *
+ *		return (z2 * NBASE ^ (m2 × 2)) + ((z1 - z2 - z0) * NBASE ^ m2) + z0
+ *
+ *	For more details on the Karatsuba algorithm, see the Wikipedia article:
+ *	https://en.wikipedia.org/wiki/Karatsuba_algorithm
+ *
+ *	The Karatsuba algorithm is preferred over other divide-and-conquer methods
+ *	like Toom-Cook for this implementation due to its balance of complexity and
+ *	performance gains given Numeric's constraints.
+ *
+ *	For Toom-Cook to be worth the added complexity, the factors would need to
+ *	be much larger than supported by Numeric, making Karatsuba a more
+ *	appropriate choice.
+ *
+ */
+static void
+mul_var_karatsuba_full(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+		NumericVar	high1, low1;
+		NumericVar	high2, low2;
+		NumericVar	z0, z1, z2;
+		NumericVar	temp1, temp2;
+		int			m2 = var2->ndigits / 2;
+
+		init_var(&low1);
+		init_var(&low2);
+		init_var(&high1);
+		init_var(&high2);
+		init_var(&z0);
+		init_var(&z1);
+		init_var(&z2);
+		init_var(&temp1);
+		init_var(&temp2);
+
+		split_var_at(var1, m2, &low1, &high1);
+		split_var_at(var2, m2, &low2, &high2);
+
+		mul_var(&low1, &low2, &z0, low1.dscale + low2.dscale);
+
+		add_var(&low1, &high1, &temp1);
+		add_var(&low2, &high2, &temp2);
+		mul_var(&temp1, &temp2, &z1, temp1.dscale + temp2.dscale);
+
+		mul_var(&high1, &high2, &z2, high1.dscale + high2.dscale);
+
+		set_var_from_var(&z2, &temp1);
+		temp1.weight += m2 * 2;
+
+		sub_var(&z1, &z2, &z1);
+		sub_var(&z1, &z0, &temp2);
+		temp2.weight += m2;
+
+		add_var(&temp1, &temp2, &temp2);
+		add_var(&temp2, &z0, result);
+
+		free_var(&low1);
+		free_var(&low2);
+		free_var(&high1);
+		free_var(&high2);
+		free_var(&z0);
+		free_var(&z1);
+		free_var(&z2);
+		free_var(&temp1);
+		free_var(&temp2);
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+}
+
+
+/*
+ * mul_var_karatsuba_half() -
+ *
+ *	Karatsuba Multiplication for factors with significant length disparity.
+ *
+ *	The Half-Karatsuba Multiplication Algorithm is a specialized case of
+ *	the normal Karatsuba multiplication algorithm, designed for the scenario
+ *	where var2 has at least twice as many base digits as var1.
+ *
+ *	In this case var2 (the longer input) is split into high2 and low1,
+ *	at m2 (half the length of var2) and var1 (the shorter input),
+ *	is used directly without splitting.
+ *
+ *	The algorithm then proceeds as follows:
+ *
+ *	1. Compute the product z0 = var1 * low2.
+ *	2. Compute the product temp2 = var1 * high2.
+ *	3. Adjust the weight of temp2 by adding m2 (* NBASE ^ m2)
+ *	4. Add temp2 and z0 to obtain the final result.
+ *
+ *	Proof:
+ *
+ *	The algorithm can be derived from the original Karatsuba algorithm by
+ *	simplifying the formula when the shorter factor var1 is not split into
+ *	high and low parts, as shown below.
+ *
+ *	Original Karatsuba formula:
+ *
+ *		result = (z2 * NBASE ^ (m2 × 2)) + ((z1 - z2 - z0) * NBASE ^ m2) + z0
+ *
+ *	Substitutions:
+ *
+ *		low1 = var1
+ *		high1 = 0
+ *
+ *	Applying substitutions:
+ *
+ *		z0 = (low1 * low2)
+ *		   = (var1 * low2)
+ *
+ *		z1 = ((low1 + high1) * (low2 + high2))
+ *		   = ((var1 + 0) * (low2 + high2))
+ *		   = (var1 * low2) + (var1 * high2)
+ *
+ *		z2 = (high1 * high2)
+ *		   = (0 * high2)
+ *		   = 0
+ *
+ *	Simplified using the above substitutions:
+ *
+ *		result = (z2 * NBASE ^ (m2 × 2)) + ((z1 - z2 - z0) * NBASE ^ m2) + z0
+ *			   = (0 * NBASE ^ (m2 × 2)) + ((z1 - 0 - z0) * NBASE ^ m2) + z0
+ *			   = ((z1 - z0) * NBASE ^ m2) + z0
+ *			   = ((z1 - z0) * NBASE ^ m2) + z0
+ *			   = (var1 * high2) * NBASE ^ m2 + z0
+ */
+static void
+mul_var_karatsuba_half(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *result, int rscale)
+{
+	NumericVar	high2, low2;
+	NumericVar	z0;
+	NumericVar	temp2;
+	int			m2 = var2->ndigits / 2;
+
+	init_var(&high2);
+	init_var(&low2);
+	init_var(&z0);
+	init_var(&temp2);
+
+	split_var_at(var2, m2, &low2, &high2);
+
+	mul_var(var1, &low2, &z0, var1->dscale + low2.dscale);
+	mul_var(var1, &high2, &temp2, var1->dscale + high2.dscale);
+	temp2.weight += m2;
+	add_var(&temp2, &z0, result);
+
+	free_var(&high2);
+	free_var(&low2);
+	free_var(&z0);
+	free_var(&temp2);
+
+	/* Round to target rscale (and set result->dscale) */
+	round_var(result, rscale);
+
+	/* Strip leading and trailing zeroes */
+	strip_var(result);
+
+	return;
+}
+
+
 /*
  * mul_var() -
  *
@@ -8747,6 +9036,18 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		return;
 	}
 
+	/*
+	 * Use the Karatsuba algorithm for sufficiently large factors.
+	 */
+	if (KARATSUBA_CONDITION(var1ndigits, var2ndigits))
+	{
+		if (var1ndigits * 2 > var2ndigits)
+			mul_var_karatsuba_full(var1, var2, result, rscale);
+		else
+			mul_var_karatsuba_half(var1, var2, result, rscale);
+		return;
+	}
+
 	/*
 	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
 	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
-- 
2.45.1

#16

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#10)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

On Sun, 30 Jun 2024 at 11:22, Joel Jacobson <joel@compiler.org> wrote:

The surprising realization here is that there are actually (var1ndigits, var2ndigits)
combinations where *only* doing mul_var_karatsuba_half() recursively
all the way down to schoolbook *is* a performance win,
even though we don't do any mul_var_karatsuba_full().

Indeed only mul_var_karatsuba_half() will be called with the inputs:
var1ndigits=1000 var2ndigits=10000
var1ndigits=1000 var2ndigits=5000
It will never call mul_var_karatsuba_full().

Surprisingly, this still gives a 13% speed-up on a Intel Core i9-14900K.

Hmm, I don't see any gains in that example. Logically, it is doing
slightly more work splitting up var2 and adding partial products.
However, it's possible that what you're seeing is a memory access
problem where, if var2 is too large, it won't fit in cache close to
the CPU, and since the schoolbook algorithm traverses var2 multiple
times, it ends up being quicker to split up var2. Since I have a
slower CPU, I'm more likely to be CPU-limited, while you might be
memory-limited.

This makes me think that this is always going to be very
hardware-dependent, and we shouldn't presume what will work best on
the user's hardware.

However, if a 1000x1000 ndigit product is known to be faster using
Karatsuba on some particular hardware (possibly nearly all hardware),
then why wouldn't it make sense to do the above as 10 invocations of
Karatsuba?

Regards,
Dean

#17

Dean Rasheed

dean.a.rasheed@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#10)

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

On Sun, 30 Jun 2024 at 11:22, Joel Jacobson <joel@compiler.org> wrote:

On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote:

But why just split it into two pieces? That will just lead
to a lot of unnecessary recursion for very unbalanced inputs. Instead,
why not split the longer input into N roughly equal sized pieces, each
around the same length as the shorter input, multiplying and adding
them at the appropriate offsets?

The approach you're describing is implemented by e.g. CPython
and is called "lopsided" in their code base. It has some different
performance characteristics, compared to the recursive Half-Karatsuba
approach.

What I didn't like about lopsided is the degenerate case where the
last chunk is much shorter than the var1, for example, if we pretend
we would be doing Karatsuba all the way down to ndigits 2,
and think about the example var1ndigits = 3 and var2ndigits = 10,
then lopsided would do
var1ndigits=3 var2ndigits=3
var1ndigits=3 var2ndigits=3
var1ndigits=3 var2ndigits=3
var1ndigits=3 var2ndigits=1

whereas Half-Karatsuba would do
var1ndigits=3 var2ndigits=5
var1ndigits=3 var2ndigits=5

You can find contrary examples too of course where lopsided
is better than Half-Karatsuba, none of them seem substantially better
than the other.

Actually, that wasn't quite what I was thinking of. The approach I've
seen (I can't remember where) is to split var2 into roughly
equal-sized chunks as follows:

If var2ndigits >= 2 * var1ndigits, split it into a number chunks where

nchunks = var2ndigits / var1ndigits

using integer division with truncation. Then call mul_var_chunks()
(say), passing it nchunks, to perform the multiplication using that
many chunks.

mul_var_chunks() would use nchunks to decides on the chunk size according to

chunk_size = var2ndigits / nchunks

again using integer division with truncation. That has a remainder in
the range [0, nchunks), which is divided up between the chunks so that
some of them end up being chunk_size + 1 digits. I.e., something like:

chunk_remainder = var2ndigits - chunk_size * nchunks

for (int start = 0, end = chunk_size;
start < var2ndigits;
start = end, end = start + chunk_size)
{
/* Distribute remainder over the first few chunks */
if (chunk_remainder > 0)
{
end++;
chunk_remainder--;
}

/* Process chunk between "start" and "end" */
}

What this does is adapt the shape of the chunks to the region where
the Karatsuba algorithm is most efficient. For example, suppose
var1ndigits = 1000. Then:

For 2000x1000 to 2999x1000 digit products, nchunks will be 2 and
chunk_size will be at most (2999/2)+1 = 1500.

For 3000x1000 to 3999x1000 digit products, nchunks will be 3 and
chunk_size will be at most (3999/3)+1 = 1334.

And so on.

The result is that all the chunks will fall into that region where
var2ndigits / var1ndigits is between 1 and 1.5, for which
KARATSUBA_LOW_RANGE_CONDITION() will almost certainly pass, and
Karatsuba will operate efficiently all the way down to
KARATSUBA_BASE_LIMIT.

Regards,
Dean