[PATCH] avoid double scanning in function byteain

Started by Steven Niu10 months ago16 messages

niushiji@gmail.com

10 months ago

1 attachment(s)

Hi,

The byteain function converts a string input into a bytea type.
The original implementation processes two input formats:
a hex format (starting with \x) and a traditional escaped format.
For the escaped format, the function scans the input string twice
— once to calculate the exact size of the output and allocate memory,
and again to fill the allocated memory with the parsed data.

This double scanning can be inefficient, especially for large inputs.
So I optimized the function to eliminate the need for two scans,
while preserving correctness and efficiency.

Please help review it and share your valuable comments.

Thanks,
Steven Niu
https://www.highgo.com/

Attachments:

0001-PATCH-Optimize-function-byteain-to-avoid-double-scan.patchtext/plain; charset=UTF-8; name=0001-PATCH-Optimize-function-byteain-to-avoid-double-scan.patchDownload

From db0352fb7fa463bd7a02f73f29760d1400cef402 Mon Sep 17 00:00:00 2001
From: Steven Niu <niushiji@highgo.com>
Date: Wed, 26 Mar 2025 14:43:43 +0800
Subject: [PATCH] Optimize function byteain() to avoid double scanning

Optimized the function to eliminate the need for two scans,
while preserving correctness and efficiency.

Author: Steven Niu <niushiji@gmail.com>
---
 src/backend/utils/adt/varlena.c | 66 +++++++++++----------------------
 1 file changed, 22 insertions(+), 44 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 95631eb2099..de422cafbd5 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -291,7 +291,6 @@ text_to_cstring_buffer(const text *src, char *dst, size_t dst_len)
  *		ereport(ERROR, ...) if bad form.
  *
  *		BUGS:
- *				The input is scanned twice.
  *				The error checking of input is minimal.
  */
 Datum
@@ -302,6 +301,7 @@ byteain(PG_FUNCTION_ARGS)
 	char	   *tp;
 	char	   *rp;
 	int			bc;
+	size_t	   input_len;
 	bytea	   *result;
 
 	/* Recognize hex input */
@@ -318,45 +318,28 @@ byteain(PG_FUNCTION_ARGS)
 		PG_RETURN_BYTEA_P(result);
 	}
 
-	/* Else, it's the traditional escaped style */
-	for (bc = 0, tp = inputText; *tp != '\0'; bc++)
-	{
-		if (tp[0] != '\\')
-			tp++;
-		else if ((tp[0] == '\\') &&
-				 (tp[1] >= '0' && tp[1] <= '3') &&
-				 (tp[2] >= '0' && tp[2] <= '7') &&
-				 (tp[3] >= '0' && tp[3] <= '7'))
-			tp += 4;
-		else if ((tp[0] == '\\') &&
-				 (tp[1] == '\\'))
-			tp += 2;
-		else
-		{
-			/*
-			 * one backslash, not followed by another or ### valid octal
-			 */
-			ereturn(escontext, (Datum) 0,
-					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
-					 errmsg("invalid input syntax for type %s", "bytea")));
-		}
-	}
-
-	bc += VARHDRSZ;
-
-	result = (bytea *) palloc(bc);
-	SET_VARSIZE(result, bc);
-
-	tp = inputText;
+	/* Handle traditional escaped style in a single pass */
+	input_len = strlen(inputText);
+	result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
 	rp = VARDATA(result);
+	tp = inputText;
+
 	while (*tp != '\0')
 	{
 		if (tp[0] != '\\')
+		{
 			*rp++ = *tp++;
-		else if ((tp[0] == '\\') &&
-				 (tp[1] >= '0' && tp[1] <= '3') &&
-				 (tp[2] >= '0' && tp[2] <= '7') &&
-				 (tp[3] >= '0' && tp[3] <= '7'))
+			continue;
+		}
+
+		if (tp[1] == '\\')
+		{
+			*rp++ = '\\';
+			tp += 2;
+		}
+		else if ((tp[1] >= '0' && tp[1] <= '3') && 
+			 (tp[2] >= '0' && tp[2] <= '7') && 
+			 (tp[3] >= '0' && tp[3] <= '7'))
 		{
 			bc = VAL(tp[1]);
 			bc <<= 3;
@@ -366,23 +349,18 @@ byteain(PG_FUNCTION_ARGS)
 
 			tp += 4;
 		}
-		else if ((tp[0] == '\\') &&
-				 (tp[1] == '\\'))
-		{
-			*rp++ = '\\';
-			tp += 2;
-		}
 		else
 		{
-			/*
-			 * We should never get here. The first pass should not allow it.
-			 */
+			/* Invalid escape sequence: report error */
 			ereturn(escontext, (Datum) 0,
 					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 					 errmsg("invalid input syntax for type %s", "bytea")));
 		}
 	}
 
+	/* Set the actual size of the bytea */
+	SET_VARSIZE(result, (rp - VARDATA(result)) + VARHDRSZ);
+
 	PG_RETURN_BYTEA_P(result);
 }
 
-- 
2.43.0

Kirill Reshke

reshkekirill@gmail.com

10 months ago

In reply to: Steven Niu (#1)

Re: [PATCH] avoid double scanning in function byteain

On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji@gmail.com> wrote:

Hi,

Hi!

This double scanning can be inefficient, especially for large inputs.
So I optimized the function to eliminate the need for two scans,
while preserving correctness and efficiency.

While the argument that processing input once not twice is fast is
generally true, may we have some simple bench here just to have an
idea how valuable this patch is?

Patch:

+ /* Handle traditional escaped style in a single pass */
+ input_len = strlen(inputText);
+ result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
rp = VARDATA(result);
+ tp = inputText;
+
while (*tp != '\0')

Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?

[0]: https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45

--
Best regards,
Kirill Reshke

Steven Niu

niushiji@gmail.com

10 months ago

In reply to: Kirill Reshke (#2)

Re: [PATCH] avoid double scanning in function byteain

在 2025/3/26 16:37, Kirill Reshke 写道:

On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji@gmail.com> wrote:

Hi,

Hi!

This double scanning can be inefficient, especially for large inputs.
So I optimized the function to eliminate the need for two scans,
while preserving correctness and efficiency.

While the argument that processing input once not twice is fast is
generally true, may we have some simple bench here just to have an
idea how valuable this patch is?

Patch:
+ /* Handle traditional escaped style in a single pass */
+ input_len = strlen(inputText);
+ result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
rp = VARDATA(result);
+ tp = inputText;
+
while (*tp != '\0')
Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?

[0] https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45

Hi, Kirill,

Your deep insight suprised me!

Yes, you are correct that strlen() actually performed a loop operation.
So maybe the performance difference is not so obvious.

However, there are some other reasons that drive me to make this change.

1. The author of original code left comment: "BUGS: The input is scanned
twice." .
You can find this line of code in my patch. This indicates a left work
to be done.

2. If I were the author of this function, I would not be satisfied with
myself that I used two loops to do something which actually can be done
with one loop.
I prefer to choose a way that would not add more burden to readers.

3. The while (*tp != '\0') loop has some unnecessary codes and I made
some change.

Thanks,
Steven

Stepan Neretin

slpmcf@gmail.com

8 months ago

In reply to: Steven Niu (#3)

2 attachment(s)

Re: [PATCH] avoid double scanning in function byteain

On Wed, Mar 26, 2025 at 9:39 PM Steven Niu <niushiji@gmail.com> wrote:

在 2025/3/26 16:37, Kirill Reshke 写道:
On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji@gmail.com> wrote:

Hi,

Hi!

This double scanning can be inefficient, especially for large inputs.
So I optimized the function to eliminate the need for two scans,
while preserving correctness and efficiency.

While the argument that processing input once not twice is fast is
generally true, may we have some simple bench here just to have an
idea how valuable this patch is?

Patch:
+ /* Handle traditional escaped style in a single pass */
+ input_len = strlen(inputText);
+ result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size
*/
rp = VARDATA(result);
+ tp = inputText;
+
while (*tp != '\0')
Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?

[0] https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45
Hi, Kirill,

Your deep insight suprised me!

Yes, you are correct that strlen() actually performed a loop operation.
So maybe the performance difference is not so obvious.

However, there are some other reasons that drive me to make this change.

1. The author of original code left comment: "BUGS: The input is scanned
twice." .
You can find this line of code in my patch. This indicates a left work
to be done.

2. If I were the author of this function, I would not be satisfied with
myself that I used two loops to do something which actually can be done
with one loop.
I prefer to choose a way that would not add more burden to readers.

3. The while (*tp != '\0') loop has some unnecessary codes and I made
some change.

Thanks,
Steven

Hi hackers,

This is a revised version (v2) of the patch that optimizes the `byteain()`
function.

The original implementation handled escaped input by scanning the string
twice — first to determine the output size and again to fill in the bytea.
This patch eliminates the double scan by using a single-pass approach with
`StringInfo`, simplifying the logic and improving maintainability.

Changes since v1 (originally by Steven Niu):
- Use `StringInfo` instead of manual memory allocation.
- Remove redundant code and improve readability.
- Add regression tests for both hex and escaped formats.

This version addresses performance and clarity while ensuring compatibility
with existing behavior. The patch also reflects discussion on the original
version, including feedback from Kirill Reshke.

Looking forward to your review and comments.

Best regards,
Stepan Neretin

Attachments:

0001-Optimize-function-byteain-to-avoid-double-scanning.patchtext/x-patch; charset=US-ASCII; name=0001-Optimize-function-byteain-to-avoid-double-scanning.patchDownload

From b589d728b54de071b8d4383a3a51de5f7c2e2293 Mon Sep 17 00:00:00 2001
From: Steven Niu <niushiji@highgo.com>
Date: Wed, 26 Mar 2025 14:43:43 +0800
Subject: [PATCH v2 1/2] Optimize function byteain() to avoid double scanning

Optimized the function to eliminate the need for two scans,
while preserving correctness and efficiency.

Author: Steven Niu <niushiji@gmail.com>
---
 src/backend/utils/adt/varlena.c | 66 +++++++++++----------------------
 1 file changed, 22 insertions(+), 44 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 3e4d5568bde..f1f1efba053 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -291,7 +291,6 @@ text_to_cstring_buffer(const text *src, char *dst, size_t dst_len)
  *		ereport(ERROR, ...) if bad form.
  *
  *		BUGS:
- *				The input is scanned twice.
  *				The error checking of input is minimal.
  */
 Datum
@@ -302,6 +301,7 @@ byteain(PG_FUNCTION_ARGS)
 	char	   *tp;
 	char	   *rp;
 	int			bc;
+	size_t	   input_len;
 	bytea	   *result;
 
 	/* Recognize hex input */
@@ -318,45 +318,28 @@ byteain(PG_FUNCTION_ARGS)
 		PG_RETURN_BYTEA_P(result);
 	}
 
-	/* Else, it's the traditional escaped style */
-	for (bc = 0, tp = inputText; *tp != '\0'; bc++)
-	{
-		if (tp[0] != '\\')
-			tp++;
-		else if ((tp[0] == '\\') &&
-				 (tp[1] >= '0' && tp[1] <= '3') &&
-				 (tp[2] >= '0' && tp[2] <= '7') &&
-				 (tp[3] >= '0' && tp[3] <= '7'))
-			tp += 4;
-		else if ((tp[0] == '\\') &&
-				 (tp[1] == '\\'))
-			tp += 2;
-		else
-		{
-			/*
-			 * one backslash, not followed by another or ### valid octal
-			 */
-			ereturn(escontext, (Datum) 0,
-					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
-					 errmsg("invalid input syntax for type %s", "bytea")));
-		}
-	}
-
-	bc += VARHDRSZ;
-
-	result = (bytea *) palloc(bc);
-	SET_VARSIZE(result, bc);
-
-	tp = inputText;
+	/* Handle traditional escaped style in a single pass */
+	input_len = strlen(inputText);
+	result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
 	rp = VARDATA(result);
+	tp = inputText;
+
 	while (*tp != '\0')
 	{
 		if (tp[0] != '\\')
+		{
 			*rp++ = *tp++;
-		else if ((tp[0] == '\\') &&
-				 (tp[1] >= '0' && tp[1] <= '3') &&
-				 (tp[2] >= '0' && tp[2] <= '7') &&
-				 (tp[3] >= '0' && tp[3] <= '7'))
+			continue;
+		}
+
+		if (tp[1] == '\\')
+		{
+			*rp++ = '\\';
+			tp += 2;
+		}
+		else if ((tp[1] >= '0' && tp[1] <= '3') && 
+			 (tp[2] >= '0' && tp[2] <= '7') && 
+			 (tp[3] >= '0' && tp[3] <= '7'))
 		{
 			bc = VAL(tp[1]);
 			bc <<= 3;
@@ -366,23 +349,18 @@ byteain(PG_FUNCTION_ARGS)
 
 			tp += 4;
 		}
-		else if ((tp[0] == '\\') &&
-				 (tp[1] == '\\'))
-		{
-			*rp++ = '\\';
-			tp += 2;
-		}
 		else
 		{
-			/*
-			 * We should never get here. The first pass should not allow it.
-			 */
+			/* Invalid escape sequence: report error */
 			ereturn(escontext, (Datum) 0,
 					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 					 errmsg("invalid input syntax for type %s", "bytea")));
 		}
 	}
 
+	/* Set the actual size of the bytea */
+	SET_VARSIZE(result, (rp - VARDATA(result)) + VARHDRSZ);
+
 	PG_RETURN_BYTEA_P(result);
 }
 
-- 
2.43.0

0002-Refactor-byteain-to-avoid-double-scanning-and-simpli.patchtext/x-patch; charset=US-ASCII; name=0002-Refactor-byteain-to-avoid-double-scanning-and-simpli.patchDownload

From e4bcfb637018737ae3e0c7761b3b62a49b7ae407 Mon Sep 17 00:00:00 2001
From: Stepan Neretin <s.neretin@postgrespro.ru>
Date: Fri, 9 May 2025 17:19:42 +0700
Subject: [PATCH v2 2/2] Refactor byteain() to avoid double scanning and
 simplify logic

This patch reworks the escaped input handling in byteain() by replacing
manual buffer management with a StringInfo-based single-pass parse.
It combines ideas from a previous proposal by Steven Niu with additional
improvements to structure and readability.

Also adds regression tests covering edge cases for both hex and escaped
formats.

Includes input from discussion with Kirill Reshke on pgsql-hackers.
---
 contrib/btree_gin/expected/bytea.out | 92 ++++++++++++++++++++++++++++
 contrib/btree_gin/sql/bytea.sql      | 37 +++++++++++
 src/backend/utils/adt/varlena.c      | 63 ++++++++-----------
 3 files changed, 155 insertions(+), 37 deletions(-)

diff --git a/contrib/btree_gin/expected/bytea.out b/contrib/btree_gin/expected/bytea.out
index b0ed7a53450..d4ad2878775 100644
--- a/contrib/btree_gin/expected/bytea.out
+++ b/contrib/btree_gin/expected/bytea.out
@@ -44,3 +44,95 @@ SELECT * FROM test_bytea WHERE i>'abc'::bytea ORDER BY i;
  xyz
 (2 rows)
 
+-- Simple ASCII strings
+SELECT encode(bytea(E'a'), 'hex');            -- 61
+ encode 
+--------
+ 61
+(1 row)
+
+SELECT encode(bytea(E'ab'), 'hex');           -- 6162
+ encode 
+--------
+ 6162
+(1 row)
+
+-- Octal escapes
+SELECT encode(bytea(E'\\000'), 'hex');        -- 00
+ encode 
+--------
+ 00
+(1 row)
+
+SELECT encode(bytea(E'\\001'), 'hex');        -- 01
+ encode 
+--------
+ 01
+(1 row)
+
+SELECT encode(bytea(E'\\001\\002\\003'), 'hex');  -- 010203
+ encode 
+--------
+ 010203
+(1 row)
+
+-- Mixed literal and escapes
+SELECT encode(bytea(E'a\\000b\\134c'), 'hex'); -- 6100625c63
+   encode   
+------------
+ 6100625c63
+(1 row)
+
+-- Backslash literal
+SELECT encode(bytea(E'\\\\'), 'hex');         -- 5c
+ encode 
+--------
+ 5c
+(1 row)
+
+-- Empty input
+SELECT encode(bytea(E''), 'hex');             -- (empty string)
+ encode 
+--------
+ 
+(1 row)
+
+-- Hex format
+SELECT encode(bytea(E'\\x6869'), 'escape');   -- hi
+ encode 
+--------
+ hi
+(1 row)
+
+-- ===== Invalid bytea input tests =====
+-- Invalid octal escapes (less than 3 digits or out of range)
+SELECT bytea(E'\\77');     -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'\\77');
+                     ^
+SELECT bytea(E'\\4');      -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'\\4');
+                     ^
+SELECT bytea(E'\\08');     -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'\\08');
+                     ^
+SELECT bytea(E'\\999');    -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'\\999');
+                     ^
+-- Invalid hex format
+SELECT bytea(E'\\x1');     -- ERROR
+ERROR:  invalid hexadecimal data: odd number of digits
+LINE 1: SELECT bytea(E'\\x1');
+                     ^
+SELECT bytea(E'\\xZZ');    -- ERROR
+ERROR:  invalid hexadecimal digit: "Z"
+LINE 1: SELECT bytea(E'\\xZZ');
+                     ^
+-- Incomplete escape sequence
+SELECT bytea(E'abc\\');    -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'abc\\');
+                     ^
diff --git a/contrib/btree_gin/sql/bytea.sql b/contrib/btree_gin/sql/bytea.sql
index 5f3eb11b169..cb8ee8eb2aa 100644
--- a/contrib/btree_gin/sql/bytea.sql
+++ b/contrib/btree_gin/sql/bytea.sql
@@ -15,3 +15,40 @@ SELECT * FROM test_bytea WHERE i<='abc'::bytea ORDER BY i;
 SELECT * FROM test_bytea WHERE i='abc'::bytea ORDER BY i;
 SELECT * FROM test_bytea WHERE i>='abc'::bytea ORDER BY i;
 SELECT * FROM test_bytea WHERE i>'abc'::bytea ORDER BY i;
+
+
+-- Simple ASCII strings
+SELECT encode(bytea(E'a'), 'hex');            -- 61
+SELECT encode(bytea(E'ab'), 'hex');           -- 6162
+
+-- Octal escapes
+SELECT encode(bytea(E'\\000'), 'hex');        -- 00
+SELECT encode(bytea(E'\\001'), 'hex');        -- 01
+SELECT encode(bytea(E'\\001\\002\\003'), 'hex');  -- 010203
+
+-- Mixed literal and escapes
+SELECT encode(bytea(E'a\\000b\\134c'), 'hex'); -- 6100625c63
+
+-- Backslash literal
+SELECT encode(bytea(E'\\\\'), 'hex');         -- 5c
+
+-- Empty input
+SELECT encode(bytea(E''), 'hex');             -- (empty string)
+
+-- Hex format
+SELECT encode(bytea(E'\\x6869'), 'escape');   -- hi
+
+-- ===== Invalid bytea input tests =====
+
+-- Invalid octal escapes (less than 3 digits or out of range)
+SELECT bytea(E'\\77');     -- ERROR
+SELECT bytea(E'\\4');      -- ERROR
+SELECT bytea(E'\\08');     -- ERROR
+SELECT bytea(E'\\999');    -- ERROR
+
+-- Invalid hex format
+SELECT bytea(E'\\x1');     -- ERROR
+SELECT bytea(E'\\xZZ');    -- ERROR
+
+-- Incomplete escape sequence
+SELECT bytea(E'abc\\');    -- ERROR
\ No newline at end of file
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index f1f1efba053..f84fd1dc644 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -296,70 +296,59 @@ text_to_cstring_buffer(const text *src, char *dst, size_t dst_len)
 Datum
 byteain(PG_FUNCTION_ARGS)
 {
-	char	   *inputText = PG_GETARG_CSTRING(0);
-	Node	   *escontext = fcinfo->context;
-	char	   *tp;
-	char	   *rp;
-	int			bc;
-	size_t	   input_len;
-	bytea	   *result;
+	char *inputText = PG_GETARG_CSTRING(0);
+	Node *escontext = fcinfo->context;
 
-	/* Recognize hex input */
+	/* Hex format */
 	if (inputText[0] == '\\' && inputText[1] == 'x')
 	{
-		size_t		len = strlen(inputText);
-
-		bc = (len - 2) / 2 + VARHDRSZ;	/* maximum possible length */
-		result = palloc(bc);
-		bc = hex_decode_safe(inputText + 2, len - 2, VARDATA(result),
-							 escontext);
-		SET_VARSIZE(result, bc + VARHDRSZ); /* actual length */
-
+		size_t len = strlen(inputText);
+		int bc = (len - 2) / 2 + VARHDRSZ;
+		bytea *result = palloc(bc);
+		bc = hex_decode_safe(inputText + 2, len - 2, VARDATA(result), escontext);
+		SET_VARSIZE(result, bc + VARHDRSZ);
 		PG_RETURN_BYTEA_P(result);
 	}
 
-	/* Handle traditional escaped style in a single pass */
-	input_len = strlen(inputText);
-	result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
-	rp = VARDATA(result);
-	tp = inputText;
+	/* Escaped format */
+	StringInfoData buf;
+	initStringInfo(&buf);
+	char *tp = inputText;
 
-	while (*tp != '\0')
+	while (*tp)
 	{
-		if (tp[0] != '\\')
+		if (*tp != '\\')
 		{
-			*rp++ = *tp++;
+			appendStringInfoChar(&buf, *tp++);
 			continue;
 		}
 
 		if (tp[1] == '\\')
 		{
-			*rp++ = '\\';
+			appendStringInfoChar(&buf, '\\');
 			tp += 2;
 		}
-		else if ((tp[1] >= '0' && tp[1] <= '3') && 
-			 (tp[2] >= '0' && tp[2] <= '7') && 
-			 (tp[3] >= '0' && tp[3] <= '7'))
+		else if ((tp[1] >= '0' && tp[1] <= '3') &&
+				 (tp[2] >= '0' && tp[2] <= '7') &&
+				 (tp[3] >= '0' && tp[3] <= '7'))
 		{
-			bc = VAL(tp[1]);
-			bc <<= 3;
-			bc += VAL(tp[2]);
-			bc <<= 3;
-			*rp++ = bc + VAL(tp[3]);
-
+			int byte_val = VAL(tp[1]);
+			byte_val = (byte_val << 3) + VAL(tp[2]);
+			byte_val = (byte_val << 3) + VAL(tp[3]);
+			appendStringInfoChar(&buf, byte_val);
 			tp += 4;
 		}
 		else
 		{
-			/* Invalid escape sequence: report error */
 			ereturn(escontext, (Datum) 0,
 					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 					 errmsg("invalid input syntax for type %s", "bytea")));
 		}
 	}
 
-	/* Set the actual size of the bytea */
-	SET_VARSIZE(result, (rp - VARDATA(result)) + VARHDRSZ);
+	bytea *result = palloc(buf.len + VARHDRSZ);
+	SET_VARSIZE(result, buf.len + VARHDRSZ);
+	memcpy(VARDATA(result), buf.data, buf.len);
 
 	PG_RETURN_BYTEA_P(result);
 }
-- 
2.43.0

Stepan Neretin

slpmcf@gmail.com

8 months ago

In reply to: Stepan Neretin (#4)

2 attachment(s)

Re: [PATCH] avoid double scanning in function byteain

On Fri, May 9, 2025 at 5:24 PM Stepan Neretin <slpmcf@gmail.com> wrote:

On Wed, Mar 26, 2025 at 9:39 PM Steven Niu <niushiji@gmail.com> wrote:
在 2025/3/26 16:37, Kirill Reshke 写道:
On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji@gmail.com> wrote:

Hi,

Hi!

This double scanning can be inefficient, especially for large inputs.
So I optimized the function to eliminate the need for two scans,
while preserving correctness and efficiency.

While the argument that processing input once not twice is fast is
generally true, may we have some simple bench here just to have an
idea how valuable this patch is?

Patch:
+ /* Handle traditional escaped style in a single pass */
+ input_len = strlen(inputText);
+ result = palloc(input_len + VARHDRSZ);  /* Allocate max possible
size */
rp = VARDATA(result);
+ tp = inputText;
+
while (*tp != '\0')
Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?

[0] https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45
Hi, Kirill,

Your deep insight suprised me!

Yes, you are correct that strlen() actually performed a loop operation.
So maybe the performance difference is not so obvious.

However, there are some other reasons that drive me to make this change.

1. The author of original code left comment: "BUGS: The input is scanned
twice." .
You can find this line of code in my patch. This indicates a left work
to be done.

2. If I were the author of this function, I would not be satisfied with
myself that I used two loops to do something which actually can be done
with one loop.
I prefer to choose a way that would not add more burden to readers.

3. The while (*tp != '\0') loop has some unnecessary codes and I made
some change.

Thanks,
Steven
Hi hackers,

This is a revised version (v2) of the patch that optimizes the `byteain()`
function.

The original implementation handled escaped input by scanning the string
twice — first to determine the output size and again to fill in the bytea.
This patch eliminates the double scan by using a single-pass approach with
`StringInfo`, simplifying the logic and improving maintainability.

Changes since v1 (originally by Steven Niu):
- Use `StringInfo` instead of manual memory allocation.
- Remove redundant code and improve readability.
- Add regression tests for both hex and escaped formats.

This version addresses performance and clarity while ensuring
compatibility with existing behavior. The patch also reflects discussion on
the original version, including feedback from Kirill Reshke.

Looking forward to your review and comments.

Best regards,
Stepan Neretin

Hi,

I noticed that the previous version of the patch was authored with an
incorrect email address due to a misconfigured git config.

I've corrected the author information in this v2 and made sure it's
consistent with my usual contributor identity. No other changes were
introduced apart from that and the updates discussed earlier.

Sorry for the confusion, and thanks for your understanding.

Best regards,

Stepan Neretin

Attachments:

0002-Refactor-byteain-to-avoid-double-scanning-and-simpli.patchtext/x-patch; charset=US-ASCII; name=0002-Refactor-byteain-to-avoid-double-scanning-and-simpli.patchDownload

From 92c581fdd3081d8ac60ead2f2ffde823377efbbb Mon Sep 17 00:00:00 2001
From: Stepan Neretin <slpmcf@gmail.com>
Date: Fri, 9 May 2025 17:36:28 +0700
Subject: [PATCH v2 2/2] Refactor byteain() to avoid double scanning and
 simplify logic

This patch reworks the escaped input handling in byteain() by replacing
manual buffer management with a StringInfo-based single-pass parse.
It combines ideas from a previous proposal by Steven Niu with additional
improvements to structure and readability.

Also adds regression tests covering edge cases for both hex and escaped
formats.

Includes input from discussion with Kirill Reshke on pgsql-hackers.
---
 contrib/btree_gin/expected/bytea.out | 92 ++++++++++++++++++++++++++++
 contrib/btree_gin/sql/bytea.sql      | 37 +++++++++++
 src/backend/utils/adt/varlena.c      | 63 ++++++++-----------
 3 files changed, 155 insertions(+), 37 deletions(-)

diff --git a/contrib/btree_gin/expected/bytea.out b/contrib/btree_gin/expected/bytea.out
index b0ed7a53450..d4ad2878775 100644
--- a/contrib/btree_gin/expected/bytea.out
+++ b/contrib/btree_gin/expected/bytea.out
@@ -44,3 +44,95 @@ SELECT * FROM test_bytea WHERE i>'abc'::bytea ORDER BY i;
  xyz
 (2 rows)
 
+-- Simple ASCII strings
+SELECT encode(bytea(E'a'), 'hex');            -- 61
+ encode 
+--------
+ 61
+(1 row)
+
+SELECT encode(bytea(E'ab'), 'hex');           -- 6162
+ encode 
+--------
+ 6162
+(1 row)
+
+-- Octal escapes
+SELECT encode(bytea(E'\\000'), 'hex');        -- 00
+ encode 
+--------
+ 00
+(1 row)
+
+SELECT encode(bytea(E'\\001'), 'hex');        -- 01
+ encode 
+--------
+ 01
+(1 row)
+
+SELECT encode(bytea(E'\\001\\002\\003'), 'hex');  -- 010203
+ encode 
+--------
+ 010203
+(1 row)
+
+-- Mixed literal and escapes
+SELECT encode(bytea(E'a\\000b\\134c'), 'hex'); -- 6100625c63
+   encode   
+------------
+ 6100625c63
+(1 row)
+
+-- Backslash literal
+SELECT encode(bytea(E'\\\\'), 'hex');         -- 5c
+ encode 
+--------
+ 5c
+(1 row)
+
+-- Empty input
+SELECT encode(bytea(E''), 'hex');             -- (empty string)
+ encode 
+--------
+ 
+(1 row)
+
+-- Hex format
+SELECT encode(bytea(E'\\x6869'), 'escape');   -- hi
+ encode 
+--------
+ hi
+(1 row)
+
+-- ===== Invalid bytea input tests =====
+-- Invalid octal escapes (less than 3 digits or out of range)
+SELECT bytea(E'\\77');     -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'\\77');
+                     ^
+SELECT bytea(E'\\4');      -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'\\4');
+                     ^
+SELECT bytea(E'\\08');     -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'\\08');
+                     ^
+SELECT bytea(E'\\999');    -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'\\999');
+                     ^
+-- Invalid hex format
+SELECT bytea(E'\\x1');     -- ERROR
+ERROR:  invalid hexadecimal data: odd number of digits
+LINE 1: SELECT bytea(E'\\x1');
+                     ^
+SELECT bytea(E'\\xZZ');    -- ERROR
+ERROR:  invalid hexadecimal digit: "Z"
+LINE 1: SELECT bytea(E'\\xZZ');
+                     ^
+-- Incomplete escape sequence
+SELECT bytea(E'abc\\');    -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'abc\\');
+                     ^
diff --git a/contrib/btree_gin/sql/bytea.sql b/contrib/btree_gin/sql/bytea.sql
index 5f3eb11b169..cb8ee8eb2aa 100644
--- a/contrib/btree_gin/sql/bytea.sql
+++ b/contrib/btree_gin/sql/bytea.sql
@@ -15,3 +15,40 @@ SELECT * FROM test_bytea WHERE i<='abc'::bytea ORDER BY i;
 SELECT * FROM test_bytea WHERE i='abc'::bytea ORDER BY i;
 SELECT * FROM test_bytea WHERE i>='abc'::bytea ORDER BY i;
 SELECT * FROM test_bytea WHERE i>'abc'::bytea ORDER BY i;
+
+
+-- Simple ASCII strings
+SELECT encode(bytea(E'a'), 'hex');            -- 61
+SELECT encode(bytea(E'ab'), 'hex');           -- 6162
+
+-- Octal escapes
+SELECT encode(bytea(E'\\000'), 'hex');        -- 00
+SELECT encode(bytea(E'\\001'), 'hex');        -- 01
+SELECT encode(bytea(E'\\001\\002\\003'), 'hex');  -- 010203
+
+-- Mixed literal and escapes
+SELECT encode(bytea(E'a\\000b\\134c'), 'hex'); -- 6100625c63
+
+-- Backslash literal
+SELECT encode(bytea(E'\\\\'), 'hex');         -- 5c
+
+-- Empty input
+SELECT encode(bytea(E''), 'hex');             -- (empty string)
+
+-- Hex format
+SELECT encode(bytea(E'\\x6869'), 'escape');   -- hi
+
+-- ===== Invalid bytea input tests =====
+
+-- Invalid octal escapes (less than 3 digits or out of range)
+SELECT bytea(E'\\77');     -- ERROR
+SELECT bytea(E'\\4');      -- ERROR
+SELECT bytea(E'\\08');     -- ERROR
+SELECT bytea(E'\\999');    -- ERROR
+
+-- Invalid hex format
+SELECT bytea(E'\\x1');     -- ERROR
+SELECT bytea(E'\\xZZ');    -- ERROR
+
+-- Incomplete escape sequence
+SELECT bytea(E'abc\\');    -- ERROR
\ No newline at end of file
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index f1f1efba053..f84fd1dc644 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -296,70 +296,59 @@ text_to_cstring_buffer(const text *src, char *dst, size_t dst_len)
 Datum
 byteain(PG_FUNCTION_ARGS)
 {
-	char	   *inputText = PG_GETARG_CSTRING(0);
-	Node	   *escontext = fcinfo->context;
-	char	   *tp;
-	char	   *rp;
-	int			bc;
-	size_t	   input_len;
-	bytea	   *result;
+	char *inputText = PG_GETARG_CSTRING(0);
+	Node *escontext = fcinfo->context;
 
-	/* Recognize hex input */
+	/* Hex format */
 	if (inputText[0] == '\\' && inputText[1] == 'x')
 	{
-		size_t		len = strlen(inputText);
-
-		bc = (len - 2) / 2 + VARHDRSZ;	/* maximum possible length */
-		result = palloc(bc);
-		bc = hex_decode_safe(inputText + 2, len - 2, VARDATA(result),
-							 escontext);
-		SET_VARSIZE(result, bc + VARHDRSZ); /* actual length */
-
+		size_t len = strlen(inputText);
+		int bc = (len - 2) / 2 + VARHDRSZ;
+		bytea *result = palloc(bc);
+		bc = hex_decode_safe(inputText + 2, len - 2, VARDATA(result), escontext);
+		SET_VARSIZE(result, bc + VARHDRSZ);
 		PG_RETURN_BYTEA_P(result);
 	}
 
-	/* Handle traditional escaped style in a single pass */
-	input_len = strlen(inputText);
-	result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
-	rp = VARDATA(result);
-	tp = inputText;
+	/* Escaped format */
+	StringInfoData buf;
+	initStringInfo(&buf);
+	char *tp = inputText;
 
-	while (*tp != '\0')
+	while (*tp)
 	{
-		if (tp[0] != '\\')
+		if (*tp != '\\')
 		{
-			*rp++ = *tp++;
+			appendStringInfoChar(&buf, *tp++);
 			continue;
 		}
 
 		if (tp[1] == '\\')
 		{
-			*rp++ = '\\';
+			appendStringInfoChar(&buf, '\\');
 			tp += 2;
 		}
-		else if ((tp[1] >= '0' && tp[1] <= '3') && 
-			 (tp[2] >= '0' && tp[2] <= '7') && 
-			 (tp[3] >= '0' && tp[3] <= '7'))
+		else if ((tp[1] >= '0' && tp[1] <= '3') &&
+				 (tp[2] >= '0' && tp[2] <= '7') &&
+				 (tp[3] >= '0' && tp[3] <= '7'))
 		{
-			bc = VAL(tp[1]);
-			bc <<= 3;
-			bc += VAL(tp[2]);
-			bc <<= 3;
-			*rp++ = bc + VAL(tp[3]);
-
+			int byte_val = VAL(tp[1]);
+			byte_val = (byte_val << 3) + VAL(tp[2]);
+			byte_val = (byte_val << 3) + VAL(tp[3]);
+			appendStringInfoChar(&buf, byte_val);
 			tp += 4;
 		}
 		else
 		{
-			/* Invalid escape sequence: report error */
 			ereturn(escontext, (Datum) 0,
 					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 					 errmsg("invalid input syntax for type %s", "bytea")));
 		}
 	}
 
-	/* Set the actual size of the bytea */
-	SET_VARSIZE(result, (rp - VARDATA(result)) + VARHDRSZ);
+	bytea *result = palloc(buf.len + VARHDRSZ);
+	SET_VARSIZE(result, buf.len + VARHDRSZ);
+	memcpy(VARDATA(result), buf.data, buf.len);
 
 	PG_RETURN_BYTEA_P(result);
 }
-- 
2.43.0

0001-Optimize-function-byteain-to-avoid-double-scanning.patchtext/x-patch; charset=US-ASCII; name=0001-Optimize-function-byteain-to-avoid-double-scanning.patchDownload

From b589d728b54de071b8d4383a3a51de5f7c2e2293 Mon Sep 17 00:00:00 2001
From: Steven Niu <niushiji@highgo.com>
Date: Wed, 26 Mar 2025 14:43:43 +0800
Subject: [PATCH v2 1/2] Optimize function byteain() to avoid double scanning

Optimized the function to eliminate the need for two scans,
while preserving correctness and efficiency.

Author: Steven Niu <niushiji@gmail.com>
---
 src/backend/utils/adt/varlena.c | 66 +++++++++++----------------------
 1 file changed, 22 insertions(+), 44 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 3e4d5568bde..f1f1efba053 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -291,7 +291,6 @@ text_to_cstring_buffer(const text *src, char *dst, size_t dst_len)
  *		ereport(ERROR, ...) if bad form.
  *
  *		BUGS:
- *				The input is scanned twice.
  *				The error checking of input is minimal.
  */
 Datum
@@ -302,6 +301,7 @@ byteain(PG_FUNCTION_ARGS)
 	char	   *tp;
 	char	   *rp;
 	int			bc;
+	size_t	   input_len;
 	bytea	   *result;
 
 	/* Recognize hex input */
@@ -318,45 +318,28 @@ byteain(PG_FUNCTION_ARGS)
 		PG_RETURN_BYTEA_P(result);
 	}
 
-	/* Else, it's the traditional escaped style */
-	for (bc = 0, tp = inputText; *tp != '\0'; bc++)
-	{
-		if (tp[0] != '\\')
-			tp++;
-		else if ((tp[0] == '\\') &&
-				 (tp[1] >= '0' && tp[1] <= '3') &&
-				 (tp[2] >= '0' && tp[2] <= '7') &&
-				 (tp[3] >= '0' && tp[3] <= '7'))
-			tp += 4;
-		else if ((tp[0] == '\\') &&
-				 (tp[1] == '\\'))
-			tp += 2;
-		else
-		{
-			/*
-			 * one backslash, not followed by another or ### valid octal
-			 */
-			ereturn(escontext, (Datum) 0,
-					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
-					 errmsg("invalid input syntax for type %s", "bytea")));
-		}
-	}
-
-	bc += VARHDRSZ;
-
-	result = (bytea *) palloc(bc);
-	SET_VARSIZE(result, bc);
-
-	tp = inputText;
+	/* Handle traditional escaped style in a single pass */
+	input_len = strlen(inputText);
+	result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
 	rp = VARDATA(result);
+	tp = inputText;
+
 	while (*tp != '\0')
 	{
 		if (tp[0] != '\\')
+		{
 			*rp++ = *tp++;
-		else if ((tp[0] == '\\') &&
-				 (tp[1] >= '0' && tp[1] <= '3') &&
-				 (tp[2] >= '0' && tp[2] <= '7') &&
-				 (tp[3] >= '0' && tp[3] <= '7'))
+			continue;
+		}
+
+		if (tp[1] == '\\')
+		{
+			*rp++ = '\\';
+			tp += 2;
+		}
+		else if ((tp[1] >= '0' && tp[1] <= '3') && 
+			 (tp[2] >= '0' && tp[2] <= '7') && 
+			 (tp[3] >= '0' && tp[3] <= '7'))
 		{
 			bc = VAL(tp[1]);
 			bc <<= 3;
@@ -366,23 +349,18 @@ byteain(PG_FUNCTION_ARGS)
 
 			tp += 4;
 		}
-		else if ((tp[0] == '\\') &&
-				 (tp[1] == '\\'))
-		{
-			*rp++ = '\\';
-			tp += 2;
-		}
 		else
 		{
-			/*
-			 * We should never get here. The first pass should not allow it.
-			 */
+			/* Invalid escape sequence: report error */
 			ereturn(escontext, (Datum) 0,
 					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 					 errmsg("invalid input syntax for type %s", "bytea")));
 		}
 	}
 
+	/* Set the actual size of the bytea */
+	SET_VARSIZE(result, (rp - VARDATA(result)) + VARHDRSZ);
+
 	PG_RETURN_BYTEA_P(result);
 }
 
-- 
2.43.0

Stepan Neretin

slpmcf@gmail.com

8 months ago

In reply to: Stepan Neretin (#5)

2 attachment(s)

Re: [PATCH] avoid double scanning in function byteain

On Fri, May 9, 2025 at 5:37 PM Stepan Neretin <slpmcf@gmail.com> wrote:

On Fri, May 9, 2025 at 5:24 PM Stepan Neretin <slpmcf@gmail.com> wrote:
On Wed, Mar 26, 2025 at 9:39 PM Steven Niu <niushiji@gmail.com> wrote:
在 2025/3/26 16:37, Kirill Reshke 写道:
On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji@gmail.com> wrote:

Hi,

Hi!

This double scanning can be inefficient, especially for large inputs.
So I optimized the function to eliminate the need for two scans,
while preserving correctness and efficiency.

While the argument that processing input once not twice is fast is
generally true, may we have some simple bench here just to have an
idea how valuable this patch is?

Patch:
+ /* Handle traditional escaped style in a single pass */
+ input_len = strlen(inputText);
+ result = palloc(input_len + VARHDRSZ);  /* Allocate max possible
size */
rp = VARDATA(result);
+ tp = inputText;
+
while (*tp != '\0')
Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?

[0]
https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45

Hi, Kirill,

Your deep insight suprised me!

Yes, you are correct that strlen() actually performed a loop operation.
So maybe the performance difference is not so obvious.

However, there are some other reasons that drive me to make this change.

1. The author of original code left comment: "BUGS: The input is scanned
twice." .
You can find this line of code in my patch. This indicates a left work
to be done.

2. If I were the author of this function, I would not be satisfied with
myself that I used two loops to do something which actually can be done
with one loop.
I prefer to choose a way that would not add more burden to readers.

3. The while (*tp != '\0') loop has some unnecessary codes and I made
some change.

Thanks,
Steven
Hi hackers,

This is a revised version (v2) of the patch that optimizes the
`byteain()` function.

The original implementation handled escaped input by scanning the string
twice — first to determine the output size and again to fill in the bytea.
This patch eliminates the double scan by using a single-pass approach with
`StringInfo`, simplifying the logic and improving maintainability.

Changes since v1 (originally by Steven Niu):
- Use `StringInfo` instead of manual memory allocation.
- Remove redundant code and improve readability.
- Add regression tests for both hex and escaped formats.

This version addresses performance and clarity while ensuring
compatibility with existing behavior. The patch also reflects discussion on
the original version, including feedback from Kirill Reshke.

Looking forward to your review and comments.

Best regards,
Stepan Neretin
Hi,

I noticed that the previous version of the patch was authored with an
incorrect email address due to a misconfigured git config.

I've corrected the author information in this v2 and made sure it's
consistent with my usual contributor identity. No other changes were
introduced apart from that and the updates discussed earlier.

Sorry for the confusion, and thanks for your understanding.

Best regards,

Stepan Neretin

Hi,

Sorry for the noise — I'm resending the patch because I noticed a compiler
warning related to mixed declarations and code, which I’ve now fixed.

Apologies for the oversight in the previous submission.

Regards,

Stepan Neretin

Attachments:

0001-Optimize-function-byteain-to-avoid-double-scanning.patchtext/x-patch; charset=US-ASCII; name=0001-Optimize-function-byteain-to-avoid-double-scanning.patchDownload

From b589d728b54de071b8d4383a3a51de5f7c2e2293 Mon Sep 17 00:00:00 2001
From: Steven Niu <niushiji@highgo.com>
Date: Wed, 26 Mar 2025 14:43:43 +0800
Subject: [PATCH v2 1/2] Optimize function byteain() to avoid double scanning

Optimized the function to eliminate the need for two scans,
while preserving correctness and efficiency.

Author: Steven Niu <niushiji@gmail.com>
---
 src/backend/utils/adt/varlena.c | 66 +++++++++++----------------------
 1 file changed, 22 insertions(+), 44 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 3e4d5568bde..f1f1efba053 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -291,7 +291,6 @@ text_to_cstring_buffer(const text *src, char *dst, size_t dst_len)
  *		ereport(ERROR, ...) if bad form.
  *
  *		BUGS:
- *				The input is scanned twice.
  *				The error checking of input is minimal.
  */
 Datum
@@ -302,6 +301,7 @@ byteain(PG_FUNCTION_ARGS)
 	char	   *tp;
 	char	   *rp;
 	int			bc;
+	size_t	   input_len;
 	bytea	   *result;
 
 	/* Recognize hex input */
@@ -318,45 +318,28 @@ byteain(PG_FUNCTION_ARGS)
 		PG_RETURN_BYTEA_P(result);
 	}
 
-	/* Else, it's the traditional escaped style */
-	for (bc = 0, tp = inputText; *tp != '\0'; bc++)
-	{
-		if (tp[0] != '\\')
-			tp++;
-		else if ((tp[0] == '\\') &&
-				 (tp[1] >= '0' && tp[1] <= '3') &&
-				 (tp[2] >= '0' && tp[2] <= '7') &&
-				 (tp[3] >= '0' && tp[3] <= '7'))
-			tp += 4;
-		else if ((tp[0] == '\\') &&
-				 (tp[1] == '\\'))
-			tp += 2;
-		else
-		{
-			/*
-			 * one backslash, not followed by another or ### valid octal
-			 */
-			ereturn(escontext, (Datum) 0,
-					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
-					 errmsg("invalid input syntax for type %s", "bytea")));
-		}
-	}
-
-	bc += VARHDRSZ;
-
-	result = (bytea *) palloc(bc);
-	SET_VARSIZE(result, bc);
-
-	tp = inputText;
+	/* Handle traditional escaped style in a single pass */
+	input_len = strlen(inputText);
+	result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
 	rp = VARDATA(result);
+	tp = inputText;
+
 	while (*tp != '\0')
 	{
 		if (tp[0] != '\\')
+		{
 			*rp++ = *tp++;
-		else if ((tp[0] == '\\') &&
-				 (tp[1] >= '0' && tp[1] <= '3') &&
-				 (tp[2] >= '0' && tp[2] <= '7') &&
-				 (tp[3] >= '0' && tp[3] <= '7'))
+			continue;
+		}
+
+		if (tp[1] == '\\')
+		{
+			*rp++ = '\\';
+			tp += 2;
+		}
+		else if ((tp[1] >= '0' && tp[1] <= '3') && 
+			 (tp[2] >= '0' && tp[2] <= '7') && 
+			 (tp[3] >= '0' && tp[3] <= '7'))
 		{
 			bc = VAL(tp[1]);
 			bc <<= 3;
@@ -366,23 +349,18 @@ byteain(PG_FUNCTION_ARGS)
 
 			tp += 4;
 		}
-		else if ((tp[0] == '\\') &&
-				 (tp[1] == '\\'))
-		{
-			*rp++ = '\\';
-			tp += 2;
-		}
 		else
 		{
-			/*
-			 * We should never get here. The first pass should not allow it.
-			 */
+			/* Invalid escape sequence: report error */
 			ereturn(escontext, (Datum) 0,
 					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 					 errmsg("invalid input syntax for type %s", "bytea")));
 		}
 	}
 
+	/* Set the actual size of the bytea */
+	SET_VARSIZE(result, (rp - VARDATA(result)) + VARHDRSZ);
+
 	PG_RETURN_BYTEA_P(result);
 }
 
-- 
2.43.0

0002-Refactor-byteain-to-avoid-double-scanning-and-simpli.patchtext/x-patch; charset=US-ASCII; name=0002-Refactor-byteain-to-avoid-double-scanning-and-simpli.patchDownload

From cd27078430955b545964837ec53b285bb7992e8f Mon Sep 17 00:00:00 2001
From: Stepan Neretin <slpmcf@gmail.com>
Date: Fri, 9 May 2025 17:36:28 +0700
Subject: [PATCH v2 2/2] Refactor byteain() to avoid double scanning and
 simplify logic

This patch reworks the escaped input handling in byteain() by replacing
manual buffer management with a StringInfo-based single-pass parse.
It combines ideas from a previous proposal by Steven Niu with additional
improvements to structure and readability.

Also adds regression tests covering edge cases for both hex and escaped
formats.

Includes input from discussion with Kirill Reshke on pgsql-hackers.
---
 contrib/btree_gin/expected/bytea.out |  92 ++++++++++++++++++++++++
 contrib/btree_gin/sql/bytea.sql      |  37 ++++++++++
 src/backend/utils/adt/varlena.c      | 100 +++++++++++++--------------
 3 files changed, 178 insertions(+), 51 deletions(-)

diff --git a/contrib/btree_gin/expected/bytea.out b/contrib/btree_gin/expected/bytea.out
index b0ed7a53450..d4ad2878775 100644
--- a/contrib/btree_gin/expected/bytea.out
+++ b/contrib/btree_gin/expected/bytea.out
@@ -44,3 +44,95 @@ SELECT * FROM test_bytea WHERE i>'abc'::bytea ORDER BY i;
  xyz
 (2 rows)
 
+-- Simple ASCII strings
+SELECT encode(bytea(E'a'), 'hex');            -- 61
+ encode 
+--------
+ 61
+(1 row)
+
+SELECT encode(bytea(E'ab'), 'hex');           -- 6162
+ encode 
+--------
+ 6162
+(1 row)
+
+-- Octal escapes
+SELECT encode(bytea(E'\\000'), 'hex');        -- 00
+ encode 
+--------
+ 00
+(1 row)
+
+SELECT encode(bytea(E'\\001'), 'hex');        -- 01
+ encode 
+--------
+ 01
+(1 row)
+
+SELECT encode(bytea(E'\\001\\002\\003'), 'hex');  -- 010203
+ encode 
+--------
+ 010203
+(1 row)
+
+-- Mixed literal and escapes
+SELECT encode(bytea(E'a\\000b\\134c'), 'hex'); -- 6100625c63
+   encode   
+------------
+ 6100625c63
+(1 row)
+
+-- Backslash literal
+SELECT encode(bytea(E'\\\\'), 'hex');         -- 5c
+ encode 
+--------
+ 5c
+(1 row)
+
+-- Empty input
+SELECT encode(bytea(E''), 'hex');             -- (empty string)
+ encode 
+--------
+ 
+(1 row)
+
+-- Hex format
+SELECT encode(bytea(E'\\x6869'), 'escape');   -- hi
+ encode 
+--------
+ hi
+(1 row)
+
+-- ===== Invalid bytea input tests =====
+-- Invalid octal escapes (less than 3 digits or out of range)
+SELECT bytea(E'\\77');     -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'\\77');
+                     ^
+SELECT bytea(E'\\4');      -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'\\4');
+                     ^
+SELECT bytea(E'\\08');     -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'\\08');
+                     ^
+SELECT bytea(E'\\999');    -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'\\999');
+                     ^
+-- Invalid hex format
+SELECT bytea(E'\\x1');     -- ERROR
+ERROR:  invalid hexadecimal data: odd number of digits
+LINE 1: SELECT bytea(E'\\x1');
+                     ^
+SELECT bytea(E'\\xZZ');    -- ERROR
+ERROR:  invalid hexadecimal digit: "Z"
+LINE 1: SELECT bytea(E'\\xZZ');
+                     ^
+-- Incomplete escape sequence
+SELECT bytea(E'abc\\');    -- ERROR
+ERROR:  invalid input syntax for type bytea
+LINE 1: SELECT bytea(E'abc\\');
+                     ^
diff --git a/contrib/btree_gin/sql/bytea.sql b/contrib/btree_gin/sql/bytea.sql
index 5f3eb11b169..cb8ee8eb2aa 100644
--- a/contrib/btree_gin/sql/bytea.sql
+++ b/contrib/btree_gin/sql/bytea.sql
@@ -15,3 +15,40 @@ SELECT * FROM test_bytea WHERE i<='abc'::bytea ORDER BY i;
 SELECT * FROM test_bytea WHERE i='abc'::bytea ORDER BY i;
 SELECT * FROM test_bytea WHERE i>='abc'::bytea ORDER BY i;
 SELECT * FROM test_bytea WHERE i>'abc'::bytea ORDER BY i;
+
+
+-- Simple ASCII strings
+SELECT encode(bytea(E'a'), 'hex');            -- 61
+SELECT encode(bytea(E'ab'), 'hex');           -- 6162
+
+-- Octal escapes
+SELECT encode(bytea(E'\\000'), 'hex');        -- 00
+SELECT encode(bytea(E'\\001'), 'hex');        -- 01
+SELECT encode(bytea(E'\\001\\002\\003'), 'hex');  -- 010203
+
+-- Mixed literal and escapes
+SELECT encode(bytea(E'a\\000b\\134c'), 'hex'); -- 6100625c63
+
+-- Backslash literal
+SELECT encode(bytea(E'\\\\'), 'hex');         -- 5c
+
+-- Empty input
+SELECT encode(bytea(E''), 'hex');             -- (empty string)
+
+-- Hex format
+SELECT encode(bytea(E'\\x6869'), 'escape');   -- hi
+
+-- ===== Invalid bytea input tests =====
+
+-- Invalid octal escapes (less than 3 digits or out of range)
+SELECT bytea(E'\\77');     -- ERROR
+SELECT bytea(E'\\4');      -- ERROR
+SELECT bytea(E'\\08');     -- ERROR
+SELECT bytea(E'\\999');    -- ERROR
+
+-- Invalid hex format
+SELECT bytea(E'\\x1');     -- ERROR
+SELECT bytea(E'\\xZZ');    -- ERROR
+
+-- Incomplete escape sequence
+SELECT bytea(E'abc\\');    -- ERROR
\ No newline at end of file
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index f1f1efba053..517965445fe 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -296,72 +296,70 @@ text_to_cstring_buffer(const text *src, char *dst, size_t dst_len)
 Datum
 byteain(PG_FUNCTION_ARGS)
 {
-	char	   *inputText = PG_GETARG_CSTRING(0);
-	Node	   *escontext = fcinfo->context;
-	char	   *tp;
-	char	   *rp;
-	int			bc;
-	size_t	   input_len;
-	bytea	   *result;
+	char *inputText = PG_GETARG_CSTRING(0);
+	Node *escontext = fcinfo->context;
 
-	/* Recognize hex input */
+	/* Hex format */
 	if (inputText[0] == '\\' && inputText[1] == 'x')
 	{
-		size_t		len = strlen(inputText);
+		size_t len;
+		int bc;
+		bytea *result;
 
-		bc = (len - 2) / 2 + VARHDRSZ;	/* maximum possible length */
+		len = strlen(inputText);
+		bc = (len - 2) / 2 + VARHDRSZ;
 		result = palloc(bc);
-		bc = hex_decode_safe(inputText + 2, len - 2, VARDATA(result),
-							 escontext);
-		SET_VARSIZE(result, bc + VARHDRSZ); /* actual length */
-
+		bc = hex_decode_safe(inputText + 2, len - 2, VARDATA(result), escontext);
+		SET_VARSIZE(result, bc + VARHDRSZ);
 		PG_RETURN_BYTEA_P(result);
 	}
 
-	/* Handle traditional escaped style in a single pass */
-	input_len = strlen(inputText);
-	result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
-	rp = VARDATA(result);
-	tp = inputText;
-
-	while (*tp != '\0')
+	/* Escaped format */
 	{
-		if (tp[0] != '\\')
-		{
-			*rp++ = *tp++;
-			continue;
-		}
+		StringInfoData buf;
+		char *tp;
+		bytea *result;
 
-		if (tp[1] == '\\')
-		{
-			*rp++ = '\\';
-			tp += 2;
-		}
-		else if ((tp[1] >= '0' && tp[1] <= '3') && 
-			 (tp[2] >= '0' && tp[2] <= '7') && 
-			 (tp[3] >= '0' && tp[3] <= '7'))
-		{
-			bc = VAL(tp[1]);
-			bc <<= 3;
-			bc += VAL(tp[2]);
-			bc <<= 3;
-			*rp++ = bc + VAL(tp[3]);
+		initStringInfo(&buf);
+		tp = inputText;
 
-			tp += 4;
-		}
-		else
+		while (*tp)
 		{
-			/* Invalid escape sequence: report error */
-			ereturn(escontext, (Datum) 0,
-					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
-					 errmsg("invalid input syntax for type %s", "bytea")));
+			if (*tp != '\\')
+			{
+				appendStringInfoChar(&buf, *tp++);
+				continue;
+			}
+
+			if (tp[1] == '\\')
+			{
+				appendStringInfoChar(&buf, '\\');
+				tp += 2;
+			}
+			else if ((tp[1] >= '0' && tp[1] <= '3') &&
+					 (tp[2] >= '0' && tp[2] <= '7') &&
+					 (tp[3] >= '0' && tp[3] <= '7'))
+			{
+				int byte_val = VAL(tp[1]);
+				byte_val = (byte_val << 3) + VAL(tp[2]);
+				byte_val = (byte_val << 3) + VAL(tp[3]);
+				appendStringInfoChar(&buf, byte_val);
+				tp += 4;
+			}
+			else
+			{
+				ereturn(escontext, (Datum) 0,
+						(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						 errmsg("invalid input syntax for type %s", "bytea")));
+			}
 		}
-	}
 
-	/* Set the actual size of the bytea */
-	SET_VARSIZE(result, (rp - VARDATA(result)) + VARHDRSZ);
+		result = palloc(buf.len + VARHDRSZ);
+		SET_VARSIZE(result, buf.len + VARHDRSZ);
+		memcpy(VARDATA(result), buf.data, buf.len);
 
-	PG_RETURN_BYTEA_P(result);
+		PG_RETURN_BYTEA_P(result);
+	}
 }
 
 /*
-- 
2.43.0

Aleksander Alekseev

aleksander@timescale.com

8 months ago

In reply to: Stepan Neretin (#6)

Re: [PATCH] avoid double scanning in function byteain

Hi Stepan,

Sorry for the noise — I'm resending the patch because I noticed a compiler warning related to mixed declarations and code, which I’ve now fixed.

Apologies for the oversight in the previous submission.

Thanks for the patch.

As Kirill pointed out above, it would be nice if you could prove that
your implementation is actually faster. I think something like a
simple micro-benchmark will do.

--
Best regards,
Aleksander Alekseev

Stepan Neretin

slpmcf@gmail.com

8 months ago

In reply to: Aleksander Alekseev (#7)

Re: [PATCH] avoid double scanning in function byteain

On Fri, May 9, 2025 at 7:43 PM Aleksander Alekseev <aleksander@timescale.com>
wrote:

Hi Stepan,

Sorry for the noise — I'm resending the patch because I noticed a

compiler warning related to mixed declarations and code, which I’ve now
fixed.

Apologies for the oversight in the previous submission.

Thanks for the patch.

As Kirill pointed out above, it would be nice if you could prove that
your implementation is actually faster. I think something like a
simple micro-benchmark will do.

--
Best regards,
Aleksander Alekseev

Hi,

Thanks for the feedback.

I’ve done a simple micro-benchmark using PL/pgSQL with a large escaped
input string (\\123 repeated 100,000 times), converted to bytea in a loop:

DO $$
DECLARE
start_time TIMESTAMP;
end_time TIMESTAMP;
i INTEGER;
dummy BYTEA;
input TEXT := repeat(E'\\123', 100000);
elapsed_ms DOUBLE PRECISION;
BEGIN
start_time := clock_timestamp();

FOR i IN 1..1000 LOOP
dummy := input::bytea;
END LOOP;

end_time := clock_timestamp();
elapsed_ms := EXTRACT(EPOCH FROM end_time - start_time) * 1000;
RAISE NOTICE 'Average time per call: % ms', elapsed_ms / 1000;
END
$$;

Here are the results from NOTICE output:

*Without patch:*

NOTICE: Average time per call: 0.49176600000000004 ms
NOTICE: Average time per call: 0.47658999999999996 ms

*With patch:*

NOTICE: Average time per call: 0.468231 ms
NOTICE: Average time per call: 0.463909 ms

The gain is small but consistent. Let me know if a more rigorous benchmark
would be useful.

Best regards,
Stepan Neretin

Peter Eisentraut

peter@eisentraut.org

8 months ago

In reply to: Stepan Neretin (#6)

Re: [PATCH] avoid double scanning in function byteain

The relationship between patch 0001 and 0002 is unclear to me. Are
these incremental or alternatives? The description doesn't make this clear.

Some of the changes in patch 0002 just appear to move code and comments
around without changing anything substantial. It's not clear why that
is done, as it's not related to what the patch claims it does.

The main tests for the bytea type input formats are in
src/test/regress/sql/strings.sql, so you should add any new tests there.
Maybe there are already enough tests there that you don't need any new
ones.

Overall, I would consider the bytea "escaped" format kind of
obsolescent. But if you want to make it a bit faster with little other
impact, why not.

#10

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Peter Eisentraut (#9)

Re: [PATCH] avoid double scanning in function byteain

Peter Eisentraut <peter@eisentraut.org> writes:

The relationship between patch 0001 and 0002 is unclear to me. Are
these incremental or alternatives? The description doesn't make this clear.

It appears to me that 0002 is actually counterproductive. I cannot
see a reason to get a StringInfo involved here: it adds overhead
and removes no complexity worth noticing. If it were hard to get
a close-enough upper bound for the output length, then yeah a
StringInfo could be a good solution. But the "strlen(inputText)"
proposed in 0001 seems plenty good enough, especially since as you
say this is a somewhat obsolescent format. The fact that it would
often overallocate somewhat doesn't bother me --- and a StringInfo
would in most cases overallocate by a lot more.

I'm inclined to accept 0001, reject 0002, and move on. This doesn't
seem like an area that's worth a huge amount of discussion.

The main tests for the bytea type input formats are in
src/test/regress/sql/strings.sql, so you should add any new tests there.
Maybe there are already enough tests there that you don't need any new
ones.

The code coverage report shows that byteain is covered except for the
path handling "\\". I'd be content to add one test query, or extend
some existing query, to make that branch get hit.

BTW, the patch needs rebasing because this code just got moved
to bytea.c.

regards, tom lane

#11

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Tom Lane (#10)

Re: [PATCH] avoid double scanning in function byteain

I wrote:

I'm inclined to accept 0001, reject 0002, and move on. This doesn't
seem like an area that's worth a huge amount of discussion.

Done that way. I made a couple more cosmetic changes and added
test cases for the double-backslash code path (which hadn't been
covered in byteaout either, I see now).

BTW, in my hands the microbenchmark that Stepan suggested shows the
committed version to be about 40% faster than before for long input.
So apparently the StringInfo-ification suggested in 0002 gave back
just about all the performance gain from 0001.

regards, tom lane

#12

Stepan Neretin

slpmcf@gmail.com

6 months ago

In reply to: Tom Lane (#11)

Re: [PATCH] avoid double scanning in function byteain

On Sat, Jul 19, 2025 at 3:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

I'm inclined to accept 0001, reject 0002, and move on. This doesn't
seem like an area that's worth a huge amount of discussion.

Done that way. I made a couple more cosmetic changes and added
test cases for the double-backslash code path (which hadn't been
covered in byteaout either, I see now).

BTW, in my hands the microbenchmark that Stepan suggested shows the
committed version to be about 40% faster than before for long input.
So apparently the StringInfo-ification suggested in 0002 gave back
just about all the performance gain from 0001.

regards, tom lane

Hi Tom,

Thanks a lot for reviewing and committing the change — much appreciated!

I agree with your rationale regarding patch 0001 vs 0002. It makes sense to
avoid the overhead of StringInfo in this context, especially given the
measurable performance benefit from the simpler approach.

One small thing: it seems the commit or diff with the final adjustments and
test additions wasn't attached or linked in the thread. Could you please
point me to the commit hash or reference? I’d love to take a look at the
final version.

Best regards,
*Stepan Neretin*

#13

David G. Johnston

david.g.johnston@gmail.com

6 months ago

In reply to: Stepan Neretin (#12)

Re: [PATCH] avoid double scanning in function byteain

On Sunday, July 27, 2025, Stepan Neretin <slpmcf@gmail.com> wrote:

One small thing: it seems the commit or diff with the final adjustments
and test additions wasn't attached or linked in the thread. Could you
please point me to the commit hash or reference? I’d love to take a look at
the final version.

/messages/by-id/E1ucruM-006yYH-2A@gemulon.postgresql.org

The pgsql-committers list is searchable. All commits get sent there.

https://www.postgresql.org/search/?m=1&q=&l=16&d=31&s=d

David J.

#14

Stepan Neretin

slpmcf@gmail.com

6 months ago

In reply to: David G. Johnston (#13)

Re: [PATCH] avoid double scanning in function byteain

On Mon, Jul 28, 2025 at 1:41 PM David G. Johnston <
david.g.johnston@gmail.com> wrote:

On Sunday, July 27, 2025, Stepan Neretin <slpmcf@gmail.com> wrote:

One small thing: it seems the commit or diff with the final adjustments
and test additions wasn't attached or linked in the thread. Could you
please point me to the commit hash or reference? I’d love to take a look at
the final version.

/messages/by-id/E1ucruM-006yYH-2A@gemulon.postgresql.org

The pgsql-committers list is searchable. All commits get sent there.

https://www.postgresql.org/search/?m=1&q=&l=16&d=31&s=d

David J.

Hi David,

Yes, I'm aware of the pgsql-committers archive and I did check there —
thank you for the reminder!

However, I couldn’t find the patch or final diff in either the
pgsql-committers message you linked or as an attachment in the original
thread.

Best regards,
Stepan

#15

David G. Johnston

david.g.johnston@gmail.com

6 months ago

In reply to: Stepan Neretin (#14)

On Sunday, July 27, 2025, Stepan Neretin <slpmcf@gmail.com> wrote:

However, I couldn’t find the patch or final diff in either the
pgsql-committers message you linked or as an attachment in the original
thread.

There is a gitweb link included in the Details section. Click that. Or
just read off the first 8 characters of the commit hash in that link and
plug it into your git client.

David J.

#16

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Stepan Neretin (#14)

Re: [PATCH] avoid double scanning in function byteain

Stepan Neretin <slpmcf@gmail.com> writes:

However, I couldn’t find the patch or final diff in either the
pgsql-committers message you linked or as an attachment in the original
thread.

Commit is here:

https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=3683af617

The "patch" link on such pages is good if you want a locally-applyable
patch rather than a colorized version.

The "details" link in the email that David pointed you to would also
have gotten you there.

regards, tom lane