[Proposal] Adding TRIM_SPACE option to COPY

Started by 河田達也about 2 months ago7 messages
#1河田達也
kawatatatsuya0913@gmail.com
1 attachment(s)

Hi,

I'd like to propose adding a new option, TRIM_SPACE, to the COPY command.

Other data warehouse systems such as Snowflake provide similar functionality
(TRIM_SPACE) to improve robustness when loading CSV data. PostgreSQL does
not currently have
such an option, although it would be consistent with other user-friendly
features already present in COPY (e.g., FORCE_NULL, FORCE_NOT_NULL,
ON_ERROR).

Proposed feature
----------------
Add a boolean option:

TRIM_SPACE = true | false
(default: false)

When enabled, COPY FROM with FORMAT text or csv will trim leading and
trailing
ASCII whitespace from each column value before NULL processing and type
conversion. This is applied only to FORMAT text and csv.(not binary)

Example usage:

COPY mytable
FROM '/tmp/data.csv'
WITH (FORMAT csv, HEADER true, TRIM_SPACE true);

This would transform(trim leading and trailing ASCII whitespace):
" AAA ", " BBB", "CCC "

into:
'AAA', 'BBB', 'CCC'
----------------

Thanks in advance for your comments.

Best regards,
Tatsuya Kawata

Attachments:

v1-0001-Adding-TRIM_SPACE-option-to-COPY.patchapplication/octet-stream; name=v1-0001-Adding-TRIM_SPACE-option-to-COPY.patchDownload
From d2d530680a895235c6edaadf48ae67a08c64dfe5 Mon Sep 17 00:00:00 2001
From: TatsuyaKawata <kawatatatsuya0913@gmail.com>
Date: Tue, 25 Nov 2025 01:00:42 +0900
Subject: [PATCH v1] Adding TRIM_SPACE option to COPY

---
 src/backend/commands/copy.c          | 13 +++++++
 src/backend/commands/copyfromparse.c | 56 ++++++++++++++++++++++++++--
 src/include/commands/copy.h          |  1 +
 3 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 28e878c3688..be1b3981ca0 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -557,6 +557,7 @@ ProcessCopyOptions(ParseState *pstate,
 	bool		on_error_specified = false;
 	bool		log_verbosity_specified = false;
 	bool		reject_limit_specified = false;
+	bool		trim_space_specified = false;
 	ListCell   *option;
 
 	/* Support external use for option sanity checking */
@@ -730,6 +731,13 @@ ProcessCopyOptions(ParseState *pstate,
 			reject_limit_specified = true;
 			opts_out->reject_limit = defGetCopyRejectLimitOption(defel);
 		}
+		else if (strcmp(defel->defname, "trim_space") == 0)
+		{
+			if (trim_space_specified)
+				errorConflictingDefElem(defel, pstate);
+			trim_space_specified = true;
+			opts_out->trim_space = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -758,6 +766,11 @@ ProcessCopyOptions(ParseState *pstate,
 				(errcode(ERRCODE_SYNTAX_ERROR),
 				 errmsg("cannot specify %s in BINARY mode", "DEFAULT")));
 
+	if (opts_out->binary && opts_out->trim_space)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("cannot specify %s in BINARY mode", "TRIM_SPACE")));
+
 	/* Set defaults for omitted options */
 	if (!opts_out->delim)
 		opts_out->delim = opts_out->csv_mode ? "," : "\t";
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index a09e7fbace3..9417c1f9e21 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1798,6 +1798,30 @@ CopyReadAttributesText(CopyFromState cstate)
 
 				pg_verifymbstr(fld, output_ptr - fld, false);
 			}
+
+			/* Apply TRIM_SPACE if requested */
+			if (cstate->opts.trim_space)
+			{
+				char	   *fld = cstate->raw_fields[fieldno];
+				char	   *start = fld;
+				char	   *end = output_ptr;
+
+				/* Trim leading spaces */
+				while (start < end && *start == ' ')
+					start++;
+
+				/* Trim trailing spaces */
+				while (end > start && *(end - 1) == ' ')
+					end--;
+
+				/* Move trimmed string to the beginning if needed */
+				if (start > fld)
+				{
+					memmove(fld, start, end - start);
+				}
+
+				output_ptr = fld + (end - start);
+			}
 		}
 
 		/* Terminate attribute value in output area */
@@ -1965,9 +1989,6 @@ CopyReadAttributesCSV(CopyFromState cstate)
 		}
 endfield:
 
-		/* Terminate attribute value in output area */
-		*output_ptr++ = '\0';
-
 		/* Check whether raw input matched null marker */
 		input_len = end_ptr - start_ptr;
 		if (!saw_quote && input_len == cstate->opts.null_print_len &&
@@ -1999,6 +2020,35 @@ endfield:
 								   NameStr(att->attname))));
 			}
 		}
+		else
+		{
+			/* Apply TRIM_SPACE if requested */
+			if (cstate->opts.trim_space)
+			{
+				char	   *fld = cstate->raw_fields[fieldno];
+				char	   *start = fld;
+				char	   *end = output_ptr;
+
+				/* Trim leading spaces */
+				while (start < end && *start == ' ')
+					start++;
+
+				/* Trim trailing spaces */
+				while (end > start && *(end - 1) == ' ')
+					end--;
+
+				/* Move trimmed string to the beginning if needed */
+				if (start > fld)
+				{
+					memmove(fld, start, end - start);
+				}
+
+				output_ptr = fld + (end - start);
+			}
+		}
+
+		/* Terminate attribute value in output area */
+		*output_ptr++ = '\0';
 
 		fieldno++;
 		/* Done if we hit EOL instead of a delim */
diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index 541176e1980..1769a193789 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -85,6 +85,7 @@ typedef struct CopyFormatOptions
 	CopyLogVerbosityChoice log_verbosity;	/* verbosity of logged messages */
 	int64		reject_limit;	/* maximum tolerable number of errors */
 	List	   *convert_select; /* list of column names (can be NIL) */
+	bool		trim_space;		/* trim leading/trailing spaces? */
 } CopyFormatOptions;
 
 /* These are private in commands/copy[from|to].c */
-- 
2.34.1

#2daidewei1970@163.com
daidewei1970@163.com
In reply to: 河田達也 (#1)
Re: [Proposal] Adding TRIM_SPACE option to COPY

HI,

This change seems very useful, I participated in a related project before. After reviewing the patch,
I have a suggestion, would it be better to restrict the change to only "copy from"?

daidewei1970@163.com

From: 河田達也
Date: 2025-11-25 00:17
To: pgsql-hackers
Subject: [Proposal] Adding TRIM_SPACE option to COPY
Hi,

I'd like to propose adding a new option, TRIM_SPACE, to the COPY command.

Other data warehouse systems such as Snowflake provide similar functionality
(TRIM_SPACE) to improve robustness when loading CSV data. PostgreSQL does not currently have
such an option, although it would be consistent with other user-friendly
features already present in COPY (e.g., FORCE_NULL, FORCE_NOT_NULL, ON_ERROR).

Proposed feature
----------------
Add a boolean option:

TRIM_SPACE = true | false
(default: false)

When enabled, COPY FROM with FORMAT text or csv will trim leading and trailing
ASCII whitespace from each column value before NULL processing and type
conversion. This is applied only to FORMAT text and csv.(not binary)

Example usage:

COPY mytable
FROM '/tmp/data.csv'
WITH (FORMAT csv, HEADER true, TRIM_SPACE true);

This would transform(trim leading and trailing ASCII whitespace):
" AAA ", " BBB", "CCC "

into:
'AAA', 'BBB', 'CCC'
----------------

Thanks in advance for your comments.

Best regards,
Tatsuya Kawata

#3Fujii Masao
masao.fujii@gmail.com
In reply to: 河田達也 (#1)
Re: [Proposal] Adding TRIM_SPACE option to COPY

On Tue, Nov 25, 2025 at 1:18 AM 河田達也 <kawatatatsuya0913@gmail.com> wrote:

Hi,

I'd like to propose adding a new option, TRIM_SPACE, to the COPY command.

Other data warehouse systems such as Snowflake provide similar functionality
(TRIM_SPACE) to improve robustness when loading CSV data. PostgreSQL does not currently have
such an option, although it would be consistent with other user-friendly
features already present in COPY (e.g., FORCE_NULL, FORCE_NOT_NULL, ON_ERROR).

Proposed feature
----------------
Add a boolean option:

TRIM_SPACE = true | false
(default: false)

When enabled, COPY FROM with FORMAT text or csv will trim leading and trailing
ASCII whitespace from each column value before NULL processing and type
conversion. This is applied only to FORMAT text and csv.(not binary)

Example usage:

COPY mytable
FROM '/tmp/data.csv'
WITH (FORMAT csv, HEADER true, TRIM_SPACE true);

This would transform(trim leading and trailing ASCII whitespace):
" AAA ", " BBB", "CCC "

into:
'AAA', 'BBB', 'CCC'
----------------

Thanks in advance for your comments.

I like this idea in general.

Since the docs already notes that leading or trailing spaces in CSV values
can cause errors during loading, at [1]https://www.postgresql.org/docs/devel/sql-copy.html, we'd likely need to update
that section as well if we proceed with this feature.

One question: should TRIM_SPACE remove only the literal space character (' '),
or should it also trim other whitespace characters (e.g., tab, newline,
those recognized by isspace())? I'm not sure how other databases define
this behavior, so it would be good to clarify the intended scope.

Regards,

[1]: https://www.postgresql.org/docs/devel/sql-copy.html

--
Fujii Masao

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#3)
Re: [Proposal] Adding TRIM_SPACE option to COPY

Fujii Masao <masao.fujii@gmail.com> writes:

On Tue, Nov 25, 2025 at 1:18 AM 河田達也 <kawatatatsuya0913@gmail.com> wrote:

I'd like to propose adding a new option, TRIM_SPACE, to the COPY command.

I like this idea in general.

I'm kind of down on it, because it's inevitably going to add
processing overhead to every COPY operation whether the feature
is used or not. I don't find it likely to be sufficiently
useful to justify that universal cost.

COPY is not a general-purpose filter or ETL tool, and we try
to make it one at our peril.

regards, tom lane

#5河田達也
kawatatatsuya0913@gmail.com
In reply to: Tom Lane (#4)
Re: [Proposal] Adding TRIM_SPACE option to COPY

Hi all,

Thank you very much for the helpful feedback so far.
I would like to take a moment to discuss and confirm the overall direction
before preparing the next revision (v2).
My goal is to ensure we are aligned on the intended behavior, scope, and
design considerations for the proposed TRIM_SPACE option.

To summarize the discussion and the points I plan to address in the next
action:

I have a suggestion, would it be better to restrict the change to only

"copy from"?
1. Scope (COPY FROM only)
I agree that the option should apply only to COPY FROM.
I will adjust the patch accordingly.

Since the docs already notes that leading or trailing spaces in CSV values
can cause errors during loading, at [1], we'd likely need to update
that section as well if we proceed with this feature.

2. Documentation
I will update the CSV section mentioned earlier so that it reflects
the behavior when TRIM_SPACE is enabled.

One question: should TRIM_SPACE remove only the literal space character

(' '),

or should it also trim other whitespace characters (e.g., tab, newline,
those recognized by isspace())?

3. Characters to trim
I plan to trim only the ASCII space character (' ') to keep the behavior
simple and avoid ambiguity.
Support for additional whitespace characters could be considered later
if there is consensus.

I'm kind of down on it, because it's inevitably going to add
processing overhead to every COPY operation whether the feature
is used or not. I don't find it likely to be sufficiently
useful to justify that universal cost.

4. Performance / overhead concerns
Thank you for raising this point.
I fully agree that the feature must not introduce overhead when
TRIM_SPACE is disabled.

The trimming logic will be executed only when the option is explicitly
enabled,
so there will be just a single additional conditional check.
While this does technically add some processing overhead, it is expected
to be slight in practice for typical CSV loads and unlikely to be a concern.

COPY is not a general-purpose filter or ETL tool, and we try
to make it one at our peril.

My intention is not to expand COPY into an ETL tool, but rather to
provide a small convenience option similar to FORCE_NULL or ON_ERROR,
to help users avoid common issues caused by unintended leading or
trailing spaces in CSV/text files.

I look forward to your thoughts and any additional feedback before
preparing the next patch.

Thank you again for all the valuable comments.

Best regards,
Tatsuya Kawata

#6Fujii Masao
masao.fujii@gmail.com
In reply to: 河田達也 (#5)
Re: [Proposal] Adding TRIM_SPACE option to COPY

On Tue, Nov 25, 2025 at 10:48 PM 河田達也 <kawatatatsuya0913@gmail.com> wrote:

One question: should TRIM_SPACE remove only the literal space character (' '),
or should it also trim other whitespace characters (e.g., tab, newline,
those recognized by isspace())?

3. Characters to trim
I plan to trim only the ASCII space character (' ') to keep the behavior simple and avoid ambiguity.
Support for additional whitespace characters could be considered later if there is consensus.

Understood.

I'm kind of down on it, because it's inevitably going to add
processing overhead to every COPY operation whether the feature
is used or not. I don't find it likely to be sufficiently
useful to justify that universal cost.

4. Performance / overhead concerns
Thank you for raising this point.
I fully agree that the feature must not introduce overhead when TRIM_SPACE is disabled.

The trimming logic will be executed only when the option is explicitly enabled,
so there will be just a single additional conditional check.
While this does technically add some processing overhead, it is expected to be slight in practice for typical CSV loads and unlikely to be a concern.

I also agree that this feature probably won't add noticeable overhead to
COPY when it isn't used, but it would still be good to measure
the performance impact of the patch.

COPY is not a general-purpose filter or ETL tool, and we try
to make it one at our peril.

My intention is not to expand COPY into an ETL tool, but rather to
provide a small convenience option similar to FORCE_NULL or ON_ERROR,
to help users avoid common issues caused by unintended leading or
trailing spaces in CSV/text files.

I see Tom's point. There might be many possible features that could process
COPY input/output, and we can't reasonably add all of them to PostgreSQL core.
So the question is which ones belong in core. On second thought,
perhaps features that are difficult or impossible to implement outside the core
are the ones that can be considered for COPY itself. Otherwise, it might be
better to avoid expanding COPY unnecessarily. Anyway I'd like to hear more
opinons on this.

As for TRIM_SPACE, it seems possible to implement it as an external module
and call it via COPY PROGRAM. Is this true?

Regards,

--
Fujii Masao

#7河田達也
kawatatatsuya0913@gmail.com
In reply to: Fujii Masao (#6)
Re: [Proposal] Adding TRIM_SPACE option to COPY

Hi all,

Thank you for the feedback and for taking the time to review this patch.

## Performance Measurement Results

I also agree that this feature probably won't add noticeable overhead to
COPY when it isn't used, but it would still be good to measure
the performance impact of the patch.

I conducted simple performance measurements with 5 million rows to assess
the overhead
when TRIM_SPACE is not used:

**BEFORE patch (master branch):**
Time: 5565.584 ms (00:05.566)
Time: 5626.593 ms (00:05.627)

**AFTER patch (TRIM_SPACE disabled):**
Time: 5840.472 ms (00:05.840)
Time: 5523.806 ms (00:05.524)

The overhead appears to be approximately 1.5%, which seems to be within
an acceptable range for typical COPY operations.

## Regarding COPY PROGRAM as an Alternative

I understand Tom's concern about adding features to core that could be
implemented externally.

As for TRIM_SPACE, it seems possible to implement it as an external module
and call it via COPY PROGRAM. Is this true?

You're right that TRIM_SPACE can technically be
achieved using COPY PROGRAM with an external script. I'd like to share
some observations about this approach:

**Example implementation with COPY PROGRAM:**

```python
#!/usr/bin/env python3
import sys
import csv

reader = csv.reader(sys.stdin)
writer = csv.writer(sys.stdout)

for row in reader:
trimmed_row = [field.strip() for field in row]
writer.writerow(trimmed_row)
```

```sql
COPY users FROM PROGRAM 'python3 /tmp/trim_csv.py'
WITH (FORMAT csv, HEADER true);
```

**Some considerations:**

While the COPY PROGRAM approach works, it does have some practical
limitations:
it requires elevated privileges (`pg_execute_server_program`), may not be
available on all platforms (particularly Windows), and requires users to
handle parsing correctly. That said, I understand the concern about
where to draw the line for features in core, especially given that options
like FORCE_NULL follow a similar pattern.

## My Thoughts

So the question is which ones belong in core. On second thought,
perhaps features that are difficult or impossible to implement outside

the core

are the ones that can be considered for COPY itself. Otherwise, it might

be

better to avoid expanding COPY unnecessarily. Anyway I'd like to hear more
opinons on this.

I can see the validity of the concern about feature creep.
I'm open to the community's perspective on whether this
belongs in core.

If the general feeling is that this functionality should remain external,
I'm happy to withdraw the patch.

Thank you again for your time and valuable feedback. This discussion has
been
very educational for me.

Best regards,
Tatsuya Kawata