Add LIMIT option to COPY FROM

Started by Shinya Kato2 months ago7 messageshackers
Jump to latest
#1Shinya Kato
shinya11.kato@gmail.com

Hi hackers,

I'd like to propose adding a LIMIT option to COPY FROM, which limits
the number of rows to load.

With COPY TO, we can use the LIMIT clause in the query to restrict
output rows, but COPY FROM has no equivalent way to limit the number
of rows to load (except using the PROGRAM option with external tools
like head). This patch resolves that asymmetry.

Syntax example:
- COPY t FROM STDIN (LIMIT 100);

This feature is useful for:
- Loading only the first N rows from a huge CSV file to verify data or
table definitions before a full import
- Sampling production data for staging or testing environments
- Preventing unexpectedly large data loads in ETL pipelines

Design:
- The LIMIT count applies after WHERE filtering and ON_ERROR skipping,
so it represents the actual number of rows inserted.
- When the source is STDIN, the server reads and discards the
remaining stream to keep the frontend/backend protocol synchronized.

The patch is attached. Thoughts?

--
Best regards,
Shinya Kato
NTT OSS Center

Attachments:

v1-0001-Refactor-defGetCopyRejectLimitOption.patchapplication/octet-stream; name=v1-0001-Refactor-defGetCopyRejectLimitOption.patchDownload+15-14
v1-0002-Add-LIMIT-option-to-COPY-FROM.patchapplication/octet-stream; name=v1-0002-Add-LIMIT-option-to-COPY-FROM.patchDownload+178-2
#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Shinya Kato (#1)
Re: Add LIMIT option to COPY FROM

On Tuesday, February 3, 2026, Shinya Kato <shinya11.kato@gmail.com> wrote:

Hi hackers,

I'd like to propose adding a LIMIT option to COPY FROM, which limits
the number of rows to load.

With COPY TO, we can use the LIMIT clause in the query to restrict
output rows, but COPY FROM has no equivalent way to limit the number
of rows to load (except using the PROGRAM option with external tools
like head). This patch resolves that asymmetry.

Syntax example:
- COPY t FROM STDIN (LIMIT 100);

This feature is useful for:
- Loading only the first N rows from a huge CSV file to verify data or
table definitions before a full import

Would want it paired with offset for this use case.

Design:

- The LIMIT count applies after WHERE filtering and ON_ERROR skipping,
so it represents the actual number of rows inserted.

Not sure about that choice. I’d go with pre-eval or implement both and
default to pre-eval.

David J.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shinya Kato (#1)
Re: Add LIMIT option to COPY FROM

Shinya Kato <shinya11.kato@gmail.com> writes:

I'd like to propose adding a LIMIT option to COPY FROM, which limits
the number of rows to load.

Do we really need this? Each random feature we load onto COPY
slows it down for everybody.

regards, tom lane

#4Florents Tselai
florents.tselai@gmail.com
In reply to: Shinya Kato (#1)
Re: Add LIMIT option to COPY FROM

On Tue, Feb 3, 2026 at 3:49 PM Shinya Kato <shinya11.kato@gmail.com> wrote:

Hi hackers,

I'd like to propose adding a LIMIT option to COPY FROM, which limits
the number of rows to load.

With COPY TO, we can use the LIMIT clause in the query to restrict
output rows, but COPY FROM has no equivalent way to limit the number
of rows to load (except using the PROGRAM option with external tools
like head). This patch resolves that asymmetry.

Syntax example:
- COPY t FROM STDIN (LIMIT 100);

I work with such scenarios a lot and I can't see why COPY should worry
itself about such filtering.
IRL most of what goes into COPY'S STDIN has already been filtered
extensively,
like ... | head -n 100 | COPY t FROM STDIN

#5Shinya Kato
shinya11.kato@gmail.com
In reply to: Florents Tselai (#4)
Re: Add LIMIT option to COPY FROM

Thank you for the comments.

On Tue, Feb 3, 2026 at 11:12 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:

Syntax example:
- COPY t FROM STDIN (LIMIT 100);

This feature is useful for:
- Loading only the first N rows from a huge CSV file to verify data or
table definitions before a full import

Would want it paired with offset for this use case.

COPY FROM already accepts HEADER <integer> to skip N leading lines, so
you can combine it with LIMIT to get OFFSET + LIMIT semantics today:
=# COPY t FROM 'file.csv' (HEADER 100, LIMIT 50);

Design:
- The LIMIT count applies after WHERE filtering and ON_ERROR skipping,
so it represents the actual number of rows inserted.

Not sure about that choice. I’d go with pre-eval or implement both and default to pre-eval.

It is consistent with SQL semantics — SELECT ... WHERE ... LIMIT N
counts rows that pass the filter, not rows scanned. Pre-eval behavior
is already achievable externally (head -n), while post-eval can only
be done server-side, which makes it the more valuable choice for a
built-in option.

On Tue, Feb 3, 2026 at 11:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Shinya Kato <shinya11.kato@gmail.com> writes:

I'd like to propose adding a LIMIT option to COPY FROM, which limits
the number of rows to load.

Do we really need this? Each random feature we load onto COPY
slows it down for everybody.

When LIMIT is not specified (the default), the only overhead is a
single if (limit > 0) branch per row on a struct field already in L1
cache — the same class of cost as the existing ON_ERROR and WHERE
checks.

This is not a novel feature either. Oracle SQL*Loader provides LOAD,
and SQL Server BULK INSERT provides LASTROW. The absence of LIMIT in
COPY FROM is arguably a gap relative to other RDBMSes.

On Wed, Feb 4, 2026 at 12:07 AM Florents Tselai
<florents.tselai@gmail.com> wrote:

I work with such scenarios a lot and I can't see why COPY should worry itself about such filtering.
IRL most of what goes into COPY'S STDIN has already been filtered extensively,
like ... | head -n 100 | COPY t FROM STDIN

head -n works for the STDIN-pipe workflow, but not for all COPY FROM scenarios:
- Server-side files via remote psql: COPY t FROM
'/server/path/file.csv' is read directly by the server process. A
client connected over the network has no way to interpose a pipe on
that I/O path.
- Interaction with WHERE / ON_ERROR: head -n 100 limits input lines,
but cannot guarantee a specific number of inserted rows when some rows
are filtered by WHERE or skipped by ON_ERROR. That control is only
possible server-side.

The same "do it outside" argument could be made against the WHERE
clause ("just use grep"), yet WHERE was accepted because server-side
filtering provides value that external tools cannot fully replicate.
LIMIT fills the same kind of gap.

--
Best regards,
Shinya Kato
NTT OSS Center

#6David G. Johnston
david.g.johnston@gmail.com
In reply to: Shinya Kato (#5)
Re: Add LIMIT option to COPY FROM

On Tue, Feb 3, 2026 at 6:07 PM Shinya Kato <shinya11.kato@gmail.com> wrote:

Would want it paired with offset for this use case.

COPY FROM already accepts HEADER <integer> to skip N leading lines, so
you can combine it with LIMIT to get OFFSET + LIMIT semantics today:
=# COPY t FROM 'file.csv' (HEADER 100, LIMIT 50);

Indirect, but fair, assuming the limit is indeed pre-parsed lines. You
need OFFSET to match post-processed counted lines.

Design:
- The LIMIT count applies after WHERE filtering and ON_ERROR skipping,
so it represents the actual number of rows inserted.

Not sure about that choice. I’d go with pre-eval or implement both and

default to pre-eval.

It is consistent with SQL semantics — SELECT ... WHERE ... LIMIT N
counts rows that pass the filter, not rows scanned.

Ok, I suppose I'd name it "stop N" instead of "limit N" to avoid this.

Pre-eval behavior

is already achievable externally (head -n), while post-eval can only
be done server-side, which makes it the more valuable choice for a
built-in option.

Not seeing the need for either-or; and everywhere else you want to support
this feature by assuming that external tools aren't available.

So; HEADER+STOP, HEADER?+OFFSET+LIMIT

This is not a novel feature either. Oracle SQL*Loader provides LOAD,
and SQL Server BULK INSERT provides LASTROW. The absence of LIMIT in
COPY FROM is arguably a gap relative to other RDBMSes.

There'd be a lot less friction of this sort if someone just bites the
bullet and devises a core-managed ETL tool instead of attaching pieces
one-by-one into COPY. Or at least maybe we get a fast-path version to
handle typical dump-restore commands and then branch to the ETL path if the
command options indicate doing so is needed. There is apparently too much
demand for this stuff for a nonproliferation agreement.

David J.

#7Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#3)
Re: Add LIMIT option to COPY FROM

On Tue, Feb 03, 2026 at 09:41:54AM -0500, Tom Lane wrote:

Do we really need this? Each random feature we load onto COPY
slows it down for everybody.

[ ...reads the patch... ]
@@ -1122,6 +1122,32 @@ CopyFrom(CopyFromState cstate)
*/
ResetPerTupleExprContext(estate);

+		/*
+		 * Stop early when LIMIT would be exceeded.  In multi-insert mode,
+		 * include already buffered tuples so we don't overshoot.
+		 */
+		if (cstate->opts.limit > 0)

This is an extra check in the per-row processing of COPY FROM, that
can never be skipped as the limit needs to be checked all the time,
once per row. This is not going to be entirely free, more noticeable
when processing many rows with few attributes.
--
Michael