pgsql: Refactor COPY FROM to use format callback functions.
Refactor COPY FROM to use format callback functions.
This commit introduces a new CopyFromRoutine struct, which is a set of
callback routines to read tuples in a specific format. It also makes
COPY FROM with the existing formats (text, CSV, and binary) utilize
these format callbacks.
This change is a preliminary step towards making the COPY FROM command
extensible in terms of input formats.
Similar to 2e4127b6d2d, this refactoring contributes to a performance
improvement by reducing the number of "if" branches that need to be
checked on a per-row basis when sending field representations in text
or CSV mode. The performance benchmark results showed ~5% performance
gain in text or CSV mode.
Author: Sutou Kouhei <kou@clear-code.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: /messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.com
Branch
------
master
Details
-------
https://git.postgresql.org/pg/commitdiff/7717f63006935de00fafd000bff450280508adf1
Modified Files
--------------
src/backend/commands/copyfrom.c | 192 ++++++++++---
src/backend/commands/copyfromparse.c | 446 +++++++++++++++++--------------
src/include/commands/copy.h | 2 -
src/include/commands/copyapi.h | 50 +++-
src/include/commands/copyfrom_internal.h | 11 +
src/tools/pgindent/typedefs.list | 1 +
6 files changed, 459 insertions(+), 243 deletions(-)
On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:
Refactor COPY FROM to use format callback functions.
This commit introduces a new CopyFromRoutine struct, which is a set of
callback routines to read tuples in a specific format. It also makes
COPY FROM with the existing formats (text, CSV, and binary) utilize
these format callbacks.This change is a preliminary step towards making the COPY FROM command
extensible in terms of input formats.Similar to 2e4127b6d2d, this refactoring contributes to a performance
improvement by reducing the number of "if" branches that need to be
checked on a per-row basis when sending field representations in text
or CSV mode. The performance benchmark results showed ~5% performance
gain in text or CSV mode.Author: Sutou Kouhei<kou@clear-code.com>
Reviewed-by: Masahiko Sawada<sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier<michael@paquier.xyz>
Reviewed-by: Andres Freund<andres@anarazel.de>
Reviewed-by: Tomas Vondra<tomas.vondra@enterprisedb.com>
Reviewed-by: Junwang Zhao<zhjwpku@gmail.com>
Discussion:/messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.com
This patch has completely broken the file_textarray fdw, which uses
NextCopyFromRawFields(). Removing that from API is not a good thing.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:
Refactor COPY FROM to use format callback functions.
This commit introduces a new CopyFromRoutine struct, which is a set of
callback routines to read tuples in a specific format. It also makes
COPY FROM with the existing formats (text, CSV, and binary) utilize
these format callbacks.This change is a preliminary step towards making the COPY FROM command
extensible in terms of input formats.Similar to 2e4127b6d2d, this refactoring contributes to a performance
improvement by reducing the number of "if" branches that need to be
checked on a per-row basis when sending field representations in text
or CSV mode. The performance benchmark results showed ~5% performance
gain in text or CSV mode.Author: Sutou Kouhei <kou@clear-code.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: /messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.comThis patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API is not a good thing.
Thank you for pointing it out.
I've just posted my analysis[1]/messages/by-id/CAD21AoDr13=dx+k8gmQnR5_bY+NskyN4mbSWN0KhQncL6xuPMA@mail.gmail.com and am planning to revive that API
(Sutou-san already proposed an idea). Could you please check if the
idea would work for file_text_array_fdw?
Regards,
[1]: /messages/by-id/CAD21AoDr13=dx+k8gmQnR5_bY+NskyN4mbSWN0KhQncL6xuPMA@mail.gmail.com
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On 2025-02-28 Fr 2:55 PM, Masahiko Sawada wrote:
On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:
Refactor COPY FROM to use format callback functions.
This commit introduces a new CopyFromRoutine struct, which is a set of
callback routines to read tuples in a specific format. It also makes
COPY FROM with the existing formats (text, CSV, and binary) utilize
these format callbacks.This change is a preliminary step towards making the COPY FROM command
extensible in terms of input formats.Similar to 2e4127b6d2d, this refactoring contributes to a performance
improvement by reducing the number of "if" branches that need to be
checked on a per-row basis when sending field representations in text
or CSV mode. The performance benchmark results showed ~5% performance
gain in text or CSV mode.Author: Sutou Kouhei <kou@clear-code.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: /messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.comThis patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API is not a good thing.
Thank you for pointing it out.
I've just posted my analysis[1] and am planning to revive that API
(Sutou-san already proposed an idea). Could you please check if the
idea would work for file_text_array_fdw?
Looks OK, I think. You could even use the Internal function further down
in the file and avoid a function call.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Fri, Feb 28, 2025 at 12:14 PM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-02-28 Fr 2:55 PM, Masahiko Sawada wrote:
On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:
Refactor COPY FROM to use format callback functions.
This commit introduces a new CopyFromRoutine struct, which is a set of
callback routines to read tuples in a specific format. It also makes
COPY FROM with the existing formats (text, CSV, and binary) utilize
these format callbacks.This change is a preliminary step towards making the COPY FROM command
extensible in terms of input formats.Similar to 2e4127b6d2d, this refactoring contributes to a performance
improvement by reducing the number of "if" branches that need to be
checked on a per-row basis when sending field representations in text
or CSV mode. The performance benchmark results showed ~5% performance
gain in text or CSV mode.Author: Sutou Kouhei <kou@clear-code.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: /messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.comThis patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API is not a good thing.
Thank you for pointing it out.
I've just posted my analysis[1] and am planning to revive that API
(Sutou-san already proposed an idea). Could you please check if the
idea would work for file_text_array_fdw?Looks OK, I think. You could even use the Internal function further down
in the file and avoid a function call.
Right. I've attached the updated patch.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
0001-Re-export-NextCopyFromRawFields-to-copy.h.patchapplication/octet-stream; name=0001-Re-export-NextCopyFromRawFields-to-copy.h.patchDownload+19-4
Hi,
Thanks for following up the patch!
In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
"Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Right. I've attached the updated patch.
In general, this approach will work but could you keep
the same signature for backward compatibility?
--- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c
+bool +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) +{ + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); +}
NextCopyFromRawFields() uses
NextCopyFromRawFields(CopyFromState cstate, char ***fields,
int *nfields) (no "bool is_csv") before we remove it. So
could you use the no "bool is_csv" signature for backward
compatibility?
bool
NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
{
return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
}
@@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes) /* * Read raw fields in the next line for COPY FROM in text or csv mode. * Return false if no more lines. + */ +bool +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) +{ + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); +} + +/* + * Workhorse for NextCopyFromRawFields().
It may be better that we don't split docs for
NextCopyFromRawFields() and
NextCopyFromRawFieldsInternal(). How about referring the
NextCopyFromRawFieldsInternal() doc from the
NextCopyFromRawFields() doc something like the following?
/*
* See NextCopyFromRawFieldsInternal() for details.
*/
bool
NextCopyFromRawFields(...)
{
...
}
/*
* Workhorse for NextCopyFromRawFields().
*
* Read raw fields ...
*
* ...
*/
static pg_attribute_always_inline bool
NextCopyFromRawFieldsInternal()
{
...
}
Thanks,
--
kou
On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote:
Hi,
Thanks for following up the patch!
In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
"Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
Masahiko Sawada <sawada.mshk@gmail.com> wrote:Right. I've attached the updated patch.
In general, this approach will work but could you keep
the same signature for backward compatibility?--- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c+bool +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) +{ + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); +}NextCopyFromRawFields() uses
NextCopyFromRawFields(CopyFromState cstate, char ***fields,
int *nfields) (no "bool is_csv") before we remove it. So
could you use the no "bool is_csv" signature for backward
compatibility?bool
NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
{
return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
}
I like this idea.
@@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes) /* * Read raw fields in the next line for COPY FROM in text or csv mode. * Return false if no more lines. + */ +bool +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) +{ + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); +} + +/* + * Workhorse for NextCopyFromRawFields().It may be better that we don't split docs for
NextCopyFromRawFields() and
NextCopyFromRawFieldsInternal(). How about referring the
NextCopyFromRawFieldsInternal() doc from the
NextCopyFromRawFields() doc something like the following?/*
* See NextCopyFromRawFieldsInternal() for details.
*/
bool
NextCopyFromRawFields(...)
{
...
}/*
* Workhorse for NextCopyFromRawFields().
*
* Read raw fields ...
*
* ...
*/
static pg_attribute_always_inline bool
NextCopyFromRawFieldsInternal()
{
...
}
Agreed.
I've updated the patch. I'm going to push it, barring any objections
and further comments.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-Re-export-NextCopyFromRawFields-to-copy.h.patchapplication/octet-stream; name=v2-0001-Re-export-NextCopyFromRawFields-to-copy.h.patchDownload+25-6
On 2025-02-28 Fr 5:39 PM, Masahiko Sawada wrote:
On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei<kou@clear-code.com> wrote:
Hi,
Thanks for following up the patch!
In<CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
"Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
Masahiko Sawada<sawada.mshk@gmail.com> wrote:Right. I've attached the updated patch.
In general, this approach will work but could you keep
the same signature for backward compatibility?--- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c +bool +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) +{ + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); +}NextCopyFromRawFields() uses
NextCopyFromRawFields(CopyFromState cstate, char ***fields,
int *nfields) (no "bool is_csv") before we remove it. So
could you use the no "bool is_csv" signature for backward
compatibility?bool
NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
{
return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
}I like this idea.
@@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes) /* * Read raw fields in the next line for COPY FROM in text or csv mode. * Return false if no more lines. + */ +bool +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) +{ + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); +} + +/* + * Workhorse for NextCopyFromRawFields().It may be better that we don't split docs for
NextCopyFromRawFields() and
NextCopyFromRawFieldsInternal(). How about referring the
NextCopyFromRawFieldsInternal() doc from the
NextCopyFromRawFields() doc something like the following?/*
* See NextCopyFromRawFieldsInternal() for details.
*/
bool
NextCopyFromRawFields(...)
{
...
}/*
* Workhorse for NextCopyFromRawFields().
*
* Read raw fields ...
*
* ...
*/
static pg_attribute_always_inline bool
NextCopyFromRawFieldsInternal()
{
...
}Agreed.
I've updated the patch. I'm going to push it, barring any objections
and further comments.
I'm OK either way - I have made changes to adapt to the API change, and
tested them.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Fri, Feb 28, 2025 at 3:06 PM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-02-28 Fr 5:39 PM, Masahiko Sawada wrote:
On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote:
Hi,
Thanks for following up the patch!
In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
"Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
Masahiko Sawada <sawada.mshk@gmail.com> wrote:Right. I've attached the updated patch.
In general, this approach will work but could you keep
the same signature for backward compatibility?--- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c+bool +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) +{ + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); +}NextCopyFromRawFields() uses
NextCopyFromRawFields(CopyFromState cstate, char ***fields,
int *nfields) (no "bool is_csv") before we remove it. So
could you use the no "bool is_csv" signature for backward
compatibility?bool
NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
{
return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
}I like this idea.
@@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes) /* * Read raw fields in the next line for COPY FROM in text or csv mode. * Return false if no more lines. + */ +bool +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) +{ + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); +} + +/* + * Workhorse for NextCopyFromRawFields().It may be better that we don't split docs for
NextCopyFromRawFields() and
NextCopyFromRawFieldsInternal(). How about referring the
NextCopyFromRawFieldsInternal() doc from the
NextCopyFromRawFields() doc something like the following?/*
* See NextCopyFromRawFieldsInternal() for details.
*/
bool
NextCopyFromRawFields(...)
{
...
}/*
* Workhorse for NextCopyFromRawFields().
*
* Read raw fields ...
*
* ...
*/
static pg_attribute_always_inline bool
NextCopyFromRawFieldsInternal()
{
...
}Agreed.
I've updated the patch. I'm going to push it, barring any objections
and further comments.I'm OK either way - I have made changes to adapt to the API change, and tested them.
Thank you for the confirmation. Pushed the fix.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com