pgsql: Refactor COPY FROM to use format callback functions.

Started by Masahiko Sawadaover 1 year ago9 messagescomitters
Jump to latest
#1Masahiko Sawada
sawada.mshk@gmail.com

Refactor COPY FROM to use format callback functions.

This commit introduces a new CopyFromRoutine struct, which is a set of
callback routines to read tuples in a specific format. It also makes
COPY FROM with the existing formats (text, CSV, and binary) utilize
these format callbacks.

This change is a preliminary step towards making the COPY FROM command
extensible in terms of input formats.

Similar to 2e4127b6d2d, this refactoring contributes to a performance
improvement by reducing the number of "if" branches that need to be
checked on a per-row basis when sending field representations in text
or CSV mode. The performance benchmark results showed ~5% performance
gain in text or CSV mode.

Author: Sutou Kouhei <kou@clear-code.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: /messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/7717f63006935de00fafd000bff450280508adf1

Modified Files
--------------
src/backend/commands/copyfrom.c | 192 ++++++++++---
src/backend/commands/copyfromparse.c | 446 +++++++++++++++++--------------
src/include/commands/copy.h | 2 -
src/include/commands/copyapi.h | 50 +++-
src/include/commands/copyfrom_internal.h | 11 +
src/tools/pgindent/typedefs.list | 1 +
6 files changed, 459 insertions(+), 243 deletions(-)

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Masahiko Sawada (#1)
Re: pgsql: Refactor COPY FROM to use format callback functions.

On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:

Refactor COPY FROM to use format callback functions.

This commit introduces a new CopyFromRoutine struct, which is a set of
callback routines to read tuples in a specific format. It also makes
COPY FROM with the existing formats (text, CSV, and binary) utilize
these format callbacks.

This change is a preliminary step towards making the COPY FROM command
extensible in terms of input formats.

Similar to 2e4127b6d2d, this refactoring contributes to a performance
improvement by reducing the number of "if" branches that need to be
checked on a per-row basis when sending field representations in text
or CSV mode. The performance benchmark results showed ~5% performance
gain in text or CSV mode.

Author: Sutou Kouhei<kou@clear-code.com>
Reviewed-by: Masahiko Sawada<sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier<michael@paquier.xyz>
Reviewed-by: Andres Freund<andres@anarazel.de>
Reviewed-by: Tomas Vondra<tomas.vondra@enterprisedb.com>
Reviewed-by: Junwang Zhao<zhjwpku@gmail.com>
Discussion:/messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.com

This patch has completely broken the file_textarray fdw, which uses
NextCopyFromRawFields(). Removing that from API is not a good thing.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#3Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andrew Dunstan (#2)
Re: pgsql: Refactor COPY FROM to use format callback functions.

On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:

Refactor COPY FROM to use format callback functions.

This commit introduces a new CopyFromRoutine struct, which is a set of
callback routines to read tuples in a specific format. It also makes
COPY FROM with the existing formats (text, CSV, and binary) utilize
these format callbacks.

This change is a preliminary step towards making the COPY FROM command
extensible in terms of input formats.

Similar to 2e4127b6d2d, this refactoring contributes to a performance
improvement by reducing the number of "if" branches that need to be
checked on a per-row basis when sending field representations in text
or CSV mode. The performance benchmark results showed ~5% performance
gain in text or CSV mode.

Author: Sutou Kouhei <kou@clear-code.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: /messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.com

This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API is not a good thing.

Thank you for pointing it out.

I've just posted my analysis[1]/messages/by-id/CAD21AoDr13=dx+k8gmQnR5_bY+NskyN4mbSWN0KhQncL6xuPMA@mail.gmail.com and am planning to revive that API
(Sutou-san already proposed an idea). Could you please check if the
idea would work for file_text_array_fdw?

Regards,

[1]: /messages/by-id/CAD21AoDr13=dx+k8gmQnR5_bY+NskyN4mbSWN0KhQncL6xuPMA@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Masahiko Sawada (#3)
Re: pgsql: Refactor COPY FROM to use format callback functions.

On 2025-02-28 Fr 2:55 PM, Masahiko Sawada wrote:

On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:

Refactor COPY FROM to use format callback functions.

This commit introduces a new CopyFromRoutine struct, which is a set of
callback routines to read tuples in a specific format. It also makes
COPY FROM with the existing formats (text, CSV, and binary) utilize
these format callbacks.

This change is a preliminary step towards making the COPY FROM command
extensible in terms of input formats.

Similar to 2e4127b6d2d, this refactoring contributes to a performance
improvement by reducing the number of "if" branches that need to be
checked on a per-row basis when sending field representations in text
or CSV mode. The performance benchmark results showed ~5% performance
gain in text or CSV mode.

Author: Sutou Kouhei <kou@clear-code.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: /messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.com

This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API is not a good thing.

Thank you for pointing it out.

I've just posted my analysis[1] and am planning to revive that API
(Sutou-san already proposed an idea). Could you please check if the
idea would work for file_text_array_fdw?

Looks OK, I think. You could even use the Internal function further down
in the file and avoid a function call.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#5Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andrew Dunstan (#4)
Re: pgsql: Refactor COPY FROM to use format callback functions.

On Fri, Feb 28, 2025 at 12:14 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2025-02-28 Fr 2:55 PM, Masahiko Sawada wrote:

On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:

Refactor COPY FROM to use format callback functions.

This commit introduces a new CopyFromRoutine struct, which is a set of
callback routines to read tuples in a specific format. It also makes
COPY FROM with the existing formats (text, CSV, and binary) utilize
these format callbacks.

This change is a preliminary step towards making the COPY FROM command
extensible in terms of input formats.

Similar to 2e4127b6d2d, this refactoring contributes to a performance
improvement by reducing the number of "if" branches that need to be
checked on a per-row basis when sending field representations in text
or CSV mode. The performance benchmark results showed ~5% performance
gain in text or CSV mode.

Author: Sutou Kouhei <kou@clear-code.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: /messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.com

This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API is not a good thing.

Thank you for pointing it out.

I've just posted my analysis[1] and am planning to revive that API
(Sutou-san already proposed an idea). Could you please check if the
idea would work for file_text_array_fdw?

Looks OK, I think. You could even use the Internal function further down
in the file and avoid a function call.

Right. I've attached the updated patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

0001-Re-export-NextCopyFromRawFields-to-copy.h.patchapplication/octet-stream; name=0001-Re-export-NextCopyFromRawFields-to-copy.h.patchDownload+19-4
#6Sutou Kouhei
kou@clear-code.com
In reply to: Masahiko Sawada (#5)
Re: pgsql: Refactor COPY FROM to use format callback functions.

Hi,

Thanks for following up the patch!

In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
"Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Right. I've attached the updated patch.

In general, this approach will work but could you keep
the same signature for backward compatibility?

--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
+bool
+NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
+{
+	return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
+}

NextCopyFromRawFields() uses
NextCopyFromRawFields(CopyFromState cstate, char ***fields,
int *nfields) (no "bool is_csv") before we remove it. So
could you use the no "bool is_csv" signature for backward
compatibility?

bool
NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
{
return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
}

@@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes)
/*
* Read raw fields in the next line for COPY FROM in text or csv mode.
* Return false if no more lines.
+ */
+bool
+NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
+{
+	return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
+}
+
+/*
+ * Workhorse for NextCopyFromRawFields().

It may be better that we don't split docs for
NextCopyFromRawFields() and
NextCopyFromRawFieldsInternal(). How about referring the
NextCopyFromRawFieldsInternal() doc from the
NextCopyFromRawFields() doc something like the following?

/*
* See NextCopyFromRawFieldsInternal() for details.
*/
bool
NextCopyFromRawFields(...)
{
...
}

/*
* Workhorse for NextCopyFromRawFields().
*
* Read raw fields ...
*
* ...
*/
static pg_attribute_always_inline bool
NextCopyFromRawFieldsInternal()
{
...
}

Thanks,
--
kou

#7Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Sutou Kouhei (#6)
Re: pgsql: Refactor COPY FROM to use format callback functions.

On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote:

Hi,

Thanks for following up the patch!

In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
"Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Right. I've attached the updated patch.

In general, this approach will work but could you keep
the same signature for backward compatibility?

--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
+bool
+NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
+{
+     return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
+}

NextCopyFromRawFields() uses
NextCopyFromRawFields(CopyFromState cstate, char ***fields,
int *nfields) (no "bool is_csv") before we remove it. So
could you use the no "bool is_csv" signature for backward
compatibility?

bool
NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
{
return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
}

I like this idea.

@@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes)
/*
* Read raw fields in the next line for COPY FROM in text or csv mode.
* Return false if no more lines.
+ */
+bool
+NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
+{
+     return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
+}
+
+/*
+ * Workhorse for NextCopyFromRawFields().

It may be better that we don't split docs for
NextCopyFromRawFields() and
NextCopyFromRawFieldsInternal(). How about referring the
NextCopyFromRawFieldsInternal() doc from the
NextCopyFromRawFields() doc something like the following?

/*
* See NextCopyFromRawFieldsInternal() for details.
*/
bool
NextCopyFromRawFields(...)
{
...
}

/*
* Workhorse for NextCopyFromRawFields().
*
* Read raw fields ...
*
* ...
*/
static pg_attribute_always_inline bool
NextCopyFromRawFieldsInternal()
{
...
}

Agreed.

I've updated the patch. I'm going to push it, barring any objections
and further comments.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-Re-export-NextCopyFromRawFields-to-copy.h.patchapplication/octet-stream; name=v2-0001-Re-export-NextCopyFromRawFields-to-copy.h.patchDownload+25-6
#8Andrew Dunstan
andrew@dunslane.net
In reply to: Masahiko Sawada (#7)
Re: pgsql: Refactor COPY FROM to use format callback functions.

On 2025-02-28 Fr 5:39 PM, Masahiko Sawada wrote:

On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei<kou@clear-code.com> wrote:

Hi,

Thanks for following up the patch!

In<CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
"Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
Masahiko Sawada<sawada.mshk@gmail.com> wrote:

Right. I've attached the updated patch.

In general, this approach will work but could you keep
the same signature for backward compatibility?

--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
+bool
+NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
+{
+     return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
+}

NextCopyFromRawFields() uses
NextCopyFromRawFields(CopyFromState cstate, char ***fields,
int *nfields) (no "bool is_csv") before we remove it. So
could you use the no "bool is_csv" signature for backward
compatibility?

bool
NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
{
return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
}

I like this idea.

@@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes)
/*
* Read raw fields in the next line for COPY FROM in text or csv mode.
* Return false if no more lines.
+ */
+bool
+NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
+{
+     return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
+}
+
+/*
+ * Workhorse for NextCopyFromRawFields().

It may be better that we don't split docs for
NextCopyFromRawFields() and
NextCopyFromRawFieldsInternal(). How about referring the
NextCopyFromRawFieldsInternal() doc from the
NextCopyFromRawFields() doc something like the following?

/*
* See NextCopyFromRawFieldsInternal() for details.
*/
bool
NextCopyFromRawFields(...)
{
...
}

/*
* Workhorse for NextCopyFromRawFields().
*
* Read raw fields ...
*
* ...
*/
static pg_attribute_always_inline bool
NextCopyFromRawFieldsInternal()
{
...
}

Agreed.

I've updated the patch. I'm going to push it, barring any objections
and further comments.

I'm OK either way - I have made changes to adapt to the API change, and
tested them.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#9Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andrew Dunstan (#8)
Re: pgsql: Refactor COPY FROM to use format callback functions.

On Fri, Feb 28, 2025 at 3:06 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2025-02-28 Fr 5:39 PM, Masahiko Sawada wrote:

On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote:

Hi,

Thanks for following up the patch!

In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
"Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Right. I've attached the updated patch.

In general, this approach will work but could you keep
the same signature for backward compatibility?

--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
+bool
+NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
+{
+     return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
+}

NextCopyFromRawFields() uses
NextCopyFromRawFields(CopyFromState cstate, char ***fields,
int *nfields) (no "bool is_csv") before we remove it. So
could you use the no "bool is_csv" signature for backward
compatibility?

bool
NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
{
return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
}

I like this idea.

@@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes)
/*
* Read raw fields in the next line for COPY FROM in text or csv mode.
* Return false if no more lines.
+ */
+bool
+NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
+{
+     return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
+}
+
+/*
+ * Workhorse for NextCopyFromRawFields().

It may be better that we don't split docs for
NextCopyFromRawFields() and
NextCopyFromRawFieldsInternal(). How about referring the
NextCopyFromRawFieldsInternal() doc from the
NextCopyFromRawFields() doc something like the following?

/*
* See NextCopyFromRawFieldsInternal() for details.
*/
bool
NextCopyFromRawFields(...)
{
...
}

/*
* Workhorse for NextCopyFromRawFields().
*
* Read raw fields ...
*
* ...
*/
static pg_attribute_always_inline bool
NextCopyFromRawFieldsInternal()
{
...
}

Agreed.

I've updated the patch. I'm going to push it, barring any objections
and further comments.

I'm OK either way - I have made changes to adapt to the API change, and tested them.

Thank you for the confirmation. Pushed the fix.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com