Binary support for pgoutput plugin
Is there a reason why pgoutput sends data in text format? Seems to me that
sending data in binary would provide a considerable performance improvement.
Dave Cramer
On Mon, Jun 03, 2019 at 10:49:54AM -0400, Dave Cramer wrote:
Is there a reason why pgoutput sends data in text format? Seems to
me that sending data in binary would provide a considerable
performance improvement.
Are you seeing something that suggests that the text output is taking
a lot of time or other resources?
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
Dave Cramer
On Mon, 3 Jun 2019 at 20:54, David Fetter <david@fetter.org> wrote:
On Mon, Jun 03, 2019 at 10:49:54AM -0400, Dave Cramer wrote:
Is there a reason why pgoutput sends data in text format? Seems to
me that sending data in binary would provide a considerable
performance improvement.Are you seeing something that suggests that the text output is taking
a lot of time or other resources?Actually it's on the other end that there is improvement. Parsing text
takes much longer for almost everything except ironically text.
To be more transparent there is some desire to use pgoutput for something
other than logical replication. Change Data Capture clients such as
Debezium have a requirement for a stable plugin which is shipped with core
as this is always available in cloud providers offerings. There's no reason
that I am aware of that they cannot use pgoutput for this. There's also no
reason that I am aware that binary outputs can't be supported. The protocol
would have to change slightly and I am working on a POC patch.
Thing is they aren't all written in C so using binary does provide a pretty
substantial win on the decoding end.
Dave
Hi,
On 2019-06-04 15:47:04 -0400, Dave Cramer wrote:
On Mon, 3 Jun 2019 at 20:54, David Fetter <david@fetter.org> wrote:
On Mon, Jun 03, 2019 at 10:49:54AM -0400, Dave Cramer wrote:
Is there a reason why pgoutput sends data in text format? Seems to
me that sending data in binary would provide a considerable
performance improvement.Are you seeing something that suggests that the text output is taking
a lot of time or other resources?Actually it's on the other end that there is improvement. Parsing text
takes much longer for almost everything except ironically text.
It's on both sides, I'd say. E.g. float (until v12), timestamp, bytea
are all much more expensive to convert from binary to text.
To be more transparent there is some desire to use pgoutput for something
other than logical replication. Change Data Capture clients such as
Debezium have a requirement for a stable plugin which is shipped with core
as this is always available in cloud providers offerings. There's no reason
that I am aware of that they cannot use pgoutput for this.
Except that that's not pgoutput's purpose, and we shouldn't make it
meaningfully more complicated or slower to achieve this. Don't think
there's a conflict in this case though.
There's also no reason that I am aware that binary outputs can't be
supported.
Well, it *does* increase version dependencies, and does make replication
more complicated, because type oids etc cannot be relied to be the same
on source and target side.
The protocol would have to change slightly and I am working
on a POC patch.
Hm, what would have to be changed protocol wise? IIRC that'd just be a
different datum type? Or is that what you mean?
pq_sendbyte(out, 't'); /* 'text' data follows */
IIRC there was code for the binary protocol in a predecessor of
pgoutput.
I think if we were to add binary output - and I think we should - we
ought to only accept a patch if it's also used in core.
Greetings,
Andres Freund
Dave Cramer
On Tue, 4 Jun 2019 at 16:30, Andres Freund <andres.freund@enterprisedb.com>
wrote:
Hi,
On 2019-06-04 15:47:04 -0400, Dave Cramer wrote:
On Mon, 3 Jun 2019 at 20:54, David Fetter <david@fetter.org> wrote:
On Mon, Jun 03, 2019 at 10:49:54AM -0400, Dave Cramer wrote:
Is there a reason why pgoutput sends data in text format? Seems to
me that sending data in binary would provide a considerable
performance improvement.Are you seeing something that suggests that the text output is taking
a lot of time or other resources?Actually it's on the other end that there is improvement. Parsing text
takes much longer for almost everything except ironically text.
It's on both sides, I'd say. E.g. float (until v12), timestamp, bytea
are all much more expensive to convert from binary to text.To be more transparent there is some desire to use pgoutput for something
other than logical replication. Change Data Capture clients such as
Debezium have a requirement for a stable plugin which is shipped withcore
as this is always available in cloud providers offerings. There's no
reason
that I am aware of that they cannot use pgoutput for this.
Except that that's not pgoutput's purpose, and we shouldn't make it
meaningfully more complicated or slower to achieve this. Don't think
there's a conflict in this case though.
agreed, my intent was to slightly bend it to my will :)
There's also no reason that I am aware that binary outputs can't be
supported.Well, it *does* increase version dependencies, and does make replication
more complicated, because type oids etc cannot be relied to be the same
on source and target side.I was about to agree with this but if the type oids change from source to
target you
still can't decode the text version properly. Unless I mis-understand
something here ?
The protocol would have to change slightly and I am working
on a POC patch.Hm, what would have to be changed protocol wise? IIRC that'd just be a
different datum type? Or is that what you mean?
pq_sendbyte(out, 't'); /* 'text' data follows */I haven't really thought this through completely but one place JDBC has
problems with binary is with
timestamps with timezone as we don't know which timezone to use. Is it safe
to assume everything is in UTC
since the server stores in UTC ? Then there are UDF's. My original thought
was to use options to send in the
types that I wanted in binary, everything else could be sent as text.
IIRC there was code for the binary protocol in a predecessor of
pgoutput.
Hmmm that might be good place to start. I will do some digging through git
history
I think if we were to add binary output - and I think we should - we
ought to only accept a patch if it's also used in core.
Certainly; as not doing so would make my work completely irrelevant for my
purpose.
Thanks,
Dave
Import Notes
Reply to msg id not found: 20190604203009.bbwmiepag7jp5ahj@alap3.anarazel.de
Hi,
On 2019-06-04 16:39:32 -0400, Dave Cramer wrote:
On Tue, 4 Jun 2019 at 16:30, Andres Freund <andres.freund@enterprisedb.com>
wrote:There's also no reason that I am aware that binary outputs can't be
supported.Well, it *does* increase version dependencies, and does make replication
more complicated, because type oids etc cannot be relied to be the same
on source and target side.I was about to agree with this but if the type oids change from source
to target you still can't decode the text version properly. Unless I
mis-understand something here ?
The text format doesn't care about oids. I don't see how it'd be a
problem? Note that some people *intentionally* use different types from
source to target system when logically replicating. So you can't rely on
the target table's types under any circumstance.
I think you really have to use the textual type which we already write
out (cf logicalrep_write_typ()) to call the binary input functions. And
you can send only data as binary that's from builtin types - otherwise
there's no guarantee at all that the target system has something
compatible. And even if you just assumed that all extensions etc are
present, you can't transport arrays / composite types in binary: For
hard to discern reasons we a) embed type oids in them b) verify them. b)
won't ever work for non-builtin types, because oids are assigned
dynamically.
I think if we were to add binary output - and I think we should - we
ought to only accept a patch if it's also used in core.Certainly; as not doing so would make my work completely irrelevant for my
purpose.
What I mean is that the builtin logical replication would have to use
this on the receiving side too.
Greetings,
Andres Freund
On Tue, 4 Jun 2019 at 16:46, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-06-04 16:39:32 -0400, Dave Cramer wrote:
On Tue, 4 Jun 2019 at 16:30, Andres Freund <
andres.freund@enterprisedb.com>
wrote:
There's also no reason that I am aware that binary outputs can't be
supported.Well, it *does* increase version dependencies, and does make
replication
more complicated, because type oids etc cannot be relied to be the same
on source and target side.I was about to agree with this but if the type oids change from source
to target you still can't decode the text version properly. Unless I
mis-understand something here ?The text format doesn't care about oids. I don't see how it'd be a
problem? Note that some people *intentionally* use different types from
source to target system when logically replicating. So you can't rely on
the target table's types under any circumstance.I think you really have to use the textual type which we already write
out (cf logicalrep_write_typ()) to call the binary input functions. And
you can send only data as binary that's from builtin types - otherwise
there's no guarantee at all that the target system has something
compatible. And even if you just assumed that all extensions etc are
present, you can't transport arrays / composite types in binary: For
hard to discern reasons we a) embed type oids in them b) verify them. b)
won't ever work for non-builtin types, because oids are assigned
dynamically.
I figured arrays and UDT's would be problematic.
I think if we were to add binary output - and I think we should - we
ought to only accept a patch if it's also used in core.Certainly; as not doing so would make my work completely irrelevant for
my
purpose.
What I mean is that the builtin logical replication would have to use
this on the receiving side too.Got it, thanks for validating that the idea isn't nuts. Now I *have* to
produce a POC.
Thanks,
Dave
Show quoted text
On 6/4/19 4:39 PM, Dave Cramer wrote:
I haven't really thought this through completely but one place JDBC has
problems with binary is with
timestamps with timezone as we don't know which timezone to use. Is it safe
to assume everything is in UTC
since the server stores in UTC ?
PL/Java, when converting to the Java 8 java.time types (because those
are sane), will turn a timestamp with timezone into an OffsetDateTime
with explicit offset zero (UTC), no matter what timezone may have been
used when the value was input (as you've observed, there's no way to
recover that). In the return direction, if given an OffsetDateTime
with any nonzero offset, it will adjust the value to UTC for postgres.
So, yes, say I.
Regards,
-Chap
On Tue, Jun 04, 2019 at 04:55:33PM -0400, Dave Cramer wrote:
On Tue, 4 Jun 2019 at 16:46, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-06-04 16:39:32 -0400, Dave Cramer wrote:
On Tue, 4 Jun 2019 at 16:30, Andres Freund <
andres.freund@enterprisedb.com>
wrote:
There's also no reason that I am aware that binary outputs can't be
supported.Well, it *does* increase version dependencies, and does make
replication
more complicated, because type oids etc cannot be relied to be the same
on source and target side.I was about to agree with this but if the type oids change from source
to target you still can't decode the text version properly. Unless I
mis-understand something here ?The text format doesn't care about oids. I don't see how it'd be a
problem? Note that some people *intentionally* use different types from
source to target system when logically replicating. So you can't rely on
the target table's types under any circumstance.I think you really have to use the textual type which we already write
out (cf logicalrep_write_typ()) to call the binary input functions. And
you can send only data as binary that's from builtin types - otherwise
there's no guarantee at all that the target system has something
compatible. And even if you just assumed that all extensions etc are
present, you can't transport arrays / composite types in binary: For
hard to discern reasons we a) embed type oids in them b) verify them. b)
won't ever work for non-builtin types, because oids are assigned
dynamically.I figured arrays and UDT's would be problematic.
Would it make sense to work toward a binary format that's not
architecture-specific? I recall from COPY that our binary format is
not standardized across, for example, big- and little-endian machines.
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
Hi,
On 2019-06-05 00:05:02 +0200, David Fetter wrote:
Would it make sense to work toward a binary format that's not
architecture-specific? I recall from COPY that our binary format is
not standardized across, for example, big- and little-endian machines.
I think you recall wrongly. It's obviously possible that we have bugs
around this, but output/input routines are supposed to handle a
endianess independent format. That usually means that you have to do
endianess conversions, but that doesn't make it non-standardized.
- Andres
On Tue, 4 Jun 2019 at 18:08, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-06-05 00:05:02 +0200, David Fetter wrote:
Would it make sense to work toward a binary format that's not
architecture-specific? I recall from COPY that our binary format is
not standardized across, for example, big- and little-endian machines.I think you recall wrongly. It's obviously possible that we have bugs
around this, but output/input routines are supposed to handle a
endianess independent format. That usually means that you have to do
endianess conversions, but that doesn't make it non-standardized.
Additionally there are a number of drivers that already know how to handle
our binary types.
I don't really think there's a win here. I also want to keep the changes
small .
Dave
Hi,
On 05/06/2019 00:08, Andres Freund wrote:
Hi,
On 2019-06-05 00:05:02 +0200, David Fetter wrote:
Would it make sense to work toward a binary format that's not
architecture-specific? I recall from COPY that our binary format is
not standardized across, for example, big- and little-endian machines.I think you recall wrongly. It's obviously possible that we have bugs
around this, but output/input routines are supposed to handle a
endianess independent format. That usually means that you have to do
endianess conversions, but that doesn't make it non-standardized.
Yeah, there are really 3 formats of data we have, text protocol, binary
network protocol and internal on disk format. The internal on disk
format will not work across big/little-endian but network binary
protocol will.
FWIW I don't think the code for binary format was included in original
logical replication patch (I really tried to keep it as minimal as
possible), but the code and protocol is pretty much ready for adding that.
That said, pglogical has code which handles this (I guess Andres means
that by predecessor of pgoutput) so if you look for example at the
write_tuple/read_tuple/decide_datum_transfer in
https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c
that can help you give some ideas on how to approach this.
--
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions
https://www.2ndQuadrant.com/
Hi,
On Wed, 5 Jun 2019 at 07:18, Petr Jelinek <petr.jelinek@2ndquadrant.com>
wrote:
Hi,
On 05/06/2019 00:08, Andres Freund wrote:
Hi,
On 2019-06-05 00:05:02 +0200, David Fetter wrote:
Would it make sense to work toward a binary format that's not
architecture-specific? I recall from COPY that our binary format is
not standardized across, for example, big- and little-endian machines.I think you recall wrongly. It's obviously possible that we have bugs
around this, but output/input routines are supposed to handle a
endianess independent format. That usually means that you have to do
endianess conversions, but that doesn't make it non-standardized.Yeah, there are really 3 formats of data we have, text protocol, binary
network protocol and internal on disk format. The internal on disk
format will not work across big/little-endian but network binary
protocol will.FWIW I don't think the code for binary format was included in original
logical replication patch (I really tried to keep it as minimal as
possible), but the code and protocol is pretty much ready for adding that.
Yes, I looked through the public history and could not find it. Thanks for
confirming.
That said, pglogical has code which handles this (I guess Andres means
that by predecessor of pgoutput) so if you look for example at the
write_tuple/read_tuple/decide_datum_transfer inhttps://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c
that can help you give some ideas on how to approach this.
Thanks for the tip!
Dave Cramer
Show quoted text
On Wed, 5 Jun 2019 at 07:21, Dave Cramer <davecramer@gmail.com> wrote:
Hi,
On Wed, 5 Jun 2019 at 07:18, Petr Jelinek <petr.jelinek@2ndquadrant.com>
wrote:Hi,
On 05/06/2019 00:08, Andres Freund wrote:
Hi,
On 2019-06-05 00:05:02 +0200, David Fetter wrote:
Would it make sense to work toward a binary format that's not
architecture-specific? I recall from COPY that our binary format is
not standardized across, for example, big- and little-endian machines.I think you recall wrongly. It's obviously possible that we have bugs
around this, but output/input routines are supposed to handle a
endianess independent format. That usually means that you have to do
endianess conversions, but that doesn't make it non-standardized.Yeah, there are really 3 formats of data we have, text protocol, binary
network protocol and internal on disk format. The internal on disk
format will not work across big/little-endian but network binary
protocol will.FWIW I don't think the code for binary format was included in original
logical replication patch (I really tried to keep it as minimal as
possible), but the code and protocol is pretty much ready for adding that.Yes, I looked through the public history and could not find it. Thanks for
confirming.That said, pglogical has code which handles this (I guess Andres means
that by predecessor of pgoutput) so if you look for example at the
write_tuple/read_tuple/decide_datum_transfer inhttps://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c
that can help you give some ideas on how to approach this.Thanks for the tip!
this seems completely ignored. What was the intent?
Dave
Hi
On June 5, 2019 8:51:10 AM PDT, Dave Cramer <davecramer@gmail.com> wrote:
On Wed, 5 Jun 2019 at 07:21, Dave Cramer <davecramer@gmail.com> wrote:
Hi,
On Wed, 5 Jun 2019 at 07:18, Petr Jelinek
<petr.jelinek@2ndquadrant.com>
wrote:
Hi,
On 05/06/2019 00:08, Andres Freund wrote:
Hi,
On 2019-06-05 00:05:02 +0200, David Fetter wrote:
Would it make sense to work toward a binary format that's not
architecture-specific? I recall from COPY that our binary formatis
not standardized across, for example, big- and little-endian
machines.
I think you recall wrongly. It's obviously possible that we have
bugs
around this, but output/input routines are supposed to handle a
endianess independent format. That usually means that you have todo
endianess conversions, but that doesn't make it non-standardized.
Yeah, there are really 3 formats of data we have, text protocol,
binary
network protocol and internal on disk format. The internal on disk
format will not work across big/little-endian but network binary
protocol will.FWIW I don't think the code for binary format was included in
original
logical replication patch (I really tried to keep it as minimal as
possible), but the code and protocol is pretty much ready for addingthat.
Yes, I looked through the public history and could not find it.
Thanks for
confirming.
That said, pglogical has code which handles this (I guess Andres
means
that by predecessor of pgoutput) so if you look for example at the
write_tuple/read_tuple/decide_datum_transfer inhttps://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c
that can help you give some ideas on how to approach this.
Thanks for the tip!
this seems completely ignored. What was the intent?
That's about the output of the plugin, not the datatypes. And independent of text/binary output, the protocol contains non-printable chars.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi,
On Wed, 5 Jun 2019 at 12:01, Andres Freund <andres@anarazel.de> wrote:
Hi
On June 5, 2019 8:51:10 AM PDT, Dave Cramer <davecramer@gmail.com> wrote:
On Wed, 5 Jun 2019 at 07:21, Dave Cramer <davecramer@gmail.com> wrote:
Hi,
On Wed, 5 Jun 2019 at 07:18, Petr Jelinek
<petr.jelinek@2ndquadrant.com>
wrote:
Hi,
On 05/06/2019 00:08, Andres Freund wrote:
Hi,
On 2019-06-05 00:05:02 +0200, David Fetter wrote:
Would it make sense to work toward a binary format that's not
architecture-specific? I recall from COPY that our binary formatis
not standardized across, for example, big- and little-endian
machines.
I think you recall wrongly. It's obviously possible that we have
bugs
around this, but output/input routines are supposed to handle a
endianess independent format. That usually means that you have todo
endianess conversions, but that doesn't make it non-standardized.
Yeah, there are really 3 formats of data we have, text protocol,
binary
network protocol and internal on disk format. The internal on disk
format will not work across big/little-endian but network binary
protocol will.FWIW I don't think the code for binary format was included in
original
logical replication patch (I really tried to keep it as minimal as
possible), but the code and protocol is pretty much ready for addingthat.
Yes, I looked through the public history and could not find it.
Thanks for
confirming.
That said, pglogical has code which handles this (I guess Andres
means
that by predecessor of pgoutput) so if you look for example at the
write_tuple/read_tuple/decide_datum_transfer inhttps://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c
that can help you give some ideas on how to approach this.
Thanks for the tip!
Looking at:
this seems completely ignored. What was the intent?
That's about the output of the plugin, not the datatypes. And independent
of text/binary output, the protocol contains non-printable chars.Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
So one of the things they would like added is to get not null information
in the schema record. This is so they can mark the field Optional in Java.
I presume this would also have some uses in other languages. As I
understand it this would require a protocol bump. If this were to be
accepted are there any outstanding asks that would useful to add if we were
going to bump the protocol?
Dave
Hi,
On 2019-06-05 18:47:57 -0400, Dave Cramer wrote:
So one of the things they would like added is to get not null information
in the schema record. This is so they can mark the field Optional in Java.
I presume this would also have some uses in other languages. As I
understand it this would require a protocol bump. If this were to be
accepted are there any outstanding asks that would useful to add if we were
going to bump the protocol?
I'm pretty strongly opposed to this. What's the limiting factor when
adding such information? I think clients that want something like this
ought to query the database for catalog information when getting schema
information.
Greetings,
Andres Freund
Hi,
On Wed, 5 Jun 2019 at 18:50, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-06-05 18:47:57 -0400, Dave Cramer wrote:
So one of the things they would like added is to get not null information
in the schema record. This is so they can mark the field Optional inJava.
I presume this would also have some uses in other languages. As I
understand it this would require a protocol bump. If this were to be
accepted are there any outstanding asks that would useful to add if wewere
going to bump the protocol?
I'm pretty strongly opposed to this. What's the limiting factor when
adding such information? I think clients that want something like this
ought to query the database for catalog information when getting schema
information.
I'm not intimately familiar with their code. I will query them more about
the ask.
I am curious why you are "strongly" opposed however. We already have the
information. Adding doesn't seem onerous.
Dave
Hi,
On 2019-06-05 19:05:05 -0400, Dave Cramer wrote:
I am curious why you are "strongly" opposed however. We already have the
information. Adding doesn't seem onerous.
(thought I'd already replied with this)
The problem is that I don't recognize a limiting principle:
If we want NOT NULL information for clients, why don't we include the
underlying types for arrays, and the fields in composite types? What
about foreign keys? And unique keys?
And then we suddenly need tracking for all these, so we don't always
send out that information when we previously already did - and in some
of the cases there's no infrastructure for that.
I just don't quite buy that the output plugin build for pg's logical
replication needs is a good place to include a continually increasing
amount of metadata that logical replication doesn't need. That's going
to add overhead and make the code more complicated.
Greetings,
Andres Freund
On 06/07/19 19:27, Andres Freund wrote:
The problem is that I don't recognize a limiting principle:
If we want NOT NULL information for clients, why don't we include the
underlying types for arrays, and the fields in composite types? What
about foreign keys? And unique keys?
This reminds me of an idea I had for a future fe/be protocol version,
right after a talk by Alyssa Ritchie and Henrietta Dombrovskaya at the
last 2Q PGConf. [1]https://www.2qpgconf.com/schedule/information-exchange-techniques-for-javapostgresql-applications/
It seems they had ended up designing a whole 'nother "protocol level"
involving queries wrapping their results as JSON and an app layer that
unwraps again, after trying a simpler first approach that was foiled by the
inability to see into arrays and anonymous record types in the 'describe'
response.
I thought, in a new protocol rev, why not let the driver send additional
'describe' messages after the first one, to drill into structure of
individual columns mentioned in the first response, before sending the
'execute' message?
If it doesn't want the further detail, it doesn't have to ask.
And then we suddenly need tracking for all these, so we don't always
send out that information when we previously already did
If it's up to the client driver, it can track what it needs or already has.
I haven't looked too deeply into the replication protocol ... it happens
under a kind of copy-both, right?, so maybe there's a way for the receiver
to send some inquiries back, but maybe in a windowed, full-duplex way where
it might have to buffer some incoming messages before getting the response
to an inquiry message it sent.
Would those be thinkable thoughts for a future protocol rev?
Regards,
-Chap
[1]: https://www.2qpgconf.com/schedule/information-exchange-techniques-for-javapostgresql-applications/
https://www.2qpgconf.com/schedule/information-exchange-techniques-for-javapostgresql-applications/