question regarding copyData containers

Started by Jerome Wagnerover 5 years ago4 messages

jerome.wagner@laposte.net

over 5 years ago

Hello,

I have been working on a node.js streaming client for different COPY
scenarios.
usually, during CopyOut, clients tend to buffer network chunks until they
have gathered a full copyData message and pass that to the user.

In some cases, this can lead to very large copyData messages. when there
are very long text fields or bytea fields it will require a lot of memory
to be handled (up to 1GB I think in the worst case scenario)

In COPY TO, I managed to relax that requirement, considering that copyData
is simply a transparent container. For each network chunk, the relevent
message content is forwarded which makes for 64KB chunks at most.

If that makes things clearer, here is an example scenarios, with 4 network
chunks received and the way they are forwarded to the client.

in: CopyData Int32Len Byten1
in: Byten2
in: Byten3
in: CopyData Int32Len Byten4

out: Byten1
out: Byten2
out: Byten3
out: Byten4

We loose the semantics of the "row" that copyData has according to the
documentation
https://www.postgresql.org/docs/10/protocol-flow.html#PROTOCOL-COPY

The backend sends a CopyOutResponse message to the frontend, followed by

zero or more >CopyData messages (**always one per row**), followed by
CopyDone

but it is not a problem because the raw bytes are still parsable (rows +
fields) in text mode (tsv) and in binary mode)

Now I started working on copyBoth and logical decoding scenarios. In this
case, the server send series of copyData. 1 copyData containing 1 message :

at the network chunk level, in the case of large fields, we can observe

in: CopyData Int32 XLogData Int64 Int64 Int64 Byten1
in: Byten2
in: CopyData Int32 XLogData Int64 Int64 Int64 Byten3
in: CopyData Int32 XLogData Int64 Int64 Int64 Byten4

out: XLogData Int64 Int64 Int64 Byten1
out: Byten2
out: XLogData Int64 Int64 Int64 Byten3
out: XLogData Int64 Int64 Int64 Byten4

but at the XLogData level, the protocol is not self-describing its length,
so there is no real way of knowing where the first XLogData ends apart from
- knowing the length of the first copyData (4 + 1 + 3*8 + n1 + n2)
- knowing the internals of the output plugin and benefit from a plugin
that self-describe its span

when a network chunks contains several copyDatas
in: CopyData Int32 XLogData Int64 Int64 Int64 Byten1 CopyData Int32
XLogData Int64 Int64 Int64 Byten2
we have
out: XLogData Int64 Int64 Int64 Byten1 XLogData Int64 Int64 Int64 Byten2

and with test_decoding for example it is impossible to know where the
test_decoding output ends without remembering the original length of the
copyData.

now my question is the following :
is it ok to consider that over the long term copyData is simply a transport
container that exists only to allow the multiplexing of events in the
protocol but that messages inside could be chunked over several copyData
events ?

if we put test_decoding apart, do you consider that output plugins XLogData
should be self-aware of their length ? I suppose (but did not fully verify
yet) that this is the case for pgoutput ? I suppose that wal2json could
also be parsed by balancing the brackets.

I am wondering because when a client sends copyData to the server, the
documentation says

The message boundaries are not required to have anything to do with row

boundaries, >although that is often a reasonable choice.

I hope that my message will ring a bell on the list.
I tried the best I could to describe my very specific research.
Thank you for your help,
---
Jérôme

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Jerome Wagner (#1)

Re: question regarding copyData containers

Jerome Wagner <jerome.wagner@laposte.net> writes:

now my question is the following :
is it ok to consider that over the long term copyData is simply a transport
container that exists only to allow the multiplexing of events in the
protocol but that messages inside could be chunked over several copyData
events ?

Yes, the expectation is that clients can send CopyData messages that are
split up however they choose; the message boundaries needn't correspond
to any semantic boundaries in the data stream.

The rule in the other direction, that a message corresponds to one table
row, is something that might not last forever either. As we get more
people working with large data values, there's going to be pressure to
set some smaller limit on message size.

regards, tom lane

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Jerome Wagner (#1)

Re: question regarding copyData containers

Hi,

On 2020-06-03 19:28:12 +0200, Jerome Wagner wrote:

I have been working on a node.js streaming client for different COPY
scenarios.
usually, during CopyOut, clients tend to buffer network chunks until they
have gathered a full copyData message and pass that to the user.

In some cases, this can lead to very large copyData messages. when there
are very long text fields or bytea fields it will require a lot of memory
to be handled (up to 1GB I think in the worst case scenario)

In COPY TO, I managed to relax that requirement, considering that copyData
is simply a transparent container. For each network chunk, the relevent
message content is forwarded which makes for 64KB chunks at most.

Uhm.

We loose the semantics of the "row" that copyData has according to the
documentation
https://www.postgresql.org/docs/10/protocol-flow.html#PROTOCOL-COPY

The backend sends a CopyOutResponse message to the frontend, followed by

zero or more >CopyData messages (**always one per row**), followed by
CopyDone

but it is not a problem because the raw bytes are still parsable (rows +
fields) in text mode (tsv) and in binary mode)

This seems like an extremely bad idea to me. Are we really going to ask
clients to incur the overhead (both in complexity and runtime) to parse
incoming data just to detect row boundaries? Given the number of
options there are for COPY, that's a seriously complicated task.

I think that's a completely no-go.

Leaving error handling aside (see para below), what does this actually
get you? Either your client cares about getting a row in one sequential
chunk, or it doesn't. If it doesn't care, then there's no need to
allocate a buffer that can contain the whole 'd' message. You can just
hand the clients the chunks incrementally. If it does, then you need to
reassemble either way (or worse, you force to reimplement the client to
reimplement that).

I assume what you're trying to get at is being able to send CopyData
messages before an entire row is assembled? And you want to send
separate CopyData messages to allow for error handling? I think that's
a quite worthwhile goal, but I don't think it can sensibly solved by
just removing protocol level framing of row boundaries. And that will
mean evolving the protocol in a non-compatible way.

Now I started working on copyBoth and logical decoding scenarios. In this
case, the server send series of copyData. 1 copyData containing 1 message :

at the network chunk level, in the case of large fields, we can observe

in: CopyData Int32 XLogData Int64 Int64 Int64 Byten1
in: Byten2
in: CopyData Int32 XLogData Int64 Int64 Int64 Byten3
in: CopyData Int32 XLogData Int64 Int64 Int64 Byten4

out: XLogData Int64 Int64 Int64 Byten1
out: Byten2
out: XLogData Int64 Int64 Int64 Byten3
out: XLogData Int64 Int64 Int64 Byten4

but at the XLogData level, the protocol is not self-describing its length,

so there is no real way of knowing where the first XLogData ends apart from
- knowing the length of the first copyData (4 + 1 + 3*8 + n1 + n2)
- knowing the internals of the output plugin and benefit from a plugin
that self-describe its span
when a network chunks contains several copyDatas
in: CopyData Int32 XLogData Int64 Int64 Int64 Byten1 CopyData Int32
XLogData Int64 Int64 Int64 Byten2
we have
out: XLogData Int64 Int64 Int64 Byten1 XLogData Int64 Int64 Int64 Byten2

Right now all 'w' messages should be contained in one CopyData/'d' that
doesn't contain anything but the XLogData/'w'.

Do you just mean that if we'd change the server side code to split 'w'
messages across multiple 'd' messages, then we couldn't make much sense
of the data anymore? If so, then I don't really see a problem. Unless
you do a much larger change, what'd be the point in allowing to split
'w' across multiple 'd' chunks? The input data exists in a linear
buffer already, so you're not going to reduce peak memory usage by
sending smaller CopyData chunks.

Sure, we could evolve the logical decoding interface to output to be
able to send data in a much more incremental way than, typically,
per-row basis. But I think that'd quite substantially increase
complexity. And the message framing seems to be the easier part of such
a change.

Greetings,

Andres Freund

Jerome Wagner

jerome.wagner@laposte.net

over 5 years ago

In reply to: Andres Freund (#3)

Re: question regarding copyData containers

Hello,

thank you for your feedback.

I agree that modifying the COPY subprotocols is hard to do because it would
have an impact on the client ecosystem.

My understanding (which seems to be confirmed by what Tom Lane said) is
that the server discards the framing and
manages to make sense of the underlying data.

the expectation is that clients can send CopyData messages that are
split up however they choose; the message boundaries needn't correspond
to any semantic boundaries in the data stream.

So I thought that a client could decide to have the same behavior and could
start parsing the payload of a copyData message without assembling it first.
It works perfectly with COPY TO but I hit a roadblock on copyBoth during
logical replication with test_decoding because the subprotocol doesn't have
any framing.

Right now all 'w' messages should be contained in one CopyData/'d' that
doesn't contain anything but the XLogData/'w'.

The current format of the XLogData/'w' message is
w lsn lsn time byten

and even if it is maybe too late now I was wondering why it was not decided
to be
w lsn lsn time n byten

because it seems to me that the missing n ties the XLogData to the copyData
framing.

The input data exists in a linear
buffer already, so you're not going to reduce peak memory usage by
sending smaller CopyData chunks.

That is very surprising to me. Do you mean that on the server in COPY TO
mode, a full row is prepared in a linear buffer in memory before
beeing sent as a copyData/d'
I found the code around
https://github.com/postgres/postgres/blob/master/src/backend/commands/copy.c#L2153
and
indeed the whole row seems to be buffered in memory.

Good thing or bad thing, users tend to use bigger fields (text, jsonb,
bytea) and that can be very memory hungry.
Do you know a case in postgres (other than large_objects I suppose) where
the server can flush data from a field without buffering it in memory ?

And then as you noted, there is the multiplexing of events. a very long
copyData makes the communication impossible between the client and the
server during the transfer.

I briefly looked at
https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c
and
I found

/*
* Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
*
* We don't have a good idea of what a good value would be; there's some
* overhead per message in both walsender and walreceiver, but on the other
* hand sending large batches makes walsender less responsive to signals
* because signals are checked only between messages. 128kB (with
* default 8k blocks) seems like a reasonable guess for now.
*/
#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
so I thought that the maximum copyData/d' I would receive during logical
replication was MAX_SEND_SIZE but it seems that this is not used for
logical decoding.
the whole output of the output plugin seem to be prepared in memory so for
an insert like

insert into mytable (col) values (repeat('-', pow(2, 27)::int)

a 128MB linear buffer will be created on the server and sent as 1 copyData
over many network chunks.

So I understand that in the long term copyData framing should not carry any
semantic to be able to keep messages small enough to allow multiplexing but
that there are many steps to climb before that.

Would it make sense one day in some way to try and do streaming at the
sub-field level ? I guess that is a huge undertaking since most of the
field unit interfaces are probably based on a buffer/field one-to-one
mapping.

Greetings,
Jérôme

On Thu, Jun 4, 2020 at 12:08 AM Andres Freund <andres@anarazel.de> wrote:

Show quoted text

Hi,

On 2020-06-03 19:28:12 +0200, Jerome Wagner wrote:

I have been working on a node.js streaming client for different COPY
scenarios.
usually, during CopyOut, clients tend to buffer network chunks until they
have gathered a full copyData message and pass that to the user.

In some cases, this can lead to very large copyData messages. when there
are very long text fields or bytea fields it will require a lot of memory
to be handled (up to 1GB I think in the worst case scenario)

In COPY TO, I managed to relax that requirement, considering that

copyData

is simply a transparent container. For each network chunk, the relevent
message content is forwarded which makes for 64KB chunks at most.

Uhm.

We loose the semantics of the "row" that copyData has according to the
documentation
https://www.postgresql.org/docs/10/protocol-flow.html#PROTOCOL-COPY

The backend sends a CopyOutResponse message to the frontend, followed by

zero or more >CopyData messages (**always one per row**), followed by
CopyDone

but it is not a problem because the raw bytes are still parsable (rows +
fields) in text mode (tsv) and in binary mode)

This seems like an extremely bad idea to me. Are we really going to ask
clients to incur the overhead (both in complexity and runtime) to parse
incoming data just to detect row boundaries? Given the number of
options there are for COPY, that's a seriously complicated task.

I think that's a completely no-go.

Leaving error handling aside (see para below), what does this actually
get you? Either your client cares about getting a row in one sequential
chunk, or it doesn't. If it doesn't care, then there's no need to
allocate a buffer that can contain the whole 'd' message. You can just
hand the clients the chunks incrementally. If it does, then you need to
reassemble either way (or worse, you force to reimplement the client to
reimplement that).

I assume what you're trying to get at is being able to send CopyData
messages before an entire row is assembled? And you want to send
separate CopyData messages to allow for error handling? I think that's
a quite worthwhile goal, but I don't think it can sensibly solved by
just removing protocol level framing of row boundaries. And that will
mean evolving the protocol in a non-compatible way.

Now I started working on copyBoth and logical decoding scenarios. In this
case, the server send series of copyData. 1 copyData containing 1

message :

at the network chunk level, in the case of large fields, we can observe

in: CopyData Int32 XLogData Int64 Int64 Int64 Byten1
in: Byten2
in: CopyData Int32 XLogData Int64 Int64 Int64 Byten3
in: CopyData Int32 XLogData Int64 Int64 Int64 Byten4

out: XLogData Int64 Int64 Int64 Byten1
out: Byten2
out: XLogData Int64 Int64 Int64 Byten3
out: XLogData Int64 Int64 Int64 Byten4

but at the XLogData level, the protocol is not self-describing its

length,

so there is no real way of knowing where the first XLogData ends apart

from

- knowing the length of the first copyData (4 + 1 + 3*8 + n1 + n2)
- knowing the internals of the output plugin and benefit from a plugin
that self-describe its span
when a network chunks contains several copyDatas
in: CopyData Int32 XLogData Int64 Int64 Int64 Byten1 CopyData Int32
XLogData Int64 Int64 Int64 Byten2
we have
out: XLogData Int64 Int64 Int64 Byten1 XLogData Int64 Int64 Int64

Byten2

Right now all 'w' messages should be contained in one CopyData/'d' that
doesn't contain anything but the XLogData/'w'.

Do you just mean that if we'd change the server side code to split 'w'
messages across multiple 'd' messages, then we couldn't make much sense
of the data anymore? If so, then I don't really see a problem. Unless
you do a much larger change, what'd be the point in allowing to split
'w' across multiple 'd' chunks? The input data exists in a linear
buffer already, so you're not going to reduce peak memory usage by
sending smaller CopyData chunks.

Sure, we could evolve the logical decoding interface to output to be
able to send data in a much more incremental way than, typically,
per-row basis. But I think that'd quite substantially increase
complexity. And the message framing seems to be the easier part of such
a change.

Greetings,

Andres Freund