Minor documentation error regarding streaming replication protocol
While implementing streaming replication client functionality for Npgsql
I stumbled upon a minor documentation error at
https://www.postgresql.org/docs/current/protocol-replication.html
The "content" return value for the TIMELINE_HISTORYcommand is advertised
as bytea while it is in fact raw ASCII bytes.
How did I find out?
Since the value I get doesn't start with a "\x" and contains ascii text,
although I've bytea_outputset to hex, I first thought that the streaming
replication protocol simply doesn't honor bytea_output, but then I
realized that I also get unencoded tabs and newlines which wouldn't be
possible if the value woud be passed through byteaout.
This is certainly a minor problem since the timeline history file only
contains generated strings that are ASCII-only, so just using the
unencoded bytes is actually easier than decoding bytea.
OTOH it did cost me a few hours (writing a bytea decoder and figuring
out why it doesn't work by looking at varlena.c and proving the docs
wrong) so I want to point this out here since it is possibly an
unintended behavior or at least a documentation error.
Also I'm wary of taking dependency on an undocumented implementation
detail that could possibly change at any point.
Regards,
Brar
Brar Piening wrote:
This is certainly a minor problem
After thinking about it a little more I think that at least the fact
that even the row description message advertises the content as bytea
could be considered as a bug - no?
On Sun, Sep 13, 2020 at 10:25:01PM +0200, Brar Piening wrote:
While implementing streaming replication client functionality for Npgsql
I stumbled upon a minor documentation error at
https://www.postgresql.org/docs/current/protocol-replication.htmlThe "content" return value for the TIMELINE_HISTORYcommand is advertised
as bytea while it is in fact raw ASCII bytes.How did I find out?
Since the value I get doesn't start with a "\x" and contains ascii text,
although I've bytea_outputset to hex, I first thought that the streaming
replication protocol simply doesn't honor bytea_output, but then I
realized that I also get unencoded tabs and newlines which wouldn't be
possible if the value woud be passed through byteaout.This is certainly a minor problem since the timeline history file only
contains generated strings that are ASCII-only, so just using the
unencoded bytes is actually easier than decoding bytea.
OTOH it did cost me a few hours (writing a bytea decoder and figuring
out why it doesn't work by looking at varlena.c and proving the docs
wrong) so I want to point this out here since it is possibly an
unintended behavior or at least a documentation error.
Also I'm wary of taking dependency on an undocumented implementation
detail that could possibly change at any point.
I have looked at this. It seems SendTimeLineHistory() is sending raw
bytes from the history file, with no encoding conversion, and
ReceiveXlogStream() is receiving it, again assuming it is just plain
text. I am not sure we really have an SQL data type where we do this.
BYTEA doesn't do encoding conversion, but does backslash procesing, and
TEXT does encoding conversion.
I suppose we either have to document this as BYTEA with no backslash
processing, or TEXT with no encoding conversion --- I think I prefer the
later.
Attached is a patch to update the docs, and the data type passed by
SendTimeLineHistory(). Does this look safe to everyone?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
Attachments:
timeline.difftext/x-diff; charset=us-asciiDownload+3-3
On Thu, Oct 08, 2020 at 04:23:06PM -0400, Bruce Momjian wrote:
I have looked at this. It seems SendTimeLineHistory() is sending raw
bytes from the history file, with no encoding conversion, and
ReceiveXlogStream() is receiving it, again assuming it is just plain
text. I am not sure we really have an SQL data type where we do this.
BYTEA doesn't do encoding conversion, but does backslash procesing, and
TEXT does encoding conversion.I suppose we either have to document this as BYTEA with no backslash
processing, or TEXT with no encoding conversion --- I think I prefer the
later.
As StartupXLOG() tells, The timeline history file can include as
reason the recovery target name which may not be made just of ASCII
characters as that's the value specified in pg_create_restore_point by
the user, so bytea is correct, no?
--
Michael
On Fri, Oct 9, 2020 at 08:52:50AM +0900, Michael Paquier wrote:
On Thu, Oct 08, 2020 at 04:23:06PM -0400, Bruce Momjian wrote:
I have looked at this. It seems SendTimeLineHistory() is sending raw
bytes from the history file, with no encoding conversion, and
ReceiveXlogStream() is receiving it, again assuming it is just plain
text. I am not sure we really have an SQL data type where we do this.
BYTEA doesn't do encoding conversion, but does backslash procesing, and
TEXT does encoding conversion.I suppose we either have to document this as BYTEA with no backslash
processing, or TEXT with no encoding conversion --- I think I prefer the
later.As StartupXLOG() tells, The timeline history file can include as
reason the recovery target name which may not be made just of ASCII
characters as that's the value specified in pg_create_restore_point by
the user, so bytea is correct, no?
Good point. The reporter was assuming the data would come to the client
in the bytea output format specified by the GUC, e.g. \x..., so that
doesn't happen either. As I said before, it is more raw bytes, but we
don't have an SQL data type for that.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
On Thu, Oct 8, 2020 at 11:17:59PM -0400, Bruce Momjian wrote:
Good point. The reporter was assuming the data would come to the client
in the bytea output format specified by the GUC, e.g. \x..., so that
doesn't happen either. As I said before, it is more raw bytes, but we
don't have an SQL data type for that.
I did some more research on this. It turns out timeline 'content' is
the only BYTEA listed in the protocol docs, even though it just passes C
strings to pq_sendbytes(), just like many other fields like the field
above it, the timeline history filename. The proper fix is to change
the code to return the timeline history file contents as TEXT instead of
BYTEA.
Patch attached. I would like to backpatch this to all supported
versions so we are consistent and people don't think different PG
versions use different return values for this. Is that safe? Looking
at the uses of this in our code, it seems so. We aren't doing BYTEA
escaping or TEXT encoding conversion in any of these cases.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
Attachments:
timeline.difftext/x-diff; charset=us-asciiDownload+2-2
Bruce Momjian <bruce@momjian.us> writes:
Patch attached. I would like to backpatch this to all supported
versions so we are consistent and people don't think different PG
versions use different return values for this. Is that safe? Looking
at the uses of this in our code, it seems so. We aren't doing BYTEA
escaping or TEXT encoding conversion in any of these cases.
I do not think a back-patch is a good idea. Yeah, it probably
won't break anything, but by the same token it won't fix anything.
regards, tom lane
On Wed, Oct 14, 2020 at 05:10:30PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Patch attached. I would like to backpatch this to all supported
versions so we are consistent and people don't think different PG
versions use different return values for this. Is that safe? Looking
at the uses of this in our code, it seems so. We aren't doing BYTEA
escaping or TEXT encoding conversion in any of these cases.I do not think a back-patch is a good idea. Yeah, it probably
won't break anything, but by the same token it won't fix anything.
OK, I can instead put in a C comment that it is really TEXT like the
other fields, but update the docs in all branches. Does that work for
everyone?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
Bruce Momjian <bruce(at)momjian(dot)us> wrote:
Good point. The reporter was assuming the data would come to the client
in the bytea output format specified by the GUC, e.g. \x..., so that
doesn't happen either. As I said before, it is more raw bytes, but we
don't have an SQL data type for that.
I did some more research on this. It turns out timeline 'content' is
the only BYTEA listed in the protocol docs, even though it just passes C
strings to pq_sendbytes(), just like many other fields like the field
above it, the timeline history filename. The proper fix is to change
the code to return the timeline history file contents as TEXT instead of
BYTEA.
In the light of what Michael wrote above, I don't think that this is really enough.
If the timeline history file can contain strings which "may not be made just of ASCII characters" this would probably make the client side assume that the content is being sent as TEXT encoded in client_encoding which again isn't true.
In the worst case this could lead to nasty decoding bugs on the client side which could even result in security issues.
Since you probably can't tell in which encoding the aforementioned "recovery target name" was written to the timeline history file, I agree with Michael that BYTEA is probably the sanest way to send this file.
IMO the best way out of this is to either really encode the content as BYTEA by passing it through byteaout() and by that escaping characters <0x20 and >0x7e, or to document that the file is being sent "as raw bytes that can be read as 'bytea Escape Format' by parsers compatible with byteain()" (this works because byteain() doesn't check whether bytes <0x20 or >0x7e are actually escaped).
Again, reading the raw bytes, either via byteain() or just as raw bytes, isn't really a problem and I don't want to bring you into a situation where the cure is worse than the disease.
Brar Piening <Brar@gmx.de> writes:
Bruce Momjian <bruce(at)momjian(dot)us> wrote:
I did some more research on this. It turns out timeline 'content' is
the only BYTEA listed in the protocol docs, even though it just passes C
strings to pq_sendbytes(), just like many other fields like the field
above it, the timeline history filename. The proper fix is to change
the code to return the timeline history file contents as TEXT instead of
BYTEA.
If the timeline history file can contain strings which "may not be made just of ASCII characters" this would probably make the client side assume that the content is being sent as TEXT encoded in client_encoding which again isn't true.
I think this is overthinking the problem, TBH. Yeah, perhaps somebody
put non-ASCII comments into that file, but they'd probably be expecting
the exact same encoding they used there to come out the other end.
We should certainly *not* apply byteaout, as that would break cases that
work fine today; and if we do not do that, it is quite misleading to
suggest that the data is given in bytea format.
I don't really see why our only documentation choices are "text" or
"bytea". Can't we just write "(raw file contents)" or the like?
regards, tom lane
On 2020/10/15 23:03, Tom Lane wrote:
Brar Piening <Brar@gmx.de> writes:
Bruce Momjian <bruce(at)momjian(dot)us> wrote:
I did some more research on this. It turns out timeline 'content' is
the only BYTEA listed in the protocol docs, even though it just passes C
strings to pq_sendbytes(), just like many other fields like the field
above it, the timeline history filename. The proper fix is to change
the code to return the timeline history file contents as TEXT instead of
BYTEA.If the timeline history file can contain strings which "may not be made just of ASCII characters" this would probably make the client side assume that the content is being sent as TEXT encoded in client_encoding which again isn't true.
I think this is overthinking the problem, TBH. Yeah, perhaps somebody
put non-ASCII comments into that file, but they'd probably be expecting
the exact same encoding they used there to come out the other end.We should certainly *not* apply byteaout, as that would break cases that
work fine today;
+1
and if we do not do that, it is quite misleading to
suggest that the data is given in bytea format.I don't really see why our only documentation choices are "text" or
"bytea". Can't we just write "(raw file contents)" or the like?
+1
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Thu, Oct 15, 2020 at 10:03:46AM -0400, Tom Lane wrote:
Brar Piening <Brar@gmx.de> writes:
Bruce Momjian <bruce(at)momjian(dot)us> wrote:
I did some more research on this. It turns out timeline 'content' is
the only BYTEA listed in the protocol docs, even though it just passes C
strings to pq_sendbytes(), just like many other fields like the field
above it, the timeline history filename. The proper fix is to change
the code to return the timeline history file contents as TEXT instead of
BYTEA.If the timeline history file can contain strings which "may not be made just of ASCII characters" this would probably make the client side assume that the content is being sent as TEXT encoded in client_encoding which again isn't true.
I think this is overthinking the problem, TBH. Yeah, perhaps somebody
put non-ASCII comments into that file, but they'd probably be expecting
the exact same encoding they used there to come out the other end.We should certainly *not* apply byteaout, as that would break cases that
work fine today; and if we do not do that, it is quite misleading to
suggest that the data is given in bytea format.I don't really see why our only documentation choices are "text" or
"bytea". Can't we just write "(raw file contents)" or the like?
Agreed. The reason they are those labels is that the C code has to
label the output fields, so it can only use SQL types. There are many
protocol fields mislabeled in this way, not just timeline history.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
Bruce Momjian <bruce@momjian.us> writes:
On Thu, Oct 15, 2020 at 10:03:46AM -0400, Tom Lane wrote:
I don't really see why our only documentation choices are "text" or
"bytea". Can't we just write "(raw file contents)" or the like?
Agreed. The reason they are those labels is that the C code has to
label the output fields, so it can only use SQL types. There are many
protocol fields mislabeled in this way, not just timeline history.
Ah, right. Well, let's label it as text and then in the description
of the message, note that the field contains the raw file contents,
whose encoding is unknown to the server.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Well, let's label it as text and then in the description
of the message, note that the field contains the raw file contents,
whose encoding is unknown to the server.
+1
This would definitely have solved my initial issue and looks like a sane way forward without over-engineering the problem or causing any backwards-compatibility issues.
Regards,
Brar Piening
On Thu, Oct 15, 2020 at 11:18:33AM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
On Thu, Oct 15, 2020 at 10:03:46AM -0400, Tom Lane wrote:
I don't really see why our only documentation choices are "text" or
"bytea". Can't we just write "(raw file contents)" or the like?Agreed. The reason they are those labels is that the C code has to
label the output fields, so it can only use SQL types. There are many
protocol fields mislabeled in this way, not just timeline history.Ah, right. Well, let's label it as text and then in the description
of the message, note that the field contains the raw file contents,
whose encoding is unknown to the server.
OK, but this not would need to be in the protocol docs at the top, not
for the timeline history file, since it affects all columns labeled as
TEXT.
We would want the timeline history file contents label changed from
BYTEA to TEXT in the docs changed for all supported versions, add a C
comment to all backbranches that BYTEA is the same as TEXT protocol
fields, and change the C code to return TEXT IN PG 14. Is that what
people want?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
Bruce Momjian <bruce@momjian.us> writes:
We would want the timeline history file contents label changed from
BYTEA to TEXT in the docs changed for all supported versions, add a C
comment to all backbranches that BYTEA is the same as TEXT protocol
fields, and change the C code to return TEXT IN PG 14. Is that what
people want?
I still think there is no need to touch back branches. What you
propose here is more likely to confuse people than help them.
Having the documentation disagree with the code about how the
field is labeled is not good either.
Furthermore, it absolutely does not make sense to say (or imply)
that the unknown-encoding business applies to all text fields.
There are a very small number of fields where we should say that.
regards, tom lane
On Thu, Oct 15, 2020 at 12:01:21PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
We would want the timeline history file contents label changed from
BYTEA to TEXT in the docs changed for all supported versions, add a C
comment to all backbranches that BYTEA is the same as TEXT protocol
fields, and change the C code to return TEXT IN PG 14. Is that what
people want?I still think there is no need to touch back branches. What you
propose here is more likely to confuse people than help them.
Having the documentation disagree with the code about how the
field is labeled is not good either.
Understood.
Furthermore, it absolutely does not make sense to say (or imply)
that the unknown-encoding business applies to all text fields.
There are a very small number of fields where we should say that.
Yes, I am seeing now that even IDENTIFY_SYSTEM above it properly does
encoding. I guess TIMELINE_HISTORY works this way because it is pulling
from the file system, not from system tables. I ended up with just a
new doc sentence and C comment in back branches, and a relabeling of the
timeline history 'content' field as TEXT in the C code and docs,
attached.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
On Thu, Oct 15, 2020 at 12:43:22PM -0400, Bruce Momjian wrote:
On Thu, Oct 15, 2020 at 12:01:21PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
We would want the timeline history file contents label changed from
BYTEA to TEXT in the docs changed for all supported versions, add a C
comment to all backbranches that BYTEA is the same as TEXT protocol
fields, and change the C code to return TEXT IN PG 14. Is that what
people want?I still think there is no need to touch back branches. What you
propose here is more likely to confuse people than help them.
Having the documentation disagree with the code about how the
field is labeled is not good either.Understood.
Furthermore, it absolutely does not make sense to say (or imply)
that the unknown-encoding business applies to all text fields.
There are a very small number of fields where we should say that.Yes, I am seeing now that even IDENTIFY_SYSTEM above it properly does
encoding. I guess TIMELINE_HISTORY works this way because it is pulling
from the file system, not from system tables. I ended up with just a
new doc sentence and C comment in back branches, and a relabeling of the
timeline history 'content' field as TEXT in the C code and docs,
attached.
I have applied the doc-only patch to back branches, and the data type
change to master; the master change should cause no behavioral change.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
On Thu, 2020-10-08 at 23:17 -0400, Bruce Momjian wrote:
As I said before, it is more raw bytes, but
we
don't have an SQL data type for that.
Sorry to jump in to this thread late.
Can't we just set both "filename" and "content" to be BYTEA, but then
set the format code to 1 (indicating binary)?
For clients that do look at the descriptor, they are more likely to get
it right; and for clients that don't look at the descriptor, there will
be no change.
Regards,
Jeff Davis
On Tue, Dec 1, 2020 at 10:16:31AM -0800, Jeff Davis wrote:
On Thu, 2020-10-08 at 23:17 -0400, Bruce Momjian wrote:
As I said before, it is more raw bytes, but
we
don't have an SQL data type for that.Sorry to jump in to this thread late.
Can't we just set both "filename" and "content" to be BYTEA, but then
set the format code to 1 (indicating binary)?For clients that do look at the descriptor, they are more likely to get
it right; and for clients that don't look at the descriptor, there will
be no change.
Yes, we could, but I thought the format code was not something we set at
this level. Looking at byteasend() it is true it just sends the bytes.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee