Logical Replication and Character encoding

Started by Shinoda, Noriyoshialmost 9 years ago30 messages

noriyoshi.shinoda@hpe.com

almost 9 years ago

Hi hackers,

I tried a committed Logical Replication environment. I found that replication between databases of different encodings did not convert encodings in character type columns. Is this behavior correct?

I expected that the character 0xe6bca2 (UTF-8) would be converted to the same character 0xb4c1 (EUC_JP). The example below replicates from the encoding UTF-8 database to the encoding EUC_JP database. You can see that the character 0xe6bca2 (UTF-8) is stored intact in the SUBSCRIPTION side database.

postgres=> CREATE TABLE encode1(col1 NUMERIC PRIMARY KEY, col2 VARCHAR(10)) ;
CREATE TABLE
postgres=> CREATE PUBLICATION pub1 FOR TABLE encode1 ;
CREATE PUBLICATION
postgres=> INSERT INTO encode1 VALUES (1, '漢') ; -- UTF-8 Character 0xe6bca2
INSERT 0 1

postgres=> CREATE TABLE encode1(col1 NUMERIC PRIMARY KEY, col2 VARCHAR(10)) ;
CREATE TABLE
postgres=# CREATE SUBSCRIPTION sub1 CONNECTION 'dbname=postgres host=localhost port=5432' PUBLICATION pub1 ;
NOTICE: created replication slot "sub1" on publisher
CREATE SUBSCRIPTION
postgres=# SELECT * FROM encode1 ;
ERROR: invalid byte sequence for encoding "EUC_JP": 0xa2
postgres=# SELECT heap_page_items(get_raw_page('encode1', 0)) ;
heap_page_items
-------------------------------------------------------------------
(1,8152,1,33,565,0,0,"(0,1)",2,2306,24,,,"\\x0b0080010009e6bca2") <- stored UTF-8 char 0xe6bca2
(1 row)

Snapshot:
https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.gz 2017-01-31 00:29:07
Operating System:
Red Hat Enterprise Linux 7 Update 2 (x86-64)

Regards.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Shinoda, Noriyoshi (#1)

1 attachment(s)

Re: Logical Replication and Character encoding

Hello,

At Tue, 31 Jan 2017 12:46:18 +0000, "Shinoda, Noriyoshi" <noriyoshi.shinoda@hpe.com> wrote in <AT5PR84MB0084FAE5976D89CDE9733093EE4A0@AT5PR84MB0084.NAMPRD84.PROD.OUTLOOK.COM>

I tried a committed Logical Replication environment. I found
that replication between databases of different encodings did
not convert encodings in character type columns. Is this
behavior correct?

The output plugin for subscription is pgoutput and it currently
doesn't consider encoding but would easiliy be added if desired
encoding is informed.

The easiest (but somewhat seems fragile) way I can guess is,

- Subscriber connects with client_encoding specification and the
output plugin pgoutput decide whether it accepts the encoding
or not. If the subscriber doesn't, pgoutput send data without
conversion.

The attached small patch does this and works with the following
CREATE SUBSCRIPTION.

CREATE SUBSCRIPTION sub1 CONNECTION 'host=/tmp port=5432 dbname=postgres client_encoding=EUC_JP' PUBLICATION pub1;

Also we may have explicit negotiation on, for example,
CREATE_REPLICATION_SLOT.

'CREATE_REPLICATION_SLOT sub1 LOGICAL pgoutput ENCODING EUC_JP'

Or output plugin may take options.

'CREATE_REPLICATION_SLOT sub1 LOGICAL pgoutput OPTIONS(encoding EUC_JP)'

Any opinions?

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

logrep_consider_client_encoding.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1f30de6..6a235d7 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -16,6 +16,7 @@
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_type.h"
 #include "libpq/pqformat.h"
+#include "mb/pg_wchar.h"
 #include "replication/logicalproto.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -442,6 +443,9 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		pq_sendbyte(out, 't');	/* 'text' data follows */
 
 		outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
+		if (pg_get_client_encoding() != GetDatabaseEncoding())
+			outputstr = pg_server_to_client(outputstr, strlen(outputstr));
+
 		len = strlen(outputstr) + 1;	/* null terminated */
 		pq_sendint(out, len, 4);		/* length */
 		appendBinaryStringInfo(out, outputstr, len); /* data */

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#2)

1 attachment(s)

Re: Logical Replication and Character encoding

At Wed, 01 Feb 2017 12:05:40 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170201.120540.183393194.horiguchi.kyotaro@lab.ntt.co.jp>

Hello,

At Tue, 31 Jan 2017 12:46:18 +0000, "Shinoda, Noriyoshi" <noriyoshi.shinoda@hpe.com> wrote in <AT5PR84MB0084FAE5976D89CDE9733093EE4A0@AT5PR84MB0084.NAMPRD84.PROD.OUTLOOK.COM>

I tried a committed Logical Replication environment. I found
that replication between databases of different encodings did
not convert encodings in character type columns. Is this
behavior correct?

The output plugin for subscription is pgoutput and it currently
doesn't consider encoding but would easiliy be added if desired
encoding is informed.

The easiest (but somewhat seems fragile) way I can guess is,

- Subscriber connects with client_encoding specification and the
output plugin pgoutput decide whether it accepts the encoding
or not. If the subscriber doesn't, pgoutput send data without
conversion.

The attached small patch does this and works with the following
CREATE SUBSCRIPTION.

Oops. It forgets to care conversion failure. It is amended in the
attached patch.

CREATE SUBSCRIPTION sub1 CONNECTION 'host=/tmp port=5432 dbname=postgres client_encoding=EUC_JP' PUBLICATION pub1;

Also we may have explicit negotiation on, for example,
CREATE_REPLICATION_SLOT.

'CREATE_REPLICATION_SLOT sub1 LOGICAL pgoutput ENCODING EUC_JP'

Or output plugin may take options.

'CREATE_REPLICATION_SLOT sub1 LOGICAL pgoutput OPTIONS(encoding EUC_JP)'

Any opinions?

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

logrep_consider_client_encoding_v2.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1f30de6..3c457a2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -16,6 +16,7 @@
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_type.h"
 #include "libpq/pqformat.h"
+#include "mb/pg_wchar.h"
 #include "replication/logicalproto.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -442,6 +443,20 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		pq_sendbyte(out, 't');	/* 'text' data follows */
 
 		outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
+		if (pg_get_client_encoding() != GetDatabaseEncoding())
+		{
+			char *p;
+
+			p = pg_server_to_client(outputstr, strlen(outputstr));
+
+			/* Send the converted string on when succeeded */
+			if (p)
+			{
+				pfree(outputstr);
+				outputstr = p;
+			}
+		}
+
 		len = strlen(outputstr) + 1;	/* null terminated */
 		pq_sendint(out, len, 4);		/* length */
 		appendBinaryStringInfo(out, outputstr, len); /* data */

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#3)

Re: Logical Replication and Character encoding

At Wed, 01 Feb 2017 12:13:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170201.121304.267734380.horiguchi.kyotaro@lab.ntt.co.jp>

I tried a committed Logical Replication environment. I found
that replication between databases of different encodings did
not convert encodings in character type columns. Is this
behavior correct?

The output plugin for subscription is pgoutput and it currently
doesn't consider encoding but would easiliy be added if desired
encoding is informed.

The easiest (but somewhat seems fragile) way I can guess is,

- Subscriber connects with client_encoding specification and the
output plugin pgoutput decide whether it accepts the encoding
or not. If the subscriber doesn't, pgoutput send data without
conversion.

The attached small patch does this and works with the following
CREATE SUBSCRIPTION.

Oops. It forgets to care conversion failure. It is amended in the
attached patch.

CREATE SUBSCRIPTION sub1 CONNECTION 'host=/tmp port=5432 dbname=postgres client_encoding=EUC_JP' PUBLICATION pub1;

Also we may have explicit negotiation on, for example,
CREATE_REPLICATION_SLOT.

'CREATE_REPLICATION_SLOT sub1 LOGICAL pgoutput ENCODING EUC_JP'

Or output plugin may take options.

'CREATE_REPLICATION_SLOT sub1 LOGICAL pgoutput OPTIONS(encoding EUC_JP)'

Any opinions?

This patch chokes replication when the publisher finds an
inconvertible character in a tuple to be sent. For the case,
dropping-then-recreating subscription is necessary to go forward.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Shinoda, Noriyoshi

noriyoshi.shinoda@hpe.com

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#4)

Re: Logical Replication and Character encoding

Thank you for creating patches.
I strongly hope that your patch will be merged into the new version.
Since all databases are not yet based on UTF - 8, I think conversion of character codes is still necessary.

-----Original Message-----
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Wednesday, February 01, 2017 3:31 PM
To: Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Logical Replication and Character encoding

At Wed, 01 Feb 2017 12:13:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170201.121304.267734380.horiguchi.kyotaro@lab.ntt.co.jp>

I tried a committed Logical Replication environment. I found
that replication between databases of different encodings did not
convert encodings in character type columns. Is this behavior
correct?

The output plugin for subscription is pgoutput and it currently
doesn't consider encoding but would easiliy be added if desired
encoding is informed.

The easiest (but somewhat seems fragile) way I can guess is,

- Subscriber connects with client_encoding specification and the
output plugin pgoutput decide whether it accepts the encoding
or not. If the subscriber doesn't, pgoutput send data without
conversion.

The attached small patch does this and works with the following
CREATE SUBSCRIPTION.

Oops. It forgets to care conversion failure. It is amended in the
attached patch.

CREATE SUBSCRIPTION sub1 CONNECTION 'host=/tmp port=5432
dbname=postgres client_encoding=EUC_JP' PUBLICATION pub1;

Also we may have explicit negotiation on, for example,
CREATE_REPLICATION_SLOT.

'CREATE_REPLICATION_SLOT sub1 LOGICAL pgoutput ENCODING EUC_JP'

Or output plugin may take options.

'CREATE_REPLICATION_SLOT sub1 LOGICAL pgoutput OPTIONS(encoding EUC_JP)'

Any opinions?

This patch chokes replication when the publisher finds an inconvertible character in a tuple to be sent. For the case, dropping-then-recreating subscription is necessary to go forward.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Shinoda, Noriyoshi (#5)

1 attachment(s)

Re: Logical Replication and Character encoding

Hello.

This version makes subscriber automatically set its database
encoding to clinet_encoding (if not explicitly set). And changed
the behavior when pg_server_to_client returns NULL to ERROR from
sending original string.

At Wed, 1 Feb 2017 08:39:41 +0000, "Shinoda, Noriyoshi" <noriyoshi.shinoda@hpe.com> wrote in <AT5PR84MB0084A18D3BF1D93B862E95E4EE4D0@AT5PR84MB0084.NAMPRD84.PROD.OUTLOOK.COM>

Thank you for creating patches.
I strongly hope that your patch will be merged into the new
version. Since all databases are not yet based on UTF - 8, I
think conversion of character codes is still necessary.

Thanks.

-----Original Message-----
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Wednesday, February 01, 2017 3:31 PM
To: Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Logical Replication and Character encoding

At Wed, 01 Feb 2017 12:13:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170201.121304.267734380.horiguchi.kyotaro@lab.ntt.co.jp>

I tried a committed Logical Replication environment. I found
that replication between databases of different encodings did not
convert encodings in character type columns. Is this behavior
correct?

The output plugin for subscription is pgoutput and it currently
doesn't consider encoding but would easiliy be added if desired
encoding is informed.

The easiest (but somewhat seems fragile) way I can guess is,

- Subscriber connects with client_encoding specification and the
output plugin pgoutput decide whether it accepts the encoding
or not. If the subscriber doesn't, pgoutput send data without
conversion.

The attached small patch does this and works with the following
CREATE SUBSCRIPTION.

Oops. It forgets to care conversion failure. It is amended in the
attached patch.

CREATE SUBSCRIPTION sub1 CONNECTION 'host=/tmp port=5432
dbname=postgres client_encoding=EUC_JP' PUBLICATION pub1;

Also we may have explicit negotiation on, for example,
CREATE_REPLICATION_SLOT.

'CREATE_REPLICATION_SLOT sub1 LOGICAL pgoutput ENCODING EUC_JP'

Or output plugin may take options.

'CREATE_REPLICATION_SLOT sub1 LOGICAL pgoutput OPTIONS(encoding EUC_JP)'

Any opinions?

This patch chokes replication when the publisher finds an inconvertible character in a tuple to be sent. For the case, dropping-then-recreating subscription is necessary to go forward.

I think this behavior is right and inevitable in order to protect
subscribers from broken strings.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Enable-logical-replication-between-databases-with-di.patchtext/x-patch; charset=us-asciiDownload

From e3af51822d1318743e554e24163390b74b254751 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 2 Feb 2017 09:05:20 +0900
Subject: [PATCH] Enable logical-replication between databases with different
 encodings

Different from physical replication, logical replication may run
between databases with different encodings. This patch makes
subscriber set client_encoding to database encoding and publisher
follow it.
---
 .../libpqwalreceiver/libpqwalreceiver.c            | 35 +++++++++++++++++++++-
 src/backend/replication/logical/proto.c            | 17 +++++++++++
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 44a89c7..ef38af7 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -24,6 +24,7 @@
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "mb/pg_wchar.h"
 #include "replication/logicalproto.h"
 #include "replication/walreceiver.h"
 #include "storage/proc.h"
@@ -112,11 +113,43 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
 				 char **err)
 {
 	WalReceiverConn *conn;
+	const char *myconninfo = conninfo;
 	const char *keys[5];
 	const char *vals[5];
 	int			i = 0;
 
 	/*
+	 * For logical replication, set database encoding as client_encoding if
+	 * not specified in conninfo
+	 */
+	if (logical)
+	{
+		PQconninfoOption   *opts = NULL;
+
+		opts = PQconninfoParse(conninfo, NULL);
+
+		if (opts != NULL)
+		{
+			while (opts->keyword != NULL &&
+				   strcmp(opts->keyword, "client_encoding") != 0)
+				opts++;
+
+			/* add client_encoding to conninfo if not set */
+			if (opts->keyword == NULL || opts->val == NULL)
+			{
+				StringInfoData s;
+
+				/* Assuming that the memory context here is properly set */
+				initStringInfo(&s);
+				appendStringInfoString(&s, conninfo);
+				appendStringInfo(&s, " client_encoding=\"%s\"",
+								 GetDatabaseEncodingName());
+				myconninfo = s.data;
+			}
+		}
+	}
+
+	/*
 	 * We use the expand_dbname parameter to process the connection string (or
 	 * URI), and pass some extra options. The deliberately undocumented
 	 * parameter "replication=true" makes it a replication connection. The
@@ -124,7 +157,7 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
 	 * "replication" for .pgpass lookup.
 	 */
 	keys[i] = "dbname";
-	vals[i] = conninfo;
+	vals[i] = myconninfo;
 	keys[++i] = "replication";
 	vals[i] = logical ? "database" : "true";
 	if (!logical)
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1f30de6..97f928e 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -16,6 +16,7 @@
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_type.h"
 #include "libpq/pqformat.h"
+#include "mb/pg_wchar.h"
 #include "replication/logicalproto.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -442,6 +443,22 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		pq_sendbyte(out, 't');	/* 'text' data follows */
 
 		outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
+		if (pg_get_client_encoding() != GetDatabaseEncoding())
+		{
+			char *p;
+
+			p = pg_server_to_client(outputstr, strlen(outputstr));
+
+			/* Error out if failed */
+			if (!p)
+				ereport(ERROR,
+						(errcode(ERRCODE_UNTRANSLATABLE_CHARACTER),
+						 errmsg("character conversion failed")));
+
+			pfree(outputstr);
+			outputstr = p;
+		}
+
 		len = strlen(outputstr) + 1;	/* null terminated */
 		pq_sendint(out, len, 4);		/* length */
 		appendBinaryStringInfo(out, outputstr, len); /* data */
-- 
2.9.2

Euler Taveira

euler@timbira.com.br

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#2)

Re: Logical Replication and Character encoding

On 01-02-2017 00:05, Kyotaro HORIGUCHI wrote:

- Subscriber connects with client_encoding specification and the
output plugin pgoutput decide whether it accepts the encoding
or not. If the subscriber doesn't, pgoutput send data without
conversion.

I don't think storage without conversion is an acceptable approach. We
should provide options to users such as ignore tuple or NULL for
column(s) with conversion problem. I wouldn't consider storage data
without conversion because it silently show incorrect data and we
historically aren't flexible with conversion routines.

--
Euler Taveira Timbira - http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Craig Ringer

craig@2ndquadrant.com

almost 9 years ago

In reply to: Euler Taveira (#7)

Re: Logical Replication and Character encoding

On 2 February 2017 at 11:45, Euler Taveira <euler@timbira.com.br> wrote:

I don't think storage without conversion is an acceptable approach. We
should provide options to users such as ignore tuple or NULL for
column(s) with conversion problem. I wouldn't consider storage data
without conversion because it silently show incorrect data and we
historically aren't flexible with conversion routines.

pglogical and BDR both require identical encoding; they test for this
during startup and refuse to replicate if the encoding differs.

For the first pass at core I suggest a variant on that policy: require
source and destination encoding to be the same. This should probably
be the first change, since it definitively prevents the issue from
arising.

If time permits we could also allow destination encoding to be UTF-8
(including codepage 65001) with any source encoding. This requires
encoding conversion to be performed, of course.

The downside is that this will impact users who use a common subset of
two encodings. This is most common for Windows-1252 <-> ISO-8859-15
(or -1 if you're old-school) but also arises anywhere the common 7 bit
subset is used. Until we can define an encoding exception policy
though, I think we should defer supporting those and make them a
"later" problem.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Craig Ringer (#8)

Re: Logical Replication and Character encoding

Hello,

At Fri, 3 Feb 2017 09:16:47 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in <CAMsr+YGqn2PjJBCY+RjWWTJ4BZ=FhG0REXq1E1nd8_kjaDGoEQ@mail.gmail.com>

On 2 February 2017 at 11:45, Euler Taveira <euler@timbira.com.br> wrote:

I don't think storage without conversion is an acceptable approach. We
should provide options to users such as ignore tuple or NULL for
column(s) with conversion problem. I wouldn't consider storage data
without conversion because it silently show incorrect data and we
historically aren't flexible with conversion routines.

It is possible technically. But changing the behavior of a
subscript and/or publication requires change of SQL syntax. It
seems a bit too late for proposing such a new feature..

IMHO unintentional silent data loss must not be happen so the
default behavior on conversion failure cannot be other than stop
of replication.

pglogical and BDR both require identical encoding; they test for this
during startup and refuse to replicate if the encoding differs.

For the first pass at core I suggest a variant on that policy: require
source and destination encoding to be the same. This should probably
be the first change, since it definitively prevents the issue from
arising.

If the check is performed by BDR, pglogical itself seems to be
allowed to convert strings when the both end have different
encodings in a non-BDR environment. Is this right?

If time permits we could also allow destination encoding to be UTF-8
(including codepage 65001) with any source encoding. This requires
encoding conversion to be performed, of course.

Does this mean that BDR might work on heterogeneous encoding
environemnt? But anyway some encodings (like SJIS) have
caharacters with the same destination in its mapping so BDR
doesn't seem to work with such conversions. So each encoding
might should have a property to inform its usability under BDR
environment, but...

On the other hand, no prolem is seen in encoding conversions in
non-BDR environments. (except the behavior on failure)

The downside is that this will impact users who use a common subset of
two encodings. This is most common for Windows-1252 <-> ISO-8859-15
(or -1 if you're old-school) but also arises anywhere the common 7 bit
subset is used. Until we can define an encoding exception policy
though, I think we should defer supporting those and make them a
"later" problem.

If the conversion is rejected for now, we should check the
encoding identity instead.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Craig Ringer

craig@2ndquadrant.com

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#9)

Re: Logical Replication and Character encoding

On 3 Feb. 2017 15:47, "Kyotaro HORIGUCHI" <horiguchi.kyotaro@lab.ntt.co.jp>
wrote:

Hello,

At Fri, 3 Feb 2017 09:16:47 +0800, Craig Ringer <craig@2ndquadrant.com>
wrote in <CAMsr+YGqn2PjJBCY+RjWWTJ4BZ=FhG0REXq1E1nd8_kjaDGoEQ@mail.gmail.com

On 2 February 2017 at 11:45, Euler Taveira <euler@timbira.com.br> wrote:

I don't think storage without conversion is an acceptable approach. We
should provide options to users such as ignore tuple or NULL for
column(s) with conversion problem. I wouldn't consider storage data
without conversion because it silently show incorrect data and we
historically aren't flexible with conversion routines.

It is possible technically. But changing the behavior of a
subscript and/or publication requires change of SQL syntax. It
seems a bit too late for proposing such a new feature..

IMHO unintentional silent data loss must not be happen so the
default behavior on conversion failure cannot be other than stop
of replication.

Agree. Which is why we should default to disallowing mismatched upstream
and downstream encodings. At least to start with.

pglogical and BDR both require identical encoding; they test for this
during startup and refuse to replicate if the encoding differs.

For the first pass at core I suggest a variant on that policy: require
source and destination encoding to be the same. This should probably
be the first change, since it definitively prevents the issue from
arising.

If the check is performed by BDR, pglogical itself seems to be
allowed to convert strings when the both end have different
encodings in a non-BDR environment. Is this right?

Hm. Maybe it's changed since I last looked. We started off disallowing
mismatched encodings anyway.

Note that I'm referring to pglogical, the tool, not "in core logical
replication for postgres"

If time permits we could also allow destination encoding to be UTF-8
(including codepage 65001) with any source encoding. This requires
encoding conversion to be performed, of course.

Does this mean that BDR might work on heterogeneous encoding
environemnt?

No. Here the "we" was meant to be PG core for V10 or later.

But anyway some encodings (like SJIS) have
caharacters with the same destination in its mapping so BDR
doesn't seem to work with such conversions. So each encoding
might should have a property to inform its usability under BDR
environment, but.

PG doesn't allow SJIS as a db encoding. So it doesn't matter here.

The downside is that this will impact users who use a common subset of
two encodings. This is most common for Windows-1252 <-> ISO-8859-15
(or -1 if you're old-school) but also arises anywhere the common 7 bit
subset is used. Until we can define an encoding exception policy
though, I think we should defer supporting those and make them a
"later" problem.

If the conversion is rejected for now, we should check the
encoding identity instead.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

#11

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Craig Ringer (#10)

Re: Logical Replication and Character encoding

Hello,

At Fri, 3 Feb 2017 13:47:54 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in <CAMsr+YFqNLAvdjmgWOMWM9X=FfzcFOL4pLxBeaqtjySbe_UeTg@mail.gmail.com>

On 3 Feb. 2017 15:47, "Kyotaro HORIGUCHI" <horiguchi.kyotaro@lab.ntt.co.jp>
wrote:

Hello,

At Fri, 3 Feb 2017 09:16:47 +0800, Craig Ringer <craig@2ndquadrant.com>
wrote in <CAMsr+YGqn2PjJBCY+RjWWTJ4BZ=FhG0REXq1E1nd8_kjaDGoEQ@mail.gmail.com

On 2 February 2017 at 11:45, Euler Taveira <euler@timbira.com.br> wrote:

I don't think storage without conversion is an acceptable approach. We
should provide options to users such as ignore tuple or NULL for
column(s) with conversion problem. I wouldn't consider storage data
without conversion because it silently show incorrect data and we
historically aren't flexible with conversion routines.

It is possible technically. But changing the behavior of a
subscript and/or publication requires change of SQL syntax. It
seems a bit too late for proposing such a new feature..

IMHO unintentional silent data loss must not be happen so the
default behavior on conversion failure cannot be other than stop
of replication.

Agree. Which is why we should default to disallowing mismatched upstream
and downstream encodings. At least to start with.

pglogical and BDR both require identical encoding; they test for this
during startup and refuse to replicate if the encoding differs.

For the first pass at core I suggest a variant on that policy: require
source and destination encoding to be the same. This should probably
be the first change, since it definitively prevents the issue from
arising.

If the check is performed by BDR, pglogical itself seems to be
allowed to convert strings when the both end have different
encodings in a non-BDR environment. Is this right?

Hm. Maybe it's changed since I last looked. We started off disallowing
mismatched encodings anyway.

Note that I'm referring to pglogical, the tool, not "in core logical
replication for postgres"

Ouch! I'm so sorry for my bad mistake. What I thought that I
mentioned is not pglogical, but pgoutput, the default output
plugin. But what the patch modifies is logical/proto.c. My
correct question was the following.

| Does pglogical requires that the core to reject connections
| with a server with unidentical encoging? Or check condings by
| itself and disconnect by itself?

If time permits we could also allow destination encoding to be UTF-8
(including codepage 65001) with any source encoding. This requires
encoding conversion to be performed, of course.

Does this mean that BDR might work on heterogeneous encoding
environemnt?

No. Here the "we" was meant to be PG core for V10 or later.

But anyway some encodings (like SJIS) have
caharacters with the same destination in its mapping so BDR
doesn't seem to work with such conversions. So each encoding
might should have a property to inform its usability under BDR
environment, but.

PG doesn't allow SJIS as a db encoding. So it doesn't matter here.

Oops. I suppose EUC_JP also has such characters but I'm not sure
now.

The downside is that this will impact users who use a common subset of
two encodings. This is most common for Windows-1252 <-> ISO-8859-15
(or -1 if you're old-school) but also arises anywhere the common 7 bit
subset is used. Until we can define an encoding exception policy
though, I think we should defer supporting those and make them a
"later" problem.

If the conversion is rejected for now, we should check the
encoding identity instead.

Ok, I'll go on this direction for the next patch.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Petr Jelinek

petr.jelinek@2ndquadrant.com

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#2)

Re: Logical Replication and Character encoding

On 01/02/17 04:05, Kyotaro HORIGUCHI wrote:

Hello,

At Tue, 31 Jan 2017 12:46:18 +0000, "Shinoda, Noriyoshi" <noriyoshi.shinoda@hpe.com> wrote in <AT5PR84MB0084FAE5976D89CDE9733093EE4A0@AT5PR84MB0084.NAMPRD84.PROD.OUTLOOK.COM>

I tried a committed Logical Replication environment. I found
that replication between databases of different encodings did
not convert encodings in character type columns. Is this
behavior correct?

The output plugin for subscription is pgoutput and it currently
doesn't consider encoding but would easiliy be added if desired
encoding is informed.

The easiest (but somewhat seems fragile) way I can guess is,

- Subscriber connects with client_encoding specification and the
output plugin pgoutput decide whether it accepts the encoding
or not. If the subscriber doesn't, pgoutput send data without
conversion.

Hmm I wonder if we should just make the subscriber send the
client_encoding always (based on server encoding of the subscriber).
That should solve the issue in combination with your patch no?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Petr Jelinek (#12)

1 attachment(s)

Re: Logical Replication and Character encoding

At Sat, 4 Feb 2017 21:27:32 +0100, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote in <bcc7f7e9-f558-b19e-b544-000ba7cf286c@2ndquadrant.com>

Hmm I wonder if we should just make the subscriber send the
client_encoding always (based on server encoding of the subscriber).
That should solve the issue in combination with your patch no?

Yeah, right. I considered that a subscriber might want to set its
own value for that but that is useless.

The attached patch does the following things to just prevent
making a logical replication connection between databases with
inconsistent encodings.

- added client_encoding with subscriber(or standby)'s encoding at
the last of options in libpqrcv_connect.

- CheckLogicalDecodingRequirements refuses connection for a
request with inconsistent encodings.

ERROR: logical replication requires consistent encodings on both side (publisher = UTF8, subscriber = EUC_JP)

We could check this earlier if involving physical replication but
I think this is a matter of logical replication.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Refuse-logical-replication-with-inconsistent-encodin.patchtext/x-patch; charset=us-asciiDownload

From e8233c47d174261a331718e9434d5fc825523305 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 14 Feb 2017 11:18:20 +0900
Subject: [PATCH] Refuse logical replication with inconsistent encodings

Logical replication requires encoding conversion of characters if the
publisher and subscriber are on different encodings. We could add
character conversion on-the-fly but we just hinhibit a connection for
the case as the first step.
---
 .../replication/libpqwalreceiver/libpqwalreceiver.c        | 14 ++++++++++++++
 src/backend/replication/logical/logical.c                  | 13 +++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 44a89c7..550a76d 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -24,6 +24,7 @@
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "mb/pg_wchar.h"
 #include "replication/logicalproto.h"
 #include "replication/walreceiver.h"
 #include "storage/proc.h"
@@ -134,9 +135,22 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
 	}
 	keys[++i] = "fallback_application_name";
 	vals[i] = appname;
+
+	/*
+	 * Override clinet_encoding with the database encoding for logical
+	 * replication.
+	 */
+	if (logical)
+	{
+		keys[++i] = "client_encoding";
+		vals[i] = GetDatabaseEncodingName();
+	}
+
 	keys[++i] = NULL;
 	vals[i] = NULL;
 
+	Assert(i < 5); /* size of keys/vals */
+
 	conn = palloc0(sizeof(WalReceiverConn));
 	conn->streamConn = PQconnectdbParams(keys, vals, /* expand_dbname = */ true);
 	if (PQstatus(conn->streamConn) != CONNECTION_OK)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..fc74ff1 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -33,6 +33,8 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 
+#include "mb/pg_wchar.h"
+
 #include "replication/decode.h"
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
@@ -87,6 +89,17 @@ CheckLogicalDecodingRequirements(void)
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("logical decoding requires a database connection")));
 
+	/*
+	 * Currently logical replication refuses subscription that requires
+	 * chacater conversion.
+	 */
+	if (pg_get_client_encoding() != GetDatabaseEncoding())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication requires consistent encodings on both side (publisher = %s, subscriber = %s)",
+						GetDatabaseEncodingName(),
+						pg_get_client_encoding_name())));
+
 	/* ----
 	 * TODO: We got to change that someday soon...
 	 *
-- 
2.9.2

#14

Petr Jelinek

petr.jelinek@2ndquadrant.com

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#13)

Re: Logical Replication and Character encoding

On 14/02/17 03:23, Kyotaro HORIGUCHI wrote:

At Sat, 4 Feb 2017 21:27:32 +0100, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote in <bcc7f7e9-f558-b19e-b544-000ba7cf286c@2ndquadrant.com>

Hmm I wonder if we should just make the subscriber send the
client_encoding always (based on server encoding of the subscriber).
That should solve the issue in combination with your patch no?

Yeah, right. I considered that a subscriber might want to set its
own value for that but that is useless.

The attached patch does the following things to just prevent
making a logical replication connection between databases with
inconsistent encodings.

- added client_encoding with subscriber(or standby)'s encoding at
the last of options in libpqrcv_connect.

- CheckLogicalDecodingRequirements refuses connection for a
request with inconsistent encodings.

ERROR: logical replication requires consistent encodings on both side (publisher = UTF8, subscriber = EUC_JP)

I am not quite convinced that this should be handled by logical decoding
itself. It's quite possible to have output plugins that will handle this
correctly for their use-cases (by doing similar conversion you did in
the original patch) so they should not be prevented doing so.
So it's probably better to check this in the plugin.

I do like the idea of just using client_encoding in libpqrcv_connect though.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 9 years ago

In reply to: Petr Jelinek (#14)

1 attachment(s)

Re: Logical Replication and Character encoding

On 2/15/17 17:55, Petr Jelinek wrote:

I am not quite convinced that this should be handled by logical decoding
itself. It's quite possible to have output plugins that will handle this
correctly for their use-cases (by doing similar conversion you did in
the original patch) so they should not be prevented doing so.
So it's probably better to check this in the plugin.

I do like the idea of just using client_encoding in libpqrcv_connect though.

Well, it is sort of a libpq connection, and a proper libpq client should
set the client encoding, and a proper libpq server should do encoding
conversion accordingly. If we just play along with this, it all works
correctly.

Other output plugins are free to ignore the encoding settings (just like
libpq can send binary data in some cases).

The attached patch puts it all together.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Fix-logical-replication-with-different-encodings.patchtext/x-patch; name=0001-Fix-logical-replication-with-different-encodings.patchDownload

From e05b91fa0dd54e4a973da6c462b023f64b7197fd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Fri, 17 Feb 2017 10:08:15 -0500
Subject: [PATCH] Fix logical replication with different encodings

---
 src/backend/replication/libpqwalreceiver/libpqwalreceiver.c | 6 ++++++
 src/backend/replication/logical/proto.c                     | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 44a89c73fd..71f57b89f6 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -22,6 +22,7 @@
 #include "libpq-fe.h"
 #include "pqexpbuffer.h"
 #include "access/xlog.h"
+#include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "replication/logicalproto.h"
@@ -134,6 +135,11 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
 	}
 	keys[++i] = "fallback_application_name";
 	vals[i] = appname;
+	if (logical)
+	{
+		keys[++i] = "client_encoding";
+		vals[i] = GetDatabaseEncodingName();
+	}
 	keys[++i] = NULL;
 	vals[i] = NULL;
 
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 142cd993cd..bc6e9b5a98 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -444,7 +444,7 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
 		len = strlen(outputstr) + 1;	/* null terminated */
 		pq_sendint(out, len, 4);		/* length */
-		appendBinaryStringInfo(out, outputstr, len); /* data */
+		pq_sendstring(out, outputstr);	/* data */
 
 		pfree(outputstr);
 
-- 
2.11.1

#16

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 9 years ago

In reply to: Peter Eisentraut (#15)

Re: Logical Replication and Character encoding

On 2/17/17 10:14, Peter Eisentraut wrote:

Well, it is sort of a libpq connection, and a proper libpq client should
set the client encoding, and a proper libpq server should do encoding
conversion accordingly. If we just play along with this, it all works
correctly.

Other output plugins are free to ignore the encoding settings (just like
libpq can send binary data in some cases).

The attached patch puts it all together.

committed

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Shinoda, Noriyoshi

noriyoshi.shinoda@hpe.com

almost 9 years ago

In reply to: Peter Eisentraut (#16)

Re: Logical Replication and Character encoding

From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]
Sent: Friday, February 24, 2017 1:32 AM
To: Petr Jelinek <petr.jelinek@2ndquadrant.com>; Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Cc: craig@2ndquadrant.com; Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com>; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Logical Replication and Character encoding

On 2/17/17 10:14, Peter Eisentraut wrote:

Well, it is sort of a libpq connection, and a proper libpq client
should set the client encoding, and a proper libpq server should do
encoding conversion accordingly. If we just play along with this, it
all works correctly.

Other output plugins are free to ignore the encoding settings (just
like libpq can send binary data in some cases).

The attached patch puts it all together.

committed

Hi,

Thank you very much for making a new patch. I tried a new committed version.
In the case of PUBLICATION(EUC_JP) and SUBSCRIPTION(UTF-8) environment, it worked as expected. Great!.
However, in the case of PUBLICATION(UTF-8) and SUBSCRIOTION(EUC_JP) environment, the following error was output and the process went down.

- PUBLICATION (UTF-8)
postgres=> INSERT INTO encode1 VALUES (1, 'ascii') ;
INSERT 0 1
postgres=> INSERT INTO encode1 VALUES (2, '漢') ; -- Expect UTF-8 Character 0xE6BCA2 will be convert EUC_JP 0xB4C1
INSERT 0 1

- SUBSCRIPTION (EUC_JP)
postgres=> SELECT * FROM encode1;
c1 | c2
----+-------
1 | ascii
(1 row)

$ tail data.euc/pg_log/postgresql.log
LOG: starting logical replication worker for subscription "sub1"
LOG: logical replication apply for subscription "sub1" has started
ERROR: insufficient data left in message
LOG: worker process: logical replication worker for subscription 16439 (PID 22583) exited with exit code 1

Snapshot:
https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.gz 2017-02-24 00:28:58
Operating System:
Red Hat Enterprise Linux 7 Update 2 (x86-64)

Regards.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Shinoda, Noriyoshi (#17)

1 attachment(s)

Re: Logical Replication and Character encoding

Sorry for the abesnse.

At Fri, 24 Feb 2017 02:43:14 +0000, "Shinoda, Noriyoshi" <noriyoshi.shinoda@hpe.com> wrote in <AT5PR84MB00847ABEA48EAE9A97D51157EE520@AT5PR84MB0084.NAMPRD84.PROD.OUTLOOK.COM>

From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]
Sent: Friday, February 24, 2017 1:32 AM
To: Petr Jelinek <petr.jelinek@2ndquadrant.com>; Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Cc: craig@2ndquadrant.com; Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com>; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Logical Replication and Character encoding

On 2/17/17 10:14, Peter Eisentraut wrote:

Well, it is sort of a libpq connection, and a proper libpq client
should set the client encoding, and a proper libpq server should do
encoding conversion accordingly. If we just play along with this, it
all works correctly.

Other output plugins are free to ignore the encoding settings (just
like libpq can send binary data in some cases).

The attached patch puts it all together.

committed

However, in the case of PUBLICATION(UTF-8) and SUBSCRIOTION(EUC_JP) environment, the following error was output and the process went down.

...

LOG: starting logical replication worker for subscription "sub1"
LOG: logical replication apply for subscription "sub1" has started
ERROR: insufficient data left in message
LOG: worker process: logical replication worker for subscription 16439 (PID 22583) exited with exit code 1

Yeah, the patch sends converted string with the length of the
orignal length. Usually encoding conversion changes the length of
a string. I doubt that the reverse case was working correctly.

As the result pg_sendstring is not usable for this case since we
don't have the true length of the string to be sent. So my first
patch did the same thing using pg_server_to_client() explicitly.

That being said, I think that a more important thing is that the
consensus about the policy of logical replication between
databases with different encodings is refusing connection. The
reason for that is it surely breaks BDR usage for some
combinations of encodings.

Anyway the attached patch fixes the current bug about encoding in
logical replication.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

fix_logrep_conversion.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index bc6e9b5..da81a2d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -16,6 +16,7 @@
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_type.h"
 #include "libpq/pqformat.h"
+#include "mb/pg_wchar.h"
 #include "replication/logicalproto.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -442,9 +443,13 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		pq_sendbyte(out, 't');	/* 'text' data follows */
 
 		outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
+
+		if (pg_get_client_encoding() != GetDatabaseEncoding())
+			outputstr = pg_server_to_client(outputstr, strlen(outputstr));
+
 		len = strlen(outputstr) + 1;	/* null terminated */
 		pq_sendint(out, len, 4);		/* length */
-		pq_sendstring(out, outputstr);	/* data */
+		appendBinaryStringInfo(out, outputstr, len); /* data */
 
 		pfree(outputstr);

#19

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#18)

1 attachment(s)

Re: Logical Replication and Character encoding

On 2/27/17 00:23, Kyotaro HORIGUCHI wrote:

Yeah, the patch sends converted string with the length of the
orignal length. Usually encoding conversion changes the length of
a string. I doubt that the reverse case was working correctly.

I think we shouldn't send the length value at all. This might have been
a leftover from an earlier version of the patch.

See attached patch that removes the length value.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Logical-replication-protocol-fix.patchtext/x-patch; name=0001-Logical-replication-protocol-fix.patchDownload

From 11b1ecf696294b891dff36d332db18e1b0c54814 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Fri, 3 Mar 2017 14:35:34 -0500
Subject: [PATCH] Logical replication protocol fix

Don't send the length value for a column value in logical replication
tuple messages.
---
 doc/src/sgml/protocol.sgml              | 10 ----------
 src/backend/replication/logical/proto.c | 15 ++-------------
 2 files changed, 2 insertions(+), 23 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 589b881ef2..bd43c321ae 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -5915,16 +5915,6 @@ <title>Logical Replication Message Formats</title>
 </varlistentry>
 <varlistentry>
 <term>
-        Int32
-</term>
-<listitem>
-<para>
-                Length of the column value.
-</para>
-</listitem>
-</varlistentry>
-<varlistentry>
-<term>
         String
 </term>
 <listitem>
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index bc6e9b5a98..a42b799e76 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -417,7 +417,6 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		Form_pg_type typclass;
 		Form_pg_attribute att = desc->attrs[i];
 		char	   *outputstr;
-		int			len;
 
 		/* skip dropped columns */
 		if (att->attisdropped)
@@ -442,10 +441,7 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		pq_sendbyte(out, 't');	/* 'text' data follows */
 
 		outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
-		len = strlen(outputstr) + 1;	/* null terminated */
-		pq_sendint(out, len, 4);		/* length */
 		pq_sendstring(out, outputstr);	/* data */
-
 		pfree(outputstr);
 
 		ReleaseSysCache(typtup);
@@ -472,7 +468,6 @@ logicalrep_read_tuple(StringInfo in, LogicalRepTupleData *tuple)
 	for (i = 0; i < natts; i++)
 	{
 		char		kind;
-		int			len;
 
 		kind = pq_getmsgbyte(in);
 
@@ -486,14 +481,8 @@ logicalrep_read_tuple(StringInfo in, LogicalRepTupleData *tuple)
 				tuple->values[i] = (char *) 0xdeadbeef; /* make bad usage more obvious */
 				break;
 			case 't': /* text formatted value */
-				{
-					tuple->changed[i] = true;
-
-					len = pq_getmsgint(in, 4); /* read length */
-
-					/* and data */
-					tuple->values[i] = (char *) pq_getmsgbytes(in, len);
-				}
+				tuple->changed[i] = true;
+				tuple->values[i] = (char *) pq_getmsgrawstring(in);
 				break;
 			default:
 				elog(ERROR, "unknown data representation type '%c'", kind);
-- 
2.12.0

#20

Petr Jelinek

petr.jelinek@2ndquadrant.com

almost 9 years ago

In reply to: Peter Eisentraut (#19)

Re: Logical Replication and Character encoding

On 03/03/17 20:37, Peter Eisentraut wrote:

On 2/27/17 00:23, Kyotaro HORIGUCHI wrote:

Yeah, the patch sends converted string with the length of the
orignal length. Usually encoding conversion changes the length of
a string. I doubt that the reverse case was working correctly.

I think we shouldn't send the length value at all. This might have been
a leftover from an earlier version of the patch.

See attached patch that removes the length value.

Well the length is necessary to be able to add binary format support in
future so it's definitely not an omission.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 9 years ago

In reply to: Petr Jelinek (#20)

Re: Logical Replication and Character encoding

On 3/3/17 14:51, Petr Jelinek wrote:

On 03/03/17 20:37, Peter Eisentraut wrote:

On 2/27/17 00:23, Kyotaro HORIGUCHI wrote:

Yeah, the patch sends converted string with the length of the
orignal length. Usually encoding conversion changes the length of
a string. I doubt that the reverse case was working correctly.

I think we shouldn't send the length value at all. This might have been
a leftover from an earlier version of the patch.

See attached patch that removes the length value.

Well the length is necessary to be able to add binary format support in
future so it's definitely not an omission.

Right. So it appears the right function to use here is
pq_sendcountedtext(). However, this sends the strings without null
termination, so we'd have to do extra processing on the receiving end.
Or we could add an option to pq_sendcountedtext() to make it do what we
want. I'd rather stick to standard pqformat.c functions for handling
the protocol.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Peter Eisentraut (#21)

1 attachment(s)

Re: Logical Replication and Character encoding

At Fri, 3 Mar 2017 21:31:24 -0500, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in <88397afa-a8ec-8d8a-1c94-94a4795a3870@2ndquadrant.com>

On 3/3/17 14:51, Petr Jelinek wrote:

On 03/03/17 20:37, Peter Eisentraut wrote:

On 2/27/17 00:23, Kyotaro HORIGUCHI wrote:

Yeah, the patch sends converted string with the length of the
orignal length. Usually encoding conversion changes the length of
a string. I doubt that the reverse case was working correctly.

I think we shouldn't send the length value at all. This might have been
a leftover from an earlier version of the patch.

See attached patch that removes the length value.

Well the length is necessary to be able to add binary format support in
future so it's definitely not an omission.

Right. So it appears the right function to use here is
pq_sendcountedtext(). However, this sends the strings without null
termination, so we'd have to do extra processing on the receiving end.
Or we could add an option to pq_sendcountedtext() to make it do what we
want. I'd rather stick to standard pqformat.c functions for handling
the protocol.

It seems reasonable. I changed the prototype of
pg_sendcountedtext as the following,

| extern void pq_sendcountedtext(StringInfo buf, const char *str, int slen,
| bool countincludesself, bool binary);

I think 'binary' seems fine for the parameter here. The patch
size increased a bit.

By the way, I noticed that postmaster launches logical
replication launcher even if wal_level < logical. Is it right?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

fix_logrep_conversion_20170306.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/access/common/printsimple.c b/src/backend/access/common/printsimple.c
index 5fe1c72..62fa70d 100644
--- a/src/backend/access/common/printsimple.c
+++ b/src/backend/access/common/printsimple.c
@@ -96,7 +96,7 @@ printsimple(TupleTableSlot *slot, DestReceiver *self)
 					pq_sendcountedtext(&buf,
 									   VARDATA_ANY(t),
 									   VARSIZE_ANY_EXHDR(t),
-									   false);
+									   false, false);
 				}
 				break;
 
@@ -106,7 +106,7 @@ printsimple(TupleTableSlot *slot, DestReceiver *self)
 					char	str[12];	/* sign, 10 digits and '\0' */
 
 					pg_ltoa(num, str);
-					pq_sendcountedtext(&buf, str, strlen(str), false);
+					pq_sendcountedtext(&buf, str, strlen(str), false, false);
 				}
 				break;
 
@@ -116,7 +116,7 @@ printsimple(TupleTableSlot *slot, DestReceiver *self)
 					char	str[23];	/* sign, 21 digits and '\0' */
 
 					pg_lltoa(num, str);
-					pq_sendcountedtext(&buf, str, strlen(str), false);
+					pq_sendcountedtext(&buf, str, strlen(str), false, false);
 				}
 				break;
 
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index a2ca2d7..1ebc50f 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -353,7 +353,8 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 			char	   *outputstr;
 
 			outputstr = OutputFunctionCall(&thisState->finfo, attr);
-			pq_sendcountedtext(&buf, outputstr, strlen(outputstr), false);
+			pq_sendcountedtext(&buf, outputstr, strlen(outputstr),
+							   false, false);
 		}
 		else
 		{
@@ -442,7 +443,7 @@ printtup_20(TupleTableSlot *slot, DestReceiver *self)
 		Assert(thisState->format == 0);
 
 		outputstr = OutputFunctionCall(&thisState->finfo, attr);
-		pq_sendcountedtext(&buf, outputstr, strlen(outputstr), true);
+		pq_sendcountedtext(&buf, outputstr, strlen(outputstr), true, false);
 	}
 
 	pq_endmessage(&buf);
diff --git a/src/backend/libpq/pqformat.c b/src/backend/libpq/pqformat.c
index c8cf67c..f799130 100644
--- a/src/backend/libpq/pqformat.c
+++ b/src/backend/libpq/pqformat.c
@@ -129,23 +129,24 @@ pq_sendbytes(StringInfo buf, const char *data, int datalen)
  */
 void
 pq_sendcountedtext(StringInfo buf, const char *str, int slen,
-				   bool countincludesself)
+				   bool countincludesself, bool binary)
 {
-	int			extra = countincludesself ? 4 : 0;
+	int			extra_self		= countincludesself ? 4 : 0;
+	int			extra_binary	= binary ? 1 : 0;
 	char	   *p;
 
 	p = pg_server_to_client(str, slen);
 	if (p != str)				/* actual conversion has been done? */
 	{
 		slen = strlen(p);
-		pq_sendint(buf, slen + extra, 4);
-		appendBinaryStringInfo(buf, p, slen);
+		pq_sendint(buf, slen + extra_self + extra_binary, 4);
+		appendBinaryStringInfo(buf, p, slen + extra_binary);
 		pfree(p);
 	}
 	else
 	{
-		pq_sendint(buf, slen + extra, 4);
-		appendBinaryStringInfo(buf, str, slen);
+		pq_sendint(buf, slen + extra_self + extra_binary, 4);
+		appendBinaryStringInfo(buf, str, slen + extra_binary);
 	}
 }
 
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index bc6e9b5..c5e4214 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -417,7 +417,6 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		Form_pg_type typclass;
 		Form_pg_attribute att = desc->attrs[i];
 		char	   *outputstr;
-		int			len;
 
 		/* skip dropped columns */
 		if (att->attisdropped)
@@ -442,9 +441,12 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		pq_sendbyte(out, 't');	/* 'text' data follows */
 
 		outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
-		len = strlen(outputstr) + 1;	/* null terminated */
-		pq_sendint(out, len, 4);		/* length */
-		pq_sendstring(out, outputstr);	/* data */
+
+		/*
+		 * Here, the string is sent as binary data so the length field counts
+		 * terminating NULL.
+		 */
+		pq_sendcountedtext(out, outputstr, strlen(outputstr), false, true);
 
 		pfree(outputstr);
 
diff --git a/src/backend/tcop/fastpath.c b/src/backend/tcop/fastpath.c
index 2efed95..0ca9522 100644
--- a/src/backend/tcop/fastpath.c
+++ b/src/backend/tcop/fastpath.c
@@ -160,7 +160,8 @@ SendFunctionResult(Datum retval, bool isnull, Oid rettype, int16 format)
 
 			getTypeOutputInfo(rettype, &typoutput, &typisvarlena);
 			outputstr = OidOutputFunctionCall(typoutput, retval);
-			pq_sendcountedtext(&buf, outputstr, strlen(outputstr), false);
+			pq_sendcountedtext(&buf, outputstr, strlen(outputstr),
+							   false, false);
 			pfree(outputstr);
 		}
 		else if (format == 1)
diff --git a/src/include/libpq/pqformat.h b/src/include/libpq/pqformat.h
index 4df87ec..29cba1a 100644
--- a/src/include/libpq/pqformat.h
+++ b/src/include/libpq/pqformat.h
@@ -19,7 +19,7 @@ extern void pq_beginmessage(StringInfo buf, char msgtype);
 extern void pq_sendbyte(StringInfo buf, int byt);
 extern void pq_sendbytes(StringInfo buf, const char *data, int datalen);
 extern void pq_sendcountedtext(StringInfo buf, const char *str, int slen,
-				   bool countincludesself);
+					   bool countincludesself, bool binary);
 extern void pq_sendtext(StringInfo buf, const char *str, int slen);
 extern void pq_sendstring(StringInfo buf, const char *str);
 extern void pq_send_ascii_string(StringInfo buf, const char *str);

#23

Petr Jelinek

petr.jelinek@2ndquadrant.com

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#22)

Re: Logical Replication and Character encoding

On 06/03/17 11:06, Kyotaro HORIGUCHI wrote:

At Fri, 3 Mar 2017 21:31:24 -0500, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in <88397afa-a8ec-8d8a-1c94-94a4795a3870@2ndquadrant.com>

On 3/3/17 14:51, Petr Jelinek wrote:

On 03/03/17 20:37, Peter Eisentraut wrote:

On 2/27/17 00:23, Kyotaro HORIGUCHI wrote:

Yeah, the patch sends converted string with the length of the
orignal length. Usually encoding conversion changes the length of
a string. I doubt that the reverse case was working correctly.

I think we shouldn't send the length value at all. This might have been
a leftover from an earlier version of the patch.

See attached patch that removes the length value.

Well the length is necessary to be able to add binary format support in
future so it's definitely not an omission.

Right. So it appears the right function to use here is
pq_sendcountedtext(). However, this sends the strings without null
termination, so we'd have to do extra processing on the receiving end.
Or we could add an option to pq_sendcountedtext() to make it do what we
want. I'd rather stick to standard pqformat.c functions for handling
the protocol.

It seems reasonable. I changed the prototype of
pg_sendcountedtext as the following,

| extern void pq_sendcountedtext(StringInfo buf, const char *str, int slen,
| bool countincludesself, bool binary);

I think 'binary' seems fine for the parameter here. The patch
size increased a bit.

Hmm I find it bit weird that binary true means NULL terminated while
false means not NULL terminated, I would think that the behaviour would
be exactly opposite given that text strings are the ones where NULL
termination matters?

By the way, I noticed that postmaster launches logical
replication launcher even if wal_level < logical. Is it right?

Yes, that came up couple of times in various threads. The logical
replication launcher is needed only to start subscriptions and
subscriptions don't need wal_level to be logical, only publication needs
that.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Petr Jelinek (#23)

1 attachment(s)

Re: Logical Replication and Character encoding

At Mon, 6 Mar 2017 11:13:48 +0100, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote in <a7c0cc59-9c21-238c-b6a7-d6eb519ad9ee@2ndquadrant.com>

Well the length is necessary to be able to add binary format support in
future so it's definitely not an omission.

Right. So it appears the right function to use here is
pq_sendcountedtext(). However, this sends the strings without null
termination, so we'd have to do extra processing on the receiving end.
Or we could add an option to pq_sendcountedtext() to make it do what we
want. I'd rather stick to standard pqformat.c functions for handling
the protocol.

It seems reasonable. I changed the prototype of
pg_sendcountedtext as the following,

| extern void pq_sendcountedtext(StringInfo buf, const char *str, int slen,
| bool countincludesself, bool binary);

I think 'binary' seems fine for the parameter here. The patch
size increased a bit.

Hmm I find it bit weird that binary true means NULL terminated while
false means not NULL terminated, I would think that the behaviour would
be exactly opposite given that text strings are the ones where NULL
termination matters?

You're right. I renamed it as with more straightforward
name. Addition to that, it contained a stupid bug that sends a
garbage byte when non-null-terminated string is not converted.
And the comment is edited to reflect the new behavior.

By the way, I noticed that postmaster launches logical
replication launcher even if wal_level < logical. Is it right?

Yes, that came up couple of times in various threads. The logical
replication launcher is needed only to start subscriptions and
subscriptions don't need wal_level to be logical, only publication needs
that.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

fix_logrep_conversion_20170321.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/access/common/printsimple.c b/src/backend/access/common/printsimple.c
index 5fe1c72..62fa70d 100644
--- a/src/backend/access/common/printsimple.c
+++ b/src/backend/access/common/printsimple.c
@@ -96,7 +96,7 @@ printsimple(TupleTableSlot *slot, DestReceiver *self)
 					pq_sendcountedtext(&buf,
 									   VARDATA_ANY(t),
 									   VARSIZE_ANY_EXHDR(t),
-									   false);
+									   false, false);
 				}
 				break;
 
@@ -106,7 +106,7 @@ printsimple(TupleTableSlot *slot, DestReceiver *self)
 					char	str[12];	/* sign, 10 digits and '\0' */
 
 					pg_ltoa(num, str);
-					pq_sendcountedtext(&buf, str, strlen(str), false);
+					pq_sendcountedtext(&buf, str, strlen(str), false, false);
 				}
 				break;
 
@@ -116,7 +116,7 @@ printsimple(TupleTableSlot *slot, DestReceiver *self)
 					char	str[23];	/* sign, 21 digits and '\0' */
 
 					pg_lltoa(num, str);
-					pq_sendcountedtext(&buf, str, strlen(str), false);
+					pq_sendcountedtext(&buf, str, strlen(str), false, false);
 				}
 				break;
 
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index a2ca2d7..1ebc50f 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -353,7 +353,8 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 			char	   *outputstr;
 
 			outputstr = OutputFunctionCall(&thisState->finfo, attr);
-			pq_sendcountedtext(&buf, outputstr, strlen(outputstr), false);
+			pq_sendcountedtext(&buf, outputstr, strlen(outputstr),
+							   false, false);
 		}
 		else
 		{
@@ -442,7 +443,7 @@ printtup_20(TupleTableSlot *slot, DestReceiver *self)
 		Assert(thisState->format == 0);
 
 		outputstr = OutputFunctionCall(&thisState->finfo, attr);
-		pq_sendcountedtext(&buf, outputstr, strlen(outputstr), true);
+		pq_sendcountedtext(&buf, outputstr, strlen(outputstr), true, false);
 	}
 
 	pq_endmessage(&buf);
diff --git a/src/backend/libpq/pqformat.c b/src/backend/libpq/pqformat.c
index c8cf67c..a83c73e 100644
--- a/src/backend/libpq/pqformat.c
+++ b/src/backend/libpq/pqformat.c
@@ -123,30 +123,39 @@ pq_sendbytes(StringInfo buf, const char *data, int datalen)
  * The data sent to the frontend by this routine is a 4-byte count field
  * followed by the string.  The count includes itself or not, as per the
  * countincludesself flag (pre-3.0 protocol requires it to include itself).
- * The passed text string need not be null-terminated, and the data sent
- * to the frontend isn't either.
+ * The passed text string need not be null-terminated but should not contain
+ * null as a part. sendterminator instructs to send null-terminator.
  * --------------------------------
  */
 void
 pq_sendcountedtext(StringInfo buf, const char *str, int slen,
-				   bool countincludesself)
+				   bool countincludesself, bool sendterminator)
 {
-	int			extra = countincludesself ? 4 : 0;
+	int			extra_self		= countincludesself ? 4 : 0;
+	int			extra_term		= sendterminator ? 1 : 0;
 	char	   *p;
 
 	p = pg_server_to_client(str, slen);
 	if (p != str)				/* actual conversion has been done? */
 	{
 		slen = strlen(p);
-		pq_sendint(buf, slen + extra, 4);
+		pq_sendint(buf, slen + extra_self + extra_term, 4);
 		appendBinaryStringInfo(buf, p, slen);
 		pfree(p);
 	}
 	else
 	{
-		pq_sendint(buf, slen + extra, 4);
+		pq_sendint(buf, slen + extra_self + extra_term, 4);
 		appendBinaryStringInfo(buf, str, slen);
+
 	}
+
+	/*
+	 * str may not be null-terminated. Send the terminator separately from the
+	 * given string.
+	 */
+	if (sendterminator)
+		appendBinaryStringInfo(buf, "",  1);
 }
 
 /* --------------------------------
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index bc6e9b5..c5e4214 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -417,7 +417,6 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		Form_pg_type typclass;
 		Form_pg_attribute att = desc->attrs[i];
 		char	   *outputstr;
-		int			len;
 
 		/* skip dropped columns */
 		if (att->attisdropped)
@@ -442,9 +441,12 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		pq_sendbyte(out, 't');	/* 'text' data follows */
 
 		outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
-		len = strlen(outputstr) + 1;	/* null terminated */
-		pq_sendint(out, len, 4);		/* length */
-		pq_sendstring(out, outputstr);	/* data */
+
+		/*
+		 * Here, the string is sent as binary data so the length field counts
+		 * terminating NULL.
+		 */
+		pq_sendcountedtext(out, outputstr, strlen(outputstr), false, true);
 
 		pfree(outputstr);
 
diff --git a/src/backend/tcop/fastpath.c b/src/backend/tcop/fastpath.c
index 2efed95..0ca9522 100644
--- a/src/backend/tcop/fastpath.c
+++ b/src/backend/tcop/fastpath.c
@@ -160,7 +160,8 @@ SendFunctionResult(Datum retval, bool isnull, Oid rettype, int16 format)
 
 			getTypeOutputInfo(rettype, &typoutput, &typisvarlena);
 			outputstr = OidOutputFunctionCall(typoutput, retval);
-			pq_sendcountedtext(&buf, outputstr, strlen(outputstr), false);
+			pq_sendcountedtext(&buf, outputstr, strlen(outputstr),
+							   false, false);
 			pfree(outputstr);
 		}
 		else if (format == 1)
diff --git a/src/include/libpq/pqformat.h b/src/include/libpq/pqformat.h
index 4df87ec..e724b04 100644
--- a/src/include/libpq/pqformat.h
+++ b/src/include/libpq/pqformat.h
@@ -19,7 +19,7 @@ extern void pq_beginmessage(StringInfo buf, char msgtype);
 extern void pq_sendbyte(StringInfo buf, int byt);
 extern void pq_sendbytes(StringInfo buf, const char *data, int datalen);
 extern void pq_sendcountedtext(StringInfo buf, const char *str, int slen,
-				   bool countincludesself);
+					   bool countincludesself, bool sendterminator);
 extern void pq_sendtext(StringInfo buf, const char *str, int slen);
 extern void pq_sendstring(StringInfo buf, const char *str);
 extern void pq_send_ascii_string(StringInfo buf, const char *str);

#25

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Petr Jelinek (#23)

Re: Logical Replication and Character encoding

Mmm. I shot the previous mail halfway.

At Mon, 6 Mar 2017 11:13:48 +0100, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote in <a7c0cc59-9c21-238c-b6a7-d6eb519ad9ee@2ndquadrant.com>

By the way, I noticed that postmaster launches logical
replication launcher even if wal_level < logical. Is it right?

Yes, that came up couple of times in various threads. The logical
replication launcher is needed only to start subscriptions and
subscriptions don't need wal_level to be logical, only publication needs
that.

I understand, thanks. Anyway the message seems to have hidden
from startup message:)

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#24)

2 attachment(s)

Re: Logical Replication and Character encoding

After further thinking, I prefer the alternative approach of using
pq_sendcountedtext() as is and sticking the trailing zero byte on on the
receiving side. This is a more localized change, and keeps the logical
replication protocol consistent with the main FE/BE protocol. (Also, we
don't need to send a useless byte around.)

Patch attached, and also a test case.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Fix-logical-replication-between-different-encodings.patchinvalid/octet-stream; name=0001-Fix-logical-replication-between-different-encodings.patchDownload

From 57f723a8d201f241410d008d0caee900c1bdeb2f Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Wed, 5 Apr 2017 10:44:23 -0400
Subject: [PATCH 1/2] Fix logical replication between different encodings

When sending a tuple attribute, the previous coding erroneously sent the
length byte before encoding conversion, which would lead to protocol
failures on the receiving side if the length did not match the following
string.

To fix that, use pq_sendcountedtext() for sending tuple attributes,
which takes care of all of that internally.  To match the API of
pq_sendcountedtext(), send even text values without a trailing zero byte
and have the receiving end put it in place instead.  This matches how
the standard FE/BE protocol behaves.

Reported-by: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
---
 doc/src/sgml/protocol.sgml              |  7 +++++--
 src/backend/replication/logical/proto.c | 10 ++++------
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 5f971412ae..9d46d74113 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -6107,11 +6107,14 @@ <title>Logical Replication Message Formats</title>
 </varlistentry>
 <varlistentry>
 <term>
-        String
+        Byte<replaceable>n</replaceable>
 </term>
 <listitem>
 <para>
-                The text value.
+                The value of the column, in text format.  (A future release
+                might support additional formats.)
+                <replaceable>n</replaceable> is the above length.
+
 </para>
 </listitem>
 </varlistentry>
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index bc6e9b5a98..bb607b6147 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -417,7 +417,6 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		Form_pg_type typclass;
 		Form_pg_attribute att = desc->attrs[i];
 		char	   *outputstr;
-		int			len;
 
 		/* skip dropped columns */
 		if (att->attisdropped)
@@ -442,10 +441,7 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple)
 		pq_sendbyte(out, 't');	/* 'text' data follows */
 
 		outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
-		len = strlen(outputstr) + 1;	/* null terminated */
-		pq_sendint(out, len, 4);		/* length */
-		pq_sendstring(out, outputstr);	/* data */
-
+		pq_sendcountedtext(out, outputstr, strlen(outputstr), false);
 		pfree(outputstr);
 
 		ReleaseSysCache(typtup);
@@ -492,7 +488,9 @@ logicalrep_read_tuple(StringInfo in, LogicalRepTupleData *tuple)
 					len = pq_getmsgint(in, 4); /* read length */
 
 					/* and data */
-					tuple->values[i] = (char *) pq_getmsgbytes(in, len);
+					tuple->values[i] = palloc(len + 1);
+					pq_copymsgbytes(in, tuple->values[i], len);
+					tuple->values[i][len] = '\0';
 				}
 				break;
 			default:
-- 
2.12.2

0002-Add-test-for-logical-replication-with-different-enco.patchinvalid/octet-stream; name=0002-Add-test-for-logical-replication-with-different-enco.patchDownload

From bc215b9f9e710e6ac6e0abb374f2531d2373e281 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Wed, 5 Apr 2017 11:05:22 -0400
Subject: [PATCH 2/2] Add test for logical replication with different encodings

---
 src/test/subscription/t/005_encoding.pl | 46 +++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)
 create mode 100644 src/test/subscription/t/005_encoding.pl

diff --git a/src/test/subscription/t/005_encoding.pl b/src/test/subscription/t/005_encoding.pl
new file mode 100644
index 0000000000..42a4eee5b4
--- /dev/null
+++ b/src/test/subscription/t/005_encoding.pl
@@ -0,0 +1,46 @@
+# Test replication between databases with different encodings
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+sub wait_for_caught_up
+{
+	my ($node, $appname) = @_;
+
+	$node->poll_query_until('postgres',
+							"SELECT pg_current_wal_location() <= replay_location FROM pg_stat_replication WHERE application_name = '$appname';")
+		or die "Timed out while waiting for subscriber to catch up";
+}
+
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical', extra => ['--locale=C', '--encoding=UTF8']);
+$node_publisher->start;
+
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical', extra => ['--locale=C', '--encoding=LATIN1']);
+$node_subscriber->start;
+
+my $ddl = "CREATE TABLE test1 (a int, b text);";
+$node_publisher->safe_psql('postgres', $ddl);
+$node_subscriber->safe_psql('postgres', $ddl);
+
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+my $appname = 'encoding_test';
+
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION mypub FOR ALL TABLES;");
+$node_subscriber->safe_psql('postgres', "CREATE SUBSCRIPTION mysub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION mypub;");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$node_publisher->safe_psql('postgres', q{INSERT INTO test1 VALUES (1, E'Mot\xc3\xb6rhead')});  # hand-rolled UTF-8
+
+wait_for_caught_up($node_publisher, $appname);
+
+is($node_subscriber->safe_psql('postgres', q{SELECT a FROM test1 WHERE b = E'Mot\xf6rhead'}),  # LATIN1
+							   qq(1),
+							   'data replicated to subscriber');
+
+$node_subscriber->stop;
+$node_publisher->stop;
-- 
2.12.2

#27

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Peter Eisentraut (#26)

Re: Logical Replication and Character encoding

At Wed, 5 Apr 2017 11:33:51 -0400, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in <5401fef6-c0c0-7e8a-d8b1-169e30cbd854@2ndquadrant.com>

After further thinking, I prefer the alternative approach of using
pq_sendcountedtext() as is and sticking the trailing zero byte on on the
receiving side. This is a more localized change, and keeps the logical
replication protocol consistent with the main FE/BE protocol. (Also, we
don't need to send a useless byte around.)

I'm not sure about the significance of the trailing zero in the
the logical replication protocol. Anyway the patch works.

Patch attached, and also a test case.

The problem was revealed when a string is shortened by encoding
conversion. The test covers the situation.

- The patches appliy on the master cleanly.
- The patch works for the UTF-8 => EUC_JP case.
- The test seems proper.

By the way, an untranslatable character on the publisher table
stops walsender with the following error.

ERROR: character with byte sequence 0xe6 0xbc 0xa2 in encoding "UTF8" has no equivalent in encoding "LATIN1"
STATEMENT: COPY public.t TO STDOUT
LOG: could not send data to client: Broken pipe
FATAL: connection to client lost

walreceiver stops on the opposite side with the following
complaint.

ERROR: could not receive data from WAL stream: ERROR: character with byte sequence 0xe6 0xbc 0xa2 in encoding "UTF8" has no equivalent in encoding "LATIN1"
CONTEXT: COPY t, line 1: ""
LOG: worker process: logical replication worker for subscription 16391 sync 16384 (PID 26915) exited with exit code 1

After this, walreceiver repeats reconnecting to master with no
wait. Maybe walreceiver had better refrain from reconnection
after certain kinds of faiure but it is not an urgent issue.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#27)

Re: Logical Replication and Character encoding

On 4/5/17 21:32, Kyotaro HORIGUCHI wrote:

At Wed, 5 Apr 2017 11:33:51 -0400, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in <5401fef6-c0c0-7e8a-d8b1-169e30cbd854@2ndquadrant.com>

After further thinking, I prefer the alternative approach of using
pq_sendcountedtext() as is and sticking the trailing zero byte on on the
receiving side. This is a more localized change, and keeps the logical
replication protocol consistent with the main FE/BE protocol. (Also, we
don't need to send a useless byte around.)

I'm not sure about the significance of the trailing zero in the
the logical replication protocol.

It doesn't matter. It could be "nice" to have it because then the
receiving side doesn't need to add it if it wants to assemble a C
string. But for the reasons above, it doesn't seem worth doing that.

walreceiver stops on the opposite side with the following
complaint.

ERROR: could not receive data from WAL stream: ERROR: character with byte sequence 0xe6 0xbc 0xa2 in encoding "UTF8" has no equivalent in encoding "LATIN1"
CONTEXT: COPY t, line 1: ""
LOG: worker process: logical replication worker for subscription 16391 sync 16384 (PID 26915) exited with exit code 1

After this, walreceiver repeats reconnecting to master with no
wait. Maybe walreceiver had better refrain from reconnection
after certain kinds of faiure but it is not an urgent issue.

This is controlled by wal_retrieve_retry_interval.

I see what you are saying about treating certain errors as potentially
permanent, but I think a lot more work would need to be done to be able
to manage that.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 9 years ago

In reply to: Peter Eisentraut (#28)

Re: Logical Replication and Character encoding

On 4/6/17 11:47, Peter Eisentraut wrote:

On 4/5/17 21:32, Kyotaro HORIGUCHI wrote:

At Wed, 5 Apr 2017 11:33:51 -0400, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in <5401fef6-c0c0-7e8a-d8b1-169e30cbd854@2ndquadrant.com>

After further thinking, I prefer the alternative approach of using
pq_sendcountedtext() as is and sticking the trailing zero byte on on the
receiving side. This is a more localized change, and keeps the logical
replication protocol consistent with the main FE/BE protocol. (Also, we
don't need to send a useless byte around.)

I'm not sure about the significance of the trailing zero in the
the logical replication protocol.

It doesn't matter. It could be "nice" to have it because then the
receiving side doesn't need to add it if it wants to assemble a C
string. But for the reasons above, it doesn't seem worth doing that.

committed

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Peter Eisentraut (#29)

Re: Logical Replication and Character encoding

At Thu, 6 Apr 2017 14:43:56 -0400, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in <330a093a-d155-c866-cbf2-8f36fdf51e95@2ndquadrant.com>

On 4/6/17 11:47, Peter Eisentraut wrote:

On 4/5/17 21:32, Kyotaro HORIGUCHI wrote:

At Wed, 5 Apr 2017 11:33:51 -0400, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in <5401fef6-c0c0-7e8a-d8b1-169e30cbd854@2ndquadrant.com>

After further thinking, I prefer the alternative approach of using
pq_sendcountedtext() as is and sticking the trailing zero byte on on the
receiving side. This is a more localized change, and keeps the logical
replication protocol consistent with the main FE/BE protocol. (Also, we
don't need to send a useless byte around.)

I'm not sure about the significance of the trailing zero in the
the logical replication protocol.

It doesn't matter. It could be "nice" to have it because then the
receiving side doesn't need to add it if it wants to assemble a C
string. But for the reasons above, it doesn't seem worth doing that.

Ok, it's fine for me.

committed

Thanks!

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers