BUG #17142: COPY ignores client_encoding for octal digit characters

Started by PG Bug reporting formover 4 years ago4 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 17142
Logged by: Andreas Grob
Email address: vilarion@illarion.org
PostgreSQL version: 13.3
Operating system: Debian GNU/Linux 11 (bullseye)
Description:

Test db and table:
```
CREATE DATABASE test WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE
= 'C' LC_CTYPE = 'C';
CREATE TABLE test (text character varying(50));
```

Test program in C:
```
#include <postgresql/libpq-fe.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv)
{
const char *conninfo;
char *errmsg;
PGconn *conn;
PGresult *res;
int a, b;
ExecStatusType status;
int enc;

char buffer[] = "\\304\\366\\337"; //Äöß
// char buffer[] = "\304\366\337"; //Äöß

if (argc > 1)
conninfo = argv[1];
else
conninfo = "user=postgres dbname=test port=5433
client_encoding=LATIN1";

/* Make a connection to the database */
conn = PQconnectdb(conninfo);

/* Check to see that the backend connection was successfully made */
if (PQstatus(conn) != CONNECTION_OK)
{
fprintf(stderr, "Connection to database failed: %s"
, PQerrorMessage(conn));
PQfinish(conn);
exit(1);
}

res = PQexec(conn, "BEGIN");

res = PQexec(conn, "COPY public.test(text) from STDIN;");
a = PQputCopyData(conn, buffer, strlen(buffer));
b = PQputCopyEnd(conn, NULL);
res = PQgetResult(conn);
status = PQresultStatus(res);
enc = PQclientEncoding(conn);
errmsg = PQresultErrorMessage(res);

printf("status=%d a=%d,b=%d, enc=%d\n", status, a, b, enc);

if (status != PGRES_COMMAND_OK)
printf("%s\n", errmsg);
else
printf("worked.\n");

res = PQexec(conn, "COMMIT");

/* close the connection to the database and cleanup */
PQfinish(conn);

return 0;
}
```

Output:
```
status=7 a=1,b=1, enc=8
ERROR: invalid byte sequence for encoding "UTF8": 0xc4 0xf6
CONTEXT: COPY test, line 1: "\304\366\337"
```

Expected output:
```
status=1 a=1,b=1, enc=8
worked.
```
(Äöß got inserted into the table.)

Characters in octal digits should be possible as per
https://www.postgresql.org/docs/13/sql-copy.html
When using characters directly (char buffer[] = "\304\366\337") the expected
output is displayed.

My apologies if I misunderstood something.

#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: PG Bug reporting form (#1)
Re: BUG #17142: COPY ignores client_encoding for octal digit characters

On 12/08/2021 00:24, PG Bug reporting form wrote:

Characters in octal digits should be possible as per
https://www.postgresql.org/docs/13/sql-copy.html
When using characters directly (char buffer[] = "\304\366\337") the expected
output is displayed.

My apologies if I misunderstood something.

The code is pretty clear that the \123 and \x12 escapes are evaluated
after encoding conversion. That means, the escapes are interpreted using
the database encoding, regardless of client encoding. The documentation
doesn't say anything about that, though. We should fix the docs. How
does the attached patch look?

You could get weird results if you use the escapes for some bytes in a
multi-byte character. Mostly you'd get invalid byte sequence errors, but
I think with the right combination of the client and database encodings,
it could get more strange. I think the wording in the attached docs
patch is enough to cover that, though.

- Heikki

Attachments:

0001-Doc-123-and-x12-escapes-in-COPY-are-in-database-enco.patchtext/x-patch; charset=UTF-8; name=0001-Doc-123-and-x12-escapes-in-COPY-are-in-database-enco.patchDownload+8-3
#3Noname
vilarion@illarion.org
In reply to: Heikki Linnakangas (#2)
Re: BUG #17142: COPY ignores client_encoding for octal digit characters

On 12.08.2021 09:40, Heikki Linnakangas wrote:

On 12/08/2021 00:24, PG Bug reporting form wrote:

Characters in octal digits should be possible as per
https://www.postgresql.org/docs/13/sql-copy.html
When using characters directly (char buffer[] = "\304\366\337") the
expected
output is displayed.

My apologies if I misunderstood something.

The code is pretty clear that the \123 and \x12 escapes are evaluated
after encoding conversion. That means, the escapes are interpreted
using the database encoding, regardless of client encoding. The
documentation doesn't say anything about that, though. We should fix
the docs. How does the attached patch look?

You could get weird results if you use the escapes for some bytes in a
multi-byte character. Mostly you'd get invalid byte sequence errors,
but I think with the right combination of the client and database
encodings, it could get more strange. I think the wording in the
attached docs patch is enough to cover that, though.

- Heikki

Thanks for clarifying! This patch to the docs will allow me to file a
bug report against the library I am using (pqxx).

Andreas

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Noname (#3)
Re: BUG #17142: COPY ignores client_encoding for octal digit characters

On 12/08/2021 11:01, vilarion@illarion.org wrote:

Thanks for clarifying! This patch to the docs will allow me to file a
bug report against the library I am using (pqxx).

Pushed the docs patch now.

- Heikki