pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

Started by Nonameover 5 years ago23 messagesbugs
Jump to latest
#1Noname
tutiluren@tutanota.com

After taking a longer break from my problem, I have now made a fresh, clean, scientifically conducted experiment in order to truly get to the bottom of this annoying problem once and for all.

The issue is that pg_dump refuses to dump the database, with bizarre errors outputted, whenever the "--exclude-table-data" is included in the command *and* its value contains anything but lowercase a-z. See for yourselves:

The database in question is verified to have "UTF8" as encoding. It has one schema called "Test schäma" containing one table called "Test täble", containing one column called "Test cålumn" with one row where the column has the text "This should not be in the dump." (so that I can easily check if it's included in the dump).

First, I set cmd.exe to use Unicode, just to be sure:

C:\pg_dump_test>chcp 65001
Active code page: 65001

Now I try this series of commands:

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
[retracted]
pg_dump: creating CONSTRAINT "Test schäma.Test täble Test täble_pkey"

= WORKS. The dump was successful. (In spite of weird output chars.)

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="test" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
[retracted]
pg_dump: creating CONSTRAINT "Test schäma.Test täble Test täble_pkey"

= WORKS. The dump was successful. (In spite of weird output chars.)

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="\"Test schäma\".\"Test täble\"" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe4 0x62 0x6c
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(Test täble)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(Test schäma)$'

= FAILED. The dump was aborted with these nonsensical errors.

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="Test schäma"."Test täble" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe4 0x62 0x6c
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(test täble)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(test schäma)$'

= FAILED. The dump was aborted with these nonsensical errors.

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="Test schäma.Test täble" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe4 0x62 0x6c
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(test täble)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(test schäma)$'

= FAILED. The dump was aborted with these nonsensical errors.

Finally, I tried the same command again like this:

--exclude-table-data="ä"

= FAILS!

--exclude-table-data="a"

= WORKS!

I looked everywhere for some kind of "client-encoding" option in the pg_dump manual, but there is no such thing. The only thing I can think of is that the client's encoding (that is, pg_dump) is for some reason not set to "UTF8" even though that's the encoding of the "test" database which I'm connecting to.

I believe that I have taken every reasonable step at this point to debug this on my own, and it truly appears as if the issue is that pg_dump has some broken internal logic which fails to account for non-simplistic identifiers, specifically the code which powers the "--exclude-table-data" option. However, it *could* still be "my fault"... although I don't see how!

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Noname (#1)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

On Thursday, July 23, 2020, <tutiluren@tutanota.com> wrote:

After taking a longer break from my problem, I have now made a fresh,
clean, scientifically conducted experiment in order to truly get to the
bottom of this annoying problem once and for all.

First, I set cmd.exe to use Unicode, just to be sure:

C:\pg_dump_test>chcp 65001
Active code page: 65001

pg_dump: creating CONSTRAINT "Test schäma.Test täble Test täble_pkey"

= WORKS. The dump was successful. (In spite of weird output chars.)

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt"
--exclude-table-data="Test schäma.Test täble" --host="localhost"
--port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
pg_dump: [archiver (db)] query failed: ERROR: invalid byte sequence for
encoding "UTF8": 0xe4 0x62 0x6c
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
AND c.relname OPERATOR(pg_catalog.~) '^(test täble)$'
AND n.nspname OPERATOR(pg_catalog.~) '^(test schäma)$'

= FAILED. The dump was aborted with these nonsensical errors.

Finally, I tried the same command again like this:

I looked everywhere for some kind of "client-encoding" option in the
pg_dump manual, but there is no such thing. The only thing I can think of
is that the client's encoding (that is, pg_dump) is for some reason not set
to "UTF8" even though that's the encoding of the "test" database which I'm
connecting to.

The main thing here is this isn’t properly considered a bug in pg_dump but
rather with most likely any client command run in the cmd.exe shell. Or
even more generally with the data flowing through the pipeline, which is
libpq. Thus what you want is a libpq option.

https://www.postgresql.org/docs/12/libpq-envars.html

And here is an old, similar, bug report, with an apparent work-around (or
solution depending on details that as an English Linux user I am not in a
position to judge).

/messages/by-id/1318589950754-4902202.post@n5.nabble.com

David J.

#3Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: David G. Johnston (#2)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

On Fri, Jul 24, 2020 at 8:14 AM David G. Johnston <
david.g.johnston@gmail.com> wrote:

On Thursday, July 23, 2020, <tutiluren@tutanota.com> wrote:

After taking a longer break from my problem, I have now made a fresh,
clean, scientifically conducted experiment in order to truly get to the
bottom of this annoying problem once and for all.

First, I set cmd.exe to use Unicode, just to be sure:

C:\pg_dump_test>chcp 65001
Active code page: 65001

pg_dump: creating CONSTRAINT "Test schäma.Test täble Test täble_pkey"

= WORKS. The dump was successful. (In spite of weird output chars.)

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt"
--exclude-table-data="Test schäma.Test täble" --host="localhost"
--port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
pg_dump: [archiver (db)] query failed: ERROR: invalid byte sequence for
encoding "UTF8": 0xe4 0x62 0x6c
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
AND c.relname OPERATOR(pg_catalog.~) '^(test täble)$'
AND n.nspname OPERATOR(pg_catalog.~) '^(test schäma)$'

= FAILED. The dump was aborted with these nonsensical errors.

Finally, I tried the same command again like this:

I looked everywhere for some kind of "client-encoding" option in the
pg_dump manual, but there is no such thing. The only thing I can think of
is that the client's encoding (that is, pg_dump) is for some reason not set
to "UTF8" even though that's the encoding of the "test" database which I'm
connecting to.

The main thing here is this isn’t properly considered a bug in pg_dump but
rather with most likely any client command run in the cmd.exe shell.

The problem is coming from CMD and its UTF8 support, but I guess I am not
making a great job in convincing you. I will try to explain it through an
example:

1. Start a CMD using "CMD /U", and set the code page to UTF8:

C:\postgres>chcp 65001
Active code page: 65001

2. Create a database named 'test', an schema named "Ö" and table named "Ä":

C:\postgres>psql -c "create database test" postgres
CREATE DATABASE

C:\postgres>psql -c "\l test" postgres
List of databases
Name | Owner | Encoding | Collate |
Ctype | Access privileges
------+--------------+----------+----------------------------+----------------------------+-------------------
test | postgres | UTF8 | English_United States.1252 |
English_United States.1252 |
(1 row)

C:\postgres>psql -c "create schema "Ö"" test
CREATE SCHEMA
C:\postgres>psql -c " create table "Ö"."Ä" (c1 char(1))" test
CREATE TABLE

3. Test the table and check your client encoding:

C:\postgres>psql -c "select * from "Ö"."Ä"" test
c1
----
(0 rows)

C:\postgres>psql -c "show client_encoding" test
client_encoding
-----------------
WIN1252
(1 row)

4. Change the client encoding to UTF8 and repeat the test:

C:\postgres>set PGCLIENTENCODING=UTF8

C:\postgres>psql -c "show client_encoding" test
client_encoding
-----------------
UTF8
(1 row)

C:\postgres>psql -c "select * from "Ö"."Ä"" test
2020-07-24 11:25:01.710 CEST [1452] ERROR: invalid byte sequence for
encoding "UTF8": 0xd6 0x2e
ERROR: invalid byte sequence for encoding "UTF8": 0xd6 0x2e

The reason why this happens is because the limited CMD's UTF8 support,
'chcp' only changes the code page of the displayed characters, shell
commands, piping, redirection and most commands are still ANSI only.

Does explicitly setting 'PGCLIENTENCODING=WIN1252' on your shell solves the
issue?

Regards,

Juan José Santamaría Flecha

#4Noname
tutiluren@tutanota.com
In reply to: Noname (#1)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

Jul 24, 2020, 12:30 PM by juanjo.santamaria@gmail.com:

On Fri, Jul 24, 2020 at 8:14 AM David G. Johnston <> david.g.johnston@gmail.com> > wrote:

On Thursday, July 23, 2020, <>> tutiluren@tutanota.com>> > wrote:

After taking a longer break from my problem, I have now made a fresh, clean, scientifically conducted experiment in order to truly get to the bottom of this annoying problem once and for all.

First, I set cmd.exe to use Unicode, just to be sure:

C:\pg_dump_test>chcp 65001
Active code page: 65001

pg_dump: creating CONSTRAINT "Test schäma.Test täble Test täble_pkey"

= WORKS. The dump was successful. (In spite of weird output chars.)

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="Test schäma.Test täble" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe4 0x62 0x6c
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(test täble)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(test schäma)$'

= FAILED. The dump was aborted with these nonsensical errors.

Finally, I tried the same command again like this:

I looked everywhere for some kind of "client-encoding" option in the pg_dump manual, but there is no such thing. The only thing I can think of is that the client's encoding (that is, pg_dump) is for some reason not set to "UTF8" even though that's the encoding of the "test" database which I'm connecting to.

The main thing here is this isn’t properly considered a bug in pg_dump but rather with most likely any client command run in the cmd.exe shell.

The problem is coming from CMD and its UTF8 support, but I guess I am not making a great job in convincing you. I will try to explain it through an example:

1. Start a CMD using "CMD /U", and set the code page to UTF8:

C:\postgres>chcp 65001
Active code page: 65001

2. Create a database named 'test', an schema named "Ö" and table named "Ä":

C:\postgres>psql -c "create database test" postgres
CREATE DATABASE

C:\postgres>psql -c "\l test" postgres
                                              List of databases
 Name |    Owner     | Encoding |          Collate           |           Ctype            | Access privileges
------+--------------+----------+----------------------------+----------------------------+-------------------
 test | postgres     | UTF8     | English_United States.1252 | English_United States.1252 |
(1 row)

C:\postgres>psql -c "create schema "Ö"" test
CREATE SCHEMA
C:\postgres>psql -c "create table "Ö"."Ä" (c1 char(1))" test
CREATE TABLE

3. Test the table and check your client encoding:

C:\postgres>psql -c "select * from "Ö"."Ä"" test
 c1
----
(0 rows)

C:\postgres>psql -c "show client_encoding" test
 client_encoding
-----------------
 WIN1252
(1 row)

4. Change the client encoding to UTF8 and repeat the test:

C:\postgres>set PGCLIENTENCODING=UTF8

C:\postgres>psql -c "show client_encoding" test
 client_encoding
-----------------
 UTF8
(1 row)

C:\postgres>psql -c "select * from "Ö"."Ä"" test
2020-07-24 11:25:01.710 CEST [1452] ERROR:  invalid byte sequence for encoding "UTF8": 0xd6 0x2e
ERROR:  invalid byte sequence for encoding "UTF8": 0xd6 0x2e

The reason why this happens is because the limited CMD's UTF8 support, 'chcp' only changes the code page of the displayed characters, shell commands, piping, redirection and most commands are still ANSI only.

Does explicitly setting 'PGCLIENTENCODING=WIN1252' on your shell solves the issue?

Regards,

Juan José Santamaría Flecha

Thanks for putting in the effort to help.
Thanks for teaching me about the "cmd.exe /U" option; I never even knew that it existed and was always annoyed by how Unicode characters looked messed up even though Unicode worked (outside of this case).
As a matter of fact, 'PGCLIENTENCODING=WIN1252' causes it to stop failing and does dump the database. To make 100% sure that this was it, I set 'PGCLIENTENCODING=UTF8' immediately afterwards and repeated the command: failed. Then changed it back to 'PGCLIENTENCODING=WIN1252' once again and repeated: worked. So it's definitely due to this, which makes zero sense since the database *HAS* the encoding UTF8!
(Lots of too-early negative conclusions by me removed here.) It turns out that, using your =WIN1252 hack, and this specific version of the exclude syntax: --exclude-table-data="\"Test schäma\".\"Test täble\"" ... it works. It dumps correctly and ignores that specific table in that schema.
If anyone responsible for the documentation for PostgreSQL is reading this, please update it to make clear what syntax one is supposed to use for the confusing --exclude-table-data option!
Thanks for finding a workaround, but I sure don't like the fact that I have to lie to PG about the encoding to be able to dump my database. Who knows what kind of unintended consequences this might have? Will it dump the DB's contents incorrectly? Will it remove/garble any non-WIN1252-compatible data? Will it corrupt my backups? It feels horribly wrong to set the wrong encoding like this, when *everything* I do/use works in UTF-8...

#5David G. Johnston
david.g.johnston@gmail.com
In reply to: Noname (#4)

On Friday, July 24, 2020, <tutiluren@tutanota.com> wrote:

1. If anyone responsible for the documentation for PostgreSQL is
reading this, please update it to make clear what syntax one is supposed to
use for the confusing --exclude-table-data option!

While not that exact example a sufficient command is already documented

that deals with handling identifiers in shell command.

pg_dump -t "\"MixedCaseName\"" mydb > mytab.sql

David J.

#6Noname
tutiluren@tutanota.com
In reply to: Noname (#1)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

Although I now have a work-around, which *appears* to work, I'm still utterly confused as to *why* it works. To be clear: why is it that setting the environment variable 'PGCLIENTENCODING' to 'WIN1252' makes the pg_dump succeed when everything, including the database I'm trying to dump, uses UTF8? The only thing not using UTF8 is the "postgres" database (which indeed uses 'WIN1252', certainly not out of my active choice), but that's not involved here... My command connects to my testdb, does it not?

pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="\"Test schäma\".\"Test täble\"" --host="localhost" --port="5432" --username="postgres" --dbname="test"

The '--dbname="test"' part suggests to me that I'm not connecting to "postgres" or involving it in any way. So why does it (apparently) have something to do with these bizarre errors which only happen if I have the PGCLIENTENCODING set to UTF8 but not WIN1252? My test database is verified UTF8.

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Noname (#6)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

On 2020-Jul-25, tutiluren@tutanota.com wrote:

Although I now have a work-around, which *appears* to work, I'm still utterly confused as to *why* it works. To be clear: why is it that setting the environment variable 'PGCLIENTENCODING' to 'WIN1252' makes the pg_dump succeed when everything, including the database I'm trying to dump, uses UTF8? The only thing not using UTF8 is the "postgres" database (which indeed uses 'WIN1252', certainly not out of my active choice), but that's not involved here... My command connects to my testdb, does it not?

What doesn't use UTF8 is your terminal. So when you enter the accented
letter, you're not actually sending the UTF8 character you think you're
sending, but instead a Win1252 character which matches nothing.

When you use "cmd /U", then your terminal *is* UTF8, so the character is
correct and things work.

You can further test this by setting log_statements=all in
postgresql.conf, sending a reload signal to postgres, and then rerunning
the pg_dump command. When run in the non-UTF8 terminal, the server log
file will show different garbage than when run in the UTF8 terminal.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Noname
tutiluren@tutanota.com
In reply to: Alvaro Herrera (#7)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

Alright. Sigh. I noticed a huge difference in the file sizes between my normal backups and my latest one. I started getting suspicious, so I made a new test WITHOUT using "--exclude-table-data" at all:

I dumped the same database:

1. With setting PGCLIENTENCODING=WIN1252, the weird "workaround".
2. Without setting that. (That is, UTF8.)

The first one is *MUCH* smaller. Opening it up in a visual diff viewer, I can see that HUGE amounts of my data has simply not been copied in the first case. Which is a nightmare scenario and thank God that I noticed this instead of just assuming that it was working now... my backups would've been worthless.

In other words: setting PGCLIENTENCODING=WIN1252 when the database is UTF8 makes pg_dump ignore massive amounts of the data in the database. For this reason, I cannot possibly use this as a "workaround" for my "--exclude-table-data" problem.

Yes, I have very carefully tried with this with "cmd.exe /U" as well as setting the Unicode codepage; it makes *no difference*. Nothing seems to make a difference; pg_dump doesn't seem to *want* to work. It's "all or nothing". I can't exclude any part of the database. The "PGCLIENTENCODING=WIN1252" workaround is sadly insanely dangerous and unusable.

Frankly, at this point I don't have the slightest idea what I could possibly do/try. I have tried so many things by now. If this is truly due to cmd.exe somehow, it must be doing something extraordinarily evil and broken even for Microsoft.

What do you suggest? Should I just pretend as if the --exclude* options don't exist as I've been forced to do many times with a lot of "crucial" features in a lot of software which just won't work correctly? It's just so incredibly inconvenient to have to either store all that temporary junk data or be forced to TRUNCATE it all for that specific table before each backup; what if I'm working right around midnight when the backup is made and I want the last few hours worth of data? I'd have just seconds worth of data instead since I couldn't exclude the table due to "some bug" which I don't even know where it lives: pg_dump or cmd.exe.

Is it entirely unthinkable that this is a pg_dump bug?

#9Noname
tutiluren@tutanota.com
In reply to: Noname (#8)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

I don't know if anyone is getting or reading my messages, but I just want to make it clear that my problem has
*not* been resolved, and I suspect more strongly than ever that pg_dump is at fault.
I don't know what I could possibly add to what I've already posted describing my issue, so I probably shouldn't repeat myself any further now, but simply hope that somebody will read my previous posts and be able to help me.

It truly looks as if I have to forget that the "--exclude*" options exist at all...

#10David G. Johnston
david.g.johnston@gmail.com
In reply to: Noname (#9)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

On Sat, Jul 25, 2020 at 6:44 PM <tutiluren@tutanota.com> wrote:

I don't know if anyone is getting or reading my messages,

You can search the mailing list archives to confirm receipt by the mailing
list.

https://www.postgresql.org/list/pgsql-bugs/

but I just want to make it clear that my problem has

*not* been resolved, and I suspect more strongly than ever that pg_dump is
at fault.

I don't know what I could possibly add to what I've already posted
describing my issue, so I probably shouldn't repeat myself any further now,
but simply hope that somebody will read my previous posts and be able to
help me.

Pretty much. Consider too that the people likely to be of help here (a
significant subset of available hackers given that it seems to be Windows
only) are busy wrapping up coding for version 13 due out within a month.

David J.

#11Graham Wideman
gwlist@grahamwideman.com
In reply to: Noname (#9)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

tutiluren:

Since at least part of your struggle has been with cmd.exe not handling UTF8 well, I wonder if you have considered Windows Terminal, which purports to handle UTF8 better?

https://www.microsoft.com/en-us/p/windows-terminal/9n0dx20hk701

I'm not saying it will solve everything, but it might at least remove one layer of confusion.

Graham

At 7/25/2020 06:44 PM, tutiluren@tutanota.com wrote:

Show quoted text

I don't know if anyone is getting or reading my messages, but I just want to make it clear that my problem has
*not* been resolved, and I suspect more strongly than ever that pg_dump is at fault.

I don't know what I could possibly add to what I've already posted describing my issue, so I probably shouldn't repeat myself any further now, but simply hope that somebody will read my previous posts and be able to help me.

It truly looks as if I have to forget that the "--exclude*" options exist at all...

#12Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Noname (#8)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

On Sat, Jul 25, 2020 at 4:21 AM <tutiluren@tutanota.com> wrote:

Alright. Sigh. I noticed a huge difference in the file sizes between my
normal backups and my latest one. I started getting suspicious, so I made a
new test WITHOUT using "--exclude-table-data" at all:

I dumped the same database:

1. With setting PGCLIENTENCODING=WIN1252, the weird "workaround".
2. Without setting that. (That is, UTF8.)

The first one is *MUCH* smaller. Opening it up in a visual diff viewer, I
can see that HUGE amounts of my data has simply not been copied in the
first case. Which is a nightmare scenario and thank God that I noticed this
instead of just assuming that it was working now... my backups would've
been worthless.

In other words: setting PGCLIENTENCODING=WIN1252 when the database is UTF8
makes pg_dump ignore massive amounts of the data in the database. For this
reason, I cannot possibly use this as a "workaround" for my
"--exclude-table-data" problem.

You are comparing a file that uses a single-byte encoding (WIN1252) with
another file that uses multibyte encoding (UTF8), so the size difference is
not unexplainable.

Also, diff-ing two files with mismatched encodings is not going to work as
expected. What you can do is, change the display code page of the CMD to
match the PGCLIENTENCODING (chcp 1252 & chcp 65001), and use the command
"type" to print on screen the content of the dump files generated with both
encodings. If you find a mismatch, please share.

Yes, I have very carefully tried with this with "cmd.exe /U" as well as
setting the Unicode codepage; it makes *no difference*. Nothing seems to
make a difference; pg_dump doesn't seem to *want* to work. It's "all or
nothing". I can't exclude any part of the database. The
"PGCLIENTENCODING=WIN1252" workaround is sadly insanely dangerous and
unusable.

Your OS code page is WIN1252, that is something with a heavy impact in the
system. In fact, your client is natively WIN1252 and explicitly setting the
PGCLIENTENCODING is not a weird hack, but a regular configuration
parameter. With all the configuring on the CMD code page, we can only
change the encoding of the displayed text.

Is it entirely unthinkable that this is a pg_dump bug?

What you are describing does not look like a bug to me, but a client
encoding problem.

If PGCLIENTENCODING=WIN1252 was failing for pg_dump, it would not do it
silently. You would see something like:

pg_dump: error: Dumping the contents of table "Ä" failed: PQgetResult()
failed.
pg_dump: error: Error message from server: ERROR: character with byte
sequence 0xe5 0x82 0x89 in encoding "UTF8" has no equivalent in encoding
"WIN1252"
pg_dump: error: The command was: COPY "Ö"."Ä" (c1) TO stdout;

If you see this error, then PGCLIENTENCODING=1252 will not be a viable
workaround for you, and will have to resort to any of the possible
solutions that have already been suggested: activate the beta UTF8 support
of the Windows Regional settings or access your database from a system with
true UTF8 terminal support.

Regards,

Juan José Santamaría Flecha

#13Noname
tutiluren@tutanota.com
In reply to: Graham Wideman (#11)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

Since at least part of your struggle has been with cmd.exe not handling UTF8 well, I wonder if you have considered Windows Terminal, which purports to handle UTF8 better?

https://www.microsoft.com/en-us/p/windows-terminal/9n0dx20hk701

I'm not saying it will solve everything, but it might at least remove one layer of confusion.

Well, I'll probably start using that "Windows Terminal" thing when it actually is built into Windows. I've heard it mentioned numerous times, but it's some kind of "external download" and involves the vile "Windows Store" on top of it, and I know it will cause numerous issues since everything about my system expects cmd.exe. I spent ages trying to get PowerShell (which is apparently now already outdated if Windows Terminal is the new "hot" terminal to use...) to work properly, but it never did, and thus I'm not very eager to switch.

Plus nothing else other than pg_dump has any problems with Unicode, so that's also a bit... odd.

PS: I actually just tried following the link to download it, but it just gets stuck at a nag screen to make me "log in" to my (nonexistent) "Microsoft account". Ugh. Not a chance. Sorry.

#14Noname
tutiluren@tutanota.com
In reply to: Noname (#1)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

Also, diff-ing two files with mismatched encodings is not going to work as expected.

I used WinMerge and visually saw how the "COPY" command was "cut off". It truly had not copied the data. Maybe the error code was "error", but that's what the dumped file looked like. I'm not making this up.

If you see this error, then PGCLIENTENCODING=1252 will not be a viable workaround

Since it skips huge amounts of the data in the database, it's not viable, no... And it never made any sense in the first place, since the database is UTF-8 and everything is UTF-8...

activate the beta UTF8 support of the Windows Regional settings

"Beta UTF8 support"? Regional settings? Nothing other than pg_dump has these problems, yet it's somehow Windows' fault? pg_dump is the program that either errors out or silently ignores data. It's not Windows/cmd.exe not sending the right table/schema names...

To prove my point, also to myself, I opened up a normal cmd.exe, without even using any /U flag, and ran:

php test.php --exclude-table-data="\"Test ❆ schäma\".\"Test ❆ täble\""

My test.php simply outputs the value of the "--exclude-table-data" parameter. Guess what it outputs?

"Test ❆ schäma"."Test ❆ täble"

So, evidently, the standard cmd.exe has no problems sending non-ANSI characters or double quotes to a program/script. Or even to output them (at least in this case). In other words: since my own script receives the Unicode chars and the double quotes, pg_dump also should, unless there's some issue with it. Unless I'm still not understanding what you're saying.

or access your database from a system with true UTF8 terminal support.

I don't have a random system just lying around, ready to migrate my entire "life" machine to. If that were the case, this world would look very different from what it is, and I would have finally have an alternative to this nightmare OS. (No, Linux is not an alternative. I've spent the last 20 years regularly coming back to Windows due to the neverending problems with literally *everything* in Linux/Unix... even considering how horrible Windows is. Which is quite a feat. Please don't think I'm defending modern Windows for one second.)

#15Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Noname (#14)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

On Mon, Jul 27, 2020 at 1:06 AM <tutiluren@tutanota.com> wrote:

or access your database from a system with true UTF8 terminal support.

I don't have a random system just lying around, ready to migrate my entire
"life" machine to. If that were the case, this world would look very
different from what it is, and I would have finally have an alternative to
this nightmare OS. (No, Linux is not an alternative. I've spent the last 20
years regularly coming back to Windows due to the neverending problems with
literally *everything* in Linux/Unix... even considering how horrible
Windows is. Which is quite a feat. Please don't think I'm defending modern
Windows for one second.)

Ten years ago my first reaction would be to download msys and try the
command there. It looks, however, that the development stalled around
2008: http://www.mingw.org/wiki/MSYS
Now I read that a second generation of that is available, but haven't tried
it: https://www.msys2.org/

You'll have to quote the special characters differently, but that way of
quoting will be familiar to the majority of folks here, I guess.

--
Alex

#16Noname
tutiluren@tutanota.com
In reply to: David G. Johnston (#2)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

Ten years ago my first reaction would be to download msys and try the command there.  It looks, however, that the development stalled around 2008: > http://www.mingw.org/wiki/MSYS
Now I read that a second generation of that is available, but haven't tried it: > https://www.msys2.org/

You'll have to quote the special characters differently, but that way of quoting will be familiar to the majority of folks here, I guess.

I don't want to download or install anything. I thought I would finally convince you all when I showed that my own script is fully able to receive the Unicode characters and quotes from the cmd.exe which doesn't even use the /U flag, so clearly pg_dump is not reading this correctly? I can't see any other explanation at this point. It frankly feels like you don't *want* to support Windows. "Cross-platform" seems to mean "several Linux distributions" these days...

#17David G. Johnston
david.g.johnston@gmail.com
In reply to: Noname (#16)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

On Mon, Jul 27, 2020 at 4:15 PM <tutiluren@tutanota.com> wrote:

I don't want to download or install anything. I thought I would finally
convince you all when I showed that my own script is fully able to receive
the Unicode characters and quotes from the cmd.exe which doesn't even use
the /U flag, so clearly pg_dump is not reading this correctly? I can't see
any other explanation at this point. It frankly feels like you don't *want*
to support Windows. "Cross-platform" seems to mean "several Linux
distributions" these days...

I do believe there is something here worth exploring. You will just need
to be patient and wait for someone capable to agree. If it's really
important to you paid support services are an option.

The fact that there are few reports being made of this nature, coupled with
the upfront hostility of the reports, makes it easy to put the concern
aside while less difficult problems exist - and, to repeat, while being at
the very end of a development cycle for a new release.

David J.

#18Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Noname (#16)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

On Tue, Jul 28, 2020 at 1:15 AM <tutiluren@tutanota.com> wrote:

Ten years ago my first reaction would be to download msys and try the
command there. It looks, however, that the development stalled around
2008: http://www.mingw.org/wiki/MSYS
Now I read that a second generation of that is available, but haven't
tried it: https://www.msys2.org/

You'll have to quote the special characters differently, but that way of
quoting will be familiar to the majority of folks here, I guess.

I don't want to download or install anything. I thought I would finally
convince you all when I showed that my own script is fully able to receive
the Unicode characters and quotes from the cmd.exe which doesn't even use
the /U flag, so clearly pg_dump is not reading this correctly? I can't see
any other explanation at this point. It frankly feels like you don't *want*
to support Windows. "Cross-platform" seems to mean "several Linux
distributions" these days...

I'm not entirely convinced that reading the arguments from the command line
and printing them back on the terminal proves anything about the actual
encoding of those characters. In the end it is cmd.exe that interprets
them in both cases. "Garbage in—garbage out", as they say.

I also don't think that pg_dump or any other CLI program has to do anything
special in order to "read it correctly". It reads what the system
(Windows, cmd.exe) gives it. If the system chooses to play tricks, there's
little we can do about it.

I do not have access to a Windows system these days, but I guess the most
reasonable thing at this point is to come up with a minimal and
self-contained reproducer, so that someone who does can try it and see what
you see on your end. So far you have only provided some terminal
invocations accompanied by error messages and vague descriptions of
comparing the actual and expected output of pg_dump.

Regards,
--
Alex

#19Noname
tutiluren@tutanota.com
In reply to: Noname (#4)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

come up with a minimal and self-contained reproducer, so that someone who does can try it and see what you see on your end.  So far you have only provided some terminal invocations accompanied by error messages and vague descriptions of comparing the actual and expected output of pg_dump.

Did you miss my e-mail starting with "After taking a longer break from my problem, I have now made a fresh, clean, scientifically conducted experiment in order to truly get to the bottom of this annoying problem once and for all."? (I don't now how I'd link to it.) There I go through everything with full commands and a minimal test database (which I also describe how I created).

I don't see how I could possibly create a smaller or clearer example of the issue. Whatever "strange things" cmd.exe does clearly doesn't show up for my cross-platform PHP CLI script. (I guess it's theoretically possible that PHP for Windows does some sort of "processing" automatically which pg_dump lacks, but I doubt it.)

#20David G. Johnston
david.g.johnston@gmail.com
In reply to: Noname (#19)
Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

On Tue, Jul 28, 2020 at 3:40 PM <tutiluren@tutanota.com> wrote:

(I don't now how I'd link to it.)

Messages and threads in the mailing list archive I mentioned before can be
linked to.

https://www.postgresql.org/list/

There I go through everything with full commands and a minimal test

database (which I also describe how I created).

I don't see how I could possibly create a smaller or clearer example of
the issue.

A psql script for setup would increase the self-contained nature of the
test. In Linux I'd also use something like grep to test the output file
and/or script outputs. More generally the test script should probably
include pg_restore and then various psql commands to check that the
resulting restored database contains the expected data.

That said, If I was setup for experimenting in Windows, probably including
compiling, the provided information is enough to put together a complete
test case to experiment with.

Whatever "strange things" cmd.exe does clearly doesn't show up for my

cross-platform PHP CLI script. (I guess it's theoretically possible that
PHP for Windows does some sort of "processing" automatically which pg_dump
lacks, but I doubt it.)

The PHP script you show is a pure client-only script. The observed
behavior is occurring in a client-server architecture, specifically in the
communication interface between the two. It's also, quite probably, not
restricted to pg_dump (a more useful test script would include trying to
demonstrate the problem using just psql, probably using the -v option).

David J.

#21Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Noname (#19)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shulgin, Oleksandr (#21)
#23Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Tom Lane (#22)