In-order pg_dump (or in-order COPY TO)

Started by Dimitrios Apostolou8 months ago31 messagesgeneral
Jump to latest

Hello list,

I am storing dumps of a database (pg_dump custom format) in a
de-duplicating backup server. Each dump is many terabytes in size, so
deduplication is very important. And de-duplication itself is based on
rolling checksums which is pretty flexible, it can compensate for blocks
moving by some offset.

Unfortunately after I did pg_restore to a new server, I notice that the
dumps from the new server are not being de-duplicated, all blocks are
considered new.

This means that the data has been significantly altered. The new dumps
contain the same rows but probably in very different order. Could the
row-order have changed when doing COPY FROM with pg_restore? No idea,
but now that I think about it this can happen by many operations, like
CLUSTER, VACUUM FULL etc so the question still applies.

A *logical* dump of data shouldn't be affected by on-disk order.
Internal representation shouldn't affect the output.

This makes me wonder: Is there a way to COPY TO in primary-key order?

If that is possible, then pg_dump could make use of it.

Thanks in advance,
Dimitris

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Dimitrios Apostolou (#1)
Re: In-order pg_dump (or in-order COPY TO)

On Tue, Aug 26, 2025 at 12:43 PM Dimitrios Apostolou <jimis@gmx.net> wrote:

Could the
row-order have changed when doing COPY FROM with pg_restore?

There is no reliable, meaningful, row ordering when it comes to the
physical files. Sure, cluster does make an attempt, but it is quite
limited in practice.

A *logical* dump of data shouldn't be affected by on-disk order.
Internal representation shouldn't affect the output.

The logical dump has no ordering - it will come out however it comes out.
"COPY <table> TO ..." doesn't have an order by clause - there is no way to
make or communicate to it that ordering is important. For adhoc work you
can use "COPY <query> TO ..." and put and order by in the query.

David J.

#3Ron
ronljohnsonjr@gmail.com
In reply to: Dimitrios Apostolou (#1)
Re: In-order pg_dump (or in-order COPY TO)

On Tue, Aug 26, 2025 at 3:44 PM Dimitrios Apostolou <jimis@gmx.net> wrote:

Hello list,

I am storing dumps of a database (pg_dump custom format) in a
de-duplicating backup server. Each dump is many terabytes in size, so
deduplication is very important. And de-duplication itself is based on
rolling checksums which is pretty flexible, it can compensate for blocks
moving by some offset.

This might be a silly question, but why are you using -Fc to
create multi-TB dumps instead of -Fd? If nothing else, the parallelization
granted by -Fd will greatly speed up the dumps.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dimitrios Apostolou (#1)
Re: In-order pg_dump (or in-order COPY TO)

Dimitrios Apostolou <jimis@gmx.net> writes:

Unfortunately after I did pg_restore to a new server, I notice that the
dumps from the new server are not being de-duplicated, all blocks are
considered new.

This means that the data has been significantly altered. The new dumps
contain the same rows but probably in very different order. Could the
row-order have changed when doing COPY FROM with pg_restore?

I'd expect pg_dump/pg_restore to preserve the physical row ordering,
simply because it doesn't do anything that would change that.

However, restoring into an empty table would result in a table with
minimal free space, whereas the original table probably had a
meaningful amount of free space thanks to updates and deletes. Thus
for example TIDs would not be the same. If your "rolling checksum"
methodology is at all sensitive to page boundaries, the table would
look quite different to it.

regards, tom lane

#5Ron
ronljohnsonjr@gmail.com
In reply to: David G. Johnston (#2)
Re: In-order pg_dump (or in-order COPY TO)

On Tue, Aug 26, 2025 at 4:31 PM David G. Johnston <
david.g.johnston@gmail.com> wrote:

On Tue, Aug 26, 2025 at 12:43 PM Dimitrios Apostolou <jimis@gmx.net>
wrote:

Could the
row-order have changed when doing COPY FROM with pg_restore?

There is no reliable, meaningful, row ordering when it comes to the
physical files. Sure, cluster does make an attempt, but it is quite
limited in practice.

A *logical* dump of data shouldn't be affected by on-disk order.
Internal representation shouldn't affect the output.

The logical dump has no ordering - it will come out however it comes out.
"COPY <table> TO ..." doesn't have an order by clause - there is no way to
make or communicate to it that ordering is important.

Doesn't COPY TO copy out records in the order they appeared in the physical
files? That _seems_ to mean that the records laid down by COPY FROM should
be in the same order as they were in the old dump files.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#6Ron
ronljohnsonjr@gmail.com
In reply to: Tom Lane (#4)
Re: In-order pg_dump (or in-order COPY TO)

On Tue, Aug 26, 2025 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dimitrios Apostolou <jimis@gmx.net> writes:

Unfortunately after I did pg_restore to a new server, I notice that the
dumps from the new server are not being de-duplicated, all blocks are
considered new.

This means that the data has been significantly altered. The new dumps
contain the same rows but probably in very different order. Could the
row-order have changed when doing COPY FROM with pg_restore?

I'd expect pg_dump/pg_restore to preserve the physical row ordering,
simply because it doesn't do anything that would change that.

However, restoring into an empty table would result in a table with
minimal free space, whereas the original table probably had a
meaningful amount of free space thanks to updates and deletes. Thus
for example TIDs would not be the same. If your "rolling checksum"
methodology is at all sensitive to page boundaries, the table would
look quite different to it.

But the rolling checksums are against a pg_dump file, not a pg_basebackup
file.

What probably changed are table OIDs. Would that change the ordering of
COPY data in post-restore dump files?

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ron (#5)
Re: In-order pg_dump (or in-order COPY TO)

Ron Johnson <ronljohnsonjr@gmail.com> writes:

On Tue, Aug 26, 2025 at 4:31 PM David G. Johnston <
david.g.johnston@gmail.com> wrote:

The logical dump has no ordering - it will come out however it comes out.
"COPY <table> TO ..." doesn't have an order by clause - there is no way to
make or communicate to it that ordering is important.

Doesn't COPY TO copy out records in the order they appeared in the physical
files?

It emits whatever a sequential-scan plan would emit. If you set
synchronize_seqscans = off (which pg_dump does), that will match
physical row order.

At least with our standard table AM. If you're using Aurora or
one of those other PG forks with proprietary storage layers,
you'd have to ask them.

I suspect the OP's problem is not row order per se, but differing
TIDs or XIDs, which are things pg_dump does not endeavor to
replicate. Or, given that he said something about blocks, maybe
he's actually sensitive to where the free space is.

regards, tom lane

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ron (#6)
Re: In-order pg_dump (or in-order COPY TO)

Ron Johnson <ronljohnsonjr@gmail.com> writes:

On Tue, Aug 26, 2025 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'd expect pg_dump/pg_restore to preserve the physical row ordering,
simply because it doesn't do anything that would change that.

But the rolling checksums are against a pg_dump file, not a pg_basebackup
file.

Oh, that wasn't clear to me.

What probably changed are table OIDs. Would that change the ordering of
COPY data in post-restore dump files?

It would not change the order of data within any one table. There are
corner cases in which different OID assignments can cause pg_dump to
emit database objects in a different order, see this recent thread:

/messages/by-id/20250707192654.9e.nmisch@google.com

regards, tom lane

#9Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dimitrios Apostolou (#1)
Re: In-order pg_dump (or in-order COPY TO)

On 8/26/25 12:43, Dimitrios Apostolou wrote:

Hello list,

I am storing dumps of a database (pg_dump custom format) in a de-
duplicating backup server. Each dump is many terabytes in size, so
deduplication is very important. And de-duplication itself is based on
rolling checksums which is pretty flexible, it can compensate for blocks
moving by some offset.

Unfortunately after I did pg_restore to a new server, I notice that the
dumps from the new server are not being de-duplicated, all blocks are
considered new.

What are the pg_dump/pg_restore commands?

What are the Postgres versions involved?

Are they community versions of Postgres or something else?

What is the depduplication program?

Thanks in advance,
Dimitris

--
Adrian Klaver
adrian.klaver@aklaver.com

In reply to: Adrian Klaver (#9)
Re: In-order pg_dump (or in-order COPY TO)

On Wednesday 2025-08-27 00:54, Adrian Klaver wrote:

Date: Wed, 27 Aug 2025 00:54:52
From: Adrian Klaver <adrian.klaver@aklaver.com>
To: Dimitrios Apostolou <jimis@gmx.net>, pgsql-general@lists.postgresql.org
Subject: Re: In-order pg_dump (or in-order COPY TO)

On 8/26/25 12:43, Dimitrios Apostolou wrote:

Hello list,

I am storing dumps of a database (pg_dump custom format) in a de-
duplicating backup server. Each dump is many terabytes in size, so
deduplication is very important. And de-duplication itself is based on
rolling checksums which is pretty flexible, it can compensate for blocks
moving by some offset.

Unfortunately after I did pg_restore to a new server, I notice that the
dumps from the new server are not being de-duplicated, all blocks are
considered new.

What are the pg_dump/pg_restore commands?

What are the Postgres versions involved?

Are they community versions of Postgres or something else?

What is the depduplication program?

Dump is from PostgreSQL 16, it's pg_dump writing to stdout:

pg_dump -v --format=custom --compress=none --no-toast-compression --serializable-deferrable db_name | borg create ...

As you can see the backup (and deduplicating) program is borgbackup.

Restore is in PostgreSQL 17:

I first create the empty tables by running the DDL commands in version
control to setup the database. And then I do pg_restore --data-only:

pg_restore -vvvv -j 8 -U db_owner -d db_name --schema=public --section=data dump_file

Worth noting is that the above pg_restore goes through the WAL, i.e. all
writes are done by walwriter, not the backend directly.

Postgres is standard open source running on own server. It has a couple
of custom patches that shouldn't matter in this codepath.

Show quoted text

Thanks in advance,
Dimitris

In reply to: Tom Lane (#4)
Re: In-order pg_dump (or in-order COPY TO)

On Wednesday 2025-08-27 00:08, Tom Lane wrote:

However, restoring into an empty table would result in a table with
minimal free space, whereas the original table probably had a
meaningful amount of free space thanks to updates and deletes. Thus
for example TIDs would not be the same. If your "rolling checksum"
methodology is at all sensitive to page boundaries, the table would
look quite different to it.

Thanks Tom. I'm not following how the empty space in the tables could
affect the custom format dumps. Where can I read more about these TIDs?
Are they stored in the pg_dump custom format archive?

The rolling checksum method should iron out shifting of data chunks that
are around a couple MB in size. Shorter shifts will not be caught, and
I assume that the "page boundaries" changes you mentioned would happen
every 8KB. So that is definitely too fine grained for the deduplicated
algorithm.

FYI something that I forgot to mention is that pg_restore is --data-only
and writes go through walwriter. The database with the tables has been
created from scratch so every table is empty before the pg_restore. Not
sure how this affects the above.

I'd expect pg_dump/pg_restore to preserve the physical row ordering,
simply because it doesn't do anything that would change that.

Regardless of my specific case, it's scary to think that doing VACUUM
FULL, CLUSTER, or who knows what other maintenance command, will modify
the logical dumps. Some implicit ordering could be enforced by pg_dump
if possible, for example when a primary key exists. Does it make sense?
Is it even possible?

Thanks,
Dimitris

In reply to: David G. Johnston (#2)
Re: In-order pg_dump (or in-order COPY TO)

On Tuesday 2025-08-26 22:31, David G. Johnston wrote:

The logical dump has no ordering - it will come out however it comes out.  "COPY <table> TO ..." doesn't have an order by clause - there is no way to make or communicate to it that ordering is important.  For adhoc work you can use "COPY <query> TO ..." and put and order by in the query.

Thank you, so it's not possible currently.

How would "COPY <query> TO" behave for copying very large tables?
Would it make sense to optionally have that in pg_dump?

Or would it make sense as a new feature, to optionally order "COPY
<table> TO ..." based on primary key where available, and use that in
pg_dump option?

Dimitris

In reply to: Ron (#3)
Re: In-order pg_dump (or in-order COPY TO)

On Wednesday 2025-08-27 00:00, Ron Johnson wrote:

This might be a silly question, but why are you using -Fc to create multi-TB dumps instead of -Fd?  If nothing else, the parallelization granted by -Fd will greatly speed up the dumps.

Hi Ron,

the primary reason is space. With -Fc I don't have to store the 10TB
archive locally. And I don't have to wait for pg_dump to finish. I pipe
it directly to the "borg create" command which does compression and
deduplication and sends new chunks to a remote borgbackup server.

Regards,
Dimitris

P.S. please use "Reply-all" when replying, I've missed quite a few
replies that I found on the list archives. :-)

#14Greg Sabino Mullane
greg@turnstep.com
In reply to: Dimitrios Apostolou (#1)
Re: In-order pg_dump (or in-order COPY TO)

On Tue, Aug 26, 2025 at 3:44 PM Dimitrios Apostolou <jimis@gmx.net> wrote:

I am storing dumps of a database (pg_dump custom format) in a
de-duplicating backup server. Each dump is many terabytes in size, so
deduplication is very important. And de-duplication itself is based on
rolling checksums which is pretty flexible, it can compensate for blocks
moving by some offset.

I suggest looking into pgBackRest, and it's block incremental feature,
which sounds similar to what you are doing. But it also does it with
parallel processes, and can do things like grab backup files from your
replicas, plus a lot of other features.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dimitrios Apostolou (#10)
Re: In-order pg_dump (or in-order COPY TO)

Dimitrios Apostolou <jimis@gmx.net> writes:

Dump is from PostgreSQL 16, it's pg_dump writing to stdout:

pg_dump -v --format=custom --compress=none --no-toast-compression --serializable-deferrable db_name | borg create ...

Don't use --format=custom (and not -v either). That causes pg_dump to
include the OIDs and pg_dump object IDs of all the tables and other
objects, which will all be different in a dump from the new server.
The actual data contents of the tables should be the same, but
apparently the differences in the entry headers are enough to
mislead borgbackup.

You might be well advised to manually examine the data you are
stuffing into borgbackup. Right now we seem to be operating on
hypotheses, not facts, about what that looks like and how it's
different between your old and new server.

regards, tom lane

#16Ron
ronljohnsonjr@gmail.com
In reply to: Tom Lane (#15)
Re: In-order pg_dump (or in-order COPY TO)

On Wed, Aug 27, 2025 at 10:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dimitrios Apostolou <jimis@gmx.net> writes:

Dump is from PostgreSQL 16, it's pg_dump writing to stdout:

pg_dump -v --format=custom --compress=none --no-toast-compression

--serializable-deferrable db_name | borg create ...

Don't use --format=custom (and not -v either). That causes pg_dump to
include the OIDs and pg_dump object IDs of all the tables and other
objects,

That's interesting. Why? (Since isn't it supposed to be Bad to rely on
OIDs?)

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ron (#16)
Re: In-order pg_dump (or in-order COPY TO)

Ron Johnson <ronljohnsonjr@gmail.com> writes:

On Wed, Aug 27, 2025 at 10:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Don't use --format=custom (and not -v either). That causes pg_dump to
include the OIDs and pg_dump object IDs of all the tables and other
objects,

That's interesting. Why? (Since isn't it supposed to be Bad to rely on
OIDs?)

-v in a text-format dump includes that data for debugging purposes:

--
-- TOC entry 1401 (class 1255 OID 16499)
-- Name: fipshash(text); Type: FUNCTION; Schema: public; Owner: postgres
--

(The "TOC entry" comment line wouldn't be there without -v.)
Then custom format has to store the same info so that pg_restore
can produce this identical text output on demand.

Yeah, this is all pretty historical, but nobody wants to change it
at this point.

regards, tom lane

#18Ron
ronljohnsonjr@gmail.com
In reply to: Tom Lane (#17)
Re: In-order pg_dump (or in-order COPY TO)

On Wed, Aug 27, 2025 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ron Johnson <ronljohnsonjr@gmail.com> writes:

On Wed, Aug 27, 2025 at 10:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Don't use --format=custom (and not -v either). That causes pg_dump to
include the OIDs and pg_dump object IDs of all the tables and other
objects,

That's interesting. Why? (Since isn't it supposed to be Bad to rely on
OIDs?)

-v in a text-format dump includes that data for debugging purposes:

--
-- TOC entry 1401 (class 1255 OID 16499)
-- Name: fipshash(text); Type: FUNCTION; Schema: public; Owner: postgres
--

(The "TOC entry" comment line wouldn't be there without -v.)
Then custom format has to store the same info so that pg_restore
can produce this identical text output on demand.

Ah, so the culprit is "-v". I like using -v, redirecting it to a log file
(more info is almost always better), but then I rarely use pg_dump, and
never pipe it to de-duplicators. (ExaGrid is supposed to deduplicate, but
that's not going to stop me from using pgbackrest, compression and
encryption; PCI auditors care about that, not deduplication.)

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#19Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dimitrios Apostolou (#10)
Re: In-order pg_dump (or in-order COPY TO)

On 8/27/25 05:09, Dimitrios Apostolou wrote:

On Wednesday 2025-08-27 00:54, Adrian Klaver wrote:

What are the pg_dump/pg_restore commands?

What are the Postgres versions involved?

Are they community versions of Postgres or something else?

What is the depduplication program?

Comments in line below.

Dump is from PostgreSQL 16, it's pg_dump writing to stdout:

pg_dump -v --format=custom --compress=none --no-toast-compression --
serializable-deferrable db_name  |  borg create ...

As you can see the backup (and deduplicating) program is borgbackup.

Ok, I use BorgBackup and it is fairly forgiving of normal changes.

FYI, if you ever want to use compression check out gzip --rsyncable, I
have found it plays well with Borg. For more information see:

https://beeznest.wordpress.com/2005/02/03/rsyncable-gzip/

Restore is in PostgreSQL 17:

I first create the empty tables by running the DDL commands in version
control to setup the database. And then I do pg_restore --data-only:

pg_restore -vvvv -j 8 -U db_owner -d db_name --schema=public --
section=data  dump_file

If you are using only the --data section why not --data-only in the pg_dump?

Or is the pg_dump output used for other purposes?

Worth noting is that the above pg_restore goes through the WAL, i.e. all
writes are done by walwriter, not the backend directly.

Please explain the above further.

The problem occurs when you do the pg_dump after this restore, correct?

Is it the same pg_dump command as you show above?

Postgres is standard open source running on own server. It has a couple
of custom patches that shouldn't matter in this codepath.

For completeness and just in case they may affect the output what do the
patches do?

 Thanks in advance,
 Dimitris

--
Adrian Klaver
adrian.klaver@aklaver.com

In reply to: Adrian Klaver (#19)
Re: In-order pg_dump (or in-order COPY TO)

On Wednesday 2025-08-27 17:25, Adrian Klaver wrote:

Comments in line below.

Dump is from PostgreSQL 16, it's pg_dump writing to stdout:

pg_dump -v --format=custom --compress=none --no-toast-compression --
serializable-deferrable db_name  |  borg create ...

As you can see the backup (and deduplicating) program is borgbackup.

Ok, I use BorgBackup and it is fairly forgiving of normal changes.

FYI, if you ever want to use compression check out gzip --rsyncable, I have
found it plays well with Borg. For more information see:

https://beeznest.wordpress.com/2005/02/03/rsyncable-gzip/

Yes, zstd has also --rsyncable. In this case I let borg do per-chunk
compression after deduplication, it has worked well so far.

Restore is in PostgreSQL 17:

I first create the empty tables by running the DDL commands in version
control to setup the database. And then I do pg_restore --data-only:

pg_restore -vvvv -j 8 -U db_owner -d db_name --schema=public --
section=data  dump_file

If you are using only the --data section why not --data-only in the pg_dump?

I want the dump to be as complete as possible. Didn't think it would
create issues.

Or is the pg_dump output used for other purposes?

It has happened that I have selectively restored user schemas from that
dump.

Worth noting is that the above pg_restore goes through the WAL, i.e. all
writes are done by walwriter, not the backend directly.

Please explain the above further.

The COPY FROM data is going through the WAL, as usual INSERTS do. The
writes to disk happen by the walwriter process.

OTOH, If you have configured the server with wal_level=minimal and
BEGIN a transaction, CREATE or TRUNCATE a table, and then COPY FROM into
that table, then the backend process writes directly to the table
without logging to the WAL.

This can be much faster, but most importantly it avoids situations of
WAL overflow that are very difficult to predict and can mess your server
up completely. [1]/messages/by-id/076464ad-3d70-dd25-9e8f-e84f27decfba@gmx.net

[1]: /messages/by-id/076464ad-3d70-dd25-9e8f-e84f27decfba@gmx.net

My patches are for activating that codepath in pg_restore, but they were
not used on purpose and I took notice that the writes went via WAL.

The problem occurs when you do the pg_dump after this restore, correct?

Correct. The first pg_dump from the restored pg17 is not deduplicated at
all. Most of the tables have not changed (logically at least; apparently
they have changed physically).

Is it the same pg_dump command as you show above?

Yes.

Postgres is standard open source running on own server. It has a couple of
custom patches that shouldn't matter in this codepath.

For completeness and just in case they may affect the output what do the
patches do?

Two patches for speeding up scanning an archive without TOC, like the
one I'm having (because it is piped into borg, instead of written to
file). These were activated, but shouldn't matter. They only build the
TOC in pg_restore's memory.

https://commitfest.postgresql.org/patch/5809/
https://commitfest.postgresql.org/patch/5817/

And two patches for speeding up pg_restore like mentioned above, under
specific arguments that I didn't provide. (one speedup needs --clean,
and the other needs --freeze).

https://commitfest.postgresql.org/patch/5821/
https://commitfest.postgresql.org/patch/5826/

IIRC I did not activate them (via --clean) because TRUNCATE fails when
foreign keys exist. See the discussion threads.

Dimitris

#21Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dimitrios Apostolou (#20)
In reply to: Adrian Klaver (#21)
#23Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dimitrios Apostolou (#22)
In reply to: Adrian Klaver (#23)
#25Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dimitrios Apostolou (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Adrian Klaver (#25)
#27Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom Lane (#26)
#28Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dimitrios Apostolou (#1)
#29Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Alvaro Herrera (#28)
In reply to: Alvaro Herrera (#28)
#31Achilleas Mantzios
a.mantzios@cloud.gatewaynet.com
In reply to: Dimitrios Apostolou (#30)