Pg 16: will pg_dump & pg_restore be faster?

Started by Ronalmost 3 years ago11 messagesgeneral
Jump to latest
#1Ron
ronljohnsonjr@gmail.com

https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/
says "PostgreSQL 16 can also improve the performance of concurrent bulk
loading of data using COPY up to 300%."

Since pg_dump & pg_restore use COPY (or something very similar), will the
speed increase translate to higher speeds for those utilities?

--
Born in Arizona, moved to Babylonia.

#2David Rowley
dgrowleyml@gmail.com
In reply to: Ron (#1)
Re: Pg 16: will pg_dump & pg_restore be faster?

On Wed, 31 May 2023 at 08:54, Ron <ronljohnsonjr@gmail.com> wrote:

https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/
says "PostgreSQL 16 can also improve the performance of concurrent bulk
loading of data using COPY up to 300%."

Since pg_dump & pg_restore use COPY (or something very similar), will the
speed increase translate to higher speeds for those utilities?

I think the improvements to relation extension only help when multiple
backends need to extend the relation at the same time. pg_restore can
have multiple workers, but the tasks that each worker performs are
only divided as far as an entire table, i.e. 2 workers will never be
working on the same table at the same time. So there is no concurrency
in terms of 2 or more workers working on loading data into the same
table at the same time.

It might be an interesting project now that we have TidRange scans, to
have pg_dump split larger tables into chunks so that they can be
restored in parallel.

David

#3Bruce Momjian
bruce@momjian.us
In reply to: David Rowley (#2)
Re: Pg 16: will pg_dump & pg_restore be faster?

On Wed, May 31, 2023 at 09:14:20AM +1200, David Rowley wrote:

On Wed, 31 May 2023 at 08:54, Ron <ronljohnsonjr@gmail.com> wrote:

https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/
says "PostgreSQL 16 can also improve the performance of concurrent bulk
loading of data using COPY up to 300%."

Since pg_dump & pg_restore use COPY (or something very similar), will the
speed increase translate to higher speeds for those utilities?

I think the improvements to relation extension only help when multiple
backends need to extend the relation at the same time. pg_restore can
have multiple workers, but the tasks that each worker performs are
only divided as far as an entire table, i.e. 2 workers will never be
working on the same table at the same time. So there is no concurrency
in terms of 2 or more workers working on loading data into the same
table at the same time.

It might be an interesting project now that we have TidRange scans, to
have pg_dump split larger tables into chunks so that they can be
restored in parallel.

Uh, the release notes say:

<!--
Author: Andres Freund <andres@anarazel.de>
2023-04-06 [00d1e02be] hio: Use ExtendBufferedRelBy() to extend tables more eff
Author: Andres Freund <andres@anarazel.de>
2023-04-06 [26158b852] Use ExtendBufferedRelTo() in XLogReadBufferExtended()
-->

<listitem>
<para>
Allow more efficient addition of heap and index pages (Andres Freund)
</para>
</listitem>

There is no mention of concurrency being a requirement. Is it wrong? I
think there was a question of whether you had to add _multiple_ blocks
ot get a benefit, not if concurrency was needed. This email about the
release notes didn't mention the concurrent requirement:

/messages/by-id/20230521171341.jjxykfsefsek4kzj@awork3.anarazel.de

While the case of extending by multiple pages improved the most, even
extending by a single page at a time got a good bit more scalable. Maybe
just "Improve efficiency of extending relations"?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#4David Rowley
dgrowleyml@gmail.com
In reply to: Bruce Momjian (#3)
Re: Pg 16: will pg_dump & pg_restore be faster?

On Wed, 31 May 2023 at 13:13, Bruce Momjian <bruce@momjian.us> wrote:

There is no mention of concurrency being a requirement. Is it wrong? I
think there was a question of whether you had to add _multiple_ blocks
ot get a benefit, not if concurrency was needed. This email about the
release notes didn't mention the concurrent requirement:

My understanding had been that concurrency was required, but I see the
commit message for 00d1e02be mentions:

Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1).

If that's the case then maybe the beta release notes could be edited
slightly to reflect this. Maybe something like:

"Relation extensions have been improved allowing faster bulk loading
of data using COPY. These improvements are more significant when
multiple processes are concurrently loading data into the same table."

The current text of "PostgreSQL 16 can also improve the performance of
concurrent bulk loading of data using COPY up to 300%." does lead me
to believe that nothing has been done to improve things when only a
single backend is involved.

David

#5Bruce Momjian
bruce@momjian.us
In reply to: David Rowley (#4)
Re: Pg 16: will pg_dump & pg_restore be faster?

On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote:

On Wed, 31 May 2023 at 13:13, Bruce Momjian <bruce@momjian.us> wrote:

There is no mention of concurrency being a requirement. Is it wrong? I
think there was a question of whether you had to add _multiple_ blocks
ot get a benefit, not if concurrency was needed. This email about the
release notes didn't mention the concurrent requirement:

My understanding had been that concurrency was required, but I see the
commit message for 00d1e02be mentions:

Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1).

If that's the case then maybe the beta release notes could be edited
slightly to reflect this. Maybe something like:

"Relation extensions have been improved allowing faster bulk loading
of data using COPY. These improvements are more significant when
multiple processes are concurrently loading data into the same table."

The release notes don't normally get into details on the magnitude of
the improvement in various circumstances.

The current text of "PostgreSQL 16 can also improve the performance of
concurrent bulk loading of data using COPY up to 300%." does lead me
to believe that nothing has been done to improve things when only a
single backend is involved.

Yes, agreed.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#6David Rowley
dgrowleyml@gmail.com
In reply to: Bruce Momjian (#5)
Re: Pg 16: will pg_dump & pg_restore be faster?

On Wed, 31 May 2023 at 14:11, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote:

"Relation extensions have been improved allowing faster bulk loading
of data using COPY. These improvements are more significant when
multiple processes are concurrently loading data into the same table."

The release notes don't normally get into details on the magnitude of
the improvement in various circumstances.

Sorry, I meant the release announcement rather than the release notes here.

David

#7Bruce Momjian
bruce@momjian.us
In reply to: David Rowley (#6)
Re: Pg 16: will pg_dump & pg_restore be faster?

On Wed, May 31, 2023 at 02:18:25PM +1200, David Rowley wrote:

On Wed, 31 May 2023 at 14:11, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote:

"Relation extensions have been improved allowing faster bulk loading
of data using COPY. These improvements are more significant when
multiple processes are concurrently loading data into the same table."

The release notes don't normally get into details on the magnitude of
the improvement in various circumstances.

Sorry, I meant the release announcement rather than the release notes here.

Oh, yeah, that gets into more details, sure.

There is also the major features list at the top of the release notes
--- that needs adjustment based on release text I have updated recently,
but I don't feel I control that list.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#8Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#7)
Re: Pg 16: will pg_dump & pg_restore be faster?

On Tue, May 30, 2023 at 10:28:58PM -0400, Bruce Momjian wrote:

On Wed, May 31, 2023 at 02:18:25PM +1200, David Rowley wrote:

On Wed, 31 May 2023 at 14:11, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote:

"Relation extensions have been improved allowing faster bulk loading
of data using COPY. These improvements are more significant when
multiple processes are concurrently loading data into the same table."

The release notes don't normally get into details on the magnitude of
the improvement in various circumstances.

Sorry, I meant the release announcement rather than the release notes here.

Oh, yeah, that gets into more details, sure.

There is also the major features list at the top of the release notes
--- that needs adjustment based on release text I have updated recently,
but I don't feel I control that list.

Oh, I now remember I added that and reworded it when I did, so I have
now adjusted it to match the new OUTER parallelism text.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#9Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#3)
Re: Pg 16: will pg_dump & pg_restore be faster?

Hi,

On 2023-05-30 21:13:08 -0400, Bruce Momjian wrote:

On Wed, May 31, 2023 at 09:14:20AM +1200, David Rowley wrote:

On Wed, 31 May 2023 at 08:54, Ron <ronljohnsonjr@gmail.com> wrote:

https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/
says "PostgreSQL 16 can also improve the performance of concurrent bulk
loading of data using COPY up to 300%."

Since pg_dump & pg_restore use COPY (or something very similar), will the
speed increase translate to higher speeds for those utilities?

I think the improvements to relation extension only help when multiple
backends need to extend the relation at the same time. pg_restore can
have multiple workers, but the tasks that each worker performs are
only divided as far as an entire table, i.e. 2 workers will never be
working on the same table at the same time. So there is no concurrency
in terms of 2 or more workers working on loading data into the same
table at the same time.

It might be an interesting project now that we have TidRange scans, to
have pg_dump split larger tables into chunks so that they can be
restored in parallel.

Uh, the release notes say:

<!--
Author: Andres Freund <andres@anarazel.de>
2023-04-06 [00d1e02be] hio: Use ExtendBufferedRelBy() to extend tables more eff
Author: Andres Freund <andres@anarazel.de>
2023-04-06 [26158b852] Use ExtendBufferedRelTo() in XLogReadBufferExtended()
-->

<listitem>
<para>
Allow more efficient addition of heap and index pages (Andres Freund)
</para>
</listitem>

There is no mention of concurrency being a requirement. Is it wrong? I
think there was a question of whether you had to add _multiple_ blocks
ot get a benefit, not if concurrency was needed. This email about the
release notes didn't mention the concurrent requirement:

/messages/by-id/20230521171341.jjxykfsefsek4kzj@awork3.anarazel.de

There's multiple improvements that work together to get the overall
improvement. One part of that is filesystem interactions, another is holding
the relation extension lock for a *much* shorter time. The former helps
regardless of concurrency, the latter only with concurrency.

Regards,

Andres

#10Jonathan S. Katz
jkatz@postgresql.org
In reply to: David Rowley (#4)
Re: Pg 16: will pg_dump & pg_restore be faster?

On 5/30/23 10:05 PM, David Rowley wrote:

My understanding had been that concurrency was required, but I see the
commit message for 00d1e02be mentions:

Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1).

If that's the case then maybe the beta release notes could be edited
slightly to reflect this. Maybe something like:

"Relation extensions have been improved allowing faster bulk loading
of data using COPY. These improvements are more significant when
multiple processes are concurrently loading data into the same table."

The current text of "PostgreSQL 16 can also improve the performance of
concurrent bulk loading of data using COPY up to 300%." does lead me
to believe that nothing has been done to improve things when only a
single backend is involved.

Typically once a release announcement is out, we'll only edit it if it's
inaccurate. I don't think the statement in the release announcement is
inaccurate, as it specifies that concurrent bulk loading is faster.

I had based the description on what Andres described in the original
discussion and through reading[1]/messages/by-id/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de, which showed a "measurable"
improvement as the commit message said, but it was not to the same
degree as concurrently loading. It does still seem impactful -- the
results show up to 20% improvement on a single backend -- but the bigger
story was around the concurrency.

I'm -0.5 for revising the announcement, but I also don't want people to
miss out on testing this. I'd be OK with this:

"PostgreSQL 16 can also improve the performance of bulk loading of data,
with some tests showing using up to 300% improvement when concurrently
executing `COPY` commands."

Thanks,

Jonathan

[1]: /messages/by-id/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
/messages/by-id/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de

#11David Rowley
dgrowleyml@gmail.com
In reply to: Jonathan S. Katz (#10)
Re: Pg 16: will pg_dump & pg_restore be faster?

On Sat, 3 Jun 2023 at 00:14, Jonathan S. Katz <jkatz@postgresql.org> wrote:

Typically once a release announcement is out, we'll only edit it if it's
inaccurate. I don't think the statement in the release announcement is
inaccurate, as it specifies that concurrent bulk loading is faster.

Understood. I had thought that the policy might be that if there's
room for and reason enough to make improvements, then we probably
should. We do aim to still make improvements to fix any problem with
the software that's the topic of the announcement, maybe it's strange
that we want to lock down what we write about that software just
before the beta1 release.

I'm -0.5 for revising the announcement, but I also don't want people to
miss out on testing this. I'd be OK with this:

"PostgreSQL 16 can also improve the performance of bulk loading of data,
with some tests showing using up to 300% improvement when concurrently
executing `COPY` commands."

I might have just misunderstood the release notes based on my
misunderstanding of Andres's work that it only improved things when
multiple backends were extending the relation at the same time. The
release announcement did seem to confirm that there had to be
concurrency, so it might be good to not lead anyone else down into
thinking that only concurrent cases are faster. I certainly understand
that's where the big wins are.

I'm fine with your proposed wording.

David