Some questions about PostgreSQL’s design.

Started by 陈宗志over 1 year ago7 messages
#1陈宗志
baotiao@gmail.com

I’ve recently started exploring PostgreSQL implementation. I used to
be a MySQL InnoDB developer, and I find the PostgreSQL community feels
a bit strange.

There are some areas where they’ve done really well, but there are
also some obvious issues that haven’t been improved.

For example, the B-link tree implementation in PostgreSQL is
particularly elegant, and the code is very clean.
But there are some clear areas that could be improved but haven’t been
addressed, like the double memory problem where the buffer pool and
page cache store the same page, using full-page writes to deal with
torn page writes instead of something like InnoDB’s double write
buffer.

It seems like these issues have clear solutions, such as using
DirectIO like InnoDB instead of buffered IO, or using a double write
buffer instead of relying on the full-page write approach.
Can anyone replay why?

However, the PostgreSQL community’s mailing list is truly a treasure
trove, where you can find really interesting discussions. For
instance, this discussion on whether lock coupling is needed for
B-link trees, etc.
/messages/by-id/CALJbhHPiudj4usf6JF7wuCB81fB7SbNAeyG616k+m9G0vffrYw@mail.gmail.com

--
---
Blog: https://baotiao.github.io/
Twitter: https://twitter.com/baotiao
Git: https://github.com/baotiao

#2Thomas Munro
thomas.munro@gmail.com
In reply to: 陈宗志 (#1)
Re: Some questions about PostgreSQL’s design.

On Tue, Aug 20, 2024 at 8:55 PM 陈宗志 <baotiao@gmail.com> wrote:

It seems like these issues have clear solutions, such as using
DirectIO like InnoDB instead of buffered IO,

For this part: we recently added an experimental option to use direct
I/O (debug_io_direct). We are working on the infrastructure needed to
make it work efficiently before removing the "debug_" prefix:
prediction of future I/O through a "stream" abstraction which we have
some early pieces of already, I/O combining (see new io_combine_limit
setting), and asynchronous I/O (work in progress, basically I/O worker
processes or io_uring or other OS-specific APIs).

#3Heikki Linnakangas
hlinnaka@iki.fi
In reply to: 陈宗志 (#1)
Re: Some questions about PostgreSQL’s design.

On 20/08/2024 11:46, 陈宗志 wrote:

I’ve recently started exploring PostgreSQL implementation. I used to
be a MySQL InnoDB developer, and I find the PostgreSQL community feels
a bit strange.

There are some areas where they’ve done really well, but there are
also some obvious issues that haven’t been improved.

For example, the B-link tree implementation in PostgreSQL is
particularly elegant, and the code is very clean.
But there are some clear areas that could be improved but haven’t been
addressed, like the double memory problem where the buffer pool and
page cache store the same page, using full-page writes to deal with
torn page writes instead of something like InnoDB’s double write
buffer.

It seems like these issues have clear solutions, such as using
DirectIO like InnoDB instead of buffered IO, or using a double write
buffer instead of relying on the full-page write approach.
Can anyone replay why?

There are pros and cons. With direct I/O, you cannot take advantage of
the kernel page cache anymore, so it becomes important to tune
shared_buffers more precisely. That's a downside: the system requires
more tuning. For many applications, squeezing the last ounce of
performance just isn't that important. There are also scaling issues
with the Postgres buffer cache, which might need to be addressed first.

With double write buffering, there are also pros and cons. It also
requires careful tuning. And replaying WAL that contains full-page
images can be much faster, because you can write new page images
"blindly" without reading the old pages first. We have WAL prefetching
now, which alleviates that, but it's no panacea.

In summary, those are good solutions but they're not obviously better in
all circumstances.

However, the PostgreSQL community’s mailing list is truly a treasure
trove, where you can find really interesting discussions. For
instance, this discussion on whether lock coupling is needed for
B-link trees, etc.
/messages/by-id/CALJbhHPiudj4usf6JF7wuCB81fB7SbNAeyG616k+m9G0vffrYw@mail.gmail.com

Yep, there are old threads and patches for double write buffers and
direct IO too :-).

--
Heikki Linnakangas
Neon (https://neon.tech)

#4Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#3)
Re: Some questions about PostgreSQL’s design.

On Tue, Aug 20, 2024 at 04:46:54PM +0300, Heikki Linnakangas wrote:

There are pros and cons. With direct I/O, you cannot take advantage of the
kernel page cache anymore, so it becomes important to tune shared_buffers
more precisely. That's a downside: the system requires more tuning. For many
applications, squeezing the last ounce of performance just isn't that
important. There are also scaling issues with the Postgres buffer cache,
which might need to be addressed first.

With double write buffering, there are also pros and cons. It also requires
careful tuning. And replaying WAL that contains full-page images can be much
faster, because you can write new page images "blindly" without reading the
old pages first. We have WAL prefetching now, which alleviates that, but
it's no panacea.

陈宗志, you mimght find this blog post helpful:

https://momjian.us/main/blogs/pgblog/2017.html#June_5_2017

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#5陈宗志
baotiao@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: Some questions about PostgreSQL’s design.

For other approaches, such as whether to use an LRU list to manage the
shared_buffer or to use a clock sweep for management, both methods
have their pros and cons. But for these two issues, there is a clearly
better solution. For example, using DirectIO avoids the problem of
double-copying data, and the OS’s page cache LRU list is optimized for
general scenarios, while the database kernel should use its own
eviction algorithm. Regarding the other issue, full-page writes don’t
actually reduce the number of page reads—it’s just a matter of whether
those page reads come from data files or from the redo log; the amount
of data read is essentially the same. However, the problem it
introduces is significant write amplification on the critical write
path, which severely impacts performance. As a result, PostgreSQL has
to minimize the frequency of checkpoints as much as possible.

I thought someone could write a demo to show it..

On Tue, Aug 20, 2024 at 9:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 20/08/2024 11:46, 陈宗志 wrote:

I’ve recently started exploring PostgreSQL implementation. I used to
be a MySQL InnoDB developer, and I find the PostgreSQL community feels
a bit strange.

There are some areas where they’ve done really well, but there are
also some obvious issues that haven’t been improved.

For example, the B-link tree implementation in PostgreSQL is
particularly elegant, and the code is very clean.
But there are some clear areas that could be improved but haven’t been
addressed, like the double memory problem where the buffer pool and
page cache store the same page, using full-page writes to deal with
torn page writes instead of something like InnoDB’s double write
buffer.

It seems like these issues have clear solutions, such as using
DirectIO like InnoDB instead of buffered IO, or using a double write
buffer instead of relying on the full-page write approach.
Can anyone replay why?

There are pros and cons. With direct I/O, you cannot take advantage of
the kernel page cache anymore, so it becomes important to tune
shared_buffers more precisely. That's a downside: the system requires
more tuning. For many applications, squeezing the last ounce of
performance just isn't that important. There are also scaling issues
with the Postgres buffer cache, which might need to be addressed first.

With double write buffering, there are also pros and cons. It also
requires careful tuning. And replaying WAL that contains full-page
images can be much faster, because you can write new page images
"blindly" without reading the old pages first. We have WAL prefetching
now, which alleviates that, but it's no panacea.

In summary, those are good solutions but they're not obviously better in
all circumstances.

However, the PostgreSQL community’s mailing list is truly a treasure
trove, where you can find really interesting discussions. For
instance, this discussion on whether lock coupling is needed for
B-link trees, etc.
/messages/by-id/CALJbhHPiudj4usf6JF7wuCB81fB7SbNAeyG616k+m9G0vffrYw@mail.gmail.com

Yep, there are old threads and patches for double write buffers and
direct IO too :-).

--
Heikki Linnakangas
Neon (https://neon.tech)

--
---
Blog: http://www.chenzongzhi.info
Twitter: https://twitter.com/baotiao
Git: https://github.com/baotiao

#6陈宗志
baotiao@gmail.com
In reply to: Bruce Momjian (#4)
Re: Some questions about PostgreSQL’s design.

I disagree with the point made in the article. The article mentions
that ‘prevents the kernel from reordering reads and writes to optimize
performance,’ which might be referring to the file system’s IO
scheduling and merging. However, this can be handled within the
database itself, where IO scheduling and merging can be done even
better.

Regarding ‘does not allow free memory to be used as kernel cache,’ I
believe the database itself should manage memory well, and most of the
memory should be managed by the database rather than handed over to
the operating system. Additionally, the database’s use of the page
cache should be restricted.

On Wed, Aug 21, 2024 at 12:55 AM Bruce Momjian <bruce@momjian.us> wrote:

On Tue, Aug 20, 2024 at 04:46:54PM +0300, Heikki Linnakangas wrote:

There are pros and cons. With direct I/O, you cannot take advantage of the
kernel page cache anymore, so it becomes important to tune shared_buffers
more precisely. That's a downside: the system requires more tuning. For many
applications, squeezing the last ounce of performance just isn't that
important. There are also scaling issues with the Postgres buffer cache,
which might need to be addressed first.

With double write buffering, there are also pros and cons. It also requires
careful tuning. And replaying WAL that contains full-page images can be much
faster, because you can write new page images "blindly" without reading the
old pages first. We have WAL prefetching now, which alleviates that, but
it's no panacea.

陈宗志, you mimght find this blog post helpful:

https://momjian.us/main/blogs/pgblog/2017.html#June_5_2017

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

--
---
Blog: http://www.chenzongzhi.info
Twitter: https://twitter.com/baotiao
Git: https://github.com/baotiao

#7Andreas Karlsson
andreas@proxel.se
In reply to: 陈宗志 (#6)
Re: Some questions about PostgreSQL’s design.

On 8/22/24 10:50 AM, 陈宗志 wrote:

I disagree with the point made in the article. The article mentions
that ‘prevents the kernel from reordering reads and writes to optimize
performance,’ which might be referring to the file system’s IO
scheduling and merging. However, this can be handled within the
database itself, where IO scheduling and merging can be done even
better.

The database does not have all the information that the OS has, but that
said I suspect that the advantages of direct IO outweigh the
disadvantages in this regard. But the only way to know for sure would be
fore someone to provide a benchmark.

Regarding ‘does not allow free memory to be used as kernel cache,’ I
believe the database itself should manage memory well, and most of the
memory should be managed by the database rather than handed over to
the operating system. Additionally, the database’s use of the page
cache should be restricted.

That all depends on you use case. If the database is running alone or
almost alone on a machine direct IO is likely the optional strategy but
if more services are running on the same machine (e.g. if you run
PostgreSQL on your personal laptop) you want to use buffered IO.

But as far as I know the long term plan of the async IO project is to
support both direct and buffered IO so people can pick the right choice
for their workload.

Andreas