Some questions about PostgreSQL’s design.
I’ve recently started exploring PostgreSQL implementation. I used to
be a MySQL InnoDB developer, and I find the PostgreSQL community feels
a bit strange.
There are some areas where they’ve done really well, but there are
also some obvious issues that haven’t been improved.
For example, the B-link tree implementation in PostgreSQL is
particularly elegant, and the code is very clean.
But there are some clear areas that could be improved but haven’t been
addressed, like the double memory problem where the buffer pool and
page cache store the same page, using full-page writes to deal with
torn page writes instead of something like InnoDB’s double write
buffer.
It seems like these issues have clear solutions, such as using
DirectIO like InnoDB instead of buffered IO, or using a double write
buffer instead of relying on the full-page write approach.
Can anyone replay why?
However, the PostgreSQL community’s mailing list is truly a treasure
trove, where you can find really interesting discussions. For
instance, this discussion on whether lock coupling is needed for
B-link trees, etc.
/messages/by-id/CALJbhHPiudj4usf6JF7wuCB81fB7SbNAeyG616k+m9G0vffrYw@mail.gmail.com
--
---
Blog: https://baotiao.github.io/
Twitter: https://twitter.com/baotiao
Git: https://github.com/baotiao
On Tue, Aug 20, 2024 at 8:55 PM 陈宗志 <baotiao@gmail.com> wrote:
It seems like these issues have clear solutions, such as using
DirectIO like InnoDB instead of buffered IO,
For this part: we recently added an experimental option to use direct
I/O (debug_io_direct). We are working on the infrastructure needed to
make it work efficiently before removing the "debug_" prefix:
prediction of future I/O through a "stream" abstraction which we have
some early pieces of already, I/O combining (see new io_combine_limit
setting), and asynchronous I/O (work in progress, basically I/O worker
processes or io_uring or other OS-specific APIs).
On 20/08/2024 11:46, 陈宗志 wrote:
I’ve recently started exploring PostgreSQL implementation. I used to
be a MySQL InnoDB developer, and I find the PostgreSQL community feels
a bit strange.There are some areas where they’ve done really well, but there are
also some obvious issues that haven’t been improved.For example, the B-link tree implementation in PostgreSQL is
particularly elegant, and the code is very clean.
But there are some clear areas that could be improved but haven’t been
addressed, like the double memory problem where the buffer pool and
page cache store the same page, using full-page writes to deal with
torn page writes instead of something like InnoDB’s double write
buffer.It seems like these issues have clear solutions, such as using
DirectIO like InnoDB instead of buffered IO, or using a double write
buffer instead of relying on the full-page write approach.
Can anyone replay why?
There are pros and cons. With direct I/O, you cannot take advantage of
the kernel page cache anymore, so it becomes important to tune
shared_buffers more precisely. That's a downside: the system requires
more tuning. For many applications, squeezing the last ounce of
performance just isn't that important. There are also scaling issues
with the Postgres buffer cache, which might need to be addressed first.
With double write buffering, there are also pros and cons. It also
requires careful tuning. And replaying WAL that contains full-page
images can be much faster, because you can write new page images
"blindly" without reading the old pages first. We have WAL prefetching
now, which alleviates that, but it's no panacea.
In summary, those are good solutions but they're not obviously better in
all circumstances.
However, the PostgreSQL community’s mailing list is truly a treasure
trove, where you can find really interesting discussions. For
instance, this discussion on whether lock coupling is needed for
B-link trees, etc.
/messages/by-id/CALJbhHPiudj4usf6JF7wuCB81fB7SbNAeyG616k+m9G0vffrYw@mail.gmail.com
Yep, there are old threads and patches for double write buffers and
direct IO too :-).
--
Heikki Linnakangas
Neon (https://neon.tech)
On Tue, Aug 20, 2024 at 04:46:54PM +0300, Heikki Linnakangas wrote:
There are pros and cons. With direct I/O, you cannot take advantage of the
kernel page cache anymore, so it becomes important to tune shared_buffers
more precisely. That's a downside: the system requires more tuning. For many
applications, squeezing the last ounce of performance just isn't that
important. There are also scaling issues with the Postgres buffer cache,
which might need to be addressed first.With double write buffering, there are also pros and cons. It also requires
careful tuning. And replaying WAL that contains full-page images can be much
faster, because you can write new page images "blindly" without reading the
old pages first. We have WAL prefetching now, which alleviates that, but
it's no panacea.
陈宗志, you mimght find this blog post helpful:
https://momjian.us/main/blogs/pgblog/2017.html#June_5_2017
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
For other approaches, such as whether to use an LRU list to manage the
shared_buffer or to use a clock sweep for management, both methods
have their pros and cons. But for these two issues, there is a clearly
better solution. For example, using DirectIO avoids the problem of
double-copying data, and the OS’s page cache LRU list is optimized for
general scenarios, while the database kernel should use its own
eviction algorithm. Regarding the other issue, full-page writes don’t
actually reduce the number of page reads—it’s just a matter of whether
those page reads come from data files or from the redo log; the amount
of data read is essentially the same. However, the problem it
introduces is significant write amplification on the critical write
path, which severely impacts performance. As a result, PostgreSQL has
to minimize the frequency of checkpoints as much as possible.
I thought someone could write a demo to show it..
On Tue, Aug 20, 2024 at 9:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 20/08/2024 11:46, 陈宗志 wrote:
I’ve recently started exploring PostgreSQL implementation. I used to
be a MySQL InnoDB developer, and I find the PostgreSQL community feels
a bit strange.There are some areas where they’ve done really well, but there are
also some obvious issues that haven’t been improved.For example, the B-link tree implementation in PostgreSQL is
particularly elegant, and the code is very clean.
But there are some clear areas that could be improved but haven’t been
addressed, like the double memory problem where the buffer pool and
page cache store the same page, using full-page writes to deal with
torn page writes instead of something like InnoDB’s double write
buffer.It seems like these issues have clear solutions, such as using
DirectIO like InnoDB instead of buffered IO, or using a double write
buffer instead of relying on the full-page write approach.
Can anyone replay why?There are pros and cons. With direct I/O, you cannot take advantage of
the kernel page cache anymore, so it becomes important to tune
shared_buffers more precisely. That's a downside: the system requires
more tuning. For many applications, squeezing the last ounce of
performance just isn't that important. There are also scaling issues
with the Postgres buffer cache, which might need to be addressed first.With double write buffering, there are also pros and cons. It also
requires careful tuning. And replaying WAL that contains full-page
images can be much faster, because you can write new page images
"blindly" without reading the old pages first. We have WAL prefetching
now, which alleviates that, but it's no panacea.In summary, those are good solutions but they're not obviously better in
all circumstances.However, the PostgreSQL community’s mailing list is truly a treasure
trove, where you can find really interesting discussions. For
instance, this discussion on whether lock coupling is needed for
B-link trees, etc.
/messages/by-id/CALJbhHPiudj4usf6JF7wuCB81fB7SbNAeyG616k+m9G0vffrYw@mail.gmail.comYep, there are old threads and patches for double write buffers and
direct IO too :-).--
Heikki Linnakangas
Neon (https://neon.tech)
--
---
Blog: http://www.chenzongzhi.info
Twitter: https://twitter.com/baotiao
Git: https://github.com/baotiao
I disagree with the point made in the article. The article mentions
that ‘prevents the kernel from reordering reads and writes to optimize
performance,’ which might be referring to the file system’s IO
scheduling and merging. However, this can be handled within the
database itself, where IO scheduling and merging can be done even
better.
Regarding ‘does not allow free memory to be used as kernel cache,’ I
believe the database itself should manage memory well, and most of the
memory should be managed by the database rather than handed over to
the operating system. Additionally, the database’s use of the page
cache should be restricted.
On Wed, Aug 21, 2024 at 12:55 AM Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Aug 20, 2024 at 04:46:54PM +0300, Heikki Linnakangas wrote:
There are pros and cons. With direct I/O, you cannot take advantage of the
kernel page cache anymore, so it becomes important to tune shared_buffers
more precisely. That's a downside: the system requires more tuning. For many
applications, squeezing the last ounce of performance just isn't that
important. There are also scaling issues with the Postgres buffer cache,
which might need to be addressed first.With double write buffering, there are also pros and cons. It also requires
careful tuning. And replaying WAL that contains full-page images can be much
faster, because you can write new page images "blindly" without reading the
old pages first. We have WAL prefetching now, which alleviates that, but
it's no panacea.陈宗志, you mimght find this blog post helpful:
https://momjian.us/main/blogs/pgblog/2017.html#June_5_2017
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.comOnly you can decide what is important to you.
--
---
Blog: http://www.chenzongzhi.info
Twitter: https://twitter.com/baotiao
Git: https://github.com/baotiao
On 8/22/24 10:50 AM, 陈宗志 wrote:
I disagree with the point made in the article. The article mentions
that ‘prevents the kernel from reordering reads and writes to optimize
performance,’ which might be referring to the file system’s IO
scheduling and merging. However, this can be handled within the
database itself, where IO scheduling and merging can be done even
better.
The database does not have all the information that the OS has, but that
said I suspect that the advantages of direct IO outweigh the
disadvantages in this regard. But the only way to know for sure would be
fore someone to provide a benchmark.
Regarding ‘does not allow free memory to be used as kernel cache,’ I
believe the database itself should manage memory well, and most of the
memory should be managed by the database rather than handed over to
the operating system. Additionally, the database’s use of the page
cache should be restricted.
That all depends on you use case. If the database is running alone or
almost alone on a machine direct IO is likely the optional strategy but
if more services are running on the same machine (e.g. if you run
PostgreSQL on your personal laptop) you want to use buffered IO.
But as far as I know the long term plan of the async IO project is to
support both direct and buffered IO so people can pick the right choice
for their workload.
Andreas