PERFORMANCE IMPROVEMENT by mapping WAL FILES

Started by Janardhana Reddyover 24 years ago7 messages

jana-reddy@mediaring.com.sg

over 24 years ago

Hi all,
By mapping the WAL files by each backend in to its address
space using "mmap" system call , there will be big
improvements in performance from the following point of view:
1. Each backend directly writes in to the address
space which is obtained by maping the WAL files.
this saves the write system call at the end of
every transaction which transfres 8k of
data from user space to kernel.
2. since every transaction does not modify all the 8k
content of WAL page , so by issuing the
fsync , the kernel only transfers only the
kernel pages which are modified , which is 4k for
linux so fsync time is saved by twice.
Any comments ?.

Regards
jana

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Janardhana Reddy (#1)

Re: PERFORMANCE IMPROVEMENT by mapping WAL FILES

Hi all,
By mapping the WAL files by each backend in to its address
space using "mmap" system call , there will be big
improvements in performance from the following point of view:
1. Each backend directly writes in to the address
space which is obtained by maping the WAL files.
this saves the write system call at the end of
every transaction which transfres 8k of
data from user space to kernel.
2. since every transaction does not modify all the 8k
content of WAL page , so by issuing the
fsync , the kernel only transfers only the
kernel pages which are modified , which is 4k for
linux so fsync time is saved by twice.
Any comments ?.

This is interesting. We are concerned about using mmap() for all I/O
because we could eat up quite a bit of address space for big tables, but
WAL seems like an ideal use for mmap().

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Janardhana Reddy (#1)

Re: PERFORMANCE IMPROVEMENT by mapping WAL FILES

Janardhana Reddy <jana-reddy@mediaring.com.sg> writes:

By mapping the WAL files by each backend in to its address
space using "mmap" system call ,

There are a lot of problems with trying to use mmap for Postgres. One
is portability: not all platforms have mmap, so we'd still have to
support the non-mmap case; and it's not at all clear that fsync/msync
semantics are consistent across platforms, either. A bigger objection
is that mmap'ing a file in one backend does not cause it to become
available to other backends, thus the entire concept of shared buffers
breaks down.

If you think you can make it work, feel free to try it ...

regards, tom lane

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tom Lane (#3)

Re: PERFORMANCE IMPROVEMENT by mapping WAL FILES

Hi all,
By mapping the WAL files by each backend in to its address
space using "mmap" system call , there will be big
improvements in performance from the following point of view:
1. Each backend directly writes in to the address
space which is obtained by maping the WAL files.
this saves the write system call at the end of
every transaction which transfres 8k of
data from user space to kernel.
2. since every transaction does not modify all the 8k
content of WAL page , so by issuing the
fsync , the kernel only transfers only the
kernel pages which are modified , which is 4k for
linux so fsync time is saved by twice.
Any comments ?.

This is interesting. We are concerned about using mmap() for all I/O
because we could eat up quite a bit of address space for big tables, but
WAL seems like an ideal use for mmap().

OK, I have talked to Tom Lane about this on the phone and we have a few
ideas.

Historically, we have avoided mmap() because of portability problems,
and because using mmap() to write to large tables could consume lots of
address space with little benefit. However, I perhaps can see WAL as
being a good use of mmap.

First, there is the issue of using mmap(). For OS's that have the
mmap() MAP_SHARED flag, different backends could mmap the same file and
each see the changes. However, keep in mind we still have to fsync()
WAL, so we need to use msync().

So, looking at the benefits of using mmap(), we have overhead of
different backends having to mmap something that now sits quite easily
in shared memory. Now, I can see mmap reducing the copy from user to
kernel, but there are other ways to fix that. We could modify the
write() routines to write() 8k on first WAL page write and later write
only the modified part of the page to the kernel buffers. The old
kernel buffer is probably still around so it is unlikely to require a
read from the file system to read in the rest of the page. This reduces
the write from 8k to something probably less than 4k which is better
than we can do with mmap.

I will add a TODO item to this effect.

As far as reducing the write to disk from 8k to 4k, if we have to
fsync/msync, we have to wait for the disk to spin to the proper location
and at that point writing 4k or 8k doesn't seem like much of a win.

In summary, I think it would be nice to reduce the 8k transfer from user
to kernel on secondary page writes to only the modified part of the
page. I am uncertain if mmap() or anything else will help the physical
write to the disk.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Import Notes

Reply to msg id not found: fromenvpgmanatSep262001105437am | Resolved by subject fallback

Janardhana Reddy

jana-reddy@mediaring.com.sg

over 24 years ago

In reply to: Bruce Momjian (#4)

Re: PERFORMANCE IMPROVEMENT by mapping WAL FILES

I have just completed the functional testing the WAL using mmap , it is

working fine, I have tested by commenting out the "CreateCheckPoint "
functionality so that
when i kill the postgres and restart it will redo all the records from the
WAL log file which
is updated using mmap.
Just i need to clean code and to do some stress testing.
By the end of this week i should able to complete the stress test and
generate the patch file .
As Tom Lane mentioned i see the problem in portability to all platforms,

what i propose is to use mmap for only WAL for some platforms like
linux,freebsd etc . For other platforms we can use the existing method by
slightly modifying the
write() routine to write only the modified part of the page.

Regards
jana

Show quoted text

OK, I have talked to Tom Lane about this on the phone and we have a few
ideas.

Historically, we have avoided mmap() because of portability problems,
and because using mmap() to write to large tables could consume lots of
address space with little benefit. However, I perhaps can see WAL as
being a good use of mmap.

First, there is the issue of using mmap(). For OS's that have the
mmap() MAP_SHARED flag, different backends could mmap the same file and
each see the changes. However, keep in mind we still have to fsync()
WAL, so we need to use msync().

So, looking at the benefits of using mmap(), we have overhead of
different backends having to mmap something that now sits quite easily
in shared memory. Now, I can see mmap reducing the copy from user to
kernel, but there are other ways to fix that. We could modify the
write() routines to write() 8k on first WAL page write and later write
only the modified part of the page to the kernel buffers. The old
kernel buffer is probably still around so it is unlikely to require a
read from the file system to read in the rest of the page. This reduces
the write from 8k to something probably less than 4k which is better
than we can do with mmap.

I will add a TODO item to this effect.

As far as reducing the write to disk from 8k to 4k, if we have to
fsync/msync, we have to wait for the disk to spin to the proper location
and at that point writing 4k or 8k doesn't seem like much of a win.

In summary, I think it would be nice to reduce the 8k transfer from user
to kernel on secondary page writes to only the modified part of the
page. I am uncertain if mmap() or anything else will help the physical
write to the disk.
--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Janardhana Reddy (#5)

Re: PERFORMANCE IMPROVEMENT by mapping WAL FILES

Sounds good. Keep us posted. This will probably not make it into 7.2
but can be added to 7.3. We can perhaps conditionally use your code in
place of what is there. I have also looked at reducing the write() size
for WAL secondary writes. That will have to wait for 7.3 too because we
are so near beta.

I have just completed the functional testing the WAL using mmap , it is

working fine, I have tested by commenting out the "CreateCheckPoint "
functionality so that
when i kill the postgres and restart it will redo all the records from the
WAL log file which
is updated using mmap.
Just i need to clean code and to do some stress testing.
By the end of this week i should able to complete the stress test and
generate the patch file .
As Tom Lane mentioned i see the problem in portability to all platforms,

what i propose is to use mmap for only WAL for some platforms like
linux,freebsd etc . For other platforms we can use the existing method by
slightly modifying the
write() routine to write only the modified part of the page.

Regards
jana
OK, I have talked to Tom Lane about this on the phone and we have a few
ideas.

Historically, we have avoided mmap() because of portability problems,
and because using mmap() to write to large tables could consume lots of
address space with little benefit. However, I perhaps can see WAL as
being a good use of mmap.

First, there is the issue of using mmap(). For OS's that have the
mmap() MAP_SHARED flag, different backends could mmap the same file and
each see the changes. However, keep in mind we still have to fsync()
WAL, so we need to use msync().

So, looking at the benefits of using mmap(), we have overhead of
different backends having to mmap something that now sits quite easily
in shared memory. Now, I can see mmap reducing the copy from user to
kernel, but there are other ways to fix that. We could modify the
write() routines to write() 8k on first WAL page write and later write
only the modified part of the page to the kernel buffers. The old
kernel buffer is probably still around so it is unlikely to require a
read from the file system to read in the rest of the page. This reduces
the write from 8k to something probably less than 4k which is better
than we can do with mmap.

I will add a TODO item to this effect.

As far as reducing the write to disk from 8k to 4k, if we have to
fsync/msync, we have to wait for the disk to spin to the proper location
and at that point writing 4k or 8k doesn't seem like much of a win.

In summary, I think it would be nice to reduce the 8k transfer from user
to kernel on secondary page writes to only the modified part of the
page. I am uncertain if mmap() or anything else will help the physical
write to the disk.
--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Janardhana Reddy (#5)

Re: PERFORMANCE IMPROVEMENT by mapping WAL FILES

I have added this to TODO.detail/mmap.

I have just completed the functional testing the WAL using mmap , it is

working fine, I have tested by commenting out the "CreateCheckPoint "
functionality so that
when i kill the postgres and restart it will redo all the records from the
WAL log file which
is updated using mmap.
Just i need to clean code and to do some stress testing.
By the end of this week i should able to complete the stress test and
generate the patch file .
As Tom Lane mentioned i see the problem in portability to all platforms,

what i propose is to use mmap for only WAL for some platforms like
linux,freebsd etc . For other platforms we can use the existing method by
slightly modifying the
write() routine to write only the modified part of the page.

Regards
jana
OK, I have talked to Tom Lane about this on the phone and we have a few
ideas.

Historically, we have avoided mmap() because of portability problems,
and because using mmap() to write to large tables could consume lots of
address space with little benefit. However, I perhaps can see WAL as
being a good use of mmap.

First, there is the issue of using mmap(). For OS's that have the
mmap() MAP_SHARED flag, different backends could mmap the same file and
each see the changes. However, keep in mind we still have to fsync()
WAL, so we need to use msync().

So, looking at the benefits of using mmap(), we have overhead of
different backends having to mmap something that now sits quite easily
in shared memory. Now, I can see mmap reducing the copy from user to
kernel, but there are other ways to fix that. We could modify the
write() routines to write() 8k on first WAL page write and later write
only the modified part of the page to the kernel buffers. The old
kernel buffer is probably still around so it is unlikely to require a
read from the file system to read in the rest of the page. This reduces
the write from 8k to something probably less than 4k which is better
than we can do with mmap.

I will add a TODO item to this effect.

As far as reducing the write to disk from 8k to 4k, if we have to
fsync/msync, we have to wait for the disk to spin to the proper location
and at that point writing 4k or 8k doesn't seem like much of a win.

In summary, I think it would be nice to reduce the 8k transfer from user
to kernel on secondary page writes to only the modified part of the
page. I am uncertain if mmap() or anything else will help the physical
write to the disk.
--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026