New Linux xfs/reiser file systems
I was talking to a Linux user yesterday, and he said that performance
using the xfs file system is pretty bad. He believes it has to do with
the fact that fsync() on log-based file systems requires more writes.
With a standard BSD/ext2 file system, WAL writes can stay on the same
cylinder to perform fsync. Is that true of log-based file systems?
I know xfs and reiser are both log based. Do we need to be concerned
about PostgreSQL performance on these file systems? I use BSD FFS with
soft updates here, so it doesn't affect me.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
* Bruce Momjian <pgman@candle.pha.pa.us> [010502 14:01] wrote:
I was talking to a Linux user yesterday, and he said that performance
using the xfs file system is pretty bad. He believes it has to do with
the fact that fsync() on log-based file systems requires more writes.With a standard BSD/ext2 file system, WAL writes can stay on the same
cylinder to perform fsync. Is that true of log-based file systems?I know xfs and reiser are both log based. Do we need to be concerned
about PostgreSQL performance on these file systems? I use BSD FFS with
soft updates here, so it doesn't affect me.
The "problem" with log based filesystems is that they most likely
do not know the consequences of a write so an fsync on a file may
require double writing to both the log and the "real" portion of
the disk. They can also exhibit the problem that an fsync may
cause all pending writes to require scheduling unless the log is
constructed on the fly rather than incrementally.
There was also the problem that was brought up recently that
certain versions (maybe all?) of Linux perform fsync() in a very
non-optimal manner, if the user is able to use the O_FSYNC option
rather than fsync he may see a performance increase.
But his guess is probably nearly as good as mine. :)
--
-Alfred Perlstein - [alfred@freebsd.org]
http://www.egr.unlv.edu/~slumos/on-netbsd.html
The "problem" with log based filesystems is that they most likely
do not know the consequences of a write so an fsync on a file may
require double writing to both the log and the "real" portion of
the disk. They can also exhibit the problem that an fsync may
cause all pending writes to require scheduling unless the log is
constructed on the fly rather than incrementally.
Yes, this double-writing is a problem. Suppose you have your WAL on a
separate drive. You can fsync() WAL with zero head movement. With a
log based file system, you need two head movements, so you have gone
from zero movements to two.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
* Bruce Momjian <pgman@candle.pha.pa.us> [010502 15:20] wrote:
The "problem" with log based filesystems is that they most likely
do not know the consequences of a write so an fsync on a file may
require double writing to both the log and the "real" portion of
the disk. They can also exhibit the problem that an fsync may
cause all pending writes to require scheduling unless the log is
constructed on the fly rather than incrementally.Yes, this double-writing is a problem. Suppose you have your WAL on a
separate drive. You can fsync() WAL with zero head movement. With a
log based file system, you need two head movements, so you have gone
from zero movements to two.
It may be worse depending on how the filesystem actually does
journalling. I wonder if an fsync() may cause ALL pending
meta-data to be updated (even metadata not related to the
postgresql files).
Do you know if reiser or xfs have this problem?
--
-Alfred Perlstein - [alfred@freebsd.org]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/
Yes, this double-writing is a problem. Suppose you have your WAL on a
separate drive. You can fsync() WAL with zero head movement. With a
log based file system, you need two head movements, so you have gone
from zero movements to two.It may be worse depending on how the filesystem actually does
journalling. I wonder if an fsync() may cause ALL pending
meta-data to be updated (even metadata not related to the
postgresql files).Do you know if reiser or xfs have this problem?
I don't know, but the Linux user reported xfs was really slow.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian wrote:
I was talking to a Linux user yesterday, and he said that performance
using the xfs file system is pretty bad. He believes it has to do with
the fact that fsync() on log-based file systems requires more writes.With a standard BSD/ext2 file system, WAL writes can stay on the same
cylinder to perform fsync. Is that true of log-based file systems?I know xfs and reiser are both log based. Do we need to be concerned
about PostgreSQL performance on these file systems? I use BSD FFS with
soft updates here, so it doesn't affect me.
I did see poor performance on reiserfs, I have not as yet ventured into using
xfs.
I occurs to me that journalizing file systems will almost always be slower on
an application such as postgres. The journalizing file system is trying to
maintain data integrity for an application which is also trying to maintain
data integrity. There will always be extra work involved.
This behavior raises the question about file system usage in Postgres. Many
databases, such as Oracle, create table space files and operate directly on the
raw blocks, bypassing the file system altogether.
On one hand, Postgres is easy to use and maintain because it cooperates with
the native file system, on the other hand it incurs the overhead of whatever
silliness the file system wants to do.
I would bet it is a huge amount of work to use a "table space" system and no
one wants that. lol. However, it should be noted that a bit more control over
database layout would make some great performance improvements.
The ability to put indexes on a separate volume from data.
The ability to put different tables on different volumes.
And so on.
In the short term, I think poor performance on a journalizing file system is to
be expected, unless there is an IOCTL to tell the FS to leave the files alone
(and postgres calls it). A Linux HOWTO which informs people that certain file
systems will have performance issues and why should handle the problem.
Perhaps we can convince the Linux community to create a "dbfs" which is a
stripped down simple no nonsense file system designed for applications like
databases?
--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com
On Thu, 3 May 2001, mlw wrote:
I would bet it is a huge amount of work to use a "table space" system
and no one wants that.
From some stracing of 7.1, the most common syscall issued by
postgres is an lseek() to the end of the file, presumably to
find its length, which seems to happen up to about a dozen
times per (pgbench) transaction.
Tablespaces would solve this (not that lseek is a particularly
expensive operation, of course).
Perhaps we can convince the Linux community to create a "dbfs" which
is a stripped down simple no nonsense file system designed for
applications like databases?
Sync-metadata ext2 should be fine. Filesystems fsck pretty
quick when they contain only a few large files.
Otherwise, something like "smugfs" (now obsolete) might do.
Matthew.
Matthew Kirkwood <matthew@hairy.beasts.org> writes:
From some stracing of 7.1, the most common syscall issued by
postgres is an lseek() to the end of the file, presumably to
find its length, which seems to happen up to about a dozen
times per (pgbench) transaction.
Tablespaces would solve this (not that lseek is a particularly
expensive operation, of course).
No, they wouldn't; or at least they'd just create a different problem.
The reason for the lseek is that the file length may have changed since
the current backend last checked it. To avoid lseek we'd need some
shared data structure that maintains the current length of every active
table, which would be a nuisance to maintain and probably a source of
contention delays.
(Of course, such a data structure would just be the tip of the iceberg
of what we'd have to maintain for ourselves if we couldn't depend on the
kernel to do it for us. Reimplementing a filesystem doesn't strike me
as a profitable use of our time.)
regards, tom lane
I know xfs and reiser are both log based. Do we need to be concerned
about PostgreSQL performance on these file systems? I use BSD FFS with
soft updates here, so it doesn't affect me.I did see poor performance on reiserfs, I have not as yet ventured into using
xfs.I occurs to me that journalizing file systems will almost always be slower on
an application such as postgres. The journalizing file system is trying to
maintain data integrity for an application which is also trying to maintain
data integrity. There will always be extra work involved.
Yes, the problem is that extra work is required on PostgreSQL's part.
Log-based file systems make sure all the changes get onto the disk in an
orderly way, but I believe it can delay what gets written to the drive.
PostgreSQL wants to be sure all the data is on the disk, period.
Unfortunately, the _orderly_ part makes the _fsync_ part do more work.
By going from ext2 to a log-based file system, we are getting _farther_
from a raw device that if we just sayed with ext2.
ext2 has serious problems with corrupt file systems after a crash, so I
understand the need to move to another file system type. I have been
waitin for Linux to get a more modern file system. Unfortunately, the
new ones seem to be worse for PostgreSQL.
This behavior raises the question about file system usage in Postgres. Many
databases, such as Oracle, create table space files and operate directly on the
raw blocks, bypassing the file system altogether.
OK, we have considered this, but frankly, the new, modern file systems
like FFS/softupdates have i/o rates near raw speed, with all the
advantages a file system gives us. I believe most commercial dbs are
moving away from raw devices and toward file systems. In the old days
the SysV file system was pretty bad at i/o & fragmentation, so they used
raw devices.
The ability to put indexes on a separate volume from data.
The ability to put different tables on different volumes.
And so on.
We certainly need that, but raw devices would not make this any easier,
I think.
In the short term, I think poor performance on a journalizing file system is to
be expected, unless there is an IOCTL to tell the FS to leave the files alone
(and postgres calls it). A Linux HOWTO which informs people that certain file
systems will have performance issues and why should handle the problem.Perhaps we can convince the Linux community to create a "dbfs" which is a
stripped down simple no nonsense file system designed for applications like
databases?
It could become a serious problem as people start using reiser/xfs for
their file systems and don't understand the performance problems. Even
more likely is that they will turn off fsync, thinking reiser doesn't
need it, when in fact, I think it does.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Matthew Kirkwood <matthew@hairy.beasts.org> writes:
From some stracing of 7.1, the most common syscall issued by
postgres is an lseek() to the end of the file, presumably to
find its length, which seems to happen up to about a dozen
times per (pgbench) transaction.Tablespaces would solve this (not that lseek is a particularly
expensive operation, of course).No, they wouldn't; or at least they'd just create a different problem.
The reason for the lseek is that the file length may have changed since
the current backend last checked it. To avoid lseek we'd need some
shared data structure that maintains the current length of every active
table, which would be a nuisance to maintain and probably a source of
contention delays.
Seems we should cache the file lengths somehow. Not sure how to do it
because our file system cache is local to each backend.
(Of course, such a data structure would just be the tip of the iceberg
of what we'd have to maintain for ourselves if we couldn't depend on the
kernel to do it for us. Reimplementing a filesystem doesn't strike me
as a profitable use of our time.)
Ditto. The database is complicated enough.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
This behavior raises the question about file system usage in Postgres. Many
databases, such as Oracle, create table space files and operate directly on the
raw blocks, bypassing the file system altogether.OK, we have considered this, but frankly, the new, modern file systems
like FFS/softupdates have i/o rates near raw speed, with all the
advantages a file system gives us. I believe most commercial dbs are
moving away from raw devices and toward file systems. In the old days
the SysV file system was pretty bad at i/o & fragmentation, so they used
raw devices.
I'm starting to like the idea of raw FS for a few reasons:
1) Considering that postgresql now does WAL, the need for a logging FS
for the database doesn't seem as needed (is it needed at all?).
2) Given the fact that postgresql is trying to support many OSs,
depending on, for example, XFS on a linux system will cause many
problems. What about solaris? How about BSD? Etc.. Using raw db MAY be
easier than dealing with the problems that will arise from supporting
multiple filesystems.
That said, the ability to use the system's FS does have it's advantages
(backup, moving files, etc).
Just some thoughts..
- Brandon
b. palmer, bpalmer@crimelabs.net
pgp: www.crimelabs.net/bpalmer.pgp5
kernel to do it for us. Reimplementing a filesystem doesn't strike me
as a profitable use of our time.)Ditto. The database is complicated enough.
Maybe some kind of recommendation would be a good thing. That is, if the
PostgreSQL community has enough knowledge.
A section in the docs that discusses various file systems, so people can make
an intelligent choice.
--
Kaare Rasmussen --Linux, spil,-- Tlf: 3816 2582
Kaki Data tshirts, merchandize Fax: 3816 2501
Howitzvej 75 �ben 14.00-18.00 Web: www.suse.dk
2000 Frederiksberg L�rdag 11.00-17.00 Email: kar@webline.dk
On Thu, 3 May 2001, mlw wrote:
This behavior raises the question about file system usage in Postgres. Many
databases, such as Oracle, create table space files and operate directly on the
raw blocks, bypassing the file system altogether.On one hand, Postgres is easy to use and maintain because it cooperates with
the native file system, on the other hand it incurs the overhead of whatever
silliness the file system wants to do.
It is not *that* hard to write a 'postgresfs' but you have to look at
the problems it creates. One of the biggest problems facing sys admins of
large sites is that the Oracle/DB2/etc DBA, having created the
purpose-build database filesystem, has not allowed enough room for
growth. Like I said, a basic file system is not difficult, but volume
management tools and the maintenance of the whole thing is. Currently,
postgres administrators are not faced with such a problem.
There is, of course, the argument that pgfs need not been enforced. The
problem is that many people would probably use it so as to have a
'superior' installation. This then entails the problems above, creating
more work for core developers.
Gavin
Just put a note in the installation docs that the place where the database
is initialised to should be on a non-Reiser, non-XFS mount...
Chris
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of mlw
Sent: Thursday, 3 May 2001 8:09 PM
To: Bruce Momjian; Hackers List
Subject: [HACKERS] Re: New Linux xfs/reiser file systems
Bruce Momjian wrote:
I was talking to a Linux user yesterday, and he said that performance
using the xfs file system is pretty bad. He believes it has to do with
the fact that fsync() on log-based file systems requires more writes.With a standard BSD/ext2 file system, WAL writes can stay on the same
cylinder to perform fsync. Is that true of log-based file systems?I know xfs and reiser are both log based. Do we need to be concerned
about PostgreSQL performance on these file systems? I use BSD FFS with
soft updates here, so it doesn't affect me.
I did see poor performance on reiserfs, I have not as yet ventured into
using
xfs.
I occurs to me that journalizing file systems will almost always be slower
on
an application such as postgres. The journalizing file system is trying to
maintain data integrity for an application which is also trying to maintain
data integrity. There will always be extra work involved.
This behavior raises the question about file system usage in Postgres. Many
databases, such as Oracle, create table space files and operate directly on
the
raw blocks, bypassing the file system altogether.
On one hand, Postgres is easy to use and maintain because it cooperates with
the native file system, on the other hand it incurs the overhead of whatever
silliness the file system wants to do.
I would bet it is a huge amount of work to use a "table space" system and no
one wants that. lol. However, it should be noted that a bit more control
over
database layout would make some great performance improvements.
The ability to put indexes on a separate volume from data.
The ability to put different tables on different volumes.
And so on.
In the short term, I think poor performance on a journalizing file system is
to
be expected, unless there is an IOCTL to tell the FS to leave the files
alone
(and postgres calls it). A Linux HOWTO which informs people that certain
file
systems will have performance issues and why should handle the problem.
Perhaps we can convince the Linux community to create a "dbfs" which is a
stripped down simple no nonsense file system designed for applications like
databases?
--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
There might be a problem, but if no one mentions it to the maintainers of
those
fs's, it will not get fixed...
Regards
John
Just put a note in the installation docs that the place where the database
is initialised to should be on a non-Reiser, non-XFS mount...
Sure, we can do that now. What do we do when these are the default file
systems for Linux? We can tell them to create other types of file
systems, but that is a pretty big hurdle. I wonder if it would be
easier to get reiser/xfs to make some modifications.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Well, arguably if you're setting up a database server then a reasonable DBA
should think about such things...
(My 2c)
Chris
-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Friday, 4 May 2001 9:42 AM
To: Christopher Kings-Lynne
Cc: mlw; Hackers List
Subject: Re: [HACKERS] Re: New Linux xfs/reiser file systems
Just put a note in the installation docs that the place where the database
is initialised to should be on a non-Reiser, non-XFS mount...
Sure, we can do that now. What do we do when these are the default file
systems for Linux? We can tell them to create other types of file
systems, but that is a pretty big hurdle. I wonder if it would be
easier to get reiser/xfs to make some modifications.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Well, arguably if you're setting up a database server then a reasonable DBA
should think about such things...
Yes, but people have trouble installing PostgreSQL. I can't imagine
walking them through a newfs.
(My 2c)
Chris
-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Friday, 4 May 2001 9:42 AM
To: Christopher Kings-Lynne
Cc: mlw; Hackers List
Subject: Re: [HACKERS] Re: New Linux xfs/reiser file systemsJust put a note in the installation docs that the place where the database
is initialised to should be on a non-Reiser, non-XFS mount...Sure, we can do that now. What do we do when these are the default file
systems for Linux? We can tell them to create other types of file
systems, but that is a pretty big hurdle. I wonder if it would be
easier to get reiser/xfs to make some modifications.-- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian wrote:
Just put a note in the installation docs that the place where the database
is initialised to should be on a non-Reiser, non-XFS mount...Sure, we can do that now. What do we do when these are the default file
systems for Linux? We can tell them to create other types of file
systems, but that is a pretty big hurdle. I wonder if it would be
easier to get reiser/xfs to make some modifications.
I have looked at Reiser, and I don't think it is a file system suited for very
large files, or applications such as postgres. The Linux crowd should lobby
against any such trend. It is ok for many moderately small files. ReiserFS
would be great for a cddb server, but poor for a database box.
XFS is a real big file system project, I'd bet that there are file properties
or management tools to tell it to leave directories and files alone. They
should have addressed that years ago.
One last mention..
Having better control over WHERE various files in a database are located can
make it easier to deal with these things.
Just a thought. ;-)
--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com
Just put a note in the installation docs that the place where the
database
is initialised to should be on a non-Reiser, non-XFS mount...
Sure, we can do that now.
I still think this is not necessarily the right approach either. One
major purpose of using a journaling fs is for fast boot up time after
crash. If you have a 100 GB database you may wish to have the data
on XFS. I do think that the WAL log should be on a separate disk and
on a non-journaling fs for performance.
Best Regards,
Carl Garland
_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com
Import Notes
Resolved by subject fallback
mlw wrote:
Bruce Momjian wrote:
Just put a note in the installation docs that the place where the database
is initialised to should be on a non-Reiser, non-XFS mount...Sure, we can do that now. What do we do when these are the default file
systems for Linux? We can tell them to create other types of file
systems, but that is a pretty big hurdle. I wonder if it would be
easier to get reiser/xfs to make some modifications.I have looked at Reiser, and I don't think it is a file system suited for very
large files, or applications such as postgres. The Linux crowd should lobby
against any such trend. It is ok for many moderately small files. ReiserFS
would be great for a cddb server, but poor for a database box.XFS is a real big file system project, I'd bet that there are file properties
or management tools to tell it to leave directories and files alone. They
should have addressed that years ago.One last mention..
Having better control over WHERE various files in a database are located can
make it easier to deal with these things.
I think it's worth noting that Oracle has been petitioning the kernel
developers for better raw device support: in other words, the ability to
write directly to the hard disk and bypassing the filesystem all
together.
If the db is going to assume the responsibility of disk write
verification it seems reasonable to assume you might want to investigate
the raw disk i/o options.
Telling your installers that a major performance gain is attainable by
doing so might be a start in the opposite direction. I've monitored a
lot of discussions and from what I can gather, postgresql does it's own
set of journaling operations. I don't think that it's necessary for
writes to be double journalled anyway.
Again, just my two cents worth...
Here is a radical idea...
What is it that is causing Postgres trouble? It is the file system's attempts
to maintain some integrity. So I proposed a simple "dbfs" sort of thing which
was the most basic sort of file system possible.
I'm not sure, but I think we can test this hypothesis on the FAT32 file system
on Linux. As far as I know, FAT32 (FAT in general) is a very simple file system
and does very little during operation, except read and write the files and
manage what's been allocated. Plus, the allocation table is very simple in
comparison all the other file systems.
Would pgbench run on a system using ext2, Reiser, then FAT32 be sufficient to
get a feeling for the type of performance Postgres would get, or am I just off
the wall?
If this idea has some merit, what would be the best way to test it? Move the
pg_xlog directory first, then try base? What's the best methodology to try?
carl garland wrote:
Just put a note in the installation docs that the place where the
database
is initialised to should be on a non-Reiser, non-XFS mount...
Sure, we can do that now.
I still think this is not necessarily the right approach either. One
major purpose of using a journaling fs is for fast boot up time after
crash. If you have a 100 GB database you may wish to have the data
on XFS. I do think that the WAL log should be on a separate disk and
on a non-journaling fs for performance.Best Regards,
Carl Garland_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com
On Thu, May 03, 2001 at 11:41:24AM -0400, Bruce Momjian wrote:
ext2 has serious problems with corrupt file systems after a crash, so I
understand the need to move to another file system type. I have been
waitin for Linux to get a more modern file system. Unfortunately, the
new ones seem to be worse for PostgreSQL.
If you fsync() a directory in Linux, all the metadata within that directory
will be written out to disk.
As for filesystem corruption, I can say the e2fsck is among the best fsck
programs out there, and I've only ever had 1 occasion where I've lost any
data on an ext2 filesystem, and that was due to bad sectors causing me to
lose the root directory. (Well, apart from human errors, but that doesn't
count)
OK, we have considered this, but frankly, the new, modern file systems
like FFS/softupdates have i/o rates near raw speed, with all the
advantages a file system gives us. I believe most commercial dbs are
moving away from raw devices and toward file systems. In the old days
the SysV file system was pretty bad at i/o & fragmentation, so they used
raw devices.
And Solaris' 1/01 media has better support for O_DIRECT (?), which they claim
gives you 93% of the speed of a raw device. (Or something like that; I read
this in marketing material a couple of months ago)
Raw devices are designed to have filesystems on them. The only excuses for
userland tools accessing them, are fs-specific tools (eg. dump, fsck, etc),
or for non-unix filesystem tools, where the unix VFS doesn't handle things
properly (hfstools).
The ability to put indexes on a separate volume from data.
The ability to put different tables on different volumes.
And so on.We certainly need that, but raw devices would not make this any easier,
I think.
It would be cool if either at compile time or at database creation time, we
could specify a printf-like format for placing tables, indexes, etc.
It could become a serious problem as people start using reiser/xfs for
their file systems and don't understand the performance problems. Even
more likely is that they will turn off fsync, thinking reiser doesn't
need it, when in fact, I think it does.
ReiserFS only supports metadata logging. The performance slowdown must be
due to logging things like mtime or atime, because otherwise ReiserFS is a
very high performance FS. (Although, I admittedly haven't used it since it
was early in it's development)
--
Michael Samuel <michael@miknet.net>
Michael Samuel wrote:
ReiserFS only supports metadata logging. The performance slowdown must be
due to logging things like mtime or atime, because otherwise ReiserFS is a
very high performance FS. (Although, I admittedly haven't used it since it
was early in it's development)
The way I understand it is that ReiserFS does not attempt to separate files at
the block level. Multiple files can live in the same disk block. This is cool
if you have many small files, but the extra overhead for large files such as
those used by a database, is a bit much.
I read some stuff about a year ago, and my impressions forced me to conclude
that ReiserFS was geared toward applications. Which is a pretty good thing for
applications, but not for databases.
I really think a simple low down dirty file system is just what the doctor
ordered for postgres.
Remember, general purpose file systems must do for files what Postgres is
already doing for records. You will always have extra work. I am seriously
thinking of trying a FAT32 as pg_xlog. I wonder if it will improve performance,
or if there is just something fundamentally stupid about FAT32 that will make
it worse?
--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com
Before we get too involved in speculating, shouldn't we actually measure the
performance of 7.1 on XFS and Reiserfs? Since it's easy to disable fsync,
we can test whether that's the problem. I don't think that logging file
systems must intrinsically give bad performance on fsync since they only log
metadata changes.
I don't have a machine with XFS installed and it will be at least a week
before I could get around to a build. Any volunteers?
Ken Hirsch
mlw <markw@mohawksoft.com> writes:
I have looked at Reiser, and I don't think it is a file system suited for very
large files, or applications such as postgres.
What's the problem with big files? ReiserFS v2 doesn't seem to support
it, while v3 seems just fine (of the ondisk format)
That said, I'm certainly looking forward to xfs - I believe it will be
the most widely used of the current batch of journaling file systems
(reiserfs, jfs, XFS and ext3, the latter mainly focusing on an easy
migration path for existing system)
--
Trond Eivind Glomsr�d
Red Hat, Inc.
On Fri, May 04, 2001 at 08:02:17AM -0400, mlw wrote:
The way I understand it is that ReiserFS does not attempt to separate files at
the block level. Multiple files can live in the same disk block. This is cool
if you have many small files, but the extra overhead for large files such as
those used by a database, is a bit much.
It should be at least as fast as other filesystems for large files. I suspect
that it would be faster in fact. The only catch is that the performance of
reiserfs sucks when it gets past 85% or so full. (ext2 has similar problems)
You can read about all this stuff at http://www.namesys.com/
I really think a simple low down dirty file system is just what the doctor
ordered for postgres.
Traditional BSD FFS or Solaris UFS is probably the best bet for postgres.
Remember, general purpose file systems must do for files what Postgres is
already doing for records. You will always have extra work. I am seriously
thinking of trying a FAT32 as pg_xlog. I wonder if it will improve performance,
or if there is just something fundamentally stupid about FAT32 that will make
it worse?
Well, for a starters, file permissions...
Ext2 would kick arse over FAT32 for performance.
--
Michael Samuel <michael@miknet.net>
"Bruce" == Bruce Momjian <pgman@candle.pha.pa.us> writes:
Well, arguably if you're setting up a database server then a
reasonable DBA should think about such things...
Bruce> Yes, but people have trouble installing PostgreSQL. I
Bruce> can't imagine walking them through a newfs.
In most of linux-land, the DBA is probably also the sysadmin. In
bigger shops, and those which currently run, say Oracle or Sybase, the
two roles are separate. When they are separate, you don't have to
walk the DBA through it; he just walks over to the sysadmin and says
"I need X megabytes of space on a new Y filesystem."
roland
--
PGP Key ID: 66 BC 3B CD
Roland B. Roberts, PhD RL Enterprises
roland@rlenter.com 76-15 113th Street, Apt 3B
rbroberts@acm.org Forest Hills, NY 11375
I got some information from Stephen Tweedie on this - please keep him
"Cc:" as he's not on this list
************************************************************************
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I was talking to a Linux user yesterday, and he said that performance
using the xfs file system is pretty bad. He believes it has to do with
the fact that fsync() on log-based file systems requires more writes.
Performance doing what? XFS has known performance problems doing
unlinks and truncates, but not synchronous IO. The user should be
using fdatasync() for databases, btw, not fsync().
First, XFS, ext3 and reiserfs are *NOT* log-based filesystems. They
are journaling filesystems. They have a log, but they are not
log-based because they do not store data permanently in a log
structure. Berkeley LFS, Sprite and Spiralog are log-based
filesystems.
With a standard BSD/ext2 file system, WAL writes can stay on the same
cylinder to perform fsync. Is that true of log-based file systems?
Not true on ext2 or BSD. Write-aheads are _usually_ close to the
inode, but not always. For true log-based filesystems, writes are
always completely sequential, so the issue just goes away. For
journaling filesystems, depending on the setup there may be a seek to
the journal involved, but some journaling filesystems can use a
separate disk for the journal so no seek is required.
I know xfs and reiser are both log based. Do we need to be concerned
about PostgreSQL performance on these file systems? I use BSD FFS with
soft updates here, so it doesn't affect me.
A database normally preallocates its data files and then performs most
of its writes using update-in-place. In such cases, fsync() is almost
always the wrong thing to be doing --- the data writes have changed
nothing in the inode except for the timestamps, and there's no need to
flush the timestamps to disk for every write. fdatasync() is
designed for this --- if the only inode change is timestamps,
fdatasync() will skip the seek to the inode and will only update the
data. If any significant inode fields have been changed, then a full
flush is done.
Using fdatasync, most filesystems will incur no seeks for data flush,
regardless of whether the filesystem is journaling or not.
Cheers,
Stephen
************************************************************************
--
Trond Eivind Glomsr�d
Red Hat, Inc.
Sure, we can do that now. What do we do when these are the default file
systems for Linux? We can tell them to create other types of file
What is a 'default file system' ? I know that untill now, everybody is using
ext2. But that's only because there hasn't been anything comparable. Now we
se ReiserFS, and my SuSE installation offers the choice. In the future, I
believe that people can choose from ext2, ReiserFS,xfs, ext3 and maybe more.
systems, but that is a pretty big hurdle. I wonder if it would be
easier to get reiser/xfs to make some modifications.
No, I don't think it's a big hurdle. If you just want to play with
PostgreSQL, you wont care. If you're serious, you'll repartition.
--
Kaare Rasmussen --Linux, spil,-- Tlf: 3816 2582
Kaki Data tshirts, merchandize Fax: 3816 2501
Howitzvej 75 �ben 14.00-18.00 Web: www.suse.dk
2000 Frederiksberg L�rdag 11.00-17.00 Email: kar@webline.dk
[ Charset ISO-8859-1 unsupported, converting... ]
Before we get too involved in speculating, shouldn't we actually measure the
performance of 7.1 on XFS and Reiserfs? Since it's easy to disable fsync,
we can test whether that's the problem. I don't think that logging file
systems must intrinsically give bad performance on fsync since they only log
metadata changes.I don't have a machine with XFS installed and it will be at least a week
before I could get around to a build. Any volunteers?
There have been multiple reports of poor PostgreSQL performance on
Reiser and xfs. I don't have numbers, though. Frankly, I think we need
xfs and reiser experts involved to figure out our options here.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Fri, May 04, 2001 at 08:02:17AM -0400, mlw wrote:
The way I understand it is that ReiserFS does not attempt to separate files at
the block level. Multiple files can live in the same disk block. This is cool
if you have many small files, but the extra overhead for large files such as
those used by a database, is a bit much.It should be at least as fast as other filesystems for large files. I suspect
that it would be faster in fact. The only catch is that the performance of
reiserfs sucks when it gets past 85% or so full. (ext2 has similar problems)
That is pretty standard for most modern file systems. They need that
free space to optimize.
You can read about all this stuff at http://www.namesys.com/
I really think a simple low down dirty file system is just what the doctor
ordered for postgres.Traditional BSD FFS or Solaris UFS is probably the best bet for postgres.
That is my opinion. BSD FFS seems to be general enough to give good
performance for a large scale of application needs. It is not as fast
as XFS for streaming large files (media), and it doesn't optimize small
files below the 1k size (fragments), and it does require fsck on reboot.
However, looking at all those for PostgreSQL, the costs of the new Linux
file systems seems pretty high, especially considering our need for
fsync().
What I am really concerned about is when xfs/reiser become the default
file systems for Linux, and people complain about PostgreSQL
performance. And if we require special file systems, we lose some of
our ability to easily grow. Because of ext2's problems with crash
recovery, who is going to want to put other data on that file system
when they have xfs/reiser available. And boots are going to have to
fsck that ext2 file system.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
[ Charset ISO-8859-1 unsupported, converting... ]
Sure, we can do that now. What do we do when these are the default file
systems for Linux? We can tell them to create other types of fileWhat is a 'default file system' ? I know that untill now, everybody is using
ext2. But that's only because there hasn't been anything comparable. Now we
se ReiserFS, and my SuSE installation offers the choice. In the future, I
believe that people can choose from ext2, ReiserFS,xfs, ext3 and maybe more.
But some day the default will be a log-based file system, and people
will have to hunt around to create a non-log based one.
systems, but that is a pretty big hurdle. I wonder if it would be
easier to get reiser/xfs to make some modifications.No, I don't think it's a big hurdle. If you just want to play with
PostgreSQL, you wont care. If you're serious, you'll repartition.
Yes, but we could get a reputation for slowness on these log-based file
systems.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
[ Charset ISO-8859-1 unsupported, converting... ]
I got some information from Stephen Tweedie on this - please keep him
"Cc:" as he's not on this list************************************************************************
Bruce Momjian <pgman@candle.pha.pa.us> writes:I was talking to a Linux user yesterday, and he said that performance
using the xfs file system is pretty bad. He believes it has to do with
the fact that fsync() on log-based file systems requires more writes.Performance doing what? XFS has known performance problems doing
unlinks and truncates, but not synchronous IO. The user should be
using fdatasync() for databases, btw, not fsync().
This is hugely helpful. In PostgreSQL 7.1, we do use fdatasync() by
default it is available on a platform.
First, XFS, ext3 and reiserfs are *NOT* log-based filesystems. They
are journaling filesystems. They have a log, but they are not
log-based because they do not store data permanently in a log
structure. Berkeley LFS, Sprite and Spiralog are log-based
filesystems.
Sorry, I get those mixed up.
With a standard BSD/ext2 file system, WAL writes can stay on the same
cylinder to perform fsync. Is that true of log-based file systems?Not true on ext2 or BSD. Write-aheads are _usually_ close to the
inode, but not always. For true log-based filesystems, writes are
always completely sequential, so the issue just goes away. For
journaling filesystems, depending on the setup there may be a seek to
the journal involved, but some journaling filesystems can use a
separate disk for the journal so no seek is required.I know xfs and reiser are both log based. Do we need to be concerned
about PostgreSQL performance on these file systems? I use BSD FFS with
soft updates here, so it doesn't affect me.A database normally preallocates its data files and then performs most
of its writes using update-in-place. In such cases, fsync() is almost
always the wrong thing to be doing --- the data writes have changed
nothing in the inode except for the timestamps, and there's no need to
flush the timestamps to disk for every write. fdatasync() is
designed for this --- if the only inode change is timestamps,
fdatasync() will skip the seek to the inode and will only update the
data. If any significant inode fields have been changed, then a full
flush is done.
We do pre-allocate our log file space in chunks to avoid inode/block
index writes.
Using fdatasync, most filesystems will incur no seeks for data flush,
regardless of whether the filesystem is journaling or not.
Thanks. That is a big help. I wonder if people reporting performance
problems were using 7.0.3. We only added fdatasync() in 7.1.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Import Notes
Reply to msg id not found: xuyhez1p341.fsf@halden.devel.redhat.comISO-8859-1Qfrom_Trond_Eivind_GlomsrF8d_at_May_42C_2001_113A043A30_am | Resolved by subject fallback
Michael Samuel wrote:
Remember, general purpose file systems must do for files what Postgres is
already doing for records. You will always have extra work. I am seriously
thinking of trying a FAT32 as pg_xlog. I wonder if it will improve performance,
or if there is just something fundamentally stupid about FAT32 that will make
it worse?Well, for a starters, file permissions...
Ext2 would kick arse over FAT32 for performance.
OK, I'll bite.
In a database environment where file creation is not such an issue, why would ext2
be faster?
The FAT file system has, AFAIK, very little overhead for file writes. It simply
writes the two FAT tables on file extension, and data. Depending on cluster size,
there is probably even less happening there.
I don't think that anyone is saying that FAT is the answer in a production
environment, but maybe we can do a comparison of various file systems and see if any
performance issues show up.
I mentioned FAT only because I was thinking about how postgres would perform on a
very simple file system, one which bypasses most of the normal stuff a "good"
general purpose file system would do. While I was thinking this, it occurred to me
that FAT was about he cheesiest simple file system one could find, short of a ram
disk, and maybe we could use it to test the assumptions about performance impact of
the file system on postgres.
Just a thought. If you know of some reason why ext2 would perform better in the
postgres environment, I would love to hear why, I'm very curious.
Hi,
On Fri, May 04, 2001 at 01:49:54PM -0400, Bruce Momjian wrote:
Performance doing what? XFS has known performance problems doing
unlinks and truncates, but not synchronous IO. The user should be
using fdatasync() for databases, btw, not fsync().This is hugely helpful. In PostgreSQL 7.1, we do use fdatasync() by
default it is available on a platform.
Good --- fdatasync is defined in SingleUnix, so it's probably safe to
probe for it and use it by default if it is there.
The 2.2 Linux kernel does not have fdatasync implemented, but glibc
will fall back to fsync if that's all that the kernel supports. 2.4
implements both with the required semantics.
--Stephen
Before we get too involved in speculating, shouldn't we actually measure
the
performance of 7.1 on XFS and Reiserfs? Since it's easy to disable
fsync,
we can test whether that's the problem. I don't think that logging file
systems must intrinsically give bad performance on fsync since they only
log
metadata changes.
I don't have a machine with XFS installed and it will be at least a week
before I could get around to a build. Any volunteers?There have been multiple reports of poor PostgreSQL performance on
Reiser and xfs. I don't have numbers, though. Frankly, I think we need
xfs and reiser experts involved to figure out our options here.
I've done some testing to see how Reiserfs performs
vs ext2, and also various for various values of wal_sync_method while on a
reiserfs partition. The attached graph shows the results. The y axis is
transactions per second and the x axis is the transaction number. It was
clear that, at least for my specific app, ext2 was significantly faster.
The hardware I tested on has an Athalon 1 Ghz cpu and 512 MB ram. The
harddrive is a 2 year old IDE drive. I'm running Red Hat 7 with all the
latest updates, and a freshly compiled 2.4.2 kernel with the latest Reiserfs
patch, and of course PostgreSQL 7.1. The transactions were run in a loop,
700 times per test, to insert sample data into 4 tables. I used a PHP script
running on the same machine to do the inserts.
I'd be happy to provide more detail or try a different variation if anyone
is interested.
- Joe
Attachments:
fs_perf_diff.jpgimage/jpeg; name=fs_perf_diff.jpgDownload
���� JFIF ,, �� C
$.' ",#(7),01444'9=82<.342�� C
2!!22222222222222222222222222222222222222222222222222�� m�"