fsync, ext2 on Linux

Started by Heikki Linnakangasabout 21 years ago7 messages
#1Heikki Linnakangas
hlinnaka@iki.fi

The Linux fsync man page says:

"It does not necessarily ensure that the entry in the directory
containing the file has also reached disk. For that an explicit fsync on
the file descriptor of the directory is also needed."

AFAIK, we don't care about it at the moment. The actual behaviour depends
on the filesystem, reiserfs and other journaling filesystems probably
don't need the explicit fsync on the parent directory, but at least ext2
does.

I've experimented with a user-mode-linux installation, crashing it at
specific points. It seems that on ext2, it's possible to get the database
in non-consistent state.

Especially:

1. start transaction
2. do a lot of updates, so that a new xlog file is created
3. commit
4. crash

Sometimes the creation of the new xlog file is lost, losing the already
committed transaction.

I also got into this situation after one crash test:

template1=# SELECT * FROM foo;
ERROR: could not access status of transaction 1768515945
DETAIL: could not open file
"/home/hlinnaka/pgsql/data_broken/pg_clog/0696": No such file or directory

I haven't tried to debug it more deeply.

Should we fix this by fsyncing the parent directory of new files? We could
also declare ext2 broken, but there could be others.

- Heikki

#2Oliver Jowett
oliver@opencloud.com
In reply to: Heikki Linnakangas (#1)
Re: fsync, ext2 on Linux

Heikki Linnakangas wrote:

The Linux fsync man page says:

"It does not necessarily ensure that the entry in the directory
containing the file has also reached disk. For that an explicit fsync on
the file descriptor of the directory is also needed."

AFAIK, we don't care about it at the moment. The actual behaviour
depends on the filesystem, reiserfs and other journaling filesystems
probably don't need the explicit fsync on the parent directory, but at
least ext2 does.

I've experimented with a user-mode-linux installation, crashing it at
specific points. It seems that on ext2, it's possible to get the
database in non-consistent state.

Have you experimented with mounting the filesystem with the dirsync
option ('-o dirsync') or marking the log directory as synchronous with
'chattr +D'? (no, it's not a real fix, just another data point..)

-O

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: fsync, ext2 on Linux

Heikki Linnakangas <hlinnaka@iki.fi> writes:

The Linux [ext2] fsync man page says:
"It does not necessarily ensure that the entry in the directory
containing the file has also reached disk. For that an explicit fsync on
the file descriptor of the directory is also needed."

This seems so broken as to defy belief. A process creating a file
doesn't normally *have* a file descriptor for the parent directory,
and I don't think the concept of an FD for a directory is even
portable (opendir() certainly doesn't return an FD). One might also
ask if we are expected to fsync everything up to the root in order
to be sure that the file remains accessible, and how exactly we should
do that on directories we don't have write access for.

In general we expect the filesystem to take care of its own metadata.
Run ext3 in journaling mode, or something like that.

(It occurs to me that the admin guide really ought to have a few words
about recommended and non-recommended filesystems ...)

regards, tom lane

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#3)
Re: fsync, ext2 on Linux

Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

The Linux [ext2] fsync man page says:
"It does not necessarily ensure that the entry in the directory
containing the file has also reached disk. For that an explicit fsync on
the file descriptor of the directory is also needed."

This seems so broken as to defy belief. A process creating a file
doesn't normally *have* a file descriptor for the parent directory,
and I don't think the concept of an FD for a directory is even
portable (opendir() certainly doesn't return an FD). One might also
ask if we are expected to fsync everything up to the root in order
to be sure that the file remains accessible, and how exactly we should
do that on directories we don't have write access for.

The notes say this:

When an ext2 file system is mounted with the sync option,
directory
entries are also implicitly synced by fsync.

cheers

andrew

#5Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#3)
Re: fsync, ext2 on Linux

In general we expect the filesystem to take care of its own metadata.
Run ext3 in journaling mode, or something like that.

(It occurs to me that the admin guide really ought to have a few words
about recommended and non-recommended filesystems ...)

Well I am not their admin, but I don't suggest any of the ext systems.
Although ext3 is reasonably stable it is very slow.

Stick with XFS, JFS or even Reiser.

Sincerely,

Joshua D. Drake

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
#6Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Oliver Jowett (#2)
Re: fsync, ext2 on Linux

On Mon, 1 Nov 2004, Oliver Jowett wrote:

Heikki Linnakangas wrote:

The Linux fsync man page says:

"It does not necessarily ensure that the entry in the directory containing
the file has also reached disk. For that an explicit fsync on the file
descriptor of the directory is also needed."

AFAIK, we don't care about it at the moment. The actual behaviour depends
on the filesystem, reiserfs and other journaling filesystems probably don't
need the explicit fsync on the parent directory, but at least ext2 does.

I've experimented with a user-mode-linux installation, crashing it at
specific points. It seems that on ext2, it's possible to get the database
in non-consistent state.

Have you experimented with mounting the filesystem with the dirsync option
('-o dirsync') or marking the log directory as synchronous with 'chattr +D'?
(no, it's not a real fix, just another data point..)

Quick experiment shows that they seem to fix it as expected.

"chattr +D" might not be such a bad idea. A warning would be nice if you
start the postmaster on a filesystem that requires it. Few admins would
remember/know about it otherwise.

- Heikki

#7Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Tom Lane (#3)
Re: fsync, ext2 on Linux

On Sun, 31 Oct 2004, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

The Linux [ext2] fsync man page says:
"It does not necessarily ensure that the entry in the directory
containing the file has also reached disk. For that an explicit fsync on
the file descriptor of the directory is also needed."

This seems so broken as to defy belief. A process creating a file
doesn't normally *have* a file descriptor for the parent directory,
and I don't think the concept of an FD for a directory is even
portable (opendir() certainly doesn't return an FD). One might also
ask if we are expected to fsync everything up to the root in order
to be sure that the file remains accessible, and how exactly we should
do that on directories we don't have write access for.

I agree on the brokeness. Linux is the only OS that's broken that I know
of. Therefore it doesn't really matter if the fix is portable or not, we
would only do it on Linux anyway.

Surely it's not necessary to crawl up to the root. Just fsync the
parent of every new file and directory.

In general we expect the filesystem to take care of its own metadata.
Run ext3 in journaling mode, or something like that.

I normally run reiserfs, I set up the ext2 filesystem just to test it.

(It occurs to me that the admin guide really ought to have a few words
about recommended and non-recommended filesystems ...)

That's the least we can do. I wonder if we could check the filesystem at
runtime and issue a warning if it's not in the list of recommended
filesystems.

- Heikki