fstat vs. lseek
In response to my blog post on lseek contention, someone posted a
comment wherein they proposed using fstat() rather than lseek() to get
file sizes.
http://rhaas.blogspot.com/2011/08/linux-and-glibc-scalability.html
I tried that on a RHEL 6.1 machine with 64-cores running
2.6.32-131.6.1.el6.x86_64, and it's pretty clear that the locking
characteristics are completely different. At 1 client, the lseek
method appears to be slightly faster, although it's not beyond belief
that the difference could be in the noise. Above 40 cores, however,
the fstat method thumps the lseek method up one side and down the
other.
Patch and test results are attached. Test runs are 5-minute runs with
scale factor 100 and shared_buffers=8GB.
Thoughts?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
filesize.patchapplication/octet-stream; name=filesize.patchDownload
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9540279..c354ef4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1418,6 +1418,26 @@ FileSeek(File file, off_t offset, int whence)
return VfdCache[file].seekPos;
}
+off_t
+FileSize(File file)
+{
+ int returnCode;
+ struct stat sb;
+
+ Assert(FileIsValid(file));
+
+ DO_DB(elog(LOG, "FileSize: %d (%s)", file, VfdCache[file].fileName));
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+ returnCode = fstat(VfdCache[file].fd, &sb);
+ if (returnCode < 0)
+ return returnCode;
+
+ return sb.st_size;
+}
+
/*
* XXX not actually used but here for completeness
*/
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 7f44606..3676ee9 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1661,11 +1661,11 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
{
off_t len;
- len = FileSeek(seg->mdfd_vfd, 0L, SEEK_END);
+ len = FileSize(seg->mdfd_vfd);
if (len < 0)
ereport(ERROR,
(errcode_for_file_access(),
- errmsg("could not seek to end of file \"%s\": %m",
+ errmsg("could not determine size of file \"%s\": %m",
FilePathName(seg->mdfd_vfd))));
/* note that this calculation will ignore any partial block at EOF */
return (BlockNumber) (len / BLCKSZ);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8a4d07c..2fb8424 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -68,6 +68,7 @@ extern int FileRead(File file, char *buffer, int amount);
extern int FileWrite(File file, char *buffer, int amount);
extern int FileSync(File file);
extern off_t FileSeek(File file, off_t offset, int whence);
+extern off_t FileSize(File file);
extern int FileTruncate(File file, off_t offset);
extern char *FilePathName(File file);
Robert Haas <robertmhaas@gmail.com> writes:
In response to my blog post on lseek contention, someone posted a
comment wherein they proposed using fstat() rather than lseek() to get
file sizes.
Patch and test results are attached. Test runs are 5-minute runs with
scale factor 100 and shared_buffers=8GB.
Thoughts?
I'm a bit concerned by the fact that you've only tested this on one
operating system, and thus the performance characteristics could be
quite different elsewhere. The comment in mdextend also points out
a way in which this might not be a win --- did you test anything besides
read-only scenarios?
regards, tom lane
On Monday, August 08, 2011 10:30:38 Robert Haas wrote:
In response to my blog post on lseek contention, someone posted a
comment wherein they proposed using fstat() rather than lseek() to get
file sizes.Thoughts?
I don't think its a good idea to replace lseek with fstat in the long run. The
likelihood that the lockless generic_file_llseek will get included seems rather
high to me. In contrast to that fstat will always be more expensive than that
as its going through a security check and then the fs' getattr implementation
(which actually takes a lock on some fs).
On the other hand its currently lockless if the security subsystem is compiled
out (i.e. no selinux et al) for some common fs (ext3/4, xfs).
Andres
On Mon, Aug 8, 2011 at 10:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm a bit concerned by the fact that you've only tested this on one
operating system, and thus the performance characteristics could be
quite different elsewhere. The comment in mdextend also points out
a way in which this might not be a win --- did you test anything besides
read-only scenarios?
Nope.
On Mon, Aug 8, 2011 at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:
I don't think its a good idea to replace lseek with fstat in the long run. The
likelihood that the lockless generic_file_llseek will get included seems rather
high to me. In contrast to that fstat will always be more expensive than that
as its going through a security check and then the fs' getattr implementation
(which actually takes a lock on some fs).
*scratches head* I understand that stat() would need a security
check, but why would fstat()?
I think both of you raise good points. I wasn't too enthusiastic
about this approach either. It's not very appealing to adopt an
approach where the right performance decision is going to depend on
operating system, file system, kernel version, core count, and
workload. We could add a GUC, but it would be pretty annoying to have
a setting that won't matter for most people at all, except
occasionally when it makes a huge difference.
I wasn't aware that was any current activity around this on the Linux
side. But Andres' comments made me Google it again, and now I see
this:
https://lkml.org/lkml/2011/6/16/800
Andes, any idea what the status of that patch is? I'm not clear on
how Linux works in terms of things getting upstreamed.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Monday, August 08, 2011 11:33:29 Robert Haas wrote:
On Mon, Aug 8, 2011 at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:
I don't think its a good idea to replace lseek with fstat in the long
run. The likelihood that the lockless generic_file_llseek will get
included seems rather high to me. In contrast to that fstat will always
be more expensive than that as its going through a security check and
then the fs' getattr implementation (which actually takes a lock on
some fs).*scratches head* I understand that stat() would need a security
check, but why would fstat()?
That I am not totally sure of either. I guess Kaigai might know more about
that.
I guess it might be that a forked process possibly is not allowed anymore to
access the information from an inherited file handle? Also I think a process
can change its permissions during runtime.
I think both of you raise good points. I wasn't too enthusiastic
about this approach either. It's not very appealing to adopt an
approach where the right performance decision is going to depend on
operating system, file system, kernel version, core count, and
workload. We could add a GUC, but it would be pretty annoying to have
a setting that won't matter for most people at all, except
occasionally when it makes a huge difference.I wasn't aware that was any current activity around this on the Linux
side. But Andres' comments made me Google it again, and now I see
this:https://lkml.org/lkml/2011/6/16/800
Andes, any idea what the status of that patch is? I'm not clear on
how Linux works in terms of things getting upstreamed.
There doesn't seem to have been any activity to inlude it in 3.1. The merge
window for 3.1 just ended. The next one will open for about a week after the
release.
Its also not yet included in linux-next which is a "preview" for the currently
worked on release + 1. A release takes roughly 3 months.
For upstreaming somebody needs to be persistent enough to convince one of the
maintainers of the particular area to include the code so that linus then can
pull that.
I guess citing your numbers would go a long way in that direction. Naturally
it would be even better to inlcude results with the patch applied.
My largest machine I can reboot often enough to test such a thing has only two
sockets (4cores E5520). I guess you cannot reboot your loaned machine with a
new kernel easily?
Greetings,
Andres
On Mon, Aug 8, 2011 at 1:10 PM, Andres Freund <andres@anarazel.de> wrote:
There doesn't seem to have been any activity to inlude it in 3.1. The merge
window for 3.1 just ended. The next one will open for about a week after the
release.
Its also not yet included in linux-next which is a "preview" for the currently
worked on release + 1. A release takes roughly 3 months.
OK. If it doesn't get into Linux 3.2 we had better start thinking
hard about a workaround on our side. I am not too concerned about
people hitting this with PostgreSQL 9.1 or prior, because you'd
basically need a workload targeted to exercise the problem, which
workload is not that similar to the way people actually do things in
real life. However, in PostgreSQL 9.2devel, it's going to be much
more of a real-world problem, so I'd hate to wait until after our
feature freeze and then decide we've got a problem we have to fix.
For upstreaming somebody needs to be persistent enough to convince one of the
maintainers of the particular area to include the code so that linus then can
pull that.
I guess citing your numbers would go a long way in that direction. Naturally
it would be even better to inlcude results with the patch applied.
My largest machine I can reboot often enough to test such a thing has only two
sockets (4cores E5520). I guess you cannot reboot your loaned machine with a
new kernel easily?
Not really. I do have root access to a 64-core box at the moment, and
I could probably get permission to reboot it, but if it didn't come
back on-line that would be awkward.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
Not really. I do have root access to a 64-core box at the moment, and
I could probably get permission to reboot it, but if it didn't come
back on-line that would be awkward.
Red Hat has some test hardware that I can use (... pokes around ...)
Hmm, this one looks promising:
Memory NUMA Nodes
64348 MB 4
Cpu
Vendor Model Name Family Model Stepping Speed Processors Cores Sockets Hyper
GenuineIntel Intel(R) Xeon(R) CPU E7- 4860 @ 2.27GHz 6 47 2 1064.0 80 40 4 True
If you can wrap something up to the point where someone else can
run it, I'll give it a shot.
regards, tom lane
On Monday, August 08, 2011 13:19:13 Robert Haas wrote:
On Mon, Aug 8, 2011 at 1:10 PM, Andres Freund <andres@anarazel.de> wrote:
There doesn't seem to have been any activity to inlude it in 3.1. The
merge window for 3.1 just ended. The next one will open for about a
week after the release.
Its also not yet included in linux-next which is a "preview" for the
currently worked on release + 1. A release takes roughly 3 months.OK. If it doesn't get into Linux 3.2 we had better start thinking
hard about a workaround on our side.
If its ok I will write a mail to lkml referencing this thread and your numbers
inline (with attribution obviously).
I don't think it will be that hard to convince them. But I constantly surprise
myself with naivity so I may be wrong.
My largest machine I can reboot often enough to test such a thing has only
two sockets (4cores E5520). I guess you cannot reboot your loaned machine
with a new kernel easily?Not really. I do have root access to a 64-core box at the moment, and
I could probably get permission to reboot it, but if it didn't come
back on-line that would be awkward.
As I feared. Any chance that the person lending you the machine can give you a
hand?
Although I don't know how that could be after reading the code it would be
disappointing to wait for 3.2 with the llseek fixes appearing in $distribution
just to notice fstat is still faster for $unobvious_reason...
Andres
On Mon, Aug 8, 2011 at 1:31 PM, Andres Freund <andres@anarazel.de> wrote:
If its ok I will write a mail to lkml referencing this thread and your numbers
inline (with attribution obviously).
That would be great. Please go ahead.
I don't think it will be that hard to convince them. But I constantly surprise
myself with naivity so I may be wrong.
Heh, heh, open source is fun.
My largest machine I can reboot often enough to test such a thing has only
two sockets (4cores E5520). I guess you cannot reboot your loaned machine
with a new kernel easily?Not really. I do have root access to a 64-core box at the moment, and
I could probably get permission to reboot it, but if it didn't come
back on-line that would be awkward.As I feared. Any chance that the person lending you the machine can give you a
hand?
Uh, maybe, but considering my relative inexperience in compiling the
Linux kernel, I'd be a little worried about having to iterate too many
times.
Although I don't know how that could be after reading the code it would be
disappointing to wait for 3.2 with the llseek fixes appearing in $distribution
just to notice fstat is still faster for $unobvious_reason...
Well, the good thing here is that we are really only concerned with
gross effects. It's pretty obvious from the numbers I posted upthread
that the problem is related to lock contention. If that gets fixed,
and lseek is still 20% slower under some set of circumstances, it's
not clear that we're really gonna care. I mean, maybe it would be
nice to avoid going to the kernel at all here just so we're immune to
possible inefficiencies in other operating systems (it would be nice
if someone could repeat these tests on a big SMP box running Windows
and/or one of BSD systems) and to save the overhead of a system call,
but those effects are pretty tiny. We could spend a lot of time
optimizing other things before that one percolated up to the top of
the heap, at least based on what I've seen so far.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
hi
On 08/08/2011 07:50 PM, Robert Haas wrote:
On Mon, Aug 8, 2011 at 1:31 PM, Andres Freund<andres@anarazel.de> wrote:
If its ok I will write a mail to lkml referencing this thread and your numbers
inline (with attribution obviously).That would be great. Please go ahead.
I've just stumbled across this thread on lkml [1]https://lkml.org/lkml/2011/9/15/399
"Improve lseek scalability v3".
and I thought to ping pgsql hackers list
just in case, more to the point they're
asking "are there any real workloads which care
[Make generic lseek lockless safe]"
maybe I've got it wrong but it seems somewhat
related to what has been discussed here and
also in Robert Haas's "Linux and glibc Scalability"
blog post [1]https://lkml.org/lkml/2011/9/15/399.
[cut]
Andrea
[1]: https://lkml.org/lkml/2011/9/15/399
[2]: http://rhaas.blogspot.com/2011/08/linux-and-glibc-scalability.html
On Friday 16 Sep 2011 15:19:07 Andrea Suisani wrote:
hi
On 08/08/2011 07:50 PM, Robert Haas wrote:
On Mon, Aug 8, 2011 at 1:31 PM, Andres Freund<andres@anarazel.de> wrote:
If its ok I will write a mail to lkml referencing this thread and your
numbers inline (with attribution obviously).That would be great. Please go ahead.
I've just stumbled across this thread on lkml [1]
"Improve lseek scalability v3".and I thought to ping pgsql hackers list
just in case, more to the point they're
asking "are there any real workloads which care
[Make generic lseek lockless safe]"
I wrote them a mail sometime ago (some weeks) regarding an earlier version of
the patch... Can't find it right now though.
Andres
On Fri, Oct 28, 2011 at 3:33 PM, Andres Freund <andres@anarazel.de> wrote:
The lseek patches just got included in Linus tree.
Excellent, thanks for the update!
So I guess this will be in Linux 3.2?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On Friday, October 28, 2011 09:40:51 PM Robert Haas wrote:
On Fri, Oct 28, 2011 at 3:33 PM, Andres Freund <andres@anarazel.de> wrote:
The lseek patches just got included in Linus tree.
Excellent, thanks for the update!
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=ef3
d0fd27e90f67e35da516dafc1482c82939a60So I guess this will be in Linux 3.2?
Unless they get reverted for some reason, yes.
Andres