Caching number of blocks in relation to avoi lseek.
Hello all,
While digging the code I found out quite interesting comment in
src/backend/access/heap/hio.c before function RelationPutHeapTupleAtEnd
* Eventually, we should cache the number of blocks in a relation somewhere.
* Until that time, this code will have to do an lseek to determine the number
* of blocks in a relation.
As far as I can see there's field rd_nblocks in Relation.
Question: is this field properly updated? Could it be used instead of RelationGetNumberOfBlocks
which calls lseek.
--
Sincerely Yours,
Denis Perchine
----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------
Hello all,
While digging the code I found out quite interesting comment in
src/backend/access/heap/hio.c before function RelationPutHeapTupleAtEnd* Eventually, we should cache the number of blocks in a relation somewhere.
* Until that time, this code will have to do an lseek to determine the number
* of blocks in a relation.As far as I can see there's field rd_nblocks in Relation.
Question: is this field properly updated? Could it be used instead of RelationGetNumberOfBlocks
which calls lseek.
Probably should be kept up-to-date. Currently only vacuum sets it. It
would have to be kept up-to-date for every INSERT/UPDATE that adds a block.
--
Bruce Momjian | http://www.op.net/~candle
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Question: is this field properly updated? Could it be used instead of RelationGetNumberOfBlocks
which calls lseek.Probably should be kept up-to-date. Currently only vacuum sets it. It
would have to be kept up-to-date for every INSERT/UPDATE that adds a block.
If we can acomplish this it will be quite big speed improvement. I saw lots of such calls
and strace show lots of lseek(,,END) calls. I would do this, but I do not know the code so
well to be sure I will walk through all places.
--
Sincerely Yours,
Denis Perchine
----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------
* Eventually, we should cache the number of blocks in a relation somewhere.
* Until that time, this code will have to do an lseek to determine the number
* of blocks in a relation.As far as I can see there's field rd_nblocks in Relation.
Question: is this field properly updated?
No. If it were that easy, this problem would have been fixed long ago.
The problem with the relcache field is that it's backend-local, so it's
not going to get updated when some other backend extends the relation.
We do use rd_nblocks to cache the last table length determined from
lseek, but that can't be trusted in critical cases, like when we are
trying to add a block to the relation.
There has been some talk of keeping a small cache of current relation
lengths in shared memory, but I'm dubious that that'd actually be a win.
In order to save an lseek (which ought to be pretty quick as kernel
calls go) we'd be talking about grabbing a spinlock, searching a shared
hashtable, and releasing a spinlock. Spinlock contention on this
heavily used datastructure might well be a problem. Now add the costs
of updating that hashtable every time we extend a relation (or even just
fail to find a relation in it). Not immediately obvious that this is
a net win overall, is it? If it were easier to do, or more obviously
a large performance gain, someone would likely have tried it by now ...
but there is lots of lower-hanging fruit to keep us busy.
regards, tom lane
No. If it were that easy, this problem would have been fixed long ago.
The problem with the relcache field is that it's backend-local, so it's
not going to get updated when some other backend extends the relation.
Sorry. I missed this.
We do use rd_nblocks to cache the last table length determined from
lseek, but that can't be trusted in critical cases, like when we are
trying to add a block to the relation.There has been some talk of keeping a small cache of current relation
lengths in shared memory, but I'm dubious that that'd actually be a win.
In order to save an lseek (which ought to be pretty quick as kernel
calls go) we'd be talking about grabbing a spinlock, searching a shared
hashtable, and releasing a spinlock. Spinlock contention on this
heavily used datastructure might well be a problem. Now add the costs
of updating that hashtable every time we extend a relation (or even just
fail to find a relation in it). Not immediately obvious that this is
a net win overall, is it? If it were easier to do, or more obviously
a large performance gain, someone would likely have tried it by now ...
but there is lots of lower-hanging fruit to keep us busy.
Hard to say what is faster... Lots of testing is needed. You are right.
And what about skipping lseek when continually read relation?
Is it possible?
--
Sincerely Yours,
Denis Perchine
----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------
Denis Perchine <dyp@perchine.com> writes:
And what about skipping lseek when continually read relation?
Is it possible?
In a pure read scenario the way it's supposed to work is that an
lseek(END) is done once at the start of each sequential scan, and we
save that value in rd_nblocks. Then we read rd_nblocks pages and stop.
By the time we finish the scan there might be more pages in the relation
(added by other backends, or even by ourselves if it's an update query).
But those pages cannot contain any tuples that could be visible to the
current scan, so it's OK if we don't read them. However, we do need a
new lseek() --- or some other way to verify the right table length
--- at least once per transaction start or CommandCounterIncrement.
Either of those events could make new tuples visible to us.
I think there may be some code paths that cause us to do more than just
one lseek(END) during scan startup, and if so it'd be worthwhile to try
to get rid of the extra lseeks(). But you'd have to be careful that
you didn't remove any lseeks that are essential in some other paths.
regards, tom lane
On ���, 13 ��� 2000, Tom Lane wrote:
Denis Perchine <dyp@perchine.com> writes:
And what about skipping lseek when continually read relation?
Is it possible?In a pure read scenario the way it's supposed to work is that an lseek(END) is done once at the start of each sequential scan, and we save that value in rd_nblocks. Then we read rd_nblocks pages and stop. By the time we finish the scan there might be more pages in the relation (added by other backends, or even by ourselves if it's an update query). But those pages cannot contain any tuples that could be visible to the current scan, so it's OK if we don't read them. However, we do need a new lseek() --- or some other way to verify the right table length --- at least once per transaction start or CommandCounterIncrement. Either of those events could make new tuples visible to us.I think there may be some code paths that cause us to do more than just
one lseek(END) during scan startup, and if so it'd be worthwhile to try
to get rid of the extra lseeks(). But you'd have to be careful that
you didn't remove any lseeks that are essential in some other paths.
No... You did not get me. I am talking about completly different thing:
I strace'ed postgres binary when doing queries and found out that it
do lseek after each read, and the difference in the position is 8096.
It means that we are in correct position anyway and do not need additional lseek.
--
Sincerely Yours,
Denis Perchine
----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------
Denis Perchine <dyp@perchine.com> writes:
No... You did not get me. I am talking about completly different thing:
I strace'ed postgres binary when doing queries and found out that it
do lseek after each read, and the difference in the position is 8096.
It means that we are in correct position anyway and do not need additional lseek.
Oh. Hmm. Not sure if it's really worth the trouble, but you could try
having fd.c keep track of the current seek position of VFDs when they
are open as well as when they are closed, and optimize away the lseek
call in FileSeek if the position is already right. You'd have to think
carefully about what to do if a read or write fails, however --- where
has the kernel left its seek position in that case? Possibly this could
be dealt with by setting the internal position to "unknown" anytime
we're not perfectly sure where the kernel is.
regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [000612 19:37] wrote:
Denis Perchine <dyp@perchine.com> writes:
No... You did not get me. I am talking about completly different thing:
I strace'ed postgres binary when doing queries and found out that it
do lseek after each read, and the difference in the position is 8096.
It means that we are in correct position anyway and do not need additional lseek.Oh. Hmm. Not sure if it's really worth the trouble, but you could try
having fd.c keep track of the current seek position of VFDs when they
are open as well as when they are closed, and optimize away the lseek
call in FileSeek if the position is already right. You'd have to think
carefully about what to do if a read or write fails, however --- where
has the kernel left its seek position in that case? Possibly this could
be dealt with by setting the internal position to "unknown" anytime
we're not perfectly sure where the kernel is.
Have you thought of using pread/pwrite which are available on many
modern platforms:
ssize_t
pread(int d, void *buf, size_t nbytes, off_t offset)
Pread() performs the same function, but reads from the specified
position in the file without modifying the file pointer.
I'm unsure how the postgresql system uses it's fds, however if they
are really shared via dup() or across fork/exec they will share
the same offset across multiple instances of the fd. The only way
around this behavior is to reopen the file in each process to get
a private offset.
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."
Oh. Hmm. Not sure if it's really worth the trouble, but you could try
having fd.c keep track of the current seek position of VFDs when they
are open as well as when they are closed, and optimize away the lseek
call in FileSeek if the position is already right. You'd have to think
carefully about what to do if a read or write fails, however --- where
has the kernel left its seek position in that case? Possibly this could
be dealt with by setting the internal position to "unknown" anytime
we're not perfectly sure where the kernel is.
If read or write fails. Position will left the same. This situation is already tracked
in File routines, but a little bit incorrectly.
Here is the full patch for this. This patch reduce amount of lseek call ten times
for update statement and twenty times for select statement. I tested joined update
and count(*) select for table with rows > 170000 and 10 indices.
I think this is worse of trying. Before lseek calls account for more than 5% of time.
Now they are 0.89 and 0.15 respectevly.
Due to only one file modification patch should be applied in src/backedn/storage/file/ dir.
--
Sincerely Yours,
Denis Perchine
----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------
Attachments:
fd.c.patchtext/plain; name=fd.c.patchDownload
Index: fd.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/storage/file/fd.c,v
retrieving revision 1.59
diff -c -b -w -r1.59 fd.c
*** fd.c 2000/06/02 15:57:24 1.59
--- fd.c 2000/06/13 08:16:06
***************
*** 841,849 ****
elog(ERROR, "FileSeek: invalid whence: %d", whence);
break;
}
! }
! else
VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
return VfdCache[file].seekPos;
}
--- 841,863 ----
elog(ERROR, "FileSeek: invalid whence: %d", whence);
break;
}
! } else
! switch (whence) {
! case SEEK_SET:
! if (VfdCache[file].seekPos != offset)
! VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
! break;
! case SEEK_CUR:
! if (offset != 0);
VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+ break;
+ case SEEK_END:
+ VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+ break;
+ default:
+ elog(ERROR, "FileSeek: invalid whence: %d", whence);
+ break;
+ }
return VfdCache[file].seekPos;
}
If read or write fails. Position will left the same. This situation is already tracked
in File routines, but a little bit incorrectly.
After small survey in Linux kernel code, I am not sure about it.
New patch set pos to unknown in the case of read/write fails. And do
lseek again.
Here is the full patch for this. This patch reduce amount of lseek call ten times
for update statement and twenty times for select statement. I tested joined update
and count(*) select for table with rows > 170000 and 10 indices.
I think this is worse of trying. Before lseek calls account for more than 5% of time.
Now they are 0.89 and 0.15 respectevly.Due to only one file modification patch should be applied in src/backedn/storage/file/ dir.
--
Sincerely Yours,
Denis Perchine
----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------
Attachments:
fd.c.patchtext/x-c; name=fd.c.patchDownload
Index: fd.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/storage/file/fd.c,v
retrieving revision 1.59
diff -c -b -w -r1.59 fd.c
*** fd.c 2000/06/02 15:57:24 1.59
--- fd.c 2000/06/13 08:34:55
***************
*** 95,100 ****
--- 95,102 ----
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
+ #define FileUnknownPos (-1)
+
typedef struct vfd
{
signed short fd; /* current FD, or VFD_CLOSED if none */
***************
*** 790,795 ****
--- 792,799 ----
returnCode = read(VfdCache[file].fd, buffer, amount);
if (returnCode > 0)
VfdCache[file].seekPos += returnCode;
+ else
+ VfdCache[file].seekPos = FileUnknownPos;
return returnCode;
}
***************
*** 806,816 ****
FileAccess(file);
returnCode = write(VfdCache[file].fd, buffer, amount);
! if (returnCode > 0)
VfdCache[file].seekPos += returnCode;
-
/* mark the file as needing fsync */
VfdCache[file].fdstate |= FD_DIRTY;
return returnCode;
}
--- 810,821 ----
FileAccess(file);
returnCode = write(VfdCache[file].fd, buffer, amount);
! if (returnCode > 0) {
VfdCache[file].seekPos += returnCode;
/* mark the file as needing fsync */
VfdCache[file].fdstate |= FD_DIRTY;
+ } else
+ VfdCache[file].seekPos = FileUnknownPos;
return returnCode;
}
***************
*** 840,849 ****
default:
elog(ERROR, "FileSeek: invalid whence: %d", whence);
break;
- }
}
! else
VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
return VfdCache[file].seekPos;
}
--- 845,870 ----
default:
elog(ERROR, "FileSeek: invalid whence: %d", whence);
break;
}
! } else
! switch (whence) {
! case SEEK_SET:
! if (offset < 0)
! elog(ERROR, "FileSeek: invalid offset: %ld", offset);
! if (VfdCache[file].seekPos != offset)
! VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
! break;
! case SEEK_CUR:
! if ((offset != 0) || (VfdCache[file].seekPos == FileUnknownPos));
VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+ break;
+ case SEEK_END:
+ VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+ break;
+ default:
+ elog(ERROR, "FileSeek: invalid whence: %d", whence);
+ break;
+ }
return VfdCache[file].seekPos;
}
Yoo, hoo. No one complained about this patch, so in it goes. I bet
this will be a real speedup on some platforms. I know some OS's don't
do fseek() as fast as we would like.
Thanks.
If read or write fails. Position will left the same. This situation is already tracked
in File routines, but a little bit incorrectly.After small survey in Linux kernel code, I am not sure about it.
New patch set pos to unknown in the case of read/write fails. And do
lseek again.Here is the full patch for this. This patch reduce amount of lseek call ten times
for update statement and twenty times for select statement. I tested joined update
and count(*) select for table with rows > 170000 and 10 indices.
I think this is worse of trying. Before lseek calls account for more than 5% of time.
Now they are 0.89 and 0.15 respectevly.Due to only one file modification patch should be applied in src/backedn/storage/file/ dir.
--
Sincerely Yours,
Denis Perchine----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------
[ Attachment, skipping... ]
Index: fd.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/storage/file/fd.c,v
retrieving revision 1.59
diff -c -b -w -r1.59 fd.c
*** fd.c 2000/06/02 15:57:24 1.59
--- fd.c 2000/06/13 08:34:55
***************
*** 95,100 ****
--- 95,102 ----
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
+ #define FileUnknownPos (-1)
+
typedef struct vfd
{
signed short fd; /* current FD, or VFD_CLOSED if none */
***************
*** 790,795 ****
--- 792,799 ----
returnCode = read(VfdCache[file].fd, buffer, amount);
if (returnCode > 0)
VfdCache[file].seekPos += returnCode;
+ else
+ VfdCache[file].seekPos = FileUnknownPos;
return returnCode;
}
***************
*** 806,816 ****
FileAccess(file);
returnCode = write(VfdCache[file].fd, buffer, amount);
! if (returnCode > 0)
VfdCache[file].seekPos += returnCode;
-
/* mark the file as needing fsync */
VfdCache[file].fdstate |= FD_DIRTY;
return returnCode;
}
--- 810,821 ----
FileAccess(file);
returnCode = write(VfdCache[file].fd, buffer, amount);
! if (returnCode > 0) {
VfdCache[file].seekPos += returnCode;
/* mark the file as needing fsync */
VfdCache[file].fdstate |= FD_DIRTY;
+ } else
+ VfdCache[file].seekPos = FileUnknownPos;
return returnCode;
}
***************
*** 840,849 ****
default:
elog(ERROR, "FileSeek: invalid whence: %d", whence);
break;
- }
}
! else
VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
return VfdCache[file].seekPos;
}
--- 845,870 ----
default:
elog(ERROR, "FileSeek: invalid whence: %d", whence);
break;
}
! } else
! switch (whence) {
! case SEEK_SET:
! if (offset < 0)
! elog(ERROR, "FileSeek: invalid offset: %ld", offset);
! if (VfdCache[file].seekPos != offset)
! VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
! break;
! case SEEK_CUR:
! if ((offset != 0) || (VfdCache[file].seekPos == FileUnknownPos));
VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+ break;
+ case SEEK_END:
+ VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+ break;
+ default:
+ elog(ERROR, "FileSeek: invalid whence: %d", whence);
+ break;
+ }
return VfdCache[file].seekPos;
}
--
Bruce Momjian | http://www.op.net/~candle
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026