Caching number of blocks in relation to avoi lseek.

Started by Denis Perchineover 25 years ago12 messages
#1Denis Perchine
dyp@perchine.com

Hello all,

While digging the code I found out quite interesting comment in
src/backend/access/heap/hio.c before function RelationPutHeapTupleAtEnd

* Eventually, we should cache the number of blocks in a relation somewhere.
* Until that time, this code will have to do an lseek to determine the number
* of blocks in a relation.

As far as I can see there's field rd_nblocks in Relation.

Question: is this field properly updated? Could it be used instead of RelationGetNumberOfBlocks
which calls lseek.

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

#2Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Denis Perchine (#1)
Re: Caching number of blocks in relation to avoi lseek.

Hello all,

While digging the code I found out quite interesting comment in
src/backend/access/heap/hio.c before function RelationPutHeapTupleAtEnd

* Eventually, we should cache the number of blocks in a relation somewhere.
* Until that time, this code will have to do an lseek to determine the number
* of blocks in a relation.

As far as I can see there's field rd_nblocks in Relation.

Question: is this field properly updated? Could it be used instead of RelationGetNumberOfBlocks
which calls lseek.

Probably should be kept up-to-date. Currently only vacuum sets it. It
would have to be kept up-to-date for every INSERT/UPDATE that adds a block.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#3Denis Perchine
dyp@perchine.com
In reply to: Bruce Momjian (#2)
Re: Caching number of blocks in relation to avoi lseek.

Question: is this field properly updated? Could it be used instead of RelationGetNumberOfBlocks
which calls lseek.

Probably should be kept up-to-date. Currently only vacuum sets it. It
would have to be kept up-to-date for every INSERT/UPDATE that adds a block.

If we can acomplish this it will be quite big speed improvement. I saw lots of such calls
and strace show lots of lseek(,,END) calls. I would do this, but I do not know the code so
well to be sure I will walk through all places.

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#2)
Re: Caching number of blocks in relation to avoi lseek.

* Eventually, we should cache the number of blocks in a relation somewhere.
* Until that time, this code will have to do an lseek to determine the number
* of blocks in a relation.

As far as I can see there's field rd_nblocks in Relation.

Question: is this field properly updated?

No. If it were that easy, this problem would have been fixed long ago.
The problem with the relcache field is that it's backend-local, so it's
not going to get updated when some other backend extends the relation.

We do use rd_nblocks to cache the last table length determined from
lseek, but that can't be trusted in critical cases, like when we are
trying to add a block to the relation.

There has been some talk of keeping a small cache of current relation
lengths in shared memory, but I'm dubious that that'd actually be a win.
In order to save an lseek (which ought to be pretty quick as kernel
calls go) we'd be talking about grabbing a spinlock, searching a shared
hashtable, and releasing a spinlock. Spinlock contention on this
heavily used datastructure might well be a problem. Now add the costs
of updating that hashtable every time we extend a relation (or even just
fail to find a relation in it). Not immediately obvious that this is
a net win overall, is it? If it were easier to do, or more obviously
a large performance gain, someone would likely have tried it by now ...
but there is lots of lower-hanging fruit to keep us busy.

regards, tom lane

#5Denis Perchine
dyp@perchine.com
In reply to: Tom Lane (#4)
Re: Caching number of blocks in relation to avoi lseek.

No. If it were that easy, this problem would have been fixed long ago.
The problem with the relcache field is that it's backend-local, so it's
not going to get updated when some other backend extends the relation.

Sorry. I missed this.

We do use rd_nblocks to cache the last table length determined from
lseek, but that can't be trusted in critical cases, like when we are
trying to add a block to the relation.

There has been some talk of keeping a small cache of current relation
lengths in shared memory, but I'm dubious that that'd actually be a win.
In order to save an lseek (which ought to be pretty quick as kernel
calls go) we'd be talking about grabbing a spinlock, searching a shared
hashtable, and releasing a spinlock. Spinlock contention on this
heavily used datastructure might well be a problem. Now add the costs
of updating that hashtable every time we extend a relation (or even just
fail to find a relation in it). Not immediately obvious that this is
a net win overall, is it? If it were easier to do, or more obviously
a large performance gain, someone would likely have tried it by now ...
but there is lots of lower-hanging fruit to keep us busy.

Hard to say what is faster... Lots of testing is needed. You are right.
And what about skipping lseek when continually read relation?
Is it possible?

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Denis Perchine (#5)
Re: Caching number of blocks in relation to avoi lseek.

Denis Perchine <dyp@perchine.com> writes:

And what about skipping lseek when continually read relation?
Is it possible?

In a pure read scenario the way it's supposed to work is that an
lseek(END) is done once at the start of each sequential scan, and we
save that value in rd_nblocks.  Then we read rd_nblocks pages and stop.
By the time we finish the scan there might be more pages in the relation
(added by other backends, or even by ourselves if it's an update query).
But those pages cannot contain any tuples that could be visible to the
current scan, so it's OK if we don't read them.  However, we do need a
new lseek() --- or some other way to verify the right table length
--- at least once per transaction start or CommandCounterIncrement.
Either of those events could make new tuples visible to us.

I think there may be some code paths that cause us to do more than just
one lseek(END) during scan startup, and if so it'd be worthwhile to try
to get rid of the extra lseeks(). But you'd have to be careful that
you didn't remove any lseeks that are essential in some other paths.

regards, tom lane

#7Denis Perchine
dyp@perchine.com
In reply to: Tom Lane (#6)
Re: Caching number of blocks in relation to avoi lseek.

On ���, 13 ��� 2000, Tom Lane wrote:

Denis Perchine <dyp@perchine.com> writes:

And what about skipping lseek when continually read relation?
Is it possible?

In a pure read scenario the way it's supposed to work is that an
lseek(END) is done once at the start of each sequential scan, and we
save that value in rd_nblocks.  Then we read rd_nblocks pages and stop.
By the time we finish the scan there might be more pages in the relation
(added by other backends, or even by ourselves if it's an update query).
But those pages cannot contain any tuples that could be visible to the
current scan, so it's OK if we don't read them.  However, we do need a
new lseek() --- or some other way to verify the right table length
--- at least once per transaction start or CommandCounterIncrement.
Either of those events could make new tuples visible to us.

I think there may be some code paths that cause us to do more than just
one lseek(END) during scan startup, and if so it'd be worthwhile to try
to get rid of the extra lseeks(). But you'd have to be careful that
you didn't remove any lseeks that are essential in some other paths.

No... You did not get me. I am talking about completly different thing:
I strace'ed postgres binary when doing queries and found out that it
do lseek after each read, and the difference in the position is 8096.
It means that we are in correct position anyway and do not need additional lseek.

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Denis Perchine (#7)
Re: Caching number of blocks in relation to avoi lseek.

Denis Perchine <dyp@perchine.com> writes:

No... You did not get me. I am talking about completly different thing:
I strace'ed postgres binary when doing queries and found out that it
do lseek after each read, and the difference in the position is 8096.
It means that we are in correct position anyway and do not need additional lseek.

Oh. Hmm. Not sure if it's really worth the trouble, but you could try
having fd.c keep track of the current seek position of VFDs when they
are open as well as when they are closed, and optimize away the lseek
call in FileSeek if the position is already right. You'd have to think
carefully about what to do if a read or write fails, however --- where
has the kernel left its seek position in that case? Possibly this could
be dealt with by setting the internal position to "unknown" anytime
we're not perfectly sure where the kernel is.

regards, tom lane

#9Alfred Perlstein
bright@wintelcom.net
In reply to: Tom Lane (#8)
Re: Caching number of blocks in relation to avoi lseek.

* Tom Lane <tgl@sss.pgh.pa.us> [000612 19:37] wrote:

Denis Perchine <dyp@perchine.com> writes:

No... You did not get me. I am talking about completly different thing:
I strace'ed postgres binary when doing queries and found out that it
do lseek after each read, and the difference in the position is 8096.
It means that we are in correct position anyway and do not need additional lseek.

Oh. Hmm. Not sure if it's really worth the trouble, but you could try
having fd.c keep track of the current seek position of VFDs when they
are open as well as when they are closed, and optimize away the lseek
call in FileSeek if the position is already right. You'd have to think
carefully about what to do if a read or write fails, however --- where
has the kernel left its seek position in that case? Possibly this could
be dealt with by setting the internal position to "unknown" anytime
we're not perfectly sure where the kernel is.

Have you thought of using pread/pwrite which are available on many
modern platforms:

ssize_t
pread(int d, void *buf, size_t nbytes, off_t offset)

Pread() performs the same function, but reads from the specified
position in the file without modifying the file pointer.

I'm unsure how the postgresql system uses it's fds, however if they
are really shared via dup() or across fork/exec they will share
the same offset across multiple instances of the fd. The only way
around this behavior is to reopen the file in each process to get
a private offset.

--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

#10Denis Perchine
dyp@perchine.com
In reply to: Tom Lane (#8)
1 attachment(s)
Patch for Re: [HACKERS] Caching number of blocks in relation to avoi lseek.

Oh. Hmm. Not sure if it's really worth the trouble, but you could try
having fd.c keep track of the current seek position of VFDs when they
are open as well as when they are closed, and optimize away the lseek
call in FileSeek if the position is already right. You'd have to think
carefully about what to do if a read or write fails, however --- where
has the kernel left its seek position in that case? Possibly this could
be dealt with by setting the internal position to "unknown" anytime
we're not perfectly sure where the kernel is.

If read or write fails. Position will left the same. This situation is already tracked
in File routines, but a little bit incorrectly.

Here is the full patch for this. This patch reduce amount of lseek call ten times
for update statement and twenty times for select statement. I tested joined update
and count(*) select for table with rows > 170000 and 10 indices.
I think this is worse of trying. Before lseek calls account for more than 5% of time.
Now they are 0.89 and 0.15 respectevly.

Due to only one file modification patch should be applied in src/backedn/storage/file/ dir.

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

Attachments:

fd.c.patchtext/plain; name=fd.c.patchDownload
Index: fd.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/storage/file/fd.c,v
retrieving revision 1.59
diff -c -b -w -r1.59 fd.c
*** fd.c	2000/06/02 15:57:24	1.59
--- fd.c	2000/06/13 08:16:06
***************
*** 841,849 ****
  				elog(ERROR, "FileSeek: invalid whence: %d", whence);
  				break;
  		}
! 	}
! 	else
  		VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
  	return VfdCache[file].seekPos;
  }
  
--- 841,863 ----
  				elog(ERROR, "FileSeek: invalid whence: %d", whence);
  				break;
  		}
! 	} else
! 		switch (whence) {
! 			case SEEK_SET:
! 				if (VfdCache[file].seekPos != offset)
! 					VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
! 				break;
! 			case SEEK_CUR:
! 				if (offset != 0);
  					VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+ 				break;
+ 			case SEEK_END:
+ 				VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+ 				break;
+ 			default:
+ 				elog(ERROR, "FileSeek: invalid whence: %d", whence);
+ 				break;
+ 		}
  	return VfdCache[file].seekPos;
  }
  
#11Denis Perchine
dyp@perchine.com
In reply to: Denis Perchine (#10)
1 attachment(s)
Patch 0.2 for Re: [HACKERS] Caching number of blocks in relation to avoi lseek.

If read or write fails. Position will left the same. This situation is already tracked
in File routines, but a little bit incorrectly.

After small survey in Linux kernel code, I am not sure about it.
New patch set pos to unknown in the case of read/write fails. And do
lseek again.

Here is the full patch for this. This patch reduce amount of lseek call ten times
for update statement and twenty times for select statement. I tested joined update
and count(*) select for table with rows > 170000 and 10 indices.
I think this is worse of trying. Before lseek calls account for more than 5% of time.
Now they are 0.89 and 0.15 respectevly.

Due to only one file modification patch should be applied in src/backedn/storage/file/ dir.

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

Attachments:

fd.c.patchtext/x-c; name=fd.c.patchDownload
Index: fd.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/storage/file/fd.c,v
retrieving revision 1.59
diff -c -b -w -r1.59 fd.c
*** fd.c	2000/06/02 15:57:24	1.59
--- fd.c	2000/06/13 08:34:55
***************
*** 95,100 ****
--- 95,102 ----
  
  #define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
  
+ #define FileUnknownPos (-1)
+ 
  typedef struct vfd
  {
  	signed short fd;			/* current FD, or VFD_CLOSED if none */
***************
*** 790,795 ****
--- 792,799 ----
  	returnCode = read(VfdCache[file].fd, buffer, amount);
  	if (returnCode > 0)
  		VfdCache[file].seekPos += returnCode;
+ 	else
+ 		VfdCache[file].seekPos = FileUnknownPos;
  
  	return returnCode;
  }
***************
*** 806,816 ****
  
  	FileAccess(file);
  	returnCode = write(VfdCache[file].fd, buffer, amount);
! 	if (returnCode > 0)
  		VfdCache[file].seekPos += returnCode;
- 
  	/* mark the file as needing fsync */
  	VfdCache[file].fdstate |= FD_DIRTY;
  
  	return returnCode;
  }
--- 810,821 ----
  
  	FileAccess(file);
  	returnCode = write(VfdCache[file].fd, buffer, amount);
! 	if (returnCode > 0) {
  		VfdCache[file].seekPos += returnCode;
  		/* mark the file as needing fsync */
  		VfdCache[file].fdstate |= FD_DIRTY;
+ 	} else
+ 		VfdCache[file].seekPos = FileUnknownPos;
  
  	return returnCode;
  }
***************
*** 840,849 ****
  			default:
  				elog(ERROR, "FileSeek: invalid whence: %d", whence);
  				break;
- 		}
  	}
! 	else
  		VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
  	return VfdCache[file].seekPos;
  }
  
--- 845,870 ----
  			default:
  				elog(ERROR, "FileSeek: invalid whence: %d", whence);
  				break;
  		}
! 	} else
! 		switch (whence) {
! 			case SEEK_SET:
! 				if (offset < 0)
! 					elog(ERROR, "FileSeek: invalid offset: %ld", offset);
! 				if (VfdCache[file].seekPos != offset)
! 					VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
! 				break;
! 			case SEEK_CUR:
! 				if ((offset != 0) || (VfdCache[file].seekPos == FileUnknownPos));
  					VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+ 				break;
+ 			case SEEK_END:
+ 				VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+ 				break;
+ 			default:
+ 				elog(ERROR, "FileSeek: invalid whence: %d", whence);
+ 				break;
+ 		}
  	return VfdCache[file].seekPos;
  }
  
#12Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Denis Perchine (#11)
Re: Patch 0.2 for Re: [HACKERS] Caching number of blocks in relation to avoi lseek.

Yoo, hoo. No one complained about this patch, so in it goes. I bet
this will be a real speedup on some platforms. I know some OS's don't
do fseek() as fast as we would like.

Thanks.

If read or write fails. Position will left the same. This situation is already tracked
in File routines, but a little bit incorrectly.

After small survey in Linux kernel code, I am not sure about it.
New patch set pos to unknown in the case of read/write fails. And do
lseek again.

Here is the full patch for this. This patch reduce amount of lseek call ten times
for update statement and twenty times for select statement. I tested joined update
and count(*) select for table with rows > 170000 and 10 indices.
I think this is worse of trying. Before lseek calls account for more than 5% of time.
Now they are 0.89 and 0.15 respectevly.

Due to only one file modification patch should be applied in src/backedn/storage/file/ dir.

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

[ Attachment, skipping... ]

Index: fd.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/storage/file/fd.c,v
retrieving revision 1.59
diff -c -b -w -r1.59 fd.c
*** fd.c	2000/06/02 15:57:24	1.59
--- fd.c	2000/06/13 08:34:55
***************
*** 95,100 ****
--- 95,102 ----

#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)

+ #define FileUnknownPos (-1)
+ 
  typedef struct vfd
  {
  	signed short fd;			/* current FD, or VFD_CLOSED if none */
***************
*** 790,795 ****
--- 792,799 ----
  	returnCode = read(VfdCache[file].fd, buffer, amount);
  	if (returnCode > 0)
  		VfdCache[file].seekPos += returnCode;
+ 	else
+ 		VfdCache[file].seekPos = FileUnknownPos;

return returnCode;
}
***************
*** 806,816 ****

FileAccess(file);
returnCode = write(VfdCache[file].fd, buffer, amount);
! if (returnCode > 0)
VfdCache[file].seekPos += returnCode;
-
/* mark the file as needing fsync */
VfdCache[file].fdstate |= FD_DIRTY;

  	return returnCode;
  }
--- 810,821 ----
  	FileAccess(file);
  	returnCode = write(VfdCache[file].fd, buffer, amount);
! 	if (returnCode > 0) {
  		VfdCache[file].seekPos += returnCode;
  		/* mark the file as needing fsync */
  		VfdCache[file].fdstate |= FD_DIRTY;
+ 	} else
+ 		VfdCache[file].seekPos = FileUnknownPos;

return returnCode;
}
***************
*** 840,849 ****
default:
elog(ERROR, "FileSeek: invalid whence: %d", whence);
break;
- }
}
! else
VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
return VfdCache[file].seekPos;
}

--- 845,870 ----
  			default:
  				elog(ERROR, "FileSeek: invalid whence: %d", whence);
  				break;
  		}
! 	} else
! 		switch (whence) {
! 			case SEEK_SET:
! 				if (offset < 0)
! 					elog(ERROR, "FileSeek: invalid offset: %ld", offset);
! 				if (VfdCache[file].seekPos != offset)
! 					VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
! 				break;
! 			case SEEK_CUR:
! 				if ((offset != 0) || (VfdCache[file].seekPos == FileUnknownPos));
  					VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+ 				break;
+ 			case SEEK_END:
+ 				VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+ 				break;
+ 			default:
+ 				elog(ERROR, "FileSeek: invalid whence: %d", whence);
+ 				break;
+ 		}
  	return VfdCache[file].seekPos;
  }
-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026