Implementation Proposal For Add Free Behind Capability For Large Sequential Scan

Started by Amit Kumar Kharealmost 24 years ago4 messages

skamit2000@yahoo.com

almost 24 years ago

MP2 amitk Normal amitk 2 355 2002-02-23T13:22:00Z 2002-02-23T13:22:00Z 4 1910 10890 90 21 13373 9.2720 1
HiAll,

Hereis at your disposal my implementation Idea of a TODO item titled ��AddFree-Behind Capability For Large Sequential Scans��. I cordially invite yourcomments.

Part I:

Problem: The present default page replacement policy of PostgreSQLis Least Recently Used (LRU). This policy may be good for any other databaseaccess patterns but it is not efficient for Large Sequential Scan, since itmakes cache useless. For example:

Let shared buffer size be 1MB and the size of relation is 2MB. Lets assume sequential scan is applied onthis relation over and over again. The cache becomes useless because by the time the second read of the table happensthe first 1mb has already been forced out of the cache.

The problemappeared in the TODO list of PostgreSQL under section CACHE titled:

��AddFree-Behind Capability For Large Sequential Scans ��

Solution: In hispaper ��Operating System Support For Database Management��, ProfessorMichael Stonebraker identifies various database access patterns in INGRES(grand father of PostgreSQL) suggest that for sequential access to blocks,which will be cyclically rereferenced, the buffer management strategy should beto Toss Immediately, unless all n pages of relation can be kept in the cache.He further suggest that for buffer management to be of some use to applicationlike DBMS there should be some provision that it accepts ��advice�� from anapplication concerning replacement strategy.

Considering thesesuggestions as our baseline we have decided to implement Most Recently Used(MRU) cache replacement policy. Our task will be to identify a large sequentialscan and then advice the buffer manager to apply MRU for cache replacement.

Detailed Plan ofImplementation:

(I) Identifyingthe query execution flow control

Goal of identifying the query execution flow: (i) This is necessary as we have to figure out a spotwhere we will identify that the block access is actually a large sequentialscan and from this point onward we think to send advice (in form of parameter )to buffer manager.��

(ii) Through this query execution trace our aim is toreach at one of the files listed below named ��freelist.c��. This is the filewhere all the funcitions related to replacement policy are implemented. Henceour trace ends after reaching ��freelist.c��.

(1) Files Involved (Important files listed):

(1) bin/psql/mainloop.c

(2) bin/psql/common.c

(3) src/interfaces/libpq/fe-exec.c

(4) src/backend/tcop/postgres.c

(5) src/backend/tcop/pquery.c

(6) src/backend/executor/execMain.c

(7) src/backend/executor/execProcnode.c

(8) src/backend/access/heap/heapam.c

(9) src/backend/utils/cache/relcache.c

(10)src/backend/utils/cache/relcache.c

(11)src/backend/utils/hash/dynahash.c

(12)src/backend/storage/smgr/smgr.c

(13)src/include/utils/rel.h

(14)src/backend/access/common/scankey.c

(15)src/backend/storage/buffer/bufmgr.c

(16)src/backend/storage/buffer/buf_table.c

(17)src/backend/storage/buffer/freelist.c

(18)src/backend/executor/nodeSeqscan.c

(19)src/backend/executor/execScan.c

(20)src/backend/executor/execTuples.c

(2) Function Call Trace:

Notation Convention: (i) Arrow �� will represent call

�� (ii)Functions are listed as

��<FileName>.<FunctionName> for example common. SendQuery�� i.e. SendQuery function is defined in file name common

(i) Querysubmitted to psql

mainloop.Mainloop�� common.SendQuery

common.SendQuery �� fe-exec.PQexec

- SendQuery send the querystring to the backen and is a ��front door�� way to send a query.

- PQexec sends the queryto the backend and package up the results.

(ii) QueryExecution from ��postgres�� backend process

postgres. pg_exec_query_string �� pquery.ProcessQuery

pquery.ProcessQuery �� execMain.ExecutorStart

(Thisroutine is called at the beginning of any execution of any

queryplan)

pquery.ProcessQuery �� execMain.ExecutorRun

(This is the main routine of the executor module. It accepts

�� the query descriptor from the traffic copand executes the

�� query plan.)

pquery.ProcessQuery �� execMain.ExecutorEnd

(Thisroutine must be called at the end of execution of any

��query plan)

we will follow the executiontrace from execMain.ExecutorRun

execMain.ExecutorRun �� execMain.ExecutePlan ��execProcnode.ExecProcNode �� nodeSeqscan.ExecSeqScan �� execScan.ExecScan

ExecSeqScan, scans therelation sequentially and returns the next qualifying

tuple. It passes�� SeqNext�� as the accessmethod to ExecScan as one of the parameters.

SeqNext is actually afunction defined in the file nodeSeqscan this function is called through afunction pointer (slot=(*accessMtd) (node)) in an infinite forloop untill the slot returned by it contains NULL.

execScan.ExecScan �� nodeSeqscan.SeqNext ��heapam.heap_getnext �� heapam.heapgettup

heapgettup returnsimmediately if the relation is empty after making a call tobufmgr.ReleaseBuffer. Otherwise bufmgr.ReleaseAndReadBuffer is called.

bufmgr.ReleaseAndReadBuffer�� bufmgr.ReadBufferInternal

ReadBufferInternal callssmgr.smgrnblocks to get the accurate number of POSTGRES blocks in the suppliedrelation if caller asks for P_NEW.

bufmgr.ReadBufferInternal�� bufmgr.BufferAlloc

BufferAlloc creates a newbuffer tag and search (buf_table.BufTableLookup) it in the bufferpool whether it is already present in it. If found in the buffer pool then itpins the buffer and starts the buffer IO (bufmgr.StartBufferIO)

��If it is not found in the buffer pool. It�� initializes a new buffer. Grabs one from thefree list by calling freelist.GetFreeBuffer which returns NULL if all thebuffers are already pinned.

BufferAlloc at times callsfreelist.UnpinBuffer for example if somebody pins the buffer while the currenttransaction was doing I/O. it clears its I/O flags, remove its pin and startall over again. (see comments and code in bufmgr.c from line 456 onward).

bufmgr.BufferAlloc �� freelist.UnpinBuffer ��freelist.AddBufferToFreelist.

( 3) Analysis of�� Execution trace:

(i) freelist.AddBufferToFreelist is the right place to implement any bufferreplacement policy. Look at the comments of this function it clearly states

��In theory, this is the only routine that needs tobe changed if the buffer replacement strategy changes. Just change

themanner in which buffers are added to the freelist queue. Currently, they areadded on an LRU basis.��

(ii)We feel bufmgr.ReadBufferInternalis the right place where we can guess whether the buffer access is LargeSequential Scan or not. We say this because

(a) We get a pointer (reln)to the ��Relation�� as one of the parameter through which we canget the size of the relation (size = reln -> rd_nblocks)

(b) Secondly if�� the ��blockNum�� (one of theparameters passesed to it) is equal to P_NEW (this contains ��InvalidBlockNumber��aconstant defined in block.h, means grow the file to get a new page), ReadBufferInternalmakes a call to the function smgr.smgrnblocks to get accuratefile length.

Henceat this point we have accurate knowledge of size of the relation being scaned.

(iii)�� An external varable ��NBuffers��declared in bufmgr.h and defined in globals.c,contains the size of shared buffer.

(iv) Hence in this functionwe can make a guess whether the scan is large sequential or not by comparingrelation size and shared buffer size. If relation��s size is�� greater than shared buffer��s sizeclearly it is a large sequential scan.

(4) First ProbableImplementation Details:

(i) We are thinking ofdeclaring constant in bufmgr.h as following

#define LRU ��1

#define MRU�� 2

(ii) These constant can bepasses further upto freelist.AddBufferToFreelist. In thisfunction after making a check appropriate policy can be applied.

(iii) Functionsaffected by the implementation:

Nearlyall the functions in the freelist.c will be changed to implement MRU inaddition two functions bufmgr.ReadBufferInternal and bufmgr.BufferAllocare most likely to be changed.

Second ProbableImplementation Details:

We assume that LRU��s ��SharedFreeList��circular queue keeps all the least recently used buffers at the head of thequeue and most recently used buffers at the tail. If this is the case thenthere is absolutely no need to change the AddBufferToFreelist function infreelist.c, let it add the buffers to shared free list as it is presentlydoing. We just change the freelist.GetFreeBuffer so that insteadof removing (LRU) buffers from head of the queue for MRU scheme we will removethe (MRU) buffers from its tail. ( we are afraid whether GetFreeBuffer isthe only function that gives out the buffer from free list. Just in case ifthere is some other function too, then, yet we have to trace it out )��

PART II:

Testing Objective:

(i) First, We have to test performance of new alogrithm MRUthat we intend to implement. Then we have to look whether we need specificsequential scan code or not.

(ii) Second, We would like to compare the performance of MRUwith that of LRU-K algorithm and more specifically LRU-2.

LRU-K a brief introduction:

LRU-K is a new approach to database disk buffering proposed byElizabeth J. O��Neil, Patrick E. O��Neil and Gerhard Weikum. The basic idea ofLRU-K is to keep track of the times of the last K references to populardatabase pages, using this information to statistically estimate theinterarrival times of references on a page to page basis. The algorithm isself-tuning and adaptive to evolving access patterns.

A patch for LRU-2 was provided to us by Mr. Bruce Momjian.The algorithm was implemented and tested by Mr Tom Lane. Both are fromPostgreSQL development group. Mr. Tom Lane had tested the algorithm but histest were not that good and the tests did not contain any sequential scan. Soit would be an interesting exercise.

Testing Plan:

(i) We plan to do a big sequential scan and at the same time do asmall sequential scan What shouldhappen is that the small sequntial scan should keep its pages in the bufferwhile the large scan is happening if itdoesn't, we have a problem.

(ii) We have populated two tables namely simple and simple1 withonly one integer field. We have populated these tables with 8192 values. Wehave to make one of the table even more bigger so that one will be used forlarge sequential scan and another will be used for small sequential scan.

(iii) Then we plan to put some simple queries like given below in afile say ��samplequeries.txt��

Select *from simple where x = 98765432;

Select *from simple1 where y = 567890432;

Where lets assume simple is largetable and simple1 is smaller table. The large file is should have to be largerthan shared buffer size and smaller file should have to be say 10% smaller thanshared buffer size.

(iv) Then we have to run psql with ��f flag giving thefile as parameter. While doing that we will keep track of the time.

(v) The other way is that we have set the various statistics flagstrue like parser statistics, planner statistics, executor statistics etc. Allthe statistics gets collected into a logfile.

Following is one of the samplestatistics generated during our sample query test run on default LRU algorithm:

2002-02-2218:40:58 DEBUG:�� EXECUTOR STATISTICS

! systemusage stats:

!�� 0.050108 elapsed 0.000000 user 0.000000system sec

!�� [0.050000 user 0.130000 sys total]

!�� 4/0 [50/0] filesystem blocks in/out

!�� 4/0 [50/0] page faults/reclaims, 0 [0]swaps

!�� 0 [0] signals rcvd, 0/0 [2/5] messagesrcvd/sent

!�� 4/0 [149/17] voluntary/involuntarycontext switches

!postgres usage stats:

!�� Shared blocks:�� 3 read,�� 0 written, buffer hit rate = 25.00%

!�� Local��blocks:�� 0 read,�� 0 written, buffer hit rate = 0.00%

!�� Direct blocks:�� 0 read,�� 0 written

(vi) Another way is to run the query with EXPLAIN command inVERBOSE and ANALYZE mode. This prints the total execution time ofthe query along with other matrics like cost of sequential scan startup andtotal cost etc.

Yardstick of Measurement:

(i) We will run the test queries on MRU and LRU-2 implementation.If the execution time for MRU is less than that of LRU-2 then We have to thinkof having a specific code for sequential scan.

(ii) Otherwise if the execution time of LRU-2 is less than or closeto the MRU execution time than we would suggest replacing the default LRU codeof PostgreSQL with LRU-2 algorithm. Since than there will be only one alogrithmfor all the access patterns and no need of ��advicing�� the buffer manager forreplacement policy and no need of having different algorithms for largesequential access.

��

Conclusion:

�� Toconclude, fetching data from disk requires at least a factor of 1000 more timethan fetching data from a RAM buffer. For this reason, good use of the buffercan significantly improve the throughput and response time fo anydata-intensive system. Equally important is choosing an efficient buffer replacementalgorithm. Thus we aim at improving the performance of PostgreSQL by studyingwhich algorithm among MRU and LRU-2 will prove to be the best bet.

We thank Professor Kevin Chang and TAs for introducing us tosuch a life time opportunity and for encouraging us to enhance PostgreSQL.

We thank Mr. Bruce Momjian for all his timely help andadvices that�� he had rendered to uswhole heartedly. It is his encouragement that our efforts are directed twordsreally contributing to PostgreSQL development.

Amit Kumar Khare

---------------------------------
Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games

Neil Padgett

npadgett@redhat.com

almost 24 years ago

In reply to: Amit Kumar Khare (#1)

Re: Implementation Proposal For Add Free Behind

On Sat, 2002-02-23 at 08:30, Amit Kumar Khare wrote:

Considering thesesuggestions as our baseline we have decided to

implement Most Recently Used(MRU) cache replacement policy. Our task
will be to identify a large sequentialscan and then advice the buffer
manager to apply MRU for cache replacement.

This could lead to poor buffer performance in the case of multiple
backends simultaneously executing queries. How will you handle this?

Neil

--
Neil Padgett
Red Hat Canada Ltd. E-Mail: npadgett@redhat.com
2323 Yonge Street, Suite #300,
Toronto, ON M4P 2C9

Amit Kumar Khare

skamit2000@yahoo.com

almost 24 years ago

In reply to: Neil Padgett (#2)

Re: Implementation Proposal For Add Free Behind Capability For Large Sequential Scan

--- Neil Padgett <npadgett@redhat.com> wrote:

On Sat, 2002-02-23 at 08:30, Amit Kumar Khare wrote:

Considering thesesuggestions as our baseline we

have decided to
implement Most Recently Used(MRU) cache replacement
policy. Our task
will be to identify a large sequentialscan and then
advice the buffer
manager to apply MRU for cache replacement.

Neil wrote:

This could lead to poor buffer performance in the
case of multiple
backends simultaneously executing queries. How will
you handle this?

Certainly it's cause of concern. I am looking for
literature regarding it. I kindly request you all to
inform me if any body finds some thing related to it.

Thanks and Regards
Amit Khare

--
Neil Padgett
Red Hat Canada Ltd. E-Mail:
npadgett@redhat.com
2323 Yonge Street, Suite #300,
Toronto, ON M4P 2C9

__________________________________________________
Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games
http://sports.yahoo.com

Bruce Momjian

pgman@candle.pha.pa.us

almost 24 years ago

In reply to: Amit Kumar Khare (#3)

Re: Implementation Proposal For Add Free Behind Capability

Amit Kumar Khare wrote:

--- Neil Padgett <npadgett@redhat.com> wrote:
On Sat, 2002-02-23 at 08:30, Amit Kumar Khare wrote:

Considering thesesuggestions as our baseline we

have decided to
implement Most Recently Used(MRU) cache replacement
policy. Our task
will be to identify a large sequentialscan and then
advice the buffer
manager to apply MRU for cache replacement.

Neil wrote:

This could lead to poor buffer performance in the
case of multiple
backends simultaneously executing queries. How will
you handle this?

Certainly it's cause of concern. I am looking for
literature regarding it. I kindly request you all to
inform me if any body finds some thing related to it.

My guess is that we should test MRU for large sequenatial scan vs. LRU-K
and see how the compare in various tests.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026