Race condition within _bt_findinsertloc()? (new page split code)

Started by Peter Geogheganover 11 years ago66 messages

pg@heroku.com

over 11 years ago

While speccing out a new B-Tree verification tool, I had the
opportunity to revisit a thought I had during review concerning
_bt_findinsertloc(): that the new coding is unsafe in the event of
deferred split completion of a leaf page of a unique index. To recap,
we now do the following in _bt_findinsertloc():

/*
* If this page was incompletely split, finish the split now. We
* do this while holding a lock on the left sibling, which is not
* good because finishing the split could be a fairly lengthy
* operation. But this should happen very seldom.
*/
if (P_INCOMPLETE_SPLIT(lpageop))
{
_bt_finish_split(rel, rbuf, stack);
rbuf = InvalidBuffer;
continue;
}

The "left sibling" referred to here is "the first page this key could
be on", an important concept for unique index enforcement. It's the
first sibling iff we're on our first iteration of the nested for(;;)
loop in _bt_findinsertloc(). So the buffer lock held on this left
sibling may constitute what in the past I've called a "value lock";
we've established the right to insert our value into the unique index
at this point, and the lock will only be released when we're done
(regardless of whether or not that buffer/page value lock is on the
buffer/page we'll ultimately insert into, or an earlier one).

Anyway, the concern is that there may be problems when we call
_bt_finish_split() with that left sibling locked thoughout (i.e.
finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
itself has a right sibling from the incomplete split (which is usually
the value lock page's right-right sibling)). I'm not concerned about
performance, since as the comment says it ought to be an infrequent
occurrence. I also believe that there are no deadlock hazards. But
consider this scenario:

* We insert the value 7 into an int4 unique index. We need to split
the leaf page. We run out of memory or something, and ours is an
incomplete page split. Our transaction is aborted. For the sake of
argument, suppose that there are also already a bunch of dead tuples
within the index with values of 7, 8 and 9.

* Another inserter of the value 7 comes along. It follows exactly the
same downlink as the first, now aborted inserter (suppose the
downlink's value is 9). It also locks the same leaf page to establish
a "value lock" in precisely the same manner. Finding no room on the
first page, it looks further right, maintaining its original "value
lock" throughout. It finishes the first inserter's split of the
non-value-lock page - a new downlink is inserted into the parent page,
with the value 8. It then releases all buffer locks except the first
one/original "value lock". A physical insertion has yet to occur.

* A third inserter of the value comes along. It gets to a later page
than the one locked by the second inserter, preferring the newer
downlink with value 8 (the internal-page _bt_binsrch() logic ensures
this). It exclusive locks that later page/buffer before the second guy
gets a chance to lock it once again. It establishes the right to
insert with _bt_check_unique(), undeterred by the second inserter's
buffer lock/"value lock". The value lock is effectively skipped over.

* Both the second and third inserters have "established the right" to
insert the same value, 7, and both do so. The unique index has an
MVCC-snapshot-wise spurious duplicate, and so is corrupt.

Regardless of whether or not I have these details exactly right (that
is, regardless of whether or not this scenario is strictly possible) I
suggest as a code-hardening measure that _bt_findinsertloc() release
its "value lock", upon realizing it must complete splits, and then
complete the split or splits known to be required. It would finally
report that it "couldn't find" an insertion location to
_bt_doinsert(), which would then retry from the start, just like when
_bt_check_unique() finds an inconclusive conflict. The only difference
is that we don't have an xact to wait on. We haven't actually done
anything extra that makes this later "goto top;" any different to the
existing one.

This should occur so infrequently that it isn't worth trying harder,
or worth differentiating between the UNIQUE_CHECK_NO and
!UNIQUE_CHECK_NO cases when retrying. This also removes the more
general risk of holding an extra buffer lock during page split
completion.

It kind of looks like _bt_findinsertloc() doesn't have this bug,
because in my scenario _bt_finish_split() is called with both the
value lock and its right page locked (so the right page is the left
page for _bt_finish_split()'s purposes). But when you take another
look, and realize that _bt_finish_split() releases its locks, and so
once again only the original value lock will be held when
_bt_finish_split() returns, and so the downlink is there to skip the
value locked page, you realize that the bug does exist (assuming that
I haven't failed to consider some third factor and am not otherwise
mistaken).

Thoughts?
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Peter Geoghegan (#1)

Re: Race condition within _bt_findinsertloc()? (new page split code)

On 05/27/2014 09:17 AM, Peter Geoghegan wrote:

While speccing out a new B-Tree verification tool, I had the
opportunity to revisit a thought I had during review concerning
_bt_findinsertloc(): that the new coding is unsafe in the event of
deferred split completion of a leaf page of a unique index. To recap,
we now do the following in _bt_findinsertloc():

/*
* If this page was incompletely split, finish the split now. We
* do this while holding a lock on the left sibling, which is not
* good because finishing the split could be a fairly lengthy
* operation. But this should happen very seldom.
*/
if (P_INCOMPLETE_SPLIT(lpageop))
{
_bt_finish_split(rel, rbuf, stack);
rbuf = InvalidBuffer;
continue;
}

The "left sibling" referred to here is "the first page this key could
be on", an important concept for unique index enforcement.

No, it's not "the first page this key could be on".

_bt_findinsertloc() does *not* hold a lock on the first valid page the
key could go to. It merely ensures that when it steps to the next page,
it releases the lock on the previous page only after acquiring the lock
on the next page. Throughout the operation, it will hold a lock on
*some* page that could legally hold the inserted value, and it acquires
the locks in left-to-right order. This is sufficient for the uniqueness
checks, because _bt_unique_check() scans all the pages, and
_bt_unique_check() *does* hold the first page locked while it scans the
rest of the pages. But _bt_findinsertlock() does not.

Also note that _bt_moveright() also finishes any incomplete splits it
encounters (when called for an insertion). So _bt_findinsertloc() never
gets called on a page with the incomplete-split flag set. It might
encounter one when it moves right, but never the first page.

Anyway, the concern is that there may be problems when we call
_bt_finish_split() with that left sibling locked thoughout (i.e.
finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
itself has a right sibling from the incomplete split (which is usually
the value lock page's right-right sibling)). I'm not concerned about
performance, since as the comment says it ought to be an infrequent
occurrence. I also believe that there are no deadlock hazards. But
consider this scenario:

* We insert the value 7 into an int4 unique index. We need to split
the leaf page. We run out of memory or something, and ours is an
incomplete page split. Our transaction is aborted. For the sake of
argument, suppose that there are also already a bunch of dead tuples
within the index with values of 7, 8 and 9.

If I understood correctly, the tree looks like this before the insertion:

Parent page:
+-------------+
| |
| 9 -> A |
+-------------+

Leaf A
+-------------+
| HI-key: 9 |
| |
| 7 8 9 |
+-------------+

And after insertion and incomplete split:

Parent page
+-------------+
| |
| 9 -> A |
+-------------+

Leaf A Leaf B
+--------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| (INCOMPLETE_ | | |
| SPLIT) | <-> | |
| | | |
| 7 7* 8 | | 9 |
+--------------+ +-------------+

where 7* is the newly inserted key, with value 7.

(you didn't mention at what point the split happens, but in the next
paragraph you said the new downlink has value 8, which implies the above
split)

* Another inserter of the value 7 comes along. It follows exactly the
same downlink as the first, now aborted inserter (suppose the
downlink's value is 9). It also locks the same leaf page to establish
a "value lock" in precisely the same manner. Finding no room on the
first page, it looks further right, maintaining its original "value
lock" throughout. It finishes the first inserter's split of the
non-value-lock page - a new downlink is inserted into the parent page,
with the value 8. It then releases all buffer locks except the first
one/original "value lock". A physical insertion has yet to occur.

Hmm, I think you got confused at this step. When inserting a 7, you
cannot "look further right" to find a page with more space, because the
HI-key, 8, on the first page stipulates that 7 must go on that page, not
some later page.

* A third inserter of the value comes along. It gets to a later page
than the one locked by the second inserter, preferring the newer
downlink with value 8 (the internal-page _bt_binsrch() logic ensures
this).

That's a contradiction: the downlink with value 8 points to the first
page, not some later page. After the split is finished, the tree looks
like this:

Parent page
+-------------+
| 8 -> A |
| 9 -> B |
+-------------+

Leaf A Leaf B
+-------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| | <-> | |
| 7 7* 8 | | 9 |
+-------------+ +-------------+

Regardless of whether or not I have these details exactly right (that
is, regardless of whether or not this scenario is strictly possible) I
suggest as a code-hardening measure that _bt_findinsertloc() release
its "value lock", upon realizing it must complete splits, and then
complete the split or splits known to be required. It would finally
report that it "couldn't find" an insertion location to
_bt_doinsert(), which would then retry from the start, just like when
_bt_check_unique() finds an inconclusive conflict. The only difference
is that we don't have an xact to wait on. We haven't actually done
anything extra that makes this later "goto top;" any different to the
existing one.

This should occur so infrequently that it isn't worth trying harder,
or worth differentiating between the UNIQUE_CHECK_NO and
!UNIQUE_CHECK_NO cases when retrying. This also removes the more
general risk of holding an extra buffer lock during page split
completion.

Yeah, that would work too. It seems safe enough as it is, though, so I
don't see the point.

It kind of looks like _bt_findinsertloc() doesn't have this bug,
because in my scenario _bt_finish_split() is called with both the
value lock and its right page locked (so the right page is the left
page for _bt_finish_split()'s purposes). But when you take another
look, and realize that _bt_finish_split() releases its locks, and so
once again only the original value lock will be held when
_bt_finish_split() returns, and so the downlink is there to skip the
value locked page, you realize that the bug does exist (assuming that
I haven't failed to consider some third factor and am not otherwise
mistaken).

When inserting, the scan for the right insert location always begins
from the first page where the key can legally go to. Inserting a missing
downlink doesn't change what that page is - it just makes it faster to
find, by reducing the number of right-links you need to follow.

PS. Thanks for looking into this again! These B-tree changes really need
thorough review.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

over 11 years ago

In reply to: Heikki Linnakangas (#2)

Re: Race condition within _bt_findinsertloc()? (new page split code)

On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The "left sibling" referred to here is "the first page this key could
be on", an important concept for unique index enforcement.

No, it's not "the first page this key could be on".

Well, it may be initially. I could have been more cautious about the
terminology here.

Also note that _bt_moveright() also finishes any incomplete splits it
encounters (when called for an insertion). So _bt_findinsertloc() never gets
called on a page with the incomplete-split flag set. It might encounter one
when it moves right, but never the first page.

Fair enough, but I don't think that affects correctness either way (I
don't think that you meant to imply that this was a necessary
precaution that you'd taken - right?). It's a nice property, since it
makes the extra locking while completing a split within
_bt_findinsertloc() particularly infrequent. But, that might also be a
bad thing, when considered from a different perspective.

If I understood correctly, the tree looks like this before the insertion:

Parent page:
+-------------+
| |
| 9 -> A |
+-------------+

Leaf A
+-------------+
| HI-key: 9 |
| |
| 7 8 9 |
+-------------+

And after insertion and incomplete split:

Parent page
+-------------+
| |
| 9 -> A |
+-------------+

Leaf A Leaf B
+--------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| (INCOMPLETE_ | | |
| SPLIT) | <-> | |
| | | |
| 7 7* 8 | | 9 |
+--------------+ +-------------+

After the split is finished, the tree looks like this:

Parent page
+-------------+
| 8 -> A |
| 9 -> B |
+-------------+

Leaf A Leaf B
+-------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| | <-> | |
| 7 7* 8 | | 9 |
+-------------+ +-------------+

How did the parent page change between before and after the final
atomic operation (page split completion)? What happened to "9 -> A"?

Regardless of whether or not I have these details exactly right (that
is, regardless of whether or not this scenario is strictly possible) I
suggest as a code-hardening measure that _bt_findinsertloc() release
its "value lock", upon realizing it must complete splits, and then
complete the split or splits known to be required. It would finally
report that it "couldn't find" an insertion location to
_bt_doinsert(), which would then retry from the start, just like when
_bt_check_unique() finds an inconclusive conflict.

Yeah, that would work too. It seems safe enough as it is, though, so I don't
see the point.

Well, it would be nice to not have to finish the page split in what is
a particularly critical path, with that extra buffer lock. It's not
strictly necessary, but then it is theoretically safer, and certainly
much clearer. The fact that this code is so seldom executed is one
issue that made me revisit this. On the other hand, I can see why
you'd want to avoid cluttering up the relatively comprehensible
_bt_doinsert() function if it could be avoided. I defer to you.

PS. Thanks for looking into this again! These B-tree changes really need
thorough review.

You're welcome. Hopefully my questions will lead you in a useful
direction, even if my concerns turn out to be, in the main, unfounded.
:-)

It previously wasn't in evidence that you'd considered these
interactions, and I feel better knowing that you have.
--
Peter Geoghega

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Peter Geoghegan (#3)

Re: Race condition within _bt_findinsertloc()? (new page split code)

On 05/27/2014 09:47 PM, Peter Geoghegan wrote:

On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Also note that _bt_moveright() also finishes any incomplete splits it
encounters (when called for an insertion). So _bt_findinsertloc() never gets
called on a page with the incomplete-split flag set. It might encounter one
when it moves right, but never the first page.

Fair enough, but I don't think that affects correctness either way (I
don't think that you meant to imply that this was a necessary
precaution that you'd taken - right?).

Right.

If I understood correctly, the tree looks like this before the insertion:

Parent page:
+-------------+
| |
| 9 -> A |
+-------------+

Leaf A
+-------------+
| HI-key: 9 |
| |
| 7 8 9 |
+-------------+

And after insertion and incomplete split:

Parent page
+-------------+
| |
| 9 -> A |
+-------------+

Leaf A Leaf B
+--------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| (INCOMPLETE_ | | |
| SPLIT) | <-> | |
| | | |
| 7 7* 8 | | 9 |
+--------------+ +-------------+

After the split is finished, the tree looks like this:

Parent page
+-------------+
| 8 -> A |
| 9 -> B |
+-------------+

Leaf A Leaf B
+-------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| | <-> | |
| 7 7* 8 | | 9 |
+-------------+ +-------------+

How did the parent page change between before and after the final
atomic operation (page split completion)? What happened to "9 -> A"?

Ah, sorry, I got that wrong. The downlinks store the *low* key of the
child page, not the high key as I depicted. Let me try again:

On 05/27/2014 09:17 AM, Peter Geoghegan wrote:

Anyway, the concern is that there may be problems when we call
_bt_finish_split() with that left sibling locked thoughout (i.e.
finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
itself has a right sibling from the incomplete split (which is usually
the value lock page's right-right sibling)). I'm not concerned about
performance, since as the comment says it ought to be an infrequent
occurrence. I also believe that there are no deadlock hazards. But
consider this scenario:

* We insert the value 7 into an int4 unique index. We need to split
the leaf page. We run out of memory or something, and ours is an
incomplete page split. Our transaction is aborted. For the sake of
argument, suppose that there are also already a bunch of dead tuples
within the index with values of 7, 8 and 9.

So, initially the tree looks like this:

Parent page:
+-------------+
| |
| 7 -> A |
+-------------+

Leaf A
+-------------+
| HI-key: 9 |
| |
| 7 8 9 |
+-------------+

And after insertion and incomplete split:

Parent page
+-------------+
| |
| 7 -> A |
+-------------+

Leaf A Leaf B
+--------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| (INCOMPLETE_ | | |
| SPLIT) | <-> | |
| | | |
| 7 7* 8 | | 9 |
+--------------+ +-------------+

where 7* is the newly inserted key, with value 7.

(you didn't mention at what point the split happens, but in the next
paragraph you said the new downlink has value 8, which implies the above
split)

* Another inserter of the value 7 comes along. It follows exactly the
same downlink as the first, now aborted inserter (suppose the
downlink's value is 9). It also locks the same leaf page to establish
a "value lock" in precisely the same manner. Finding no room on the
first page, it looks further right, maintaining its original "value
lock" throughout. It finishes the first inserter's split of the
non-value-lock page - a new downlink is inserted into the parent page,
with the value 8. It then releases all buffer locks except the first
one/original "value lock". A physical insertion has yet to occur.

The downlink of the original page cannot contain 9. Because, as I now
remember ;-), the downlinks contain low-keys, not high keys.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

over 11 years ago

In reply to: Heikki Linnakangas (#4)

Re: Race condition within _bt_findinsertloc()? (new page split code)

On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Ah, sorry, I got that wrong. The downlinks store the *low* key of the child
page, not the high key as I depicted. Let me try again:

Would you mind humoring me, and including a corrected final
post-downlink-insert diagram, when the split is fully complete? You
omitted that.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Peter Geoghegan (#5)

Re: Race condition within _bt_findinsertloc()? (new page split code)

On 05/27/2014 11:30 PM, Peter Geoghegan wrote:

On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Ah, sorry, I got that wrong. The downlinks store the *low* key of the child
page, not the high key as I depicted. Let me try again:

Would you mind humoring me, and including a corrected final
post-downlink-insert diagram, when the split is fully complete? You
omitted that.

Sure:

Parent page
+-------------+
| 7 -> A |
| 8 -> B |
+-------------+

Leaf A Leaf B
+--------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| | | |
| | <-> | |
| | | |
| 7 7* 8 | | 9 |
+--------------+ +-------------+

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: Peter Geoghegan (#5)

Extended Prefetching using Asynchronous IO - proposal and patch

Claudio Freire and I are proposing new functionality for Postgresql
to extend the scope of prefetching and also exploit posix asynchronous IO
when doing prefetching, and have a patch based on 9.4dev
ready for consideration.

This topic has cropped up at irregular intervals over the years,
e.g. this thread back in 2012
www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com
and this thread more recently
/messages/by-id/CAGTBQpaFC_z=zdWVAXD8wWss3v6jxZ5pNmrrYPsD23LbrqGvgQ@mail.gmail.com

We now have an implementation which gives useful performance improvement
as well as other advantages compared to what is currently available,
at least for certain environments.

Below I am pasting the README we have written for this new functionality
which mentions some of the measurements, advantages (and disadvantages)
and we welcome all and any comments on this.

I will send the patch to commitfest later, once this email is posted to hackers,
so that anyone who wishes can try it, or apply directly to me if you wish.
The patch is currently based on 9.4dev but a version based on 9.3.4
will be available soon if anyone wants that. The patch is large (43 files)
so non-trivial to review, but any comments on it (when posted) will be
appreciated and acted on. Note that at present the only environment
in which it has been applied and tested is linux.

John Lumby
__________________________________________________________________________________________________

Postgresql -- Extended Prefetching using Asynchronous IO
============================================================

Postgresql currently (9.3.4) provides a limited prefetching capability
using posix_fadvise to give hints to the Operating System kernel
about which pages it expects to read in the near future.
This capability is used only during the heap-scan phase of bitmap-index scans.
It is controlled via the effective_io_concurrency configuration parameter.

This capability is now extended in two ways :
. use asynchronous IO into Postgresql shared buffers as an
alternative to posix_fadvise
. Implement prefetching in other types of scan :
. non-bitmap (i.e. simple) index scans - index pages
currently only for B-tree indexes.
(developed by Claudio Freire <klaussfreire(at)gmail(dot)com>)
. non-bitmap (i.e. simple) index scans - heap pages
currently only for B-tree indexes.
. simple heap scans

Posix asynchronous IO is chosen as the function library for asynchronous IO,
since this is well supported and also fits very well with the model of
the prefetching process, particularly as regards checking for completion
of an asynchronous read. On linux, Posix asynchronous IO is provided
in the librt library. librt uses independently-schedulable threads to
achieve the asynchronicity, rather than kernel functionality.

In this implementation, use of asynchronous IO is limited to prefetching
while performing one of the three types of scan
. B-tree bitmap index scan - heap pages (as already exists)
. B-tree non-bitmap (i.e. simple) index scans - index and heap pages
. simple heap scans
on permanent relations. It is not used on temporary tables nor for writes.

The advantages of Posix asynchronous IO into shared buffers
compared to posix_fadvise are :
. Beneficial for non-sequential access patterns as well as sequential
. No restriction on the kinds of IO which can be used
(other kinds of asynchronous IO impose restrictions such as
buffer alignment, use of non-buffered IO).
. Does not interfere with standard linux kernel read-ahead functionality.
(It has been stated in
www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com
that :
"the kernel stops doing read-ahead when a call to posix_fadvise comes.
I noticed the performance hit, and checked the kernel's code.
It effectively changes the prediction mode from sequential to fadvise,
negating the (assumed) kernel's prefetch logic")
. When the read request is issued after a prefetch has completed,
no delay associated with a kernel call to copy the page from
kernel page buffers into the Postgresql shared buffer,
since it is already there.
Also, in a memory-constrained environment, there is a greater
probability that the prefetched page will "stick" in memory
since the linux kernel victimizes the filesystem page cache in preference
to swapping out user process pages.
. Statistics on prefetch success can be gathered (see "Statistics" below)
which helps the administrator to tune the prefetching settings.

These benefits are most likely to be obtained in a system whose usage profile
(e.g. from iostat) shows:
. high IO wait from mostly-read activity
. disk access pattern is not entirely sequential
(so kernel readahead can't predict it but postgresql can)
. sufficient spare idle CPU to run the librt pthreads
or, stated another way, the CPU subsystem is relatively powerful
compared to the disk subsystem.
In such ideal conditions, and with a workload with plenty of index scans,
around 10% - 20% improvement in throughput has been achieved.
In an admittedly extreme environment measured by this author, with a workload
consisting of 8 client applications each running similar complex queries
(same query structure but different predicates and constants),
including 2 Bitmap Index Scans and 17 non-bitmap index scans,
on a dual-core Intel laptop (4 hyperthreads) with the database on a single
USB3-attached 500GB disk drive, and no part of the database in filesystem buffers
initially, (filesystem freshly mounted), comparing unpatched build
using posix_fadvise with effective_io_concurrency 4 against same build patched
with async IO and effective_io_concurrency 4 and max_async_io_prefetchers 32,
elapse time repeatably improved from around 640-670 seconds to around 530-550 seconds,
a 17% - 18% improvement.

The disadvantages of Posix asynchronous IO compared to posix_fadvise are:
. probably higher CPU utilization:
Firstly, the extra work performed by the librt threads adds CPU
overhead, and secondly, if the asynchronous prefetching is effective,
then it will deliver better (greater) overlap of CPU with IO, which
will reduce elapsed times and hence increase CPU utilization percentage
still more (during that shorter elapsed time).
. more context switching, because of the additional threads.

Statistics:
___________

A number of additional statistics relating to effectiveness of asynchronous IO
are provided as an extension of the existing pg_stat_statements loadable module.
Refer to the appendix "Additional Supplied Modules" in the current
PostgreSQL Documentation for details of this module.

The following additional statistics are provided for asynchronous IO prefetching:

. aio_read_noneed : number of prefetches for which no need for prefetch as block already in buffer pool
. aio_read_discrd : number of prefetches for which buffer not subsequently read and therefore discarded
. aio_read_forgot : number of prefetches for which buffer not subsequently read and then forgotten about
. aio_read_noblok : number of prefetches for which no available BufferAiocb control block
. aio_read_failed : number of aio reads for which aio itself failed or the read failed with an errno
. aio_read_wasted : number of aio reads for which in-progress aio cancelled and disk block not used
. aio_read_waited : number of aio reads for which disk block used but had to wait for it
. aio_read_ontime : number of aio reads for which disk block used and ready on time when requested

Some of these are (hopefully) self-explanatory. Some additional notes:

. aio_read_discrd and aio_read_forgot :
prefetch was wasted work since the buffer was not subsequently read
The discrd case indicates that the scanner realized this and discarded the buffer,
whereas the forgot case indicates that the scanner did not realize it,
which should not normally occur.
A high number in either suggests lowering effective_io_concurrency.

. aio_read_noblok :
Any significant number in relation to all the other numbers indicates that
max_async_io_prefetchers should be increased.

. aio_read_waited :
The page was prefetched but the asynchronous read had not completed by the time the
scanner requested to read it. causes extra overhead in waiting and indicates
prefetching is not providing much if any benefit.
The disk subsystem may be underpowered/overloaded in relation to the available CPU power.

. aio_read_ontime :
The page was prefetched and the asynchronous read had completed by the time the
scanner requested to read it. Optimal behaviour. If this number if large
in relation to all the other numbers except (possibly) aio_read_noneed,
then prefetching is working well.

To create the extension with support for these additional statistics, use the following syntax:
CREATE EXTENSION pg_stat_statements VERSION '1.3'
or, if you run the new code against an existing database which already has the extension
( see installation and migration below ), you can
ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'

A suggested set of commands for displaying these statistics might be :

/* OPTIONALLY */ DROP extension pg_stat_statements;
CREATE extension pg_stat_statements VERSION '1.3';
/* run your workload */
select userid , dbid , substring(query from 1 for 24) , calls , total_time , rows , shared_blks_read , blk_read_time , blk_write_time \
, aio_read_noneed , aio_read_noblok , aio_read_failed , aio_read_wasted , aio_read_waited , aio_read_ontime , aio_read_forgot \
from pg_stat_statements where shared_blks_read > 0;

Installation and Build Configuration:
_____________________________________

1. First - a prerequsite:
# as well as requiring all the usual package build tools such as gcc , make etc,
# as described in the instructions for building postgresql,
# the following is required :
gnu autoconf at version 2.69 :
# run the following command
autoconf -V
# it *must* return
autoconf (GNU Autoconf) 2.69

2. If you don't have it or it is a different version,
then you must obtain version 2.69 (which is the current version)
from your distribution provider or from the gnu software download site.

3. Also you must have the source tree for postgresql version 9.4 (development version).
# all the following commands assume your current working directory is the top of the source tree.

4. cd to top of source tree :
# check it appears to be a postgresql source tree
ls -ld configure.in src
# should show both the file and the directory
grep PostgreSQL COPYRIGHT
# should show PostgreSQL Database Management System

5. Apply the patch :
patch -b -p0 -i <patch_file_path>
# should report no errors, 42 files patched (see list at bottom of this README)
# and all hunks applied
# check the patch was appplied to configure.in
ls -ld configure.in.orig configure.in
# should show both files

6. Rebuild the configure script with the patched configure.in :
mv configure configure.orig;
autoconf configure.in >configure;echo "rc= $? from autoconf"; chmod +x configure;
ls -lrt configure.orig configure;

7. run the new configure script :
# if you have run configure before,
# then you may first want to save existing config.status and config.log if they exist,
# and then specify same configure flags and options as you specified before.
# the patch does not alter or extend the set of configure options
# if unsure, run ./configure --help
# if still unsure, run ./configure
./configure <other configure options as desired>

8. now check that configure decided that this environment supports asynchronous IO :
grep USE_AIO_ATOMIC_BUILTIN_COMP_SWAP src/include/pg_config.h
# it should show
#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 1
# if not, apparently your environment does not support asynch IO -
# the config.log will show how it came to that conclusion,
# also check for :
# . a librt.so somewhere in the loader's library path (probably under /lib , /lib64 , or /usr)
# . your gcc must support the atomic compare_and_swap __sync_bool_compare_and_swap built-in function
# do not proceed without this define being set.

9. do you want to use the new code on an existing cluster
that was created using the same code base but without the patch?
If so then run this nasty-looking command :
(cut-and-paste it into a terminal window or a shell-script file)
Otherwise continue to step 10.
see Migration note below for explanation.
###############################################################################################
fl=src/Makefile.global; typeset -i bkx=0; while [[ $bkx < 200 ]]; do {
bkfl="${fl}.bak${bkx}"; if [[ -a ${bkfl} ]]; then ((bkx=bkx+1)); else break; fi;
}; done;
if [[ -a ${bkfl} ]]; then echo "sorry cannot find a backup name for $fl";
elif [[ -a $fl ]]; then {
mv $fl $bkfl && {
sed -e "/^CFLAGS =/ s/\$/ -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO/" $bkfl > $fl;
str="diff -w $bkfl $fl";echo "$str"; eval "$str";
};
};
else echo "ooopppss $fl is missing";
fi;
###############################################################################################
# it should report something like
diff -w Makefile.global.bak0 Makefile.global
222c222
< CFLAGS = XXXX
---

CFLAGS = XXXX -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO

# where XXXX is some set of flags

10. now run the rest of the build process as usual -
follow instructions in file INSTALL if that file exists,
else e.g. run
make && make install

If the build fails with the following error:
undefined reference to `aio_init'
Then edit the following file
src/include/pg_config_manual.h
and add the following line at the bottom:

#define DONT_HAVE_AIO_INIT

and then run
make clean && make && make install
See notes to section Runtime Configuration below for more information on this.

Migration , Runtime Configuration, and Use:
___________________________________________

Database Migration:
___________________

The new prefetching code for non-bitmap index scans introduces a new btree-index
function named btpeeknexttuple. The correct way to add such a function involves
also adding it to the catalog as an internal function in pg_proc.
However, this results in the new built code considering an existing database to be
incompatible, i.e requiring backup on the old code and restore on the new.
This is normal behaviour for migration to a new version of postgresql, and is
also a valid way of migrating a database for use with this asynchronous IO feature,
but in this case it may be inconvenient.

As an alternative, the new code may be compiled with the macro define
AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
which does what it says by not altering the catalog. The patched build can then
be run against an existing database cluster initdb'd using the unpatched build.

There are no known ill-effects of so doing, but :
. in any case, it is strongly suggested to make a backup of any precious database
before accessing it with a patched build
. be aware that if this asynchronous IO feature is eventually released as part of postgresql,
migration will probably be required anyway.

This option to avoid catalog migration is intended as a convenience for a quick test,
and also makes it easier to obtain performance comparisons on the same database.

Runtime Configuration:
______________________

One new configuration parameter settable in postgresql.conf and
in any other way as described in the postgresql documentation :

max_async_io_prefetchers
Maximum number of background processes concurrently using asynchronous
librt threads to prefetch pages into shared memory buffers

This number can be thought of as the maximum number
of librt threads concurrently active, each working on a list of
from 1 to target_prefetch_pages pages ( see notes 1 and 2 ).

In practice, this number simply controls how many prefetch requests in total
may be active concurrently :
max_async_io_prefetchers * target_prefetch_pages ( see note 1)

default is max_connections/6
and recall that the default for max_connections is 100

note 1 a number based on effective_io_concurrency and approximately n * ln(n)
where n is effective_io_concurrency

note 2 Provided that the gnu extension to Posix AIO which provides the
aio_init() function is present, then aio_init() is called
to set the librt maximum number of threads to max_async_io_prefetchers,
and to set the maximum number of concurrent aio read requests to the product of
max_async_io_prefetchers * target_prefetch_pages

As well as this regular configuration parameter,
there are several other parameters that can be set via environment variable.
The reason why they are environment vars rather than regular configuration parameters
is that it is not expected that they should need to be set, but they may be useful :
variable name values default meaning
PG_TRY_PREFETCHING_FOR_BITMAP [Y|N] Y whether to prefetch bitmap heap scans
PG_TRY_PREFETCHING_FOR_ISCAN [Y|N|integer[,[N|Y]]] 256,N whether to prefetch non-bitmap index scans
also numeric size of list of prefetched blocks
also whether to prefetch forward-sequential-pattern index pages
PG_TRY_PREFETCHING_FOR_BTREE [Y|N] Y whether to prefetch heap pages in non-bitmap index scans
PG_TRY_PREFETCHING_FOR_HEAP [Y|N] N whether to prefetch relation (un-indexed) heap scans

The setting for PG_TRY_PREFETCHING_FOR_ISCAN is a litle complicated.
It can be set to Y or N to control prefetching of non-bitmap index scans;
But in addition it can be set to an integer, which both implies Y
and also sets the size of a list used to remember prefetched but unread heap pages.
This list is an optimization used to avoid re-prefetching and maximise the potential
set of prefetchable blocks indexed by one index page.
And if set to an integer, this integer may be followed by either ,Y or ,N
to specify to prefetch index pages which are being accessed forward-sequentially.
It has been found that prefetching is not of great benefit for this access pattern,
and so it is not the default, but also does no harm (provided sufficient CPU capacity).

Usage :
______

There are no changes in usage other than as noted under Configuration and Statistics.
However, in order to assess benefit from this feature, it will be useful to
understand the query access plans of your workload using EXPLAIN. Before doing that,
make sure that statistics are up to date using ANALYZE.

Internals:
__________

Internal changes span two areas and the interface between them :

. buffer manager layer
. programming interface for scanner to call buffer manager
. scanner layer

. buffer manager layer
____________________

changes comprise :
. allocating, pinning , unpinning buffers
this is complex and discussed briefly below in "Buffer Management"
. acquiring and releasing a BufferAiocb, the control block
associated with a single aio_read, and checking for its completion
a new file, backend/storage/buffer/buf_async.c, provides three new functions,
BufStartAsync BufReleaseAsync BufCheckAsync
which handle this.
. calling librt asynch io functions
this follows the example of all other filesystem interfaces
and is straightforward.
two new functions are provided in fd.c:
FileStartaio FileCompleteaio
and corresponding interfaces in smgr.c

. programming interface for scanner to call buffer manager
________________________________________________________
. calling interface for existing function PrefetchBuffer is modified :
. one new argument, BufferAccessStrategy strategy
. now returns an int return code which indicates :
whether pin count on buffer has been increased by 1
whether block was already present in a buffer
. new function DiscardBuffer
. discard buffer used for a previously prefetched page
which scanner decides it does not want to read.
. same arguments as for PrefetchBuffer except for omission of BufferAccessStrategy
. note - this is different from the existing function ReleaseBuffer
in that ReleaseBuffer takes a buffer_descriptor as argument
for a buffer which has been read, but has similar purpose.

. scanner layer
_____________
common to all scanners is that the scanner which wishes to prefetch must do two things:
. decide which pages to prefetch and call PrefetchBuffer to prefetch them
nodeBitmapHeapscan already does this (but note one extra argument on PrefetchBuffer)
. remember which pages it has prefetched in some list (actual or conceptual, e.g. a page range),
removing each page from this list if and when it subsequently reads the page.
. at end of scan, call DiscardBuffer for every remembered (i.e. prefetched not unread) page
how this list of prefetched pages is implemented varies for each of the three scanners and four scan types:
. bitmap index scan - heap pages
. non-bitmap (i.e. simple) index scans - index pages
. non-bitmap (i.e. simple) index scans - heap pages
. simple heap scans
The consequences of forgetting to call DiscardBuffer on a prefetched but unread page are:
. counted in aio_read_forgot (see "Statistics" above)
. may incur an annoying but harmless warning in the pg_log "Buffer Leak ... "
(the buffer is released at commit)
This does sometimes happen ...

Buffer Management
_________________

With async io, PrefetchBuffer must allocate and pin a buffer, which is relatively straightforward,
but also every other part of buffer manager must know about the possibility that a buffer may be in
a state of async_io_in_progress state and be prepared to determine the possible completion.
That is, one backend BK1 may start the io but another BK2 may try to read it before BK1 does.
Posix Asynchronous IO provides a means for waiting on this or another task's read if in progress,
namely aio_suspend(), which this extension uses. Therefore, although StartBufferIO and TerminateBufferIO
are called as part of asynchronous prefetching, their role is limited to maintaining the buffer descriptor flags,
and they do not track the asynchronous IO itself. Instead, asynchronous IOs are tracked in
a separate set of shared control blocks, the BufferAiocb list -
refer to include/storage/buf_internals.h
Checking asynchronous io status is handled in backend/storage/buffer/buf_async.c BufCheckAsync function.
Read the commentary for this function for more details.

Pinning and unpinning of buffers is the most complex aspect of asynch io prefetching,
and the logic is spread throughout BufStartAsync , BufCheckAsync , and many functions in bufmgr.c.
When a backend BK2 requests ReadBuffer of a page for which asynch read is in progress,
buffer manager has to determine which backend BK1 pinned this buffer during previous PrefetchBuffer,
and for example must not be re-pinned a second time if BK2 is BK1.
Information concerning which backend initiated the prefetch is held in the BufferAiocb.

The trickiest case concerns the scenario in which :
. BK1 initiates prefetch and acquires a pin
. BK2 possibly waits for completion and then reads the buffer, and perhaps later on
releases it by ReleaseBuffer.
. Since the asynchronous IO is no longer in progress, there is no longer any
BufferAiocb associated with it. Yet buffer manager must remember that BK1 holds a
"prefetch" pin, i.e. a pin which must not be repeated if and when BK1 finally issues ReadBuffer.
. The solution to this problem is to invent the concept of a "banked" pin,
which is a pin obtained when prefetch was issued, identied as in "banked" status only if and when
the associated asynchronous IO terminates, and redeemable by the next use by same task,
either by ReadBuffer or DiscardBuffer.
The pid of the backend which holds a banked pin on a buffer (there can be at most one such backend)
is stored in the buffer descriptor.
This is done without increasing size of the buffer descriptor, which is important since
there may be a very large number of these. This does overload the relevant field in the descriptor.
Refer to include/storage/buf_internals.h for more details
and search for BM_AIO_PREFETCH_PIN_BANKED in storage/buffer/bufmgr.c and backend/storage/buffer/buf_async.c

______________________________________________________________________________
The following 43 files are changed in this feature (output of the patch command) :

patching file configure.in
patching file contrib/pg_stat_statements/pg_stat_statements--1.3.sql
patching file contrib/pg_stat_statements/Makefile
patching file contrib/pg_stat_statements/pg_stat_statements.c
patching file contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql
patching file config/c-library.m4
patching file src/backend/postmaster/postmaster.c
patching file src/backend/executor/nodeBitmapHeapscan.c
patching file src/backend/executor/nodeIndexscan.c
patching file src/backend/executor/instrument.c
patching file src/backend/storage/buffer/Makefile
patching file src/backend/storage/buffer/bufmgr.c
patching file src/backend/storage/buffer/buf_async.c
patching file src/backend/storage/buffer/buf_init.c
patching file src/backend/storage/smgr/md.c
patching file src/backend/storage/smgr/smgr.c
patching file src/backend/storage/file/fd.c
patching file src/backend/storage/lmgr/proc.c
patching file src/backend/access/heap/heapam.c
patching file src/backend/access/heap/syncscan.c
patching file src/backend/access/index/indexam.c
patching file src/backend/access/index/genam.c
patching file src/backend/access/nbtree/nbtsearch.c
patching file src/backend/access/nbtree/nbtinsert.c
patching file src/backend/access/nbtree/nbtpage.c
patching file src/backend/access/nbtree/nbtree.c
patching file src/backend/nodes/tidbitmap.c
patching file src/backend/utils/misc/guc.c
patching file src/backend/utils/mmgr/aset.c
patching file src/include/executor/instrument.h
patching file src/include/storage/bufmgr.h
patching file src/include/storage/smgr.h
patching file src/include/storage/fd.h
patching file src/include/storage/buf_internals.h
patching file src/include/catalog/pg_am.h
patching file src/include/catalog/pg_proc.h
patching file src/include/pg_config_manual.h
patching file src/include/access/nbtree.h
patching file src/include/access/heapam.h
patching file src/include/access/relscan.h
patching file src/include/nodes/tidbitmap.h
patching file src/include/utils/rel.h
patching file src/include/pg_config.h.in

Future Possibilities:
____________________

There are several possible extensions of this feature :
. Extend prefetching of index scans to types of index
other than B-tree.
This should be fairly straightforward, but requires some
good base of benchmarkable workloads to prove the value.
. Investigate why asynchronous IO prefetching does not greatly
improve sequential relation heap scans and possibly find how to
achieve a benefit.
. Build knowledge of asycnhronous IO prefetching into the
Query Planner costing.
This is far from straightforward. The Postgresql Query Planner's
costing model is based on resource consumption rather than elapsed time.
Use of asynchronous IO prefetching is intended to improve elapsed time
as the expense of (probably) higher resource consumption.
Although Costing understands about the reduced cost of reading buffered
blocks, it does not take asynchronicity or overlap of CPU with disk
into account. A naive approach might be to try to tweak the Query
Planner's Cost Constant configuration parameters
such as seq_page_cost , random_page_cost
but this is hazardous as explained in the Documentation.

John Lumby, johnlumby(at)hotmail(dot)com

Peter Geoghegan

pg@heroku.com

over 11 years ago

In reply to: Heikki Linnakangas (#4)

Re: Race condition within _bt_findinsertloc()? (new page split code)

On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Fair enough, but I don't think that affects correctness either way (I
don't think that you meant to imply that this was a necessary
precaution that you'd taken - right?).

Right.

So, the comments within _bt_search() suggest that the _bt_moveright()
call will perform a _bt_finish_split() call opportunistically iff it's
called from _bt_doinsert() (i.e. access == BT_WRITE). There is no
reason to not do so in all circumstances though, assuming that it's
desirable to do so as soon as possible (which I *don't* actually
assume). If I'm not mistaken, it's also true that it would be strictly
speaking correct to never do it there. Do you think it would be a fair
stress-test if I was to hack Postgres so that this call never happens
(within _bt_moveright())? I'd also have an incomplete page split occur
at random with a probability of, say, 1% per split. The mechanism
would be much the same as your original test-case for the split patch
- I'd throw an error at the wrong place, although only 1% of the time,
and over many hours.

The concern with the _bt_moveright() call of _bt_finish_split() is
that it might conceal a problem without reliably fixing it,
potentially making isolating that problem much harder.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Peter Geoghegan (#8)

Re: Race condition within _bt_findinsertloc()? (new page split code)

On 05/28/2014 02:15 AM, Peter Geoghegan wrote:

On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Fair enough, but I don't think that affects correctness either way (I
don't think that you meant to imply that this was a necessary
precaution that you'd taken - right?).

Right.

So, the comments within _bt_search() suggest that the _bt_moveright()
call will perform a _bt_finish_split() call opportunistically iff it's
called from _bt_doinsert() (i.e. access == BT_WRITE). There is no
reason to not do so in all circumstances though, assuming that it's
desirable to do so as soon as possible (which I *don't* actually
assume).

You can't write in a hot standby, so that's one reason to only do it
when inserting, not when querying. Even when you're not in a hot
standby, it seems like a nice property that a read-only query doesn't do
writes. I know we do hint bit updates and other kinds of write-action
when reading anyway, but still.

If I'm not mistaken, it's also true that it would be strictly
speaking correct to never do it there.

Hmm. No, it wouldn't. It is not allowed to have two incompletely-split
pages adjacent to each other. If you move right to the right-half of an
incomplete split, i.e. a page that does not have a downlink in its
parent, and then try to split the page again, _bt_insert_parent() would
fail to find the location to insert the new downlink to. It assumes that
there is a downlink to the page being split in the parent, and uses that
to find the location for the new downlink.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: John Lumby (#7)

1 attachment(s)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

The patch is attached.
It is based on clone of today's 9.4dev source.
I have noticed that this source is
(not suprisingly) quite a moving target at present,
meaning that this patch becomes stale quite quickly.
So although this copy is fine for reviewing,
it may quite probably soon not be correct
for the current source tree.

As mentioned before, if anyone wishes to try this feature out
on 9.3.4, I will be making a patch for that soon
which I can supply on request.

John Lumby

Attachments:

postgresql-9.4.140528.async_io_prefetching.patchapplication/octet-streamDownload

--- configure.in.orig	2014-05-28 08:29:09.146829394 -0400
+++ configure.in	2014-05-28 16:45:42.406505235 -0400
@@ -1771,6 +1771,12 @@ operating system;  use --disable-thread-
 fi
 fi
 
+#  test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation  of the latter.
+PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" = x"yes"; then
+      AC_DEFINE(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP, 1, [Define to select librt-style async io and the gcc atomic compare_and_swap.])
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
--- contrib/pg_stat_statements/pg_stat_statements--1.3.sql.orig	2014-05-28 08:50:32.110559768 -0400
+++ contrib/pg_stat_statements/pg_stat_statements--1.3.sql	2014-05-28 16:45:42.570505896 -0400
@@ -0,0 +1,52 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_stat_statements VERSION '1.3'" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_stat_statements_reset()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT queryid bigint,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT blk_read_time float8,
+    OUT blk_write_time float8
+   , OUT aio_read_noneed int8
+   , OUT aio_read_discrd int8
+   , OUT aio_read_forgot int8
+   , OUT aio_read_noblok int8
+   , OUT aio_read_failed int8
+   , OUT aio_read_wasted int8
+   , OUT aio_read_waited int8
+   , OUT aio_read_ontime int8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_3'
+LANGUAGE C STRICT VOLATILE;
+
+-- Register a view on the function for ease of use.
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
+
+-- Don't want this to be available to non-superusers.
+REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
--- contrib/pg_stat_statements/Makefile.orig	2014-05-28 08:29:09.166829383 -0400
+++ contrib/pg_stat_statements/Makefile	2014-05-28 16:45:42.590505977 -0400
@@ -4,7 +4,8 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.2.sql pg_stat_statements--1.1--1.2.sql \
+DATA = pg_stat_statements--1.3.sql pg_stat_statements--1.2--1.3.sql \
+	pg_stat_statements--1.2.sql pg_stat_statements--1.1--1.2.sql \
 	pg_stat_statements--1.0--1.1.sql pg_stat_statements--unpackaged--1.0.sql
 
 ifdef USE_PGXS
--- contrib/pg_stat_statements/pg_stat_statements.c.orig	2014-05-28 08:29:09.166829383 -0400
+++ contrib/pg_stat_statements/pg_stat_statements.c	2014-05-28 16:45:42.630506138 -0400
@@ -117,6 +117,7 @@ typedef enum pgssVersion
 	PGSS_V1_0 = 0,
 	PGSS_V1_1,
 	PGSS_V1_2
+	,PGSS_V1_3
 } pgssVersion;
 
 /*
@@ -148,6 +149,16 @@ typedef struct Counters
 	int64		local_blks_written;		/* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;		/* # of temp blocks written */
+
+	int64		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool  */
+	int64		aio_read_discrd;		/* # of prefetches for which buffer not subsequently read and therefore discarded  */
+	int64		aio_read_forgot;		/* # of prefetches for which buffer not subsequently read and then forgotten about */
+	int64		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb  control block               */
+	int64		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno     */
+	int64		aio_read_wasted;		/* # of aio reads for which in-progress aio cancelled and disk block not used      */
+	int64		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it                 */
+	int64		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested       */
+
 	double		blk_read_time;	/* time spent reading, in msec */
 	double		blk_write_time; /* time spent writing, in msec */
 	double		usage;			/* usage factor */
@@ -274,7 +285,7 @@ void		_PG_init(void);
 void		_PG_fini(void);
 
 PG_FUNCTION_INFO_V1(pg_stat_statements_reset);
-PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
+PG_FUNCTION_INFO_V1(pg_stat_statements_1_3);
 PG_FUNCTION_INFO_V1(pg_stat_statements);
 
 static void pgss_shmem_startup(void);
@@ -1026,7 +1037,25 @@ pgss_ProcessUtility(Node *parsetree, con
 		bufusage.temp_blks_read =
 			pgBufferUsage.temp_blks_read - bufusage_start.temp_blks_read;
 		bufusage.temp_blks_written =
-			pgBufferUsage.temp_blks_written - bufusage_start.temp_blks_written;
+			pgBufferUsage.temp_blks_written - bufusage.temp_blks_written;
+
+		bufusage.aio_read_noneed =
+			pgBufferUsage.aio_read_noneed - bufusage.aio_read_noneed;
+		bufusage.aio_read_discrd =
+			pgBufferUsage.aio_read_discrd - bufusage.aio_read_discrd;
+		bufusage.aio_read_forgot =
+			pgBufferUsage.aio_read_forgot - bufusage.aio_read_forgot;
+		bufusage.aio_read_noblok =
+			pgBufferUsage.aio_read_noblok - bufusage.aio_read_noblok;
+		bufusage.aio_read_failed =
+			pgBufferUsage.aio_read_failed - bufusage.aio_read_failed;
+		bufusage.aio_read_wasted =
+			pgBufferUsage.aio_read_wasted - bufusage.aio_read_wasted;
+		bufusage.aio_read_waited =
+			pgBufferUsage.aio_read_waited - bufusage.aio_read_waited;
+		bufusage.aio_read_ontime =
+			pgBufferUsage.aio_read_ontime - bufusage.aio_read_ontime;
+
 		bufusage.blk_read_time = pgBufferUsage.blk_read_time;
 		INSTR_TIME_SUBTRACT(bufusage.blk_read_time, bufusage_start.blk_read_time);
 		bufusage.blk_write_time = pgBufferUsage.blk_write_time;
@@ -1041,6 +1070,7 @@ pgss_ProcessUtility(Node *parsetree, con
 				   rows,
 				   &bufusage,
 				   NULL);
+
 	}
 	else
 	{
@@ -1224,6 +1254,16 @@ pgss_store(const char *query, uint32 que
 		e->counters.local_blks_written += bufusage->local_blks_written;
 		e->counters.temp_blks_read += bufusage->temp_blks_read;
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
+
+		e->counters.aio_read_noneed     += bufusage->aio_read_noneed;
+		e->counters.aio_read_discrd     += bufusage->aio_read_discrd;
+		e->counters.aio_read_forgot     += bufusage->aio_read_forgot;
+		e->counters.aio_read_noblok     += bufusage->aio_read_noblok;
+		e->counters.aio_read_failed     += bufusage->aio_read_failed;
+		e->counters.aio_read_wasted     += bufusage->aio_read_wasted;
+		e->counters.aio_read_waited     += bufusage->aio_read_waited;
+		e->counters.aio_read_ontime     += bufusage->aio_read_ontime;
+
 		e->counters.blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_read_time);
 		e->counters.blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_write_time);
 		e->counters.usage += USAGE_EXEC(total_time);
@@ -1257,7 +1297,8 @@ pg_stat_statements_reset(PG_FUNCTION_ARG
 #define PG_STAT_STATEMENTS_COLS_V1_0	14
 #define PG_STAT_STATEMENTS_COLS_V1_1	18
 #define PG_STAT_STATEMENTS_COLS_V1_2	19
-#define PG_STAT_STATEMENTS_COLS			19		/* maximum of above */
+#define PG_STAT_STATEMENTS_COLS_V1_3	27
+#define PG_STAT_STATEMENTS_COLS			27		/* maximum of above */
 
 /*
  * Retrieve statement statistics.
@@ -1270,6 +1311,16 @@ pg_stat_statements_reset(PG_FUNCTION_ARG
  * function.  Unfortunately we weren't bright enough to do that for 1.1.
  */
 Datum
+pg_stat_statements_1_3(PG_FUNCTION_ARGS)
+{
+	bool		showtext = PG_GETARG_BOOL(0);
+
+	pg_stat_statements_internal(fcinfo, PGSS_V1_3, showtext);
+
+	return (Datum) 0;
+}
+
+Datum
 pg_stat_statements_1_2(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
@@ -1358,6 +1409,10 @@ pg_stat_statements_internal(FunctionCall
 			if (api_version != PGSS_V1_2)
 				elog(ERROR, "incorrect number of output arguments");
 			break;
+		case PG_STAT_STATEMENTS_COLS_V1_3:
+			if (api_version != PGSS_V1_3)
+				elog(ERROR, "incorrect number of output arguments");
+			break;
 		default:
 			elog(ERROR, "incorrect number of output arguments");
 	}
@@ -1534,11 +1589,24 @@ pg_stat_statements_internal(FunctionCall
 		{
 			values[i++] = Float8GetDatumFast(tmp.blk_read_time);
 			values[i++] = Float8GetDatumFast(tmp.blk_write_time);
+
+			if (api_version >= PGSS_V1_3)
+			{
+				values[i++] = Int64GetDatumFast(tmp.aio_read_noneed);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_discrd);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_forgot);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_noblok);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_failed);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_wasted);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_waited);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_ontime);
+			}
 		}
 
 		Assert(i == (api_version == PGSS_V1_0 ? PG_STAT_STATEMENTS_COLS_V1_0 :
 					 api_version == PGSS_V1_1 ? PG_STAT_STATEMENTS_COLS_V1_1 :
 					 api_version == PGSS_V1_2 ? PG_STAT_STATEMENTS_COLS_V1_2 :
+					 api_version == PGSS_V1_3 ? PG_STAT_STATEMENTS_COLS_V1_3 :
 					 -1 /* fail if you forget to update this assert */ ));
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
--- contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql.orig	2014-05-28 08:50:32.110559768 -0400
+++ contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql	2014-05-28 16:45:42.658506251 -0400
@@ -0,0 +1,51 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements();
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements();
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT queryid bigint,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT blk_read_time float8,
+    OUT blk_write_time float8
+   , OUT aio_read_noneed int8
+   , OUT aio_read_discrd int8
+   , OUT aio_read_forgot int8
+   , OUT aio_read_noblok int8
+   , OUT aio_read_failed int8
+   , OUT aio_read_wasted int8
+   , OUT aio_read_waited int8
+   , OUT aio_read_ontime int8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_3'
+LANGUAGE C STRICT VOLATILE;
+
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
--- postgresql-prefetching-asyncio.README.orig	2014-05-28 09:18:59.386318460 -0400
+++ postgresql-prefetching-asyncio.README	2014-05-28 16:45:42.702506429 -0400
@@ -0,0 +1,542 @@
+Postgresql  --   Extended Prefetching using Asynchronous IO
+============================================================
+
+Postgresql currently (9.3.4) provides a limited prefetching capability
+using posix_fadvise to give hints to the Operating System kernel
+about which pages it expects to read in the near future.
+This capability is used only during the heap-scan phase of bitmap-index scans.
+It is controlled via the effective_io_concurrency configuration parameter.
+
+This capability is now extended in two ways :
+   .   use asynchronous IO into Postgresql shared buffers as an
+       alternative to posix_fadvise
+   .   Implement prefetching in other types of scan :
+            .  non-bitmap (i.e. simple) index scans - index pages
+                     currently only for B-tree indexes.
+                    (developed by Claudio Freire <klaussfreire(at)gmail(dot)com>)
+            .  non-bitmap (i.e. simple) index scans - heap pages
+                          currently only for B-tree indexes.
+            .  simple heap scans
+
+Posix asynchronous IO is chosen as the function library for asynchronous IO,
+since this is well supported and also fits very well with the model of
+the prefetching process,  particularly as regards checking for completion
+of an asynchronous read.    On linux,   Posix asynchronous IO is provided
+in the librt library.    librt uses independently-schedulable threads to
+achieve the asynchronicity,   rather than kernel functionality.
+
+In this implementation,  use of asynchronous IO is limited to prefetching
+while performing one of the three types of scan
+            .  B-tree bitmap index scan - heap pages    (as already exists)
+            .  B-tree non-bitmap (i.e. simple) index scans - index and heap pages
+            .  simple heap scans
+on permanent relations.   It is not used on temporary tables nor for writes.
+
+The advantages of Posix asynchronous IO into shared buffers
+compared to posix_fadvise are :
+   .   Beneficial for non-sequential access patterns as well as sequential
+   .   No restriction on the kinds of IO which can be used
+       (other kinds of asynchronous IO impose restrictions such as
+        buffer alignment,  use of non-buffered IO).
+   .   Does not interfere with standard linux kernel read-ahead functionality.
+       (It has been stated in 
+ www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com
+       that :
+          "the kernel stops doing read-ahead when a call to posix_fadvise comes.
+           I noticed the performance hit, and checked the kernel's code.
+           It effectively changes the prediction mode from sequential to fadvise,
+           negating the (assumed) kernel's prefetch logic")
+   .   When the read request is issued after a prefetch has completed,
+       no delay associated with a kernel call to copy the page from
+       kernel page buffers into the Postgresql shared buffer,
+       since it is already there.
+       Also,   in a memory-constrained environment,   there is a greater
+       probability that the prefetched page will "stick" in memory
+       since the linux kernel victimizes the filesystem page cache in preference
+       to swapping out user process pages.
+   .   Statistics on prefetch success can be gathered (see "Statistics" below)
+       which helps the administrator to tune the prefetching settings.
+
+These benefits are most likely to be obtained in a system whose usage profile
+(e.g. from iostat)  shows:
+     .   high IO wait from mostly-read activity
+     .   disk access pattern is not entirely sequential
+         (so kernel readahead can't predict it but postgresql can)
+     .   sufficient spare idle CPU to run the librt pthreads
+         or,  stated another way,    the CPU subsystem is relatively powerful
+         compared to the disk subsystem.
+In such ideal conditions,  and with a workload with plenty of index scans,
+around 10% - 20% improvement in throughput has been achieved.
+In an admittedly extreme environment measured by this author,    with a workload
+consisting of 8 client applications each running similar complex queries
+(same query structure but different predicates and constants),
+including 2 Bitmap Index Scans and 17 non-bitmap index scans,
+on a dual-core Intel laptop (4 hyperthreads) with the database on a single
+USB3-attached 500GB disk drive, and no part of the database in filesystem buffers
+initially,  (filesystem freshly mounted),  comparing unpatched build
+using posix_fadvise with effective_io_concurrency 4 against same build patched
+with async IO and effective_io_concurrency 4 and max_async_io_prefetchers 32,
+elapse time repeatably improved from around 640-670 seconds to around 530-550 seconds,
+a 17% - 18% improvement. 
+
+The disadvantages of Posix asynchronous IO compared to posix_fadvise are:
+     .   probably higher CPU utilization:
+         Firstly, the extra work performed by the librt threads adds CPU
+         overhead, and secondly, if the asynchronous prefetching is effective,
+         then it will deliver better (greater) overlap of CPU with IO, which
+         will reduce elapsed times and hence increase CPU utilization percentage
+         still more (during that shorter elapsed time).
+     .   more context switching,  because of the additional threads.
+
+
+Statistics:
+___________
+
+A number of additional statistics relating to effectiveness of asynchronous IO
+are provided as an extension of the existing pg_stat_statements loadable module.
+Refer to the appendix "Additional Supplied Modules" in the current
+PostgreSQL Documentation for details of this module.
+
+The following additional statistics are provided for asynchronous IO prefetching:
+
+    . aio_read_noneed  :   number of prefetches for which no need for prefetch as block already in buffer pool
+    . aio_read_discrd  :   number of prefetches for which buffer not subsequently read and therefore discarded
+    . aio_read_forgot  :   number of prefetches for which buffer not subsequently read and then forgotten about
+    . aio_read_noblok  :   number of prefetches for which no available BufferAiocb  control block
+    . aio_read_failed  :   number of aio reads for which aio itself failed or the read failed with an errno
+    . aio_read_wasted  :   number of aio reads for which in-progress aio cancelled and disk block not used
+    . aio_read_waited  :   number of aio reads for which disk block used but had to wait for it
+    . aio_read_ontime  :   number of aio reads for which disk block used and ready on time when requested
+
+Some of these are (hopefully) self-explanatory.    Some additional notes:
+
+    . aio_read_discrd and aio_read_forgot  :
+                    prefetch was wasted work since the buffer was not subsequently read
+                    The discrd case indicates that the scanner realized this and discarded the buffer,
+                    whereas the forgot case indicates that the scanner did not realize it,
+                    which should not normally occur.
+                    A high number in either suggests lowering effective_io_concurrency.
+
+    . aio_read_noblok  :   
+                    Any significant number in relation to all the other numbers indicates that
+                    max_async_io_prefetchers should be increased.
+
+    . aio_read_waited  :
+                    The page was prefetched but the asynchronous read had not completed by the time the
+                    scanner requested to read it.     causes extra overhead in waiting and indicates
+                    prefetching is not providing much if any benefit.
+                    The disk subsystem may be underpowered/overloaded in relation to the available CPU power.
+
+    . aio_read_ontime  :
+                    The page was prefetched and the asynchronous read had completed by the time the
+                    scanner requested to read it.     Optimal behaviour.      If this number if large
+                    in relation to all the other numbers except (possibly) aio_read_noneed,
+                    then prefetching is working well.
+
+To create the extension with support for these additional statistics, use the following syntax:
+     CREATE EXTENSION pg_stat_statements VERSION '1.3'
+or,  if you run the new code against an existing database which already has the extension
+( see installation and migration below ),  you can 
+     ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'
+
+A suggested set of commands for displaying these statistics might be :
+
+ /*  OPTIONALLY */ DROP extension pg_stat_statements;
+                   CREATE extension pg_stat_statements VERSION '1.3';
+ /*  run your workload   */
+                  select userid , dbid , substring(query from 1 for 24) , calls , total_time , rows , shared_blks_read , blk_read_time , blk_write_time \
+                    , aio_read_noneed , aio_read_noblok , aio_read_failed , aio_read_wasted , aio_read_waited , aio_read_ontime , aio_read_forgot       \
+                      from pg_stat_statements where shared_blks_read > 0;
+
+
+Installation and Build Configuration:
+_____________________________________
+
+1. First -  a prerequsite:
+#  as well as requiring all the usual package build tools such as gcc , make etc,
+#  as described in the instructions for building postgresql,
+#  the following is required :
+    gnu autoconf at version 2.69 :
+# run the following command
+autoconf -V
+# it *must* return
+autoconf (GNU Autoconf) 2.69
+
+2. If you don't have it or it is a different version,
+then you must obtain version 2.69 (which is the current version)
+from your distribution provider or from the gnu software download site.
+
+3. Also you must have the source tree for postgresql version 9.4 (development version).
+#   all the following commands assume your current working directory is the top of the source tree.
+
+4. cd to top of source tree :
+#   check it appears to be a postgresql source tree
+ls -ld configure.in src
+#   should show both the file and the directory
+grep PostgreSQL COPYRIGHT
+#   should show PostgreSQL Database Management System
+
+5. Apply the patch :
+patch -b -p0 -i <patch_file_path>
+#   should report no errors, 43 files patched (see list at bottom of this README)
+#   and all hunks applied
+#  check the patch was appplied to configure.in
+ls -ld configure.in.orig configure.in
+#   should show both files
+
+6. Rebuild the configure script with the patched configure.in :
+mv configure configure.orig;
+autoconf configure.in >configure;echo "rc= $? from autoconf"; chmod +x configure;
+ls -lrt configure.orig configure;
+
+7. run the new configure script :
+#   if you have run configure before,
+#   then you may first want to save existing config.status and config.log if they exist,
+#   and then specify same configure flags and options as you specified before.
+#   the patch does not alter or extend the set of configure options
+#   if unsure,   run ./configure --help
+#   if still unsure,   run ./configure
+./configure <other configure options as desired>
+
+
+
+8. now check that configure decided that this environment supports asynchronous IO :
+grep USE_AIO_ATOMIC_BUILTIN_COMP_SWAP src/include/pg_config.h
+#  it should show
+#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 1
+#  if not,  apparently your environment does not support asynch IO  -
+#  the config.log will show how it came to that conclusion,
+#  also check for :
+#    . a librt.so somewhere in the loader's library path (probably under /lib , /lib64 , or /usr)
+#    . your gcc must support the atomic compare_and_swap __sync_bool_compare_and_swap built-in function
+#  do not proceed without this define being set.
+
+9. do you want to use the new code on an existing cluster
+   that was created using the same code base but without the patch?
+   If so then run this nasty-looking command :
+   (cut-and-paste it into a terminal window or a shell-script file)
+   Otherwise continue to step 10.
+   see Migration note below for explanation.
+###############################################################################################
+   fl=src/Makefile.global; typeset -i bkx=0; while [[ $bkx < 200 ]]; do {
+       bkfl="${fl}.bak${bkx}"; if [[ -a ${bkfl} ]]; then ((bkx=bkx+1)); else break; fi;
+   }; done;
+   if [[ -a ${bkfl} ]]; then echo "sorry cannot find a backup name for $fl";
+   elif [[ -a $fl ]]; then {
+       mv $fl $bkfl && {
+          sed -e "/^CFLAGS =/ s/\$/ -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO/" $bkfl > $fl;
+          str="diff -w $bkfl $fl";echo "$str"; eval "$str";
+       };
+   };
+   else echo "ooopppss $fl is missing";
+   fi;
+###############################################################################################
+# it should report something like
+diff -w Makefile.global.bak0 Makefile.global
+222c222
+< CFLAGS = XXXX
+---
+> CFLAGS = XXXX -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+#   where XXXX is some set of flags
+
+
+10. now run the rest of the build process as usual  -
+    follow instructions in file INSTALL if that file exists,
+    else e.g. run
+make && make install
+
+If the build fails with the following error:
+undefined reference to `aio_init'
+Then edit the following file
+src/include/pg_config_manual.h
+and add the following line at the bottom:
+
+#define DONT_HAVE_AIO_INIT
+
+and then run
+make clean && make && make install
+See notes to section Runtime Configuration below for more information on this.
+
+
+
+Migration , Runtime Configuration, and Use:
+___________________________________________
+
+
+Database Migration:
+___________________
+
+The new prefetching code for non-bitmap index scans introduces a new btree-index
+function named btpeeknexttuple.    The correct way to add such a function involves
+also adding it to the catalog as an internal function in pg_proc.
+However,  this results in the new built code considering an existing database to be
+incompatible,  i.e requiring backup on the old code and restore on the new.
+This is normal behaviour for migration to a new version of postgresql,  and is
+also a valid way of migrating a database for use with this asynchronous IO feature,
+but in this case it may be inconvenient.
+
+As an alternative,    the new code may be compiled with the macro define
+AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+which does what it says by not altering the catalog.   The patched build can then
+be run against an existing database cluster initdb'd using the unpatched build.
+
+There are no known ill-effects of so doing,  but :
+     .  in any case,  it is strongly suggested to make a backup of any precious database
+        before accessing it with a patched build
+     .  be aware that if this asynchronous IO feature is eventually released as part of postgresql,
+        migration will probably be required anyway.
+
+This option to avoid catalog migration is intended as a convenience for a quick test,
+and also makes it easier to obtain performance comparisons on the same database.
+
+
+
+Runtime Configuration:
+______________________
+
+One new configuration parameter settable in postgresql.conf and
+in any other way as described in the postgresql documentation :
+
+max_async_io_prefetchers
+  Maximum number of background processes concurrently using asynchronous
+  librt threads to prefetch pages into shared memory buffers
+
+This number can be thought of as the maximum number
+of librt threads concurrently active,   each working on a list of
+from 1 to target_prefetch_pages pages ( see notes 1 and 2 ).
+
+In practice,    this number simply controls how many prefetch requests in total
+may be active concurrently :
+        max_async_io_prefetchers * target_prefetch_pages ( see note 1)
+
+default is max_connections/6
+and recall that the default for max_connections is 100
+
+
+note 1  a number based on effective_io_concurrency and approximately n * ln(n)
+        where n is effective_io_concurrency
+
+note 2  Provided that the gnu extension to Posix AIO which provides the
+aio_init() function is present,   then aio_init() is called
+to set the librt maximum number of threads to max_async_io_prefetchers,
+and to set the maximum number of concurrent aio read requests to the product of
+        max_async_io_prefetchers * target_prefetch_pages
+
+
+As well as this regular configuration parameter,
+there are several other parameters that can be set via environment variable.
+The reason why they are environment vars rather than regular configuration parameters
+is that it is not expected that they should need to be set,   but they may be useful :
+                variable name         values                  default        meaning
+   PG_TRY_PREFETCHING_FOR_BITMAP      [Y|N]                    Y         whether to prefetch bitmap heap scans
+   PG_TRY_PREFETCHING_FOR_ISCAN       [Y|N|integer[,[N|Y]]]   256,N      whether to prefetch  non-bitmap index scans
+                                                                    also numeric size of list of prefetched blocks
+                                                                    also whether to prefetch forward-sequential-pattern index pages
+   PG_TRY_PREFETCHING_FOR_BTREE       [Y|N]                    Y         whether to prefetch heap pages in non-bitmap index scans
+   PG_TRY_PREFETCHING_FOR_HEAP        [Y|N]                    N         whether to prefetch relation (un-indexed) heap scans
+
+
+The setting for PG_TRY_PREFETCHING_FOR_ISCAN is a litle complicated.
+It can be set to Y or N to control prefetching of  non-bitmap index scans;
+But in addition it can be set to an integer,   which both implies Y
+and also sets the size of a list used to remember prefetched but unread heap pages.
+This list is an optimization used to avoid re-prefetching and maximise the potential
+set of prefetchable blocks indexed by one index page.
+And if set to an integer,  this integer may be followed by either ,Y or ,N
+to specify to prefetch index pages which are being accessed forward-sequentially.
+It has been found that prefetching is not of great benefit for this access pattern,
+and so it is not the default,  but also does no harm (provided sufficient CPU capacity).
+
+
+
+Usage :
+______
+
+
+There are no changes in usage other than as noted under Configuration and Statistics.
+However,   in order to assess benefit from this feature,   it will be useful to
+understand the query access plans of your workload using EXPLAIN.    Before doing that,
+make sure that statistics are up to date using ANALYZE.
+
+
+
+Internals:
+__________
+
+
+Internal changes span two areas and the interface between them :
+
+ .  buffer manager layer
+ .  programming interface for scanner to call buffer manager
+ .  scanner layer
+
+ .  buffer manager layer
+    ____________________
+
+    changes comprise :
+       .   allocating,  pinning , unpinning buffers
+            this is complex and discussed briefly below in "Buffer Management"
+       .   acquiring and releasing a BufferAiocb, the control block
+            associated with a single aio_read,  and checking for its completion
+            a new file,  backend/storage/buffer/buf_async.c, provides three new functions,
+                  BufStartAsync        BufReleaseAsync            BufCheckAsync
+            which handle this.
+       .   calling librt asynch io functions
+            this follows the example of all other filesystem interfaces
+            and is straightforward.    
+            two new functions are provided in fd.c:
+                   FileStartaio        FileCompleteaio
+            and corresponding interfaces in smgr.c
+
+ .  programming interface for scanner to call buffer manager
+    ________________________________________________________
+       . calling interface for existing function PrefetchBuffer is modified :
+           .  one new argument,   BufferAccessStrategy strategy
+           .  now returns an int return code which indicates :
+                     whether pin count on buffer has been increased by 1
+                     whether block was already present in a buffer
+       .  new function DiscardBuffer
+           .  discard buffer used for a previously prefetched page
+                 which scanner decides it does not want to read.
+           .  same arguments as for PrefetchBuffer except for omission of BufferAccessStrategy
+           .  note - this is different from the existing function ReleaseBuffer
+                     in that ReleaseBuffer takes a buffer_descriptor as argument
+                     for a buffer which has been read, but has similar purpose.
+
+ .  scanner layer
+    _____________
+        common to all scanners is that the scanner which wishes to prefetch must do two things:
+          .  decide which pages to prefetch and call PrefetchBuffer to prefetch them
+                 nodeBitmapHeapscan already does this (but note one extra argument on PrefetchBuffer)
+          .  remember which pages it has prefetched in some list (actual or conceptual,  e.g. a page range),
+                 removing each page from this list if and when it subsequently reads the page.
+          .  at end of scan,  call DiscardBuffer for every remembered (i.e. prefetched not unread) page
+       how this list of prefetched pages is implemented varies for each of the three scanners and four scan types:
+            .  bitmap index scan - heap pages
+            .  non-bitmap (i.e. simple) index scans - index pages
+            .  non-bitmap (i.e. simple) index scans - heap pages
+            .  simple heap scans
+       The consequences of forgetting to call DiscardBuffer on a prefetched but unread page are:
+            .   counted in aio_read_forgot  (see "Statistics" above)
+            .   may incur an annoying but harmless warning in the pg_log "Buffer Leak ... "
+                  (the buffer is released at commit)
+       This does sometimes happen ...
+     
+
+
+Buffer Management
+_________________
+
+With async io,   PrefetchBuffer must allocate and pin a buffer,  which is relatively straightforward,
+but also every other part of buffer manager must know about the possibility that a buffer may be in
+a state of async_io_in_progress state and be prepared to determine the possible completion.
+That is,  one backend BK1 may start the io but another BK2 may try to read it before BK1 does.
+Posix Asynchronous IO provides a means for waiting on this or another task's read if in progress,
+namely aio_suspend(),  which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO
+are called as part of asynchronous prefetching,   their role is limited to maintaining the buffer descriptor flags,
+and they do not track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in
+a separate set of shared control blocks,  the BufferAiocb list -
+refer to     include/storage/buf_internals.h
+Checking asynchronous io status is handled in  backend/storage/buffer/buf_async.c BufCheckAsync function.
+Read the commentary for this function for more details.
+
+Pinning and unpinning of buffers is the most complex aspect of asynch io prefetching,
+and the logic is spread throughout BufStartAsync , BufCheckAsync , and many functions in bufmgr.c.
+When a backend BK2 requests ReadBuffer of a page for which asynch read is in progress,
+buffer manager has to determine which backend BK1 pinned this buffer during previous PrefetchBuffer,
+and for example must not be re-pinned a second time if BK2 is BK1.
+Information concerning which backend initiated the prefetch is held in the BufferAiocb.
+
+The trickiest case concerns the scenario in which :
+   .  BK1 initiates prefetch and acquires a pin
+   .  BK2 possibly waits for completion and then reads the buffer,  and perhaps later on
+         releases it by ReleaseBuffer.
+   .  Since the asynchronous IO is no longer in progress,     there is no longer any
+         BufferAiocb associated with it.    Yet buffer manager must remember that BK1 holds a
+         "prefetch" pin, i.e. a pin which must not be repeated if and when BK1 finally issues ReadBuffer.
+   .  The solution to this problem is to invent the concept of a "banked" pin,
+      which is a pin obtained when prefetch was issued,   identied as in "banked" status only if and when
+      the associated asynchronous IO terminates,  and redeemable by the next use by same task,
+      either by ReadBuffer or DiscardBuffer.
+      The pid of the backend which holds a banked pin on a buffer (there can be at most one such backend)
+      is stored in the buffer descriptor.
+      This is done without increasing size of the buffer descriptor,  which is important since
+      there may be a very large number of these.     This does overload the relevant field in the descriptor.
+      Refer to include/storage/buf_internals.h for more details
+      and search for BM_AIO_PREFETCH_PIN_BANKED in storage/buffer/bufmgr.c and  backend/storage/buffer/buf_async.c
+
+______________________________________________________________________________
+The following 43 files are changed in this feature (output of the patch command) :
+
+patching file configure.in
+patching file contrib/pg_stat_statements/pg_stat_statements--1.3.sql
+patching file contrib/pg_stat_statements/Makefile
+patching file contrib/pg_stat_statements/pg_stat_statements.c
+patching file contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql
+patching file config/c-library.m4
+patching file src/backend/postmaster/postmaster.c
+patching file src/backend/executor/nodeBitmapHeapscan.c
+patching file src/backend/executor/nodeIndexscan.c
+patching file src/backend/executor/instrument.c
+patching file src/backend/storage/buffer/Makefile
+patching file src/backend/storage/buffer/bufmgr.c
+patching file src/backend/storage/buffer/buf_async.c
+patching file src/backend/storage/buffer/buf_init.c
+patching file src/backend/storage/smgr/md.c
+patching file src/backend/storage/smgr/smgr.c
+patching file src/backend/storage/file/fd.c
+patching file src/backend/storage/lmgr/proc.c
+patching file src/backend/access/heap/heapam.c
+patching file src/backend/access/heap/syncscan.c
+patching file src/backend/access/index/indexam.c
+patching file src/backend/access/index/genam.c
+patching file src/backend/access/nbtree/nbtsearch.c
+patching file src/backend/access/nbtree/nbtinsert.c
+patching file src/backend/access/nbtree/nbtpage.c
+patching file src/backend/access/nbtree/nbtree.c
+patching file src/backend/nodes/tidbitmap.c
+patching file src/backend/utils/misc/guc.c
+patching file src/backend/utils/mmgr/aset.c
+patching file src/include/executor/instrument.h
+patching file src/include/storage/bufmgr.h
+patching file src/include/storage/smgr.h
+patching file src/include/storage/fd.h
+patching file src/include/storage/buf_internals.h
+patching file src/include/catalog/pg_am.h
+patching file src/include/catalog/pg_proc.h
+patching file src/include/pg_config_manual.h
+patching file src/include/access/nbtree.h
+patching file src/include/access/heapam.h
+patching file src/include/access/relscan.h
+patching file src/include/nodes/tidbitmap.h
+patching file src/include/utils/rel.h
+patching file src/include/pg_config.h.in
+
+
+Future Possibilities:
+____________________
+
+There are several possible extensions of this feature :
+   .   Extend prefetching of index scans to types of index
+       other than B-tree.
+       This should be fairly straightforward,  but requires some
+       good base of benchmarkable workloads to prove the value.
+   .   Investigate why asynchronous IO prefetching does not greatly
+       improve sequential relation heap scans and possibly find how to
+       achieve a benefit.
+   .   Build knowledge of asycnhronous IO prefetching into the
+       Query Planner costing.
+       This is far from straightforward.    The Postgresql Query Planner's
+       costing model is based on resource consumption rather than elapsed time.
+       Use of asynchronous IO prefetching is intended to improve elapsed time
+       as the expense of (probably) higher resource consumption.
+       Although Costing understands about the reduced cost of reading buffered
+       blocks, it does not take asynchronicity or overlap of CPU with disk
+       into account.  A naive approach might be to try to tweak the Query
+       Planner's Cost Constant configuration parameters
+       such as seq_page_cost , random_page_cost
+       but this is hazardous as explained in the Documentation.
+
+
+
+John Lumby,  johnlumby(at)hotmail(dot)com
--- config/c-library.m4.orig	2014-05-28 08:29:09.142829396 -0400
+++ config/c-library.m4	2014-05-28 16:45:42.746506606 -0400
@@ -367,3 +367,50 @@ if test "$pgac_cv_type_locale_t" = 'yes
   AC_DEFINE(LOCALE_T_IN_XLOCALE, 1,
             [Define to 1 if `locale_t' requires <xlocale.h>.])
 fi])])# PGAC_HEADER_XLOCALE
+
+
+# PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
+# ---------------------------------------
+# test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation  of the latter.
+#
+AC_DEFUN([PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP],
+[AC_MSG_CHECKING([whether have both librt-style async io and the gcc atomic compare_and_swap])
+AC_CACHE_VAL(pgac_cv_aio_atomic_builtin_comp_swap,
+pgac_save_LIBS=$LIBS
+LIBS=" -lrt $pgac_save_LIBS"
+[AC_TRY_RUN([#include <stdio.h>
+#include <unistd.h>
+#include "aio.h"
+
+int main (int argc, char *argv[])
+ {
+   int rc;
+   struct aiocb volatile * first_aiocb;
+   struct aiocb volatile * second_aiocb;
+   struct aiocb volatile * my_aiocbp = (struct aiocb *)20000008;
+
+   first_aiocb = (struct aiocb *)20000008;
+   second_aiocb = (struct aiocb *)40000008;
+
+   /*  return zero as success if two comp-swaps both worked as expected -
+   **  first compares equal and swaps,  second compares unequal
+   */
+   rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   if (rc) {
+      rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   } else {
+      rc = -1;
+   }
+
+   return rc;
+}],
+[pgac_cv_aio_atomic_builtin_comp_swap=yes],
+[pgac_cv_aio_atomic_builtin_comp_swap=no],
+[pgac_cv_aio_atomic_builtin_comp_swap=cross])
+])dnl AC_CACHE_VAL
+AC_MSG_RESULT([$pgac_cv_aio_atomic_builtin_comp_swap])
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" != x"yes"; then
+LIBS=$pgac_save_LIBS
+fi
+])# PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
--- src/backend/postmaster/postmaster.c.orig	2014-05-28 08:29:09.322829301 -0400
+++ src/backend/postmaster/postmaster.c	2014-05-28 16:45:42.814506880 -0400
@@ -123,6 +123,11 @@
 #include "storage/spin.h"
 #endif
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+void ReportFreeBAiocbs(void);
+int CountInuseBAiocbs(void);
+extern int hwmBufferAiocbs;         /*  high water mark of in-use  BufferAiocbs in pool           */
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Possible types of a backend. Beyond being the possible bkend_type values in
@@ -1493,9 +1498,15 @@ ServerLoop(void)
 	fd_set		readmask;
 	int			nSockets;
 	time_t		now,
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                           count_baiocb_time,
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 				last_touch_time;
 
 	last_touch_time = time(NULL);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        count_baiocb_time = time(NULL);
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 	nSockets = initMasks(&readmask);
 
@@ -1654,6 +1665,19 @@ ServerLoop(void)
 			last_touch_time = now;
 		}
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                /*   maintain the hwm of used baiocbs every 10 seconds  */
+		if ((now - count_baiocb_time) >= 10)
+		{
+                        int inuseBufferAiocbs;         /*  current in-use  BufferAiocbs in pool */
+                        inuseBufferAiocbs = CountInuseBAiocbs();
+                        if (inuseBufferAiocbs > hwmBufferAiocbs) {
+			    hwmBufferAiocbs = inuseBufferAiocbs;
+			}
+			count_baiocb_time = now;
+		}
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 		/*
 		 * If we already sent SIGQUIT to children and they are slow to shut
 		 * down, it's time to send them SIGKILL.  This doesn't happen
@@ -3444,6 +3468,9 @@ PostmasterStateMachine(void)
 						signal_child(PgStatPID, SIGQUIT);
 				}
 			}
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        ReportFreeBAiocbs();
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 		}
 	}
 
--- src/backend/executor/nodeBitmapHeapscan.c.orig	2014-05-28 08:29:09.270829328 -0400
+++ src/backend/executor/nodeBitmapHeapscan.c	2014-05-28 16:45:42.834506961 -0400
@@ -34,6 +34,8 @@
  *		ExecEndBitmapHeapScan		releases all storage.
  */
 #include "postgres.h"
+#include <sys/types.h> 
+#include <unistd.h>
 
 #include "access/relscan.h"
 #include "access/transam.h"
@@ -47,6 +49,10 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_bitmap_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
 static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
@@ -111,10 +117,21 @@ BitmapHeapNext(BitmapHeapScanState *node
 		node->tbmres = tbmres = NULL;
 
 #ifdef USE_PREFETCH
-		if (target_prefetch_pages > 0)
-		{
+		if (    prefetch_bitmap_scans
+                     && (target_prefetch_pages > 0)
+                     && (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == scan->rs_rd->rd_node.dbNode)
+                             )
+                         ||  (prefetch_dbOid == 0)
+                        )
+                        /* sufficient number of blocks - at least twice the target_prefetch_pages */
+                     && (scan->rs_nblocks > (2*target_prefetch_pages))
+                   ) {
 			node->prefetch_iterator = prefetch_iterator = tbm_begin_iterate(tbm);
 			node->prefetch_pages = 0;
+                        if (prefetch_iterator) {
+                          tbm_zero(prefetch_iterator);  /* zero list of prefetched and unread blocknos */
+                        }
 			node->prefetch_target = -1;
 		}
 #endif   /* USE_PREFETCH */
@@ -138,12 +155,14 @@ BitmapHeapNext(BitmapHeapScanState *node
 			}
 
 #ifdef USE_PREFETCH
+                        if (prefetch_iterator) {
 			if (node->prefetch_pages > 0)
 			{
 				/* The main iterator has closed the distance by one page */
 				node->prefetch_pages--;
+                                tbm_subtract(prefetch_iterator, tbmres->blockno); /* remove this blockno from list of prefetched and unread blocknos */
 			}
-			else if (prefetch_iterator)
+                            else
 			{
 				/* Do not let the prefetch iterator get behind the main one */
 				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
@@ -151,6 +170,7 @@ BitmapHeapNext(BitmapHeapScanState *node
 				if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
 					elog(ERROR, "prefetch and main iterators are out of sync");
 			}
+                        }
 #endif   /* USE_PREFETCH */
 
 			/*
@@ -239,16 +259,26 @@ BitmapHeapNext(BitmapHeapScanState *node
 			while (node->prefetch_pages < node->prefetch_target)
 			{
 				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+                                int             PrefetchBufferRc; /*  return value from PrefetchBuffer  - refer to bufmgr.h */
+
 
 				if (tbmpre == NULL)
 				{
 					/* No more pages to prefetch */
-					tbm_end_iterate(prefetch_iterator);
-					node->prefetch_iterator = prefetch_iterator = NULL;
+                                        /* let ExecEndBitmapHeapScan terminate the prefetch_iterator
+				        **	tbm_end_iterate(prefetch_iterator);
+					**      node->prefetch_iterator = NULL;
+                                        */
+                                        prefetch_iterator = NULL;
 					break;
 				}
 				node->prefetch_pages++;
-				PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+				PrefetchBufferRc = PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno , 0);
+                                /*  add this blockno to list of prefetched and unread blocknos
+                                **  if pin count did not increase then indicate so in the Unread_Pfetched list
+                                */
+                                tbm_add(prefetch_iterator
+                                   ,( (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED) ? tbmpre->blockno : InvalidBlockNumber ) );
 			}
 		}
 #endif   /* USE_PREFETCH */
@@ -482,12 +512,31 @@ ExecEndBitmapHeapScan(BitmapHeapScanStat
 {
 	Relation	relation;
 	HeapScanDesc scanDesc;
+	TBMIterator *prefetch_iterator;
 
 	/*
 	 * extract information from the node
 	 */
 	relation = node->ss.ss_currentRelation;
 	scanDesc = node->ss.ss_currentScanDesc;
+	prefetch_iterator = node->prefetch_iterator;
+
+#ifdef USE_PREFETCH
+        /*  before any other cleanup,  discard any prefetched but unread buffers  */
+        if (prefetch_iterator != NULL) {
+            TBMIterateResult *tbmpre = tbm_locate_IterateResult(prefetch_iterator);
+            BlockNumber *Unread_Pfetched_base = tbmpre->Unread_Pfetched_base;
+            unsigned int Unread_Pfetched_next = tbmpre->Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+            unsigned int Unread_Pfetched_count = tbmpre->Unread_Pfetched_count;
+
+            while ((Unread_Pfetched_count--) > 0) {
+                DiscardBuffer( scanDesc->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                Unread_Pfetched_next++;
+                if (Unread_Pfetched_next >= target_prefetch_pages)
+                    Unread_Pfetched_next = 0;
+            }
+        }
+#endif   /* USE_PREFETCH */
 
 	/*
 	 * Free the exprcontext
--- src/backend/executor/nodeIndexscan.c.orig	2014-05-28 08:29:09.270829328 -0400
+++ src/backend/executor/nodeIndexscan.c	2014-05-28 16:45:42.858507057 -0400
@@ -35,8 +35,13 @@
 #include "utils/rel.h"
 
 
+
 static TupleTableSlot *IndexNext(IndexScanState *node);
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -418,7 +423,12 @@ ExecEndIndexScan(IndexScanState *node)
 	 * close the index relation (no-op if we didn't open it)
 	 */
 	if (indexScanDesc)
+        {
 		index_endscan(indexScanDesc);
+
+        /*  note  -  at this point all scan controlblock resources have been freed by IndexScanEnd called by index_endscan */
+
+        }
 	if (indexRelationDesc)
 		index_close(indexRelationDesc, NoLock);
 
@@ -609,6 +619,33 @@ ExecInitIndexScan(IndexScan *node, EStat
 											   indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
+#ifdef USE_PREFETCH
+        /*  initialize prefetching   */
+                indexstate->iss_ScanDesc->pfch_index_page_list =  (struct pfch_index_pagelist*)0;
+                indexstate->iss_ScanDesc->pfch_block_item_list = (struct pfch_block_item*)0;
+		if (    prefetch_index_scans
+			 && (target_prefetch_pages > 0)
+			 &&	(!RelationUsesLocalBuffers(indexstate->iss_ScanDesc->heapRelation)) /* I think this must always be true for an indexed heap ? */
+			 && (    (   (prefetch_dbOid > 0)
+					   && (prefetch_dbOid == indexstate->iss_ScanDesc->heapRelation->rd_node.dbNode)
+					 )
+				 ||  (prefetch_dbOid == 0)
+				)
+		   ) {
+			indexstate->iss_ScanDesc->pfch_index_page_list = palloc( sizeof(struct pfch_index_pagelist) + ( (target_prefetch_pages-1) * sizeof(struct pfch_index_item) ) );
+			indexstate->iss_ScanDesc->pfch_block_item_list = palloc( prefetch_index_scans * sizeof(struct pfch_block_item) );
+			if (     ( (struct pfch_index_pagelist*)0 != indexstate->iss_ScanDesc->pfch_index_page_list )
+                  && ( (struct pfch_block_item*)0 != indexstate->iss_ScanDesc->pfch_block_item_list )
+               ) {
+                          indexstate->iss_ScanDesc->pfch_used = 0;
+                          indexstate->iss_ScanDesc->pfch_next = prefetch_index_scans; /* ensure first entry is at index 0 */
+                          indexstate->iss_ScanDesc->pfch_index_page_list->pfch_index_pagelist_next = (struct pfch_index_pagelist*)0;
+                          indexstate->iss_ScanDesc->pfch_index_page_list->pfch_index_item_count = 0;
+                          indexstate->iss_ScanDesc->do_prefetch = 1;
+            }
+		}
+#endif   /* USE_PREFETCH */
+
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
 	 * index AM.
--- src/backend/executor/instrument.c.orig	2014-05-28 08:29:09.266829330 -0400
+++ src/backend/executor/instrument.c	2014-05-28 16:45:42.882507154 -0400
@@ -41,6 +41,14 @@ InstrAlloc(int n, int instrument_options
 		{
 			instr[i].need_bufusage = need_buffers;
 			instr[i].need_timer = need_timer;
+			instr[i].bufusage_start.aio_read_noneed = 0;
+			instr[i].bufusage_start.aio_read_discrd = 0;
+			instr[i].bufusage_start.aio_read_forgot = 0;
+			instr[i].bufusage_start.aio_read_noblok = 0;
+			instr[i].bufusage_start.aio_read_failed = 0;
+			instr[i].bufusage_start.aio_read_wasted = 0;
+			instr[i].bufusage_start.aio_read_waited = 0;
+			instr[i].bufusage_start.aio_read_ontime = 0;
 		}
 	}
 
@@ -143,6 +151,16 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+
+	dst->aio_read_noneed       += add->aio_read_noneed - sub->aio_read_noneed;
+	dst->aio_read_discrd       += add->aio_read_discrd - sub->aio_read_discrd;
+	dst->aio_read_forgot       += add->aio_read_forgot - sub->aio_read_forgot;
+	dst->aio_read_noblok       += add->aio_read_noblok - sub->aio_read_noblok;
+	dst->aio_read_failed       += add->aio_read_failed - sub->aio_read_failed;
+	dst->aio_read_wasted       += add->aio_read_wasted - sub->aio_read_wasted;
+	dst->aio_read_waited       += add->aio_read_waited - sub->aio_read_waited;
+	dst->aio_read_ontime       += add->aio_read_ontime - sub->aio_read_ontime;
+
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
--- src/backend/storage/buffer/Makefile.orig	2014-05-28 08:29:09.330829297 -0400
+++ src/backend/storage/buffer/Makefile	2014-05-28 16:45:42.942507396 -0400
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o buf_async.o
 
 include $(top_srcdir)/src/backend/common.mk
--- src/backend/storage/buffer/bufmgr.c.orig	2014-05-28 08:29:09.334829294 -0400
+++ src/backend/storage/buffer/bufmgr.c	2014-05-28 16:45:42.978507541 -0400
@@ -29,7 +29,7 @@
  *		buf_table.c -- manages the buffer lookup table
  */
 #include "postgres.h"
-
+#include <sys/types.h> 
 #include <sys/file.h>
 #include <unistd.h>
 
@@ -50,7 +50,6 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
-
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
 #define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
@@ -63,6 +62,8 @@
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
 
+extern volatile struct BAiocbAnchor *BAiocbAnchr;  /*  anchor for all control blocks pertaining to aio  */
+
 #define DROP_RELS_BSEARCH_THRESHOLD		20
 
 /* GUC variables */
@@ -78,26 +79,33 @@ bool		track_io_timing = false;
  */
 int			target_prefetch_pages = 0;
 
-/* local state for StartBufferIO and related functions */
+/* local state for StartBufferIO and related functions
+**  but ONLY for synchronous IO  -  not altered for aio
+*/
 static volatile BufferDesc *InProgressBuf = NULL;
 static bool IsForInput;
+pid_t this_backend_pid = 0;    /*    pid of this backend */
 
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
-
-static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
+extern int
+BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy);
+extern int
+BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, volatile BufferDesc *buf_desc, int intention
+        ,BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock );
+Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
-				  bool *hit);
-static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
-static void PinBuffer_Locked(volatile BufferDesc *buf);
-static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
+				  bool *hit , int index_for_aio);
+bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
+void PinBuffer_Locked(volatile BufferDesc *buf);
+void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
 static void WaitIO(volatile BufferDesc *buf);
-static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
-static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
+static bool StartBufferIO(volatile BufferDesc *buf, bool forInput , int index_for_aio );
+void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -106,24 +114,66 @@ static volatile BufferDesc *BufferAlloc(
 			ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr);
+			int *foundPtr , int index_for_aio );
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
 static int	rnode_comparator(const void *p1, const void *p2);
 
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
 
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
  *
- * This is named by analogy to ReadBuffer but doesn't actually allocate a
- * buffer.  Instead it tries to ensure that a future ReadBuffer for the given
- * block will not be delayed by the I/O.  Prefetching is optional.
+ * This is named by analogy to ReadBuffer but allocates a buffer only if using asynchronous I/O.
+ * Its purpose  is to try to ensure that a future ReadBuffer for the given block
+ * will not be delayed by the I/O.  Prefetching is optional.
  * No-op if prefetching isn't compiled in.
- */
-void
-PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
-{
+ *
+ * Originally the prefetch simply called posix_fadvise() to recommend read-ahead into kernel page cache.
+ * Extended to provide an alternative of issuing an asynchronous aio_read() to read into a buffer.
+ * This extension has an implication on how this bufmgr component manages concurrent requests
+ * for the same disk block.
+ *
+ * Synchronous IO (read()) does not provide a means for waiting on another task's read if in progress,
+ * and bufmgr implements its own scheme in StartBufferIO, WaitIO, and TerminateBufferIO.
+ *
+ * Asynchronous IO (aio_read()) provides a means for waiting on this or another task's read if in progress,
+ * namely aio_suspend(),  which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO
+ * are called as part of asynchronous prefetching,   their role is limited to maintaining the buffer desc flags,
+ * and they do not track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in 
+ * a separate set of shared control blocks,  the BufferAiocb list -
+ *   refer to     include/storage/buf_internals.h and storage/buffer/buf_init.c
+ *
+ * Another implication of asynchronous IO concerns buffer pinning.
+ * The buffer used for the prefetch is pinned before aio_read is issued.
+ * It is expected that the same task (and possibly others) will later ask to read the page
+ * and eventually release and unpin the buffer.
+ * However,  if the task which issued the aio_read later decides not to read the page,
+ * and return code indicates delta_pin_count > 0 (see below)
+ * it *must* instead issue a DiscardBuffer() (see function later in this file)
+ * so that its pin is released.
+ * Therefore,  each client which uses the PrefetchBuffer service must either always read all
+ * prefetched pages,  or keep track of prefetched pages and discard unread ones at end of scan.
+ *
+ * return code:   is an int bitmask defined in bufmgr.h
+        PREFTCHRC_BUF_PIN_INCREASED 0x01      pin count on buffer has been increased by 1
+        PREFTCHRC_BLK_ALREADY_PRESENT 0x02    block was already present in a buffer
+ *
+ * PREFTCHRC_BLK_ALREADY_PRESENT is a hint to caller that the prefetch may be unnecessary
+ */
+int
+PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy)
+{
+	Buffer		buf_id; /* indicates buffer containing the requested block  */
+        int             PrefetchBufferRc = 0; /*  return value as described above  */
+        int             PinCountOnEntry = 0;  /*  pin count on entry           */
+        int             PinCountdelta = 0;    /*  pin count delta increase     */
+
+
 #ifdef USE_PREFETCH
+
+	buf_id = -1;
 	Assert(RelationIsValid(reln));
 	Assert(BlockNumberIsValid(blockNum));
 
@@ -145,8 +195,13 @@ PrefetchBuffer(Relation reln, ForkNumber
 	{
 		BufferTag	newTag;		/* identity of requested block */
 		uint32		newHash;	/* hash value for newTag */
+        int         BufStartAsyncrc = -1;  /*  retcode from BufStartAsync :
+                                                       **        0 if started successfully (which implies buffer was newly pinned )
+                                                       **       -1 if failed for some reason
+                                                       **        1+PrivateRefCount if we found desired buffer in buffer pool
+                                                       **  and we set it likewise if we find buffer in buffer pool
+                                                       */
 		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-		int			buf_id;
 
 		/* create a tag so we can lookup the buffer */
 		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
@@ -158,28 +213,119 @@ PrefetchBuffer(Relation reln, ForkNumber
 
 		/* see if the block is in the buffer pool already */
 		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
+		buf_id = (Buffer)BufTableLookup(&newTag, newHash);
+                if (buf_id >= 0) {
+                    PinCountOnEntry = PrivateRefCount[buf_id];  /*  pin count on entry               */
+                    BufStartAsyncrc = 1 + PinCountOnEntry; /* indicate this backends pin count - see above comment */
+                    PrefetchBufferRc = PREFTCHRC_BLK_ALREADY_PRESENT;       /* indicate buffer present */
+                } else {
+                    PrefetchBufferRc = 0;                                   /* indicate buffer not present */
+                }
 		LWLockRelease(newPartitionLock);
 
+     not_in_buffers:
 		/* If not in buffers, initiate prefetch */
-		if (buf_id < 0)
+		if (buf_id < 0) {
+                    /*    try using async aio_read with a buffer */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                    BufStartAsyncrc = BufStartAsync( reln, forkNum, blockNum , strategy );
+                    if (BufStartAsyncrc < 0) {
+                            pgBufferUsage.aio_read_noblok++;
+                    }
+#else /* not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP so try the alternative that does not read the block into a postgresql buffer */
 			smgrprefetch(reln->rd_smgr, forkNum, blockNum);
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+		}
 
-		/*
-		 * If the block *is* in buffers, we do nothing.  This is not really
-		 * ideal: the block might be just about to be evicted, which would be
-		 * stupid since we know we are going to need it soon.  But the only
-		 * easy answer is to bump the usage_count, which does not seem like a
-		 * great solution: when the caller does ultimately touch the block,
-		 * usage_count would get bumped again, resulting in too much
-		 * favoritism for blocks that are involved in a prefetch sequence. A
-		 * real fix would involve some additional per-buffer state, and it's
-		 * not clear that there's enough of a problem to justify that.
+        if (  (buf_id >= 0) || (BufStartAsyncrc >= 1)  ) {
+                        /* The block *is* in buffers.    */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        pgBufferUsage.aio_read_noneed++;
+#ifndef USE_PREFETCH_BUT_DONT_PIN_ALREADY_PRESENT /* jury is out on whether the following wins but it ought to ...  */
+                        /*
+                        ** If this backend already had pinned it,
+                        ** or another backend had banked a pin on it,
+                        ** or there is an IO in progress,
+                        ** or it is not marked valid,
+                        ** then do nothing.
+                        ** Otherwise pin it and mark the buffer's pin as banked by this backend.
+                        ** Note  -  it may or not be pinned by another backend -
+                        **          it is ok for us to bank a pin on it
+                        **          *provided* the other backend did not bank its pin.
+                        **          The reason for this is that the banked-pin indicator is global -
+                        **          it can identify at most one process.
+                        */
+                        /* pgBufferUsage.aio_read_wasted++;  overload counter - only for debugging */
+                        if (BufStartAsyncrc == 1) {            /*   not pinned by me  */
+                              /*  pgBufferUsage.aio_read_wasted++;  overload counter - only for debugging */
+                              /*  note   -   all we can say with certainty is that the buffer is not pinned by me
+                              **             we cannot be sure that it is still in buffer pool
+                              **             so must go through the entire locking and searching all over again ...
 		 */
+                            LWLockAcquire(newPartitionLock, LW_SHARED);
+                            buf_id = (Buffer)BufTableLookup(&newTag, newHash);
+                            /* If in buffers, proceed */
+                            if (buf_id >= 0) {
+                                /*  since the block is now present,
+                                **  save the current pin count to ensure final delta is calculated correctly
+                                */
+                                PinCountOnEntry = PrivateRefCount[buf_id];  /*  pin count on entry               */
+                                if ( PinCountOnEntry == 0) { /*  paranoid check it's still not pinned by me */
+                                    volatile        BufferDesc *buf_desc;
+
+                                    buf_desc = &BufferDescriptors[buf_id];        /* found buffer descriptor */
+                                    LockBufHdr(buf_desc);
+                                    if (    (buf_desc->flags & BM_VALID)           /* buffer is valid        */
+                                         && (!(buf_desc->flags & (BM_IO_IN_PROGRESS|BM_AIO_IN_PROGRESS|BM_AIO_PREFETCH_PIN_BANKED))) /* buffer is not any of ... */
+                                       ) {
+                                        buf_desc->flags |= BM_AIO_PREFETCH_PIN_BANKED; /* bank the pin for next use by this task */
+                                        /*  note  - we can call PinBuffer_Locked with the BM_AIO_PREFETCH_PIN_BANKED flag set because it is not yet pinned by me */
+                                        buf_desc->freeNext = -(this_backend_pid);       /* remember which pid banked it */
+                                        /*  pgBufferUsage.aio_read_wasted--;      overload counter - not wasted after all - only for debugging */
+
+                                        /* Make sure we will have room to remember the buffer pin */
+                                        ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+                                        PinBuffer_Locked(buf_desc);
 	}
-#endif   /* USE_PREFETCH */
+                                    else {
+                                        UnlockBufHdr(buf_desc);
+                                    }
+                                }
+                            }
+                            LWLockRelease(newPartitionLock);
+                            /*  although unlikely,  maybe it was evicted while we were puttering about  */
+                            if (buf_id < 0) {
+                                pgBufferUsage.aio_read_noneed--;   /*   back out the accounting */
+                                goto not_in_buffers;               /*   and try again           */
+                            }
+                        }
+#endif /*  USE_PREFETCH_BUT_DONT_PIN_ALREADY_PRESENT */
+
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+		}
+
+		if (buf_id >= 0) {
+			PinCountdelta = PrivateRefCount[buf_id] - PinCountOnEntry;  /*  pin count delta increase    */
+			if (  (PinCountdelta < 0) || (PinCountdelta > 1) ) {
+				  elog(ERROR,
+						 "PrefetchBuffer #%d : incremented pin count by %d on bufdesc %p refcount %u localpins %d\n"
+								  ,(buf_id+1) , PinCountdelta , &BufferDescriptors[buf_id] ,BufferDescriptors[buf_id].refcount , PrivateRefCount[buf_id]);
 }
+		} else
+		if (BufStartAsyncrc == 0) {  /* aio started successfully (which implies buffer was newly pinned ) */
+			PinCountdelta = 1;
+		}
+
+		/*  set final PrefetchBufferRc according to previous value */
+		PrefetchBufferRc |= PinCountdelta;  /* set the PREFTCHRC_BUF_PIN_INCREASED bit */
+	}
+
+#endif   /* USE_PREFETCH */
 
+	return PrefetchBufferRc; /*  return value as described above */
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
@@ -252,7 +398,7 @@ ReadBufferExtended(Relation reln, ForkNu
 	 */
 	pgstat_count_buffer_read(reln);
 	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
+							forkNum, blockNum, mode, strategy, &hit , 0);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
 	return buf;
@@ -280,7 +426,7 @@ ReadBufferWithoutRelcache(RelFileNode rn
 	Assert(InRecovery);
 
 	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
-							 mode, strategy, &hit);
+							 mode, strategy, &hit , 0);
 }
 
 
@@ -288,15 +434,18 @@ ReadBufferWithoutRelcache(RelFileNode rn
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
+ * index_for_aio ,  if -ve , is negative of ( index of the aiocb in the BufferAiocbs array + 3 )
+ *     which is passed through to StartBufferIO
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy, bool *hit , int index_for_aio )
 {
 	volatile BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+        int             allocrc;             /*  retcode from BufferAlloc */
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -328,16 +477,40 @@ ReadBuffer_common(SMgrRelation smgr, cha
 	}
 	else
 	{
+                allocrc = mode; /* pass mode to BufferAlloc since it must not wait for async io if RBM_NOREAD_FOR_PREFETCH */
 		/*
 		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
+							 strategy, &allocrc , index_for_aio );
+		if (allocrc < 0) {
+                    if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+                    {
+                        ereport(WARNING,
+                                (errcode(ERRCODE_DATA_CORRUPTED),
+                                 errmsg("invalid page header in block %u of relation %s; zeroing out page",
+                                        blockNum,
+                                        relpath(smgr->smgr_rnode, forkNum))));
+                        bufBlock = BufHdrGetBlock(bufHdr);
+                        MemSet((char *) bufBlock, 0, BLCKSZ);
+                    }
 		else
+                      ereport(ERROR,
+                              (errcode(ERRCODE_DATA_CORRUPTED),
+                               errmsg("invalid page header in block %u of relation %s",
+                                      blockNum,
+                                      relpath(smgr->smgr_rnode, forkNum))));
+                        found = true;
+                }
+		else if (allocrc > 0) {
+			pgBufferUsage.shared_blks_hit++;
+                        found = true;
+                }
+		else {
 			pgBufferUsage.shared_blks_read++;
+                        found = false;
+                }
 	}
 
 	/* At this point we do NOT hold any locks. */
@@ -410,7 +583,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 				Assert(bufHdr->flags & BM_VALID);
 				bufHdr->flags &= ~BM_VALID;
 				UnlockBufHdr(bufHdr);
-			} while (!StartBufferIO(bufHdr, true));
+			} while (!StartBufferIO(bufHdr, true, 0));
 		}
 	}
 
@@ -430,6 +603,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
+        if (mode != RBM_NOREAD_FOR_PREFETCH) {
 	if (isExtend)
 	{
 		/* new buffers are zero-filled */
@@ -499,6 +673,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 	VacuumPageMiss++;
 	if (VacuumCostActive)
 		VacuumCostBalance += VacuumCostPageMiss;
+	}
 
 	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
 									  smgr->smgr_rnode.node.spcNode,
@@ -520,21 +695,39 @@ ReadBuffer_common(SMgrRelation smgr, cha
  * the default strategy.  The selected buffer's usage_count is advanced when
  * using the default strategy, but otherwise possibly not (see PinBuffer).
  *
- * The returned buffer is pinned and is already marked as holding the
- * desired page.  If it already did have the desired page, *foundPtr is
- * set TRUE.  Otherwise, *foundPtr is set FALSE and the buffer is marked
+ * index_for_aio is an input parm which,  if non-zero,  identifies a BufferAiocb
+ * acquired by caller and to be used for any StartBufferIO performed by this routine.
+ * In this case,  if block not found in buffer pool and we allocate a new buffer,
+ * then we must maintain the spinlock on the buffer and pass it back to caller.
+ *
+ * foundPtr is input and output :
+ *  . input   -  indicates the read-buffer mode  ( see bufmgr.h )
+ *  . output  -  indicates the status of the buffer - see below
+ *
+ * Except for the case of RBM_NOREAD_FOR_PREFETCH and buffer is found,
+ * the returned buffer is pinned and is already marked as holding the
+ * desired page.
+ *  If it already did have the desired page and page content is valid,
+ *  *foundPtr is set to 1
+ *  If it already did have the desired page and mode is RBM_NOREAD_FOR_PREFETCH
+ *    and StartBufferIO returned false
+ *    (meaning it could not initialise the buffer for aio)
+ *  *foundPtr is set to 2
+ *  If it already did have the desired page but page content is invalid,
+ *  *foundPtr is set to -1
+ *   this can happen only if the buffer was read by an async read
+ *   and the aio is still in progress or pinned by the issuer of the startaio.
+ *  Otherwise, *foundPtr is set to 0 and the buffer is marked
  * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
  *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
- *
- * No locks are held either at entry or exit.
+ * No locks are held either at entry or exit EXCEPT for case noted above
+ * of passing an empty buffer back to async io caller ( index_for_aio set ).
  */
 static volatile BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			int *foundPtr , int index_for_aio )
 {
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
@@ -546,6 +739,13 @@ BufferAlloc(SMgrRelation smgr, char relp
 	int			buf_id;
 	volatile BufferDesc *buf;
 	bool		valid;
+        int             IntentionBufferrc;      /* retcode from BufCheckAsync */
+        bool            StartBufferIOrc;        /* retcode from StartBufferIO */
+        ReadBufferMode mode;
+
+
+        mode = *foundPtr;
+        *foundPtr = 0;
 
 	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
@@ -560,21 +760,53 @@ BufferAlloc(SMgrRelation smgr, char relp
 	if (buf_id >= 0)
 	{
 		/*
-		 * Found it.  Now, pin the buffer so no one can steal it from the
-		 * buffer pool, and check to see if the correct data has been loaded
-		 * into the buffer.
+		 * Found it.
 		 */
+                *foundPtr = 1;
 		buf = &BufferDescriptors[buf_id];
 
-		valid = PinBuffer(buf, strategy);
-
-		/* Can release the mapping lock as soon as we've pinned it */
+                /*   If prefetch mode,  then return immediately indicating found,
+                **   and NOTE in this case only,  we did not pin buffer.
+                **   In theory we might try to check whether the buffer is valid,  io in progress,  etc
+                **   but in practice it is simpler to abandon the prefetch if the buffer exists
+                */
+                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+                    /* release the mapping lock and return    */
 		LWLockRelease(newPartitionLock);
+                } else {
+                    /*   note that the current request is for same tag as the one associated with the aio -
+                    **   so simply complete the aio and we have our buffer.
+                    **         If an aio was started on this buffer,
+                    **         check complete and wait for it if not.
+                    **         And,  if aio had been started,  then the task
+                    **         which issued the start aio already pinned it for this read,
+                    **         so if that task was me and the aio was successful,
+                    **         pass the current pin to this read without dropping and re-acquiring.
+                    **         this is all done by BufCheckAsync
+                    */
+                    IntentionBufferrc = BufCheckAsync(smgr , 0 , buf, BUF_INTENTION_WANT , strategy , index_for_aio , false , newPartitionLock );
 
-		*foundPtr = TRUE;
+                    /*       check to see if the correct data has been loaded into the buffer.  */
+                    valid = (IntentionBufferrc == BUF_INTENT_RC_VALID);
 
-		if (!valid)
-		{
+                    /*  check for serious IO errors   */
+                    if (!valid) {
+                        if (    (IntentionBufferrc != BUF_INTENT_RC_INVALID_NO_AIO)
+                             && (IntentionBufferrc != BUF_INTENT_RC_INVALID_AIO)
+                           ) {
+                            *foundPtr = -1;  /*  inform caller of serious error */
+                        }
+                        else
+                        if (IntentionBufferrc == BUF_INTENT_RC_INVALID_AIO) {
+                            goto proceed_with_not_found;  /*  yes,  I know,  a goto ... think of it as a break out of the if */
+                        }
+                     }
+
+                    /* BufCheckAsync pinned the buffer            */
+                    /* so can now release the mapping lock               */
+                    LWLockRelease(newPartitionLock);
+
+                    if (!valid) {
 			/*
 			 * We can only get here if (a) someone else is still reading in
 			 * the page, or (b) a previous read attempt failed.  We have to
@@ -582,19 +814,21 @@ BufferAlloc(SMgrRelation smgr, char relp
 			 * own read attempt if the page is still not BM_VALID.
 			 * StartBufferIO does it all.
 			 */
-			if (StartBufferIO(buf, true))
+                                if (StartBufferIO(buf, true, index_for_aio))
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
 				 * have failed ... but we shall bravely try again.
 				 */
-				*foundPtr = FALSE;
+                                        *foundPtr = 0;
+                                }
 			}
 		}
 
 		return buf;
 	}
 
+  proceed_with_not_found:
 	/*
 	 * Didn't find it in the buffer pool.  We'll have to initialize a new
 	 * buffer.  Remember to unlock the mapping lock while doing the work.
@@ -619,8 +853,10 @@ BufferAlloc(SMgrRelation smgr, char relp
 		/* Must copy buffer flags while we still hold the spinlock */
 		oldFlags = buf->flags;
 
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
+                /*         If an aio was started on this buffer,
+                **         check complete and cancel it if not.
+                */
+                BufCheckAsync(smgr , 0 , buf, BUF_INTENTION_REJECT_OBTAIN_PIN , 0 , index_for_aio, true , 0 );
 
 		/* Now it's safe to release the freelist lock */
 		if (lock_held)
@@ -791,13 +1027,18 @@ BufferAlloc(SMgrRelation smgr, char relp
 				 * then set up our own read attempt if the page is still not
 				 * BM_VALID.  StartBufferIO does it all.
 				 */
-				if (StartBufferIO(buf, true))
+                                StartBufferIOrc = StartBufferIO(buf, true , index_for_aio); /* retcode from StartBufferIO */
+				if (StartBufferIOrc)
 				{
 					/*
 					 * If we get here, previous attempts to read the buffer
 					 * must have failed ... but we shall bravely try again.
 					 */
-					*foundPtr = FALSE;
+					*foundPtr = 0;
+                                } else
+                                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+					UnpinBuffer(buf, true); /* must unpin if RBM_NOREAD_FOR_PREFETCH and found */
+                                        *foundPtr = 2;  /*  inform BufStartAsync that buffer must not be used */
 				}
 			}
 
@@ -860,10 +1101,17 @@ BufferAlloc(SMgrRelation smgr, char relp
 	 * lock.  If StartBufferIO returns false, then someone else managed to
 	 * read it before we did, so there's nothing left for BufferAlloc() to do.
 	 */
-	if (StartBufferIO(buf, true))
-		*foundPtr = FALSE;
-	else
-		*foundPtr = TRUE;
+        StartBufferIOrc = StartBufferIO(buf, true , index_for_aio); /* retcode from StartBufferIO */
+        if (StartBufferIOrc) {
+		*foundPtr = 0;
+        } else {
+                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+                    UnpinBuffer(buf, true); /* must unpin if RBM_NOREAD_FOR_PREFETCH and found */
+                    *foundPtr = 2;  /*  inform BufStartAsync that buffer must not be used */
+                } else {
+                    *foundPtr = 1;
+                }
+        }
 
 	return buf;
 }
@@ -970,6 +1218,10 @@ retry:
 	/*
 	 * Insert the buffer at the head of the list of free buffers.
 	 */
+        /*   avoid confusing freelist with strange-looking freeNext */
+        if (buf->freeNext <= FREENEXT_BAIOCB_ORIGIN) { /* means was used for aiocb index */
+            buf->freeNext = FREENEXT_NOT_IN_LIST;
+        }
 	StrategyFreeBuffer(buf);
 }
 
@@ -1022,6 +1274,56 @@ MarkBufferDirty(Buffer buffer)
 	UnlockBufHdr(bufHdr);
 }
 
+/*  return the blocknum of the block in a buffer if it is valid
+**  if a shared buffer,  it must be pinned
+*/
+BlockNumber
+BlocknumOfBuffer(Buffer buffer)
+{
+	volatile BufferDesc *bufHdr;
+        BlockNumber rc = 0;
+
+	if (BufferIsValid(buffer)) {
+            if  (BufferIsLocal(buffer)) {
+                bufHdr = &LocalBufferDescriptors[-buffer - 1];
+            } else {
+                bufHdr = &BufferDescriptors[buffer - 1];
+            }
+
+            rc = bufHdr->tag.blockNum;
+        }
+
+        return rc;
+}
+
+/*  report whether specified buffer contains same or different block
+**  if a shared buffer,  it must be pinned
+*/
+bool
+BlocknotinBuffer(Buffer buffer,
+					 Relation relation,
+					 BlockNumber blockNum)
+{
+	volatile BufferDesc *bufHdr;
+        bool rc = false;
+
+	if (BufferIsValid(buffer)) {
+            if  (BufferIsLocal(buffer)) {
+                bufHdr = &LocalBufferDescriptors[-buffer - 1];
+            } else {
+                bufHdr = &BufferDescriptors[buffer - 1];
+            }
+
+            rc =
+                  (   (bufHdr->tag.blockNum != blockNum)
+                   || (!(RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) ))
+                   || (bufHdr->tag.forkNum != MAIN_FORKNUM)
+                  );
+        }
+
+        return rc;
+}
+
 /*
  * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer()
  *
@@ -1040,18 +1342,18 @@ ReleaseAndReadBuffer(Buffer buffer,
 					 Relation relation,
 					 BlockNumber blockNum)
 {
-	ForkNumber	forkNum = MAIN_FORKNUM;
 	volatile BufferDesc *bufHdr;
+        bool isDifferentBlock;   /*       requesting different block from that already in buffer ? */
 
 	if (BufferIsValid(buffer))
 	{
+	    /* if a shared buff, we have pin, so it's ok to examine tag without spinlock */
+            isDifferentBlock = BlocknotinBuffer(buffer,relation,blockNum); /*  requesting different block from that already in buffer ? */
 		if (BufferIsLocal(buffer))
 		{
 			Assert(LocalRefCount[-buffer - 1] > 0);
 			bufHdr = &LocalBufferDescriptors[-buffer - 1];
-			if (bufHdr->tag.blockNum == blockNum &&
-				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
-				bufHdr->tag.forkNum == forkNum)
+			if (!isDifferentBlock)
 				return buffer;
 			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 			LocalRefCount[-buffer - 1]--;
@@ -1060,12 +1362,12 @@ ReleaseAndReadBuffer(Buffer buffer,
 		{
 			Assert(PrivateRefCount[buffer - 1] > 0);
 			bufHdr = &BufferDescriptors[buffer - 1];
-			/* we have pin, so it's ok to examine tag without spinlock */
-			if (bufHdr->tag.blockNum == blockNum &&
-				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
-				bufHdr->tag.forkNum == forkNum)
+                        BufCheckAsync(0 , relation , bufHdr , ( isDifferentBlock ? BUF_INTENTION_REJECT_FORGET
+                                                                                 : BUF_INTENTION_REJECT_KEEP_PIN )
+                                                            , 0 , 0 , false , 0 ); /* end any IO and maybe unpin */
+			if (!isDifferentBlock) {
 				return buffer;
-			UnpinBuffer(bufHdr, true);
+                        }
 		}
 	}
 
@@ -1090,11 +1392,12 @@ ReleaseAndReadBuffer(Buffer buffer,
  * Returns TRUE if buffer is BM_VALID, else FALSE.  This provision allows
  * some callers to avoid an extra spinlock cycle.
  */
-static bool
+bool
 PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
 {
 	int			b = buf->buf_id;
 	bool		result;
+        bool       pin_already_banked_by_me = 0;  /* buffer is already pinned by me and redeemable */
 
 	if (PrivateRefCount[b] == 0)
 	{
@@ -1116,12 +1419,34 @@ PinBuffer(volatile BufferDesc *buf, Buff
 	else
 	{
 		/* If we previously pinned the buffer, it must surely be valid */
+                /* Errr  -   is that really true  ???    I don't think so  :
+                ** what if I pin,  try an IO,  in progress,  then mistakenly pin again
 		result = true;
+                */
+		LockBufHdr(buf);
+                pin_already_banked_by_me = (    (PrivateRefCount[b] > 0) && (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                             && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                                     : (-(buf->freeNext))  )  == this_backend_pid )
+                                           );
+                if (pin_already_banked_by_me) {
+                        elog(LOG, "PinBuffer : on buffer %d with banked pin rel=%s, blockNum=%u, with flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+                    buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
 	}
+                }
+		result = (buf->flags & BM_VALID) != 0;
+		UnlockBufHdr(buf);
+	}
+
+        if (!pin_already_banked_by_me) {
 	PrivateRefCount[b]++;
 	Assert(PrivateRefCount[b] > 0);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
 								BufferDescriptorGetBuffer(buf));
+        }
 	return result;
 }
 
@@ -1138,19 +1463,36 @@ PinBuffer(volatile BufferDesc *buf, Buff
  * to save a spin lock/unlock cycle, because we need to pin a buffer before
  * its state can change under us.
  */
-static void
+void
 PinBuffer_Locked(volatile BufferDesc *buf)
 {
 	int			b = buf->buf_id;
+        bool       pin_already_banked_by_me;  /* buffer is already pinned by me and redeemable */
 
-	if (PrivateRefCount[b] == 0)
+        pin_already_banked_by_me = (    (PrivateRefCount[b] > 0) && (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                     && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                             : (-(buf->freeNext))  )  == this_backend_pid )
+                             );
+	if (PrivateRefCount[b] == 0) {
 		buf->refcount++;
+        }
+        if (pin_already_banked_by_me) {
+                        elog(LOG, "PinBuffer_Locked : on buffer %d with banked pin rel=%s, blockNum=%u, with flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+            buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                    }
+        }
 	UnlockBufHdr(buf);
+        if (!pin_already_banked_by_me) {
 	PrivateRefCount[b]++;
 	Assert(PrivateRefCount[b] > 0);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
 								BufferDescriptorGetBuffer(buf));
 }
+}
 
 /*
  * UnpinBuffer -- make buffer available for replacement.
@@ -1160,29 +1502,68 @@ PinBuffer_Locked(volatile BufferDesc *bu
  * Most but not all callers want CurrentResourceOwner to be adjusted.
  * Those that don't should pass fixOwner = FALSE.
  */
-static void
+void
 UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 {
+
 	int			b = buf->buf_id;
+        bool       pin_already_banked_by_me;  /* buffer is already pinned by me and redeemable */
 
-	if (fixOwner)
+	if (fixOwner) {
 		ResourceOwnerForgetBuffer(CurrentResourceOwner,
 								  BufferDescriptorGetBuffer(buf));
+        }
 
 	Assert(PrivateRefCount[b] > 0);
 	PrivateRefCount[b]--;
 	if (PrivateRefCount[b] == 0)
 	{
+
 		/* I'd better not still hold any locks on the buffer */
 		Assert(!LWLockHeldByMe(buf->content_lock));
 		Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
 
 		LockBufHdr(buf);
 
+		/* this backend has released last pin - buffer should not have pin banked by me
+                ** and if AIO in progress then there should be another backend pin
+                */
+                pin_already_banked_by_me = (       (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                             &&    (   (    (buf->flags & BM_AIO_IN_PROGRESS)
+                                                         ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                         : (-(buf->freeNext))
+                                                       )  == this_backend_pid )
+                                           );
+                if (pin_already_banked_by_me) {
+                        /*  this is a strange situation  -  caller had a banked pin (which callers are supposed not to know about)
+                        **                                  but either discovered it had it or has over-counted how many pins it has
+                        */
+                        buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;   /*   redeem the pin although it is now of no use since about to release */
+                        if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                            buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                        }
+
+                        /*     temporarily suppress logging error to avoid performance degradation -
+                        **     either this task really does not need the buffer in which case the error is harmless
+                        **     or a more severe error will be detected later (possibly immediately below)
+                        elog(LOG, "UnpinBuffer :  released last this-backend pin on buffer %d rel=%s, blockNum=%u, but had banked pin flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+                        */
+                }
+
 		/* Decrement the shared reference count */
 		Assert(buf->refcount > 0);
 		buf->refcount--;
 
+                if (   (buf->refcount == 0) && (buf->flags & BM_AIO_IN_PROGRESS)  ) {
+
+                        elog(ERROR, "UnpinBuffer :  released last any-backend pin on buffer %d rel=%s, blockNum=%u, but AIO in progress flags %X refcount=%u"
+                            ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                            ,buf->tag.blockNum, buf->flags, buf->refcount);
+                }
+
+
 		/* Support LockBufferForCleanup() */
 		if ((buf->flags & BM_PIN_COUNT_WAITER) &&
 			buf->refcount == 1)
@@ -1657,6 +2038,7 @@ SyncOneBuffer(int buf_id, bool skip_rece
 	volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
 	int			result = 0;
 
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -1789,6 +2171,8 @@ PrintBufferLeakWarning(Buffer buffer)
 	char	   *path;
 	BackendId	backend;
 
+
+
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
 	{
@@ -1799,12 +2183,28 @@ PrintBufferLeakWarning(Buffer buffer)
 	else
 	{
 		buf = &BufferDescriptors[buffer - 1];
+#ifdef USE_PREFETCH
+                /*   If reason that this buffer is pinned
+                **   is that it was prefetched with async_io
+                **   and never read or discarded, then omit the
+                **   warning,  because this is expected in some
+                **   cases when a scan is closed abnormally.
+                **   Note that the buffer will be released soon by our caller.
+                */
+                if (buf->flags & BM_AIO_PREFETCH_PIN_BANKED) {
+                    pgBufferUsage.aio_read_forgot++; /* account for it */
+                    return;
+                }
+#endif /*  USE_PREFETCH */
 		loccount = PrivateRefCount[buffer - 1];
 		backend = InvalidBackendId;
 	}
 
+/* #if defined(DO_ISCAN_DISCARD_BUFFER) && defined(DO_HEAPSCAN_DISCARD_BUFFER) not yet working safely 130530 */
 	/* theoretically we should lock the bufhdr here */
 	path = relpathbackend(buf->tag.rnode, backend, buf->tag.forkNum);
+
+
 	elog(WARNING,
 		 "buffer refcount leak: [%03d] "
 		 "(rel=%s, blockNum=%u, flags=0x%x, refcount=%u %d)",
@@ -1812,6 +2212,7 @@ PrintBufferLeakWarning(Buffer buffer)
 		 buf->tag.blockNum, buf->flags,
 		 buf->refcount, loccount);
 	pfree(path);
+/* #endif defined(DO_ISCAN_DISCARD_BUFFER) && defined(DO_HEAPSCAN_DISCARD_BUFFER) not yet working safely 130530 */
 }
 
 /*
@@ -1928,7 +2329,7 @@ FlushBuffer(volatile BufferDesc *buf, SM
 	 * false, then someone else flushed the buffer before we could, so we need
 	 * not do anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, 0))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -2512,6 +2913,70 @@ FlushDatabaseBuffers(Oid dbid)
 	}
 }
 
+#ifdef USE_PREFETCH
+/*
+ * DiscardBuffer -- discard shared buffer used for a previously
+ *                  prefetched but unread block of a relation
+ *
+ * If the buffer is found and pinned with a banked pin,  then :
+ *      .  if AIO in progress, terminate AIO without waiting
+ *      .  if AIO had already completed successfully,
+ *         then mark buffer valid (in case someone else wants it)
+ *      .  redeem the banked pin and unpin it.
+ *
+ * This function is similar in purpose to ReleaseBuffer (below)
+ * but sufficiently different that it is a separate function.
+ * Two important differences are :
+ *   .   caller identifies buffer by blocknumber,  not buffer number
+ *   .   we unpin buffer *only* if the pin is banked,
+ *                      *never* if pinned but not banked.
+ *       This is essential as caller may perform a sequence of
+ *  SCAN1   . PrefetchBuffer   (and remember block was prefetched)
+ *  SCAN2   . ReadBuffer       (but fails to connect this read to the prefetch by SCAN1)
+ *  SCAN1   . DiscardBuffer    (SCAN1 terminates early)
+ *  SCAN2   . access tuples in buffer
+ *        Clearly the Discard *must not* unpin the buffer since SCAN2 needs it!
+ *
+ *
+ * caller may pass InvalidBlockNumber as blockNum to mean do nothing
+ */
+void
+DiscardBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
+{
+        BufferTag	newTag;		   /* identity of requested block */
+        uint32		newHash;	   /* hash value for newTag */
+        LWLockId	newPartitionLock;  /* buffer partition lock for it */
+        Buffer		buf_id;
+        volatile        BufferDesc *buf_desc;
+
+    if (!SmgrIsTemp(reln->rd_smgr)) {
+	Assert(RelationIsValid(reln));
+	if (BlockNumberIsValid(blockNum)) {
+
+            /* create a tag so we can lookup the buffer */
+            INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
+                                       forkNum, blockNum);
+
+            /* determine its hash code and partition lock ID */
+            newHash = BufTableHashCode(&newTag);
+            newPartitionLock = BufMappingPartitionLock(newHash);
+
+            /* see if the block is in the buffer pool already */
+            LWLockAcquire(newPartitionLock, LW_SHARED);
+            buf_id = BufTableLookup(&newTag, newHash);
+            LWLockRelease(newPartitionLock);
+
+            /* If in buffers, proceed */
+            if (buf_id >= 0) {
+                buf_desc = &BufferDescriptors[buf_id];        /* found buffer descriptor */
+                BufCheckAsync(0 , reln, buf_desc , BUF_INTENTION_REJECT_UNBANK , 0 , 0 , false , 0); /* end the IO and unpin if banked */
+                pgBufferUsage.aio_read_discrd++; /* account for it */
+            }
+        }
+    }
+}
+#endif   /* USE_PREFETCH */
+
 /*
  * ReleaseBuffer -- release the pin on a buffer
  */
@@ -2520,26 +2985,23 @@ ReleaseBuffer(Buffer buffer)
 {
 	volatile BufferDesc *bufHdr;
 
+
 	if (!BufferIsValid(buffer))
 		elog(ERROR, "bad buffer ID: %d", buffer);
 
-	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 
 	if (BufferIsLocal(buffer))
 	{
+                ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 		Assert(LocalRefCount[-buffer - 1] > 0);
 		LocalRefCount[-buffer - 1]--;
 		return;
 	}
-
-	bufHdr = &BufferDescriptors[buffer - 1];
-
-	Assert(PrivateRefCount[buffer - 1] > 0);
-
-	if (PrivateRefCount[buffer - 1] > 1)
-		PrivateRefCount[buffer - 1]--;
 	else
-		UnpinBuffer(bufHdr, false);
+        {
+                bufHdr = &BufferDescriptors[buffer - 1];
+                BufCheckAsync(0 , 0 , bufHdr , BUF_INTENTION_REJECT_NOADJUST , 0 , 0 , false , 0 );
+        }
 }
 
 /*
@@ -2565,14 +3027,41 @@ UnlockReleaseBuffer(Buffer buffer)
 void
 IncrBufferRefCount(Buffer buffer)
 {
+        bool       pin_already_banked_by_me = false;  /* buffer is already pinned by me and redeemable */
+        volatile BufferDesc *buf;                     /* descriptor for a shared buffer */
+
 	Assert(BufferIsPinned(buffer));
+
+        if (!(BufferIsLocal(buffer))) {
+                buf = &BufferDescriptors[buffer - 1];
+		LockBufHdr(buf);
+                pin_already_banked_by_me =
+                      (    (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                        && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                      : (-(buf->freeNext))  )  == this_backend_pid )
+                      );
+        }
+
+        if (!pin_already_banked_by_me) {
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, buffer);
+        }
+
 	if (BufferIsLocal(buffer))
 		LocalRefCount[-buffer - 1]++;
-	else
+	else {
+                if (pin_already_banked_by_me) {
+                    buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                    }
+                }
+		UnlockBufHdr(buf);
+                if (!pin_already_banked_by_me) {
 		PrivateRefCount[buffer - 1]++;
 }
+        }
+}
 
 /*
  * MarkBufferDirtyHint
@@ -2994,61 +3483,138 @@ WaitIO(volatile BufferDesc *buf)
  *
  * In some scenarios there are race conditions in which multiple backends
  * could attempt the same I/O operation concurrently.  If someone else
- * has already started I/O on this buffer then we will block on the
+ * has already started synchronous I/O on this buffer then we will block on the
  * io_in_progress lock until he's done.
  *
+ * if an async io is in progress and we are doing synchronous io,
+ * then readbuffer uses call to smgrcompleteaio to wait,
+ * and so we treat this request as if no io in progress
+ *
  * Input operations are only attempted on buffers that are not BM_VALID,
  * and output operations only on buffers that are BM_VALID and BM_DIRTY,
  * so we can always tell if the work is already done.
  *
+ * index_for_aio is an input parm which,  if non-zero,  identifies a BufferAiocb
+ * acquired by caller and to be attached to the buffer header for use with async io
+ *
  * Returns TRUE if we successfully marked the buffer as I/O busy,
  * FALSE if someone else already did the work.
  */
 static bool
-StartBufferIO(volatile BufferDesc *buf, bool forInput)
+StartBufferIO(volatile BufferDesc *buf, bool forInput , int index_for_aio )
 {
+#ifdef USE_PREFETCH
+        struct BufferAiocb volatile * BAiocb = (struct BufferAiocb*)0;  /*  BufferAiocb for use with aio */
+#endif   /* USE_PREFETCH */
+ 
+        if (!index_for_aio)
 	Assert(!InProgressBuf);
 
 	for (;;)
 	{
+                if (!index_for_aio) {
 		/*
 		 * Grab the io_in_progress lock so that other processes can wait for
 		 * me to finish the I/O.
 		 */
 		LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+                }
 
 		LockBufHdr(buf);
 
-		if (!(buf->flags & BM_IO_IN_PROGRESS))
+                /*     the following test is intended to distinguish between :
+                **      .   buffer which : 
+                **           .     has io in progress
+                **             AND is not associated with a current aio
+                **      .   not the above
+                **     Here,  "recent" means an aio marked by buf->freeNext <= FREENEXT_BAIOCB_ORIGIN but no longer in progress -
+                **          this situation arises when the aio has just been cancelled and this process now wishes to recycle the buffer.
+                **          In this case,  the first such would-be recycler (i.e. me) must :
+                **             . avoid waiting for the cancelled aio to complete
+                **             . if not myself doing async read, then assume responsibility for posting other future readbuffers.
+                */
+		if (    (buf->flags & BM_AIO_IN_PROGRESS)
+                     || (!(buf->flags & BM_IO_IN_PROGRESS))
+                   )
 			break;
 
 		/*
-		 * The only way BM_IO_IN_PROGRESS could be set when the io_in_progress
+		 * The only way BM_IO_IN_PROGRESS without AIO in progress could be set when the io_in_progress
 		 * lock isn't held is if the process doing the I/O is recovering from
 		 * an error (see AbortBufferIO).  If that's the case, we must wait for
 		 * him to get unwedged.
 		 */
 		UnlockBufHdr(buf);
+                if (!index_for_aio) {
 		LWLockRelease(buf->io_in_progress_lock);
+                }
 		WaitIO(buf);
 	}
 
-	/* Once we get here, there is definitely no I/O active on this buffer */
-
+#ifdef USE_PREFETCH
+	/* Once we get here, there is definitely no synchronous I/O active on this buffer
+        ** but if being asked to attach a BufferAiocb to the buf header,
+        ** then we must also check if there is any async io currently
+        ** in progress or pinned started by a different task.
+        */
+        if (index_for_aio) {
+            BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext);
+            if (    (buf->flags & (BM_AIO_IN_PROGRESS|BM_AIO_PREFETCH_PIN_BANKED))
+                 && (buf->freeNext <= FREENEXT_BAIOCB_ORIGIN)
+                 && (BAiocb->pidOfAio != this_backend_pid)
+               ) {
+                    /* someone else already doing async I/O */
+                    UnlockBufHdr(buf);
+                    return false;
+            }
+	}
+#endif   /* USE_PREFETCH */
 	if (forInput ? (buf->flags & BM_VALID) : !(buf->flags & BM_DIRTY))
 	{
 		/* someone else already did the I/O */
 		UnlockBufHdr(buf);
+                if (!index_for_aio) {
 		LWLockRelease(buf->io_in_progress_lock);
+                }
 		return false;
 	}
 
 	buf->flags |= BM_IO_IN_PROGRESS;
 
+#ifdef USE_PREFETCH
+        if (index_for_aio) {
+            BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - index_for_aio);
+            /*   insist that no other buffer is using this BufferAiocb for async IO */
+            if (BAiocb->BAiocbbufh == (struct sbufdesc *)0) {
+                BAiocb->BAiocbbufh = buf;
+            }
+            if (BAiocb->BAiocbbufh != buf) {
+                               ereport(ERROR,
+                                    (errcode(ERRCODE_INTERNAL_ERROR),
+                                     errmsg("AIO control block %p to be used by %p already in use by %p"
+                                              ,BAiocb ,buf , BAiocb->BAiocbbufh)));
+            }
+            /*   note - there is no need to register self as an dependent of BAiocb
+            **   as we shall not unlock buf_desc before we free the BAiocb
+            */
+
+            buf->flags |= BM_AIO_IN_PROGRESS;
+            buf->freeNext = index_for_aio;
+            /*  at this point,  this buffer appears to have an in-progress aio_read,
+            **  and any other task which is able to look inside the buffer might try waiting on that aio -
+            **  except we have not yet issued the aio!   So we must keep the buffer header locked
+            **  from here all the way back to the BufStartAsync caller
+            */
+        } else {
+#endif   /* USE_PREFETCH */
+
 	UnlockBufHdr(buf);
 
 	InProgressBuf = buf;
 	IsForInput = forInput;
+#ifdef USE_PREFETCH
+        }
+#endif   /* USE_PREFETCH */
 
 	return true;
 }
@@ -3058,7 +3624,7 @@ StartBufferIO(volatile BufferDesc *buf,
  *	(Assumptions)
  *	My process is executing IO for the buffer
  *	BM_IO_IN_PROGRESS bit is set for the buffer
- *	We hold the buffer's io_in_progress lock
+ *	if no async IO in progress,  then We hold the buffer's io_in_progress lock
  *	The buffer is Pinned
  *
  * If clear_dirty is TRUE and BM_JUST_DIRTIED is not set, we clear the
@@ -3070,26 +3636,32 @@ StartBufferIO(volatile BufferDesc *buf,
  * BM_IO_ERROR in a failure case.  For successful completion it could
  * be 0, or BM_VALID if we just finished reading in the page.
  */
-static void
+void
 TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits)
 {
-	Assert(buf == InProgressBuf);
+        int flags_on_entry;
 
 	LockBufHdr(buf);
 
+        flags_on_entry = buf->flags;
+
+        if (! (flags_on_entry & BM_AIO_IN_PROGRESS) )
+            Assert( buf == InProgressBuf );
+
 	Assert(buf->flags & BM_IO_IN_PROGRESS);
-	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
+	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS);
 	if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
 		buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 	buf->flags |= set_flag_bits;
 
 	UnlockBufHdr(buf);
 
+        if (! (flags_on_entry & BM_AIO_IN_PROGRESS) ) {
 	InProgressBuf = NULL;
-
 	LWLockRelease(buf->io_in_progress_lock);
 }
+}
 
 /*
  * AbortBufferIO: Clean up any active buffer I/O after an error.
--- src/backend/storage/buffer/buf_async.c.orig	2014-05-28 08:50:32.446571884 -0400
+++ src/backend/storage/buffer/buf_async.c	2014-05-28 16:45:43.014507687 -0400
@@ -0,0 +1,920 @@
+/*-------------------------------------------------------------------------
+ *
+ * buf_async.c
+ *	  buffer manager asynchronous disk read routines
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/buf_async.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * Principal entry points:
+ *
+ * BufStartAsync() -- start an asynchronous read of a block into a buffer and
+ *	 pin it so that no one can destroy it while this process is using it.
+ *
+ * BufCheckAsync() -- check completion of an asynchronous read
+ *       and either claim buffer or discard it
+ *
+ * Private helper
+ *
+ * BufReleaseAsync() -- release the BAiocb resources used for an asynchronous read
+ *
+ * See also these files:
+ *		bufmgr.c -- main buffer manager functions
+ *		buf_init.c -- initialisation of resources
+ */
+#include "postgres.h"
+#include <sys/types.h> 
+#include <sys/file.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "catalog/catalog.h"
+#include "common/relpath.h"
+#include "executor/instrument.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/smgr.h"
+#include "storage/standby.h"
+#include "utils/rel.h"
+#include "utils/resowner_private.h"
+
+/*
+ * GUC parameters
+ */
+int      max_async_io_prefetchers; /* Maximum number of backends using asynchronous librt threads to read pages into our buffers */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
+extern volatile struct BAiocbAnchor *BAiocbAnchr;  /*  anchor for all control blocks pertaining to aio  */
+extern int numBufferAiocbs;        /*  total number of  BufferAiocbs in pool  */
+extern int maxGetBAiocbTries;      /*   max times we will try to get a free BufferAiocb */
+extern int maxRelBAiocbTries;      /*   max times we will try to release a BufferAiocb back to freelist */
+extern pid_t this_backend_pid;     /*   pid of this backend */
+
+extern bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
+extern void PinBuffer_Locked(volatile BufferDesc *buf);
+extern Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
+				  ForkNumber forkNum, BlockNumber blockNum,
+				  ReadBufferMode mode, BufferAccessStrategy strategy,
+				  bool *hit , int index_for_aio);
+extern void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
+extern void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
+				  int set_flag_bits);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+int BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy);
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+int BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, volatile BufferDesc *buf_desc
+  ,int intention , BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock );
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+
+static struct BufferAiocb volatile * cachedBAiocb = (struct BufferAiocb*)0;  /*  one cached BufferAiocb for use with aio */
+
+#ifdef USE_PREFETCH
+/*  BufReleaseAsync releases a BufferAiocb and returns 0 if successful else non-zero
+**  it *must* be called :
+**     EITHER with a valid  BAiocb->BAiocbbufh -> buf_desc
+**            and that buf_desc must be spin-locked
+**     OR     with BAiocb->BAiocbbufh == 0
+*/
+static int
+BufReleaseAsync(struct BufferAiocb volatile * BAiocb)
+{
+        int    LockTries;         /*  max times we will try to release the BufferAiocb */
+        volatile struct BufferAiocb *BufferAiocbs;
+        struct BufferAiocb volatile * oldFreeBAiocb;  /*  old value of FreeBAiocbs */
+
+        int failed = 1; /* by end of this function, non-zero  will indicate if we failed to return the BAiocb */
+
+
+        if (    ( BAiocb == (struct BufferAiocb*)0 )
+             || ( BAiocb == (struct BufferAiocb*)BAIOCB_OCCUPIED )
+             || ( ((unsigned long)BAiocb) & 0x1 )
+           ) {
+                          elog(ERROR,
+                                 "AIO control block corruption on release of aiocb %p - invalid BAiocb"
+                                          ,BAiocb);
+        }
+        else 
+        if (   (0 == BAiocb->BAiocbDependentCount)     /*  no dependents  */
+            && ((struct BufferAiocb*)BAIOCB_OCCUPIED == BAiocb->BAiocbnext)  /*  not already on freelist */
+           ) {
+
+            if ((struct sbufdesc*)0 != BAiocb->BAiocbbufh) { /*  if a buffer was attached */
+                volatile        BufferDesc *buf_desc = BAiocb->BAiocbbufh;
+
+                /*  spinlock held so instead of TerminateBufferIO(buf, false , 0); ... */
+                if (buf_desc->flags & BM_AIO_PREFETCH_PIN_BANKED) { /* if a pid banked the pin */
+                    buf_desc->freeNext = -(BAiocb->pidOfAio);       /* then remember which pid */
+                }
+                else if (buf_desc->freeNext <= FREENEXT_BAIOCB_ORIGIN) {
+                    buf_desc->freeNext = FREENEXT_NOT_IN_LIST;   /* disconnect BufferAiocb from buf_desc */
+                }
+                buf_desc->flags &= ~BM_AIO_IN_PROGRESS;
+            }
+            
+            BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* disconnect buf_desc from BufferAiocb */
+            BAiocb->pidOfAio = 0;                      /*  clean */
+            LockTries = maxRelBAiocbTries;         /*  max times we will try to release the BufferAiocb */
+            do {
+                register long long int dividend , remainder;
+
+                /*      retrieve old value of FreeBAiocbs  */
+                BAiocb->BAiocbnext = oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs;  /*  old value of FreeBAiocbs */
+
+                /*  this is a volatile value unprotected by any lock,  so must validate it;
+                **  safest is to verify that it is identical to one of the BufferAiocbs
+                **  to do so,  verify by direct division that its address offset from first control block 
+                **  is an integral multiple of the control block size
+                **  that lies within the range [ 0 , (numBufferAiocbs-1) ]
+                */
+                BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+                remainder = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                            % (long long int)(sizeof(struct BufferAiocb));
+                failed = (int)remainder;
+                if (!failed) {
+                    dividend = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                               / (long long int)(sizeof(struct BufferAiocb));
+                     failed = ( (dividend < 0 ) || ( dividend >= numBufferAiocbs) );
+                     if (!failed) {
+                         if (__sync_bool_compare_and_swap (&(BAiocbAnchr->FreeBAiocbs), oldFreeBAiocb, BAiocb)) {
+                            LockTries = 0;   /*  end the do loop  */
+
+                            goto cheering;   /*  cant simply break because then failed would be set incorrectly */
+                         }
+                    }
+                }
+                /*  if we reach here, we have failed and failed is set to -1 */
+
+       cheering: ;
+
+                if ( LockTries > 1 ) {
+                    sched_yield();    /*   yield to another process,  (hopefully a backend) */
+                }
+            } while  (LockTries-- > 0);
+
+            if (failed) {
+#ifdef LOG_RELBAIOCB_DEPLETION
+                          elog(LOG,
+                                "BufReleaseAsync:AIO control block %p unreleased after tries= %d\n"
+                                       ,BAiocb,maxRelBAiocbTries);
+#endif  /* LOG_RELBAIOCB_DEPLETION */
+            }
+
+        }
+              else
+                          elog(LOG,
+                                "BufReleaseAsync:AIO control block %p either has dependents= %d or is already on freelist %p or has no buf_header %p\n"
+                                       ,BAiocb , BAiocb->BAiocbDependentCount , BAiocb->BAiocbnext , BAiocb->BAiocbbufh);
+        return failed;
+}
+
+/*  try using asynchronous aio_read to prefetch into a buffer
+**  return code :
+**        0 if started successfully
+**       -1 if failed for some reason
+**        1+PrivateRefCount if we found desired buffer in buffer pool
+**
+**  There is a harmless race condition here :
+**  two different backends may both arrive here simultaneously
+**  to prefetch the same buffer.    This is not unlikely when a syncscan is in progress.
+**  .  One will acquire the buffer and issue the smgrstartaio
+**  .  Other will find the buffer on return from  ReadBuffer_common with hit = true
+**  Only the first task has a pin on the buffer since ReadBuffer_common knows not to get a pin
+**  on a found buffer in prefetch mode.
+**  Therefore  -   the second task must simply abandon the prefetch if it finds the buffer in the buffer pool.
+**
+**  if we fail to acquire a BAiocb because of concurrent theft from freelist by other backend,
+**  retry up to maxGetBAiocbTries times provided that there actually was at least one BAiocb on the freelist.
+*/
+int
+BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy) {
+
+        int retcode = -1;
+        struct BufferAiocb volatile * BAiocb = (struct BufferAiocb*)0;  /*  BufferAiocb for use with aio */
+        int  smgrstartaio_rc = -1;           /*  retcode from smgrstartaio */
+        bool do_unpin_buffer = false;        /* unpin must be deferred until after buffer descriptor is unlocked */
+        Buffer		buf_id;
+        bool		hit = false;
+        volatile        BufferDesc *buf_desc = (BufferDesc *)0;
+
+        int    LockTries;         /*  max times we will try to get a free BufferAiocb */
+
+        struct BufferAiocb volatile * oldFreeBAiocb;  /*  old value of FreeBAiocbs */
+        struct BufferAiocb volatile * newFreeBAiocb;  /*  new value of FreeBAiocbs */
+
+
+    /*  return immediately if no async io resources */
+    if (numBufferAiocbs > 0) {
+        buf_id = (Buffer)0;
+
+        if ( (struct BAiocbAnchor *)0 != BAiocbAnchr ) {
+
+            volatile struct BufferAiocb *BufferAiocbs;
+
+            if ((struct BufferAiocb*)0 != cachedBAiocb) {  /* any cached BufferAiocb ? */
+                BAiocb = cachedBAiocb;                     /* yes  use it  */
+                cachedBAiocb = BAiocb->BAiocbnext;
+                BAiocb->BAiocbnext = (struct BufferAiocb*)BAIOCB_OCCUPIED; /* mark as not on freelist */
+                BAiocb->BAiocbDependentCount = 0;  /* no dependent yet */
+                BAiocb->BAiocbbufh = (struct sbufdesc*)0;
+                BAiocb->pidOfAio = 0;
+            } else {
+
+                LockTries = maxGetBAiocbTries;         /*  max times we will try to get a free BufferAiocb */
+                do {
+                    register long long int dividend = -1 , remainder;
+                    /*  check if we have a free BufferAiocb */
+
+                    oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs;  /*  old value of FreeBAiocbs */
+
+                    /*  check if we have a free BufferAiocb */
+
+                    /*  BAiocbAnchr->FreeBAiocbs is a volatile value unprotected by any lock,
+                    **  and use of compare-and-swap to add and remove items from the list has
+                    **  two potential pitfalls,    both relating to the fact that we must
+                    **  access data de-referenced from this pointer before the compare-and-swap.
+                    **  1)  The value we load may be corrupt,  e.g. mixture of bytes from
+                    **      two different values,   so must validate it;
+                    **      safest is to verify that it is identical to one of the BufferAiocbs.
+                    **      to do so,  verify by direct division that its address offset from
+                    **      first control block is an integral multiple of the control block size
+                    **      that lies within the range [ 0 , (numBufferAiocbs-1) ]
+                    **      Thus we completely prevent this pitfall.
+                    **  2)  The content of the item's next pointer may have changed between the
+                    **      time we de-reference it and the time of the compare-and-swap.
+                    **      Thus even though the compare-and-swap succeeds,   we might set the
+                    **      new head of the freelist to an invalid value  (either a free item
+                    **      that is not the first in the free chain  -  resulting only in
+                    **      loss of the orphaned free items, or,  much worse, an in-use item).
+                    **      In practice this is extremely unlikely because of the implied huge delay
+                    **      in this window interval in this (current) process.    Here are two scenarios:
+                    **      legend:
+                    **         P0  -  this (current) process,  P1,  P2 , ... other processes
+                    **         content of freelist shown as BAiocbAnchr->FreeBAiocbs -> first item -> 2nd item ...
+                    **         @[X] means address of X
+                    **         |      timeline of window of exposure to problems
+                    **      successive lines in chronological order                                       content of freelist
+                    **        2.1    P0           access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **               P0          access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1       access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        swap-remove I0,  place I1 at head of list                F -> I1 -> I2 -> I3 ...
+                    **         |           P2     access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I1]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2    access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I2]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2     swap-remove I1,  place I2 at head of list                F -> I2 -> I3 ...
+                    **         |        P1       complete aio,  replace I0 at head of list                 F -> I0 -> I2 -> I3 ...
+                    **               P0        swap-remove I0,  place I1 at head of list                   F -> I1 IS IN USE !! CORRUPT !!
+                    **               compare-and-swap succeeded but the value of newFreeBAiocb was stale
+                    **               and had become in-use during the window.
+                    **        2.2    P0           access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **               P0          access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1       access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        swap-remove I0,  place I1 at head of list                F -> I1 -> I2 -> I3 ...
+                    **         |           P2     access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I1]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2    access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I2]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2     swap-remove I1,  place I2 at head of list                F -> I2 -> I3 ...
+                    **         |              P3  access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I2]  F -> I2 -> I3 ...
+                    **         |              P3 access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I3]  F -> I2 -> I3 ...
+                    **         |              P3  swap-remove I2,  place I3 at head of list                F -> I3 ...
+                    **         |           P2    complete aio,  replace I1 at head of list                 F -> I1 -> I3 ...
+                    **         |              P3 complete aio,  replace I2 at head of list                 F -> I2 -> I1 -> I3 ...
+                    **         |        P1       complete aio,  replace I0 at head of list                 F -> I0 -> I2 -> I1 -> I3 ...
+                    **               P0        swap-remove I0,  place I1 at head of list                   F -> I1 -> I3 ... ! I2 is orphaned !
+                    **               compare-and-swap succeeded but the value of newFreeBAiocb was stale
+                    **               and had moved further down the free list during the window.
+                    **      Unfortunately, we cannot prevent this pitfall but we can detect it (after the fact),
+                    **      by checking that the next pointer of the item we have just removed for our use still points to the same item.
+                    **      This test is not subject to any timing or uncertainty since :
+                    **       .  The fact that the compare-and-swap succeeded implies that the item we removed
+                    **          was defintely on the freelist (at the head) when it was removed,
+                    **          and therefore cannot be in use,  and therefore its next pointer is no longer volatile.
+                    **       .  Although pointers of the anchor and items on the freelist are volatile,
+                    **          the addresses of items never change -  they are in an allocated array and never move.
+                    **      E.g. in the above two scenarios,   the test is that I0.next still -> I1,
+                    **      and this is true if and only if the second item on the freelist is
+                    **      still the same at the end of the window as it was at the start of the window.
+                    **      Note that we do not insist that it did not change during the window,
+                    **           only that it is still the correct new head of freelist.
+                    **      If this test fails,  we abort immediately as the subsystem is damaged and cannot be repaired.
+                    **      Note that at least one aio must have been issued *and* completed during the window
+                    **           for this to occur,  and since the window is just one single machine instruction,
+                    **           it is very unlikely in practice.
+                    */
+                    BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+                    remainder = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                                % (long long int)(sizeof(struct BufferAiocb));
+                    if (remainder == 0) {
+                        dividend = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                                    / (long long int)(sizeof(struct BufferAiocb));
+                    }
+                    if (    (remainder == 0)
+                         && ( (dividend >= 0 ) && ( dividend < numBufferAiocbs) )
+                       )
+                    {
+                        newFreeBAiocb = oldFreeBAiocb->BAiocbnext; /*  tentative new value is second on free list */
+                        /*   Here we are in the exposure window referred to in the above comments,
+                        **   so moving along rapidly ...
+                        */
+                        if (__sync_bool_compare_and_swap (&(BAiocbAnchr->FreeBAiocbs), oldFreeBAiocb, newFreeBAiocb)) {   /*  did we get it ? */
+                                /*   We have successfully swapped head of freelist pointed to by oldFreeBAiocb off the list;
+                                **   Here we check that the item we just placed at head of freelist, pointed to by newFreeBAiocb,
+                                **   is the right one
+                                **
+                                **   also check that the BAiocb we have acquired was not in use
+                                **   i.e. that scenario 2.1 above did not occur just before our compare-and-swap
+                                **   The test is that the BAiocb is not in use.
+                                **
+                                **  in one hypothetical case,
+                                **  we can be certain that there is no corruption -
+                                **  the case where newFreeBAiocb == 0 and oldFreeBAiocb->BAiocbnext != BAIOCB_OCCUPIED -
+                                **  i.e. we have set the freelist to empty but we have a baiocb chained from ours.
+                                **  in this case our comp_swap removed all BAiocbs from the list (including ours)
+                                **  so the others chained from ours are either orphaned (no harm done)
+                                **  or in use by another backend and will eventually be returned (fine).
+                                */
+                                if ((struct BufferAiocb *)0 == newFreeBAiocb) {
+                                    if ((struct BufferAiocb *)BAIOCB_OCCUPIED == oldFreeBAiocb->BAiocbnext) {
+                                        goto baiocb_corruption;
+                                    } else if ((struct BufferAiocb *)0 != oldFreeBAiocb->BAiocbnext) {
+                                      elog(LOG,
+                                         "AIO control block inconsistency on acquiring aiocb %p - its next free %p may be orphaned (no corruption has occurred)"
+                                         	,oldFreeBAiocb , oldFreeBAiocb->BAiocbnext);
+                                    }
+                                } else {
+                                    /*  case of newFreeBAiocb not null  -  so must check more carefully ... */
+                                    remainder = ((long long int)newFreeBAiocb - (long long int)BufferAiocbs)
+                                                % (long long int)(sizeof(struct BufferAiocb));
+                                    dividend = ((long long int)newFreeBAiocb - (long long int)BufferAiocbs)
+                                                / (long long int)(sizeof(struct BufferAiocb));
+
+                                    if (    (newFreeBAiocb != oldFreeBAiocb->BAiocbnext)
+                                         || (remainder != 0)
+                                         || ( (dividend < 0 ) || ( dividend >= numBufferAiocbs) )
+                                       ) {
+                                        goto baiocb_corruption;
+                                    }
+                                }
+                                BAiocb = oldFreeBAiocb;
+                                BAiocb->BAiocbnext = (struct BufferAiocb*)BAIOCB_OCCUPIED; /* mark as not on freelist */
+                                BAiocb->BAiocbDependentCount = 0;  /* no dependent yet */
+                                BAiocb->BAiocbbufh = (struct sbufdesc*)0;
+                                BAiocb->pidOfAio = 0;
+
+                                LockTries = 0;   /*  end the do loop  */
+
+                        }
+                    }
+
+                    if ( LockTries > 1 ) {
+                        sched_yield();    /*   yield to another process,  (hopefully a backend) */
+                    }
+                } while (     ((struct BufferAiocb*)0 == BAiocb)            /*  did not get a BAiocb    */
+                           && ((struct BufferAiocb*)0 != oldFreeBAiocb)     /*  there was a free BAiocb */
+                           && (LockTries-- > 0)                             /*  told to retry           */
+                        );
+            }
+        }
+
+        if ( BAiocb != (struct BufferAiocb*)0 ) {
+            /*  try an async io    */
+            BAiocb->BAiocbthis.aio_fildes = -1; /* necessary to ensure any thief realizes aio not yet started */
+            BAiocb->pidOfAio = this_backend_pid;
+
+            /*  now try to acquire a buffer :
+            **  note -   ReadBuffer_common returns hit=true if the block is found in the buffer pool,
+            **            in which case there is no need to prefetch.
+            **  otherwise ReadBuffer_common pins returned buffer and calls StartBufferIO
+            **           and StartBufferIO :
+            **      . sets buf_desc->freeNext to negative of ( index of the aiocb in the BufferAiocbs array + 3 )
+            **      . sets  BAiocb->BAiocbbufh -> buf_desc
+            **           and in this case the buffer spinlock is held.
+            **           This is essential as no other task must issue any intention with respect
+            **           to the buffer until we have started the aio_read.
+            **  Also note that ReadBuffer_common handles enlarging the ResourceOwner buffer list as needed
+            **       so I dont need to do that
+            */
+            buf_id = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
+                                        forkNum, blockNum
+                                       ,RBM_NOREAD_FOR_PREFETCH  /*  tells ReadBuffer not to do any read,  just alloc buf */
+                                       ,strategy , &hit , (FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))));
+            buf_desc = &BufferDescriptors[buf_id-1];    /* find buffer descriptor */
+
+            /*  normally hit will be false as presumably it was not in the pool
+            **  when our caller looked - but it could be there now ...
+            */
+            if (hit) {
+                   /*   see earlier comments  -  we must abandon the prefetch */
+                   retcode = 1 + PrivateRefCount[buf_id];
+                   BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* indicator that we must release the BAiocb before exiting */
+            } else
+            if (  (buf_id > 0) && ((BufferDesc *)0 != buf_desc) && (buf_desc == BAiocb->BAiocbbufh)  ) {
+                   /*   buff descriptor header lock should be held.
+                   **   However,  just to be safe ,   now validate that
+                   **   we are still the owner and no other task already stole it.
+                   */
+
+                   buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /* ensure no banked pin */
+                   /*  there should not be any other pid waiting on this buffer
+                   **  so check both of BM_VALID and BM_PIN_COUNT_WAITER are not set
+                   */
+                   if (    ( !(buf_desc->flags & (BM_VALID|BM_PIN_COUNT_WAITER) ) )
+                        && (buf_desc->flags & BM_AIO_IN_PROGRESS)
+                        && ((FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))) == buf_desc->freeNext) /* it is still mine */
+                        && (-1 == BAiocb->BAiocbthis.aio_fildes)   /* no thief stole it */
+                        && (0 == BAiocb->BAiocbDependentCount)     /* no dependent */
+                     ) {
+                        /*   we have an empty buffer for our use */
+
+                        BAiocb->BAiocbthis.aio_buf = (void *)(BufHdrGetBlock(buf_desc)); /* Location of actual buffer.  */
+
+                        /*   note - there is no need to register self as an dependent of BAiocb
+                        **   as we shall not unlock buf_desc before we free the BAiocb
+                        */
+
+                        /*   smgrstartaio retcode is returned in smgrstartaio_rc -
+                        **   it indicates whether started or not
+                        */
+                        smgrstartaio(reln->rd_smgr, forkNum, blockNum , (char *)&(BAiocb->BAiocbthis) , &smgrstartaio_rc );
+
+                        if (smgrstartaio_rc == 0) {
+                            retcode = 0;
+                            buf_desc->flags |= BM_AIO_PREFETCH_PIN_BANKED; /* bank the pin for next use by this task */
+                            /* we did not register register self as an dependent of BAiocb so no need to unregister  */
+                        } else {
+                            /*  failed - return the BufferAiocb to the free list  */
+                            do_unpin_buffer = true;    /* unpin must be deferred until after buffer descriptor is unlocked */
+                            /*  spinlock held so instead of TerminateBufferIO(buf_desc, false , 0); ... */
+                            buf_desc->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS | BM_AIO_PREFETCH_PIN_BANKED | BM_VALID);
+                            /* we did not register register self as an dependent of BAiocb so no need to unregister  */
+
+                            /*  return the BufferAiocb to the free list  */
+                            if (
+                                   BufReleaseAsync(BAiocb)
+                               ) {         /*  failed ? */
+                                BAiocb->BAiocbnext = cachedBAiocb;    /* then ...       */
+                                cachedBAiocb = BAiocb;                /*  ... cache it  */
+                            }
+
+                            pgBufferUsage.aio_read_failed++;
+                            smgrstartaio_rc = 1;  /*   to distinguish from aio not even attempted */
+                        }
+                   }
+                   else {
+                            /*  buffer was stolen or in use by other task - return the BufferAiocb to the free list  */
+                            do_unpin_buffer = true;    /* unpin must be deferred until after buffer descriptor is unlocked */
+                   }
+
+                   UnlockBufHdr(buf_desc);
+                   if (do_unpin_buffer) {
+                        if (smgrstartaio_rc >= 0) { /*  if  aio was attempted */
+                            TerminateBufferIO(buf_desc, false , 0);
+                        }
+                        UnpinBuffer(buf_desc, true);
+                   }
+            }
+            else {
+                BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* indicator that we must release the BAiocb before exiting */
+            }
+
+            if ((struct sbufdesc*)0 == BAiocb->BAiocbbufh) { /*  we did not associate a buffer */
+                                                             /*  so return the BufferAiocb to the free list  */
+                            if (
+                                   BufReleaseAsync(BAiocb)
+                               ) {         /*  failed ? */
+                                BAiocb->BAiocbnext = cachedBAiocb;    /* then ...       */
+                                cachedBAiocb = BAiocb;                /*  ... cache it  */
+                            }
+            }
+        }
+    }
+
+    return retcode;
+
+    baiocb_corruption:;
+         elog(PANIC,
+              "AIO control block corruption on acquiring aiocb %p - its next free %p conflicts with new freelist pointer %p which may be invalid (corruption may have occurred)"
+                                                    ,oldFreeBAiocb , oldFreeBAiocb->BAiocbnext , newFreeBAiocb);
+}
+#endif   /* USE_PREFETCH */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+/*
+ * BufCheckAsync --      act upon caller's intention regarding a shared buffer,
+ *                       primarily in connection with any async io in progress on the buffer.
+ *     class  subvalue   intention has two main classes and some subvalues within those :
+ *      +ve      1            .   want    <=>  caller wants the buffer,
+ *                                             wait for in-progress aio and then always pin
+ *      -ve                   .   reject  <=>  caller does not want the buffer,
+ *                                             if there are no dependents,  then cancel the aio
+ *              -1, -2 , -3 , ... (see below)        and then optionally unpin
+ *                             Used when there may have been a previous fetch or prefetch.
+ *
+ * buffer is assumed to be an existing member of the shared buffer pool
+ *    as returned by BufTableLookup.
+ * if AIO in progress,  then :
+ *      .  terminate AIO, waiting for completion if +ve intention, else without waiting
+ *      .  if the AIO had already completed successfully,   then mark buffer valid
+ *      .  pin/unpin as requested
+ *
+ * +ve intention indicates that buffer must be pinned :
+ *   if the strategy parameter is null,  then use the PinBuffer_Locked optimization
+ *   to pin and unlock in one operation.   But always update buffer usage count.
+ *
+ * -ve intention indicates whether and how to unpin :
+ *   BUF_INTENTION_REJECT_KEEP_PIN 	-1   pin already held, do not unpin, (caller wants to keep it)
+ *   BUF_INTENTION_REJECT_OBTAIN_PIN	-2   obtain pin,  caller wants it for same buffer
+ *   BUF_INTENTION_REJECT_FORGET	-3   unpin and tell resource owner to forget
+ *   BUF_INTENTION_REJECT_NOADJUST	-4   unpin and call ResourceOwnerForgetBuffer myself
+ *                                           instead of telling UnpinBuffer to adjust CurrentResource owner
+ *                                           (quirky simulation of ReleaseBuffer logic)
+ *   BUF_INTENTION_REJECT_UNBANK   	-5   unpin only if pin banked by caller
+ *   The behaviour for the -ve case is based on that of ReleaseBuffer, adding handling of async io.
+ *
+ * pin/unpin action must take account of whether this backend hold a "disposable" pin on the particular buffer.
+ * A "disposable" pin is a pin acquired by buffer manager without caller knowing, such as :
+ *      when required to safeguard an async AIO  -  pin can be held across multiple bufmgr calls
+ *      when required to safeguard waiting for an async AIO  -  pin acquired and released within this function
+ * if a disposable pin is held,   then :
+ *      if a new pin is requested,  the disposable pin must be retained (redeemed) and any flags relating to it unset
+ *      if an unpin is requested,   then :
+ *              if    either no AIO in progress or this backend did not initiate the AIO
+ *              then  the disposable pin must be dropped (redeemed) and any flags relating to it unset
+ *              else  log warning and do nothing
+ *  i.e. in either case,   there is no longer a disposable pin after this function has completed.
+ *       Note that if    intention is BUF_INTENTION_REJECT_UNBANK,
+ *                 then caller expects there to be a disposable banked pin
+ *                      and if there isn't one,  we do nothing
+ *                 for all other intentions,  if there is no disposable pin,   we pin/unpin normally.
+ *
+ * index_for_aio indicates the BAiocb to be used for next aio (see PrefetchBuffer)
+ * BufFreelistLockHeld indicates whether freelistlock is held
+ * spinLockHeld indicates whether buffer header spinlock is held
+ * PartitionLock is the  buffer partition lock to be used
+ *
+ * return code (meaningful ONLY if intention is +ve) indicates validity of buffer :
+ *  -1    buffer is invalid and failed PageHeaderIsValid check
+ *   0    buffer is not valid
+ *   1    buffer is valid
+ *   2    buffer is valid but tag changed  -  (so content does not match the relation block that caller expects)
+ */
+int
+BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, BufferDesc volatile * buf_desc, int intention , BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock )
+{
+
+        int retcode = BUF_INTENT_RC_INVALID_NO_AIO;
+        bool valid = false;
+
+	BufferTag origTag = buf_desc->tag;	/* original identity of selected buffer */
+
+#ifdef USE_PREFETCH
+        int smgrcompleteaio_rc;          /*  retcode from smgrcompleteaio */
+        SMgrRelation smgr = caller_smgr;
+        int aio_successful = -1;         /*  did the aio_read succeed ?  -1 = no aio,  0 unsuccessful , 1 successful */
+	BufFlags	flags_on_entry;  /*  for debugging  -  can be printed in gdb */
+        int    		freeNext_on_entry;  /*  for debugging  -  can be printed in gdb */
+        int    		BAiocbDependentCount_after_aio_finished = -1;  /*  for debugging  -  can be printed in gdb */
+        bool       disposable_pin = false;            /* this backend had a disposable pin on entry or pins the buffer while waiting for aio_read to complete */
+        bool       pin_already_banked_by_me;          /* buffer is already pinned by me and redeemable */
+        int local_intention;
+#endif   /* USE_PREFETCH */
+
+
+
+#ifdef USE_PREFETCH
+            if (!spinLockHeld) {
+                /*  lock buffer header    */
+                LockBufHdr(buf_desc);
+            }
+
+	    flags_on_entry = buf_desc->flags;
+	    freeNext_on_entry = buf_desc->freeNext;
+            pin_already_banked_by_me =
+                      (    (flags_on_entry & BM_AIO_PREFETCH_PIN_BANKED)
+                        && (   (    (flags_on_entry & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - freeNext_on_entry))->pidOfAio )
+                                                                      : (-(freeNext_on_entry))  )  == this_backend_pid )
+                      );
+
+            if (pin_already_banked_by_me) {
+                if (PrivateRefCount[buf_desc->buf_id] == 0) {  /*  but do we actually have a pin ?? */
+                    /*   this is an anomalous situation   -  somehow our disposable pin was lost without us noticing
+                    **   if AIO is in progress and we started it,
+                    **   then this is disastrous  -   two backends might both issue IO on same buffer
+                    **   otherwise,  it is harmless,  and simply means we have no disposable pin,
+                    **               but we must update flags to "notice" the fact now
+                    */
+                    if (flags_on_entry & BM_AIO_IN_PROGRESS) {
+                            elog(ERROR, "BufCheckAsync : AIO control block issuer of aio_read lost pin with BM_AIO_IN_PROGRESS on buffer %d rel=%s, blockNum=%u, flags 0x%X refcount=%u intention= %d"
+                                ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                           ,buf_desc->tag.blockNum, flags_on_entry, buf_desc->refcount ,intention);
+                    } else {
+                            elog(LOG, "BufCheckAsync : AIO control block issuer of aio_read lost pin on buffer %d rel=%s, blockNum=%u, with flags 0x%X refcount=%u intention= %d"
+                               ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                               ,buf_desc->tag.blockNum, flags_on_entry, buf_desc->refcount ,intention);
+							buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /*   redeem the banked pin  */
+							/*   since AIO not in progress,  disconnect the buffer from banked pin */
+							buf_desc->freeNext = FREENEXT_NOT_IN_LIST;      /*   and forget the bank client */
+							pin_already_banked_by_me = false;
+                    }
+                } else {
+                    disposable_pin = true;
+                }
+            }
+
+            /*  the case of BUF_INTENTION_REJECT_UNBANK is handled specially :
+            **    if this backend has a banked pin,  then proceed just as for BUF_INTENTION_REJECT_FORGET
+            **    else the call is a no-op  --  unlock buf header and return immediately
+            */
+            local_intention = intention;
+            if (intention == BUF_INTENTION_REJECT_UNBANK) {
+                if (pin_already_banked_by_me) {
+                    local_intention = BUF_INTENTION_REJECT_FORGET;
+                } else {
+                    goto unlock_buf_header;  /*  code following the unlock will do nothing since local_intention still set to BUF_INTENTION_REJECT_UNBANK */
+                }
+            }
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+            /*       we do not expect that BM_AIO_IN_PROGRESS is set without freeNext identifying the BAiocb */
+            if ( (buf_desc->flags & BM_AIO_IN_PROGRESS) && (buf_desc->freeNext == FREENEXT_NOT_IN_LIST) ) {
+
+					elog(ERROR, "BufCheckAsync : found BM_AIO_IN_PROGRESS without a BAiocb on buffer %d rel=%s, blockNum=%u, flags %X refcount=%u"
+						,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+						,buf_desc->tag.blockNum, buf_desc->flags, buf_desc->refcount);
+                }
+            /*     check whether aio in progress  */
+            if  (    ( (struct BAiocbAnchor *)0 != BAiocbAnchr )
+                  && (buf_desc->flags & BM_AIO_IN_PROGRESS)
+                  && (buf_desc->freeNext <= FREENEXT_BAIOCB_ORIGIN)                       /*  has a valid BAiocb */
+                  && ((FREENEXT_BAIOCB_ORIGIN - buf_desc->freeNext) < numBufferAiocbs)    /*  double-check */
+                ) {        /* this is aio   */
+                    struct BufferAiocb volatile * BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf_desc->freeNext); /*  BufferAiocb associated with this aio */
+                    if ((struct BufferAiocb*)BAIOCB_OCCUPIED == BAiocb->BAiocbnext) { /*  ensure BAiocb is occupied */
+                        aio_successful = 0;       /*  tentatively the aio_read did not succeed   */
+                        retcode = BUF_INTENT_RC_INVALID_AIO;
+
+                        if (smgr == NULL) {
+                            if (caller_reln == NULL) {
+                                smgr = smgropen(buf_desc->tag.rnode, InvalidBackendId);
+                            } else {
+                                smgr = caller_reln->rd_smgr;
+                            }
+                        }
+
+                        /*  assert that this AIO is not using the same BufferAiocb as the one caller asked us to use */
+                        if ((index_for_aio < 0) && (index_for_aio == buf_desc->freeNext)) {
+                                   ereport(ERROR,
+                                        (errcode(ERRCODE_INTERNAL_ERROR),
+                                         errmsg("AIO control block index %d to be used by %p already in use by %p"
+                                                  ,index_for_aio, buf_desc, BAiocb->BAiocbbufh)));
+                        }
+
+                        /*    Call smgrcompleteaio only if either we want buffer or there are no dependents.
+                        **    In the other case of reject and there are dependents,
+                        **    then one of them will do it.
+                        */
+                        if (   (local_intention > 0) || (0 == BAiocb->BAiocbDependentCount)  ) {
+                            if (local_intention > 0) {
+                                /*  wait for in-progress aio and then pin
+                                **  OR  if I did not issue the aio and do not have a pin
+                                **  then pin now before waiting to ensure the buffer does not become unpinned while I wait
+                                **  we may potentially wait for io to complete
+                                **  so release buf header lock so that others may also wait here
+                                */
+                                BAiocb->BAiocbDependentCount++; /* register self as dependent  */
+                                if (PrivateRefCount[buf_desc->buf_id] == 0) {   /* if this buffer not pinned by me */
+                                    disposable_pin = true; /* this backend has pinned the buffer while waiting for aio_read to complete */
+                                    PinBuffer_Locked(buf_desc);
+                                } else {
+                                    UnlockBufHdr(buf_desc);
+                                }
+                                LWLockRelease(PartitionLock);
+
+                                smgrcompleteaio_rc = 1;   /*  tell smgrcompleteaio to wait  */
+                            } else {
+                                smgrcompleteaio_rc = 0;   /*  tell smgrcompleteaio to cancel */
+                            }
+
+                            smgrcompleteaio( smgr , (char *)&(BAiocb->BAiocbthis) , &smgrcompleteaio_rc );
+                            if ( (smgrcompleteaio_rc == 0) || (smgrcompleteaio_rc == 1) ) {
+                                  aio_successful = 1;
+                            }
+
+                            /*   statistics  */
+                            if (local_intention > 0) {
+                                if (smgrcompleteaio_rc == 0) {
+                                    /*    completed successfully and did not have to wait  */
+                                    pgBufferUsage.aio_read_ontime++;
+                                } else if (smgrcompleteaio_rc == 1) {
+                                    /*    completed successfully and did have to wait  */
+                                    pgBufferUsage.aio_read_waited++;
+                                } else {
+                                    /*  bad news   -   read failed and so buffer not usable
+                                    **  the buffer is still pinned so unpin and proceed with "not found" case
+                                    */
+                                    pgBufferUsage.aio_read_failed++;
+                                }
+
+                                /*  regain locks and handle the validity of the buffer and intention regarding it    */
+                                LWLockAcquire(PartitionLock, LW_SHARED);
+                                LockBufHdr(buf_desc);
+                                BAiocb->BAiocbDependentCount--; /* unregister self as dependent  */
+                            } else {
+                                    pgBufferUsage.aio_read_wasted++;  /*   regardless of whether aio_successful */
+                            }
+
+
+                            if (local_intention > 0) {
+                                /*  verify the buffer is still ours and has same identity
+                                **  There is one slightly tricky point here -
+                                **  if there are other dependents,   then each of them will perform this same check
+                                **  this is unavoidable as the correct setting of retcode and the BM_VALID flag
+                                **  is required by each dependent,  so we may not leave it to the last one to do it.
+                                **  It should not do any harm and easier to let them all do it than try to avoid.
+                                */
+                                if ((FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))) == buf_desc->freeNext) { /* it is still mine */
+
+                                    if (aio_successful) {
+                                        /* validate page header.   If valid,  then mark the buffer as valid */
+                                        if (PageIsVerified((Page)(BufHdrGetBlock(buf_desc)) , ((BAiocb->BAiocbthis).aio_offset/BLCKSZ))) {
+                                            buf_desc->flags |= BM_VALID;
+                                            if (BUFFERTAGS_EQUAL(origTag , buf_desc->tag)) {
+                                                retcode = BUF_INTENT_RC_VALID;
+                                            } else {
+                                                retcode = BUF_INTENT_RC_CHANGED_TAG;
+                                            }
+                                        } else {
+                                            retcode = BUF_INTENT_RC_BADPAGE;
+                                        }
+                                    }
+                                }
+                            }
+
+                            BAiocbDependentCount_after_aio_finished = BAiocb->BAiocbDependentCount;
+
+                            /*  if no dependents,   then disconnect the BAiocb and update buffer header */
+                            if (BAiocbDependentCount_after_aio_finished == 0 ) {
+
+
+                                /*  return the BufferAiocb to the free list  */
+                                buf_desc->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS);
+                                if (
+                                           BufReleaseAsync(BAiocb)
+                                   ) {        /*  failed ? */
+                                    BAiocb->BAiocbnext = cachedBAiocb;   /* then ...       */
+                                    cachedBAiocb = BAiocb;               /*  ... cache it  */
+                                }
+                            }
+
+                        }
+                    }
+            }
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+            /*  note whether buffer is valid before unlocking spinlock */
+            valid = ((buf_desc->flags & BM_VALID) != 0);
+
+            /*  if there was a disposable pin on entry to this function (i.e. marked in buffer flags)
+            **  then unmark it  -  refer to prologue comments talking about :
+            **    if a disposable pin is held,   then :
+            **     ...
+            **    i.e. in either case,   there is no longer a disposable pin after this function has completed.
+            */
+            if (pin_already_banked_by_me) {
+                        buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /*   redeem the banked pin  */
+                        /*   if AIO not in progress,  then disconnect the buffer from BAiocb and/or banked pin */
+                        if (!(buf_desc->flags & BM_AIO_IN_PROGRESS)) {
+                            buf_desc->freeNext = FREENEXT_NOT_IN_LIST;      /*   and forget the bank client */
+                        }
+                        /********** for debugging   *****************
+                        else elog(LOG, "BufCheckAsync : found BM_AIO_IN_PROGRESS when redeeming banked pin on buffer %d rel=%s, blockNum=%u, flags %X refcount=%u"
+                       ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                       ,buf_desc->tag.blockNum, buf_desc->flags, buf_desc->refcount);
+                        ********** for debugging     *****************/
+            }
+
+            /*  If we are to obtain new pin, then use pin optimization  -  pin and unlock.
+            **  However,   if the caller is the same backend who issued the aio_read,
+            **  then he ought to have obtained the pin at that time and must not acquire
+            **  a "second" one since this is logically the same read -  he would have obtained
+            **  a single pin if using synchronous read and we emulate that behaviour.
+            **  Its important to understand that the caller is not aware that he already obtained a pin -
+            **  because calling PrefetchBuffer did not imply a pin -
+            **  so we must track that via the pidOfAio field in the BAiocb.
+            **  And to add one further complication :
+            **      we assume that although PrefetchBuffer pinned the buffer,
+            **      it did not increment the usage count.
+            **      (because it called PinBuffer_Locked which does not do that)
+            **      so in this case,   we must increment the usage count without double-pinning.
+            **      yes its ugly  -  and theres a goto!
+            */
+            if (   (local_intention > 0)
+                || (local_intention == BUF_INTENTION_REJECT_OBTAIN_PIN)
+               ) {
+
+                /* Make sure we will have room to remember the buffer pin */
+                ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+                /*    here we really want a version of PinBuffer_Locked which updates usage count ... */
+                if (   (PrivateRefCount[buf_desc->buf_id] == 0) /*   if this buffer not previously pinned by me */
+                    || pin_already_banked_by_me                 /*   or I had a disposable pin on entry */
+                   ) {
+                    if (strategy == NULL)
+                    {
+                            if (buf_desc->usage_count < BM_MAX_USAGE_COUNT)
+                                    buf_desc->usage_count++;
+                    }
+                    else
+                    {
+                            if (buf_desc->usage_count == 0)
+                                    buf_desc->usage_count = 1;
+                    }
+		}
+
+                /*  now pin buffer unless we have a disposable */
+                if (!disposable_pin) { /* this backend neither banked pin for aio nor pinned the buffer while waiting for aio_read to complete */
+                    PinBuffer_Locked(buf_desc);
+                    goto unlocked_it;
+                }
+                else
+                /*    if this task previously issued the aio or pinned the buffer while waiting for aio_read to complete
+                **       and aio was unsuccessful,    then release the pin
+                */
+                if (     disposable_pin
+                      && (aio_successful == 0)       /*  aio_read failed ? */
+                   ) {
+                           UnpinBuffer(buf_desc, true);
+                }
+            }
+
+    unlock_buf_header:
+            UnlockBufHdr(buf_desc);
+    unlocked_it:
+#endif   /* USE_PREFETCH */
+
+            /*   now do any requested pin (if not done immediately above) or unpin/forget  */
+            if (local_intention == BUF_INTENTION_REJECT_KEEP_PIN) {
+            /*   the caller is supposed to hold a pin already so there should be nothing to do ... */
+                if (PrivateRefCount[buf_desc->buf_id] == 0) {
+                    elog(LOG, "request to keep pin on unpinned buffer %d",buf_desc->buf_id);
+
+                    valid = PinBuffer(buf_desc, strategy);
+                }
+            }
+            else
+            if (   (   (local_intention == BUF_INTENTION_REJECT_FORGET)
+                    || (local_intention == BUF_INTENTION_REJECT_NOADJUST)
+                   )
+                && (PrivateRefCount[buf_desc->buf_id] > 0) /*   if this buffer was previously pinned by me ... */
+               )  {
+
+                    if (local_intention == BUF_INTENTION_REJECT_FORGET) {
+                        UnpinBuffer(buf_desc, true); /*  ... then release the pin                   */
+                    } else
+                    if (local_intention == BUF_INTENTION_REJECT_NOADJUST) {
+                        /*   following code moved from ReleaseBuffer :
+                        **   not sure why we can't simply UnpinBuffer(buf_desc, true)
+                        **   but better leave it the way it was
+                        */
+                        ResourceOwnerForgetBuffer(CurrentResourceOwner, BufferDescriptorGetBuffer(buf_desc));
+                        if (PrivateRefCount[buf_desc->buf_id] > 1) {
+                            PrivateRefCount[buf_desc->buf_id]--;
+                        } else {
+                            UnpinBuffer(buf_desc, false);
+                        }
+                    }
+            }
+
+            /*    if retcode has not been set to one of the unusual conditions
+            **        namely failed header validity or tag changed
+            **    then the setting of valid takes precedence
+            **    over whatever retcode may be currently set to.
+            */
+            if ( ( (retcode == BUF_INTENT_RC_INVALID_NO_AIO) || (retcode == BUF_INTENT_RC_INVALID_AIO) ) && valid) {
+                   retcode = BUF_INTENT_RC_VALID;
+            } else
+            if ((retcode == BUF_INTENT_RC_VALID) && (!valid)) {
+                   if (aio_successful == -1) { /* aio not attempted */
+                       retcode = BUF_INTENT_RC_INVALID_NO_AIO;
+                   } else {
+                       retcode = BUF_INTENT_RC_INVALID_AIO;
+                   }
+            }
+
+            return retcode;
+}
--- src/backend/storage/buffer/buf_init.c.orig	2014-05-28 08:29:09.330829297 -0400
+++ src/backend/storage/buffer/buf_init.c	2014-05-28 16:45:43.038507784 -0400
@@ -13,15 +13,89 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <sys/types.h> 
+#include <unistd.h>
 
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
-
+#include <stdlib.h> /* for getenv() */
+#include <errno.h> /* for strtoul() */
 
 BufferDesc *BufferDescriptors;
 char	   *BufferBlocks;
-int32	   *PrivateRefCount;
+int32	   *PrivateRefCount;       /*  array of counts per buffer of how many times this task has pinned this buffer */
+
+volatile struct BAiocbAnchor *BAiocbAnchr = (struct BAiocbAnchor *)0;  /*  anchor for all control blocks pertaining to aio  */
 
+int CountInuseBAiocbs(void);     /*  keep compiler happy */
+void ReportFreeBAiocbs(void);    /*  keep compiler happy */
+
+extern int	MaxConnections;  /*  max number of client connections which postmaster will allow  */
+int numBufferAiocbs = 0;         /*  total number of  BufferAiocbs in pool (0 <=> no async io) */
+int hwmBufferAiocbs = 0;         /*  high water mark of in-use  BufferAiocbs in pool
+                                 **  (not required to be accurate, kindly maintained for us somehow by postmaster)
+                                 */
+
+#ifdef USE_PREFETCH
+unsigned int prefetch_dbOid = 0; /*  database oid of relations on which prefetching to be done - 0 means all */
+unsigned int prefetch_bitmap_scans = 1; /*  boolean whether to prefetch bitmap heap scans        */
+unsigned int prefetch_heap_scans = 0;   /*  boolean whether to prefetch non-bitmap heap scans    */
+unsigned int prefetch_sequential_index_scans = 0;  /*  boolean whether to prefetch sequential-access non-bitmap index scans  */
+unsigned int prefetch_index_scans = 256;  /*  boolean whether to prefetch non-bitmap index scans also numeric size of pfch_list  */
+unsigned int prefetch_btree_heaps = 1;  /*  boolean whether to prefetch heap pages in _bt_next for non-bitmap index scans   */
+#endif /* USE_PREFETCH */
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+int maxGetBAiocbTries = 1;       /*  max times we will try to get a free BufferAiocb */
+int maxRelBAiocbTries = 1;       /*  max times we will try to release a BufferAiocb back to freelist */
+
+/*  locking protocol for manipulating the BufferAiocbs and FreeBAiocbs list :
+**    1.    ownership of a BufferAiocb :
+**          to gain ownership of a BufferAiocb, a task must
+**          EITHER    remove it from FreeBAiocbs (it is now temporary owner and no other task can find it)
+**                    if decision is to attach it to a buffer descriptor header, then
+**                       .   lock the buffer descriptor header
+**                       .   check  NOT flags & BM_AIO_IN_PROGRESS
+**                       .   attach to buffer descriptor header
+**                       .   increment the BufferAiocb.dependent_count
+**                       .   unlock the buffer descriptor header
+**                     and ownership scope is from lock to unlock
+***          OR        locate it by dereferencing the pointer in a buffer descriptor,
+**                    in which case :
+**                       .   lock the buffer descriptor header
+**                       .   check  flags & BM_AIO_IN_PROGRESS
+**                       .   increment the BufferAiocb.dependent_count
+**                       .   if decision is to return to FreeBAiocbs,
+**                           then   (with buffer descriptor header still locked)
+**                                  .  turn off BM_AIO_IN_PROGRESS
+**                       .   IF        the BufferAiocb.dependent_count == 1 (I am sole dependent)
+**                       .   THEN
+**                       .       .  decrement the BufferAiocb.dependent_count
+**                               .  return to FreeBAiocbs (see below)
+**                       .   unlock the buffer descriptor header
+**                     and ownership scope is from lock to either return to FreeBAiocbs or unlock
+**    2.    adding and removing from FreeBAiocbs :
+**      two alternative methods - controlled by conditional macro definition LOCK_BAIOCB_FOR_GET_REL
+**       2.1 LOCK_BAIOCB_FOR_GET_REL is defined - use a lock
+**          .   lock BufFreelistLock exclusive
+**          .   add / remove from FreeBAiocbs
+**          .   unlock BufFreelistLock exclusive
+**          advantage of this method  -  never fails to add or remove
+**       2.2  LOCK_BAIOCB_FOR_GET_REL is not defined - use compare_and_swap
+**          .   retrieve the current Freelist pointer and validate
+**          .   compare_and_swap on/off the FreeBAiocb list
+**          .   unlock BufFreelistLock exclusive
+**          advantage of this method  -  never waits
+**          to avoid losing a FreeBAiocbs,   save it in a process-local cache and reuse
+*/
+#else   /*  not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        /*   this dummy structure is to ensure that references to these fields in other bufmgr runtime code
+        **   that is not conditionally ifdefd on USE_AIO_ATOMIC_BUILTIN_COMP_SWAP compiles and runs correctly
+        */
+        struct BufferAiocb dummy_BAiocbAnchr = { (struct BufferAiocb*)0 , (struct BufferAiocb*)0 };
+int maxGetBAiocbTries = -1;       /*  max times we will try to get a free BufferAiocb */
+int maxRelBAiocbTries = -1;       /*  max times we will try to release a BufferAiocb back to freelist */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Data Structures:
@@ -73,7 +147,14 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs
+                        , foundAiocbs
+          ;
+#if defined(USE_PREFETCH) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
+        char *envvarpointer = (char *)0;  /*  might point to an environment variable string */
+        char *charptr;
+#endif /* USE_PREFETCH */
+
 
 	BufferDescriptors = (BufferDesc *)
 		ShmemInitStruct("Buffer Descriptors",
@@ -83,6 +164,142 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        BAiocbAnchr = (struct BAiocbAnchor *)0; /*  anchor for all control blocks pertaining to aio  */
+        if (max_async_io_prefetchers < 0) {  /*  negative value indicates to initialize to something sensible during buf_init */
+            max_async_io_prefetchers = MaxConnections/6;  /*  default allows for average of MaxConnections/6 concurrent prefetchers  - reasonable ??? */
+        }
+
+        if ((target_prefetch_pages > 0) && (max_async_io_prefetchers > 0)) {
+            int ix;
+            volatile struct BufferAiocb *BufferAiocbs;
+            volatile struct BufferAiocb * volatile FreeBAiocbs;
+
+            numBufferAiocbs = (target_prefetch_pages*max_async_io_prefetchers);  /*  target_prefetch_pages per prefetcher */
+            BAiocbAnchr = (struct BAiocbAnchor *)
+		ShmemInitStruct("Buffer Aiocbs",
+                          sizeof(struct BAiocbAnchor) + (numBufferAiocbs * sizeof(struct BufferAiocb)), &foundAiocbs);
+            if (BAiocbAnchr) {
+                BufferAiocbs = BAiocbAnchr->BufferAiocbs = (struct BufferAiocb*)(((char *)BAiocbAnchr) + sizeof(struct BAiocbAnchor));
+                FreeBAiocbs = (struct BufferAiocb*)0;
+                for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    (BufferAiocbs+ix)->BAiocbnext = FreeBAiocbs;   /*  init the free list,  last one -> 0  */
+                    (BufferAiocbs+ix)->BAiocbbufh = (struct sbufdesc*)0;
+                    (BufferAiocbs+ix)->BAiocbDependentCount = 0;
+                    (BufferAiocbs+ix)->pidOfAio = 0;
+                    FreeBAiocbs = (BufferAiocbs+ix);
+
+                }
+                BAiocbAnchr->FreeBAiocbs = FreeBAiocbs;
+                envvarpointer = getenv("PG_MAX_GET_BAIOCB_TRIES");
+                if (    (envvarpointer != (char *)0)
+                     && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+                   ) {
+                      maxGetBAiocbTries = strtol(envvarpointer, 0, 10);
+                }
+                envvarpointer = getenv("PG_MAX_REL_BAIOCB_TRIES");
+                if (    (envvarpointer != (char *)0)
+                     && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+                   ) {
+                      maxRelBAiocbTries = strtol(envvarpointer, 0, 10);
+                }
+
+                /*   init the aio subsystem max number of threads and max number of requests
+                **   max number of threads   <-->  max_async_io_prefetchers
+                **   max number of requests  <-->  numBufferAiocbs = (target_prefetch_pages*max_async_io_prefetchers)
+                **   there is no return code so we just hope.
+                */
+                smgrinitaio(max_async_io_prefetchers , numBufferAiocbs);
+
+            }
+        }
+#else   /*  not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        /*   this dummy structure is to ensure that references to these fields in other bufmgr runtime code
+        **   that is not conditionally ifdefd on USE_AIO_ATOMIC_BUILTIN_COMP_SWAP compiles and runs correctly
+        */
+        BAiocbAnchr = &dummy_BAiocbAnchr;
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+
+#ifdef USE_PREFETCH
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_BITMAP");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_bitmap_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_bitmap_scans = 1;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_ISCAN");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_index_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_index_scans = 1;
+             } else
+             if (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   ) {
+                 prefetch_index_scans = strtol(envvarpointer, &charptr, 10);
+                 if (charptr && (',' == *charptr)) {   /*  optional sequential prefetch in index scans */
+					 charptr++;        /*   following the comma ... */
+					 if ( ('Y' == *charptr) || ('y' == *charptr) || ('1' == *charptr) ) {
+                         prefetch_sequential_index_scans = 1;
+					 }
+				 }
+             }
+             /*  if prefeching for ISCAN,  then we require size of pfch_list to be at least target_prefetch_pages */
+             if (   (prefetch_index_scans > 0)
+                 && (prefetch_index_scans < target_prefetch_pages)
+                ) {
+                 prefetch_index_scans = target_prefetch_pages;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_BTREE");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_btree_heaps = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_btree_heaps = 1;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_HEAP");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_heap_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+              prefetch_heap_scans = 1;
+             }
+        }
+        envvarpointer = getenv("PG_PREFETCH_DBOID");
+        if (    (envvarpointer != (char *)0)
+             && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+           ) {
+              errno = 0;   /*  required in order to distinguish error from 0 */
+              prefetch_dbOid = (unsigned int)strtoul((const char *)envvarpointer, 0, 10);
+              if (errno) {
+                  prefetch_dbOid = 0;
+              }
+        }
+        elog(LOG, "prefetching initialised with target_prefetch_pages= %d "
+                  ", max_async_io_prefetchers= %d implying aio concurrency= %d "
+                  ", prefetching_for_bitmap= %s "
+                  ", prefetching_for_heap= %s "
+                  ", prefetching_for_iscan= %d with sequential_index_page_prefetching= %s "
+                  ", prefetching_for_btree= %s"
+                   ,target_prefetch_pages ,max_async_io_prefetchers ,numBufferAiocbs
+                   ,(prefetch_bitmap_scans ? "Y" : "N")
+                   ,(prefetch_heap_scans ? "Y" : "N")
+                   ,prefetch_index_scans
+                   ,(prefetch_sequential_index_scans ? "Y" : "N")
+                   ,(prefetch_btree_heaps ? "Y" : "N")
+            );
+#endif /* USE_PREFETCH */
+
+
 	if (foundDescs || foundBufs)
 	{
 		/* both should be present or neither */
@@ -176,3 +393,80 @@ BufferShmemSize(void)
 
 	return size;
 }
+
+/*     imprecise count of number of in-use BAiocbs at any time
+ *     we scan the array read-only without latching so are subject to unstable result
+ *     (but since the array is in well-known contiguous storage,
+ *     we are not subject to segmentation violation)
+ *     This function may be called at any time and just does its best
+ *     return the count of what we counted.
+ */
+int
+CountInuseBAiocbs(void)
+{
+        volatile struct BufferAiocb *BAiocb;
+        int count = 0;
+        int ix;
+
+
+        if (BAiocbAnchr != (struct BAiocbAnchor *)0 ) {
+            BAiocb = BAiocbAnchr->BufferAiocbs;             /*   start of list */
+
+            for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    if ((struct BufferAiocb*)BAIOCB_OCCUPIED == (BAiocb+ix)->BAiocbnext) {   /* not on freelist ? */
+                        count++;
+                    }
+            }
+        }
+        return count;
+}
+
+/*
+ * report how many free BAiocbs at shutdown
+ * DO NOT call this while backends are actively working!!
+ * this report is useful when compare_and_swap method used (see above)
+ * as it can be used to deduce how many BAiocbs were in process-local caches -
+ * (original_number_on_freelist_at_startup - this_reported_number_at_shutdown)
+ */
+void
+ReportFreeBAiocbs(void)
+{
+        volatile struct BufferAiocb *BAiocb;
+        volatile struct BufferAiocb *BufferAiocbs;
+        int count = 0;
+        int fx , ix;
+
+
+        if (BAiocbAnchr != (struct BAiocbAnchor *)0 ) {
+            BAiocb = BAiocbAnchr->FreeBAiocbs;             /*   start of free list */
+            BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+
+            for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    (BufferAiocbs+ix)->BAiocbDependentCount = 0;  /* use this as marker for finding it on freelist */
+            }
+            for (fx = (numBufferAiocbs-1);  ( (fx>=0) &&  ( BAiocb != (struct BufferAiocb*)0 ) );  fx--) {
+                    
+                    /*  check if it is a valid BufferAiocb */
+                    for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                        if ((BufferAiocbs+ix) == BAiocb) { /*  is it this one ? */
+                             break;
+                        }
+                    }
+                    if (ix >= 0) {
+                        if (BAiocb->BAiocbDependentCount) {   /* seen it already ? */
+                            elog(LOG, "ReportFreeBAiocbs closed cycle on AIO control block freelist %p"
+                                          ,BAiocb);
+                            fx = 0; /* give up at this point */
+                        }
+                        BAiocb->BAiocbDependentCount = 1;  /* use this as marker for finding it on freelist */
+                        count++;
+                        BAiocb = BAiocb->BAiocbnext;
+                    } else {
+                        elog(LOG, "ReportFreeBAiocbs invalid item on AIO control block freelist %p"
+                                          ,BAiocb);
+                        fx = 0; /* give up at this point */
+                    }
+            }
+        }
+        elog(LOG, "ReportFreeBAiocbs AIO control block list : poolsize= %d  in-use-hwm= %d  final-free= %d" ,numBufferAiocbs , hwmBufferAiocbs , count);
+}
--- src/backend/storage/smgr/md.c.orig	2014-05-28 08:29:09.338829292 -0400
+++ src/backend/storage/smgr/md.c	2014-05-28 16:45:43.070507912 -0400
@@ -647,6 +647,62 @@ mdprefetch(SMgrRelation reln, ForkNumber
 }
 
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ *	mdinitaio() --  init the aio subsystem max number of threads and max number of requests
+ */
+void
+mdinitaio(int max_aio_threads, int max_aio_num)
+{
+     FileInitaio( max_aio_threads, max_aio_num );
+}
+
+/*
+ *	mdstartaio() -- start aio read of the specified block of a relation
+ */
+void
+mdstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode )
+{
+#ifdef USE_PREFETCH
+	off_t		seekpos;
+	MdfdVec    *v;
+        int local_retcode;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+
+	seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	local_retcode = FileStartaio(v->mdfd_vfd, seekpos, BLCKSZ , aiocbp);
+	if (retcode) {
+            *retcode = local_retcode;
+        }
+#endif   /* USE_PREFETCH */
+}
+
+
+/*
+ *	mdcompleteaio() -- complete aio read of the specified block of a relation
+ *      on entry, *inoutcode should indicate :
+ *           .  non-0  <=>   check if complete and wait if not
+ *           .  0      <=>   cancel io immediately
+ */
+void
+mdcompleteaio( char *aiocbp , int *inoutcode )
+{
+#ifdef USE_PREFETCH
+        int local_retcode;
+
+	local_retcode = FileCompleteaio(aiocbp, (inoutcode ? *inoutcode : 0));
+	if (inoutcode) {
+            *inoutcode = local_retcode;
+        }
+#endif   /* USE_PREFETCH */
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+
 /*
  *	mdread() -- Read the specified block from a relation.
  */
--- src/backend/storage/smgr/smgr.c.orig	2014-05-28 08:29:09.338829292 -0400
+++ src/backend/storage/smgr/smgr.c	2014-05-28 16:45:43.094508008 -0400
@@ -49,6 +49,12 @@ typedef struct f_smgr
 						 BlockNumber blocknum, char *buffer, bool skipFsync);
 	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	void		(*smgr_initaio) (int max_aio_threads, int max_aio_num);
+	void		(*smgr_startaio) (SMgrRelation reln, ForkNumber forknum,
+											  BlockNumber blocknum , char *aiocbp , int *retcode );
+	void		(*smgr_completeaio) ( char *aiocbp , int *inoutcode );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
@@ -66,7 +72,11 @@ typedef struct f_smgr
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
-		mdprefetch, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
+		mdprefetch
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        ,mdinitaio, mdstartaio, mdcompleteaio
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+              , mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
 		mdpreckpt, mdsync, mdpostckpt
 	}
 };
@@ -612,6 +622,35 @@ smgrprefetch(SMgrRelation reln, ForkNumb
 	(*(smgrsw[reln->smgr_which].smgr_prefetch)) (reln, forknum, blocknum);
 }
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ *	smgrinitaio() -- initialize the aio subsystem max number of threads and max number of requests
+ */
+void
+smgrinitaio(int max_aio_threads, int max_aio_num)
+{
+	(*(smgrsw[0].smgr_initaio)) ( max_aio_threads, max_aio_num );
+}
+
+/*
+ *	smgrstartaio() -- Initiate aio read of the specified block of a relation.
+ */
+void
+smgrstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode )
+{
+	(*(smgrsw[reln->smgr_which].smgr_startaio)) (reln, forknum, blocknum , aiocbp , retcode );
+}
+
+/*
+ *	smgrcompleteaio() -- Complete aio read of the specified block of a relation.
+ */
+void
+smgrcompleteaio(SMgrRelation reln,  char *aiocbp , int *inoutcode )
+{
+	(*(smgrsw[reln->smgr_which].smgr_completeaio)) ( aiocbp , inoutcode );
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 /*
  *	smgrread() -- read a particular block from a relation into the supplied
  *				  buffer.
--- src/backend/storage/file/fd.c.orig	2014-05-28 08:29:09.334829294 -0400
+++ src/backend/storage/file/fd.c	2014-05-28 16:45:43.122508122 -0400
@@ -77,6 +77,9 @@
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+#include "aio.h"
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * We must leave some file descriptors free for system(), the dynamic loader,
@@ -1239,6 +1242,10 @@ FileClose(File file)
  * We could add an implementation using libaio in the future; but note that
  * this API is inappropriate for libaio, which wants to have a buffer provided
  * to read into.
+ * Also note that a new, different implementation of asynchronous prefetch
+ * using librt,  not libaio,  is provided by the two functions following this one,
+ * FileStartaio and FileCompleteaio.   These also require to have a buffer provided
+ * to read into,  which the new async_io support provides.
  */
 int
 FilePrefetch(File file, off_t offset, int amount)
@@ -1266,6 +1273,139 @@ FilePrefetch(File file, off_t offset, in
 #endif
 }
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ * FileInitaio - initialize the aio subsystem max number of threads and max number of requests
+ *  input parms
+ *  max_aio_threads;    maximum number of threads
+ *  max_aio_num;        maximum number of concurrent aio read requests
+ *
+ *  on linux, man page for the librt implemenation of aio_init() says :
+ *         This function is a GNU extension.
+ *  If your posix aio does not have it,   then add the following line to 
+ *        src/include/pg_config_manual.h
+ *    #define DONT_HAVE_AIO_INIT
+ *  to render it as a no-op
+ */
+void
+FileInitaio(int max_aio_threads, int max_aio_num )
+{
+#ifndef DONT_HAVE_AIO_INIT
+    struct aioinit aioinit_struct;  /*  structure to pass to aio_init */
+
+    aioinit_struct.aio_threads = max_aio_threads; /*     maximum number of threads */
+    aioinit_struct.aio_num = max_aio_num;         /*     maximum number of concurrent aio read requests */
+    aioinit_struct.aio_idle_time = 1;             /*     we dont want to alter this but aio_init does not ignore it so set to the default */
+    aio_init(&aioinit_struct);
+#endif  /* ndef DONT_HAVE_AIO_INIT */
+    return;
+}
+
+/*
+ * FileStartaio - initiate asynchronous read of a given range of the file.
+ * The logical seek position is unaffected.
+ *
+ * use standard posix aio (librt)
+ *  ASSUME   BufferAiocb.aio_buf already set to -> buffer by caller
+ *  return 0 if successfully started,  else non-zero
+ */
+int
+FileStartaio(File file, off_t offset, int amount , char *aiocbp )
+{
+	int	returnCode;
+        struct aiocb *my_aiocbp = (struct aiocb *)aiocbp;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartaio: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset, amount));
+
+	returnCode = FileAccess(file);
+	if (returnCode >= 0) {
+
+            my_aiocbp->aio_fildes = VfdCache[file].fd;
+            my_aiocbp->aio_lio_opcode = LIO_READ;
+            my_aiocbp->aio_nbytes = amount;
+            my_aiocbp->aio_offset = offset;
+            returnCode = aio_read(my_aiocbp);
+        }
+
+	return returnCode;
+}
+
+/*
+ * FileCompleteaio - complete asynchronous aio read
+ * normal_wait indicates whether to cancel or wait -
+ *                 0 <=> cancel
+ *                 1 <=> wait
+ *
+ * use standard posix aio (librt)
+ *  return 0 if successfull and did not have to wait,
+ *         1 if successfull and had to wait,
+ *    else x'ff'
+ */
+int
+FileCompleteaio( char *aiocbp , int normal_wait )
+{
+	int	returnCode;
+	int	aio_errno;
+        struct aiocb *my_aiocbp = (struct aiocb *)aiocbp;
+        const struct aiocb const*cblist[1];
+        int fd;
+        struct timespec my_timeout = { 0 , 10000 };
+        int max_polls;
+
+        fd = my_aiocbp->aio_fildes;
+        cblist[0] = my_aiocbp;
+        returnCode = aio_errno = aio_error(my_aiocbp);
+        /* note that aio_error returns 0 if op already completed successfully */
+
+        /*  first handle normal case of waiting for op to complete  */
+        if (normal_wait) {
+            while (aio_errno == EINPROGRESS) {
+                max_polls = 256;
+                my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                returnCode = aio_suspend(cblist , 1 , &my_timeout);
+                while ((returnCode < 0) && (EAGAIN == errno) && (max_polls-- > 0)) {
+                    my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                    returnCode = aio_suspend(cblist , 1 , &my_timeout);
+                }
+                returnCode = aio_errno = aio_error(my_aiocbp);
+                /*  now return_code is from aio_error  */
+                if (returnCode == 0) {
+                    returnCode = 1;    /*  successful but had to wait */
+                }
+            }
+            if (aio_errno) {
+                elog(LOG, "FileCompleteaio: %d %d", fd, returnCode);
+                returnCode = 0xff; /*  unsuccessful */
+            }
+        } else {
+            if (aio_errno == EINPROGRESS) {
+                do {
+                        max_polls = 256;
+                        my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                        returnCode = aio_cancel(fd, my_aiocbp);
+                        while ((returnCode == AIO_NOTCANCELED) && (max_polls-- > 0)) {
+                            my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                            returnCode = aio_cancel(fd, my_aiocbp);
+                        }
+                    returnCode = aio_errno = aio_error(my_aiocbp);
+                } while (aio_errno == EINPROGRESS);
+                returnCode = 0xff; /*  unsuccessful */
+            }
+            if (returnCode != 0)
+                returnCode = 0xff; /*  unsuccessful */
+        }
+
+	DO_DB(elog(LOG, "FileCompleteaio: %d %d",
+			   fd, returnCode));
+
+	return returnCode;
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 int
 FileRead(File file, char *buffer, int amount)
 {
--- src/backend/storage/lmgr/proc.c.orig	2014-05-28 08:29:09.338829292 -0400
+++ src/backend/storage/lmgr/proc.c	2014-05-28 16:45:43.146508219 -0400
@@ -52,6 +52,7 @@
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+extern pid_t this_backend_pid;     /*   pid of this backend */
 
 /* GUC variables */
 int			DeadlockTimeout = 1000;
@@ -361,6 +362,7 @@ InitProcess(void)
 	MyPgXact->xid = InvalidTransactionId;
 	MyPgXact->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
+        this_backend_pid = getpid();    /*    pid of this backend */
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
--- src/backend/access/heap/heapam.c.orig	2014-05-28 08:29:09.242829343 -0400
+++ src/backend/access/heap/heapam.c	2014-05-28 16:45:43.202508444 -0400
@@ -71,6 +71,28 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef USE_PREFETCH
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+
+#include "executor/instrument.h"
+
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_heap_scans; /*  boolean whether to prefetch non-bitmap heap scans         */
+
+/*  special values for scan->rs_prefetch_target indicating as follows :               */
+#define PREFETCH_MAYBE 0xffffffff      /*   prefetch permitted but not yet in effect  */
+#define PREFETCH_DISABLED 0xfffffffe   /*   prefetch disabled and not permitted       */
+/*  PREFETCH_WRAP_POINT indicates a pretcher who has reached the point where the scan would wrap -
+**  at this point the prefetcher runs on the spot until scan catches up.
+**  This *must* be < maximum valid setting of target_prefetch_pages aka effective_io_concurrency.
+*/
+#define PREFETCH_WRAP_POINT 0x0fffffff
+
+#endif   /* USE_PREFETCH */
+
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -115,6 +137,8 @@ static XLogRecPtr log_heap_new_cid(Relat
 static HeapTuple ExtractReplicaIdentity(Relation rel, HeapTuple tup, bool key_modified,
 					   bool *copy);
 
+static void heap_unread_add(HeapScanDesc scan, BlockNumber blockno);
+static void heap_unread_subtract(HeapScanDesc scan, BlockNumber blockno);
 
 /*
  * Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -292,9 +316,148 @@ initscan(HeapScanDesc scan, ScanKey key,
 	 * Currently, we don't have a stats counter for bitmap heap scans (but the
 	 * underlying bitmap index scans will be counted).
 	 */
-	if (!scan->rs_bitmapscan)
+#ifdef USE_PREFETCH
+        /*    by default,  no prefetching on any scan  */
+        scan->rs_prefetch_target = PREFETCH_DISABLED;  /*  tentatively disable  */
+        scan->rs_pfchblock = 0; /*  scanner will reset this to be ahead of scan */
+        scan->rs_Unread_Pfetched_base = (BlockNumber *)0; /*  list of prefetched but unread blocknos */
+        scan->rs_Unread_Pfetched_next = 0; /*  next unread blockno */
+        scan->rs_Unread_Pfetched_count = 0; /* number of valid unread blocknos */
+#endif   /* USE_PREFETCH */
+	if (!scan->rs_bitmapscan) {
+
 		pgstat_count_heap_scan(scan->rs_rd);
+#ifdef USE_PREFETCH
+                /*    bitmap scans do their own prefetching -
+                **    for others,  set up prefetching now
+                */
+                if (    prefetch_heap_scans
+                     && (target_prefetch_pages > 0)
+                     &&	(!RelationUsesLocalBuffers(scan->rs_rd))
+                   ) {
+                      /*   prefetch_dbOid may be set to a database Oid to specify only prefetch in that db */
+                      if (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == scan->rs_rd->rd_node.dbNode)
+                              )
+                          ||  (prefetch_dbOid == 0)
+                         ) {
+                          scan->rs_prefetch_target = PREFETCH_MAYBE;    /*  permitted but let the scan decide */
+                      }
+                      else {
+                      }
+                }
+#endif   /* USE_PREFETCH */
+        }
+}
+
+/* add this blockno to list of prefetched and unread blocknos
+** use the one identified by the (next+count|modulo circumference) index if it is unused,
+** else search for the first available slot if there is one,
+** else error.
+*/
+static void
+heap_unread_add(HeapScanDesc scan, BlockNumber blockno)
+{
+      BlockNumber *available_P;   /*  where to store new blockno */
+      unsigned int Unread_Pfetched_index = scan->rs_Unread_Pfetched_next
+                                         + scan->rs_Unread_Pfetched_count; /* index of next unused slot */
+
+      /*  caller is not supposed to pass InvalidBlockNumber but check anyway */
+      if (blockno != InvalidBlockNumber) {
+
+		  /*  ensure there is some room somewhere   */
+		  if (scan->rs_Unread_Pfetched_count < target_prefetch_pages) {
+
+			  /*  try the "next+count" one */
+			  if (Unread_Pfetched_index >= target_prefetch_pages) {
+				  Unread_Pfetched_index -= target_prefetch_pages;  /* modulo circumference */
+			  }
+			  available_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+			  if (*available_P == InvalidBlockNumber) { /* unused */
+				  goto store_blockno;
+			  } else {
+                  /*  slow-search the entire list  */
+				  for (Unread_Pfetched_index = 0; Unread_Pfetched_index < target_prefetch_pages; Unread_Pfetched_index++) {
+					  available_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+					  if (*available_P == InvalidBlockNumber) { /* unused */
+                          /*  before storing this blockno,
+                          **  since the next pointer did not locate an unused slot,
+                          **  set it to one which is more likely to be so for the next time
+                          */
+                          scan->rs_Unread_Pfetched_next = Unread_Pfetched_index;
+						  goto store_blockno;
+					  }
+				  }
+			  }
+		  }
+
+          /*  if we reach here,    either there was no available slot
+          **  or we thought there was one and didn't find any  --
+          */
+  		  ereport(ERROR,
+			  (errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("heap_unread_add overflowed list cannot add blockno %d", blockno)));
+
+  		  return;
+      }
+
+    store_blockno:
+      *available_P = blockno;
+      scan->rs_Unread_Pfetched_count++;  /*  update count */
+
+    return;
+}
+
+/* remove specified blockno from list of prefetched and unread blocknos.
+/* Usually this will be found at the rs_Unread_Pfetched_next item -
+** else search for it.    If not found,   inore it  -  no error results.
+*/
+static void
+heap_unread_subtract(HeapScanDesc scan, BlockNumber blockno)
+{
+      unsigned int Unread_Pfetched_index = scan->rs_Unread_Pfetched_next; /* index of next unread blockno */
+      BlockNumber *candidate_P;   /*  location of callers blockno - maybe */
+      BlockNumber nextUnreadPfetched;
+
+      /*  caller is not supposed to pass InvalidBlockNumber but check anyway */
+      if (    (blockno != InvalidBlockNumber)
+		   && ( scan->rs_Unread_Pfetched_count > 0 )   /*  if the list is not empty  */
+         ) {
+
+			  /*  take modulo of the circumference.
+			  **  actually rs_Unread_Pfetched_next should never exceed the circumference but check anyway.
+			  */
+			  if (Unread_Pfetched_index >= target_prefetch_pages) {
+				  Unread_Pfetched_index  -= target_prefetch_pages;
 }
+			  candidate_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);
+			  nextUnreadPfetched = *candidate_P;
+
+			  if ( nextUnreadPfetched == blockno ) {
+				  goto remove_blockno;
+			  } else {
+                  /*  slow-search the entire list  */
+				  for (Unread_Pfetched_index = 0; Unread_Pfetched_index < target_prefetch_pages; Unread_Pfetched_index++) {
+					  candidate_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+					  if (*candidate_P == blockno) { /* unused */
+						  goto remove_blockno;
+					  }
+				  }
+			  }
+
+    remove_blockno:
+			  *candidate_P = InvalidBlockNumber;
+
+			  scan->rs_Unread_Pfetched_next = (Unread_Pfetched_index+1);  /*  update next pfchd unread */
+			  if (scan->rs_Unread_Pfetched_next >= target_prefetch_pages) {
+					  scan->rs_Unread_Pfetched_next = 0;
+			  }
+			  scan->rs_Unread_Pfetched_count--;  /*  update count */
+	  }
+
+      return;
+}
+
 
 /*
  * heapgetpage - subroutine for heapgettup()
@@ -304,7 +467,7 @@ initscan(HeapScanDesc scan, ScanKey key,
  * which tuples on the page are visible.
  */
 static void
-heapgetpage(HeapScanDesc scan, BlockNumber page)
+heapgetpage(HeapScanDesc scan, BlockNumber page , BlockNumber prefetchHWM)
 {
 	Buffer		buffer;
 	Snapshot	snapshot;
@@ -314,6 +477,10 @@ heapgetpage(HeapScanDesc scan, BlockNumb
 	OffsetNumber lineoff;
 	ItemId		lpp;
 	bool		all_visible;
+#ifdef USE_PREFETCH
+	int             PrefetchBufferRc;  /*   indicates whether requested prefetch block already in a buffer and if pin count on buffer has been increased */
+#endif   /* USE_PREFETCH */
+
 
 	Assert(page < scan->rs_nblocks);
 
@@ -336,6 +503,98 @@ heapgetpage(HeapScanDesc scan, BlockNumb
 									   RBM_NORMAL, scan->rs_strategy);
 	scan->rs_cblock = page;
 
+#ifdef USE_PREFETCH
+
+        heap_unread_subtract(scan, page);
+
+        /*    maybe prefetch some pages  starting with rs_pfchblock */
+        if (scan->rs_prefetch_target >= 0) {       /*   prefetching enabled on this scan ?                         */
+            int next_block_to_be_read = (page+1);  /*   next block to be read = lowest possible prefetchable block */
+            int num_to_pfch_this_time;             /*   eventually holds the number of blocks to prefetch now      */
+            int prefetchable_range;                /*   size of the area ahead of the current prefetch position    */
+
+            /*  check if prefetcher reached wrap point and the scan has now wrapped */
+            if (  (page == 0) && (scan->rs_prefetch_target == PREFETCH_WRAP_POINT)  ) {
+                scan->rs_prefetch_target = 1;
+                scan->rs_pfchblock = next_block_to_be_read;
+            } else
+            if (scan->rs_pfchblock < next_block_to_be_read) {
+                scan->rs_pfchblock = next_block_to_be_read; /* next block to be prefetched must be ahead of one we just read */
+            }
+
+            /* now we know where we would start prefetching -
+            ** next question   -  if this is a sync scan,  ensure we do not prefetch behind the HWM
+            ** debatable whether to require strict inequality or >=  -   >= works better in practice
+            */
+            if ( (!scan->rs_syncscan) || (scan->rs_pfchblock >= prefetchHWM) ) {
+
+                /* now we know where we will start prefetching -
+                ** next question   -  how many?
+                ** apply two limits :
+                **  1.   target prefetch distance
+                **  2.   number of available blocks ahead of us
+                */
+
+                /*  1.   target prefetch distance   */
+                num_to_pfch_this_time = next_block_to_be_read + scan->rs_prefetch_target; /* page beyond prefetch target */
+                num_to_pfch_this_time -= scan->rs_pfchblock;                              /*  convert to offset        */
+
+                /*   first do prefetching up to our current limit  ...
+                **   highest page number that a scan (pre)-fetches is scan->rs_nblocks-1
+                **   note  -  prefetcher does not wrap a prefetch range -
+                **            instead just stop and then start again if and when main scan wraps
+                */
+                if (scan->rs_pfchblock <= scan->rs_startblock) {  /*  if on second leg towards startblock */
+                    prefetchable_range = ((int)(scan->rs_startblock) - (int)(scan->rs_pfchblock));
+                }
+                else {                                            /*     on first leg towards nblocks     */
+                    prefetchable_range = ((int)(scan->rs_nblocks) - (int)(scan->rs_pfchblock));
+                }
+                if (prefetchable_range > 0) {           /*  if theres a range to prefetch */
+
+                    /*  2.   number of available blocks ahead of us        */
+                    if (num_to_pfch_this_time > prefetchable_range) {
+                        num_to_pfch_this_time = prefetchable_range;
+                    }
+                    while (num_to_pfch_this_time-- > 0) {
+                        PrefetchBufferRc = PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, scan->rs_pfchblock, scan->rs_strategy);
+                        /*  if pin acquired on buffer,  then remember in case of future Discard */
+                        if (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED) {
+                            heap_unread_add(scan, scan->rs_pfchblock);
+						}
+                        scan->rs_pfchblock++;
+                        /*  if syncscan and requested block was already in buffer pool,
+                        **  this suggests that another scanner is ahead of us and we should advance
+                        */
+                        if ( (scan->rs_syncscan) && (PrefetchBufferRc & PREFTCHRC_BLK_ALREADY_PRESENT) ) {
+                            scan->rs_pfchblock++;
+                            num_to_pfch_this_time--;
+                        }
+                    }
+                }
+                else {
+                    /*   we must not modify scan->rs_pfchblock here
+                    **   because it is needed for possible DiscardBuffer at end of scan  ...
+                    **   ... instead ...
+                    */
+                    scan->rs_prefetch_target = PREFETCH_WRAP_POINT;  /*   mark this prefetcher as waiting to wrap */
+                }
+
+                /*   ...  then adjust prefetching limit : by doubling on each iteration */
+                if (scan->rs_prefetch_target == 0) {
+                    scan->rs_prefetch_target = 1;
+                }
+                else {
+                    scan->rs_prefetch_target *= 2;
+                    if (scan->rs_prefetch_target > target_prefetch_pages) {
+                        scan->rs_prefetch_target = target_prefetch_pages;
+                    }
+                }
+            }
+        }
+#endif   /* USE_PREFETCH */
+
+
 	if (!scan->rs_pageatatime)
 		return;
 
@@ -452,6 +711,8 @@ heapgettup(HeapScanDesc scan,
 	OffsetNumber lineoff;
 	int			linesleft;
 	ItemId		lpp;
+    BlockNumber  prefetchHWM = 0;   /*  HWM of prefetch BlockNum for all participants in same sync scan */
+    int          ix;
 
 	/*
 	 * calculate next starting lineoff, given scan direction
@@ -470,7 +731,25 @@ heapgettup(HeapScanDesc scan,
 				return;
 			}
 			page = scan->rs_startblock; /* first page */
-			heapgetpage(scan, page);
+#ifdef USE_PREFETCH
+                        /*  decide if we shall do prefetching :  only if :
+                        **   .  prefetching enabled for this scan
+                        **   .  not a bitmap scan (which do their own)
+                        **   .  sufficient number of blocks - at least twice the target_prefetch_pages
+                        */
+                        if (    (!scan->rs_bitmapscan)
+                             && (scan->rs_prefetch_target >= PREFETCH_MAYBE)
+                             && (scan->rs_nblocks > (2*target_prefetch_pages))
+                           )  {
+                                scan->rs_prefetch_target = 1;  /*  do prefetching on forward non-bitmap scan */
+                                scan->rs_Unread_Pfetched_base = (BlockNumber *)palloc(sizeof(BlockNumber)*target_prefetch_pages);
+                                /*  Initialise the list */
+                                for (ix = 0; ix < target_prefetch_pages; ix++) {
+									*(scan->rs_Unread_Pfetched_base + ix) = InvalidBlockNumber;
+								}
+                        }
+#endif   /* USE_PREFETCH */
+			heapgetpage(scan, page, prefetchHWM);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
 		}
@@ -516,7 +795,7 @@ heapgettup(HeapScanDesc scan,
 				page = scan->rs_startblock - 1;
 			else
 				page = scan->rs_nblocks - 1;
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 		}
 		else
 		{
@@ -557,7 +836,7 @@ heapgettup(HeapScanDesc scan,
 
 		page = ItemPointerGetBlockNumber(&(tuple->t_self));
 		if (page != scan->rs_cblock)
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 
 		/* Since the tuple was previously fetched, needn't lock page here */
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
@@ -660,8 +939,10 @@ heapgettup(HeapScanDesc scan,
 			 * a little bit backwards on every invocation, which is confusing.
 			 * We don't guarantee any specific ordering in general, though.
 			 */
-			if (scan->rs_syncscan)
-				ss_report_location(scan->rs_rd, page);
+			if (scan->rs_syncscan) {
+                                prefetchHWM = scan->rs_pfchblock;
+				ss_report_location(scan->rs_rd, page, &prefetchHWM);
+                        }
 		}
 
 		/*
@@ -671,6 +952,22 @@ heapgettup(HeapScanDesc scan,
 		{
 			if (BufferIsValid(scan->rs_cbuf))
 				ReleaseBuffer(scan->rs_cbuf);
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
 			scan->rs_cbuf = InvalidBuffer;
 			scan->rs_cblock = InvalidBlockNumber;
 			tuple->t_data = NULL;
@@ -678,7 +975,7 @@ heapgettup(HeapScanDesc scan,
 			return;
 		}
 
-		heapgetpage(scan, page);
+		heapgetpage(scan, page, prefetchHWM);
 
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 
@@ -727,6 +1024,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 	OffsetNumber lineoff;
 	int			linesleft;
 	ItemId		lpp;
+    BlockNumber  prefetchHWM = 0;   /*  HWM of prefetch BlockNum for all participants in same sync scan */
+    int          ix;
 
 	/*
 	 * calculate next starting lineindex, given scan direction
@@ -745,7 +1044,25 @@ heapgettup_pagemode(HeapScanDesc scan,
 				return;
 			}
 			page = scan->rs_startblock; /* first page */
-			heapgetpage(scan, page);
+#ifdef USE_PREFETCH
+                        /*  decide if we shall do prefetching :  only if :
+                        **   .  prefetching enabled for this scan
+                        **   .  not a bitmap scan (which do their own)
+                        **   .  sufficient number of blocks - at least twice the target_prefetch_pages
+                        */
+                        if (    (!scan->rs_bitmapscan)
+                             && (scan->rs_prefetch_target >= PREFETCH_MAYBE)
+                             && (scan->rs_nblocks > (2*target_prefetch_pages))
+                           )  {
+                                scan->rs_prefetch_target = 1;  /*  do prefetching on forward non-bitmap scan */
+                                scan->rs_Unread_Pfetched_base = (BlockNumber *)palloc(sizeof(BlockNumber)*target_prefetch_pages);
+                                /*  Initialise the list */
+                                for (ix = 0; ix < target_prefetch_pages; ix++) {
+									*(scan->rs_Unread_Pfetched_base + ix) = InvalidBlockNumber;
+								}
+                        }
+#endif   /* USE_PREFETCH */
+			heapgetpage(scan, page, prefetchHWM);
 			lineindex = 0;
 			scan->rs_inited = true;
 		}
@@ -788,7 +1105,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 				page = scan->rs_startblock - 1;
 			else
 				page = scan->rs_nblocks - 1;
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 		}
 		else
 		{
@@ -826,7 +1143,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 
 		page = ItemPointerGetBlockNumber(&(tuple->t_self));
 		if (page != scan->rs_cblock)
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 
 		/* Since the tuple was previously fetched, needn't lock page here */
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
@@ -921,8 +1238,10 @@ heapgettup_pagemode(HeapScanDesc scan,
 			 * a little bit backwards on every invocation, which is confusing.
 			 * We don't guarantee any specific ordering in general, though.
 			 */
-			if (scan->rs_syncscan)
-				ss_report_location(scan->rs_rd, page);
+			if (scan->rs_syncscan) {
+                                prefetchHWM = scan->rs_pfchblock;
+				ss_report_location(scan->rs_rd, page, &prefetchHWM);
+                        }
 		}
 
 		/*
@@ -932,6 +1251,22 @@ heapgettup_pagemode(HeapScanDesc scan,
 		{
 			if (BufferIsValid(scan->rs_cbuf))
 				ReleaseBuffer(scan->rs_cbuf);
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                             while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
 			scan->rs_cbuf = InvalidBuffer;
 			scan->rs_cblock = InvalidBlockNumber;
 			tuple->t_data = NULL;
@@ -939,7 +1274,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 			return;
 		}
 
-		heapgetpage(scan, page);
+		heapgetpage(scan, page, prefetchHWM);
 
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
 		lines = scan->rs_ntuples;
@@ -1394,6 +1729,23 @@ void
 heap_rescan(HeapScanDesc scan,
 			ScanKey key)
 {
+
+#ifdef USE_PREFETCH
+         if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                       /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+         }
+#endif   /* USE_PREFETCH */
 	/*
 	 * unpin scan buffers
 	 */
@@ -1418,6 +1770,23 @@ heap_endscan(HeapScanDesc scan)
 {
 	/* Note: no locking manipulations needed */
 
+
+#ifdef USE_PREFETCH
+         if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                       /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+         }
+#endif   /* USE_PREFETCH */
 	/*
 	 * unpin scan buffers
 	 */
@@ -1435,6 +1804,10 @@ heap_endscan(HeapScanDesc scan)
 	if (scan->rs_strategy != NULL)
 		FreeAccessStrategy(scan->rs_strategy);
 
+    if (scan->rs_Unread_Pfetched_base) {
+        pfree(scan->rs_Unread_Pfetched_base);
+    }
+
 	if (scan->rs_temp_snap)
 		UnregisterSnapshot(scan->rs_snapshot);
 
@@ -1464,7 +1837,6 @@ heap_endscan(HeapScanDesc scan)
 #define HEAPDEBUG_3
 #endif   /* !defined(HEAPDEBUGALL) */
 
-
 HeapTuple
 heap_getnext(HeapScanDesc scan, ScanDirection direction)
 {
@@ -6347,6 +6719,25 @@ heap_markpos(HeapScanDesc scan)
 void
 heap_restrpos(HeapScanDesc scan)
 {
+
+
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                             while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
+
 	/* XXX no amrestrpos checking that ammarkpos called */
 
 	if (!ItemPointerIsValid(&scan->rs_mctid))
--- src/backend/access/heap/syncscan.c.orig	2014-05-28 08:29:09.242829343 -0400
+++ src/backend/access/heap/syncscan.c	2014-05-28 16:45:43.254508653 -0400
@@ -90,6 +90,7 @@ typedef struct ss_scan_location_t
 {
 	RelFileNode relfilenode;	/* identity of a relation */
 	BlockNumber location;		/* last-reported location in the relation */
+	BlockNumber prefetchHWM;	/* high-water-mark of prefetched Blocknum */
 } ss_scan_location_t;
 
 typedef struct ss_lru_item_t
@@ -113,7 +114,7 @@ static ss_scan_locations_t *scan_locatio
 
 /* prototypes for internal functions */
 static BlockNumber ss_search(RelFileNode relfilenode,
-		  BlockNumber location, bool set);
+		  BlockNumber location, bool set , BlockNumber *prefetchHWMp);
 
 
 /*
@@ -160,6 +161,7 @@ SyncScanShmemInit(void)
 			item->location.relfilenode.dbNode = InvalidOid;
 			item->location.relfilenode.relNode = InvalidOid;
 			item->location.location = InvalidBlockNumber;
+			item->location.prefetchHWM = InvalidBlockNumber;
 
 			item->prev = (i > 0) ?
 				(&scan_locations->items[i - 1]) : NULL;
@@ -185,7 +187,7 @@ SyncScanShmemInit(void)
  * data structure.
  */
 static BlockNumber
-ss_search(RelFileNode relfilenode, BlockNumber location, bool set)
+ss_search(RelFileNode relfilenode, BlockNumber location, bool set , BlockNumber *prefetchHWMp)
 {
 	ss_lru_item_t *item;
 
@@ -206,6 +208,22 @@ ss_search(RelFileNode relfilenode, Block
 			{
 				item->location.relfilenode = relfilenode;
 				item->location.location = location;
+                                /*  if prefetch information requested,
+                                **  then reconcile and either update or report back the new HWM.
+                                */
+                                if (prefetchHWMp)
+                                {
+                                    if (   (item->location.prefetchHWM == InvalidBlockNumber)
+                                        || (item->location.prefetchHWM < *prefetchHWMp)
+                                       )
+                                    {
+                                      item->location.prefetchHWM = *prefetchHWMp;
+                                    }
+                                    else
+                                    {
+                                      *prefetchHWMp = item->location.prefetchHWM;
+                                    }
+                                }
 			}
 			else if (set)
 				item->location.location = location;
@@ -252,7 +270,7 @@ ss_get_location(Relation rel, BlockNumbe
 	BlockNumber startloc;
 
 	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
-	startloc = ss_search(rel->rd_node, 0, false);
+	startloc = ss_search(rel->rd_node, 0, false , 0);
 	LWLockRelease(SyncScanLock);
 
 	/*
@@ -282,7 +300,7 @@ ss_get_location(Relation rel, BlockNumbe
  * same relfilenode.
  */
 void
-ss_report_location(Relation rel, BlockNumber location)
+ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp)
 {
 #ifdef TRACE_SYNCSCAN
 	if (trace_syncscan)
@@ -306,7 +324,7 @@ ss_report_location(Relation rel, BlockNu
 	{
 		if (LWLockConditionalAcquire(SyncScanLock, LW_EXCLUSIVE))
 		{
-			(void) ss_search(rel->rd_node, location, true);
+			(void) ss_search(rel->rd_node, location, true , prefetchHWMp);
 			LWLockRelease(SyncScanLock);
 		}
 #ifdef TRACE_SYNCSCAN
--- src/backend/access/index/indexam.c.orig	2014-05-28 08:29:09.242829343 -0400
+++ src/backend/access/index/indexam.c	2014-05-28 16:45:43.298508831 -0400
@@ -79,6 +79,55 @@
 #include "utils/tqual.h"
 
 
+#ifdef USE_PREFETCH
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+void index_mark_or_evict_block(IndexScanDesc scan , BlockNumber blocknumber , int markit);
+
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+
+/*  if specified block number is present in the prefetch array,
+**  then either mark it as not to be discarded or evict it according to input param
+*/
+void index_mark_or_evict_block(IndexScanDesc scan , BlockNumber blocknumber , int markit)
+{
+        unsigned short int pfchx , pfchy , pfchz; /*  indexes in BlockIdData array   */
+
+        if (    scan->do_prefetch
+             && ((struct pfch_block_item*)0 != scan->pfch_block_item_list)
+             /* no need to check for scan->pfch_next < prefetch_index_scans
+             ** since we will do nothing if scan->pfch_used == 0
+             */
+           ) {
+            /*  search the prefetch list to find if the block is a member */
+            for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                if (BlockIdGetBlockNumber(&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)) == blocknumber) {
+                      if (markit) {
+						      /*  mark it as not to be discarded */
+						      ((scan->pfch_block_item_list)+pfchx)->pfch_discard &= ~PREFTCHRC_BUF_PIN_INCREASED;
+					  } else {
+							  /*  shuffle all following the evictee to the left
+							  **  and update next pointer if its element moves
+							  */
+							  pfchy = (scan->pfch_used - 1); /*  current rightmost */
+							  scan->pfch_used = pfchy;
+
+							  while (pfchy > pfchx) {
+								  pfchz = pfchx + 1;
+								  BlockIdCopy((&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)), (&(((scan->pfch_block_item_list)+pfchz)->pfch_blockid)));
+								  ((scan->pfch_block_item_list)+pfchx)->pfch_discard = ((scan->pfch_block_item_list)+pfchz)->pfch_discard;
+								  if (scan->pfch_next == pfchz) {
+									  scan->pfch_next = pfchx;
+								  }
+								  pfchx = pfchz; /* advance */
+							  }
+                      }
+                }
+            }
+        }
+}
+#endif /* USE_PREFETCH */
+
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
  *
@@ -253,6 +302,11 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+	scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -277,6 +331,11 @@ index_beginscan_bitmap(Relation indexRel
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+	scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -311,6 +370,9 @@ index_beginscan_internal(Relation indexR
 									  Int32GetDatum(nkeys),
 									  Int32GetDatum(norderbys)));
 
+	scan->heap_tids_seen = 0;
+	scan->heap_tids_fetched = 0;
+	
 	return scan;
 }
 
@@ -342,6 +404,12 @@ index_rescan(IndexScanDesc scan,
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
@@ -373,10 +441,30 @@ index_endscan(IndexScanDesc scan)
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
 
+#ifdef USE_PREFETCH
+        /*   discard prefetched but unread buffers */
+        if (    scan->do_prefetch
+             && ((struct pfch_block_item*)0 != scan->pfch_block_item_list)
+           ) {
+            unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                  if (((scan->pfch_block_item_list)+pfchx)->pfch_discard) {
+                      DiscardBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber(&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)));
+                  }
+                }
+        }
+#endif   /* USE_PREFETCH */
+
 	/* End the AM's scan */
 	FunctionCall1(procedure, PointerGetDatum(scan));
 
@@ -472,6 +560,12 @@ index_getnext_tid(IndexScanDesc scan, Sc
 		/* ... but first, release any held pin on a heap page */
 		if (BufferIsValid(scan->xs_cbuf))
 		{
+#ifdef USE_PREFETCH
+                    /*   if specified block number is present in the prefetch array,  then evict it */
+                    if (scan->do_prefetch) {
+                          index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                    }
+#endif   /* USE_PREFETCH */
 			ReleaseBuffer(scan->xs_cbuf);
 			scan->xs_cbuf = InvalidBuffer;
 		}
@@ -479,6 +573,11 @@ index_getnext_tid(IndexScanDesc scan, Sc
 	}
 
 	pgstat_count_index_tuples(scan->indexRelation, 1);
+	if (scan->heap_tids_seen++ >= (~0)) {
+		/* Avoid integer overflow */
+		scan->heap_tids_seen = 1;
+		scan->heap_tids_fetched = 0;
+	}
 
 	/* Return the TID of the tuple we found. */
 	return &scan->xs_ctup.t_self;
@@ -502,6 +601,10 @@ index_getnext_tid(IndexScanDesc scan, Sc
  * enough information to do it efficiently in the general case.
  * ----------------
  */
+#if defined(USE_PREFETCH) && defined(AVOID_CATALOG_MIGRATION_FOR_ASYNCIO)
+extern Datum btpeeknexttuple(IndexScanDesc scan);
+#endif /* USE_PREFETCH */
+
 HeapTuple
 index_fetch_heap(IndexScanDesc scan)
 {
@@ -509,16 +612,105 @@ index_fetch_heap(IndexScanDesc scan)
 	bool		all_dead = false;
 	bool		got_heap_tuple;
 
+
+
 	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
 	if (!scan->xs_continue_hot)
 	{
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = scan->xs_cbuf;
 
+#ifdef USE_PREFETCH
+
+                /*   If the old block is different from new block, then evict old
+                **   block from prefetched array.   It is arguable we should leave it
+                **   in the array because it's likely to remain in the buffer pool
+                **   for a while,  but in that case , if we encounter the block
+                **   again,  prefetching it again does no harm.
+                **   (and note that,  if it's not pinned,  prefetching it will try to
+                **   pin it since prefetch tries to bank a pin for a buffer in the buffer pool).
+                **   therefore it should usually win.
+                */
+                if (    scan->do_prefetch
+                     && ( BufferIsValid(prev_buf) )
+                     && (BlocknotinBuffer(prev_buf,scan->heapRelation,ItemPointerGetBlockNumber(tid)))
+                     && (scan->pfch_next < prefetch_index_scans)  /* ensure there is an entry */
+                        ) {
+                          index_mark_or_evict_block(scan , BlocknumOfBuffer(prev_buf) , 0);
+                }
+
+#endif   /* USE_PREFETCH  */
 		scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf,
 											 scan->heapRelation,
 											 ItemPointerGetBlockNumber(tid));
 
+                /*   If the new block had been prefetched and pinned,
+                **   then mark that it no longer requires to be discarded.
+                **   Of course,  we don't evict the entry,
+                **   because we want to remember that it was recently prefetched.
+                */
+                index_mark_or_evict_block(scan , BlocknumOfBuffer(prev_buf) , 1);
+
+				scan->heap_tids_fetched++;
+
+#ifdef USE_PREFETCH
+                /*  try prefetching next data block
+                **    (next meaning one containing TIDs from matching keys
+                **     in same index page and different from any block
+                **     we previously prefetched and listed in prefetched array)
+                */
+                {
+                    FmgrInfo   *procedure;
+                    bool	found;             /*  did we find the "next" heap tid in current index page */
+                    int         PrefetchBufferRc;  /*  indicates whether requested prefetch block already in a buffer and if pin count on buffer has been increased */
+
+                    if (scan->do_prefetch) {
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                        procedure = &scan->indexRelation->rd_aminfo->ampeeknexttuple; /* is incorrect but avoids adding function to catalog */
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                        GET_SCAN_PROCEDURE(ampeeknexttuple); /* is correct but requires adding function to catalog */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+
+                        if (    procedure          /* does the index access method support peektuple? */
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                             && procedure->fn_addr /* procedure->fn_addr is non-null only if in catalog */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                           ) {
+                            int iterations = 1;      /*  how many iterations of prefetching shall we try  -
+                                                     **  if used entries in prefetch list is < target_prefetch_pages
+                                                     **  then 2,  else 1
+                                                     **  this should result in gradually and smoothly increasing up to target_prefetch_pages
+                                                     */
+                            /*  note we trust InitIndexScan verified this scan is forwards only and so set that */
+                            if (scan->pfch_used < target_prefetch_pages) {
+                                iterations = 2;
+                            }
+                            do {
+                                found =  DatumGetBool(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                                                         btpeeknexttuple(scan)     /*  pass scan as direct parameter since cant use fmgr because not in catalog */
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                                                         FunctionCall1(procedure, PointerGetDatum(scan)) /* use fmgr to call it because in catalog  */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                                                     );
+                                if (found) {
+                                    /*    btpeeknexttuple set pfch_next to point to the item in block_item_list to be prefetched */
+                                    PrefetchBufferRc = PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber((&((scan->pfch_block_item_list + scan->pfch_next))->pfch_blockid)) , 0);
+                                    /* elog(LOG,"index_fetch_heap prefetched rel %u blockNum %u"
+                                       ,scan->heapRelation->rd_node.relNode ,BlockIdGetBlockNumber(scan->pfch_block_item_list + scan->pfch_next));
+                                    */
+
+                                    /*  if pin acquired on buffer,  then remember in case of future Discard */
+                                    (scan->pfch_block_item_list + scan->pfch_next)->pfch_discard = (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED);
+
+
+                                }
+                            } while (--iterations > 0);
+                        }
+                    }
+                }
+#endif   /* USE_PREFETCH */
+
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
--- src/backend/access/index/genam.c.orig	2014-05-28 08:29:09.242829343 -0400
+++ src/backend/access/index/genam.c	2014-05-28 16:45:43.322508927 -0400
@@ -77,6 +77,12 @@ RelationGetIndexScan(Relation indexRelat
 
 	scan = (IndexScanDesc) palloc(sizeof(IndexScanDescData));
 
+#ifdef USE_PREFETCH
+        scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+        scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
+
 	scan->heapRelation = NULL;	/* may be set later */
 	scan->indexRelation = indexRelation;
 	scan->xs_snapshot = InvalidSnapshot;		/* caller must initialize this */
@@ -139,6 +145,19 @@ RelationGetIndexScan(Relation indexRelat
 void
 IndexScanEnd(IndexScanDesc scan)
 {
+#ifdef USE_PREFETCH
+	if (scan->do_prefetch) {
+		if ( (struct pfch_block_item*)0 != scan->pfch_block_item_list ) {
+			pfree(scan->pfch_block_item_list);
+			scan->pfch_block_item_list = (struct pfch_block_item*)0;
+		}
+		if ( (struct pfch_index_pagelist*)0 != scan->pfch_index_page_list ) {
+			pfree(scan->pfch_index_page_list);
+			scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+		}
+	}
+#endif   /* USE_PREFETCH */
+
 	if (scan->keyData != NULL)
 		pfree(scan->keyData);
 	if (scan->orderByData != NULL)
--- src/backend/access/nbtree/nbtsearch.c.orig	2014-05-28 08:29:09.242829343 -0400
+++ src/backend/access/nbtree/nbtsearch.c	2014-05-28 16:45:43.350509042 -0400
@@ -23,13 +23,16 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+extern unsigned int prefetch_btree_heaps;  /*  boolean whether to prefetch heap pages in _bt_next for non-bitmap index scans   */
+extern unsigned int prefetch_sequential_index_scans;  /*  boolean whether to prefetch sequential-access non-bitmap index scans  */
 
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
+static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir, 
+			 bool prefetch);
+static Buffer _bt_walk_left(IndexScanDesc scan, Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
 
@@ -226,7 +229,7 @@ _bt_moveright(Relation rel,
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, blkno, access , (struct pfch_index_pagelist*)0);
 			continue;
 		}
 
@@ -1005,7 +1008,7 @@ _bt_first(IndexScanDesc scan, ScanDirect
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
@@ -1040,6 +1043,8 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTScanPosItem *currItem;
+	BlockNumber prevblkno = ItemPointerGetBlockNumber(
+		&scan->xs_ctup.t_self);
 
 	/*
 	 * Advance to next tuple on current page; or if there's no more, try to
@@ -1052,11 +1057,53 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+		if (prefetch_btree_heaps && (target_prefetch_pages > 0)) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreRight
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+				
+					if (so->prefetchItemIndex <= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex + 1;
+					while (    (so->prefetchItemIndex <= so->currPos.lastItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex++].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {  /*  if not a repetition of previous block */
+							/*   start prefetch on next page, providing :
+                            **   EITHER  .  we're reading non-sequentially previously or for this block
+                            **   OR      .  user explicitly specified to prefetch for sequential pattern
+							**   as it may be counterproductive otherwise
+                            */
+							nonsequential = (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1));
+							if (prefetch_sequential_index_scans || nonsequential) {
+								_bt_prefetchbuf(scan->heapRelation, blkno , &scan->pfch_index_page_list );
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
 	}
 	else
 	{
@@ -1065,11 +1112,53 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+		if (prefetch_btree_heaps && (target_prefetch_pages > 0)) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreLeft
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+			
+					if (so->prefetchItemIndex >= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex - 1;
+					while (    (so->prefetchItemIndex >= so->currPos.firstItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex--].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {  /*  if not a repetition of previous block */
+							/*   start prefetch on next page, providing :
+                            **   EITHER  .  we're reading non-sequentially previously or for this block
+                            **   OR      .  user explicitly specified to prefetch for sequential pattern
+							**   as it may be counterproductive otherwise
+                            */
+							nonsequential = (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1));
+							if (prefetch_sequential_index_scans || nonsequential) {
+								_bt_prefetchbuf(scan->heapRelation, blkno , &scan->pfch_index_page_list );
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
 	}
 
 	/* OK, itemIndex says what to return */
@@ -1119,9 +1208,11 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 	/*
 	 * we must save the page's right-link while scanning it; this tells us
 	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * corresponding need for the left-link, since splits always go right,
+	 * but we need it for back-sequential scan detection.
 	 */
 	so->currPos.nextPage = opaque->btpo_next;
+	so->currPos.prevPage = opaque->btpo_prev;
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
@@ -1156,6 +1247,7 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
+		so->prefetchItemIndex = 0;
 	}
 	else
 	{
@@ -1187,6 +1279,7 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 		so->currPos.firstItem = itemIndex;
 		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
 		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->prefetchItemIndex = MaxIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1224,7 +1317,7 @@ _bt_saveitem(BTScanOpaque so, int itemIn
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
  */
 static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+_bt_steppage(IndexScanDesc scan, ScanDirection dir, bool prefetch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -1278,7 +1371,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ , scan->pfch_index_page_list);
 			/* check for deleted page */
 			page = BufferGetPage(so->currPos.buf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1287,9 +1380,20 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque))) {
+					if (    prefetch && so->currPos.moreRight
+						/*   start prefetch on next page, providing :
+						**   EITHER  .  we're reading non-sequentially for this block
+						**   OR      .  user explicitly specified to prefetch for sequential pattern
+						**   as it may be counterproductive otherwise
+						*/
+						&& (prefetch_sequential_index_scans || opaque->btpo_next != (blkno+1))
+                       ) {
+ 						  _bt_prefetchbuf(rel, opaque->btpo_next , &scan->pfch_index_page_list);
+					}
 					break;
 			}
+			}
 			/* nope, keep going */
 			blkno = opaque->btpo_next;
 		}
@@ -1317,7 +1421,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
+			so->currPos.buf = _bt_walk_left(scan , rel, so->currPos.buf);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -1332,14 +1436,58 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 			if (!P_IGNORE(opaque))
 			{
+				/* We must rely on the previously saved prevPage link! */
+				BlockNumber blkno = so->currPos.prevPage;
+				
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page))) {
+					if (prefetch && so->currPos.moreLeft) {
+						/* detect back-sequential runs and increase prefetch window blindly 
+						 * downwards 2 blocks at a time. This only works in our favor
+						 * for index-only scans, by merging read requests at the kernel,
+						 * so we want to inflate target_prefetch_pages since merged 
+						 * back-sequential requests are about as expensive as a single one 
+						 */
+						if (scan->xs_want_itup && blkno > 0 && opaque->btpo_prev == (blkno-1)) {
+							BlockNumber backPos;
+							unsigned int back_prefetch_pages = target_prefetch_pages * 16;
+							if (back_prefetch_pages > 64)
+								back_prefetch_pages = 64;
+							
+							if (so->backSeqRun == 0)
+								backPos = (blkno-1);
+							else
+								backPos = so->backSeqPos;
+							so->backSeqRun++;
+							
+							if (backPos > 0 && (blkno - backPos) <= back_prefetch_pages) {
+								_bt_prefetchbuf(rel, backPos-- , &scan->pfch_index_page_list);
+								/* don't start back-seq prefetch too early */
+								if (so->backSeqRun >= back_prefetch_pages
+										&& backPos > 0 
+										&& (blkno - backPos) <= back_prefetch_pages)
+								{
+									_bt_prefetchbuf(rel, backPos-- , &scan->pfch_index_page_list);
+								}
+							}
+							
+							so->backSeqPos = backPos;
+						} else {
+							/* start prefetch on next page */
+							if (so->backSeqRun != 0) {
+								if (opaque->btpo_prev > blkno || opaque->btpo_prev < so->backSeqPos)
+									so->backSeqRun = 0;
+							}
+							_bt_prefetchbuf(rel, opaque->btpo_prev , &scan->pfch_index_page_list);
+						}
+					}
 					break;
 			}
 		}
 	}
+	}
 
 	return true;
 }
@@ -1359,7 +1507,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf)
+_bt_walk_left(IndexScanDesc scan, Relation rel, Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1387,7 +1535,7 @@ _bt_walk_left(Relation rel, Buffer buf)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, blkno, BT_READ , scan->pfch_index_page_list );
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -1631,7 +1779,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDir
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
--- src/backend/access/nbtree/nbtinsert.c.orig	2014-05-28 08:29:09.242829343 -0400
+++ src/backend/access/nbtree/nbtinsert.c	2014-05-28 16:45:43.394509218 -0400
@@ -793,7 +793,7 @@ _bt_insertonpg(Relation rel,
 		{
 			Assert(!P_ISLEAF(lpageop));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -972,7 +972,7 @@ _bt_split(Relation rel, Buffer buf, Buff
 	bool		isleaf;
 
 	/* Acquire a new page to split into */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE , (struct pfch_index_pagelist*)0);
 
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
@@ -1175,7 +1175,7 @@ _bt_split(Relation rel, Buffer buf, Buff
 
 	if (!P_RIGHTMOST(oopaque))
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE , (struct pfch_index_pagelist*)0);
 		spage = BufferGetPage(sbuf);
 		sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -1817,7 +1817,7 @@ _bt_finish_split(Relation rel, Buffer lb
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE , (struct pfch_index_pagelist*)0);
 	rpage = BufferGetPage(rbuf);
 	rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
 
@@ -1829,7 +1829,7 @@ _bt_finish_split(Relation rel, Buffer lb
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -1877,7 +1877,7 @@ _bt_getstackbuf(Relation rel, BTStack st
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, access);
+		buf = _bt_getbuf(rel, blkno, access , (struct pfch_index_pagelist*)0);
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -2008,12 +2008,12 @@ _bt_newroot(Relation rel, Buffer lbuf, B
 	lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE , (struct pfch_index_pagelist*)0);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
--- src/backend/access/nbtree/nbtpage.c.orig	2014-05-28 08:29:09.242829343 -0400
+++ src/backend/access/nbtree/nbtpage.c	2014-05-28 16:45:43.426509347 -0400
@@ -127,7 +127,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ , (struct pfch_index_pagelist*)0);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 
@@ -153,7 +153,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ , (struct pfch_index_pagelist*)0);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -209,7 +209,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE , (struct pfch_index_pagelist*)0);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
@@ -350,7 +350,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ , (struct pfch_index_pagelist*)0);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -436,7 +436,7 @@ _bt_getrootheight(Relation rel)
 		Page		metapg;
 		BTPageOpaque metaopaque;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ , (struct pfch_index_pagelist*)0);
 		metapg = BufferGetPage(metabuf);
 		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 		metad = BTPageGetMeta(metapg);
@@ -562,6 +562,170 @@ _bt_log_reuse_page(Relation rel, BlockNu
 }
 
 /*
+ *	_bt_prefetchbuf() -- Prefetch a buffer by block number
+ *                           and keep track of prefetched and unread blocknums in pagelist.
+ *   input parms  :
+ *       rel and blockno identify block to be prefetched as usual
+ *       pfch_index_page_list_P points to the pointer anchoring the head of the index page list
+ *             Since the pagelist is a kind of optimization,
+ *             handle palloc failure by quietly omitting the keeping track.
+ */
+void
+_bt_prefetchbuf(Relation rel, BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P)
+{
+
+    int rc = 0;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    struct pfch_index_item* found_item = 0;
+
+		if ((struct pfch_index_pagelist**)0 == pfch_index_page_list_P) {
+			pfch_index_plp = (struct pfch_index_pagelist*)0;
+		} else {
+			pfch_index_plp = *pfch_index_page_list_P;
+		}
+
+    	if (blkno != P_NEW && blkno != P_NONE)
+    	{
+            /* prefetch an existing block of the relation
+            ** but first,  check it has not recently already been prefetched and not yet read
+            */
+            found_item = _bt_find_block(blkno , pfch_index_plp);
+			if ((struct pfch_index_item*)0 == found_item) {  /*  not found */
+
+		        rc = PrefetchBuffer(rel, MAIN_FORKNUM, blkno , 0);
+
+                /*  add the pagenum to the list ,  indicating its discard status
+                **  since it's only an optimization,  ignore failure such as exceeded allowed space
+				*/
+                _bt_add_block( blkno , pfch_index_page_list_P , (uint32)(rc & PREFTCHRC_BUF_PIN_INCREASED));
+
+            }
+	    }
+        return;
+}
+
+/*   _bt_find_block finds the item referencing specified Block in index page list if present
+**   and returns the pointer to the pfch_index_item if found,  or null if not
+*/
+struct pfch_index_item*
+_bt_find_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list)
+{
+
+    struct pfch_index_item* found_item = 0;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    int ix, tx;
+
+		pfch_index_plp = pfch_index_page_list;
+
+		while (     (struct pfch_index_pagelist*)0 != pfch_index_plp
+                &&  ( (struct pfch_index_item*)0 == found_item)
+              ) {
+			ix = 0;
+			tx = pfch_index_plp->pfch_index_item_count;
+			while (     (ix < tx)
+                    &&  ( (struct pfch_index_item*)0 == found_item)
+                  ) {
+				if (pfch_index_plp->pfch_indexid[ix].pfch_blocknum == blkno) {
+					found_item = &pfch_index_plp->pfch_indexid[ix];
+				}
+                ix++;
+			}
+			pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+		}
+
+     return found_item;
+}
+
+/*   _bt_add_block adds the specified Block to the index page list
+**   and returns 0 if successful,  non-zero if not
+*/
+int
+_bt_add_block(BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P , uint32 discard_status)
+{
+    int rc = 1;
+    int ix;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    struct pfch_index_pagelist* pfch_index_page_list_anchor; /*  pointer to first chunk if any  */
+	/*  allow expansion of pagelist to 16 chunks
+	**  which accommodates backwards-sequential index scans
+	**  where the scanner increases target_prefetch_pages by a factor of up to 16
+	**   see code in _bt_steppage
+	**  note - this creates an undesirable weak dependency on this number in _bt_steppage,
+	**         but :
+	**           there is no disaster if the numbers disagree  -  just sub-optimal use of the list
+	**           to implement a proper interface would require that chunks have a variable size
+	**           which would require an extra size variable in each chunk
+	*/
+	int num_chunks = 16;
+
+		if ((struct pfch_index_pagelist**)0 == pfch_index_page_list_P) {
+			pfch_index_page_list_anchor = (struct pfch_index_pagelist*)0;
+		} else {
+			pfch_index_page_list_anchor = *pfch_index_page_list_P;
+		}
+		pfch_index_plp = pfch_index_page_list_anchor;       /* pointer to current chunk */
+
+		while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+			ix = pfch_index_plp->pfch_index_item_count;
+			if (ix < target_prefetch_pages) {
+				pfch_index_plp->pfch_indexid[ix].pfch_blocknum = blkno;
+				pfch_index_plp->pfch_indexid[ix].pfch_discard = discard_status;
+				pfch_index_plp->pfch_index_item_count = (ix+1);
+                rc = 0;
+				goto stored_pagenum;
+			}
+			pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+			num_chunks--;              /*  keep track of number of chunks */
+		}
+
+		/*   we did not find any free space in existing chunks -
+		**   create new chunk if within our limit and we have a pfch_index_page_list
+		*/
+		if ( (num_chunks > 0) && ((struct pfch_index_pagelist*)0 != pfch_index_page_list_anchor) ) {
+			pfch_index_plp = (struct pfch_index_pagelist*)palloc( sizeof(struct pfch_index_pagelist) + ( (target_prefetch_pages-1) * sizeof(struct pfch_index_item) ) );
+			if ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+				pfch_index_plp->pfch_index_pagelist_next = pfch_index_page_list_anchor;  /* old head of list is next after this */
+				pfch_index_plp->pfch_indexid[0].pfch_blocknum = blkno;
+				pfch_index_plp->pfch_indexid[0].pfch_discard = discard_status;
+				pfch_index_plp->pfch_index_item_count = 1;
+				pfch_index_page_list_P = &pfch_index_plp;   /*  new head of list is new chunk */
+                rc = 0;
+			}
+		}
+
+    stored_pagenum:;
+     return rc;
+}
+
+/*  _bt_subtract_block removes a block from the prefetched-but-unread pagelist if present */
+void
+_bt_subtract_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list)
+{
+    struct pfch_index_pagelist* pfch_index_plp = pfch_index_page_list;
+	if ( (blkno != P_NEW) && (blkno != P_NONE) ) {
+            int ix , jx;
+                while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+                    ix = pfch_index_plp->pfch_index_item_count;
+                    while (ix-- > 0) {
+                        if (pfch_index_plp->pfch_indexid[ix].pfch_blocknum == blkno) {
+                            /*   move the last item to the curent (now deleted) position and decrement count */
+                            jx = (pfch_index_plp->pfch_index_item_count-1); /*  index of last item ... */
+                            if (jx > ix) {                                  /*  ... is not the current one so move is required */
+                                pfch_index_plp->pfch_indexid[ix].pfch_blocknum = pfch_index_plp->pfch_indexid[jx].pfch_blocknum;
+                                pfch_index_plp->pfch_indexid[ix].pfch_discard = pfch_index_plp->pfch_indexid[jx].pfch_discard;
+                                ix = jx;
+                            }
+                            pfch_index_plp->pfch_index_item_count = ix;
+                            goto done_subtract;
+                        }
+                    }
+                    pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+                }
+        }
+    done_subtract:  return;
+}
+
+/*
  *	_bt_getbuf() -- Get a buffer by block number for read or write.
  *
  *		blkno == P_NEW means to get an unallocated index page.  The page
@@ -573,7 +737,7 @@ _bt_log_reuse_page(Relation rel, BlockNu
  *		_bt_checkpage to sanity-check the page (except in P_NEW case).
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access , struct pfch_index_pagelist* pfch_index_page_list)
 {
 	Buffer		buf;
 
@@ -581,6 +745,10 @@ _bt_getbuf(Relation rel, BlockNumber blk
 	{
 		/* Read an existing block of the relation */
 		buf = ReadBuffer(rel, blkno);
+
+        /*  if the block is in the prefetched-but-unread pagelist,  remove it */
+        _bt_subtract_block( blkno , pfch_index_page_list);
+
 		LockBuffer(buf, access);
 		_bt_checkpage(rel, buf);
 	}
@@ -702,6 +870,10 @@ _bt_getbuf(Relation rel, BlockNumber blk
  * bufmgr when one would do.  However, now it's mainly just a notational
  * convenience.  The only case where it saves work over _bt_relbuf/_bt_getbuf
  * is when the target page is the same one already in the buffer.
+ *
+ * if prefetching of index pages is changed to use this function,
+ * then it should be extended to take the index_page_list as parameter
+ * and call_bt_subtract_block in the same way that _bt_getbuf does.
  */
 Buffer
 _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
@@ -712,6 +884,7 @@ _bt_relandgetbuf(Relation rel, Buffer ob
 	if (BufferIsValid(obuf))
 		LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
 	buf = ReleaseAndReadBuffer(obuf, rel, blkno);
+
 	LockBuffer(buf, access);
 	_bt_checkpage(rel, buf);
 	return buf;
@@ -965,7 +1138,7 @@ _bt_is_page_halfdead(Relation rel, Block
 	BTPageOpaque opaque;
 	bool		result;
 
-	buf = _bt_getbuf(rel, blk, BT_READ);
+	buf = _bt_getbuf(rel, blk, BT_READ , (struct pfch_index_pagelist*)0);
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -1069,7 +1242,7 @@ _bt_lock_branch_parent(Relation rel, Blo
 				Page		lpage;
 				BTPageOpaque lopaque;
 
-				lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+				lbuf = _bt_getbuf(rel, leftsib, BT_READ, (struct pfch_index_pagelist*)0);
 				lpage = BufferGetPage(lbuf);
 				lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
@@ -1265,7 +1438,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 					BTPageOpaque lopaque;
 					Page		lpage;
 
-					lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+					lbuf = _bt_getbuf(rel, leftsib, BT_READ, (struct pfch_index_pagelist*)0);
 					lpage = BufferGetPage(lbuf);
 					lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 					/*
@@ -1340,7 +1513,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 		if (!rightsib_empty)
 			break;
 
-		buf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		buf = _bt_getbuf(rel, rightsib, BT_WRITE, (struct pfch_index_pagelist*)0);
 	}
 
 	return ndeleted;
@@ -1593,7 +1766,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 		target = topblkno;
 
 		/* fetch the block number of the topmost parent's left sibling */
-		buf = _bt_getbuf(rel, topblkno, BT_READ);
+		buf = _bt_getbuf(rel, topblkno, BT_READ, (struct pfch_index_pagelist*)0);
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		leftsib = opaque->btpo_prev;
@@ -1632,7 +1805,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 		LockBuffer(leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE , (struct pfch_index_pagelist*)0);
 		page = BufferGetPage(lbuf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -1646,7 +1819,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 					 RelationGetRelationName(rel));
 				return false;
 			}
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE , (struct pfch_index_pagelist*)0);
 			page = BufferGetPage(lbuf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		}
@@ -1701,7 +1874,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE , (struct pfch_index_pagelist*)0);
 	page = BufferGetPage(rbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	if (opaque->btpo_prev != target)
@@ -1731,7 +1904,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
--- src/backend/access/nbtree/nbtree.c.orig	2014-05-28 08:29:09.242829343 -0400
+++ src/backend/access/nbtree/nbtree.c	2014-05-28 16:45:43.450509443 -0400
@@ -30,6 +30,18 @@
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_index_scans; /* boolean whether to prefetch non-bitmap index scans also numeric size of pfch_list */
+#endif   /* USE_PREFETCH */
+
+Datum
+btpeeknexttuple(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+        IndexScanDesc scan
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+        PG_FUNCTION_ARGS
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+);
 
 /* Working state for btbuild and its callback */
 typedef struct
@@ -332,6 +344,74 @@ btgettuple(PG_FUNCTION_ARGS)
 }
 
 /*
+ *	btpeeknexttuple() -- peek at the next tuple different from any blocknum in pfch_block_item_list
+ *                           without reading a new index page
+ *                       and without causing any side-effects such as altering values in control blocks
+ *               if found,     store blocknum in next element of pfch_block_item_list
+ */
+Datum
+btpeeknexttuple(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+        IndexScanDesc scan
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+        PG_FUNCTION_ARGS
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+)
+{
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+    IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res = false;
+	int		itemIndex;		/* current index in items[] */
+
+        /*
+         * If we've already initialized this scan, we can just advance it in
+         * the appropriate direction.  If we haven't done so yet, bail out
+         */
+        if ( BTScanPosIsValid(so->currPos) ) {
+
+            itemIndex = so->currPos.itemIndex+1;    /*   next item */
+
+            /* This loop handles advancing till we find different data block or end of index page */
+            while (itemIndex <= so->currPos.lastItem) {
+                unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                        if (BlockIdEquals((&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)) , &(so->currPos.items[itemIndex].heapTid.ip_blkid))) {
+                             goto block_match;
+                        }
+                }
+
+                /*  if we reach here,  no block in list matched this item  */
+                res = true;
+                /*   set item in prefetch list
+                **   prefer unused entry if there is one,  else overwrite
+                */
+                if (scan->pfch_used < prefetch_index_scans) {
+                    scan->pfch_next = scan->pfch_used;
+                } else {
+                    scan->pfch_next++;
+                    if (scan->pfch_next >= prefetch_index_scans) {
+                        scan->pfch_next = 0;
+                    }
+                }
+
+                BlockIdCopy((&((scan->pfch_block_item_list + scan->pfch_next)->pfch_blockid)) , &(so->currPos.items[itemIndex].heapTid.ip_blkid));
+                if (scan->pfch_used <= scan->pfch_next) {
+                     scan->pfch_used = (scan->pfch_next + 1);
+                }
+
+                goto peek_complete;
+
+      block_match:         itemIndex++;
+            }
+	}
+
+ peek_complete:
+	PG_RETURN_BOOL(res);
+}
+
+/*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
 Datum
@@ -425,6 +505,12 @@ btbeginscan(PG_FUNCTION_ARGS)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	so->backSeqRun = 0;
+	so->backSeqPos = 0;
+	so->prefetchItemIndex = 0;
+	so->lastHeapPrefetchBlkno = P_NONE;
+	so->prefetchBlockCount = 0;
+	
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -516,6 +602,23 @@ btendscan(PG_FUNCTION_ARGS)
 {
 	IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+        struct pfch_index_pagelist* pfch_index_plp;
+        int ix;
+
+#ifdef USE_PREFETCH
+
+	/* discard all prefetched but unread index pages listed in the pagelist */
+        pfch_index_plp = scan->pfch_index_page_list;
+        while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+                    ix = pfch_index_plp->pfch_index_item_count;
+                    while (ix-- > 0) {
+                        if (pfch_index_plp->pfch_indexid[ix].pfch_discard) {
+                            DiscardBuffer( scan->indexRelation , MAIN_FORKNUM , pfch_index_plp->pfch_indexid[ix].pfch_blocknum);
+                        }
+                    }
+                    pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+        }
+#endif /* USE_PREFETCH */
 
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
--- src/backend/nodes/tidbitmap.c.orig	2014-05-28 08:29:09.278829325 -0400
+++ src/backend/nodes/tidbitmap.c	2014-05-28 16:45:43.474509540 -0400
@@ -44,6 +44,9 @@
 #include "nodes/bitmapset.h"
 #include "nodes/tidbitmap.h"
 #include "utils/hsearch.h"
+#ifdef USE_PREFETCH
+extern int	target_prefetch_pages;
+#endif   /* USE_PREFETCH */
 
 /*
  * The maximum number of tuples per page is not large (typically 256 with
@@ -572,7 +575,12 @@ tbm_begin_iterate(TIDBitmap *tbm)
 	 * needs of the TBMIterateResult sub-struct.
 	 */
 	iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
-								 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+								 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)
+#ifdef USE_PREFETCH
+                                          		      /*  space for remembering every prefetched but unread blockno */
+                                          		      +  (target_prefetch_pages * sizeof(BlockNumber))
+#endif   /* USE_PREFETCH */
+                                         );
 	iterator->tbm = tbm;
 
 	/*
@@ -1020,3 +1028,68 @@ tbm_comparator(const void *left, const v
 		return 1;
 	return 0;
 }
+
+void
+tbm_zero(TBMIterator *iterator) /* zero list of prefetched and unread blocknos */
+{
+      /* locate the list of prefetched but unread blocknos immediately following the array of offsets
+      ** and note that tbm_begin_iterate allocates space for (1 + MAX_TUPLES_PER_PAGE) offsets  -
+      ** 1 included in struct TBMIterator and MAX_TUPLES_PER_PAGE additional
+      */
+      iterator->output.Unread_Pfetched_base = ((BlockNumber *)(&(iterator->output.offsets[MAX_TUPLES_PER_PAGE+1])));
+      iterator->output.Unread_Pfetched_next = iterator->output.Unread_Pfetched_count = 0;
+}
+
+void
+tbm_add(TBMIterator *iterator, BlockNumber blockno) /* add this blockno to list of prefetched and unread blocknos */
+{
+      unsigned int Unread_Pfetched_index = iterator->output.Unread_Pfetched_next + iterator->output.Unread_Pfetched_count++;
+
+      if (iterator->output.Unread_Pfetched_count > target_prefetch_pages) {
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_add overflowed list cannot add blockno %d", blockno)));
+      }
+
+      if (Unread_Pfetched_index >= target_prefetch_pages)
+          Unread_Pfetched_index -= target_prefetch_pages;
+      *(iterator->output.Unread_Pfetched_base + Unread_Pfetched_index) = blockno;
+}
+
+void
+tbm_subtract(TBMIterator *iterator, BlockNumber blockno) /* remove this blockno from list of prefetched and unread blocknos */
+{
+      unsigned int Unread_Pfetched_index = iterator->output.Unread_Pfetched_next++;
+      BlockNumber nextUnreadPfetched;
+
+      /*    make a weak check that the next blockno is the one to be removed,
+      **    although actually in case of disagreement,   we ignore callers blockno and remove next anyway,
+      **    which is really what caller wants
+      */
+      if ( iterator->output.Unread_Pfetched_count == 0 ) {
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_subtract empty list cannot subtract blockno %d", blockno)));
+      }
+
+      if (Unread_Pfetched_index >= target_prefetch_pages)
+          Unread_Pfetched_index = 0;
+      nextUnreadPfetched = *(iterator->output.Unread_Pfetched_base + Unread_Pfetched_index);
+      if (   ( nextUnreadPfetched != blockno ) 
+          && ( nextUnreadPfetched != InvalidBlockNumber ) /* dont report it if the block in the list was InvalidBlockNumber */
+         ) {
+		ereport(NOTICE,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_subtract will subtract blockno %d not %d",
+					nextUnreadPfetched, blockno)));
+      }
+      if (iterator->output.Unread_Pfetched_next >= target_prefetch_pages)
+          iterator->output.Unread_Pfetched_next = 0;
+      iterator->output.Unread_Pfetched_count--;
+}
+
+TBMIterateResult *
+tbm_locate_IterateResult(TBMIterator *iterator)
+{
+   return &(iterator->output);
+}
--- src/backend/utils/misc/guc.c.orig	2014-05-28 08:29:09.406829256 -0400
+++ src/backend/utils/misc/guc.c	2014-05-28 16:45:43.550509846 -0400
@@ -2264,6 +2264,25 @@ static struct config_int ConfigureNamesI
 	},
 
 	{
+		{"max_async_io_prefetchers",
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+			PGC_USERSET,
+#else
+			PGC_INTERNAL,
+#endif
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Maximum number of background processes concurrently using asynchronous librt threads to prefetch pages into shared memory buffers."),
+		},
+		&max_async_io_prefetchers,
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	        -1, 0, 8192,      /*  boot val -1 indicates to initialize to something sensible during buf_init */
+#else
+		0, 0, 0,
+#endif
+		NULL, NULL, NULL
+	},
+
+	{
 		{"max_worker_processes",
 			PGC_POSTMASTER,
 			RESOURCES_ASYNCHRONOUS,
--- src/backend/utils/mmgr/aset.c.orig	2014-05-28 08:29:09.406829256 -0400
+++ src/backend/utils/mmgr/aset.c	2014-05-28 16:45:43.610510088 -0400
@@ -733,6 +733,48 @@ AllocSetAlloc(MemoryContext context, Siz
 	 */
 	fidx = AllocSetFreeIndex(size);
 	chunk = set->freelist[fidx];
+#ifdef MEMORY_CONTEXT_CHECKING
+        /*    an instance of segfault caused by a rogue value in set->freelist[fidx]
+        **    has been seen - check for it using crude sanity check based on neighbours :
+        **    if at least one is sufficiently close, then pass,  else fail
+        */
+        if (chunk != 0) {
+            int frx, nrx; /*  frx is index,  nrx is index of failing neighbour for errmsg */
+            for (nrx = -1, frx = 0; (frx < ALLOCSET_NUM_FREELISTS); frx++) {
+                if (   (frx != fidx)     /*  not the chosen one */
+                    && ( ( (unsigned long)(set->freelist[frx]) ) != 0 ) /* not empty */
+                   ) {
+                    if (   ( (unsigned long)chunk < ( ( (unsigned long)(set->freelist[frx]) ) / 2 ) )
+                        && (  ( (unsigned long)(set->freelist[frx]) ) < 0x4000000  )
+               /***     || ( (unsigned long)chunk > ( ( (unsigned long)(set->freelist[frx]) ) * 2 ) )  ***/
+                       ) {
+                       nrx = frx;
+                    } else {
+                       nrx = -1;
+                       break;
+                    }
+                }
+            }
+
+            if (nrx >= 0) {
+                     /*  this must be a rogue value  -  put this list in the garbage */
+                     /*  build message but be careful to avoid recursively triggering same fault */
+                     set->freelist[fidx] = NULL;   /*  mark this list empty */
+                     elog(WARNING, "detected rogue value %p in freelist index %d compared with neighbour %p whose chunksize %d"
+				 , chunk , fidx , set->freelist[nrx] , set->freelist[nrx]->size);
+                     chunk = NULL;
+            }
+        }
+#else /* if not MEMORY_CONTEXT_CHECKING make very simple-minded check*/
+        if ( (chunk != 0) && ( (unsigned long)chunk <  0x40000 ) ) {
+                     /*  this must be a rogue value  -  put this list in the garbage */
+                     /*  build message but be careful to avoid recursively triggering same fault */
+                     set->freelist[fidx] = NULL;   /*  mark this list empty */
+                     elog(WARNING, "detected rogue value %p in freelist index %d"
+				 , chunk , fidx);
+                     chunk = NULL;
+        }
+#endif
 	if (chunk != NULL)
 	{
 		Assert(chunk->size >= size);
--- src/include/executor/instrument.h.orig	2014-05-28 08:29:09.454829232 -0400
+++ src/include/executor/instrument.h	2014-05-28 16:45:43.798510846 -0400
@@ -28,8 +28,18 @@ typedef struct BufferUsage
 	long		local_blks_written;		/* # of local disk blocks written */
 	long		temp_blks_read; /* # of temp blocks read */
 	long		temp_blks_written;		/* # of temp blocks written */
+
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
+
+	long		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_discrd;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_forgot;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb */
+	long		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno */
+	long		aio_read_wasted;		/* # of aio reads for which disk block not used */
+	long		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it */
+	long		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested */
 } BufferUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
--- src/include/storage/bufmgr.h.orig	2014-05-28 08:29:09.462829227 -0400
+++ src/include/storage/bufmgr.h	2014-05-28 16:45:43.830510976 -0400
@@ -41,6 +41,7 @@ typedef enum
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
 	RBM_NORMAL_NO_LOG			/* Don't log page as invalid during WAL
 								 * replay; otherwise same as RBM_NORMAL */
+       ,RBM_NOREAD_FOR_PREFETCH   /* Don't read from disk, don't zero buffer, find buffer only */
 } ReadBufferMode;
 
 /* in globals.c ... this duplicates miscadmin.h */
@@ -57,6 +58,9 @@ extern int	target_prefetch_pages;
 extern PGDLLIMPORT char *BufferBlocks;
 extern PGDLLIMPORT int32 *PrivateRefCount;
 
+/*  in buf_async.c  */;
+extern int      max_async_io_prefetchers; /* Maximum number of backends using asynchronous librt threads to read pages into our buffers */
+
 /* in localbuf.c */
 extern PGDLLIMPORT int NLocBuffer;
 extern PGDLLIMPORT Block *LocalBufferBlockPointers;
@@ -159,9 +163,15 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
 /*
- * prototypes for functions in bufmgr.c
+ * prototypes for external functions in bufmgr.c and buf_async.c
  */
-extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
+extern int PrefetchBuffer(Relation reln, ForkNumber forkNum,
+			   BlockNumber blockNum , BufferAccessStrategy strategy);
+/*   return code  is an int bitmask : */
+#define PREFTCHRC_BUF_PIN_INCREASED 0x01    /*  pin count on buffer has been increased by 1 */
+#define PREFTCHRC_BLK_ALREADY_PRESENT 0x02  /*  block was already present in a buffer       */
+
+extern void DiscardBuffer(Relation reln, ForkNumber forkNum,
 			   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
--- src/include/storage/smgr.h.orig	2014-05-28 08:29:09.462829227 -0400
+++ src/include/storage/smgr.h	2014-05-28 16:45:43.854511072 -0400
@@ -92,6 +92,12 @@ extern void smgrextend(SMgrRelation reln
 		   BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void smgrinitaio(int max_aio_threads, int max_aio_num);
+extern void smgrstartaio(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum , char *aiocbp , int *retcode);
+extern void smgrcompleteaio( SMgrRelation reln, char *aiocbp , int *inoutcode );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
@@ -118,6 +124,11 @@ extern void mdextend(SMgrRelation reln,
 		 BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void mdinitaio(int max_aio_threads, int max_aio_num);
+extern void mdstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode );
+extern void mdcompleteaio( char *aiocbp , int *inoutcode );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
--- src/include/storage/fd.h.orig	2014-05-28 08:29:09.462829227 -0400
+++ src/include/storage/fd.h	2014-05-28 16:45:43.882511185 -0400
@@ -69,6 +69,11 @@ extern File PathNameOpenFile(FileName fi
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void FileInitaio(int max_aio_threads, int max_aio_num );
+extern int  FileStartaio(File file, off_t offset, int amount , char *aiocbp);
+extern int  FileCompleteaio( char *aiocbp , int normal_wait );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern int	FileRead(File file, char *buffer, int amount);
 extern int	FileWrite(File file, char *buffer, int amount);
 extern int	FileSync(File file);
--- src/include/storage/buf_internals.h.orig	2014-05-28 08:29:09.462829227 -0400
+++ src/include/storage/buf_internals.h	2014-05-28 16:45:43.906511281 -0400
@@ -22,7 +22,9 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
-
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+#include "aio.h"
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Flags for buffer descriptors
@@ -38,8 +40,23 @@
 #define BM_JUST_DIRTIED			(1 << 5)		/* dirtied since write started */
 #define BM_PIN_COUNT_WAITER		(1 << 6)		/* have waiter for sole pin */
 #define BM_CHECKPOINT_NEEDED	(1 << 7)		/* must write for checkpoint */
-#define BM_PERMANENT			(1 << 8)		/* permanent relation (not
-												 * unlogged) */
+#define BM_PERMANENT			(1 << 8)	/* permanent relation (not unlogged) */
+#define BM_AIO_IN_PROGRESS		(1 << 9)	/* aio in progress    */
+#define BM_AIO_PREFETCH_PIN_BANKED	(1 << 10)	/* pinned when prefetch issued
+                                                        ** and this pin is banked - i.e.
+                                                        ** redeemable by the next use by same task
+                                                        ** note that for any one buffer, a pin can be banked
+                                                        **      by at most one process globally,
+                                                        **      that is,   only one process may bank a pin on the buffer
+                                                        **                 and it may do so only once (may not be stacked)
+                                                        */
+
+/*********
+for asynchronous aio-read prefetching, two golden rules concerning buffer pinning and buffer-header flags must be observed:
+  R1.  a buffer marked as BM_AIO_IN_PROGRESS must be pinned by at least one backend
+  R2.  a buffer marked as BM_AIO_PREFETCH_PIN_BANKED must be pinned by the backend identified by
+               (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio : (-(buf->freeNext))
+*********/
 
 typedef bits16 BufFlags;
 
@@ -140,17 +157,83 @@ typedef struct sbufdesc
 	BufFlags	flags;			/* see bit definitions above */
 	uint16		usage_count;	/* usage counter for clock sweep code */
 	unsigned	refcount;		/* # of backends holding pins on buffer */
-	int			wait_backend_pid;		/* backend PID of pin-count waiter */
+	int		wait_backend_pid;	/*  if     flags & BM_PIN_COUNT_WAITER
+                                                **  then   backend PID of pin-count waiter
+                                                **  else   not set
+                                                */
 
 	slock_t		buf_hdr_lock;	/* protects the above fields */
 
 	int			buf_id;			/* buffer's index number (from 0) */
-	int			freeNext;		/* link in freelist chain */
+        int    	volatile	freeNext;	/* overloaded and much-abused field :
+                                                ** EITHER
+                                                **     if     >= 0
+                                                **     then   link in freelist chain
+                                                **  OR
+                                                **     if     <  0
+                                                **     then    EITHER
+                                                **             if     flags & BM_AIO_IN_PROGRESS
+                                                **             then   negative of (the index of the aiocb in the BufferAiocbs array + 3)
+                                                **             else   if flags & BM_AIO_PREFETCH_PIN_BANKED
+                                                **             then   -(pid of task that issued aio_read and pinned buffer)
+                                                **             else   one of the special values -1 or -2 listed below
+                                                */
 
 	LWLock	   *io_in_progress_lock;	/* to wait for I/O to complete */
 	LWLock	   *content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
+/*  structures for control blocks for our implementation of async io */
+
+/*  if USE_AIO_ATOMIC_BUILTIN_COMP_SWAP is not defined,  the following struct is not put into use at runtime
+**  but it is easier to let the compiler find the definition but hide the reference to aiocb
+**  which is the only type it would not understand
+*/
+
+struct BufferAiocb {
+       struct BufferAiocb volatile * volatile BAiocbnext;  /*    next free entry or value of BAIOCB_OCCUPIED means in use  */
+       struct sbufdesc    volatile * volatile BAiocbbufh;  /*    there can be at most one BufferDesc marked BM_AIO_IN_PROGRESS
+                                                           **    and using this BufferAiocb -
+                                                           **    if there is one, BAiocbbufh points to it, else BAiocbbufh is zero
+                                                           **    NOTE  BAiocbbufh should be zero for every BufferAiocb on the free list
+                                                           */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+       struct aiocb       volatile            BAiocbthis;  /*    the aio library's control block for one async io */
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+       int                volatile  BAiocbDependentCount;  /*    count of tasks who depend on this BufferAiocb
+                                                           **    in the sense that they are waiting for io completion.
+                                                           **    only a Dependent may move the BufferAiocb onto the freelist
+                                                           **    and only when that Dependent is the *only* Dependent (count == 1)
+                                                           **    BAiocbDependentCount is protected by bufferheader spinlock
+                                                           **    and must be updated only when that spinlock is held
+                                                           */
+       pid_t              volatile  pidOfAio;              /*    pid of backend who issued an aio_read using this BAiocb -
+                                                           **    this backend must have pinned the associated buffer.
+                                                           */
+};
+
+#define BAIOCB_OCCUPIED 0x75f1        /*  distinct indicator of a BufferAiocb.BAiocbnext that is NOT on free list */
+#define BAIOCB_FREE 0x7b9d            /*  distinct indicator of a BufferAiocb.BAiocbbufh that IS     on free list */
+
+struct BAiocbAnchor {                 /*  anchor for all control blocks pertaining to aio  */
+       volatile struct BufferAiocb* BufferAiocbs;          /*  aiocbs ...                   */
+       volatile struct BufferAiocb* volatile FreeBAiocbs; /* ... and their free list   */
+};
+
+/*   values for BufCheckAsync input and retcode */
+#define BUF_INTENTION_WANT 		 1  /* wants the buffer, wait for in-progress aio and then pin */
+#define BUF_INTENTION_REJECT_KEEP_PIN 	-1  /* pin already held, do not unpin */
+#define BUF_INTENTION_REJECT_OBTAIN_PIN	-2  /* obtain pin,  caller wants it for same buffer */
+#define BUF_INTENTION_REJECT_FORGET	-3  /* unpin and tell resource owner to forget */
+#define BUF_INTENTION_REJECT_NOADJUST	-4  /* unpin and call ResourceOwnerForgetBuffer */
+#define BUF_INTENTION_REJECT_UNBANK   	-5  /* unpin only if pin banked by caller */
+
+#define BUF_INTENT_RC_CHANGED_TAG	-5
+#define BUF_INTENT_RC_BADPAGE		-4
+#define BUF_INTENT_RC_INVALID_AIO	-3    /*  invalid and aio was in progress */
+#define BUF_INTENT_RC_INVALID_NO_AIO	-1    /*  invalid and no aio was in progress */
+#define BUF_INTENT_RC_VALID		 1
+
 #define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
 
 /*
@@ -159,6 +242,7 @@ typedef struct sbufdesc
  */
 #define FREENEXT_END_OF_LIST	(-1)
 #define FREENEXT_NOT_IN_LIST	(-2)
+#define FREENEXT_BAIOCB_ORIGIN	(-3)
 
 /*
  * Macros for acquiring/releasing a shared buffer header's spinlock.
--- src/include/catalog/pg_am.h.orig	2014-05-28 08:29:09.446829236 -0400
+++ src/include/catalog/pg_am.h	2014-05-28 16:45:43.926511362 -0400
@@ -67,6 +67,7 @@ CATALOG(pg_am,2601)
 	regproc		amcanreturn;	/* can indexscan return IndexTuples? */
 	regproc		amcostestimate; /* estimate cost of an indexscan */
 	regproc		amoptions;		/* parse AM-specific parameters */
+	regproc		ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } FormData_pg_am;
 
 /* ----------------
@@ -117,19 +118,19 @@ typedef FormData_pg_am *Form_pg_am;
  * ----------------
  */
 
-DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions ));
+DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions btpeeknexttuple ));
 DESCR("b-tree index access method");
 #define BTREE_AM_OID 403
-DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions ));
+DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions - ));
 DESCR("hash index access method");
 #define HASH_AM_OID 405
-DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions ));
+DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions - ));
 DESCR("GiST index access method");
 #define GIST_AM_OID 783
-DATA(insert OID = 2742 (  gin		0 6 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions ));
+DATA(insert OID = 2742 (  gin		0 6 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions - ));
 DESCR("GIN index access method");
 #define GIN_AM_OID 2742
-DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
+DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions - ));
 DESCR("SP-GiST index access method");
 #define SPGIST_AM_OID 4000
 
--- src/include/catalog/pg_proc.h.orig	2014-05-28 08:29:09.450829234 -0400
+++ src/include/catalog/pg_proc.h	2014-05-28 16:45:43.966511524 -0400
@@ -536,6 +536,12 @@ DESCR("convert float4 to int4");
 
 DATA(insert OID = 330 (  btgettuple		   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2281 2281" _null_ _null_ _null_ _null_	btgettuple _null_ _null_ _null_ ));
 DESCR("btree(internal)");
+
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+DATA(insert OID = 3251 (  btpeeknexttuple	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "2281" _null_ _null_ _null_ _null_ btpeeknexttuple _null_ _null_ _null_ ));
+DESCR("btree(internal)");
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+
 DATA(insert OID = 636 (  btgetbitmap	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_	btgetbitmap _null_ _null_ _null_ ));
 DESCR("btree(internal)");
 DATA(insert OID = 331 (  btinsert		   PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_	btinsert _null_ _null_ _null_ ));
--- src/include/pg_config_manual.h.orig	2014-05-28 08:29:09.458829229 -0400
+++ src/include/pg_config_manual.h	2014-05-28 16:45:43.994511636 -0400
@@ -138,9 +138,11 @@
 /*
  * USE_PREFETCH code should be compiled only if we have a way to implement
  * prefetching.  (This is decoupled from USE_POSIX_FADVISE because there
- * might in future be support for alternative low-level prefetch APIs.)
+ * might in future be support for alternative low-level prefetch APIs  --
+ * -- update October 2013  -- now there is such a new prefetch capability --
+ *   async_io into postgres buffers  -   configuration parameter max_async_io_threads)
  */
-#ifdef USE_POSIX_FADVISE
+#if defined(USE_POSIX_FADVISE) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
 #define USE_PREFETCH
 #endif
 
--- src/include/access/nbtree.h.orig	2014-05-28 08:29:09.442829238 -0400
+++ src/include/access/nbtree.h	2014-05-28 16:45:44.022511749 -0400
@@ -19,6 +19,7 @@
 #include "access/sdir.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "access/relscan.h"
 #include "catalog/pg_index.h"
 
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
@@ -524,6 +525,7 @@ typedef struct BTScanPosData
 	Buffer		buf;			/* if valid, the buffer is pinned */
 
 	BlockNumber nextPage;		/* page's right link when we scanned it */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
@@ -603,6 +605,15 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* prefetch logic state */
+	unsigned int	backSeqRun;	/* number of back-sequential pages in a run */
+	BlockNumber		backSeqPos;	/* blkid last prefetched in back-sequential 
+				          		   runs */
+	BlockNumber		lastHeapPrefetchBlkno;	/* blkid last prefetched from heap */
+	int				prefetchItemIndex; /* item index within currPos last
+					                      fetched by heap prefetch */
+	int				prefetchBlockCount; /* number of prefetched heap blocks */
+	
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -655,7 +666,11 @@ extern Buffer _bt_getroot(Relation rel,
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access , struct pfch_index_pagelist* pfch_index_page_list);
+extern void _bt_prefetchbuf(Relation rel, BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P);
+extern struct pfch_index_item* _bt_find_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list);
+extern int _bt_add_block(BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P , uint32 discard_status);
+extern void _bt_subtract_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 				 BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
--- src/include/access/heapam.h.orig	2014-05-28 08:29:09.442829238 -0400
+++ src/include/access/heapam.h	2014-05-28 16:45:44.046511845 -0400
@@ -175,7 +175,7 @@ extern void heap_page_prune_execute(Buff
 extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
 
 /* in heap/syncscan.c */
-extern void ss_report_location(Relation rel, BlockNumber location);
+extern void ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp);
 extern BlockNumber ss_get_location(Relation rel, BlockNumber relnblocks);
 extern void SyncScanShmemInit(void);
 extern Size SyncScanShmemSize(void);
--- src/include/access/relscan.h.orig	2014-05-28 08:29:09.446829236 -0400
+++ src/include/access/relscan.h	2014-05-28 16:45:44.066511925 -0400
@@ -44,6 +44,24 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+#ifdef USE_PREFETCH
+	int	    rs_prefetch_target; /* target distance (numblocks) for prefetch to reach beyond main scan */
+	BlockNumber rs_pfchblock;	/* next block # to be prefetched in scan, if any */
+
+        /*   Unread_Pfetched is a "mostly" circular list of recently prefetched blocknos of size target_prefetch_pages
+        **   the index of the first unread block is held in Unread_Pfetched_next
+        **         and is advanced when a block is read
+        **   the count of number of unread blocks is in Unread_Pfetched_count (and this subset can wrap around)
+        **   "mostly" means that there may be gaps caused by storing entries for blocks which do not need to be discarded -
+        **   these are indicated by blockno = InvalidBlockNumber,  and these slots are reused when found.
+        */
+        BlockNumber *rs_Unread_Pfetched_base;   /*  where the list of prefetched but unread blocknos starts */
+        unsigned int rs_Unread_Pfetched_next;   /*  where the next unread blockno probably is relative to start --
+                                                **  this is only a hint which may be temporarily stale.
+                                                */
+        unsigned int rs_Unread_Pfetched_count;  /*  number of valid unread blocknos in list */
+#endif   /* USE_PREFETCH */
+
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	ItemPointerData rs_mctid;	/* marked scan position, if any */
@@ -55,6 +73,27 @@ typedef struct HeapScanDescData
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
 }	HeapScanDescData;
 
+/* pfch_index_items track prefetched and unread index pages -   chunks of blocknumbers are chained in singly-linked list from scan->pfch_index_item_list */
+struct pfch_index_item {                              /*  index-relation BlockIds which we will/have prefetched */
+       BlockNumber         pfch_blocknum;    /* Blocknum which we will/have prefetched */
+       uint32              pfch_discard;     /* whether block is to be discarded when scan is closed */
+};
+
+struct pfch_block_item {
+       struct BlockIdData  pfch_blockid;     /* BlockId which we will/have prefetched */
+       uint32              pfch_discard;     /* whether block is to be discarded when scan is closed */
+};
+
+/* pfch_index_page_items track prefetched and unread index pages -
+** chunks of blocknumbers are chained backwards (newest first,  oldest last)
+** in singly-linked list from scan->pfch_index_item_list
+*/
+struct pfch_index_pagelist {                          /*  index-relation BlockIds which we will/have prefetched */
+       struct pfch_index_pagelist* pfch_index_pagelist_next;  /*  pointer to next chunk if any */
+       unsigned int    pfch_index_item_count;         /*  number of used entries in this chunk */
+       struct pfch_index_item pfch_indexid[1];        /*  in-line list of Blocknums which we will/have prefetched and whether to be discarded */
+};
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -75,8 +114,15 @@ typedef struct IndexScanDescData
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
 	bool		ignore_killed_tuples;	/* do not return killed entries */
-	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
-										 * tuples */
+	bool		xactStartedInRecovery;	/* prevents killing/seeing killed tuples */
+										 
+#ifdef USE_PREFETCH
+        struct pfch_index_pagelist* pfch_index_page_list;  /* array of index-relation BlockIds which we will/have prefetched */
+        struct pfch_block_item* pfch_block_item_list;   /* array of heap-relation BlockIds which we will/have prefetched */
+        unsigned short int     pfch_used;	/*  number of used elements in BlockIdData array   */
+        unsigned short int     pfch_next;	/*  next element for prefetch in BlockIdData array */
+	int             do_prefetch;    /*  should I prefetch ? */
+#endif   /* USE_PREFETCH */
 
 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
@@ -91,6 +137,10 @@ typedef struct IndexScanDescData
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/* heap fetch statistics for read-ahead logic */
+	unsigned int heap_tids_seen;
+	unsigned int heap_tids_fetched;
+
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
 }	IndexScanDescData;
--- src/include/nodes/tidbitmap.h.orig	2014-05-28 08:29:09.458829229 -0400
+++ src/include/nodes/tidbitmap.h	2014-05-28 16:45:44.106512088 -0400
@@ -41,6 +41,16 @@ typedef struct
 	int			ntuples;		/* -1 indicates lossy result */
 	bool		recheck;		/* should the tuples be rechecked? */
 	/* Note: recheck is always true if ntuples < 0 */
+#ifdef USE_PREFETCH
+        /*   Unread_Pfetched is a circular list of recently prefetched blocknos of size target_prefetch_pages
+        **   the index of the first unread block is held in Unread_Pfetched_next
+        **         and is advanced when a block is read
+        **   the count of number of unread blocks is in Unread_Pfetched_count (and this subset can wrap around)
+        */
+        BlockNumber *Unread_Pfetched_base;   /*  where the list of prefetched but unread blocknos starts */
+        unsigned int Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+        unsigned int Unread_Pfetched_count;  /*  number of valid unread blocknos in list */
+#endif   /* USE_PREFETCH */
 	OffsetNumber offsets[1];	/* VARIABLE LENGTH ARRAY */
 } TBMIterateResult;				/* VARIABLE LENGTH STRUCT */
 
@@ -62,5 +72,8 @@ extern bool tbm_is_empty(const TIDBitmap
 extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
 extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
 extern void tbm_end_iterate(TBMIterator *iterator);
-
+extern void tbm_zero(TBMIterator *iterator); /* zero list of prefetched and unread blocknos */
+extern void tbm_add(TBMIterator *iterator, BlockNumber blockno); /* add this blockno to list of prefetched and unread blocknos */
+extern void tbm_subtract(TBMIterator *iterator, BlockNumber blockno); /* remove this blockno from list of prefetched and unread blocknos */
+extern TBMIterateResult *tbm_locate_IterateResult(TBMIterator *iterator); /* locate the TBMIterateResult of an iterator */
 #endif   /* TIDBITMAP_H */
--- src/include/utils/rel.h.orig	2014-05-28 08:29:09.466829225 -0400
+++ src/include/utils/rel.h	2014-05-28 16:45:44.134512200 -0400
@@ -61,6 +61,7 @@ typedef struct RelationAmInfo
 	FmgrInfo	ammarkpos;
 	FmgrInfo	amrestrpos;
 	FmgrInfo	amcanreturn;
+	FmgrInfo	ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } RelationAmInfo;
 
 
--- src/include/pg_config.h.in.orig	2014-05-28 08:29:09.458829229 -0400
+++ src/include/pg_config.h.in	2014-05-28 16:45:44.150512266 -0400
@@ -1,4 +1,4 @@
-/* src/include/pg_config.h.in.  Generated from configure.in by autoheader.  */
+/* src/include/pg_config.h.in.  Generated from - by autoheader.  */
 
 /* Define to the type of arg 1 of 'accept' */
 #undef ACCEPT_TYPE_ARG1
@@ -748,6 +748,10 @@
 /* Define to the appropriate snprintf format for unsigned 64-bit ints. */
 #undef UINT64_FORMAT
 
+/* Define to select librt-style async io and the gcc atomic compare_and_swap.
+   */
+#undef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING

Import Notes

Reply to msg id not found: BAY175-W2E2D8EFD58C4EBB45D2A1A3250@phx.gblReference msg id not found: E1WpPgO-00073L-GQ@malur.postgresql.orgReference msg id not found: BAY175-W2E2D8EFD58C4EBB45D2A1A3250@phx.gbl | Resolved by subject fallback

#11

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: John Lumby (#10)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 05/28/2014 11:52 PM, John Lumby wrote:

The patch is attached.
It is based on clone of today's 9.4dev source.
I have noticed that this source is
(not suprisingly) quite a moving target at present,
meaning that this patch becomes stale quite quickly.
So although this copy is fine for reviewing,
it may quite probably soon not be correct
for the current source tree.

As mentioned before, if anyone wishes to try this feature out
on 9.3.4, I will be making a patch for that soon
which I can supply on request.

Wow, that's a huge patch. I took a very brief look, focusing on the
basic design. ignoring the style & other minor things for now:

The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.

How portable is POSIX aio nowadays? Googling around, it still seems that
on Linux, it's implemented using threads. Does the thread-emulation
implementation cause problems with the rest of the backend, which
assumes that there is only a single thread? In any case, I think we'll
want to encapsulate the AIO implementation behind some kind of an API,
to allow other implementations to co-exist.

Benchmarks?
- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Peter Geoghegan

pg@heroku.com

over 11 years ago

In reply to: John Lumby (#7)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Tue, May 27, 2014 at 3:17 PM, John Lumby <johnlumby@hotmail.com> wrote:

Below I am pasting the README we have written for this new functionality
which mentions some of the measurements, advantages (and disadvantages)
and we welcome all and any comments on this.

I think that this is likely to be a useful area to work on, but I
wonder: can you suggest a useful test-case or benchmark to show the
advantages of the patch you posted? You mention a testcase already,
but it's a little short on details. I think it's always a good idea to
start with that when pursuing a performance feature.

Have you thought about things like specialized prefetching for nested
loop joins?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: Peter Geoghegan (#12)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Wed, May 28, 2014 at 6:51 PM, Peter Geoghegan <pg@heroku.com> wrote:

Have you thought about things like specialized prefetching for nested
loop joins?

Currently, such a thing would need some non-trivial changes to the
execution nodes, I believe.

For nestloop, correct me if I'm wrong, but index scan nodes don't have
visibility of the next tuple to be searched for.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Peter Geoghegan

pg@heroku.com

over 11 years ago

In reply to: Claudio Freire (#13)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

For nestloop, correct me if I'm wrong, but index scan nodes don't have
visibility of the next tuple to be searched for.

Nested loop joins are considered a particularly compelling case for
prefetching, actually.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#11)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

Thanks for looking at it!

Date: Thu, 29 May 2014 00:19:33 +0300
From: hlinnakangas@vmware.com
To: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
CC: klaussfreire@gmail.com
Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

On 05/28/2014 11:52 PM, John Lumby wrote:

The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.

It works on linux. Actually this ability allows the asyncio implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,. since librt takes care of the waiting.
specifically, no need for extra wait-for-io control blocks
such as in bufmgr's WaitIO()

This also plays very nicely with the syncscan where the cohorts run close together.

If anyone would like to advise whether this also works on MacOS/BSD , windows,
that would be good, as I can't verify it myself.

How portable is POSIX aio nowadays? Googling around, it still seems that
on Linux, it's implemented using threads. Does the thread-emulation
implementation cause problems with the rest of the backend, which
assumes that there is only a single thread? In any case, I think we'll

Good question, I am not aware of any dependency on a backend having only
a single thread. Can you please point me to such dependencies?

want to encapsulate the AIO implementation behind some kind of an API,
to allow other implementations to co-exist.

It is already - it follows the smgr(stg mgr) -> md (mag disk) -> fd (filesystem)
layering used for the existing filesystem ops including posix-fadvise.

Benchmarks?

I will see if I can package mine up somehow.
Would be great if anyone else would like to benchmark it on theirs ...

Show quoted text

- Heikki

#16

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: Peter Geoghegan (#14)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

I have pasted below the EXPLAIN of one of my benchmark queries
(the one I reference in the README).
Plenty of nested loop joins.
However I think I understand your question as to how effective it would be
if the outer is not sorted, and I will see if I can dig into that
if I get time (and it sounds as though Claudio is on it too).

The area of exactly what the best prefetch strategy should be for
each particular type of scan and context is a good one to work on.

John

Date: Wed, 28 May 2014 18:12:23 -0700
Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
From: pg@heroku.com
To: klaussfreire@gmail.com
CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org

On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

For nestloop, correct me if I'm wrong, but index scan nodes don't have
visibility of the next tuple to be searched for.

Nested loop joins are considered a particularly compelling case for
prefetching, actually.

--
Peter Geoghegan

____________________________________________________________________________________-
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=801294.81..801294.81 rows=2 width=532)
CTE deploy_zone_down
-> Recursive Union (cost=1061.25..2687.40 rows=11 width=573)
-> Nested Loop (cost=1061.25..1423.74 rows=1 width=41)
-> Nested Loop (cost=1061.11..1421.22 rows=14 width=49)
-> Bitmap Heap Scan on entity zone_tree (cost=1060.67..1175.80 rows=29 width=40)
Recheck Cond: ((name >= 'testZone-4375'::text) AND (name <= 'testZone-5499'::text) AND ((discriminator)::text = 'ZONE'::text))
-> BitmapAnd (cost=1060.67..1060.67 rows=29 width=0)
-> Bitmap Index Scan on entity_name (cost=0.00..139.71 rows=5927 width=0)
Index Cond: ((name >= 'testZone-4375'::text) AND (name <= 'testZone-5499'::text))
-> Bitmap Index Scan on entity_discriminatorx (cost=0.00..920.70 rows=49636 width=0)
Index Cond: ((discriminator)::text = 'ZONE'::text)
-> Index Scan using metadata_value_owner_id on metadata_value mv (cost=0.43..8.45 rows=1 width=17)
Index Cond: (owner_id = zone_tree.id)
-> Index Scan using metadata_field_pkey on metadata_field mf (cost=0.15..0.17 rows=1 width=8)
Index Cond: (id = mv.field_id)
Filter: ((name)::text = 'deployable'::text)
-> Nested Loop (cost=0.87..126.34 rows=1 width=573)
-> Nested Loop (cost=0.72..125.44 rows=5 width=581)
-> Nested Loop (cost=0.29..83.42 rows=10 width=572)
-> WorkTable Scan on deploy_zone_down dzd (cost=0.00..0.20 rows=10 width=540)
-> Index Scan using entity_discriminator_parent_zone on entity ch (cost=0.29..8.31 rows=1 width=40)
Index Cond: ((parent_id = dzd.zone_tree_id) AND ((discriminator)::text = 'ZONE'::text))
-> Index Scan using metadata_value_owner_id on metadata_value mv_1 (cost=0.43..4.19 rows=1 width=17)
Index Cond: (owner_id = ch.id)
-> Index Scan using metadata_field_pkey on metadata_field mf_1 (cost=0.15..0.17 rows=1 width=8)
Index Cond: (id = mv_1.field_id)
Filter: ((name)::text = 'deployable'::text)
CTE deploy_zone_tree
-> Recursive Union (cost=0.00..9336.89 rows=21 width=1105)
-> CTE Scan on deploy_zone_down dzd_1 (cost=0.00..0.22 rows=11 width=1105)
-> Nested Loop (cost=0.43..933.63 rows=1 width=594)
-> WorkTable Scan on deploy_zone_tree dzt (cost=0.00..2.20 rows=110 width=581)
-> Index Scan using entity_pkey on entity pt (cost=0.43..8.46 rows=1 width=21)
Index Cond: (id = dzt.zone_tree_ancestor_parent_id)
Filter: ((discriminator)::text = ANY ('{VIEW,ZONE}'::text[]))
CTE forward_host_ip
-> Nested Loop (cost=1.30..149.65 rows=24 width=88)
-> Nested Loop (cost=0.87..121.69 rows=48 width=56)
-> Nested Loop (cost=0.43..71.61 rows=99 width=48)
-> CTE Scan on deploy_zone_tree dzt_1 (cost=0.00..0.47 rows=1 width=16)
Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
-> Index Scan using entity_parent_id on entity host (cost=0.43..70.14 rows=99 width=40)
Index Cond: (parent_id = dzt_1.zone_tree_id)
Filter: ((discriminator)::text = 'HOST'::text)
-> Index Scan using entity_link_owner_id on entity_link link (cost=0.43..0.50 rows=1 width=16)
Index Cond: (owner_id = host.id)
Filter: ((link_type)::text = ANY ('{IP,IP6}'::text[]))
-> Index Scan using entity_pkey on entity address (cost=0.43..0.57 rows=1 width=40)
Index Cond: (id = link.entity_id)
Filter: ((discriminator)::text = ANY ('{IP4A,IP6A}'::text[]))
CTE association_view
-> Nested Loop (cost=0.87..26.29 rows=1 width=75)
-> Nested Loop (cost=0.43..17.82 rows=1 width=56)
-> CTE Scan on deploy_zone_tree dzt_2 (cost=0.00..0.47 rows=1 width=16)
Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
-> Index Scan using entity_discriminator_parent_rr on entity record (cost=0.43..17.34 rows=1 width=48)
Index Cond: ((parent_id = dzt_2.zone_tree_id) AND ((discriminator)::text = ANY ('{C,MX,SRV}'::text[])))
-> Index Scan using entity_pkey on entity assoc (cost=0.43..8.46 rows=1 width=27)
Index Cond: (id = record.association_id)
CTE simple_view
-> Nested Loop (cost=0.43..22.27 rows=1 width=48)
-> CTE Scan on deploy_zone_tree dzt_3 (cost=0.00..0.47 rows=1 width=16)
Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
-> Index Scan using entity_discriminator_parent_rr on entity record_1 (cost=0.43..21.79 rows=1 width=40)
Index Cond: ((parent_id = dzt_3.zone_tree_id) AND ((discriminator)::text = ANY ('{TXT,HINFO,GENRR,NAPTR}'::text[])))
CTE max_hist_id
-> Result (cost=0.48..0.49 rows=1 width=0)
InitPlan 6 (returns $19)
-> Limit (cost=0.43..0.48 rows=1 width=8)
-> Index Only Scan Backward using entity_history_history_id on entity_history xh (cost=0.43..444052.51 rows=10386347 width=8)
Index Cond: (history_id IS NOT NULL)
CTE relevant_history
-> Nested Loop (cost=0.43..199661.39 rows=3438689 width=28)
-> CTE Scan on max_hist_id xh_1 (cost=0.00..0.02 rows=1 width=8)
-> Index Scan using entity_history_history_id on entity_history eh (cost=0.43..156677.76 rows=3438689 width=20)
Index Cond: (history_id > xh_1.history_id)
Filter: (transaction_type = 'I'::bpchar)
CTE resource_records
-> Unique (cost=580178.30..584992.46 rows=160472 width=1063)
-> Sort (cost=580178.30..580579.48 rows=160472 width=1063)
Sort Key: fip.host_id, fip.host_discriminator, fip.host_parent_id, fip.view_id, ((fip.address_id)::text), fip.host_name, ((fip.address_long1)::text), (((fip.host_long1 & 1::bigint))::text), ((fip.address_parent_id)::text), ((((''::text || (COALESCE(mv_2.longnumber, (-1)::bigint))::text) || ','::text) || (fip.address_discriminator)::text)), rh.long1
-> Append (cost=203.82..417112.92 rows=160472 width=1063)
-> Hash Join (cost=203.82..91844.90 rows=137548 width=1136)
Hash Cond: ((rh.hist_discrim)::text = (fip.address_discriminator)::text)
Join Filter: (rh.history_delta > fip.host_id)
-> CTE Scan on relevant_history rh (cost=0.00..68773.78 rows=3438689 width=532)
-> Hash (cost=203.52..203.52 rows=24 width=1128)
-> Nested Loop Left Join (cost=0.43..203.52 rows=24 width=1128)
-> CTE Scan on forward_host_ip fip (cost=0.00..0.48 rows=24 width=1120)
-> Index Scan using metadata_value_owner_id on metadata_value mv_2 (cost=0.43..8.45 rows=1 width=24)
Index Cond: (owner_id = fip.host_id)
-> Nested Loop (cost=0.43..77722.85 rows=5731 width=644)
Join Filter: (rh_1.history_delta > av.record_id)
-> Nested Loop Left Join (cost=0.43..8.48 rows=1 width=636)
-> CTE Scan on association_view av (cost=0.00..0.02 rows=1 width=628)
Filter: ((record_discriminator)::text = 'C'::text)
-> Index Scan using metadata_value_owner_id on metadata_value mv_3 (cost=0.43..8.45 rows=1 width=24)
Index Cond: (owner_id = av.record_id)
-> CTE Scan on relevant_history rh_1 (cost=0.00..77370.50 rows=17193 width=532)
Filter: ((hist_discrim)::text = 'C'::text)
-> Hash Join (cost=0.04..83402.53 rows=5731 width=636)
Hash Cond: ((rh_2.hist_discrim)::text = (av_1.record_discriminator)::text)
Join Filter: (rh_2.history_delta > av_1.record_id)
-> CTE Scan on relevant_history rh_2 (cost=0.00..68773.78 rows=3438689 width=532)
-> Hash (cost=0.02..0.02 rows=1 width=628)
-> CTE Scan on association_view av_1 (cost=0.00..0.02 rows=1 width=628)
Filter: ((record_discriminator)::text = ANY ('{MX,SRV}'::text[]))
-> Nested Loop (cost=0.86..79164.06 rows=5731 width=618)
Join Filter: (rh_3.history_delta > sv.record_id)
-> Nested Loop Left Join (cost=0.86..16.94 rows=1 width=610)
-> Nested Loop Left Join (cost=0.43..8.48 rows=1 width=588)
-> CTE Scan on simple_view sv (cost=0.00..0.02 rows=1 width=580)
Filter: ((record_discriminator)::text = 'TXT'::text)
-> Index Scan using metadata_value_owner_id on metadata_value mv_4 (cost=0.43..8.45 rows=1 width=24)
Index Cond: (owner_id = sv.record_id)
-> Index Scan using metadata_value_owner_id on metadata_value txtvalue (cost=0.43..8.45 rows=1 width=38)
Index Cond: (owner_id = sv.record_id)
-> CTE Scan on relevant_history rh_3 (cost=0.00..77370.50 rows=17193 width=532)
Filter: ((hist_discrim)::text = 'TXT'::text)
-> Hash Join (cost=0.04..83373.87 rows=5731 width=588)
Hash Cond: ((rh_4.hist_discrim)::text = (sv_1.record_discriminator)::text)
Join Filter: (rh_4.history_delta > sv_1.record_id)
-> CTE Scan on relevant_history rh_4 (cost=0.00..68773.78 rows=3438689 width=532)
-> Hash (cost=0.02..0.02 rows=1 width=580)
-> CTE Scan on simple_view sv_1 (cost=0.00..0.02 rows=1 width=580)
Filter: ((record_discriminator)::text = ANY ('{HINFO,GENRR,NAPTR}'::text[]))
-> Sort (cost=4417.98..4418.48 rows=200 width=532)
Sort Key: resource_records.discrim
-> HashAggregate (cost=4412.98..4415.98 rows=200 width=532)
Group Key: resource_records.discrim
-> CTE Scan on resource_records (cost=0.00..3209.44 rows=160472 width=532)
Planning time: 6.620 ms
(133 rows)

#17

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: Peter Geoghegan (#14)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

El 28/05/2014 22:12, "Peter Geoghegan" <pg@heroku.com> escribió:

On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <klaussfreire@gmail.com>

wrote:

For nestloop, correct me if I'm wrong, but index scan nodes don't have
visibility of the next tuple to be searched for.

Nested loop joins are considered a particularly compelling case for
prefetching, actually.

Of course. I only doubt it can be implemented without not so small changes
to all execution nodes.

I'll look into it

Show quoted text

--
Peter Geoghegan

#18

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: John Lumby (#15)

1 attachment(s)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 05/29/2014 04:12 PM, John Lumby wrote:

Thanks for looking at it!

Date: Thu, 29 May 2014 00:19:33 +0300
From: hlinnakangas@vmware.com
To: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
CC: klaussfreire@gmail.com
Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

On 05/28/2014 11:52 PM, John Lumby wrote:

The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.

It works on linux. Actually this ability allows the asyncio implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,. since librt takes care of the waiting.
specifically, no need for extra wait-for-io control blocks
such as in bufmgr's WaitIO()

[checks]. No, it doesn't work. See attached test program.

It kinda seems to work sometimes, because of the way it's implemented in
glibc. The aiocb struct has a field for the result value and errno, and
when the I/O is finished, the worker thread fills them in. aio_error()
and aio_return() just return the values of those fields, so calling
aio_error() or aio_return() do in fact happen to work from a different
process. aio_suspend(), however, is implemented by sleeping on a
process-local mutex, which does not work from a different process.

Even if it worked on Linux today, it would be a bad idea to rely on it
from a portability point of view. No, the only sane way to make this
work is that the process that initiates an I/O request is responsible
for completing it. If another process needs to wait for an async I/O to
complete, we must use some other means to do the waiting. Like the
io_in_progress_lock that we already have, for the same purpose.

- Heikki

#19

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#18)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 05/29/2014 04:12 PM, John Lumby wrote:

Thanks for looking at it!

Date: Thu, 29 May 2014 00:19:33 +0300
From: hlinnakangas@vmware.com
To: johnlumby@hotmail.com; pgsql-hackers@postgresql.org
CC: klaussfreire@gmail.com
Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO -
proposal and patch

On 05/28/2014 11:52 PM, John Lumby wrote:

The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.

It works on linux. Actually this ability allows the asyncio
implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,. since librt takes care of the
waiting.
specifically, no need for extra wait-for-io control blocks
such as in bufmgr's WaitIO()

[checks]. No, it doesn't work. See attached test program.

It kinda seems to work sometimes, because of the way it's implemented in
glibc. The aiocb struct has a field for the result value and errno, and when
the I/O is finished, the worker thread fills them in. aio_error() and
aio_return() just return the values of those fields, so calling aio_error()
or aio_return() do in fact happen to work from a different process.
aio_suspend(), however, is implemented by sleeping on a process-local mutex,
which does not work from a different process.

Even if it worked on Linux today, it would be a bad idea to rely on it from
a portability point of view. No, the only sane way to make this work is that
the process that initiates an I/O request is responsible for completing it.
If another process needs to wait for an async I/O to complete, we must use
some other means to do the waiting. Like the io_in_progress_lock that we
already have, for the same purpose.

But calls to it are timeouted by 10us, effectively turning the thing
into polling mode.

Which is odd now that I read carefully, is how come 256 waits of 10us
amounts to anything? That's just 2.5ms, not enough IIUC for any normal
I/O to complete.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Claudio Freire (#19)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 05/29/2014 11:34 PM, Claudio Freire wrote:

On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 05/29/2014 04:12 PM, John Lumby wrote:

On 05/28/2014 11:52 PM, John Lumby wrote:

The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.

It works on linux. Actually this ability allows the asyncio
implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,. since librt takes care of the
waiting.
specifically, no need for extra wait-for-io control blocks
such as in bufmgr's WaitIO()

[checks]. No, it doesn't work. See attached test program.

It kinda seems to work sometimes, because of the way it's implemented in
glibc. The aiocb struct has a field for the result value and errno, and when
the I/O is finished, the worker thread fills them in. aio_error() and
aio_return() just return the values of those fields, so calling aio_error()
or aio_return() do in fact happen to work from a different process.
aio_suspend(), however, is implemented by sleeping on a process-local mutex,
which does not work from a different process.

Even if it worked on Linux today, it would be a bad idea to rely on it from
a portability point of view. No, the only sane way to make this work is that
the process that initiates an I/O request is responsible for completing it.
If another process needs to wait for an async I/O to complete, we must use
some other means to do the waiting. Like the io_in_progress_lock that we
already have, for the same purpose.

But calls to it are timeouted by 10us, effectively turning the thing
into polling mode.

We don't want polling... And even if we did, calling aio_suspend() in a
way that's known to be broken, in a loop, is a pretty crappy way of polling.

Which is odd now that I read carefully, is how come 256 waits of 10us
amounts to anything? That's just 2.5ms, not enough IIUC for any normal
I/O to complete

Wild guess: the buffer you're reading is already in OS cache.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#20)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 05/29/2014 11:34 PM, Claudio Freire wrote:

On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 05/29/2014 04:12 PM, John Lumby wrote:

On 05/28/2014 11:52 PM, John Lumby wrote:

The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.

It works on linux. Actually this ability allows the asyncio
implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,. since librt takes care of the
waiting.
specifically, no need for extra wait-for-io control blocks
such as in bufmgr's WaitIO()

[checks]. No, it doesn't work. See attached test program.

It kinda seems to work sometimes, because of the way it's implemented in
glibc. The aiocb struct has a field for the result value and errno, and
when
the I/O is finished, the worker thread fills them in. aio_error() and
aio_return() just return the values of those fields, so calling
aio_error()
or aio_return() do in fact happen to work from a different process.
aio_suspend(), however, is implemented by sleeping on a process-local
mutex,
which does not work from a different process.

Even if it worked on Linux today, it would be a bad idea to rely on it
from
a portability point of view. No, the only sane way to make this work is
that
the process that initiates an I/O request is responsible for completing
it.
If another process needs to wait for an async I/O to complete, we must
use
some other means to do the waiting. Like the io_in_progress_lock that we
already have, for the same purpose.

But calls to it are timeouted by 10us, effectively turning the thing
into polling mode.

We don't want polling... And even if we did, calling aio_suspend() in a way
that's known to be broken, in a loop, is a pretty crappy way of polling.

Agreed. Just saying that that's why it works.

The PID of the process responsible for the aio should be there
somewhere, and other processes should explicitly synchronize (or
initiate their own I/O and let the OS merge them - which also works).

Which is odd now that I read carefully, is how come 256 waits of 10us
amounts to anything? That's just 2.5ms, not enough IIUC for any normal
I/O to complete

Wild guess: the buffer you're reading is already in OS cache.

My wild guess as well.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#20)

1 attachment(s)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 05/29/2014 11:34 PM, Claudio Freire wrote:

On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 05/29/2014 04:12 PM, John Lumby wrote:

On 05/28/2014 11:52 PM, John Lumby wrote:

The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.

It works on linux. Actually this ability allows the asyncio
implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,. since librt takes care of the
waiting.
specifically, no need for extra wait-for-io control blocks
such as in bufmgr's WaitIO()

[checks]. No, it doesn't work. See attached test program.

It kinda seems to work sometimes, because of the way it's implemented in
glibc. The aiocb struct has a field for the result value and errno, and
when
the I/O is finished, the worker thread fills them in. aio_error() and
aio_return() just return the values of those fields, so calling
aio_error()
or aio_return() do in fact happen to work from a different process.
aio_suspend(), however, is implemented by sleeping on a process-local
mutex,
which does not work from a different process.

Even if it worked on Linux today, it would be a bad idea to rely on it
from
a portability point of view. No, the only sane way to make this work is
that
the process that initiates an I/O request is responsible for completing
it.
If another process needs to wait for an async I/O to complete, we must
use
some other means to do the waiting. Like the io_in_progress_lock that we
already have, for the same purpose.

But calls to it are timeouted by 10us, effectively turning the thing
into polling mode.

We don't want polling... And even if we did, calling aio_suspend() in a way
that's known to be broken, in a loop, is a pretty crappy way of polling.

Didn't fix that, but the attached patch does fix regression tests when
scanning over index types other than btree (was invoking elog when the
index am didn't have ampeeknexttuple)

Attachments:

postgresql-9.4.140529-async_io_prefetching.patchtext/x-patch; charset=US-ASCII; name=postgresql-9.4.140529-async_io_prefetching.patchDownload

diff --git a/config/c-library.m4 b/config/c-library.m4
index 8f45010..9aee904 100644
--- a/config/c-library.m4
+++ b/config/c-library.m4
@@ -367,3 +367,50 @@ if test "$pgac_cv_type_locale_t" = 'yes (in xlocale.h)'; then
   AC_DEFINE(LOCALE_T_IN_XLOCALE, 1,
             [Define to 1 if `locale_t' requires <xlocale.h>.])
 fi])])# PGAC_HEADER_XLOCALE
+
+
+# PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
+# ---------------------------------------
+# test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation  of the latter.
+#
+AC_DEFUN([PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP],
+[AC_MSG_CHECKING([whether have both librt-style async io and the gcc atomic compare_and_swap])
+AC_CACHE_VAL(pgac_cv_aio_atomic_builtin_comp_swap,
+pgac_save_LIBS=$LIBS
+LIBS=" -lrt $pgac_save_LIBS"
+[AC_TRY_RUN([#include <stdio.h>
+#include <unistd.h>
+#include "aio.h"
+
+int main (int argc, char *argv[])
+ {
+   int rc;
+   struct aiocb volatile * first_aiocb;
+   struct aiocb volatile * second_aiocb;
+   struct aiocb volatile * my_aiocbp = (struct aiocb *)20000008;
+
+   first_aiocb = (struct aiocb *)20000008;
+   second_aiocb = (struct aiocb *)40000008;
+
+   /*  return zero as success if two comp-swaps both worked as expected -
+   **  first compares equal and swaps,  second compares unequal
+   */
+   rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   if (rc) {
+      rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   } else {
+      rc = -1;
+   }
+
+   return rc;
+}],
+[pgac_cv_aio_atomic_builtin_comp_swap=yes],
+[pgac_cv_aio_atomic_builtin_comp_swap=no],
+[pgac_cv_aio_atomic_builtin_comp_swap=cross])
+])dnl AC_CACHE_VAL
+AC_MSG_RESULT([$pgac_cv_aio_atomic_builtin_comp_swap])
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" != x"yes"; then
+LIBS=$pgac_save_LIBS
+fi
+])# PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
diff --git a/configure b/configure
index 3663e50..5ad28a1 100755
--- a/configure
+++ b/configure
@@ -13840,6 +13840,69 @@ operating system;  use --disable-thread-safety to disable thread safety." "$LINE
 fi
 fi
 
+#  test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation  of the latter.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether have both librt-style async io and the gcc atomic compare_and_swap" >&5
+$as_echo_n "checking whether have both librt-style async io and the gcc atomic compare_and_swap... " >&6; }
+if ${pgac_cv_aio_atomic_builtin_comp_swap+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_LIBS=$LIBS
+LIBS=" -lrt $pgac_save_LIBS"
+if test "$cross_compiling" = yes; then :
+  pgac_cv_aio_atomic_builtin_comp_swap=cross
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <stdio.h>
+#include <unistd.h>
+#include "aio.h"
+
+int main (int argc, char *argv[])
+ {
+   int rc;
+   struct aiocb volatile * first_aiocb;
+   struct aiocb volatile * second_aiocb;
+   struct aiocb volatile * my_aiocbp = (struct aiocb *)20000008;
+
+   first_aiocb = (struct aiocb *)20000008;
+   second_aiocb = (struct aiocb *)40000008;
+
+   /*  return zero as success if two comp-swaps both worked as expected -
+   **  first compares equal and swaps,  second compares unequal
+   */
+   rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   if (rc) {
+      rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   } else {
+      rc = -1;
+   }
+
+   return rc;
+}
+_ACEOF
+if ac_fn_c_try_run "$LINENO"; then :
+  pgac_cv_aio_atomic_builtin_comp_swap=yes
+else
+  pgac_cv_aio_atomic_builtin_comp_swap=no
+fi
+rm -f core *.core core.conftest.* gmon.out bb.out conftest$ac_exeext \
+  conftest.$ac_objext conftest.beam conftest.$ac_ext
+fi
+
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_aio_atomic_builtin_comp_swap" >&5
+$as_echo "$pgac_cv_aio_atomic_builtin_comp_swap" >&6; }
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" != x"yes"; then
+LIBS=$pgac_save_LIBS
+fi
+
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" = x"yes"; then
+
+$as_echo "#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 1" >>confdefs.h
+
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/configure.in b/configure.in
index 80df1d7..b0876bd 100644
--- a/configure.in
+++ b/configure.in
@@ -1771,6 +1771,12 @@ operating system;  use --disable-thread-safety to disable thread safety.])
 fi
 fi
 
+#  test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation  of the latter.
+PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" = x"yes"; then
+      AC_DEFINE(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP, 1, [Define to select librt-style async io and the gcc atomic compare_and_swap.])
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index df20e88..b52772f 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -159,7 +159,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		 */
 		for (block = first_block; block <= last_block; ++block)
 		{
-			PrefetchBuffer(rel, forkNumber, block);
+			PrefetchBuffer(rel, forkNumber, block, 0);
 			++blocks_done;
 		}
 #else
diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 95a2767..ae52fa5 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -4,7 +4,8 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.2.sql pg_stat_statements--1.1--1.2.sql \
+DATA = pg_stat_statements--1.3.sql pg_stat_statements--1.2--1.3.sql \
+	pg_stat_statements--1.2.sql pg_stat_statements--1.1--1.2.sql \
 	pg_stat_statements--1.0--1.1.sql pg_stat_statements--unpackaged--1.0.sql
 
 ifdef USE_PGXS
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 32d16cc..b4bfb5a 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -117,6 +117,7 @@ typedef enum pgssVersion
 	PGSS_V1_0 = 0,
 	PGSS_V1_1,
 	PGSS_V1_2
+	,PGSS_V1_3
 } pgssVersion;
 
 /*
@@ -148,6 +149,16 @@ typedef struct Counters
 	int64		local_blks_written;		/* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;		/* # of temp blocks written */
+
+	int64		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool  */
+	int64		aio_read_discrd;		/* # of prefetches for which buffer not subsequently read and therefore discarded  */
+	int64		aio_read_forgot;		/* # of prefetches for which buffer not subsequently read and then forgotten about */
+	int64		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb  control block               */
+	int64		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno     */
+	int64		aio_read_wasted;		/* # of aio reads for which in-progress aio cancelled and disk block not used      */
+	int64		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it                 */
+	int64		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested       */
+
 	double		blk_read_time;	/* time spent reading, in msec */
 	double		blk_write_time; /* time spent writing, in msec */
 	double		usage;			/* usage factor */
@@ -274,7 +285,7 @@ void		_PG_init(void);
 void		_PG_fini(void);
 
 PG_FUNCTION_INFO_V1(pg_stat_statements_reset);
-PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
+PG_FUNCTION_INFO_V1(pg_stat_statements_1_3);
 PG_FUNCTION_INFO_V1(pg_stat_statements);
 
 static void pgss_shmem_startup(void);
@@ -1026,7 +1037,25 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 		bufusage.temp_blks_read =
 			pgBufferUsage.temp_blks_read - bufusage_start.temp_blks_read;
 		bufusage.temp_blks_written =
-			pgBufferUsage.temp_blks_written - bufusage_start.temp_blks_written;
+			pgBufferUsage.temp_blks_written - bufusage.temp_blks_written;
+
+		bufusage.aio_read_noneed =
+			pgBufferUsage.aio_read_noneed - bufusage.aio_read_noneed;
+		bufusage.aio_read_discrd =
+			pgBufferUsage.aio_read_discrd - bufusage.aio_read_discrd;
+		bufusage.aio_read_forgot =
+			pgBufferUsage.aio_read_forgot - bufusage.aio_read_forgot;
+		bufusage.aio_read_noblok =
+			pgBufferUsage.aio_read_noblok - bufusage.aio_read_noblok;
+		bufusage.aio_read_failed =
+			pgBufferUsage.aio_read_failed - bufusage.aio_read_failed;
+		bufusage.aio_read_wasted =
+			pgBufferUsage.aio_read_wasted - bufusage.aio_read_wasted;
+		bufusage.aio_read_waited =
+			pgBufferUsage.aio_read_waited - bufusage.aio_read_waited;
+		bufusage.aio_read_ontime =
+			pgBufferUsage.aio_read_ontime - bufusage.aio_read_ontime;
+
 		bufusage.blk_read_time = pgBufferUsage.blk_read_time;
 		INSTR_TIME_SUBTRACT(bufusage.blk_read_time, bufusage_start.blk_read_time);
 		bufusage.blk_write_time = pgBufferUsage.blk_write_time;
@@ -1041,6 +1070,7 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 				   rows,
 				   &bufusage,
 				   NULL);
+
 	}
 	else
 	{
@@ -1224,6 +1254,16 @@ pgss_store(const char *query, uint32 queryId,
 		e->counters.local_blks_written += bufusage->local_blks_written;
 		e->counters.temp_blks_read += bufusage->temp_blks_read;
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
+
+		e->counters.aio_read_noneed     += bufusage->aio_read_noneed;
+		e->counters.aio_read_discrd     += bufusage->aio_read_discrd;
+		e->counters.aio_read_forgot     += bufusage->aio_read_forgot;
+		e->counters.aio_read_noblok     += bufusage->aio_read_noblok;
+		e->counters.aio_read_failed     += bufusage->aio_read_failed;
+		e->counters.aio_read_wasted     += bufusage->aio_read_wasted;
+		e->counters.aio_read_waited     += bufusage->aio_read_waited;
+		e->counters.aio_read_ontime     += bufusage->aio_read_ontime;
+
 		e->counters.blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_read_time);
 		e->counters.blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_write_time);
 		e->counters.usage += USAGE_EXEC(total_time);
@@ -1257,7 +1297,8 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
 #define PG_STAT_STATEMENTS_COLS_V1_0	14
 #define PG_STAT_STATEMENTS_COLS_V1_1	18
 #define PG_STAT_STATEMENTS_COLS_V1_2	19
-#define PG_STAT_STATEMENTS_COLS			19		/* maximum of above */
+#define PG_STAT_STATEMENTS_COLS_V1_3	27
+#define PG_STAT_STATEMENTS_COLS			27		/* maximum of above */
 
 /*
  * Retrieve statement statistics.
@@ -1270,6 +1311,16 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
  * function.  Unfortunately we weren't bright enough to do that for 1.1.
  */
 Datum
+pg_stat_statements_1_3(PG_FUNCTION_ARGS)
+{
+	bool		showtext = PG_GETARG_BOOL(0);
+
+	pg_stat_statements_internal(fcinfo, PGSS_V1_3, showtext);
+
+	return (Datum) 0;
+}
+
+Datum
 pg_stat_statements_1_2(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
@@ -1358,6 +1409,10 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 			if (api_version != PGSS_V1_2)
 				elog(ERROR, "incorrect number of output arguments");
 			break;
+		case PG_STAT_STATEMENTS_COLS_V1_3:
+			if (api_version != PGSS_V1_3)
+				elog(ERROR, "incorrect number of output arguments");
+			break;
 		default:
 			elog(ERROR, "incorrect number of output arguments");
 	}
@@ -1534,11 +1589,24 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 		{
 			values[i++] = Float8GetDatumFast(tmp.blk_read_time);
 			values[i++] = Float8GetDatumFast(tmp.blk_write_time);
+
+			if (api_version >= PGSS_V1_3)
+			{
+				values[i++] = Int64GetDatumFast(tmp.aio_read_noneed);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_discrd);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_forgot);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_noblok);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_failed);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_wasted);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_waited);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_ontime);
+			}
 		}
 
 		Assert(i == (api_version == PGSS_V1_0 ? PG_STAT_STATEMENTS_COLS_V1_0 :
 					 api_version == PGSS_V1_1 ? PG_STAT_STATEMENTS_COLS_V1_1 :
 					 api_version == PGSS_V1_2 ? PG_STAT_STATEMENTS_COLS_V1_2 :
+					 api_version == PGSS_V1_3 ? PG_STAT_STATEMENTS_COLS_V1_3 :
 					 -1 /* fail if you forget to update this assert */ ));
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b77c32c..1d43917 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -71,6 +71,28 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef USE_PREFETCH
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+
+#include "executor/instrument.h"
+
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_heap_scans; /*  boolean whether to prefetch non-bitmap heap scans         */
+
+/*  special values for scan->rs_prefetch_target indicating as follows :               */
+#define PREFETCH_MAYBE 0xffffffff      /*   prefetch permitted but not yet in effect  */
+#define PREFETCH_DISABLED 0xfffffffe   /*   prefetch disabled and not permitted       */
+/*  PREFETCH_WRAP_POINT indicates a pretcher who has reached the point where the scan would wrap -
+**  at this point the prefetcher runs on the spot until scan catches up.
+**  This *must* be < maximum valid setting of target_prefetch_pages aka effective_io_concurrency.
+*/
+#define PREFETCH_WRAP_POINT 0x0fffffff
+
+#endif   /* USE_PREFETCH */
+
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -115,6 +137,8 @@ static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
 static HeapTuple ExtractReplicaIdentity(Relation rel, HeapTuple tup, bool key_modified,
 					   bool *copy);
 
+static void heap_unread_add(HeapScanDesc scan, BlockNumber blockno);
+static void heap_unread_subtract(HeapScanDesc scan, BlockNumber blockno);
 
 /*
  * Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -292,10 +316,149 @@ initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
 	 * Currently, we don't have a stats counter for bitmap heap scans (but the
 	 * underlying bitmap index scans will be counted).
 	 */
-	if (!scan->rs_bitmapscan)
+#ifdef USE_PREFETCH
+        /*    by default,  no prefetching on any scan  */
+        scan->rs_prefetch_target = PREFETCH_DISABLED;  /*  tentatively disable  */
+        scan->rs_pfchblock = 0; /*  scanner will reset this to be ahead of scan */
+        scan->rs_Unread_Pfetched_base = (BlockNumber *)0; /*  list of prefetched but unread blocknos */
+        scan->rs_Unread_Pfetched_next = 0; /*  next unread blockno */
+        scan->rs_Unread_Pfetched_count = 0; /* number of valid unread blocknos */
+#endif   /* USE_PREFETCH */
+	if (!scan->rs_bitmapscan) {
+
 		pgstat_count_heap_scan(scan->rs_rd);
+#ifdef USE_PREFETCH
+                /*    bitmap scans do their own prefetching -
+                **    for others,  set up prefetching now
+                */
+                if (    prefetch_heap_scans
+                     && (target_prefetch_pages > 0)
+                     &&	(!RelationUsesLocalBuffers(scan->rs_rd))
+                   ) {
+                      /*   prefetch_dbOid may be set to a database Oid to specify only prefetch in that db */
+                      if (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == scan->rs_rd->rd_node.dbNode)
+                              )
+                          ||  (prefetch_dbOid == 0)
+                         ) {
+                          scan->rs_prefetch_target = PREFETCH_MAYBE;    /*  permitted but let the scan decide */
+                      }
+                      else {
+                      }
+                }
+#endif   /* USE_PREFETCH */
+        }
+}
+
+/* add this blockno to list of prefetched and unread blocknos
+** use the one identified by the (next+count|modulo circumference) index if it is unused,
+** else search for the first available slot if there is one,
+** else error.
+*/
+static void
+heap_unread_add(HeapScanDesc scan, BlockNumber blockno)
+{
+      BlockNumber *available_P;   /*  where to store new blockno */
+      unsigned int Unread_Pfetched_index = scan->rs_Unread_Pfetched_next
+                                         + scan->rs_Unread_Pfetched_count; /* index of next unused slot */
+
+      /*  caller is not supposed to pass InvalidBlockNumber but check anyway */
+      if (blockno != InvalidBlockNumber) {
+
+		  /*  ensure there is some room somewhere   */
+		  if (scan->rs_Unread_Pfetched_count < target_prefetch_pages) {
+
+			  /*  try the "next+count" one */
+			  if (Unread_Pfetched_index >= target_prefetch_pages) {
+				  Unread_Pfetched_index -= target_prefetch_pages;  /* modulo circumference */
+			  }
+			  available_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+			  if (*available_P == InvalidBlockNumber) { /* unused */
+				  goto store_blockno;
+			  } else {
+                  /*  slow-search the entire list  */
+				  for (Unread_Pfetched_index = 0; Unread_Pfetched_index < target_prefetch_pages; Unread_Pfetched_index++) {
+					  available_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+					  if (*available_P == InvalidBlockNumber) { /* unused */
+                          /*  before storing this blockno,
+                          **  since the next pointer did not locate an unused slot,
+                          **  set it to one which is more likely to be so for the next time
+                          */
+                          scan->rs_Unread_Pfetched_next = Unread_Pfetched_index;
+						  goto store_blockno;
+					  }
+				  }
+			  }
+		  }
+
+          /*  if we reach here,    either there was no available slot
+          **  or we thought there was one and didn't find any  --
+          */
+  		  ereport(ERROR,
+			  (errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("heap_unread_add overflowed list cannot add blockno %d", blockno)));
+
+  		  return;
+      }
+
+    store_blockno:
+      *available_P = blockno;
+      scan->rs_Unread_Pfetched_count++;  /*  update count */
+
+    return;
+}
+
+/* remove specified blockno from list of prefetched and unread blocknos.
+/* Usually this will be found at the rs_Unread_Pfetched_next item -
+** else search for it.    If not found,   inore it  -  no error results.
+*/
+static void
+heap_unread_subtract(HeapScanDesc scan, BlockNumber blockno)
+{
+      unsigned int Unread_Pfetched_index = scan->rs_Unread_Pfetched_next; /* index of next unread blockno */
+      BlockNumber *candidate_P;   /*  location of callers blockno - maybe */
+      BlockNumber nextUnreadPfetched;
+
+      /*  caller is not supposed to pass InvalidBlockNumber but check anyway */
+      if (    (blockno != InvalidBlockNumber)
+		   && ( scan->rs_Unread_Pfetched_count > 0 )   /*  if the list is not empty  */
+         ) {
+
+			  /*  take modulo of the circumference.
+			  **  actually rs_Unread_Pfetched_next should never exceed the circumference but check anyway.
+			  */
+			  if (Unread_Pfetched_index >= target_prefetch_pages) {
+				  Unread_Pfetched_index  -= target_prefetch_pages;
+}
+			  candidate_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);
+			  nextUnreadPfetched = *candidate_P;
+
+			  if ( nextUnreadPfetched == blockno ) {
+				  goto remove_blockno;
+			  } else {
+                  /*  slow-search the entire list  */
+				  for (Unread_Pfetched_index = 0; Unread_Pfetched_index < target_prefetch_pages; Unread_Pfetched_index++) {
+					  candidate_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+					  if (*candidate_P == blockno) { /* unused */
+						  goto remove_blockno;
+					  }
+				  }
+			  }
+
+    remove_blockno:
+			  *candidate_P = InvalidBlockNumber;
+
+			  scan->rs_Unread_Pfetched_next = (Unread_Pfetched_index+1);  /*  update next pfchd unread */
+			  if (scan->rs_Unread_Pfetched_next >= target_prefetch_pages) {
+					  scan->rs_Unread_Pfetched_next = 0;
+			  }
+			  scan->rs_Unread_Pfetched_count--;  /*  update count */
+	  }
+
+      return;
 }
 
+
 /*
  * heapgetpage - subroutine for heapgettup()
  *
@@ -304,7 +467,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
  * which tuples on the page are visible.
  */
 static void
-heapgetpage(HeapScanDesc scan, BlockNumber page)
+heapgetpage(HeapScanDesc scan, BlockNumber page , BlockNumber prefetchHWM)
 {
 	Buffer		buffer;
 	Snapshot	snapshot;
@@ -314,6 +477,10 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 	OffsetNumber lineoff;
 	ItemId		lpp;
 	bool		all_visible;
+#ifdef USE_PREFETCH
+	int             PrefetchBufferRc;  /*   indicates whether requested prefetch block already in a buffer and if pin count on buffer has been increased */
+#endif   /* USE_PREFETCH */
+
 
 	Assert(page < scan->rs_nblocks);
 
@@ -336,6 +503,98 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 									   RBM_NORMAL, scan->rs_strategy);
 	scan->rs_cblock = page;
 
+#ifdef USE_PREFETCH
+
+        heap_unread_subtract(scan, page);
+
+        /*    maybe prefetch some pages  starting with rs_pfchblock */
+        if (scan->rs_prefetch_target >= 0) {       /*   prefetching enabled on this scan ?                         */
+            int next_block_to_be_read = (page+1);  /*   next block to be read = lowest possible prefetchable block */
+            int num_to_pfch_this_time;             /*   eventually holds the number of blocks to prefetch now      */
+            int prefetchable_range;                /*   size of the area ahead of the current prefetch position    */
+
+            /*  check if prefetcher reached wrap point and the scan has now wrapped */
+            if (  (page == 0) && (scan->rs_prefetch_target == PREFETCH_WRAP_POINT)  ) {
+                scan->rs_prefetch_target = 1;
+                scan->rs_pfchblock = next_block_to_be_read;
+            } else
+            if (scan->rs_pfchblock < next_block_to_be_read) {
+                scan->rs_pfchblock = next_block_to_be_read; /* next block to be prefetched must be ahead of one we just read */
+            }
+
+            /* now we know where we would start prefetching -
+            ** next question   -  if this is a sync scan,  ensure we do not prefetch behind the HWM
+            ** debatable whether to require strict inequality or >=  -   >= works better in practice
+            */
+            if ( (!scan->rs_syncscan) || (scan->rs_pfchblock >= prefetchHWM) ) {
+
+                /* now we know where we will start prefetching -
+                ** next question   -  how many?
+                ** apply two limits :
+                **  1.   target prefetch distance
+                **  2.   number of available blocks ahead of us
+                */
+
+                /*  1.   target prefetch distance   */
+                num_to_pfch_this_time = next_block_to_be_read + scan->rs_prefetch_target; /* page beyond prefetch target */
+                num_to_pfch_this_time -= scan->rs_pfchblock;                              /*  convert to offset        */
+
+                /*   first do prefetching up to our current limit  ...
+                **   highest page number that a scan (pre)-fetches is scan->rs_nblocks-1
+                **   note  -  prefetcher does not wrap a prefetch range -
+                **            instead just stop and then start again if and when main scan wraps
+                */
+                if (scan->rs_pfchblock <= scan->rs_startblock) {  /*  if on second leg towards startblock */
+                    prefetchable_range = ((int)(scan->rs_startblock) - (int)(scan->rs_pfchblock));
+                }
+                else {                                            /*     on first leg towards nblocks     */
+                    prefetchable_range = ((int)(scan->rs_nblocks) - (int)(scan->rs_pfchblock));
+                }
+                if (prefetchable_range > 0) {           /*  if theres a range to prefetch */
+
+                    /*  2.   number of available blocks ahead of us        */
+                    if (num_to_pfch_this_time > prefetchable_range) {
+                        num_to_pfch_this_time = prefetchable_range;
+                    }
+                    while (num_to_pfch_this_time-- > 0) {
+                        PrefetchBufferRc = PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, scan->rs_pfchblock, scan->rs_strategy);
+                        /*  if pin acquired on buffer,  then remember in case of future Discard */
+                        if (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED) {
+                            heap_unread_add(scan, scan->rs_pfchblock);
+						}
+                        scan->rs_pfchblock++;
+                        /*  if syncscan and requested block was already in buffer pool,
+                        **  this suggests that another scanner is ahead of us and we should advance
+                        */
+                        if ( (scan->rs_syncscan) && (PrefetchBufferRc & PREFTCHRC_BLK_ALREADY_PRESENT) ) {
+                            scan->rs_pfchblock++;
+                            num_to_pfch_this_time--;
+                        }
+                    }
+                }
+                else {
+                    /*   we must not modify scan->rs_pfchblock here
+                    **   because it is needed for possible DiscardBuffer at end of scan  ...
+                    **   ... instead ...
+                    */
+                    scan->rs_prefetch_target = PREFETCH_WRAP_POINT;  /*   mark this prefetcher as waiting to wrap */
+                }
+
+                /*   ...  then adjust prefetching limit : by doubling on each iteration */
+                if (scan->rs_prefetch_target == 0) {
+                    scan->rs_prefetch_target = 1;
+                }
+                else {
+                    scan->rs_prefetch_target *= 2;
+                    if (scan->rs_prefetch_target > target_prefetch_pages) {
+                        scan->rs_prefetch_target = target_prefetch_pages;
+                    }
+                }
+            }
+        }
+#endif   /* USE_PREFETCH */
+
+
 	if (!scan->rs_pageatatime)
 		return;
 
@@ -452,6 +711,8 @@ heapgettup(HeapScanDesc scan,
 	OffsetNumber lineoff;
 	int			linesleft;
 	ItemId		lpp;
+    BlockNumber  prefetchHWM = 0;   /*  HWM of prefetch BlockNum for all participants in same sync scan */
+    int          ix;
 
 	/*
 	 * calculate next starting lineoff, given scan direction
@@ -470,7 +731,25 @@ heapgettup(HeapScanDesc scan,
 				return;
 			}
 			page = scan->rs_startblock; /* first page */
-			heapgetpage(scan, page);
+#ifdef USE_PREFETCH
+                        /*  decide if we shall do prefetching :  only if :
+                        **   .  prefetching enabled for this scan
+                        **   .  not a bitmap scan (which do their own)
+                        **   .  sufficient number of blocks - at least twice the target_prefetch_pages
+                        */
+                        if (    (!scan->rs_bitmapscan)
+                             && (scan->rs_prefetch_target >= PREFETCH_MAYBE)
+                             && (scan->rs_nblocks > (2*target_prefetch_pages))
+                           )  {
+                                scan->rs_prefetch_target = 1;  /*  do prefetching on forward non-bitmap scan */
+                                scan->rs_Unread_Pfetched_base = (BlockNumber *)palloc(sizeof(BlockNumber)*target_prefetch_pages);
+                                /*  Initialise the list */
+                                for (ix = 0; ix < target_prefetch_pages; ix++) {
+									*(scan->rs_Unread_Pfetched_base + ix) = InvalidBlockNumber;
+								}
+                        }
+#endif   /* USE_PREFETCH */
+			heapgetpage(scan, page, prefetchHWM);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
 		}
@@ -516,7 +795,7 @@ heapgettup(HeapScanDesc scan,
 				page = scan->rs_startblock - 1;
 			else
 				page = scan->rs_nblocks - 1;
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 		}
 		else
 		{
@@ -557,7 +836,7 @@ heapgettup(HeapScanDesc scan,
 
 		page = ItemPointerGetBlockNumber(&(tuple->t_self));
 		if (page != scan->rs_cblock)
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 
 		/* Since the tuple was previously fetched, needn't lock page here */
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
@@ -660,8 +939,10 @@ heapgettup(HeapScanDesc scan,
 			 * a little bit backwards on every invocation, which is confusing.
 			 * We don't guarantee any specific ordering in general, though.
 			 */
-			if (scan->rs_syncscan)
-				ss_report_location(scan->rs_rd, page);
+			if (scan->rs_syncscan) {
+                                prefetchHWM = scan->rs_pfchblock;
+				ss_report_location(scan->rs_rd, page, &prefetchHWM);
+                        }
 		}
 
 		/*
@@ -671,6 +952,22 @@ heapgettup(HeapScanDesc scan,
 		{
 			if (BufferIsValid(scan->rs_cbuf))
 				ReleaseBuffer(scan->rs_cbuf);
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
 			scan->rs_cbuf = InvalidBuffer;
 			scan->rs_cblock = InvalidBlockNumber;
 			tuple->t_data = NULL;
@@ -678,7 +975,7 @@ heapgettup(HeapScanDesc scan,
 			return;
 		}
 
-		heapgetpage(scan, page);
+		heapgetpage(scan, page, prefetchHWM);
 
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 
@@ -727,6 +1024,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 	OffsetNumber lineoff;
 	int			linesleft;
 	ItemId		lpp;
+    BlockNumber  prefetchHWM = 0;   /*  HWM of prefetch BlockNum for all participants in same sync scan */
+    int          ix;
 
 	/*
 	 * calculate next starting lineindex, given scan direction
@@ -745,7 +1044,25 @@ heapgettup_pagemode(HeapScanDesc scan,
 				return;
 			}
 			page = scan->rs_startblock; /* first page */
-			heapgetpage(scan, page);
+#ifdef USE_PREFETCH
+                        /*  decide if we shall do prefetching :  only if :
+                        **   .  prefetching enabled for this scan
+                        **   .  not a bitmap scan (which do their own)
+                        **   .  sufficient number of blocks - at least twice the target_prefetch_pages
+                        */
+                        if (    (!scan->rs_bitmapscan)
+                             && (scan->rs_prefetch_target >= PREFETCH_MAYBE)
+                             && (scan->rs_nblocks > (2*target_prefetch_pages))
+                           )  {
+                                scan->rs_prefetch_target = 1;  /*  do prefetching on forward non-bitmap scan */
+                                scan->rs_Unread_Pfetched_base = (BlockNumber *)palloc(sizeof(BlockNumber)*target_prefetch_pages);
+                                /*  Initialise the list */
+                                for (ix = 0; ix < target_prefetch_pages; ix++) {
+									*(scan->rs_Unread_Pfetched_base + ix) = InvalidBlockNumber;
+								}
+                        }
+#endif   /* USE_PREFETCH */
+			heapgetpage(scan, page, prefetchHWM);
 			lineindex = 0;
 			scan->rs_inited = true;
 		}
@@ -788,7 +1105,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 				page = scan->rs_startblock - 1;
 			else
 				page = scan->rs_nblocks - 1;
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 		}
 		else
 		{
@@ -826,7 +1143,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 
 		page = ItemPointerGetBlockNumber(&(tuple->t_self));
 		if (page != scan->rs_cblock)
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 
 		/* Since the tuple was previously fetched, needn't lock page here */
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
@@ -921,8 +1238,10 @@ heapgettup_pagemode(HeapScanDesc scan,
 			 * a little bit backwards on every invocation, which is confusing.
 			 * We don't guarantee any specific ordering in general, though.
 			 */
-			if (scan->rs_syncscan)
-				ss_report_location(scan->rs_rd, page);
+			if (scan->rs_syncscan) {
+                                prefetchHWM = scan->rs_pfchblock;
+				ss_report_location(scan->rs_rd, page, &prefetchHWM);
+                        }
 		}
 
 		/*
@@ -932,6 +1251,22 @@ heapgettup_pagemode(HeapScanDesc scan,
 		{
 			if (BufferIsValid(scan->rs_cbuf))
 				ReleaseBuffer(scan->rs_cbuf);
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                             while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
 			scan->rs_cbuf = InvalidBuffer;
 			scan->rs_cblock = InvalidBlockNumber;
 			tuple->t_data = NULL;
@@ -939,7 +1274,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 			return;
 		}
 
-		heapgetpage(scan, page);
+		heapgetpage(scan, page, prefetchHWM);
 
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
 		lines = scan->rs_ntuples;
@@ -1394,6 +1729,23 @@ void
 heap_rescan(HeapScanDesc scan,
 			ScanKey key)
 {
+
+#ifdef USE_PREFETCH
+         if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                       /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+         }
+#endif   /* USE_PREFETCH */
 	/*
 	 * unpin scan buffers
 	 */
@@ -1418,6 +1770,23 @@ heap_endscan(HeapScanDesc scan)
 {
 	/* Note: no locking manipulations needed */
 
+
+#ifdef USE_PREFETCH
+         if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                       /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+         }
+#endif   /* USE_PREFETCH */
 	/*
 	 * unpin scan buffers
 	 */
@@ -1435,6 +1804,10 @@ heap_endscan(HeapScanDesc scan)
 	if (scan->rs_strategy != NULL)
 		FreeAccessStrategy(scan->rs_strategy);
 
+    if (scan->rs_Unread_Pfetched_base) {
+        pfree(scan->rs_Unread_Pfetched_base);
+    }
+
 	if (scan->rs_temp_snap)
 		UnregisterSnapshot(scan->rs_snapshot);
 
@@ -1464,7 +1837,6 @@ heap_endscan(HeapScanDesc scan)
 #define HEAPDEBUG_3
 #endif   /* !defined(HEAPDEBUGALL) */
 
-
 HeapTuple
 heap_getnext(HeapScanDesc scan, ScanDirection direction)
 {
@@ -6347,6 +6719,25 @@ heap_markpos(HeapScanDesc scan)
 void
 heap_restrpos(HeapScanDesc scan)
 {
+
+
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                             while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
+
 	/* XXX no amrestrpos checking that ammarkpos called */
 
 	if (!ItemPointerIsValid(&scan->rs_mctid))
diff --git a/src/backend/access/heap/syncscan.c b/src/backend/access/heap/syncscan.c
index 7ea1ead..59e5255 100644
--- a/src/backend/access/heap/syncscan.c
+++ b/src/backend/access/heap/syncscan.c
@@ -90,6 +90,7 @@ typedef struct ss_scan_location_t
 {
 	RelFileNode relfilenode;	/* identity of a relation */
 	BlockNumber location;		/* last-reported location in the relation */
+	BlockNumber prefetchHWM;	/* high-water-mark of prefetched Blocknum */
 } ss_scan_location_t;
 
 typedef struct ss_lru_item_t
@@ -113,7 +114,7 @@ static ss_scan_locations_t *scan_locations;
 
 /* prototypes for internal functions */
 static BlockNumber ss_search(RelFileNode relfilenode,
-		  BlockNumber location, bool set);
+		  BlockNumber location, bool set , BlockNumber *prefetchHWMp);
 
 
 /*
@@ -160,6 +161,7 @@ SyncScanShmemInit(void)
 			item->location.relfilenode.dbNode = InvalidOid;
 			item->location.relfilenode.relNode = InvalidOid;
 			item->location.location = InvalidBlockNumber;
+			item->location.prefetchHWM = InvalidBlockNumber;
 
 			item->prev = (i > 0) ?
 				(&scan_locations->items[i - 1]) : NULL;
@@ -185,7 +187,7 @@ SyncScanShmemInit(void)
  * data structure.
  */
 static BlockNumber
-ss_search(RelFileNode relfilenode, BlockNumber location, bool set)
+ss_search(RelFileNode relfilenode, BlockNumber location, bool set , BlockNumber *prefetchHWMp)
 {
 	ss_lru_item_t *item;
 
@@ -206,6 +208,22 @@ ss_search(RelFileNode relfilenode, BlockNumber location, bool set)
 			{
 				item->location.relfilenode = relfilenode;
 				item->location.location = location;
+                                /*  if prefetch information requested,
+                                **  then reconcile and either update or report back the new HWM.
+                                */
+                                if (prefetchHWMp)
+                                {
+                                    if (   (item->location.prefetchHWM == InvalidBlockNumber)
+                                        || (item->location.prefetchHWM < *prefetchHWMp)
+                                       )
+                                    {
+                                      item->location.prefetchHWM = *prefetchHWMp;
+                                    }
+                                    else
+                                    {
+                                      *prefetchHWMp = item->location.prefetchHWM;
+                                    }
+                                }
 			}
 			else if (set)
 				item->location.location = location;
@@ -252,7 +270,7 @@ ss_get_location(Relation rel, BlockNumber relnblocks)
 	BlockNumber startloc;
 
 	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
-	startloc = ss_search(rel->rd_node, 0, false);
+	startloc = ss_search(rel->rd_node, 0, false , 0);
 	LWLockRelease(SyncScanLock);
 
 	/*
@@ -282,7 +300,7 @@ ss_get_location(Relation rel, BlockNumber relnblocks)
  * same relfilenode.
  */
 void
-ss_report_location(Relation rel, BlockNumber location)
+ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp)
 {
 #ifdef TRACE_SYNCSCAN
 	if (trace_syncscan)
@@ -306,7 +324,7 @@ ss_report_location(Relation rel, BlockNumber location)
 	{
 		if (LWLockConditionalAcquire(SyncScanLock, LW_EXCLUSIVE))
 		{
-			(void) ss_search(rel->rd_node, location, true);
+			(void) ss_search(rel->rd_node, location, true , prefetchHWMp);
 			LWLockRelease(SyncScanLock);
 		}
 #ifdef TRACE_SYNCSCAN
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 850008b..53af580 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -77,6 +77,12 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 
 	scan = (IndexScanDesc) palloc(sizeof(IndexScanDescData));
 
+#ifdef USE_PREFETCH
+        scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+        scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
+
 	scan->heapRelation = NULL;	/* may be set later */
 	scan->indexRelation = indexRelation;
 	scan->xs_snapshot = InvalidSnapshot;		/* caller must initialize this */
@@ -139,6 +145,19 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 void
 IndexScanEnd(IndexScanDesc scan)
 {
+#ifdef USE_PREFETCH
+	if (scan->do_prefetch) {
+		if ( (struct pfch_block_item*)0 != scan->pfch_block_item_list ) {
+			pfree(scan->pfch_block_item_list);
+			scan->pfch_block_item_list = (struct pfch_block_item*)0;
+		}
+		if ( (struct pfch_index_pagelist*)0 != scan->pfch_index_page_list ) {
+			pfree(scan->pfch_index_page_list);
+			scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+		}
+	}
+#endif   /* USE_PREFETCH */
+
 	if (scan->keyData != NULL)
 		pfree(scan->keyData);
 	if (scan->orderByData != NULL)
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 53cf96f..94f716c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -79,6 +79,55 @@
 #include "utils/tqual.h"
 
 
+#ifdef USE_PREFETCH
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+void index_mark_or_evict_block(IndexScanDesc scan , BlockNumber blocknumber , int markit);
+
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+
+/*  if specified block number is present in the prefetch array,
+**  then either mark it as not to be discarded or evict it according to input param
+*/
+void index_mark_or_evict_block(IndexScanDesc scan , BlockNumber blocknumber , int markit)
+{
+        unsigned short int pfchx , pfchy , pfchz; /*  indexes in BlockIdData array   */
+
+        if (    scan->do_prefetch
+             && ((struct pfch_block_item*)0 != scan->pfch_block_item_list)
+             /* no need to check for scan->pfch_next < prefetch_index_scans
+             ** since we will do nothing if scan->pfch_used == 0
+             */
+           ) {
+            /*  search the prefetch list to find if the block is a member */
+            for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                if (BlockIdGetBlockNumber(&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)) == blocknumber) {
+                      if (markit) {
+						      /*  mark it as not to be discarded */
+						      ((scan->pfch_block_item_list)+pfchx)->pfch_discard &= ~PREFTCHRC_BUF_PIN_INCREASED;
+					  } else {
+							  /*  shuffle all following the evictee to the left
+							  **  and update next pointer if its element moves
+							  */
+							  pfchy = (scan->pfch_used - 1); /*  current rightmost */
+							  scan->pfch_used = pfchy;
+
+							  while (pfchy > pfchx) {
+								  pfchz = pfchx + 1;
+								  BlockIdCopy((&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)), (&(((scan->pfch_block_item_list)+pfchz)->pfch_blockid)));
+								  ((scan->pfch_block_item_list)+pfchx)->pfch_discard = ((scan->pfch_block_item_list)+pfchz)->pfch_discard;
+								  if (scan->pfch_next == pfchz) {
+									  scan->pfch_next = pfchx;
+								  }
+								  pfchx = pfchz; /* advance */
+							  }
+                      }
+                }
+            }
+        }
+}
+#endif /* USE_PREFETCH */
+
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
  *
@@ -253,6 +302,11 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+	scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -277,6 +331,11 @@ index_beginscan_bitmap(Relation indexRelation,
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+	scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -311,6 +370,9 @@ index_beginscan_internal(Relation indexRelation,
 									  Int32GetDatum(nkeys),
 									  Int32GetDatum(norderbys)));
 
+	scan->heap_tids_seen = 0;
+	scan->heap_tids_fetched = 0;
+	
 	return scan;
 }
 
@@ -342,6 +404,12 @@ index_rescan(IndexScanDesc scan,
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
@@ -373,10 +441,30 @@ index_endscan(IndexScanDesc scan)
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
 
+#ifdef USE_PREFETCH
+        /*   discard prefetched but unread buffers */
+        if (    scan->do_prefetch
+             && ((struct pfch_block_item*)0 != scan->pfch_block_item_list)
+           ) {
+            unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                  if (((scan->pfch_block_item_list)+pfchx)->pfch_discard) {
+                      DiscardBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber(&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)));
+                  }
+                }
+        }
+#endif   /* USE_PREFETCH */
+
 	/* End the AM's scan */
 	FunctionCall1(procedure, PointerGetDatum(scan));
 
@@ -472,6 +560,12 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		/* ... but first, release any held pin on a heap page */
 		if (BufferIsValid(scan->xs_cbuf))
 		{
+#ifdef USE_PREFETCH
+                    /*   if specified block number is present in the prefetch array,  then evict it */
+                    if (scan->do_prefetch) {
+                          index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                    }
+#endif   /* USE_PREFETCH */
 			ReleaseBuffer(scan->xs_cbuf);
 			scan->xs_cbuf = InvalidBuffer;
 		}
@@ -479,6 +573,11 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	}
 
 	pgstat_count_index_tuples(scan->indexRelation, 1);
+	if (scan->heap_tids_seen++ >= (~0)) {
+		/* Avoid integer overflow */
+		scan->heap_tids_seen = 1;
+		scan->heap_tids_fetched = 0;
+	}
 
 	/* Return the TID of the tuple we found. */
 	return &scan->xs_ctup.t_self;
@@ -502,6 +601,10 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
  * enough information to do it efficiently in the general case.
  * ----------------
  */
+#if defined(USE_PREFETCH) && defined(AVOID_CATALOG_MIGRATION_FOR_ASYNCIO)
+extern Datum btpeeknexttuple(IndexScanDesc scan);
+#endif /* USE_PREFETCH */
+
 HeapTuple
 index_fetch_heap(IndexScanDesc scan)
 {
@@ -509,16 +612,109 @@ index_fetch_heap(IndexScanDesc scan)
 	bool		all_dead = false;
 	bool		got_heap_tuple;
 
+
+
 	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
 	if (!scan->xs_continue_hot)
 	{
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = scan->xs_cbuf;
 
+#ifdef USE_PREFETCH
+
+                /*   If the old block is different from new block, then evict old
+                **   block from prefetched array.   It is arguable we should leave it
+                **   in the array because it's likely to remain in the buffer pool
+                **   for a while,  but in that case , if we encounter the block
+                **   again,  prefetching it again does no harm.
+                **   (and note that,  if it's not pinned,  prefetching it will try to
+                **   pin it since prefetch tries to bank a pin for a buffer in the buffer pool).
+                **   therefore it should usually win.
+                */
+                if (    scan->do_prefetch
+                     && ( BufferIsValid(prev_buf) )
+                     && (BlocknotinBuffer(prev_buf,scan->heapRelation,ItemPointerGetBlockNumber(tid)))
+                     && (scan->pfch_next < prefetch_index_scans)  /* ensure there is an entry */
+                        ) {
+                          index_mark_or_evict_block(scan , BlocknumOfBuffer(prev_buf) , 0);
+                }
+
+#endif   /* USE_PREFETCH  */
 		scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf,
 											 scan->heapRelation,
 											 ItemPointerGetBlockNumber(tid));
 
+                /*   If the new block had been prefetched and pinned,
+                **   then mark that it no longer requires to be discarded.
+                **   Of course,  we don't evict the entry,
+                **   because we want to remember that it was recently prefetched.
+                */
+                index_mark_or_evict_block(scan , BlocknumOfBuffer(prev_buf) , 1);
+
+				scan->heap_tids_fetched++;
+
+#ifdef USE_PREFETCH
+                /*  try prefetching next data block
+                **    (next meaning one containing TIDs from matching keys
+                **     in same index page and different from any block
+                **     we previously prefetched and listed in prefetched array)
+                */
+                {
+                    FmgrInfo   *procedure;
+                    bool	found;             /*  did we find the "next" heap tid in current index page */
+                    int         PrefetchBufferRc;  /*  indicates whether requested prefetch block already in a buffer and if pin count on buffer has been increased */
+
+                    if (scan->do_prefetch) {
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                        procedure = &scan->indexRelation->rd_aminfo->ampeeknexttuple; /* is incorrect but avoids adding function to catalog */
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                        if (RegProcedureIsValid(scan->indexRelation->rd_am->ampeeknexttuple)) {
+                            GET_SCAN_PROCEDURE(ampeeknexttuple); /* is correct but requires adding function to catalog */
+                        } else {
+                            procedure = 0;
+                        }
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+
+                        if (    procedure          /* does the index access method support peektuple? */
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                             && procedure->fn_addr /* procedure->fn_addr is non-null only if in catalog */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                           ) {
+                            int iterations = 1;      /*  how many iterations of prefetching shall we try  -
+                                                     **  if used entries in prefetch list is < target_prefetch_pages
+                                                     **  then 2,  else 1
+                                                     **  this should result in gradually and smoothly increasing up to target_prefetch_pages
+                                                     */
+                            /*  note we trust InitIndexScan verified this scan is forwards only and so set that */
+                            if (scan->pfch_used < target_prefetch_pages) {
+                                iterations = 2;
+                            }
+                            do {
+                                found =  DatumGetBool(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                                                         btpeeknexttuple(scan)     /*  pass scan as direct parameter since cant use fmgr because not in catalog */
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                                                         FunctionCall1(procedure, PointerGetDatum(scan)) /* use fmgr to call it because in catalog  */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                                                     );
+                                if (found) {
+                                    /*    btpeeknexttuple set pfch_next to point to the item in block_item_list to be prefetched */
+                                    PrefetchBufferRc = PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber((&((scan->pfch_block_item_list + scan->pfch_next))->pfch_blockid)) , 0);
+                                    /* elog(LOG,"index_fetch_heap prefetched rel %u blockNum %u"
+                                       ,scan->heapRelation->rd_node.relNode ,BlockIdGetBlockNumber(scan->pfch_block_item_list + scan->pfch_next));
+                                    */
+
+                                    /*  if pin acquired on buffer,  then remember in case of future Discard */
+                                    (scan->pfch_block_item_list + scan->pfch_next)->pfch_discard = (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED);
+
+
+                                }
+                            } while (--iterations > 0);
+                        }
+                    }
+                }
+#endif   /* USE_PREFETCH */
+
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ecee5ac..e32a79f 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -793,7 +793,7 @@ _bt_insertonpg(Relation rel,
 		{
 			Assert(!P_ISLEAF(lpageop));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -972,7 +972,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	bool		isleaf;
 
 	/* Acquire a new page to split into */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE , (struct pfch_index_pagelist*)0);
 
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
@@ -1175,7 +1175,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	if (!P_RIGHTMOST(oopaque))
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE , (struct pfch_index_pagelist*)0);
 		spage = BufferGetPage(sbuf);
 		sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -1817,7 +1817,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE , (struct pfch_index_pagelist*)0);
 	rpage = BufferGetPage(rbuf);
 	rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
 
@@ -1829,7 +1829,7 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -1877,7 +1877,7 @@ _bt_getstackbuf(Relation rel, BTStack stack, int access)
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, access);
+		buf = _bt_getbuf(rel, blkno, access , (struct pfch_index_pagelist*)0);
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -2008,12 +2008,12 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE , (struct pfch_index_pagelist*)0);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index bab5a49..afa0263 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -127,7 +127,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ , (struct pfch_index_pagelist*)0);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 
@@ -153,7 +153,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ , (struct pfch_index_pagelist*)0);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -209,7 +209,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE , (struct pfch_index_pagelist*)0);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
@@ -350,7 +350,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ , (struct pfch_index_pagelist*)0);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -436,7 +436,7 @@ _bt_getrootheight(Relation rel)
 		Page		metapg;
 		BTPageOpaque metaopaque;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ , (struct pfch_index_pagelist*)0);
 		metapg = BufferGetPage(metabuf);
 		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 		metad = BTPageGetMeta(metapg);
@@ -562,6 +562,170 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 }
 
 /*
+ *	_bt_prefetchbuf() -- Prefetch a buffer by block number
+ *                           and keep track of prefetched and unread blocknums in pagelist.
+ *   input parms  :
+ *       rel and blockno identify block to be prefetched as usual
+ *       pfch_index_page_list_P points to the pointer anchoring the head of the index page list
+ *             Since the pagelist is a kind of optimization,
+ *             handle palloc failure by quietly omitting the keeping track.
+ */
+void
+_bt_prefetchbuf(Relation rel, BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P)
+{
+
+    int rc = 0;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    struct pfch_index_item* found_item = 0;
+
+		if ((struct pfch_index_pagelist**)0 == pfch_index_page_list_P) {
+			pfch_index_plp = (struct pfch_index_pagelist*)0;
+		} else {
+			pfch_index_plp = *pfch_index_page_list_P;
+		}
+
+    	if (blkno != P_NEW && blkno != P_NONE)
+    	{
+            /* prefetch an existing block of the relation
+            ** but first,  check it has not recently already been prefetched and not yet read
+            */
+            found_item = _bt_find_block(blkno , pfch_index_plp);
+			if ((struct pfch_index_item*)0 == found_item) {  /*  not found */
+
+		        rc = PrefetchBuffer(rel, MAIN_FORKNUM, blkno , 0);
+
+                /*  add the pagenum to the list ,  indicating its discard status
+                **  since it's only an optimization,  ignore failure such as exceeded allowed space
+				*/
+                _bt_add_block( blkno , pfch_index_page_list_P , (uint32)(rc & PREFTCHRC_BUF_PIN_INCREASED));
+
+            }
+	    }
+        return;
+}
+
+/*   _bt_find_block finds the item referencing specified Block in index page list if present
+**   and returns the pointer to the pfch_index_item if found,  or null if not
+*/
+struct pfch_index_item*
+_bt_find_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list)
+{
+
+    struct pfch_index_item* found_item = 0;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    int ix, tx;
+
+		pfch_index_plp = pfch_index_page_list;
+
+		while (     (struct pfch_index_pagelist*)0 != pfch_index_plp
+                &&  ( (struct pfch_index_item*)0 == found_item)
+              ) {
+			ix = 0;
+			tx = pfch_index_plp->pfch_index_item_count;
+			while (     (ix < tx)
+                    &&  ( (struct pfch_index_item*)0 == found_item)
+                  ) {
+				if (pfch_index_plp->pfch_indexid[ix].pfch_blocknum == blkno) {
+					found_item = &pfch_index_plp->pfch_indexid[ix];
+				}
+                ix++;
+			}
+			pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+		}
+
+     return found_item;
+}
+
+/*   _bt_add_block adds the specified Block to the index page list
+**   and returns 0 if successful,  non-zero if not
+*/
+int
+_bt_add_block(BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P , uint32 discard_status)
+{
+    int rc = 1;
+    int ix;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    struct pfch_index_pagelist* pfch_index_page_list_anchor; /*  pointer to first chunk if any  */
+	/*  allow expansion of pagelist to 16 chunks
+	**  which accommodates backwards-sequential index scans
+	**  where the scanner increases target_prefetch_pages by a factor of up to 16
+	**   see code in _bt_steppage
+	**  note - this creates an undesirable weak dependency on this number in _bt_steppage,
+	**         but :
+	**           there is no disaster if the numbers disagree  -  just sub-optimal use of the list
+	**           to implement a proper interface would require that chunks have a variable size
+	**           which would require an extra size variable in each chunk
+	*/
+	int num_chunks = 16;
+
+		if ((struct pfch_index_pagelist**)0 == pfch_index_page_list_P) {
+			pfch_index_page_list_anchor = (struct pfch_index_pagelist*)0;
+		} else {
+			pfch_index_page_list_anchor = *pfch_index_page_list_P;
+		}
+		pfch_index_plp = pfch_index_page_list_anchor;       /* pointer to current chunk */
+
+		while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+			ix = pfch_index_plp->pfch_index_item_count;
+			if (ix < target_prefetch_pages) {
+				pfch_index_plp->pfch_indexid[ix].pfch_blocknum = blkno;
+				pfch_index_plp->pfch_indexid[ix].pfch_discard = discard_status;
+				pfch_index_plp->pfch_index_item_count = (ix+1);
+                rc = 0;
+				goto stored_pagenum;
+			}
+			pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+			num_chunks--;              /*  keep track of number of chunks */
+		}
+
+		/*   we did not find any free space in existing chunks -
+		**   create new chunk if within our limit and we have a pfch_index_page_list
+		*/
+		if ( (num_chunks > 0) && ((struct pfch_index_pagelist*)0 != pfch_index_page_list_anchor) ) {
+			pfch_index_plp = (struct pfch_index_pagelist*)palloc( sizeof(struct pfch_index_pagelist) + ( (target_prefetch_pages-1) * sizeof(struct pfch_index_item) ) );
+			if ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+				pfch_index_plp->pfch_index_pagelist_next = pfch_index_page_list_anchor;  /* old head of list is next after this */
+				pfch_index_plp->pfch_indexid[0].pfch_blocknum = blkno;
+				pfch_index_plp->pfch_indexid[0].pfch_discard = discard_status;
+				pfch_index_plp->pfch_index_item_count = 1;
+				pfch_index_page_list_P = &pfch_index_plp;   /*  new head of list is new chunk */
+                rc = 0;
+			}
+		}
+
+    stored_pagenum:;
+     return rc;
+}
+
+/*  _bt_subtract_block removes a block from the prefetched-but-unread pagelist if present */
+void
+_bt_subtract_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list)
+{
+    struct pfch_index_pagelist* pfch_index_plp = pfch_index_page_list;
+	if ( (blkno != P_NEW) && (blkno != P_NONE) ) {
+            int ix , jx;
+                while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+                    ix = pfch_index_plp->pfch_index_item_count;
+                    while (ix-- > 0) {
+                        if (pfch_index_plp->pfch_indexid[ix].pfch_blocknum == blkno) {
+                            /*   move the last item to the curent (now deleted) position and decrement count */
+                            jx = (pfch_index_plp->pfch_index_item_count-1); /*  index of last item ... */
+                            if (jx > ix) {                                  /*  ... is not the current one so move is required */
+                                pfch_index_plp->pfch_indexid[ix].pfch_blocknum = pfch_index_plp->pfch_indexid[jx].pfch_blocknum;
+                                pfch_index_plp->pfch_indexid[ix].pfch_discard = pfch_index_plp->pfch_indexid[jx].pfch_discard;
+                                ix = jx;
+                            }
+                            pfch_index_plp->pfch_index_item_count = ix;
+                            goto done_subtract;
+                        }
+                    }
+                    pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+                }
+        }
+    done_subtract:  return;
+}
+
+/*
  *	_bt_getbuf() -- Get a buffer by block number for read or write.
  *
  *		blkno == P_NEW means to get an unallocated index page.  The page
@@ -573,7 +737,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
  *		_bt_checkpage to sanity-check the page (except in P_NEW case).
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access , struct pfch_index_pagelist* pfch_index_page_list)
 {
 	Buffer		buf;
 
@@ -581,6 +745,10 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 	{
 		/* Read an existing block of the relation */
 		buf = ReadBuffer(rel, blkno);
+
+        /*  if the block is in the prefetched-but-unread pagelist,  remove it */
+        _bt_subtract_block( blkno , pfch_index_page_list);
+
 		LockBuffer(buf, access);
 		_bt_checkpage(rel, buf);
 	}
@@ -702,6 +870,10 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
  * bufmgr when one would do.  However, now it's mainly just a notational
  * convenience.  The only case where it saves work over _bt_relbuf/_bt_getbuf
  * is when the target page is the same one already in the buffer.
+ *
+ * if prefetching of index pages is changed to use this function,
+ * then it should be extended to take the index_page_list as parameter
+ * and call_bt_subtract_block in the same way that _bt_getbuf does.
  */
 Buffer
 _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
@@ -712,6 +884,7 @@ _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
 	if (BufferIsValid(obuf))
 		LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
 	buf = ReleaseAndReadBuffer(obuf, rel, blkno);
+
 	LockBuffer(buf, access);
 	_bt_checkpage(rel, buf);
 	return buf;
@@ -965,7 +1138,7 @@ _bt_is_page_halfdead(Relation rel, BlockNumber blk)
 	BTPageOpaque opaque;
 	bool		result;
 
-	buf = _bt_getbuf(rel, blk, BT_READ);
+	buf = _bt_getbuf(rel, blk, BT_READ , (struct pfch_index_pagelist*)0);
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -1069,7 +1242,7 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 				Page		lpage;
 				BTPageOpaque lopaque;
 
-				lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+				lbuf = _bt_getbuf(rel, leftsib, BT_READ, (struct pfch_index_pagelist*)0);
 				lpage = BufferGetPage(lbuf);
 				lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
@@ -1265,7 +1438,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 					BTPageOpaque lopaque;
 					Page		lpage;
 
-					lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+					lbuf = _bt_getbuf(rel, leftsib, BT_READ, (struct pfch_index_pagelist*)0);
 					lpage = BufferGetPage(lbuf);
 					lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 					/*
@@ -1340,7 +1513,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 		if (!rightsib_empty)
 			break;
 
-		buf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		buf = _bt_getbuf(rel, rightsib, BT_WRITE, (struct pfch_index_pagelist*)0);
 	}
 
 	return ndeleted;
@@ -1593,7 +1766,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		target = topblkno;
 
 		/* fetch the block number of the topmost parent's left sibling */
-		buf = _bt_getbuf(rel, topblkno, BT_READ);
+		buf = _bt_getbuf(rel, topblkno, BT_READ, (struct pfch_index_pagelist*)0);
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		leftsib = opaque->btpo_prev;
@@ -1632,7 +1805,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		LockBuffer(leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE , (struct pfch_index_pagelist*)0);
 		page = BufferGetPage(lbuf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -1646,7 +1819,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 					 RelationGetRelationName(rel));
 				return false;
 			}
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE , (struct pfch_index_pagelist*)0);
 			page = BufferGetPage(lbuf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		}
@@ -1701,7 +1874,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE , (struct pfch_index_pagelist*)0);
 	page = BufferGetPage(rbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	if (opaque->btpo_prev != target)
@@ -1731,7 +1904,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 36dc6c2..ceb5af9 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -30,6 +30,18 @@
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_index_scans; /* boolean whether to prefetch non-bitmap index scans also numeric size of pfch_list */
+#endif   /* USE_PREFETCH */
+
+Datum
+btpeeknexttuple(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+        IndexScanDesc scan
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+        PG_FUNCTION_ARGS
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+);
 
 /* Working state for btbuild and its callback */
 typedef struct
@@ -332,6 +344,74 @@ btgettuple(PG_FUNCTION_ARGS)
 }
 
 /*
+ *	btpeeknexttuple() -- peek at the next tuple different from any blocknum in pfch_block_item_list
+ *                           without reading a new index page
+ *                       and without causing any side-effects such as altering values in control blocks
+ *               if found,     store blocknum in next element of pfch_block_item_list
+ */
+Datum
+btpeeknexttuple(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+        IndexScanDesc scan
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+        PG_FUNCTION_ARGS
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+)
+{
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+    IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res = false;
+	int		itemIndex;		/* current index in items[] */
+
+        /*
+         * If we've already initialized this scan, we can just advance it in
+         * the appropriate direction.  If we haven't done so yet, bail out
+         */
+        if ( BTScanPosIsValid(so->currPos) ) {
+
+            itemIndex = so->currPos.itemIndex+1;    /*   next item */
+
+            /* This loop handles advancing till we find different data block or end of index page */
+            while (itemIndex <= so->currPos.lastItem) {
+                unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                        if (BlockIdEquals((&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)) , &(so->currPos.items[itemIndex].heapTid.ip_blkid))) {
+                             goto block_match;
+                        }
+                }
+
+                /*  if we reach here,  no block in list matched this item  */
+                res = true;
+                /*   set item in prefetch list
+                **   prefer unused entry if there is one,  else overwrite
+                */
+                if (scan->pfch_used < prefetch_index_scans) {
+                    scan->pfch_next = scan->pfch_used;
+                } else {
+                    scan->pfch_next++;
+                    if (scan->pfch_next >= prefetch_index_scans) {
+                        scan->pfch_next = 0;
+                    }
+                }
+
+                BlockIdCopy((&((scan->pfch_block_item_list + scan->pfch_next)->pfch_blockid)) , &(so->currPos.items[itemIndex].heapTid.ip_blkid));
+                if (scan->pfch_used <= scan->pfch_next) {
+                     scan->pfch_used = (scan->pfch_next + 1);
+                }
+
+                goto peek_complete;
+
+      block_match:         itemIndex++;
+            }
+	}
+
+ peek_complete:
+	PG_RETURN_BOOL(res);
+}
+
+/*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
 Datum
@@ -425,6 +505,12 @@ btbeginscan(PG_FUNCTION_ARGS)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	so->backSeqRun = 0;
+	so->backSeqPos = 0;
+	so->prefetchItemIndex = 0;
+	so->lastHeapPrefetchBlkno = P_NONE;
+	so->prefetchBlockCount = 0;
+	
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -516,6 +602,23 @@ btendscan(PG_FUNCTION_ARGS)
 {
 	IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+        struct pfch_index_pagelist* pfch_index_plp;
+        int ix;
+
+#ifdef USE_PREFETCH
+
+	/* discard all prefetched but unread index pages listed in the pagelist */
+        pfch_index_plp = scan->pfch_index_page_list;
+        while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+                    ix = pfch_index_plp->pfch_index_item_count;
+                    while (ix-- > 0) {
+                        if (pfch_index_plp->pfch_indexid[ix].pfch_discard) {
+                            DiscardBuffer( scan->indexRelation , MAIN_FORKNUM , pfch_index_plp->pfch_indexid[ix].pfch_blocknum);
+                        }
+                    }
+                    pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+        }
+#endif /* USE_PREFETCH */
 
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 203b969..bee1f12 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -23,13 +23,16 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+extern unsigned int prefetch_btree_heaps;  /*  boolean whether to prefetch heap pages in _bt_next for non-bitmap index scans   */
+extern unsigned int prefetch_sequential_index_scans;  /*  boolean whether to prefetch sequential-access non-bitmap index scans  */
 
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
+static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir, 
+			 bool prefetch);
+static Buffer _bt_walk_left(IndexScanDesc scan, Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
 
@@ -226,7 +229,7 @@ _bt_moveright(Relation rel,
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, blkno, access , (struct pfch_index_pagelist*)0);
 			continue;
 		}
 
@@ -1005,7 +1008,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
@@ -1040,6 +1043,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTScanPosItem *currItem;
+	BlockNumber prevblkno = ItemPointerGetBlockNumber(
+		&scan->xs_ctup.t_self);
 
 	/*
 	 * Advance to next tuple on current page; or if there's no more, try to
@@ -1052,11 +1057,53 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+		if (prefetch_btree_heaps && (target_prefetch_pages > 0)) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreRight
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+				
+					if (so->prefetchItemIndex <= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex + 1;
+					while (    (so->prefetchItemIndex <= so->currPos.lastItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex++].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {  /*  if not a repetition of previous block */
+							/*   start prefetch on next page, providing :
+                            **   EITHER  .  we're reading non-sequentially previously or for this block
+                            **   OR      .  user explicitly specified to prefetch for sequential pattern
+							**   as it may be counterproductive otherwise
+                            */
+							nonsequential = (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1));
+							if (prefetch_sequential_index_scans || nonsequential) {
+								_bt_prefetchbuf(scan->heapRelation, blkno , &scan->pfch_index_page_list );
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
 	}
 	else
 	{
@@ -1065,11 +1112,53 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+		if (prefetch_btree_heaps && (target_prefetch_pages > 0)) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreLeft
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+			
+					if (so->prefetchItemIndex >= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex - 1;
+					while (    (so->prefetchItemIndex >= so->currPos.firstItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex--].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {  /*  if not a repetition of previous block */
+							/*   start prefetch on next page, providing :
+                            **   EITHER  .  we're reading non-sequentially previously or for this block
+                            **   OR      .  user explicitly specified to prefetch for sequential pattern
+							**   as it may be counterproductive otherwise
+                            */
+							nonsequential = (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1));
+							if (prefetch_sequential_index_scans || nonsequential) {
+								_bt_prefetchbuf(scan->heapRelation, blkno , &scan->pfch_index_page_list );
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
 	}
 
 	/* OK, itemIndex says what to return */
@@ -1119,9 +1208,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	/*
 	 * we must save the page's right-link while scanning it; this tells us
 	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * corresponding need for the left-link, since splits always go right,
+	 * but we need it for back-sequential scan detection.
 	 */
 	so->currPos.nextPage = opaque->btpo_next;
+	so->currPos.prevPage = opaque->btpo_prev;
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
@@ -1156,6 +1247,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
+		so->prefetchItemIndex = 0;
 	}
 	else
 	{
@@ -1187,6 +1279,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		so->currPos.firstItem = itemIndex;
 		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
 		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->prefetchItemIndex = MaxIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1224,7 +1317,7 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
  */
 static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+_bt_steppage(IndexScanDesc scan, ScanDirection dir, bool prefetch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -1278,7 +1371,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ , scan->pfch_index_page_list);
 			/* check for deleted page */
 			page = BufferGetPage(so->currPos.buf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1287,9 +1380,20 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque))) {
+					if (    prefetch && so->currPos.moreRight
+						/*   start prefetch on next page, providing :
+						**   EITHER  .  we're reading non-sequentially for this block
+						**   OR      .  user explicitly specified to prefetch for sequential pattern
+						**   as it may be counterproductive otherwise
+						*/
+						&& (prefetch_sequential_index_scans || opaque->btpo_next != (blkno+1))
+                       ) {
+ 						  _bt_prefetchbuf(rel, opaque->btpo_next , &scan->pfch_index_page_list);
+					}
 					break;
 			}
+			}
 			/* nope, keep going */
 			blkno = opaque->btpo_next;
 		}
@@ -1317,7 +1421,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
+			so->currPos.buf = _bt_walk_left(scan , rel, so->currPos.buf);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -1332,14 +1436,58 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 			if (!P_IGNORE(opaque))
 			{
+				/* We must rely on the previously saved prevPage link! */
+				BlockNumber blkno = so->currPos.prevPage;
+				
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page))) {
+					if (prefetch && so->currPos.moreLeft) {
+						/* detect back-sequential runs and increase prefetch window blindly 
+						 * downwards 2 blocks at a time. This only works in our favor
+						 * for index-only scans, by merging read requests at the kernel,
+						 * so we want to inflate target_prefetch_pages since merged 
+						 * back-sequential requests are about as expensive as a single one 
+						 */
+						if (scan->xs_want_itup && blkno > 0 && opaque->btpo_prev == (blkno-1)) {
+							BlockNumber backPos;
+							unsigned int back_prefetch_pages = target_prefetch_pages * 16;
+							if (back_prefetch_pages > 64)
+								back_prefetch_pages = 64;
+							
+							if (so->backSeqRun == 0)
+								backPos = (blkno-1);
+							else
+								backPos = so->backSeqPos;
+							so->backSeqRun++;
+							
+							if (backPos > 0 && (blkno - backPos) <= back_prefetch_pages) {
+								_bt_prefetchbuf(rel, backPos-- , &scan->pfch_index_page_list);
+								/* don't start back-seq prefetch too early */
+								if (so->backSeqRun >= back_prefetch_pages
+										&& backPos > 0 
+										&& (blkno - backPos) <= back_prefetch_pages)
+								{
+									_bt_prefetchbuf(rel, backPos-- , &scan->pfch_index_page_list);
+								}
+							}
+							
+							so->backSeqPos = backPos;
+						} else {
+							/* start prefetch on next page */
+							if (so->backSeqRun != 0) {
+								if (opaque->btpo_prev > blkno || opaque->btpo_prev < so->backSeqPos)
+									so->backSeqRun = 0;
+							}
+							_bt_prefetchbuf(rel, opaque->btpo_prev , &scan->pfch_index_page_list);
+						}
+					}
 					break;
 			}
 		}
 	}
+	}
 
 	return true;
 }
@@ -1359,7 +1507,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf)
+_bt_walk_left(IndexScanDesc scan, Relation rel, Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1387,7 +1535,7 @@ _bt_walk_left(Relation rel, Buffer buf)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, blkno, BT_READ , scan->pfch_index_page_list );
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -1631,7 +1779,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index a5118a4..c8c6fee 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -41,6 +41,14 @@ InstrAlloc(int n, int instrument_options)
 		{
 			instr[i].need_bufusage = need_buffers;
 			instr[i].need_timer = need_timer;
+			instr[i].bufusage_start.aio_read_noneed = 0;
+			instr[i].bufusage_start.aio_read_discrd = 0;
+			instr[i].bufusage_start.aio_read_forgot = 0;
+			instr[i].bufusage_start.aio_read_noblok = 0;
+			instr[i].bufusage_start.aio_read_failed = 0;
+			instr[i].bufusage_start.aio_read_wasted = 0;
+			instr[i].bufusage_start.aio_read_waited = 0;
+			instr[i].bufusage_start.aio_read_ontime = 0;
 		}
 	}
 
@@ -143,6 +151,16 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+
+	dst->aio_read_noneed       += add->aio_read_noneed - sub->aio_read_noneed;
+	dst->aio_read_discrd       += add->aio_read_discrd - sub->aio_read_discrd;
+	dst->aio_read_forgot       += add->aio_read_forgot - sub->aio_read_forgot;
+	dst->aio_read_noblok       += add->aio_read_noblok - sub->aio_read_noblok;
+	dst->aio_read_failed       += add->aio_read_failed - sub->aio_read_failed;
+	dst->aio_read_wasted       += add->aio_read_wasted - sub->aio_read_wasted;
+	dst->aio_read_waited       += add->aio_read_waited - sub->aio_read_waited;
+	dst->aio_read_ontime       += add->aio_read_ontime - sub->aio_read_ontime;
+
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 9b1e975..f5e6fd8 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -34,6 +34,8 @@
  *		ExecEndBitmapHeapScan		releases all storage.
  */
 #include "postgres.h"
+#include <sys/types.h> 
+#include <unistd.h>
 
 #include "access/relscan.h"
 #include "access/transam.h"
@@ -47,6 +49,10 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_bitmap_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
 static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
@@ -111,10 +117,21 @@ BitmapHeapNext(BitmapHeapScanState *node)
 		node->tbmres = tbmres = NULL;
 
 #ifdef USE_PREFETCH
-		if (target_prefetch_pages > 0)
-		{
+		if (    prefetch_bitmap_scans
+                     && (target_prefetch_pages > 0)
+                     && (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == scan->rs_rd->rd_node.dbNode)
+                             )
+                         ||  (prefetch_dbOid == 0)
+                        )
+                        /* sufficient number of blocks - at least twice the target_prefetch_pages */
+                     && (scan->rs_nblocks > (2*target_prefetch_pages))
+                   ) {
 			node->prefetch_iterator = prefetch_iterator = tbm_begin_iterate(tbm);
 			node->prefetch_pages = 0;
+                        if (prefetch_iterator) {
+                          tbm_zero(prefetch_iterator);  /* zero list of prefetched and unread blocknos */
+                        }
 			node->prefetch_target = -1;
 		}
 #endif   /* USE_PREFETCH */
@@ -138,12 +155,14 @@ BitmapHeapNext(BitmapHeapScanState *node)
 			}
 
 #ifdef USE_PREFETCH
+                        if (prefetch_iterator) {
 			if (node->prefetch_pages > 0)
 			{
 				/* The main iterator has closed the distance by one page */
 				node->prefetch_pages--;
+                                tbm_subtract(prefetch_iterator, tbmres->blockno); /* remove this blockno from list of prefetched and unread blocknos */
 			}
-			else if (prefetch_iterator)
+                            else
 			{
 				/* Do not let the prefetch iterator get behind the main one */
 				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
@@ -151,6 +170,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
 					elog(ERROR, "prefetch and main iterators are out of sync");
 			}
+                        }
 #endif   /* USE_PREFETCH */
 
 			/*
@@ -239,16 +259,26 @@ BitmapHeapNext(BitmapHeapScanState *node)
 			while (node->prefetch_pages < node->prefetch_target)
 			{
 				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+                                int             PrefetchBufferRc; /*  return value from PrefetchBuffer  - refer to bufmgr.h */
+
 
 				if (tbmpre == NULL)
 				{
 					/* No more pages to prefetch */
-					tbm_end_iterate(prefetch_iterator);
-					node->prefetch_iterator = prefetch_iterator = NULL;
+                                        /* let ExecEndBitmapHeapScan terminate the prefetch_iterator
+				        **	tbm_end_iterate(prefetch_iterator);
+					**      node->prefetch_iterator = NULL;
+                                        */
+                                        prefetch_iterator = NULL;
 					break;
 				}
 				node->prefetch_pages++;
-				PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+				PrefetchBufferRc = PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno , 0);
+                                /*  add this blockno to list of prefetched and unread blocknos
+                                **  if pin count did not increase then indicate so in the Unread_Pfetched list
+                                */
+                                tbm_add(prefetch_iterator
+                                   ,( (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED) ? tbmpre->blockno : InvalidBlockNumber ) );
 			}
 		}
 #endif   /* USE_PREFETCH */
@@ -482,12 +512,31 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
 {
 	Relation	relation;
 	HeapScanDesc scanDesc;
+	TBMIterator *prefetch_iterator;
 
 	/*
 	 * extract information from the node
 	 */
 	relation = node->ss.ss_currentRelation;
 	scanDesc = node->ss.ss_currentScanDesc;
+	prefetch_iterator = node->prefetch_iterator;
+
+#ifdef USE_PREFETCH
+        /*  before any other cleanup,  discard any prefetched but unread buffers  */
+        if (prefetch_iterator != NULL) {
+            TBMIterateResult *tbmpre = tbm_locate_IterateResult(prefetch_iterator);
+            BlockNumber *Unread_Pfetched_base = tbmpre->Unread_Pfetched_base;
+            unsigned int Unread_Pfetched_next = tbmpre->Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+            unsigned int Unread_Pfetched_count = tbmpre->Unread_Pfetched_count;
+
+            while ((Unread_Pfetched_count--) > 0) {
+                DiscardBuffer( scanDesc->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                Unread_Pfetched_next++;
+                if (Unread_Pfetched_next >= target_prefetch_pages)
+                    Unread_Pfetched_next = 0;
+            }
+        }
+#endif   /* USE_PREFETCH */
 
 	/*
 	 * Free the exprcontext
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2b89dc6..c4fec3b 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -35,8 +35,13 @@
 #include "utils/rel.h"
 
 
+
 static TupleTableSlot *IndexNext(IndexScanState *node);
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -418,7 +423,12 @@ ExecEndIndexScan(IndexScanState *node)
 	 * close the index relation (no-op if we didn't open it)
 	 */
 	if (indexScanDesc)
+        {
 		index_endscan(indexScanDesc);
+
+        /*  note  -  at this point all scan controlblock resources have been freed by IndexScanEnd called by index_endscan */
+
+        }
 	if (indexRelationDesc)
 		index_close(indexRelationDesc, NoLock);
 
@@ -609,6 +619,33 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 											   indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
+#ifdef USE_PREFETCH
+        /*  initialize prefetching   */
+                indexstate->iss_ScanDesc->pfch_index_page_list =  (struct pfch_index_pagelist*)0;
+                indexstate->iss_ScanDesc->pfch_block_item_list = (struct pfch_block_item*)0;
+		if (    prefetch_index_scans
+			 && (target_prefetch_pages > 0)
+			 &&	(!RelationUsesLocalBuffers(indexstate->iss_ScanDesc->heapRelation)) /* I think this must always be true for an indexed heap ? */
+			 && (    (   (prefetch_dbOid > 0)
+					   && (prefetch_dbOid == indexstate->iss_ScanDesc->heapRelation->rd_node.dbNode)
+					 )
+				 ||  (prefetch_dbOid == 0)
+				)
+		   ) {
+			indexstate->iss_ScanDesc->pfch_index_page_list = palloc( sizeof(struct pfch_index_pagelist) + ( (target_prefetch_pages-1) * sizeof(struct pfch_index_item) ) );
+			indexstate->iss_ScanDesc->pfch_block_item_list = palloc( prefetch_index_scans * sizeof(struct pfch_block_item) );
+			if (     ( (struct pfch_index_pagelist*)0 != indexstate->iss_ScanDesc->pfch_index_page_list )
+                  && ( (struct pfch_block_item*)0 != indexstate->iss_ScanDesc->pfch_block_item_list )
+               ) {
+                          indexstate->iss_ScanDesc->pfch_used = 0;
+                          indexstate->iss_ScanDesc->pfch_next = prefetch_index_scans; /* ensure first entry is at index 0 */
+                          indexstate->iss_ScanDesc->pfch_index_page_list->pfch_index_pagelist_next = (struct pfch_index_pagelist*)0;
+                          indexstate->iss_ScanDesc->pfch_index_page_list->pfch_index_item_count = 0;
+                          indexstate->iss_ScanDesc->do_prefetch = 1;
+            }
+		}
+#endif   /* USE_PREFETCH */
+
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
 	 * index AM.
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index a880c81..1e34d6a 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -44,6 +44,9 @@
 #include "nodes/bitmapset.h"
 #include "nodes/tidbitmap.h"
 #include "utils/hsearch.h"
+#ifdef USE_PREFETCH
+extern int	target_prefetch_pages;
+#endif   /* USE_PREFETCH */
 
 /*
  * The maximum number of tuples per page is not large (typically 256 with
@@ -572,7 +575,12 @@ tbm_begin_iterate(TIDBitmap *tbm)
 	 * needs of the TBMIterateResult sub-struct.
 	 */
 	iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
-								 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+								 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)
+#ifdef USE_PREFETCH
+                                          		      /*  space for remembering every prefetched but unread blockno */
+                                          		      +  (target_prefetch_pages * sizeof(BlockNumber))
+#endif   /* USE_PREFETCH */
+                                         );
 	iterator->tbm = tbm;
 
 	/*
@@ -1020,3 +1028,68 @@ tbm_comparator(const void *left, const void *right)
 		return 1;
 	return 0;
 }
+
+void
+tbm_zero(TBMIterator *iterator) /* zero list of prefetched and unread blocknos */
+{
+      /* locate the list of prefetched but unread blocknos immediately following the array of offsets
+      ** and note that tbm_begin_iterate allocates space for (1 + MAX_TUPLES_PER_PAGE) offsets  -
+      ** 1 included in struct TBMIterator and MAX_TUPLES_PER_PAGE additional
+      */
+      iterator->output.Unread_Pfetched_base = ((BlockNumber *)(&(iterator->output.offsets[MAX_TUPLES_PER_PAGE+1])));
+      iterator->output.Unread_Pfetched_next = iterator->output.Unread_Pfetched_count = 0;
+}
+
+void
+tbm_add(TBMIterator *iterator, BlockNumber blockno) /* add this blockno to list of prefetched and unread blocknos */
+{
+      unsigned int Unread_Pfetched_index = iterator->output.Unread_Pfetched_next + iterator->output.Unread_Pfetched_count++;
+
+      if (iterator->output.Unread_Pfetched_count > target_prefetch_pages) {
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_add overflowed list cannot add blockno %d", blockno)));
+      }
+
+      if (Unread_Pfetched_index >= target_prefetch_pages)
+          Unread_Pfetched_index -= target_prefetch_pages;
+      *(iterator->output.Unread_Pfetched_base + Unread_Pfetched_index) = blockno;
+}
+
+void
+tbm_subtract(TBMIterator *iterator, BlockNumber blockno) /* remove this blockno from list of prefetched and unread blocknos */
+{
+      unsigned int Unread_Pfetched_index = iterator->output.Unread_Pfetched_next++;
+      BlockNumber nextUnreadPfetched;
+
+      /*    make a weak check that the next blockno is the one to be removed,
+      **    although actually in case of disagreement,   we ignore callers blockno and remove next anyway,
+      **    which is really what caller wants
+      */
+      if ( iterator->output.Unread_Pfetched_count == 0 ) {
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_subtract empty list cannot subtract blockno %d", blockno)));
+      }
+
+      if (Unread_Pfetched_index >= target_prefetch_pages)
+          Unread_Pfetched_index = 0;
+      nextUnreadPfetched = *(iterator->output.Unread_Pfetched_base + Unread_Pfetched_index);
+      if (   ( nextUnreadPfetched != blockno ) 
+          && ( nextUnreadPfetched != InvalidBlockNumber ) /* dont report it if the block in the list was InvalidBlockNumber */
+         ) {
+		ereport(NOTICE,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_subtract will subtract blockno %d not %d",
+					nextUnreadPfetched, blockno)));
+      }
+      if (iterator->output.Unread_Pfetched_next >= target_prefetch_pages)
+          iterator->output.Unread_Pfetched_next = 0;
+      iterator->output.Unread_Pfetched_count--;
+}
+
+TBMIterateResult *
+tbm_locate_IterateResult(TBMIterator *iterator)
+{
+   return &(iterator->output);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a5d5c2d..a16467d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -123,6 +123,11 @@
 #include "storage/spin.h"
 #endif
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+void ReportFreeBAiocbs(void);
+int CountInuseBAiocbs(void);
+extern int hwmBufferAiocbs;         /*  high water mark of in-use  BufferAiocbs in pool           */
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Possible types of a backend. Beyond being the possible bkend_type values in
@@ -1493,9 +1498,15 @@ ServerLoop(void)
 	fd_set		readmask;
 	int			nSockets;
 	time_t		now,
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                           count_baiocb_time,
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 				last_touch_time;
 
 	last_touch_time = time(NULL);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        count_baiocb_time = time(NULL);
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 	nSockets = initMasks(&readmask);
 
@@ -1654,6 +1665,19 @@ ServerLoop(void)
 			last_touch_time = now;
 		}
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                /*   maintain the hwm of used baiocbs every 10 seconds  */
+		if ((now - count_baiocb_time) >= 10)
+		{
+                        int inuseBufferAiocbs;         /*  current in-use  BufferAiocbs in pool */
+                        inuseBufferAiocbs = CountInuseBAiocbs();
+                        if (inuseBufferAiocbs > hwmBufferAiocbs) {
+			    hwmBufferAiocbs = inuseBufferAiocbs;
+			}
+			count_baiocb_time = now;
+		}
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 		/*
 		 * If we already sent SIGQUIT to children and they are slow to shut
 		 * down, it's time to send them SIGKILL.  This doesn't happen
@@ -3444,6 +3468,9 @@ PostmasterStateMachine(void)
 						signal_child(PgStatPID, SIGQUIT);
 				}
 			}
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        ReportFreeBAiocbs();
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 		}
 	}
 
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..96aa9e0 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o buf_async.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index e03394c..658764d 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -13,15 +13,89 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <sys/types.h> 
+#include <unistd.h>
 
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
-
+#include <stdlib.h> /* for getenv() */
+#include <errno.h> /* for strtoul() */
 
 BufferDesc *BufferDescriptors;
 char	   *BufferBlocks;
-int32	   *PrivateRefCount;
+int32	   *PrivateRefCount;       /*  array of counts per buffer of how many times this task has pinned this buffer */
+
+volatile struct BAiocbAnchor *BAiocbAnchr = (struct BAiocbAnchor *)0;  /*  anchor for all control blocks pertaining to aio  */
+
+int CountInuseBAiocbs(void);     /*  keep compiler happy */
+void ReportFreeBAiocbs(void);    /*  keep compiler happy */
+
+extern int	MaxConnections;  /*  max number of client connections which postmaster will allow  */
+int numBufferAiocbs = 0;         /*  total number of  BufferAiocbs in pool (0 <=> no async io) */
+int hwmBufferAiocbs = 0;         /*  high water mark of in-use  BufferAiocbs in pool
+                                 **  (not required to be accurate, kindly maintained for us somehow by postmaster)
+                                 */
+
+#ifdef USE_PREFETCH
+unsigned int prefetch_dbOid = 0; /*  database oid of relations on which prefetching to be done - 0 means all */
+unsigned int prefetch_bitmap_scans = 1; /*  boolean whether to prefetch bitmap heap scans        */
+unsigned int prefetch_heap_scans = 0;   /*  boolean whether to prefetch non-bitmap heap scans    */
+unsigned int prefetch_sequential_index_scans = 0;  /*  boolean whether to prefetch sequential-access non-bitmap index scans  */
+unsigned int prefetch_index_scans = 256;  /*  boolean whether to prefetch non-bitmap index scans also numeric size of pfch_list  */
+unsigned int prefetch_btree_heaps = 1;  /*  boolean whether to prefetch heap pages in _bt_next for non-bitmap index scans   */
+#endif /* USE_PREFETCH */
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+int maxGetBAiocbTries = 1;       /*  max times we will try to get a free BufferAiocb */
+int maxRelBAiocbTries = 1;       /*  max times we will try to release a BufferAiocb back to freelist */
+
+/*  locking protocol for manipulating the BufferAiocbs and FreeBAiocbs list :
+**    1.    ownership of a BufferAiocb :
+**          to gain ownership of a BufferAiocb, a task must
+**          EITHER    remove it from FreeBAiocbs (it is now temporary owner and no other task can find it)
+**                    if decision is to attach it to a buffer descriptor header, then
+**                       .   lock the buffer descriptor header
+**                       .   check  NOT flags & BM_AIO_IN_PROGRESS
+**                       .   attach to buffer descriptor header
+**                       .   increment the BufferAiocb.dependent_count
+**                       .   unlock the buffer descriptor header
+**                     and ownership scope is from lock to unlock
+***          OR        locate it by dereferencing the pointer in a buffer descriptor,
+**                    in which case :
+**                       .   lock the buffer descriptor header
+**                       .   check  flags & BM_AIO_IN_PROGRESS
+**                       .   increment the BufferAiocb.dependent_count
+**                       .   if decision is to return to FreeBAiocbs,
+**                           then   (with buffer descriptor header still locked)
+**                                  .  turn off BM_AIO_IN_PROGRESS
+**                       .   IF        the BufferAiocb.dependent_count == 1 (I am sole dependent)
+**                       .   THEN
+**                       .       .  decrement the BufferAiocb.dependent_count
+**                               .  return to FreeBAiocbs (see below)
+**                       .   unlock the buffer descriptor header
+**                     and ownership scope is from lock to either return to FreeBAiocbs or unlock
+**    2.    adding and removing from FreeBAiocbs :
+**      two alternative methods - controlled by conditional macro definition LOCK_BAIOCB_FOR_GET_REL
+**       2.1 LOCK_BAIOCB_FOR_GET_REL is defined - use a lock
+**          .   lock BufFreelistLock exclusive
+**          .   add / remove from FreeBAiocbs
+**          .   unlock BufFreelistLock exclusive
+**          advantage of this method  -  never fails to add or remove
+**       2.2  LOCK_BAIOCB_FOR_GET_REL is not defined - use compare_and_swap
+**          .   retrieve the current Freelist pointer and validate
+**          .   compare_and_swap on/off the FreeBAiocb list
+**          .   unlock BufFreelistLock exclusive
+**          advantage of this method  -  never waits
+**          to avoid losing a FreeBAiocbs,   save it in a process-local cache and reuse
+*/
+#else   /*  not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        /*   this dummy structure is to ensure that references to these fields in other bufmgr runtime code
+        **   that is not conditionally ifdefd on USE_AIO_ATOMIC_BUILTIN_COMP_SWAP compiles and runs correctly
+        */
+        struct BufferAiocb dummy_BAiocbAnchr = { (struct BufferAiocb*)0 , (struct BufferAiocb*)0 };
+int maxGetBAiocbTries = -1;       /*  max times we will try to get a free BufferAiocb */
+int maxRelBAiocbTries = -1;       /*  max times we will try to release a BufferAiocb back to freelist */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Data Structures:
@@ -73,7 +147,14 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs
+                        , foundAiocbs
+          ;
+#if defined(USE_PREFETCH) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
+        char *envvarpointer = (char *)0;  /*  might point to an environment variable string */
+        char *charptr;
+#endif /* USE_PREFETCH */
+
 
 	BufferDescriptors = (BufferDesc *)
 		ShmemInitStruct("Buffer Descriptors",
@@ -83,6 +164,142 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        BAiocbAnchr = (struct BAiocbAnchor *)0; /*  anchor for all control blocks pertaining to aio  */
+        if (max_async_io_prefetchers < 0) {  /*  negative value indicates to initialize to something sensible during buf_init */
+            max_async_io_prefetchers = MaxConnections/6;  /*  default allows for average of MaxConnections/6 concurrent prefetchers  - reasonable ??? */
+        }
+
+        if ((target_prefetch_pages > 0) && (max_async_io_prefetchers > 0)) {
+            int ix;
+            volatile struct BufferAiocb *BufferAiocbs;
+            volatile struct BufferAiocb * volatile FreeBAiocbs;
+
+            numBufferAiocbs = (target_prefetch_pages*max_async_io_prefetchers);  /*  target_prefetch_pages per prefetcher */
+            BAiocbAnchr = (struct BAiocbAnchor *)
+		ShmemInitStruct("Buffer Aiocbs",
+                          sizeof(struct BAiocbAnchor) + (numBufferAiocbs * sizeof(struct BufferAiocb)), &foundAiocbs);
+            if (BAiocbAnchr) {
+                BufferAiocbs = BAiocbAnchr->BufferAiocbs = (struct BufferAiocb*)(((char *)BAiocbAnchr) + sizeof(struct BAiocbAnchor));
+                FreeBAiocbs = (struct BufferAiocb*)0;
+                for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    (BufferAiocbs+ix)->BAiocbnext = FreeBAiocbs;   /*  init the free list,  last one -> 0  */
+                    (BufferAiocbs+ix)->BAiocbbufh = (struct sbufdesc*)0;
+                    (BufferAiocbs+ix)->BAiocbDependentCount = 0;
+                    (BufferAiocbs+ix)->pidOfAio = 0;
+                    FreeBAiocbs = (BufferAiocbs+ix);
+
+                }
+                BAiocbAnchr->FreeBAiocbs = FreeBAiocbs;
+                envvarpointer = getenv("PG_MAX_GET_BAIOCB_TRIES");
+                if (    (envvarpointer != (char *)0)
+                     && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+                   ) {
+                      maxGetBAiocbTries = strtol(envvarpointer, 0, 10);
+                }
+                envvarpointer = getenv("PG_MAX_REL_BAIOCB_TRIES");
+                if (    (envvarpointer != (char *)0)
+                     && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+                   ) {
+                      maxRelBAiocbTries = strtol(envvarpointer, 0, 10);
+                }
+
+                /*   init the aio subsystem max number of threads and max number of requests
+                **   max number of threads   <-->  max_async_io_prefetchers
+                **   max number of requests  <-->  numBufferAiocbs = (target_prefetch_pages*max_async_io_prefetchers)
+                **   there is no return code so we just hope.
+                */
+                smgrinitaio(max_async_io_prefetchers , numBufferAiocbs);
+
+            }
+        }
+#else   /*  not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        /*   this dummy structure is to ensure that references to these fields in other bufmgr runtime code
+        **   that is not conditionally ifdefd on USE_AIO_ATOMIC_BUILTIN_COMP_SWAP compiles and runs correctly
+        */
+        BAiocbAnchr = &dummy_BAiocbAnchr;
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+
+#ifdef USE_PREFETCH
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_BITMAP");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_bitmap_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_bitmap_scans = 1;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_ISCAN");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_index_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_index_scans = 1;
+             } else
+             if (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   ) {
+                 prefetch_index_scans = strtol(envvarpointer, &charptr, 10);
+                 if (charptr && (',' == *charptr)) {   /*  optional sequential prefetch in index scans */
+					 charptr++;        /*   following the comma ... */
+					 if ( ('Y' == *charptr) || ('y' == *charptr) || ('1' == *charptr) ) {
+                         prefetch_sequential_index_scans = 1;
+					 }
+				 }
+             }
+             /*  if prefeching for ISCAN,  then we require size of pfch_list to be at least target_prefetch_pages */
+             if (   (prefetch_index_scans > 0)
+                 && (prefetch_index_scans < target_prefetch_pages)
+                ) {
+                 prefetch_index_scans = target_prefetch_pages;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_BTREE");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_btree_heaps = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_btree_heaps = 1;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_HEAP");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_heap_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+              prefetch_heap_scans = 1;
+             }
+        }
+        envvarpointer = getenv("PG_PREFETCH_DBOID");
+        if (    (envvarpointer != (char *)0)
+             && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+           ) {
+              errno = 0;   /*  required in order to distinguish error from 0 */
+              prefetch_dbOid = (unsigned int)strtoul((const char *)envvarpointer, 0, 10);
+              if (errno) {
+                  prefetch_dbOid = 0;
+              }
+        }
+        elog(LOG, "prefetching initialised with target_prefetch_pages= %d "
+                  ", max_async_io_prefetchers= %d implying aio concurrency= %d "
+                  ", prefetching_for_bitmap= %s "
+                  ", prefetching_for_heap= %s "
+                  ", prefetching_for_iscan= %d with sequential_index_page_prefetching= %s "
+                  ", prefetching_for_btree= %s"
+                   ,target_prefetch_pages ,max_async_io_prefetchers ,numBufferAiocbs
+                   ,(prefetch_bitmap_scans ? "Y" : "N")
+                   ,(prefetch_heap_scans ? "Y" : "N")
+                   ,prefetch_index_scans
+                   ,(prefetch_sequential_index_scans ? "Y" : "N")
+                   ,(prefetch_btree_heaps ? "Y" : "N")
+            );
+#endif /* USE_PREFETCH */
+
+
 	if (foundDescs || foundBufs)
 	{
 		/* both should be present or neither */
@@ -176,3 +393,80 @@ BufferShmemSize(void)
 
 	return size;
 }
+
+/*     imprecise count of number of in-use BAiocbs at any time
+ *     we scan the array read-only without latching so are subject to unstable result
+ *     (but since the array is in well-known contiguous storage,
+ *     we are not subject to segmentation violation)
+ *     This function may be called at any time and just does its best
+ *     return the count of what we counted.
+ */
+int
+CountInuseBAiocbs(void)
+{
+        volatile struct BufferAiocb *BAiocb;
+        int count = 0;
+        int ix;
+
+
+        if (BAiocbAnchr != (struct BAiocbAnchor *)0 ) {
+            BAiocb = BAiocbAnchr->BufferAiocbs;             /*   start of list */
+
+            for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    if ((struct BufferAiocb*)BAIOCB_OCCUPIED == (BAiocb+ix)->BAiocbnext) {   /* not on freelist ? */
+                        count++;
+                    }
+            }
+        }
+        return count;
+}
+
+/*
+ * report how many free BAiocbs at shutdown
+ * DO NOT call this while backends are actively working!!
+ * this report is useful when compare_and_swap method used (see above)
+ * as it can be used to deduce how many BAiocbs were in process-local caches -
+ * (original_number_on_freelist_at_startup - this_reported_number_at_shutdown)
+ */
+void
+ReportFreeBAiocbs(void)
+{
+        volatile struct BufferAiocb *BAiocb;
+        volatile struct BufferAiocb *BufferAiocbs;
+        int count = 0;
+        int fx , ix;
+
+
+        if (BAiocbAnchr != (struct BAiocbAnchor *)0 ) {
+            BAiocb = BAiocbAnchr->FreeBAiocbs;             /*   start of free list */
+            BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+
+            for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    (BufferAiocbs+ix)->BAiocbDependentCount = 0;  /* use this as marker for finding it on freelist */
+            }
+            for (fx = (numBufferAiocbs-1);  ( (fx>=0) &&  ( BAiocb != (struct BufferAiocb*)0 ) );  fx--) {
+                    
+                    /*  check if it is a valid BufferAiocb */
+                    for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                        if ((BufferAiocbs+ix) == BAiocb) { /*  is it this one ? */
+                             break;
+                        }
+                    }
+                    if (ix >= 0) {
+                        if (BAiocb->BAiocbDependentCount) {   /* seen it already ? */
+                            elog(LOG, "ReportFreeBAiocbs closed cycle on AIO control block freelist %p"
+                                          ,BAiocb);
+                            fx = 0; /* give up at this point */
+                        }
+                        BAiocb->BAiocbDependentCount = 1;  /* use this as marker for finding it on freelist */
+                        count++;
+                        BAiocb = BAiocb->BAiocbnext;
+                    } else {
+                        elog(LOG, "ReportFreeBAiocbs invalid item on AIO control block freelist %p"
+                                          ,BAiocb);
+                        fx = 0; /* give up at this point */
+                    }
+            }
+        }
+        elog(LOG, "ReportFreeBAiocbs AIO control block list : poolsize= %d  in-use-hwm= %d  final-free= %d" ,numBufferAiocbs , hwmBufferAiocbs , count);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c070278..e4ef7f9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -29,7 +29,7 @@
  *		buf_table.c -- manages the buffer lookup table
  */
 #include "postgres.h"
-
+#include <sys/types.h> 
 #include <sys/file.h>
 #include <unistd.h>
 
@@ -50,7 +50,6 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
-
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
 #define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
@@ -63,6 +62,8 @@
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
 
+extern volatile struct BAiocbAnchor *BAiocbAnchr;  /*  anchor for all control blocks pertaining to aio  */
+
 #define DROP_RELS_BSEARCH_THRESHOLD		20
 
 /* GUC variables */
@@ -78,26 +79,33 @@ bool		track_io_timing = false;
  */
 int			target_prefetch_pages = 0;
 
-/* local state for StartBufferIO and related functions */
+/* local state for StartBufferIO and related functions
+**  but ONLY for synchronous IO  -  not altered for aio
+*/
 static volatile BufferDesc *InProgressBuf = NULL;
 static bool IsForInput;
+pid_t this_backend_pid = 0;    /*    pid of this backend */
 
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
-
-static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
+extern int
+BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy);
+extern int
+BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, volatile BufferDesc *buf_desc, int intention
+        ,BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock );
+Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
-				  bool *hit);
-static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
-static void PinBuffer_Locked(volatile BufferDesc *buf);
-static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
+				  bool *hit , int index_for_aio);
+bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
+void PinBuffer_Locked(volatile BufferDesc *buf);
+void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
 static void WaitIO(volatile BufferDesc *buf);
-static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
-static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
+static bool StartBufferIO(volatile BufferDesc *buf, bool forInput , int index_for_aio );
+void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -106,24 +114,66 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr);
+			int *foundPtr , int index_for_aio );
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
 static int	rnode_comparator(const void *p1, const void *p2);
 
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
 
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
  *
- * This is named by analogy to ReadBuffer but doesn't actually allocate a
- * buffer.  Instead it tries to ensure that a future ReadBuffer for the given
- * block will not be delayed by the I/O.  Prefetching is optional.
+ * This is named by analogy to ReadBuffer but allocates a buffer only if using asynchronous I/O.
+ * Its purpose  is to try to ensure that a future ReadBuffer for the given block
+ * will not be delayed by the I/O.  Prefetching is optional.
  * No-op if prefetching isn't compiled in.
+ *
+ * Originally the prefetch simply called posix_fadvise() to recommend read-ahead into kernel page cache.
+ * Extended to provide an alternative of issuing an asynchronous aio_read() to read into a buffer.
+ * This extension has an implication on how this bufmgr component manages concurrent requests
+ * for the same disk block.
+ *
+ * Synchronous IO (read()) does not provide a means for waiting on another task's read if in progress,
+ * and bufmgr implements its own scheme in StartBufferIO, WaitIO, and TerminateBufferIO.
+ *
+ * Asynchronous IO (aio_read()) provides a means for waiting on this or another task's read if in progress,
+ * namely aio_suspend(),  which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO
+ * are called as part of asynchronous prefetching,   their role is limited to maintaining the buffer desc flags,
+ * and they do not track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in 
+ * a separate set of shared control blocks,  the BufferAiocb list -
+ *   refer to     include/storage/buf_internals.h and storage/buffer/buf_init.c
+ *
+ * Another implication of asynchronous IO concerns buffer pinning.
+ * The buffer used for the prefetch is pinned before aio_read is issued.
+ * It is expected that the same task (and possibly others) will later ask to read the page
+ * and eventually release and unpin the buffer.
+ * However,  if the task which issued the aio_read later decides not to read the page,
+ * and return code indicates delta_pin_count > 0 (see below)
+ * it *must* instead issue a DiscardBuffer() (see function later in this file)
+ * so that its pin is released.
+ * Therefore,  each client which uses the PrefetchBuffer service must either always read all
+ * prefetched pages,  or keep track of prefetched pages and discard unread ones at end of scan.
+ *
+ * return code:   is an int bitmask defined in bufmgr.h
+        PREFTCHRC_BUF_PIN_INCREASED 0x01      pin count on buffer has been increased by 1
+        PREFTCHRC_BLK_ALREADY_PRESENT 0x02    block was already present in a buffer
+ *
+ * PREFTCHRC_BLK_ALREADY_PRESENT is a hint to caller that the prefetch may be unnecessary
  */
-void
-PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
+int
+PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy)
 {
+	Buffer		buf_id; /* indicates buffer containing the requested block  */
+        int             PrefetchBufferRc = 0; /*  return value as described above  */
+        int             PinCountOnEntry = 0;  /*  pin count on entry           */
+        int             PinCountdelta = 0;    /*  pin count delta increase     */
+
+
 #ifdef USE_PREFETCH
+
+	buf_id = -1;
 	Assert(RelationIsValid(reln));
 	Assert(BlockNumberIsValid(blockNum));
 
@@ -145,8 +195,13 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	{
 		BufferTag	newTag;		/* identity of requested block */
 		uint32		newHash;	/* hash value for newTag */
+        int         BufStartAsyncrc = -1;  /*  retcode from BufStartAsync :
+                                                       **        0 if started successfully (which implies buffer was newly pinned )
+                                                       **       -1 if failed for some reason
+                                                       **        1+PrivateRefCount if we found desired buffer in buffer pool
+                                                       **  and we set it likewise if we find buffer in buffer pool
+                                                       */
 		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-		int			buf_id;
 
 		/* create a tag so we can lookup the buffer */
 		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
@@ -158,28 +213,119 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 
 		/* see if the block is in the buffer pool already */
 		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
+		buf_id = (Buffer)BufTableLookup(&newTag, newHash);
+                if (buf_id >= 0) {
+                    PinCountOnEntry = PrivateRefCount[buf_id];  /*  pin count on entry               */
+                    BufStartAsyncrc = 1 + PinCountOnEntry; /* indicate this backends pin count - see above comment */
+                    PrefetchBufferRc = PREFTCHRC_BLK_ALREADY_PRESENT;       /* indicate buffer present */
+                } else {
+                    PrefetchBufferRc = 0;                                   /* indicate buffer not present */
+                }
 		LWLockRelease(newPartitionLock);
 
+     not_in_buffers:
 		/* If not in buffers, initiate prefetch */
-		if (buf_id < 0)
+		if (buf_id < 0) {
+                    /*    try using async aio_read with a buffer */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                    BufStartAsyncrc = BufStartAsync( reln, forkNum, blockNum , strategy );
+                    if (BufStartAsyncrc < 0) {
+                            pgBufferUsage.aio_read_noblok++;
+                    }
+#else /* not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP so try the alternative that does not read the block into a postgresql buffer */
 			smgrprefetch(reln->rd_smgr, forkNum, blockNum);
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+		}
 
-		/*
-		 * If the block *is* in buffers, we do nothing.  This is not really
-		 * ideal: the block might be just about to be evicted, which would be
-		 * stupid since we know we are going to need it soon.  But the only
-		 * easy answer is to bump the usage_count, which does not seem like a
-		 * great solution: when the caller does ultimately touch the block,
-		 * usage_count would get bumped again, resulting in too much
-		 * favoritism for blocks that are involved in a prefetch sequence. A
-		 * real fix would involve some additional per-buffer state, and it's
-		 * not clear that there's enough of a problem to justify that.
+        if (  (buf_id >= 0) || (BufStartAsyncrc >= 1)  ) {
+                        /* The block *is* in buffers.    */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        pgBufferUsage.aio_read_noneed++;
+#ifndef USE_PREFETCH_BUT_DONT_PIN_ALREADY_PRESENT /* jury is out on whether the following wins but it ought to ...  */
+                        /*
+                        ** If this backend already had pinned it,
+                        ** or another backend had banked a pin on it,
+                        ** or there is an IO in progress,
+                        ** or it is not marked valid,
+                        ** then do nothing.
+                        ** Otherwise pin it and mark the buffer's pin as banked by this backend.
+                        ** Note  -  it may or not be pinned by another backend -
+                        **          it is ok for us to bank a pin on it
+                        **          *provided* the other backend did not bank its pin.
+                        **          The reason for this is that the banked-pin indicator is global -
+                        **          it can identify at most one process.
+                        */
+                        /* pgBufferUsage.aio_read_wasted++;  overload counter - only for debugging */
+                        if (BufStartAsyncrc == 1) {            /*   not pinned by me  */
+                              /*  pgBufferUsage.aio_read_wasted++;  overload counter - only for debugging */
+                              /*  note   -   all we can say with certainty is that the buffer is not pinned by me
+                              **             we cannot be sure that it is still in buffer pool
+                              **             so must go through the entire locking and searching all over again ...
 		 */
+                            LWLockAcquire(newPartitionLock, LW_SHARED);
+                            buf_id = (Buffer)BufTableLookup(&newTag, newHash);
+                            /* If in buffers, proceed */
+                            if (buf_id >= 0) {
+                                /*  since the block is now present,
+                                **  save the current pin count to ensure final delta is calculated correctly
+                                */
+                                PinCountOnEntry = PrivateRefCount[buf_id];  /*  pin count on entry               */
+                                if ( PinCountOnEntry == 0) { /*  paranoid check it's still not pinned by me */
+                                    volatile        BufferDesc *buf_desc;
+
+                                    buf_desc = &BufferDescriptors[buf_id];        /* found buffer descriptor */
+                                    LockBufHdr(buf_desc);
+                                    if (    (buf_desc->flags & BM_VALID)           /* buffer is valid        */
+                                         && (!(buf_desc->flags & (BM_IO_IN_PROGRESS|BM_AIO_IN_PROGRESS|BM_AIO_PREFETCH_PIN_BANKED))) /* buffer is not any of ... */
+                                       ) {
+                                        buf_desc->flags |= BM_AIO_PREFETCH_PIN_BANKED; /* bank the pin for next use by this task */
+                                        /*  note  - we can call PinBuffer_Locked with the BM_AIO_PREFETCH_PIN_BANKED flag set because it is not yet pinned by me */
+                                        buf_desc->freeNext = -(this_backend_pid);       /* remember which pid banked it */
+                                        /*  pgBufferUsage.aio_read_wasted--;      overload counter - not wasted after all - only for debugging */
+
+                                        /* Make sure we will have room to remember the buffer pin */
+                                        ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+                                        PinBuffer_Locked(buf_desc);
 	}
-#endif   /* USE_PREFETCH */
+                                    else {
+                                        UnlockBufHdr(buf_desc);
+                                    }
+                                }
+                            }
+                            LWLockRelease(newPartitionLock);
+                            /*  although unlikely,  maybe it was evicted while we were puttering about  */
+                            if (buf_id < 0) {
+                                pgBufferUsage.aio_read_noneed--;   /*   back out the accounting */
+                                goto not_in_buffers;               /*   and try again           */
+                            }
+                        }
+#endif /*  USE_PREFETCH_BUT_DONT_PIN_ALREADY_PRESENT */
+
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+		}
+
+		if (buf_id >= 0) {
+			PinCountdelta = PrivateRefCount[buf_id] - PinCountOnEntry;  /*  pin count delta increase    */
+			if (  (PinCountdelta < 0) || (PinCountdelta > 1) ) {
+				  elog(ERROR,
+						 "PrefetchBuffer #%d : incremented pin count by %d on bufdesc %p refcount %u localpins %d\n"
+								  ,(buf_id+1) , PinCountdelta , &BufferDescriptors[buf_id] ,BufferDescriptors[buf_id].refcount , PrivateRefCount[buf_id]);
 }
+		} else
+		if (BufStartAsyncrc == 0) {  /* aio started successfully (which implies buffer was newly pinned ) */
+			PinCountdelta = 1;
+		}
+
+		/*  set final PrefetchBufferRc according to previous value */
+		PrefetchBufferRc |= PinCountdelta;  /* set the PREFTCHRC_BUF_PIN_INCREASED bit */
+	}
+
+#endif   /* USE_PREFETCH */
 
+	return PrefetchBufferRc; /*  return value as described above */
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
@@ -252,7 +398,7 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	pgstat_count_buffer_read(reln);
 	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
+							forkNum, blockNum, mode, strategy, &hit , 0);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
 	return buf;
@@ -280,7 +426,7 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 	Assert(InRecovery);
 
 	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
-							 mode, strategy, &hit);
+							 mode, strategy, &hit , 0);
 }
 
 
@@ -288,15 +434,18 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
+ * index_for_aio ,  if -ve , is negative of ( index of the aiocb in the BufferAiocbs array + 3 )
+ *     which is passed through to StartBufferIO
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy, bool *hit , int index_for_aio )
 {
 	volatile BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+        int             allocrc;             /*  retcode from BufferAlloc */
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -328,16 +477,40 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	}
 	else
 	{
+                allocrc = mode; /* pass mode to BufferAlloc since it must not wait for async io if RBM_NOREAD_FOR_PREFETCH */
 		/*
 		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
+							 strategy, &allocrc , index_for_aio );
+		if (allocrc < 0) {
+                    if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+                    {
+                        ereport(WARNING,
+                                (errcode(ERRCODE_DATA_CORRUPTED),
+                                 errmsg("invalid page header in block %u of relation %s; zeroing out page",
+                                        blockNum,
+                                        relpath(smgr->smgr_rnode, forkNum))));
+                        bufBlock = BufHdrGetBlock(bufHdr);
+                        MemSet((char *) bufBlock, 0, BLCKSZ);
+                    }
 		else
+                      ereport(ERROR,
+                              (errcode(ERRCODE_DATA_CORRUPTED),
+                               errmsg("invalid page header in block %u of relation %s",
+                                      blockNum,
+                                      relpath(smgr->smgr_rnode, forkNum))));
+                        found = true;
+                }
+		else if (allocrc > 0) {
+			pgBufferUsage.shared_blks_hit++;
+                        found = true;
+                }
+		else {
 			pgBufferUsage.shared_blks_read++;
+                        found = false;
+                }
 	}
 
 	/* At this point we do NOT hold any locks. */
@@ -410,7 +583,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				Assert(bufHdr->flags & BM_VALID);
 				bufHdr->flags &= ~BM_VALID;
 				UnlockBufHdr(bufHdr);
-			} while (!StartBufferIO(bufHdr, true));
+			} while (!StartBufferIO(bufHdr, true, 0));
 		}
 	}
 
@@ -430,6 +603,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
+        if (mode != RBM_NOREAD_FOR_PREFETCH) {
 	if (isExtend)
 	{
 		/* new buffers are zero-filled */
@@ -499,6 +673,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	VacuumPageMiss++;
 	if (VacuumCostActive)
 		VacuumCostBalance += VacuumCostPageMiss;
+	}
 
 	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
 									  smgr->smgr_rnode.node.spcNode,
@@ -520,21 +695,39 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  * the default strategy.  The selected buffer's usage_count is advanced when
  * using the default strategy, but otherwise possibly not (see PinBuffer).
  *
- * The returned buffer is pinned and is already marked as holding the
- * desired page.  If it already did have the desired page, *foundPtr is
- * set TRUE.  Otherwise, *foundPtr is set FALSE and the buffer is marked
+ * index_for_aio is an input parm which,  if non-zero,  identifies a BufferAiocb
+ * acquired by caller and to be used for any StartBufferIO performed by this routine.
+ * In this case,  if block not found in buffer pool and we allocate a new buffer,
+ * then we must maintain the spinlock on the buffer and pass it back to caller.
+ *
+ * foundPtr is input and output :
+ *  . input   -  indicates the read-buffer mode  ( see bufmgr.h )
+ *  . output  -  indicates the status of the buffer - see below
+ *
+ * Except for the case of RBM_NOREAD_FOR_PREFETCH and buffer is found,
+ * the returned buffer is pinned and is already marked as holding the
+ * desired page.
+ *  If it already did have the desired page and page content is valid,
+ *  *foundPtr is set to 1
+ *  If it already did have the desired page and mode is RBM_NOREAD_FOR_PREFETCH
+ *    and StartBufferIO returned false
+ *    (meaning it could not initialise the buffer for aio)
+ *  *foundPtr is set to 2
+ *  If it already did have the desired page but page content is invalid,
+ *  *foundPtr is set to -1
+ *   this can happen only if the buffer was read by an async read
+ *   and the aio is still in progress or pinned by the issuer of the startaio.
+ *  Otherwise, *foundPtr is set to 0 and the buffer is marked
  * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
  *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
- *
- * No locks are held either at entry or exit.
+ * No locks are held either at entry or exit EXCEPT for case noted above
+ * of passing an empty buffer back to async io caller ( index_for_aio set ).
  */
 static volatile BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			int *foundPtr , int index_for_aio )
 {
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
@@ -546,6 +739,13 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	int			buf_id;
 	volatile BufferDesc *buf;
 	bool		valid;
+        int             IntentionBufferrc;      /* retcode from BufCheckAsync */
+        bool            StartBufferIOrc;        /* retcode from StartBufferIO */
+        ReadBufferMode mode;
+
+
+        mode = *foundPtr;
+        *foundPtr = 0;
 
 	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
@@ -560,21 +760,53 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	if (buf_id >= 0)
 	{
 		/*
-		 * Found it.  Now, pin the buffer so no one can steal it from the
-		 * buffer pool, and check to see if the correct data has been loaded
-		 * into the buffer.
+		 * Found it.
 		 */
+                *foundPtr = 1;
 		buf = &BufferDescriptors[buf_id];
 
-		valid = PinBuffer(buf, strategy);
-
-		/* Can release the mapping lock as soon as we've pinned it */
+                /*   If prefetch mode,  then return immediately indicating found,
+                **   and NOTE in this case only,  we did not pin buffer.
+                **   In theory we might try to check whether the buffer is valid,  io in progress,  etc
+                **   but in practice it is simpler to abandon the prefetch if the buffer exists
+                */
+                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+                    /* release the mapping lock and return    */
 		LWLockRelease(newPartitionLock);
-
-		*foundPtr = TRUE;
-
-		if (!valid)
-		{
+                } else {
+                    /*   note that the current request is for same tag as the one associated with the aio -
+                    **   so simply complete the aio and we have our buffer.
+                    **         If an aio was started on this buffer,
+                    **         check complete and wait for it if not.
+                    **         And,  if aio had been started,  then the task
+                    **         which issued the start aio already pinned it for this read,
+                    **         so if that task was me and the aio was successful,
+                    **         pass the current pin to this read without dropping and re-acquiring.
+                    **         this is all done by BufCheckAsync
+                    */
+                    IntentionBufferrc = BufCheckAsync(smgr , 0 , buf, BUF_INTENTION_WANT , strategy , index_for_aio , false , newPartitionLock );
+
+                    /*       check to see if the correct data has been loaded into the buffer.  */
+                    valid = (IntentionBufferrc == BUF_INTENT_RC_VALID);
+
+                    /*  check for serious IO errors   */
+                    if (!valid) {
+                        if (    (IntentionBufferrc != BUF_INTENT_RC_INVALID_NO_AIO)
+                             && (IntentionBufferrc != BUF_INTENT_RC_INVALID_AIO)
+                           ) {
+                            *foundPtr = -1;  /*  inform caller of serious error */
+                        }
+                        else
+                        if (IntentionBufferrc == BUF_INTENT_RC_INVALID_AIO) {
+                            goto proceed_with_not_found;  /*  yes,  I know,  a goto ... think of it as a break out of the if */
+                        }
+                     }
+
+                    /* BufCheckAsync pinned the buffer            */
+                    /* so can now release the mapping lock               */
+                    LWLockRelease(newPartitionLock);
+
+                    if (!valid) {
 			/*
 			 * We can only get here if (a) someone else is still reading in
 			 * the page, or (b) a previous read attempt failed.  We have to
@@ -582,19 +814,21 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			 * own read attempt if the page is still not BM_VALID.
 			 * StartBufferIO does it all.
 			 */
-			if (StartBufferIO(buf, true))
+                                if (StartBufferIO(buf, true, index_for_aio))
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
 				 * have failed ... but we shall bravely try again.
 				 */
-				*foundPtr = FALSE;
+                                        *foundPtr = 0;
+                                }
 			}
 		}
 
 		return buf;
 	}
 
+  proceed_with_not_found:
 	/*
 	 * Didn't find it in the buffer pool.  We'll have to initialize a new
 	 * buffer.  Remember to unlock the mapping lock while doing the work.
@@ -619,8 +853,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* Must copy buffer flags while we still hold the spinlock */
 		oldFlags = buf->flags;
 
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
+                /*         If an aio was started on this buffer,
+                **         check complete and cancel it if not.
+                */
+                BufCheckAsync(smgr , 0 , buf, BUF_INTENTION_REJECT_OBTAIN_PIN , 0 , index_for_aio, true , 0 );
 
 		/* Now it's safe to release the freelist lock */
 		if (lock_held)
@@ -791,13 +1027,18 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				 * then set up our own read attempt if the page is still not
 				 * BM_VALID.  StartBufferIO does it all.
 				 */
-				if (StartBufferIO(buf, true))
+                                StartBufferIOrc = StartBufferIO(buf, true , index_for_aio); /* retcode from StartBufferIO */
+				if (StartBufferIOrc)
 				{
 					/*
 					 * If we get here, previous attempts to read the buffer
 					 * must have failed ... but we shall bravely try again.
 					 */
-					*foundPtr = FALSE;
+					*foundPtr = 0;
+                                } else
+                                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+					UnpinBuffer(buf, true); /* must unpin if RBM_NOREAD_FOR_PREFETCH and found */
+                                        *foundPtr = 2;  /*  inform BufStartAsync that buffer must not be used */
 				}
 			}
 
@@ -860,10 +1101,17 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 * lock.  If StartBufferIO returns false, then someone else managed to
 	 * read it before we did, so there's nothing left for BufferAlloc() to do.
 	 */
-	if (StartBufferIO(buf, true))
-		*foundPtr = FALSE;
-	else
-		*foundPtr = TRUE;
+        StartBufferIOrc = StartBufferIO(buf, true , index_for_aio); /* retcode from StartBufferIO */
+        if (StartBufferIOrc) {
+		*foundPtr = 0;
+        } else {
+                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+                    UnpinBuffer(buf, true); /* must unpin if RBM_NOREAD_FOR_PREFETCH and found */
+                    *foundPtr = 2;  /*  inform BufStartAsync that buffer must not be used */
+                } else {
+                    *foundPtr = 1;
+                }
+        }
 
 	return buf;
 }
@@ -970,6 +1218,10 @@ retry:
 	/*
 	 * Insert the buffer at the head of the list of free buffers.
 	 */
+        /*   avoid confusing freelist with strange-looking freeNext */
+        if (buf->freeNext <= FREENEXT_BAIOCB_ORIGIN) { /* means was used for aiocb index */
+            buf->freeNext = FREENEXT_NOT_IN_LIST;
+        }
 	StrategyFreeBuffer(buf);
 }
 
@@ -1022,6 +1274,56 @@ MarkBufferDirty(Buffer buffer)
 	UnlockBufHdr(bufHdr);
 }
 
+/*  return the blocknum of the block in a buffer if it is valid
+**  if a shared buffer,  it must be pinned
+*/
+BlockNumber
+BlocknumOfBuffer(Buffer buffer)
+{
+	volatile BufferDesc *bufHdr;
+        BlockNumber rc = 0;
+
+	if (BufferIsValid(buffer)) {
+            if  (BufferIsLocal(buffer)) {
+                bufHdr = &LocalBufferDescriptors[-buffer - 1];
+            } else {
+                bufHdr = &BufferDescriptors[buffer - 1];
+            }
+
+            rc = bufHdr->tag.blockNum;
+        }
+
+        return rc;
+}
+
+/*  report whether specified buffer contains same or different block
+**  if a shared buffer,  it must be pinned
+*/
+bool
+BlocknotinBuffer(Buffer buffer,
+					 Relation relation,
+					 BlockNumber blockNum)
+{
+	volatile BufferDesc *bufHdr;
+        bool rc = false;
+
+	if (BufferIsValid(buffer)) {
+            if  (BufferIsLocal(buffer)) {
+                bufHdr = &LocalBufferDescriptors[-buffer - 1];
+            } else {
+                bufHdr = &BufferDescriptors[buffer - 1];
+            }
+
+            rc =
+                  (   (bufHdr->tag.blockNum != blockNum)
+                   || (!(RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) ))
+                   || (bufHdr->tag.forkNum != MAIN_FORKNUM)
+                  );
+        }
+
+        return rc;
+}
+
 /*
  * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer()
  *
@@ -1040,18 +1342,18 @@ ReleaseAndReadBuffer(Buffer buffer,
 					 Relation relation,
 					 BlockNumber blockNum)
 {
-	ForkNumber	forkNum = MAIN_FORKNUM;
 	volatile BufferDesc *bufHdr;
+        bool isDifferentBlock;   /*       requesting different block from that already in buffer ? */
 
 	if (BufferIsValid(buffer))
 	{
+	    /* if a shared buff, we have pin, so it's ok to examine tag without spinlock */
+            isDifferentBlock = BlocknotinBuffer(buffer,relation,blockNum); /*  requesting different block from that already in buffer ? */
 		if (BufferIsLocal(buffer))
 		{
 			Assert(LocalRefCount[-buffer - 1] > 0);
 			bufHdr = &LocalBufferDescriptors[-buffer - 1];
-			if (bufHdr->tag.blockNum == blockNum &&
-				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
-				bufHdr->tag.forkNum == forkNum)
+			if (!isDifferentBlock)
 				return buffer;
 			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 			LocalRefCount[-buffer - 1]--;
@@ -1060,12 +1362,12 @@ ReleaseAndReadBuffer(Buffer buffer,
 		{
 			Assert(PrivateRefCount[buffer - 1] > 0);
 			bufHdr = &BufferDescriptors[buffer - 1];
-			/* we have pin, so it's ok to examine tag without spinlock */
-			if (bufHdr->tag.blockNum == blockNum &&
-				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
-				bufHdr->tag.forkNum == forkNum)
+                        BufCheckAsync(0 , relation , bufHdr , ( isDifferentBlock ? BUF_INTENTION_REJECT_FORGET
+                                                                                 : BUF_INTENTION_REJECT_KEEP_PIN )
+                                                            , 0 , 0 , false , 0 ); /* end any IO and maybe unpin */
+			if (!isDifferentBlock) {
 				return buffer;
-			UnpinBuffer(bufHdr, true);
+                        }
 		}
 	}
 
@@ -1090,11 +1392,12 @@ ReleaseAndReadBuffer(Buffer buffer,
  * Returns TRUE if buffer is BM_VALID, else FALSE.  This provision allows
  * some callers to avoid an extra spinlock cycle.
  */
-static bool
+bool
 PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
 {
 	int			b = buf->buf_id;
 	bool		result;
+        bool       pin_already_banked_by_me = 0;  /* buffer is already pinned by me and redeemable */
 
 	if (PrivateRefCount[b] == 0)
 	{
@@ -1116,12 +1419,34 @@ PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/* If we previously pinned the buffer, it must surely be valid */
+                /* Errr  -   is that really true  ???    I don't think so  :
+                ** what if I pin,  try an IO,  in progress,  then mistakenly pin again
 		result = true;
+                */
+		LockBufHdr(buf);
+                pin_already_banked_by_me = (    (PrivateRefCount[b] > 0) && (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                             && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                                     : (-(buf->freeNext))  )  == this_backend_pid )
+                                           );
+                if (pin_already_banked_by_me) {
+                        elog(LOG, "PinBuffer : on buffer %d with banked pin rel=%s, blockNum=%u, with flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+                    buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+	}
+                }
+		result = (buf->flags & BM_VALID) != 0;
+		UnlockBufHdr(buf);
 	}
+
+        if (!pin_already_banked_by_me) {
 	PrivateRefCount[b]++;
 	Assert(PrivateRefCount[b] > 0);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
 								BufferDescriptorGetBuffer(buf));
+        }
 	return result;
 }
 
@@ -1138,19 +1463,36 @@ PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
  * to save a spin lock/unlock cycle, because we need to pin a buffer before
  * its state can change under us.
  */
-static void
+void
 PinBuffer_Locked(volatile BufferDesc *buf)
 {
 	int			b = buf->buf_id;
+        bool       pin_already_banked_by_me;  /* buffer is already pinned by me and redeemable */
 
-	if (PrivateRefCount[b] == 0)
+        pin_already_banked_by_me = (    (PrivateRefCount[b] > 0) && (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                     && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                             : (-(buf->freeNext))  )  == this_backend_pid )
+                             );
+	if (PrivateRefCount[b] == 0) {
 		buf->refcount++;
+        }
+        if (pin_already_banked_by_me) {
+                        elog(LOG, "PinBuffer_Locked : on buffer %d with banked pin rel=%s, blockNum=%u, with flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+            buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                    }
+        }
 	UnlockBufHdr(buf);
+        if (!pin_already_banked_by_me) {
 	PrivateRefCount[b]++;
 	Assert(PrivateRefCount[b] > 0);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
 								BufferDescriptorGetBuffer(buf));
 }
+}
 
 /*
  * UnpinBuffer -- make buffer available for replacement.
@@ -1160,29 +1502,68 @@ PinBuffer_Locked(volatile BufferDesc *buf)
  * Most but not all callers want CurrentResourceOwner to be adjusted.
  * Those that don't should pass fixOwner = FALSE.
  */
-static void
+void
 UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 {
+
 	int			b = buf->buf_id;
+        bool       pin_already_banked_by_me;  /* buffer is already pinned by me and redeemable */
 
-	if (fixOwner)
+	if (fixOwner) {
 		ResourceOwnerForgetBuffer(CurrentResourceOwner,
 								  BufferDescriptorGetBuffer(buf));
+        }
 
 	Assert(PrivateRefCount[b] > 0);
 	PrivateRefCount[b]--;
 	if (PrivateRefCount[b] == 0)
 	{
+
 		/* I'd better not still hold any locks on the buffer */
 		Assert(!LWLockHeldByMe(buf->content_lock));
 		Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
 
 		LockBufHdr(buf);
 
+		/* this backend has released last pin - buffer should not have pin banked by me
+                ** and if AIO in progress then there should be another backend pin
+                */
+                pin_already_banked_by_me = (       (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                             &&    (   (    (buf->flags & BM_AIO_IN_PROGRESS)
+                                                         ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                         : (-(buf->freeNext))
+                                                       )  == this_backend_pid )
+                                           );
+                if (pin_already_banked_by_me) {
+                        /*  this is a strange situation  -  caller had a banked pin (which callers are supposed not to know about)
+                        **                                  but either discovered it had it or has over-counted how many pins it has
+                        */
+                        buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;   /*   redeem the pin although it is now of no use since about to release */
+                        if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                            buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                        }
+
+                        /*     temporarily suppress logging error to avoid performance degradation -
+                        **     either this task really does not need the buffer in which case the error is harmless
+                        **     or a more severe error will be detected later (possibly immediately below)
+                        elog(LOG, "UnpinBuffer :  released last this-backend pin on buffer %d rel=%s, blockNum=%u, but had banked pin flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+                        */
+                }
+
 		/* Decrement the shared reference count */
 		Assert(buf->refcount > 0);
 		buf->refcount--;
 
+                if (   (buf->refcount == 0) && (buf->flags & BM_AIO_IN_PROGRESS)  ) {
+
+                        elog(ERROR, "UnpinBuffer :  released last any-backend pin on buffer %d rel=%s, blockNum=%u, but AIO in progress flags %X refcount=%u"
+                            ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                            ,buf->tag.blockNum, buf->flags, buf->refcount);
+                }
+
+
 		/* Support LockBufferForCleanup() */
 		if ((buf->flags & BM_PIN_COUNT_WAITER) &&
 			buf->refcount == 1)
@@ -1657,6 +2038,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
 	int			result = 0;
 
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -1789,6 +2171,8 @@ PrintBufferLeakWarning(Buffer buffer)
 	char	   *path;
 	BackendId	backend;
 
+
+
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
 	{
@@ -1799,12 +2183,28 @@ PrintBufferLeakWarning(Buffer buffer)
 	else
 	{
 		buf = &BufferDescriptors[buffer - 1];
+#ifdef USE_PREFETCH
+                /*   If reason that this buffer is pinned
+                **   is that it was prefetched with async_io
+                **   and never read or discarded, then omit the
+                **   warning,  because this is expected in some
+                **   cases when a scan is closed abnormally.
+                **   Note that the buffer will be released soon by our caller.
+                */
+                if (buf->flags & BM_AIO_PREFETCH_PIN_BANKED) {
+                    pgBufferUsage.aio_read_forgot++; /* account for it */
+                    return;
+                }
+#endif /*  USE_PREFETCH */
 		loccount = PrivateRefCount[buffer - 1];
 		backend = InvalidBackendId;
 	}
 
+/* #if defined(DO_ISCAN_DISCARD_BUFFER) && defined(DO_HEAPSCAN_DISCARD_BUFFER) not yet working safely 130530 */
 	/* theoretically we should lock the bufhdr here */
 	path = relpathbackend(buf->tag.rnode, backend, buf->tag.forkNum);
+
+
 	elog(WARNING,
 		 "buffer refcount leak: [%03d] "
 		 "(rel=%s, blockNum=%u, flags=0x%x, refcount=%u %d)",
@@ -1812,6 +2212,7 @@ PrintBufferLeakWarning(Buffer buffer)
 		 buf->tag.blockNum, buf->flags,
 		 buf->refcount, loccount);
 	pfree(path);
+/* #endif defined(DO_ISCAN_DISCARD_BUFFER) && defined(DO_HEAPSCAN_DISCARD_BUFFER) not yet working safely 130530 */
 }
 
 /*
@@ -1928,7 +2329,7 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	 * false, then someone else flushed the buffer before we could, so we need
 	 * not do anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, 0))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -2512,6 +2913,70 @@ FlushDatabaseBuffers(Oid dbid)
 	}
 }
 
+#ifdef USE_PREFETCH
+/*
+ * DiscardBuffer -- discard shared buffer used for a previously
+ *                  prefetched but unread block of a relation
+ *
+ * If the buffer is found and pinned with a banked pin,  then :
+ *      .  if AIO in progress, terminate AIO without waiting
+ *      .  if AIO had already completed successfully,
+ *         then mark buffer valid (in case someone else wants it)
+ *      .  redeem the banked pin and unpin it.
+ *
+ * This function is similar in purpose to ReleaseBuffer (below)
+ * but sufficiently different that it is a separate function.
+ * Two important differences are :
+ *   .   caller identifies buffer by blocknumber,  not buffer number
+ *   .   we unpin buffer *only* if the pin is banked,
+ *                      *never* if pinned but not banked.
+ *       This is essential as caller may perform a sequence of
+ *  SCAN1   . PrefetchBuffer   (and remember block was prefetched)
+ *  SCAN2   . ReadBuffer       (but fails to connect this read to the prefetch by SCAN1)
+ *  SCAN1   . DiscardBuffer    (SCAN1 terminates early)
+ *  SCAN2   . access tuples in buffer
+ *        Clearly the Discard *must not* unpin the buffer since SCAN2 needs it!
+ *
+ *
+ * caller may pass InvalidBlockNumber as blockNum to mean do nothing
+ */
+void
+DiscardBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
+{
+        BufferTag	newTag;		   /* identity of requested block */
+        uint32		newHash;	   /* hash value for newTag */
+        LWLockId	newPartitionLock;  /* buffer partition lock for it */
+        Buffer		buf_id;
+        volatile        BufferDesc *buf_desc;
+
+    if (!SmgrIsTemp(reln->rd_smgr)) {
+	Assert(RelationIsValid(reln));
+	if (BlockNumberIsValid(blockNum)) {
+
+            /* create a tag so we can lookup the buffer */
+            INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
+                                       forkNum, blockNum);
+
+            /* determine its hash code and partition lock ID */
+            newHash = BufTableHashCode(&newTag);
+            newPartitionLock = BufMappingPartitionLock(newHash);
+
+            /* see if the block is in the buffer pool already */
+            LWLockAcquire(newPartitionLock, LW_SHARED);
+            buf_id = BufTableLookup(&newTag, newHash);
+            LWLockRelease(newPartitionLock);
+
+            /* If in buffers, proceed */
+            if (buf_id >= 0) {
+                buf_desc = &BufferDescriptors[buf_id];        /* found buffer descriptor */
+                BufCheckAsync(0 , reln, buf_desc , BUF_INTENTION_REJECT_UNBANK , 0 , 0 , false , 0); /* end the IO and unpin if banked */
+                pgBufferUsage.aio_read_discrd++; /* account for it */
+            }
+        }
+    }
+}
+#endif   /* USE_PREFETCH */
+
 /*
  * ReleaseBuffer -- release the pin on a buffer
  */
@@ -2520,26 +2985,23 @@ ReleaseBuffer(Buffer buffer)
 {
 	volatile BufferDesc *bufHdr;
 
+
 	if (!BufferIsValid(buffer))
 		elog(ERROR, "bad buffer ID: %d", buffer);
 
-	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 
 	if (BufferIsLocal(buffer))
 	{
+                ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 		Assert(LocalRefCount[-buffer - 1] > 0);
 		LocalRefCount[-buffer - 1]--;
 		return;
 	}
-
-	bufHdr = &BufferDescriptors[buffer - 1];
-
-	Assert(PrivateRefCount[buffer - 1] > 0);
-
-	if (PrivateRefCount[buffer - 1] > 1)
-		PrivateRefCount[buffer - 1]--;
 	else
-		UnpinBuffer(bufHdr, false);
+        {
+                bufHdr = &BufferDescriptors[buffer - 1];
+                BufCheckAsync(0 , 0 , bufHdr , BUF_INTENTION_REJECT_NOADJUST , 0 , 0 , false , 0 );
+        }
 }
 
 /*
@@ -2565,14 +3027,41 @@ UnlockReleaseBuffer(Buffer buffer)
 void
 IncrBufferRefCount(Buffer buffer)
 {
+        bool       pin_already_banked_by_me = false;  /* buffer is already pinned by me and redeemable */
+        volatile BufferDesc *buf;                     /* descriptor for a shared buffer */
+
 	Assert(BufferIsPinned(buffer));
+
+        if (!(BufferIsLocal(buffer))) {
+                buf = &BufferDescriptors[buffer - 1];
+		LockBufHdr(buf);
+                pin_already_banked_by_me =
+                      (    (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                        && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                      : (-(buf->freeNext))  )  == this_backend_pid )
+                      );
+        }
+
+        if (!pin_already_banked_by_me) {
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, buffer);
+        }
+
 	if (BufferIsLocal(buffer))
 		LocalRefCount[-buffer - 1]++;
-	else
+	else {
+                if (pin_already_banked_by_me) {
+                    buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                    }
+                }
+		UnlockBufHdr(buf);
+                if (!pin_already_banked_by_me) {
 		PrivateRefCount[buffer - 1]++;
 }
+        }
+}
 
 /*
  * MarkBufferDirtyHint
@@ -2994,61 +3483,138 @@ WaitIO(volatile BufferDesc *buf)
  *
  * In some scenarios there are race conditions in which multiple backends
  * could attempt the same I/O operation concurrently.  If someone else
- * has already started I/O on this buffer then we will block on the
+ * has already started synchronous I/O on this buffer then we will block on the
  * io_in_progress lock until he's done.
  *
+ * if an async io is in progress and we are doing synchronous io,
+ * then readbuffer uses call to smgrcompleteaio to wait,
+ * and so we treat this request as if no io in progress
+ *
  * Input operations are only attempted on buffers that are not BM_VALID,
  * and output operations only on buffers that are BM_VALID and BM_DIRTY,
  * so we can always tell if the work is already done.
  *
+ * index_for_aio is an input parm which,  if non-zero,  identifies a BufferAiocb
+ * acquired by caller and to be attached to the buffer header for use with async io
+ *
  * Returns TRUE if we successfully marked the buffer as I/O busy,
  * FALSE if someone else already did the work.
  */
 static bool
-StartBufferIO(volatile BufferDesc *buf, bool forInput)
+StartBufferIO(volatile BufferDesc *buf, bool forInput , int index_for_aio )
 {
+#ifdef USE_PREFETCH
+        struct BufferAiocb volatile * BAiocb = (struct BufferAiocb*)0;  /*  BufferAiocb for use with aio */
+#endif   /* USE_PREFETCH */
+ 
+        if (!index_for_aio)
 	Assert(!InProgressBuf);
 
 	for (;;)
 	{
+                if (!index_for_aio) {
 		/*
 		 * Grab the io_in_progress lock so that other processes can wait for
 		 * me to finish the I/O.
 		 */
 		LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+                }
 
 		LockBufHdr(buf);
 
-		if (!(buf->flags & BM_IO_IN_PROGRESS))
+                /*     the following test is intended to distinguish between :
+                **      .   buffer which : 
+                **           .     has io in progress
+                **             AND is not associated with a current aio
+                **      .   not the above
+                **     Here,  "recent" means an aio marked by buf->freeNext <= FREENEXT_BAIOCB_ORIGIN but no longer in progress -
+                **          this situation arises when the aio has just been cancelled and this process now wishes to recycle the buffer.
+                **          In this case,  the first such would-be recycler (i.e. me) must :
+                **             . avoid waiting for the cancelled aio to complete
+                **             . if not myself doing async read, then assume responsibility for posting other future readbuffers.
+                */
+		if (    (buf->flags & BM_AIO_IN_PROGRESS)
+                     || (!(buf->flags & BM_IO_IN_PROGRESS))
+                   )
 			break;
 
 		/*
-		 * The only way BM_IO_IN_PROGRESS could be set when the io_in_progress
+		 * The only way BM_IO_IN_PROGRESS without AIO in progress could be set when the io_in_progress
 		 * lock isn't held is if the process doing the I/O is recovering from
 		 * an error (see AbortBufferIO).  If that's the case, we must wait for
 		 * him to get unwedged.
 		 */
 		UnlockBufHdr(buf);
+                if (!index_for_aio) {
 		LWLockRelease(buf->io_in_progress_lock);
+                }
 		WaitIO(buf);
 	}
 
-	/* Once we get here, there is definitely no I/O active on this buffer */
-
+#ifdef USE_PREFETCH
+	/* Once we get here, there is definitely no synchronous I/O active on this buffer
+        ** but if being asked to attach a BufferAiocb to the buf header,
+        ** then we must also check if there is any async io currently
+        ** in progress or pinned started by a different task.
+        */
+        if (index_for_aio) {
+            BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext);
+            if (    (buf->flags & (BM_AIO_IN_PROGRESS|BM_AIO_PREFETCH_PIN_BANKED))
+                 && (buf->freeNext <= FREENEXT_BAIOCB_ORIGIN)
+                 && (BAiocb->pidOfAio != this_backend_pid)
+               ) {
+                    /* someone else already doing async I/O */
+                    UnlockBufHdr(buf);
+                    return false;
+            }
+	}
+#endif   /* USE_PREFETCH */
 	if (forInput ? (buf->flags & BM_VALID) : !(buf->flags & BM_DIRTY))
 	{
 		/* someone else already did the I/O */
 		UnlockBufHdr(buf);
+                if (!index_for_aio) {
 		LWLockRelease(buf->io_in_progress_lock);
+                }
 		return false;
 	}
 
 	buf->flags |= BM_IO_IN_PROGRESS;
 
+#ifdef USE_PREFETCH
+        if (index_for_aio) {
+            BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - index_for_aio);
+            /*   insist that no other buffer is using this BufferAiocb for async IO */
+            if (BAiocb->BAiocbbufh == (struct sbufdesc *)0) {
+                BAiocb->BAiocbbufh = buf;
+            }
+            if (BAiocb->BAiocbbufh != buf) {
+                               ereport(ERROR,
+                                    (errcode(ERRCODE_INTERNAL_ERROR),
+                                     errmsg("AIO control block %p to be used by %p already in use by %p"
+                                              ,BAiocb ,buf , BAiocb->BAiocbbufh)));
+            }
+            /*   note - there is no need to register self as an dependent of BAiocb
+            **   as we shall not unlock buf_desc before we free the BAiocb
+            */
+
+            buf->flags |= BM_AIO_IN_PROGRESS;
+            buf->freeNext = index_for_aio;
+            /*  at this point,  this buffer appears to have an in-progress aio_read,
+            **  and any other task which is able to look inside the buffer might try waiting on that aio -
+            **  except we have not yet issued the aio!   So we must keep the buffer header locked
+            **  from here all the way back to the BufStartAsync caller
+            */
+        } else {
+#endif   /* USE_PREFETCH */
+
 	UnlockBufHdr(buf);
 
 	InProgressBuf = buf;
 	IsForInput = forInput;
+#ifdef USE_PREFETCH
+        }
+#endif   /* USE_PREFETCH */
 
 	return true;
 }
@@ -3058,7 +3624,7 @@ StartBufferIO(volatile BufferDesc *buf, bool forInput)
  *	(Assumptions)
  *	My process is executing IO for the buffer
  *	BM_IO_IN_PROGRESS bit is set for the buffer
- *	We hold the buffer's io_in_progress lock
+ *	if no async IO in progress,  then We hold the buffer's io_in_progress lock
  *	The buffer is Pinned
  *
  * If clear_dirty is TRUE and BM_JUST_DIRTIED is not set, we clear the
@@ -3070,26 +3636,32 @@ StartBufferIO(volatile BufferDesc *buf, bool forInput)
  * BM_IO_ERROR in a failure case.  For successful completion it could
  * be 0, or BM_VALID if we just finished reading in the page.
  */
-static void
+void
 TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits)
 {
-	Assert(buf == InProgressBuf);
+        int flags_on_entry;
 
 	LockBufHdr(buf);
 
+        flags_on_entry = buf->flags;
+
+        if (! (flags_on_entry & BM_AIO_IN_PROGRESS) )
+            Assert( buf == InProgressBuf );
+
 	Assert(buf->flags & BM_IO_IN_PROGRESS);
-	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
+	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS);
 	if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
 		buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 	buf->flags |= set_flag_bits;
 
 	UnlockBufHdr(buf);
 
+        if (! (flags_on_entry & BM_AIO_IN_PROGRESS) ) {
 	InProgressBuf = NULL;
-
 	LWLockRelease(buf->io_in_progress_lock);
 }
+}
 
 /*
  * AbortBufferIO: Clean up any active buffer I/O after an error.
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1f69c9e..4133506 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -77,6 +77,9 @@
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+#include "aio.h"
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * We must leave some file descriptors free for system(), the dynamic loader,
@@ -1239,6 +1242,10 @@ FileClose(File file)
  * We could add an implementation using libaio in the future; but note that
  * this API is inappropriate for libaio, which wants to have a buffer provided
  * to read into.
+ * Also note that a new, different implementation of asynchronous prefetch
+ * using librt,  not libaio,  is provided by the two functions following this one,
+ * FileStartaio and FileCompleteaio.   These also require to have a buffer provided
+ * to read into,  which the new async_io support provides.
  */
 int
 FilePrefetch(File file, off_t offset, int amount)
@@ -1266,6 +1273,139 @@ FilePrefetch(File file, off_t offset, int amount)
 #endif
 }
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ * FileInitaio - initialize the aio subsystem max number of threads and max number of requests
+ *  input parms
+ *  max_aio_threads;    maximum number of threads
+ *  max_aio_num;        maximum number of concurrent aio read requests
+ *
+ *  on linux, man page for the librt implemenation of aio_init() says :
+ *         This function is a GNU extension.
+ *  If your posix aio does not have it,   then add the following line to 
+ *        src/include/pg_config_manual.h
+ *    #define DONT_HAVE_AIO_INIT
+ *  to render it as a no-op
+ */
+void
+FileInitaio(int max_aio_threads, int max_aio_num )
+{
+#ifndef DONT_HAVE_AIO_INIT
+    struct aioinit aioinit_struct;  /*  structure to pass to aio_init */
+
+    aioinit_struct.aio_threads = max_aio_threads; /*     maximum number of threads */
+    aioinit_struct.aio_num = max_aio_num;         /*     maximum number of concurrent aio read requests */
+    aioinit_struct.aio_idle_time = 1;             /*     we dont want to alter this but aio_init does not ignore it so set to the default */
+    aio_init(&aioinit_struct);
+#endif  /* ndef DONT_HAVE_AIO_INIT */
+    return;
+}
+
+/*
+ * FileStartaio - initiate asynchronous read of a given range of the file.
+ * The logical seek position is unaffected.
+ *
+ * use standard posix aio (librt)
+ *  ASSUME   BufferAiocb.aio_buf already set to -> buffer by caller
+ *  return 0 if successfully started,  else non-zero
+ */
+int
+FileStartaio(File file, off_t offset, int amount , char *aiocbp )
+{
+	int	returnCode;
+        struct aiocb *my_aiocbp = (struct aiocb *)aiocbp;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartaio: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset, amount));
+
+	returnCode = FileAccess(file);
+	if (returnCode >= 0) {
+
+            my_aiocbp->aio_fildes = VfdCache[file].fd;
+            my_aiocbp->aio_lio_opcode = LIO_READ;
+            my_aiocbp->aio_nbytes = amount;
+            my_aiocbp->aio_offset = offset;
+            returnCode = aio_read(my_aiocbp);
+        }
+
+	return returnCode;
+}
+
+/*
+ * FileCompleteaio - complete asynchronous aio read
+ * normal_wait indicates whether to cancel or wait -
+ *                 0 <=> cancel
+ *                 1 <=> wait
+ *
+ * use standard posix aio (librt)
+ *  return 0 if successfull and did not have to wait,
+ *         1 if successfull and had to wait,
+ *    else x'ff'
+ */
+int
+FileCompleteaio( char *aiocbp , int normal_wait )
+{
+	int	returnCode;
+	int	aio_errno;
+        struct aiocb *my_aiocbp = (struct aiocb *)aiocbp;
+        const struct aiocb const*cblist[1];
+        int fd;
+        struct timespec my_timeout = { 0 , 10000 };
+        int max_polls;
+
+        fd = my_aiocbp->aio_fildes;
+        cblist[0] = my_aiocbp;
+        returnCode = aio_errno = aio_error(my_aiocbp);
+        /* note that aio_error returns 0 if op already completed successfully */
+
+        /*  first handle normal case of waiting for op to complete  */
+        if (normal_wait) {
+            while (aio_errno == EINPROGRESS) {
+                max_polls = 256;
+                my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                returnCode = aio_suspend(cblist , 1 , &my_timeout);
+                while ((returnCode < 0) && (EAGAIN == errno) && (max_polls-- > 0)) {
+                    my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                    returnCode = aio_suspend(cblist , 1 , &my_timeout);
+                }
+                returnCode = aio_errno = aio_error(my_aiocbp);
+                /*  now return_code is from aio_error  */
+                if (returnCode == 0) {
+                    returnCode = 1;    /*  successful but had to wait */
+                }
+            }
+            if (aio_errno) {
+                elog(LOG, "FileCompleteaio: %d %d", fd, returnCode);
+                returnCode = 0xff; /*  unsuccessful */
+            }
+        } else {
+            if (aio_errno == EINPROGRESS) {
+                do {
+                        max_polls = 256;
+                        my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                        returnCode = aio_cancel(fd, my_aiocbp);
+                        while ((returnCode == AIO_NOTCANCELED) && (max_polls-- > 0)) {
+                            my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                            returnCode = aio_cancel(fd, my_aiocbp);
+                        }
+                    returnCode = aio_errno = aio_error(my_aiocbp);
+                } while (aio_errno == EINPROGRESS);
+                returnCode = 0xff; /*  unsuccessful */
+            }
+            if (returnCode != 0)
+                returnCode = 0xff; /*  unsuccessful */
+        }
+
+	DO_DB(elog(LOG, "FileCompleteaio: %d %d",
+			   fd, returnCode));
+
+	return returnCode;
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 int
 FileRead(File file, char *buffer, int amount)
 {
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 266b0da..3f13ded 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -52,6 +52,7 @@
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+extern pid_t this_backend_pid;     /*   pid of this backend */
 
 /* GUC variables */
 int			DeadlockTimeout = 1000;
@@ -361,6 +362,7 @@ InitProcess(void)
 	MyPgXact->xid = InvalidTransactionId;
 	MyPgXact->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
+        this_backend_pid = getpid();    /*    pid of this backend */
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3c1c81a..e0ca942 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -647,6 +647,62 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 }
 
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ *	mdinitaio() --  init the aio subsystem max number of threads and max number of requests
+ */
+void
+mdinitaio(int max_aio_threads, int max_aio_num)
+{
+     FileInitaio( max_aio_threads, max_aio_num );
+}
+
+/*
+ *	mdstartaio() -- start aio read of the specified block of a relation
+ */
+void
+mdstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode )
+{
+#ifdef USE_PREFETCH
+	off_t		seekpos;
+	MdfdVec    *v;
+        int local_retcode;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+
+	seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	local_retcode = FileStartaio(v->mdfd_vfd, seekpos, BLCKSZ , aiocbp);
+	if (retcode) {
+            *retcode = local_retcode;
+        }
+#endif   /* USE_PREFETCH */
+}
+
+
+/*
+ *	mdcompleteaio() -- complete aio read of the specified block of a relation
+ *      on entry, *inoutcode should indicate :
+ *           .  non-0  <=>   check if complete and wait if not
+ *           .  0      <=>   cancel io immediately
+ */
+void
+mdcompleteaio( char *aiocbp , int *inoutcode )
+{
+#ifdef USE_PREFETCH
+        int local_retcode;
+
+	local_retcode = FileCompleteaio(aiocbp, (inoutcode ? *inoutcode : 0));
+	if (inoutcode) {
+            *inoutcode = local_retcode;
+        }
+#endif   /* USE_PREFETCH */
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+
 /*
  *	mdread() -- Read the specified block from a relation.
  */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index d16f559..7f3e6ff 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -49,6 +49,12 @@ typedef struct f_smgr
 						 BlockNumber blocknum, char *buffer, bool skipFsync);
 	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	void		(*smgr_initaio) (int max_aio_threads, int max_aio_num);
+	void		(*smgr_startaio) (SMgrRelation reln, ForkNumber forknum,
+											  BlockNumber blocknum , char *aiocbp , int *retcode );
+	void		(*smgr_completeaio) ( char *aiocbp , int *inoutcode );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
@@ -66,7 +72,11 @@ typedef struct f_smgr
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
-		mdprefetch, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
+		mdprefetch
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        ,mdinitaio, mdstartaio, mdcompleteaio
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+              , mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
 		mdpreckpt, mdsync, mdpostckpt
 	}
 };
@@ -612,6 +622,35 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 	(*(smgrsw[reln->smgr_which].smgr_prefetch)) (reln, forknum, blocknum);
 }
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ *	smgrinitaio() -- initialize the aio subsystem max number of threads and max number of requests
+ */
+void
+smgrinitaio(int max_aio_threads, int max_aio_num)
+{
+	(*(smgrsw[0].smgr_initaio)) ( max_aio_threads, max_aio_num );
+}
+
+/*
+ *	smgrstartaio() -- Initiate aio read of the specified block of a relation.
+ */
+void
+smgrstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode )
+{
+	(*(smgrsw[reln->smgr_which].smgr_startaio)) (reln, forknum, blocknum , aiocbp , retcode );
+}
+
+/*
+ *	smgrcompleteaio() -- Complete aio read of the specified block of a relation.
+ */
+void
+smgrcompleteaio(SMgrRelation reln,  char *aiocbp , int *inoutcode )
+{
+	(*(smgrsw[reln->smgr_which].smgr_completeaio)) ( aiocbp , inoutcode );
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 /*
  *	smgrread() -- read a particular block from a relation into the supplied
  *				  buffer.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1d094f0..d36b77d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2264,6 +2264,25 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"max_async_io_prefetchers",
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+			PGC_USERSET,
+#else
+			PGC_INTERNAL,
+#endif
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Maximum number of background processes concurrently using asynchronous librt threads to prefetch pages into shared memory buffers."),
+		},
+		&max_async_io_prefetchers,
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	        -1, 0, 8192,      /*  boot val -1 indicates to initialize to something sensible during buf_init */
+#else
+		0, 0, 0,
+#endif
+		NULL, NULL, NULL
+	},
+
+	{
 		{"max_worker_processes",
 			PGC_POSTMASTER,
 			RESOURCES_ASYNCHRONOUS,
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 743455e..e392390 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -733,6 +733,48 @@ AllocSetAlloc(MemoryContext context, Size size)
 	 */
 	fidx = AllocSetFreeIndex(size);
 	chunk = set->freelist[fidx];
+#ifdef MEMORY_CONTEXT_CHECKING
+        /*    an instance of segfault caused by a rogue value in set->freelist[fidx]
+        **    has been seen - check for it using crude sanity check based on neighbours :
+        **    if at least one is sufficiently close, then pass,  else fail
+        */
+        if (chunk != 0) {
+            int frx, nrx; /*  frx is index,  nrx is index of failing neighbour for errmsg */
+            for (nrx = -1, frx = 0; (frx < ALLOCSET_NUM_FREELISTS); frx++) {
+                if (   (frx != fidx)     /*  not the chosen one */
+                    && ( ( (unsigned long)(set->freelist[frx]) ) != 0 ) /* not empty */
+                   ) {
+                    if (   ( (unsigned long)chunk < ( ( (unsigned long)(set->freelist[frx]) ) / 2 ) )
+                        && (  ( (unsigned long)(set->freelist[frx]) ) < 0x4000000  )
+               /***     || ( (unsigned long)chunk > ( ( (unsigned long)(set->freelist[frx]) ) * 2 ) )  ***/
+                       ) {
+                       nrx = frx;
+                    } else {
+                       nrx = -1;
+                       break;
+                    }
+                }
+            }
+
+            if (nrx >= 0) {
+                     /*  this must be a rogue value  -  put this list in the garbage */
+                     /*  build message but be careful to avoid recursively triggering same fault */
+                     set->freelist[fidx] = NULL;   /*  mark this list empty */
+                     elog(WARNING, "detected rogue value %p in freelist index %d compared with neighbour %p whose chunksize %d"
+				 , chunk , fidx , set->freelist[nrx] , set->freelist[nrx]->size);
+                     chunk = NULL;
+            }
+        }
+#else /* if not MEMORY_CONTEXT_CHECKING make very simple-minded check*/
+        if ( (chunk != 0) && ( (unsigned long)chunk <  0x40000 ) ) {
+                     /*  this must be a rogue value  -  put this list in the garbage */
+                     /*  build message but be careful to avoid recursively triggering same fault */
+                     set->freelist[fidx] = NULL;   /*  mark this list empty */
+                     elog(WARNING, "detected rogue value %p in freelist index %d"
+				 , chunk , fidx);
+                     chunk = NULL;
+        }
+#endif
 	if (chunk != NULL)
 	{
 		Assert(chunk->size >= size);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 493839f..3ce5618 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -175,7 +175,7 @@ extern void heap_page_prune_execute(Buffer buffer,
 extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
 
 /* in heap/syncscan.c */
-extern void ss_report_location(Relation rel, BlockNumber location);
+extern void ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp);
 extern BlockNumber ss_get_location(Relation rel, BlockNumber relnblocks);
 extern void SyncScanShmemInit(void);
 extern Size SyncScanShmemSize(void);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f281759..34cf496 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -19,6 +19,7 @@
 #include "access/sdir.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "access/relscan.h"
 #include "catalog/pg_index.h"
 
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
@@ -524,6 +525,7 @@ typedef struct BTScanPosData
 	Buffer		buf;			/* if valid, the buffer is pinned */
 
 	BlockNumber nextPage;		/* page's right link when we scanned it */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
@@ -603,6 +605,15 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* prefetch logic state */
+	unsigned int	backSeqRun;	/* number of back-sequential pages in a run */
+	BlockNumber		backSeqPos;	/* blkid last prefetched in back-sequential 
+				          		   runs */
+	BlockNumber		lastHeapPrefetchBlkno;	/* blkid last prefetched from heap */
+	int				prefetchItemIndex; /* item index within currPos last
+					                      fetched by heap prefetch */
+	int				prefetchBlockCount; /* number of prefetched heap blocks */
+	
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -655,7 +666,11 @@ extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access , struct pfch_index_pagelist* pfch_index_page_list);
+extern void _bt_prefetchbuf(Relation rel, BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P);
+extern struct pfch_index_item* _bt_find_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list);
+extern int _bt_add_block(BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P , uint32 discard_status);
+extern void _bt_subtract_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 				 BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8a57698..cbf3100 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -44,6 +44,24 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+#ifdef USE_PREFETCH
+	int	    rs_prefetch_target; /* target distance (numblocks) for prefetch to reach beyond main scan */
+	BlockNumber rs_pfchblock;	/* next block # to be prefetched in scan, if any */
+
+        /*   Unread_Pfetched is a "mostly" circular list of recently prefetched blocknos of size target_prefetch_pages
+        **   the index of the first unread block is held in Unread_Pfetched_next
+        **         and is advanced when a block is read
+        **   the count of number of unread blocks is in Unread_Pfetched_count (and this subset can wrap around)
+        **   "mostly" means that there may be gaps caused by storing entries for blocks which do not need to be discarded -
+        **   these are indicated by blockno = InvalidBlockNumber,  and these slots are reused when found.
+        */
+        BlockNumber *rs_Unread_Pfetched_base;   /*  where the list of prefetched but unread blocknos starts */
+        unsigned int rs_Unread_Pfetched_next;   /*  where the next unread blockno probably is relative to start --
+                                                **  this is only a hint which may be temporarily stale.
+                                                */
+        unsigned int rs_Unread_Pfetched_count;  /*  number of valid unread blocknos in list */
+#endif   /* USE_PREFETCH */
+
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	ItemPointerData rs_mctid;	/* marked scan position, if any */
@@ -55,6 +73,27 @@ typedef struct HeapScanDescData
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
 }	HeapScanDescData;
 
+/* pfch_index_items track prefetched and unread index pages -   chunks of blocknumbers are chained in singly-linked list from scan->pfch_index_item_list */
+struct pfch_index_item {                              /*  index-relation BlockIds which we will/have prefetched */
+       BlockNumber         pfch_blocknum;    /* Blocknum which we will/have prefetched */
+       uint32              pfch_discard;     /* whether block is to be discarded when scan is closed */
+};
+
+struct pfch_block_item {
+       struct BlockIdData  pfch_blockid;     /* BlockId which we will/have prefetched */
+       uint32              pfch_discard;     /* whether block is to be discarded when scan is closed */
+};
+
+/* pfch_index_page_items track prefetched and unread index pages -
+** chunks of blocknumbers are chained backwards (newest first,  oldest last)
+** in singly-linked list from scan->pfch_index_item_list
+*/
+struct pfch_index_pagelist {                          /*  index-relation BlockIds which we will/have prefetched */
+       struct pfch_index_pagelist* pfch_index_pagelist_next;  /*  pointer to next chunk if any */
+       unsigned int    pfch_index_item_count;         /*  number of used entries in this chunk */
+       struct pfch_index_item pfch_indexid[1];        /*  in-line list of Blocknums which we will/have prefetched and whether to be discarded */
+};
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -75,8 +114,15 @@ typedef struct IndexScanDescData
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
 	bool		ignore_killed_tuples;	/* do not return killed entries */
-	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
-										 * tuples */
+	bool		xactStartedInRecovery;	/* prevents killing/seeing killed tuples */
+										 
+#ifdef USE_PREFETCH
+        struct pfch_index_pagelist* pfch_index_page_list;  /* array of index-relation BlockIds which we will/have prefetched */
+        struct pfch_block_item* pfch_block_item_list;   /* array of heap-relation BlockIds which we will/have prefetched */
+        unsigned short int     pfch_used;	/*  number of used elements in BlockIdData array   */
+        unsigned short int     pfch_next;	/*  next element for prefetch in BlockIdData array */
+	int             do_prefetch;    /*  should I prefetch ? */
+#endif   /* USE_PREFETCH */
 
 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
@@ -91,6 +137,10 @@ typedef struct IndexScanDescData
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/* heap fetch statistics for read-ahead logic */
+	unsigned int heap_tids_seen;
+	unsigned int heap_tids_fetched;
+
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
 }	IndexScanDescData;
diff --git a/src/include/catalog/pg_am.h b/src/include/catalog/pg_am.h
index 759ea70..dc461b6 100644
--- a/src/include/catalog/pg_am.h
+++ b/src/include/catalog/pg_am.h
@@ -67,6 +67,7 @@ CATALOG(pg_am,2601)
 	regproc		amcanreturn;	/* can indexscan return IndexTuples? */
 	regproc		amcostestimate; /* estimate cost of an indexscan */
 	regproc		amoptions;		/* parse AM-specific parameters */
+	regproc		ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } FormData_pg_am;
 
 /* ----------------
@@ -117,19 +118,19 @@ typedef FormData_pg_am *Form_pg_am;
  * ----------------
  */
 
-DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions ));
+DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions btpeeknexttuple ));
 DESCR("b-tree index access method");
 #define BTREE_AM_OID 403
-DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions ));
+DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions - ));
 DESCR("hash index access method");
 #define HASH_AM_OID 405
-DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions ));
+DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions - ));
 DESCR("GiST index access method");
 #define GIST_AM_OID 783
-DATA(insert OID = 2742 (  gin		0 6 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions ));
+DATA(insert OID = 2742 (  gin		0 6 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions - ));
 DESCR("GIN index access method");
 #define GIN_AM_OID 2742
-DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
+DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions - ));
 DESCR("SP-GiST index access method");
 #define SPGIST_AM_OID 4000
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 72170af..f61c05a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -536,6 +536,12 @@ DESCR("convert float4 to int4");
 
 DATA(insert OID = 330 (  btgettuple		   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2281 2281" _null_ _null_ _null_ _null_	btgettuple _null_ _null_ _null_ ));
 DESCR("btree(internal)");
+
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+DATA(insert OID = 3251 (  btpeeknexttuple	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "2281" _null_ _null_ _null_ _null_ btpeeknexttuple _null_ _null_ _null_ ));
+DESCR("btree(internal)");
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+
 DATA(insert OID = 636 (  btgetbitmap	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_	btgetbitmap _null_ _null_ _null_ ));
 DESCR("btree(internal)");
 DATA(insert OID = 331 (  btinsert		   PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_	btinsert _null_ _null_ _null_ ));
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 8ec1033..8c86781 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -28,8 +28,18 @@ typedef struct BufferUsage
 	long		local_blks_written;		/* # of local disk blocks written */
 	long		temp_blks_read; /* # of temp blocks read */
 	long		temp_blks_written;		/* # of temp blocks written */
+
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
+
+	long		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_discrd;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_forgot;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb */
+	long		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno */
+	long		aio_read_wasted;		/* # of aio reads for which disk block not used */
+	long		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it */
+	long		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested */
 } BufferUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 444d4d8..4998dbc 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -41,6 +41,16 @@ typedef struct
 	int			ntuples;		/* -1 indicates lossy result */
 	bool		recheck;		/* should the tuples be rechecked? */
 	/* Note: recheck is always true if ntuples < 0 */
+#ifdef USE_PREFETCH
+        /*   Unread_Pfetched is a circular list of recently prefetched blocknos of size target_prefetch_pages
+        **   the index of the first unread block is held in Unread_Pfetched_next
+        **         and is advanced when a block is read
+        **   the count of number of unread blocks is in Unread_Pfetched_count (and this subset can wrap around)
+        */
+        BlockNumber *Unread_Pfetched_base;   /*  where the list of prefetched but unread blocknos starts */
+        unsigned int Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+        unsigned int Unread_Pfetched_count;  /*  number of valid unread blocknos in list */
+#endif   /* USE_PREFETCH */
 	OffsetNumber offsets[1];	/* VARIABLE LENGTH ARRAY */
 } TBMIterateResult;				/* VARIABLE LENGTH STRUCT */
 
@@ -62,5 +72,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
 extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
 extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
 extern void tbm_end_iterate(TBMIterator *iterator);
-
+extern void tbm_zero(TBMIterator *iterator); /* zero list of prefetched and unread blocknos */
+extern void tbm_add(TBMIterator *iterator, BlockNumber blockno); /* add this blockno to list of prefetched and unread blocknos */
+extern void tbm_subtract(TBMIterator *iterator, BlockNumber blockno); /* remove this blockno from list of prefetched and unread blocknos */
+extern TBMIterateResult *tbm_locate_IterateResult(TBMIterator *iterator); /* locate the TBMIterateResult of an iterator */
 #endif   /* TIDBITMAP_H */
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 5ff9e41..01cfa81 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -1,4 +1,4 @@
-/* src/include/pg_config.h.in.  Generated from configure.in by autoheader.  */
+/* src/include/pg_config.h.in.  Generated from - by autoheader.  */
 
 /* Define to the type of arg 1 of 'accept' */
 #undef ACCEPT_TYPE_ARG1
@@ -748,6 +748,10 @@
 /* Define to the appropriate snprintf format for unsigned 64-bit ints. */
 #undef UINT64_FORMAT
 
+/* Define to select librt-style async io and the gcc atomic compare_and_swap.
+   */
+#undef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
 
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index d1f99fb..6040a5d 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -138,9 +138,11 @@
 /*
  * USE_PREFETCH code should be compiled only if we have a way to implement
  * prefetching.  (This is decoupled from USE_POSIX_FADVISE because there
- * might in future be support for alternative low-level prefetch APIs.)
+ * might in future be support for alternative low-level prefetch APIs  --
+ * -- update October 2013  -- now there is such a new prefetch capability --
+ *   async_io into postgres buffers  -   configuration parameter max_async_io_threads)
  */
-#ifdef USE_POSIX_FADVISE
+#if defined(USE_POSIX_FADVISE) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
 #define USE_PREFETCH
 #endif
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..1482c5a 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -22,7 +22,9 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
-
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+#include "aio.h"
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Flags for buffer descriptors
@@ -38,8 +40,23 @@
 #define BM_JUST_DIRTIED			(1 << 5)		/* dirtied since write started */
 #define BM_PIN_COUNT_WAITER		(1 << 6)		/* have waiter for sole pin */
 #define BM_CHECKPOINT_NEEDED	(1 << 7)		/* must write for checkpoint */
-#define BM_PERMANENT			(1 << 8)		/* permanent relation (not
-												 * unlogged) */
+#define BM_PERMANENT			(1 << 8)	/* permanent relation (not unlogged) */
+#define BM_AIO_IN_PROGRESS		(1 << 9)	/* aio in progress    */
+#define BM_AIO_PREFETCH_PIN_BANKED	(1 << 10)	/* pinned when prefetch issued
+                                                        ** and this pin is banked - i.e.
+                                                        ** redeemable by the next use by same task
+                                                        ** note that for any one buffer, a pin can be banked
+                                                        **      by at most one process globally,
+                                                        **      that is,   only one process may bank a pin on the buffer
+                                                        **                 and it may do so only once (may not be stacked)
+                                                        */
+
+/*********
+for asynchronous aio-read prefetching, two golden rules concerning buffer pinning and buffer-header flags must be observed:
+  R1.  a buffer marked as BM_AIO_IN_PROGRESS must be pinned by at least one backend
+  R2.  a buffer marked as BM_AIO_PREFETCH_PIN_BANKED must be pinned by the backend identified by
+               (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio : (-(buf->freeNext))
+*********/
 
 typedef bits16 BufFlags;
 
@@ -140,17 +157,83 @@ typedef struct sbufdesc
 	BufFlags	flags;			/* see bit definitions above */
 	uint16		usage_count;	/* usage counter for clock sweep code */
 	unsigned	refcount;		/* # of backends holding pins on buffer */
-	int			wait_backend_pid;		/* backend PID of pin-count waiter */
+	int		wait_backend_pid;	/*  if     flags & BM_PIN_COUNT_WAITER
+                                                **  then   backend PID of pin-count waiter
+                                                **  else   not set
+                                                */
 
 	slock_t		buf_hdr_lock;	/* protects the above fields */
 
 	int			buf_id;			/* buffer's index number (from 0) */
-	int			freeNext;		/* link in freelist chain */
+        int    	volatile	freeNext;	/* overloaded and much-abused field :
+                                                ** EITHER
+                                                **     if     >= 0
+                                                **     then   link in freelist chain
+                                                **  OR
+                                                **     if     <  0
+                                                **     then    EITHER
+                                                **             if     flags & BM_AIO_IN_PROGRESS
+                                                **             then   negative of (the index of the aiocb in the BufferAiocbs array + 3)
+                                                **             else   if flags & BM_AIO_PREFETCH_PIN_BANKED
+                                                **             then   -(pid of task that issued aio_read and pinned buffer)
+                                                **             else   one of the special values -1 or -2 listed below
+                                                */
 
 	LWLock	   *io_in_progress_lock;	/* to wait for I/O to complete */
 	LWLock	   *content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
+/*  structures for control blocks for our implementation of async io */
+
+/*  if USE_AIO_ATOMIC_BUILTIN_COMP_SWAP is not defined,  the following struct is not put into use at runtime
+**  but it is easier to let the compiler find the definition but hide the reference to aiocb
+**  which is the only type it would not understand
+*/
+
+struct BufferAiocb {
+       struct BufferAiocb volatile * volatile BAiocbnext;  /*    next free entry or value of BAIOCB_OCCUPIED means in use  */
+       struct sbufdesc    volatile * volatile BAiocbbufh;  /*    there can be at most one BufferDesc marked BM_AIO_IN_PROGRESS
+                                                           **    and using this BufferAiocb -
+                                                           **    if there is one, BAiocbbufh points to it, else BAiocbbufh is zero
+                                                           **    NOTE  BAiocbbufh should be zero for every BufferAiocb on the free list
+                                                           */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+       struct aiocb       volatile            BAiocbthis;  /*    the aio library's control block for one async io */
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+       int                volatile  BAiocbDependentCount;  /*    count of tasks who depend on this BufferAiocb
+                                                           **    in the sense that they are waiting for io completion.
+                                                           **    only a Dependent may move the BufferAiocb onto the freelist
+                                                           **    and only when that Dependent is the *only* Dependent (count == 1)
+                                                           **    BAiocbDependentCount is protected by bufferheader spinlock
+                                                           **    and must be updated only when that spinlock is held
+                                                           */
+       pid_t              volatile  pidOfAio;              /*    pid of backend who issued an aio_read using this BAiocb -
+                                                           **    this backend must have pinned the associated buffer.
+                                                           */
+};
+
+#define BAIOCB_OCCUPIED 0x75f1        /*  distinct indicator of a BufferAiocb.BAiocbnext that is NOT on free list */
+#define BAIOCB_FREE 0x7b9d            /*  distinct indicator of a BufferAiocb.BAiocbbufh that IS     on free list */
+
+struct BAiocbAnchor {                 /*  anchor for all control blocks pertaining to aio  */
+       volatile struct BufferAiocb* BufferAiocbs;          /*  aiocbs ...                   */
+       volatile struct BufferAiocb* volatile FreeBAiocbs; /* ... and their free list   */
+};
+
+/*   values for BufCheckAsync input and retcode */
+#define BUF_INTENTION_WANT 		 1  /* wants the buffer, wait for in-progress aio and then pin */
+#define BUF_INTENTION_REJECT_KEEP_PIN 	-1  /* pin already held, do not unpin */
+#define BUF_INTENTION_REJECT_OBTAIN_PIN	-2  /* obtain pin,  caller wants it for same buffer */
+#define BUF_INTENTION_REJECT_FORGET	-3  /* unpin and tell resource owner to forget */
+#define BUF_INTENTION_REJECT_NOADJUST	-4  /* unpin and call ResourceOwnerForgetBuffer */
+#define BUF_INTENTION_REJECT_UNBANK   	-5  /* unpin only if pin banked by caller */
+
+#define BUF_INTENT_RC_CHANGED_TAG	-5
+#define BUF_INTENT_RC_BADPAGE		-4
+#define BUF_INTENT_RC_INVALID_AIO	-3    /*  invalid and aio was in progress */
+#define BUF_INTENT_RC_INVALID_NO_AIO	-1    /*  invalid and no aio was in progress */
+#define BUF_INTENT_RC_VALID		 1
+
 #define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
 
 /*
@@ -159,6 +242,7 @@ typedef struct sbufdesc
  */
 #define FREENEXT_END_OF_LIST	(-1)
 #define FREENEXT_NOT_IN_LIST	(-2)
+#define FREENEXT_BAIOCB_ORIGIN	(-3)
 
 /*
  * Macros for acquiring/releasing a shared buffer header's spinlock.
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89447d0..c652841 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -41,6 +41,7 @@ typedef enum
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
 	RBM_NORMAL_NO_LOG			/* Don't log page as invalid during WAL
 								 * replay; otherwise same as RBM_NORMAL */
+       ,RBM_NOREAD_FOR_PREFETCH   /* Don't read from disk, don't zero buffer, find buffer only */
 } ReadBufferMode;
 
 /* in globals.c ... this duplicates miscadmin.h */
@@ -57,6 +58,9 @@ extern int	target_prefetch_pages;
 extern PGDLLIMPORT char *BufferBlocks;
 extern PGDLLIMPORT int32 *PrivateRefCount;
 
+/*  in buf_async.c  */;
+extern int      max_async_io_prefetchers; /* Maximum number of backends using asynchronous librt threads to read pages into our buffers */
+
 /* in localbuf.c */
 extern PGDLLIMPORT int NLocBuffer;
 extern PGDLLIMPORT Block *LocalBufferBlockPointers;
@@ -159,9 +163,15 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
 /*
- * prototypes for functions in bufmgr.c
+ * prototypes for external functions in bufmgr.c and buf_async.c
  */
-extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
+extern int PrefetchBuffer(Relation reln, ForkNumber forkNum,
+			   BlockNumber blockNum , BufferAccessStrategy strategy);
+/*   return code  is an int bitmask : */
+#define PREFTCHRC_BUF_PIN_INCREASED 0x01    /*  pin count on buffer has been increased by 1 */
+#define PREFTCHRC_BLK_ALREADY_PRESENT 0x02  /*  block was already present in a buffer       */
+
+extern void DiscardBuffer(Relation reln, ForkNumber forkNum,
 			   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index a6df8fb..d394c6a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -69,6 +69,11 @@ extern File PathNameOpenFile(FileName fileName, int fileFlags, int fileMode);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void FileInitaio(int max_aio_threads, int max_aio_num );
+extern int  FileStartaio(File file, off_t offset, int amount , char *aiocbp);
+extern int  FileCompleteaio( char *aiocbp , int normal_wait );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern int	FileRead(File file, char *buffer, int amount);
 extern int	FileWrite(File file, char *buffer, int amount);
 extern int	FileSync(File file);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index ba7c909..5e8d645 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,6 +92,12 @@ extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void smgrinitaio(int max_aio_threads, int max_aio_num);
+extern void smgrstartaio(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum , char *aiocbp , int *retcode);
+extern void smgrcompleteaio( SMgrRelation reln, char *aiocbp , int *inoutcode );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
@@ -118,6 +124,11 @@ extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void mdinitaio(int max_aio_threads, int max_aio_num);
+extern void mdstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode );
+extern void mdcompleteaio( char *aiocbp , int *inoutcode );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af4f53f..720d3f3 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -61,6 +61,7 @@ typedef struct RelationAmInfo
 	FmgrInfo	ammarkpos;
 	FmgrInfo	amrestrpos;
 	FmgrInfo	amcanreturn;
+	FmgrInfo	ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } RelationAmInfo;

#23

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Claudio Freire (#22)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

Claudio Freire <klaussfreire@gmail.com> writes:

Didn't fix that, but the attached patch does fix regression tests when
scanning over index types other than btree (was invoking elog when the
index am didn't have ampeeknexttuple)

"ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
nbtree/README).

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: Tom Lane (#23)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Claudio Freire <klaussfreire@gmail.com> writes:

Didn't fix that, but the attached patch does fix regression tests when
scanning over index types other than btree (was invoking elog when the
index am didn't have ampeeknexttuple)

"ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
nbtree/README).

It's not really the tuple, just the tid

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: Claudio Freire (#24)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Claudio Freire <klaussfreire@gmail.com> writes:

Didn't fix that, but the attached patch does fix regression tests when
scanning over index types other than btree (was invoking elog when the
index am didn't have ampeeknexttuple)

"ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
nbtree/README).

It's not really the tuple, just the tid

And, furthermore, it's used only to do prefetching, so even if the tid
was invalid when the tuple needs to be accessed, it wouldn't matter,
because the indexam wouldn't use the result of ampeeknexttuple to do
anything at that time.

Though, your comment does illustrate the need to document that on
ampeeknexttuple, for future users.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: Claudio Freire (#22)

1 attachment(s)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

Date: Thu, 29 May 2014 18:00:28 -0300
Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
From: klaussfreire@gmail.com
To: hlinnakangas@vmware.com
CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org

On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 05/29/2014 11:34 PM, Claudio Freire wrote:

On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 05/29/2014 04:12 PM, John Lumby wrote:

On 05/28/2014 11:52 PM, John Lumby wrote:

The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.

It works on linux. Actually this ability allows the asyncio
implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,. since librt takes care of the
waiting.
specifically, no need for extra wait-for-io control blocks
such as in bufmgr's WaitIO()

[checks]. No, it doesn't work. See attached test program.

Thanks for checking and thanks for coming up with that test program.
However, yes, it really does work -- always (on linux).
Your test program is doing things in the wrong order -
it calls aio_suspend *before* aio_error.
However, the rule is, call aio_suspend *after* aio_error
and *only* if aio_error returns EINPROGRESS.

See the code changes to fd.c function FileCompleteaio()
to see how we have done it. And I am attaching corrected version
of your test program which runs just fine.

It kinda seems to work sometimes, because of the way it's implemented in
glibc. The aiocb struct has a field for the result value and errno, and
when
the I/O is finished, the worker thread fills them in. aio_error() and
aio_return() just return the values of those fields, so calling
aio_error()
or aio_return() do in fact happen to work from a different process.
aio_suspend(), however, is implemented by sleeping on a process-local
mutex,
which does not work from a different process.

Even if it worked on Linux today, it would be a bad idea to rely on it
from
a portability point of view. No, the only sane way to make this work is
that
the process that initiates an I/O request is responsible for completing
it.
If another process needs to wait for an async I/O to complete, we must
use
some other means to do the waiting. Like the io_in_progress_lock that we
already have, for the same purpose.

But calls to it are timeouted by 10us, effectively turning the thing
into polling mode.

We don't want polling... And even if we did, calling aio_suspend() in a way
that's known to be broken, in a loop, is a pretty crappy way of polling.

Well, as mentioned earlier, it is not broken. Whether it is efficient I am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that, if caller is not the original aio_read process,
it renders the suspend() into an instant timeout. I will see if I can verify that.
Where are you (Claudio) seeing 10us?

Show quoted text

Didn't fix that, but the attached patch does fix regression tests when
scanning over index types other than btree (was invoking elog when the
index am didn't have ampeeknexttuple)

#27

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Claudio Freire (#25)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

Claudio Freire <klaussfreire@gmail.com> writes:

On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
nbtree/README).

It's not really the tuple, just the tid

And, furthermore, it's used only to do prefetching, so even if the tid
was invalid when the tuple needs to be accessed, it wouldn't matter,
because the indexam wouldn't use the result of ampeeknexttuple to do
anything at that time.

Nonetheless, getting the next tid out of the index may involve stepping
to the next index page, at which point you've lost your interlock
guaranteeing that the *previous* tid will still mean something by the time
you arrive at its heap page. I presume that the ampeeknexttuple call is
issued before trying to visit the heap (otherwise you're not actually
getting much I/O overlap), so I think there's a real risk here.

Having said that, it's probably OK as long as this mode is only invoked
for user queries (with MVCC snapshots) and not for system indexscans.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: Tom Lane (#27)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

From: tgl@sss.pgh.pa.us
To: klaussfreire@gmail.com
CC: hlinnakangas@vmware.com; johnlumby@hotmail.com; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
Date: Thu, 29 May 2014 17:56:57 -0400

Claudio Freire <klaussfreire@gmail.com> writes:

On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
nbtree/README).

It's not really the tuple, just the tid

And, furthermore, it's used only to do prefetching, so even if the tid
was invalid when the tuple needs to be accessed, it wouldn't matter,
because the indexam wouldn't use the result of ampeeknexttuple to do
anything at that time.

Nonetheless, getting the next tid out of the index may involve stepping
to the next index page, at which point you've lost your interlock

I think we are ok as peeknexttuple (yes bad name, sorry, can change it ...

never advances to another page :

* btpeeknexttuple() -- peek at the next tuple different from any blocknum in pfch_list

* without reading a new index page

* and without causing any side-effects such as altering values in control blocks

* if found, store blocknum in next element of pfch_list

Show quoted text

guaranteeing that the *previous* tid will still mean something by the time
you arrive at its heap page. I presume that the ampeeknexttuple call is
issued before trying to visit the heap (otherwise you're not actually
getting much I/O overlap), so I think there's a real risk here.

Having said that, it's probably OK as long as this mode is only invoked
for user queries (with MVCC snapshots) and not for system indexscans.

regards, tom lane

#29

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: Tom Lane (#27)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Thu, May 29, 2014 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Claudio Freire <klaussfreire@gmail.com> writes:

On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
nbtree/README).

It's not really the tuple, just the tid

And, furthermore, it's used only to do prefetching, so even if the tid
was invalid when the tuple needs to be accessed, it wouldn't matter,
because the indexam wouldn't use the result of ampeeknexttuple to do
anything at that time.

Nonetheless, getting the next tid out of the index may involve stepping
to the next index page, at which point you've lost your interlock
guaranteeing that the *previous* tid will still mean something by the time

No, no... that's exactly why a new regproc is needed, because for
prefetching, we need to get the next tid that satisfies some
conditions *without* walking the index.

This, in nbtree, only looks through the tid array to find the suitable
tid, or just return false if the array is exhausted.

Having said that, it's probably OK as long as this mode is only invoked
for user queries (with MVCC snapshots) and not for system indexscans.

I think system index scans will also invoke this. There's no rule
excluding that possibility.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: John Lumby (#28)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Thu, May 29, 2014 at 7:11 PM, John Lumby <johnlumby@hotmail.com> wrote:

Nonetheless, getting the next tid out of the index may involve stepping
to the next index page, at which point you've lost your interlock

I think we are ok as peeknexttuple (yes bad name, sorry, can change it
...
never advances to another page :

* btpeeknexttuple() -- peek at the next tuple different from any
blocknum in pfch_list
* without reading a new index page
* and without causing any side-effects such as
altering values in control blocks
* if found, store blocknum in next element of pfch_list

Yeah, I was getting to that conclusion myself too.

We could call it amprefetchnextheap, since it does just prefetch, and
is good for nothing *but* prefetch.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: John Lumby (#26)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

Hi,

On 2014-05-29 17:53:51 -0400, John Lumby wrote:

to see how we have done it. And I am attaching corrected version
of your test program which runs just fine.

It's perfectly fine to not be up to the coding style at this point, but
trying to adhere to it to some degree will make code review later less
painfull...
* comments with **
* line length
* tabs vs spaces
* ...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: John Lumby (#26)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

Date: Thu, 29 May 2014 18:00:28 -0300
Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
From: klaussfreire@gmail.com
To: hlinnakangas@vmware.com
CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org

Even if it worked on Linux today, it would be a bad idea to rely on it
from
a portability point of view. No, the only sane way to make this work is
that
the process that initiates an I/O request is responsible for completing
it.

I meant to add - it is really a significant benefit that a bkend
can wait on the aio of a different bkend's original prefeetching aio_read.
Remember that we check completion only when the bkend decides it really
wants the block in a buffer, i.e ReadBuffer and friends,
which might be a very long time after it had issued the prefetch request,
or even never (see below). We don't want other bkends which want that
block to have to wait for the originator to get around to reading it.
*Especially* since the originator may *never* read it if it quits its scan early
leaving prefetched but unread blocks behind. (Which is also taken
care of in the patch).

Show quoted text

If another process needs to wait for an async I/O to complete, we must
use
some other means to do the waiting. Like the io_in_progress_lock that we
already have, for the same purpose.

#33

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: John Lumby (#26)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Thu, May 29, 2014 at 6:53 PM, John Lumby <johnlumby@hotmail.com> wrote:

Well, as mentioned earlier, it is not broken. Whether it is efficient
I am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that, if caller is not the original aio_read process,
it renders the suspend() into an instant timeout. I will see if I can
verify that.
Where are you (Claudio) seeing 10us?

fd.c, in FileCompleteaio, sets timeout to:

my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;

Which is 10k ns, which is 10 us.

It loops 256 times at most, so it's polling 256 times with a 10 us
timeout. Sounds wasteful.

I'd:

1) If it's the same process, wait for the full timeout (no looping).
If you have to loop (EAGAIN or EINTR - which I just noticed you don't
check for), that's ok.

2) If it's not, just fall through, don't wait, issue the I/O. The
kernel will merge the requests.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: Claudio Freire (#33)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Thu, May 29, 2014 at 7:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

1) If it's the same process, wait for the full timeout (no looping).
If you have to loop (EAGAIN or EINTR - which I just noticed you don't
check for), that's ok.

Sorry, meant to say just looping on EINTR.

About the style guidelines, no, I just copy the style of surrounding
code usually.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: John Lumby (#26)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 05/30/2014 12:53 AM, John Lumby wrote:

Date: Thu, 29 May 2014 18:00:28 -0300
Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
From: klaussfreire@gmail.com
To: hlinnakangas@vmware.com
CC: johnlumby@hotmail.com; pgsql-hackers@postgresql.org

On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 05/29/2014 11:34 PM, Claudio Freire wrote:

On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 05/29/2014 04:12 PM, John Lumby wrote:

On 05/28/2014 11:52 PM, John Lumby wrote:

The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.

It works on linux. Actually this ability allows the asyncio
implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,. since librt takes care of the
waiting.
specifically, no need for extra wait-for-io control blocks
such as in bufmgr's WaitIO()

[checks]. No, it doesn't work. See attached test program.

Thanks for checking and thanks for coming up with that test program.
However, yes, it really does work -- always (on linux).
Your test program is doing things in the wrong order -
it calls aio_suspend *before* aio_error.
However, the rule is, call aio_suspend *after* aio_error
and *only* if aio_error returns EINPROGRESS.

I see no such a rule in the man pages of any of the functions involved.
And it wouldn't matter anyway; the behavior is exactly the same if you
aio_error() first.

See the code changes to fd.c function FileCompleteaio()
to see how we have done it. And I am attaching corrected version
of your test program which runs just fine.

As Claudio mentioned earlier, the way FileCompleteaio() uses aio_suspend
is just a complicated way of polling. You might as well replace the
aio_suspend() calls with pg_usleep().

It kinda seems to work sometimes, because of the way it's implemented in
glibc. The aiocb struct has a field for the result value and errno, and
when
the I/O is finished, the worker thread fills them in. aio_error() and
aio_return() just return the values of those fields, so calling
aio_error()
or aio_return() do in fact happen to work from a different process.
aio_suspend(), however, is implemented by sleeping on a process-local
mutex,
which does not work from a different process.

Even if it worked on Linux today, it would be a bad idea to rely on it
from
a portability point of view. No, the only sane way to make this work is
that
the process that initiates an I/O request is responsible for completing
it.
If another process needs to wait for an async I/O to complete, we must
use
some other means to do the waiting. Like the io_in_progress_lock that we
already have, for the same purpose.

But calls to it are timeouted by 10us, effectively turning the thing
into polling mode.

We don't want polling... And even if we did, calling aio_suspend() in a way
that's known to be broken, in a loop, is a pretty crappy way of polling.

Well, as mentioned earlier, it is not broken. Whether it is efficient I am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that, if caller is not the original aio_read process,
it renders the suspend() into an instant timeout. I will see if I can verify that.

I don't see the point of pursuing this design further. Surely we don't
want to use polling here, and you're relying on undefined behavior
anyway. I'm pretty sure aio_return/aio_error won't work from a different
process on all platforms, even if it happens to work on Linux. Even on
Linux, it might stop working if the underlying implementation changes
from the glibc pthread emulation to something kernel-based.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#35)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Fri, May 30, 2014 at 4:15 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

We don't want polling... And even if we did, calling aio_suspend() in a
way
that's known to be broken, in a loop, is a pretty crappy way of polling.

Well, as mentioned earlier, it is not broken. Whether it is
efficient I am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that, if caller is not the original aio_read process,
it renders the suspend() into an instant timeout. I will see if I can
verify that.

I don't see the point of pursuing this design further. Surely we don't want
to use polling here, and you're relying on undefined behavior anyway. I'm
pretty sure aio_return/aio_error won't work from a different process on all
platforms, even if it happens to work on Linux. Even on Linux, it might stop
working if the underlying implementation changes from the glibc pthread
emulation to something kernel-based.

I'll try to do some measuring of performance with:
a) git head
b) git head + patch as-is
c) git head + patch without aio_suspend in foreign processes (just re-read)
d) git head + patch with a lwlock (or whatever works) instead of aio_suspend

a-c will be the fastest, d might take some while.

I'll let you know of the results as I get them.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

johnlumby

johnlumby@hotmail.com

over 11 years ago

In reply to: Claudio Freire (#36)

1 attachment(s)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 05/30/14 09:36, Claudio Freire wrote:

On Fri, May 30, 2014 at 4:15 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I see no such a rule in the man pages of any of the functions

involved. And it wouldn't matter anyway; the behavior is exactly the
same if you aio_error() first.

You are completely correct and I was wrong.
For the case of different-process,
It is not the order of the aio calls that makes it work/not work,
it is whether it polls with a timeout.
I think I was confusing this with a rule concerning order for
calling aio_return.

We don't want polling... And even if we did, calling aio_suspend() in a
way
that's known to be broken, in a loop, is a pretty crappy way of polling.

I have made a change to the complete_aio functions
to use polling *only* if the caller is not the originator of the aio_read.
If it is the originator, it now calls aio_suspend with no timeout
(i.e. wait till complete).
The loop also now checks for the EINTR case which someone
pointed out.

In my test runs, with a debug message in FileCompleteaio to tell me
whether caller is or is not the originator of the aio_read,
I see > 99.8% of calls are from originator and only < 0.2% are not.
e.g. (samples from two backends)

different : 10
same : 11726

different : 38
same : 12105

new patch based on today 140531 is attached,

This improves one of my benchmarks by about 10% throughput,
and now shows an overall 23% improvement relative to existing code with
posix_fadvise.

So hopefully this addresses your performance concern.

If you look at the new patch, you'll see that for the different-pid case,
I still call aio_suspend with a timeout.
As you or Claudio pointed out earlier, it could just as well sleep
for the same timeout,
but the small advantage of calling aio_suspend is if the io completed
just between
the aio_error returning EINPROGRESS and the aio_suspend call.
Also it makes the code simpler. In fact this change is quite small,
just a few lines
in backend/storage/buffer/buf_async.c and backend/storage/file/fd.c

Based on this, I think it is not necessary to get rid of the polling
altogether
(and in any case, as far as I can see, very difficult).

Well, as mentioned earlier, it is not broken. Whether it is
efficient I am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that, if caller is not the original aio_read process,
it renders the suspend() into an instant timeout. I will see if I can
verify that.

I don't see the point of pursuing this design further. Surely we don't want
to use polling here, and you're relying on undefined behavior anyway. I'm
pretty sure aio_return/aio_error won't work from a different process on all
platforms, even if it happens to work on Linux. Even on Linux, it might stop
working if the underlying implementation changes from the glibc pthread
emulation to something kernel-based.

Good point. I have included the guts of your little test program
(modified to do polling) into the existing autoconf test program
that decides on the
#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.
See config/c-library.m4.
I hope this goes some way to answer your concern about robustness,
as at least now if the implementation changes in some way that
renders the polling ineffective, it will be caught in configure.

I'll try to do some measuring of performance with:
a) git head
b) git head + patch as-is
c) git head + patch without aio_suspend in foreign processes (just re-read)
d) git head + patch with a lwlock (or whatever works) instead of aio_suspend

a-c will be the fastest, d might take some while.

I'll let you know of the results as I get them.

Claudio, I am not quite sure if what I am submitting now is
quite the same as any of yours. As I promised before, but have
not yet done, I will package one or two of my benchmarks and
send them in.

Attachments:

postgresql-9.4.140531.async_io_prefetching.patchtext/x-patch; name=postgresql-9.4.140531.async_io_prefetching.patchDownload

--- configure.in.orig	2014-05-31 17:19:07.689208337 -0400
+++ configure.in	2014-05-31 19:53:08.552072242 -0400
@@ -1771,6 +1771,12 @@ operating system;  use --disable-thread-
 fi
 fi
 
+#  test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation  of the latter.
+PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" = x"yes"; then
+      AC_DEFINE(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP, 1, [Define to select librt-style async io and the gcc atomic compare_and_swap.])
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
--- contrib/pg_prewarm/pg_prewarm.c.orig	2014-05-31 17:19:07.705208380 -0400
+++ contrib/pg_prewarm/pg_prewarm.c	2014-05-31 19:53:08.620072414 -0400
@@ -159,7 +159,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		 */
 		for (block = first_block; block <= last_block; ++block)
 		{
-			PrefetchBuffer(rel, forkNumber, block);
+			PrefetchBuffer(rel, forkNumber, block, 0);
 			++blocks_done;
 		}
 #else
--- contrib/pg_stat_statements/pg_stat_statements--1.3.sql.orig	2014-05-31 17:21:19.249556268 -0400
+++ contrib/pg_stat_statements/pg_stat_statements--1.3.sql	2014-05-31 19:53:08.652072495 -0400
@@ -0,0 +1,52 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_stat_statements VERSION '1.3'" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_stat_statements_reset()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT queryid bigint,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT blk_read_time float8,
+    OUT blk_write_time float8
+   , OUT aio_read_noneed int8
+   , OUT aio_read_discrd int8
+   , OUT aio_read_forgot int8
+   , OUT aio_read_noblok int8
+   , OUT aio_read_failed int8
+   , OUT aio_read_wasted int8
+   , OUT aio_read_waited int8
+   , OUT aio_read_ontime int8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_3'
+LANGUAGE C STRICT VOLATILE;
+
+-- Register a view on the function for ease of use.
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
+
+-- Don't want this to be available to non-superusers.
+REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
--- contrib/pg_stat_statements/Makefile.orig	2014-05-31 17:19:07.705208380 -0400
+++ contrib/pg_stat_statements/Makefile	2014-05-31 19:53:08.664072525 -0400
@@ -4,7 +4,8 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.2.sql pg_stat_statements--1.1--1.2.sql \
+DATA = pg_stat_statements--1.3.sql pg_stat_statements--1.2--1.3.sql \
+	pg_stat_statements--1.2.sql pg_stat_statements--1.1--1.2.sql \
 	pg_stat_statements--1.0--1.1.sql pg_stat_statements--unpackaged--1.0.sql
 
 ifdef USE_PGXS
--- contrib/pg_stat_statements/pg_stat_statements.c.orig	2014-05-31 17:19:07.705208380 -0400
+++ contrib/pg_stat_statements/pg_stat_statements.c	2014-05-31 19:53:08.700072616 -0400
@@ -117,6 +117,7 @@ typedef enum pgssVersion
 	PGSS_V1_0 = 0,
 	PGSS_V1_1,
 	PGSS_V1_2
+	,PGSS_V1_3
 } pgssVersion;
 
 /*
@@ -148,6 +149,16 @@ typedef struct Counters
 	int64		local_blks_written;		/* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;		/* # of temp blocks written */
+
+	int64		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool  */
+	int64		aio_read_discrd;		/* # of prefetches for which buffer not subsequently read and therefore discarded  */
+	int64		aio_read_forgot;		/* # of prefetches for which buffer not subsequently read and then forgotten about */
+	int64		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb  control block               */
+	int64		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno     */
+	int64		aio_read_wasted;		/* # of aio reads for which in-progress aio cancelled and disk block not used      */
+	int64		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it                 */
+	int64		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested       */
+
 	double		blk_read_time;	/* time spent reading, in msec */
 	double		blk_write_time; /* time spent writing, in msec */
 	double		usage;			/* usage factor */
@@ -274,7 +285,7 @@ void		_PG_init(void);
 void		_PG_fini(void);
 
 PG_FUNCTION_INFO_V1(pg_stat_statements_reset);
-PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
+PG_FUNCTION_INFO_V1(pg_stat_statements_1_3);
 PG_FUNCTION_INFO_V1(pg_stat_statements);
 
 static void pgss_shmem_startup(void);
@@ -1026,7 +1037,25 @@ pgss_ProcessUtility(Node *parsetree, con
 		bufusage.temp_blks_read =
 			pgBufferUsage.temp_blks_read - bufusage_start.temp_blks_read;
 		bufusage.temp_blks_written =
-			pgBufferUsage.temp_blks_written - bufusage_start.temp_blks_written;
+			pgBufferUsage.temp_blks_written - bufusage.temp_blks_written;
+
+		bufusage.aio_read_noneed =
+			pgBufferUsage.aio_read_noneed - bufusage.aio_read_noneed;
+		bufusage.aio_read_discrd =
+			pgBufferUsage.aio_read_discrd - bufusage.aio_read_discrd;
+		bufusage.aio_read_forgot =
+			pgBufferUsage.aio_read_forgot - bufusage.aio_read_forgot;
+		bufusage.aio_read_noblok =
+			pgBufferUsage.aio_read_noblok - bufusage.aio_read_noblok;
+		bufusage.aio_read_failed =
+			pgBufferUsage.aio_read_failed - bufusage.aio_read_failed;
+		bufusage.aio_read_wasted =
+			pgBufferUsage.aio_read_wasted - bufusage.aio_read_wasted;
+		bufusage.aio_read_waited =
+			pgBufferUsage.aio_read_waited - bufusage.aio_read_waited;
+		bufusage.aio_read_ontime =
+			pgBufferUsage.aio_read_ontime - bufusage.aio_read_ontime;
+
 		bufusage.blk_read_time = pgBufferUsage.blk_read_time;
 		INSTR_TIME_SUBTRACT(bufusage.blk_read_time, bufusage_start.blk_read_time);
 		bufusage.blk_write_time = pgBufferUsage.blk_write_time;
@@ -1041,6 +1070,7 @@ pgss_ProcessUtility(Node *parsetree, con
 				   rows,
 				   &bufusage,
 				   NULL);
+
 	}
 	else
 	{
@@ -1224,6 +1254,16 @@ pgss_store(const char *query, uint32 que
 		e->counters.local_blks_written += bufusage->local_blks_written;
 		e->counters.temp_blks_read += bufusage->temp_blks_read;
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
+
+		e->counters.aio_read_noneed     += bufusage->aio_read_noneed;
+		e->counters.aio_read_discrd     += bufusage->aio_read_discrd;
+		e->counters.aio_read_forgot     += bufusage->aio_read_forgot;
+		e->counters.aio_read_noblok     += bufusage->aio_read_noblok;
+		e->counters.aio_read_failed     += bufusage->aio_read_failed;
+		e->counters.aio_read_wasted     += bufusage->aio_read_wasted;
+		e->counters.aio_read_waited     += bufusage->aio_read_waited;
+		e->counters.aio_read_ontime     += bufusage->aio_read_ontime;
+
 		e->counters.blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_read_time);
 		e->counters.blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_write_time);
 		e->counters.usage += USAGE_EXEC(total_time);
@@ -1257,7 +1297,8 @@ pg_stat_statements_reset(PG_FUNCTION_ARG
 #define PG_STAT_STATEMENTS_COLS_V1_0	14
 #define PG_STAT_STATEMENTS_COLS_V1_1	18
 #define PG_STAT_STATEMENTS_COLS_V1_2	19
-#define PG_STAT_STATEMENTS_COLS			19		/* maximum of above */
+#define PG_STAT_STATEMENTS_COLS_V1_3	27
+#define PG_STAT_STATEMENTS_COLS			27		/* maximum of above */
 
 /*
  * Retrieve statement statistics.
@@ -1270,6 +1311,16 @@ pg_stat_statements_reset(PG_FUNCTION_ARG
  * function.  Unfortunately we weren't bright enough to do that for 1.1.
  */
 Datum
+pg_stat_statements_1_3(PG_FUNCTION_ARGS)
+{
+	bool		showtext = PG_GETARG_BOOL(0);
+
+	pg_stat_statements_internal(fcinfo, PGSS_V1_3, showtext);
+
+	return (Datum) 0;
+}
+
+Datum
 pg_stat_statements_1_2(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
@@ -1358,6 +1409,10 @@ pg_stat_statements_internal(FunctionCall
 			if (api_version != PGSS_V1_2)
 				elog(ERROR, "incorrect number of output arguments");
 			break;
+		case PG_STAT_STATEMENTS_COLS_V1_3:
+			if (api_version != PGSS_V1_3)
+				elog(ERROR, "incorrect number of output arguments");
+			break;
 		default:
 			elog(ERROR, "incorrect number of output arguments");
 	}
@@ -1534,11 +1589,24 @@ pg_stat_statements_internal(FunctionCall
 		{
 			values[i++] = Float8GetDatumFast(tmp.blk_read_time);
 			values[i++] = Float8GetDatumFast(tmp.blk_write_time);
+
+			if (api_version >= PGSS_V1_3)
+			{
+				values[i++] = Int64GetDatumFast(tmp.aio_read_noneed);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_discrd);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_forgot);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_noblok);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_failed);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_wasted);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_waited);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_ontime);
+			}
 		}
 
 		Assert(i == (api_version == PGSS_V1_0 ? PG_STAT_STATEMENTS_COLS_V1_0 :
 					 api_version == PGSS_V1_1 ? PG_STAT_STATEMENTS_COLS_V1_1 :
 					 api_version == PGSS_V1_2 ? PG_STAT_STATEMENTS_COLS_V1_2 :
+					 api_version == PGSS_V1_3 ? PG_STAT_STATEMENTS_COLS_V1_3 :
 					 -1 /* fail if you forget to update this assert */ ));
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
--- contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql.orig	2014-05-31 17:21:19.249556268 -0400
+++ contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql	2014-05-31 19:53:08.724072677 -0400
@@ -0,0 +1,51 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements();
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements();
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT queryid bigint,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT blk_read_time float8,
+    OUT blk_write_time float8
+   , OUT aio_read_noneed int8
+   , OUT aio_read_discrd int8
+   , OUT aio_read_forgot int8
+   , OUT aio_read_noblok int8
+   , OUT aio_read_failed int8
+   , OUT aio_read_wasted int8
+   , OUT aio_read_waited int8
+   , OUT aio_read_ontime int8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_3'
+LANGUAGE C STRICT VOLATILE;
+
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
--- postgresql-prefetching-asyncio.README.orig	2014-05-31 17:21:19.249556268 -0400
+++ postgresql-prefetching-asyncio.README	2014-05-31 19:53:08.768072788 -0400
@@ -0,0 +1,544 @@
+Postgresql  --   Extended Prefetching using Asynchronous IO
+============================================================
+
+Postgresql currently (9.3.4) provides a limited prefetching capability
+using posix_fadvise to give hints to the Operating System kernel
+about which pages it expects to read in the near future.
+This capability is used only during the heap-scan phase of bitmap-index scans.
+It is controlled via the effective_io_concurrency configuration parameter.
+
+This capability is now extended in two ways :
+   .   use asynchronous IO into Postgresql shared buffers as an
+       alternative to posix_fadvise
+   .   Implement prefetching in other types of scan :
+            .  non-bitmap (i.e. simple) index scans - index pages
+                     currently only for B-tree indexes.
+                    (developed by Claudio Freire <klaussfreire(at)gmail(dot)com>)
+            .  non-bitmap (i.e. simple) index scans - heap pages
+                          currently only for B-tree indexes.
+            .  simple heap scans
+
+Posix asynchronous IO is chosen as the function library for asynchronous IO,
+since this is well supported and also fits very well with the model of
+the prefetching process,  particularly as regards checking for completion
+of an asynchronous read.    On linux,   Posix asynchronous IO is provided
+in the librt library.    librt uses independently-schedulable threads to
+achieve the asynchronicity,   rather than kernel functionality.
+
+In this implementation,  use of asynchronous IO is limited to prefetching
+while performing one of the three types of scan
+            .  B-tree bitmap index scan - heap pages    (as already exists)
+            .  B-tree non-bitmap (i.e. simple) index scans - index and heap pages
+            .  simple heap scans
+on permanent relations.   It is not used on temporary tables nor for writes.
+
+The advantages of Posix asynchronous IO into shared buffers
+compared to posix_fadvise are :
+   .   Beneficial for non-sequential access patterns as well as sequential
+   .   No restriction on the kinds of IO which can be used
+       (other kinds of asynchronous IO impose restrictions such as
+        buffer alignment,  use of non-buffered IO).
+   .   Does not interfere with standard linux kernel read-ahead functionality.
+       (It has been stated in 
+ www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com
+       that :
+          "the kernel stops doing read-ahead when a call to posix_fadvise comes.
+           I noticed the performance hit, and checked the kernel's code.
+           It effectively changes the prediction mode from sequential to fadvise,
+           negating the (assumed) kernel's prefetch logic")
+   .   When the read request is issued after a prefetch has completed,
+       no delay associated with a kernel call to copy the page from
+       kernel page buffers into the Postgresql shared buffer,
+       since it is already there.
+       Also,   in a memory-constrained environment,   there is a greater
+       probability that the prefetched page will "stick" in memory
+       since the linux kernel victimizes the filesystem page cache in preference
+       to swapping out user process pages.
+   .   Statistics on prefetch success can be gathered (see "Statistics" below)
+       which helps the administrator to tune the prefetching settings.
+
+These benefits are most likely to be obtained in a system whose usage profile
+(e.g. from iostat)  shows:
+     .   high IO wait from mostly-read activity
+     .   disk access pattern is not entirely sequential
+         (so kernel readahead can't predict it but postgresql can)
+     .   sufficient spare idle CPU to run the librt pthreads
+         or,  stated another way,    the CPU subsystem is relatively powerful
+         compared to the disk subsystem.
+In such ideal conditions,  and with a workload with plenty of index scans,
+around 10% - 20% improvement in throughput has been achieved.
+In an admittedly extreme environment measured by this author,    with a workload
+consisting of 8 client applications each running similar complex queries
+(same query structure but different predicates and constants),
+including 2 Bitmap Index Scans and 17 non-bitmap index scans,
+on a dual-core Intel laptop (4 hyperthreads) with the database on a single
+USB3-attached 500GB disk drive, and no part of the database in filesystem buffers
+initially,  (filesystem freshly mounted),  comparing unpatched build
+using posix_fadvise with effective_io_concurrency 4 against same build patched
+with async IO and effective_io_concurrency 4 and max_async_io_prefetchers 32,
+elapse time repeatably improved from around 640-670 seconds to around 530-550 seconds,
+a 17% - 18% improvement. 
+
+The disadvantages of Posix asynchronous IO compared to posix_fadvise are:
+     .   probably higher CPU utilization:
+         Firstly, the extra work performed by the librt threads adds CPU
+         overhead, and secondly, if the asynchronous prefetching is effective,
+         then it will deliver better (greater) overlap of CPU with IO, which
+         will reduce elapsed times and hence increase CPU utilization percentage
+         still more (during that shorter elapsed time).
+     .   more context switching,  because of the additional threads.
+
+
+Statistics:
+___________
+
+A number of additional statistics relating to effectiveness of asynchronous IO
+are provided as an extension of the existing pg_stat_statements loadable module.
+Refer to the appendix "Additional Supplied Modules" in the current
+PostgreSQL Documentation for details of this module.
+
+The following additional statistics are provided for asynchronous IO prefetching:
+
+    . aio_read_noneed  :   number of prefetches for which no need for prefetch as block already in buffer pool
+    . aio_read_discrd  :   number of prefetches for which buffer not subsequently read and therefore discarded
+    . aio_read_forgot  :   number of prefetches for which buffer not subsequently read and then forgotten about
+    . aio_read_noblok  :   number of prefetches for which no available BufferAiocb  control block
+    . aio_read_failed  :   number of aio reads for which aio itself failed or the read failed with an errno
+    . aio_read_wasted  :   number of aio reads for which in-progress aio cancelled and disk block not used
+    . aio_read_waited  :   number of aio reads for which disk block used but had to wait for it
+    . aio_read_ontime  :   number of aio reads for which disk block used and ready on time when requested
+
+Some of these are (hopefully) self-explanatory.    Some additional notes:
+
+    . aio_read_discrd and aio_read_forgot  :
+                    prefetch was wasted work since the buffer was not subsequently read
+                    The discrd case indicates that the scanner realized this and discarded the buffer,
+                    whereas the forgot case indicates that the scanner did not realize it,
+                    which should not normally occur.
+                    A high number in either suggests lowering effective_io_concurrency.
+
+    . aio_read_noblok  :   
+                    Any significant number in relation to all the other numbers indicates that
+                    max_async_io_prefetchers should be increased.
+
+    . aio_read_waited  :
+                    The page was prefetched but the asynchronous read had not completed by the time the
+                    scanner requested to read it.     causes extra overhead in waiting and indicates
+                    prefetching is not providing much if any benefit.
+                    The disk subsystem may be underpowered/overloaded in relation to the available CPU power.
+
+    . aio_read_ontime  :
+                    The page was prefetched and the asynchronous read had completed by the time the
+                    scanner requested to read it.     Optimal behaviour.      If this number if large
+                    in relation to all the other numbers except (possibly) aio_read_noneed,
+                    then prefetching is working well.
+
+To create the extension with support for these additional statistics, use the following syntax:
+     CREATE EXTENSION pg_stat_statements VERSION '1.3'
+or,  if you run the new code against an existing database which already has the extension
+( see installation and migration below ),  you can 
+     ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'
+
+A suggested set of commands for displaying these statistics might be :
+
+ /*  OPTIONALLY */ DROP extension pg_stat_statements;
+                   CREATE extension pg_stat_statements VERSION '1.3';
+ /*  run your workload   */
+                  select userid , dbid , substring(query from 1 for 24) , calls , total_time , rows , shared_blks_read , blk_read_time , blk_write_time \
+                    , aio_read_noneed , aio_read_noblok , aio_read_failed , aio_read_wasted , aio_read_waited , aio_read_ontime , aio_read_forgot       \
+                      from pg_stat_statements where shared_blks_read > 0;
+
+
+Installation and Build Configuration:
+_____________________________________
+
+1. First -  a prerequsite:
+#  as well as requiring all the usual package build tools such as gcc , make etc,
+#  as described in the instructions for building postgresql,
+#  the following is required :
+    gnu autoconf at version 2.69 :
+# run the following command
+autoconf -V
+# it *must* return
+autoconf (GNU Autoconf) 2.69
+
+2. If you don't have it or it is a different version,
+then you must obtain version 2.69 (which is the current version)
+from your distribution provider or from the gnu software download site.
+
+3. Also you must have the source tree for postgresql version 9.4 (development version).
+#   all the following commands assume your current working directory is the top of the source tree.
+
+4. cd to top of source tree :
+#   check it appears to be a postgresql source tree
+ls -ld configure.in src
+#   should show both the file and the directory
+grep PostgreSQL COPYRIGHT
+#   should show PostgreSQL Database Management System
+
+5. Apply the patch :
+patch -b -p0 -i <patch_file_path>
+#   should report no errors, 45 files patched (see list at bottom of this README)
+#   and all hunks applied
+#  check the patch was appplied to configure.in
+ls -ld configure.in.orig configure.in
+#   should show both files
+
+6. Rebuild the configure script with the patched configure.in :
+mv configure configure.orig;
+autoconf configure.in >configure;echo "rc= $? from autoconf"; chmod +x configure;
+ls -lrt configure.orig configure;
+
+7. run the new configure script :
+#   if you have run configure before,
+#   then you may first want to save existing config.status and config.log if they exist,
+#   and then specify same configure flags and options as you specified before.
+#   the patch does not alter or extend the set of configure options
+#   if unsure,   run ./configure --help
+#   if still unsure,   run ./configure
+./configure <other configure options as desired>
+
+
+
+8. now check that configure decided that this environment supports asynchronous IO :
+grep USE_AIO_ATOMIC_BUILTIN_COMP_SWAP src/include/pg_config.h
+#  it should show
+#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 1
+#  if not,  apparently your environment does not support asynch IO  -
+#  the config.log will show how it came to that conclusion,
+#  also check for :
+#    . a librt.so somewhere in the loader's library path (probably under /lib , /lib64 , or /usr)
+#    . your gcc must support the atomic compare_and_swap __sync_bool_compare_and_swap built-in function
+#  do not proceed without this define being set.
+
+9. do you want to use the new code on an existing cluster
+   that was created using the same code base but without the patch?
+   If so then run this nasty-looking command :
+   (cut-and-paste it into a terminal window or a shell-script file)
+   Otherwise continue to step 10.
+   see Migration note below for explanation.
+###############################################################################################
+   fl=src/Makefile.global; typeset -i bkx=0; while [[ $bkx < 200 ]]; do {
+       bkfl="${fl}.bak${bkx}"; if [[ -a ${bkfl} ]]; then ((bkx=bkx+1)); else break; fi;
+   }; done;
+   if [[ -a ${bkfl} ]]; then echo "sorry cannot find a backup name for $fl";
+   elif [[ -a $fl ]]; then {
+       mv $fl $bkfl && {
+          sed -e "/^CFLAGS =/ s/\$/ -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO/" $bkfl > $fl;
+          str="diff -w $bkfl $fl";echo "$str"; eval "$str";
+       };
+   };
+   else echo "ooopppss $fl is missing";
+   fi;
+###############################################################################################
+# it should report something like
+diff -w Makefile.global.bak0 Makefile.global
+222c222
+< CFLAGS = XXXX
+---
+> CFLAGS = XXXX -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+#   where XXXX is some set of flags
+
+
+10. now run the rest of the build process as usual  -
+    follow instructions in file INSTALL if that file exists,
+    else e.g. run
+make && make install
+
+If the build fails with the following error:
+undefined reference to `aio_init'
+Then edit the following file
+src/include/pg_config_manual.h
+and add the following line at the bottom:
+
+#define DONT_HAVE_AIO_INIT
+
+and then run
+make clean && make && make install
+See notes to section Runtime Configuration below for more information on this.
+
+
+
+Migration , Runtime Configuration, and Use:
+___________________________________________
+
+
+Database Migration:
+___________________
+
+The new prefetching code for non-bitmap index scans introduces a new btree-index
+function named btpeeknexttuple.    The correct way to add such a function involves
+also adding it to the catalog as an internal function in pg_proc.
+However,  this results in the new built code considering an existing database to be
+incompatible,  i.e requiring backup on the old code and restore on the new.
+This is normal behaviour for migration to a new version of postgresql,  and is
+also a valid way of migrating a database for use with this asynchronous IO feature,
+but in this case it may be inconvenient.
+
+As an alternative,    the new code may be compiled with the macro define
+AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+which does what it says by not altering the catalog.   The patched build can then
+be run against an existing database cluster initdb'd using the unpatched build.
+
+There are no known ill-effects of so doing,  but :
+     .  in any case,  it is strongly suggested to make a backup of any precious database
+        before accessing it with a patched build
+     .  be aware that if this asynchronous IO feature is eventually released as part of postgresql,
+        migration will probably be required anyway.
+
+This option to avoid catalog migration is intended as a convenience for a quick test,
+and also makes it easier to obtain performance comparisons on the same database.
+
+
+
+Runtime Configuration:
+______________________
+
+One new configuration parameter settable in postgresql.conf and
+in any other way as described in the postgresql documentation :
+
+max_async_io_prefetchers
+  Maximum number of background processes concurrently using asynchronous
+  librt threads to prefetch pages into shared memory buffers
+
+This number can be thought of as the maximum number
+of librt threads concurrently active,   each working on a list of
+from 1 to target_prefetch_pages pages ( see notes 1 and 2 ).
+
+In practice,    this number simply controls how many prefetch requests in total
+may be active concurrently :
+        max_async_io_prefetchers * target_prefetch_pages ( see note 1)
+
+default is max_connections/6
+and recall that the default for max_connections is 100
+
+
+note 1  a number based on effective_io_concurrency and approximately n * ln(n)
+        where n is effective_io_concurrency
+
+note 2  Provided that the gnu extension to Posix AIO which provides the
+aio_init() function is present,   then aio_init() is called
+to set the librt maximum number of threads to max_async_io_prefetchers,
+and to set the maximum number of concurrent aio read requests to the product of
+        max_async_io_prefetchers * target_prefetch_pages
+
+
+As well as this regular configuration parameter,
+there are several other parameters that can be set via environment variable.
+The reason why they are environment vars rather than regular configuration parameters
+is that it is not expected that they should need to be set,   but they may be useful :
+                variable name         values                  default        meaning
+   PG_TRY_PREFETCHING_FOR_BITMAP      [Y|N]                    Y         whether to prefetch bitmap heap scans
+   PG_TRY_PREFETCHING_FOR_ISCAN       [Y|N|integer[,[N|Y]]]   256,N      whether to prefetch  non-bitmap index scans
+                                                                    also numeric size of list of prefetched blocks
+                                                                    also whether to prefetch forward-sequential-pattern index pages
+   PG_TRY_PREFETCHING_FOR_BTREE       [Y|N]                    Y         whether to prefetch heap pages in non-bitmap index scans
+   PG_TRY_PREFETCHING_FOR_HEAP        [Y|N]                    N         whether to prefetch relation (un-indexed) heap scans
+
+
+The setting for PG_TRY_PREFETCHING_FOR_ISCAN is a litle complicated.
+It can be set to Y or N to control prefetching of  non-bitmap index scans;
+But in addition it can be set to an integer,   which both implies Y
+and also sets the size of a list used to remember prefetched but unread heap pages.
+This list is an optimization used to avoid re-prefetching and maximise the potential
+set of prefetchable blocks indexed by one index page.
+And if set to an integer,  this integer may be followed by either ,Y or ,N
+to specify to prefetch index pages which are being accessed forward-sequentially.
+It has been found that prefetching is not of great benefit for this access pattern,
+and so it is not the default,  but also does no harm (provided sufficient CPU capacity).
+
+
+
+Usage :
+______
+
+
+There are no changes in usage other than as noted under Configuration and Statistics.
+However,   in order to assess benefit from this feature,   it will be useful to
+understand the query access plans of your workload using EXPLAIN.    Before doing that,
+make sure that statistics are up to date using ANALYZE.
+
+
+
+Internals:
+__________
+
+
+Internal changes span two areas and the interface between them :
+
+ .  buffer manager layer
+ .  programming interface for scanner to call buffer manager
+ .  scanner layer
+
+ .  buffer manager layer
+    ____________________
+
+    changes comprise :
+       .   allocating,  pinning , unpinning buffers
+            this is complex and discussed briefly below in "Buffer Management"
+       .   acquiring and releasing a BufferAiocb, the control block
+            associated with a single aio_read,  and checking for its completion
+            a new file,  backend/storage/buffer/buf_async.c, provides three new functions,
+                  BufStartAsync        BufReleaseAsync            BufCheckAsync
+            which handle this.
+       .   calling librt asynch io functions
+            this follows the example of all other filesystem interfaces
+            and is straightforward.    
+            two new functions are provided in fd.c:
+                   FileStartaio        FileCompleteaio
+            and corresponding interfaces in smgr.c
+
+ .  programming interface for scanner to call buffer manager
+    ________________________________________________________
+       . calling interface for existing function PrefetchBuffer is modified :
+           .  one new argument,   BufferAccessStrategy strategy
+           .  now returns an int return code which indicates :
+                     whether pin count on buffer has been increased by 1
+                     whether block was already present in a buffer
+       .  new function DiscardBuffer
+           .  discard buffer used for a previously prefetched page
+                 which scanner decides it does not want to read.
+           .  same arguments as for PrefetchBuffer except for omission of BufferAccessStrategy
+           .  note - this is different from the existing function ReleaseBuffer
+                     in that ReleaseBuffer takes a buffer_descriptor as argument
+                     for a buffer which has been read, but has similar purpose.
+
+ .  scanner layer
+    _____________
+        common to all scanners is that the scanner which wishes to prefetch must do two things:
+          .  decide which pages to prefetch and call PrefetchBuffer to prefetch them
+                 nodeBitmapHeapscan already does this (but note one extra argument on PrefetchBuffer)
+          .  remember which pages it has prefetched in some list (actual or conceptual,  e.g. a page range),
+                 removing each page from this list if and when it subsequently reads the page.
+          .  at end of scan,  call DiscardBuffer for every remembered (i.e. prefetched not unread) page
+       how this list of prefetched pages is implemented varies for each of the three scanners and four scan types:
+            .  bitmap index scan - heap pages
+            .  non-bitmap (i.e. simple) index scans - index pages
+            .  non-bitmap (i.e. simple) index scans - heap pages
+            .  simple heap scans
+       The consequences of forgetting to call DiscardBuffer on a prefetched but unread page are:
+            .   counted in aio_read_forgot  (see "Statistics" above)
+            .   may incur an annoying but harmless warning in the pg_log "Buffer Leak ... "
+                  (the buffer is released at commit)
+       This does sometimes happen ...
+     
+
+
+Buffer Management
+_________________
+
+With async io,   PrefetchBuffer must allocate and pin a buffer,  which is relatively straightforward,
+but also every other part of buffer manager must know about the possibility that a buffer may be in
+a state of async_io_in_progress state and be prepared to determine the possible completion.
+That is,  one backend BK1 may start the io but another BK2 may try to read it before BK1 does.
+Posix Asynchronous IO provides a means for waiting on this or another task's read if in progress,
+namely aio_suspend(),  which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO
+are called as part of asynchronous prefetching,   their role is limited to maintaining the buffer descriptor flags,
+and they do not track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in
+a separate set of shared control blocks,  the BufferAiocb list -
+refer to     include/storage/buf_internals.h
+Checking asynchronous io status is handled in  backend/storage/buffer/buf_async.c BufCheckAsync function.
+Read the commentary for this function for more details.
+
+Pinning and unpinning of buffers is the most complex aspect of asynch io prefetching,
+and the logic is spread throughout BufStartAsync , BufCheckAsync , and many functions in bufmgr.c.
+When a backend BK2 requests ReadBuffer of a page for which asynch read is in progress,
+buffer manager has to determine which backend BK1 pinned this buffer during previous PrefetchBuffer,
+and for example must not be re-pinned a second time if BK2 is BK1.
+Information concerning which backend initiated the prefetch is held in the BufferAiocb.
+
+The trickiest case concerns the scenario in which :
+   .  BK1 initiates prefetch and acquires a pin
+   .  BK2 possibly waits for completion and then reads the buffer,  and perhaps later on
+         releases it by ReleaseBuffer.
+   .  Since the asynchronous IO is no longer in progress,     there is no longer any
+         BufferAiocb associated with it.    Yet buffer manager must remember that BK1 holds a
+         "prefetch" pin, i.e. a pin which must not be repeated if and when BK1 finally issues ReadBuffer.
+   .  The solution to this problem is to invent the concept of a "banked" pin,
+      which is a pin obtained when prefetch was issued,   identied as in "banked" status only if and when
+      the associated asynchronous IO terminates,  and redeemable by the next use by same task,
+      either by ReadBuffer or DiscardBuffer.
+      The pid of the backend which holds a banked pin on a buffer (there can be at most one such backend)
+      is stored in the buffer descriptor.
+      This is done without increasing size of the buffer descriptor,  which is important since
+      there may be a very large number of these.     This does overload the relevant field in the descriptor.
+      Refer to include/storage/buf_internals.h for more details
+      and search for BM_AIO_PREFETCH_PIN_BANKED in storage/buffer/bufmgr.c and  backend/storage/buffer/buf_async.c
+
+______________________________________________________________________________
+The following 45 files are changed in this feature (output of the patch command) :
+
+patching file configure.in
+patching file contrib/pg_prewarm/pg_prewarm.c
+patching file contrib/pg_stat_statements/pg_stat_statements--1.3.sql
+patching file contrib/pg_stat_statements/Makefile
+patching file contrib/pg_stat_statements/pg_stat_statements.c
+patching file contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql
+patching file postgresql-prefetching-asyncio.README
+patching file config/c-library.m4
+patching file src/backend/postmaster/postmaster.c
+patching file src/backend/executor/nodeBitmapHeapscan.c
+patching file src/backend/executor/nodeIndexscan.c
+patching file src/backend/executor/instrument.c
+patching file src/backend/storage/buffer/Makefile
+patching file src/backend/storage/buffer/bufmgr.c
+patching file src/backend/storage/buffer/buf_async.c
+patching file src/backend/storage/buffer/buf_init.c
+patching file src/backend/storage/smgr/md.c
+patching file src/backend/storage/smgr/smgr.c
+patching file src/backend/storage/file/fd.c
+patching file src/backend/storage/lmgr/proc.c
+patching file src/backend/access/heap/heapam.c
+patching file src/backend/access/heap/syncscan.c
+patching file src/backend/access/index/indexam.c
+patching file src/backend/access/index/genam.c
+patching file src/backend/access/nbtree/nbtsearch.c
+patching file src/backend/access/nbtree/nbtinsert.c
+patching file src/backend/access/nbtree/nbtpage.c
+patching file src/backend/access/nbtree/nbtree.c
+patching file src/backend/nodes/tidbitmap.c
+patching file src/backend/utils/misc/guc.c
+patching file src/backend/utils/mmgr/aset.c
+patching file src/include/executor/instrument.h
+patching file src/include/storage/bufmgr.h
+patching file src/include/storage/smgr.h
+patching file src/include/storage/fd.h
+patching file src/include/storage/buf_internals.h
+patching file src/include/catalog/pg_am.h
+patching file src/include/catalog/pg_proc.h
+patching file src/include/pg_config_manual.h
+patching file src/include/access/nbtree.h
+patching file src/include/access/heapam.h
+patching file src/include/access/relscan.h
+patching file src/include/nodes/tidbitmap.h
+patching file src/include/utils/rel.h
+patching file src/include/pg_config.h.in
+
+
+Future Possibilities:
+____________________
+
+There are several possible extensions of this feature :
+   .   Extend prefetching of index scans to types of index
+       other than B-tree.
+       This should be fairly straightforward,  but requires some
+       good base of benchmarkable workloads to prove the value.
+   .   Investigate why asynchronous IO prefetching does not greatly
+       improve sequential relation heap scans and possibly find how to
+       achieve a benefit.
+   .   Build knowledge of asycnhronous IO prefetching into the
+       Query Planner costing.
+       This is far from straightforward.    The Postgresql Query Planner's
+       costing model is based on resource consumption rather than elapsed time.
+       Use of asynchronous IO prefetching is intended to improve elapsed time
+       as the expense of (probably) higher resource consumption.
+       Although Costing understands about the reduced cost of reading buffered
+       blocks, it does not take asynchronicity or overlap of CPU with disk
+       into account.  A naive approach might be to try to tweak the Query
+       Planner's Cost Constant configuration parameters
+       such as seq_page_cost , random_page_cost
+       but this is hazardous as explained in the Documentation.
+
+
+
+John Lumby,  johnlumby(at)hotmail(dot)com
--- config/c-library.m4.orig	2014-05-31 17:19:07.685208326 -0400
+++ config/c-library.m4	2014-05-31 19:53:08.788072839 -0400
@@ -367,3 +367,152 @@ if test "$pgac_cv_type_locale_t" = 'yes
   AC_DEFINE(LOCALE_T_IN_XLOCALE, 1,
             [Define to 1 if `locale_t' requires <xlocale.h>.])
 fi])])# PGAC_HEADER_XLOCALE
+
+
+# PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
+# ---------------------------------------
+# test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation of both,
+#      including verifying that aio_error can retrieve completion status
+#      of aio_read issued by a different process
+#
+AC_DEFUN([PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP],
+[AC_MSG_CHECKING([whether have both librt-style async io and the gcc atomic compare_and_swap])
+AC_CACHE_VAL(pgac_cv_aio_atomic_builtin_comp_swap,
+pgac_save_LIBS=$LIBS
+LIBS=" -lrt $pgac_save_LIBS"
+[AC_TRY_RUN([#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include "aio.h"
+#include <errno.h>
+
+char *shmem;
+
+/*  returns rc of aio_read or -1 if some error */
+int
+processA(void)
+{
+	int fd , rc;
+	struct aiocb *aiocbp = (struct aiocb *) shmem;
+	char *buf = shmem + sizeof(struct aiocb);
+
+	rc = fd = open("configure", O_RDONLY );
+	if (fd != -1) {
+
+            memset(aiocbp, 0, sizeof(struct aiocb));
+            aiocbp->aio_fildes = fd;
+            aiocbp->aio_offset = 0;
+            aiocbp->aio_buf = buf;
+            aiocbp->aio_nbytes = 8;
+            aiocbp->aio_reqprio = 0;
+            aiocbp->aio_sigevent.sigev_notify = SIGEV_NONE;
+
+            rc = aio_read(aiocbp);
+	}
+        return rc;
+}
+
+/*  returns result of aio_error  -  0 if io completed successfully */
+int
+processB(void)
+{
+	struct aiocb *aiocbp = (struct aiocb *) shmem;
+	const struct aiocb * const pl[1] = { aiocbp };
+	int rv;
+	int	returnCode;
+        struct timespec my_timeout = { 0 , 10000 };
+        int max_iters , max_polls;
+
+	rv = aio_error(aiocbp);
+        max_iters = 100;
+        while ( (max_iters-- > 0) && (rv == EINPROGRESS) ) {
+                max_polls = 256;
+                my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                returnCode = aio_suspend(pl , 1 , &my_timeout);
+                while ((returnCode < 0) && (EAGAIN == errno) && (max_polls-- > 0)) {
+                    my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                    returnCode = aio_suspend(pl , 1 , &my_timeout);
+                }
+                rv = aio_error(aiocbp);
+        }
+
+	return rv;
+}
+
+int main (int argc, char *argv[])
+ {
+   int rc;
+   int pidB;
+   int child_status;
+   struct aiocb volatile * first_aiocb;
+   struct aiocb volatile * second_aiocb;
+   struct aiocb volatile * my_aiocbp = (struct aiocb *)20000008;
+
+   first_aiocb = (struct aiocb *)20000008;
+   second_aiocb = (struct aiocb *)40000008;
+
+   /*  first test  --  __sync_bool_compare_and_swap
+   **  set zero as success if two comp-swaps both worked as expected -
+   **  first compares equal and swaps,  second compares unequal
+   */
+   rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   if (rc) {
+      rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   } else {
+      rc = -1;
+   }
+
+   if (rc == 0) {
+       /*  second test  --  process A start aio_read
+       **  and process B checks completion by polling
+       */
+        rc = -1; /* pessimistic */
+
+	shmem = mmap(NULL, sizeof(struct aiocb) + 2048,
+				 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
+				 -1, 0);
+	if (shmem != MAP_FAILED) {
+            
+            /*
+             * Start the I/O request in parent process, then fork and try to wait
+             * for it to finish from the child process.
+             */
+            rc = processA();
+            if (rc >= 0) {
+
+                rc = pidB = fork();
+                if (pidB != -1) {
+                    if (pidB != 0) {
+                        /* parent */
+                        wait (&child_status);
+                        if (WIFEXITED(child_status)) {
+                            rc = WEXITSTATUS(child_status);
+                        }
+                    } else {
+                        /* child */
+                        rc = processB();
+                        exit(rc);
+                    }
+                }
+            }
+        }
+   }
+
+   return rc;
+}],
+[pgac_cv_aio_atomic_builtin_comp_swap=yes],
+[pgac_cv_aio_atomic_builtin_comp_swap=no],
+[pgac_cv_aio_atomic_builtin_comp_swap=cross])
+])dnl AC_CACHE_VAL
+AC_MSG_RESULT([$pgac_cv_aio_atomic_builtin_comp_swap])
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" != x"yes"; then
+LIBS=$pgac_save_LIBS
+fi
+])# PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
--- src/backend/postmaster/postmaster.c.orig	2014-05-31 17:19:07.865208803 -0400
+++ src/backend/postmaster/postmaster.c	2014-05-31 19:53:08.828072940 -0400
@@ -123,6 +123,11 @@
 #include "storage/spin.h"
 #endif
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+void ReportFreeBAiocbs(void);
+int CountInuseBAiocbs(void);
+extern int hwmBufferAiocbs;         /*  high water mark of in-use  BufferAiocbs in pool           */
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Possible types of a backend. Beyond being the possible bkend_type values in
@@ -1493,9 +1498,15 @@ ServerLoop(void)
 	fd_set		readmask;
 	int			nSockets;
 	time_t		now,
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                           count_baiocb_time,
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 				last_touch_time;
 
 	last_touch_time = time(NULL);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        count_baiocb_time = time(NULL);
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 	nSockets = initMasks(&readmask);
 
@@ -1654,6 +1665,19 @@ ServerLoop(void)
 			last_touch_time = now;
 		}
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                /*   maintain the hwm of used baiocbs every 10 seconds  */
+		if ((now - count_baiocb_time) >= 10)
+		{
+                        int inuseBufferAiocbs;         /*  current in-use  BufferAiocbs in pool */
+                        inuseBufferAiocbs = CountInuseBAiocbs();
+                        if (inuseBufferAiocbs > hwmBufferAiocbs) {
+			    hwmBufferAiocbs = inuseBufferAiocbs;
+			}
+			count_baiocb_time = now;
+		}
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 		/*
 		 * If we already sent SIGQUIT to children and they are slow to shut
 		 * down, it's time to send them SIGKILL.  This doesn't happen
@@ -3444,6 +3468,9 @@ PostmasterStateMachine(void)
 						signal_child(PgStatPID, SIGQUIT);
 				}
 			}
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        ReportFreeBAiocbs();
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 		}
 	}
 
--- src/backend/executor/nodeBitmapHeapscan.c.orig	2014-05-31 17:19:07.809208656 -0400
+++ src/backend/executor/nodeBitmapHeapscan.c	2014-05-31 19:53:08.876073061 -0400
@@ -34,6 +34,8 @@
  *		ExecEndBitmapHeapScan		releases all storage.
  */
 #include "postgres.h"
+#include <sys/types.h> 
+#include <unistd.h>
 
 #include "access/relscan.h"
 #include "access/transam.h"
@@ -47,6 +49,10 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_bitmap_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
 static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
@@ -111,10 +117,21 @@ BitmapHeapNext(BitmapHeapScanState *node
 		node->tbmres = tbmres = NULL;
 
 #ifdef USE_PREFETCH
-		if (target_prefetch_pages > 0)
-		{
+		if (    prefetch_bitmap_scans
+                     && (target_prefetch_pages > 0)
+                     && (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == scan->rs_rd->rd_node.dbNode)
+                             )
+                         ||  (prefetch_dbOid == 0)
+                        )
+                        /* sufficient number of blocks - at least twice the target_prefetch_pages */
+                     && (scan->rs_nblocks > (2*target_prefetch_pages))
+                   ) {
 			node->prefetch_iterator = prefetch_iterator = tbm_begin_iterate(tbm);
 			node->prefetch_pages = 0;
+                        if (prefetch_iterator) {
+                          tbm_zero(prefetch_iterator);  /* zero list of prefetched and unread blocknos */
+                        }
 			node->prefetch_target = -1;
 		}
 #endif   /* USE_PREFETCH */
@@ -138,12 +155,14 @@ BitmapHeapNext(BitmapHeapScanState *node
 			}
 
 #ifdef USE_PREFETCH
+                        if (prefetch_iterator) {
 			if (node->prefetch_pages > 0)
 			{
 				/* The main iterator has closed the distance by one page */
 				node->prefetch_pages--;
+                                tbm_subtract(prefetch_iterator, tbmres->blockno); /* remove this blockno from list of prefetched and unread blocknos */
 			}
-			else if (prefetch_iterator)
+                            else
 			{
 				/* Do not let the prefetch iterator get behind the main one */
 				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
@@ -151,6 +170,7 @@ BitmapHeapNext(BitmapHeapScanState *node
 				if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
 					elog(ERROR, "prefetch and main iterators are out of sync");
 			}
+                        }
 #endif   /* USE_PREFETCH */
 
 			/*
@@ -239,16 +259,26 @@ BitmapHeapNext(BitmapHeapScanState *node
 			while (node->prefetch_pages < node->prefetch_target)
 			{
 				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+                                int             PrefetchBufferRc; /*  return value from PrefetchBuffer  - refer to bufmgr.h */
+
 
 				if (tbmpre == NULL)
 				{
 					/* No more pages to prefetch */
-					tbm_end_iterate(prefetch_iterator);
-					node->prefetch_iterator = prefetch_iterator = NULL;
+                                        /* let ExecEndBitmapHeapScan terminate the prefetch_iterator
+				        **	tbm_end_iterate(prefetch_iterator);
+					**      node->prefetch_iterator = NULL;
+                                        */
+                                        prefetch_iterator = NULL;
 					break;
 				}
 				node->prefetch_pages++;
-				PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+				PrefetchBufferRc = PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno , 0);
+                                /*  add this blockno to list of prefetched and unread blocknos
+                                **  if pin count did not increase then indicate so in the Unread_Pfetched list
+                                */
+                                tbm_add(prefetch_iterator
+                                   ,( (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED) ? tbmpre->blockno : InvalidBlockNumber ) );
 			}
 		}
 #endif   /* USE_PREFETCH */
@@ -482,12 +512,31 @@ ExecEndBitmapHeapScan(BitmapHeapScanStat
 {
 	Relation	relation;
 	HeapScanDesc scanDesc;
+	TBMIterator *prefetch_iterator;
 
 	/*
 	 * extract information from the node
 	 */
 	relation = node->ss.ss_currentRelation;
 	scanDesc = node->ss.ss_currentScanDesc;
+	prefetch_iterator = node->prefetch_iterator;
+
+#ifdef USE_PREFETCH
+        /*  before any other cleanup,  discard any prefetched but unread buffers  */
+        if (prefetch_iterator != NULL) {
+            TBMIterateResult *tbmpre = tbm_locate_IterateResult(prefetch_iterator);
+            BlockNumber *Unread_Pfetched_base = tbmpre->Unread_Pfetched_base;
+            unsigned int Unread_Pfetched_next = tbmpre->Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+            unsigned int Unread_Pfetched_count = tbmpre->Unread_Pfetched_count;
+
+            while ((Unread_Pfetched_count--) > 0) {
+                DiscardBuffer( scanDesc->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                Unread_Pfetched_next++;
+                if (Unread_Pfetched_next >= target_prefetch_pages)
+                    Unread_Pfetched_next = 0;
+            }
+        }
+#endif   /* USE_PREFETCH */
 
 	/*
 	 * Free the exprcontext
--- src/backend/executor/nodeIndexscan.c.orig	2014-05-31 17:19:07.813208666 -0400
+++ src/backend/executor/nodeIndexscan.c	2014-05-31 19:53:08.900073122 -0400
@@ -35,8 +35,13 @@
 #include "utils/rel.h"
 
 
+
 static TupleTableSlot *IndexNext(IndexScanState *node);
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -418,7 +423,12 @@ ExecEndIndexScan(IndexScanState *node)
 	 * close the index relation (no-op if we didn't open it)
 	 */
 	if (indexScanDesc)
+        {
 		index_endscan(indexScanDesc);
+
+        /*  note  -  at this point all scan controlblock resources have been freed by IndexScanEnd called by index_endscan */
+
+        }
 	if (indexRelationDesc)
 		index_close(indexRelationDesc, NoLock);
 
@@ -609,6 +619,33 @@ ExecInitIndexScan(IndexScan *node, EStat
 											   indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
+#ifdef USE_PREFETCH
+        /*  initialize prefetching   */
+                indexstate->iss_ScanDesc->pfch_index_page_list =  (struct pfch_index_pagelist*)0;
+                indexstate->iss_ScanDesc->pfch_block_item_list = (struct pfch_block_item*)0;
+		if (    prefetch_index_scans
+			 && (target_prefetch_pages > 0)
+			 &&	(!RelationUsesLocalBuffers(indexstate->iss_ScanDesc->heapRelation)) /* I think this must always be true for an indexed heap ? */
+			 && (    (   (prefetch_dbOid > 0)
+					   && (prefetch_dbOid == indexstate->iss_ScanDesc->heapRelation->rd_node.dbNode)
+					 )
+				 ||  (prefetch_dbOid == 0)
+				)
+		   ) {
+			indexstate->iss_ScanDesc->pfch_index_page_list = palloc( sizeof(struct pfch_index_pagelist) + ( (target_prefetch_pages-1) * sizeof(struct pfch_index_item) ) );
+			indexstate->iss_ScanDesc->pfch_block_item_list = palloc( prefetch_index_scans * sizeof(struct pfch_block_item) );
+			if (     ( (struct pfch_index_pagelist*)0 != indexstate->iss_ScanDesc->pfch_index_page_list )
+                  && ( (struct pfch_block_item*)0 != indexstate->iss_ScanDesc->pfch_block_item_list )
+               ) {
+                          indexstate->iss_ScanDesc->pfch_used = 0;
+                          indexstate->iss_ScanDesc->pfch_next = prefetch_index_scans; /* ensure first entry is at index 0 */
+                          indexstate->iss_ScanDesc->pfch_index_page_list->pfch_index_pagelist_next = (struct pfch_index_pagelist*)0;
+                          indexstate->iss_ScanDesc->pfch_index_page_list->pfch_index_item_count = 0;
+                          indexstate->iss_ScanDesc->do_prefetch = 1;
+            }
+		}
+#endif   /* USE_PREFETCH */
+
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
 	 * index AM.
--- src/backend/executor/instrument.c.orig	2014-05-31 17:19:07.809208656 -0400
+++ src/backend/executor/instrument.c	2014-05-31 19:53:08.932073203 -0400
@@ -41,6 +41,14 @@ InstrAlloc(int n, int instrument_options
 		{
 			instr[i].need_bufusage = need_buffers;
 			instr[i].need_timer = need_timer;
+			instr[i].bufusage_start.aio_read_noneed = 0;
+			instr[i].bufusage_start.aio_read_discrd = 0;
+			instr[i].bufusage_start.aio_read_forgot = 0;
+			instr[i].bufusage_start.aio_read_noblok = 0;
+			instr[i].bufusage_start.aio_read_failed = 0;
+			instr[i].bufusage_start.aio_read_wasted = 0;
+			instr[i].bufusage_start.aio_read_waited = 0;
+			instr[i].bufusage_start.aio_read_ontime = 0;
 		}
 	}
 
@@ -143,6 +151,16 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+
+	dst->aio_read_noneed       += add->aio_read_noneed - sub->aio_read_noneed;
+	dst->aio_read_discrd       += add->aio_read_discrd - sub->aio_read_discrd;
+	dst->aio_read_forgot       += add->aio_read_forgot - sub->aio_read_forgot;
+	dst->aio_read_noblok       += add->aio_read_noblok - sub->aio_read_noblok;
+	dst->aio_read_failed       += add->aio_read_failed - sub->aio_read_failed;
+	dst->aio_read_wasted       += add->aio_read_wasted - sub->aio_read_wasted;
+	dst->aio_read_waited       += add->aio_read_waited - sub->aio_read_waited;
+	dst->aio_read_ontime       += add->aio_read_ontime - sub->aio_read_ontime;
+
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
--- src/backend/storage/buffer/Makefile.orig	2014-05-31 17:19:07.873208825 -0400
+++ src/backend/storage/buffer/Makefile	2014-05-31 19:53:08.952073254 -0400
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o buf_async.o
 
 include $(top_srcdir)/src/backend/common.mk
--- src/backend/storage/buffer/bufmgr.c.orig	2014-05-31 17:19:07.873208825 -0400
+++ src/backend/storage/buffer/bufmgr.c	2014-05-31 19:53:08.996073365 -0400
@@ -29,7 +29,7 @@
  *		buf_table.c -- manages the buffer lookup table
  */
 #include "postgres.h"
-
+#include <sys/types.h> 
 #include <sys/file.h>
 #include <unistd.h>
 
@@ -50,7 +50,6 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
-
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
 #define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
@@ -63,6 +62,8 @@
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
 
+extern volatile struct BAiocbAnchor *BAiocbAnchr;  /*  anchor for all control blocks pertaining to aio  */
+
 #define DROP_RELS_BSEARCH_THRESHOLD		20
 
 /* GUC variables */
@@ -78,26 +79,33 @@ bool		track_io_timing = false;
  */
 int			target_prefetch_pages = 0;
 
-/* local state for StartBufferIO and related functions */
+/* local state for StartBufferIO and related functions
+**  but ONLY for synchronous IO  -  not altered for aio
+*/
 static volatile BufferDesc *InProgressBuf = NULL;
 static bool IsForInput;
+pid_t this_backend_pid = 0;    /*    pid of this backend */
 
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
-
-static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
+extern int
+BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy);
+extern int
+BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, volatile BufferDesc *buf_desc, int intention
+        ,BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock );
+Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
-				  bool *hit);
-static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
-static void PinBuffer_Locked(volatile BufferDesc *buf);
-static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
+				  bool *hit , int index_for_aio);
+bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
+void PinBuffer_Locked(volatile BufferDesc *buf);
+void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
 static void WaitIO(volatile BufferDesc *buf);
-static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
-static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
+static bool StartBufferIO(volatile BufferDesc *buf, bool forInput , int index_for_aio );
+void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -106,24 +114,66 @@ static volatile BufferDesc *BufferAlloc(
 			ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr);
+			int *foundPtr , int index_for_aio );
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
 static int	rnode_comparator(const void *p1, const void *p2);
 
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
 
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
  *
- * This is named by analogy to ReadBuffer but doesn't actually allocate a
- * buffer.  Instead it tries to ensure that a future ReadBuffer for the given
- * block will not be delayed by the I/O.  Prefetching is optional.
+ * This is named by analogy to ReadBuffer but allocates a buffer only if using asynchronous I/O.
+ * Its purpose  is to try to ensure that a future ReadBuffer for the given block
+ * will not be delayed by the I/O.  Prefetching is optional.
  * No-op if prefetching isn't compiled in.
- */
-void
-PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
-{
+ *
+ * Originally the prefetch simply called posix_fadvise() to recommend read-ahead into kernel page cache.
+ * Extended to provide an alternative of issuing an asynchronous aio_read() to read into a buffer.
+ * This extension has an implication on how this bufmgr component manages concurrent requests
+ * for the same disk block.
+ *
+ * Synchronous IO (read()) does not provide a means for waiting on another task's read if in progress,
+ * and bufmgr implements its own scheme in StartBufferIO, WaitIO, and TerminateBufferIO.
+ *
+ * Asynchronous IO (aio_read()) provides a means for waiting on this or another task's read if in progress,
+ * namely aio_suspend(),  which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO
+ * are called as part of asynchronous prefetching,   their role is limited to maintaining the buffer desc flags,
+ * and they do not track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in 
+ * a separate set of shared control blocks,  the BufferAiocb list -
+ *   refer to     include/storage/buf_internals.h and storage/buffer/buf_init.c
+ *
+ * Another implication of asynchronous IO concerns buffer pinning.
+ * The buffer used for the prefetch is pinned before aio_read is issued.
+ * It is expected that the same task (and possibly others) will later ask to read the page
+ * and eventually release and unpin the buffer.
+ * However,  if the task which issued the aio_read later decides not to read the page,
+ * and return code indicates delta_pin_count > 0 (see below)
+ * it *must* instead issue a DiscardBuffer() (see function later in this file)
+ * so that its pin is released.
+ * Therefore,  each client which uses the PrefetchBuffer service must either always read all
+ * prefetched pages,  or keep track of prefetched pages and discard unread ones at end of scan.
+ *
+ * return code:   is an int bitmask defined in bufmgr.h
+        PREFTCHRC_BUF_PIN_INCREASED 0x01      pin count on buffer has been increased by 1
+        PREFTCHRC_BLK_ALREADY_PRESENT 0x02    block was already present in a buffer
+ *
+ * PREFTCHRC_BLK_ALREADY_PRESENT is a hint to caller that the prefetch may be unnecessary
+ */
+int
+PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy)
+{
+	Buffer		buf_id; /* indicates buffer containing the requested block  */
+        int             PrefetchBufferRc = 0; /*  return value as described above  */
+        int             PinCountOnEntry = 0;  /*  pin count on entry           */
+        int             PinCountdelta = 0;    /*  pin count delta increase     */
+
+
 #ifdef USE_PREFETCH
+
+	buf_id = -1;
 	Assert(RelationIsValid(reln));
 	Assert(BlockNumberIsValid(blockNum));
 
@@ -146,7 +196,12 @@ PrefetchBuffer(Relation reln, ForkNumber
 		BufferTag	newTag;		/* identity of requested block */
 		uint32		newHash;	/* hash value for newTag */
 		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-		int			buf_id;
+        int         BufStartAsyncrc = -1;  /*  retcode from BufStartAsync :
+                                                       **        0 if started successfully (which implies buffer was newly pinned )
+                                                       **       -1 if failed for some reason
+                                                       **        1+PrivateRefCount if we found desired buffer in buffer pool
+                                                       **  and we set it likewise if we find buffer in buffer pool
+                                                       */
 
 		/* create a tag so we can lookup the buffer */
 		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
@@ -158,28 +213,119 @@ PrefetchBuffer(Relation reln, ForkNumber
 
 		/* see if the block is in the buffer pool already */
 		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
+		buf_id = (Buffer)BufTableLookup(&newTag, newHash);
+                if (buf_id >= 0) {
+                    PinCountOnEntry = PrivateRefCount[buf_id];  /*  pin count on entry               */
+                    BufStartAsyncrc = 1 + PinCountOnEntry; /* indicate this backends pin count - see above comment */
+                    PrefetchBufferRc = PREFTCHRC_BLK_ALREADY_PRESENT;       /* indicate buffer present */
+                } else {
+                    PrefetchBufferRc = 0;                                   /* indicate buffer not present */
+                }
 		LWLockRelease(newPartitionLock);
 
+     not_in_buffers:
 		/* If not in buffers, initiate prefetch */
-		if (buf_id < 0)
+		if (buf_id < 0) {
+                    /*    try using async aio_read with a buffer */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                    BufStartAsyncrc = BufStartAsync( reln, forkNum, blockNum , strategy );
+                    if (BufStartAsyncrc < 0) {
+                            pgBufferUsage.aio_read_noblok++;
+                    }
+#else /* not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP so try the alternative that does not read the block into a postgresql buffer */
 			smgrprefetch(reln->rd_smgr, forkNum, blockNum);
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+		}
 
-		/*
-		 * If the block *is* in buffers, we do nothing.  This is not really
-		 * ideal: the block might be just about to be evicted, which would be
-		 * stupid since we know we are going to need it soon.  But the only
-		 * easy answer is to bump the usage_count, which does not seem like a
-		 * great solution: when the caller does ultimately touch the block,
-		 * usage_count would get bumped again, resulting in too much
-		 * favoritism for blocks that are involved in a prefetch sequence. A
-		 * real fix would involve some additional per-buffer state, and it's
-		 * not clear that there's enough of a problem to justify that.
+        if (  (buf_id >= 0) || (BufStartAsyncrc >= 1)  ) {
+                        /* The block *is* in buffers.    */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        pgBufferUsage.aio_read_noneed++;
+#ifndef USE_PREFETCH_BUT_DONT_PIN_ALREADY_PRESENT /* jury is out on whether the following wins but it ought to ...  */
+                        /*
+                        ** If this backend already had pinned it,
+                        ** or another backend had banked a pin on it,
+                        ** or there is an IO in progress,
+                        ** or it is not marked valid,
+                        ** then do nothing.
+                        ** Otherwise pin it and mark the buffer's pin as banked by this backend.
+                        ** Note  -  it may or not be pinned by another backend -
+                        **          it is ok for us to bank a pin on it
+                        **          *provided* the other backend did not bank its pin.
+                        **          The reason for this is that the banked-pin indicator is global -
+                        **          it can identify at most one process.
+                        */
+                        /* pgBufferUsage.aio_read_wasted++;  overload counter - only for debugging */
+                        if (BufStartAsyncrc == 1) {            /*   not pinned by me  */
+                              /*  pgBufferUsage.aio_read_wasted++;  overload counter - only for debugging */
+                              /*  note   -   all we can say with certainty is that the buffer is not pinned by me
+                              **             we cannot be sure that it is still in buffer pool
+                              **             so must go through the entire locking and searching all over again ...
 		 */
+                            LWLockAcquire(newPartitionLock, LW_SHARED);
+                            buf_id = (Buffer)BufTableLookup(&newTag, newHash);
+                            /* If in buffers, proceed */
+                            if (buf_id >= 0) {
+                                /*  since the block is now present,
+                                **  save the current pin count to ensure final delta is calculated correctly
+                                */
+                                PinCountOnEntry = PrivateRefCount[buf_id];  /*  pin count on entry               */
+                                if ( PinCountOnEntry == 0) { /*  paranoid check it's still not pinned by me */
+                                    volatile        BufferDesc *buf_desc;
+
+                                    buf_desc = &BufferDescriptors[buf_id];        /* found buffer descriptor */
+                                    LockBufHdr(buf_desc);
+                                    if (    (buf_desc->flags & BM_VALID)           /* buffer is valid        */
+                                         && (!(buf_desc->flags & (BM_IO_IN_PROGRESS|BM_AIO_IN_PROGRESS|BM_AIO_PREFETCH_PIN_BANKED))) /* buffer is not any of ... */
+                                       ) {
+                                        buf_desc->flags |= BM_AIO_PREFETCH_PIN_BANKED; /* bank the pin for next use by this task */
+                                        /*  note  - we can call PinBuffer_Locked with the BM_AIO_PREFETCH_PIN_BANKED flag set because it is not yet pinned by me */
+                                        buf_desc->freeNext = -(this_backend_pid);       /* remember which pid banked it */
+                                        /*  pgBufferUsage.aio_read_wasted--;      overload counter - not wasted after all - only for debugging */
+
+                                        /* Make sure we will have room to remember the buffer pin */
+                                        ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+                                        PinBuffer_Locked(buf_desc);
 	}
-#endif   /* USE_PREFETCH */
+                                    else {
+                                        UnlockBufHdr(buf_desc);
+                                    }
+                                }
+                            }
+                            LWLockRelease(newPartitionLock);
+                            /*  although unlikely,  maybe it was evicted while we were puttering about  */
+                            if (buf_id < 0) {
+                                pgBufferUsage.aio_read_noneed--;   /*   back out the accounting */
+                                goto not_in_buffers;               /*   and try again           */
+                            }
+                        }
+#endif /*  USE_PREFETCH_BUT_DONT_PIN_ALREADY_PRESENT */
+
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+		}
+
+		if (buf_id >= 0) {
+			PinCountdelta = PrivateRefCount[buf_id] - PinCountOnEntry;  /*  pin count delta increase    */
+			if (  (PinCountdelta < 0) || (PinCountdelta > 1) ) {
+				  elog(ERROR,
+						 "PrefetchBuffer #%d : incremented pin count by %d on bufdesc %p refcount %u localpins %d\n"
+								  ,(buf_id+1) , PinCountdelta , &BufferDescriptors[buf_id] ,BufferDescriptors[buf_id].refcount , PrivateRefCount[buf_id]);
 }
+		} else
+		if (BufStartAsyncrc == 0) {  /* aio started successfully (which implies buffer was newly pinned ) */
+			PinCountdelta = 1;
+		}
+
+		/*  set final PrefetchBufferRc according to previous value */
+		PrefetchBufferRc |= PinCountdelta;  /* set the PREFTCHRC_BUF_PIN_INCREASED bit */
+	}
+
+#endif   /* USE_PREFETCH */
 
+	return PrefetchBufferRc; /*  return value as described above */
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
@@ -252,7 +398,7 @@ ReadBufferExtended(Relation reln, ForkNu
 	 */
 	pgstat_count_buffer_read(reln);
 	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
+							forkNum, blockNum, mode, strategy, &hit , 0);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
 	return buf;
@@ -280,7 +426,7 @@ ReadBufferWithoutRelcache(RelFileNode rn
 	Assert(InRecovery);
 
 	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
-							 mode, strategy, &hit);
+							 mode, strategy, &hit , 0);
 }
 
 
@@ -288,15 +434,18 @@ ReadBufferWithoutRelcache(RelFileNode rn
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
+ * index_for_aio ,  if -ve , is negative of ( index of the aiocb in the BufferAiocbs array + 3 )
+ *     which is passed through to StartBufferIO
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy, bool *hit , int index_for_aio )
 {
 	volatile BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+        int             allocrc;             /*  retcode from BufferAlloc */
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -328,16 +477,40 @@ ReadBuffer_common(SMgrRelation smgr, cha
 	}
 	else
 	{
+                allocrc = mode; /* pass mode to BufferAlloc since it must not wait for async io if RBM_NOREAD_FOR_PREFETCH */
 		/*
 		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
+							 strategy, &allocrc , index_for_aio );
+		if (allocrc < 0) {
+                    if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+                    {
+                        ereport(WARNING,
+                                (errcode(ERRCODE_DATA_CORRUPTED),
+                                 errmsg("invalid page header in block %u of relation %s; zeroing out page",
+                                        blockNum,
+                                        relpath(smgr->smgr_rnode, forkNum))));
+                        bufBlock = BufHdrGetBlock(bufHdr);
+                        MemSet((char *) bufBlock, 0, BLCKSZ);
+                    }
 		else
+                      ereport(ERROR,
+                              (errcode(ERRCODE_DATA_CORRUPTED),
+                               errmsg("invalid page header in block %u of relation %s",
+                                      blockNum,
+                                      relpath(smgr->smgr_rnode, forkNum))));
+                        found = true;
+                }
+		else if (allocrc > 0) {
+			pgBufferUsage.shared_blks_hit++;
+                        found = true;
+                }
+		else {
 			pgBufferUsage.shared_blks_read++;
+                        found = false;
+                }
 	}
 
 	/* At this point we do NOT hold any locks. */
@@ -410,7 +583,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 				Assert(bufHdr->flags & BM_VALID);
 				bufHdr->flags &= ~BM_VALID;
 				UnlockBufHdr(bufHdr);
-			} while (!StartBufferIO(bufHdr, true));
+			} while (!StartBufferIO(bufHdr, true, 0));
 		}
 	}
 
@@ -430,6 +603,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
+        if (mode != RBM_NOREAD_FOR_PREFETCH) {
 	if (isExtend)
 	{
 		/* new buffers are zero-filled */
@@ -499,6 +673,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 	VacuumPageMiss++;
 	if (VacuumCostActive)
 		VacuumCostBalance += VacuumCostPageMiss;
+	}
 
 	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
 									  smgr->smgr_rnode.node.spcNode,
@@ -520,21 +695,39 @@ ReadBuffer_common(SMgrRelation smgr, cha
  * the default strategy.  The selected buffer's usage_count is advanced when
  * using the default strategy, but otherwise possibly not (see PinBuffer).
  *
- * The returned buffer is pinned and is already marked as holding the
- * desired page.  If it already did have the desired page, *foundPtr is
- * set TRUE.  Otherwise, *foundPtr is set FALSE and the buffer is marked
+ * index_for_aio is an input parm which,  if non-zero,  identifies a BufferAiocb
+ * acquired by caller and to be used for any StartBufferIO performed by this routine.
+ * In this case,  if block not found in buffer pool and we allocate a new buffer,
+ * then we must maintain the spinlock on the buffer and pass it back to caller.
+ *
+ * foundPtr is input and output :
+ *  . input   -  indicates the read-buffer mode  ( see bufmgr.h )
+ *  . output  -  indicates the status of the buffer - see below
+ *
+ * Except for the case of RBM_NOREAD_FOR_PREFETCH and buffer is found,
+ * the returned buffer is pinned and is already marked as holding the
+ * desired page.
+ *  If it already did have the desired page and page content is valid,
+ *  *foundPtr is set to 1
+ *  If it already did have the desired page and mode is RBM_NOREAD_FOR_PREFETCH
+ *    and StartBufferIO returned false
+ *    (meaning it could not initialise the buffer for aio)
+ *  *foundPtr is set to 2
+ *  If it already did have the desired page but page content is invalid,
+ *  *foundPtr is set to -1
+ *   this can happen only if the buffer was read by an async read
+ *   and the aio is still in progress or pinned by the issuer of the startaio.
+ *  Otherwise, *foundPtr is set to 0 and the buffer is marked
  * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
  *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
- *
- * No locks are held either at entry or exit.
+ * No locks are held either at entry or exit EXCEPT for case noted above
+ * of passing an empty buffer back to async io caller ( index_for_aio set ).
  */
 static volatile BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			int *foundPtr , int index_for_aio )
 {
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
@@ -546,6 +739,13 @@ BufferAlloc(SMgrRelation smgr, char relp
 	int			buf_id;
 	volatile BufferDesc *buf;
 	bool		valid;
+        int             IntentionBufferrc;      /* retcode from BufCheckAsync */
+        bool            StartBufferIOrc;        /* retcode from StartBufferIO */
+        ReadBufferMode mode;
+
+
+        mode = *foundPtr;
+        *foundPtr = 0;
 
 	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
@@ -560,21 +760,53 @@ BufferAlloc(SMgrRelation smgr, char relp
 	if (buf_id >= 0)
 	{
 		/*
-		 * Found it.  Now, pin the buffer so no one can steal it from the
-		 * buffer pool, and check to see if the correct data has been loaded
-		 * into the buffer.
+		 * Found it.
 		 */
+                *foundPtr = 1;
 		buf = &BufferDescriptors[buf_id];
 
-		valid = PinBuffer(buf, strategy);
-
-		/* Can release the mapping lock as soon as we've pinned it */
+                /*   If prefetch mode,  then return immediately indicating found,
+                **   and NOTE in this case only,  we did not pin buffer.
+                **   In theory we might try to check whether the buffer is valid,  io in progress,  etc
+                **   but in practice it is simpler to abandon the prefetch if the buffer exists
+                */
+                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+                    /* release the mapping lock and return    */
 		LWLockRelease(newPartitionLock);
+                } else {
+                    /*   note that the current request is for same tag as the one associated with the aio -
+                    **   so simply complete the aio and we have our buffer.
+                    **         If an aio was started on this buffer,
+                    **         check complete and wait for it if not.
+                    **         And,  if aio had been started,  then the task
+                    **         which issued the start aio already pinned it for this read,
+                    **         so if that task was me and the aio was successful,
+                    **         pass the current pin to this read without dropping and re-acquiring.
+                    **         this is all done by BufCheckAsync
+                    */
+                    IntentionBufferrc = BufCheckAsync(smgr , 0 , buf, BUF_INTENTION_WANT , strategy , index_for_aio , false , newPartitionLock );
 
-		*foundPtr = TRUE;
+                    /*       check to see if the correct data has been loaded into the buffer.  */
+                    valid = (IntentionBufferrc == BUF_INTENT_RC_VALID);
 
-		if (!valid)
-		{
+                    /*  check for serious IO errors   */
+                    if (!valid) {
+                        if (    (IntentionBufferrc != BUF_INTENT_RC_INVALID_NO_AIO)
+                             && (IntentionBufferrc != BUF_INTENT_RC_INVALID_AIO)
+                           ) {
+                            *foundPtr = -1;  /*  inform caller of serious error */
+                        }
+                        else
+                        if (IntentionBufferrc == BUF_INTENT_RC_INVALID_AIO) {
+                            goto proceed_with_not_found;  /*  yes,  I know,  a goto ... think of it as a break out of the if */
+                        }
+                     }
+
+                    /* BufCheckAsync pinned the buffer            */
+                    /* so can now release the mapping lock               */
+                    LWLockRelease(newPartitionLock);
+
+                    if (!valid) {
 			/*
 			 * We can only get here if (a) someone else is still reading in
 			 * the page, or (b) a previous read attempt failed.  We have to
@@ -582,19 +814,21 @@ BufferAlloc(SMgrRelation smgr, char relp
 			 * own read attempt if the page is still not BM_VALID.
 			 * StartBufferIO does it all.
 			 */
-			if (StartBufferIO(buf, true))
+                                if (StartBufferIO(buf, true, index_for_aio))
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
 				 * have failed ... but we shall bravely try again.
 				 */
-				*foundPtr = FALSE;
+                                        *foundPtr = 0;
+                                }
 			}
 		}
 
 		return buf;
 	}
 
+  proceed_with_not_found:
 	/*
 	 * Didn't find it in the buffer pool.  We'll have to initialize a new
 	 * buffer.  Remember to unlock the mapping lock while doing the work.
@@ -619,8 +853,10 @@ BufferAlloc(SMgrRelation smgr, char relp
 		/* Must copy buffer flags while we still hold the spinlock */
 		oldFlags = buf->flags;
 
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
+                /*         If an aio was started on this buffer,
+                **         check complete and cancel it if not.
+                */
+                BufCheckAsync(smgr , 0 , buf, BUF_INTENTION_REJECT_OBTAIN_PIN , 0 , index_for_aio, true , 0 );
 
 		/* Now it's safe to release the freelist lock */
 		if (lock_held)
@@ -791,13 +1027,18 @@ BufferAlloc(SMgrRelation smgr, char relp
 				 * then set up our own read attempt if the page is still not
 				 * BM_VALID.  StartBufferIO does it all.
 				 */
-				if (StartBufferIO(buf, true))
+                                StartBufferIOrc = StartBufferIO(buf, true , index_for_aio); /* retcode from StartBufferIO */
+				if (StartBufferIOrc)
 				{
 					/*
 					 * If we get here, previous attempts to read the buffer
 					 * must have failed ... but we shall bravely try again.
 					 */
-					*foundPtr = FALSE;
+					*foundPtr = 0;
+                                } else
+                                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+					UnpinBuffer(buf, true); /* must unpin if RBM_NOREAD_FOR_PREFETCH and found */
+                                        *foundPtr = 2;  /*  inform BufStartAsync that buffer must not be used */
 				}
 			}
 
@@ -860,10 +1101,17 @@ BufferAlloc(SMgrRelation smgr, char relp
 	 * lock.  If StartBufferIO returns false, then someone else managed to
 	 * read it before we did, so there's nothing left for BufferAlloc() to do.
 	 */
-	if (StartBufferIO(buf, true))
-		*foundPtr = FALSE;
-	else
-		*foundPtr = TRUE;
+        StartBufferIOrc = StartBufferIO(buf, true , index_for_aio); /* retcode from StartBufferIO */
+        if (StartBufferIOrc) {
+		*foundPtr = 0;
+        } else {
+                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+                    UnpinBuffer(buf, true); /* must unpin if RBM_NOREAD_FOR_PREFETCH and found */
+                    *foundPtr = 2;  /*  inform BufStartAsync that buffer must not be used */
+                } else {
+                    *foundPtr = 1;
+                }
+        }
 
 	return buf;
 }
@@ -970,6 +1218,10 @@ retry:
 	/*
 	 * Insert the buffer at the head of the list of free buffers.
 	 */
+        /*   avoid confusing freelist with strange-looking freeNext */
+        if (buf->freeNext <= FREENEXT_BAIOCB_ORIGIN) { /* means was used for aiocb index */
+            buf->freeNext = FREENEXT_NOT_IN_LIST;
+        }
 	StrategyFreeBuffer(buf);
 }
 
@@ -1022,6 +1274,56 @@ MarkBufferDirty(Buffer buffer)
 	UnlockBufHdr(bufHdr);
 }
 
+/*  return the blocknum of the block in a buffer if it is valid
+**  if a shared buffer,  it must be pinned
+*/
+BlockNumber
+BlocknumOfBuffer(Buffer buffer)
+{
+	volatile BufferDesc *bufHdr;
+        BlockNumber rc = 0;
+
+	if (BufferIsValid(buffer)) {
+            if  (BufferIsLocal(buffer)) {
+                bufHdr = &LocalBufferDescriptors[-buffer - 1];
+            } else {
+                bufHdr = &BufferDescriptors[buffer - 1];
+            }
+
+            rc = bufHdr->tag.blockNum;
+        }
+
+        return rc;
+}
+
+/*  report whether specified buffer contains same or different block
+**  if a shared buffer,  it must be pinned
+*/
+bool
+BlocknotinBuffer(Buffer buffer,
+					 Relation relation,
+					 BlockNumber blockNum)
+{
+	volatile BufferDesc *bufHdr;
+        bool rc = false;
+
+	if (BufferIsValid(buffer)) {
+            if  (BufferIsLocal(buffer)) {
+                bufHdr = &LocalBufferDescriptors[-buffer - 1];
+            } else {
+                bufHdr = &BufferDescriptors[buffer - 1];
+            }
+
+            rc =
+                  (   (bufHdr->tag.blockNum != blockNum)
+                   || (!(RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) ))
+                   || (bufHdr->tag.forkNum != MAIN_FORKNUM)
+                  );
+        }
+
+        return rc;
+}
+
 /*
  * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer()
  *
@@ -1040,18 +1342,18 @@ ReleaseAndReadBuffer(Buffer buffer,
 					 Relation relation,
 					 BlockNumber blockNum)
 {
-	ForkNumber	forkNum = MAIN_FORKNUM;
 	volatile BufferDesc *bufHdr;
+        bool isDifferentBlock;   /*       requesting different block from that already in buffer ? */
 
 	if (BufferIsValid(buffer))
 	{
+	    /* if a shared buff, we have pin, so it's ok to examine tag without spinlock */
+            isDifferentBlock = BlocknotinBuffer(buffer,relation,blockNum); /*  requesting different block from that already in buffer ? */
 		if (BufferIsLocal(buffer))
 		{
 			Assert(LocalRefCount[-buffer - 1] > 0);
 			bufHdr = &LocalBufferDescriptors[-buffer - 1];
-			if (bufHdr->tag.blockNum == blockNum &&
-				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
-				bufHdr->tag.forkNum == forkNum)
+			if (!isDifferentBlock)
 				return buffer;
 			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 			LocalRefCount[-buffer - 1]--;
@@ -1060,12 +1362,12 @@ ReleaseAndReadBuffer(Buffer buffer,
 		{
 			Assert(PrivateRefCount[buffer - 1] > 0);
 			bufHdr = &BufferDescriptors[buffer - 1];
-			/* we have pin, so it's ok to examine tag without spinlock */
-			if (bufHdr->tag.blockNum == blockNum &&
-				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
-				bufHdr->tag.forkNum == forkNum)
+                        BufCheckAsync(0 , relation , bufHdr , ( isDifferentBlock ? BUF_INTENTION_REJECT_FORGET
+                                                                                 : BUF_INTENTION_REJECT_KEEP_PIN )
+                                                            , 0 , 0 , false , 0 ); /* end any IO and maybe unpin */
+			if (!isDifferentBlock) {
 				return buffer;
-			UnpinBuffer(bufHdr, true);
+                        }
 		}
 	}
 
@@ -1090,11 +1392,12 @@ ReleaseAndReadBuffer(Buffer buffer,
  * Returns TRUE if buffer is BM_VALID, else FALSE.  This provision allows
  * some callers to avoid an extra spinlock cycle.
  */
-static bool
+bool
 PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
 {
 	int			b = buf->buf_id;
 	bool		result;
+        bool       pin_already_banked_by_me = 0;  /* buffer is already pinned by me and redeemable */
 
 	if (PrivateRefCount[b] == 0)
 	{
@@ -1116,12 +1419,34 @@ PinBuffer(volatile BufferDesc *buf, Buff
 	else
 	{
 		/* If we previously pinned the buffer, it must surely be valid */
+                /* Errr  -   is that really true  ???    I don't think so  :
+                ** what if I pin,  try an IO,  in progress,  then mistakenly pin again
 		result = true;
+                */
+		LockBufHdr(buf);
+                pin_already_banked_by_me = (    (PrivateRefCount[b] > 0) && (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                             && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                                     : (-(buf->freeNext))  )  == this_backend_pid )
+                                           );
+                if (pin_already_banked_by_me) {
+                        elog(LOG, "PinBuffer : on buffer %d with banked pin rel=%s, blockNum=%u, with flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+                    buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
 	}
+                }
+		result = (buf->flags & BM_VALID) != 0;
+		UnlockBufHdr(buf);
+	}
+
+        if (!pin_already_banked_by_me) {
 	PrivateRefCount[b]++;
 	Assert(PrivateRefCount[b] > 0);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
 								BufferDescriptorGetBuffer(buf));
+        }
 	return result;
 }
 
@@ -1138,19 +1463,36 @@ PinBuffer(volatile BufferDesc *buf, Buff
  * to save a spin lock/unlock cycle, because we need to pin a buffer before
  * its state can change under us.
  */
-static void
+void
 PinBuffer_Locked(volatile BufferDesc *buf)
 {
 	int			b = buf->buf_id;
+        bool       pin_already_banked_by_me;  /* buffer is already pinned by me and redeemable */
 
-	if (PrivateRefCount[b] == 0)
+        pin_already_banked_by_me = (    (PrivateRefCount[b] > 0) && (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                     && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                             : (-(buf->freeNext))  )  == this_backend_pid )
+                             );
+	if (PrivateRefCount[b] == 0) {
 		buf->refcount++;
+        }
+        if (pin_already_banked_by_me) {
+                        elog(LOG, "PinBuffer_Locked : on buffer %d with banked pin rel=%s, blockNum=%u, with flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+            buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                    }
+        }
 	UnlockBufHdr(buf);
+        if (!pin_already_banked_by_me) {
 	PrivateRefCount[b]++;
 	Assert(PrivateRefCount[b] > 0);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
 								BufferDescriptorGetBuffer(buf));
 }
+}
 
 /*
  * UnpinBuffer -- make buffer available for replacement.
@@ -1160,29 +1502,68 @@ PinBuffer_Locked(volatile BufferDesc *bu
  * Most but not all callers want CurrentResourceOwner to be adjusted.
  * Those that don't should pass fixOwner = FALSE.
  */
-static void
+void
 UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 {
+
 	int			b = buf->buf_id;
+        bool       pin_already_banked_by_me;  /* buffer is already pinned by me and redeemable */
 
-	if (fixOwner)
+	if (fixOwner) {
 		ResourceOwnerForgetBuffer(CurrentResourceOwner,
 								  BufferDescriptorGetBuffer(buf));
+        }
 
 	Assert(PrivateRefCount[b] > 0);
 	PrivateRefCount[b]--;
 	if (PrivateRefCount[b] == 0)
 	{
+
 		/* I'd better not still hold any locks on the buffer */
 		Assert(!LWLockHeldByMe(buf->content_lock));
 		Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
 
 		LockBufHdr(buf);
 
+		/* this backend has released last pin - buffer should not have pin banked by me
+                ** and if AIO in progress then there should be another backend pin
+                */
+                pin_already_banked_by_me = (       (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                             &&    (   (    (buf->flags & BM_AIO_IN_PROGRESS)
+                                                         ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                         : (-(buf->freeNext))
+                                                       )  == this_backend_pid )
+                                           );
+                if (pin_already_banked_by_me) {
+                        /*  this is a strange situation  -  caller had a banked pin (which callers are supposed not to know about)
+                        **                                  but either discovered it had it or has over-counted how many pins it has
+                        */
+                        buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;   /*   redeem the pin although it is now of no use since about to release */
+                        if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                            buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                        }
+
+                        /*     temporarily suppress logging error to avoid performance degradation -
+                        **     either this task really does not need the buffer in which case the error is harmless
+                        **     or a more severe error will be detected later (possibly immediately below)
+                        elog(LOG, "UnpinBuffer :  released last this-backend pin on buffer %d rel=%s, blockNum=%u, but had banked pin flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+                        */
+                }
+
 		/* Decrement the shared reference count */
 		Assert(buf->refcount > 0);
 		buf->refcount--;
 
+                if (   (buf->refcount == 0) && (buf->flags & BM_AIO_IN_PROGRESS)  ) {
+
+                        elog(ERROR, "UnpinBuffer :  released last any-backend pin on buffer %d rel=%s, blockNum=%u, but AIO in progress flags %X refcount=%u"
+                            ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                            ,buf->tag.blockNum, buf->flags, buf->refcount);
+                }
+
+
 		/* Support LockBufferForCleanup() */
 		if ((buf->flags & BM_PIN_COUNT_WAITER) &&
 			buf->refcount == 1)
@@ -1657,6 +2038,7 @@ SyncOneBuffer(int buf_id, bool skip_rece
 	volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
 	int			result = 0;
 
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -1789,6 +2171,8 @@ PrintBufferLeakWarning(Buffer buffer)
 	char	   *path;
 	BackendId	backend;
 
+
+
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
 	{
@@ -1799,12 +2183,28 @@ PrintBufferLeakWarning(Buffer buffer)
 	else
 	{
 		buf = &BufferDescriptors[buffer - 1];
+#ifdef USE_PREFETCH
+                /*   If reason that this buffer is pinned
+                **   is that it was prefetched with async_io
+                **   and never read or discarded, then omit the
+                **   warning,  because this is expected in some
+                **   cases when a scan is closed abnormally.
+                **   Note that the buffer will be released soon by our caller.
+                */
+                if (buf->flags & BM_AIO_PREFETCH_PIN_BANKED) {
+                    pgBufferUsage.aio_read_forgot++; /* account for it */
+                    return;
+                }
+#endif /*  USE_PREFETCH */
 		loccount = PrivateRefCount[buffer - 1];
 		backend = InvalidBackendId;
 	}
 
+/* #if defined(DO_ISCAN_DISCARD_BUFFER) && defined(DO_HEAPSCAN_DISCARD_BUFFER) not yet working safely 130530 */
 	/* theoretically we should lock the bufhdr here */
 	path = relpathbackend(buf->tag.rnode, backend, buf->tag.forkNum);
+
+
 	elog(WARNING,
 		 "buffer refcount leak: [%03d] "
 		 "(rel=%s, blockNum=%u, flags=0x%x, refcount=%u %d)",
@@ -1812,6 +2212,7 @@ PrintBufferLeakWarning(Buffer buffer)
 		 buf->tag.blockNum, buf->flags,
 		 buf->refcount, loccount);
 	pfree(path);
+/* #endif defined(DO_ISCAN_DISCARD_BUFFER) && defined(DO_HEAPSCAN_DISCARD_BUFFER) not yet working safely 130530 */
 }
 
 /*
@@ -1928,7 +2329,7 @@ FlushBuffer(volatile BufferDesc *buf, SM
 	 * false, then someone else flushed the buffer before we could, so we need
 	 * not do anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, 0))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -2512,6 +2913,70 @@ FlushDatabaseBuffers(Oid dbid)
 	}
 }
 
+#ifdef USE_PREFETCH
+/*
+ * DiscardBuffer -- discard shared buffer used for a previously
+ *                  prefetched but unread block of a relation
+ *
+ * If the buffer is found and pinned with a banked pin,  then :
+ *      .  if AIO in progress, terminate AIO without waiting
+ *      .  if AIO had already completed successfully,
+ *         then mark buffer valid (in case someone else wants it)
+ *      .  redeem the banked pin and unpin it.
+ *
+ * This function is similar in purpose to ReleaseBuffer (below)
+ * but sufficiently different that it is a separate function.
+ * Two important differences are :
+ *   .   caller identifies buffer by blocknumber,  not buffer number
+ *   .   we unpin buffer *only* if the pin is banked,
+ *                      *never* if pinned but not banked.
+ *       This is essential as caller may perform a sequence of
+ *  SCAN1   . PrefetchBuffer   (and remember block was prefetched)
+ *  SCAN2   . ReadBuffer       (but fails to connect this read to the prefetch by SCAN1)
+ *  SCAN1   . DiscardBuffer    (SCAN1 terminates early)
+ *  SCAN2   . access tuples in buffer
+ *        Clearly the Discard *must not* unpin the buffer since SCAN2 needs it!
+ *
+ *
+ * caller may pass InvalidBlockNumber as blockNum to mean do nothing
+ */
+void
+DiscardBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
+{
+        BufferTag	newTag;		   /* identity of requested block */
+        uint32		newHash;	   /* hash value for newTag */
+        LWLockId	newPartitionLock;  /* buffer partition lock for it */
+        Buffer		buf_id;
+        volatile        BufferDesc *buf_desc;
+
+    if (!SmgrIsTemp(reln->rd_smgr)) {
+	Assert(RelationIsValid(reln));
+	if (BlockNumberIsValid(blockNum)) {
+
+            /* create a tag so we can lookup the buffer */
+            INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
+                                       forkNum, blockNum);
+
+            /* determine its hash code and partition lock ID */
+            newHash = BufTableHashCode(&newTag);
+            newPartitionLock = BufMappingPartitionLock(newHash);
+
+            /* see if the block is in the buffer pool already */
+            LWLockAcquire(newPartitionLock, LW_SHARED);
+            buf_id = BufTableLookup(&newTag, newHash);
+            LWLockRelease(newPartitionLock);
+
+            /* If in buffers, proceed */
+            if (buf_id >= 0) {
+                buf_desc = &BufferDescriptors[buf_id];        /* found buffer descriptor */
+                BufCheckAsync(0 , reln, buf_desc , BUF_INTENTION_REJECT_UNBANK , 0 , 0 , false , 0); /* end the IO and unpin if banked */
+                pgBufferUsage.aio_read_discrd++; /* account for it */
+            }
+        }
+    }
+}
+#endif   /* USE_PREFETCH */
+
 /*
  * ReleaseBuffer -- release the pin on a buffer
  */
@@ -2520,26 +2985,23 @@ ReleaseBuffer(Buffer buffer)
 {
 	volatile BufferDesc *bufHdr;
 
+
 	if (!BufferIsValid(buffer))
 		elog(ERROR, "bad buffer ID: %d", buffer);
 
-	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 
 	if (BufferIsLocal(buffer))
 	{
+                ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 		Assert(LocalRefCount[-buffer - 1] > 0);
 		LocalRefCount[-buffer - 1]--;
 		return;
 	}
-
-	bufHdr = &BufferDescriptors[buffer - 1];
-
-	Assert(PrivateRefCount[buffer - 1] > 0);
-
-	if (PrivateRefCount[buffer - 1] > 1)
-		PrivateRefCount[buffer - 1]--;
 	else
-		UnpinBuffer(bufHdr, false);
+        {
+                bufHdr = &BufferDescriptors[buffer - 1];
+                BufCheckAsync(0 , 0 , bufHdr , BUF_INTENTION_REJECT_NOADJUST , 0 , 0 , false , 0 );
+        }
 }
 
 /*
@@ -2565,14 +3027,41 @@ UnlockReleaseBuffer(Buffer buffer)
 void
 IncrBufferRefCount(Buffer buffer)
 {
+        bool       pin_already_banked_by_me = false;  /* buffer is already pinned by me and redeemable */
+        volatile BufferDesc *buf;                     /* descriptor for a shared buffer */
+
 	Assert(BufferIsPinned(buffer));
+
+        if (!(BufferIsLocal(buffer))) {
+                buf = &BufferDescriptors[buffer - 1];
+		LockBufHdr(buf);
+                pin_already_banked_by_me =
+                      (    (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                        && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                      : (-(buf->freeNext))  )  == this_backend_pid )
+                      );
+        }
+
+        if (!pin_already_banked_by_me) {
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, buffer);
+        }
+
 	if (BufferIsLocal(buffer))
 		LocalRefCount[-buffer - 1]++;
-	else
+	else {
+                if (pin_already_banked_by_me) {
+                    buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                    }
+                }
+		UnlockBufHdr(buf);
+                if (!pin_already_banked_by_me) {
 		PrivateRefCount[buffer - 1]++;
 }
+        }
+}
 
 /*
  * MarkBufferDirtyHint
@@ -2994,61 +3483,138 @@ WaitIO(volatile BufferDesc *buf)
  *
  * In some scenarios there are race conditions in which multiple backends
  * could attempt the same I/O operation concurrently.  If someone else
- * has already started I/O on this buffer then we will block on the
+ * has already started synchronous I/O on this buffer then we will block on the
  * io_in_progress lock until he's done.
  *
+ * if an async io is in progress and we are doing synchronous io,
+ * then readbuffer uses call to smgrcompleteaio to wait,
+ * and so we treat this request as if no io in progress
+ *
  * Input operations are only attempted on buffers that are not BM_VALID,
  * and output operations only on buffers that are BM_VALID and BM_DIRTY,
  * so we can always tell if the work is already done.
  *
+ * index_for_aio is an input parm which,  if non-zero,  identifies a BufferAiocb
+ * acquired by caller and to be attached to the buffer header for use with async io
+ *
  * Returns TRUE if we successfully marked the buffer as I/O busy,
  * FALSE if someone else already did the work.
  */
 static bool
-StartBufferIO(volatile BufferDesc *buf, bool forInput)
+StartBufferIO(volatile BufferDesc *buf, bool forInput , int index_for_aio )
 {
+#ifdef USE_PREFETCH
+        struct BufferAiocb volatile * BAiocb = (struct BufferAiocb*)0;  /*  BufferAiocb for use with aio */
+#endif   /* USE_PREFETCH */
+ 
+        if (!index_for_aio)
 	Assert(!InProgressBuf);
 
 	for (;;)
 	{
+                if (!index_for_aio) {
 		/*
 		 * Grab the io_in_progress lock so that other processes can wait for
 		 * me to finish the I/O.
 		 */
 		LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+                }
 
 		LockBufHdr(buf);
 
-		if (!(buf->flags & BM_IO_IN_PROGRESS))
+                /*     the following test is intended to distinguish between :
+                **      .   buffer which : 
+                **           .     has io in progress
+                **             AND is not associated with a current aio
+                **      .   not the above
+                **     Here,  "recent" means an aio marked by buf->freeNext <= FREENEXT_BAIOCB_ORIGIN but no longer in progress -
+                **          this situation arises when the aio has just been cancelled and this process now wishes to recycle the buffer.
+                **          In this case,  the first such would-be recycler (i.e. me) must :
+                **             . avoid waiting for the cancelled aio to complete
+                **             . if not myself doing async read, then assume responsibility for posting other future readbuffers.
+                */
+		if (    (buf->flags & BM_AIO_IN_PROGRESS)
+                     || (!(buf->flags & BM_IO_IN_PROGRESS))
+                   )
 			break;
 
 		/*
-		 * The only way BM_IO_IN_PROGRESS could be set when the io_in_progress
+		 * The only way BM_IO_IN_PROGRESS without AIO in progress could be set when the io_in_progress
 		 * lock isn't held is if the process doing the I/O is recovering from
 		 * an error (see AbortBufferIO).  If that's the case, we must wait for
 		 * him to get unwedged.
 		 */
 		UnlockBufHdr(buf);
+                if (!index_for_aio) {
 		LWLockRelease(buf->io_in_progress_lock);
+                }
 		WaitIO(buf);
 	}
 
-	/* Once we get here, there is definitely no I/O active on this buffer */
-
+#ifdef USE_PREFETCH
+	/* Once we get here, there is definitely no synchronous I/O active on this buffer
+        ** but if being asked to attach a BufferAiocb to the buf header,
+        ** then we must also check if there is any async io currently
+        ** in progress or pinned started by a different task.
+        */
+        if (index_for_aio) {
+            BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext);
+            if (    (buf->flags & (BM_AIO_IN_PROGRESS|BM_AIO_PREFETCH_PIN_BANKED))
+                 && (buf->freeNext <= FREENEXT_BAIOCB_ORIGIN)
+                 && (BAiocb->pidOfAio != this_backend_pid)
+               ) {
+                    /* someone else already doing async I/O */
+                    UnlockBufHdr(buf);
+                    return false;
+            }
+	}
+#endif   /* USE_PREFETCH */
 	if (forInput ? (buf->flags & BM_VALID) : !(buf->flags & BM_DIRTY))
 	{
 		/* someone else already did the I/O */
 		UnlockBufHdr(buf);
+                if (!index_for_aio) {
 		LWLockRelease(buf->io_in_progress_lock);
+                }
 		return false;
 	}
 
 	buf->flags |= BM_IO_IN_PROGRESS;
 
+#ifdef USE_PREFETCH
+        if (index_for_aio) {
+            BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - index_for_aio);
+            /*   insist that no other buffer is using this BufferAiocb for async IO */
+            if (BAiocb->BAiocbbufh == (struct sbufdesc *)0) {
+                BAiocb->BAiocbbufh = buf;
+            }
+            if (BAiocb->BAiocbbufh != buf) {
+                               ereport(ERROR,
+                                    (errcode(ERRCODE_INTERNAL_ERROR),
+                                     errmsg("AIO control block %p to be used by %p already in use by %p"
+                                              ,BAiocb ,buf , BAiocb->BAiocbbufh)));
+            }
+            /*   note - there is no need to register self as an dependent of BAiocb
+            **   as we shall not unlock buf_desc before we free the BAiocb
+            */
+
+            buf->flags |= BM_AIO_IN_PROGRESS;
+            buf->freeNext = index_for_aio;
+            /*  at this point,  this buffer appears to have an in-progress aio_read,
+            **  and any other task which is able to look inside the buffer might try waiting on that aio -
+            **  except we have not yet issued the aio!   So we must keep the buffer header locked
+            **  from here all the way back to the BufStartAsync caller
+            */
+        } else {
+#endif   /* USE_PREFETCH */
+
 	UnlockBufHdr(buf);
 
 	InProgressBuf = buf;
 	IsForInput = forInput;
+#ifdef USE_PREFETCH
+        }
+#endif   /* USE_PREFETCH */
 
 	return true;
 }
@@ -3058,7 +3624,7 @@ StartBufferIO(volatile BufferDesc *buf,
  *	(Assumptions)
  *	My process is executing IO for the buffer
  *	BM_IO_IN_PROGRESS bit is set for the buffer
- *	We hold the buffer's io_in_progress lock
+ *	if no async IO in progress,  then We hold the buffer's io_in_progress lock
  *	The buffer is Pinned
  *
  * If clear_dirty is TRUE and BM_JUST_DIRTIED is not set, we clear the
@@ -3070,26 +3636,32 @@ StartBufferIO(volatile BufferDesc *buf,
  * BM_IO_ERROR in a failure case.  For successful completion it could
  * be 0, or BM_VALID if we just finished reading in the page.
  */
-static void
+void
 TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits)
 {
-	Assert(buf == InProgressBuf);
+        int flags_on_entry;
 
 	LockBufHdr(buf);
 
+        flags_on_entry = buf->flags;
+
+        if (! (flags_on_entry & BM_AIO_IN_PROGRESS) )
+            Assert( buf == InProgressBuf );
+
 	Assert(buf->flags & BM_IO_IN_PROGRESS);
-	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
+	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS);
 	if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
 		buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 	buf->flags |= set_flag_bits;
 
 	UnlockBufHdr(buf);
 
+        if (! (flags_on_entry & BM_AIO_IN_PROGRESS) ) {
 	InProgressBuf = NULL;
-
 	LWLockRelease(buf->io_in_progress_lock);
 }
+}
 
 /*
  * AbortBufferIO: Clean up any active buffer I/O after an error.
--- src/backend/storage/buffer/buf_async.c.orig	2014-05-31 17:21:19.261556301 -0400
+++ src/backend/storage/buffer/buf_async.c	2014-05-31 19:53:09.036073466 -0400
@@ -0,0 +1,921 @@
+/*-------------------------------------------------------------------------
+ *
+ * buf_async.c
+ *	  buffer manager asynchronous disk read routines
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/buf_async.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * Principal entry points:
+ *
+ * BufStartAsync() -- start an asynchronous read of a block into a buffer and
+ *	 pin it so that no one can destroy it while this process is using it.
+ *
+ * BufCheckAsync() -- check completion of an asynchronous read
+ *       and either claim buffer or discard it
+ *
+ * Private helper
+ *
+ * BufReleaseAsync() -- release the BAiocb resources used for an asynchronous read
+ *
+ * See also these files:
+ *		bufmgr.c -- main buffer manager functions
+ *		buf_init.c -- initialisation of resources
+ */
+#include "postgres.h"
+#include <sys/types.h> 
+#include <sys/file.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "catalog/catalog.h"
+#include "common/relpath.h"
+#include "executor/instrument.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/smgr.h"
+#include "storage/standby.h"
+#include "utils/rel.h"
+#include "utils/resowner_private.h"
+
+/*
+ * GUC parameters
+ */
+int      max_async_io_prefetchers; /* Maximum number of backends using asynchronous librt threads to read pages into our buffers */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
+extern volatile struct BAiocbAnchor *BAiocbAnchr;  /*  anchor for all control blocks pertaining to aio  */
+extern int numBufferAiocbs;        /*  total number of  BufferAiocbs in pool  */
+extern int maxGetBAiocbTries;      /*   max times we will try to get a free BufferAiocb */
+extern int maxRelBAiocbTries;      /*   max times we will try to release a BufferAiocb back to freelist */
+extern pid_t this_backend_pid;     /*   pid of this backend */
+
+extern bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
+extern void PinBuffer_Locked(volatile BufferDesc *buf);
+extern Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
+				  ForkNumber forkNum, BlockNumber blockNum,
+				  ReadBufferMode mode, BufferAccessStrategy strategy,
+				  bool *hit , int index_for_aio);
+extern void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
+extern void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
+				  int set_flag_bits);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+int BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy);
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+int BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, volatile BufferDesc *buf_desc
+  ,int intention , BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock );
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+
+static struct BufferAiocb volatile * cachedBAiocb = (struct BufferAiocb*)0;  /*  one cached BufferAiocb for use with aio */
+
+#ifdef USE_PREFETCH
+/*  BufReleaseAsync releases a BufferAiocb and returns 0 if successful else non-zero
+**  it *must* be called :
+**     EITHER with a valid  BAiocb->BAiocbbufh -> buf_desc
+**            and that buf_desc must be spin-locked
+**     OR     with BAiocb->BAiocbbufh == 0
+*/
+static int
+BufReleaseAsync(struct BufferAiocb volatile * BAiocb)
+{
+        int    LockTries;         /*  max times we will try to release the BufferAiocb */
+        volatile struct BufferAiocb *BufferAiocbs;
+        struct BufferAiocb volatile * oldFreeBAiocb;  /*  old value of FreeBAiocbs */
+
+        int failed = 1; /* by end of this function, non-zero  will indicate if we failed to return the BAiocb */
+
+
+        if (    ( BAiocb == (struct BufferAiocb*)0 )
+             || ( BAiocb == (struct BufferAiocb*)BAIOCB_OCCUPIED )
+             || ( ((unsigned long)BAiocb) & 0x1 )
+           ) {
+                          elog(ERROR,
+                                 "AIO control block corruption on release of aiocb %p - invalid BAiocb"
+                                          ,BAiocb);
+        }
+        else 
+        if (   (0 == BAiocb->BAiocbDependentCount)     /*  no dependents  */
+            && ((struct BufferAiocb*)BAIOCB_OCCUPIED == BAiocb->BAiocbnext)  /*  not already on freelist */
+           ) {
+
+            if ((struct sbufdesc*)0 != BAiocb->BAiocbbufh) { /*  if a buffer was attached */
+                volatile        BufferDesc *buf_desc = BAiocb->BAiocbbufh;
+
+                /*  spinlock held so instead of TerminateBufferIO(buf, false , 0); ... */
+                if (buf_desc->flags & BM_AIO_PREFETCH_PIN_BANKED) { /* if a pid banked the pin */
+                    buf_desc->freeNext = -(BAiocb->pidOfAio);       /* then remember which pid */
+                }
+                else if (buf_desc->freeNext <= FREENEXT_BAIOCB_ORIGIN) {
+                    buf_desc->freeNext = FREENEXT_NOT_IN_LIST;   /* disconnect BufferAiocb from buf_desc */
+                }
+                buf_desc->flags &= ~BM_AIO_IN_PROGRESS;
+            }
+            
+            BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* disconnect buf_desc from BufferAiocb */
+            BAiocb->pidOfAio = 0;                      /*  clean */
+            LockTries = maxRelBAiocbTries;         /*  max times we will try to release the BufferAiocb */
+            do {
+                register long long int dividend , remainder;
+
+                /*      retrieve old value of FreeBAiocbs  */
+                BAiocb->BAiocbnext = oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs;  /*  old value of FreeBAiocbs */
+
+                /*  this is a volatile value unprotected by any lock,  so must validate it;
+                **  safest is to verify that it is identical to one of the BufferAiocbs
+                **  to do so,  verify by direct division that its address offset from first control block 
+                **  is an integral multiple of the control block size
+                **  that lies within the range [ 0 , (numBufferAiocbs-1) ]
+                */
+                BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+                remainder = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                            % (long long int)(sizeof(struct BufferAiocb));
+                failed = (int)remainder;
+                if (!failed) {
+                    dividend = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                               / (long long int)(sizeof(struct BufferAiocb));
+                     failed = ( (dividend < 0 ) || ( dividend >= numBufferAiocbs) );
+                     if (!failed) {
+                         if (__sync_bool_compare_and_swap (&(BAiocbAnchr->FreeBAiocbs), oldFreeBAiocb, BAiocb)) {
+                            LockTries = 0;   /*  end the do loop  */
+
+                            goto cheering;   /*  cant simply break because then failed would be set incorrectly */
+                         }
+                    }
+                }
+                /*  if we reach here, we have failed and failed is set to -1 */
+
+       cheering: ;
+
+                if ( LockTries > 1 ) {
+                    sched_yield();    /*   yield to another process,  (hopefully a backend) */
+                }
+            } while  (LockTries-- > 0);
+
+            if (failed) {
+#ifdef LOG_RELBAIOCB_DEPLETION
+                          elog(LOG,
+                                "BufReleaseAsync:AIO control block %p unreleased after tries= %d\n"
+                                       ,BAiocb,maxRelBAiocbTries);
+#endif  /* LOG_RELBAIOCB_DEPLETION */
+            }
+
+        }
+              else
+                          elog(LOG,
+                                "BufReleaseAsync:AIO control block %p either has dependents= %d or is already on freelist %p or has no buf_header %p\n"
+                                       ,BAiocb , BAiocb->BAiocbDependentCount , BAiocb->BAiocbnext , BAiocb->BAiocbbufh);
+        return failed;
+}
+
+/*  try using asynchronous aio_read to prefetch into a buffer
+**  return code :
+**        0 if started successfully
+**       -1 if failed for some reason
+**        1+PrivateRefCount if we found desired buffer in buffer pool
+**
+**  There is a harmless race condition here :
+**  two different backends may both arrive here simultaneously
+**  to prefetch the same buffer.    This is not unlikely when a syncscan is in progress.
+**  .  One will acquire the buffer and issue the smgrstartaio
+**  .  Other will find the buffer on return from  ReadBuffer_common with hit = true
+**  Only the first task has a pin on the buffer since ReadBuffer_common knows not to get a pin
+**  on a found buffer in prefetch mode.
+**  Therefore  -   the second task must simply abandon the prefetch if it finds the buffer in the buffer pool.
+**
+**  if we fail to acquire a BAiocb because of concurrent theft from freelist by other backend,
+**  retry up to maxGetBAiocbTries times provided that there actually was at least one BAiocb on the freelist.
+*/
+int
+BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy) {
+
+        int retcode = -1;
+        struct BufferAiocb volatile * BAiocb = (struct BufferAiocb*)0;  /*  BufferAiocb for use with aio */
+        int  smgrstartaio_rc = -1;           /*  retcode from smgrstartaio */
+        bool do_unpin_buffer = false;        /* unpin must be deferred until after buffer descriptor is unlocked */
+        Buffer		buf_id;
+        bool		hit = false;
+        volatile        BufferDesc *buf_desc = (BufferDesc *)0;
+
+        int    LockTries;         /*  max times we will try to get a free BufferAiocb */
+
+        struct BufferAiocb volatile * oldFreeBAiocb;  /*  old value of FreeBAiocbs */
+        struct BufferAiocb volatile * newFreeBAiocb;  /*  new value of FreeBAiocbs */
+
+
+    /*  return immediately if no async io resources */
+    if (numBufferAiocbs > 0) {
+        buf_id = (Buffer)0;
+
+        if ( (struct BAiocbAnchor *)0 != BAiocbAnchr ) {
+
+            volatile struct BufferAiocb *BufferAiocbs;
+
+            if ((struct BufferAiocb*)0 != cachedBAiocb) {  /* any cached BufferAiocb ? */
+                BAiocb = cachedBAiocb;                     /* yes  use it  */
+                cachedBAiocb = BAiocb->BAiocbnext;
+                BAiocb->BAiocbnext = (struct BufferAiocb*)BAIOCB_OCCUPIED; /* mark as not on freelist */
+                BAiocb->BAiocbDependentCount = 0;  /* no dependent yet */
+                BAiocb->BAiocbbufh = (struct sbufdesc*)0;
+                BAiocb->pidOfAio = 0;
+            } else {
+
+                LockTries = maxGetBAiocbTries;         /*  max times we will try to get a free BufferAiocb */
+                do {
+                    register long long int dividend = -1 , remainder;
+                    /*  check if we have a free BufferAiocb */
+
+                    oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs;  /*  old value of FreeBAiocbs */
+
+                    /*  check if we have a free BufferAiocb */
+
+                    /*  BAiocbAnchr->FreeBAiocbs is a volatile value unprotected by any lock,
+                    **  and use of compare-and-swap to add and remove items from the list has
+                    **  two potential pitfalls,    both relating to the fact that we must
+                    **  access data de-referenced from this pointer before the compare-and-swap.
+                    **  1)  The value we load may be corrupt,  e.g. mixture of bytes from
+                    **      two different values,   so must validate it;
+                    **      safest is to verify that it is identical to one of the BufferAiocbs.
+                    **      to do so,  verify by direct division that its address offset from
+                    **      first control block is an integral multiple of the control block size
+                    **      that lies within the range [ 0 , (numBufferAiocbs-1) ]
+                    **      Thus we completely prevent this pitfall.
+                    **  2)  The content of the item's next pointer may have changed between the
+                    **      time we de-reference it and the time of the compare-and-swap.
+                    **      Thus even though the compare-and-swap succeeds,   we might set the
+                    **      new head of the freelist to an invalid value  (either a free item
+                    **      that is not the first in the free chain  -  resulting only in
+                    **      loss of the orphaned free items, or,  much worse, an in-use item).
+                    **      In practice this is extremely unlikely because of the implied huge delay
+                    **      in this window interval in this (current) process.    Here are two scenarios:
+                    **      legend:
+                    **         P0  -  this (current) process,  P1,  P2 , ... other processes
+                    **         content of freelist shown as BAiocbAnchr->FreeBAiocbs -> first item -> 2nd item ...
+                    **         @[X] means address of X
+                    **         |      timeline of window of exposure to problems
+                    **      successive lines in chronological order                                       content of freelist
+                    **        2.1    P0           access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **               P0          access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1       access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        swap-remove I0,  place I1 at head of list                F -> I1 -> I2 -> I3 ...
+                    **         |           P2     access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I1]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2    access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I2]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2     swap-remove I1,  place I2 at head of list                F -> I2 -> I3 ...
+                    **         |        P1       complete aio,  replace I0 at head of list                 F -> I0 -> I2 -> I3 ...
+                    **               P0        swap-remove I0,  place I1 at head of list                   F -> I1 IS IN USE !! CORRUPT !!
+                    **               compare-and-swap succeeded but the value of newFreeBAiocb was stale
+                    **               and had become in-use during the window.
+                    **        2.2    P0           access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **               P0          access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1       access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        swap-remove I0,  place I1 at head of list                F -> I1 -> I2 -> I3 ...
+                    **         |           P2     access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I1]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2    access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I2]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2     swap-remove I1,  place I2 at head of list                F -> I2 -> I3 ...
+                    **         |              P3  access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I2]  F -> I2 -> I3 ...
+                    **         |              P3 access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I3]  F -> I2 -> I3 ...
+                    **         |              P3  swap-remove I2,  place I3 at head of list                F -> I3 ...
+                    **         |           P2    complete aio,  replace I1 at head of list                 F -> I1 -> I3 ...
+                    **         |              P3 complete aio,  replace I2 at head of list                 F -> I2 -> I1 -> I3 ...
+                    **         |        P1       complete aio,  replace I0 at head of list                 F -> I0 -> I2 -> I1 -> I3 ...
+                    **               P0        swap-remove I0,  place I1 at head of list                   F -> I1 -> I3 ... ! I2 is orphaned !
+                    **               compare-and-swap succeeded but the value of newFreeBAiocb was stale
+                    **               and had moved further down the free list during the window.
+                    **      Unfortunately, we cannot prevent this pitfall but we can detect it (after the fact),
+                    **      by checking that the next pointer of the item we have just removed for our use still points to the same item.
+                    **      This test is not subject to any timing or uncertainty since :
+                    **       .  The fact that the compare-and-swap succeeded implies that the item we removed
+                    **          was defintely on the freelist (at the head) when it was removed,
+                    **          and therefore cannot be in use,  and therefore its next pointer is no longer volatile.
+                    **       .  Although pointers of the anchor and items on the freelist are volatile,
+                    **          the addresses of items never change -  they are in an allocated array and never move.
+                    **      E.g. in the above two scenarios,   the test is that I0.next still -> I1,
+                    **      and this is true if and only if the second item on the freelist is
+                    **      still the same at the end of the window as it was at the start of the window.
+                    **      Note that we do not insist that it did not change during the window,
+                    **           only that it is still the correct new head of freelist.
+                    **      If this test fails,  we abort immediately as the subsystem is damaged and cannot be repaired.
+                    **      Note that at least one aio must have been issued *and* completed during the window
+                    **           for this to occur,  and since the window is just one single machine instruction,
+                    **           it is very unlikely in practice.
+                    */
+                    BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+                    remainder = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                                % (long long int)(sizeof(struct BufferAiocb));
+                    if (remainder == 0) {
+                        dividend = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                                    / (long long int)(sizeof(struct BufferAiocb));
+                    }
+                    if (    (remainder == 0)
+                         && ( (dividend >= 0 ) && ( dividend < numBufferAiocbs) )
+                       )
+                    {
+                        newFreeBAiocb = oldFreeBAiocb->BAiocbnext; /*  tentative new value is second on free list */
+                        /*   Here we are in the exposure window referred to in the above comments,
+                        **   so moving along rapidly ...
+                        */
+                        if (__sync_bool_compare_and_swap (&(BAiocbAnchr->FreeBAiocbs), oldFreeBAiocb, newFreeBAiocb)) {   /*  did we get it ? */
+                                /*   We have successfully swapped head of freelist pointed to by oldFreeBAiocb off the list;
+                                **   Here we check that the item we just placed at head of freelist, pointed to by newFreeBAiocb,
+                                **   is the right one
+                                **
+                                **   also check that the BAiocb we have acquired was not in use
+                                **   i.e. that scenario 2.1 above did not occur just before our compare-and-swap
+                                **   The test is that the BAiocb is not in use.
+                                **
+                                **  in one hypothetical case,
+                                **  we can be certain that there is no corruption -
+                                **  the case where newFreeBAiocb == 0 and oldFreeBAiocb->BAiocbnext != BAIOCB_OCCUPIED -
+                                **  i.e. we have set the freelist to empty but we have a baiocb chained from ours.
+                                **  in this case our comp_swap removed all BAiocbs from the list (including ours)
+                                **  so the others chained from ours are either orphaned (no harm done)
+                                **  or in use by another backend and will eventually be returned (fine).
+                                */
+                                if ((struct BufferAiocb *)0 == newFreeBAiocb) {
+                                    if ((struct BufferAiocb *)BAIOCB_OCCUPIED == oldFreeBAiocb->BAiocbnext) {
+                                        goto baiocb_corruption;
+                                    } else if ((struct BufferAiocb *)0 != oldFreeBAiocb->BAiocbnext) {
+                                      elog(LOG,
+                                         "AIO control block inconsistency on acquiring aiocb %p - its next free %p may be orphaned (no corruption has occurred)"
+                                         	,oldFreeBAiocb , oldFreeBAiocb->BAiocbnext);
+                                    }
+                                } else {
+                                    /*  case of newFreeBAiocb not null  -  so must check more carefully ... */
+                                    remainder = ((long long int)newFreeBAiocb - (long long int)BufferAiocbs)
+                                                % (long long int)(sizeof(struct BufferAiocb));
+                                    dividend = ((long long int)newFreeBAiocb - (long long int)BufferAiocbs)
+                                                / (long long int)(sizeof(struct BufferAiocb));
+
+                                    if (    (newFreeBAiocb != oldFreeBAiocb->BAiocbnext)
+                                         || (remainder != 0)
+                                         || ( (dividend < 0 ) || ( dividend >= numBufferAiocbs) )
+                                       ) {
+                                        goto baiocb_corruption;
+                                    }
+                                }
+                                BAiocb = oldFreeBAiocb;
+                                BAiocb->BAiocbnext = (struct BufferAiocb*)BAIOCB_OCCUPIED; /* mark as not on freelist */
+                                BAiocb->BAiocbDependentCount = 0;  /* no dependent yet */
+                                BAiocb->BAiocbbufh = (struct sbufdesc*)0;
+                                BAiocb->pidOfAio = 0;
+
+                                LockTries = 0;   /*  end the do loop  */
+
+                        }
+                    }
+
+                    if ( LockTries > 1 ) {
+                        sched_yield();    /*   yield to another process,  (hopefully a backend) */
+                    }
+                } while (     ((struct BufferAiocb*)0 == BAiocb)            /*  did not get a BAiocb    */
+                           && ((struct BufferAiocb*)0 != oldFreeBAiocb)     /*  there was a free BAiocb */
+                           && (LockTries-- > 0)                             /*  told to retry           */
+                        );
+            }
+        }
+
+        if ( BAiocb != (struct BufferAiocb*)0 ) {
+            /*  try an async io    */
+            BAiocb->BAiocbthis.aio_fildes = -1; /* necessary to ensure any thief realizes aio not yet started */
+            BAiocb->pidOfAio = this_backend_pid;
+
+            /*  now try to acquire a buffer :
+            **  note -   ReadBuffer_common returns hit=true if the block is found in the buffer pool,
+            **            in which case there is no need to prefetch.
+            **  otherwise ReadBuffer_common pins returned buffer and calls StartBufferIO
+            **           and StartBufferIO :
+            **      . sets buf_desc->freeNext to negative of ( index of the aiocb in the BufferAiocbs array + 3 )
+            **      . sets  BAiocb->BAiocbbufh -> buf_desc
+            **           and in this case the buffer spinlock is held.
+            **           This is essential as no other task must issue any intention with respect
+            **           to the buffer until we have started the aio_read.
+            **  Also note that ReadBuffer_common handles enlarging the ResourceOwner buffer list as needed
+            **       so I dont need to do that
+            */
+            buf_id = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
+                                        forkNum, blockNum
+                                       ,RBM_NOREAD_FOR_PREFETCH  /*  tells ReadBuffer not to do any read,  just alloc buf */
+                                       ,strategy , &hit , (FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))));
+            buf_desc = &BufferDescriptors[buf_id-1];    /* find buffer descriptor */
+
+            /*  normally hit will be false as presumably it was not in the pool
+            **  when our caller looked - but it could be there now ...
+            */
+            if (hit) {
+                   /*   see earlier comments  -  we must abandon the prefetch */
+                   retcode = 1 + PrivateRefCount[buf_id];
+                   BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* indicator that we must release the BAiocb before exiting */
+            } else
+            if (  (buf_id > 0) && ((BufferDesc *)0 != buf_desc) && (buf_desc == BAiocb->BAiocbbufh)  ) {
+                   /*   buff descriptor header lock should be held.
+                   **   However,  just to be safe ,   now validate that
+                   **   we are still the owner and no other task already stole it.
+                   */
+
+                   buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /* ensure no banked pin */
+                   /*  there should not be any other pid waiting on this buffer
+                   **  so check both of BM_VALID and BM_PIN_COUNT_WAITER are not set
+                   */
+                   if (    ( !(buf_desc->flags & (BM_VALID|BM_PIN_COUNT_WAITER) ) )
+                        && (buf_desc->flags & BM_AIO_IN_PROGRESS)
+                        && ((FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))) == buf_desc->freeNext) /* it is still mine */
+                        && (-1 == BAiocb->BAiocbthis.aio_fildes)   /* no thief stole it */
+                        && (0 == BAiocb->BAiocbDependentCount)     /* no dependent */
+                     ) {
+                        /*   we have an empty buffer for our use */
+
+                        BAiocb->BAiocbthis.aio_buf = (void *)(BufHdrGetBlock(buf_desc)); /* Location of actual buffer.  */
+
+                        /*   note - there is no need to register self as an dependent of BAiocb
+                        **   as we shall not unlock buf_desc before we free the BAiocb
+                        */
+
+                        /*   smgrstartaio retcode is returned in smgrstartaio_rc -
+                        **   it indicates whether started or not
+                        */
+                        smgrstartaio(reln->rd_smgr, forkNum, blockNum , (char *)&(BAiocb->BAiocbthis) , &smgrstartaio_rc );
+
+                        if (smgrstartaio_rc == 0) {
+                            retcode = 0;
+                            buf_desc->flags |= BM_AIO_PREFETCH_PIN_BANKED; /* bank the pin for next use by this task */
+                            /* we did not register register self as an dependent of BAiocb so no need to unregister  */
+                        } else {
+                            /*  failed - return the BufferAiocb to the free list  */
+                            do_unpin_buffer = true;    /* unpin must be deferred until after buffer descriptor is unlocked */
+                            /*  spinlock held so instead of TerminateBufferIO(buf_desc, false , 0); ... */
+                            buf_desc->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS | BM_AIO_PREFETCH_PIN_BANKED | BM_VALID);
+                            /* we did not register register self as an dependent of BAiocb so no need to unregister  */
+
+                            /*  return the BufferAiocb to the free list  */
+                            if (
+                                   BufReleaseAsync(BAiocb)
+                               ) {         /*  failed ? */
+                                BAiocb->BAiocbnext = cachedBAiocb;    /* then ...       */
+                                cachedBAiocb = BAiocb;                /*  ... cache it  */
+                            }
+
+                            pgBufferUsage.aio_read_failed++;
+                            smgrstartaio_rc = 1;  /*   to distinguish from aio not even attempted */
+                        }
+                   }
+                   else {
+                            /*  buffer was stolen or in use by other task - return the BufferAiocb to the free list  */
+                            do_unpin_buffer = true;    /* unpin must be deferred until after buffer descriptor is unlocked */
+                   }
+
+                   UnlockBufHdr(buf_desc);
+                   if (do_unpin_buffer) {
+                        if (smgrstartaio_rc >= 0) { /*  if  aio was attempted */
+                            TerminateBufferIO(buf_desc, false , 0);
+                        }
+                        UnpinBuffer(buf_desc, true);
+                   }
+            }
+            else {
+                BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* indicator that we must release the BAiocb before exiting */
+            }
+
+            if ((struct sbufdesc*)0 == BAiocb->BAiocbbufh) { /*  we did not associate a buffer */
+                                                             /*  so return the BufferAiocb to the free list  */
+                            if (
+                                   BufReleaseAsync(BAiocb)
+                               ) {         /*  failed ? */
+                                BAiocb->BAiocbnext = cachedBAiocb;    /* then ...       */
+                                cachedBAiocb = BAiocb;                /*  ... cache it  */
+                            }
+            }
+        }
+    }
+
+    return retcode;
+
+    baiocb_corruption:;
+         elog(PANIC,
+              "AIO control block corruption on acquiring aiocb %p - its next free %p conflicts with new freelist pointer %p which may be invalid (corruption may have occurred)"
+                                                    ,oldFreeBAiocb , oldFreeBAiocb->BAiocbnext , newFreeBAiocb);
+}
+#endif   /* USE_PREFETCH */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+/*
+ * BufCheckAsync --      act upon caller's intention regarding a shared buffer,
+ *                       primarily in connection with any async io in progress on the buffer.
+ *     class  subvalue   intention has two main classes and some subvalues within those :
+ *      +ve      1            .   want    <=>  caller wants the buffer,
+ *                                             wait for in-progress aio and then always pin
+ *      -ve                   .   reject  <=>  caller does not want the buffer,
+ *                                             if there are no dependents,  then cancel the aio
+ *              -1, -2 , -3 , ... (see below)        and then optionally unpin
+ *                             Used when there may have been a previous fetch or prefetch.
+ *
+ * buffer is assumed to be an existing member of the shared buffer pool
+ *    as returned by BufTableLookup.
+ * if AIO in progress,  then :
+ *      .  terminate AIO, waiting for completion if +ve intention, else without waiting
+ *      .  if the AIO had already completed successfully,   then mark buffer valid
+ *      .  pin/unpin as requested
+ *
+ * +ve intention indicates that buffer must be pinned :
+ *   if the strategy parameter is null,  then use the PinBuffer_Locked optimization
+ *   to pin and unlock in one operation.   But always update buffer usage count.
+ *
+ * -ve intention indicates whether and how to unpin :
+ *   BUF_INTENTION_REJECT_KEEP_PIN 	-1   pin already held, do not unpin, (caller wants to keep it)
+ *   BUF_INTENTION_REJECT_OBTAIN_PIN	-2   obtain pin,  caller wants it for same buffer
+ *   BUF_INTENTION_REJECT_FORGET	-3   unpin and tell resource owner to forget
+ *   BUF_INTENTION_REJECT_NOADJUST	-4   unpin and call ResourceOwnerForgetBuffer myself
+ *                                           instead of telling UnpinBuffer to adjust CurrentResource owner
+ *                                           (quirky simulation of ReleaseBuffer logic)
+ *   BUF_INTENTION_REJECT_UNBANK   	-5   unpin only if pin banked by caller
+ *   The behaviour for the -ve case is based on that of ReleaseBuffer, adding handling of async io.
+ *
+ * pin/unpin action must take account of whether this backend hold a "disposable" pin on the particular buffer.
+ * A "disposable" pin is a pin acquired by buffer manager without caller knowing, such as :
+ *      when required to safeguard an async AIO  -  pin can be held across multiple bufmgr calls
+ *      when required to safeguard waiting for an async AIO  -  pin acquired and released within this function
+ * if a disposable pin is held,   then :
+ *      if a new pin is requested,  the disposable pin must be retained (redeemed) and any flags relating to it unset
+ *      if an unpin is requested,   then :
+ *              if    either no AIO in progress or this backend did not initiate the AIO
+ *              then  the disposable pin must be dropped (redeemed) and any flags relating to it unset
+ *              else  log warning and do nothing
+ *  i.e. in either case,   there is no longer a disposable pin after this function has completed.
+ *       Note that if    intention is BUF_INTENTION_REJECT_UNBANK,
+ *                 then caller expects there to be a disposable banked pin
+ *                      and if there isn't one,  we do nothing
+ *                 for all other intentions,  if there is no disposable pin,   we pin/unpin normally.
+ *
+ * index_for_aio indicates the BAiocb to be used for next aio (see PrefetchBuffer)
+ * BufFreelistLockHeld indicates whether freelistlock is held
+ * spinLockHeld indicates whether buffer header spinlock is held
+ * PartitionLock is the  buffer partition lock to be used
+ *
+ * return code (meaningful ONLY if intention is +ve) indicates validity of buffer :
+ *  -1    buffer is invalid and failed PageHeaderIsValid check
+ *   0    buffer is not valid
+ *   1    buffer is valid
+ *   2    buffer is valid but tag changed  -  (so content does not match the relation block that caller expects)
+ */
+int
+BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, BufferDesc volatile * buf_desc, int intention , BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock )
+{
+
+        int retcode = BUF_INTENT_RC_INVALID_NO_AIO;
+        bool valid = false;
+
+	BufferTag origTag = buf_desc->tag;	/* original identity of selected buffer */
+
+#ifdef USE_PREFETCH
+        int smgrcompleteaio_rc;          /*  retcode from smgrcompleteaio */
+        SMgrRelation smgr = caller_smgr;
+        int aio_successful = -1;         /*  did the aio_read succeed ?  -1 = no aio,  0 unsuccessful , 1 successful */
+	BufFlags	flags_on_entry;  /*  for debugging  -  can be printed in gdb */
+        int    		freeNext_on_entry;  /*  for debugging  -  can be printed in gdb */
+        int    		BAiocbDependentCount_after_aio_finished = -1;  /*  for debugging  -  can be printed in gdb */
+        bool       disposable_pin = false;            /* this backend had a disposable pin on entry or pins the buffer while waiting for aio_read to complete */
+        bool       pin_already_banked_by_me;          /* buffer is already pinned by me and redeemable */
+        int local_intention;
+#endif   /* USE_PREFETCH */
+
+
+
+#ifdef USE_PREFETCH
+            if (!spinLockHeld) {
+                /*  lock buffer header    */
+                LockBufHdr(buf_desc);
+            }
+
+	    flags_on_entry = buf_desc->flags;
+	    freeNext_on_entry = buf_desc->freeNext;
+            pin_already_banked_by_me =
+                      (    (flags_on_entry & BM_AIO_PREFETCH_PIN_BANKED)
+                        && (   (    (flags_on_entry & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - freeNext_on_entry))->pidOfAio )
+                                                                      : (-(freeNext_on_entry))  )  == this_backend_pid )
+                      );
+
+            if (pin_already_banked_by_me) {
+                if (PrivateRefCount[buf_desc->buf_id] == 0) {  /*  but do we actually have a pin ?? */
+                    /*   this is an anomalous situation   -  somehow our disposable pin was lost without us noticing
+                    **   if AIO is in progress and we started it,
+                    **   then this is disastrous  -   two backends might both issue IO on same buffer
+                    **   otherwise,  it is harmless,  and simply means we have no disposable pin,
+                    **               but we must update flags to "notice" the fact now
+                    */
+                    if (flags_on_entry & BM_AIO_IN_PROGRESS) {
+                            elog(ERROR, "BufCheckAsync : AIO control block issuer of aio_read lost pin with BM_AIO_IN_PROGRESS on buffer %d rel=%s, blockNum=%u, flags 0x%X refcount=%u intention= %d"
+                                ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                           ,buf_desc->tag.blockNum, flags_on_entry, buf_desc->refcount ,intention);
+                    } else {
+                            elog(LOG, "BufCheckAsync : AIO control block issuer of aio_read lost pin on buffer %d rel=%s, blockNum=%u, with flags 0x%X refcount=%u intention= %d"
+                               ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                               ,buf_desc->tag.blockNum, flags_on_entry, buf_desc->refcount ,intention);
+							buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /*   redeem the banked pin  */
+							/*   since AIO not in progress,  disconnect the buffer from banked pin */
+							buf_desc->freeNext = FREENEXT_NOT_IN_LIST;      /*   and forget the bank client */
+							pin_already_banked_by_me = false;
+                    }
+                } else {
+                    disposable_pin = true;
+                }
+            }
+
+            /*  the case of BUF_INTENTION_REJECT_UNBANK is handled specially :
+            **    if this backend has a banked pin,  then proceed just as for BUF_INTENTION_REJECT_FORGET
+            **    else the call is a no-op  --  unlock buf header and return immediately
+            */
+            local_intention = intention;
+            if (intention == BUF_INTENTION_REJECT_UNBANK) {
+                if (pin_already_banked_by_me) {
+                    local_intention = BUF_INTENTION_REJECT_FORGET;
+                } else {
+                    goto unlock_buf_header;  /*  code following the unlock will do nothing since local_intention still set to BUF_INTENTION_REJECT_UNBANK */
+                }
+            }
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+            /*       we do not expect that BM_AIO_IN_PROGRESS is set without freeNext identifying the BAiocb */
+            if ( (buf_desc->flags & BM_AIO_IN_PROGRESS) && (buf_desc->freeNext == FREENEXT_NOT_IN_LIST) ) {
+
+					elog(ERROR, "BufCheckAsync : found BM_AIO_IN_PROGRESS without a BAiocb on buffer %d rel=%s, blockNum=%u, flags %X refcount=%u"
+						,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+						,buf_desc->tag.blockNum, buf_desc->flags, buf_desc->refcount);
+                }
+            /*     check whether aio in progress  */
+            if  (    ( (struct BAiocbAnchor *)0 != BAiocbAnchr )
+                  && (buf_desc->flags & BM_AIO_IN_PROGRESS)
+                  && (buf_desc->freeNext <= FREENEXT_BAIOCB_ORIGIN)                       /*  has a valid BAiocb */
+                  && ((FREENEXT_BAIOCB_ORIGIN - buf_desc->freeNext) < numBufferAiocbs)    /*  double-check */
+                ) {        /* this is aio   */
+                    struct BufferAiocb volatile * BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf_desc->freeNext); /*  BufferAiocb associated with this aio */
+                    if ((struct BufferAiocb*)BAIOCB_OCCUPIED == BAiocb->BAiocbnext) { /*  ensure BAiocb is occupied */
+                        aio_successful = 0;       /*  tentatively the aio_read did not succeed   */
+                        retcode = BUF_INTENT_RC_INVALID_AIO;
+
+                        if (smgr == NULL) {
+                            if (caller_reln == NULL) {
+                                smgr = smgropen(buf_desc->tag.rnode, InvalidBackendId);
+                            } else {
+                                smgr = caller_reln->rd_smgr;
+                            }
+                        }
+
+                        /*  assert that this AIO is not using the same BufferAiocb as the one caller asked us to use */
+                        if ((index_for_aio < 0) && (index_for_aio == buf_desc->freeNext)) {
+                                   ereport(ERROR,
+                                        (errcode(ERRCODE_INTERNAL_ERROR),
+                                         errmsg("AIO control block index %d to be used by %p already in use by %p"
+                                                  ,index_for_aio, buf_desc, BAiocb->BAiocbbufh)));
+                        }
+
+                        /*    Call smgrcompleteaio only if either we want buffer or there are no dependents.
+                        **    In the other case of reject and there are dependents,
+                        **    then one of them will do it.
+                        */
+                        if (   (local_intention > 0) || (0 == BAiocb->BAiocbDependentCount)  ) {
+                            if (local_intention > 0) {
+                                /*  wait for in-progress aio and then pin
+                                **  OR  if I did not issue the aio and do not have a pin
+                                **  then pin now before waiting to ensure the buffer does not become unpinned while I wait
+                                **  we may potentially wait for io to complete
+                                **  so release buf header lock so that others may also wait here
+                                */
+                                BAiocb->BAiocbDependentCount++; /* register self as dependent  */
+                                if (PrivateRefCount[buf_desc->buf_id] == 0) {   /* if this buffer not pinned by me */
+                                    disposable_pin = true; /* this backend has pinned the buffer while waiting for aio_read to complete */
+                                    PinBuffer_Locked(buf_desc);
+                                } else {
+                                    UnlockBufHdr(buf_desc);
+                                }
+                                LWLockRelease(PartitionLock);
+
+                                smgrcompleteaio_rc = 1   /*  tell smgrcompleteaio to wait  */
+                                                   + ( BAiocb->pidOfAio == this_backend_pid ); /*  and whether I initiated the aio */
+                            } else {
+                                smgrcompleteaio_rc = 0;   /*  tell smgrcompleteaio to cancel */
+                            }
+
+                            smgrcompleteaio( smgr , (char *)&(BAiocb->BAiocbthis) , &smgrcompleteaio_rc );
+                            if ( (smgrcompleteaio_rc == 0) || (smgrcompleteaio_rc == 1) ) {
+                                  aio_successful = 1;
+                            }
+
+                            /*   statistics  */
+                            if (local_intention > 0) {
+                                if (smgrcompleteaio_rc == 0) {
+                                    /*    completed successfully and did not have to wait  */
+                                    pgBufferUsage.aio_read_ontime++;
+                                } else if (smgrcompleteaio_rc == 1) {
+                                    /*    completed successfully and did have to wait  */
+                                    pgBufferUsage.aio_read_waited++;
+                                } else {
+                                    /*  bad news   -   read failed and so buffer not usable
+                                    **  the buffer is still pinned so unpin and proceed with "not found" case
+                                    */
+                                    pgBufferUsage.aio_read_failed++;
+                                }
+
+                                /*  regain locks and handle the validity of the buffer and intention regarding it    */
+                                LWLockAcquire(PartitionLock, LW_SHARED);
+                                LockBufHdr(buf_desc);
+                                BAiocb->BAiocbDependentCount--; /* unregister self as dependent  */
+                            } else {
+                                    pgBufferUsage.aio_read_wasted++;  /*   regardless of whether aio_successful */
+                            }
+
+
+                            if (local_intention > 0) {
+                                /*  verify the buffer is still ours and has same identity
+                                **  There is one slightly tricky point here -
+                                **  if there are other dependents,   then each of them will perform this same check
+                                **  this is unavoidable as the correct setting of retcode and the BM_VALID flag
+                                **  is required by each dependent,  so we may not leave it to the last one to do it.
+                                **  It should not do any harm and easier to let them all do it than try to avoid.
+                                */
+                                if ((FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))) == buf_desc->freeNext) { /* it is still mine */
+
+                                    if (aio_successful) {
+                                        /* validate page header.   If valid,  then mark the buffer as valid */
+                                        if (PageIsVerified((Page)(BufHdrGetBlock(buf_desc)) , ((BAiocb->BAiocbthis).aio_offset/BLCKSZ))) {
+                                            buf_desc->flags |= BM_VALID;
+                                            if (BUFFERTAGS_EQUAL(origTag , buf_desc->tag)) {
+                                                retcode = BUF_INTENT_RC_VALID;
+                                            } else {
+                                                retcode = BUF_INTENT_RC_CHANGED_TAG;
+                                            }
+                                        } else {
+                                            retcode = BUF_INTENT_RC_BADPAGE;
+                                        }
+                                    }
+                                }
+                            }
+
+                            BAiocbDependentCount_after_aio_finished = BAiocb->BAiocbDependentCount;
+
+                            /*  if no dependents,   then disconnect the BAiocb and update buffer header */
+                            if (BAiocbDependentCount_after_aio_finished == 0 ) {
+
+
+                                /*  return the BufferAiocb to the free list  */
+                                buf_desc->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS);
+                                if (
+                                           BufReleaseAsync(BAiocb)
+                                   ) {        /*  failed ? */
+                                    BAiocb->BAiocbnext = cachedBAiocb;   /* then ...       */
+                                    cachedBAiocb = BAiocb;               /*  ... cache it  */
+                                }
+                            }
+
+                        }
+                    }
+            }
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+            /*  note whether buffer is valid before unlocking spinlock */
+            valid = ((buf_desc->flags & BM_VALID) != 0);
+
+            /*  if there was a disposable pin on entry to this function (i.e. marked in buffer flags)
+            **  then unmark it  -  refer to prologue comments talking about :
+            **    if a disposable pin is held,   then :
+            **     ...
+            **    i.e. in either case,   there is no longer a disposable pin after this function has completed.
+            */
+            if (pin_already_banked_by_me) {
+                        buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /*   redeem the banked pin  */
+                        /*   if AIO not in progress,  then disconnect the buffer from BAiocb and/or banked pin */
+                        if (!(buf_desc->flags & BM_AIO_IN_PROGRESS)) {
+                            buf_desc->freeNext = FREENEXT_NOT_IN_LIST;      /*   and forget the bank client */
+                        }
+                        /********** for debugging   *****************
+                        else elog(LOG, "BufCheckAsync : found BM_AIO_IN_PROGRESS when redeeming banked pin on buffer %d rel=%s, blockNum=%u, flags %X refcount=%u"
+                       ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                       ,buf_desc->tag.blockNum, buf_desc->flags, buf_desc->refcount);
+                        ********** for debugging     *****************/
+            }
+
+            /*  If we are to obtain new pin, then use pin optimization  -  pin and unlock.
+            **  However,   if the caller is the same backend who issued the aio_read,
+            **  then he ought to have obtained the pin at that time and must not acquire
+            **  a "second" one since this is logically the same read -  he would have obtained
+            **  a single pin if using synchronous read and we emulate that behaviour.
+            **  Its important to understand that the caller is not aware that he already obtained a pin -
+            **  because calling PrefetchBuffer did not imply a pin -
+            **  so we must track that via the pidOfAio field in the BAiocb.
+            **  And to add one further complication :
+            **      we assume that although PrefetchBuffer pinned the buffer,
+            **      it did not increment the usage count.
+            **      (because it called PinBuffer_Locked which does not do that)
+            **      so in this case,   we must increment the usage count without double-pinning.
+            **      yes its ugly  -  and theres a goto!
+            */
+            if (   (local_intention > 0)
+                || (local_intention == BUF_INTENTION_REJECT_OBTAIN_PIN)
+               ) {
+
+                /* Make sure we will have room to remember the buffer pin */
+                ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+                /*    here we really want a version of PinBuffer_Locked which updates usage count ... */
+                if (   (PrivateRefCount[buf_desc->buf_id] == 0) /*   if this buffer not previously pinned by me */
+                    || pin_already_banked_by_me                 /*   or I had a disposable pin on entry */
+                   ) {
+                    if (strategy == NULL)
+                    {
+                            if (buf_desc->usage_count < BM_MAX_USAGE_COUNT)
+                                    buf_desc->usage_count++;
+                    }
+                    else
+                    {
+                            if (buf_desc->usage_count == 0)
+                                    buf_desc->usage_count = 1;
+                    }
+		}
+
+                /*  now pin buffer unless we have a disposable */
+                if (!disposable_pin) { /* this backend neither banked pin for aio nor pinned the buffer while waiting for aio_read to complete */
+                    PinBuffer_Locked(buf_desc);
+                    goto unlocked_it;
+                }
+                else
+                /*    if this task previously issued the aio or pinned the buffer while waiting for aio_read to complete
+                **       and aio was unsuccessful,    then release the pin
+                */
+                if (     disposable_pin
+                      && (aio_successful == 0)       /*  aio_read failed ? */
+                   ) {
+                           UnpinBuffer(buf_desc, true);
+                }
+            }
+
+    unlock_buf_header:
+            UnlockBufHdr(buf_desc);
+    unlocked_it:
+#endif   /* USE_PREFETCH */
+
+            /*   now do any requested pin (if not done immediately above) or unpin/forget  */
+            if (local_intention == BUF_INTENTION_REJECT_KEEP_PIN) {
+            /*   the caller is supposed to hold a pin already so there should be nothing to do ... */
+                if (PrivateRefCount[buf_desc->buf_id] == 0) {
+                    elog(LOG, "request to keep pin on unpinned buffer %d",buf_desc->buf_id);
+
+                    valid = PinBuffer(buf_desc, strategy);
+                }
+            }
+            else
+            if (   (   (local_intention == BUF_INTENTION_REJECT_FORGET)
+                    || (local_intention == BUF_INTENTION_REJECT_NOADJUST)
+                   )
+                && (PrivateRefCount[buf_desc->buf_id] > 0) /*   if this buffer was previously pinned by me ... */
+               )  {
+
+                    if (local_intention == BUF_INTENTION_REJECT_FORGET) {
+                        UnpinBuffer(buf_desc, true); /*  ... then release the pin                   */
+                    } else
+                    if (local_intention == BUF_INTENTION_REJECT_NOADJUST) {
+                        /*   following code moved from ReleaseBuffer :
+                        **   not sure why we can't simply UnpinBuffer(buf_desc, true)
+                        **   but better leave it the way it was
+                        */
+                        ResourceOwnerForgetBuffer(CurrentResourceOwner, BufferDescriptorGetBuffer(buf_desc));
+                        if (PrivateRefCount[buf_desc->buf_id] > 1) {
+                            PrivateRefCount[buf_desc->buf_id]--;
+                        } else {
+                            UnpinBuffer(buf_desc, false);
+                        }
+                    }
+            }
+
+            /*    if retcode has not been set to one of the unusual conditions
+            **        namely failed header validity or tag changed
+            **    then the setting of valid takes precedence
+            **    over whatever retcode may be currently set to.
+            */
+            if ( ( (retcode == BUF_INTENT_RC_INVALID_NO_AIO) || (retcode == BUF_INTENT_RC_INVALID_AIO) ) && valid) {
+                   retcode = BUF_INTENT_RC_VALID;
+            } else
+            if ((retcode == BUF_INTENT_RC_VALID) && (!valid)) {
+                   if (aio_successful == -1) { /* aio not attempted */
+                       retcode = BUF_INTENT_RC_INVALID_NO_AIO;
+                   } else {
+                       retcode = BUF_INTENT_RC_INVALID_AIO;
+                   }
+            }
+
+            return retcode;
+}
--- src/backend/storage/buffer/buf_init.c.orig	2014-05-31 17:19:07.873208825 -0400
+++ src/backend/storage/buffer/buf_init.c	2014-05-31 19:53:09.060073526 -0400
@@ -13,15 +13,89 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <sys/types.h> 
+#include <unistd.h>
 
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
-
+#include <stdlib.h> /* for getenv() */
+#include <errno.h> /* for strtoul() */
 
 BufferDesc *BufferDescriptors;
 char	   *BufferBlocks;
-int32	   *PrivateRefCount;
+int32	   *PrivateRefCount;       /*  array of counts per buffer of how many times this task has pinned this buffer */
+
+volatile struct BAiocbAnchor *BAiocbAnchr = (struct BAiocbAnchor *)0;  /*  anchor for all control blocks pertaining to aio  */
 
+int CountInuseBAiocbs(void);     /*  keep compiler happy */
+void ReportFreeBAiocbs(void);    /*  keep compiler happy */
+
+extern int	MaxConnections;  /*  max number of client connections which postmaster will allow  */
+int numBufferAiocbs = 0;         /*  total number of  BufferAiocbs in pool (0 <=> no async io) */
+int hwmBufferAiocbs = 0;         /*  high water mark of in-use  BufferAiocbs in pool
+                                 **  (not required to be accurate, kindly maintained for us somehow by postmaster)
+                                 */
+
+#ifdef USE_PREFETCH
+unsigned int prefetch_dbOid = 0; /*  database oid of relations on which prefetching to be done - 0 means all */
+unsigned int prefetch_bitmap_scans = 1; /*  boolean whether to prefetch bitmap heap scans        */
+unsigned int prefetch_heap_scans = 0;   /*  boolean whether to prefetch non-bitmap heap scans    */
+unsigned int prefetch_sequential_index_scans = 0;  /*  boolean whether to prefetch sequential-access non-bitmap index scans  */
+unsigned int prefetch_index_scans = 256;  /*  boolean whether to prefetch non-bitmap index scans also numeric size of pfch_list  */
+unsigned int prefetch_btree_heaps = 1;  /*  boolean whether to prefetch heap pages in _bt_next for non-bitmap index scans   */
+#endif /* USE_PREFETCH */
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+int maxGetBAiocbTries = 1;       /*  max times we will try to get a free BufferAiocb */
+int maxRelBAiocbTries = 1;       /*  max times we will try to release a BufferAiocb back to freelist */
+
+/*  locking protocol for manipulating the BufferAiocbs and FreeBAiocbs list :
+**    1.    ownership of a BufferAiocb :
+**          to gain ownership of a BufferAiocb, a task must
+**          EITHER    remove it from FreeBAiocbs (it is now temporary owner and no other task can find it)
+**                    if decision is to attach it to a buffer descriptor header, then
+**                       .   lock the buffer descriptor header
+**                       .   check  NOT flags & BM_AIO_IN_PROGRESS
+**                       .   attach to buffer descriptor header
+**                       .   increment the BufferAiocb.dependent_count
+**                       .   unlock the buffer descriptor header
+**                     and ownership scope is from lock to unlock
+***          OR        locate it by dereferencing the pointer in a buffer descriptor,
+**                    in which case :
+**                       .   lock the buffer descriptor header
+**                       .   check  flags & BM_AIO_IN_PROGRESS
+**                       .   increment the BufferAiocb.dependent_count
+**                       .   if decision is to return to FreeBAiocbs,
+**                           then   (with buffer descriptor header still locked)
+**                                  .  turn off BM_AIO_IN_PROGRESS
+**                       .   IF        the BufferAiocb.dependent_count == 1 (I am sole dependent)
+**                       .   THEN
+**                       .       .  decrement the BufferAiocb.dependent_count
+**                               .  return to FreeBAiocbs (see below)
+**                       .   unlock the buffer descriptor header
+**                     and ownership scope is from lock to either return to FreeBAiocbs or unlock
+**    2.    adding and removing from FreeBAiocbs :
+**      two alternative methods - controlled by conditional macro definition LOCK_BAIOCB_FOR_GET_REL
+**       2.1 LOCK_BAIOCB_FOR_GET_REL is defined - use a lock
+**          .   lock BufFreelistLock exclusive
+**          .   add / remove from FreeBAiocbs
+**          .   unlock BufFreelistLock exclusive
+**          advantage of this method  -  never fails to add or remove
+**       2.2  LOCK_BAIOCB_FOR_GET_REL is not defined - use compare_and_swap
+**          .   retrieve the current Freelist pointer and validate
+**          .   compare_and_swap on/off the FreeBAiocb list
+**          .   unlock BufFreelistLock exclusive
+**          advantage of this method  -  never waits
+**          to avoid losing a FreeBAiocbs,   save it in a process-local cache and reuse
+*/
+#else   /*  not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        /*   this dummy structure is to ensure that references to these fields in other bufmgr runtime code
+        **   that is not conditionally ifdefd on USE_AIO_ATOMIC_BUILTIN_COMP_SWAP compiles and runs correctly
+        */
+        struct BufferAiocb dummy_BAiocbAnchr = { (struct BufferAiocb*)0 , (struct BufferAiocb*)0 };
+int maxGetBAiocbTries = -1;       /*  max times we will try to get a free BufferAiocb */
+int maxRelBAiocbTries = -1;       /*  max times we will try to release a BufferAiocb back to freelist */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Data Structures:
@@ -73,7 +147,14 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs
+                        , foundAiocbs
+          ;
+#if defined(USE_PREFETCH) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
+        char *envvarpointer = (char *)0;  /*  might point to an environment variable string */
+        char *charptr;
+#endif /* USE_PREFETCH */
+
 
 	BufferDescriptors = (BufferDesc *)
 		ShmemInitStruct("Buffer Descriptors",
@@ -83,6 +164,142 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        BAiocbAnchr = (struct BAiocbAnchor *)0; /*  anchor for all control blocks pertaining to aio  */
+        if (max_async_io_prefetchers < 0) {  /*  negative value indicates to initialize to something sensible during buf_init */
+            max_async_io_prefetchers = MaxConnections/6;  /*  default allows for average of MaxConnections/6 concurrent prefetchers  - reasonable ??? */
+        }
+
+        if ((target_prefetch_pages > 0) && (max_async_io_prefetchers > 0)) {
+            int ix;
+            volatile struct BufferAiocb *BufferAiocbs;
+            volatile struct BufferAiocb * volatile FreeBAiocbs;
+
+            numBufferAiocbs = (target_prefetch_pages*max_async_io_prefetchers);  /*  target_prefetch_pages per prefetcher */
+            BAiocbAnchr = (struct BAiocbAnchor *)
+		ShmemInitStruct("Buffer Aiocbs",
+                          sizeof(struct BAiocbAnchor) + (numBufferAiocbs * sizeof(struct BufferAiocb)), &foundAiocbs);
+            if (BAiocbAnchr) {
+                BufferAiocbs = BAiocbAnchr->BufferAiocbs = (struct BufferAiocb*)(((char *)BAiocbAnchr) + sizeof(struct BAiocbAnchor));
+                FreeBAiocbs = (struct BufferAiocb*)0;
+                for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    (BufferAiocbs+ix)->BAiocbnext = FreeBAiocbs;   /*  init the free list,  last one -> 0  */
+                    (BufferAiocbs+ix)->BAiocbbufh = (struct sbufdesc*)0;
+                    (BufferAiocbs+ix)->BAiocbDependentCount = 0;
+                    (BufferAiocbs+ix)->pidOfAio = 0;
+                    FreeBAiocbs = (BufferAiocbs+ix);
+
+                }
+                BAiocbAnchr->FreeBAiocbs = FreeBAiocbs;
+                envvarpointer = getenv("PG_MAX_GET_BAIOCB_TRIES");
+                if (    (envvarpointer != (char *)0)
+                     && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+                   ) {
+                      maxGetBAiocbTries = strtol(envvarpointer, 0, 10);
+                }
+                envvarpointer = getenv("PG_MAX_REL_BAIOCB_TRIES");
+                if (    (envvarpointer != (char *)0)
+                     && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+                   ) {
+                      maxRelBAiocbTries = strtol(envvarpointer, 0, 10);
+                }
+
+                /*   init the aio subsystem max number of threads and max number of requests
+                **   max number of threads   <-->  max_async_io_prefetchers
+                **   max number of requests  <-->  numBufferAiocbs = (target_prefetch_pages*max_async_io_prefetchers)
+                **   there is no return code so we just hope.
+                */
+                smgrinitaio(max_async_io_prefetchers , numBufferAiocbs);
+
+            }
+        }
+#else   /*  not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        /*   this dummy structure is to ensure that references to these fields in other bufmgr runtime code
+        **   that is not conditionally ifdefd on USE_AIO_ATOMIC_BUILTIN_COMP_SWAP compiles and runs correctly
+        */
+        BAiocbAnchr = &dummy_BAiocbAnchr;
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+
+#ifdef USE_PREFETCH
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_BITMAP");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_bitmap_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_bitmap_scans = 1;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_ISCAN");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_index_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_index_scans = 1;
+             } else
+             if (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   ) {
+                 prefetch_index_scans = strtol(envvarpointer, &charptr, 10);
+                 if (charptr && (',' == *charptr)) {   /*  optional sequential prefetch in index scans */
+					 charptr++;        /*   following the comma ... */
+					 if ( ('Y' == *charptr) || ('y' == *charptr) || ('1' == *charptr) ) {
+                         prefetch_sequential_index_scans = 1;
+					 }
+				 }
+             }
+             /*  if prefeching for ISCAN,  then we require size of pfch_list to be at least target_prefetch_pages */
+             if (   (prefetch_index_scans > 0)
+                 && (prefetch_index_scans < target_prefetch_pages)
+                ) {
+                 prefetch_index_scans = target_prefetch_pages;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_BTREE");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_btree_heaps = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_btree_heaps = 1;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_HEAP");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_heap_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+              prefetch_heap_scans = 1;
+             }
+        }
+        envvarpointer = getenv("PG_PREFETCH_DBOID");
+        if (    (envvarpointer != (char *)0)
+             && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+           ) {
+              errno = 0;   /*  required in order to distinguish error from 0 */
+              prefetch_dbOid = (unsigned int)strtoul((const char *)envvarpointer, 0, 10);
+              if (errno) {
+                  prefetch_dbOid = 0;
+              }
+        }
+        elog(LOG, "prefetching initialised with target_prefetch_pages= %d "
+                  ", max_async_io_prefetchers= %d implying aio concurrency= %d "
+                  ", prefetching_for_bitmap= %s "
+                  ", prefetching_for_heap= %s "
+                  ", prefetching_for_iscan= %d with sequential_index_page_prefetching= %s "
+                  ", prefetching_for_btree= %s"
+                   ,target_prefetch_pages ,max_async_io_prefetchers ,numBufferAiocbs
+                   ,(prefetch_bitmap_scans ? "Y" : "N")
+                   ,(prefetch_heap_scans ? "Y" : "N")
+                   ,prefetch_index_scans
+                   ,(prefetch_sequential_index_scans ? "Y" : "N")
+                   ,(prefetch_btree_heaps ? "Y" : "N")
+            );
+#endif /* USE_PREFETCH */
+
+
 	if (foundDescs || foundBufs)
 	{
 		/* both should be present or neither */
@@ -176,3 +393,80 @@ BufferShmemSize(void)
 
 	return size;
 }
+
+/*     imprecise count of number of in-use BAiocbs at any time
+ *     we scan the array read-only without latching so are subject to unstable result
+ *     (but since the array is in well-known contiguous storage,
+ *     we are not subject to segmentation violation)
+ *     This function may be called at any time and just does its best
+ *     return the count of what we counted.
+ */
+int
+CountInuseBAiocbs(void)
+{
+        volatile struct BufferAiocb *BAiocb;
+        int count = 0;
+        int ix;
+
+
+        if (BAiocbAnchr != (struct BAiocbAnchor *)0 ) {
+            BAiocb = BAiocbAnchr->BufferAiocbs;             /*   start of list */
+
+            for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    if ((struct BufferAiocb*)BAIOCB_OCCUPIED == (BAiocb+ix)->BAiocbnext) {   /* not on freelist ? */
+                        count++;
+                    }
+            }
+        }
+        return count;
+}
+
+/*
+ * report how many free BAiocbs at shutdown
+ * DO NOT call this while backends are actively working!!
+ * this report is useful when compare_and_swap method used (see above)
+ * as it can be used to deduce how many BAiocbs were in process-local caches -
+ * (original_number_on_freelist_at_startup - this_reported_number_at_shutdown)
+ */
+void
+ReportFreeBAiocbs(void)
+{
+        volatile struct BufferAiocb *BAiocb;
+        volatile struct BufferAiocb *BufferAiocbs;
+        int count = 0;
+        int fx , ix;
+
+
+        if (BAiocbAnchr != (struct BAiocbAnchor *)0 ) {
+            BAiocb = BAiocbAnchr->FreeBAiocbs;             /*   start of free list */
+            BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+
+            for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    (BufferAiocbs+ix)->BAiocbDependentCount = 0;  /* use this as marker for finding it on freelist */
+            }
+            for (fx = (numBufferAiocbs-1);  ( (fx>=0) &&  ( BAiocb != (struct BufferAiocb*)0 ) );  fx--) {
+                    
+                    /*  check if it is a valid BufferAiocb */
+                    for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                        if ((BufferAiocbs+ix) == BAiocb) { /*  is it this one ? */
+                             break;
+                        }
+                    }
+                    if (ix >= 0) {
+                        if (BAiocb->BAiocbDependentCount) {   /* seen it already ? */
+                            elog(LOG, "ReportFreeBAiocbs closed cycle on AIO control block freelist %p"
+                                          ,BAiocb);
+                            fx = 0; /* give up at this point */
+                        }
+                        BAiocb->BAiocbDependentCount = 1;  /* use this as marker for finding it on freelist */
+                        count++;
+                        BAiocb = BAiocb->BAiocbnext;
+                    } else {
+                        elog(LOG, "ReportFreeBAiocbs invalid item on AIO control block freelist %p"
+                                          ,BAiocb);
+                        fx = 0; /* give up at this point */
+                    }
+            }
+        }
+        elog(LOG, "ReportFreeBAiocbs AIO control block list : poolsize= %d  in-use-hwm= %d  final-free= %d" ,numBufferAiocbs , hwmBufferAiocbs , count);
+}
--- src/backend/storage/smgr/md.c.orig	2014-05-31 17:19:07.881208848 -0400
+++ src/backend/storage/smgr/md.c	2014-05-31 19:53:09.084073587 -0400
@@ -647,6 +647,62 @@ mdprefetch(SMgrRelation reln, ForkNumber
 }
 
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ *	mdinitaio() --  init the aio subsystem max number of threads and max number of requests
+ */
+void
+mdinitaio(int max_aio_threads, int max_aio_num)
+{
+     FileInitaio( max_aio_threads, max_aio_num );
+}
+
+/*
+ *	mdstartaio() -- start aio read of the specified block of a relation
+ */
+void
+mdstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode )
+{
+#ifdef USE_PREFETCH
+	off_t		seekpos;
+	MdfdVec    *v;
+        int local_retcode;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+
+	seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	local_retcode = FileStartaio(v->mdfd_vfd, seekpos, BLCKSZ , aiocbp);
+	if (retcode) {
+            *retcode = local_retcode;
+        }
+#endif   /* USE_PREFETCH */
+}
+
+
+/*
+ *	mdcompleteaio() -- complete aio read of the specified block of a relation
+ *      on entry, *inoutcode should indicate :
+ *           .  non-0  <=>   check if complete and wait if not
+ *           .  0      <=>   cancel io immediately
+ */
+void
+mdcompleteaio( char *aiocbp , int *inoutcode )
+{
+#ifdef USE_PREFETCH
+        int local_retcode;
+
+	local_retcode = FileCompleteaio(aiocbp, (inoutcode ? *inoutcode : 0));
+	if (inoutcode) {
+            *inoutcode = local_retcode;
+        }
+#endif   /* USE_PREFETCH */
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+
 /*
  *	mdread() -- Read the specified block from a relation.
  */
--- src/backend/storage/smgr/smgr.c.orig	2014-05-31 17:19:07.881208848 -0400
+++ src/backend/storage/smgr/smgr.c	2014-05-31 19:53:09.108073648 -0400
@@ -49,6 +49,12 @@ typedef struct f_smgr
 						 BlockNumber blocknum, char *buffer, bool skipFsync);
 	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	void		(*smgr_initaio) (int max_aio_threads, int max_aio_num);
+	void		(*smgr_startaio) (SMgrRelation reln, ForkNumber forknum,
+											  BlockNumber blocknum , char *aiocbp , int *retcode );
+	void		(*smgr_completeaio) ( char *aiocbp , int *inoutcode );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
@@ -66,7 +72,11 @@ typedef struct f_smgr
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
-		mdprefetch, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
+		mdprefetch
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        ,mdinitaio, mdstartaio, mdcompleteaio
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+              , mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
 		mdpreckpt, mdsync, mdpostckpt
 	}
 };
@@ -612,6 +622,35 @@ smgrprefetch(SMgrRelation reln, ForkNumb
 	(*(smgrsw[reln->smgr_which].smgr_prefetch)) (reln, forknum, blocknum);
 }
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ *	smgrinitaio() -- initialize the aio subsystem max number of threads and max number of requests
+ */
+void
+smgrinitaio(int max_aio_threads, int max_aio_num)
+{
+	(*(smgrsw[0].smgr_initaio)) ( max_aio_threads, max_aio_num );
+}
+
+/*
+ *	smgrstartaio() -- Initiate aio read of the specified block of a relation.
+ */
+void
+smgrstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode )
+{
+	(*(smgrsw[reln->smgr_which].smgr_startaio)) (reln, forknum, blocknum , aiocbp , retcode );
+}
+
+/*
+ *	smgrcompleteaio() -- Complete aio read of the specified block of a relation.
+ */
+void
+smgrcompleteaio(SMgrRelation reln,  char *aiocbp , int *inoutcode )
+{
+	(*(smgrsw[reln->smgr_which].smgr_completeaio)) ( aiocbp , inoutcode );
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 /*
  *	smgrread() -- read a particular block from a relation into the supplied
  *				  buffer.
--- src/backend/storage/file/fd.c.orig	2014-05-31 17:19:07.877208836 -0400
+++ src/backend/storage/file/fd.c	2014-05-31 19:53:09.136073719 -0400
@@ -77,6 +77,9 @@
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+#include "aio.h"
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * We must leave some file descriptors free for system(), the dynamic loader,
@@ -1239,6 +1242,10 @@ FileClose(File file)
  * We could add an implementation using libaio in the future; but note that
  * this API is inappropriate for libaio, which wants to have a buffer provided
  * to read into.
+ * Also note that a new, different implementation of asynchronous prefetch
+ * using librt,  not libaio,  is provided by the two functions following this one,
+ * FileStartaio and FileCompleteaio.   These also require to have a buffer provided
+ * to read into,  which the new async_io support provides.
  */
 int
 FilePrefetch(File file, off_t offset, int amount)
@@ -1266,6 +1273,145 @@ FilePrefetch(File file, off_t offset, in
 #endif
 }
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ * FileInitaio - initialize the aio subsystem max number of threads and max number of requests
+ *  input parms
+ *  max_aio_threads;    maximum number of threads
+ *  max_aio_num;        maximum number of concurrent aio read requests
+ *
+ *  on linux, man page for the librt implemenation of aio_init() says :
+ *         This function is a GNU extension.
+ *  If your posix aio does not have it,   then add the following line to 
+ *        src/include/pg_config_manual.h
+ *    #define DONT_HAVE_AIO_INIT
+ *  to render it as a no-op
+ */
+void
+FileInitaio(int max_aio_threads, int max_aio_num )
+{
+#ifndef DONT_HAVE_AIO_INIT
+    struct aioinit aioinit_struct;  /*  structure to pass to aio_init */
+
+    aioinit_struct.aio_threads = max_aio_threads; /*     maximum number of threads */
+    aioinit_struct.aio_num = max_aio_num;         /*     maximum number of concurrent aio read requests */
+    aioinit_struct.aio_idle_time = 1;             /*     we dont want to alter this but aio_init does not ignore it so set to the default */
+    aio_init(&aioinit_struct);
+#endif  /* ndef DONT_HAVE_AIO_INIT */
+    return;
+}
+
+/*
+ * FileStartaio - initiate asynchronous read of a given range of the file.
+ * The logical seek position is unaffected.
+ *
+ * use standard posix aio (librt)
+ *  ASSUME   BufferAiocb.aio_buf already set to -> buffer by caller
+ *  return 0 if successfully started,  else non-zero
+ */
+int
+FileStartaio(File file, off_t offset, int amount , char *aiocbp )
+{
+	int	returnCode;
+        struct aiocb *my_aiocbp = (struct aiocb *)aiocbp;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartaio: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset, amount));
+
+	returnCode = FileAccess(file);
+	if (returnCode >= 0) {
+
+            my_aiocbp->aio_fildes = VfdCache[file].fd;
+            my_aiocbp->aio_lio_opcode = LIO_READ;
+            my_aiocbp->aio_nbytes = amount;
+            my_aiocbp->aio_offset = offset;
+            returnCode = aio_read(my_aiocbp);
+        }
+
+	return returnCode;
+}
+
+/*
+ * FileCompleteaio - complete asynchronous aio read
+ * normal_wait indicates whether to cancel or wait -
+ *                 0 <=> cancel
+ *                 1 <=> wait by polling the aiocb
+ *                 2 <=> wait by suspending on the aiocb
+ *
+ * use standard posix aio (librt)
+ *  return 0 if successfull and did not have to wait,
+ *         1 if successfull and had to wait,
+ *    else x'ff'
+ */
+int
+FileCompleteaio( char *aiocbp , int normal_wait )
+{
+	int	returnCode;
+	int	aio_errno;
+        struct aiocb *my_aiocbp = (struct aiocb *)aiocbp;
+        const struct aiocb const*cblist[1];
+        int fd;
+        struct timespec my_timeout = { 0 , 10000 };
+        struct timespec *suspend_timeout_P; /*  the timeout actually used depending on normal_wait */
+        int max_polls;
+
+        fd = my_aiocbp->aio_fildes;
+        cblist[0] = my_aiocbp;
+        returnCode = aio_errno = aio_error(my_aiocbp);
+        /* note that aio_error returns 0 if op already completed successfully */
+
+        /*  first handle normal case of waiting for op to complete  */
+        if (normal_wait) {
+            /*   if told not to poll,   then specify no timeout  */
+            suspend_timeout_P = (normal_wait == 1 ? &my_timeout : (struct timespec *)0);
+            while (aio_errno == EINPROGRESS) {
+                max_polls = 256;
+                my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                returnCode = aio_suspend(cblist , 1 , suspend_timeout_P);
+                while (    (returnCode < 0) && (max_polls-- > 0)
+                        && ((EAGAIN == errno) || (EINTR == errno))
+                      ) {
+                    my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                    returnCode = aio_suspend(cblist , 1 , suspend_timeout_P);
+                }
+                returnCode = aio_errno = aio_error(my_aiocbp);
+                /*  now return_code is from aio_error  */
+                if (returnCode == 0) {
+                    returnCode = 1;    /*  successful but had to wait */
+                }
+            }
+            if (aio_errno) {
+                elog(LOG, "FileCompleteaio: %d %d", fd, returnCode);
+                returnCode = 0xff; /*  unsuccessful */
+            }
+        } else {
+            if (aio_errno == EINPROGRESS) {
+                do {
+                        max_polls = 256;
+                        my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                        returnCode = aio_cancel(fd, my_aiocbp);
+                        while ((returnCode == AIO_NOTCANCELED) && (max_polls-- > 0)) {
+                            my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                            returnCode = aio_cancel(fd, my_aiocbp);
+                        }
+                    returnCode = aio_errno = aio_error(my_aiocbp);
+                } while (aio_errno == EINPROGRESS);
+                returnCode = 0xff; /*  unsuccessful */
+            }
+            if (returnCode != 0)
+                returnCode = 0xff; /*  unsuccessful */
+        }
+
+	DO_DB(elog(LOG, "FileCompleteaio: %d %d",
+			   fd, returnCode));
+
+	return returnCode;
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 int
 FileRead(File file, char *buffer, int amount)
 {
--- src/backend/storage/lmgr/proc.c.orig	2014-05-31 17:19:07.881208848 -0400
+++ src/backend/storage/lmgr/proc.c	2014-05-31 19:53:09.168073800 -0400
@@ -52,6 +52,7 @@
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+extern pid_t this_backend_pid;     /*   pid of this backend */
 
 /* GUC variables */
 int			DeadlockTimeout = 1000;
@@ -361,6 +362,7 @@ InitProcess(void)
 	MyPgXact->xid = InvalidTransactionId;
 	MyPgXact->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
+        this_backend_pid = getpid();    /*    pid of this backend */
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
--- src/backend/access/heap/heapam.c.orig	2014-05-31 17:19:07.785208591 -0400
+++ src/backend/access/heap/heapam.c	2014-05-31 19:53:09.228073952 -0400
@@ -71,6 +71,28 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef USE_PREFETCH
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+
+#include "executor/instrument.h"
+
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_heap_scans; /*  boolean whether to prefetch non-bitmap heap scans         */
+
+/*  special values for scan->rs_prefetch_target indicating as follows :               */
+#define PREFETCH_MAYBE 0xffffffff      /*   prefetch permitted but not yet in effect  */
+#define PREFETCH_DISABLED 0xfffffffe   /*   prefetch disabled and not permitted       */
+/*  PREFETCH_WRAP_POINT indicates a pretcher who has reached the point where the scan would wrap -
+**  at this point the prefetcher runs on the spot until scan catches up.
+**  This *must* be < maximum valid setting of target_prefetch_pages aka effective_io_concurrency.
+*/
+#define PREFETCH_WRAP_POINT 0x0fffffff
+
+#endif   /* USE_PREFETCH */
+
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -115,6 +137,8 @@ static XLogRecPtr log_heap_new_cid(Relat
 static HeapTuple ExtractReplicaIdentity(Relation rel, HeapTuple tup, bool key_modified,
 					   bool *copy);
 
+static void heap_unread_add(HeapScanDesc scan, BlockNumber blockno);
+static void heap_unread_subtract(HeapScanDesc scan, BlockNumber blockno);
 
 /*
  * Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -292,10 +316,149 @@ initscan(HeapScanDesc scan, ScanKey key,
 	 * Currently, we don't have a stats counter for bitmap heap scans (but the
 	 * underlying bitmap index scans will be counted).
 	 */
-	if (!scan->rs_bitmapscan)
+#ifdef USE_PREFETCH
+        /*    by default,  no prefetching on any scan  */
+        scan->rs_prefetch_target = PREFETCH_DISABLED;  /*  tentatively disable  */
+        scan->rs_pfchblock = 0; /*  scanner will reset this to be ahead of scan */
+        scan->rs_Unread_Pfetched_base = (BlockNumber *)0; /*  list of prefetched but unread blocknos */
+        scan->rs_Unread_Pfetched_next = 0; /*  next unread blockno */
+        scan->rs_Unread_Pfetched_count = 0; /* number of valid unread blocknos */
+#endif   /* USE_PREFETCH */
+	if (!scan->rs_bitmapscan) {
+
 		pgstat_count_heap_scan(scan->rs_rd);
+#ifdef USE_PREFETCH
+                /*    bitmap scans do their own prefetching -
+                **    for others,  set up prefetching now
+                */
+                if (    prefetch_heap_scans
+                     && (target_prefetch_pages > 0)
+                     &&	(!RelationUsesLocalBuffers(scan->rs_rd))
+                   ) {
+                      /*   prefetch_dbOid may be set to a database Oid to specify only prefetch in that db */
+                      if (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == scan->rs_rd->rd_node.dbNode)
+                              )
+                          ||  (prefetch_dbOid == 0)
+                         ) {
+                          scan->rs_prefetch_target = PREFETCH_MAYBE;    /*  permitted but let the scan decide */
+                      }
+                      else {
+                      }
+                }
+#endif   /* USE_PREFETCH */
+        }
+}
+
+/* add this blockno to list of prefetched and unread blocknos
+** use the one identified by the (next+count|modulo circumference) index if it is unused,
+** else search for the first available slot if there is one,
+** else error.
+*/
+static void
+heap_unread_add(HeapScanDesc scan, BlockNumber blockno)
+{
+      BlockNumber *available_P;   /*  where to store new blockno */
+      unsigned int Unread_Pfetched_index = scan->rs_Unread_Pfetched_next
+                                         + scan->rs_Unread_Pfetched_count; /* index of next unused slot */
+
+      /*  caller is not supposed to pass InvalidBlockNumber but check anyway */
+      if (blockno != InvalidBlockNumber) {
+
+		  /*  ensure there is some room somewhere   */
+		  if (scan->rs_Unread_Pfetched_count < target_prefetch_pages) {
+
+			  /*  try the "next+count" one */
+			  if (Unread_Pfetched_index >= target_prefetch_pages) {
+				  Unread_Pfetched_index -= target_prefetch_pages;  /* modulo circumference */
+			  }
+			  available_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+			  if (*available_P == InvalidBlockNumber) { /* unused */
+				  goto store_blockno;
+			  } else {
+                  /*  slow-search the entire list  */
+				  for (Unread_Pfetched_index = 0; Unread_Pfetched_index < target_prefetch_pages; Unread_Pfetched_index++) {
+					  available_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+					  if (*available_P == InvalidBlockNumber) { /* unused */
+                          /*  before storing this blockno,
+                          **  since the next pointer did not locate an unused slot,
+                          **  set it to one which is more likely to be so for the next time
+                          */
+                          scan->rs_Unread_Pfetched_next = Unread_Pfetched_index;
+						  goto store_blockno;
+					  }
+				  }
+			  }
+		  }
+
+          /*  if we reach here,    either there was no available slot
+          **  or we thought there was one and didn't find any  --
+          */
+  		  ereport(NOTICE,
+			  (errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("heap_unread_add overflowed list cannot add blockno %d", blockno)));
+
+  		  return;
 }
 
+    store_blockno:
+      *available_P = blockno;
+      scan->rs_Unread_Pfetched_count++;  /*  update count */
+
+    return;
+}
+
+/* remove specified blockno from list of prefetched and unread blocknos.
+** Usually this will be found at the rs_Unread_Pfetched_next item -
+** else search for it.    If not found,   inore it  -  no error results.
+*/
+static void
+heap_unread_subtract(HeapScanDesc scan, BlockNumber blockno)
+{
+      unsigned int Unread_Pfetched_index = scan->rs_Unread_Pfetched_next; /* index of next unread blockno */
+      BlockNumber *candidate_P;   /*  location of callers blockno - maybe */
+      BlockNumber nextUnreadPfetched;
+
+      /*  caller is not supposed to pass InvalidBlockNumber but check anyway */
+      if (    (blockno != InvalidBlockNumber)
+		   && ( scan->rs_Unread_Pfetched_count > 0 )   /*  if the list is not empty  */
+         ) {
+
+			  /*  take modulo of the circumference.
+			  **  actually rs_Unread_Pfetched_next should never exceed the circumference but check anyway.
+			  */
+			  if (Unread_Pfetched_index >= target_prefetch_pages) {
+				  Unread_Pfetched_index  -= target_prefetch_pages;
+			  }
+			  candidate_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);
+			  nextUnreadPfetched = *candidate_P;
+
+			  if ( nextUnreadPfetched == blockno ) {
+				  goto remove_blockno;
+			  } else {
+                  /*  slow-search the entire list  */
+				  for (Unread_Pfetched_index = 0; Unread_Pfetched_index < target_prefetch_pages; Unread_Pfetched_index++) {
+					  candidate_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+					  if (*candidate_P == blockno) { /* unused */
+						  goto remove_blockno;
+					  }
+				  }
+			  }
+
+    remove_blockno:
+			  *candidate_P = InvalidBlockNumber;
+
+			  scan->rs_Unread_Pfetched_next = (Unread_Pfetched_index+1);  /*  update next pfchd unread */
+			  if (scan->rs_Unread_Pfetched_next >= target_prefetch_pages) {
+					  scan->rs_Unread_Pfetched_next = 0;
+			  }
+			  scan->rs_Unread_Pfetched_count--;  /*  update count */
+	  }
+
+      return;
+}
+
+
 /*
  * heapgetpage - subroutine for heapgettup()
  *
@@ -304,7 +467,7 @@ initscan(HeapScanDesc scan, ScanKey key,
  * which tuples on the page are visible.
  */
 static void
-heapgetpage(HeapScanDesc scan, BlockNumber page)
+heapgetpage(HeapScanDesc scan, BlockNumber page , BlockNumber prefetchHWM)
 {
 	Buffer		buffer;
 	Snapshot	snapshot;
@@ -314,6 +477,10 @@ heapgetpage(HeapScanDesc scan, BlockNumb
 	OffsetNumber lineoff;
 	ItemId		lpp;
 	bool		all_visible;
+#ifdef USE_PREFETCH
+	int             PrefetchBufferRc;  /*   indicates whether requested prefetch block already in a buffer and if pin count on buffer has been increased */
+#endif   /* USE_PREFETCH */
+
 
 	Assert(page < scan->rs_nblocks);
 
@@ -336,6 +503,98 @@ heapgetpage(HeapScanDesc scan, BlockNumb
 									   RBM_NORMAL, scan->rs_strategy);
 	scan->rs_cblock = page;
 
+#ifdef USE_PREFETCH
+
+        heap_unread_subtract(scan, page);
+
+        /*    maybe prefetch some pages  starting with rs_pfchblock */
+        if (scan->rs_prefetch_target >= 0) {       /*   prefetching enabled on this scan ?                         */
+            int next_block_to_be_read = (page+1);  /*   next block to be read = lowest possible prefetchable block */
+            int num_to_pfch_this_time;             /*   eventually holds the number of blocks to prefetch now      */
+            int prefetchable_range;                /*   size of the area ahead of the current prefetch position    */
+
+            /*  check if prefetcher reached wrap point and the scan has now wrapped */
+            if (  (page == 0) && (scan->rs_prefetch_target == PREFETCH_WRAP_POINT)  ) {
+                scan->rs_prefetch_target = 1;
+                scan->rs_pfchblock = next_block_to_be_read;
+            } else
+            if (scan->rs_pfchblock < next_block_to_be_read) {
+                scan->rs_pfchblock = next_block_to_be_read; /* next block to be prefetched must be ahead of one we just read */
+            }
+
+            /* now we know where we would start prefetching -
+            ** next question   -  if this is a sync scan,  ensure we do not prefetch behind the HWM
+            ** debatable whether to require strict inequality or >=  -   >= works better in practice
+            */
+            if ( (!scan->rs_syncscan) || (scan->rs_pfchblock >= prefetchHWM) ) {
+
+                /* now we know where we will start prefetching -
+                ** next question   -  how many?
+                ** apply two limits :
+                **  1.   target prefetch distance
+                **  2.   number of available blocks ahead of us
+                */
+
+                /*  1.   target prefetch distance   */
+                num_to_pfch_this_time = next_block_to_be_read + scan->rs_prefetch_target; /* page beyond prefetch target */
+                num_to_pfch_this_time -= scan->rs_pfchblock;                              /*  convert to offset        */
+
+                /*   first do prefetching up to our current limit  ...
+                **   highest page number that a scan (pre)-fetches is scan->rs_nblocks-1
+                **   note  -  prefetcher does not wrap a prefetch range -
+                **            instead just stop and then start again if and when main scan wraps
+                */
+                if (scan->rs_pfchblock <= scan->rs_startblock) {  /*  if on second leg towards startblock */
+                    prefetchable_range = ((int)(scan->rs_startblock) - (int)(scan->rs_pfchblock));
+                }
+                else {                                            /*     on first leg towards nblocks     */
+                    prefetchable_range = ((int)(scan->rs_nblocks) - (int)(scan->rs_pfchblock));
+                }
+                if (prefetchable_range > 0) {           /*  if theres a range to prefetch */
+
+                    /*  2.   number of available blocks ahead of us        */
+                    if (num_to_pfch_this_time > prefetchable_range) {
+                        num_to_pfch_this_time = prefetchable_range;
+                    }
+                    while (num_to_pfch_this_time-- > 0) {
+                        PrefetchBufferRc = PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, scan->rs_pfchblock, scan->rs_strategy);
+                        /*  if pin acquired on buffer,  then remember in case of future Discard */
+                        if (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED) {
+                            heap_unread_add(scan, scan->rs_pfchblock);
+						}
+                        scan->rs_pfchblock++;
+                        /*  if syncscan and requested block was already in buffer pool,
+                        **  this suggests that another scanner is ahead of us and we should advance
+                        */
+                        if ( (scan->rs_syncscan) && (PrefetchBufferRc & PREFTCHRC_BLK_ALREADY_PRESENT) ) {
+                            scan->rs_pfchblock++;
+                            num_to_pfch_this_time--;
+                        }
+                    }
+                }
+                else {
+                    /*   we must not modify scan->rs_pfchblock here
+                    **   because it is needed for possible DiscardBuffer at end of scan  ...
+                    **   ... instead ...
+                    */
+                    scan->rs_prefetch_target = PREFETCH_WRAP_POINT;  /*   mark this prefetcher as waiting to wrap */
+                }
+
+                /*   ...  then adjust prefetching limit : by doubling on each iteration */
+                if (scan->rs_prefetch_target == 0) {
+                    scan->rs_prefetch_target = 1;
+                }
+                else {
+                    scan->rs_prefetch_target *= 2;
+                    if (scan->rs_prefetch_target > target_prefetch_pages) {
+                        scan->rs_prefetch_target = target_prefetch_pages;
+                    }
+                }
+            }
+        }
+#endif   /* USE_PREFETCH */
+
+
 	if (!scan->rs_pageatatime)
 		return;
 
@@ -452,6 +711,8 @@ heapgettup(HeapScanDesc scan,
 	OffsetNumber lineoff;
 	int			linesleft;
 	ItemId		lpp;
+    BlockNumber  prefetchHWM = 0;   /*  HWM of prefetch BlockNum for all participants in same sync scan */
+    int          ix;
 
 	/*
 	 * calculate next starting lineoff, given scan direction
@@ -470,7 +731,25 @@ heapgettup(HeapScanDesc scan,
 				return;
 			}
 			page = scan->rs_startblock; /* first page */
-			heapgetpage(scan, page);
+#ifdef USE_PREFETCH
+                        /*  decide if we shall do prefetching :  only if :
+                        **   .  prefetching enabled for this scan
+                        **   .  not a bitmap scan (which do their own)
+                        **   .  sufficient number of blocks - at least twice the target_prefetch_pages
+                        */
+                        if (    (!scan->rs_bitmapscan)
+                             && (scan->rs_prefetch_target >= PREFETCH_MAYBE)
+                             && (scan->rs_nblocks > (2*target_prefetch_pages))
+                           )  {
+                                scan->rs_prefetch_target = 1;  /*  do prefetching on forward non-bitmap scan */
+                                scan->rs_Unread_Pfetched_base = (BlockNumber *)palloc(sizeof(BlockNumber)*target_prefetch_pages);
+                                /*  Initialise the list */
+                                for (ix = 0; ix < target_prefetch_pages; ix++) {
+									*(scan->rs_Unread_Pfetched_base + ix) = InvalidBlockNumber;
+								}
+                        }
+#endif   /* USE_PREFETCH */
+			heapgetpage(scan, page, prefetchHWM);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
 		}
@@ -516,7 +795,7 @@ heapgettup(HeapScanDesc scan,
 				page = scan->rs_startblock - 1;
 			else
 				page = scan->rs_nblocks - 1;
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 		}
 		else
 		{
@@ -557,7 +836,7 @@ heapgettup(HeapScanDesc scan,
 
 		page = ItemPointerGetBlockNumber(&(tuple->t_self));
 		if (page != scan->rs_cblock)
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 
 		/* Since the tuple was previously fetched, needn't lock page here */
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
@@ -660,8 +939,10 @@ heapgettup(HeapScanDesc scan,
 			 * a little bit backwards on every invocation, which is confusing.
 			 * We don't guarantee any specific ordering in general, though.
 			 */
-			if (scan->rs_syncscan)
-				ss_report_location(scan->rs_rd, page);
+			if (scan->rs_syncscan) {
+                                prefetchHWM = scan->rs_pfchblock;
+				ss_report_location(scan->rs_rd, page, &prefetchHWM);
+                        }
 		}
 
 		/*
@@ -671,6 +952,22 @@ heapgettup(HeapScanDesc scan,
 		{
 			if (BufferIsValid(scan->rs_cbuf))
 				ReleaseBuffer(scan->rs_cbuf);
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
 			scan->rs_cbuf = InvalidBuffer;
 			scan->rs_cblock = InvalidBlockNumber;
 			tuple->t_data = NULL;
@@ -678,7 +975,7 @@ heapgettup(HeapScanDesc scan,
 			return;
 		}
 
-		heapgetpage(scan, page);
+		heapgetpage(scan, page, prefetchHWM);
 
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 
@@ -727,6 +1024,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 	OffsetNumber lineoff;
 	int			linesleft;
 	ItemId		lpp;
+    BlockNumber  prefetchHWM = 0;   /*  HWM of prefetch BlockNum for all participants in same sync scan */
+    int          ix;
 
 	/*
 	 * calculate next starting lineindex, given scan direction
@@ -745,7 +1044,25 @@ heapgettup_pagemode(HeapScanDesc scan,
 				return;
 			}
 			page = scan->rs_startblock; /* first page */
-			heapgetpage(scan, page);
+#ifdef USE_PREFETCH
+                        /*  decide if we shall do prefetching :  only if :
+                        **   .  prefetching enabled for this scan
+                        **   .  not a bitmap scan (which do their own)
+                        **   .  sufficient number of blocks - at least twice the target_prefetch_pages
+                        */
+                        if (    (!scan->rs_bitmapscan)
+                             && (scan->rs_prefetch_target >= PREFETCH_MAYBE)
+                             && (scan->rs_nblocks > (2*target_prefetch_pages))
+                           )  {
+                                scan->rs_prefetch_target = 1;  /*  do prefetching on forward non-bitmap scan */
+                                scan->rs_Unread_Pfetched_base = (BlockNumber *)palloc(sizeof(BlockNumber)*target_prefetch_pages);
+                                /*  Initialise the list */
+                                for (ix = 0; ix < target_prefetch_pages; ix++) {
+									*(scan->rs_Unread_Pfetched_base + ix) = InvalidBlockNumber;
+								}
+                        }
+#endif   /* USE_PREFETCH */
+			heapgetpage(scan, page, prefetchHWM);
 			lineindex = 0;
 			scan->rs_inited = true;
 		}
@@ -788,7 +1105,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 				page = scan->rs_startblock - 1;
 			else
 				page = scan->rs_nblocks - 1;
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 		}
 		else
 		{
@@ -826,7 +1143,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 
 		page = ItemPointerGetBlockNumber(&(tuple->t_self));
 		if (page != scan->rs_cblock)
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 
 		/* Since the tuple was previously fetched, needn't lock page here */
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
@@ -921,8 +1238,10 @@ heapgettup_pagemode(HeapScanDesc scan,
 			 * a little bit backwards on every invocation, which is confusing.
 			 * We don't guarantee any specific ordering in general, though.
 			 */
-			if (scan->rs_syncscan)
-				ss_report_location(scan->rs_rd, page);
+			if (scan->rs_syncscan) {
+                                prefetchHWM = scan->rs_pfchblock;
+				ss_report_location(scan->rs_rd, page, &prefetchHWM);
+                        }
 		}
 
 		/*
@@ -932,6 +1251,22 @@ heapgettup_pagemode(HeapScanDesc scan,
 		{
 			if (BufferIsValid(scan->rs_cbuf))
 				ReleaseBuffer(scan->rs_cbuf);
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                             while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
 			scan->rs_cbuf = InvalidBuffer;
 			scan->rs_cblock = InvalidBlockNumber;
 			tuple->t_data = NULL;
@@ -939,7 +1274,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 			return;
 		}
 
-		heapgetpage(scan, page);
+		heapgetpage(scan, page, prefetchHWM);
 
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
 		lines = scan->rs_ntuples;
@@ -1394,6 +1729,23 @@ void
 heap_rescan(HeapScanDesc scan,
 			ScanKey key)
 {
+
+#ifdef USE_PREFETCH
+         if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                       /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+         }
+#endif   /* USE_PREFETCH */
 	/*
 	 * unpin scan buffers
 	 */
@@ -1418,6 +1770,23 @@ heap_endscan(HeapScanDesc scan)
 {
 	/* Note: no locking manipulations needed */
 
+
+#ifdef USE_PREFETCH
+         if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                       /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+         }
+#endif   /* USE_PREFETCH */
 	/*
 	 * unpin scan buffers
 	 */
@@ -1435,6 +1804,10 @@ heap_endscan(HeapScanDesc scan)
 	if (scan->rs_strategy != NULL)
 		FreeAccessStrategy(scan->rs_strategy);
 
+    if (scan->rs_Unread_Pfetched_base) {
+        pfree(scan->rs_Unread_Pfetched_base);
+    }
+
 	if (scan->rs_temp_snap)
 		UnregisterSnapshot(scan->rs_snapshot);
 
@@ -1464,7 +1837,6 @@ heap_endscan(HeapScanDesc scan)
 #define HEAPDEBUG_3
 #endif   /* !defined(HEAPDEBUGALL) */
 
-
 HeapTuple
 heap_getnext(HeapScanDesc scan, ScanDirection direction)
 {
@@ -6347,6 +6719,25 @@ heap_markpos(HeapScanDesc scan)
 void
 heap_restrpos(HeapScanDesc scan)
 {
+
+
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                             while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
+
 	/* XXX no amrestrpos checking that ammarkpos called */
 
 	if (!ItemPointerIsValid(&scan->rs_mctid))
--- src/backend/access/heap/syncscan.c.orig	2014-05-31 17:19:07.785208591 -0400
+++ src/backend/access/heap/syncscan.c	2014-05-31 19:53:09.252074012 -0400
@@ -90,6 +90,7 @@ typedef struct ss_scan_location_t
 {
 	RelFileNode relfilenode;	/* identity of a relation */
 	BlockNumber location;		/* last-reported location in the relation */
+	BlockNumber prefetchHWM;	/* high-water-mark of prefetched Blocknum */
 } ss_scan_location_t;
 
 typedef struct ss_lru_item_t
@@ -113,7 +114,7 @@ static ss_scan_locations_t *scan_locatio
 
 /* prototypes for internal functions */
 static BlockNumber ss_search(RelFileNode relfilenode,
-		  BlockNumber location, bool set);
+		  BlockNumber location, bool set , BlockNumber *prefetchHWMp);
 
 
 /*
@@ -160,6 +161,7 @@ SyncScanShmemInit(void)
 			item->location.relfilenode.dbNode = InvalidOid;
 			item->location.relfilenode.relNode = InvalidOid;
 			item->location.location = InvalidBlockNumber;
+			item->location.prefetchHWM = InvalidBlockNumber;
 
 			item->prev = (i > 0) ?
 				(&scan_locations->items[i - 1]) : NULL;
@@ -185,7 +187,7 @@ SyncScanShmemInit(void)
  * data structure.
  */
 static BlockNumber
-ss_search(RelFileNode relfilenode, BlockNumber location, bool set)
+ss_search(RelFileNode relfilenode, BlockNumber location, bool set , BlockNumber *prefetchHWMp)
 {
 	ss_lru_item_t *item;
 
@@ -206,6 +208,22 @@ ss_search(RelFileNode relfilenode, Block
 			{
 				item->location.relfilenode = relfilenode;
 				item->location.location = location;
+                                /*  if prefetch information requested,
+                                **  then reconcile and either update or report back the new HWM.
+                                */
+                                if (prefetchHWMp)
+                                {
+                                    if (   (item->location.prefetchHWM == InvalidBlockNumber)
+                                        || (item->location.prefetchHWM < *prefetchHWMp)
+                                       )
+                                    {
+                                      item->location.prefetchHWM = *prefetchHWMp;
+                                    }
+                                    else
+                                    {
+                                      *prefetchHWMp = item->location.prefetchHWM;
+                                    }
+                                }
 			}
 			else if (set)
 				item->location.location = location;
@@ -252,7 +270,7 @@ ss_get_location(Relation rel, BlockNumbe
 	BlockNumber startloc;
 
 	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
-	startloc = ss_search(rel->rd_node, 0, false);
+	startloc = ss_search(rel->rd_node, 0, false , 0);
 	LWLockRelease(SyncScanLock);
 
 	/*
@@ -282,7 +300,7 @@ ss_get_location(Relation rel, BlockNumbe
  * same relfilenode.
  */
 void
-ss_report_location(Relation rel, BlockNumber location)
+ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp)
 {
 #ifdef TRACE_SYNCSCAN
 	if (trace_syncscan)
@@ -306,7 +324,7 @@ ss_report_location(Relation rel, BlockNu
 	{
 		if (LWLockConditionalAcquire(SyncScanLock, LW_EXCLUSIVE))
 		{
-			(void) ss_search(rel->rd_node, location, true);
+			(void) ss_search(rel->rd_node, location, true , prefetchHWMp);
 			LWLockRelease(SyncScanLock);
 		}
 #ifdef TRACE_SYNCSCAN
--- src/backend/access/index/indexam.c.orig	2014-05-31 17:19:07.785208591 -0400
+++ src/backend/access/index/indexam.c	2014-05-31 19:53:09.276074073 -0400
@@ -79,6 +79,55 @@
 #include "utils/tqual.h"
 
 
+#ifdef USE_PREFETCH
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+void index_mark_or_evict_block(IndexScanDesc scan , BlockNumber blocknumber , int markit);
+
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+
+/*  if specified block number is present in the prefetch array,
+**  then either mark it as not to be discarded or evict it according to input param
+*/
+void index_mark_or_evict_block(IndexScanDesc scan , BlockNumber blocknumber , int markit)
+{
+        unsigned short int pfchx , pfchy , pfchz; /*  indexes in BlockIdData array   */
+
+        if (    scan->do_prefetch
+             && ((struct pfch_block_item*)0 != scan->pfch_block_item_list)
+             /* no need to check for scan->pfch_next < prefetch_index_scans
+             ** since we will do nothing if scan->pfch_used == 0
+             */
+           ) {
+            /*  search the prefetch list to find if the block is a member */
+            for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                if (BlockIdGetBlockNumber(&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)) == blocknumber) {
+                      if (markit) {
+						      /*  mark it as not to be discarded */
+						      ((scan->pfch_block_item_list)+pfchx)->pfch_discard &= ~PREFTCHRC_BUF_PIN_INCREASED;
+					  } else {
+							  /*  shuffle all following the evictee to the left
+							  **  and update next pointer if its element moves
+							  */
+							  pfchy = (scan->pfch_used - 1); /*  current rightmost */
+							  scan->pfch_used = pfchy;
+
+							  while (pfchy > pfchx) {
+								  pfchz = pfchx + 1;
+								  BlockIdCopy((&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)), (&(((scan->pfch_block_item_list)+pfchz)->pfch_blockid)));
+								  ((scan->pfch_block_item_list)+pfchx)->pfch_discard = ((scan->pfch_block_item_list)+pfchz)->pfch_discard;
+								  if (scan->pfch_next == pfchz) {
+									  scan->pfch_next = pfchx;
+								  }
+								  pfchx = pfchz; /* advance */
+							  }
+                      }
+                }
+            }
+        }
+}
+#endif /* USE_PREFETCH */
+
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
  *
@@ -253,6 +302,11 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+	scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -277,6 +331,11 @@ index_beginscan_bitmap(Relation indexRel
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+	scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -311,6 +370,9 @@ index_beginscan_internal(Relation indexR
 									  Int32GetDatum(nkeys),
 									  Int32GetDatum(norderbys)));
 
+	scan->heap_tids_seen = 0;
+	scan->heap_tids_fetched = 0;
+	
 	return scan;
 }
 
@@ -342,6 +404,12 @@ index_rescan(IndexScanDesc scan,
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
@@ -373,10 +441,30 @@ index_endscan(IndexScanDesc scan)
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
 
+#ifdef USE_PREFETCH
+        /*   discard prefetched but unread buffers */
+        if (    scan->do_prefetch
+             && ((struct pfch_block_item*)0 != scan->pfch_block_item_list)
+           ) {
+            unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                  if (((scan->pfch_block_item_list)+pfchx)->pfch_discard) {
+                      DiscardBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber(&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)));
+                  }
+                }
+        }
+#endif   /* USE_PREFETCH */
+
 	/* End the AM's scan */
 	FunctionCall1(procedure, PointerGetDatum(scan));
 
@@ -472,6 +560,12 @@ index_getnext_tid(IndexScanDesc scan, Sc
 		/* ... but first, release any held pin on a heap page */
 		if (BufferIsValid(scan->xs_cbuf))
 		{
+#ifdef USE_PREFETCH
+                    /*   if specified block number is present in the prefetch array,  then evict it */
+                    if (scan->do_prefetch) {
+                          index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                    }
+#endif   /* USE_PREFETCH */
 			ReleaseBuffer(scan->xs_cbuf);
 			scan->xs_cbuf = InvalidBuffer;
 		}
@@ -479,6 +573,11 @@ index_getnext_tid(IndexScanDesc scan, Sc
 	}
 
 	pgstat_count_index_tuples(scan->indexRelation, 1);
+	if (scan->heap_tids_seen++ >= (~0)) {
+		/* Avoid integer overflow */
+		scan->heap_tids_seen = 1;
+		scan->heap_tids_fetched = 0;
+	}
 
 	/* Return the TID of the tuple we found. */
 	return &scan->xs_ctup.t_self;
@@ -502,6 +601,10 @@ index_getnext_tid(IndexScanDesc scan, Sc
  * enough information to do it efficiently in the general case.
  * ----------------
  */
+#if defined(USE_PREFETCH) && defined(AVOID_CATALOG_MIGRATION_FOR_ASYNCIO)
+extern Datum btpeeknexttuple(IndexScanDesc scan);
+#endif /* USE_PREFETCH */
+
 HeapTuple
 index_fetch_heap(IndexScanDesc scan)
 {
@@ -509,16 +612,109 @@ index_fetch_heap(IndexScanDesc scan)
 	bool		all_dead = false;
 	bool		got_heap_tuple;
 
+
+
 	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
 	if (!scan->xs_continue_hot)
 	{
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = scan->xs_cbuf;
 
+#ifdef USE_PREFETCH
+
+                /*   If the old block is different from new block, then evict old
+                **   block from prefetched array.   It is arguable we should leave it
+                **   in the array because it's likely to remain in the buffer pool
+                **   for a while,  but in that case , if we encounter the block
+                **   again,  prefetching it again does no harm.
+                **   (and note that,  if it's not pinned,  prefetching it will try to
+                **   pin it since prefetch tries to bank a pin for a buffer in the buffer pool).
+                **   therefore it should usually win.
+                */
+                if (    scan->do_prefetch
+                     && ( BufferIsValid(prev_buf) )
+                     && (BlocknotinBuffer(prev_buf,scan->heapRelation,ItemPointerGetBlockNumber(tid)))
+                     && (scan->pfch_next < prefetch_index_scans)  /* ensure there is an entry */
+                        ) {
+                          index_mark_or_evict_block(scan , BlocknumOfBuffer(prev_buf) , 0);
+                }
+
+#endif   /* USE_PREFETCH  */
 		scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf,
 											 scan->heapRelation,
 											 ItemPointerGetBlockNumber(tid));
 
+                /*   If the new block had been prefetched and pinned,
+                **   then mark that it no longer requires to be discarded.
+                **   Of course,  we don't evict the entry,
+                **   because we want to remember that it was recently prefetched.
+                */
+                index_mark_or_evict_block(scan , BlocknumOfBuffer(prev_buf) , 1);
+
+				scan->heap_tids_fetched++;
+
+#ifdef USE_PREFETCH
+                /*  try prefetching next data block
+                **    (next meaning one containing TIDs from matching keys
+                **     in same index page and different from any block
+                **     we previously prefetched and listed in prefetched array)
+                */
+                {
+                    FmgrInfo   *procedure;
+                    bool	found;             /*  did we find the "next" heap tid in current index page */
+                    int         PrefetchBufferRc;  /*  indicates whether requested prefetch block already in a buffer and if pin count on buffer has been increased */
+
+                    if (scan->do_prefetch) {
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                        procedure = &scan->indexRelation->rd_aminfo->ampeeknexttuple; /* is incorrect but avoids adding function to catalog */
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                        if (RegProcedureIsValid(scan->indexRelation->rd_am->ampeeknexttuple)) {
+                            GET_SCAN_PROCEDURE(ampeeknexttuple); /* is correct but requires adding function to catalog */
+                        } else {
+                            procedure = 0;
+                        }
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+
+                        if (    procedure          /* does the index access method support peektuple? */
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                             && procedure->fn_addr /* procedure->fn_addr is non-null only if in catalog */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                           ) {
+                            int iterations = 1;      /*  how many iterations of prefetching shall we try  -
+                                                     **  if used entries in prefetch list is < target_prefetch_pages
+                                                     **  then 2,  else 1
+                                                     **  this should result in gradually and smoothly increasing up to target_prefetch_pages
+                                                     */
+                            /*  note we trust InitIndexScan verified this scan is forwards only and so set that */
+                            if (scan->pfch_used < target_prefetch_pages) {
+                                iterations = 2;
+                            }
+                            do {
+                                found =  DatumGetBool(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                                                         btpeeknexttuple(scan)     /*  pass scan as direct parameter since cant use fmgr because not in catalog */
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                                                         FunctionCall1(procedure, PointerGetDatum(scan)) /* use fmgr to call it because in catalog  */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                                                     );
+                                if (found) {
+                                    /*    btpeeknexttuple set pfch_next to point to the item in block_item_list to be prefetched */
+                                    PrefetchBufferRc = PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber((&((scan->pfch_block_item_list + scan->pfch_next))->pfch_blockid)) , 0);
+                                    /* elog(LOG,"index_fetch_heap prefetched rel %u blockNum %u"
+                                       ,scan->heapRelation->rd_node.relNode ,BlockIdGetBlockNumber(scan->pfch_block_item_list + scan->pfch_next));
+                                    */
+
+                                    /*  if pin acquired on buffer,  then remember in case of future Discard */
+                                    (scan->pfch_block_item_list + scan->pfch_next)->pfch_discard = (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED);
+
+
+                                }
+                            } while (--iterations > 0);
+                        }
+                    }
+                }
+#endif   /* USE_PREFETCH */
+
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
--- src/backend/access/index/genam.c.orig	2014-05-31 17:19:07.785208591 -0400
+++ src/backend/access/index/genam.c	2014-05-31 19:53:09.296074123 -0400
@@ -77,6 +77,12 @@ RelationGetIndexScan(Relation indexRelat
 
 	scan = (IndexScanDesc) palloc(sizeof(IndexScanDescData));
 
+#ifdef USE_PREFETCH
+        scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+        scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
+
 	scan->heapRelation = NULL;	/* may be set later */
 	scan->indexRelation = indexRelation;
 	scan->xs_snapshot = InvalidSnapshot;		/* caller must initialize this */
@@ -139,6 +145,19 @@ RelationGetIndexScan(Relation indexRelat
 void
 IndexScanEnd(IndexScanDesc scan)
 {
+#ifdef USE_PREFETCH
+	if (scan->do_prefetch) {
+		if ( (struct pfch_block_item*)0 != scan->pfch_block_item_list ) {
+			pfree(scan->pfch_block_item_list);
+			scan->pfch_block_item_list = (struct pfch_block_item*)0;
+		}
+		if ( (struct pfch_index_pagelist*)0 != scan->pfch_index_page_list ) {
+			pfree(scan->pfch_index_page_list);
+			scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+		}
+	}
+#endif   /* USE_PREFETCH */
+
 	if (scan->keyData != NULL)
 		pfree(scan->keyData);
 	if (scan->orderByData != NULL)
--- src/backend/access/nbtree/nbtsearch.c.orig	2014-05-31 17:19:07.785208591 -0400
+++ src/backend/access/nbtree/nbtsearch.c	2014-05-31 19:53:09.324074194 -0400
@@ -23,13 +23,16 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+extern unsigned int prefetch_btree_heaps;  /*  boolean whether to prefetch heap pages in _bt_next for non-bitmap index scans   */
+extern unsigned int prefetch_sequential_index_scans;  /*  boolean whether to prefetch sequential-access non-bitmap index scans  */
 
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
+static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir, 
+			 bool prefetch);
+static Buffer _bt_walk_left(IndexScanDesc scan, Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
 
@@ -226,7 +229,7 @@ _bt_moveright(Relation rel,
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, blkno, access , (struct pfch_index_pagelist*)0);
 			continue;
 		}
 
@@ -1005,7 +1008,7 @@ _bt_first(IndexScanDesc scan, ScanDirect
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
@@ -1040,6 +1043,8 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTScanPosItem *currItem;
+	BlockNumber prevblkno = ItemPointerGetBlockNumber(
+		&scan->xs_ctup.t_self);
 
 	/*
 	 * Advance to next tuple on current page; or if there's no more, try to
@@ -1052,11 +1057,53 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+		if (prefetch_btree_heaps && (target_prefetch_pages > 0)) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreRight
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+				
+					if (so->prefetchItemIndex <= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex + 1;
+					while (    (so->prefetchItemIndex <= so->currPos.lastItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex++].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {  /*  if not a repetition of previous block */
+							/*   start prefetch on next page, providing :
+                            **   EITHER  .  we're reading non-sequentially previously or for this block
+                            **   OR      .  user explicitly specified to prefetch for sequential pattern
+							**   as it may be counterproductive otherwise
+                            */
+							nonsequential = (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1));
+							if (prefetch_sequential_index_scans || nonsequential) {
+								_bt_prefetchbuf(scan->heapRelation, blkno , &scan->pfch_index_page_list );
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
 	}
 	else
 	{
@@ -1065,11 +1112,53 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+		if (prefetch_btree_heaps && (target_prefetch_pages > 0)) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreLeft
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+			
+					if (so->prefetchItemIndex >= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex - 1;
+					while (    (so->prefetchItemIndex >= so->currPos.firstItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex--].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {  /*  if not a repetition of previous block */
+							/*   start prefetch on next page, providing :
+                            **   EITHER  .  we're reading non-sequentially previously or for this block
+                            **   OR      .  user explicitly specified to prefetch for sequential pattern
+							**   as it may be counterproductive otherwise
+                            */
+							nonsequential = (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1));
+							if (prefetch_sequential_index_scans || nonsequential) {
+								_bt_prefetchbuf(scan->heapRelation, blkno , &scan->pfch_index_page_list );
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
 	}
 
 	/* OK, itemIndex says what to return */
@@ -1119,9 +1208,11 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 	/*
 	 * we must save the page's right-link while scanning it; this tells us
 	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * corresponding need for the left-link, since splits always go right,
+	 * but we need it for back-sequential scan detection.
 	 */
 	so->currPos.nextPage = opaque->btpo_next;
+	so->currPos.prevPage = opaque->btpo_prev;
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
@@ -1156,6 +1247,7 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
+		so->prefetchItemIndex = 0;
 	}
 	else
 	{
@@ -1187,6 +1279,7 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 		so->currPos.firstItem = itemIndex;
 		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
 		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->prefetchItemIndex = MaxIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1224,7 +1317,7 @@ _bt_saveitem(BTScanOpaque so, int itemIn
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
  */
 static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+_bt_steppage(IndexScanDesc scan, ScanDirection dir, bool prefetch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -1278,7 +1371,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ , scan->pfch_index_page_list);
 			/* check for deleted page */
 			page = BufferGetPage(so->currPos.buf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1287,9 +1380,20 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque))) {
+					if (    prefetch && so->currPos.moreRight
+						/*   start prefetch on next page, providing :
+						**   EITHER  .  we're reading non-sequentially for this block
+						**   OR      .  user explicitly specified to prefetch for sequential pattern
+						**   as it may be counterproductive otherwise
+						*/
+						&& (prefetch_sequential_index_scans || opaque->btpo_next != (blkno+1))
+                       ) {
+ 						  _bt_prefetchbuf(rel, opaque->btpo_next , &scan->pfch_index_page_list);
+					}
 					break;
 			}
+			}
 			/* nope, keep going */
 			blkno = opaque->btpo_next;
 		}
@@ -1317,7 +1421,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
+			so->currPos.buf = _bt_walk_left(scan , rel, so->currPos.buf);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -1332,14 +1436,58 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 			if (!P_IGNORE(opaque))
 			{
+				/* We must rely on the previously saved prevPage link! */
+				BlockNumber blkno = so->currPos.prevPage;
+				
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page))) {
+					if (prefetch && so->currPos.moreLeft) {
+						/* detect back-sequential runs and increase prefetch window blindly 
+						 * downwards 2 blocks at a time. This only works in our favor
+						 * for index-only scans, by merging read requests at the kernel,
+						 * so we want to inflate target_prefetch_pages since merged 
+						 * back-sequential requests are about as expensive as a single one 
+						 */
+						if (scan->xs_want_itup && blkno > 0 && opaque->btpo_prev == (blkno-1)) {
+							BlockNumber backPos;
+							unsigned int back_prefetch_pages = target_prefetch_pages * 16;
+							if (back_prefetch_pages > 64)
+								back_prefetch_pages = 64;
+							
+							if (so->backSeqRun == 0)
+								backPos = (blkno-1);
+							else
+								backPos = so->backSeqPos;
+							so->backSeqRun++;
+							
+							if (backPos > 0 && (blkno - backPos) <= back_prefetch_pages) {
+								_bt_prefetchbuf(rel, backPos-- , &scan->pfch_index_page_list);
+								/* don't start back-seq prefetch too early */
+								if (so->backSeqRun >= back_prefetch_pages
+										&& backPos > 0 
+										&& (blkno - backPos) <= back_prefetch_pages)
+								{
+									_bt_prefetchbuf(rel, backPos-- , &scan->pfch_index_page_list);
+								}
+							}
+							
+							so->backSeqPos = backPos;
+						} else {
+							/* start prefetch on next page */
+							if (so->backSeqRun != 0) {
+								if (opaque->btpo_prev > blkno || opaque->btpo_prev < so->backSeqPos)
+									so->backSeqRun = 0;
+							}
+							_bt_prefetchbuf(rel, opaque->btpo_prev , &scan->pfch_index_page_list);
+						}
+					}
 					break;
 			}
 		}
 	}
+	}
 
 	return true;
 }
@@ -1359,7 +1507,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf)
+_bt_walk_left(IndexScanDesc scan, Relation rel, Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1387,7 +1535,7 @@ _bt_walk_left(Relation rel, Buffer buf)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, blkno, BT_READ , scan->pfch_index_page_list );
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -1631,7 +1779,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDir
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
--- src/backend/access/nbtree/nbtinsert.c.orig	2014-05-31 17:19:07.785208591 -0400
+++ src/backend/access/nbtree/nbtinsert.c	2014-05-31 19:53:09.356074275 -0400
@@ -793,7 +793,7 @@ _bt_insertonpg(Relation rel,
 		{
 			Assert(!P_ISLEAF(lpageop));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -972,7 +972,7 @@ _bt_split(Relation rel, Buffer buf, Buff
 	bool		isleaf;
 
 	/* Acquire a new page to split into */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE , (struct pfch_index_pagelist*)0);
 
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
@@ -1175,7 +1175,7 @@ _bt_split(Relation rel, Buffer buf, Buff
 
 	if (!P_RIGHTMOST(oopaque))
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE , (struct pfch_index_pagelist*)0);
 		spage = BufferGetPage(sbuf);
 		sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -1817,7 +1817,7 @@ _bt_finish_split(Relation rel, Buffer lb
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE , (struct pfch_index_pagelist*)0);
 	rpage = BufferGetPage(rbuf);
 	rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
 
@@ -1829,7 +1829,7 @@ _bt_finish_split(Relation rel, Buffer lb
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -1877,7 +1877,7 @@ _bt_getstackbuf(Relation rel, BTStack st
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, access);
+		buf = _bt_getbuf(rel, blkno, access , (struct pfch_index_pagelist*)0);
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -2008,12 +2008,12 @@ _bt_newroot(Relation rel, Buffer lbuf, B
 	lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE , (struct pfch_index_pagelist*)0);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
--- src/backend/access/nbtree/nbtpage.c.orig	2014-05-31 17:19:07.785208591 -0400
+++ src/backend/access/nbtree/nbtpage.c	2014-05-31 19:53:09.384074346 -0400
@@ -127,7 +127,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ , (struct pfch_index_pagelist*)0);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 
@@ -153,7 +153,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ , (struct pfch_index_pagelist*)0);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -209,7 +209,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE , (struct pfch_index_pagelist*)0);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
@@ -350,7 +350,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ , (struct pfch_index_pagelist*)0);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -436,7 +436,7 @@ _bt_getrootheight(Relation rel)
 		Page		metapg;
 		BTPageOpaque metaopaque;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ , (struct pfch_index_pagelist*)0);
 		metapg = BufferGetPage(metabuf);
 		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 		metad = BTPageGetMeta(metapg);
@@ -562,6 +562,170 @@ _bt_log_reuse_page(Relation rel, BlockNu
 }
 
 /*
+ *	_bt_prefetchbuf() -- Prefetch a buffer by block number
+ *                           and keep track of prefetched and unread blocknums in pagelist.
+ *   input parms  :
+ *       rel and blockno identify block to be prefetched as usual
+ *       pfch_index_page_list_P points to the pointer anchoring the head of the index page list
+ *             Since the pagelist is a kind of optimization,
+ *             handle palloc failure by quietly omitting the keeping track.
+ */
+void
+_bt_prefetchbuf(Relation rel, BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P)
+{
+
+    int rc = 0;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    struct pfch_index_item* found_item = 0;
+
+		if ((struct pfch_index_pagelist**)0 == pfch_index_page_list_P) {
+			pfch_index_plp = (struct pfch_index_pagelist*)0;
+		} else {
+			pfch_index_plp = *pfch_index_page_list_P;
+		}
+
+    	if (blkno != P_NEW && blkno != P_NONE)
+    	{
+            /* prefetch an existing block of the relation
+            ** but first,  check it has not recently already been prefetched and not yet read
+            */
+            found_item = _bt_find_block(blkno , pfch_index_plp);
+			if ((struct pfch_index_item*)0 == found_item) {  /*  not found */
+
+		        rc = PrefetchBuffer(rel, MAIN_FORKNUM, blkno , 0);
+
+                /*  add the pagenum to the list ,  indicating its discard status
+                **  since it's only an optimization,  ignore failure such as exceeded allowed space
+				*/
+                _bt_add_block( blkno , pfch_index_page_list_P , (uint32)(rc & PREFTCHRC_BUF_PIN_INCREASED));
+
+            }
+	    }
+        return;
+}
+
+/*   _bt_find_block finds the item referencing specified Block in index page list if present
+**   and returns the pointer to the pfch_index_item if found,  or null if not
+*/
+struct pfch_index_item*
+_bt_find_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list)
+{
+
+    struct pfch_index_item* found_item = 0;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    int ix, tx;
+
+		pfch_index_plp = pfch_index_page_list;
+
+		while (     (struct pfch_index_pagelist*)0 != pfch_index_plp
+                &&  ( (struct pfch_index_item*)0 == found_item)
+              ) {
+			ix = 0;
+			tx = pfch_index_plp->pfch_index_item_count;
+			while (     (ix < tx)
+                    &&  ( (struct pfch_index_item*)0 == found_item)
+                  ) {
+				if (pfch_index_plp->pfch_indexid[ix].pfch_blocknum == blkno) {
+					found_item = &pfch_index_plp->pfch_indexid[ix];
+				}
+                ix++;
+			}
+			pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+		}
+
+     return found_item;
+}
+
+/*   _bt_add_block adds the specified Block to the index page list
+**   and returns 0 if successful,  non-zero if not
+*/
+int
+_bt_add_block(BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P , uint32 discard_status)
+{
+    int rc = 1;
+    int ix;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    struct pfch_index_pagelist* pfch_index_page_list_anchor; /*  pointer to first chunk if any  */
+	/*  allow expansion of pagelist to 16 chunks
+	**  which accommodates backwards-sequential index scans
+	**  where the scanner increases target_prefetch_pages by a factor of up to 16
+	**   see code in _bt_steppage
+	**  note - this creates an undesirable weak dependency on this number in _bt_steppage,
+	**         but :
+	**           there is no disaster if the numbers disagree  -  just sub-optimal use of the list
+	**           to implement a proper interface would require that chunks have a variable size
+	**           which would require an extra size variable in each chunk
+	*/
+	int num_chunks = 16;
+
+		if ((struct pfch_index_pagelist**)0 == pfch_index_page_list_P) {
+			pfch_index_page_list_anchor = (struct pfch_index_pagelist*)0;
+		} else {
+			pfch_index_page_list_anchor = *pfch_index_page_list_P;
+		}
+		pfch_index_plp = pfch_index_page_list_anchor;       /* pointer to current chunk */
+
+		while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+			ix = pfch_index_plp->pfch_index_item_count;
+			if (ix < target_prefetch_pages) {
+				pfch_index_plp->pfch_indexid[ix].pfch_blocknum = blkno;
+				pfch_index_plp->pfch_indexid[ix].pfch_discard = discard_status;
+				pfch_index_plp->pfch_index_item_count = (ix+1);
+                rc = 0;
+				goto stored_pagenum;
+			}
+			pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+			num_chunks--;              /*  keep track of number of chunks */
+		}
+
+		/*   we did not find any free space in existing chunks -
+		**   create new chunk if within our limit and we have a pfch_index_page_list
+		*/
+		if ( (num_chunks > 0) && ((struct pfch_index_pagelist*)0 != pfch_index_page_list_anchor) ) {
+			pfch_index_plp = (struct pfch_index_pagelist*)palloc( sizeof(struct pfch_index_pagelist) + ( (target_prefetch_pages-1) * sizeof(struct pfch_index_item) ) );
+			if ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+				pfch_index_plp->pfch_index_pagelist_next = pfch_index_page_list_anchor;  /* old head of list is next after this */
+				pfch_index_plp->pfch_indexid[0].pfch_blocknum = blkno;
+				pfch_index_plp->pfch_indexid[0].pfch_discard = discard_status;
+				pfch_index_plp->pfch_index_item_count = 1;
+				pfch_index_page_list_P = &pfch_index_plp;   /*  new head of list is new chunk */
+                rc = 0;
+			}
+		}
+
+    stored_pagenum:;
+     return rc;
+}
+
+/*  _bt_subtract_block removes a block from the prefetched-but-unread pagelist if present */
+void
+_bt_subtract_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list)
+{
+    struct pfch_index_pagelist* pfch_index_plp = pfch_index_page_list;
+	if ( (blkno != P_NEW) && (blkno != P_NONE) ) {
+            int ix , jx;
+                while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+                    ix = pfch_index_plp->pfch_index_item_count;
+                    while (ix-- > 0) {
+                        if (pfch_index_plp->pfch_indexid[ix].pfch_blocknum == blkno) {
+                            /*   move the last item to the curent (now deleted) position and decrement count */
+                            jx = (pfch_index_plp->pfch_index_item_count-1); /*  index of last item ... */
+                            if (jx > ix) {                                  /*  ... is not the current one so move is required */
+                                pfch_index_plp->pfch_indexid[ix].pfch_blocknum = pfch_index_plp->pfch_indexid[jx].pfch_blocknum;
+                                pfch_index_plp->pfch_indexid[ix].pfch_discard = pfch_index_plp->pfch_indexid[jx].pfch_discard;
+                                ix = jx;
+                            }
+                            pfch_index_plp->pfch_index_item_count = ix;
+                            goto done_subtract;
+                        }
+                    }
+                    pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+                }
+        }
+    done_subtract:  return;
+}
+
+/*
  *	_bt_getbuf() -- Get a buffer by block number for read or write.
  *
  *		blkno == P_NEW means to get an unallocated index page.  The page
@@ -573,7 +737,7 @@ _bt_log_reuse_page(Relation rel, BlockNu
  *		_bt_checkpage to sanity-check the page (except in P_NEW case).
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access , struct pfch_index_pagelist* pfch_index_page_list)
 {
 	Buffer		buf;
 
@@ -581,6 +745,10 @@ _bt_getbuf(Relation rel, BlockNumber blk
 	{
 		/* Read an existing block of the relation */
 		buf = ReadBuffer(rel, blkno);
+
+        /*  if the block is in the prefetched-but-unread pagelist,  remove it */
+        _bt_subtract_block( blkno , pfch_index_page_list);
+
 		LockBuffer(buf, access);
 		_bt_checkpage(rel, buf);
 	}
@@ -702,6 +870,10 @@ _bt_getbuf(Relation rel, BlockNumber blk
  * bufmgr when one would do.  However, now it's mainly just a notational
  * convenience.  The only case where it saves work over _bt_relbuf/_bt_getbuf
  * is when the target page is the same one already in the buffer.
+ *
+ * if prefetching of index pages is changed to use this function,
+ * then it should be extended to take the index_page_list as parameter
+ * and call_bt_subtract_block in the same way that _bt_getbuf does.
  */
 Buffer
 _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
@@ -712,6 +884,7 @@ _bt_relandgetbuf(Relation rel, Buffer ob
 	if (BufferIsValid(obuf))
 		LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
 	buf = ReleaseAndReadBuffer(obuf, rel, blkno);
+
 	LockBuffer(buf, access);
 	_bt_checkpage(rel, buf);
 	return buf;
@@ -965,7 +1138,7 @@ _bt_is_page_halfdead(Relation rel, Block
 	BTPageOpaque opaque;
 	bool		result;
 
-	buf = _bt_getbuf(rel, blk, BT_READ);
+	buf = _bt_getbuf(rel, blk, BT_READ , (struct pfch_index_pagelist*)0);
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -1069,7 +1242,7 @@ _bt_lock_branch_parent(Relation rel, Blo
 				Page		lpage;
 				BTPageOpaque lopaque;
 
-				lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+				lbuf = _bt_getbuf(rel, leftsib, BT_READ, (struct pfch_index_pagelist*)0);
 				lpage = BufferGetPage(lbuf);
 				lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
@@ -1265,7 +1438,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 					BTPageOpaque lopaque;
 					Page		lpage;
 
-					lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+					lbuf = _bt_getbuf(rel, leftsib, BT_READ, (struct pfch_index_pagelist*)0);
 					lpage = BufferGetPage(lbuf);
 					lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 					/*
@@ -1340,7 +1513,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 		if (!rightsib_empty)
 			break;
 
-		buf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		buf = _bt_getbuf(rel, rightsib, BT_WRITE, (struct pfch_index_pagelist*)0);
 	}
 
 	return ndeleted;
@@ -1593,7 +1766,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 		target = topblkno;
 
 		/* fetch the block number of the topmost parent's left sibling */
-		buf = _bt_getbuf(rel, topblkno, BT_READ);
+		buf = _bt_getbuf(rel, topblkno, BT_READ, (struct pfch_index_pagelist*)0);
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		leftsib = opaque->btpo_prev;
@@ -1632,7 +1805,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 		LockBuffer(leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE , (struct pfch_index_pagelist*)0);
 		page = BufferGetPage(lbuf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -1646,7 +1819,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 					 RelationGetRelationName(rel));
 				return false;
 			}
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE , (struct pfch_index_pagelist*)0);
 			page = BufferGetPage(lbuf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		}
@@ -1701,7 +1874,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE , (struct pfch_index_pagelist*)0);
 	page = BufferGetPage(rbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	if (opaque->btpo_prev != target)
@@ -1731,7 +1904,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
--- src/backend/access/nbtree/nbtree.c.orig	2014-05-31 17:19:07.785208591 -0400
+++ src/backend/access/nbtree/nbtree.c	2014-05-31 19:53:09.416074427 -0400
@@ -30,6 +30,18 @@
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_index_scans; /* boolean whether to prefetch non-bitmap index scans also numeric size of pfch_list */
+#endif   /* USE_PREFETCH */
+
+Datum
+btpeeknexttuple(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+        IndexScanDesc scan
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+        PG_FUNCTION_ARGS
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+);
 
 /* Working state for btbuild and its callback */
 typedef struct
@@ -332,6 +344,74 @@ btgettuple(PG_FUNCTION_ARGS)
 }
 
 /*
+ *	btpeeknexttuple() -- peek at the next tuple different from any blocknum in pfch_block_item_list
+ *                           without reading a new index page
+ *                       and without causing any side-effects such as altering values in control blocks
+ *               if found,     store blocknum in next element of pfch_block_item_list
+ */
+Datum
+btpeeknexttuple(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+        IndexScanDesc scan
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+        PG_FUNCTION_ARGS
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+)
+{
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+    IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res = false;
+	int		itemIndex;		/* current index in items[] */
+
+        /*
+         * If we've already initialized this scan, we can just advance it in
+         * the appropriate direction.  If we haven't done so yet, bail out
+         */
+        if ( BTScanPosIsValid(so->currPos) ) {
+
+            itemIndex = so->currPos.itemIndex+1;    /*   next item */
+
+            /* This loop handles advancing till we find different data block or end of index page */
+            while (itemIndex <= so->currPos.lastItem) {
+                unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                        if (BlockIdEquals((&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)) , &(so->currPos.items[itemIndex].heapTid.ip_blkid))) {
+                             goto block_match;
+                        }
+                }
+
+                /*  if we reach here,  no block in list matched this item  */
+                res = true;
+                /*   set item in prefetch list
+                **   prefer unused entry if there is one,  else overwrite
+                */
+                if (scan->pfch_used < prefetch_index_scans) {
+                    scan->pfch_next = scan->pfch_used;
+                } else {
+                    scan->pfch_next++;
+                    if (scan->pfch_next >= prefetch_index_scans) {
+                        scan->pfch_next = 0;
+                    }
+                }
+
+                BlockIdCopy((&((scan->pfch_block_item_list + scan->pfch_next)->pfch_blockid)) , &(so->currPos.items[itemIndex].heapTid.ip_blkid));
+                if (scan->pfch_used <= scan->pfch_next) {
+                     scan->pfch_used = (scan->pfch_next + 1);
+                }
+
+                goto peek_complete;
+
+      block_match:         itemIndex++;
+            }
+	}
+
+ peek_complete:
+	PG_RETURN_BOOL(res);
+}
+
+/*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
 Datum
@@ -425,6 +505,12 @@ btbeginscan(PG_FUNCTION_ARGS)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	so->backSeqRun = 0;
+	so->backSeqPos = 0;
+	so->prefetchItemIndex = 0;
+	so->lastHeapPrefetchBlkno = P_NONE;
+	so->prefetchBlockCount = 0;
+	
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -516,6 +602,23 @@ btendscan(PG_FUNCTION_ARGS)
 {
 	IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+        struct pfch_index_pagelist* pfch_index_plp;
+        int ix;
+
+#ifdef USE_PREFETCH
+
+	/* discard all prefetched but unread index pages listed in the pagelist */
+        pfch_index_plp = scan->pfch_index_page_list;
+        while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+                    ix = pfch_index_plp->pfch_index_item_count;
+                    while (ix-- > 0) {
+                        if (pfch_index_plp->pfch_indexid[ix].pfch_discard) {
+                            DiscardBuffer( scan->indexRelation , MAIN_FORKNUM , pfch_index_plp->pfch_indexid[ix].pfch_blocknum);
+                        }
+                    }
+                    pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+        }
+#endif /* USE_PREFETCH */
 
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
--- src/backend/nodes/tidbitmap.c.orig	2014-05-31 17:19:07.817208676 -0400
+++ src/backend/nodes/tidbitmap.c	2014-05-31 19:53:09.440074487 -0400
@@ -44,6 +44,9 @@
 #include "nodes/bitmapset.h"
 #include "nodes/tidbitmap.h"
 #include "utils/hsearch.h"
+#ifdef USE_PREFETCH
+extern int	target_prefetch_pages;
+#endif   /* USE_PREFETCH */
 
 /*
  * The maximum number of tuples per page is not large (typically 256 with
@@ -572,7 +575,12 @@ tbm_begin_iterate(TIDBitmap *tbm)
 	 * needs of the TBMIterateResult sub-struct.
 	 */
 	iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
-								 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+								 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)
+#ifdef USE_PREFETCH
+                                          		      /*  space for remembering every prefetched but unread blockno */
+                                          		      +  (target_prefetch_pages * sizeof(BlockNumber))
+#endif   /* USE_PREFETCH */
+                                         );
 	iterator->tbm = tbm;
 
 	/*
@@ -1020,3 +1028,68 @@ tbm_comparator(const void *left, const v
 		return 1;
 	return 0;
 }
+
+void
+tbm_zero(TBMIterator *iterator) /* zero list of prefetched and unread blocknos */
+{
+      /* locate the list of prefetched but unread blocknos immediately following the array of offsets
+      ** and note that tbm_begin_iterate allocates space for (1 + MAX_TUPLES_PER_PAGE) offsets  -
+      ** 1 included in struct TBMIterator and MAX_TUPLES_PER_PAGE additional
+      */
+      iterator->output.Unread_Pfetched_base = ((BlockNumber *)(&(iterator->output.offsets[MAX_TUPLES_PER_PAGE+1])));
+      iterator->output.Unread_Pfetched_next = iterator->output.Unread_Pfetched_count = 0;
+}
+
+void
+tbm_add(TBMIterator *iterator, BlockNumber blockno) /* add this blockno to list of prefetched and unread blocknos */
+{
+      unsigned int Unread_Pfetched_index = iterator->output.Unread_Pfetched_next + iterator->output.Unread_Pfetched_count++;
+
+      if (iterator->output.Unread_Pfetched_count > target_prefetch_pages) {
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_add overflowed list cannot add blockno %d", blockno)));
+      }
+
+      if (Unread_Pfetched_index >= target_prefetch_pages)
+          Unread_Pfetched_index -= target_prefetch_pages;
+      *(iterator->output.Unread_Pfetched_base + Unread_Pfetched_index) = blockno;
+}
+
+void
+tbm_subtract(TBMIterator *iterator, BlockNumber blockno) /* remove this blockno from list of prefetched and unread blocknos */
+{
+      unsigned int Unread_Pfetched_index = iterator->output.Unread_Pfetched_next++;
+      BlockNumber nextUnreadPfetched;
+
+      /*    make a weak check that the next blockno is the one to be removed,
+      **    although actually in case of disagreement,   we ignore callers blockno and remove next anyway,
+      **    which is really what caller wants
+      */
+      if ( iterator->output.Unread_Pfetched_count == 0 ) {
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_subtract empty list cannot subtract blockno %d", blockno)));
+      }
+
+      if (Unread_Pfetched_index >= target_prefetch_pages)
+          Unread_Pfetched_index = 0;
+      nextUnreadPfetched = *(iterator->output.Unread_Pfetched_base + Unread_Pfetched_index);
+      if (   ( nextUnreadPfetched != blockno ) 
+          && ( nextUnreadPfetched != InvalidBlockNumber ) /* dont report it if the block in the list was InvalidBlockNumber */
+         ) {
+		ereport(NOTICE,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_subtract will subtract blockno %d not %d",
+					nextUnreadPfetched, blockno)));
+      }
+      if (iterator->output.Unread_Pfetched_next >= target_prefetch_pages)
+          iterator->output.Unread_Pfetched_next = 0;
+      iterator->output.Unread_Pfetched_count--;
+}
+
+TBMIterateResult *
+tbm_locate_IterateResult(TBMIterator *iterator)
+{
+   return &(iterator->output);
+}
--- src/backend/utils/misc/guc.c.orig	2014-05-31 17:19:07.949209027 -0400
+++ src/backend/utils/misc/guc.c	2014-05-31 19:53:09.484074599 -0400
@@ -2264,6 +2264,25 @@ static struct config_int ConfigureNamesI
 	},
 
 	{
+		{"max_async_io_prefetchers",
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+			PGC_USERSET,
+#else
+			PGC_INTERNAL,
+#endif
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Maximum number of background processes concurrently using asynchronous librt threads to prefetch pages into shared memory buffers."),
+		},
+		&max_async_io_prefetchers,
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	        -1, 0, 8192,      /*  boot val -1 indicates to initialize to something sensible during buf_init */
+#else
+		0, 0, 0,
+#endif
+		NULL, NULL, NULL
+	},
+
+	{
 		{"max_worker_processes",
 			PGC_POSTMASTER,
 			RESOURCES_ASYNCHRONOUS,
--- src/backend/utils/mmgr/aset.c.orig	2014-05-31 17:19:07.949209027 -0400
+++ src/backend/utils/mmgr/aset.c	2014-05-31 19:53:09.496074629 -0400
@@ -733,6 +733,48 @@ AllocSetAlloc(MemoryContext context, Siz
 	 */
 	fidx = AllocSetFreeIndex(size);
 	chunk = set->freelist[fidx];
+#ifdef MEMORY_CONTEXT_CHECKING
+        /*    an instance of segfault caused by a rogue value in set->freelist[fidx]
+        **    has been seen - check for it using crude sanity check based on neighbours :
+        **    if at least one is sufficiently close, then pass,  else fail
+        */
+        if (chunk != 0) {
+            int frx, nrx; /*  frx is index,  nrx is index of failing neighbour for errmsg */
+            for (nrx = -1, frx = 0; (frx < ALLOCSET_NUM_FREELISTS); frx++) {
+                if (   (frx != fidx)     /*  not the chosen one */
+                    && ( ( (unsigned long)(set->freelist[frx]) ) != 0 ) /* not empty */
+                   ) {
+                    if (   ( (unsigned long)chunk < ( ( (unsigned long)(set->freelist[frx]) ) / 2 ) )
+                        && (  ( (unsigned long)(set->freelist[frx]) ) < 0x4000000  )
+               /***     || ( (unsigned long)chunk > ( ( (unsigned long)(set->freelist[frx]) ) * 2 ) )  ***/
+                       ) {
+                       nrx = frx;
+                    } else {
+                       nrx = -1;
+                       break;
+                    }
+                }
+            }
+
+            if (nrx >= 0) {
+                     /*  this must be a rogue value  -  put this list in the garbage */
+                     /*  build message but be careful to avoid recursively triggering same fault */
+                     set->freelist[fidx] = NULL;   /*  mark this list empty */
+                     elog(WARNING, "detected rogue value %p in freelist index %d compared with neighbour %p whose chunksize %d"
+				 , chunk , fidx , set->freelist[nrx] , set->freelist[nrx]->size);
+                     chunk = NULL;
+            }
+        }
+#else /* if not MEMORY_CONTEXT_CHECKING make very simple-minded check*/
+        if ( (chunk != 0) && ( (unsigned long)chunk <  0x40000 ) ) {
+                     /*  this must be a rogue value  -  put this list in the garbage */
+                     /*  build message but be careful to avoid recursively triggering same fault */
+                     set->freelist[fidx] = NULL;   /*  mark this list empty */
+                     elog(WARNING, "detected rogue value %p in freelist index %d"
+				 , chunk , fidx);
+                     chunk = NULL;
+        }
+#endif
 	if (chunk != NULL)
 	{
 		Assert(chunk->size >= size);
--- src/include/executor/instrument.h.orig	2014-05-31 17:19:07.997209154 -0400
+++ src/include/executor/instrument.h	2014-05-31 19:53:09.536074730 -0400
@@ -28,8 +28,18 @@ typedef struct BufferUsage
 	long		local_blks_written;		/* # of local disk blocks written */
 	long		temp_blks_read; /* # of temp blocks read */
 	long		temp_blks_written;		/* # of temp blocks written */
+
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
+
+	long		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_discrd;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_forgot;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb */
+	long		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno */
+	long		aio_read_wasted;		/* # of aio reads for which disk block not used */
+	long		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it */
+	long		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested */
 } BufferUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
--- src/include/storage/bufmgr.h.orig	2014-05-31 17:19:08.005209176 -0400
+++ src/include/storage/bufmgr.h	2014-05-31 19:53:09.572074821 -0400
@@ -41,6 +41,7 @@ typedef enum
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
 	RBM_NORMAL_NO_LOG			/* Don't log page as invalid during WAL
 								 * replay; otherwise same as RBM_NORMAL */
+       ,RBM_NOREAD_FOR_PREFETCH   /* Don't read from disk, don't zero buffer, find buffer only */
 } ReadBufferMode;
 
 /* in globals.c ... this duplicates miscadmin.h */
@@ -57,6 +58,9 @@ extern int	target_prefetch_pages;
 extern PGDLLIMPORT char *BufferBlocks;
 extern PGDLLIMPORT int32 *PrivateRefCount;
 
+/*  in buf_async.c  */;
+extern int      max_async_io_prefetchers; /* Maximum number of backends using asynchronous librt threads to read pages into our buffers */
+
 /* in localbuf.c */
 extern PGDLLIMPORT int NLocBuffer;
 extern PGDLLIMPORT Block *LocalBufferBlockPointers;
@@ -159,9 +163,15 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
 /*
- * prototypes for functions in bufmgr.c
+ * prototypes for external functions in bufmgr.c and buf_async.c
  */
-extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
+extern int PrefetchBuffer(Relation reln, ForkNumber forkNum,
+			   BlockNumber blockNum , BufferAccessStrategy strategy);
+/*   return code  is an int bitmask : */
+#define PREFTCHRC_BUF_PIN_INCREASED 0x01    /*  pin count on buffer has been increased by 1 */
+#define PREFTCHRC_BLK_ALREADY_PRESENT 0x02  /*  block was already present in a buffer       */
+
+extern void DiscardBuffer(Relation reln, ForkNumber forkNum,
 			   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
--- src/include/storage/smgr.h.orig	2014-05-31 17:19:08.009209186 -0400
+++ src/include/storage/smgr.h	2014-05-31 19:53:09.604074902 -0400
@@ -92,6 +92,12 @@ extern void smgrextend(SMgrRelation reln
 		   BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void smgrinitaio(int max_aio_threads, int max_aio_num);
+extern void smgrstartaio(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum , char *aiocbp , int *retcode);
+extern void smgrcompleteaio( SMgrRelation reln, char *aiocbp , int *inoutcode );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
@@ -118,6 +124,11 @@ extern void mdextend(SMgrRelation reln,
 		 BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void mdinitaio(int max_aio_threads, int max_aio_num);
+extern void mdstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode );
+extern void mdcompleteaio( char *aiocbp , int *inoutcode );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
--- src/include/storage/fd.h.orig	2014-05-31 17:19:08.009209186 -0400
+++ src/include/storage/fd.h	2014-05-31 19:53:09.632074973 -0400
@@ -69,6 +69,11 @@ extern File PathNameOpenFile(FileName fi
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void FileInitaio(int max_aio_threads, int max_aio_num );
+extern int  FileStartaio(File file, off_t offset, int amount , char *aiocbp);
+extern int  FileCompleteaio( char *aiocbp , int normal_wait );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern int	FileRead(File file, char *buffer, int amount);
 extern int	FileWrite(File file, char *buffer, int amount);
 extern int	FileSync(File file);
--- src/include/storage/buf_internals.h.orig	2014-05-31 17:19:08.005209176 -0400
+++ src/include/storage/buf_internals.h	2014-05-31 19:53:09.656075033 -0400
@@ -22,7 +22,9 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
-
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+#include "aio.h"
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Flags for buffer descriptors
@@ -38,8 +40,23 @@
 #define BM_JUST_DIRTIED			(1 << 5)		/* dirtied since write started */
 #define BM_PIN_COUNT_WAITER		(1 << 6)		/* have waiter for sole pin */
 #define BM_CHECKPOINT_NEEDED	(1 << 7)		/* must write for checkpoint */
-#define BM_PERMANENT			(1 << 8)		/* permanent relation (not
-												 * unlogged) */
+#define BM_PERMANENT			(1 << 8)	/* permanent relation (not unlogged) */
+#define BM_AIO_IN_PROGRESS		(1 << 9)	/* aio in progress    */
+#define BM_AIO_PREFETCH_PIN_BANKED	(1 << 10)	/* pinned when prefetch issued
+                                                        ** and this pin is banked - i.e.
+                                                        ** redeemable by the next use by same task
+                                                        ** note that for any one buffer, a pin can be banked
+                                                        **      by at most one process globally,
+                                                        **      that is,   only one process may bank a pin on the buffer
+                                                        **                 and it may do so only once (may not be stacked)
+                                                        */
+
+/*********
+for asynchronous aio-read prefetching, two golden rules concerning buffer pinning and buffer-header flags must be observed:
+  R1.  a buffer marked as BM_AIO_IN_PROGRESS must be pinned by at least one backend
+  R2.  a buffer marked as BM_AIO_PREFETCH_PIN_BANKED must be pinned by the backend identified by
+               (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio : (-(buf->freeNext))
+*********/
 
 typedef bits16 BufFlags;
 
@@ -140,17 +157,83 @@ typedef struct sbufdesc
 	BufFlags	flags;			/* see bit definitions above */
 	uint16		usage_count;	/* usage counter for clock sweep code */
 	unsigned	refcount;		/* # of backends holding pins on buffer */
-	int			wait_backend_pid;		/* backend PID of pin-count waiter */
+	int		wait_backend_pid;	/*  if     flags & BM_PIN_COUNT_WAITER
+                                                **  then   backend PID of pin-count waiter
+                                                **  else   not set
+                                                */
 
 	slock_t		buf_hdr_lock;	/* protects the above fields */
 
 	int			buf_id;			/* buffer's index number (from 0) */
-	int			freeNext;		/* link in freelist chain */
+        int    	volatile	freeNext;	/* overloaded and much-abused field :
+                                                ** EITHER
+                                                **     if     >= 0
+                                                **     then   link in freelist chain
+                                                **  OR
+                                                **     if     <  0
+                                                **     then    EITHER
+                                                **             if     flags & BM_AIO_IN_PROGRESS
+                                                **             then   negative of (the index of the aiocb in the BufferAiocbs array + 3)
+                                                **             else   if flags & BM_AIO_PREFETCH_PIN_BANKED
+                                                **             then   -(pid of task that issued aio_read and pinned buffer)
+                                                **             else   one of the special values -1 or -2 listed below
+                                                */
 
 	LWLock	   *io_in_progress_lock;	/* to wait for I/O to complete */
 	LWLock	   *content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
+/*  structures for control blocks for our implementation of async io */
+
+/*  if USE_AIO_ATOMIC_BUILTIN_COMP_SWAP is not defined,  the following struct is not put into use at runtime
+**  but it is easier to let the compiler find the definition but hide the reference to aiocb
+**  which is the only type it would not understand
+*/
+
+struct BufferAiocb {
+       struct BufferAiocb volatile * volatile BAiocbnext;  /*    next free entry or value of BAIOCB_OCCUPIED means in use  */
+       struct sbufdesc    volatile * volatile BAiocbbufh;  /*    there can be at most one BufferDesc marked BM_AIO_IN_PROGRESS
+                                                           **    and using this BufferAiocb -
+                                                           **    if there is one, BAiocbbufh points to it, else BAiocbbufh is zero
+                                                           **    NOTE  BAiocbbufh should be zero for every BufferAiocb on the free list
+                                                           */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+       struct aiocb       volatile            BAiocbthis;  /*    the aio library's control block for one async io */
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+       int                volatile  BAiocbDependentCount;  /*    count of tasks who depend on this BufferAiocb
+                                                           **    in the sense that they are waiting for io completion.
+                                                           **    only a Dependent may move the BufferAiocb onto the freelist
+                                                           **    and only when that Dependent is the *only* Dependent (count == 1)
+                                                           **    BAiocbDependentCount is protected by bufferheader spinlock
+                                                           **    and must be updated only when that spinlock is held
+                                                           */
+       pid_t              volatile  pidOfAio;              /*    pid of backend who issued an aio_read using this BAiocb -
+                                                           **    this backend must have pinned the associated buffer.
+                                                           */
+};
+
+#define BAIOCB_OCCUPIED 0x75f1        /*  distinct indicator of a BufferAiocb.BAiocbnext that is NOT on free list */
+#define BAIOCB_FREE 0x7b9d            /*  distinct indicator of a BufferAiocb.BAiocbbufh that IS     on free list */
+
+struct BAiocbAnchor {                 /*  anchor for all control blocks pertaining to aio  */
+       volatile struct BufferAiocb* BufferAiocbs;          /*  aiocbs ...                   */
+       volatile struct BufferAiocb* volatile FreeBAiocbs; /* ... and their free list   */
+};
+
+/*   values for BufCheckAsync input and retcode */
+#define BUF_INTENTION_WANT 		 1  /* wants the buffer, wait for in-progress aio and then pin */
+#define BUF_INTENTION_REJECT_KEEP_PIN 	-1  /* pin already held, do not unpin */
+#define BUF_INTENTION_REJECT_OBTAIN_PIN	-2  /* obtain pin,  caller wants it for same buffer */
+#define BUF_INTENTION_REJECT_FORGET	-3  /* unpin and tell resource owner to forget */
+#define BUF_INTENTION_REJECT_NOADJUST	-4  /* unpin and call ResourceOwnerForgetBuffer */
+#define BUF_INTENTION_REJECT_UNBANK   	-5  /* unpin only if pin banked by caller */
+
+#define BUF_INTENT_RC_CHANGED_TAG	-5
+#define BUF_INTENT_RC_BADPAGE		-4
+#define BUF_INTENT_RC_INVALID_AIO	-3    /*  invalid and aio was in progress */
+#define BUF_INTENT_RC_INVALID_NO_AIO	-1    /*  invalid and no aio was in progress */
+#define BUF_INTENT_RC_VALID		 1
+
 #define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
 
 /*
@@ -159,6 +242,7 @@ typedef struct sbufdesc
  */
 #define FREENEXT_END_OF_LIST	(-1)
 #define FREENEXT_NOT_IN_LIST	(-2)
+#define FREENEXT_BAIOCB_ORIGIN	(-3)
 
 /*
  * Macros for acquiring/releasing a shared buffer header's spinlock.
--- src/include/catalog/pg_am.h.orig	2014-05-31 17:19:07.989209134 -0400
+++ src/include/catalog/pg_am.h	2014-05-31 19:53:09.676075084 -0400
@@ -67,6 +67,7 @@ CATALOG(pg_am,2601)
 	regproc		amcanreturn;	/* can indexscan return IndexTuples? */
 	regproc		amcostestimate; /* estimate cost of an indexscan */
 	regproc		amoptions;		/* parse AM-specific parameters */
+	regproc		ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } FormData_pg_am;
 
 /* ----------------
@@ -117,19 +118,19 @@ typedef FormData_pg_am *Form_pg_am;
  * ----------------
  */
 
-DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions ));
+DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions btpeeknexttuple ));
 DESCR("b-tree index access method");
 #define BTREE_AM_OID 403
-DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions ));
+DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions - ));
 DESCR("hash index access method");
 #define HASH_AM_OID 405
-DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions ));
+DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions - ));
 DESCR("GiST index access method");
 #define GIST_AM_OID 783
-DATA(insert OID = 2742 (  gin		0 6 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions ));
+DATA(insert OID = 2742 (  gin		0 6 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions - ));
 DESCR("GIN index access method");
 #define GIN_AM_OID 2742
-DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
+DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions - ));
 DESCR("SP-GiST index access method");
 #define SPGIST_AM_OID 4000
 
--- src/include/catalog/pg_proc.h.orig	2014-05-31 17:19:07.993209143 -0400
+++ src/include/catalog/pg_proc.h	2014-05-31 19:53:09.724075206 -0400
@@ -536,6 +536,12 @@ DESCR("convert float4 to int4");
 
 DATA(insert OID = 330 (  btgettuple		   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2281 2281" _null_ _null_ _null_ _null_	btgettuple _null_ _null_ _null_ ));
 DESCR("btree(internal)");
+
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+DATA(insert OID = 3251 (  btpeeknexttuple	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "2281" _null_ _null_ _null_ _null_ btpeeknexttuple _null_ _null_ _null_ ));
+DESCR("btree(internal)");
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+
 DATA(insert OID = 636 (  btgetbitmap	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_	btgetbitmap _null_ _null_ _null_ ));
 DESCR("btree(internal)");
 DATA(insert OID = 331 (  btinsert		   PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_	btinsert _null_ _null_ _null_ ));
--- src/include/pg_config_manual.h.orig	2014-05-31 17:19:08.005209176 -0400
+++ src/include/pg_config_manual.h	2014-05-31 19:53:09.740075246 -0400
@@ -138,9 +138,11 @@
 /*
  * USE_PREFETCH code should be compiled only if we have a way to implement
  * prefetching.  (This is decoupled from USE_POSIX_FADVISE because there
- * might in future be support for alternative low-level prefetch APIs.)
+ * might in future be support for alternative low-level prefetch APIs  --
+ * -- update October 2013  -- now there is such a new prefetch capability --
+ *   async_io into postgres buffers  -   configuration parameter max_async_io_threads)
  */
-#ifdef USE_POSIX_FADVISE
+#if defined(USE_POSIX_FADVISE) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
 #define USE_PREFETCH
 #endif
 
--- src/include/access/nbtree.h.orig	2014-05-31 17:19:07.989209134 -0400
+++ src/include/access/nbtree.h	2014-05-31 19:53:09.768075317 -0400
@@ -19,6 +19,7 @@
 #include "access/sdir.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "access/relscan.h"
 #include "catalog/pg_index.h"
 
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
@@ -524,6 +525,7 @@ typedef struct BTScanPosData
 	Buffer		buf;			/* if valid, the buffer is pinned */
 
 	BlockNumber nextPage;		/* page's right link when we scanned it */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
@@ -603,6 +605,15 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* prefetch logic state */
+	unsigned int	backSeqRun;	/* number of back-sequential pages in a run */
+	BlockNumber		backSeqPos;	/* blkid last prefetched in back-sequential 
+				          		   runs */
+	BlockNumber		lastHeapPrefetchBlkno;	/* blkid last prefetched from heap */
+	int				prefetchItemIndex; /* item index within currPos last
+					                      fetched by heap prefetch */
+	int				prefetchBlockCount; /* number of prefetched heap blocks */
+	
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -655,7 +666,11 @@ extern Buffer _bt_getroot(Relation rel,
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access , struct pfch_index_pagelist* pfch_index_page_list);
+extern void _bt_prefetchbuf(Relation rel, BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P);
+extern struct pfch_index_item* _bt_find_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list);
+extern int _bt_add_block(BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P , uint32 discard_status);
+extern void _bt_subtract_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 				 BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
--- src/include/access/heapam.h.orig	2014-05-31 17:19:07.989209134 -0400
+++ src/include/access/heapam.h	2014-05-31 19:53:09.792075377 -0400
@@ -175,7 +175,7 @@ extern void heap_page_prune_execute(Buff
 extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
 
 /* in heap/syncscan.c */
-extern void ss_report_location(Relation rel, BlockNumber location);
+extern void ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp);
 extern BlockNumber ss_get_location(Relation rel, BlockNumber relnblocks);
 extern void SyncScanShmemInit(void);
 extern Size SyncScanShmemSize(void);
--- src/include/access/relscan.h.orig	2014-05-31 17:19:07.989209134 -0400
+++ src/include/access/relscan.h	2014-05-31 19:53:09.808075418 -0400
@@ -44,6 +44,24 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+#ifdef USE_PREFETCH
+	int	    rs_prefetch_target; /* target distance (numblocks) for prefetch to reach beyond main scan */
+	BlockNumber rs_pfchblock;	/* next block # to be prefetched in scan, if any */
+
+        /*   Unread_Pfetched is a "mostly" circular list of recently prefetched blocknos of size target_prefetch_pages
+        **   the index of the first unread block is held in Unread_Pfetched_next
+        **         and is advanced when a block is read
+        **   the count of number of unread blocks is in Unread_Pfetched_count (and this subset can wrap around)
+        **   "mostly" means that there may be gaps caused by storing entries for blocks which do not need to be discarded -
+        **   these are indicated by blockno = InvalidBlockNumber,  and these slots are reused when found.
+        */
+        BlockNumber *rs_Unread_Pfetched_base;   /*  where the list of prefetched but unread blocknos starts */
+        unsigned int rs_Unread_Pfetched_next;   /*  where the next unread blockno probably is relative to start --
+                                                **  this is only a hint which may be temporarily stale.
+                                                */
+        unsigned int rs_Unread_Pfetched_count;  /*  number of valid unread blocknos in list */
+#endif   /* USE_PREFETCH */
+
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	ItemPointerData rs_mctid;	/* marked scan position, if any */
@@ -55,6 +73,27 @@ typedef struct HeapScanDescData
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
 }	HeapScanDescData;
 
+/* pfch_index_items track prefetched and unread index pages -   chunks of blocknumbers are chained in singly-linked list from scan->pfch_index_item_list */
+struct pfch_index_item {                              /*  index-relation BlockIds which we will/have prefetched */
+       BlockNumber         pfch_blocknum;    /* Blocknum which we will/have prefetched */
+       uint32              pfch_discard;     /* whether block is to be discarded when scan is closed */
+};
+
+struct pfch_block_item {
+       struct BlockIdData  pfch_blockid;     /* BlockId which we will/have prefetched */
+       uint32              pfch_discard;     /* whether block is to be discarded when scan is closed */
+};
+
+/* pfch_index_page_items track prefetched and unread index pages -
+** chunks of blocknumbers are chained backwards (newest first,  oldest last)
+** in singly-linked list from scan->pfch_index_item_list
+*/
+struct pfch_index_pagelist {                          /*  index-relation BlockIds which we will/have prefetched */
+       struct pfch_index_pagelist* pfch_index_pagelist_next;  /*  pointer to next chunk if any */
+       unsigned int    pfch_index_item_count;         /*  number of used entries in this chunk */
+       struct pfch_index_item pfch_indexid[1];        /*  in-line list of Blocknums which we will/have prefetched and whether to be discarded */
+};
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -75,8 +114,15 @@ typedef struct IndexScanDescData
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
 	bool		ignore_killed_tuples;	/* do not return killed entries */
-	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
-										 * tuples */
+	bool		xactStartedInRecovery;	/* prevents killing/seeing killed tuples */
+										 
+#ifdef USE_PREFETCH
+        struct pfch_index_pagelist* pfch_index_page_list;  /* array of index-relation BlockIds which we will/have prefetched */
+        struct pfch_block_item* pfch_block_item_list;   /* array of heap-relation BlockIds which we will/have prefetched */
+        unsigned short int     pfch_used;	/*  number of used elements in BlockIdData array   */
+        unsigned short int     pfch_next;	/*  next element for prefetch in BlockIdData array */
+	int             do_prefetch;    /*  should I prefetch ? */
+#endif   /* USE_PREFETCH */
 
 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
@@ -91,6 +137,10 @@ typedef struct IndexScanDescData
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/* heap fetch statistics for read-ahead logic */
+	unsigned int heap_tids_seen;
+	unsigned int heap_tids_fetched;
+
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
 }	IndexScanDescData;
--- src/include/nodes/tidbitmap.h.orig	2014-05-31 17:19:08.001209165 -0400
+++ src/include/nodes/tidbitmap.h	2014-05-31 19:53:09.836075488 -0400
@@ -41,6 +41,16 @@ typedef struct
 	int			ntuples;		/* -1 indicates lossy result */
 	bool		recheck;		/* should the tuples be rechecked? */
 	/* Note: recheck is always true if ntuples < 0 */
+#ifdef USE_PREFETCH
+        /*   Unread_Pfetched is a circular list of recently prefetched blocknos of size target_prefetch_pages
+        **   the index of the first unread block is held in Unread_Pfetched_next
+        **         and is advanced when a block is read
+        **   the count of number of unread blocks is in Unread_Pfetched_count (and this subset can wrap around)
+        */
+        BlockNumber *Unread_Pfetched_base;   /*  where the list of prefetched but unread blocknos starts */
+        unsigned int Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+        unsigned int Unread_Pfetched_count;  /*  number of valid unread blocknos in list */
+#endif   /* USE_PREFETCH */
 	OffsetNumber offsets[1];	/* VARIABLE LENGTH ARRAY */
 } TBMIterateResult;				/* VARIABLE LENGTH STRUCT */
 
@@ -62,5 +72,8 @@ extern bool tbm_is_empty(const TIDBitmap
 extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
 extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
 extern void tbm_end_iterate(TBMIterator *iterator);
-
+extern void tbm_zero(TBMIterator *iterator); /* zero list of prefetched and unread blocknos */
+extern void tbm_add(TBMIterator *iterator, BlockNumber blockno); /* add this blockno to list of prefetched and unread blocknos */
+extern void tbm_subtract(TBMIterator *iterator, BlockNumber blockno); /* remove this blockno from list of prefetched and unread blocknos */
+extern TBMIterateResult *tbm_locate_IterateResult(TBMIterator *iterator); /* locate the TBMIterateResult of an iterator */
 #endif   /* TIDBITMAP_H */
--- src/include/utils/rel.h.orig	2014-05-31 17:19:08.013209197 -0400
+++ src/include/utils/rel.h	2014-05-31 19:53:09.864075560 -0400
@@ -61,6 +61,7 @@ typedef struct RelationAmInfo
 	FmgrInfo	ammarkpos;
 	FmgrInfo	amrestrpos;
 	FmgrInfo	amcanreturn;
+	FmgrInfo	ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } RelationAmInfo;
 
 
--- src/include/pg_config.h.in.orig	2014-05-31 17:19:08.001209165 -0400
+++ src/include/pg_config.h.in	2014-05-31 19:53:09.896075641 -0400
@@ -1,4 +1,4 @@
-/* src/include/pg_config.h.in.  Generated from configure.in by autoheader.  */
+/* src/include/pg_config.h.in.  Generated from - by autoheader.  */
 
 /* Define to the type of arg 1 of 'accept' */
 #undef ACCEPT_TYPE_ARG1
@@ -748,6 +748,10 @@
 /* Define to the appropriate snprintf format for unsigned 64-bit ints. */
 #undef UINT64_FORMAT
 
+/* Define to select librt-style async io and the gcc atomic compare_and_swap.
+   */
+#undef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING

#38

johnlumby

johnlumby@hotmail.com

over 11 years ago

In reply to: John Lumby (#10)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 05/31/14 20:44, johnlumby wrote:

On 05/30/14 09:36, Claudio Freire wrote:

Good point. I have included the guts of your little test program
(modified to do polling) into the existing autoconf test program
that decides on the
#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.
See config/c-library.m4.
I hope this goes some way to answer your concern about robustness,
as at least now if the implementation changes in some way that
renders the polling ineffective, it will be caught in configure.

I meant to add that by including this test, which involves a fork(),
in the autoconf tester, on Windows
USE_AIO_ATOMIC_BUILTIN_COMP_SWAP would always by un-defined.
(But could then be defined manually if someone wanted to give it a try)

Import Notes

Reply to msg id not found: 538A776C.2060006@hotmail.comReference msg id not found: E1WpPgO-00073L-GQ@malur.postgresql.orgReference msg id not found: BAY175-W2E2D8EFD58C4EBB45D2A1A3250@phx.gbl

#39

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: johnlumby (#37)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Sat, May 31, 2014 at 9:44 PM, johnlumby <johnlumby@hotmail.com> wrote:

I'll try to do some measuring of performance with:
a) git head
b) git head + patch as-is
c) git head + patch without aio_suspend in foreign processes (just re-read)
d) git head + patch with a lwlock (or whatever works) instead of aio_suspend

a-c will be the fastest, d might take some while.

I'll let you know of the results as I get them.

Claudio, I am not quite sure if what I am submitting now is
quite the same as any of yours. As I promised before, but have
not yet done, I will package one or two of my benchmarks and
send them in.

It's a tad different. c will not do polling on the foreign process, I
will just let PG do the read again. d will be like polling, but
without the associated CPU overhead.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: johnlumby (#37)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 06/01/2014 03:44 AM, johnlumby wrote:

If you look at the new patch, you'll see that for the different-pid case,
I still call aio_suspend with a timeout.
As you or Claudio pointed out earlier, it could just as well sleep
for the same timeout,
but the small advantage of calling aio_suspend is if the io completed
just between
the aio_error returning EINPROGRESS and the aio_suspend call.
Also it makes the code simpler. In fact this change is quite small,
just a few lines
in backend/storage/buffer/buf_async.c and backend/storage/file/fd.c

Based on this, I think it is not necessary to get rid of the polling
altogether
(and in any case, as far as I can see, very difficult).

That's still just as wrong as it always has been. Just get rid of it.
Don't put aiocb structs in shared memory at all. They don't belong there.

Well, as mentioned earlier, it is not broken. Whether it is
efficient I am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that, if caller is not the original aio_read process,
it renders the suspend() into an instant timeout. I will see if I can
verify that.

I don't see the point of pursuing this design further. Surely we don't want
to use polling here, and you're relying on undefined behavior anyway. I'm
pretty sure aio_return/aio_error won't work from a different process on all
platforms, even if it happens to work on Linux. Even on Linux, it might stop
working if the underlying implementation changes from the glibc pthread
emulation to something kernel-based.

Good point. I have included the guts of your little test program
(modified to do polling) into the existing autoconf test program
that decides on the
#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.
See config/c-library.m4.
I hope this goes some way to answer your concern about robustness,
as at least now if the implementation changes in some way that
renders the polling ineffective, it will be caught in configure.

No, that does not make it robust enough.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

johnlumby

johnlumby@hotmail.com

over 11 years ago

In reply to: John Lumby (#10)

1 attachment(s)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

updated version of patch compatible with git head of 140608,
(adjusted proc oid and a couple of minor fixes)

Attachments:

postgresql-9.4.140608.async_io_prefetching.patchtext/x-patch; name=postgresql-9.4.140608.async_io_prefetching.patchDownload

--- configure.in.orig	2014-06-08 11:26:27.000000000 -0400
+++ configure.in	2014-06-08 21:59:36.140095486 -0400
@@ -1771,6 +1771,12 @@ operating system;  use --disable-thread-
 fi
 fi
 
+#  test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation  of the latter.
+PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" = x"yes"; then
+      AC_DEFINE(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP, 1, [Define to select librt-style async io and the gcc atomic compare_and_swap.])
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
--- contrib/pg_prewarm/pg_prewarm.c.orig	2014-06-08 11:26:27.000000000 -0400
+++ contrib/pg_prewarm/pg_prewarm.c	2014-06-08 21:59:36.248095674 -0400
@@ -159,7 +159,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		 */
 		for (block = first_block; block <= last_block; ++block)
 		{
-			PrefetchBuffer(rel, forkNumber, block);
+			PrefetchBuffer(rel, forkNumber, block, 0);
 			++blocks_done;
 		}
 #else
--- contrib/pg_stat_statements/pg_stat_statements--1.3.sql.orig	2014-06-08 17:57:22.088976836 -0400
+++ contrib/pg_stat_statements/pg_stat_statements--1.3.sql	2014-06-08 21:59:36.272095715 -0400
@@ -0,0 +1,52 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_stat_statements VERSION '1.3'" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_stat_statements_reset()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT queryid bigint,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT blk_read_time float8,
+    OUT blk_write_time float8
+   , OUT aio_read_noneed int8
+   , OUT aio_read_discrd int8
+   , OUT aio_read_forgot int8
+   , OUT aio_read_noblok int8
+   , OUT aio_read_failed int8
+   , OUT aio_read_wasted int8
+   , OUT aio_read_waited int8
+   , OUT aio_read_ontime int8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_3'
+LANGUAGE C STRICT VOLATILE;
+
+-- Register a view on the function for ease of use.
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
+
+-- Don't want this to be available to non-superusers.
+REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
--- contrib/pg_stat_statements/Makefile.orig	2014-06-08 11:26:27.000000000 -0400
+++ contrib/pg_stat_statements/Makefile	2014-06-08 21:59:36.292095750 -0400
@@ -4,7 +4,8 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.2.sql pg_stat_statements--1.1--1.2.sql \
+DATA = pg_stat_statements--1.3.sql pg_stat_statements--1.2--1.3.sql \
+	pg_stat_statements--1.2.sql pg_stat_statements--1.1--1.2.sql \
 	pg_stat_statements--1.0--1.1.sql pg_stat_statements--unpackaged--1.0.sql
 
 ifdef USE_PGXS
--- contrib/pg_stat_statements/pg_stat_statements.c.orig	2014-06-08 11:26:27.000000000 -0400
+++ contrib/pg_stat_statements/pg_stat_statements.c	2014-06-08 21:59:36.324095805 -0400
@@ -117,6 +117,7 @@ typedef enum pgssVersion
 	PGSS_V1_0 = 0,
 	PGSS_V1_1,
 	PGSS_V1_2
+	,PGSS_V1_3
 } pgssVersion;
 
 /*
@@ -148,6 +149,16 @@ typedef struct Counters
 	int64		local_blks_written;		/* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;		/* # of temp blocks written */
+
+	int64		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool  */
+	int64		aio_read_discrd;		/* # of prefetches for which buffer not subsequently read and therefore discarded  */
+	int64		aio_read_forgot;		/* # of prefetches for which buffer not subsequently read and then forgotten about */
+	int64		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb  control block               */
+	int64		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno     */
+	int64		aio_read_wasted;		/* # of aio reads for which in-progress aio cancelled and disk block not used      */
+	int64		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it                 */
+	int64		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested       */
+
 	double		blk_read_time;	/* time spent reading, in msec */
 	double		blk_write_time; /* time spent writing, in msec */
 	double		usage;			/* usage factor */
@@ -275,6 +286,7 @@ void		_PG_fini(void);
 
 PG_FUNCTION_INFO_V1(pg_stat_statements_reset);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
+PG_FUNCTION_INFO_V1(pg_stat_statements_1_3);
 PG_FUNCTION_INFO_V1(pg_stat_statements);
 
 static void pgss_shmem_startup(void);
@@ -1026,7 +1038,25 @@ pgss_ProcessUtility(Node *parsetree, con
 		bufusage.temp_blks_read =
 			pgBufferUsage.temp_blks_read - bufusage_start.temp_blks_read;
 		bufusage.temp_blks_written =
-			pgBufferUsage.temp_blks_written - bufusage_start.temp_blks_written;
+			pgBufferUsage.temp_blks_written - bufusage.temp_blks_written;
+
+		bufusage.aio_read_noneed =
+			pgBufferUsage.aio_read_noneed - bufusage.aio_read_noneed;
+		bufusage.aio_read_discrd =
+			pgBufferUsage.aio_read_discrd - bufusage.aio_read_discrd;
+		bufusage.aio_read_forgot =
+			pgBufferUsage.aio_read_forgot - bufusage.aio_read_forgot;
+		bufusage.aio_read_noblok =
+			pgBufferUsage.aio_read_noblok - bufusage.aio_read_noblok;
+		bufusage.aio_read_failed =
+			pgBufferUsage.aio_read_failed - bufusage.aio_read_failed;
+		bufusage.aio_read_wasted =
+			pgBufferUsage.aio_read_wasted - bufusage.aio_read_wasted;
+		bufusage.aio_read_waited =
+			pgBufferUsage.aio_read_waited - bufusage.aio_read_waited;
+		bufusage.aio_read_ontime =
+			pgBufferUsage.aio_read_ontime - bufusage.aio_read_ontime;
+
 		bufusage.blk_read_time = pgBufferUsage.blk_read_time;
 		INSTR_TIME_SUBTRACT(bufusage.blk_read_time, bufusage_start.blk_read_time);
 		bufusage.blk_write_time = pgBufferUsage.blk_write_time;
@@ -1041,6 +1071,7 @@ pgss_ProcessUtility(Node *parsetree, con
 				   rows,
 				   &bufusage,
 				   NULL);
+
 	}
 	else
 	{
@@ -1224,6 +1255,16 @@ pgss_store(const char *query, uint32 que
 		e->counters.local_blks_written += bufusage->local_blks_written;
 		e->counters.temp_blks_read += bufusage->temp_blks_read;
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
+
+		e->counters.aio_read_noneed     += bufusage->aio_read_noneed;
+		e->counters.aio_read_discrd     += bufusage->aio_read_discrd;
+		e->counters.aio_read_forgot     += bufusage->aio_read_forgot;
+		e->counters.aio_read_noblok     += bufusage->aio_read_noblok;
+		e->counters.aio_read_failed     += bufusage->aio_read_failed;
+		e->counters.aio_read_wasted     += bufusage->aio_read_wasted;
+		e->counters.aio_read_waited     += bufusage->aio_read_waited;
+		e->counters.aio_read_ontime     += bufusage->aio_read_ontime;
+
 		e->counters.blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_read_time);
 		e->counters.blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_write_time);
 		e->counters.usage += USAGE_EXEC(total_time);
@@ -1257,7 +1298,8 @@ pg_stat_statements_reset(PG_FUNCTION_ARG
 #define PG_STAT_STATEMENTS_COLS_V1_0	14
 #define PG_STAT_STATEMENTS_COLS_V1_1	18
 #define PG_STAT_STATEMENTS_COLS_V1_2	19
-#define PG_STAT_STATEMENTS_COLS			19		/* maximum of above */
+#define PG_STAT_STATEMENTS_COLS_V1_3	27
+#define PG_STAT_STATEMENTS_COLS			27		/* maximum of above */
 
 /*
  * Retrieve statement statistics.
@@ -1270,6 +1312,16 @@ pg_stat_statements_reset(PG_FUNCTION_ARG
  * function.  Unfortunately we weren't bright enough to do that for 1.1.
  */
 Datum
+pg_stat_statements_1_3(PG_FUNCTION_ARGS)
+{
+	bool		showtext = PG_GETARG_BOOL(0);
+
+	pg_stat_statements_internal(fcinfo, PGSS_V1_3, showtext);
+
+	return (Datum) 0;
+}
+
+Datum
 pg_stat_statements_1_2(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
@@ -1358,6 +1410,10 @@ pg_stat_statements_internal(FunctionCall
 			if (api_version != PGSS_V1_2)
 				elog(ERROR, "incorrect number of output arguments");
 			break;
+		case PG_STAT_STATEMENTS_COLS_V1_3:
+			if (api_version != PGSS_V1_3)
+				elog(ERROR, "incorrect number of output arguments");
+			break;
 		default:
 			elog(ERROR, "incorrect number of output arguments");
 	}
@@ -1534,11 +1590,24 @@ pg_stat_statements_internal(FunctionCall
 		{
 			values[i++] = Float8GetDatumFast(tmp.blk_read_time);
 			values[i++] = Float8GetDatumFast(tmp.blk_write_time);
+
+			if (api_version >= PGSS_V1_3)
+			{
+				values[i++] = Int64GetDatumFast(tmp.aio_read_noneed);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_discrd);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_forgot);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_noblok);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_failed);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_wasted);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_waited);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_ontime);
+			}
 		}
 
 		Assert(i == (api_version == PGSS_V1_0 ? PG_STAT_STATEMENTS_COLS_V1_0 :
 					 api_version == PGSS_V1_1 ? PG_STAT_STATEMENTS_COLS_V1_1 :
 					 api_version == PGSS_V1_2 ? PG_STAT_STATEMENTS_COLS_V1_2 :
+					 api_version == PGSS_V1_3 ? PG_STAT_STATEMENTS_COLS_V1_3 :
 					 -1 /* fail if you forget to update this assert */ ));
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
--- contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql.orig	2014-06-08 17:57:22.088976836 -0400
+++ contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql	2014-06-08 21:59:36.348095847 -0400
@@ -0,0 +1,51 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements();
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements();
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT queryid bigint,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT blk_read_time float8,
+    OUT blk_write_time float8
+   , OUT aio_read_noneed int8
+   , OUT aio_read_discrd int8
+   , OUT aio_read_forgot int8
+   , OUT aio_read_noblok int8
+   , OUT aio_read_failed int8
+   , OUT aio_read_wasted int8
+   , OUT aio_read_waited int8
+   , OUT aio_read_ontime int8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_3'
+LANGUAGE C STRICT VOLATILE;
+
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
--- postgresql-prefetching-asyncio.README.orig	2014-06-08 17:57:22.088976836 -0400
+++ postgresql-prefetching-asyncio.README	2014-06-08 21:59:36.372095888 -0400
@@ -0,0 +1,544 @@
+Postgresql  --   Extended Prefetching using Asynchronous IO
+============================================================
+
+Postgresql currently (9.3.4) provides a limited prefetching capability
+using posix_fadvise to give hints to the Operating System kernel
+about which pages it expects to read in the near future.
+This capability is used only during the heap-scan phase of bitmap-index scans.
+It is controlled via the effective_io_concurrency configuration parameter.
+
+This capability is now extended in two ways :
+   .   use asynchronous IO into Postgresql shared buffers as an
+       alternative to posix_fadvise
+   .   Implement prefetching in other types of scan :
+            .  non-bitmap (i.e. simple) index scans - index pages
+                     currently only for B-tree indexes.
+                    (developed by Claudio Freire <klaussfreire(at)gmail(dot)com>)
+            .  non-bitmap (i.e. simple) index scans - heap pages
+                          currently only for B-tree indexes.
+            .  simple heap scans
+
+Posix asynchronous IO is chosen as the function library for asynchronous IO,
+since this is well supported and also fits very well with the model of
+the prefetching process,  particularly as regards checking for completion
+of an asynchronous read.    On linux,   Posix asynchronous IO is provided
+in the librt library.    librt uses independently-schedulable threads to
+achieve the asynchronicity,   rather than kernel functionality.
+
+In this implementation,  use of asynchronous IO is limited to prefetching
+while performing one of the three types of scan
+            .  B-tree bitmap index scan - heap pages    (as already exists)
+            .  B-tree non-bitmap (i.e. simple) index scans - index and heap pages
+            .  simple heap scans
+on permanent relations.   It is not used on temporary tables nor for writes.
+
+The advantages of Posix asynchronous IO into shared buffers
+compared to posix_fadvise are :
+   .   Beneficial for non-sequential access patterns as well as sequential
+   .   No restriction on the kinds of IO which can be used
+       (other kinds of asynchronous IO impose restrictions such as
+        buffer alignment,  use of non-buffered IO).
+   .   Does not interfere with standard linux kernel read-ahead functionality.
+       (It has been stated in 
+ www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com
+       that :
+          "the kernel stops doing read-ahead when a call to posix_fadvise comes.
+           I noticed the performance hit, and checked the kernel's code.
+           It effectively changes the prediction mode from sequential to fadvise,
+           negating the (assumed) kernel's prefetch logic")
+   .   When the read request is issued after a prefetch has completed,
+       no delay associated with a kernel call to copy the page from
+       kernel page buffers into the Postgresql shared buffer,
+       since it is already there.
+       Also,   in a memory-constrained environment,   there is a greater
+       probability that the prefetched page will "stick" in memory
+       since the linux kernel victimizes the filesystem page cache in preference
+       to swapping out user process pages.
+   .   Statistics on prefetch success can be gathered (see "Statistics" below)
+       which helps the administrator to tune the prefetching settings.
+
+These benefits are most likely to be obtained in a system whose usage profile
+(e.g. from iostat)  shows:
+     .   high IO wait from mostly-read activity
+     .   disk access pattern is not entirely sequential
+         (so kernel readahead can't predict it but postgresql can)
+     .   sufficient spare idle CPU to run the librt pthreads
+         or,  stated another way,    the CPU subsystem is relatively powerful
+         compared to the disk subsystem.
+In such ideal conditions,  and with a workload with plenty of index scans,
+around 10% - 20% improvement in throughput has been achieved.
+In an admittedly extreme environment measured by this author,    with a workload
+consisting of 8 client applications each running similar complex queries
+(same query structure but different predicates and constants),
+including 2 Bitmap Index Scans and 17 non-bitmap index scans,
+on a dual-core Intel laptop (4 hyperthreads) with the database on a single
+USB3-attached 500GB disk drive, and no part of the database in filesystem buffers
+initially,  (filesystem freshly mounted),  comparing unpatched build
+using posix_fadvise with effective_io_concurrency 4 against same build patched
+with async IO and effective_io_concurrency 4 and max_async_io_prefetchers 32,
+elapse time repeatably improved from around 640-670 seconds to around 530-550 seconds,
+a 17% - 18% improvement. 
+
+The disadvantages of Posix asynchronous IO compared to posix_fadvise are:
+     .   probably higher CPU utilization:
+         Firstly, the extra work performed by the librt threads adds CPU
+         overhead, and secondly, if the asynchronous prefetching is effective,
+         then it will deliver better (greater) overlap of CPU with IO, which
+         will reduce elapsed times and hence increase CPU utilization percentage
+         still more (during that shorter elapsed time).
+     .   more context switching,  because of the additional threads.
+
+
+Statistics:
+___________
+
+A number of additional statistics relating to effectiveness of asynchronous IO
+are provided as an extension of the existing pg_stat_statements loadable module.
+Refer to the appendix "Additional Supplied Modules" in the current
+PostgreSQL Documentation for details of this module.
+
+The following additional statistics are provided for asynchronous IO prefetching:
+
+    . aio_read_noneed  :   number of prefetches for which no need for prefetch as block already in buffer pool
+    . aio_read_discrd  :   number of prefetches for which buffer not subsequently read and therefore discarded
+    . aio_read_forgot  :   number of prefetches for which buffer not subsequently read and then forgotten about
+    . aio_read_noblok  :   number of prefetches for which no available BufferAiocb  control block
+    . aio_read_failed  :   number of aio reads for which aio itself failed or the read failed with an errno
+    . aio_read_wasted  :   number of aio reads for which in-progress aio cancelled and disk block not used
+    . aio_read_waited  :   number of aio reads for which disk block used but had to wait for it
+    . aio_read_ontime  :   number of aio reads for which disk block used and ready on time when requested
+
+Some of these are (hopefully) self-explanatory.    Some additional notes:
+
+    . aio_read_discrd and aio_read_forgot  :
+                    prefetch was wasted work since the buffer was not subsequently read
+                    The discrd case indicates that the scanner realized this and discarded the buffer,
+                    whereas the forgot case indicates that the scanner did not realize it,
+                    which should not normally occur.
+                    A high number in either suggests lowering effective_io_concurrency.
+
+    . aio_read_noblok  :   
+                    Any significant number in relation to all the other numbers indicates that
+                    max_async_io_prefetchers should be increased.
+
+    . aio_read_waited  :
+                    The page was prefetched but the asynchronous read had not completed by the time the
+                    scanner requested to read it.     causes extra overhead in waiting and indicates
+                    prefetching is not providing much if any benefit.
+                    The disk subsystem may be underpowered/overloaded in relation to the available CPU power.
+
+    . aio_read_ontime  :
+                    The page was prefetched and the asynchronous read had completed by the time the
+                    scanner requested to read it.     Optimal behaviour.      If this number if large
+                    in relation to all the other numbers except (possibly) aio_read_noneed,
+                    then prefetching is working well.
+
+To create the extension with support for these additional statistics, use the following syntax:
+     CREATE EXTENSION pg_stat_statements VERSION '1.3'
+or,  if you run the new code against an existing database which already has the extension
+( see installation and migration below ),  you can 
+     ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'
+
+A suggested set of commands for displaying these statistics might be :
+
+ /*  OPTIONALLY */ DROP extension pg_stat_statements;
+                   CREATE extension pg_stat_statements VERSION '1.3';
+ /*  run your workload   */
+                  select userid , dbid , substring(query from 1 for 24) , calls , total_time , rows , shared_blks_read , blk_read_time , blk_write_time \
+                    , aio_read_noneed , aio_read_noblok , aio_read_failed , aio_read_wasted , aio_read_waited , aio_read_ontime , aio_read_forgot       \
+                      from pg_stat_statements where shared_blks_read > 0;
+
+
+Installation and Build Configuration:
+_____________________________________
+
+1. First -  a prerequsite:
+#  as well as requiring all the usual package build tools such as gcc , make etc,
+#  as described in the instructions for building postgresql,
+#  the following is required :
+    gnu autoconf at version 2.69 :
+# run the following command
+autoconf -V
+# it *must* return
+autoconf (GNU Autoconf) 2.69
+
+2. If you don't have it or it is a different version,
+then you must obtain version 2.69 (which is the current version)
+from your distribution provider or from the gnu software download site.
+
+3. Also you must have the source tree for postgresql version 9.4 (development version).
+#   all the following commands assume your current working directory is the top of the source tree.
+
+4. cd to top of source tree :
+#   check it appears to be a postgresql source tree
+ls -ld configure.in src
+#   should show both the file and the directory
+grep PostgreSQL COPYRIGHT
+#   should show PostgreSQL Database Management System
+
+5. Apply the patch :
+patch -b -p0 -i <patch_file_path>
+#   should report no errors, 45 files patched (see list at bottom of this README)
+#   and all hunks applied
+#  check the patch was appplied to configure.in
+ls -ld configure.in.orig configure.in
+#   should show both files
+
+6. Rebuild the configure script with the patched configure.in :
+mv configure configure.orig;
+autoconf configure.in >configure;echo "rc= $? from autoconf"; chmod +x configure;
+ls -lrt configure.orig configure;
+
+7. run the new configure script :
+#   if you have run configure before,
+#   then you may first want to save existing config.status and config.log if they exist,
+#   and then specify same configure flags and options as you specified before.
+#   the patch does not alter or extend the set of configure options
+#   if unsure,   run ./configure --help
+#   if still unsure,   run ./configure
+./configure <other configure options as desired>
+
+
+
+8. now check that configure decided that this environment supports asynchronous IO :
+grep USE_AIO_ATOMIC_BUILTIN_COMP_SWAP src/include/pg_config.h
+#  it should show
+#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 1
+#  if not,  apparently your environment does not support asynch IO  -
+#  the config.log will show how it came to that conclusion,
+#  also check for :
+#    . a librt.so somewhere in the loader's library path (probably under /lib , /lib64 , or /usr)
+#    . your gcc must support the atomic compare_and_swap __sync_bool_compare_and_swap built-in function
+#  do not proceed without this define being set.
+
+9. do you want to use the new code on an existing cluster
+   that was created using the same code base but without the patch?
+   If so then run this nasty-looking command :
+   (cut-and-paste it into a terminal window or a shell-script file)
+   Otherwise continue to step 10.
+   see Migration note below for explanation.
+###############################################################################################
+   fl=src/Makefile.global; typeset -i bkx=0; while [[ $bkx < 200 ]]; do {
+       bkfl="${fl}.bak${bkx}"; if [[ -a ${bkfl} ]]; then ((bkx=bkx+1)); else break; fi;
+   }; done;
+   if [[ -a ${bkfl} ]]; then echo "sorry cannot find a backup name for $fl";
+   elif [[ -a $fl ]]; then {
+       mv $fl $bkfl && {
+          sed -e "/^CFLAGS =/ s/\$/ -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO/" $bkfl > $fl;
+          str="diff -w $bkfl $fl";echo "$str"; eval "$str";
+       };
+   };
+   else echo "ooopppss $fl is missing";
+   fi;
+###############################################################################################
+# it should report something like
+diff -w Makefile.global.bak0 Makefile.global
+222c222
+< CFLAGS = XXXX
+---
+> CFLAGS = XXXX -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+#   where XXXX is some set of flags
+
+
+10. now run the rest of the build process as usual  -
+    follow instructions in file INSTALL if that file exists,
+    else e.g. run
+make && make install
+
+If the build fails with the following error:
+undefined reference to `aio_init'
+Then edit the following file
+src/include/pg_config_manual.h
+and add the following line at the bottom:
+
+#define DONT_HAVE_AIO_INIT
+
+and then run
+make clean && make && make install
+See notes to section Runtime Configuration below for more information on this.
+
+
+
+Migration , Runtime Configuration, and Use:
+___________________________________________
+
+
+Database Migration:
+___________________
+
+The new prefetching code for non-bitmap index scans introduces a new btree-index
+function named btpeeknexttuple.    The correct way to add such a function involves
+also adding it to the catalog as an internal function in pg_proc.
+However,  this results in the new built code considering an existing database to be
+incompatible,  i.e requiring backup on the old code and restore on the new.
+This is normal behaviour for migration to a new version of postgresql,  and is
+also a valid way of migrating a database for use with this asynchronous IO feature,
+but in this case it may be inconvenient.
+
+As an alternative,    the new code may be compiled with the macro define
+AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+which does what it says by not altering the catalog.   The patched build can then
+be run against an existing database cluster initdb'd using the unpatched build.
+
+There are no known ill-effects of so doing,  but :
+     .  in any case,  it is strongly suggested to make a backup of any precious database
+        before accessing it with a patched build
+     .  be aware that if this asynchronous IO feature is eventually released as part of postgresql,
+        migration will probably be required anyway.
+
+This option to avoid catalog migration is intended as a convenience for a quick test,
+and also makes it easier to obtain performance comparisons on the same database.
+
+
+
+Runtime Configuration:
+______________________
+
+One new configuration parameter settable in postgresql.conf and
+in any other way as described in the postgresql documentation :
+
+max_async_io_prefetchers
+  Maximum number of background processes concurrently using asynchronous
+  librt threads to prefetch pages into shared memory buffers
+
+This number can be thought of as the maximum number
+of librt threads concurrently active,   each working on a list of
+from 1 to target_prefetch_pages pages ( see notes 1 and 2 ).
+
+In practice,    this number simply controls how many prefetch requests in total
+may be active concurrently :
+        max_async_io_prefetchers * target_prefetch_pages ( see note 1)
+
+default is max_connections/6
+and recall that the default for max_connections is 100
+
+
+note 1  a number based on effective_io_concurrency and approximately n * ln(n)
+        where n is effective_io_concurrency
+
+note 2  Provided that the gnu extension to Posix AIO which provides the
+aio_init() function is present,   then aio_init() is called
+to set the librt maximum number of threads to max_async_io_prefetchers,
+and to set the maximum number of concurrent aio read requests to the product of
+        max_async_io_prefetchers * target_prefetch_pages
+
+
+As well as this regular configuration parameter,
+there are several other parameters that can be set via environment variable.
+The reason why they are environment vars rather than regular configuration parameters
+is that it is not expected that they should need to be set,   but they may be useful :
+                variable name         values                  default        meaning
+   PG_TRY_PREFETCHING_FOR_BITMAP      [Y|N]                    Y         whether to prefetch bitmap heap scans
+   PG_TRY_PREFETCHING_FOR_ISCAN       [Y|N|integer[,[N|Y]]]   256,N      whether to prefetch  non-bitmap index scans
+                                                                    also numeric size of list of prefetched blocks
+                                                                    also whether to prefetch forward-sequential-pattern index pages
+   PG_TRY_PREFETCHING_FOR_BTREE       [Y|N]                    Y         whether to prefetch heap pages in non-bitmap index scans
+   PG_TRY_PREFETCHING_FOR_HEAP        [Y|N]                    N         whether to prefetch relation (un-indexed) heap scans
+
+
+The setting for PG_TRY_PREFETCHING_FOR_ISCAN is a litle complicated.
+It can be set to Y or N to control prefetching of  non-bitmap index scans;
+But in addition it can be set to an integer,   which both implies Y
+and also sets the size of a list used to remember prefetched but unread heap pages.
+This list is an optimization used to avoid re-prefetching and maximise the potential
+set of prefetchable blocks indexed by one index page.
+And if set to an integer,  this integer may be followed by either ,Y or ,N
+to specify to prefetch index pages which are being accessed forward-sequentially.
+It has been found that prefetching is not of great benefit for this access pattern,
+and so it is not the default,  but also does no harm (provided sufficient CPU capacity).
+
+
+
+Usage :
+______
+
+
+There are no changes in usage other than as noted under Configuration and Statistics.
+However,   in order to assess benefit from this feature,   it will be useful to
+understand the query access plans of your workload using EXPLAIN.    Before doing that,
+make sure that statistics are up to date using ANALYZE.
+
+
+
+Internals:
+__________
+
+
+Internal changes span two areas and the interface between them :
+
+ .  buffer manager layer
+ .  programming interface for scanner to call buffer manager
+ .  scanner layer
+
+ .  buffer manager layer
+    ____________________
+
+    changes comprise :
+       .   allocating,  pinning , unpinning buffers
+            this is complex and discussed briefly below in "Buffer Management"
+       .   acquiring and releasing a BufferAiocb, the control block
+            associated with a single aio_read,  and checking for its completion
+            a new file,  backend/storage/buffer/buf_async.c, provides three new functions,
+                  BufStartAsync        BufReleaseAsync            BufCheckAsync
+            which handle this.
+       .   calling librt asynch io functions
+            this follows the example of all other filesystem interfaces
+            and is straightforward.    
+            two new functions are provided in fd.c:
+                   FileStartaio        FileCompleteaio
+            and corresponding interfaces in smgr.c
+
+ .  programming interface for scanner to call buffer manager
+    ________________________________________________________
+       . calling interface for existing function PrefetchBuffer is modified :
+           .  one new argument,   BufferAccessStrategy strategy
+           .  now returns an int return code which indicates :
+                     whether pin count on buffer has been increased by 1
+                     whether block was already present in a buffer
+       .  new function DiscardBuffer
+           .  discard buffer used for a previously prefetched page
+                 which scanner decides it does not want to read.
+           .  same arguments as for PrefetchBuffer except for omission of BufferAccessStrategy
+           .  note - this is different from the existing function ReleaseBuffer
+                     in that ReleaseBuffer takes a buffer_descriptor as argument
+                     for a buffer which has been read, but has similar purpose.
+
+ .  scanner layer
+    _____________
+        common to all scanners is that the scanner which wishes to prefetch must do two things:
+          .  decide which pages to prefetch and call PrefetchBuffer to prefetch them
+                 nodeBitmapHeapscan already does this (but note one extra argument on PrefetchBuffer)
+          .  remember which pages it has prefetched in some list (actual or conceptual,  e.g. a page range),
+                 removing each page from this list if and when it subsequently reads the page.
+          .  at end of scan,  call DiscardBuffer for every remembered (i.e. prefetched not unread) page
+       how this list of prefetched pages is implemented varies for each of the three scanners and four scan types:
+            .  bitmap index scan - heap pages
+            .  non-bitmap (i.e. simple) index scans - index pages
+            .  non-bitmap (i.e. simple) index scans - heap pages
+            .  simple heap scans
+       The consequences of forgetting to call DiscardBuffer on a prefetched but unread page are:
+            .   counted in aio_read_forgot  (see "Statistics" above)
+            .   may incur an annoying but harmless warning in the pg_log "Buffer Leak ... "
+                  (the buffer is released at commit)
+       This does sometimes happen ...
+     
+
+
+Buffer Management
+_________________
+
+With async io,   PrefetchBuffer must allocate and pin a buffer,  which is relatively straightforward,
+but also every other part of buffer manager must know about the possibility that a buffer may be in
+a state of async_io_in_progress state and be prepared to determine the possible completion.
+That is,  one backend BK1 may start the io but another BK2 may try to read it before BK1 does.
+Posix Asynchronous IO provides a means for waiting on this or another task's read if in progress,
+namely aio_suspend(),  which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO
+are called as part of asynchronous prefetching,   their role is limited to maintaining the buffer descriptor flags,
+and they do not track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in
+a separate set of shared control blocks,  the BufferAiocb list -
+refer to     include/storage/buf_internals.h
+Checking asynchronous io status is handled in  backend/storage/buffer/buf_async.c BufCheckAsync function.
+Read the commentary for this function for more details.
+
+Pinning and unpinning of buffers is the most complex aspect of asynch io prefetching,
+and the logic is spread throughout BufStartAsync , BufCheckAsync , and many functions in bufmgr.c.
+When a backend BK2 requests ReadBuffer of a page for which asynch read is in progress,
+buffer manager has to determine which backend BK1 pinned this buffer during previous PrefetchBuffer,
+and for example must not be re-pinned a second time if BK2 is BK1.
+Information concerning which backend initiated the prefetch is held in the BufferAiocb.
+
+The trickiest case concerns the scenario in which :
+   .  BK1 initiates prefetch and acquires a pin
+   .  BK2 possibly waits for completion and then reads the buffer,  and perhaps later on
+         releases it by ReleaseBuffer.
+   .  Since the asynchronous IO is no longer in progress,     there is no longer any
+         BufferAiocb associated with it.    Yet buffer manager must remember that BK1 holds a
+         "prefetch" pin, i.e. a pin which must not be repeated if and when BK1 finally issues ReadBuffer.
+   .  The solution to this problem is to invent the concept of a "banked" pin,
+      which is a pin obtained when prefetch was issued,   identied as in "banked" status only if and when
+      the associated asynchronous IO terminates,  and redeemable by the next use by same task,
+      either by ReadBuffer or DiscardBuffer.
+      The pid of the backend which holds a banked pin on a buffer (there can be at most one such backend)
+      is stored in the buffer descriptor.
+      This is done without increasing size of the buffer descriptor,  which is important since
+      there may be a very large number of these.     This does overload the relevant field in the descriptor.
+      Refer to include/storage/buf_internals.h for more details
+      and search for BM_AIO_PREFETCH_PIN_BANKED in storage/buffer/bufmgr.c and  backend/storage/buffer/buf_async.c
+
+______________________________________________________________________________
+The following 45 files are changed in this feature (output of the patch command) :
+
+patching file configure.in
+patching file contrib/pg_prewarm/pg_prewarm.c
+patching file contrib/pg_stat_statements/pg_stat_statements--1.3.sql
+patching file contrib/pg_stat_statements/Makefile
+patching file contrib/pg_stat_statements/pg_stat_statements.c
+patching file contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql
+patching file postgresql-prefetching-asyncio.README
+patching file config/c-library.m4
+patching file src/backend/postmaster/postmaster.c
+patching file src/backend/executor/nodeBitmapHeapscan.c
+patching file src/backend/executor/nodeIndexscan.c
+patching file src/backend/executor/instrument.c
+patching file src/backend/storage/buffer/Makefile
+patching file src/backend/storage/buffer/bufmgr.c
+patching file src/backend/storage/buffer/buf_async.c
+patching file src/backend/storage/buffer/buf_init.c
+patching file src/backend/storage/smgr/md.c
+patching file src/backend/storage/smgr/smgr.c
+patching file src/backend/storage/file/fd.c
+patching file src/backend/storage/lmgr/proc.c
+patching file src/backend/access/heap/heapam.c
+patching file src/backend/access/heap/syncscan.c
+patching file src/backend/access/index/indexam.c
+patching file src/backend/access/index/genam.c
+patching file src/backend/access/nbtree/nbtsearch.c
+patching file src/backend/access/nbtree/nbtinsert.c
+patching file src/backend/access/nbtree/nbtpage.c
+patching file src/backend/access/nbtree/nbtree.c
+patching file src/backend/nodes/tidbitmap.c
+patching file src/backend/utils/misc/guc.c
+patching file src/backend/utils/mmgr/aset.c
+patching file src/include/executor/instrument.h
+patching file src/include/storage/bufmgr.h
+patching file src/include/storage/smgr.h
+patching file src/include/storage/fd.h
+patching file src/include/storage/buf_internals.h
+patching file src/include/catalog/pg_am.h
+patching file src/include/catalog/pg_proc.h
+patching file src/include/pg_config_manual.h
+patching file src/include/access/nbtree.h
+patching file src/include/access/heapam.h
+patching file src/include/access/relscan.h
+patching file src/include/nodes/tidbitmap.h
+patching file src/include/utils/rel.h
+patching file src/include/pg_config.h.in
+
+
+Future Possibilities:
+____________________
+
+There are several possible extensions of this feature :
+   .   Extend prefetching of index scans to types of index
+       other than B-tree.
+       This should be fairly straightforward,  but requires some
+       good base of benchmarkable workloads to prove the value.
+   .   Investigate why asynchronous IO prefetching does not greatly
+       improve sequential relation heap scans and possibly find how to
+       achieve a benefit.
+   .   Build knowledge of asycnhronous IO prefetching into the
+       Query Planner costing.
+       This is far from straightforward.    The Postgresql Query Planner's
+       costing model is based on resource consumption rather than elapsed time.
+       Use of asynchronous IO prefetching is intended to improve elapsed time
+       as the expense of (probably) higher resource consumption.
+       Although Costing understands about the reduced cost of reading buffered
+       blocks, it does not take asynchronicity or overlap of CPU with disk
+       into account.  A naive approach might be to try to tweak the Query
+       Planner's Cost Constant configuration parameters
+       such as seq_page_cost , random_page_cost
+       but this is hazardous as explained in the Documentation.
+
+
+
+John Lumby,  johnlumby(at)hotmail(dot)com
--- config/c-library.m4.orig	2014-06-08 11:26:27.000000000 -0400
+++ config/c-library.m4	2014-06-08 21:59:36.448096020 -0400
@@ -367,3 +367,152 @@ if test "$pgac_cv_type_locale_t" = 'yes
   AC_DEFINE(LOCALE_T_IN_XLOCALE, 1,
             [Define to 1 if `locale_t' requires <xlocale.h>.])
 fi])])# PGAC_HEADER_XLOCALE
+
+
+# PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
+# ---------------------------------------
+# test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation of both,
+#      including verifying that aio_error can retrieve completion status
+#      of aio_read issued by a different process
+#
+AC_DEFUN([PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP],
+[AC_MSG_CHECKING([whether have both librt-style async io and the gcc atomic compare_and_swap])
+AC_CACHE_VAL(pgac_cv_aio_atomic_builtin_comp_swap,
+pgac_save_LIBS=$LIBS
+LIBS=" -lrt $pgac_save_LIBS"
+[AC_TRY_RUN([#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include "aio.h"
+#include <errno.h>
+
+char *shmem;
+
+/*  returns rc of aio_read or -1 if some error */
+int
+processA(void)
+{
+	int fd , rc;
+	struct aiocb *aiocbp = (struct aiocb *) shmem;
+	char *buf = shmem + sizeof(struct aiocb);
+
+	rc = fd = open("configure", O_RDONLY );
+	if (fd != -1) {
+
+            memset(aiocbp, 0, sizeof(struct aiocb));
+            aiocbp->aio_fildes = fd;
+            aiocbp->aio_offset = 0;
+            aiocbp->aio_buf = buf;
+            aiocbp->aio_nbytes = 8;
+            aiocbp->aio_reqprio = 0;
+            aiocbp->aio_sigevent.sigev_notify = SIGEV_NONE;
+
+            rc = aio_read(aiocbp);
+	}
+        return rc;
+}
+
+/*  returns result of aio_error  -  0 if io completed successfully */
+int
+processB(void)
+{
+	struct aiocb *aiocbp = (struct aiocb *) shmem;
+	const struct aiocb * const pl[1] = { aiocbp };
+	int rv;
+	int	returnCode;
+        struct timespec my_timeout = { 0 , 10000 };
+        int max_iters , max_polls;
+
+	rv = aio_error(aiocbp);
+        max_iters = 100;
+        while ( (max_iters-- > 0) && (rv == EINPROGRESS) ) {
+                max_polls = 256;
+                my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                returnCode = aio_suspend(pl , 1 , &my_timeout);
+                while ((returnCode < 0) && (EAGAIN == errno) && (max_polls-- > 0)) {
+                    my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                    returnCode = aio_suspend(pl , 1 , &my_timeout);
+                }
+                rv = aio_error(aiocbp);
+        }
+
+	return rv;
+}
+
+int main (int argc, char *argv[])
+ {
+   int rc;
+   int pidB;
+   int child_status;
+   struct aiocb volatile * first_aiocb;
+   struct aiocb volatile * second_aiocb;
+   struct aiocb volatile * my_aiocbp = (struct aiocb *)20000008;
+
+   first_aiocb = (struct aiocb *)20000008;
+   second_aiocb = (struct aiocb *)40000008;
+
+   /*  first test  --  __sync_bool_compare_and_swap
+   **  set zero as success if two comp-swaps both worked as expected -
+   **  first compares equal and swaps,  second compares unequal
+   */
+   rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   if (rc) {
+      rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   } else {
+      rc = -1;
+   }
+
+   if (rc == 0) {
+       /*  second test  --  process A start aio_read
+       **  and process B checks completion by polling
+       */
+        rc = -1; /* pessimistic */
+
+	shmem = mmap(NULL, sizeof(struct aiocb) + 2048,
+				 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
+				 -1, 0);
+	if (shmem != MAP_FAILED) {
+            
+            /*
+             * Start the I/O request in parent process, then fork and try to wait
+             * for it to finish from the child process.
+             */
+            rc = processA();
+            if (rc >= 0) {
+
+                rc = pidB = fork();
+                if (pidB != -1) {
+                    if (pidB != 0) {
+                        /* parent */
+                        wait (&child_status);
+                        if (WIFEXITED(child_status)) {
+                            rc = WEXITSTATUS(child_status);
+                        }
+                    } else {
+                        /* child */
+                        rc = processB();
+                        exit(rc);
+                    }
+                }
+            }
+        }
+   }
+
+   return rc;
+}],
+[pgac_cv_aio_atomic_builtin_comp_swap=yes],
+[pgac_cv_aio_atomic_builtin_comp_swap=no],
+[pgac_cv_aio_atomic_builtin_comp_swap=cross])
+])dnl AC_CACHE_VAL
+AC_MSG_RESULT([$pgac_cv_aio_atomic_builtin_comp_swap])
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" != x"yes"; then
+LIBS=$pgac_save_LIBS
+fi
+])# PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
--- src/backend/postmaster/postmaster.c.orig	2014-06-08 11:26:30.000000000 -0400
+++ src/backend/postmaster/postmaster.c	2014-06-08 21:59:36.532096166 -0400
@@ -123,6 +123,11 @@
 #include "storage/spin.h"
 #endif
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+void ReportFreeBAiocbs(void);
+int CountInuseBAiocbs(void);
+extern int hwmBufferAiocbs;         /*  high water mark of in-use  BufferAiocbs in pool           */
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Possible types of a backend. Beyond being the possible bkend_type values in
@@ -1493,9 +1498,15 @@ ServerLoop(void)
 	fd_set		readmask;
 	int			nSockets;
 	time_t		now,
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                           count_baiocb_time,
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 				last_touch_time;
 
 	last_touch_time = time(NULL);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        count_baiocb_time = time(NULL);
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 	nSockets = initMasks(&readmask);
 
@@ -1654,6 +1665,19 @@ ServerLoop(void)
 			last_touch_time = now;
 		}
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                /*   maintain the hwm of used baiocbs every 10 seconds  */
+		if ((now - count_baiocb_time) >= 10)
+		{
+                        int inuseBufferAiocbs;         /*  current in-use  BufferAiocbs in pool */
+                        inuseBufferAiocbs = CountInuseBAiocbs();
+                        if (inuseBufferAiocbs > hwmBufferAiocbs) {
+			    hwmBufferAiocbs = inuseBufferAiocbs;
+			}
+			count_baiocb_time = now;
+		}
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 		/*
 		 * If we already sent SIGQUIT to children and they are slow to shut
 		 * down, it's time to send them SIGKILL.  This doesn't happen
@@ -3444,6 +3468,9 @@ PostmasterStateMachine(void)
 						signal_child(PgStatPID, SIGQUIT);
 				}
 			}
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        ReportFreeBAiocbs();
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 		}
 	}
 
--- src/backend/executor/nodeBitmapHeapscan.c.orig	2014-06-08 11:26:29.000000000 -0400
+++ src/backend/executor/nodeBitmapHeapscan.c	2014-06-08 21:59:36.552096200 -0400
@@ -34,6 +34,8 @@
  *		ExecEndBitmapHeapScan		releases all storage.
  */
 #include "postgres.h"
+#include <sys/types.h> 
+#include <unistd.h>
 
 #include "access/relscan.h"
 #include "access/transam.h"
@@ -47,6 +49,10 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_bitmap_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
 static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
@@ -111,10 +117,21 @@ BitmapHeapNext(BitmapHeapScanState *node
 		node->tbmres = tbmres = NULL;
 
 #ifdef USE_PREFETCH
-		if (target_prefetch_pages > 0)
-		{
+		if (    prefetch_bitmap_scans
+                     && (target_prefetch_pages > 0)
+                     && (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == scan->rs_rd->rd_node.dbNode)
+                             )
+                         ||  (prefetch_dbOid == 0)
+                        )
+                        /* sufficient number of blocks - at least twice the target_prefetch_pages */
+                     && (scan->rs_nblocks > (2*target_prefetch_pages))
+                   ) {
 			node->prefetch_iterator = prefetch_iterator = tbm_begin_iterate(tbm);
 			node->prefetch_pages = 0;
+                        if (prefetch_iterator) {
+                          tbm_zero(prefetch_iterator);  /* zero list of prefetched and unread blocknos */
+                        }
 			node->prefetch_target = -1;
 		}
 #endif   /* USE_PREFETCH */
@@ -138,12 +155,14 @@ BitmapHeapNext(BitmapHeapScanState *node
 			}
 
 #ifdef USE_PREFETCH
+                        if (prefetch_iterator) {
 			if (node->prefetch_pages > 0)
 			{
 				/* The main iterator has closed the distance by one page */
 				node->prefetch_pages--;
+                                tbm_subtract(prefetch_iterator, tbmres->blockno); /* remove this blockno from list of prefetched and unread blocknos */
 			}
-			else if (prefetch_iterator)
+                            else
 			{
 				/* Do not let the prefetch iterator get behind the main one */
 				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
@@ -151,6 +170,7 @@ BitmapHeapNext(BitmapHeapScanState *node
 				if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
 					elog(ERROR, "prefetch and main iterators are out of sync");
 			}
+                        }
 #endif   /* USE_PREFETCH */
 
 			/*
@@ -239,16 +259,26 @@ BitmapHeapNext(BitmapHeapScanState *node
 			while (node->prefetch_pages < node->prefetch_target)
 			{
 				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+                                int             PrefetchBufferRc; /*  return value from PrefetchBuffer  - refer to bufmgr.h */
+
 
 				if (tbmpre == NULL)
 				{
 					/* No more pages to prefetch */
-					tbm_end_iterate(prefetch_iterator);
-					node->prefetch_iterator = prefetch_iterator = NULL;
+                                        /* let ExecEndBitmapHeapScan terminate the prefetch_iterator
+				        **	tbm_end_iterate(prefetch_iterator);
+					**      node->prefetch_iterator = NULL;
+                                        */
+                                        prefetch_iterator = NULL;
 					break;
 				}
 				node->prefetch_pages++;
-				PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+				PrefetchBufferRc = PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno , 0);
+                                /*  add this blockno to list of prefetched and unread blocknos
+                                **  if pin count did not increase then indicate so in the Unread_Pfetched list
+                                */
+                                tbm_add(prefetch_iterator
+                                   ,( (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED) ? tbmpre->blockno : InvalidBlockNumber ) );
 			}
 		}
 #endif   /* USE_PREFETCH */
@@ -482,12 +512,31 @@ ExecEndBitmapHeapScan(BitmapHeapScanStat
 {
 	Relation	relation;
 	HeapScanDesc scanDesc;
+	TBMIterator *prefetch_iterator;
 
 	/*
 	 * extract information from the node
 	 */
 	relation = node->ss.ss_currentRelation;
 	scanDesc = node->ss.ss_currentScanDesc;
+	prefetch_iterator = node->prefetch_iterator;
+
+#ifdef USE_PREFETCH
+        /*  before any other cleanup,  discard any prefetched but unread buffers  */
+        if (prefetch_iterator != NULL) {
+            TBMIterateResult *tbmpre = tbm_locate_IterateResult(prefetch_iterator);
+            BlockNumber *Unread_Pfetched_base = tbmpre->Unread_Pfetched_base;
+            unsigned int Unread_Pfetched_next = tbmpre->Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+            unsigned int Unread_Pfetched_count = tbmpre->Unread_Pfetched_count;
+
+            while ((Unread_Pfetched_count--) > 0) {
+                DiscardBuffer( scanDesc->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                Unread_Pfetched_next++;
+                if (Unread_Pfetched_next >= target_prefetch_pages)
+                    Unread_Pfetched_next = 0;
+            }
+        }
+#endif   /* USE_PREFETCH */
 
 	/*
 	 * Free the exprcontext
--- src/backend/executor/nodeIndexscan.c.orig	2014-06-08 11:26:29.000000000 -0400
+++ src/backend/executor/nodeIndexscan.c	2014-06-08 21:59:36.584096256 -0400
@@ -35,8 +35,13 @@
 #include "utils/rel.h"
 
 
+
 static TupleTableSlot *IndexNext(IndexScanState *node);
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -418,7 +423,12 @@ ExecEndIndexScan(IndexScanState *node)
 	 * close the index relation (no-op if we didn't open it)
 	 */
 	if (indexScanDesc)
+        {
 		index_endscan(indexScanDesc);
+
+        /*  note  -  at this point all scan controlblock resources have been freed by IndexScanEnd called by index_endscan */
+
+        }
 	if (indexRelationDesc)
 		index_close(indexRelationDesc, NoLock);
 
@@ -609,6 +619,33 @@ ExecInitIndexScan(IndexScan *node, EStat
 											   indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
+#ifdef USE_PREFETCH
+        /*  initialize prefetching   */
+                indexstate->iss_ScanDesc->pfch_index_page_list =  (struct pfch_index_pagelist*)0;
+                indexstate->iss_ScanDesc->pfch_block_item_list = (struct pfch_block_item*)0;
+		if (    prefetch_index_scans
+			 && (target_prefetch_pages > 0)
+			 &&	(!RelationUsesLocalBuffers(indexstate->iss_ScanDesc->heapRelation)) /* I think this must always be true for an indexed heap ? */
+			 && (    (   (prefetch_dbOid > 0)
+					   && (prefetch_dbOid == indexstate->iss_ScanDesc->heapRelation->rd_node.dbNode)
+					 )
+				 ||  (prefetch_dbOid == 0)
+				)
+		   ) {
+			indexstate->iss_ScanDesc->pfch_index_page_list = palloc( sizeof(struct pfch_index_pagelist) + ( (target_prefetch_pages-1) * sizeof(struct pfch_index_item) ) );
+			indexstate->iss_ScanDesc->pfch_block_item_list = palloc( prefetch_index_scans * sizeof(struct pfch_block_item) );
+			if (     ( (struct pfch_index_pagelist*)0 != indexstate->iss_ScanDesc->pfch_index_page_list )
+                  && ( (struct pfch_block_item*)0 != indexstate->iss_ScanDesc->pfch_block_item_list )
+               ) {
+                          indexstate->iss_ScanDesc->pfch_used = 0;
+                          indexstate->iss_ScanDesc->pfch_next = prefetch_index_scans; /* ensure first entry is at index 0 */
+                          indexstate->iss_ScanDesc->pfch_index_page_list->pfch_index_pagelist_next = (struct pfch_index_pagelist*)0;
+                          indexstate->iss_ScanDesc->pfch_index_page_list->pfch_index_item_count = 0;
+                          indexstate->iss_ScanDesc->do_prefetch = 1;
+            }
+		}
+#endif   /* USE_PREFETCH */
+
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
 	 * index AM.
--- src/backend/executor/instrument.c.orig	2014-06-08 11:26:29.000000000 -0400
+++ src/backend/executor/instrument.c	2014-06-08 21:59:36.620096318 -0400
@@ -41,6 +41,14 @@ InstrAlloc(int n, int instrument_options
 		{
 			instr[i].need_bufusage = need_buffers;
 			instr[i].need_timer = need_timer;
+			instr[i].bufusage_start.aio_read_noneed = 0;
+			instr[i].bufusage_start.aio_read_discrd = 0;
+			instr[i].bufusage_start.aio_read_forgot = 0;
+			instr[i].bufusage_start.aio_read_noblok = 0;
+			instr[i].bufusage_start.aio_read_failed = 0;
+			instr[i].bufusage_start.aio_read_wasted = 0;
+			instr[i].bufusage_start.aio_read_waited = 0;
+			instr[i].bufusage_start.aio_read_ontime = 0;
 		}
 	}
 
@@ -143,6 +151,16 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+
+	dst->aio_read_noneed       += add->aio_read_noneed - sub->aio_read_noneed;
+	dst->aio_read_discrd       += add->aio_read_discrd - sub->aio_read_discrd;
+	dst->aio_read_forgot       += add->aio_read_forgot - sub->aio_read_forgot;
+	dst->aio_read_noblok       += add->aio_read_noblok - sub->aio_read_noblok;
+	dst->aio_read_failed       += add->aio_read_failed - sub->aio_read_failed;
+	dst->aio_read_wasted       += add->aio_read_wasted - sub->aio_read_wasted;
+	dst->aio_read_waited       += add->aio_read_waited - sub->aio_read_waited;
+	dst->aio_read_ontime       += add->aio_read_ontime - sub->aio_read_ontime;
+
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
--- src/backend/storage/buffer/Makefile.orig	2014-06-08 11:26:30.000000000 -0400
+++ src/backend/storage/buffer/Makefile	2014-06-08 21:59:36.652096374 -0400
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o buf_async.o
 
 include $(top_srcdir)/src/backend/common.mk
--- src/backend/storage/buffer/bufmgr.c.orig	2014-06-08 11:26:30.000000000 -0400
+++ src/backend/storage/buffer/bufmgr.c	2014-06-08 21:59:36.696096450 -0400
@@ -29,7 +29,7 @@
  *		buf_table.c -- manages the buffer lookup table
  */
 #include "postgres.h"
-
+#include <sys/types.h> 
 #include <sys/file.h>
 #include <unistd.h>
 
@@ -50,7 +50,6 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
-
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
 #define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
@@ -63,6 +62,11 @@
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
 
+extern volatile struct BAiocbAnchor *BAiocbAnchr;  /*  anchor for all control blocks pertaining to aio  */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern int numBufferAiocbs;        /*  total number of  BufferAiocbs in pool  */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 #define DROP_RELS_BSEARCH_THRESHOLD		20
 
 /* GUC variables */
@@ -78,26 +82,33 @@ bool		track_io_timing = false;
  */
 int			target_prefetch_pages = 0;
 
-/* local state for StartBufferIO and related functions */
+/* local state for StartBufferIO and related functions
+**  but ONLY for synchronous IO  -  not altered for aio
+*/
 static volatile BufferDesc *InProgressBuf = NULL;
 static bool IsForInput;
+pid_t this_backend_pid = 0;    /*    pid of this backend */
 
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
-
-static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
+extern int
+BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy);
+extern int
+BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, volatile BufferDesc *buf_desc, int intention
+        ,BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock );
+Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
-				  bool *hit);
-static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
-static void PinBuffer_Locked(volatile BufferDesc *buf);
-static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
+				  bool *hit , int index_for_aio);
+bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
+void PinBuffer_Locked(volatile BufferDesc *buf);
+void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
 static void WaitIO(volatile BufferDesc *buf);
-static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
-static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
+static bool StartBufferIO(volatile BufferDesc *buf, bool forInput , int index_for_aio );
+void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -106,24 +117,66 @@ static volatile BufferDesc *BufferAlloc(
 			ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr);
+			int *foundPtr , int index_for_aio );
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
 static int	rnode_comparator(const void *p1, const void *p2);
 
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
 
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
  *
- * This is named by analogy to ReadBuffer but doesn't actually allocate a
- * buffer.  Instead it tries to ensure that a future ReadBuffer for the given
- * block will not be delayed by the I/O.  Prefetching is optional.
+ * This is named by analogy to ReadBuffer but allocates a buffer only if using asynchronous I/O.
+ * Its purpose  is to try to ensure that a future ReadBuffer for the given block
+ * will not be delayed by the I/O.  Prefetching is optional.
  * No-op if prefetching isn't compiled in.
- */
-void
-PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
-{
+ *
+ * Originally the prefetch simply called posix_fadvise() to recommend read-ahead into kernel page cache.
+ * Extended to provide an alternative of issuing an asynchronous aio_read() to read into a buffer.
+ * This extension has an implication on how this bufmgr component manages concurrent requests
+ * for the same disk block.
+ *
+ * Synchronous IO (read()) does not provide a means for waiting on another task's read if in progress,
+ * and bufmgr implements its own scheme in StartBufferIO, WaitIO, and TerminateBufferIO.
+ *
+ * Asynchronous IO (aio_read()) provides a means for waiting on this or another task's read if in progress,
+ * namely aio_suspend(),  which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO
+ * are called as part of asynchronous prefetching,   their role is limited to maintaining the buffer desc flags,
+ * and they do not track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in 
+ * a separate set of shared control blocks,  the BufferAiocb list -
+ *   refer to     include/storage/buf_internals.h and storage/buffer/buf_init.c
+ *
+ * Another implication of asynchronous IO concerns buffer pinning.
+ * The buffer used for the prefetch is pinned before aio_read is issued.
+ * It is expected that the same task (and possibly others) will later ask to read the page
+ * and eventually release and unpin the buffer.
+ * However,  if the task which issued the aio_read later decides not to read the page,
+ * and return code indicates delta_pin_count > 0 (see below)
+ * it *must* instead issue a DiscardBuffer() (see function later in this file)
+ * so that its pin is released.
+ * Therefore,  each client which uses the PrefetchBuffer service must either always read all
+ * prefetched pages,  or keep track of prefetched pages and discard unread ones at end of scan.
+ *
+ * return code:   is an int bitmask defined in bufmgr.h
+        PREFTCHRC_BUF_PIN_INCREASED 0x01      pin count on buffer has been increased by 1
+        PREFTCHRC_BLK_ALREADY_PRESENT 0x02    block was already present in a buffer
+ *
+ * PREFTCHRC_BLK_ALREADY_PRESENT is a hint to caller that the prefetch may be unnecessary
+ */
+int
+PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy)
+{
+	Buffer		buf_id; /* indicates buffer containing the requested block  */
+        int             PrefetchBufferRc = 0; /*  return value as described above  */
+        int             PinCountOnEntry = 0;  /*  pin count on entry           */
+        int             PinCountdelta = 0;    /*  pin count delta increase     */
+
+
 #ifdef USE_PREFETCH
+
+	buf_id = -1;
 	Assert(RelationIsValid(reln));
 	Assert(BlockNumberIsValid(blockNum));
 
@@ -146,7 +199,12 @@ PrefetchBuffer(Relation reln, ForkNumber
 		BufferTag	newTag;		/* identity of requested block */
 		uint32		newHash;	/* hash value for newTag */
 		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-		int			buf_id;
+        int         BufStartAsyncrc = -1;  /*  retcode from BufStartAsync :
+                                                       **        0 if started successfully (which implies buffer was newly pinned )
+                                                       **       -1 if failed for some reason
+                                                       **        1+PrivateRefCount if we found desired buffer in buffer pool
+                                                       **  and we set it likewise if we find buffer in buffer pool
+                                                       */
 
 		/* create a tag so we can lookup the buffer */
 		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
@@ -158,28 +216,121 @@ PrefetchBuffer(Relation reln, ForkNumber
 
 		/* see if the block is in the buffer pool already */
 		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
+		buf_id = (Buffer)BufTableLookup(&newTag, newHash);
+                if (buf_id >= 0) {
+                    PinCountOnEntry = PrivateRefCount[buf_id];  /*  pin count on entry               */
+                    BufStartAsyncrc = 1 + PinCountOnEntry; /* indicate this backends pin count - see above comment */
+                    PrefetchBufferRc = PREFTCHRC_BLK_ALREADY_PRESENT;       /* indicate buffer present */
+                } else {
+                    PrefetchBufferRc = 0;                                   /* indicate buffer not present */
+                }
 		LWLockRelease(newPartitionLock);
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+     not_in_buffers:
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 		/* If not in buffers, initiate prefetch */
-		if (buf_id < 0)
+		if (buf_id < 0) {
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                    /*    try using async aio_read with a buffer */
+                    BufStartAsyncrc = BufStartAsync( reln, forkNum, blockNum , strategy );
+                    if (BufStartAsyncrc < 0) {
+                            pgBufferUsage.aio_read_noblok++;
+                    }
+#else /* not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP so try the alternative that does not read the block into a postgresql buffer */
 			smgrprefetch(reln->rd_smgr, forkNum, blockNum);
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+		}
 
-		/*
-		 * If the block *is* in buffers, we do nothing.  This is not really
-		 * ideal: the block might be just about to be evicted, which would be
-		 * stupid since we know we are going to need it soon.  But the only
-		 * easy answer is to bump the usage_count, which does not seem like a
-		 * great solution: when the caller does ultimately touch the block,
-		 * usage_count would get bumped again, resulting in too much
-		 * favoritism for blocks that are involved in a prefetch sequence. A
-		 * real fix would involve some additional per-buffer state, and it's
-		 * not clear that there's enough of a problem to justify that.
+        if (  (buf_id >= 0) || (BufStartAsyncrc >= 1)  ) {
+                        /* The block *is* in buffers.    */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        pgBufferUsage.aio_read_noneed++;
+#ifndef USE_PREFETCH_BUT_DONT_PIN_ALREADY_PRESENT /* jury is out on whether the following wins but it ought to ...  */
+                        /*
+                        ** If this backend already had pinned it,
+                        ** or another backend had banked a pin on it,
+                        ** or there is an IO in progress,
+                        ** or it is not marked valid,
+                        ** then do nothing.
+                        ** Otherwise pin it and mark the buffer's pin as banked by this backend.
+                        ** Note  -  it may or not be pinned by another backend -
+                        **          it is ok for us to bank a pin on it
+                        **          *provided* the other backend did not bank its pin.
+                        **          The reason for this is that the banked-pin indicator is global -
+                        **          it can identify at most one process.
+                        */
+                        /* pgBufferUsage.aio_read_wasted++;  overload counter - only for debugging */
+                        if (BufStartAsyncrc == 1) {            /*   not pinned by me  */
+                              /*  pgBufferUsage.aio_read_wasted++;  overload counter - only for debugging */
+                              /*  note   -   all we can say with certainty is that the buffer is not pinned by me
+                              **             we cannot be sure that it is still in buffer pool
+                              **             so must go through the entire locking and searching all over again ...
 		 */
+                            LWLockAcquire(newPartitionLock, LW_SHARED);
+                            buf_id = (Buffer)BufTableLookup(&newTag, newHash);
+                            /* If in buffers, proceed */
+                            if (buf_id >= 0) {
+                                /*  since the block is now present,
+                                **  save the current pin count to ensure final delta is calculated correctly
+                                */
+                                PinCountOnEntry = PrivateRefCount[buf_id];  /*  pin count on entry               */
+                                if ( PinCountOnEntry == 0) { /*  paranoid check it's still not pinned by me */
+                                    volatile        BufferDesc *buf_desc;
+
+                                    buf_desc = &BufferDescriptors[buf_id];        /* found buffer descriptor */
+                                    LockBufHdr(buf_desc);
+                                    if (    (buf_desc->flags & BM_VALID)           /* buffer is valid        */
+                                         && (!(buf_desc->flags & (BM_IO_IN_PROGRESS|BM_AIO_IN_PROGRESS|BM_AIO_PREFETCH_PIN_BANKED))) /* buffer is not any of ... */
+                                       ) {
+                                        buf_desc->flags |= BM_AIO_PREFETCH_PIN_BANKED; /* bank the pin for next use by this task */
+                                        /*  note  - we can call PinBuffer_Locked with the BM_AIO_PREFETCH_PIN_BANKED flag set because it is not yet pinned by me */
+                                        buf_desc->freeNext = -(this_backend_pid);       /* remember which pid banked it */
+                                        /*  pgBufferUsage.aio_read_wasted--;      overload counter - not wasted after all - only for debugging */
+
+                                        /* Make sure we will have room to remember the buffer pin */
+                                        ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+                                        PinBuffer_Locked(buf_desc);
+	}
+                                    else {
+                                        UnlockBufHdr(buf_desc);
+                                    }
+                                }
+                            }
+                            LWLockRelease(newPartitionLock);
+                            /*  although unlikely,  maybe it was evicted while we were puttering about  */
+                            if (buf_id < 0) {
+                                pgBufferUsage.aio_read_noneed--;   /*   back out the accounting */
+                                goto not_in_buffers;               /*   and try again           */
 	}
-#endif   /* USE_PREFETCH */
 }
+#endif /*  USE_PREFETCH_BUT_DONT_PIN_ALREADY_PRESENT */
 
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+		}
+
+		if (buf_id >= 0) {
+			PinCountdelta = PrivateRefCount[buf_id] - PinCountOnEntry;  /*  pin count delta increase    */
+			if (  (PinCountdelta < 0) || (PinCountdelta > 1) ) {
+				  elog(ERROR,
+						 "PrefetchBuffer #%d : incremented pin count by %d on bufdesc %p refcount %u localpins %d\n"
+								  ,(buf_id+1) , PinCountdelta , &BufferDescriptors[buf_id] ,BufferDescriptors[buf_id].refcount , PrivateRefCount[buf_id]);
+}
+		} else
+		if (BufStartAsyncrc == 0) {  /* aio started successfully (which implies buffer was newly pinned ) */
+			PinCountdelta = 1;
+		}
+
+		/*  set final PrefetchBufferRc according to previous value */
+		PrefetchBufferRc |= PinCountdelta;  /* set the PREFTCHRC_BUF_PIN_INCREASED bit */
+	}
+
+#endif   /* USE_PREFETCH */
+
+	return PrefetchBufferRc; /*  return value as described above */
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
@@ -252,7 +403,7 @@ ReadBufferExtended(Relation reln, ForkNu
 	 */
 	pgstat_count_buffer_read(reln);
 	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
+							forkNum, blockNum, mode, strategy, &hit , 0);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
 	return buf;
@@ -280,7 +431,7 @@ ReadBufferWithoutRelcache(RelFileNode rn
 	Assert(InRecovery);
 
 	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
-							 mode, strategy, &hit);
+							 mode, strategy, &hit , 0);
 }
 
 
@@ -288,15 +439,18 @@ ReadBufferWithoutRelcache(RelFileNode rn
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
+ * index_for_aio ,  if -ve , is negative of ( index of the aiocb in the BufferAiocbs array + 3 )
+ *     which is passed through to StartBufferIO
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy, bool *hit , int index_for_aio )
 {
 	volatile BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+        int             allocrc;             /*  retcode from BufferAlloc */
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -328,16 +482,40 @@ ReadBuffer_common(SMgrRelation smgr, cha
 	}
 	else
 	{
+                allocrc = mode; /* pass mode to BufferAlloc since it must not wait for async io if RBM_NOREAD_FOR_PREFETCH */
 		/*
 		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
+							 strategy, &allocrc , index_for_aio );
+		if (allocrc < 0) {
+                    if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+                    {
+                        ereport(WARNING,
+                                (errcode(ERRCODE_DATA_CORRUPTED),
+                                 errmsg("invalid page header in block %u of relation %s; zeroing out page",
+                                        blockNum,
+                                        relpath(smgr->smgr_rnode, forkNum))));
+                        bufBlock = BufHdrGetBlock(bufHdr);
+                        MemSet((char *) bufBlock, 0, BLCKSZ);
+                    }
 		else
+                      ereport(ERROR,
+                              (errcode(ERRCODE_DATA_CORRUPTED),
+                               errmsg("invalid page header in block %u of relation %s",
+                                      blockNum,
+                                      relpath(smgr->smgr_rnode, forkNum))));
+                        found = true;
+                }
+		else if (allocrc > 0) {
+			pgBufferUsage.shared_blks_hit++;
+                        found = true;
+                }
+		else {
 			pgBufferUsage.shared_blks_read++;
+                        found = false;
+                }
 	}
 
 	/* At this point we do NOT hold any locks. */
@@ -410,7 +588,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 				Assert(bufHdr->flags & BM_VALID);
 				bufHdr->flags &= ~BM_VALID;
 				UnlockBufHdr(bufHdr);
-			} while (!StartBufferIO(bufHdr, true));
+			} while (!StartBufferIO(bufHdr, true, 0));
 		}
 	}
 
@@ -430,6 +608,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
+        if (mode != RBM_NOREAD_FOR_PREFETCH) {
 	if (isExtend)
 	{
 		/* new buffers are zero-filled */
@@ -499,6 +678,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 	VacuumPageMiss++;
 	if (VacuumCostActive)
 		VacuumCostBalance += VacuumCostPageMiss;
+	}
 
 	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
 									  smgr->smgr_rnode.node.spcNode,
@@ -520,21 +700,39 @@ ReadBuffer_common(SMgrRelation smgr, cha
  * the default strategy.  The selected buffer's usage_count is advanced when
  * using the default strategy, but otherwise possibly not (see PinBuffer).
  *
- * The returned buffer is pinned and is already marked as holding the
- * desired page.  If it already did have the desired page, *foundPtr is
- * set TRUE.  Otherwise, *foundPtr is set FALSE and the buffer is marked
+ * index_for_aio is an input parm which,  if non-zero,  identifies a BufferAiocb
+ * acquired by caller and to be used for any StartBufferIO performed by this routine.
+ * In this case,  if block not found in buffer pool and we allocate a new buffer,
+ * then we must maintain the spinlock on the buffer and pass it back to caller.
+ *
+ * foundPtr is input and output :
+ *  . input   -  indicates the read-buffer mode  ( see bufmgr.h )
+ *  . output  -  indicates the status of the buffer - see below
+ *
+ * Except for the case of RBM_NOREAD_FOR_PREFETCH and buffer is found,
+ * the returned buffer is pinned and is already marked as holding the
+ * desired page.
+ *  If it already did have the desired page and page content is valid,
+ *  *foundPtr is set to 1
+ *  If it already did have the desired page and mode is RBM_NOREAD_FOR_PREFETCH
+ *    and StartBufferIO returned false
+ *    (meaning it could not initialise the buffer for aio)
+ *  *foundPtr is set to 2
+ *  If it already did have the desired page but page content is invalid,
+ *  *foundPtr is set to -1
+ *   this can happen only if the buffer was read by an async read
+ *   and the aio is still in progress or pinned by the issuer of the startaio.
+ *  Otherwise, *foundPtr is set to 0 and the buffer is marked
  * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
  *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
- *
- * No locks are held either at entry or exit.
+ * No locks are held either at entry or exit EXCEPT for case noted above
+ * of passing an empty buffer back to async io caller ( index_for_aio set ).
  */
 static volatile BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			int *foundPtr , int index_for_aio )
 {
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
@@ -546,6 +744,13 @@ BufferAlloc(SMgrRelation smgr, char relp
 	int			buf_id;
 	volatile BufferDesc *buf;
 	bool		valid;
+        int             IntentionBufferrc;      /* retcode from BufCheckAsync */
+        bool            StartBufferIOrc;        /* retcode from StartBufferIO */
+        ReadBufferMode mode;
+
+
+        mode = *foundPtr;
+        *foundPtr = 0;
 
 	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
@@ -560,21 +765,53 @@ BufferAlloc(SMgrRelation smgr, char relp
 	if (buf_id >= 0)
 	{
 		/*
-		 * Found it.  Now, pin the buffer so no one can steal it from the
-		 * buffer pool, and check to see if the correct data has been loaded
-		 * into the buffer.
+		 * Found it.
 		 */
+                *foundPtr = 1;
 		buf = &BufferDescriptors[buf_id];
 
-		valid = PinBuffer(buf, strategy);
-
-		/* Can release the mapping lock as soon as we've pinned it */
+                /*   If prefetch mode,  then return immediately indicating found,
+                **   and NOTE in this case only,  we did not pin buffer.
+                **   In theory we might try to check whether the buffer is valid,  io in progress,  etc
+                **   but in practice it is simpler to abandon the prefetch if the buffer exists
+                */
+                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+                    /* release the mapping lock and return    */
 		LWLockRelease(newPartitionLock);
+                } else {
+                    /*   note that the current request is for same tag as the one associated with the aio -
+                    **   so simply complete the aio and we have our buffer.
+                    **         If an aio was started on this buffer,
+                    **         check complete and wait for it if not.
+                    **         And,  if aio had been started,  then the task
+                    **         which issued the start aio already pinned it for this read,
+                    **         so if that task was me and the aio was successful,
+                    **         pass the current pin to this read without dropping and re-acquiring.
+                    **         this is all done by BufCheckAsync
+                    */
+                    IntentionBufferrc = BufCheckAsync(smgr , 0 , buf, BUF_INTENTION_WANT , strategy , index_for_aio , false , newPartitionLock );
 
-		*foundPtr = TRUE;
+                    /*       check to see if the correct data has been loaded into the buffer.  */
+                    valid = (IntentionBufferrc == BUF_INTENT_RC_VALID);
 
-		if (!valid)
-		{
+                    /*  check for serious IO errors   */
+                    if (!valid) {
+                        if (    (IntentionBufferrc != BUF_INTENT_RC_INVALID_NO_AIO)
+                             && (IntentionBufferrc != BUF_INTENT_RC_INVALID_AIO)
+                           ) {
+                            *foundPtr = -1;  /*  inform caller of serious error */
+                        }
+                        else
+                        if (IntentionBufferrc == BUF_INTENT_RC_INVALID_AIO) {
+                            goto proceed_with_not_found;  /*  yes,  I know,  a goto ... think of it as a break out of the if */
+                        }
+                     }
+
+                    /* BufCheckAsync pinned the buffer            */
+                    /* so can now release the mapping lock               */
+                    LWLockRelease(newPartitionLock);
+
+                    if (!valid) {
 			/*
 			 * We can only get here if (a) someone else is still reading in
 			 * the page, or (b) a previous read attempt failed.  We have to
@@ -582,19 +819,21 @@ BufferAlloc(SMgrRelation smgr, char relp
 			 * own read attempt if the page is still not BM_VALID.
 			 * StartBufferIO does it all.
 			 */
-			if (StartBufferIO(buf, true))
+                                if (StartBufferIO(buf, true, index_for_aio))
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
 				 * have failed ... but we shall bravely try again.
 				 */
-				*foundPtr = FALSE;
+                                        *foundPtr = 0;
+                                }
 			}
 		}
 
 		return buf;
 	}
 
+  proceed_with_not_found:
 	/*
 	 * Didn't find it in the buffer pool.  We'll have to initialize a new
 	 * buffer.  Remember to unlock the mapping lock while doing the work.
@@ -619,8 +858,10 @@ BufferAlloc(SMgrRelation smgr, char relp
 		/* Must copy buffer flags while we still hold the spinlock */
 		oldFlags = buf->flags;
 
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
+                /*         If an aio was started on this buffer,
+                **         check complete and cancel it if not.
+                */
+                BufCheckAsync(smgr , 0 , buf, BUF_INTENTION_REJECT_OBTAIN_PIN , 0 , index_for_aio, true , 0 );
 
 		/* Now it's safe to release the freelist lock */
 		if (lock_held)
@@ -791,13 +1032,18 @@ BufferAlloc(SMgrRelation smgr, char relp
 				 * then set up our own read attempt if the page is still not
 				 * BM_VALID.  StartBufferIO does it all.
 				 */
-				if (StartBufferIO(buf, true))
+                                StartBufferIOrc = StartBufferIO(buf, true , index_for_aio); /* retcode from StartBufferIO */
+				if (StartBufferIOrc)
 				{
 					/*
 					 * If we get here, previous attempts to read the buffer
 					 * must have failed ... but we shall bravely try again.
 					 */
-					*foundPtr = FALSE;
+					*foundPtr = 0;
+                                } else
+                                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+					UnpinBuffer(buf, true); /* must unpin if RBM_NOREAD_FOR_PREFETCH and found */
+                                        *foundPtr = 2;  /*  inform BufStartAsync that buffer must not be used */
 				}
 			}
 
@@ -860,10 +1106,17 @@ BufferAlloc(SMgrRelation smgr, char relp
 	 * lock.  If StartBufferIO returns false, then someone else managed to
 	 * read it before we did, so there's nothing left for BufferAlloc() to do.
 	 */
-	if (StartBufferIO(buf, true))
-		*foundPtr = FALSE;
-	else
-		*foundPtr = TRUE;
+        StartBufferIOrc = StartBufferIO(buf, true , index_for_aio); /* retcode from StartBufferIO */
+        if (StartBufferIOrc) {
+		*foundPtr = 0;
+        } else {
+                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+                    UnpinBuffer(buf, true); /* must unpin if RBM_NOREAD_FOR_PREFETCH and found */
+                    *foundPtr = 2;  /*  inform BufStartAsync that buffer must not be used */
+                } else {
+                    *foundPtr = 1;
+                }
+        }
 
 	return buf;
 }
@@ -970,6 +1223,10 @@ retry:
 	/*
 	 * Insert the buffer at the head of the list of free buffers.
 	 */
+        /*   avoid confusing freelist with strange-looking freeNext */
+        if (buf->freeNext <= FREENEXT_BAIOCB_ORIGIN) { /* means was used for aiocb index */
+            buf->freeNext = FREENEXT_NOT_IN_LIST;
+        }
 	StrategyFreeBuffer(buf);
 }
 
@@ -1022,6 +1279,56 @@ MarkBufferDirty(Buffer buffer)
 	UnlockBufHdr(bufHdr);
 }
 
+/*  return the blocknum of the block in a buffer if it is valid
+**  if a shared buffer,  it must be pinned
+*/
+BlockNumber
+BlocknumOfBuffer(Buffer buffer)
+{
+	volatile BufferDesc *bufHdr;
+        BlockNumber rc = 0;
+
+	if (BufferIsValid(buffer)) {
+            if  (BufferIsLocal(buffer)) {
+                bufHdr = &LocalBufferDescriptors[-buffer - 1];
+            } else {
+                bufHdr = &BufferDescriptors[buffer - 1];
+            }
+
+            rc = bufHdr->tag.blockNum;
+        }
+
+        return rc;
+}
+
+/*  report whether specified buffer contains same or different block
+**  if a shared buffer,  it must be pinned
+*/
+bool
+BlocknotinBuffer(Buffer buffer,
+					 Relation relation,
+					 BlockNumber blockNum)
+{
+	volatile BufferDesc *bufHdr;
+        bool rc = false;
+
+	if (BufferIsValid(buffer)) {
+            if  (BufferIsLocal(buffer)) {
+                bufHdr = &LocalBufferDescriptors[-buffer - 1];
+            } else {
+                bufHdr = &BufferDescriptors[buffer - 1];
+            }
+
+            rc =
+                  (   (bufHdr->tag.blockNum != blockNum)
+                   || (!(RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) ))
+                   || (bufHdr->tag.forkNum != MAIN_FORKNUM)
+                  );
+        }
+
+        return rc;
+}
+
 /*
  * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer()
  *
@@ -1040,18 +1347,18 @@ ReleaseAndReadBuffer(Buffer buffer,
 					 Relation relation,
 					 BlockNumber blockNum)
 {
-	ForkNumber	forkNum = MAIN_FORKNUM;
 	volatile BufferDesc *bufHdr;
+        bool isDifferentBlock;   /*       requesting different block from that already in buffer ? */
 
 	if (BufferIsValid(buffer))
 	{
+	    /* if a shared buff, we have pin, so it's ok to examine tag without spinlock */
+            isDifferentBlock = BlocknotinBuffer(buffer,relation,blockNum); /*  requesting different block from that already in buffer ? */
 		if (BufferIsLocal(buffer))
 		{
 			Assert(LocalRefCount[-buffer - 1] > 0);
 			bufHdr = &LocalBufferDescriptors[-buffer - 1];
-			if (bufHdr->tag.blockNum == blockNum &&
-				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
-				bufHdr->tag.forkNum == forkNum)
+			if (!isDifferentBlock)
 				return buffer;
 			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 			LocalRefCount[-buffer - 1]--;
@@ -1060,12 +1367,12 @@ ReleaseAndReadBuffer(Buffer buffer,
 		{
 			Assert(PrivateRefCount[buffer - 1] > 0);
 			bufHdr = &BufferDescriptors[buffer - 1];
-			/* we have pin, so it's ok to examine tag without spinlock */
-			if (bufHdr->tag.blockNum == blockNum &&
-				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
-				bufHdr->tag.forkNum == forkNum)
+                        BufCheckAsync(0 , relation , bufHdr , ( isDifferentBlock ? BUF_INTENTION_REJECT_FORGET
+                                                                                 : BUF_INTENTION_REJECT_KEEP_PIN )
+                                                            , 0 , 0 , false , 0 ); /* end any IO and maybe unpin */
+			if (!isDifferentBlock) {
 				return buffer;
-			UnpinBuffer(bufHdr, true);
+                        }
 		}
 	}
 
@@ -1090,11 +1397,12 @@ ReleaseAndReadBuffer(Buffer buffer,
  * Returns TRUE if buffer is BM_VALID, else FALSE.  This provision allows
  * some callers to avoid an extra spinlock cycle.
  */
-static bool
+bool
 PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
 {
 	int			b = buf->buf_id;
 	bool		result;
+        bool       pin_already_banked_by_me = 0;  /* buffer is already pinned by me and redeemable */
 
 	if (PrivateRefCount[b] == 0)
 	{
@@ -1116,12 +1424,34 @@ PinBuffer(volatile BufferDesc *buf, Buff
 	else
 	{
 		/* If we previously pinned the buffer, it must surely be valid */
+                /* Errr  -   is that really true  ???    I don't think so  :
+                ** what if I pin,  try an IO,  in progress,  then mistakenly pin again
 		result = true;
+                */
+		LockBufHdr(buf);
+                pin_already_banked_by_me = (    (PrivateRefCount[b] > 0) && (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                             && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                                     : (-(buf->freeNext))  )  == this_backend_pid )
+                                           );
+                if (pin_already_banked_by_me) {
+                        elog(LOG, "PinBuffer : on buffer %d with banked pin rel=%s, blockNum=%u, with flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+                    buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+	}
+                }
+		result = (buf->flags & BM_VALID) != 0;
+		UnlockBufHdr(buf);
 	}
+
+        if (!pin_already_banked_by_me) {
 	PrivateRefCount[b]++;
 	Assert(PrivateRefCount[b] > 0);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
 								BufferDescriptorGetBuffer(buf));
+        }
 	return result;
 }
 
@@ -1138,19 +1468,36 @@ PinBuffer(volatile BufferDesc *buf, Buff
  * to save a spin lock/unlock cycle, because we need to pin a buffer before
  * its state can change under us.
  */
-static void
+void
 PinBuffer_Locked(volatile BufferDesc *buf)
 {
 	int			b = buf->buf_id;
+        bool       pin_already_banked_by_me;  /* buffer is already pinned by me and redeemable */
 
-	if (PrivateRefCount[b] == 0)
+        pin_already_banked_by_me = (    (PrivateRefCount[b] > 0) && (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                     && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                             : (-(buf->freeNext))  )  == this_backend_pid )
+                             );
+	if (PrivateRefCount[b] == 0) {
 		buf->refcount++;
+        }
+        if (pin_already_banked_by_me) {
+                        elog(LOG, "PinBuffer_Locked : on buffer %d with banked pin rel=%s, blockNum=%u, with flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+            buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                    }
+        }
 	UnlockBufHdr(buf);
+        if (!pin_already_banked_by_me) {
 	PrivateRefCount[b]++;
 	Assert(PrivateRefCount[b] > 0);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
 								BufferDescriptorGetBuffer(buf));
 }
+}
 
 /*
  * UnpinBuffer -- make buffer available for replacement.
@@ -1160,29 +1507,68 @@ PinBuffer_Locked(volatile BufferDesc *bu
  * Most but not all callers want CurrentResourceOwner to be adjusted.
  * Those that don't should pass fixOwner = FALSE.
  */
-static void
+void
 UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 {
+
 	int			b = buf->buf_id;
+        bool       pin_already_banked_by_me;  /* buffer is already pinned by me and redeemable */
 
-	if (fixOwner)
+	if (fixOwner) {
 		ResourceOwnerForgetBuffer(CurrentResourceOwner,
 								  BufferDescriptorGetBuffer(buf));
+        }
 
 	Assert(PrivateRefCount[b] > 0);
 	PrivateRefCount[b]--;
 	if (PrivateRefCount[b] == 0)
 	{
+
 		/* I'd better not still hold any locks on the buffer */
 		Assert(!LWLockHeldByMe(buf->content_lock));
 		Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
 
 		LockBufHdr(buf);
 
+		/* this backend has released last pin - buffer should not have pin banked by me
+                ** and if AIO in progress then there should be another backend pin
+                */
+                pin_already_banked_by_me = (       (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                             &&    (   (    (buf->flags & BM_AIO_IN_PROGRESS)
+                                                         ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                         : (-(buf->freeNext))
+                                                       )  == this_backend_pid )
+                                           );
+                if (pin_already_banked_by_me) {
+                        /*  this is a strange situation  -  caller had a banked pin (which callers are supposed not to know about)
+                        **                                  but either discovered it had it or has over-counted how many pins it has
+                        */
+                        buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;   /*   redeem the pin although it is now of no use since about to release */
+                        if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                            buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                        }
+
+                        /*     temporarily suppress logging error to avoid performance degradation -
+                        **     either this task really does not need the buffer in which case the error is harmless
+                        **     or a more severe error will be detected later (possibly immediately below)
+                        elog(LOG, "UnpinBuffer :  released last this-backend pin on buffer %d rel=%s, blockNum=%u, but had banked pin flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+                        */
+                }
+
 		/* Decrement the shared reference count */
 		Assert(buf->refcount > 0);
 		buf->refcount--;
 
+                if (   (buf->refcount == 0) && (buf->flags & BM_AIO_IN_PROGRESS)  ) {
+
+                        elog(ERROR, "UnpinBuffer :  released last any-backend pin on buffer %d rel=%s, blockNum=%u, but AIO in progress flags %X refcount=%u"
+                            ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                            ,buf->tag.blockNum, buf->flags, buf->refcount);
+                }
+
+
 		/* Support LockBufferForCleanup() */
 		if ((buf->flags & BM_PIN_COUNT_WAITER) &&
 			buf->refcount == 1)
@@ -1657,6 +2043,7 @@ SyncOneBuffer(int buf_id, bool skip_rece
 	volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
 	int			result = 0;
 
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -1744,6 +2131,16 @@ void
 InitBufferPoolBackend(void)
 {
 	on_shmem_exit(AtProcExit_Buffers, 0);
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        /*   init the aio subsystem max number of threads and max number of requests
+        **   max number of threads   <-->  max_async_io_prefetchers
+        **   max number of requests  <-->  numBufferAiocbs = (target_prefetch_pages*max_async_io_prefetchers)
+        **   there is no return code so we just hope.
+        */
+        smgrinitaio(max_async_io_prefetchers , numBufferAiocbs);
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 }
 
 /*
@@ -1789,6 +2186,8 @@ PrintBufferLeakWarning(Buffer buffer)
 	char	   *path;
 	BackendId	backend;
 
+
+
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
 	{
@@ -1799,12 +2198,28 @@ PrintBufferLeakWarning(Buffer buffer)
 	else
 	{
 		buf = &BufferDescriptors[buffer - 1];
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                /*   If reason that this buffer is pinned
+                **   is that it was prefetched with async_io
+                **   and never read or discarded, then omit the
+                **   warning,  because this is expected in some
+                **   cases when a scan is closed abnormally.
+                **   Note that the buffer will be released soon by our caller.
+                */
+                if (buf->flags & BM_AIO_PREFETCH_PIN_BANKED) {
+                    pgBufferUsage.aio_read_forgot++; /* account for it */
+                    return;
+                }
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 		loccount = PrivateRefCount[buffer - 1];
 		backend = InvalidBackendId;
 	}
 
+/* #if defined(DO_ISCAN_DISCARD_BUFFER) && defined(DO_HEAPSCAN_DISCARD_BUFFER) not yet working safely 130530 */
 	/* theoretically we should lock the bufhdr here */
 	path = relpathbackend(buf->tag.rnode, backend, buf->tag.forkNum);
+
+
 	elog(WARNING,
 		 "buffer refcount leak: [%03d] "
 		 "(rel=%s, blockNum=%u, flags=0x%x, refcount=%u %d)",
@@ -1812,6 +2227,7 @@ PrintBufferLeakWarning(Buffer buffer)
 		 buf->tag.blockNum, buf->flags,
 		 buf->refcount, loccount);
 	pfree(path);
+/* #endif defined(DO_ISCAN_DISCARD_BUFFER) && defined(DO_HEAPSCAN_DISCARD_BUFFER) not yet working safely 130530 */
 }
 
 /*
@@ -1928,7 +2344,7 @@ FlushBuffer(volatile BufferDesc *buf, SM
 	 * false, then someone else flushed the buffer before we could, so we need
 	 * not do anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, 0))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -2512,6 +2928,70 @@ FlushDatabaseBuffers(Oid dbid)
 	}
 }
 
+#ifdef USE_PREFETCH
+/*
+ * DiscardBuffer -- discard shared buffer used for a previously
+ *                  prefetched but unread block of a relation
+ *
+ * If the buffer is found and pinned with a banked pin,  then :
+ *      .  if AIO in progress, terminate AIO without waiting
+ *      .  if AIO had already completed successfully,
+ *         then mark buffer valid (in case someone else wants it)
+ *      .  redeem the banked pin and unpin it.
+ *
+ * This function is similar in purpose to ReleaseBuffer (below)
+ * but sufficiently different that it is a separate function.
+ * Two important differences are :
+ *   .   caller identifies buffer by blocknumber,  not buffer number
+ *   .   we unpin buffer *only* if the pin is banked,
+ *                      *never* if pinned but not banked.
+ *       This is essential as caller may perform a sequence of
+ *  SCAN1   . PrefetchBuffer   (and remember block was prefetched)
+ *  SCAN2   . ReadBuffer       (but fails to connect this read to the prefetch by SCAN1)
+ *  SCAN1   . DiscardBuffer    (SCAN1 terminates early)
+ *  SCAN2   . access tuples in buffer
+ *        Clearly the Discard *must not* unpin the buffer since SCAN2 needs it!
+ *
+ *
+ * caller may pass InvalidBlockNumber as blockNum to mean do nothing
+ */
+void
+DiscardBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
+{
+        BufferTag	newTag;		   /* identity of requested block */
+        uint32		newHash;	   /* hash value for newTag */
+        LWLockId	newPartitionLock;  /* buffer partition lock for it */
+        Buffer		buf_id;
+        volatile        BufferDesc *buf_desc;
+
+    if (!SmgrIsTemp(reln->rd_smgr)) {
+	Assert(RelationIsValid(reln));
+	if (BlockNumberIsValid(blockNum)) {
+
+            /* create a tag so we can lookup the buffer */
+            INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
+                                       forkNum, blockNum);
+
+            /* determine its hash code and partition lock ID */
+            newHash = BufTableHashCode(&newTag);
+            newPartitionLock = BufMappingPartitionLock(newHash);
+
+            /* see if the block is in the buffer pool already */
+            LWLockAcquire(newPartitionLock, LW_SHARED);
+            buf_id = BufTableLookup(&newTag, newHash);
+            LWLockRelease(newPartitionLock);
+
+            /* If in buffers, proceed */
+            if (buf_id >= 0) {
+                buf_desc = &BufferDescriptors[buf_id];        /* found buffer descriptor */
+                BufCheckAsync(0 , reln, buf_desc , BUF_INTENTION_REJECT_UNBANK , 0 , 0 , false , 0); /* end the IO and unpin if banked */
+                pgBufferUsage.aio_read_discrd++; /* account for it */
+            }
+        }
+    }
+}
+#endif   /* USE_PREFETCH */
+
 /*
  * ReleaseBuffer -- release the pin on a buffer
  */
@@ -2520,26 +3000,23 @@ ReleaseBuffer(Buffer buffer)
 {
 	volatile BufferDesc *bufHdr;
 
+
 	if (!BufferIsValid(buffer))
 		elog(ERROR, "bad buffer ID: %d", buffer);
 
-	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 
 	if (BufferIsLocal(buffer))
 	{
+                ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 		Assert(LocalRefCount[-buffer - 1] > 0);
 		LocalRefCount[-buffer - 1]--;
 		return;
 	}
-
-	bufHdr = &BufferDescriptors[buffer - 1];
-
-	Assert(PrivateRefCount[buffer - 1] > 0);
-
-	if (PrivateRefCount[buffer - 1] > 1)
-		PrivateRefCount[buffer - 1]--;
 	else
-		UnpinBuffer(bufHdr, false);
+        {
+                bufHdr = &BufferDescriptors[buffer - 1];
+                BufCheckAsync(0 , 0 , bufHdr , BUF_INTENTION_REJECT_NOADJUST , 0 , 0 , false , 0 );
+        }
 }
 
 /*
@@ -2565,14 +3042,41 @@ UnlockReleaseBuffer(Buffer buffer)
 void
 IncrBufferRefCount(Buffer buffer)
 {
+        bool       pin_already_banked_by_me = false;  /* buffer is already pinned by me and redeemable */
+        volatile BufferDesc *buf;                     /* descriptor for a shared buffer */
+
 	Assert(BufferIsPinned(buffer));
+
+        if (!(BufferIsLocal(buffer))) {
+                buf = &BufferDescriptors[buffer - 1];
+		LockBufHdr(buf);
+                pin_already_banked_by_me =
+                      (    (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                        && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                      : (-(buf->freeNext))  )  == this_backend_pid )
+                      );
+        }
+
+        if (!pin_already_banked_by_me) {
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, buffer);
+        }
+
 	if (BufferIsLocal(buffer))
 		LocalRefCount[-buffer - 1]++;
-	else
+	else {
+                if (pin_already_banked_by_me) {
+                    buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                    }
+                }
+		UnlockBufHdr(buf);
+                if (!pin_already_banked_by_me) {
 		PrivateRefCount[buffer - 1]++;
 }
+        }
+}
 
 /*
  * MarkBufferDirtyHint
@@ -2994,61 +3498,138 @@ WaitIO(volatile BufferDesc *buf)
  *
  * In some scenarios there are race conditions in which multiple backends
  * could attempt the same I/O operation concurrently.  If someone else
- * has already started I/O on this buffer then we will block on the
+ * has already started synchronous I/O on this buffer then we will block on the
  * io_in_progress lock until he's done.
  *
+ * if an async io is in progress and we are doing synchronous io,
+ * then readbuffer uses call to smgrcompleteaio to wait,
+ * and so we treat this request as if no io in progress
+ *
  * Input operations are only attempted on buffers that are not BM_VALID,
  * and output operations only on buffers that are BM_VALID and BM_DIRTY,
  * so we can always tell if the work is already done.
  *
+ * index_for_aio is an input parm which,  if non-zero,  identifies a BufferAiocb
+ * acquired by caller and to be attached to the buffer header for use with async io
+ *
  * Returns TRUE if we successfully marked the buffer as I/O busy,
  * FALSE if someone else already did the work.
  */
 static bool
-StartBufferIO(volatile BufferDesc *buf, bool forInput)
+StartBufferIO(volatile BufferDesc *buf, bool forInput , int index_for_aio )
 {
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        struct BufferAiocb volatile * BAiocb = (struct BufferAiocb*)0;  /*  BufferAiocb for use with aio */
+#endif   /* USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+ 
+        if (!index_for_aio)
 	Assert(!InProgressBuf);
 
 	for (;;)
 	{
+                if (!index_for_aio) {
 		/*
 		 * Grab the io_in_progress lock so that other processes can wait for
 		 * me to finish the I/O.
 		 */
 		LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+                }
 
 		LockBufHdr(buf);
 
-		if (!(buf->flags & BM_IO_IN_PROGRESS))
+                /*     the following test is intended to distinguish between :
+                **      .   buffer which : 
+                **           .     has io in progress
+                **             AND is not associated with a current aio
+                **      .   not the above
+                **     Here,  "recent" means an aio marked by buf->freeNext <= FREENEXT_BAIOCB_ORIGIN but no longer in progress -
+                **          this situation arises when the aio has just been cancelled and this process now wishes to recycle the buffer.
+                **          In this case,  the first such would-be recycler (i.e. me) must :
+                **             . avoid waiting for the cancelled aio to complete
+                **             . if not myself doing async read, then assume responsibility for posting other future readbuffers.
+                */
+		if (    (buf->flags & BM_AIO_IN_PROGRESS)
+                     || (!(buf->flags & BM_IO_IN_PROGRESS))
+                   )
 			break;
 
 		/*
-		 * The only way BM_IO_IN_PROGRESS could be set when the io_in_progress
+		 * The only way BM_IO_IN_PROGRESS without AIO in progress could be set when the io_in_progress
 		 * lock isn't held is if the process doing the I/O is recovering from
 		 * an error (see AbortBufferIO).  If that's the case, we must wait for
 		 * him to get unwedged.
 		 */
 		UnlockBufHdr(buf);
+                if (!index_for_aio) {
 		LWLockRelease(buf->io_in_progress_lock);
+                }
 		WaitIO(buf);
 	}
 
-	/* Once we get here, there is definitely no I/O active on this buffer */
-
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	/* Once we get here, there is definitely no synchronous I/O active on this buffer
+        ** but if being asked to attach a BufferAiocb to the buf header,
+        ** then we must also check if there is any async io currently
+        ** in progress or pinned started by a different task.
+        */
+        if (index_for_aio) {
+            BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext);
+            if (    (buf->flags & (BM_AIO_IN_PROGRESS|BM_AIO_PREFETCH_PIN_BANKED))
+                 && (buf->freeNext <= FREENEXT_BAIOCB_ORIGIN)
+                 && (BAiocb->pidOfAio != this_backend_pid)
+               ) {
+                    /* someone else already doing async I/O */
+                    UnlockBufHdr(buf);
+                    return false;
+            }
+	}
+#endif   /* USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 	if (forInput ? (buf->flags & BM_VALID) : !(buf->flags & BM_DIRTY))
 	{
 		/* someone else already did the I/O */
 		UnlockBufHdr(buf);
+                if (!index_for_aio) {
 		LWLockRelease(buf->io_in_progress_lock);
+                }
 		return false;
 	}
 
 	buf->flags |= BM_IO_IN_PROGRESS;
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        if (index_for_aio) {
+            BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - index_for_aio);
+            /*   insist that no other buffer is using this BufferAiocb for async IO */
+            if (BAiocb->BAiocbbufh == (struct sbufdesc *)0) {
+                BAiocb->BAiocbbufh = buf;
+            }
+            if (BAiocb->BAiocbbufh != buf) {
+                               ereport(ERROR,
+                                    (errcode(ERRCODE_INTERNAL_ERROR),
+                                     errmsg("AIO control block %p to be used by %p already in use by %p"
+                                              ,BAiocb ,buf , BAiocb->BAiocbbufh)));
+            }
+            /*   note - there is no need to register self as an dependent of BAiocb
+            **   as we shall not unlock buf_desc before we free the BAiocb
+            */
+
+            buf->flags |= BM_AIO_IN_PROGRESS;
+            buf->freeNext = index_for_aio;
+            /*  at this point,  this buffer appears to have an in-progress aio_read,
+            **  and any other task which is able to look inside the buffer might try waiting on that aio -
+            **  except we have not yet issued the aio!   So we must keep the buffer header locked
+            **  from here all the way back to the BufStartAsync caller
+            */
+        } else {
+#endif   /* USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 	UnlockBufHdr(buf);
 
 	InProgressBuf = buf;
 	IsForInput = forInput;
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        }
+#endif   /* USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 	return true;
 }
@@ -3058,7 +3639,7 @@ StartBufferIO(volatile BufferDesc *buf,
  *	(Assumptions)
  *	My process is executing IO for the buffer
  *	BM_IO_IN_PROGRESS bit is set for the buffer
- *	We hold the buffer's io_in_progress lock
+ *	if no async IO in progress,  then We hold the buffer's io_in_progress lock
  *	The buffer is Pinned
  *
  * If clear_dirty is TRUE and BM_JUST_DIRTIED is not set, we clear the
@@ -3070,26 +3651,32 @@ StartBufferIO(volatile BufferDesc *buf,
  * BM_IO_ERROR in a failure case.  For successful completion it could
  * be 0, or BM_VALID if we just finished reading in the page.
  */
-static void
+void
 TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits)
 {
-	Assert(buf == InProgressBuf);
+        int flags_on_entry;
 
 	LockBufHdr(buf);
 
+        flags_on_entry = buf->flags;
+
+        if (! (flags_on_entry & BM_AIO_IN_PROGRESS) )
+            Assert( buf == InProgressBuf );
+
 	Assert(buf->flags & BM_IO_IN_PROGRESS);
-	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
+	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS);
 	if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
 		buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 	buf->flags |= set_flag_bits;
 
 	UnlockBufHdr(buf);
 
+        if (! (flags_on_entry & BM_AIO_IN_PROGRESS) ) {
 	InProgressBuf = NULL;
-
 	LWLockRelease(buf->io_in_progress_lock);
 }
+}
 
 /*
  * AbortBufferIO: Clean up any active buffer I/O after an error.
--- src/backend/storage/buffer/buf_async.c.orig	2014-06-08 17:57:22.096976856 -0400
+++ src/backend/storage/buffer/buf_async.c	2014-06-08 21:59:36.748096540 -0400
@@ -0,0 +1,923 @@
+/*-------------------------------------------------------------------------
+ *
+ * buf_async.c
+ *	  buffer manager asynchronous disk read routines
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/buf_async.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * Principal entry points:
+ *
+ * BufStartAsync() -- start an asynchronous read of a block into a buffer and
+ *	 pin it so that no one can destroy it while this process is using it.
+ *
+ * BufCheckAsync() -- check completion of an asynchronous read
+ *       and either claim buffer or discard it
+ *
+ * Private helper
+ *
+ * BufReleaseAsync() -- release the BAiocb resources used for an asynchronous read
+ *
+ * See also these files:
+ *		bufmgr.c -- main buffer manager functions
+ *		buf_init.c -- initialisation of resources
+ */
+#include "postgres.h"
+#include <sys/types.h> 
+#include <sys/file.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "catalog/catalog.h"
+#include "common/relpath.h"
+#include "executor/instrument.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/smgr.h"
+#include "storage/standby.h"
+#include "utils/rel.h"
+#include "utils/resowner_private.h"
+
+/*
+ * GUC parameters
+ */
+int      max_async_io_prefetchers; /* Maximum number of backends using asynchronous librt threads to read pages into our buffers */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
+extern volatile struct BAiocbAnchor *BAiocbAnchr;  /*  anchor for all control blocks pertaining to aio  */
+extern int numBufferAiocbs;        /*  total number of  BufferAiocbs in pool  */
+extern int maxGetBAiocbTries;      /*   max times we will try to get a free BufferAiocb */
+extern int maxRelBAiocbTries;      /*   max times we will try to release a BufferAiocb back to freelist */
+extern pid_t this_backend_pid;     /*   pid of this backend */
+
+extern bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
+extern void PinBuffer_Locked(volatile BufferDesc *buf);
+extern Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
+				  ForkNumber forkNum, BlockNumber blockNum,
+				  ReadBufferMode mode, BufferAccessStrategy strategy,
+				  bool *hit , int index_for_aio);
+extern void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
+extern void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
+				  int set_flag_bits);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+int BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy);
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+int BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, volatile BufferDesc *buf_desc
+  ,int intention , BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock );
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+
+static struct BufferAiocb volatile * cachedBAiocb = (struct BufferAiocb*)0;  /*  one cached BufferAiocb for use with aio */
+
+#ifdef USE_PREFETCH
+/*  BufReleaseAsync releases a BufferAiocb and returns 0 if successful else non-zero
+**  it *must* be called :
+**     EITHER with a valid  BAiocb->BAiocbbufh -> buf_desc
+**            and that buf_desc must be spin-locked
+**     OR     with BAiocb->BAiocbbufh == 0
+*/
+static int
+BufReleaseAsync(struct BufferAiocb volatile * BAiocb)
+{
+        int    LockTries;         /*  max times we will try to release the BufferAiocb */
+        volatile struct BufferAiocb *BufferAiocbs;
+        struct BufferAiocb volatile * oldFreeBAiocb;  /*  old value of FreeBAiocbs */
+
+        int failed = 1; /* by end of this function, non-zero  will indicate if we failed to return the BAiocb */
+
+
+        if (    ( BAiocb == (struct BufferAiocb*)0 )
+             || ( BAiocb == (struct BufferAiocb*)BAIOCB_OCCUPIED )
+             || ( ((unsigned long)BAiocb) & 0x1 )
+           ) {
+                          elog(ERROR,
+                                 "AIO control block corruption on release of aiocb %p - invalid BAiocb"
+                                          ,BAiocb);
+        }
+        else 
+        if (   (0 == BAiocb->BAiocbDependentCount)     /*  no dependents  */
+            && ((struct BufferAiocb*)BAIOCB_OCCUPIED == BAiocb->BAiocbnext)  /*  not already on freelist */
+           ) {
+
+            if ((struct sbufdesc*)0 != BAiocb->BAiocbbufh) { /*  if a buffer was attached */
+                volatile        BufferDesc *buf_desc = BAiocb->BAiocbbufh;
+
+                /*  spinlock held so instead of TerminateBufferIO(buf, false , 0); ... */
+                if (buf_desc->flags & BM_AIO_PREFETCH_PIN_BANKED) { /* if a pid banked the pin */
+                    buf_desc->freeNext = -(BAiocb->pidOfAio);       /* then remember which pid */
+                }
+                else if (buf_desc->freeNext <= FREENEXT_BAIOCB_ORIGIN) {
+                    buf_desc->freeNext = FREENEXT_NOT_IN_LIST;   /* disconnect BufferAiocb from buf_desc */
+                }
+                buf_desc->flags &= ~BM_AIO_IN_PROGRESS;
+            }
+            
+            BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* disconnect buf_desc from BufferAiocb */
+            BAiocb->pidOfAio = 0;                      /*  clean */
+            LockTries = maxRelBAiocbTries;         /*  max times we will try to release the BufferAiocb */
+            do {
+                register long long int dividend , remainder;
+
+                /*      retrieve old value of FreeBAiocbs  */
+                BAiocb->BAiocbnext = oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs;  /*  old value of FreeBAiocbs */
+
+                /*  this is a volatile value unprotected by any lock,  so must validate it;
+                **  safest is to verify that it is identical to one of the BufferAiocbs
+                **  to do so,  verify by direct division that its address offset from first control block 
+                **  is an integral multiple of the control block size
+                **  that lies within the range [ 0 , (numBufferAiocbs-1) ]
+                */
+                BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+                remainder = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                            % (long long int)(sizeof(struct BufferAiocb));
+                failed = (int)remainder;
+                if (!failed) {
+                    dividend = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                               / (long long int)(sizeof(struct BufferAiocb));
+                     failed = ( (dividend < 0 ) || ( dividend >= numBufferAiocbs) );
+                     if (!failed) {
+                         if (__sync_bool_compare_and_swap (&(BAiocbAnchr->FreeBAiocbs), oldFreeBAiocb, BAiocb)) {
+                            LockTries = 0;   /*  end the do loop  */
+
+                            goto cheering;   /*  cant simply break because then failed would be set incorrectly */
+                         }
+                    }
+                }
+                /*  if we reach here, we have failed and failed is set to -1 */
+
+       cheering: ;
+
+                if ( LockTries > 1 ) {
+                    sched_yield();    /*   yield to another process,  (hopefully a backend) */
+                }
+            } while  (LockTries-- > 0);
+
+            if (failed) {
+#ifdef LOG_RELBAIOCB_DEPLETION
+                          elog(LOG,
+                                "BufReleaseAsync:AIO control block %p unreleased after tries= %d\n"
+                                       ,BAiocb,maxRelBAiocbTries);
+#endif  /* LOG_RELBAIOCB_DEPLETION */
+            }
+
+        }
+              else
+                          elog(LOG,
+                                "BufReleaseAsync:AIO control block %p either has dependents= %d or is already on freelist %p or has no buf_header %p\n"
+                                       ,BAiocb , BAiocb->BAiocbDependentCount , BAiocb->BAiocbnext , BAiocb->BAiocbbufh);
+        return failed;
+}
+
+/*  try using asynchronous aio_read to prefetch into a buffer
+**  return code :
+**        0 if started successfully
+**       -1 if failed for some reason
+**        1+PrivateRefCount if we found desired buffer in buffer pool
+**
+**  There is a harmless race condition here :
+**  two different backends may both arrive here simultaneously
+**  to prefetch the same buffer.    This is not unlikely when a syncscan is in progress.
+**  .  One will acquire the buffer and issue the smgrstartaio
+**  .  Other will find the buffer on return from  ReadBuffer_common with hit = true
+**  Only the first task has a pin on the buffer since ReadBuffer_common knows not to get a pin
+**  on a found buffer in prefetch mode.
+**  Therefore  -   the second task must simply abandon the prefetch if it finds the buffer in the buffer pool.
+**
+**  if we fail to acquire a BAiocb because of concurrent theft from freelist by other backend,
+**  retry up to maxGetBAiocbTries times provided that there actually was at least one BAiocb on the freelist.
+*/
+int
+BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy) {
+
+        int retcode = -1;
+        struct BufferAiocb volatile * BAiocb = (struct BufferAiocb*)0;  /*  BufferAiocb for use with aio */
+        int  smgrstartaio_rc = -1;           /*  retcode from smgrstartaio */
+        bool do_unpin_buffer = false;        /* unpin must be deferred until after buffer descriptor is unlocked */
+        Buffer		buf_id;
+        bool		hit = false;
+        volatile        BufferDesc *buf_desc = (BufferDesc *)0;
+
+        int    LockTries;         /*  max times we will try to get a free BufferAiocb */
+
+        struct BufferAiocb volatile * oldFreeBAiocb;  /*  old value of FreeBAiocbs */
+        struct BufferAiocb volatile * newFreeBAiocb;  /*  new value of FreeBAiocbs */
+
+
+    /*  return immediately if no async io resources */
+    if (numBufferAiocbs > 0) {
+        buf_id = (Buffer)0;
+
+        if ( (struct BAiocbAnchor *)0 != BAiocbAnchr ) {
+
+            volatile struct BufferAiocb *BufferAiocbs;
+
+            if ((struct BufferAiocb*)0 != cachedBAiocb) {  /* any cached BufferAiocb ? */
+                BAiocb = cachedBAiocb;                     /* yes  use it  */
+                cachedBAiocb = BAiocb->BAiocbnext;
+                BAiocb->BAiocbnext = (struct BufferAiocb*)BAIOCB_OCCUPIED; /* mark as not on freelist */
+                BAiocb->BAiocbDependentCount = 0;  /* no dependent yet */
+                BAiocb->BAiocbbufh = (struct sbufdesc*)0;
+                BAiocb->pidOfAio = 0;
+            } else {
+
+                LockTries = maxGetBAiocbTries;         /*  max times we will try to get a free BufferAiocb */
+                do {
+                    register long long int dividend = -1 , remainder;
+                    /*  check if we have a free BufferAiocb */
+
+                    oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs;  /*  old value of FreeBAiocbs */
+
+                    /*  check if we have a free BufferAiocb */
+
+                    /*  BAiocbAnchr->FreeBAiocbs is a volatile value unprotected by any lock,
+                    **  and use of compare-and-swap to add and remove items from the list has
+                    **  two potential pitfalls,    both relating to the fact that we must
+                    **  access data de-referenced from this pointer before the compare-and-swap.
+                    **  1)  The value we load may be corrupt,  e.g. mixture of bytes from
+                    **      two different values,   so must validate it;
+                    **      safest is to verify that it is identical to one of the BufferAiocbs.
+                    **      to do so,  verify by direct division that its address offset from
+                    **      first control block is an integral multiple of the control block size
+                    **      that lies within the range [ 0 , (numBufferAiocbs-1) ]
+                    **      Thus we completely prevent this pitfall.
+                    **  2)  The content of the item's next pointer may have changed between the
+                    **      time we de-reference it and the time of the compare-and-swap.
+                    **      Thus even though the compare-and-swap succeeds,   we might set the
+                    **      new head of the freelist to an invalid value  (either a free item
+                    **      that is not the first in the free chain  -  resulting only in
+                    **      loss of the orphaned free items, or,  much worse, an in-use item).
+                    **      In practice this is extremely unlikely because of the implied huge delay
+                    **      in this window interval in this (current) process.    Here are two scenarios:
+                    **      legend:
+                    **         P0  -  this (current) process,  P1,  P2 , ... other processes
+                    **         content of freelist shown as BAiocbAnchr->FreeBAiocbs -> first item -> 2nd item ...
+                    **         @[X] means address of X
+                    **         |      timeline of window of exposure to problems
+                    **      successive lines in chronological order                                       content of freelist
+                    **        2.1    P0           access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **               P0          access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1       access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        swap-remove I0,  place I1 at head of list                F -> I1 -> I2 -> I3 ...
+                    **         |           P2     access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I1]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2    access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I2]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2     swap-remove I1,  place I2 at head of list                F -> I2 -> I3 ...
+                    **         |        P1       complete aio,  replace I0 at head of list                 F -> I0 -> I2 -> I3 ...
+                    **               P0        swap-remove I0,  place I1 at head of list                   F -> I1 IS IN USE !! CORRUPT !!
+                    **               compare-and-swap succeeded but the value of newFreeBAiocb was stale
+                    **               and had become in-use during the window.
+                    **        2.2    P0           access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **               P0          access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1       access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        swap-remove I0,  place I1 at head of list                F -> I1 -> I2 -> I3 ...
+                    **         |           P2     access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I1]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2    access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I2]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2     swap-remove I1,  place I2 at head of list                F -> I2 -> I3 ...
+                    **         |              P3  access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I2]  F -> I2 -> I3 ...
+                    **         |              P3 access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I3]  F -> I2 -> I3 ...
+                    **         |              P3  swap-remove I2,  place I3 at head of list                F -> I3 ...
+                    **         |           P2    complete aio,  replace I1 at head of list                 F -> I1 -> I3 ...
+                    **         |              P3 complete aio,  replace I2 at head of list                 F -> I2 -> I1 -> I3 ...
+                    **         |        P1       complete aio,  replace I0 at head of list                 F -> I0 -> I2 -> I1 -> I3 ...
+                    **               P0        swap-remove I0,  place I1 at head of list                   F -> I1 -> I3 ... ! I2 is orphaned !
+                    **               compare-and-swap succeeded but the value of newFreeBAiocb was stale
+                    **               and had moved further down the free list during the window.
+                    **      Unfortunately, we cannot prevent this pitfall but we can detect it (after the fact),
+                    **      by checking that the next pointer of the item we have just removed for our use still points to the same item.
+                    **      This test is not subject to any timing or uncertainty since :
+                    **       .  The fact that the compare-and-swap succeeded implies that the item we removed
+                    **          was defintely on the freelist (at the head) when it was removed,
+                    **          and therefore cannot be in use,  and therefore its next pointer is no longer volatile.
+                    **       .  Although pointers of the anchor and items on the freelist are volatile,
+                    **          the addresses of items never change -  they are in an allocated array and never move.
+                    **      E.g. in the above two scenarios,   the test is that I0.next still -> I1,
+                    **      and this is true if and only if the second item on the freelist is
+                    **      still the same at the end of the window as it was at the start of the window.
+                    **      Note that we do not insist that it did not change during the window,
+                    **           only that it is still the correct new head of freelist.
+                    **      If this test fails,  we abort immediately as the subsystem is damaged and cannot be repaired.
+                    **      Note that at least one aio must have been issued *and* completed during the window
+                    **           for this to occur,  and since the window is just one single machine instruction,
+                    **           it is very unlikely in practice.
+                    */
+                    BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+                    remainder = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                                % (long long int)(sizeof(struct BufferAiocb));
+                    if (remainder == 0) {
+                        dividend = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                                    / (long long int)(sizeof(struct BufferAiocb));
+                    }
+                    if (    (remainder == 0)
+                         && ( (dividend >= 0 ) && ( dividend < numBufferAiocbs) )
+                       )
+                    {
+                        newFreeBAiocb = oldFreeBAiocb->BAiocbnext; /*  tentative new value is second on free list */
+                        /*   Here we are in the exposure window referred to in the above comments,
+                        **   so moving along rapidly ...
+                        */
+                        if (__sync_bool_compare_and_swap (&(BAiocbAnchr->FreeBAiocbs), oldFreeBAiocb, newFreeBAiocb)) {   /*  did we get it ? */
+                                /*   We have successfully swapped head of freelist pointed to by oldFreeBAiocb off the list;
+                                **   Here we check that the item we just placed at head of freelist, pointed to by newFreeBAiocb,
+                                **   is the right one
+                                **
+                                **   also check that the BAiocb we have acquired was not in use
+                                **   i.e. that scenario 2.1 above did not occur just before our compare-and-swap
+                                **   The test is that the BAiocb is not in use.
+                                **
+                                **  in one hypothetical case,
+                                **  we can be certain that there is no corruption -
+                                **  the case where newFreeBAiocb == 0 and oldFreeBAiocb->BAiocbnext != BAIOCB_OCCUPIED -
+                                **  i.e. we have set the freelist to empty but we have a baiocb chained from ours.
+                                **  in this case our comp_swap removed all BAiocbs from the list (including ours)
+                                **  so the others chained from ours are either orphaned (no harm done)
+                                **  or in use by another backend and will eventually be returned (fine).
+                                */
+                                if ((struct BufferAiocb *)0 == newFreeBAiocb) {
+                                    if ((struct BufferAiocb *)BAIOCB_OCCUPIED == oldFreeBAiocb->BAiocbnext) {
+                                        goto baiocb_corruption;
+                                    } else if ((struct BufferAiocb *)0 != oldFreeBAiocb->BAiocbnext) {
+                                      elog(LOG,
+                                         "AIO control block inconsistency on acquiring aiocb %p - its next free %p may be orphaned (no corruption has occurred)"
+                                         	,oldFreeBAiocb , oldFreeBAiocb->BAiocbnext);
+                                    }
+                                } else {
+                                    /*  case of newFreeBAiocb not null  -  so must check more carefully ... */
+                                    remainder = ((long long int)newFreeBAiocb - (long long int)BufferAiocbs)
+                                                % (long long int)(sizeof(struct BufferAiocb));
+                                    dividend = ((long long int)newFreeBAiocb - (long long int)BufferAiocbs)
+                                                / (long long int)(sizeof(struct BufferAiocb));
+
+                                    if (    (newFreeBAiocb != oldFreeBAiocb->BAiocbnext)
+                                         || (remainder != 0)
+                                         || ( (dividend < 0 ) || ( dividend >= numBufferAiocbs) )
+                                       ) {
+                                        goto baiocb_corruption;
+                                    }
+                                }
+                                BAiocb = oldFreeBAiocb;
+                                BAiocb->BAiocbnext = (struct BufferAiocb*)BAIOCB_OCCUPIED; /* mark as not on freelist */
+                                BAiocb->BAiocbDependentCount = 0;  /* no dependent yet */
+                                BAiocb->BAiocbbufh = (struct sbufdesc*)0;
+                                BAiocb->pidOfAio = 0;
+
+                                LockTries = 0;   /*  end the do loop  */
+
+                        }
+                    }
+
+                    if ( LockTries > 1 ) {
+                        sched_yield();    /*   yield to another process,  (hopefully a backend) */
+                    }
+                } while (     ((struct BufferAiocb*)0 == BAiocb)            /*  did not get a BAiocb    */
+                           && ((struct BufferAiocb*)0 != oldFreeBAiocb)     /*  there was a free BAiocb */
+                           && (LockTries-- > 0)                             /*  told to retry           */
+                        );
+            }
+        }
+
+        if ( BAiocb != (struct BufferAiocb*)0 ) {
+            /*  try an async io    */
+            BAiocb->BAiocbthis.aio_fildes = -1; /* necessary to ensure any thief realizes aio not yet started */
+            BAiocb->pidOfAio = this_backend_pid;
+
+            /*  now try to acquire a buffer :
+            **  note -   ReadBuffer_common returns hit=true if the block is found in the buffer pool,
+            **            in which case there is no need to prefetch.
+            **  otherwise ReadBuffer_common pins returned buffer and calls StartBufferIO
+            **           and StartBufferIO :
+            **      . sets buf_desc->freeNext to negative of ( index of the aiocb in the BufferAiocbs array + 3 )
+            **      . sets  BAiocb->BAiocbbufh -> buf_desc
+            **           and in this case the buffer spinlock is held.
+            **           This is essential as no other task must issue any intention with respect
+            **           to the buffer until we have started the aio_read.
+            **  Also note that ReadBuffer_common handles enlarging the ResourceOwner buffer list as needed
+            **       so I dont need to do that
+            */
+            buf_id = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
+                                        forkNum, blockNum
+                                       ,RBM_NOREAD_FOR_PREFETCH  /*  tells ReadBuffer not to do any read,  just alloc buf */
+                                       ,strategy , &hit , (FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))));
+            buf_desc = &BufferDescriptors[buf_id-1];    /* find buffer descriptor */
+
+            /*  normally hit will be false as presumably it was not in the pool
+            **  when our caller looked - but it could be there now ...
+            */
+            if (hit) {
+                   /*   see earlier comments  -  we must abandon the prefetch */
+                   retcode = 1 + PrivateRefCount[buf_id];
+                   BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* indicator that we must release the BAiocb before exiting */
+            } else
+            if (  (buf_id > 0) && ((BufferDesc *)0 != buf_desc) && (buf_desc == BAiocb->BAiocbbufh)  ) {
+                   /*   buff descriptor header lock should be held.
+                   **   However,  just to be safe ,   now validate that
+                   **   we are still the owner and no other task already stole it.
+                   */
+
+                   buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /* ensure no banked pin */
+                   /*  there should not be any other pid waiting on this buffer
+                   **  so check both of BM_VALID and BM_PIN_COUNT_WAITER are not set
+                   */
+                   if (    ( !(buf_desc->flags & (BM_VALID|BM_PIN_COUNT_WAITER) ) )
+                        && (buf_desc->flags & BM_AIO_IN_PROGRESS)
+                        && ((FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))) == buf_desc->freeNext) /* it is still mine */
+                        && (-1 == BAiocb->BAiocbthis.aio_fildes)   /* no thief stole it */
+                        && (0 == BAiocb->BAiocbDependentCount)     /* no dependent */
+                     ) {
+                        /*   we have an empty buffer for our use */
+
+                        BAiocb->BAiocbthis.aio_buf = (void *)(BufHdrGetBlock(buf_desc)); /* Location of actual buffer.  */
+
+                        /*   note - there is no need to register self as an dependent of BAiocb
+                        **   as we shall not unlock buf_desc before we free the BAiocb
+                        */
+
+                        /*   smgrstartaio retcode is returned in smgrstartaio_rc -
+                        **   it indicates whether started or not
+                        */
+                        smgrstartaio(reln->rd_smgr, forkNum, blockNum , (char *)&(BAiocb->BAiocbthis) , &smgrstartaio_rc );
+
+                        if (smgrstartaio_rc == 0) {
+                            retcode = 0;
+                            buf_desc->flags |= BM_AIO_PREFETCH_PIN_BANKED; /* bank the pin for next use by this task */
+                            /* we did not register register self as an dependent of BAiocb so no need to unregister  */
+                        } else {
+                            /*  failed - return the BufferAiocb to the free list  */
+                            do_unpin_buffer = true;    /* unpin must be deferred until after buffer descriptor is unlocked */
+                            /*  spinlock held so instead of TerminateBufferIO(buf_desc, false , 0); ... */
+                            buf_desc->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS | BM_AIO_PREFETCH_PIN_BANKED | BM_VALID);
+                            /* we did not register register self as an dependent of BAiocb so no need to unregister  */
+
+                            /*  return the BufferAiocb to the free list  */
+                            if (
+                                   BufReleaseAsync(BAiocb)
+                               ) {         /*  failed ? */
+                                BAiocb->BAiocbnext = cachedBAiocb;    /* then ...       */
+                                cachedBAiocb = BAiocb;                /*  ... cache it  */
+                            }
+
+                            pgBufferUsage.aio_read_failed++;
+                            smgrstartaio_rc = 1;  /*   to distinguish from aio not even attempted */
+                        }
+                   }
+                   else {
+                            /*  buffer was stolen or in use by other task - return the BufferAiocb to the free list  */
+                            do_unpin_buffer = true;    /* unpin must be deferred until after buffer descriptor is unlocked */
+                   }
+
+                   UnlockBufHdr(buf_desc);
+                   if (do_unpin_buffer) {
+                        if (smgrstartaio_rc >= 0) { /*  if  aio was attempted */
+                            TerminateBufferIO(buf_desc, false , 0);
+                        }
+                        UnpinBuffer(buf_desc, true);
+                   }
+            }
+            else {
+                BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* indicator that we must release the BAiocb before exiting */
+            }
+
+            if ((struct sbufdesc*)0 == BAiocb->BAiocbbufh) { /*  we did not associate a buffer */
+                                                             /*  so return the BufferAiocb to the free list  */
+                            if (
+                                   BufReleaseAsync(BAiocb)
+                               ) {         /*  failed ? */
+                                BAiocb->BAiocbnext = cachedBAiocb;    /* then ...       */
+                                cachedBAiocb = BAiocb;                /*  ... cache it  */
+                            }
+            }
+        }
+    }
+
+    return retcode;
+
+    baiocb_corruption:;
+         elog(PANIC,
+              "AIO control block corruption on acquiring aiocb %p - its next free %p conflicts with new freelist pointer %p which may be invalid (corruption may have occurred)"
+                                                    ,oldFreeBAiocb , oldFreeBAiocb->BAiocbnext , newFreeBAiocb);
+}
+#endif   /* USE_PREFETCH */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+/*
+ * BufCheckAsync --      act upon caller's intention regarding a shared buffer,
+ *                       primarily in connection with any async io in progress on the buffer.
+ *     class  subvalue   intention has two main classes and some subvalues within those :
+ *      +ve      1            .   want    <=>  caller wants the buffer,
+ *                                             wait for in-progress aio and then always pin
+ *      -ve                   .   reject  <=>  caller does not want the buffer,
+ *                                             if there are no dependents,  then cancel the aio
+ *              -1, -2 , -3 , ... (see below)        and then optionally unpin
+ *                             Used when there may have been a previous fetch or prefetch.
+ *
+ * buffer is assumed to be an existing member of the shared buffer pool
+ *    as returned by BufTableLookup.
+ * if AIO in progress,  then :
+ *      .  terminate AIO, waiting for completion if +ve intention, else without waiting
+ *      .  if the AIO had already completed successfully,   then mark buffer valid
+ *      .  pin/unpin as requested
+ *
+ * +ve intention indicates that buffer must be pinned :
+ *   if the strategy parameter is null,  then use the PinBuffer_Locked optimization
+ *   to pin and unlock in one operation.   But always update buffer usage count.
+ *
+ * -ve intention indicates whether and how to unpin :
+ *   BUF_INTENTION_REJECT_KEEP_PIN 	-1   pin already held, do not unpin, (caller wants to keep it)
+ *   BUF_INTENTION_REJECT_OBTAIN_PIN	-2   obtain pin,  caller wants it for same buffer
+ *   BUF_INTENTION_REJECT_FORGET	-3   unpin and tell resource owner to forget
+ *   BUF_INTENTION_REJECT_NOADJUST	-4   unpin and call ResourceOwnerForgetBuffer myself
+ *                                           instead of telling UnpinBuffer to adjust CurrentResource owner
+ *                                           (quirky simulation of ReleaseBuffer logic)
+ *   BUF_INTENTION_REJECT_UNBANK   	-5   unpin only if pin banked by caller
+ *   The behaviour for the -ve case is based on that of ReleaseBuffer, adding handling of async io.
+ *
+ * pin/unpin action must take account of whether this backend hold a "disposable" pin on the particular buffer.
+ * A "disposable" pin is a pin acquired by buffer manager without caller knowing, such as :
+ *      when required to safeguard an async AIO  -  pin can be held across multiple bufmgr calls
+ *      when required to safeguard waiting for an async AIO  -  pin acquired and released within this function
+ * if a disposable pin is held,   then :
+ *      if a new pin is requested,  the disposable pin must be retained (redeemed) and any flags relating to it unset
+ *      if an unpin is requested,   then :
+ *              if    either no AIO in progress or this backend did not initiate the AIO
+ *              then  the disposable pin must be dropped (redeemed) and any flags relating to it unset
+ *              else  log warning and do nothing
+ *  i.e. in either case,   there is no longer a disposable pin after this function has completed.
+ *       Note that if    intention is BUF_INTENTION_REJECT_UNBANK,
+ *                 then caller expects there to be a disposable banked pin
+ *                      and if there isn't one,  we do nothing
+ *                 for all other intentions,  if there is no disposable pin,   we pin/unpin normally.
+ *
+ * index_for_aio indicates the BAiocb to be used for next aio (see PrefetchBuffer)
+ * BufFreelistLockHeld indicates whether freelistlock is held
+ * spinLockHeld indicates whether buffer header spinlock is held
+ * PartitionLock is the  buffer partition lock to be used
+ *
+ * return code (meaningful ONLY if intention is +ve) indicates validity of buffer :
+ *  -1    buffer is invalid and failed PageHeaderIsValid check
+ *   0    buffer is not valid
+ *   1    buffer is valid
+ *   2    buffer is valid but tag changed  -  (so content does not match the relation block that caller expects)
+ */
+int
+BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, BufferDesc volatile * buf_desc, int intention , BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock )
+{
+
+        int retcode = BUF_INTENT_RC_INVALID_NO_AIO;
+        bool valid = false;
+
+
+#ifdef USE_PREFETCH
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        int smgrcompleteaio_rc;          /*  retcode from smgrcompleteaio */
+        SMgrRelation smgr = caller_smgr;
+        int    		BAiocbDependentCount_after_aio_finished = -1;  /*  for debugging  -  can be printed in gdb */
+	    BufferTag origTag = buf_desc->tag;	/* original identity of selected buffer */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        int aio_successful = -1;         /*  did the aio_read succeed ?  -1 = no aio,  0 unsuccessful , 1 successful */
+        BufFlags	flags_on_entry;  /*  for debugging  -  can be printed in gdb */
+        int    		freeNext_on_entry;  /*  for debugging  -  can be printed in gdb */
+        bool       disposable_pin = false;            /* this backend had a disposable pin on entry or pins the buffer while waiting for aio_read to complete */
+        bool       pin_already_banked_by_me;          /* buffer is already pinned by me and redeemable */
+        int local_intention;
+#endif   /* USE_PREFETCH */
+
+
+
+#ifdef USE_PREFETCH
+            if (!spinLockHeld) {
+                /*  lock buffer header    */
+                LockBufHdr(buf_desc);
+            }
+
+	    flags_on_entry = buf_desc->flags;
+	    freeNext_on_entry = buf_desc->freeNext;
+            pin_already_banked_by_me =
+                      (    (flags_on_entry & BM_AIO_PREFETCH_PIN_BANKED)
+                        && (   (    (flags_on_entry & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - freeNext_on_entry))->pidOfAio )
+                                                                      : (-(freeNext_on_entry))  )  == this_backend_pid )
+                      );
+
+            if (pin_already_banked_by_me) {
+                if (PrivateRefCount[buf_desc->buf_id] == 0) {  /*  but do we actually have a pin ?? */
+                    /*   this is an anomalous situation   -  somehow our disposable pin was lost without us noticing
+                    **   if AIO is in progress and we started it,
+                    **   then this is disastrous  -   two backends might both issue IO on same buffer
+                    **   otherwise,  it is harmless,  and simply means we have no disposable pin,
+                    **               but we must update flags to "notice" the fact now
+                    */
+                    if (flags_on_entry & BM_AIO_IN_PROGRESS) {
+                            elog(ERROR, "BufCheckAsync : AIO control block issuer of aio_read lost pin with BM_AIO_IN_PROGRESS on buffer %d rel=%s, blockNum=%u, flags 0x%X refcount=%u intention= %d"
+                                ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                           ,buf_desc->tag.blockNum, flags_on_entry, buf_desc->refcount ,intention);
+                    } else {
+                            elog(LOG, "BufCheckAsync : AIO control block issuer of aio_read lost pin on buffer %d rel=%s, blockNum=%u, with flags 0x%X refcount=%u intention= %d"
+                               ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                               ,buf_desc->tag.blockNum, flags_on_entry, buf_desc->refcount ,intention);
+							buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /*   redeem the banked pin  */
+							/*   since AIO not in progress,  disconnect the buffer from banked pin */
+							buf_desc->freeNext = FREENEXT_NOT_IN_LIST;      /*   and forget the bank client */
+							pin_already_banked_by_me = false;
+                    }
+                } else {
+                    disposable_pin = true;
+                }
+            }
+
+            /*  the case of BUF_INTENTION_REJECT_UNBANK is handled specially :
+            **    if this backend has a banked pin,  then proceed just as for BUF_INTENTION_REJECT_FORGET
+            **    else the call is a no-op  --  unlock buf header and return immediately
+            */
+            local_intention = intention;
+            if (intention == BUF_INTENTION_REJECT_UNBANK) {
+                if (pin_already_banked_by_me) {
+                    local_intention = BUF_INTENTION_REJECT_FORGET;
+                } else {
+                    goto unlock_buf_header;  /*  code following the unlock will do nothing since local_intention still set to BUF_INTENTION_REJECT_UNBANK */
+                }
+            }
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+            /*       we do not expect that BM_AIO_IN_PROGRESS is set without freeNext identifying the BAiocb */
+            if ( (buf_desc->flags & BM_AIO_IN_PROGRESS) && (buf_desc->freeNext == FREENEXT_NOT_IN_LIST) ) {
+
+					elog(ERROR, "BufCheckAsync : found BM_AIO_IN_PROGRESS without a BAiocb on buffer %d rel=%s, blockNum=%u, flags %X refcount=%u"
+						,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+						,buf_desc->tag.blockNum, buf_desc->flags, buf_desc->refcount);
+                }
+            /*     check whether aio in progress  */
+            if  (    ( (struct BAiocbAnchor *)0 != BAiocbAnchr )
+                  && (buf_desc->flags & BM_AIO_IN_PROGRESS)
+                  && (buf_desc->freeNext <= FREENEXT_BAIOCB_ORIGIN)                       /*  has a valid BAiocb */
+                  && ((FREENEXT_BAIOCB_ORIGIN - buf_desc->freeNext) < numBufferAiocbs)    /*  double-check */
+                ) {        /* this is aio   */
+                    struct BufferAiocb volatile * BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf_desc->freeNext); /*  BufferAiocb associated with this aio */
+                    if ((struct BufferAiocb*)BAIOCB_OCCUPIED == BAiocb->BAiocbnext) { /*  ensure BAiocb is occupied */
+                        aio_successful = 0;       /*  tentatively the aio_read did not succeed   */
+                        retcode = BUF_INTENT_RC_INVALID_AIO;
+
+                        if (smgr == NULL) {
+                            if (caller_reln == NULL) {
+                                smgr = smgropen(buf_desc->tag.rnode, InvalidBackendId);
+                            } else {
+                                smgr = caller_reln->rd_smgr;
+                            }
+                        }
+
+                        /*  assert that this AIO is not using the same BufferAiocb as the one caller asked us to use */
+                        if ((index_for_aio < 0) && (index_for_aio == buf_desc->freeNext)) {
+                                   ereport(ERROR,
+                                        (errcode(ERRCODE_INTERNAL_ERROR),
+                                         errmsg("AIO control block index %d to be used by %p already in use by %p"
+                                                  ,index_for_aio, buf_desc, BAiocb->BAiocbbufh)));
+                        }
+
+                        /*    Call smgrcompleteaio only if either we want buffer or there are no dependents.
+                        **    In the other case of reject and there are dependents,
+                        **    then one of them will do it.
+                        */
+                        if (   (local_intention > 0) || (0 == BAiocb->BAiocbDependentCount)  ) {
+                            if (local_intention > 0) {
+                                /*  wait for in-progress aio and then pin
+                                **  OR  if I did not issue the aio and do not have a pin
+                                **  then pin now before waiting to ensure the buffer does not become unpinned while I wait
+                                **  we may potentially wait for io to complete
+                                **  so release buf header lock so that others may also wait here
+                                */
+                                BAiocb->BAiocbDependentCount++; /* register self as dependent  */
+                                if (PrivateRefCount[buf_desc->buf_id] == 0) {   /* if this buffer not pinned by me */
+                                    disposable_pin = true; /* this backend has pinned the buffer while waiting for aio_read to complete */
+                                    PinBuffer_Locked(buf_desc);
+                                } else {
+                                    UnlockBufHdr(buf_desc);
+                                }
+                                LWLockRelease(PartitionLock);
+
+                                smgrcompleteaio_rc = 1   /*  tell smgrcompleteaio to wait  */
+                                                   + ( BAiocb->pidOfAio == this_backend_pid ); /*  and whether I initiated the aio */
+                            } else {
+                                smgrcompleteaio_rc = 0;   /*  tell smgrcompleteaio to cancel */
+                            }
+
+                            smgrcompleteaio( smgr , (char *)&(BAiocb->BAiocbthis) , &smgrcompleteaio_rc );
+                            if ( (smgrcompleteaio_rc == 0) || (smgrcompleteaio_rc == 1) ) {
+                                  aio_successful = 1;
+                            }
+
+                            /*   statistics  */
+                            if (local_intention > 0) {
+                                if (smgrcompleteaio_rc == 0) {
+                                    /*    completed successfully and did not have to wait  */
+                                    pgBufferUsage.aio_read_ontime++;
+                                } else if (smgrcompleteaio_rc == 1) {
+                                    /*    completed successfully and did have to wait  */
+                                    pgBufferUsage.aio_read_waited++;
+                                } else {
+                                    /*  bad news   -   read failed and so buffer not usable
+                                    **  the buffer is still pinned so unpin and proceed with "not found" case
+                                    */
+                                    pgBufferUsage.aio_read_failed++;
+                                }
+
+                                /*  regain locks and handle the validity of the buffer and intention regarding it    */
+                                LWLockAcquire(PartitionLock, LW_SHARED);
+                                LockBufHdr(buf_desc);
+                                BAiocb->BAiocbDependentCount--; /* unregister self as dependent  */
+                            } else {
+                                    pgBufferUsage.aio_read_wasted++;  /*   regardless of whether aio_successful */
+                            }
+
+
+                            if (local_intention > 0) {
+                                /*  verify the buffer is still ours and has same identity
+                                **  There is one slightly tricky point here -
+                                **  if there are other dependents,   then each of them will perform this same check
+                                **  this is unavoidable as the correct setting of retcode and the BM_VALID flag
+                                **  is required by each dependent,  so we may not leave it to the last one to do it.
+                                **  It should not do any harm and easier to let them all do it than try to avoid.
+                                */
+                                if ((FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))) == buf_desc->freeNext) { /* it is still mine */
+
+                                    if (aio_successful) {
+                                        /* validate page header.   If valid,  then mark the buffer as valid */
+                                        if (PageIsVerified((Page)(BufHdrGetBlock(buf_desc)) , ((BAiocb->BAiocbthis).aio_offset/BLCKSZ))) {
+                                            buf_desc->flags |= BM_VALID;
+                                            if (BUFFERTAGS_EQUAL(origTag , buf_desc->tag)) {
+                                                retcode = BUF_INTENT_RC_VALID;
+                                            } else {
+                                                retcode = BUF_INTENT_RC_CHANGED_TAG;
+                                            }
+                                        } else {
+                                            retcode = BUF_INTENT_RC_BADPAGE;
+                                        }
+                                    }
+                                }
+                            }
+
+                            BAiocbDependentCount_after_aio_finished = BAiocb->BAiocbDependentCount;
+
+                            /*  if no dependents,   then disconnect the BAiocb and update buffer header */
+                            if (BAiocbDependentCount_after_aio_finished == 0 ) {
+
+
+                                /*  return the BufferAiocb to the free list  */
+                                buf_desc->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS);
+                                if (
+                                           BufReleaseAsync(BAiocb)
+                                   ) {        /*  failed ? */
+                                    BAiocb->BAiocbnext = cachedBAiocb;   /* then ...       */
+                                    cachedBAiocb = BAiocb;               /*  ... cache it  */
+                                }
+                            }
+
+                        }
+                    }
+            }
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+            /*  note whether buffer is valid before unlocking spinlock */
+            valid = ((buf_desc->flags & BM_VALID) != 0);
+
+            /*  if there was a disposable pin on entry to this function (i.e. marked in buffer flags)
+            **  then unmark it  -  refer to prologue comments talking about :
+            **    if a disposable pin is held,   then :
+            **     ...
+            **    i.e. in either case,   there is no longer a disposable pin after this function has completed.
+            */
+            if (pin_already_banked_by_me) {
+                        buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /*   redeem the banked pin  */
+                        /*   if AIO not in progress,  then disconnect the buffer from BAiocb and/or banked pin */
+                        if (!(buf_desc->flags & BM_AIO_IN_PROGRESS)) {
+                            buf_desc->freeNext = FREENEXT_NOT_IN_LIST;      /*   and forget the bank client */
+                        }
+                        /********** for debugging   *****************
+                        else elog(LOG, "BufCheckAsync : found BM_AIO_IN_PROGRESS when redeeming banked pin on buffer %d rel=%s, blockNum=%u, flags %X refcount=%u"
+                       ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                       ,buf_desc->tag.blockNum, buf_desc->flags, buf_desc->refcount);
+                        ********** for debugging     *****************/
+            }
+
+            /*  If we are to obtain new pin, then use pin optimization  -  pin and unlock.
+            **  However,   if the caller is the same backend who issued the aio_read,
+            **  then he ought to have obtained the pin at that time and must not acquire
+            **  a "second" one since this is logically the same read -  he would have obtained
+            **  a single pin if using synchronous read and we emulate that behaviour.
+            **  Its important to understand that the caller is not aware that he already obtained a pin -
+            **  because calling PrefetchBuffer did not imply a pin -
+            **  so we must track that via the pidOfAio field in the BAiocb.
+            **  And to add one further complication :
+            **      we assume that although PrefetchBuffer pinned the buffer,
+            **      it did not increment the usage count.
+            **      (because it called PinBuffer_Locked which does not do that)
+            **      so in this case,   we must increment the usage count without double-pinning.
+            **      yes its ugly  -  and theres a goto!
+            */
+            if (   (local_intention > 0)
+                || (local_intention == BUF_INTENTION_REJECT_OBTAIN_PIN)
+               ) {
+
+                /* Make sure we will have room to remember the buffer pin */
+                ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+                /*    here we really want a version of PinBuffer_Locked which updates usage count ... */
+                if (   (PrivateRefCount[buf_desc->buf_id] == 0) /*   if this buffer not previously pinned by me */
+                    || pin_already_banked_by_me                 /*   or I had a disposable pin on entry */
+                   ) {
+                    if (strategy == NULL)
+                    {
+                            if (buf_desc->usage_count < BM_MAX_USAGE_COUNT)
+                                    buf_desc->usage_count++;
+                    }
+                    else
+                    {
+                            if (buf_desc->usage_count == 0)
+                                    buf_desc->usage_count = 1;
+                    }
+		}
+
+                /*  now pin buffer unless we have a disposable */
+                if (!disposable_pin) { /* this backend neither banked pin for aio nor pinned the buffer while waiting for aio_read to complete */
+                    PinBuffer_Locked(buf_desc);
+                    goto unlocked_it;
+                }
+                else
+                /*    if this task previously issued the aio or pinned the buffer while waiting for aio_read to complete
+                **       and aio was unsuccessful,    then release the pin
+                */
+                if (     disposable_pin
+                      && (aio_successful == 0)       /*  aio_read failed ? */
+                   ) {
+                           UnpinBuffer(buf_desc, true);
+                }
+            }
+
+    unlock_buf_header:
+            UnlockBufHdr(buf_desc);
+    unlocked_it:
+#endif   /* USE_PREFETCH */
+
+            /*   now do any requested pin (if not done immediately above) or unpin/forget  */
+            if (local_intention == BUF_INTENTION_REJECT_KEEP_PIN) {
+            /*   the caller is supposed to hold a pin already so there should be nothing to do ... */
+                if (PrivateRefCount[buf_desc->buf_id] == 0) {
+                    elog(LOG, "request to keep pin on unpinned buffer %d",buf_desc->buf_id);
+
+                    valid = PinBuffer(buf_desc, strategy);
+                }
+            }
+            else
+            if (   (   (local_intention == BUF_INTENTION_REJECT_FORGET)
+                    || (local_intention == BUF_INTENTION_REJECT_NOADJUST)
+                   )
+                && (PrivateRefCount[buf_desc->buf_id] > 0) /*   if this buffer was previously pinned by me ... */
+               )  {
+
+                    if (local_intention == BUF_INTENTION_REJECT_FORGET) {
+                        UnpinBuffer(buf_desc, true); /*  ... then release the pin                   */
+                    } else
+                    if (local_intention == BUF_INTENTION_REJECT_NOADJUST) {
+                        /*   following code moved from ReleaseBuffer :
+                        **   not sure why we can't simply UnpinBuffer(buf_desc, true)
+                        **   but better leave it the way it was
+                        */
+                        ResourceOwnerForgetBuffer(CurrentResourceOwner, BufferDescriptorGetBuffer(buf_desc));
+                        if (PrivateRefCount[buf_desc->buf_id] > 1) {
+                            PrivateRefCount[buf_desc->buf_id]--;
+                        } else {
+                            UnpinBuffer(buf_desc, false);
+                        }
+                    }
+            }
+
+            /*    if retcode has not been set to one of the unusual conditions
+            **        namely failed header validity or tag changed
+            **    then the setting of valid takes precedence
+            **    over whatever retcode may be currently set to.
+            */
+            if ( ( (retcode == BUF_INTENT_RC_INVALID_NO_AIO) || (retcode == BUF_INTENT_RC_INVALID_AIO) ) && valid) {
+                   retcode = BUF_INTENT_RC_VALID;
+            } else
+            if ((retcode == BUF_INTENT_RC_VALID) && (!valid)) {
+                   if (aio_successful == -1) { /* aio not attempted */
+                       retcode = BUF_INTENT_RC_INVALID_NO_AIO;
+                   } else {
+                       retcode = BUF_INTENT_RC_INVALID_AIO;
+                   }
+            }
+
+            return retcode;
+}
--- src/backend/storage/buffer/buf_init.c.orig	2014-06-08 11:26:30.000000000 -0400
+++ src/backend/storage/buffer/buf_init.c	2014-06-08 21:59:36.776096588 -0400
@@ -13,15 +13,89 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <sys/types.h> 
+#include <unistd.h>
 
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
-
+#include <stdlib.h> /* for getenv() */
+#include <errno.h> /* for strtoul() */
 
 BufferDesc *BufferDescriptors;
 char	   *BufferBlocks;
-int32	   *PrivateRefCount;
+int32	   *PrivateRefCount;       /*  array of counts per buffer of how many times this task has pinned this buffer */
+
+volatile struct BAiocbAnchor *BAiocbAnchr = (struct BAiocbAnchor *)0;  /*  anchor for all control blocks pertaining to aio  */
+
+int CountInuseBAiocbs(void);     /*  keep compiler happy */
+void ReportFreeBAiocbs(void);    /*  keep compiler happy */
+
+extern int	MaxConnections;  /*  max number of client connections which postmaster will allow  */
+int numBufferAiocbs = 0;         /*  total number of  BufferAiocbs in pool (0 <=> no async io) */
+int hwmBufferAiocbs = 0;         /*  high water mark of in-use  BufferAiocbs in pool
+                                 **  (not required to be accurate, kindly maintained for us somehow by postmaster)
+                                 */
 
+#ifdef USE_PREFETCH
+unsigned int prefetch_dbOid = 0; /*  database oid of relations on which prefetching to be done - 0 means all */
+unsigned int prefetch_bitmap_scans = 1; /*  boolean whether to prefetch bitmap heap scans        */
+unsigned int prefetch_heap_scans = 0;   /*  boolean whether to prefetch non-bitmap heap scans    */
+unsigned int prefetch_sequential_index_scans = 0;  /*  boolean whether to prefetch sequential-access non-bitmap index scans  */
+unsigned int prefetch_index_scans = 256;  /*  boolean whether to prefetch non-bitmap index scans also numeric size of pfch_list  */
+unsigned int prefetch_btree_heaps = 1;  /*  boolean whether to prefetch heap pages in _bt_next for non-bitmap index scans   */
+#endif /* USE_PREFETCH */
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+int maxGetBAiocbTries = 1;       /*  max times we will try to get a free BufferAiocb */
+int maxRelBAiocbTries = 1;       /*  max times we will try to release a BufferAiocb back to freelist */
+
+/*  locking protocol for manipulating the BufferAiocbs and FreeBAiocbs list :
+**    1.    ownership of a BufferAiocb :
+**          to gain ownership of a BufferAiocb, a task must
+**          EITHER    remove it from FreeBAiocbs (it is now temporary owner and no other task can find it)
+**                    if decision is to attach it to a buffer descriptor header, then
+**                       .   lock the buffer descriptor header
+**                       .   check  NOT flags & BM_AIO_IN_PROGRESS
+**                       .   attach to buffer descriptor header
+**                       .   increment the BufferAiocb.dependent_count
+**                       .   unlock the buffer descriptor header
+**                     and ownership scope is from lock to unlock
+***          OR        locate it by dereferencing the pointer in a buffer descriptor,
+**                    in which case :
+**                       .   lock the buffer descriptor header
+**                       .   check  flags & BM_AIO_IN_PROGRESS
+**                       .   increment the BufferAiocb.dependent_count
+**                       .   if decision is to return to FreeBAiocbs,
+**                           then   (with buffer descriptor header still locked)
+**                                  .  turn off BM_AIO_IN_PROGRESS
+**                       .   IF        the BufferAiocb.dependent_count == 1 (I am sole dependent)
+**                       .   THEN
+**                       .       .  decrement the BufferAiocb.dependent_count
+**                               .  return to FreeBAiocbs (see below)
+**                       .   unlock the buffer descriptor header
+**                     and ownership scope is from lock to either return to FreeBAiocbs or unlock
+**    2.    adding and removing from FreeBAiocbs :
+**      two alternative methods - controlled by conditional macro definition LOCK_BAIOCB_FOR_GET_REL
+**       2.1 LOCK_BAIOCB_FOR_GET_REL is defined - use a lock
+**          .   lock BufFreelistLock exclusive
+**          .   add / remove from FreeBAiocbs
+**          .   unlock BufFreelistLock exclusive
+**          advantage of this method  -  never fails to add or remove
+**       2.2  LOCK_BAIOCB_FOR_GET_REL is not defined - use compare_and_swap
+**          .   retrieve the current Freelist pointer and validate
+**          .   compare_and_swap on/off the FreeBAiocb list
+**          .   unlock BufFreelistLock exclusive
+**          advantage of this method  -  never waits
+**          to avoid losing a FreeBAiocbs,   save it in a process-local cache and reuse
+*/
+#else   /*  not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        /*   this dummy structure is to ensure that references to these fields in other bufmgr runtime code
+        **   that is not conditionally ifdefd on USE_AIO_ATOMIC_BUILTIN_COMP_SWAP compiles and runs correctly
+        */
+        struct BAiocbAnchor dummy_BAiocbAnchr = { (struct BufferAiocb*)0 , (struct BufferAiocb*)0 };
+int maxGetBAiocbTries = -1;       /*  max times we will try to get a free BufferAiocb */
+int maxRelBAiocbTries = -1;       /*  max times we will try to release a BufferAiocb back to freelist */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Data Structures:
@@ -73,7 +147,16 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+              , foundAiocbs
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+          ;
+#if defined(USE_PREFETCH) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
+        char *envvarpointer = (char *)0;  /*  might point to an environment variable string */
+        char *charptr;
+#endif /* USE_PREFETCH */
+
 
 	BufferDescriptors = (BufferDesc *)
 		ShmemInitStruct("Buffer Descriptors",
@@ -83,6 +166,134 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        BAiocbAnchr = (struct BAiocbAnchor *)0; /*  anchor for all control blocks pertaining to aio  */
+        if (max_async_io_prefetchers < 0) {  /*  negative value indicates to initialize to something sensible during buf_init */
+            max_async_io_prefetchers = MaxConnections/6;  /*  default allows for average of MaxConnections/6 concurrent prefetchers  - reasonable ??? */
+        }
+
+        if ((target_prefetch_pages > 0) && (max_async_io_prefetchers > 0)) {
+            int ix;
+            volatile struct BufferAiocb *BufferAiocbs;
+            volatile struct BufferAiocb * volatile FreeBAiocbs;
+
+            numBufferAiocbs = (target_prefetch_pages*max_async_io_prefetchers);  /*  target_prefetch_pages per prefetcher */
+            BAiocbAnchr = (struct BAiocbAnchor *)
+		ShmemInitStruct("Buffer Aiocbs",
+                          sizeof(struct BAiocbAnchor) + (numBufferAiocbs * sizeof(struct BufferAiocb)), &foundAiocbs);
+            if (BAiocbAnchr) {
+                BufferAiocbs = BAiocbAnchr->BufferAiocbs = (struct BufferAiocb*)(((char *)BAiocbAnchr) + sizeof(struct BAiocbAnchor));
+                FreeBAiocbs = (struct BufferAiocb*)0;
+                for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    (BufferAiocbs+ix)->BAiocbnext = FreeBAiocbs;   /*  init the free list,  last one -> 0  */
+                    (BufferAiocbs+ix)->BAiocbbufh = (struct sbufdesc*)0;
+                    (BufferAiocbs+ix)->BAiocbDependentCount = 0;
+                    (BufferAiocbs+ix)->pidOfAio = 0;
+                    FreeBAiocbs = (BufferAiocbs+ix);
+
+                }
+                BAiocbAnchr->FreeBAiocbs = FreeBAiocbs;
+                envvarpointer = getenv("PG_MAX_GET_BAIOCB_TRIES");
+                if (    (envvarpointer != (char *)0)
+                     && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+                   ) {
+                      maxGetBAiocbTries = strtol(envvarpointer, 0, 10);
+                }
+                envvarpointer = getenv("PG_MAX_REL_BAIOCB_TRIES");
+                if (    (envvarpointer != (char *)0)
+                     && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+                   ) {
+                      maxRelBAiocbTries = strtol(envvarpointer, 0, 10);
+                }
+            }
+        }
+#else   /*  not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        /*   this dummy structure is to ensure that references to these fields in other bufmgr runtime code
+        **   that is not conditionally ifdefd on USE_AIO_ATOMIC_BUILTIN_COMP_SWAP compiles and runs correctly
+        */
+        BAiocbAnchr = &dummy_BAiocbAnchr;
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+
+#ifdef USE_PREFETCH
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_BITMAP");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_bitmap_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_bitmap_scans = 1;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_ISCAN");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_index_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_index_scans = 1;
+             } else
+             if (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   ) {
+                 prefetch_index_scans = strtol(envvarpointer, &charptr, 10);
+                 if (charptr && (',' == *charptr)) {   /*  optional sequential prefetch in index scans */
+					 charptr++;        /*   following the comma ... */
+					 if ( ('Y' == *charptr) || ('y' == *charptr) || ('1' == *charptr) ) {
+                         prefetch_sequential_index_scans = 1;
+					 }
+				 }
+             }
+             /*  if prefeching for ISCAN,  then we require size of pfch_list to be at least target_prefetch_pages */
+             if (   (prefetch_index_scans > 0)
+                 && (prefetch_index_scans < target_prefetch_pages)
+                ) {
+                 prefetch_index_scans = target_prefetch_pages;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_BTREE");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_btree_heaps = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_btree_heaps = 1;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_HEAP");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_heap_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+              prefetch_heap_scans = 1;
+             }
+        }
+        envvarpointer = getenv("PG_PREFETCH_DBOID");
+        if (    (envvarpointer != (char *)0)
+             && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+           ) {
+              errno = 0;   /*  required in order to distinguish error from 0 */
+              prefetch_dbOid = (unsigned int)strtoul((const char *)envvarpointer, 0, 10);
+              if (errno) {
+                  prefetch_dbOid = 0;
+              }
+        }
+        elog(LOG, "prefetching initialised with target_prefetch_pages= %d "
+                  ", max_async_io_prefetchers= %d implying aio concurrency= %d "
+                  ", prefetching_for_bitmap= %s "
+                  ", prefetching_for_heap= %s "
+                  ", prefetching_for_iscan= %d with sequential_index_page_prefetching= %s "
+                  ", prefetching_for_btree= %s"
+                   ,target_prefetch_pages ,max_async_io_prefetchers ,numBufferAiocbs
+                   ,(prefetch_bitmap_scans ? "Y" : "N")
+                   ,(prefetch_heap_scans ? "Y" : "N")
+                   ,prefetch_index_scans
+                   ,(prefetch_sequential_index_scans ? "Y" : "N")
+                   ,(prefetch_btree_heaps ? "Y" : "N")
+            );
+#endif /* USE_PREFETCH */
+
+
 	if (foundDescs || foundBufs)
 	{
 		/* both should be present or neither */
@@ -176,3 +387,82 @@ BufferShmemSize(void)
 
 	return size;
 }
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*     imprecise count of number of in-use BAiocbs at any time
+ *     we scan the array read-only without latching so are subject to unstable result
+ *     (but since the array is in well-known contiguous storage,
+ *     we are not subject to segmentation violation)
+ *     This function may be called at any time and just does its best
+ *     return the count of what we counted.
+ */
+int
+CountInuseBAiocbs(void)
+{
+        volatile struct BufferAiocb *BAiocb;
+        int count = 0;
+        int ix;
+
+
+        if (BAiocbAnchr != (struct BAiocbAnchor *)0 ) {
+            BAiocb = BAiocbAnchr->BufferAiocbs;             /*   start of list */
+
+            for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    if ((struct BufferAiocb*)BAIOCB_OCCUPIED == (BAiocb+ix)->BAiocbnext) {   /* not on freelist ? */
+                        count++;
+                    }
+            }
+        }
+        return count;
+}
+
+/*
+ * report how many free BAiocbs at shutdown
+ * DO NOT call this while backends are actively working!!
+ * this report is useful when compare_and_swap method used (see above)
+ * as it can be used to deduce how many BAiocbs were in process-local caches -
+ * (original_number_on_freelist_at_startup - this_reported_number_at_shutdown)
+ */
+void
+ReportFreeBAiocbs(void)
+{
+        volatile struct BufferAiocb *BAiocb;
+        volatile struct BufferAiocb *BufferAiocbs;
+        int count = 0;
+        int fx , ix;
+
+
+        if (BAiocbAnchr != (struct BAiocbAnchor *)0 ) {
+            BAiocb = BAiocbAnchr->FreeBAiocbs;             /*   start of free list */
+            BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+
+            for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    (BufferAiocbs+ix)->BAiocbDependentCount = 0;  /* use this as marker for finding it on freelist */
+            }
+            for (fx = (numBufferAiocbs-1);  ( (fx>=0) &&  ( BAiocb != (struct BufferAiocb*)0 ) );  fx--) {
+                    
+                    /*  check if it is a valid BufferAiocb */
+                    for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                        if ((BufferAiocbs+ix) == BAiocb) { /*  is it this one ? */
+                             break;
+                        }
+                    }
+                    if (ix >= 0) {
+                        if (BAiocb->BAiocbDependentCount) {   /* seen it already ? */
+                            elog(LOG, "ReportFreeBAiocbs closed cycle on AIO control block freelist %p"
+                                          ,BAiocb);
+                            fx = 0; /* give up at this point */
+                        }
+                        BAiocb->BAiocbDependentCount = 1;  /* use this as marker for finding it on freelist */
+                        count++;
+                        BAiocb = BAiocb->BAiocbnext;
+                    } else {
+                        elog(LOG, "ReportFreeBAiocbs invalid item on AIO control block freelist %p"
+                                          ,BAiocb);
+                        fx = 0; /* give up at this point */
+                    }
+            }
+        }
+        elog(LOG, "ReportFreeBAiocbs AIO control block list : poolsize= %d  in-use-hwm= %d  final-free= %d" ,numBufferAiocbs , hwmBufferAiocbs , count);
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
--- src/backend/storage/smgr/md.c.orig	2014-06-08 11:26:30.000000000 -0400
+++ src/backend/storage/smgr/md.c	2014-06-08 21:59:36.804096637 -0400
@@ -647,6 +647,62 @@ mdprefetch(SMgrRelation reln, ForkNumber
 }
 
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ *	mdinitaio() --  init the aio subsystem max number of threads and max number of requests
+ */
+void
+mdinitaio(int max_aio_threads, int max_aio_num)
+{
+     FileInitaio( max_aio_threads, max_aio_num );
+}
+
+/*
+ *	mdstartaio() -- start aio read of the specified block of a relation
+ */
+void
+mdstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode )
+{
+#ifdef USE_PREFETCH
+	off_t		seekpos;
+	MdfdVec    *v;
+        int local_retcode;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+
+	seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	local_retcode = FileStartaio(v->mdfd_vfd, seekpos, BLCKSZ , aiocbp);
+	if (retcode) {
+            *retcode = local_retcode;
+        }
+#endif   /* USE_PREFETCH */
+}
+
+
+/*
+ *	mdcompleteaio() -- complete aio read of the specified block of a relation
+ *      on entry, *inoutcode should indicate :
+ *           .  non-0  <=>   check if complete and wait if not
+ *           .  0      <=>   cancel io immediately
+ */
+void
+mdcompleteaio( char *aiocbp , int *inoutcode )
+{
+#ifdef USE_PREFETCH
+        int local_retcode;
+
+	local_retcode = FileCompleteaio(aiocbp, (inoutcode ? *inoutcode : 0));
+	if (inoutcode) {
+            *inoutcode = local_retcode;
+        }
+#endif   /* USE_PREFETCH */
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+
 /*
  *	mdread() -- Read the specified block from a relation.
  */
--- src/backend/storage/smgr/smgr.c.orig	2014-06-08 11:26:30.000000000 -0400
+++ src/backend/storage/smgr/smgr.c	2014-06-08 21:59:36.828096679 -0400
@@ -49,6 +49,12 @@ typedef struct f_smgr
 						 BlockNumber blocknum, char *buffer, bool skipFsync);
 	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	void		(*smgr_initaio) (int max_aio_threads, int max_aio_num);
+	void		(*smgr_startaio) (SMgrRelation reln, ForkNumber forknum,
+											  BlockNumber blocknum , char *aiocbp , int *retcode );
+	void		(*smgr_completeaio) ( char *aiocbp , int *inoutcode );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
@@ -66,7 +72,11 @@ typedef struct f_smgr
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
-		mdprefetch, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
+		mdprefetch
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        ,mdinitaio, mdstartaio, mdcompleteaio
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+              , mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
 		mdpreckpt, mdsync, mdpostckpt
 	}
 };
@@ -612,6 +622,35 @@ smgrprefetch(SMgrRelation reln, ForkNumb
 	(*(smgrsw[reln->smgr_which].smgr_prefetch)) (reln, forknum, blocknum);
 }
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ *	smgrinitaio() -- initialize the aio subsystem max number of threads and max number of requests
+ */
+void
+smgrinitaio(int max_aio_threads, int max_aio_num)
+{
+	(*(smgrsw[0].smgr_initaio)) ( max_aio_threads, max_aio_num );
+}
+
+/*
+ *	smgrstartaio() -- Initiate aio read of the specified block of a relation.
+ */
+void
+smgrstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode )
+{
+	(*(smgrsw[reln->smgr_which].smgr_startaio)) (reln, forknum, blocknum , aiocbp , retcode );
+}
+
+/*
+ *	smgrcompleteaio() -- Complete aio read of the specified block of a relation.
+ */
+void
+smgrcompleteaio(SMgrRelation reln,  char *aiocbp , int *inoutcode )
+{
+	(*(smgrsw[reln->smgr_which].smgr_completeaio)) ( aiocbp , inoutcode );
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 /*
  *	smgrread() -- read a particular block from a relation into the supplied
  *				  buffer.
--- src/backend/storage/file/fd.c.orig	2014-06-08 11:26:30.000000000 -0400
+++ src/backend/storage/file/fd.c	2014-06-08 21:59:36.856096727 -0400
@@ -77,6 +77,9 @@
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+#include "aio.h"
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * We must leave some file descriptors free for system(), the dynamic loader,
@@ -1239,6 +1242,10 @@ FileClose(File file)
  * We could add an implementation using libaio in the future; but note that
  * this API is inappropriate for libaio, which wants to have a buffer provided
  * to read into.
+ * Also note that a new, different implementation of asynchronous prefetch
+ * using librt,  not libaio,  is provided by the two functions following this one,
+ * FileStartaio and FileCompleteaio.   These also require to have a buffer provided
+ * to read into,  which the new async_io support provides.
  */
 int
 FilePrefetch(File file, off_t offset, int amount)
@@ -1266,6 +1273,145 @@ FilePrefetch(File file, off_t offset, in
 #endif
 }
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ * FileInitaio - initialize the aio subsystem max number of threads and max number of requests
+ *  input parms
+ *  max_aio_threads;    maximum number of threads
+ *  max_aio_num;        maximum number of concurrent aio read requests
+ *
+ *  on linux, man page for the librt implemenation of aio_init() says :
+ *         This function is a GNU extension.
+ *  If your posix aio does not have it,   then add the following line to 
+ *        src/include/pg_config_manual.h
+ *    #define DONT_HAVE_AIO_INIT
+ *  to render it as a no-op
+ */
+void
+FileInitaio(int max_aio_threads, int max_aio_num )
+{
+#ifndef DONT_HAVE_AIO_INIT
+    struct aioinit aioinit_struct;  /*  structure to pass to aio_init */
+
+    aioinit_struct.aio_threads = max_aio_threads; /*     maximum number of threads */
+    aioinit_struct.aio_num = max_aio_num;         /*     maximum number of concurrent aio read requests */
+    aioinit_struct.aio_idle_time = 1;             /*     we dont want to alter this but aio_init does not ignore it so set to the default */
+    aio_init(&aioinit_struct);
+#endif  /* ndef DONT_HAVE_AIO_INIT */
+    return;
+}
+
+/*
+ * FileStartaio - initiate asynchronous read of a given range of the file.
+ * The logical seek position is unaffected.
+ *
+ * use standard posix aio (librt)
+ *  ASSUME   BufferAiocb.aio_buf already set to -> buffer by caller
+ *  return 0 if successfully started,  else non-zero
+ */
+int
+FileStartaio(File file, off_t offset, int amount , char *aiocbp )
+{
+	int	returnCode;
+        struct aiocb *my_aiocbp = (struct aiocb *)aiocbp;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartaio: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset, amount));
+
+	returnCode = FileAccess(file);
+	if (returnCode >= 0) {
+
+            my_aiocbp->aio_fildes = VfdCache[file].fd;
+            my_aiocbp->aio_lio_opcode = LIO_READ;
+            my_aiocbp->aio_nbytes = amount;
+            my_aiocbp->aio_offset = offset;
+            returnCode = aio_read(my_aiocbp);
+        }
+
+	return returnCode;
+}
+
+/*
+ * FileCompleteaio - complete asynchronous aio read
+ * normal_wait indicates whether to cancel or wait -
+ *                 0 <=> cancel
+ *                 1 <=> wait by polling the aiocb
+ *                 2 <=> wait by suspending on the aiocb
+ *
+ * use standard posix aio (librt)
+ *  return 0 if successfull and did not have to wait,
+ *         1 if successfull and had to wait,
+ *    else x'ff'
+ */
+int
+FileCompleteaio( char *aiocbp , int normal_wait )
+{
+	int	returnCode;
+	int	aio_errno;
+        struct aiocb *my_aiocbp = (struct aiocb *)aiocbp;
+        const struct aiocb const*cblist[1];
+        int fd;
+        struct timespec my_timeout = { 0 , 10000 };
+        struct timespec *suspend_timeout_P; /*  the timeout actually used depending on normal_wait */
+        int max_polls;
+
+        fd = my_aiocbp->aio_fildes;
+        cblist[0] = my_aiocbp;
+        returnCode = aio_errno = aio_error(my_aiocbp);
+        /* note that aio_error returns 0 if op already completed successfully */
+
+        /*  first handle normal case of waiting for op to complete  */
+        if (normal_wait) {
+            /*   if told not to poll,   then specify no timeout  */
+            suspend_timeout_P = (normal_wait == 1 ? &my_timeout : (struct timespec *)0);
+            while (aio_errno == EINPROGRESS) {
+                max_polls = 256;
+                my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                returnCode = aio_suspend(cblist , 1 , suspend_timeout_P);
+                while (    (returnCode < 0) && (max_polls-- > 0)
+                        && ((EAGAIN == errno) || (EINTR == errno))
+                      ) {
+                    my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                    returnCode = aio_suspend(cblist , 1 , suspend_timeout_P);
+                }
+                returnCode = aio_errno = aio_error(my_aiocbp);
+                /*  now return_code is from aio_error  */
+                if (returnCode == 0) {
+                    returnCode = 1;    /*  successful but had to wait */
+                }
+            }
+            if (aio_errno) {
+                elog(LOG, "FileCompleteaio: %d %d", fd, returnCode);
+                returnCode = 0xff; /*  unsuccessful */
+            }
+        } else {
+            if (aio_errno == EINPROGRESS) {
+                do {
+                        max_polls = 256;
+                        my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                        returnCode = aio_cancel(fd, my_aiocbp);
+                        while ((returnCode == AIO_NOTCANCELED) && (max_polls-- > 0)) {
+                            my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                            returnCode = aio_cancel(fd, my_aiocbp);
+                        }
+                    returnCode = aio_errno = aio_error(my_aiocbp);
+                } while (aio_errno == EINPROGRESS);
+                returnCode = 0xff; /*  unsuccessful */
+            }
+            if (returnCode != 0)
+                returnCode = 0xff; /*  unsuccessful */
+        }
+
+	DO_DB(elog(LOG, "FileCompleteaio: %d %d",
+			   fd, returnCode));
+
+	return returnCode;
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 int
 FileRead(File file, char *buffer, int amount)
 {
--- src/backend/storage/lmgr/proc.c.orig	2014-06-08 11:26:30.000000000 -0400
+++ src/backend/storage/lmgr/proc.c	2014-06-08 21:59:36.888096782 -0400
@@ -52,6 +52,7 @@
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+extern pid_t this_backend_pid;     /*   pid of this backend */
 
 /* GUC variables */
 int			DeadlockTimeout = 1000;
@@ -361,6 +362,7 @@ InitProcess(void)
 	MyPgXact->xid = InvalidTransactionId;
 	MyPgXact->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
+        this_backend_pid = getpid();    /*    pid of this backend */
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
--- src/backend/access/heap/heapam.c.orig	2014-06-08 11:26:29.000000000 -0400
+++ src/backend/access/heap/heapam.c	2014-06-08 21:59:36.932096859 -0400
@@ -71,6 +71,28 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef USE_PREFETCH
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+
+#include "executor/instrument.h"
+
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_heap_scans; /*  boolean whether to prefetch non-bitmap heap scans         */
+
+/*  special values for scan->rs_prefetch_target indicating as follows :               */
+#define PREFETCH_MAYBE 0xffffffff      /*   prefetch permitted but not yet in effect  */
+#define PREFETCH_DISABLED 0xfffffffe   /*   prefetch disabled and not permitted       */
+/*  PREFETCH_WRAP_POINT indicates a pretcher who has reached the point where the scan would wrap -
+**  at this point the prefetcher runs on the spot until scan catches up.
+**  This *must* be < maximum valid setting of target_prefetch_pages aka effective_io_concurrency.
+*/
+#define PREFETCH_WRAP_POINT 0x0fffffff
+
+#endif   /* USE_PREFETCH */
+
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -115,6 +137,8 @@ static XLogRecPtr log_heap_new_cid(Relat
 static HeapTuple ExtractReplicaIdentity(Relation rel, HeapTuple tup, bool key_modified,
 					   bool *copy);
 
+static void heap_unread_add(HeapScanDesc scan, BlockNumber blockno);
+static void heap_unread_subtract(HeapScanDesc scan, BlockNumber blockno);
 
 /*
  * Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -292,9 +316,149 @@ initscan(HeapScanDesc scan, ScanKey key,
 	 * Currently, we don't have a stats counter for bitmap heap scans (but the
 	 * underlying bitmap index scans will be counted).
 	 */
-	if (!scan->rs_bitmapscan)
+#ifdef USE_PREFETCH
+        /*    by default,  no prefetching on any scan  */
+        scan->rs_prefetch_target = PREFETCH_DISABLED;  /*  tentatively disable  */
+        scan->rs_pfchblock = 0; /*  scanner will reset this to be ahead of scan */
+        scan->rs_Unread_Pfetched_base = (BlockNumber *)0; /*  list of prefetched but unread blocknos */
+        scan->rs_Unread_Pfetched_next = 0; /*  next unread blockno */
+        scan->rs_Unread_Pfetched_count = 0; /* number of valid unread blocknos */
+#endif   /* USE_PREFETCH */
+	if (!scan->rs_bitmapscan) {
+
 		pgstat_count_heap_scan(scan->rs_rd);
+#ifdef USE_PREFETCH
+                /*    bitmap scans do their own prefetching -
+                **    for others,  set up prefetching now
+                */
+                if (    prefetch_heap_scans
+                     && (target_prefetch_pages > 0)
+                     &&	(!RelationUsesLocalBuffers(scan->rs_rd))
+                   ) {
+                      /*   prefetch_dbOid may be set to a database Oid to specify only prefetch in that db */
+                      if (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == scan->rs_rd->rd_node.dbNode)
+                              )
+                          ||  (prefetch_dbOid == 0)
+                         ) {
+                          scan->rs_prefetch_target = PREFETCH_MAYBE;    /*  permitted but let the scan decide */
+                      }
+                      else {
+                      }
+                }
+#endif   /* USE_PREFETCH */
+        }
+}
+
+/* add this blockno to list of prefetched and unread blocknos
+** use the one identified by the (next+count|modulo circumference) index if it is unused,
+** else search for the first available slot if there is one,
+** else error.
+*/
+static void
+heap_unread_add(HeapScanDesc scan, BlockNumber blockno)
+{
+      BlockNumber *available_P;   /*  where to store new blockno */
+      unsigned int Unread_Pfetched_index = scan->rs_Unread_Pfetched_next
+                                         + scan->rs_Unread_Pfetched_count; /* index of next unused slot */
+
+      /*  caller is not supposed to pass InvalidBlockNumber but check anyway */
+      if (blockno != InvalidBlockNumber) {
+
+		  /*  ensure there is some room somewhere   */
+		  if (scan->rs_Unread_Pfetched_count < target_prefetch_pages) {
+
+			  /*  try the "next+count" one */
+			  if (Unread_Pfetched_index >= target_prefetch_pages) {
+				  Unread_Pfetched_index -= target_prefetch_pages;  /* modulo circumference */
+			  }
+			  available_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+			  if (*available_P == InvalidBlockNumber) { /* unused */
+				  goto store_blockno;
+			  } else {
+                  /*  slow-search the entire list  */
+				  for (Unread_Pfetched_index = 0; Unread_Pfetched_index < target_prefetch_pages; Unread_Pfetched_index++) {
+					  available_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+					  if (*available_P == InvalidBlockNumber) { /* unused */
+                          /*  before storing this blockno,
+                          **  since the next pointer did not locate an unused slot,
+                          **  set it to one which is more likely to be so for the next time
+                          */
+                          scan->rs_Unread_Pfetched_next = Unread_Pfetched_index;
+						  goto store_blockno;
+					  }
+				  }
+			  }
+		  }
+
+          /*  if we reach here,    either there was no available slot
+          **  or we thought there was one and didn't find any  --
+          */
+  		  ereport(NOTICE,
+			  (errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("heap_unread_add overflowed list cannot add blockno %d", blockno)));
+
+  		  return;
+
+    store_blockno:
+		  *available_P = blockno;
+		  scan->rs_Unread_Pfetched_count++;  /*  update count */
+
+	  }
+
+    return;
+}
+
+/* remove specified blockno from list of prefetched and unread blocknos.
+** Usually this will be found at the rs_Unread_Pfetched_next item -
+** else search for it.    If not found,   inore it  -  no error results.
+*/
+static void
+heap_unread_subtract(HeapScanDesc scan, BlockNumber blockno)
+{
+      unsigned int Unread_Pfetched_index = scan->rs_Unread_Pfetched_next; /* index of next unread blockno */
+      BlockNumber *candidate_P;   /*  location of callers blockno - maybe */
+      BlockNumber nextUnreadPfetched;
+
+      /*  caller is not supposed to pass InvalidBlockNumber but check anyway */
+      if (    (blockno != InvalidBlockNumber)
+		   && ( scan->rs_Unread_Pfetched_count > 0 )   /*  if the list is not empty  */
+         ) {
+
+			  /*  take modulo of the circumference.
+			  **  actually rs_Unread_Pfetched_next should never exceed the circumference but check anyway.
+			  */
+			  if (Unread_Pfetched_index >= target_prefetch_pages) {
+				  Unread_Pfetched_index  -= target_prefetch_pages;
 }
+			  candidate_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);
+			  nextUnreadPfetched = *candidate_P;
+
+			  if ( nextUnreadPfetched == blockno ) {
+				  goto remove_blockno;
+			  } else {
+                  /*  slow-search the entire list  */
+				  for (Unread_Pfetched_index = 0; Unread_Pfetched_index < target_prefetch_pages; Unread_Pfetched_index++) {
+					  candidate_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+					  if (*candidate_P == blockno) { /* unused */
+						  goto remove_blockno;
+					  }
+				  }
+			  }
+
+    remove_blockno:
+			  *candidate_P = InvalidBlockNumber;
+
+			  scan->rs_Unread_Pfetched_next = (Unread_Pfetched_index+1);  /*  update next pfchd unread */
+			  if (scan->rs_Unread_Pfetched_next >= target_prefetch_pages) {
+					  scan->rs_Unread_Pfetched_next = 0;
+			  }
+			  scan->rs_Unread_Pfetched_count--;  /*  update count */
+	  }
+
+      return;
+}
+
 
 /*
  * heapgetpage - subroutine for heapgettup()
@@ -304,7 +468,7 @@ initscan(HeapScanDesc scan, ScanKey key,
  * which tuples on the page are visible.
  */
 static void
-heapgetpage(HeapScanDesc scan, BlockNumber page)
+heapgetpage(HeapScanDesc scan, BlockNumber page , BlockNumber prefetchHWM)
 {
 	Buffer		buffer;
 	Snapshot	snapshot;
@@ -314,6 +478,10 @@ heapgetpage(HeapScanDesc scan, BlockNumb
 	OffsetNumber lineoff;
 	ItemId		lpp;
 	bool		all_visible;
+#ifdef USE_PREFETCH
+	int             PrefetchBufferRc;  /*   indicates whether requested prefetch block already in a buffer and if pin count on buffer has been increased */
+#endif   /* USE_PREFETCH */
+
 
 	Assert(page < scan->rs_nblocks);
 
@@ -336,6 +504,98 @@ heapgetpage(HeapScanDesc scan, BlockNumb
 									   RBM_NORMAL, scan->rs_strategy);
 	scan->rs_cblock = page;
 
+#ifdef USE_PREFETCH
+
+        heap_unread_subtract(scan, page);
+
+        /*    maybe prefetch some pages  starting with rs_pfchblock */
+        if (scan->rs_prefetch_target >= 0) {       /*   prefetching enabled on this scan ?                         */
+            int next_block_to_be_read = (page+1);  /*   next block to be read = lowest possible prefetchable block */
+            int num_to_pfch_this_time;             /*   eventually holds the number of blocks to prefetch now      */
+            int prefetchable_range;                /*   size of the area ahead of the current prefetch position    */
+
+            /*  check if prefetcher reached wrap point and the scan has now wrapped */
+            if (  (page == 0) && (scan->rs_prefetch_target == PREFETCH_WRAP_POINT)  ) {
+                scan->rs_prefetch_target = 1;
+                scan->rs_pfchblock = next_block_to_be_read;
+            } else
+            if (scan->rs_pfchblock < next_block_to_be_read) {
+                scan->rs_pfchblock = next_block_to_be_read; /* next block to be prefetched must be ahead of one we just read */
+            }
+
+            /* now we know where we would start prefetching -
+            ** next question   -  if this is a sync scan,  ensure we do not prefetch behind the HWM
+            ** debatable whether to require strict inequality or >=  -   >= works better in practice
+            */
+            if ( (!scan->rs_syncscan) || (scan->rs_pfchblock >= prefetchHWM) ) {
+
+                /* now we know where we will start prefetching -
+                ** next question   -  how many?
+                ** apply two limits :
+                **  1.   target prefetch distance
+                **  2.   number of available blocks ahead of us
+                */
+
+                /*  1.   target prefetch distance   */
+                num_to_pfch_this_time = next_block_to_be_read + scan->rs_prefetch_target; /* page beyond prefetch target */
+                num_to_pfch_this_time -= scan->rs_pfchblock;                              /*  convert to offset        */
+
+                /*   first do prefetching up to our current limit  ...
+                **   highest page number that a scan (pre)-fetches is scan->rs_nblocks-1
+                **   note  -  prefetcher does not wrap a prefetch range -
+                **            instead just stop and then start again if and when main scan wraps
+                */
+                if (scan->rs_pfchblock <= scan->rs_startblock) {  /*  if on second leg towards startblock */
+                    prefetchable_range = ((int)(scan->rs_startblock) - (int)(scan->rs_pfchblock));
+                }
+                else {                                            /*     on first leg towards nblocks     */
+                    prefetchable_range = ((int)(scan->rs_nblocks) - (int)(scan->rs_pfchblock));
+                }
+                if (prefetchable_range > 0) {           /*  if theres a range to prefetch */
+
+                    /*  2.   number of available blocks ahead of us        */
+                    if (num_to_pfch_this_time > prefetchable_range) {
+                        num_to_pfch_this_time = prefetchable_range;
+                    }
+                    while (num_to_pfch_this_time-- > 0) {
+                        PrefetchBufferRc = PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, scan->rs_pfchblock, scan->rs_strategy);
+                        /*  if pin acquired on buffer,  then remember in case of future Discard */
+                        if (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED) {
+                            heap_unread_add(scan, scan->rs_pfchblock);
+						}
+                        scan->rs_pfchblock++;
+                        /*  if syncscan and requested block was already in buffer pool,
+                        **  this suggests that another scanner is ahead of us and we should advance
+                        */
+                        if ( (scan->rs_syncscan) && (PrefetchBufferRc & PREFTCHRC_BLK_ALREADY_PRESENT) ) {
+                            scan->rs_pfchblock++;
+                            num_to_pfch_this_time--;
+                        }
+                    }
+                }
+                else {
+                    /*   we must not modify scan->rs_pfchblock here
+                    **   because it is needed for possible DiscardBuffer at end of scan  ...
+                    **   ... instead ...
+                    */
+                    scan->rs_prefetch_target = PREFETCH_WRAP_POINT;  /*   mark this prefetcher as waiting to wrap */
+                }
+
+                /*   ...  then adjust prefetching limit : by doubling on each iteration */
+                if (scan->rs_prefetch_target == 0) {
+                    scan->rs_prefetch_target = 1;
+                }
+                else {
+                    scan->rs_prefetch_target *= 2;
+                    if (scan->rs_prefetch_target > target_prefetch_pages) {
+                        scan->rs_prefetch_target = target_prefetch_pages;
+                    }
+                }
+            }
+        }
+#endif   /* USE_PREFETCH */
+
+
 	if (!scan->rs_pageatatime)
 		return;
 
@@ -452,6 +712,8 @@ heapgettup(HeapScanDesc scan,
 	OffsetNumber lineoff;
 	int			linesleft;
 	ItemId		lpp;
+    BlockNumber  prefetchHWM = 0;   /*  HWM of prefetch BlockNum for all participants in same sync scan */
+    int          ix;
 
 	/*
 	 * calculate next starting lineoff, given scan direction
@@ -470,7 +732,25 @@ heapgettup(HeapScanDesc scan,
 				return;
 			}
 			page = scan->rs_startblock; /* first page */
-			heapgetpage(scan, page);
+#ifdef USE_PREFETCH
+                        /*  decide if we shall do prefetching :  only if :
+                        **   .  prefetching enabled for this scan
+                        **   .  not a bitmap scan (which do their own)
+                        **   .  sufficient number of blocks - at least twice the target_prefetch_pages
+                        */
+                        if (    (!scan->rs_bitmapscan)
+                             && (scan->rs_prefetch_target >= PREFETCH_MAYBE)
+                             && (scan->rs_nblocks > (2*target_prefetch_pages))
+                           )  {
+                                scan->rs_prefetch_target = 1;  /*  do prefetching on forward non-bitmap scan */
+                                scan->rs_Unread_Pfetched_base = (BlockNumber *)palloc(sizeof(BlockNumber)*target_prefetch_pages);
+                                /*  Initialise the list */
+                                for (ix = 0; ix < target_prefetch_pages; ix++) {
+									*(scan->rs_Unread_Pfetched_base + ix) = InvalidBlockNumber;
+								}
+                        }
+#endif   /* USE_PREFETCH */
+			heapgetpage(scan, page, prefetchHWM);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
 		}
@@ -516,7 +796,7 @@ heapgettup(HeapScanDesc scan,
 				page = scan->rs_startblock - 1;
 			else
 				page = scan->rs_nblocks - 1;
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 		}
 		else
 		{
@@ -557,7 +837,7 @@ heapgettup(HeapScanDesc scan,
 
 		page = ItemPointerGetBlockNumber(&(tuple->t_self));
 		if (page != scan->rs_cblock)
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 
 		/* Since the tuple was previously fetched, needn't lock page here */
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
@@ -660,8 +940,10 @@ heapgettup(HeapScanDesc scan,
 			 * a little bit backwards on every invocation, which is confusing.
 			 * We don't guarantee any specific ordering in general, though.
 			 */
-			if (scan->rs_syncscan)
-				ss_report_location(scan->rs_rd, page);
+			if (scan->rs_syncscan) {
+                                prefetchHWM = scan->rs_pfchblock;
+				ss_report_location(scan->rs_rd, page, &prefetchHWM);
+                        }
 		}
 
 		/*
@@ -671,6 +953,22 @@ heapgettup(HeapScanDesc scan,
 		{
 			if (BufferIsValid(scan->rs_cbuf))
 				ReleaseBuffer(scan->rs_cbuf);
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
 			scan->rs_cbuf = InvalidBuffer;
 			scan->rs_cblock = InvalidBlockNumber;
 			tuple->t_data = NULL;
@@ -678,7 +976,7 @@ heapgettup(HeapScanDesc scan,
 			return;
 		}
 
-		heapgetpage(scan, page);
+		heapgetpage(scan, page, prefetchHWM);
 
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 
@@ -727,6 +1025,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 	OffsetNumber lineoff;
 	int			linesleft;
 	ItemId		lpp;
+    BlockNumber  prefetchHWM = 0;   /*  HWM of prefetch BlockNum for all participants in same sync scan */
+    int          ix;
 
 	/*
 	 * calculate next starting lineindex, given scan direction
@@ -745,7 +1045,25 @@ heapgettup_pagemode(HeapScanDesc scan,
 				return;
 			}
 			page = scan->rs_startblock; /* first page */
-			heapgetpage(scan, page);
+#ifdef USE_PREFETCH
+                        /*  decide if we shall do prefetching :  only if :
+                        **   .  prefetching enabled for this scan
+                        **   .  not a bitmap scan (which do their own)
+                        **   .  sufficient number of blocks - at least twice the target_prefetch_pages
+                        */
+                        if (    (!scan->rs_bitmapscan)
+                             && (scan->rs_prefetch_target >= PREFETCH_MAYBE)
+                             && (scan->rs_nblocks > (2*target_prefetch_pages))
+                           )  {
+                                scan->rs_prefetch_target = 1;  /*  do prefetching on forward non-bitmap scan */
+                                scan->rs_Unread_Pfetched_base = (BlockNumber *)palloc(sizeof(BlockNumber)*target_prefetch_pages);
+                                /*  Initialise the list */
+                                for (ix = 0; ix < target_prefetch_pages; ix++) {
+									*(scan->rs_Unread_Pfetched_base + ix) = InvalidBlockNumber;
+								}
+                        }
+#endif   /* USE_PREFETCH */
+			heapgetpage(scan, page, prefetchHWM);
 			lineindex = 0;
 			scan->rs_inited = true;
 		}
@@ -788,7 +1106,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 				page = scan->rs_startblock - 1;
 			else
 				page = scan->rs_nblocks - 1;
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 		}
 		else
 		{
@@ -826,7 +1144,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 
 		page = ItemPointerGetBlockNumber(&(tuple->t_self));
 		if (page != scan->rs_cblock)
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 
 		/* Since the tuple was previously fetched, needn't lock page here */
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
@@ -921,8 +1239,10 @@ heapgettup_pagemode(HeapScanDesc scan,
 			 * a little bit backwards on every invocation, which is confusing.
 			 * We don't guarantee any specific ordering in general, though.
 			 */
-			if (scan->rs_syncscan)
-				ss_report_location(scan->rs_rd, page);
+			if (scan->rs_syncscan) {
+                                prefetchHWM = scan->rs_pfchblock;
+				ss_report_location(scan->rs_rd, page, &prefetchHWM);
+                        }
 		}
 
 		/*
@@ -932,6 +1252,22 @@ heapgettup_pagemode(HeapScanDesc scan,
 		{
 			if (BufferIsValid(scan->rs_cbuf))
 				ReleaseBuffer(scan->rs_cbuf);
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                             while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
 			scan->rs_cbuf = InvalidBuffer;
 			scan->rs_cblock = InvalidBlockNumber;
 			tuple->t_data = NULL;
@@ -939,7 +1275,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 			return;
 		}
 
-		heapgetpage(scan, page);
+		heapgetpage(scan, page, prefetchHWM);
 
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
 		lines = scan->rs_ntuples;
@@ -1394,6 +1730,23 @@ void
 heap_rescan(HeapScanDesc scan,
 			ScanKey key)
 {
+
+#ifdef USE_PREFETCH
+         if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                       /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+         }
+#endif   /* USE_PREFETCH */
 	/*
 	 * unpin scan buffers
 	 */
@@ -1418,6 +1771,23 @@ heap_endscan(HeapScanDesc scan)
 {
 	/* Note: no locking manipulations needed */
 
+
+#ifdef USE_PREFETCH
+         if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                       /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+         }
+#endif   /* USE_PREFETCH */
 	/*
 	 * unpin scan buffers
 	 */
@@ -1435,6 +1805,10 @@ heap_endscan(HeapScanDesc scan)
 	if (scan->rs_strategy != NULL)
 		FreeAccessStrategy(scan->rs_strategy);
 
+    if (scan->rs_Unread_Pfetched_base) {
+        pfree(scan->rs_Unread_Pfetched_base);
+    }
+
 	if (scan->rs_temp_snap)
 		UnregisterSnapshot(scan->rs_snapshot);
 
@@ -1464,7 +1838,6 @@ heap_endscan(HeapScanDesc scan)
 #define HEAPDEBUG_3
 #endif   /* !defined(HEAPDEBUGALL) */
 
-
 HeapTuple
 heap_getnext(HeapScanDesc scan, ScanDirection direction)
 {
@@ -6347,6 +6720,25 @@ heap_markpos(HeapScanDesc scan)
 void
 heap_restrpos(HeapScanDesc scan)
 {
+
+
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                             while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
+
 	/* XXX no amrestrpos checking that ammarkpos called */
 
 	if (!ItemPointerIsValid(&scan->rs_mctid))
--- src/backend/access/heap/syncscan.c.orig	2014-06-08 11:26:29.000000000 -0400
+++ src/backend/access/heap/syncscan.c	2014-06-08 21:59:36.968096921 -0400
@@ -90,6 +90,7 @@ typedef struct ss_scan_location_t
 {
 	RelFileNode relfilenode;	/* identity of a relation */
 	BlockNumber location;		/* last-reported location in the relation */
+	BlockNumber prefetchHWM;	/* high-water-mark of prefetched Blocknum */
 } ss_scan_location_t;
 
 typedef struct ss_lru_item_t
@@ -113,7 +114,7 @@ static ss_scan_locations_t *scan_locatio
 
 /* prototypes for internal functions */
 static BlockNumber ss_search(RelFileNode relfilenode,
-		  BlockNumber location, bool set);
+		  BlockNumber location, bool set , BlockNumber *prefetchHWMp);
 
 
 /*
@@ -160,6 +161,7 @@ SyncScanShmemInit(void)
 			item->location.relfilenode.dbNode = InvalidOid;
 			item->location.relfilenode.relNode = InvalidOid;
 			item->location.location = InvalidBlockNumber;
+			item->location.prefetchHWM = InvalidBlockNumber;
 
 			item->prev = (i > 0) ?
 				(&scan_locations->items[i - 1]) : NULL;
@@ -185,7 +187,7 @@ SyncScanShmemInit(void)
  * data structure.
  */
 static BlockNumber
-ss_search(RelFileNode relfilenode, BlockNumber location, bool set)
+ss_search(RelFileNode relfilenode, BlockNumber location, bool set , BlockNumber *prefetchHWMp)
 {
 	ss_lru_item_t *item;
 
@@ -206,6 +208,22 @@ ss_search(RelFileNode relfilenode, Block
 			{
 				item->location.relfilenode = relfilenode;
 				item->location.location = location;
+                                /*  if prefetch information requested,
+                                **  then reconcile and either update or report back the new HWM.
+                                */
+                                if (prefetchHWMp)
+                                {
+                                    if (   (item->location.prefetchHWM == InvalidBlockNumber)
+                                        || (item->location.prefetchHWM < *prefetchHWMp)
+                                       )
+                                    {
+                                      item->location.prefetchHWM = *prefetchHWMp;
+                                    }
+                                    else
+                                    {
+                                      *prefetchHWMp = item->location.prefetchHWM;
+                                    }
+                                }
 			}
 			else if (set)
 				item->location.location = location;
@@ -252,7 +270,7 @@ ss_get_location(Relation rel, BlockNumbe
 	BlockNumber startloc;
 
 	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
-	startloc = ss_search(rel->rd_node, 0, false);
+	startloc = ss_search(rel->rd_node, 0, false , 0);
 	LWLockRelease(SyncScanLock);
 
 	/*
@@ -282,7 +300,7 @@ ss_get_location(Relation rel, BlockNumbe
  * same relfilenode.
  */
 void
-ss_report_location(Relation rel, BlockNumber location)
+ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp)
 {
 #ifdef TRACE_SYNCSCAN
 	if (trace_syncscan)
@@ -306,7 +324,7 @@ ss_report_location(Relation rel, BlockNu
 	{
 		if (LWLockConditionalAcquire(SyncScanLock, LW_EXCLUSIVE))
 		{
-			(void) ss_search(rel->rd_node, location, true);
+			(void) ss_search(rel->rd_node, location, true , prefetchHWMp);
 			LWLockRelease(SyncScanLock);
 		}
 #ifdef TRACE_SYNCSCAN
--- src/backend/access/index/indexam.c.orig	2014-06-08 11:26:29.000000000 -0400
+++ src/backend/access/index/indexam.c	2014-06-08 21:59:37.012096997 -0400
@@ -79,6 +79,55 @@
 #include "utils/tqual.h"
 
 
+#ifdef USE_PREFETCH
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+void index_mark_or_evict_block(IndexScanDesc scan , BlockNumber blocknumber , int markit);
+
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+
+/*  if specified block number is present in the prefetch array,
+**  then either mark it as not to be discarded or evict it according to input param
+*/
+void index_mark_or_evict_block(IndexScanDesc scan , BlockNumber blocknumber , int markit)
+{
+        unsigned short int pfchx , pfchy , pfchz; /*  indexes in BlockIdData array   */
+
+        if (    scan->do_prefetch
+             && ((struct pfch_block_item*)0 != scan->pfch_block_item_list)
+             /* no need to check for scan->pfch_next < prefetch_index_scans
+             ** since we will do nothing if scan->pfch_used == 0
+             */
+           ) {
+            /*  search the prefetch list to find if the block is a member */
+            for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                if (BlockIdGetBlockNumber(&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)) == blocknumber) {
+                      if (markit) {
+						      /*  mark it as not to be discarded */
+						      ((scan->pfch_block_item_list)+pfchx)->pfch_discard &= ~PREFTCHRC_BUF_PIN_INCREASED;
+					  } else {
+							  /*  shuffle all following the evictee to the left
+							  **  and update next pointer if its element moves
+							  */
+							  pfchy = (scan->pfch_used - 1); /*  current rightmost */
+							  scan->pfch_used = pfchy;
+
+							  while (pfchy > pfchx) {
+								  pfchz = pfchx + 1;
+								  BlockIdCopy((&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)), (&(((scan->pfch_block_item_list)+pfchz)->pfch_blockid)));
+								  ((scan->pfch_block_item_list)+pfchx)->pfch_discard = ((scan->pfch_block_item_list)+pfchz)->pfch_discard;
+								  if (scan->pfch_next == pfchz) {
+									  scan->pfch_next = pfchx;
+								  }
+								  pfchx = pfchz; /* advance */
+							  }
+                      }
+                }
+            }
+        }
+}
+#endif /* USE_PREFETCH */
+
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
  *
@@ -253,6 +302,11 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+	scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -277,6 +331,11 @@ index_beginscan_bitmap(Relation indexRel
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+	scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -311,6 +370,9 @@ index_beginscan_internal(Relation indexR
 									  Int32GetDatum(nkeys),
 									  Int32GetDatum(norderbys)));
 
+	scan->heap_tids_seen = 0;
+	scan->heap_tids_fetched = 0;
+	
 	return scan;
 }
 
@@ -342,6 +404,12 @@ index_rescan(IndexScanDesc scan,
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
@@ -373,10 +441,30 @@ index_endscan(IndexScanDesc scan)
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
 
+#ifdef USE_PREFETCH
+        /*   discard prefetched but unread buffers */
+        if (    scan->do_prefetch
+             && ((struct pfch_block_item*)0 != scan->pfch_block_item_list)
+           ) {
+            unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                  if (((scan->pfch_block_item_list)+pfchx)->pfch_discard) {
+                      DiscardBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber(&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)));
+                  }
+                }
+        }
+#endif   /* USE_PREFETCH */
+
 	/* End the AM's scan */
 	FunctionCall1(procedure, PointerGetDatum(scan));
 
@@ -472,6 +560,12 @@ index_getnext_tid(IndexScanDesc scan, Sc
 		/* ... but first, release any held pin on a heap page */
 		if (BufferIsValid(scan->xs_cbuf))
 		{
+#ifdef USE_PREFETCH
+                    /*   if specified block number is present in the prefetch array,  then evict it */
+                    if (scan->do_prefetch) {
+                          index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                    }
+#endif   /* USE_PREFETCH */
 			ReleaseBuffer(scan->xs_cbuf);
 			scan->xs_cbuf = InvalidBuffer;
 		}
@@ -479,6 +573,11 @@ index_getnext_tid(IndexScanDesc scan, Sc
 	}
 
 	pgstat_count_index_tuples(scan->indexRelation, 1);
+	if (scan->heap_tids_seen++ >= (~0)) {
+		/* Avoid integer overflow */
+		scan->heap_tids_seen = 1;
+		scan->heap_tids_fetched = 0;
+	}
 
 	/* Return the TID of the tuple we found. */
 	return &scan->xs_ctup.t_self;
@@ -502,6 +601,10 @@ index_getnext_tid(IndexScanDesc scan, Sc
  * enough information to do it efficiently in the general case.
  * ----------------
  */
+#if defined(USE_PREFETCH) && defined(AVOID_CATALOG_MIGRATION_FOR_ASYNCIO)
+extern Datum btpeeknexttuple(IndexScanDesc scan);
+#endif /* USE_PREFETCH */
+
 HeapTuple
 index_fetch_heap(IndexScanDesc scan)
 {
@@ -509,16 +612,109 @@ index_fetch_heap(IndexScanDesc scan)
 	bool		all_dead = false;
 	bool		got_heap_tuple;
 
+
+
 	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
 	if (!scan->xs_continue_hot)
 	{
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = scan->xs_cbuf;
 
+#ifdef USE_PREFETCH
+
+                /*   If the old block is different from new block, then evict old
+                **   block from prefetched array.   It is arguable we should leave it
+                **   in the array because it's likely to remain in the buffer pool
+                **   for a while,  but in that case , if we encounter the block
+                **   again,  prefetching it again does no harm.
+                **   (and note that,  if it's not pinned,  prefetching it will try to
+                **   pin it since prefetch tries to bank a pin for a buffer in the buffer pool).
+                **   therefore it should usually win.
+                */
+                if (    scan->do_prefetch
+                     && ( BufferIsValid(prev_buf) )
+                     && (BlocknotinBuffer(prev_buf,scan->heapRelation,ItemPointerGetBlockNumber(tid)))
+                     && (scan->pfch_next < prefetch_index_scans)  /* ensure there is an entry */
+                        ) {
+                          index_mark_or_evict_block(scan , BlocknumOfBuffer(prev_buf) , 0);
+                }
+
+#endif   /* USE_PREFETCH  */
 		scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf,
 											 scan->heapRelation,
 											 ItemPointerGetBlockNumber(tid));
 
+                /*   If the new block had been prefetched and pinned,
+                **   then mark that it no longer requires to be discarded.
+                **   Of course,  we don't evict the entry,
+                **   because we want to remember that it was recently prefetched.
+                */
+                index_mark_or_evict_block(scan , BlocknumOfBuffer(prev_buf) , 1);
+
+				scan->heap_tids_fetched++;
+
+#ifdef USE_PREFETCH
+                /*  try prefetching next data block
+                **    (next meaning one containing TIDs from matching keys
+                **     in same index page and different from any block
+                **     we previously prefetched and listed in prefetched array)
+                */
+                {
+                    FmgrInfo   *procedure;
+                    bool	found;             /*  did we find the "next" heap tid in current index page */
+                    int         PrefetchBufferRc;  /*  indicates whether requested prefetch block already in a buffer and if pin count on buffer has been increased */
+
+                    if (scan->do_prefetch) {
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                        procedure = &scan->indexRelation->rd_aminfo->ampeeknexttuple; /* is incorrect but avoids adding function to catalog */
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                        if (RegProcedureIsValid(scan->indexRelation->rd_am->ampeeknexttuple)) {
+                            GET_SCAN_PROCEDURE(ampeeknexttuple); /* is correct but requires adding function to catalog */
+                        } else {
+                            procedure = 0;
+                        }
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+
+                        if (    procedure          /* does the index access method support peektuple? */
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                             && procedure->fn_addr /* procedure->fn_addr is non-null only if in catalog */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                           ) {
+                            int iterations = 1;      /*  how many iterations of prefetching shall we try  -
+                                                     **  if used entries in prefetch list is < target_prefetch_pages
+                                                     **  then 2,  else 1
+                                                     **  this should result in gradually and smoothly increasing up to target_prefetch_pages
+                                                     */
+                            /*  note we trust InitIndexScan verified this scan is forwards only and so set that */
+                            if (scan->pfch_used < target_prefetch_pages) {
+                                iterations = 2;
+                            }
+                            do {
+                                found =  DatumGetBool(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                                                         btpeeknexttuple(scan)     /*  pass scan as direct parameter since cant use fmgr because not in catalog */
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                                                         FunctionCall1(procedure, PointerGetDatum(scan)) /* use fmgr to call it because in catalog  */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                                                     );
+                                if (found) {
+                                    /*    btpeeknexttuple set pfch_next to point to the item in block_item_list to be prefetched */
+                                    PrefetchBufferRc = PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber((&((scan->pfch_block_item_list + scan->pfch_next))->pfch_blockid)) , 0);
+                                    /* elog(LOG,"index_fetch_heap prefetched rel %u blockNum %u"
+                                       ,scan->heapRelation->rd_node.relNode ,BlockIdGetBlockNumber(scan->pfch_block_item_list + scan->pfch_next));
+                                    */
+
+                                    /*  if pin acquired on buffer,  then remember in case of future Discard */
+                                    (scan->pfch_block_item_list + scan->pfch_next)->pfch_discard = (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED);
+
+
+                                }
+                            } while (--iterations > 0);
+                        }
+                    }
+                }
+#endif   /* USE_PREFETCH */
+
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
--- src/backend/access/index/genam.c.orig	2014-06-08 11:26:29.000000000 -0400
+++ src/backend/access/index/genam.c	2014-06-08 21:59:37.036097039 -0400
@@ -77,6 +77,12 @@ RelationGetIndexScan(Relation indexRelat
 
 	scan = (IndexScanDesc) palloc(sizeof(IndexScanDescData));
 
+#ifdef USE_PREFETCH
+        scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+        scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
+
 	scan->heapRelation = NULL;	/* may be set later */
 	scan->indexRelation = indexRelation;
 	scan->xs_snapshot = InvalidSnapshot;		/* caller must initialize this */
@@ -139,6 +145,19 @@ RelationGetIndexScan(Relation indexRelat
 void
 IndexScanEnd(IndexScanDesc scan)
 {
+#ifdef USE_PREFETCH
+	if (scan->do_prefetch) {
+		if ( (struct pfch_block_item*)0 != scan->pfch_block_item_list ) {
+			pfree(scan->pfch_block_item_list);
+			scan->pfch_block_item_list = (struct pfch_block_item*)0;
+		}
+		if ( (struct pfch_index_pagelist*)0 != scan->pfch_index_page_list ) {
+			pfree(scan->pfch_index_page_list);
+			scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+		}
+	}
+#endif   /* USE_PREFETCH */
+
 	if (scan->keyData != NULL)
 		pfree(scan->keyData);
 	if (scan->orderByData != NULL)
--- src/backend/access/nbtree/nbtsearch.c.orig	2014-06-08 11:26:29.000000000 -0400
+++ src/backend/access/nbtree/nbtsearch.c	2014-06-08 21:59:37.064097087 -0400
@@ -23,13 +23,16 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+extern unsigned int prefetch_btree_heaps;  /*  boolean whether to prefetch heap pages in _bt_next for non-bitmap index scans   */
+extern unsigned int prefetch_sequential_index_scans;  /*  boolean whether to prefetch sequential-access non-bitmap index scans  */
 
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
+static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir, 
+			 bool prefetch);
+static Buffer _bt_walk_left(IndexScanDesc scan, Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
 
@@ -226,7 +229,7 @@ _bt_moveright(Relation rel,
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, blkno, access , (struct pfch_index_pagelist*)0);
 			continue;
 		}
 
@@ -1005,7 +1008,7 @@ _bt_first(IndexScanDesc scan, ScanDirect
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
@@ -1040,6 +1043,8 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTScanPosItem *currItem;
+	BlockNumber prevblkno = ItemPointerGetBlockNumber(
+		&scan->xs_ctup.t_self);
 
 	/*
 	 * Advance to next tuple on current page; or if there's no more, try to
@@ -1052,11 +1057,53 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+		if (prefetch_btree_heaps && (target_prefetch_pages > 0)) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreRight
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+				
+					if (so->prefetchItemIndex <= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex + 1;
+					while (    (so->prefetchItemIndex <= so->currPos.lastItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex++].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {  /*  if not a repetition of previous block */
+							/*   start prefetch on next page, providing :
+                            **   EITHER  .  we're reading non-sequentially previously or for this block
+                            **   OR      .  user explicitly specified to prefetch for sequential pattern
+							**   as it may be counterproductive otherwise
+                            */
+							nonsequential = (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1));
+							if (prefetch_sequential_index_scans || nonsequential) {
+								_bt_prefetchbuf(scan->heapRelation, blkno , &scan->pfch_index_page_list );
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
 	}
 	else
 	{
@@ -1065,11 +1112,53 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+		if (prefetch_btree_heaps && (target_prefetch_pages > 0)) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreLeft
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+			
+					if (so->prefetchItemIndex >= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex - 1;
+					while (    (so->prefetchItemIndex >= so->currPos.firstItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex--].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {  /*  if not a repetition of previous block */
+							/*   start prefetch on next page, providing :
+                            **   EITHER  .  we're reading non-sequentially previously or for this block
+                            **   OR      .  user explicitly specified to prefetch for sequential pattern
+							**   as it may be counterproductive otherwise
+                            */
+							nonsequential = (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1));
+							if (prefetch_sequential_index_scans || nonsequential) {
+								_bt_prefetchbuf(scan->heapRelation, blkno , &scan->pfch_index_page_list );
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
 	}
 
 	/* OK, itemIndex says what to return */
@@ -1119,9 +1208,11 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 	/*
 	 * we must save the page's right-link while scanning it; this tells us
 	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * corresponding need for the left-link, since splits always go right,
+	 * but we need it for back-sequential scan detection.
 	 */
 	so->currPos.nextPage = opaque->btpo_next;
+	so->currPos.prevPage = opaque->btpo_prev;
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
@@ -1156,6 +1247,7 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
+		so->prefetchItemIndex = 0;
 	}
 	else
 	{
@@ -1187,6 +1279,7 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 		so->currPos.firstItem = itemIndex;
 		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
 		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->prefetchItemIndex = MaxIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1224,7 +1317,7 @@ _bt_saveitem(BTScanOpaque so, int itemIn
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
  */
 static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+_bt_steppage(IndexScanDesc scan, ScanDirection dir, bool prefetch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -1278,7 +1371,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ , scan->pfch_index_page_list);
 			/* check for deleted page */
 			page = BufferGetPage(so->currPos.buf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1287,9 +1380,20 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque))) {
+					if (    prefetch && so->currPos.moreRight
+						/*   start prefetch on next page, providing :
+						**   EITHER  .  we're reading non-sequentially for this block
+						**   OR      .  user explicitly specified to prefetch for sequential pattern
+						**   as it may be counterproductive otherwise
+						*/
+						&& (prefetch_sequential_index_scans || opaque->btpo_next != (blkno+1))
+                       ) {
+ 						  _bt_prefetchbuf(rel, opaque->btpo_next , &scan->pfch_index_page_list);
+					}
 					break;
 			}
+			}
 			/* nope, keep going */
 			blkno = opaque->btpo_next;
 		}
@@ -1317,7 +1421,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
+			so->currPos.buf = _bt_walk_left(scan , rel, so->currPos.buf);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -1332,14 +1436,58 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 			if (!P_IGNORE(opaque))
 			{
+				/* We must rely on the previously saved prevPage link! */
+				BlockNumber blkno = so->currPos.prevPage;
+				
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page))) {
+					if (prefetch && so->currPos.moreLeft) {
+						/* detect back-sequential runs and increase prefetch window blindly 
+						 * downwards 2 blocks at a time. This only works in our favor
+						 * for index-only scans, by merging read requests at the kernel,
+						 * so we want to inflate target_prefetch_pages since merged 
+						 * back-sequential requests are about as expensive as a single one 
+						 */
+						if (scan->xs_want_itup && blkno > 0 && opaque->btpo_prev == (blkno-1)) {
+							BlockNumber backPos;
+							unsigned int back_prefetch_pages = target_prefetch_pages * 16;
+							if (back_prefetch_pages > 64)
+								back_prefetch_pages = 64;
+							
+							if (so->backSeqRun == 0)
+								backPos = (blkno-1);
+							else
+								backPos = so->backSeqPos;
+							so->backSeqRun++;
+							
+							if (backPos > 0 && (blkno - backPos) <= back_prefetch_pages) {
+								_bt_prefetchbuf(rel, backPos-- , &scan->pfch_index_page_list);
+								/* don't start back-seq prefetch too early */
+								if (so->backSeqRun >= back_prefetch_pages
+										&& backPos > 0 
+										&& (blkno - backPos) <= back_prefetch_pages)
+								{
+									_bt_prefetchbuf(rel, backPos-- , &scan->pfch_index_page_list);
+								}
+							}
+							
+							so->backSeqPos = backPos;
+						} else {
+							/* start prefetch on next page */
+							if (so->backSeqRun != 0) {
+								if (opaque->btpo_prev > blkno || opaque->btpo_prev < so->backSeqPos)
+									so->backSeqRun = 0;
+							}
+							_bt_prefetchbuf(rel, opaque->btpo_prev , &scan->pfch_index_page_list);
+						}
+					}
 					break;
 			}
 		}
 	}
+	}
 
 	return true;
 }
@@ -1359,7 +1507,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf)
+_bt_walk_left(IndexScanDesc scan, Relation rel, Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1387,7 +1535,7 @@ _bt_walk_left(Relation rel, Buffer buf)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, blkno, BT_READ , scan->pfch_index_page_list );
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -1631,7 +1779,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDir
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
--- src/backend/access/nbtree/nbtinsert.c.orig	2014-06-08 11:26:29.000000000 -0400
+++ src/backend/access/nbtree/nbtinsert.c	2014-06-08 21:59:37.104097157 -0400
@@ -793,7 +793,7 @@ _bt_insertonpg(Relation rel,
 		{
 			Assert(!P_ISLEAF(lpageop));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -972,7 +972,7 @@ _bt_split(Relation rel, Buffer buf, Buff
 	bool		isleaf;
 
 	/* Acquire a new page to split into */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE , (struct pfch_index_pagelist*)0);
 
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
@@ -1175,7 +1175,7 @@ _bt_split(Relation rel, Buffer buf, Buff
 
 	if (!P_RIGHTMOST(oopaque))
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE , (struct pfch_index_pagelist*)0);
 		spage = BufferGetPage(sbuf);
 		sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -1817,7 +1817,7 @@ _bt_finish_split(Relation rel, Buffer lb
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE , (struct pfch_index_pagelist*)0);
 	rpage = BufferGetPage(rbuf);
 	rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
 
@@ -1829,7 +1829,7 @@ _bt_finish_split(Relation rel, Buffer lb
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -1877,7 +1877,7 @@ _bt_getstackbuf(Relation rel, BTStack st
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, access);
+		buf = _bt_getbuf(rel, blkno, access , (struct pfch_index_pagelist*)0);
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -2008,12 +2008,12 @@ _bt_newroot(Relation rel, Buffer lbuf, B
 	lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE , (struct pfch_index_pagelist*)0);
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
--- src/backend/access/nbtree/nbtpage.c.orig	2014-06-08 11:26:29.000000000 -0400
+++ src/backend/access/nbtree/nbtpage.c	2014-06-08 21:59:37.136097213 -0400
@@ -127,7 +127,7 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ , (struct pfch_index_pagelist*)0);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 
@@ -153,7 +153,7 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ , (struct pfch_index_pagelist*)0);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -209,7 +209,7 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE , (struct pfch_index_pagelist*)0);
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
@@ -350,7 +350,7 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ , (struct pfch_index_pagelist*)0);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -436,7 +436,7 @@ _bt_getrootheight(Relation rel)
 		Page		metapg;
 		BTPageOpaque metaopaque;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ , (struct pfch_index_pagelist*)0);
 		metapg = BufferGetPage(metabuf);
 		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 		metad = BTPageGetMeta(metapg);
@@ -562,6 +562,170 @@ _bt_log_reuse_page(Relation rel, BlockNu
 }
 
 /*
+ *	_bt_prefetchbuf() -- Prefetch a buffer by block number
+ *                           and keep track of prefetched and unread blocknums in pagelist.
+ *   input parms  :
+ *       rel and blockno identify block to be prefetched as usual
+ *       pfch_index_page_list_P points to the pointer anchoring the head of the index page list
+ *             Since the pagelist is a kind of optimization,
+ *             handle palloc failure by quietly omitting the keeping track.
+ */
+void
+_bt_prefetchbuf(Relation rel, BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P)
+{
+
+    int rc = 0;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    struct pfch_index_item* found_item = 0;
+
+		if ((struct pfch_index_pagelist**)0 == pfch_index_page_list_P) {
+			pfch_index_plp = (struct pfch_index_pagelist*)0;
+		} else {
+			pfch_index_plp = *pfch_index_page_list_P;
+		}
+
+    	if (blkno != P_NEW && blkno != P_NONE)
+    	{
+            /* prefetch an existing block of the relation
+            ** but first,  check it has not recently already been prefetched and not yet read
+            */
+            found_item = _bt_find_block(blkno , pfch_index_plp);
+			if ((struct pfch_index_item*)0 == found_item) {  /*  not found */
+
+		        rc = PrefetchBuffer(rel, MAIN_FORKNUM, blkno , 0);
+
+                /*  add the pagenum to the list ,  indicating its discard status
+                **  since it's only an optimization,  ignore failure such as exceeded allowed space
+				*/
+                _bt_add_block( blkno , pfch_index_page_list_P , (uint32)(rc & PREFTCHRC_BUF_PIN_INCREASED));
+
+            }
+	    }
+        return;
+}
+
+/*   _bt_find_block finds the item referencing specified Block in index page list if present
+**   and returns the pointer to the pfch_index_item if found,  or null if not
+*/
+struct pfch_index_item*
+_bt_find_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list)
+{
+
+    struct pfch_index_item* found_item = 0;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    int ix, tx;
+
+		pfch_index_plp = pfch_index_page_list;
+
+		while (     (struct pfch_index_pagelist*)0 != pfch_index_plp
+                &&  ( (struct pfch_index_item*)0 == found_item)
+              ) {
+			ix = 0;
+			tx = pfch_index_plp->pfch_index_item_count;
+			while (     (ix < tx)
+                    &&  ( (struct pfch_index_item*)0 == found_item)
+                  ) {
+				if (pfch_index_plp->pfch_indexid[ix].pfch_blocknum == blkno) {
+					found_item = &pfch_index_plp->pfch_indexid[ix];
+				}
+                ix++;
+			}
+			pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+		}
+
+     return found_item;
+}
+
+/*   _bt_add_block adds the specified Block to the index page list
+**   and returns 0 if successful,  non-zero if not
+*/
+int
+_bt_add_block(BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P , uint32 discard_status)
+{
+    int rc = 1;
+    int ix;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    struct pfch_index_pagelist* pfch_index_page_list_anchor; /*  pointer to first chunk if any  */
+	/*  allow expansion of pagelist to 16 chunks
+	**  which accommodates backwards-sequential index scans
+	**  where the scanner increases target_prefetch_pages by a factor of up to 16
+	**   see code in _bt_steppage
+	**  note - this creates an undesirable weak dependency on this number in _bt_steppage,
+	**         but :
+	**           there is no disaster if the numbers disagree  -  just sub-optimal use of the list
+	**           to implement a proper interface would require that chunks have a variable size
+	**           which would require an extra size variable in each chunk
+	*/
+	int num_chunks = 16;
+
+		if ((struct pfch_index_pagelist**)0 == pfch_index_page_list_P) {
+			pfch_index_page_list_anchor = (struct pfch_index_pagelist*)0;
+		} else {
+			pfch_index_page_list_anchor = *pfch_index_page_list_P;
+		}
+		pfch_index_plp = pfch_index_page_list_anchor;       /* pointer to current chunk */
+
+		while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+			ix = pfch_index_plp->pfch_index_item_count;
+			if (ix < target_prefetch_pages) {
+				pfch_index_plp->pfch_indexid[ix].pfch_blocknum = blkno;
+				pfch_index_plp->pfch_indexid[ix].pfch_discard = discard_status;
+				pfch_index_plp->pfch_index_item_count = (ix+1);
+                rc = 0;
+				goto stored_pagenum;
+			}
+			pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+			num_chunks--;              /*  keep track of number of chunks */
+		}
+
+		/*   we did not find any free space in existing chunks -
+		**   create new chunk if within our limit and we have a pfch_index_page_list
+		*/
+		if ( (num_chunks > 0) && ((struct pfch_index_pagelist*)0 != pfch_index_page_list_anchor) ) {
+			pfch_index_plp = (struct pfch_index_pagelist*)palloc( sizeof(struct pfch_index_pagelist) + ( (target_prefetch_pages-1) * sizeof(struct pfch_index_item) ) );
+			if ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+				pfch_index_plp->pfch_index_pagelist_next = pfch_index_page_list_anchor;  /* old head of list is next after this */
+				pfch_index_plp->pfch_indexid[0].pfch_blocknum = blkno;
+				pfch_index_plp->pfch_indexid[0].pfch_discard = discard_status;
+				pfch_index_plp->pfch_index_item_count = 1;
+				pfch_index_page_list_P = &pfch_index_plp;   /*  new head of list is new chunk */
+                rc = 0;
+			}
+		}
+
+    stored_pagenum:;
+     return rc;
+}
+
+/*  _bt_subtract_block removes a block from the prefetched-but-unread pagelist if present */
+void
+_bt_subtract_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list)
+{
+    struct pfch_index_pagelist* pfch_index_plp = pfch_index_page_list;
+	if ( (blkno != P_NEW) && (blkno != P_NONE) ) {
+            int ix , jx;
+                while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+                    ix = pfch_index_plp->pfch_index_item_count;
+                    while (ix-- > 0) {
+                        if (pfch_index_plp->pfch_indexid[ix].pfch_blocknum == blkno) {
+                            /*   move the last item to the curent (now deleted) position and decrement count */
+                            jx = (pfch_index_plp->pfch_index_item_count-1); /*  index of last item ... */
+                            if (jx > ix) {                                  /*  ... is not the current one so move is required */
+                                pfch_index_plp->pfch_indexid[ix].pfch_blocknum = pfch_index_plp->pfch_indexid[jx].pfch_blocknum;
+                                pfch_index_plp->pfch_indexid[ix].pfch_discard = pfch_index_plp->pfch_indexid[jx].pfch_discard;
+                                ix = jx;
+                            }
+                            pfch_index_plp->pfch_index_item_count = ix;
+                            goto done_subtract;
+                        }
+                    }
+                    pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+                }
+        }
+    done_subtract:  return;
+}
+
+/*
  *	_bt_getbuf() -- Get a buffer by block number for read or write.
  *
  *		blkno == P_NEW means to get an unallocated index page.  The page
@@ -573,7 +737,7 @@ _bt_log_reuse_page(Relation rel, BlockNu
  *		_bt_checkpage to sanity-check the page (except in P_NEW case).
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access , struct pfch_index_pagelist* pfch_index_page_list)
 {
 	Buffer		buf;
 
@@ -581,6 +745,10 @@ _bt_getbuf(Relation rel, BlockNumber blk
 	{
 		/* Read an existing block of the relation */
 		buf = ReadBuffer(rel, blkno);
+
+        /*  if the block is in the prefetched-but-unread pagelist,  remove it */
+        _bt_subtract_block( blkno , pfch_index_page_list);
+
 		LockBuffer(buf, access);
 		_bt_checkpage(rel, buf);
 	}
@@ -702,6 +870,10 @@ _bt_getbuf(Relation rel, BlockNumber blk
  * bufmgr when one would do.  However, now it's mainly just a notational
  * convenience.  The only case where it saves work over _bt_relbuf/_bt_getbuf
  * is when the target page is the same one already in the buffer.
+ *
+ * if prefetching of index pages is changed to use this function,
+ * then it should be extended to take the index_page_list as parameter
+ * and call_bt_subtract_block in the same way that _bt_getbuf does.
  */
 Buffer
 _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
@@ -712,6 +884,7 @@ _bt_relandgetbuf(Relation rel, Buffer ob
 	if (BufferIsValid(obuf))
 		LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
 	buf = ReleaseAndReadBuffer(obuf, rel, blkno);
+
 	LockBuffer(buf, access);
 	_bt_checkpage(rel, buf);
 	return buf;
@@ -965,7 +1138,7 @@ _bt_is_page_halfdead(Relation rel, Block
 	BTPageOpaque opaque;
 	bool		result;
 
-	buf = _bt_getbuf(rel, blk, BT_READ);
+	buf = _bt_getbuf(rel, blk, BT_READ , (struct pfch_index_pagelist*)0);
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -1069,7 +1242,7 @@ _bt_lock_branch_parent(Relation rel, Blo
 				Page		lpage;
 				BTPageOpaque lopaque;
 
-				lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+				lbuf = _bt_getbuf(rel, leftsib, BT_READ, (struct pfch_index_pagelist*)0);
 				lpage = BufferGetPage(lbuf);
 				lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
@@ -1265,7 +1438,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 					BTPageOpaque lopaque;
 					Page		lpage;
 
-					lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+					lbuf = _bt_getbuf(rel, leftsib, BT_READ, (struct pfch_index_pagelist*)0);
 					lpage = BufferGetPage(lbuf);
 					lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 					/*
@@ -1340,7 +1513,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 		if (!rightsib_empty)
 			break;
 
-		buf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		buf = _bt_getbuf(rel, rightsib, BT_WRITE, (struct pfch_index_pagelist*)0);
 	}
 
 	return ndeleted;
@@ -1593,7 +1766,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 		target = topblkno;
 
 		/* fetch the block number of the topmost parent's left sibling */
-		buf = _bt_getbuf(rel, topblkno, BT_READ);
+		buf = _bt_getbuf(rel, topblkno, BT_READ, (struct pfch_index_pagelist*)0);
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		leftsib = opaque->btpo_prev;
@@ -1632,7 +1805,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 		LockBuffer(leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE , (struct pfch_index_pagelist*)0);
 		page = BufferGetPage(lbuf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -1646,7 +1819,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 					 RelationGetRelationName(rel));
 				return false;
 			}
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE , (struct pfch_index_pagelist*)0);
 			page = BufferGetPage(lbuf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		}
@@ -1701,7 +1874,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE , (struct pfch_index_pagelist*)0);
 	page = BufferGetPage(rbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	if (opaque->btpo_prev != target)
@@ -1731,7 +1904,7 @@ _bt_unlink_halfdead_page(Relation rel, B
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE , (struct pfch_index_pagelist*)0);
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
--- src/backend/access/nbtree/nbtree.c.orig	2014-06-08 11:26:29.000000000 -0400
+++ src/backend/access/nbtree/nbtree.c	2014-06-08 21:59:37.172097275 -0400
@@ -30,6 +30,18 @@
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_index_scans; /* boolean whether to prefetch non-bitmap index scans also numeric size of pfch_list */
+#endif   /* USE_PREFETCH */
+
+Datum
+btpeeknexttuple(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+        IndexScanDesc scan
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+        PG_FUNCTION_ARGS
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+);
 
 /* Working state for btbuild and its callback */
 typedef struct
@@ -332,6 +344,74 @@ btgettuple(PG_FUNCTION_ARGS)
 }
 
 /*
+ *	btpeeknexttuple() -- peek at the next tuple different from any blocknum in pfch_block_item_list
+ *                           without reading a new index page
+ *                       and without causing any side-effects such as altering values in control blocks
+ *               if found,     store blocknum in next element of pfch_block_item_list
+ */
+Datum
+btpeeknexttuple(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+        IndexScanDesc scan
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+        PG_FUNCTION_ARGS
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+)
+{
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+    IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res = false;
+	int		itemIndex;		/* current index in items[] */
+
+        /*
+         * If we've already initialized this scan, we can just advance it in
+         * the appropriate direction.  If we haven't done so yet, bail out
+         */
+        if ( BTScanPosIsValid(so->currPos) ) {
+
+            itemIndex = so->currPos.itemIndex+1;    /*   next item */
+
+            /* This loop handles advancing till we find different data block or end of index page */
+            while (itemIndex <= so->currPos.lastItem) {
+                unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                        if (BlockIdEquals((&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)) , &(so->currPos.items[itemIndex].heapTid.ip_blkid))) {
+                             goto block_match;
+                        }
+                }
+
+                /*  if we reach here,  no block in list matched this item  */
+                res = true;
+                /*   set item in prefetch list
+                **   prefer unused entry if there is one,  else overwrite
+                */
+                if (scan->pfch_used < prefetch_index_scans) {
+                    scan->pfch_next = scan->pfch_used;
+                } else {
+                    scan->pfch_next++;
+                    if (scan->pfch_next >= prefetch_index_scans) {
+                        scan->pfch_next = 0;
+                    }
+                }
+
+                BlockIdCopy((&((scan->pfch_block_item_list + scan->pfch_next)->pfch_blockid)) , &(so->currPos.items[itemIndex].heapTid.ip_blkid));
+                if (scan->pfch_used <= scan->pfch_next) {
+                     scan->pfch_used = (scan->pfch_next + 1);
+                }
+
+                goto peek_complete;
+
+      block_match:         itemIndex++;
+            }
+	}
+
+ peek_complete:
+	PG_RETURN_BOOL(res);
+}
+
+/*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
 Datum
@@ -425,6 +505,12 @@ btbeginscan(PG_FUNCTION_ARGS)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	so->backSeqRun = 0;
+	so->backSeqPos = 0;
+	so->prefetchItemIndex = 0;
+	so->lastHeapPrefetchBlkno = P_NONE;
+	so->prefetchBlockCount = 0;
+	
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -516,6 +602,23 @@ btendscan(PG_FUNCTION_ARGS)
 {
 	IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+        struct pfch_index_pagelist* pfch_index_plp;
+        int ix;
+
+#ifdef USE_PREFETCH
+
+	/* discard all prefetched but unread index pages listed in the pagelist */
+        pfch_index_plp = scan->pfch_index_page_list;
+        while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+                    ix = pfch_index_plp->pfch_index_item_count;
+                    while (ix-- > 0) {
+                        if (pfch_index_plp->pfch_indexid[ix].pfch_discard) {
+                            DiscardBuffer( scan->indexRelation , MAIN_FORKNUM , pfch_index_plp->pfch_indexid[ix].pfch_blocknum);
+                        }
+                    }
+                    pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+        }
+#endif /* USE_PREFETCH */
 
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
--- src/backend/nodes/tidbitmap.c.orig	2014-06-08 11:26:29.000000000 -0400
+++ src/backend/nodes/tidbitmap.c	2014-06-08 21:59:37.216097351 -0400
@@ -44,6 +44,9 @@
 #include "nodes/bitmapset.h"
 #include "nodes/tidbitmap.h"
 #include "utils/hsearch.h"
+#ifdef USE_PREFETCH
+extern int	target_prefetch_pages;
+#endif   /* USE_PREFETCH */
 
 /*
  * The maximum number of tuples per page is not large (typically 256 with
@@ -572,7 +575,12 @@ tbm_begin_iterate(TIDBitmap *tbm)
 	 * needs of the TBMIterateResult sub-struct.
 	 */
 	iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
-								 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+								 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)
+#ifdef USE_PREFETCH
+                                          		      /*  space for remembering every prefetched but unread blockno */
+                                          		      +  (target_prefetch_pages * sizeof(BlockNumber))
+#endif   /* USE_PREFETCH */
+                                         );
 	iterator->tbm = tbm;
 
 	/*
@@ -1020,3 +1028,68 @@ tbm_comparator(const void *left, const v
 		return 1;
 	return 0;
 }
+
+void
+tbm_zero(TBMIterator *iterator) /* zero list of prefetched and unread blocknos */
+{
+      /* locate the list of prefetched but unread blocknos immediately following the array of offsets
+      ** and note that tbm_begin_iterate allocates space for (1 + MAX_TUPLES_PER_PAGE) offsets  -
+      ** 1 included in struct TBMIterator and MAX_TUPLES_PER_PAGE additional
+      */
+      iterator->output.Unread_Pfetched_base = ((BlockNumber *)(&(iterator->output.offsets[MAX_TUPLES_PER_PAGE+1])));
+      iterator->output.Unread_Pfetched_next = iterator->output.Unread_Pfetched_count = 0;
+}
+
+void
+tbm_add(TBMIterator *iterator, BlockNumber blockno) /* add this blockno to list of prefetched and unread blocknos */
+{
+      unsigned int Unread_Pfetched_index = iterator->output.Unread_Pfetched_next + iterator->output.Unread_Pfetched_count++;
+
+      if (iterator->output.Unread_Pfetched_count > target_prefetch_pages) {
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_add overflowed list cannot add blockno %d", blockno)));
+      }
+
+      if (Unread_Pfetched_index >= target_prefetch_pages)
+          Unread_Pfetched_index -= target_prefetch_pages;
+      *(iterator->output.Unread_Pfetched_base + Unread_Pfetched_index) = blockno;
+}
+
+void
+tbm_subtract(TBMIterator *iterator, BlockNumber blockno) /* remove this blockno from list of prefetched and unread blocknos */
+{
+      unsigned int Unread_Pfetched_index = iterator->output.Unread_Pfetched_next++;
+      BlockNumber nextUnreadPfetched;
+
+      /*    make a weak check that the next blockno is the one to be removed,
+      **    although actually in case of disagreement,   we ignore callers blockno and remove next anyway,
+      **    which is really what caller wants
+      */
+      if ( iterator->output.Unread_Pfetched_count == 0 ) {
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_subtract empty list cannot subtract blockno %d", blockno)));
+      }
+
+      if (Unread_Pfetched_index >= target_prefetch_pages)
+          Unread_Pfetched_index = 0;
+      nextUnreadPfetched = *(iterator->output.Unread_Pfetched_base + Unread_Pfetched_index);
+      if (   ( nextUnreadPfetched != blockno ) 
+          && ( nextUnreadPfetched != InvalidBlockNumber ) /* dont report it if the block in the list was InvalidBlockNumber */
+         ) {
+		ereport(NOTICE,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_subtract will subtract blockno %d not %d",
+					nextUnreadPfetched, blockno)));
+      }
+      if (iterator->output.Unread_Pfetched_next >= target_prefetch_pages)
+          iterator->output.Unread_Pfetched_next = 0;
+      iterator->output.Unread_Pfetched_count--;
+}
+
+TBMIterateResult *
+tbm_locate_IterateResult(TBMIterator *iterator)
+{
+   return &(iterator->output);
+}
--- src/backend/utils/misc/guc.c.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/backend/utils/misc/guc.c	2014-06-08 21:59:37.284097469 -0400
@@ -2264,6 +2264,25 @@ static struct config_int ConfigureNamesI
 	},
 
 	{
+		{"max_async_io_prefetchers",
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+			PGC_USERSET,
+#else
+			PGC_INTERNAL,
+#endif
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Maximum number of background processes concurrently using asynchronous librt threads to prefetch pages into shared memory buffers."),
+		},
+		&max_async_io_prefetchers,
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	        -1, 0, 8192,      /*  boot val -1 indicates to initialize to something sensible during buf_init */
+#else
+		0, 0, 0,
+#endif
+		NULL, NULL, NULL
+	},
+
+	{
 		{"max_worker_processes",
 			PGC_POSTMASTER,
 			RESOURCES_ASYNCHRONOUS,
--- src/backend/utils/mmgr/aset.c.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/backend/utils/mmgr/aset.c	2014-06-08 21:59:37.324097538 -0400
@@ -733,6 +733,48 @@ AllocSetAlloc(MemoryContext context, Siz
 	 */
 	fidx = AllocSetFreeIndex(size);
 	chunk = set->freelist[fidx];
+#ifdef MEMORY_CONTEXT_CHECKING
+        /*    an instance of segfault caused by a rogue value in set->freelist[fidx]
+        **    has been seen - check for it using crude sanity check based on neighbours :
+        **    if at least one is sufficiently close, then pass,  else fail
+        */
+        if (chunk != 0) {
+            int frx, nrx; /*  frx is index,  nrx is index of failing neighbour for errmsg */
+            for (nrx = -1, frx = 0; (frx < ALLOCSET_NUM_FREELISTS); frx++) {
+                if (   (frx != fidx)     /*  not the chosen one */
+                    && ( ( (unsigned long)(set->freelist[frx]) ) != 0 ) /* not empty */
+                   ) {
+                    if (   ( (unsigned long)chunk < ( ( (unsigned long)(set->freelist[frx]) ) / 2 ) )
+                        && (  ( (unsigned long)(set->freelist[frx]) ) < 0x4000000  )
+               /***     || ( (unsigned long)chunk > ( ( (unsigned long)(set->freelist[frx]) ) * 2 ) )  ***/
+                       ) {
+                       nrx = frx;
+                    } else {
+                       nrx = -1;
+                       break;
+                    }
+                }
+            }
+
+            if (nrx >= 0) {
+                     /*  this must be a rogue value  -  put this list in the garbage */
+                     /*  build message but be careful to avoid recursively triggering same fault */
+                     set->freelist[fidx] = NULL;   /*  mark this list empty */
+                     elog(WARNING, "detected rogue value %p in freelist index %d compared with neighbour %p whose chunksize %d"
+				 , chunk , fidx , set->freelist[nrx] , set->freelist[nrx]->size);
+                     chunk = NULL;
+            }
+        }
+#else /* if not MEMORY_CONTEXT_CHECKING make very simple-minded check*/
+        if ( (chunk != 0) && ( (unsigned long)chunk <  0x40000 ) ) {
+                     /*  this must be a rogue value  -  put this list in the garbage */
+                     /*  build message but be careful to avoid recursively triggering same fault */
+                     set->freelist[fidx] = NULL;   /*  mark this list empty */
+                     elog(WARNING, "detected rogue value %p in freelist index %d"
+				 , chunk , fidx);
+                     chunk = NULL;
+        }
+#endif
 	if (chunk != NULL)
 	{
 		Assert(chunk->size >= size);
--- src/include/executor/instrument.h.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/executor/instrument.h	2014-06-08 21:59:37.608098030 -0400
@@ -28,8 +28,18 @@ typedef struct BufferUsage
 	long		local_blks_written;		/* # of local disk blocks written */
 	long		temp_blks_read; /* # of temp blocks read */
 	long		temp_blks_written;		/* # of temp blocks written */
+
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
+
+	long		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_discrd;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_forgot;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb */
+	long		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno */
+	long		aio_read_wasted;		/* # of aio reads for which disk block not used */
+	long		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it */
+	long		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested */
 } BufferUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
--- src/include/storage/bufmgr.h.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/storage/bufmgr.h	2014-06-08 21:59:37.644098093 -0400
@@ -41,6 +41,7 @@ typedef enum
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
 	RBM_NORMAL_NO_LOG			/* Don't log page as invalid during WAL
 								 * replay; otherwise same as RBM_NORMAL */
+       ,RBM_NOREAD_FOR_PREFETCH   /* Don't read from disk, don't zero buffer, find buffer only */
 } ReadBufferMode;
 
 /* in globals.c ... this duplicates miscadmin.h */
@@ -57,6 +58,9 @@ extern int	target_prefetch_pages;
 extern PGDLLIMPORT char *BufferBlocks;
 extern PGDLLIMPORT int32 *PrivateRefCount;
 
+/*  in buf_async.c  */;
+extern int      max_async_io_prefetchers; /* Maximum number of backends using asynchronous librt threads to read pages into our buffers */
+
 /* in localbuf.c */
 extern PGDLLIMPORT int NLocBuffer;
 extern PGDLLIMPORT Block *LocalBufferBlockPointers;
@@ -159,9 +163,15 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
 /*
- * prototypes for functions in bufmgr.c
+ * prototypes for external functions in bufmgr.c and buf_async.c
  */
-extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
+extern int PrefetchBuffer(Relation reln, ForkNumber forkNum,
+			   BlockNumber blockNum , BufferAccessStrategy strategy);
+/*   return code  is an int bitmask : */
+#define PREFTCHRC_BUF_PIN_INCREASED 0x01    /*  pin count on buffer has been increased by 1 */
+#define PREFTCHRC_BLK_ALREADY_PRESENT 0x02  /*  block was already present in a buffer       */
+
+extern void DiscardBuffer(Relation reln, ForkNumber forkNum,
 			   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
--- src/include/storage/smgr.h.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/storage/smgr.h	2014-06-08 21:59:37.664098127 -0400
@@ -92,6 +92,12 @@ extern void smgrextend(SMgrRelation reln
 		   BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void smgrinitaio(int max_aio_threads, int max_aio_num);
+extern void smgrstartaio(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum , char *aiocbp , int *retcode);
+extern void smgrcompleteaio( SMgrRelation reln, char *aiocbp , int *inoutcode );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
@@ -118,6 +124,11 @@ extern void mdextend(SMgrRelation reln,
 		 BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void mdinitaio(int max_aio_threads, int max_aio_num);
+extern void mdstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode );
+extern void mdcompleteaio( char *aiocbp , int *inoutcode );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
--- src/include/storage/fd.h.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/storage/fd.h	2014-06-08 21:59:37.684098162 -0400
@@ -69,6 +69,11 @@ extern File PathNameOpenFile(FileName fi
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void FileInitaio(int max_aio_threads, int max_aio_num );
+extern int  FileStartaio(File file, off_t offset, int amount , char *aiocbp);
+extern int  FileCompleteaio( char *aiocbp , int normal_wait );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern int	FileRead(File file, char *buffer, int amount);
 extern int	FileWrite(File file, char *buffer, int amount);
 extern int	FileSync(File file);
--- src/include/storage/buf_internals.h.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/storage/buf_internals.h	2014-06-08 21:59:37.708098204 -0400
@@ -22,7 +22,9 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
-
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+#include "aio.h"
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Flags for buffer descriptors
@@ -38,8 +40,23 @@
 #define BM_JUST_DIRTIED			(1 << 5)		/* dirtied since write started */
 #define BM_PIN_COUNT_WAITER		(1 << 6)		/* have waiter for sole pin */
 #define BM_CHECKPOINT_NEEDED	(1 << 7)		/* must write for checkpoint */
-#define BM_PERMANENT			(1 << 8)		/* permanent relation (not
-												 * unlogged) */
+#define BM_PERMANENT			(1 << 8)	/* permanent relation (not unlogged) */
+#define BM_AIO_IN_PROGRESS		(1 << 9)	/* aio in progress    */
+#define BM_AIO_PREFETCH_PIN_BANKED	(1 << 10)	/* pinned when prefetch issued
+                                                        ** and this pin is banked - i.e.
+                                                        ** redeemable by the next use by same task
+                                                        ** note that for any one buffer, a pin can be banked
+                                                        **      by at most one process globally,
+                                                        **      that is,   only one process may bank a pin on the buffer
+                                                        **                 and it may do so only once (may not be stacked)
+                                                        */
+
+/*********
+for asynchronous aio-read prefetching, two golden rules concerning buffer pinning and buffer-header flags must be observed:
+  R1.  a buffer marked as BM_AIO_IN_PROGRESS must be pinned by at least one backend
+  R2.  a buffer marked as BM_AIO_PREFETCH_PIN_BANKED must be pinned by the backend identified by
+               (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio : (-(buf->freeNext))
+*********/
 
 typedef bits16 BufFlags;
 
@@ -140,17 +157,83 @@ typedef struct sbufdesc
 	BufFlags	flags;			/* see bit definitions above */
 	uint16		usage_count;	/* usage counter for clock sweep code */
 	unsigned	refcount;		/* # of backends holding pins on buffer */
-	int			wait_backend_pid;		/* backend PID of pin-count waiter */
+	int		wait_backend_pid;	/*  if     flags & BM_PIN_COUNT_WAITER
+                                                **  then   backend PID of pin-count waiter
+                                                **  else   not set
+                                                */
 
 	slock_t		buf_hdr_lock;	/* protects the above fields */
 
 	int			buf_id;			/* buffer's index number (from 0) */
-	int			freeNext;		/* link in freelist chain */
+        int    	volatile	freeNext;	/* overloaded and much-abused field :
+                                                ** EITHER
+                                                **     if     >= 0
+                                                **     then   link in freelist chain
+                                                **  OR
+                                                **     if     <  0
+                                                **     then    EITHER
+                                                **             if     flags & BM_AIO_IN_PROGRESS
+                                                **             then   negative of (the index of the aiocb in the BufferAiocbs array + 3)
+                                                **             else   if flags & BM_AIO_PREFETCH_PIN_BANKED
+                                                **             then   -(pid of task that issued aio_read and pinned buffer)
+                                                **             else   one of the special values -1 or -2 listed below
+                                                */
 
 	LWLock	   *io_in_progress_lock;	/* to wait for I/O to complete */
 	LWLock	   *content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
+/*  structures for control blocks for our implementation of async io */
+
+/*  if USE_AIO_ATOMIC_BUILTIN_COMP_SWAP is not defined,  the following struct is not put into use at runtime
+**  but it is easier to let the compiler find the definition but hide the reference to aiocb
+**  which is the only type it would not understand
+*/
+
+struct BufferAiocb {
+       struct BufferAiocb volatile * volatile BAiocbnext;  /*    next free entry or value of BAIOCB_OCCUPIED means in use  */
+       struct sbufdesc    volatile * volatile BAiocbbufh;  /*    there can be at most one BufferDesc marked BM_AIO_IN_PROGRESS
+                                                           **    and using this BufferAiocb -
+                                                           **    if there is one, BAiocbbufh points to it, else BAiocbbufh is zero
+                                                           **    NOTE  BAiocbbufh should be zero for every BufferAiocb on the free list
+                                                           */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+       struct aiocb       volatile            BAiocbthis;  /*    the aio library's control block for one async io */
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+       int                volatile  BAiocbDependentCount;  /*    count of tasks who depend on this BufferAiocb
+                                                           **    in the sense that they are waiting for io completion.
+                                                           **    only a Dependent may move the BufferAiocb onto the freelist
+                                                           **    and only when that Dependent is the *only* Dependent (count == 1)
+                                                           **    BAiocbDependentCount is protected by bufferheader spinlock
+                                                           **    and must be updated only when that spinlock is held
+                                                           */
+       pid_t              volatile  pidOfAio;              /*    pid of backend who issued an aio_read using this BAiocb -
+                                                           **    this backend must have pinned the associated buffer.
+                                                           */
+};
+
+#define BAIOCB_OCCUPIED 0x75f1        /*  distinct indicator of a BufferAiocb.BAiocbnext that is NOT on free list */
+#define BAIOCB_FREE 0x7b9d            /*  distinct indicator of a BufferAiocb.BAiocbbufh that IS     on free list */
+
+struct BAiocbAnchor {                 /*  anchor for all control blocks pertaining to aio  */
+       volatile struct BufferAiocb* BufferAiocbs;          /*  aiocbs ...                   */
+       volatile struct BufferAiocb* volatile FreeBAiocbs; /* ... and their free list   */
+};
+
+/*   values for BufCheckAsync input and retcode */
+#define BUF_INTENTION_WANT 		 1  /* wants the buffer, wait for in-progress aio and then pin */
+#define BUF_INTENTION_REJECT_KEEP_PIN 	-1  /* pin already held, do not unpin */
+#define BUF_INTENTION_REJECT_OBTAIN_PIN	-2  /* obtain pin,  caller wants it for same buffer */
+#define BUF_INTENTION_REJECT_FORGET	-3  /* unpin and tell resource owner to forget */
+#define BUF_INTENTION_REJECT_NOADJUST	-4  /* unpin and call ResourceOwnerForgetBuffer */
+#define BUF_INTENTION_REJECT_UNBANK   	-5  /* unpin only if pin banked by caller */
+
+#define BUF_INTENT_RC_CHANGED_TAG	-5
+#define BUF_INTENT_RC_BADPAGE		-4
+#define BUF_INTENT_RC_INVALID_AIO	-3    /*  invalid and aio was in progress */
+#define BUF_INTENT_RC_INVALID_NO_AIO	-1    /*  invalid and no aio was in progress */
+#define BUF_INTENT_RC_VALID		 1
+
 #define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
 
 /*
@@ -159,6 +242,7 @@ typedef struct sbufdesc
  */
 #define FREENEXT_END_OF_LIST	(-1)
 #define FREENEXT_NOT_IN_LIST	(-2)
+#define FREENEXT_BAIOCB_ORIGIN	(-3)
 
 /*
  * Macros for acquiring/releasing a shared buffer header's spinlock.
--- src/include/catalog/pg_am.h.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/catalog/pg_am.h	2014-06-08 21:59:37.744098266 -0400
@@ -67,6 +67,7 @@ CATALOG(pg_am,2601)
 	regproc		amcanreturn;	/* can indexscan return IndexTuples? */
 	regproc		amcostestimate; /* estimate cost of an indexscan */
 	regproc		amoptions;		/* parse AM-specific parameters */
+	regproc		ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } FormData_pg_am;
 
 /* ----------------
@@ -117,19 +118,19 @@ typedef FormData_pg_am *Form_pg_am;
  * ----------------
  */
 
-DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions ));
+DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions btpeeknexttuple ));
 DESCR("b-tree index access method");
 #define BTREE_AM_OID 403
-DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions ));
+DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions - ));
 DESCR("hash index access method");
 #define HASH_AM_OID 405
-DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions ));
+DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions - ));
 DESCR("GiST index access method");
 #define GIST_AM_OID 783
-DATA(insert OID = 2742 (  gin		0 6 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions ));
+DATA(insert OID = 2742 (  gin		0 6 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions - ));
 DESCR("GIN index access method");
 #define GIN_AM_OID 2742
-DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
+DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions - ));
 DESCR("SP-GiST index access method");
 #define SPGIST_AM_OID 4000
 
--- src/include/catalog/pg_proc.h.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/catalog/pg_proc.h	2014-06-08 21:59:37.796098356 -0400
@@ -536,6 +536,12 @@ DESCR("convert float4 to int4");
 
 DATA(insert OID = 330 (  btgettuple		   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2281 2281" _null_ _null_ _null_ _null_	btgettuple _null_ _null_ _null_ ));
 DESCR("btree(internal)");
+
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+DATA(insert OID = 3255 (  btpeeknexttuple	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "2281" _null_ _null_ _null_ _null_ btpeeknexttuple _null_ _null_ _null_ ));
+DESCR("btree(internal)");
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+
 DATA(insert OID = 636 (  btgetbitmap	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_	btgetbitmap _null_ _null_ _null_ ));
 DESCR("btree(internal)");
 DATA(insert OID = 331 (  btinsert		   PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_	btinsert _null_ _null_ _null_ ));
--- src/include/pg_config_manual.h.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/pg_config_manual.h	2014-06-08 21:59:37.836098426 -0400
@@ -138,9 +138,11 @@
 /*
  * USE_PREFETCH code should be compiled only if we have a way to implement
  * prefetching.  (This is decoupled from USE_POSIX_FADVISE because there
- * might in future be support for alternative low-level prefetch APIs.)
+ * might in future be support for alternative low-level prefetch APIs  --
+ * -- update October 2013  -- now there is such a new prefetch capability --
+ *   async_io into postgres buffers  -   configuration parameter max_async_io_threads)
  */
-#ifdef USE_POSIX_FADVISE
+#if defined(USE_POSIX_FADVISE) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
 #define USE_PREFETCH
 #endif
 
--- src/include/access/nbtree.h.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/access/nbtree.h	2014-06-08 21:59:37.864098474 -0400
@@ -19,6 +19,7 @@
 #include "access/sdir.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "access/relscan.h"
 #include "catalog/pg_index.h"
 
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
@@ -524,6 +525,7 @@ typedef struct BTScanPosData
 	Buffer		buf;			/* if valid, the buffer is pinned */
 
 	BlockNumber nextPage;		/* page's right link when we scanned it */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
@@ -603,6 +605,15 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* prefetch logic state */
+	unsigned int	backSeqRun;	/* number of back-sequential pages in a run */
+	BlockNumber		backSeqPos;	/* blkid last prefetched in back-sequential 
+				          		   runs */
+	BlockNumber		lastHeapPrefetchBlkno;	/* blkid last prefetched from heap */
+	int				prefetchItemIndex; /* item index within currPos last
+					                      fetched by heap prefetch */
+	int				prefetchBlockCount; /* number of prefetched heap blocks */
+	
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -655,7 +666,11 @@ extern Buffer _bt_getroot(Relation rel,
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access , struct pfch_index_pagelist* pfch_index_page_list);
+extern void _bt_prefetchbuf(Relation rel, BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P);
+extern struct pfch_index_item* _bt_find_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list);
+extern int _bt_add_block(BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P , uint32 discard_status);
+extern void _bt_subtract_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 				 BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
--- src/include/access/heapam.h.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/access/heapam.h	2014-06-08 21:59:37.892098523 -0400
@@ -175,7 +175,7 @@ extern void heap_page_prune_execute(Buff
 extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
 
 /* in heap/syncscan.c */
-extern void ss_report_location(Relation rel, BlockNumber location);
+extern void ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp);
 extern BlockNumber ss_get_location(Relation rel, BlockNumber relnblocks);
 extern void SyncScanShmemInit(void);
 extern Size SyncScanShmemSize(void);
--- src/include/access/relscan.h.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/access/relscan.h	2014-06-08 21:59:37.920098571 -0400
@@ -44,6 +44,24 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+#ifdef USE_PREFETCH
+	int	    rs_prefetch_target; /* target distance (numblocks) for prefetch to reach beyond main scan */
+	BlockNumber rs_pfchblock;	/* next block # to be prefetched in scan, if any */
+
+        /*   Unread_Pfetched is a "mostly" circular list of recently prefetched blocknos of size target_prefetch_pages
+        **   the index of the first unread block is held in Unread_Pfetched_next
+        **         and is advanced when a block is read
+        **   the count of number of unread blocks is in Unread_Pfetched_count (and this subset can wrap around)
+        **   "mostly" means that there may be gaps caused by storing entries for blocks which do not need to be discarded -
+        **   these are indicated by blockno = InvalidBlockNumber,  and these slots are reused when found.
+        */
+        BlockNumber *rs_Unread_Pfetched_base;   /*  where the list of prefetched but unread blocknos starts */
+        unsigned int rs_Unread_Pfetched_next;   /*  where the next unread blockno probably is relative to start --
+                                                **  this is only a hint which may be temporarily stale.
+                                                */
+        unsigned int rs_Unread_Pfetched_count;  /*  number of valid unread blocknos in list */
+#endif   /* USE_PREFETCH */
+
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	ItemPointerData rs_mctid;	/* marked scan position, if any */
@@ -55,6 +73,27 @@ typedef struct HeapScanDescData
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
 }	HeapScanDescData;
 
+/* pfch_index_items track prefetched and unread index pages -   chunks of blocknumbers are chained in singly-linked list from scan->pfch_index_item_list */
+struct pfch_index_item {                              /*  index-relation BlockIds which we will/have prefetched */
+       BlockNumber         pfch_blocknum;    /* Blocknum which we will/have prefetched */
+       uint32              pfch_discard;     /* whether block is to be discarded when scan is closed */
+};
+
+struct pfch_block_item {
+       struct BlockIdData  pfch_blockid;     /* BlockId which we will/have prefetched */
+       uint32              pfch_discard;     /* whether block is to be discarded when scan is closed */
+};
+
+/* pfch_index_page_items track prefetched and unread index pages -
+** chunks of blocknumbers are chained backwards (newest first,  oldest last)
+** in singly-linked list from scan->pfch_index_item_list
+*/
+struct pfch_index_pagelist {                          /*  index-relation BlockIds which we will/have prefetched */
+       struct pfch_index_pagelist* pfch_index_pagelist_next;  /*  pointer to next chunk if any */
+       unsigned int    pfch_index_item_count;         /*  number of used entries in this chunk */
+       struct pfch_index_item pfch_indexid[1];        /*  in-line list of Blocknums which we will/have prefetched and whether to be discarded */
+};
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -75,8 +114,15 @@ typedef struct IndexScanDescData
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
 	bool		ignore_killed_tuples;	/* do not return killed entries */
-	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
-										 * tuples */
+	bool		xactStartedInRecovery;	/* prevents killing/seeing killed tuples */
+										 
+#ifdef USE_PREFETCH
+        struct pfch_index_pagelist* pfch_index_page_list;  /* array of index-relation BlockIds which we will/have prefetched */
+        struct pfch_block_item* pfch_block_item_list;   /* array of heap-relation BlockIds which we will/have prefetched */
+        unsigned short int     pfch_used;	/*  number of used elements in BlockIdData array   */
+        unsigned short int     pfch_next;	/*  next element for prefetch in BlockIdData array */
+	int             do_prefetch;    /*  should I prefetch ? */
+#endif   /* USE_PREFETCH */
 
 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
@@ -91,6 +137,10 @@ typedef struct IndexScanDescData
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/* heap fetch statistics for read-ahead logic */
+	unsigned int heap_tids_seen;
+	unsigned int heap_tids_fetched;
+
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
 }	IndexScanDescData;
--- src/include/nodes/tidbitmap.h.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/nodes/tidbitmap.h	2014-06-08 21:59:37.952098627 -0400
@@ -41,6 +41,16 @@ typedef struct
 	int			ntuples;		/* -1 indicates lossy result */
 	bool		recheck;		/* should the tuples be rechecked? */
 	/* Note: recheck is always true if ntuples < 0 */
+#ifdef USE_PREFETCH
+        /*   Unread_Pfetched is a circular list of recently prefetched blocknos of size target_prefetch_pages
+        **   the index of the first unread block is held in Unread_Pfetched_next
+        **         and is advanced when a block is read
+        **   the count of number of unread blocks is in Unread_Pfetched_count (and this subset can wrap around)
+        */
+        BlockNumber *Unread_Pfetched_base;   /*  where the list of prefetched but unread blocknos starts */
+        unsigned int Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+        unsigned int Unread_Pfetched_count;  /*  number of valid unread blocknos in list */
+#endif   /* USE_PREFETCH */
 	OffsetNumber offsets[1];	/* VARIABLE LENGTH ARRAY */
 } TBMIterateResult;				/* VARIABLE LENGTH STRUCT */
 
@@ -62,5 +72,8 @@ extern bool tbm_is_empty(const TIDBitmap
 extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
 extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
 extern void tbm_end_iterate(TBMIterator *iterator);
-
+extern void tbm_zero(TBMIterator *iterator); /* zero list of prefetched and unread blocknos */
+extern void tbm_add(TBMIterator *iterator, BlockNumber blockno); /* add this blockno to list of prefetched and unread blocknos */
+extern void tbm_subtract(TBMIterator *iterator, BlockNumber blockno); /* remove this blockno from list of prefetched and unread blocknos */
+extern TBMIterateResult *tbm_locate_IterateResult(TBMIterator *iterator); /* locate the TBMIterateResult of an iterator */
 #endif   /* TIDBITMAP_H */
--- src/include/utils/rel.h.orig	2014-06-08 11:26:32.000000000 -0400
+++ src/include/utils/rel.h	2014-06-08 21:59:37.996098703 -0400
@@ -61,6 +61,7 @@ typedef struct RelationAmInfo
 	FmgrInfo	ammarkpos;
 	FmgrInfo	amrestrpos;
 	FmgrInfo	amcanreturn;
+	FmgrInfo	ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } RelationAmInfo;
 
 
--- src/include/pg_config.h.in.orig	2014-06-08 11:26:31.000000000 -0400
+++ src/include/pg_config.h.in	2014-06-08 21:59:38.024098751 -0400
@@ -1,4 +1,4 @@
-/* src/include/pg_config.h.in.  Generated from configure.in by autoheader.  */
+/* src/include/pg_config.h.in.  Generated from - by autoheader.  */
 
 /* Define to the type of arg 1 of 'accept' */
 #undef ACCEPT_TYPE_ARG1
@@ -748,6 +748,10 @@
 /* Define to the appropriate snprintf format for unsigned 64-bit ints. */
 #undef UINT64_FORMAT
 
+/* Define to select librt-style async io and the gcc atomic compare_and_swap.
+   */
+#undef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING

Import Notes

Reply to msg id not found: 538A776C.2060006@hotmail.comReference msg id not found: E1WpPgO-00073L-GQ@malur.postgresql.orgReference msg id not found: BAY175-W2E2D8EFD58C4EBB45D2A1A3250@phx.gbl

#42

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: johnlumby (#41)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

I'm having trouble setting max_async_io_prefetchers in postgresql.conf

It says it cannot be changed.

Is that fixed at initdb time? (sounds like a bad idea if it is)

On Sun, Jun 8, 2014 at 11:12 PM, johnlumby <johnlumby@hotmail.com> wrote:

updated version of patch compatible with git head of 140608,
(adjusted proc oid and a couple of minor fixes)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Greg Stark

stark@mit.edu

over 11 years ago

In reply to: Heikki Linnakangas (#11)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Wed, May 28, 2014 at 2:19 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

How portable is POSIX aio nowadays? Googling around, it still seems that on
Linux, it's implemented using threads. Does the thread-emulation
implementation cause problems with the rest of the backend, which assumes
that there is only a single thread? In any case, I think we'll want to
encapsulate the AIO implementation behind some kind of an API, to allow
other implementations to co-exist.

I think POSIX aio is pretty damn standard and it's a pretty fiddly
interface. If we abstract it behind an i/o interface we're going to
lose a lot of the power. Abstracting it behind a set of buffer manager
operations (initiate i/o on buffer, complete i/o on buffer, abort i/o
on buffer) should be fine but that's basically what we have, no?

I don't think the threaded implementation on Linux is the one to use
though. I find this *super* confusing but the kernel definitely
supports aio syscalls, glibc also has a threaded implementation it
uses if run on a kernel that doesn't implement the syscalls, and I
think there are existing libaio and librt libraries from outside glibc
that do one or the other. Which you build against seems to make a big
difference. My instinct is that anything but the kernel native
implementation will be worthless. The overhead of thread communication
will completely outweigh any advantage over posix_fadvise's partial
win.

The main advantage of posix aio is that we can actually receive the
data out of order. With posix_fadvise we can get the i/o and cpu
overlap but we will never process the later blocks until the earlier
requests are satisfied and processed in order. With aio you could do a
sequential scan, initiating i/o on 1,000 blocks and then processing
them as they arrive, initiating new requests as those blocks are
handled.

When I investigated this I found the buffer manager's I/O bits seemed
to already be able to represent the state we needed (i/o initiated on
this buffer but not completed). The problem was in ensuring that a
backend would process the i/o completion promptly when it might be in
the midst of handling other tasks and might even get an elog() stack
unwinding. The interface that actually fits Postgres best might be the
threaded interface (orthogonal to the threaded implementation
question) which is you give aio a callback which gets called on a
separate thread when the i/o completes. The alternative is you give
aio a list of operation control blocks and it tells you the state of
all the i/o operations. But it's not clear to me how you arrange to do
that regularly, promptly, and reliably.

The other gotcha here is that the kernel implementation only does
anything useful on DIRECT_IO files. That means you have to do *all*
the prefetching and i/o scheduling yourself. You would be doing that
anyways for sequential scans and bitmap scans -- and we already do it
with things like synchronised scans and posix_fadvise -- but index
scans would need to get some intelligence for when it makes sense to
read more than one page at a time. It might be possible to do
something fairly coarse like having our i/o operators keep track of
how often i/o on a relation falls within a certain number of blocks of
an earlier i/o and autotune number of blocks to read based on that. It
might not be hard to do better than the kernel with even basic info
like what level of the index we're reading or what type of pointer
we're following.

Finally, when I did the posix_fadvise work I wrote a synthetic
benchmark for testing the equivalent i/o pattern of a bitmap scan. It
let me simulate bitmap scans of varying densities with varying
parameters, notably how many i/o to keep in flight at once. It
supported posix_fadvise or aio. You should look it up in the archives,
it made for some nice looking graphs. IIRC I could not find any build
environment where aio offered any performance boost at all. I think
this means I just didn't know how to build it against the right
libraries or wasn't using the right kernel or there was some skew
between them at the time.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: Greg Stark (#43)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark <stark@mit.edu> wrote:

I don't think the threaded implementation on Linux is the one to use
though. I find this *super* confusing but the kernel definitely
supports aio syscalls, glibc also has a threaded implementation it
uses if run on a kernel that doesn't implement the syscalls, and I
think there are existing libaio and librt libraries from outside glibc
that do one or the other. Which you build against seems to make a big
difference. My instinct is that anything but the kernel native
implementation will be worthless. The overhead of thread communication
will completely outweigh any advantage over posix_fadvise's partial
win.

What overhead?

The only communication is through a "done" bit and associated
synchronization structure when *and only when* you want to wait on it.

Furthermore, posix_fadvise is braindead on this use case, been there,
done that. What you win with threads is a better postgres-kernel
interaction, even if you loose some CPU performance it's gonna beat
posix_fadvise by a large margin.

The main advantage of posix aio is that we can actually receive the
data out of order. With posix_fadvise we can get the i/o and cpu
overlap but we will never process the later blocks until the earlier
requests are satisfied and processed in order. With aio you could do a
sequential scan, initiating i/o on 1,000 blocks and then processing
them as they arrive, initiating new requests as those blocks are
handled.

And each and every I/O you issue with it counts as such on the kernel side.

It's not the case with posix_fadvise, mind you, and that's terribly
damaging for performance.

When I investigated this I found the buffer manager's I/O bits seemed
to already be able to represent the state we needed (i/o initiated on
this buffer but not completed). The problem was in ensuring that a
backend would process the i/o completion promptly when it might be in
the midst of handling other tasks and might even get an elog() stack
unwinding. The interface that actually fits Postgres best might be the
threaded interface (orthogonal to the threaded implementation
question) which is you give aio a callback which gets called on a
separate thread when the i/o completes. The alternative is you give
aio a list of operation control blocks and it tells you the state of
all the i/o operations. But it's not clear to me how you arrange to do
that regularly, promptly, and reliably.

Indeed we've been musing about using librt's support of completion
callbacks for that.

The other gotcha here is that the kernel implementation only does
anything useful on DIRECT_IO files. That means you have to do *all*
the prefetching and i/o scheduling yourself.

If that's the case, we should discard kernel-based implementations and
stick to thread-based ones. Postgres cannot do scheduling as
efficiently as the kernel, and it shouldn't try.

You would be doing that
anyways for sequential scans and bitmap scans -- and we already do it
with things like synchronised scans and posix_fadvise

That only works because the patterns are semi-sequential. If you try
to schedule random access, it becomes effectively impossible without
hardware info.

The kernel is the one with hardware info.

Finally, when I did the posix_fadvise work I wrote a synthetic
benchmark for testing the equivalent i/o pattern of a bitmap scan. It
let me simulate bitmap scans of varying densities with varying
parameters, notably how many i/o to keep in flight at once. It
supported posix_fadvise or aio. You should look it up in the archives,
it made for some nice looking graphs. IIRC I could not find any build
environment where aio offered any performance boost at all. I think
this means I just didn't know how to build it against the right
libraries or wasn't using the right kernel or there was some skew
between them at the time.

If it's old, it probable you didn't hit the kernel's braindeadness
since it was introduced somewhat recently (somewhate, ie, before 3.x I
believe).

Even if you did hit it, bitmap heap scans are blessed with sequential
ordering. The technique doesn't work nearly as well with random I/O
that might be sorted from time to time.

When traversing an index, you do a mostly sequential pattern due to
physical correlation, but not completely sequential. Not only that,
with the assumption of random I/O, and the uncertainty of when will
the scan be aborted too, you don't read ahead as much as you could if
you knew it was sequential or a full scan. That kills performance. You
don't fetch enough ahead of time to avoid stalls, and the kernel
doesn't do read-ahead either because posix_fadvise effectively
disables it, resulting in the equivalent of direct I/O with bad
scheduling.

Solving this for index scans isn't just a little more complex. It's
insanely more complex, because you need hardware information to do it
right. How many spindles, how many sectors per cylinder if it's
rotational, how big the segments if it's flash, etc, etc... all stuff
hidden away inside the kernel.

It's not a good idea to try to do the kernel's job. Aio, even
threaded, lets you avoid tha.

If you still have the benchmark around, I suggest you shuffle the
sectors a little bit (but not fully) and try them with semi-random
I/O.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Fujii Masao

masao.fujii@gmail.com

over 11 years ago

In reply to: johnlumby (#41)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Mon, Jun 9, 2014 at 11:12 AM, johnlumby <johnlumby@hotmail.com> wrote:

updated version of patch compatible with git head of 140608,
(adjusted proc oid and a couple of minor fixes)

Compilation of patched version on MacOS failed. The error messages were

gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -I../../../../src/include -c -o heapam.o heapam.c
heapam.c: In function 'heap_unread_add':
heapam.c:362: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:363: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:369: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:375: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:381: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:387: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:405: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c: In function 'heap_unread_subtract':
heapam.c:419: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:425: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:434: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:442: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:452: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:453: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:454: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:456: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c: In function 'heapgettup':
heapam.c:944: error: 'struct HeapScanDescData' has no member named
'rs_pfchblock'
heapam.c:716: warning: unused variable 'ix'
heapam.c: In function 'heapgettup_pagemode':
heapam.c:1243: error: 'struct HeapScanDescData' has no member named
'rs_pfchblock'
heapam.c:1029: warning: unused variable 'ix'
heapam.c: In function 'heap_endscan':
heapam.c:1808: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:1809: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
make[4]: *** [heapam.o] Error 1
make[3]: *** [heap-recursive] Error 2
make[2]: *** [access-recursive] Error 2
make[1]: *** [install-backend-recurse] Error 2
make: *** [install-src-recurse] Error 2

Huge patch is basically not easy to review. What about simplifying the patch
by excluding non-core parts like the change of pg_stat_statements, so that
reviewers can easily read the patch? We can add such non-core parts later.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: Fujii Masao (#45)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

Thanks Fujii , that is a bug -- an #ifdef USE_PREFETCH is missing in heapam.c
(maybe several)
I will fix it in the next patch version.

I also appreciate it is not easy to review the patch.
There are really 4 (or maybe 5) parts :

     .   async io (librt functions)
     .   buffer management   (allocating, locking and pinning etc)
     .   scanners making prefetch calls
     .   statistics

and the autoconf input program

I will see what I can do. Maybe putting an indicator against each modified file?

I am currently working on two things :
.   alternative way for non-originator of an aio_read to wait on completion
             (LWlock instead of polling the aiocb)
      This was talked about in several earlier posts and Claudio is also working on something there
.   package up my benchmark

Cheers John

----------------------------------------

Date: Fri, 20 Jun 2014 04:21:19 +0900
Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
From: masao.fujii@gmail.com
To: johnlumby@hotmail.com
CC: pgsql-hackers@postgresql.org; klaussfreire@gmail.com

On Mon, Jun 9, 2014 at 11:12 AM, johnlumby <johnlumby@hotmail.com> wrote:

updated version of patch compatible with git head of 140608,
(adjusted proc oid and a couple of minor fixes)

Compilation of patched version on MacOS failed. The error messages were

gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -I../../../../src/include -c -o heapam.o heapam.c
heapam.c: In function 'heap_unread_add':
heapam.c:362: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:363: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:369: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:375: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:381: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:387: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:405: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c: In function 'heap_unread_subtract':
heapam.c:419: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:425: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:434: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:442: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:452: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:453: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:454: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:456: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c: In function 'heapgettup':
heapam.c:944: error: 'struct HeapScanDescData' has no member named
'rs_pfchblock'
heapam.c:716: warning: unused variable 'ix'
heapam.c: In function 'heapgettup_pagemode':
heapam.c:1243: error: 'struct HeapScanDescData' has no member named
'rs_pfchblock'
heapam.c:1029: warning: unused variable 'ix'
heapam.c: In function 'heap_endscan':
heapam.c:1808: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:1809: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
make[4]: *** [heapam.o] Error 1
make[3]: *** [heap-recursive] Error 2
make[2]: *** [access-recursive] Error 2
make[1]: *** [install-backend-recurse] Error 2
make: *** [install-src-recurse] Error 2

Huge patch is basically not easy to review. What about simplifying the patch
by excluding non-core parts like the change of pg_stat_statements, so that
reviewers can easily read the patch? We can add such non-core parts later.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Jim Nasby

jim@nasby.net

over 11 years ago

In reply to: John Lumby (#46)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 6/20/14, 5:12 PM, John Lumby wrote:

I also appreciate it is not easy to review the patch.
There are really 4 (or maybe 5) parts :

. async io (librt functions)
. buffer management (allocating, locking and pinning etc)
. scanners making prefetch calls
. statistics

and the autoconf input program

I will see what I can do. Maybe putting an indicator against each modified file?

Generally the best way to handle cases like this is to create multiple patches, each of which builds upon previous ones.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: Claudio Freire (#44)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

----------------------------------------

Date: Thu, 19 Jun 2014 15:43:44 -0300
Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
From: klaussfreire@gmail.com
To: stark@mit.edu
CC: hlinnakangas@vmware.com; johnlumby@hotmail.com; pgsql-hackers@postgresql.org

On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark <stark@mit.edu> wrote:

I don't think the threaded implementation on Linux is the one to use
though. [... ] The overhead of thread communication
will completely outweigh any advantage over posix_fadvise's partial
win.

What overhead?

The only communication is through a "done" bit and associated
synchronization structure when *and only when* you want to wait on it.

Threads do cost some extra CPU, but provided the system had some
spare CPU capacity, then performance improves because of reduced IO wait.
I quoted a measured improvement of 17% - 18% improvement in the README
along with some more explanation of when the asyc IO gives and improvement.

Furthermore, posix_fadvise is braindead on this use case, been there,
done that. What you win with threads is a better postgres-kernel
interaction, even if you loose some CPU performance it's gonna beat
posix_fadvise by a large margin.

[...]

When I investigated this I found the buffer manager's I/O bits seemed
to already be able to represent the state we needed (i/o initiated on
this buffer but not completed). The problem was in ensuring that a
backend would process the i/o completion promptly when it might be in
the midst of handling other tasks and might even get an elog() stack
unwinding. The interface that actually fits Postgres best might be the
threaded interface (orthogonal to the threaded implementation
question) which is you give aio a callback which gets called on a
separate thread when the i/o completes. The alternative is you give
aio a list of operation control blocks and it tells you the state of
all the i/o operations. But it's not clear to me how you arrange to do
that regularly, promptly, and reliably.

Indeed we've been musing about using librt's support of completion
callbacks for that.

For the most common case of a backend issues a PrefetchBuffer
and then that *same* backend issues ReadBuffer, the posix aio works
ideally, since there is no need for any callback or completion signal,
we simply check "is it complete" during the ReadBuffer.

It is when some *other* backend gets there first with the ReadBuffer that
things are a bit trickier.    The current version of the patch did polling for that case
but that drew criticism,    and so an imminent new version of the patch
uses the sigevent mechanism.    And there are other ways still.

In an earlier posting I reported that , in my benchmark,
99.8% of [FileCompleteaio] calls are from originator and only < 0.2% are not.so, from a performance perspective, only the common case really matters.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

Greg Stark

stark@mit.edu

over 11 years ago

In reply to: John Lumby (#48)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote:

It is when some *other* backend gets there first with the ReadBuffer that
things are a bit trickier. The current version of the patch did polling for that case
but that drew criticism, and so an imminent new version of the patch
uses the sigevent mechanism. And there are other ways still.

I'm a bit puzzled by this though. Postgres *already* has code for this
case. When you call ReadBuffer you set the bits on the buffer
indicating I/O is in progress. If another backend does ReadBuffer for
the same block they'll get the same buffer and then wait until the
first backend's I/O completes. ReadBuffer goes through some hoops to
handle this (and all the corner cases such as the other backend's I/O
completing and the buffer being reused for another block before the
first backend reawakens). It would be a shame to reinvent the wheel.

The problem with using the Buffers I/O in progress bit is that the I/O
might complete while the other backend is busy doing stuff. As long as
you can handle the I/O completion promptly -- either in callback or
thread or signal handler then that wouldn't matter. But I'm not clear
that any of those will work reliably.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: Greg Stark (#49)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

----------------------------------------

From: stark@mit.edu
Date: Mon, 23 Jun 2014 16:04:50 -0700
Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
To: johnlumby@hotmail.com
CC: klaussfreire@gmail.com; hlinnakangas@vmware.com; pgsql-hackers@postgresql.org

On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote:

It is when some *other* backend gets there first with the ReadBuffer that
things are a bit trickier. The current version of the patch did polling for that case
but that drew criticism, and so an imminent new version of the patch
uses the sigevent mechanism. And there are other ways still.

I'm a bit puzzled by this though. Postgres *already* has code for this
case. When you call ReadBuffer you set the bits on the buffer

Good question.     Let me explain.
Yes, postgresql has code for the case of a backend is inside a synchronous
read() or write(), performed from a ReadBuffer(), and some other backend
wants that buffer.    asynchronous aio is initiated not from ReadBuffer
but from PrefetchBuffer,    and performs its aio_read into an allocated, pinned,
postgresql buffer.    This is entirely different from the synchronous io case.
Why?      Because the issuer of the aio_read (the "originator") is unaware
of this buffer pinned on its behalf, and is then free to do any other
reading or writing it wishes,   such as more prefetching or any other operation.
And furthermore, it may *never* issue a ReadBuffer for the block which it
prefetched.

Therefore, asynchronous IO is different from synchronous IO, and
a new bit, BM_AIO_IN_PROGRESS, in the buf_header is required to
track this aio operation until completion.

I would encourage you to read the new
postgresql-prefetching-asyncio.README
in the patch file where this is explained in greater detail.

indicating I/O is in progress. If another backend does ReadBuffer for
the same block they'll get the same buffer and then wait until the
first backend's I/O completes. ReadBuffer goes through some hoops to
handle this (and all the corner cases such as the other backend's I/O
completing and the buffer being reused for another block before the
first backend reawakens). It would be a shame to reinvent the wheel.

No re-invention! Actually some effort has been made to use the
existing functions in bufmgr.c as much as possible rather than
rewriting them.

The problem with using the Buffers I/O in progress bit is that the I/O
might complete while the other backend is busy doing stuff. As long as
you can handle the I/O completion promptly -- either in callback or
thread or signal handler then that wouldn't matter. But I'm not clear
that any of those will work reliably.

They both work reliably, but the criticism was that backend B polling
an aiocb of an aio issued by backend A is not documented as
being supported (although it happens to work), hence the proposed
change to use sigevent.

By the way, on the "will it actually work though?" question which several folks
have raised,    I should mention that this patch has been in semi-production
use for almost 2 years now in different stages of completion on all postgresql
releases from 9.1.4 to 9.5 devel.       I would guess it has had around
500 hours of operation by now.     I'm sure there are bugs still to be
found but I am confident it is fundamentally sound.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: John Lumby (#50)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 06/24/2014 04:29 PM, John Lumby wrote:

On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote:

It is when some *other* backend gets there first with the ReadBuffer that
things are a bit trickier. The current version of the patch did polling for that case
but that drew criticism, and so an imminent new version of the patch
uses the sigevent mechanism. And there are other ways still.

I'm a bit puzzled by this though. Postgres *already* has code for this
case. When you call ReadBuffer you set the bits on the buffer

Good question. Let me explain.
Yes, postgresql has code for the case of a backend is inside a synchronous
read() or write(), performed from a ReadBuffer(), and some other backend
wants that buffer. asynchronous aio is initiated not from ReadBuffer
but from PrefetchBuffer, and performs its aio_read into an allocated, pinned,
postgresql buffer. This is entirely different from the synchronous io case.
Why? Because the issuer of the aio_read (the "originator") is unaware
of this buffer pinned on its behalf, and is then free to do any other
reading or writing it wishes, such as more prefetching or any other operation.
And furthermore, it may *never* issue a ReadBuffer for the block which it
prefetched.

I still don't see the difference. Once an asynchronous read is initiated
on the buffer, it can't be used for anything else until the read has
finished. This is exactly the same situation as with a synchronous read:
after read() is called, the buffer can't be used for anything else until
the call finishes.

In particular, consider the situation from another backend's point of
view. Looking from another backend (i.e. one that didn't initiate the
read), there's no difference between a synchronous and asynchronous
read. So why do we need a different IPC mechanism for the synchronous
and asynchronous cases? We don't.

I understand that *within the backend*, you need to somehow track the
I/O, and you'll need to treat synchronous and asynchronous I/Os
differently. But that's all within the same backend, and doesn't need to
involve the flags or locks in shared memory at all. The inter-process
communication doesn't need any changes.

The problem with using the Buffers I/O in progress bit is that the I/O
might complete while the other backend is busy doing stuff. As long as
you can handle the I/O completion promptly -- either in callback or
thread or signal handler then that wouldn't matter. But I'm not clear
that any of those will work reliably.

They both work reliably, but the criticism was that backend B polling
an aiocb of an aio issued by backend A is not documented as
being supported (although it happens to work), hence the proposed
change to use sigevent.

You didn't understand what Greg meant. You need to handle the completion
of the I/O in the same process that initiated it, by clearing the
in-progress bit of the buffer and releasing the I/O in-progress lwlock
on it. And you need to do that very quickly after the I/O has finished,
because there might be another backend waiting for the buffer and you
don't want him to wait longer than necessary.

The question is, if you receive the notification of the I/O completion
using a signal or a thread, is it safe to release the lwlock from the
signal handler or a separate thread?

By the way, on the "will it actually work though?" question which several folks
have raised, I should mention that this patch has been in semi-production
use for almost 2 years now in different stages of completion on all postgresql
releases from 9.1.4 to 9.5 devel. I would guess it has had around
500 hours of operation by now. I'm sure there are bugs still to be
found but I am confident it is fundamentally sound.

Well, a committable version of this patch is going to look quite
different from the first version that you posted, so I don't put much
weight on how long you've tested the first version.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#51)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

Thanks Heikki,

----------------------------------------

Date: Tue, 24 Jun 2014 17:02:38 +0300
From: hlinnakangas@vmware.com
To: johnlumby@hotmail.com; stark@mit.edu
CC: klaussfreire@gmail.com; pgsql-hackers@postgresql.org
Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 06/24/2014 04:29 PM, John Lumby wrote:

On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby@hotmail.com> wrote:

It is when some *other* backend gets there first with the ReadBuffer that
things are a bit trickier. The current version of the patch did polling for that case
but that drew criticism, and so an imminent new version of the patch
uses the sigevent mechanism. And there are other ways still.

I'm a bit puzzled by this though. Postgres *already* has code for this
case. When you call ReadBuffer you set the bits on the buffer

Good question. Let me explain.
Yes, postgresql has code for the case of a backend is inside a synchronous
read() or write(), performed from a ReadBuffer(), and some other backend
wants that buffer. asynchronous aio is initiated not from ReadBuffer
but from PrefetchBuffer, and performs its aio_read into an allocated, pinned,
postgresql buffer. This is entirely different from the synchronous io case.
Why? Because the issuer of the aio_read (the "originator") is unaware
of this buffer pinned on its behalf, and is then free to do any other
reading or writing it wishes, such as more prefetching or any other operation.
And furthermore, it may *never* issue a ReadBuffer for the block which it
prefetched.

I still don't see the difference. Once an asynchronous read is initiated
on the buffer, it can't be used for anything else until the read has
finished. This is exactly the same situation as with a synchronous read:
after read() is called, the buffer can't be used for anything else until
the call finishes.

Ah, now I see what you and Greg are asking. See my next imbed below.

In particular, consider the situation from another backend's point of
view. Looking from another backend (i.e. one that didn't initiate the
read), there's no difference between a synchronous and asynchronous
read. So why do we need a different IPC mechanism for the synchronous
and asynchronous cases? We don't.

I understand that *within the backend*, you need to somehow track the
I/O, and you'll need to treat synchronous and asynchronous I/Os
differently. But that's all within the same backend, and doesn't need to
involve the flags or locks in shared memory at all. The inter-process
communication doesn't need any changes.

The problem with using the Buffers I/O in progress bit is that the I/O
might complete while the other backend is busy doing stuff. As long as
you can handle the I/O completion promptly -- either in callback or
thread or signal handler then that wouldn't matter. But I'm not clear
that any of those will work reliably.

They both work reliably, but the criticism was that backend B polling
an aiocb of an aio issued by backend A is not documented as
being supported (although it happens to work), hence the proposed
change to use sigevent.

You didn't understand what Greg meant. You need to handle the completion
of the I/O in the same process that initiated it, by clearing the
in-progress bit of the buffer and releasing the I/O in-progress lwlock
on it. And you need to do that very quickly after the I/O has finished,
because there might be another backend waiting for the buffer and you
don't want him to wait longer than necessary.

I think I understand the question now.    I didn't spell out the details earlier.
Let me explain a little more.
With this patch, when read is issued, it is either a synchronous IO
(as before), or an asynchronous aio_read (new,   represented by
both BM_IO_IN_PROGRESS    *and* BM_AIO_IN_PROGRESS).
The way other backends wait on a synchronous IO in progress is unchanged.
But if BM_AIO_IN_PROGRESS,   then *any* backend which requests
ReadBuffer on this block (including originator) follows a new path
through BufCheckAsync() which, depending on various flags and context,
send the backend down to FileCompleteaio to check and maybe wait.
So *all* backends who are waiting for a BM_AIO_IN_PROGRESS buffer
will wait in that way.

The question is, if you receive the notification of the I/O completion
using a signal or a thread, is it safe to release the lwlock from the
signal handler or a separate thread?

In the forthcoming new version of the patch that uses sigevent,
the originator locks a LWlock associated with that BAaiocb eXclusive,
and ,   when signalled, in the signal handler it places that LWlock
on a process-local queue of LWlocks awaiting release.
(No, It cannot be safely released inside the signal handler or in a
separate thread).     Whenever the mainline passes a CHECK_INTERRUPTS macro
and at a few additional points in bufmgr, the backend walks this process-local
queue and releases those LWlocks.    This is also done if the originator
itself issues a ReadBuffer, which is the most frequent case in which it
is released.

Meanwhile, any other backend will simply acquire Shared and release.

I think you are right that the existing io_in_progress_lock LWlock in the
buf_header could be used for this, because if there is a aio in progress,
then that lock cannot be in use for synchronous IO. I chose not to use it
because I preferred to keep the wait/post for asynch io separate,
but they could both use the same LWlock. However, the way the LWlock
is acquired and released would still be a bit different because of the need
to have the originator release it in its mainline.

By the way, on the "will it actually work though?" question which several folks
have raised, I should mention that this patch has been in semi-production
use for almost 2 years now in different stages of completion on all postgresql
releases from 9.1.4 to 9.5 devel. I would guess it has had around
500 hours of operation by now. I'm sure there are bugs still to be
found but I am confident it is fundamentally sound.

Well, a committable version of this patch is going to look quite
different from the first version that you posted, so I don't put much
weight on how long you've tested the first version.

Yes, I am quite willing to change it, time permitting.
I take the works "committable version" as a positive sign ...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: John Lumby (#52)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 06/24/2014 06:08 PM, John Lumby wrote:

The question is, if you receive the notification of the I/O completion
using a signal or a thread, is it safe to release the lwlock from the
signal handler or a separate thread?

In the forthcoming new version of the patch that uses sigevent,
the originator locks a LWlock associated with that BAaiocb eXclusive,
and , when signalled, in the signal handler it places that LWlock
on a process-local queue of LWlocks awaiting release.
(No, It cannot be safely released inside the signal handler or in a
separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro
and at a few additional points in bufmgr, the backend walks this process-local
queue and releases those LWlocks. This is also done if the originator
itself issues a ReadBuffer, which is the most frequent case in which it
is released.

Meanwhile, any other backend will simply acquire Shared and release.

Ok, doing the work in CHECK_FOR_INTERRUPTS sounds safe. But is that fast
enough? We have never made any hard guarantees on how often
CHECK_FOR_INTERRUPTS() is called. In particular, if you're running 3rd
party C code or PL code, there might be no CHECK_FOR_INTERRUPTS() calls
for many seconds, or even more. That's a long time to hold onto a buffer
I/O lock. I don't think that's acceptable :-(.

I think you are right that the existing io_in_progress_lock LWlock in the
buf_header could be used for this, because if there is a aio in progress,
then that lock cannot be in use for synchronous IO. I chose not to use it
because I preferred to keep the wait/post for asynch io separate,
but they could both use the same LWlock. However, the way the LWlock
is acquired and released would still be a bit different because of the need
to have the originator release it in its mainline.

It would be nice to use the same LWLock.

However, if releasing a regular LWLock in a signal handler is not safe,
and cannot be made safe, perhaps we should, after all, invent a whole
new mechanism. One that would make it safe to release the lock in a
signal handler.

By the way, on the "will it actually work though?" question which several folks
have raised, I should mention that this patch has been in semi-production
use for almost 2 years now in different stages of completion on all postgresql
releases from 9.1.4 to 9.5 devel. I would guess it has had around
500 hours of operation by now. I'm sure there are bugs still to be
found but I am confident it is fundamentally sound.

Well, a committable version of this patch is going to look quite
different from the first version that you posted, so I don't put much
weight on how long you've tested the first version.

Yes, I am quite willing to change it, time permitting.
I take the works "committable version" as a positive sign ...

BTW, sorry if I sound negative, I'm actually quite excited about this
feature. A patch like this take a lot of work, and usually several
rewrites, until it's ready ;-). But I'm looking forward for it.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: John Lumby (#52)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Tue, Jun 24, 2014 at 12:08 PM, John Lumby <johnlumby@hotmail.com> wrote:

The question is, if you receive the notification of the I/O completion
using a signal or a thread, is it safe to release the lwlock from the
signal handler or a separate thread?

In the forthcoming new version of the patch that uses sigevent,
the originator locks a LWlock associated with that BAaiocb eXclusive,
and , when signalled, in the signal handler it places that LWlock
on a process-local queue of LWlocks awaiting release.
(No, It cannot be safely released inside the signal handler or in a
separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro
and at a few additional points in bufmgr, the backend walks this process-local
queue and releases those LWlocks. This is also done if the originator
itself issues a ReadBuffer, which is the most frequent case in which it
is released.

I suggest using a semaphore instead.

Semaphores are supposed to be incremented/decremented from multiple
threads or processes already. So, in theory, the callback itself
should be able to do it.

The problem with the process-local queue is that it may take time to
be processed (the time it takes to get to a CHECK_INTERRUPTS macro,
which as it happened with regexes, it can be quite high).

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Claudio Freire (#54)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 06/25/2014 09:20 PM, Claudio Freire wrote:

On Tue, Jun 24, 2014 at 12:08 PM, John Lumby <johnlumby@hotmail.com> wrote:

The question is, if you receive the notification of the I/O completion
using a signal or a thread, is it safe to release the lwlock from the
signal handler or a separate thread?

In the forthcoming new version of the patch that uses sigevent,
the originator locks a LWlock associated with that BAaiocb eXclusive,
and , when signalled, in the signal handler it places that LWlock
on a process-local queue of LWlocks awaiting release.
(No, It cannot be safely released inside the signal handler or in a
separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro
and at a few additional points in bufmgr, the backend walks this process-local
queue and releases those LWlocks. This is also done if the originator
itself issues a ReadBuffer, which is the most frequent case in which it
is released.

I suggest using a semaphore instead.

Semaphores are supposed to be incremented/decremented from multiple
threads or processes already. So, in theory, the callback itself
should be able to do it.

LWLocks are implemented with semaphores, so if you can increment a
semaphore in the signal handler / callback thread, then in theory you
should be able to release a LWLock. You'll need some additional
synchronization within the same process, to make sure you don't release
a lock in signal handler while you're just doing the same thing in the
main thread. I'm not sure I want to buy into the notion that
LWLockRelease must be generally safe to call from a signal handler, but
maybe it's possible to have a variant of it that is.

On Linux at least we use System V semaphores, which are (unsurpisingly)
not listed as safe for using in a signal handler. See list
Async-signal-safe functions in signal(7) man page. The function used to
increment a POSIX semaphore, sem_post(), is in the list, however.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Heikki Linnakangas (#55)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 2014-06-26 00:08:48 +0300, Heikki Linnakangas wrote:

LWLocks are implemented with semaphores, so if you can increment a semaphore
in the signal handler / callback thread, then in theory you should be able
to release a LWLock.

I don't think that's a convincing argument even if semop et al were
signal safe. LWLocks also use spinlocks and it's a fairly bad idea to do
anything with the same spinlock both inside and outside a signal
handler.
I don't think we're going to get lwlock.c/LWLockRelease to work
reasonably from a signal handler without a lot of work.

On Linux at least we use System V semaphores, which are (unsurpisingly) not
listed as safe for using in a signal handler. See list Async-signal-safe
functions in signal(7) man page. The function used to increment a POSIX
semaphore, sem_post(), is in the list, however.

Heh, just wrote the same after reading the beginning of your email ;)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: Andres Freund (#56)

1 attachment(s)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

I am attaching the new version of the patch with support for use of sigevent
to report completion of asynch io and the new LWlock for waiters to
wait for originator to release it.

This patch is based on up-to-date git head but the asyncio design is
a bit behind the curve of recent discussions here. Specifically,
sigevent is still optional (see README for how to configure it)
and non-originators still refer to the originator's aiocb to check the
completion code (but not for waiting for it to complete).

All sigevent-related code is marked by
#ifdef USE_AIO_SIGEVENT
to make it easier to find.

Also although it "works" it is not functionally complete -
search for FIXME in
src/backend/storage/buffer/buf_async.c
and has not had nearly as much testing.

________________________________________________________________________

I will try to paste earlier points and imbed comments :

Heikki wrote :

Ok, doing the work in CHECK_FOR_INTERRUPTS sounds safe. But is that fast
enough? We have never made any hard guarantees on how often
CHECK_FOR_INTERRUPTS() is called. In particular, if you're running 3rd
party C code or PL code, there might be no CHECK_FOR_INTERRUPTS() calls
for many seconds, or even more. That's a long time to hold onto a buffer
I/O lock. I don't think that's acceptable :-(.

true but remember this case is the very infrequent one, less than 0.2 %
(in my benchmark).

I think you are right that the existing io_in_progress_lock LWlock in the
buf_header could be used for this, because if there is a aio in progress,
then that lock cannot be in use for synchronous IO. I chose not to use it
because I preferred to keep the wait/post for asynch io separate,
but they could both use the same LWlock. However, the way the LWlock
is acquired and released would still be a bit different because of the need
to have the originator release it in its mainline.

It would be nice to use the same LWLock.

I think that will work.

However, if releasing a regular LWLock in a signal handler is not safe,
and cannot be made safe, perhaps we should, after all, invent a whole
new mechanism. One that would make it safe to release the lock in a
signal handler.

It would take rewriting lwlock.c and still be devillish hard to test -
I would prefer not.

Well, a committable version of this patch is going to look quite
different from the first version that you posted, so I don't put much
weight on how long you've tested the first version.

Yes, I am quite willing to change it, time permitting.
I take the works "committable version" as a positive sign ...

BTW, sorry if I sound negative, I'm actually quite excited about this
feature. A patch like this take a lot of work, and usually several
rewrites, until it's ready ;-). But I'm looking forward for it.

I am grateful you spent the time to study it.
knowing the code, I am not surprised it made you a bit grumpy ...

________________________________________________________

discussion about what is the difference between a synchronous read
versus an asynchronous read as far as non-originator waiting on it is concerned.

I thought a bit more about this. There are currently two differences,
one of which can easily be changed and one not so easy.

1)     The current code, even with sigevent, still makes the non-originator waiter
         call aio_error on the originator's aiocb to get the completion code.
         For sigevent variation, easily changed to have the originator always call aio_error
   (from its CHECK_INTERRUPTS or from FIleCompleteaio)
         and store that in the BAiocb.
         My idea of why not to do that was that, by having the non-originator check the aiocb,
this would allow the waiter to proceed sooner.   But for a different reason it actually
         doesn't.   (The non-originator must still wait for the LWlock release)

2)   Buffer pinning and returning the BufferAiocb to the free list
        With synchronous IO,    each backend that calls a ReadBuffer must pin the buffer
        early in the process.
        With asynchronous IO,    initially only the originator gets the pin
(and that is during PrefetchBuffer, not Readbuffer)
         When the aio completes and some backend checks that completion,
        then the backend has various responsibilities:

               . pin the buffer if it did not already have one (from prefetch)
   . if it was the last such backend to make that check
(amongst the cohort waiting on it)
   then pin the buffer if it did not already have one (from prefetch)

Those functions are different from the synchronous IO case.
I think code clarity alone may dictate keeping these separate.

___________________________________________________________________
various discussion of semaphores and LWlocks:

----------------------------------------

Date: Wed, 25 Jun 2014 23:17:53 +0200
From: andres@2ndquadrant.com
To: hlinnakangas@vmware.com
CC: klaussfreire@gmail.com; johnlumby@hotmail.com; stark@mit.edu; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

On 2014-06-26 00:08:48 +0300, Heikki Linnakangas wrote:

LWLocks are implemented with semaphores, so if you can increment a semaphore
in the signal handler / callback thread, then in theory you should be able
to release a LWLock.

I don't think that's a convincing argument even if semop et al were
signal safe. LWLocks also use spinlocks and it's a fairly bad idea to do
anything with the same spinlock both inside and outside a signal
handler.
I don't think we're going to get lwlock.c/LWLockRelease to work
reasonably from a signal handler without a lot of work.

I agree - see earlier.

Attachments:

postgresql-9.5.140625.140625-181050.noDEBUG.patchapplication/octet-streamDownload

--- configure.in.orig	2014-06-25 16:37:59.233618849 -0400
+++ configure.in	2014-06-25 18:10:50.760519904 -0400
@@ -1771,6 +1771,12 @@ operating system;  use --disable-thread-
 fi
 fi
 
+#  test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation  of the latter.
+PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" = x"yes"; then
+      AC_DEFINE(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP, 1, [Define to select librt-style async io and the gcc atomic compare_and_swap.])
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
--- contrib/pg_prewarm/pg_prewarm.c.orig	2014-06-25 16:37:59.281618859 -0400
+++ contrib/pg_prewarm/pg_prewarm.c	2014-06-25 18:10:50.772519950 -0400
@@ -159,7 +159,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		 */
 		for (block = first_block; block <= last_block; ++block)
 		{
-			PrefetchBuffer(rel, forkNumber, block);
+			PrefetchBuffer(rel, forkNumber, block, 0);
 			++blocks_done;
 		}
 #else
--- contrib/pg_stat_statements/pg_stat_statements--1.3.sql.orig	2014-06-25 17:33:03.164961952 -0400
+++ contrib/pg_stat_statements/pg_stat_statements--1.3.sql	2014-06-25 18:10:50.792520029 -0400
@@ -0,0 +1,52 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_stat_statements VERSION '1.3'" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_stat_statements_reset()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT queryid bigint,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT blk_read_time float8,
+    OUT blk_write_time float8
+   , OUT aio_read_noneed int8
+   , OUT aio_read_discrd int8
+   , OUT aio_read_forgot int8
+   , OUT aio_read_noblok int8
+   , OUT aio_read_failed int8
+   , OUT aio_read_wasted int8
+   , OUT aio_read_waited int8
+   , OUT aio_read_ontime int8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_3'
+LANGUAGE C STRICT VOLATILE;
+
+-- Register a view on the function for ease of use.
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
+
+-- Don't want this to be available to non-superusers.
+REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
--- contrib/pg_stat_statements/Makefile.orig	2014-06-25 16:37:59.281618859 -0400
+++ contrib/pg_stat_statements/Makefile	2014-06-25 18:10:50.812520107 -0400
@@ -4,7 +4,8 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.2.sql pg_stat_statements--1.1--1.2.sql \
+DATA = pg_stat_statements--1.3.sql pg_stat_statements--1.2--1.3.sql \
+	pg_stat_statements--1.2.sql pg_stat_statements--1.1--1.2.sql \
 	pg_stat_statements--1.0--1.1.sql pg_stat_statements--unpackaged--1.0.sql
 
 ifdef USE_PGXS
--- contrib/pg_stat_statements/pg_stat_statements.c.orig	2014-06-25 16:37:59.285618860 -0400
+++ contrib/pg_stat_statements/pg_stat_statements.c	2014-06-25 18:10:50.844520233 -0400
@@ -117,6 +117,7 @@ typedef enum pgssVersion
 	PGSS_V1_0 = 0,
 	PGSS_V1_1,
 	PGSS_V1_2
+	,PGSS_V1_3
 } pgssVersion;
 
 /*
@@ -148,6 +149,16 @@ typedef struct Counters
 	int64		local_blks_written;		/* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;		/* # of temp blocks written */
+
+	int64		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool  */
+	int64		aio_read_discrd;		/* # of prefetches for which buffer not subsequently read and therefore discarded  */
+	int64		aio_read_forgot;		/* # of prefetches for which buffer not subsequently read and then forgotten about */
+	int64		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb  control block               */
+	int64		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno     */
+	int64		aio_read_wasted;		/* # of aio reads for which in-progress aio cancelled and disk block not used      */
+	int64		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it                 */
+	int64		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested       */
+
 	double		blk_read_time;	/* time spent reading, in msec */
 	double		blk_write_time; /* time spent writing, in msec */
 	double		usage;			/* usage factor */
@@ -275,6 +286,7 @@ void		_PG_fini(void);
 
 PG_FUNCTION_INFO_V1(pg_stat_statements_reset);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
+PG_FUNCTION_INFO_V1(pg_stat_statements_1_3);
 PG_FUNCTION_INFO_V1(pg_stat_statements);
 
 static void pgss_shmem_startup(void);
@@ -1026,7 +1038,25 @@ pgss_ProcessUtility(Node *parsetree, con
 		bufusage.temp_blks_read =
 			pgBufferUsage.temp_blks_read - bufusage_start.temp_blks_read;
 		bufusage.temp_blks_written =
-			pgBufferUsage.temp_blks_written - bufusage_start.temp_blks_written;
+			pgBufferUsage.temp_blks_written - bufusage.temp_blks_written;
+
+		bufusage.aio_read_noneed =
+			pgBufferUsage.aio_read_noneed - bufusage.aio_read_noneed;
+		bufusage.aio_read_discrd =
+			pgBufferUsage.aio_read_discrd - bufusage.aio_read_discrd;
+		bufusage.aio_read_forgot =
+			pgBufferUsage.aio_read_forgot - bufusage.aio_read_forgot;
+		bufusage.aio_read_noblok =
+			pgBufferUsage.aio_read_noblok - bufusage.aio_read_noblok;
+		bufusage.aio_read_failed =
+			pgBufferUsage.aio_read_failed - bufusage.aio_read_failed;
+		bufusage.aio_read_wasted =
+			pgBufferUsage.aio_read_wasted - bufusage.aio_read_wasted;
+		bufusage.aio_read_waited =
+			pgBufferUsage.aio_read_waited - bufusage.aio_read_waited;
+		bufusage.aio_read_ontime =
+			pgBufferUsage.aio_read_ontime - bufusage.aio_read_ontime;
+
 		bufusage.blk_read_time = pgBufferUsage.blk_read_time;
 		INSTR_TIME_SUBTRACT(bufusage.blk_read_time, bufusage_start.blk_read_time);
 		bufusage.blk_write_time = pgBufferUsage.blk_write_time;
@@ -1041,6 +1071,7 @@ pgss_ProcessUtility(Node *parsetree, con
 				   rows,
 				   &bufusage,
 				   NULL);
+
 	}
 	else
 	{
@@ -1224,6 +1255,16 @@ pgss_store(const char *query, uint32 que
 		e->counters.local_blks_written += bufusage->local_blks_written;
 		e->counters.temp_blks_read += bufusage->temp_blks_read;
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
+
+		e->counters.aio_read_noneed     += bufusage->aio_read_noneed;
+		e->counters.aio_read_discrd     += bufusage->aio_read_discrd;
+		e->counters.aio_read_forgot     += bufusage->aio_read_forgot;
+		e->counters.aio_read_noblok     += bufusage->aio_read_noblok;
+		e->counters.aio_read_failed     += bufusage->aio_read_failed;
+		e->counters.aio_read_wasted     += bufusage->aio_read_wasted;
+		e->counters.aio_read_waited     += bufusage->aio_read_waited;
+		e->counters.aio_read_ontime     += bufusage->aio_read_ontime;
+
 		e->counters.blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_read_time);
 		e->counters.blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_write_time);
 		e->counters.usage += USAGE_EXEC(total_time);
@@ -1257,7 +1298,8 @@ pg_stat_statements_reset(PG_FUNCTION_ARG
 #define PG_STAT_STATEMENTS_COLS_V1_0	14
 #define PG_STAT_STATEMENTS_COLS_V1_1	18
 #define PG_STAT_STATEMENTS_COLS_V1_2	19
-#define PG_STAT_STATEMENTS_COLS			19		/* maximum of above */
+#define PG_STAT_STATEMENTS_COLS_V1_3	27
+#define PG_STAT_STATEMENTS_COLS			27		/* maximum of above */
 
 /*
  * Retrieve statement statistics.
@@ -1270,6 +1312,16 @@ pg_stat_statements_reset(PG_FUNCTION_ARG
  * function.  Unfortunately we weren't bright enough to do that for 1.1.
  */
 Datum
+pg_stat_statements_1_3(PG_FUNCTION_ARGS)
+{
+	bool		showtext = PG_GETARG_BOOL(0);
+
+	pg_stat_statements_internal(fcinfo, PGSS_V1_3, showtext);
+
+	return (Datum) 0;
+}
+
+Datum
 pg_stat_statements_1_2(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
@@ -1358,6 +1410,10 @@ pg_stat_statements_internal(FunctionCall
 			if (api_version != PGSS_V1_2)
 				elog(ERROR, "incorrect number of output arguments");
 			break;
+		case PG_STAT_STATEMENTS_COLS_V1_3:
+			if (api_version != PGSS_V1_3)
+				elog(ERROR, "incorrect number of output arguments");
+			break;
 		default:
 			elog(ERROR, "incorrect number of output arguments");
 	}
@@ -1534,11 +1590,24 @@ pg_stat_statements_internal(FunctionCall
 		{
 			values[i++] = Float8GetDatumFast(tmp.blk_read_time);
 			values[i++] = Float8GetDatumFast(tmp.blk_write_time);
+
+			if (api_version >= PGSS_V1_3)
+			{
+				values[i++] = Int64GetDatumFast(tmp.aio_read_noneed);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_discrd);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_forgot);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_noblok);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_failed);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_wasted);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_waited);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_ontime);
+			}
 		}
 
 		Assert(i == (api_version == PGSS_V1_0 ? PG_STAT_STATEMENTS_COLS_V1_0 :
 					 api_version == PGSS_V1_1 ? PG_STAT_STATEMENTS_COLS_V1_1 :
 					 api_version == PGSS_V1_2 ? PG_STAT_STATEMENTS_COLS_V1_2 :
+					 api_version == PGSS_V1_3 ? PG_STAT_STATEMENTS_COLS_V1_3 :
 					 -1 /* fail if you forget to update this assert */ ));
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
--- contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql.orig	2014-06-25 17:33:03.168961964 -0400
+++ contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql	2014-06-25 18:10:50.872520343 -0400
@@ -0,0 +1,51 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements();
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements();
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT queryid bigint,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT blk_read_time float8,
+    OUT blk_write_time float8
+   , OUT aio_read_noneed int8
+   , OUT aio_read_discrd int8
+   , OUT aio_read_forgot int8
+   , OUT aio_read_noblok int8
+   , OUT aio_read_failed int8
+   , OUT aio_read_wasted int8
+   , OUT aio_read_waited int8
+   , OUT aio_read_ontime int8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_3'
+LANGUAGE C STRICT VOLATILE;
+
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
--- postgresql-prefetching-asyncio.README.orig	2014-06-25 17:33:03.168961964 -0400
+++ postgresql-prefetching-asyncio.README	2014-06-25 18:10:50.892520421 -0400
@@ -0,0 +1,588 @@
+Postgresql  --   Extended Prefetching using Asynchronous IO
+============================================================
+
+Postgresql currently (9.3.4) provides a limited prefetching capability
+using posix_fadvise to give hints to the Operating System kernel
+about which pages it expects to read in the near future.
+This capability is used only during the heap-scan phase of bitmap-index scans.
+It is controlled via the effective_io_concurrency configuration parameter.
+
+This capability is now extended in two ways :
+   .   use asynchronous IO into Postgresql shared buffers as an
+       alternative to posix_fadvise
+   .   Implement prefetching in other types of scan :
+            .  non-bitmap (i.e. simple) index scans - index pages
+                     currently only for B-tree indexes.
+                    (developed by Claudio Freire <klaussfreire(at)gmail(dot)com>)
+            .  non-bitmap (i.e. simple) index scans - heap pages
+                          currently only for B-tree indexes.
+            .  simple heap scans
+
+Posix asynchronous IO is chosen as the function library for asynchronous IO,
+since this is well supported and also fits very well with the model of
+the prefetching process,  particularly as regards checking for completion
+of an asynchronous read.    On linux,   Posix asynchronous IO is provided
+in the librt library.    librt uses independently-schedulable threads to
+achieve the asynchronicity,   rather than kernel functionality.
+
+In this implementation,  use of asynchronous IO is limited to prefetching
+while performing one of the three types of scan
+            .  B-tree bitmap index scan - heap pages    (as already exists)
+            .  B-tree non-bitmap (i.e. simple) index scans - index and heap pages
+            .  simple heap scans
+on permanent relations.   It is not used on temporary tables nor for writes.
+
+The advantages of Posix asynchronous IO into shared buffers
+compared to posix_fadvise are :
+   .   Beneficial for non-sequential access patterns as well as sequential
+   .   No restriction on the kinds of IO which can be used
+       (other kinds of asynchronous IO impose restrictions such as
+        buffer alignment,  use of non-buffered IO).
+   .   Does not interfere with standard linux kernel read-ahead functionality.
+       (It has been stated in 
+ www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com
+       that :
+          "the kernel stops doing read-ahead when a call to posix_fadvise comes.
+           I noticed the performance hit, and checked the kernel's code.
+           It effectively changes the prediction mode from sequential to fadvise,
+           negating the (assumed) kernel's prefetch logic")
+   .   When the read request is issued after a prefetch has completed,
+       no delay associated with a kernel call to copy the page from
+       kernel page buffers into the Postgresql shared buffer,
+       since it is already there.
+       Also,   in a memory-constrained environment,   there is a greater
+       probability that the prefetched page will "stick" in memory
+       since the linux kernel victimizes the filesystem page cache in preference
+       to swapping out user process pages.
+   .   Statistics on prefetch success can be gathered (see "Statistics" below)
+       which helps the administrator to tune the prefetching settings.
+
+These benefits are most likely to be obtained in a system whose usage profile
+(e.g. from iostat)  shows:
+     .   high IO wait from mostly-read activity
+     .   disk access pattern is not entirely sequential
+         (so kernel readahead can't predict it but postgresql can)
+     .   sufficient spare idle CPU to run the librt pthreads
+         or,  stated another way,    the CPU subsystem is relatively powerful
+         compared to the disk subsystem.
+In such ideal conditions,  and with a workload with plenty of index scans,
+around 10% - 20% improvement in throughput has been achieved.
+In an admittedly extreme environment measured by this author,    with a workload
+consisting of 8 client applications each running similar complex queries
+(same query structure but different predicates and constants),
+including 2 Bitmap Index Scans and 17 non-bitmap index scans,
+on a dual-core Intel laptop (4 hyperthreads) with the database on a single
+USB3-attached 500GB disk drive, and no part of the database in filesystem buffers
+initially,  (filesystem freshly mounted),  comparing unpatched build
+using posix_fadvise with effective_io_concurrency 4 against same build patched
+with async IO and effective_io_concurrency 4 and max_async_io_prefetchers 32,
+elapse time repeatably improved from around 640-670 seconds to around 530-550 seconds,
+a 17% - 18% improvement. 
+
+The disadvantages of Posix asynchronous IO compared to posix_fadvise are:
+     .   probably higher CPU utilization:
+         Firstly, the extra work performed by the librt threads adds CPU
+         overhead, and secondly, if the asynchronous prefetching is effective,
+         then it will deliver better (greater) overlap of CPU with IO, which
+         will reduce elapsed times and hence increase CPU utilization percentage
+         still more (during that shorter elapsed time).
+     .   more context switching,  because of the additional threads.
+
+
+Statistics:
+___________
+
+A number of additional statistics relating to effectiveness of asynchronous IO
+are provided as an extension of the existing pg_stat_statements loadable module.
+Refer to the appendix "Additional Supplied Modules" in the current
+PostgreSQL Documentation for details of this module.
+
+The following additional statistics are provided for asynchronous IO prefetching:
+
+    . aio_read_noneed  :   number of prefetches for which no need for prefetch as block already in buffer pool
+    . aio_read_discrd  :   number of prefetches for which buffer not subsequently read and therefore discarded
+    . aio_read_forgot  :   number of prefetches for which buffer not subsequently read and then forgotten about
+    . aio_read_noblok  :   number of prefetches for which no available BufferAiocb  control block
+    . aio_read_failed  :   number of aio reads for which aio itself failed or the read failed with an errno
+    . aio_read_wasted  :   number of aio reads for which in-progress aio cancelled and disk block not used
+    . aio_read_waited  :   number of aio reads for which disk block used but had to wait for it
+    . aio_read_ontime  :   number of aio reads for which disk block used and ready on time when requested
+
+Some of these are (hopefully) self-explanatory.    Some additional notes:
+
+    . aio_read_discrd and aio_read_forgot  :
+                    prefetch was wasted work since the buffer was not subsequently read
+                    The discrd case indicates that the scanner realized this and discarded the buffer,
+                    whereas the forgot case indicates that the scanner did not realize it,
+                    which should not normally occur.
+                    A high number in either suggests lowering effective_io_concurrency.
+
+    . aio_read_noblok  :   
+                    Any significant number in relation to all the other numbers indicates that
+                    max_async_io_prefetchers should be increased.
+
+    . aio_read_waited  :
+                    The page was prefetched but the asynchronous read had not completed by the time the
+                    scanner requested to read it.     causes extra overhead in waiting and indicates
+                    prefetching is not providing much if any benefit.
+                    The disk subsystem may be underpowered/overloaded in relation to the available CPU power.
+
+    . aio_read_ontime  :
+                    The page was prefetched and the asynchronous read had completed by the time the
+                    scanner requested to read it.     Optimal behaviour.      If this number if large
+                    in relation to all the other numbers except (possibly) aio_read_noneed,
+                    then prefetching is working well.
+
+To create the extension with support for these additional statistics, use the following syntax:
+     CREATE EXTENSION pg_stat_statements VERSION '1.3'
+or,  if you run the new code against an existing database which already has the extension
+( see installation and migration below ),  you can 
+     ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'
+
+A suggested set of commands for displaying these statistics might be :
+
+ /*  OPTIONALLY */ DROP extension pg_stat_statements;
+                   CREATE extension pg_stat_statements VERSION '1.3';
+ /*  run your workload   */
+                  select userid , dbid , substring(query from 1 for 24) , calls , total_time , rows , shared_blks_read , blk_read_time , blk_write_time \
+                    , aio_read_noneed , aio_read_noblok , aio_read_failed , aio_read_wasted , aio_read_waited , aio_read_ontime , aio_read_forgot       \
+                      from pg_stat_statements where shared_blks_read > 0;
+
+
+Installation and Build Configuration:
+_____________________________________
+
+1. First -  a prerequsite:
+#  as well as requiring all the usual package build tools such as gcc , make etc,
+#  as described in the instructions for building postgresql,
+#  the following is required :
+    gnu autoconf at version 2.69 :
+# run the following command
+autoconf -V
+# it *must* return
+autoconf (GNU Autoconf) 2.69
+
+2. If you don't have it or it is a different version,
+then you must obtain version 2.69 (which is the current version)
+from your distribution provider or from the gnu software download site.
+
+3. Also you must have the source tree for postgresql version 9.4 (development version).
+#   all the following commands assume your current working directory is the top of the source tree.
+
+4. cd to top of source tree :
+#   check it appears to be a postgresql source tree
+ls -ld configure.in src
+#   should show both the file and the directory
+grep PostgreSQL COPYRIGHT
+#   should show PostgreSQL Database Management System
+
+5. Apply the patch :
+patch -b -p0 -i <patch_file_path>
+#   should report no errors, 48 files patched (see list at bottom of this README)
+#   and all hunks applied
+#  check the patch was appplied to configure.in
+ls -ld configure.in.orig configure.in
+#   should show both files
+
+6. Rebuild the configure script with the patched configure.in :
+mv configure configure.orig;
+autoconf configure.in >configure;echo "rc= $? from autoconf"; chmod +x configure;
+ls -lrt configure.orig configure;
+
+7. run the new configure script :
+#   if you have run configure before,
+#   then you may first want to save existing config.status and config.log if they exist,
+#   and then specify same configure flags and options as you specified before.
+#   the patch does not alter or extend the set of configure options
+#   if unsure,   run ./configure --help
+#   if still unsure,   run ./configure
+./configure <other configure options as desired>
+
+
+
+8. now check that configure decided that this environment supports asynchronous IO :
+grep USE_AIO_ATOMIC_BUILTIN_COMP_SWAP src/include/pg_config.h
+#  it should show
+#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 1
+#  if not,  apparently your environment does not support asynch IO  -
+#  the config.log will show how it came to that conclusion,
+#  also check for :
+#    . a librt.so somewhere in the loader's library path (probably under /lib , /lib64 , or /usr)
+#    . your gcc must support the atomic compare_and_swap __sync_bool_compare_and_swap built-in function
+#  do not proceed without this define being set.
+
+9. do you want to use the new code on an existing cluster
+   that was created using the same code base but without the patch?
+   If so then run this nasty-looking command :
+   (cut-and-paste it into a terminal window or a shell-script file)
+   Otherwise continue to step 10.
+   see Migration note below for explanation.
+###############################################################################################
+   fl=src/Makefile.global; typeset -i bkx=0; while [[ $bkx < 200 ]]; do {
+       bkfl="${fl}.bak${bkx}"; if [[ -a ${bkfl} ]]; then ((bkx=bkx+1)); else break; fi;
+   }; done;
+   if [[ -a ${bkfl} ]]; then echo "sorry cannot find a backup name for $fl";
+   elif [[ -a $fl ]]; then {
+       mv $fl $bkfl && {
+          sed -e "/^CFLAGS =/ s/\$/ -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO/" $bkfl > $fl;
+          str="diff -w $bkfl $fl";echo "$str"; eval "$str";
+       };
+   };
+   else echo "ooopppss $fl is missing";
+   fi;
+###############################################################################################
+# it should report something like
+diff -w Makefile.global.bak0 Makefile.global
+222c222
+< CFLAGS = XXXX
+---
+> CFLAGS = XXXX -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+#   where XXXX is some set of flags
+
+
+10. now run the rest of the build process as usual  -
+    follow instructions in file INSTALL if that file exists,
+    else e.g. run
+make && make install
+
+If the build fails with the following error:
+undefined reference to `aio_init'
+Then edit the following file
+src/include/pg_config_manual.h
+and add the following line at the bottom:
+
+#define DONT_HAVE_AIO_INIT
+
+and then run
+make clean && make && make install
+See notes to section Runtime Configuration below for more information on this.
+
+If you would like to use the sigevent mechanism for signalling completion
+of asynchronous io to non-originating backends,  instead of the polling method,
+(see section Checking AIO Completion below)
+then add these lines to src/include/pg_config_manual.h
+
+#define USE_AIO_SIGEVENT 1
+#define AIO_SIGEVENT_SIGNALNUM SIGIO  /* or signal num of your choice */
+
+Here's the context
+
+/*
+ * USE_PREFETCH code should be compiled only if we have a way to implement
+ * prefetching.  (This is decoupled from USE_POSIX_FADVISE because there
+ * might in future be support for alternative low-level prefetch APIs  --
+ * -- update October 2013  -- now there is such a new prefetch capability --
+ *   async_io into postgres buffers  -   configuration parameter max_async_io_threads)
+ */
+#if defined(USE_POSIX_FADVISE) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
+#define USE_AIO_SIGEVENT 1
+/* AIO_SIGEVENT_SIGNALNUM is the signal used to indicate completion
+ * of an aio operation.  Choose a signal that is not used elsewhere
+ * in postgresql and which can be caught by signal handler.
+*/
+#define AIO_SIGEVENT_SIGNALNUM SIGIO
+#define USE_PREFETCH
+#endif
+
+
+
+
+Migration , Runtime Configuration, and Use:
+___________________________________________
+
+
+Database Migration:
+___________________
+
+The new prefetching code for non-bitmap index scans introduces a new btree-index
+function named btpeeknexttuple.    The correct way to add such a function involves
+also adding it to the catalog as an internal function in pg_proc.
+However,  this results in the new built code considering an existing database to be
+incompatible,  i.e requiring backup on the old code and restore on the new.
+This is normal behaviour for migration to a new version of postgresql,  and is
+also a valid way of migrating a database for use with this asynchronous IO feature,
+but in this case it may be inconvenient.
+
+As an alternative,    the new code may be compiled with the macro define
+AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+which does what it says by not altering the catalog.   The patched build can then
+be run against an existing database cluster initdb'd using the unpatched build.
+
+There are no known ill-effects of so doing,  but :
+     .  in any case,  it is strongly suggested to make a backup of any precious database
+        before accessing it with a patched build
+     .  be aware that if this asynchronous IO feature is eventually released as part of postgresql,
+        migration will probably be required anyway.
+
+This option to avoid catalog migration is intended as a convenience for a quick test,
+and also makes it easier to obtain performance comparisons on the same database.
+
+
+
+Runtime Configuration:
+______________________
+
+One new configuration parameter settable in postgresql.conf and
+in any other way as described in the postgresql documentation :
+
+max_async_io_prefetchers
+  Maximum number of background processes concurrently using asynchronous
+  librt threads to prefetch pages into shared memory buffers
+
+This number can be thought of as the maximum number
+of librt threads concurrently active,   each working on a list of
+from 1 to target_prefetch_pages pages ( see notes 1 and 2 ).
+
+In practice,    this number simply controls how many prefetch requests in total
+may be active concurrently :
+        max_async_io_prefetchers * target_prefetch_pages ( see note 1)
+
+default is max_connections/6
+and recall that the default for max_connections is 100
+
+
+note 1  a number based on effective_io_concurrency and approximately n * ln(n)
+        where n is effective_io_concurrency
+
+note 2  Provided that the gnu extension to Posix AIO which provides the
+aio_init() function is present,   then aio_init() is called
+to set the librt maximum number of threads to max_async_io_prefetchers,
+and to set the maximum number of concurrent aio read requests to the product of
+        max_async_io_prefetchers * target_prefetch_pages
+
+
+As well as this regular configuration parameter,
+there are several other parameters that can be set via environment variable.
+The reason why they are environment vars rather than regular configuration parameters
+is that it is not expected that they should need to be set,   but they may be useful :
+                variable name         values                  default        meaning
+   PG_TRY_PREFETCHING_FOR_BITMAP      [Y|N]                    Y         whether to prefetch bitmap heap scans
+   PG_TRY_PREFETCHING_FOR_ISCAN       [Y|N|integer[,[N|Y]]]   256,N      whether to prefetch  non-bitmap index scans
+                                                                    also numeric size of list of prefetched blocks
+                                                                    also whether to prefetch forward-sequential-pattern index pages
+   PG_TRY_PREFETCHING_FOR_BTREE       [Y|N]                    Y         whether to prefetch heap pages in non-bitmap index scans
+   PG_TRY_PREFETCHING_FOR_HEAP        [Y|N]                    N         whether to prefetch relation (un-indexed) heap scans
+
+
+The setting for PG_TRY_PREFETCHING_FOR_ISCAN is a litle complicated.
+It can be set to Y or N to control prefetching of  non-bitmap index scans;
+But in addition it can be set to an integer,   which both implies Y
+and also sets the size of a list used to remember prefetched but unread heap pages.
+This list is an optimization used to avoid re-prefetching and maximise the potential
+set of prefetchable blocks indexed by one index page.
+And if set to an integer,  this integer may be followed by either ,Y or ,N
+to specify to prefetch index pages which are being accessed forward-sequentially.
+It has been found that prefetching is not of great benefit for this access pattern,
+and so it is not the default,  but also does no harm (provided sufficient CPU capacity).
+
+
+
+Usage :
+______
+
+
+There are no changes in usage other than as noted under Configuration and Statistics.
+However,   in order to assess benefit from this feature,   it will be useful to
+understand the query access plans of your workload using EXPLAIN.    Before doing that,
+make sure that statistics are up to date using ANALYZE.
+
+
+
+Internals:
+__________
+
+
+Internal changes span two areas and the interface between them :
+
+ .  buffer manager layer
+ .  programming interface for scanner to call buffer manager
+ .  scanner layer
+
+ .  buffer manager layer
+    ____________________
+
+    changes comprise :
+       .   allocating,  pinning , unpinning buffers
+            this is complex and discussed briefly below in "Buffer Management"
+       .   acquiring and releasing a BufferAiocb, the control block
+            associated with a single aio_read,  and checking for its completion
+            a new file,  backend/storage/buffer/buf_async.c, provides three new functions,
+                  BufStartAsync        BufReleaseAsync            BufCheckAsync
+            which handle this.
+       .   calling librt asynch io functions
+            this follows the example of all other filesystem interfaces
+            and is straightforward.    
+            two new functions are provided in fd.c:
+                   FileStartaio        FileCompleteaio
+            and corresponding interfaces in smgr.c
+
+ .  programming interface for scanner to call buffer manager
+    ________________________________________________________
+       . calling interface for existing function PrefetchBuffer is modified :
+           .  one new argument,   BufferAccessStrategy strategy
+           .  now returns an int return code which indicates :
+                     whether pin count on buffer has been increased by 1
+                     whether block was already present in a buffer
+       .  new function DiscardBuffer
+           .  discard buffer used for a previously prefetched page
+                 which scanner decides it does not want to read.
+           .  same arguments as for PrefetchBuffer except for omission of BufferAccessStrategy
+           .  note - this is different from the existing function ReleaseBuffer
+                     in that ReleaseBuffer takes a buffer_descriptor as argument
+                     for a buffer which has been read, but has similar purpose.
+
+ .  scanner layer
+    _____________
+        common to all scanners is that the scanner which wishes to prefetch must do two things:
+          .  decide which pages to prefetch and call PrefetchBuffer to prefetch them
+                 nodeBitmapHeapscan already does this (but note one extra argument on PrefetchBuffer)
+          .  remember which pages it has prefetched in some list (actual or conceptual,  e.g. a page range),
+                 removing each page from this list if and when it subsequently reads the page.
+          .  at end of scan,  call DiscardBuffer for every remembered (i.e. prefetched not unread) page
+       how this list of prefetched pages is implemented varies for each of the three scanners and four scan types:
+            .  bitmap index scan - heap pages
+            .  non-bitmap (i.e. simple) index scans - index pages
+            .  non-bitmap (i.e. simple) index scans - heap pages
+            .  simple heap scans
+       The consequences of forgetting to call DiscardBuffer on a prefetched but unread page are:
+            .   counted in aio_read_forgot  (see "Statistics" above)
+            .   may incur an annoying but harmless warning in the pg_log "Buffer Leak ... "
+                  (the buffer is released at commit)
+       This does sometimes happen ...
+     
+
+
+Buffer Management
+_________________
+
+With async io,   PrefetchBuffer must allocate and pin a buffer,  which is relatively straightforward,
+but also every other part of buffer manager must know about the possibility that a buffer may be in
+a state of async_io_in_progress state and be prepared to determine the possible completion.
+That is,  one backend BK1 may start the io but another BK2 may try to read it before BK1 does.
+Posix Asynchronous IO provides a means for waiting on this or another task's read if in progress,
+namely aio_suspend(),  which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO
+are called as part of asynchronous prefetching,   their role is limited to maintaining the buffer descriptor flags,
+and they do not track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in
+a separate set of shared control blocks,  the BufferAiocb list -
+refer to     include/storage/buf_internals.h
+Checking asynchronous io status is handled in  backend/storage/buffer/buf_async.c BufCheckAsync function.
+Read the commentary for this function for more details.
+
+Checking AIO Completion
+_______________________
+There is a difference in how completion is checked depending on whether the backend doing the checking is :
+    .   the same as the backend which stared the asynch io (the "originator")
+    .   a different backend ("non-originator")
+The "originator" case is most common and also simplest  -
+      FileCompleteaio simply issues the appriopriate aio_xxxx calls and suspends if not complete.
+The "non-originator" case is more complex and two methods are currently designed and implemented :
+    .  polling the aiocb for completion
+    .  use of LWlocks and sigevent to cause a signal to be delivered to the originator.
+       The originator locks eXclusive at aio start and releases after delivery of the signal.
+       The non-originator locks Shared and releases.
+
+Pinning and unpinning of buffers is the most complex aspect of asynch io prefetching,
+and the logic is spread throughout BufStartAsync , BufCheckAsync , and many functions in bufmgr.c.
+When a backend BK2 requests ReadBuffer of a page for which asynch read is in progress,
+buffer manager has to determine which backend BK1 pinned this buffer during previous PrefetchBuffer,
+and for example must not be re-pinned a second time if BK2 is BK1.
+Information concerning which backend initiated the prefetch is held in the BufferAiocb.
+
+The trickiest case concerns the scenario in which :
+   .  BK1 initiates prefetch and acquires a pin
+   .  BK2 possibly waits for completion and then reads the buffer,  and perhaps later on
+         releases it by ReleaseBuffer.
+   .  Since the asynchronous IO is no longer in progress,     there is no longer any
+         BufferAiocb associated with it.    Yet buffer manager must remember that BK1 holds a
+         "prefetch" pin, i.e. a pin which must not be repeated if and when BK1 finally issues ReadBuffer.
+   .  The solution to this problem is to invent the concept of a "banked" pin,
+      which is a pin obtained when prefetch was issued,   identied as in "banked" status only if and when
+      the associated asynchronous IO terminates,  and redeemable by the next use by same task,
+      either by ReadBuffer or DiscardBuffer.
+      The pid of the backend which holds a banked pin on a buffer (there can be at most one such backend)
+      is stored in the buffer descriptor.
+      This is done without increasing size of the buffer descriptor,  which is important since
+      there may be a very large number of these.     This does overload the relevant field in the descriptor.
+      Refer to include/storage/buf_internals.h for more details
+      and search for BM_AIO_PREFETCH_PIN_BANKED in storage/buffer/bufmgr.c and  backend/storage/buffer/buf_async.c
+
+______________________________________________________________________________
+The following 48 files are changed in this feature (output of the patch command) :
+
+patching file configure.in
+patching file contrib/pg_prewarm/pg_prewarm.c
+patching file contrib/pg_stat_statements/pg_stat_statements--1.3.sql
+patching file contrib/pg_stat_statements/Makefile
+patching file contrib/pg_stat_statements/pg_stat_statements.c
+patching file contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql
+patching file postgresql-prefetching-asyncio.README
+patching file config/c-library.m4
+patching file src/backend/postmaster/postmaster.c
+patching file src/backend/executor/nodeBitmapHeapscan.c
+patching file src/backend/executor/nodeIndexscan.c
+patching file src/backend/executor/instrument.c
+patching file src/backend/storage/buffer/Makefile
+patching file src/backend/storage/buffer/bufmgr.c
+patching file src/backend/storage/buffer/buf_async.c
+patching file src/backend/storage/buffer/buf_init.c
+patching file src/backend/storage/smgr/md.c
+patching file src/backend/storage/smgr/smgr.c
+patching file src/backend/storage/file/fd.c
+patching file src/backend/storage/lmgr/lwlock.c
+patching file src/backend/storage/lmgr/proc.c
+patching file src/backend/access/heap/heapam.c
+patching file src/backend/access/heap/syncscan.c
+patching file src/backend/access/index/indexam.c
+patching file src/backend/access/index/genam.c
+patching file src/backend/access/nbtree/nbtsearch.c
+patching file src/backend/access/nbtree/nbtinsert.c
+patching file src/backend/access/nbtree/nbtpage.c
+patching file src/backend/access/nbtree/nbtree.c
+patching file src/backend/nodes/tidbitmap.c
+patching file src/backend/utils/misc/guc.c
+patching file src/backend/utils/mmgr/aset.c
+patching file src/include/executor/instrument.h
+patching file src/include/storage/bufmgr.h
+patching file src/include/storage/lwlock.h
+patching file src/include/storage/smgr.h
+patching file src/include/storage/fd.h
+patching file src/include/storage/buf_internals.h
+patching file src/include/catalog/pg_am.h
+patching file src/include/catalog/pg_proc.h
+patching file src/include/pg_config_manual.h
+patching file src/include/miscadmin.h
+patching file src/include/access/nbtree.h
+patching file src/include/access/heapam.h
+patching file src/include/access/relscan.h
+patching file src/include/nodes/tidbitmap.h
+patching file src/include/utils/rel.h
+patching file src/include/pg_config.h.in
+
+
+Future Possibilities:
+____________________
+
+There are several possible extensions of this feature :
+   .   Extend prefetching of index scans to types of index
+       other than B-tree.
+       This should be fairly straightforward,  but requires some
+       good base of benchmarkable workloads to prove the value.
+   .   Investigate why asynchronous IO prefetching does not greatly
+       improve sequential relation heap scans and possibly find how to
+       achieve a benefit.
+   .   Build knowledge of asycnhronous IO prefetching into the
+       Query Planner costing.
+       This is far from straightforward.    The Postgresql Query Planner's
+       costing model is based on resource consumption rather than elapsed time.
+       Use of asynchronous IO prefetching is intended to improve elapsed time
+       as the expense of (probably) higher resource consumption.
+       Although Costing understands about the reduced cost of reading buffered
+       blocks, it does not take asynchronicity or overlap of CPU with disk
+       into account.  A naive approach might be to try to tweak the Query
+       Planner's Cost Constant configuration parameters
+       such as seq_page_cost , random_page_cost
+       but this is hazardous as explained in the Documentation.
+
+
+
+John Lumby,  johnlumby(at)hotmail(dot)com
--- config/c-library.m4.orig	2014-06-25 16:37:59.229618848 -0400
+++ config/c-library.m4	2014-06-25 18:10:50.920520531 -0400
@@ -367,3 +367,152 @@ if test "$pgac_cv_type_locale_t" = 'yes
   AC_DEFINE(LOCALE_T_IN_XLOCALE, 1,
             [Define to 1 if `locale_t' requires <xlocale.h>.])
 fi])])# PGAC_HEADER_XLOCALE
+
+
+# PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
+# ---------------------------------------
+# test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation of both,
+#      including verifying that aio_error can retrieve completion status
+#      of aio_read issued by a different process
+#
+AC_DEFUN([PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP],
+[AC_MSG_CHECKING([whether have both librt-style async io and the gcc atomic compare_and_swap])
+AC_CACHE_VAL(pgac_cv_aio_atomic_builtin_comp_swap,
+pgac_save_LIBS=$LIBS
+LIBS=" -lrt $pgac_save_LIBS"
+[AC_TRY_RUN([#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include "aio.h"
+#include <errno.h>
+
+char *shmem;
+
+/*  returns rc of aio_read or -1 if some error */
+int
+processA(void)
+{
+	int fd , rc;
+	struct aiocb *aiocbp = (struct aiocb *) shmem;
+	char *buf = shmem + sizeof(struct aiocb);
+
+	rc = fd = open("configure", O_RDONLY );
+	if (fd != -1) {
+
+            memset(aiocbp, 0, sizeof(struct aiocb));
+            aiocbp->aio_fildes = fd;
+            aiocbp->aio_offset = 0;
+            aiocbp->aio_buf = buf;
+            aiocbp->aio_nbytes = 8;
+            aiocbp->aio_reqprio = 0;
+            aiocbp->aio_sigevent.sigev_notify = SIGEV_NONE;
+
+            rc = aio_read(aiocbp);
+	}
+        return rc;
+}
+
+/*  returns result of aio_error  -  0 if io completed successfully */
+int
+processB(void)
+{
+	struct aiocb *aiocbp = (struct aiocb *) shmem;
+	const struct aiocb * const pl[1] = { aiocbp };
+	int rv;
+	int	returnCode;
+        struct timespec my_timeout = { 0 , 10000 };
+        int max_iters , max_polls;
+
+	rv = aio_error(aiocbp);
+        max_iters = 100;
+        while ( (max_iters-- > 0) && (rv == EINPROGRESS) ) {
+                max_polls = 256;
+                my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                returnCode = aio_suspend(pl , 1 , &my_timeout);
+                while ((returnCode < 0) && (EAGAIN == errno) && (max_polls-- > 0)) {
+                    my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                    returnCode = aio_suspend(pl , 1 , &my_timeout);
+                }
+                rv = aio_error(aiocbp);
+        }
+
+	return rv;
+}
+
+int main (int argc, char *argv[])
+ {
+   int rc;
+   int pidB;
+   int child_status;
+   struct aiocb volatile * first_aiocb;
+   struct aiocb volatile * second_aiocb;
+   struct aiocb volatile * my_aiocbp = (struct aiocb *)20000008;
+
+   first_aiocb = (struct aiocb *)20000008;
+   second_aiocb = (struct aiocb *)40000008;
+
+   /*  first test  --  __sync_bool_compare_and_swap
+   **  set zero as success if two comp-swaps both worked as expected -
+   **  first compares equal and swaps,  second compares unequal
+   */
+   rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   if (rc) {
+      rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   } else {
+      rc = -1;
+   }
+
+   if (rc == 0) {
+       /*  second test  --  process A start aio_read
+       **  and process B checks completion by polling
+       */
+        rc = -1; /* pessimistic */
+
+	shmem = mmap(NULL, sizeof(struct aiocb) + 2048,
+				 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
+				 -1, 0);
+	if (shmem != MAP_FAILED) {
+            
+            /*
+             * Start the I/O request in parent process, then fork and try to wait
+             * for it to finish from the child process.
+             */
+            rc = processA();
+            if (rc >= 0) {
+
+                rc = pidB = fork();
+                if (pidB != -1) {
+                    if (pidB != 0) {
+                        /* parent */
+                        wait (&child_status);
+                        if (WIFEXITED(child_status)) {
+                            rc = WEXITSTATUS(child_status);
+                        }
+                    } else {
+                        /* child */
+                        rc = processB();
+                        exit(rc);
+                    }
+                }
+            }
+        }
+   }
+
+   return rc;
+}],
+[pgac_cv_aio_atomic_builtin_comp_swap=yes],
+[pgac_cv_aio_atomic_builtin_comp_swap=no],
+[pgac_cv_aio_atomic_builtin_comp_swap=cross])
+])dnl AC_CACHE_VAL
+AC_MSG_RESULT([$pgac_cv_aio_atomic_builtin_comp_swap])
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" != x"yes"; then
+LIBS=$pgac_save_LIBS
+fi
+])# PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
--- src/backend/postmaster/postmaster.c.orig	2014-06-25 16:37:59.445618891 -0400
+++ src/backend/postmaster/postmaster.c	2014-06-25 18:10:50.960520688 -0400
@@ -123,6 +123,11 @@
 #include "storage/spin.h"
 #endif
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+void ReportFreeBAiocbs(void);
+int CountInuseBAiocbs(void);
+extern int hwmBufferAiocbs;         /*  high water mark of in-use  BufferAiocbs in pool           */
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Possible types of a backend. Beyond being the possible bkend_type values in
@@ -1489,9 +1494,15 @@ ServerLoop(void)
 	fd_set		readmask;
 	int			nSockets;
 	time_t		now,
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                           count_baiocb_time,
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 				last_touch_time;
 
 	last_touch_time = time(NULL);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        count_baiocb_time = time(NULL);
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 	nSockets = initMasks(&readmask);
 
@@ -1650,6 +1661,19 @@ ServerLoop(void)
 			last_touch_time = now;
 		}
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                /*   maintain the hwm of used baiocbs every 10 seconds  */
+		if ((now - count_baiocb_time) >= 10)
+		{
+                        int inuseBufferAiocbs;         /*  current in-use  BufferAiocbs in pool */
+                        inuseBufferAiocbs = CountInuseBAiocbs();
+                        if (inuseBufferAiocbs > hwmBufferAiocbs) {
+			    hwmBufferAiocbs = inuseBufferAiocbs;
+			}
+			count_baiocb_time = now;
+		}
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 		/*
 		 * If we already sent SIGQUIT to children and they are slow to shut
 		 * down, it's time to send them SIGKILL.  This doesn't happen
@@ -3440,6 +3464,9 @@ PostmasterStateMachine(void)
 						signal_child(PgStatPID, SIGQUIT);
 				}
 			}
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        ReportFreeBAiocbs();
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 		}
 	}
 
--- src/backend/executor/nodeBitmapHeapscan.c.orig	2014-06-25 16:37:59.389618879 -0400
+++ src/backend/executor/nodeBitmapHeapscan.c	2014-06-25 18:10:50.988520798 -0400
@@ -34,6 +34,8 @@
  *		ExecEndBitmapHeapScan		releases all storage.
  */
 #include "postgres.h"
+#include <sys/types.h> 
+#include <unistd.h>
 
 #include "access/relscan.h"
 #include "access/transam.h"
@@ -47,6 +49,10 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_bitmap_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
 static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
@@ -111,10 +117,21 @@ BitmapHeapNext(BitmapHeapScanState *node
 		node->tbmres = tbmres = NULL;
 
 #ifdef USE_PREFETCH
-		if (target_prefetch_pages > 0)
-		{
+		if (    prefetch_bitmap_scans
+                     && (target_prefetch_pages > 0)
+                     && (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == scan->rs_rd->rd_node.dbNode)
+                             )
+                         ||  (prefetch_dbOid == 0)
+                        )
+                        /* sufficient number of blocks - at least twice the target_prefetch_pages */
+                     && (scan->rs_nblocks > (2*target_prefetch_pages))
+                   ) {
 			node->prefetch_iterator = prefetch_iterator = tbm_begin_iterate(tbm);
 			node->prefetch_pages = 0;
+                        if (prefetch_iterator) {
+                          tbm_zero(prefetch_iterator);  /* zero list of prefetched and unread blocknos */
+                        }
 			node->prefetch_target = -1;
 		}
 #endif   /* USE_PREFETCH */
@@ -138,12 +155,14 @@ BitmapHeapNext(BitmapHeapScanState *node
 			}
 
 #ifdef USE_PREFETCH
+                        if (prefetch_iterator) {
 			if (node->prefetch_pages > 0)
 			{
 				/* The main iterator has closed the distance by one page */
 				node->prefetch_pages--;
+                                tbm_subtract(prefetch_iterator, tbmres->blockno); /* remove this blockno from list of prefetched and unread blocknos */
 			}
-			else if (prefetch_iterator)
+                            else
 			{
 				/* Do not let the prefetch iterator get behind the main one */
 				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
@@ -151,6 +170,7 @@ BitmapHeapNext(BitmapHeapScanState *node
 				if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
 					elog(ERROR, "prefetch and main iterators are out of sync");
 			}
+                        }
 #endif   /* USE_PREFETCH */
 
 			/*
@@ -239,16 +259,26 @@ BitmapHeapNext(BitmapHeapScanState *node
 			while (node->prefetch_pages < node->prefetch_target)
 			{
 				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+                                int             PrefetchBufferRc; /*  return value from PrefetchBuffer  - refer to bufmgr.h */
+
 
 				if (tbmpre == NULL)
 				{
 					/* No more pages to prefetch */
-					tbm_end_iterate(prefetch_iterator);
-					node->prefetch_iterator = prefetch_iterator = NULL;
+                                        /* let ExecEndBitmapHeapScan terminate the prefetch_iterator
+				        **	tbm_end_iterate(prefetch_iterator);
+					**      node->prefetch_iterator = NULL;
+                                        */
+                                        prefetch_iterator = NULL;
 					break;
 				}
 				node->prefetch_pages++;
-				PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+				PrefetchBufferRc = PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno , 0);
+                                /*  add this blockno to list of prefetched and unread blocknos
+                                **  if pin count did not increase then indicate so in the Unread_Pfetched list
+                                */
+                                tbm_add(prefetch_iterator
+                                   ,( (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED) ? tbmpre->blockno : InvalidBlockNumber ) );
 			}
 		}
 #endif   /* USE_PREFETCH */
@@ -482,12 +512,31 @@ ExecEndBitmapHeapScan(BitmapHeapScanStat
 {
 	Relation	relation;
 	HeapScanDesc scanDesc;
+	TBMIterator *prefetch_iterator;
 
 	/*
 	 * extract information from the node
 	 */
 	relation = node->ss.ss_currentRelation;
 	scanDesc = node->ss.ss_currentScanDesc;
+	prefetch_iterator = node->prefetch_iterator;
+
+#ifdef USE_PREFETCH
+        /*  before any other cleanup,  discard any prefetched but unread buffers  */
+        if (prefetch_iterator != NULL) {
+            TBMIterateResult *tbmpre = tbm_locate_IterateResult(prefetch_iterator);
+            BlockNumber *Unread_Pfetched_base = tbmpre->Unread_Pfetched_base;
+            unsigned int Unread_Pfetched_next = tbmpre->Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+            unsigned int Unread_Pfetched_count = tbmpre->Unread_Pfetched_count;
+
+            while ((Unread_Pfetched_count--) > 0) {
+                DiscardBuffer( scanDesc->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                Unread_Pfetched_next++;
+                if (Unread_Pfetched_next >= target_prefetch_pages)
+                    Unread_Pfetched_next = 0;
+            }
+        }
+#endif   /* USE_PREFETCH */
 
 	/*
 	 * Free the exprcontext
--- src/backend/executor/nodeIndexscan.c.orig	2014-06-25 16:37:59.393618881 -0400
+++ src/backend/executor/nodeIndexscan.c	2014-06-25 18:10:51.012520893 -0400
@@ -35,8 +35,13 @@
 #include "utils/rel.h"
 
 
+
 static TupleTableSlot *IndexNext(IndexScanState *node);
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -418,7 +423,12 @@ ExecEndIndexScan(IndexScanState *node)
 	 * close the index relation (no-op if we didn't open it)
 	 */
 	if (indexScanDesc)
+        {
 		index_endscan(indexScanDesc);
+
+        /*  note  -  at this point all scan controlblock resources have been freed by IndexScanEnd called by index_endscan */
+
+        }
 	if (indexRelationDesc)
 		index_close(indexRelationDesc, NoLock);
 
@@ -609,6 +619,33 @@ ExecInitIndexScan(IndexScan *node, EStat
 											   indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
+#ifdef USE_PREFETCH
+        /*  initialize prefetching   */
+                indexstate->iss_ScanDesc->pfch_index_page_list =  (struct pfch_index_pagelist*)0;
+                indexstate->iss_ScanDesc->pfch_block_item_list = (struct pfch_block_item*)0;
+		if (    prefetch_index_scans
+			 && (target_prefetch_pages > 0)
+			 &&	(!RelationUsesLocalBuffers(indexstate->iss_ScanDesc->heapRelation)) /* I think this must always be true for an indexed heap ? */
+			 && (    (   (prefetch_dbOid > 0)
+					   && (prefetch_dbOid == indexstate->iss_ScanDesc->heapRelation->rd_node.dbNode)
+					 )
+				 ||  (prefetch_dbOid == 0)
+				)
+		   ) {
+			indexstate->iss_ScanDesc->pfch_index_page_list = palloc( sizeof(struct pfch_index_pagelist) + ( (target_prefetch_pages-1) * sizeof(struct pfch_index_item) ) );
+			indexstate->iss_ScanDesc->pfch_block_item_list = palloc( prefetch_index_scans * sizeof(struct pfch_block_item) );
+			if (     ( (struct pfch_index_pagelist*)0 != indexstate->iss_ScanDesc->pfch_index_page_list )
+                  && ( (struct pfch_block_item*)0 != indexstate->iss_ScanDesc->pfch_block_item_list )
+               ) {
+                          indexstate->iss_ScanDesc->pfch_used = 0;
+                          indexstate->iss_ScanDesc->pfch_next = prefetch_index_scans; /* ensure first entry is at index 0 */
+                          indexstate->iss_ScanDesc->pfch_index_page_list->pfch_index_pagelist_next = (struct pfch_index_pagelist*)0;
+                          indexstate->iss_ScanDesc->pfch_index_page_list->pfch_index_item_count = 0;
+                          indexstate->iss_ScanDesc->do_prefetch = 1;
+            }
+		}
+#endif   /* USE_PREFETCH */
+
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
 	 * index AM.
--- src/backend/executor/instrument.c.orig	2014-06-25 16:37:59.389618879 -0400
+++ src/backend/executor/instrument.c	2014-06-25 18:10:51.036520987 -0400
@@ -41,6 +41,14 @@ InstrAlloc(int n, int instrument_options
 		{
 			instr[i].need_bufusage = need_buffers;
 			instr[i].need_timer = need_timer;
+			instr[i].bufusage_start.aio_read_noneed = 0;
+			instr[i].bufusage_start.aio_read_discrd = 0;
+			instr[i].bufusage_start.aio_read_forgot = 0;
+			instr[i].bufusage_start.aio_read_noblok = 0;
+			instr[i].bufusage_start.aio_read_failed = 0;
+			instr[i].bufusage_start.aio_read_wasted = 0;
+			instr[i].bufusage_start.aio_read_waited = 0;
+			instr[i].bufusage_start.aio_read_ontime = 0;
 		}
 	}
 
@@ -143,6 +151,16 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+
+	dst->aio_read_noneed       += add->aio_read_noneed - sub->aio_read_noneed;
+	dst->aio_read_discrd       += add->aio_read_discrd - sub->aio_read_discrd;
+	dst->aio_read_forgot       += add->aio_read_forgot - sub->aio_read_forgot;
+	dst->aio_read_noblok       += add->aio_read_noblok - sub->aio_read_noblok;
+	dst->aio_read_failed       += add->aio_read_failed - sub->aio_read_failed;
+	dst->aio_read_wasted       += add->aio_read_wasted - sub->aio_read_wasted;
+	dst->aio_read_waited       += add->aio_read_waited - sub->aio_read_waited;
+	dst->aio_read_ontime       += add->aio_read_ontime - sub->aio_read_ontime;
+
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
--- src/backend/storage/buffer/Makefile.orig	2014-06-25 16:37:59.457618893 -0400
+++ src/backend/storage/buffer/Makefile	2014-06-25 18:10:51.060521080 -0400
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o buf_async.o
 
 include $(top_srcdir)/src/backend/common.mk
--- src/backend/storage/buffer/bufmgr.c.orig	2014-06-25 16:37:59.457618893 -0400
+++ src/backend/storage/buffer/bufmgr.c	2014-06-25 18:10:51.092521205 -0400
@@ -29,7 +29,7 @@
  *		buf_table.c -- manages the buffer lookup table
  */
 #include "postgres.h"
-
+#include <sys/types.h> 
 #include <sys/file.h>
 #include <unistd.h>
 
@@ -50,7 +50,6 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
-
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
 #define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
@@ -63,6 +62,14 @@
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
 
+extern volatile struct BAiocbAnchor *BAiocbAnchr;  /*  anchor for all control blocks pertaining to aio  */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern int numBufferAiocbs;        /*  total number of  BufferAiocbs in pool  */
+#if defined(USE_PREFETCH) && defined(USE_AIO_SIGEVENT)
+extern struct BAiocbIolock_chain_item volatile * volatile BAiocbIolock_anchor; /* anchor for chain of awaiting-release LWLock ptrs */
+#endif /* defined(USE_PREFETCH) && defined(USE_AIO_SIGEVENT)  */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 #define DROP_RELS_BSEARCH_THRESHOLD		20
 
 /* GUC variables */
@@ -78,26 +85,33 @@ bool		track_io_timing = false;
  */
 int			target_prefetch_pages = 0;
 
-/* local state for StartBufferIO and related functions */
+/* local state for StartBufferIO and related functions
+**  but ONLY for synchronous IO  -  not altered for aio
+*/
 static volatile BufferDesc *InProgressBuf = NULL;
 static bool IsForInput;
+pid_t this_backend_pid = 0;    /*    pid of this backend */
 
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
-
-static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
+extern int
+BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy);
+extern int
+BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, volatile BufferDesc *buf_desc, int intention
+        ,BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock );
+Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
-				  bool *hit);
-static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
-static void PinBuffer_Locked(volatile BufferDesc *buf);
-static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
+				  bool *hit , int index_for_aio);
+bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
+void PinBuffer_Locked(volatile BufferDesc *buf);
+void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
 static void WaitIO(volatile BufferDesc *buf);
-static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
-static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
+static bool StartBufferIO(volatile BufferDesc *buf, bool forInput , int index_for_aio );
+void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -106,28 +120,76 @@ static volatile BufferDesc *BufferAlloc(
 			ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr);
+			int *foundPtr , int index_for_aio );
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
 
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
 
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
  *
- * This is named by analogy to ReadBuffer but doesn't actually allocate a
- * buffer.  Instead it tries to ensure that a future ReadBuffer for the given
- * block will not be delayed by the I/O.  Prefetching is optional.
+ * This is named by analogy to ReadBuffer but allocates a buffer only if using asynchronous I/O.
+ * Its purpose  is to try to ensure that a future ReadBuffer for the given block
+ * will not be delayed by the I/O.  Prefetching is optional.
  * No-op if prefetching isn't compiled in.
- */
-void
-PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
-{
+ *
+ * Originally the prefetch simply called posix_fadvise() to recommend read-ahead into kernel page cache.
+ * Extended to provide an alternative of issuing an asynchronous aio_read() to read into a buffer.
+ * This extension has an implication on how this bufmgr component manages concurrent requests
+ * for the same disk block.
+ *
+ * Synchronous IO (read()) does not provide a means for waiting on another task's read if in progress,
+ * and bufmgr implements its own scheme in StartBufferIO, WaitIO, and TerminateBufferIO.
+ *
+ * Asynchronous IO (aio_read()) provides a means for waiting on this or another task's read if in progress,
+ * namely aio_suspend(),  which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO
+ * are called as part of asynchronous prefetching,   their role is limited to maintaining the buffer desc flags,
+ * and they do not track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in 
+ * a separate set of shared control blocks,  the BufferAiocb list -
+ *   refer to     include/storage/buf_internals.h and storage/buffer/buf_init.c
+ *
+ * Another implication of asynchronous IO concerns buffer pinning.
+ * The buffer used for the prefetch is pinned before aio_read is issued.
+ * It is expected that the same task (and possibly others) will later ask to read the page
+ * and eventually release and unpin the buffer.
+ * However,  if the task which issued the aio_read later decides not to read the page,
+ * and return code indicates delta_pin_count > 0 (see below)
+ * it *must* instead issue a DiscardBuffer() (see function later in this file)
+ * so that its pin is released.
+ * Therefore,  each client which uses the PrefetchBuffer service must either always read all
+ * prefetched pages,  or keep track of prefetched pages and discard unread ones at end of scan.
+ *
+ * return code:   is an int bitmask defined in bufmgr.h
+        PREFTCHRC_BUF_PIN_INCREASED 0x01      pin count on buffer has been increased by 1
+        PREFTCHRC_BLK_ALREADY_PRESENT 0x02    block was already present in a buffer
+ *
+ * PREFTCHRC_BLK_ALREADY_PRESENT is a hint to caller that the prefetch may be unnecessary
+ */
+int
+PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy)
+{
+	Buffer		buf_id; /* indicates buffer containing the requested block  */
+        int             PrefetchBufferRc = 0; /*  return value as described above  */
+        int             PinCountOnEntry = 0;  /*  pin count on entry           */
+        int             PinCountdelta = 0;    /*  pin count delta increase     */
+
+
 #ifdef USE_PREFETCH
+
+	buf_id = -1;
 	Assert(RelationIsValid(reln));
 	Assert(BlockNumberIsValid(blockNum));
 
+#if defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)
+	/*  now is a good time to check whether any BAiocb waiter-locks are pending for release */
+	if (BAiocbIolock_anchor != (struct BAiocbIolock_chain_item*)0) {
+			ProcessPendingBAiocbIolocks();	 
+	}
+#endif /* defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)  */
 	/* Open it at the smgr level if not already done */
 	RelationOpenSmgr(reln);
 
@@ -147,7 +209,12 @@ PrefetchBuffer(Relation reln, ForkNumber
 		BufferTag	newTag;		/* identity of requested block */
 		uint32		newHash;	/* hash value for newTag */
 		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-		int			buf_id;
+        int         BufStartAsyncrc = -1;  /*  retcode from BufStartAsync :
+                                                       **        0 if started successfully (which implies buffer was newly pinned )
+                                                       **       -1 if failed for some reason
+                                                       **        1+PrivateRefCount if we found desired buffer in buffer pool
+                                                       **  and we set it likewise if we find buffer in buffer pool
+                                                       */
 
 		/* create a tag so we can lookup the buffer */
 		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
@@ -159,28 +226,121 @@ PrefetchBuffer(Relation reln, ForkNumber
 
 		/* see if the block is in the buffer pool already */
 		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
+		buf_id = (Buffer)BufTableLookup(&newTag, newHash);
+                if (buf_id >= 0) {
+                    PinCountOnEntry = PrivateRefCount[buf_id];  /*  pin count on entry               */
+                    BufStartAsyncrc = 1 + PinCountOnEntry; /* indicate this backends pin count - see above comment */
+                    PrefetchBufferRc = PREFTCHRC_BLK_ALREADY_PRESENT;       /* indicate buffer present */
+                } else {
+                    PrefetchBufferRc = 0;                                   /* indicate buffer not present */
+                }
 		LWLockRelease(newPartitionLock);
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+     not_in_buffers:
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 		/* If not in buffers, initiate prefetch */
-		if (buf_id < 0)
+		if (buf_id < 0) {
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                    /*    try using async aio_read with a buffer */
+                    BufStartAsyncrc = BufStartAsync( reln, forkNum, blockNum , strategy );
+                    if (BufStartAsyncrc < 0) {
+                            pgBufferUsage.aio_read_noblok++;
+                    }
+#else /* not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP so try the alternative that does not read the block into a postgresql buffer */
 			smgrprefetch(reln->rd_smgr, forkNum, blockNum);
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+		}
 
-		/*
-		 * If the block *is* in buffers, we do nothing.  This is not really
-		 * ideal: the block might be just about to be evicted, which would be
-		 * stupid since we know we are going to need it soon.  But the only
-		 * easy answer is to bump the usage_count, which does not seem like a
-		 * great solution: when the caller does ultimately touch the block,
-		 * usage_count would get bumped again, resulting in too much
-		 * favoritism for blocks that are involved in a prefetch sequence. A
-		 * real fix would involve some additional per-buffer state, and it's
-		 * not clear that there's enough of a problem to justify that.
+        if (  (buf_id >= 0) || (BufStartAsyncrc >= 1)  ) {
+                        /* The block *is* in buffers.    */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        pgBufferUsage.aio_read_noneed++;
+#ifndef USE_PREFETCH_BUT_DONT_PIN_ALREADY_PRESENT /* jury is out on whether the following wins but it ought to ...  */
+                        /*
+                        ** If this backend already had pinned it,
+                        ** or another backend had banked a pin on it,
+                        ** or there is an IO in progress,
+                        ** or it is not marked valid,
+                        ** then do nothing.
+                        ** Otherwise pin it and mark the buffer's pin as banked by this backend.
+                        ** Note  -  it may or not be pinned by another backend -
+                        **          it is ok for us to bank a pin on it
+                        **          *provided* the other backend did not bank its pin.
+                        **          The reason for this is that the banked-pin indicator is global -
+                        **          it can identify at most one process.
+                        */
+                        /* pgBufferUsage.aio_read_wasted++;  overload counter - only for debugging */
+                        if (BufStartAsyncrc == 1) {            /*   not pinned by me  */
+                              /*  pgBufferUsage.aio_read_wasted++;  overload counter - only for debugging */
+                              /*  note   -   all we can say with certainty is that the buffer is not pinned by me
+                              **             we cannot be sure that it is still in buffer pool
+                              **             so must go through the entire locking and searching all over again ...
 		 */
+                            LWLockAcquire(newPartitionLock, LW_SHARED);
+                            buf_id = (Buffer)BufTableLookup(&newTag, newHash);
+                            /* If in buffers, proceed */
+                            if (buf_id >= 0) {
+                                /*  since the block is now present,
+                                **  save the current pin count to ensure final delta is calculated correctly
+                                */
+                                PinCountOnEntry = PrivateRefCount[buf_id];  /*  pin count on entry               */
+                                if ( PinCountOnEntry == 0) { /*  paranoid check it's still not pinned by me */
+                                    volatile        BufferDesc *buf_desc;
+
+                                    buf_desc = &BufferDescriptors[buf_id];        /* found buffer descriptor */
+                                    LockBufHdr(buf_desc);
+                                    if (    (buf_desc->flags & BM_VALID)           /* buffer is valid        */
+                                         && (!(buf_desc->flags & (BM_IO_IN_PROGRESS|BM_AIO_IN_PROGRESS|BM_AIO_PREFETCH_PIN_BANKED))) /* buffer is not any of ... */
+                                       ) {
+                                        buf_desc->flags |= BM_AIO_PREFETCH_PIN_BANKED; /* bank the pin for next use by this task */
+                                        /*  note  - we can call PinBuffer_Locked with the BM_AIO_PREFETCH_PIN_BANKED flag set because it is not yet pinned by me */
+                                        buf_desc->freeNext = -(this_backend_pid);       /* remember which pid banked it */
+                                        /*  pgBufferUsage.aio_read_wasted--;      overload counter - not wasted after all - only for debugging */
+
+                                        /* Make sure we will have room to remember the buffer pin */
+                                        ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+                                        PinBuffer_Locked(buf_desc);
+	}
+                                    else {
+                                        UnlockBufHdr(buf_desc);
+                                    }
+                                }
+                            }
+                            LWLockRelease(newPartitionLock);
+                            /*  although unlikely,  maybe it was evicted while we were puttering about  */
+                            if (buf_id < 0) {
+                                pgBufferUsage.aio_read_noneed--;   /*   back out the accounting */
+                                goto not_in_buffers;               /*   and try again           */
 	}
-#endif   /* USE_PREFETCH */
 }
+#endif /*  USE_PREFETCH_BUT_DONT_PIN_ALREADY_PRESENT */
 
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+		}
+
+		if (buf_id >= 0) {
+			PinCountdelta = PrivateRefCount[buf_id] - PinCountOnEntry;  /*  pin count delta increase    */
+			if (  (PinCountdelta < 0) || (PinCountdelta > 1) ) {
+				  elog(ERROR,
+						 "PrefetchBuffer #%d : incremented pin count by %d on bufdesc %p refcount %u localpins %d\n"
+								  ,(buf_id+1) , PinCountdelta , &BufferDescriptors[buf_id] ,BufferDescriptors[buf_id].refcount , PrivateRefCount[buf_id]);
+}
+		} else
+		if (BufStartAsyncrc == 0) {  /* aio started successfully (which implies buffer was newly pinned ) */
+			PinCountdelta = 1;
+		}
+
+		/*  set final PrefetchBufferRc according to previous value */
+		PrefetchBufferRc |= PinCountdelta;  /* set the PREFTCHRC_BUF_PIN_INCREASED bit */
+	}
+
+#endif   /* USE_PREFETCH */
+
+	return PrefetchBufferRc; /*  return value as described above */
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
@@ -234,6 +394,13 @@ ReadBufferExtended(Relation reln, ForkNu
 	bool		hit;
 	Buffer		buf;
 
+#if defined(USE_PREFETCH) && defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)
+	/*  now is a good time to check whether any BAiocb waiter-locks are pending for release */
+	if (BAiocbIolock_anchor != (struct BAiocbIolock_chain_item*)0) {
+			ProcessPendingBAiocbIolocks();	 
+	}
+#endif /* defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)  */
+
 	/* Open it at the smgr level if not already done */
 	RelationOpenSmgr(reln);
 
@@ -253,7 +420,7 @@ ReadBufferExtended(Relation reln, ForkNu
 	 */
 	pgstat_count_buffer_read(reln);
 	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
+							forkNum, blockNum, mode, strategy, &hit , 0);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
 	return buf;
@@ -281,7 +448,7 @@ ReadBufferWithoutRelcache(RelFileNode rn
 	Assert(InRecovery);
 
 	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
-							 mode, strategy, &hit);
+							 mode, strategy, &hit , 0);
 }
 
 
@@ -289,15 +456,18 @@ ReadBufferWithoutRelcache(RelFileNode rn
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
+ * index_for_aio ,  if -ve , is negative of ( index of the aiocb in the BufferAiocbs array + 3 )
+ *     which is passed through to StartBufferIO
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy, bool *hit , int index_for_aio )
 {
 	volatile BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+        int             allocrc;             /*  retcode from BufferAlloc */
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -329,16 +499,40 @@ ReadBuffer_common(SMgrRelation smgr, cha
 	}
 	else
 	{
+                allocrc = mode; /* pass mode to BufferAlloc since it must not wait for async io if RBM_NOREAD_FOR_PREFETCH */
 		/*
 		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
+							 strategy, &allocrc , index_for_aio );
+		if (allocrc < 0) {
+                    if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+                    {
+                        ereport(WARNING,
+                                (errcode(ERRCODE_DATA_CORRUPTED),
+                                 errmsg("invalid page header in block %u of relation %s; zeroing out page",
+                                        blockNum,
+                                        relpath(smgr->smgr_rnode, forkNum))));
+                        bufBlock = BufHdrGetBlock(bufHdr);
+                        MemSet((char *) bufBlock, 0, BLCKSZ);
+                    }
 		else
+                      ereport(ERROR,
+                              (errcode(ERRCODE_DATA_CORRUPTED),
+                               errmsg("invalid page header in block %u of relation %s",
+                                      blockNum,
+                                      relpath(smgr->smgr_rnode, forkNum))));
+                        found = true;
+                }
+		else if (allocrc > 0) {
+			pgBufferUsage.shared_blks_hit++;
+                        found = true;
+                }
+		else {
 			pgBufferUsage.shared_blks_read++;
+                        found = false;
+                }
 	}
 
 	/* At this point we do NOT hold any locks. */
@@ -411,7 +605,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 				Assert(bufHdr->flags & BM_VALID);
 				bufHdr->flags &= ~BM_VALID;
 				UnlockBufHdr(bufHdr);
-			} while (!StartBufferIO(bufHdr, true));
+			} while (!StartBufferIO(bufHdr, true, 0));
 		}
 	}
 
@@ -431,6 +625,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
+        if (mode != RBM_NOREAD_FOR_PREFETCH) {
 	if (isExtend)
 	{
 		/* new buffers are zero-filled */
@@ -500,6 +695,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 	VacuumPageMiss++;
 	if (VacuumCostActive)
 		VacuumCostBalance += VacuumCostPageMiss;
+	}
 
 	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
 									  smgr->smgr_rnode.node.spcNode,
@@ -521,21 +717,39 @@ ReadBuffer_common(SMgrRelation smgr, cha
  * the default strategy.  The selected buffer's usage_count is advanced when
  * using the default strategy, but otherwise possibly not (see PinBuffer).
  *
- * The returned buffer is pinned and is already marked as holding the
- * desired page.  If it already did have the desired page, *foundPtr is
- * set TRUE.  Otherwise, *foundPtr is set FALSE and the buffer is marked
+ * index_for_aio is an input parm which,  if non-zero,  identifies a BufferAiocb
+ * acquired by caller and to be used for any StartBufferIO performed by this routine.
+ * In this case,  if block not found in buffer pool and we allocate a new buffer,
+ * then we must maintain the spinlock on the buffer and pass it back to caller.
+ *
+ * foundPtr is input and output :
+ *  . input   -  indicates the read-buffer mode  ( see bufmgr.h )
+ *  . output  -  indicates the status of the buffer - see below
+ *
+ * Except for the case of RBM_NOREAD_FOR_PREFETCH and buffer is found,
+ * the returned buffer is pinned and is already marked as holding the
+ * desired page.
+ *  If it already did have the desired page and page content is valid,
+ *  *foundPtr is set to 1
+ *  If it already did have the desired page and mode is RBM_NOREAD_FOR_PREFETCH
+ *    and StartBufferIO returned false
+ *    (meaning it could not initialise the buffer for aio)
+ *  *foundPtr is set to 2
+ *  If it already did have the desired page but page content is invalid,
+ *  *foundPtr is set to -1
+ *   this can happen only if the buffer was read by an async read
+ *   and the aio is still in progress or pinned by the issuer of the startaio.
+ *  Otherwise, *foundPtr is set to 0 and the buffer is marked
  * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
  *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
- *
- * No locks are held either at entry or exit.
+ * No locks are held either at entry or exit EXCEPT for case noted above
+ * of passing an empty buffer back to async io caller ( index_for_aio set ).
  */
 static volatile BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			int *foundPtr , int index_for_aio )
 {
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
@@ -547,6 +761,13 @@ BufferAlloc(SMgrRelation smgr, char relp
 	int			buf_id;
 	volatile BufferDesc *buf;
 	bool		valid;
+        int             IntentionBufferrc;      /* retcode from BufCheckAsync */
+        bool            StartBufferIOrc;        /* retcode from StartBufferIO */
+        ReadBufferMode mode;
+
+
+        mode = *foundPtr;
+        *foundPtr = 0;
 
 	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
@@ -561,21 +782,53 @@ BufferAlloc(SMgrRelation smgr, char relp
 	if (buf_id >= 0)
 	{
 		/*
-		 * Found it.  Now, pin the buffer so no one can steal it from the
-		 * buffer pool, and check to see if the correct data has been loaded
-		 * into the buffer.
+		 * Found it.
 		 */
+                *foundPtr = 1;
 		buf = &BufferDescriptors[buf_id];
 
-		valid = PinBuffer(buf, strategy);
-
-		/* Can release the mapping lock as soon as we've pinned it */
+                /*   If prefetch mode,  then return immediately indicating found,
+                **   and NOTE in this case only,  we did not pin buffer.
+                **   In theory we might try to check whether the buffer is valid,  io in progress,  etc
+                **   but in practice it is simpler to abandon the prefetch if the buffer exists
+                */
+                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+                    /* release the mapping lock and return    */
 		LWLockRelease(newPartitionLock);
+                } else {
+                    /*   note that the current request is for same tag as the one associated with the aio -
+                    **   so simply complete the aio and we have our buffer.
+                    **         If an aio was started on this buffer,
+                    **         check complete and wait for it if not.
+                    **         And,  if aio had been started,  then the task
+                    **         which issued the start aio already pinned it for this read,
+                    **         so if that task was me and the aio was successful,
+                    **         pass the current pin to this read without dropping and re-acquiring.
+                    **         this is all done by BufCheckAsync
+                    */
+                    IntentionBufferrc = BufCheckAsync(smgr , 0 , buf, BUF_INTENTION_WANT , strategy , index_for_aio , false , newPartitionLock );
 
-		*foundPtr = TRUE;
+                    /*       check to see if the correct data has been loaded into the buffer.  */
+                    valid = (IntentionBufferrc == BUF_INTENT_RC_VALID);
 
-		if (!valid)
-		{
+                    /*  check for serious IO errors   */
+                    if (!valid) {
+                        if (    (IntentionBufferrc != BUF_INTENT_RC_INVALID_NO_AIO)
+                             && (IntentionBufferrc != BUF_INTENT_RC_INVALID_AIO)
+                           ) {
+                            *foundPtr = -1;  /*  inform caller of serious error */
+                        }
+                        else
+                        if (IntentionBufferrc == BUF_INTENT_RC_INVALID_AIO) {
+                            goto proceed_with_not_found;  /*  yes,  I know,  a goto ... think of it as a break out of the if */
+                        }
+                     }
+
+                    /* BufCheckAsync pinned the buffer            */
+                    /* so can now release the mapping lock               */
+                    LWLockRelease(newPartitionLock);
+
+                    if (!valid) {
 			/*
 			 * We can only get here if (a) someone else is still reading in
 			 * the page, or (b) a previous read attempt failed.  We have to
@@ -583,19 +836,21 @@ BufferAlloc(SMgrRelation smgr, char relp
 			 * own read attempt if the page is still not BM_VALID.
 			 * StartBufferIO does it all.
 			 */
-			if (StartBufferIO(buf, true))
+                        if (StartBufferIO(buf, true, index_for_aio))
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
 				 * have failed ... but we shall bravely try again.
 				 */
-				*foundPtr = FALSE;
+							*foundPtr = 0;
+						}
 			}
 		}
 
 		return buf;
 	}
 
+  proceed_with_not_found:
 	/*
 	 * Didn't find it in the buffer pool.  We'll have to initialize a new
 	 * buffer.  Remember to unlock the mapping lock while doing the work.
@@ -620,8 +875,10 @@ BufferAlloc(SMgrRelation smgr, char relp
 		/* Must copy buffer flags while we still hold the spinlock */
 		oldFlags = buf->flags;
 
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
+        /*         If an aio was started on this buffer,
+        **         check complete and cancel it if not.
+        */
+        BufCheckAsync(smgr , 0 , buf, BUF_INTENTION_REJECT_OBTAIN_PIN , 0 , index_for_aio, true , 0 );
 
 		/* Now it's safe to release the freelist lock */
 		if (lock_held)
@@ -792,13 +1049,18 @@ BufferAlloc(SMgrRelation smgr, char relp
 				 * then set up our own read attempt if the page is still not
 				 * BM_VALID.  StartBufferIO does it all.
 				 */
-				if (StartBufferIO(buf, true))
+                                StartBufferIOrc = StartBufferIO(buf, true , index_for_aio); /* retcode from StartBufferIO */
+				if (StartBufferIOrc)
 				{
 					/*
 					 * If we get here, previous attempts to read the buffer
 					 * must have failed ... but we shall bravely try again.
 					 */
-					*foundPtr = FALSE;
+					*foundPtr = 0;
+                                } else
+                                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+					UnpinBuffer(buf, true); /* must unpin if RBM_NOREAD_FOR_PREFETCH and found */
+                                        *foundPtr = 2;  /*  inform BufStartAsync that buffer must not be used */
 				}
 			}
 
@@ -861,10 +1123,17 @@ BufferAlloc(SMgrRelation smgr, char relp
 	 * lock.  If StartBufferIO returns false, then someone else managed to
 	 * read it before we did, so there's nothing left for BufferAlloc() to do.
 	 */
-	if (StartBufferIO(buf, true))
-		*foundPtr = FALSE;
-	else
-		*foundPtr = TRUE;
+        StartBufferIOrc = StartBufferIO(buf, true , index_for_aio); /* retcode from StartBufferIO */
+        if (StartBufferIOrc) {
+		*foundPtr = 0;
+        } else {
+                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+                    UnpinBuffer(buf, true); /* must unpin if RBM_NOREAD_FOR_PREFETCH and found */
+                    *foundPtr = 2;  /*  inform BufStartAsync that buffer must not be used */
+                } else {
+                    *foundPtr = 1;
+                }
+        }
 
 	return buf;
 }
@@ -971,6 +1240,10 @@ retry:
 	/*
 	 * Insert the buffer at the head of the list of free buffers.
 	 */
+        /*   avoid confusing freelist with strange-looking freeNext */
+        if (buf->freeNext <= FREENEXT_BAIOCB_ORIGIN) { /* means was used for aiocb index */
+            buf->freeNext = FREENEXT_NOT_IN_LIST;
+        }
 	StrategyFreeBuffer(buf);
 }
 
@@ -1023,6 +1296,56 @@ MarkBufferDirty(Buffer buffer)
 	UnlockBufHdr(bufHdr);
 }
 
+/*  return the blocknum of the block in a buffer if it is valid
+**  if a shared buffer,  it must be pinned
+*/
+BlockNumber
+BlocknumOfBuffer(Buffer buffer)
+{
+	volatile BufferDesc *bufHdr;
+        BlockNumber rc = 0;
+
+	if (BufferIsValid(buffer)) {
+            if  (BufferIsLocal(buffer)) {
+                bufHdr = &LocalBufferDescriptors[-buffer - 1];
+            } else {
+                bufHdr = &BufferDescriptors[buffer - 1];
+            }
+
+            rc = bufHdr->tag.blockNum;
+        }
+
+        return rc;
+}
+
+/*  report whether specified buffer contains same or different block
+**  if a shared buffer,  it must be pinned
+*/
+bool
+BlocknotinBuffer(Buffer buffer,
+					 Relation relation,
+					 BlockNumber blockNum)
+{
+	volatile BufferDesc *bufHdr;
+        bool rc = false;
+
+	if (BufferIsValid(buffer)) {
+            if  (BufferIsLocal(buffer)) {
+                bufHdr = &LocalBufferDescriptors[-buffer - 1];
+            } else {
+                bufHdr = &BufferDescriptors[buffer - 1];
+            }
+
+            rc =
+                  (   (bufHdr->tag.blockNum != blockNum)
+                   || (!(RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) ))
+                   || (bufHdr->tag.forkNum != MAIN_FORKNUM)
+                  );
+        }
+
+        return rc;
+}
+
 /*
  * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer()
  *
@@ -1041,18 +1364,18 @@ ReleaseAndReadBuffer(Buffer buffer,
 					 Relation relation,
 					 BlockNumber blockNum)
 {
-	ForkNumber	forkNum = MAIN_FORKNUM;
 	volatile BufferDesc *bufHdr;
+        bool isDifferentBlock;   /*       requesting different block from that already in buffer ? */
 
 	if (BufferIsValid(buffer))
 	{
+	    /* if a shared buff, we have pin, so it's ok to examine tag without spinlock */
+            isDifferentBlock = BlocknotinBuffer(buffer,relation,blockNum); /*  requesting different block from that already in buffer ? */
 		if (BufferIsLocal(buffer))
 		{
 			Assert(LocalRefCount[-buffer - 1] > 0);
 			bufHdr = &LocalBufferDescriptors[-buffer - 1];
-			if (bufHdr->tag.blockNum == blockNum &&
-				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
-				bufHdr->tag.forkNum == forkNum)
+			if (!isDifferentBlock)
 				return buffer;
 			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 			LocalRefCount[-buffer - 1]--;
@@ -1061,12 +1384,12 @@ ReleaseAndReadBuffer(Buffer buffer,
 		{
 			Assert(PrivateRefCount[buffer - 1] > 0);
 			bufHdr = &BufferDescriptors[buffer - 1];
-			/* we have pin, so it's ok to examine tag without spinlock */
-			if (bufHdr->tag.blockNum == blockNum &&
-				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
-				bufHdr->tag.forkNum == forkNum)
+                        BufCheckAsync(0 , relation , bufHdr , ( isDifferentBlock ? BUF_INTENTION_REJECT_FORGET
+                                                                                 : BUF_INTENTION_REJECT_KEEP_PIN )
+                                                            , 0 , 0 , false , 0 ); /* end any IO and maybe unpin */
+			if (!isDifferentBlock) {
 				return buffer;
-			UnpinBuffer(bufHdr, true);
+                        }
 		}
 	}
 
@@ -1091,11 +1414,12 @@ ReleaseAndReadBuffer(Buffer buffer,
  * Returns TRUE if buffer is BM_VALID, else FALSE.  This provision allows
  * some callers to avoid an extra spinlock cycle.
  */
-static bool
+bool
 PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
 {
 	int			b = buf->buf_id;
 	bool		result;
+        bool       pin_already_banked_by_me = 0;  /* buffer is already pinned by me and redeemable */
 
 	if (PrivateRefCount[b] == 0)
 	{
@@ -1117,12 +1441,34 @@ PinBuffer(volatile BufferDesc *buf, Buff
 	else
 	{
 		/* If we previously pinned the buffer, it must surely be valid */
+                /* Errr  -   is that really true  ???    I don't think so  :
+                ** what if I pin,  try an IO,  in progress,  then mistakenly pin again
 		result = true;
+                */
+		LockBufHdr(buf);
+                pin_already_banked_by_me = (    (PrivateRefCount[b] > 0) && (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                             && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                                     : (-(buf->freeNext))  )  == this_backend_pid )
+                                           );
+                if (pin_already_banked_by_me) {
+                        elog(LOG, "PinBuffer : on buffer %d with banked pin rel=%s, blockNum=%u, with flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+                    buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+	}
+                }
+		result = (buf->flags & BM_VALID) != 0;
+		UnlockBufHdr(buf);
 	}
+
+        if (!pin_already_banked_by_me) {
 	PrivateRefCount[b]++;
 	Assert(PrivateRefCount[b] > 0);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
 								BufferDescriptorGetBuffer(buf));
+        }
 	return result;
 }
 
@@ -1139,19 +1485,36 @@ PinBuffer(volatile BufferDesc *buf, Buff
  * to save a spin lock/unlock cycle, because we need to pin a buffer before
  * its state can change under us.
  */
-static void
+void
 PinBuffer_Locked(volatile BufferDesc *buf)
 {
 	int			b = buf->buf_id;
+        bool       pin_already_banked_by_me;  /* buffer is already pinned by me and redeemable */
 
-	if (PrivateRefCount[b] == 0)
+        pin_already_banked_by_me = (    (PrivateRefCount[b] > 0) && (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                     && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                             : (-(buf->freeNext))  )  == this_backend_pid )
+                             );
+	if (PrivateRefCount[b] == 0) {
 		buf->refcount++;
+        }
+        if (pin_already_banked_by_me) {
+                        elog(LOG, "PinBuffer_Locked : on buffer %d with banked pin rel=%s, blockNum=%u, with flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+            buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                    }
+        }
 	UnlockBufHdr(buf);
+        if (!pin_already_banked_by_me) {
 	PrivateRefCount[b]++;
 	Assert(PrivateRefCount[b] > 0);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
 								BufferDescriptorGetBuffer(buf));
 }
+}
 
 /*
  * UnpinBuffer -- make buffer available for replacement.
@@ -1161,29 +1524,68 @@ PinBuffer_Locked(volatile BufferDesc *bu
  * Most but not all callers want CurrentResourceOwner to be adjusted.
  * Those that don't should pass fixOwner = FALSE.
  */
-static void
+void
 UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 {
+
 	int			b = buf->buf_id;
+        bool       pin_already_banked_by_me;  /* buffer is already pinned by me and redeemable */
 
-	if (fixOwner)
+	if (fixOwner) {
 		ResourceOwnerForgetBuffer(CurrentResourceOwner,
 								  BufferDescriptorGetBuffer(buf));
+        }
 
 	Assert(PrivateRefCount[b] > 0);
 	PrivateRefCount[b]--;
 	if (PrivateRefCount[b] == 0)
 	{
+
 		/* I'd better not still hold any locks on the buffer */
 		Assert(!LWLockHeldByMe(buf->content_lock));
 		Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
 
 		LockBufHdr(buf);
 
+		/* this backend has released last pin - buffer should not have pin banked by me
+                ** and if AIO in progress then there should be another backend pin
+                */
+                pin_already_banked_by_me = (       (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                             &&    (   (    (buf->flags & BM_AIO_IN_PROGRESS)
+                                                         ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                         : (-(buf->freeNext))
+                                                       )  == this_backend_pid )
+                                           );
+                if (pin_already_banked_by_me) {
+                        /*  this is a strange situation  -  caller had a banked pin (which callers are supposed not to know about)
+                        **                                  but either discovered it had it or has over-counted how many pins it has
+                        */
+                        buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;   /*   redeem the pin although it is now of no use since about to release */
+                        if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                            buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                        }
+
+                        /*     temporarily suppress logging error to avoid performance degradation -
+                        **     either this task really does not need the buffer in which case the error is harmless
+                        **     or a more severe error will be detected later (possibly immediately below)
+                        elog(LOG, "UnpinBuffer :  released last this-backend pin on buffer %d rel=%s, blockNum=%u, but had banked pin flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+                        */
+                }
+
 		/* Decrement the shared reference count */
 		Assert(buf->refcount > 0);
 		buf->refcount--;
 
+                if (   (buf->refcount == 0) && (buf->flags & BM_AIO_IN_PROGRESS)  ) {
+
+                        elog(ERROR, "UnpinBuffer :  released last any-backend pin on buffer %d rel=%s, blockNum=%u, but AIO in progress flags %X refcount=%u"
+                            ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                            ,buf->tag.blockNum, buf->flags, buf->refcount);
+                }
+
+
 		/* Support LockBufferForCleanup() */
 		if ((buf->flags & BM_PIN_COUNT_WAITER) &&
 			buf->refcount == 1)
@@ -1658,6 +2060,7 @@ SyncOneBuffer(int buf_id, bool skip_rece
 	volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
 	int			result = 0;
 
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -1724,6 +2127,16 @@ void
 InitBufferPoolBackend(void)
 {
 	on_shmem_exit(AtProcExit_Buffers, 0);
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        /*   init the aio subsystem max number of threads and max number of requests
+        **   max number of threads   <-->  max_async_io_prefetchers
+        **   max number of requests  <-->  numBufferAiocbs = (target_prefetch_pages*max_async_io_prefetchers)
+        **   there is no return code so we just hope.
+        */
+        smgrinitaio(max_async_io_prefetchers , numBufferAiocbs);
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 }
 
 /*
@@ -1779,6 +2192,8 @@ PrintBufferLeakWarning(Buffer buffer)
 	char	   *path;
 	BackendId	backend;
 
+
+
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
 	{
@@ -1789,12 +2204,28 @@ PrintBufferLeakWarning(Buffer buffer)
 	else
 	{
 		buf = &BufferDescriptors[buffer - 1];
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                /*   If reason that this buffer is pinned
+                **   is that it was prefetched with async_io
+                **   and never read or discarded, then omit the
+                **   warning,  because this is expected in some
+                **   cases when a scan is closed abnormally.
+                **   Note that the buffer will be released soon by our caller.
+                */
+                if (buf->flags & BM_AIO_PREFETCH_PIN_BANKED) {
+                    pgBufferUsage.aio_read_forgot++; /* account for it */
+                    return;
+                }
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 		loccount = PrivateRefCount[buffer - 1];
 		backend = InvalidBackendId;
 	}
 
+/* #if defined(DO_ISCAN_DISCARD_BUFFER) && defined(DO_HEAPSCAN_DISCARD_BUFFER) not yet working safely 130530 */
 	/* theoretically we should lock the bufhdr here */
 	path = relpathbackend(buf->tag.rnode, backend, buf->tag.forkNum);
+
+
 	elog(WARNING,
 		 "buffer refcount leak: [%03d] "
 		 "(rel=%s, blockNum=%u, flags=0x%x, refcount=%u %d)",
@@ -1802,6 +2233,7 @@ PrintBufferLeakWarning(Buffer buffer)
 		 buf->tag.blockNum, buf->flags,
 		 buf->refcount, loccount);
 	pfree(path);
+/* #endif defined(DO_ISCAN_DISCARD_BUFFER) && defined(DO_HEAPSCAN_DISCARD_BUFFER) not yet working safely 130530 */
 }
 
 /*
@@ -1918,7 +2350,7 @@ FlushBuffer(volatile BufferDesc *buf, SM
 	 * false, then someone else flushed the buffer before we could, so we need
 	 * not do anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, 0))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -2502,6 +2934,77 @@ FlushDatabaseBuffers(Oid dbid)
 	}
 }
 
+#ifdef USE_PREFETCH
+/*
+ * DiscardBuffer -- discard shared buffer used for a previously
+ *                  prefetched but unread block of a relation
+ *
+ * If the buffer is found and pinned with a banked pin,  then :
+ *      .  if AIO in progress, terminate AIO without waiting
+ *      .  if AIO had already completed successfully,
+ *         then mark buffer valid (in case someone else wants it)
+ *      .  redeem the banked pin and unpin it.
+ *
+ * This function is similar in purpose to ReleaseBuffer (below)
+ * but sufficiently different that it is a separate function.
+ * Two important differences are :
+ *   .   caller identifies buffer by blocknumber,  not buffer number
+ *   .   we unpin buffer *only* if the pin is banked,
+ *                      *never* if pinned but not banked.
+ *       This is essential as caller may perform a sequence of
+ *  SCAN1   . PrefetchBuffer   (and remember block was prefetched)
+ *  SCAN2   . ReadBuffer       (but fails to connect this read to the prefetch by SCAN1)
+ *  SCAN1   . DiscardBuffer    (SCAN1 terminates early)
+ *  SCAN2   . access tuples in buffer
+ *        Clearly the Discard *must not* unpin the buffer since SCAN2 needs it!
+ *
+ *
+ * caller may pass InvalidBlockNumber as blockNum to mean do nothing
+ */
+void
+DiscardBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
+{
+        BufferTag	newTag;		   /* identity of requested block */
+        uint32		newHash;	   /* hash value for newTag */
+        LWLockId	newPartitionLock;  /* buffer partition lock for it */
+        Buffer		buf_id;
+        volatile        BufferDesc *buf_desc;
+
+#if defined(USE_PREFETCH) && defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)
+	/*  now is a good time to check whether any BAiocb waiter-locks are pending for release */
+	if (BAiocbIolock_anchor != (struct BAiocbIolock_chain_item*)0) {
+			ProcessPendingBAiocbIolocks();	 
+	}
+#endif /* defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)  */
+
+    if (!SmgrIsTemp(reln->rd_smgr)) {
+	Assert(RelationIsValid(reln));
+	if (BlockNumberIsValid(blockNum)) {
+
+            /* create a tag so we can lookup the buffer */
+            INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
+                                       forkNum, blockNum);
+
+            /* determine its hash code and partition lock ID */
+            newHash = BufTableHashCode(&newTag);
+            newPartitionLock = BufMappingPartitionLock(newHash);
+
+            /* see if the block is in the buffer pool already */
+            LWLockAcquire(newPartitionLock, LW_SHARED);
+            buf_id = BufTableLookup(&newTag, newHash);
+            LWLockRelease(newPartitionLock);
+
+            /* If in buffers, proceed */
+            if (buf_id >= 0) {
+                buf_desc = &BufferDescriptors[buf_id];        /* found buffer descriptor */
+                BufCheckAsync(0 , reln, buf_desc , BUF_INTENTION_REJECT_UNBANK , 0 , 0 , false , 0); /* end the IO and unpin if banked */
+                pgBufferUsage.aio_read_discrd++; /* account for it */
+            }
+        }
+    }
+}
+#endif   /* USE_PREFETCH */
+
 /*
  * ReleaseBuffer -- release the pin on a buffer
  */
@@ -2510,26 +3013,23 @@ ReleaseBuffer(Buffer buffer)
 {
 	volatile BufferDesc *bufHdr;
 
+
 	if (!BufferIsValid(buffer))
 		elog(ERROR, "bad buffer ID: %d", buffer);
 
-	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 
 	if (BufferIsLocal(buffer))
 	{
+                ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 		Assert(LocalRefCount[-buffer - 1] > 0);
 		LocalRefCount[-buffer - 1]--;
 		return;
 	}
-
-	bufHdr = &BufferDescriptors[buffer - 1];
-
-	Assert(PrivateRefCount[buffer - 1] > 0);
-
-	if (PrivateRefCount[buffer - 1] > 1)
-		PrivateRefCount[buffer - 1]--;
 	else
-		UnpinBuffer(bufHdr, false);
+        {
+                bufHdr = &BufferDescriptors[buffer - 1];
+                BufCheckAsync(0 , 0 , bufHdr , BUF_INTENTION_REJECT_NOADJUST , 0 , 0 , false , 0 );
+        }
 }
 
 /*
@@ -2555,14 +3055,41 @@ UnlockReleaseBuffer(Buffer buffer)
 void
 IncrBufferRefCount(Buffer buffer)
 {
+        bool       pin_already_banked_by_me = false;  /* buffer is already pinned by me and redeemable */
+        volatile BufferDesc *buf;                     /* descriptor for a shared buffer */
+
 	Assert(BufferIsPinned(buffer));
+
+        if (!(BufferIsLocal(buffer))) {
+                buf = &BufferDescriptors[buffer - 1];
+		LockBufHdr(buf);
+                pin_already_banked_by_me =
+                      (    (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                        && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                      : (-(buf->freeNext))  )  == this_backend_pid )
+                      );
+        }
+
+        if (!pin_already_banked_by_me) {
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, buffer);
+        }
+
 	if (BufferIsLocal(buffer))
 		LocalRefCount[-buffer - 1]++;
-	else
+	else {
+                if (pin_already_banked_by_me) {
+                    buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                    }
+                }
+		UnlockBufHdr(buf);
+                if (!pin_already_banked_by_me) {
 		PrivateRefCount[buffer - 1]++;
 }
+        }
+}
 
 /*
  * MarkBufferDirtyHint
@@ -2984,61 +3511,138 @@ WaitIO(volatile BufferDesc *buf)
  *
  * In some scenarios there are race conditions in which multiple backends
  * could attempt the same I/O operation concurrently.  If someone else
- * has already started I/O on this buffer then we will block on the
+ * has already started synchronous I/O on this buffer then we will block on the
  * io_in_progress lock until he's done.
  *
+ * if an async io is in progress and we are doing synchronous io,
+ * then readbuffer uses call to smgrcompleteaio to wait,
+ * and so we treat this request as if no io in progress
+ *
  * Input operations are only attempted on buffers that are not BM_VALID,
  * and output operations only on buffers that are BM_VALID and BM_DIRTY,
  * so we can always tell if the work is already done.
  *
+ * index_for_aio is an input parm which,  if non-zero,  identifies a BufferAiocb
+ * acquired by caller and to be attached to the buffer header for use with async io
+ *
  * Returns TRUE if we successfully marked the buffer as I/O busy,
  * FALSE if someone else already did the work.
  */
 static bool
-StartBufferIO(volatile BufferDesc *buf, bool forInput)
+StartBufferIO(volatile BufferDesc *buf, bool forInput , int index_for_aio )
 {
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        struct BufferAiocb volatile * BAiocb = (struct BufferAiocb*)0;  /*  BufferAiocb for use with aio */
+#endif   /* USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+ 
+        if (!index_for_aio)
 	Assert(!InProgressBuf);
 
 	for (;;)
 	{
+                if (!index_for_aio) {
 		/*
 		 * Grab the io_in_progress lock so that other processes can wait for
 		 * me to finish the I/O.
 		 */
 		LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+                }
 
 		LockBufHdr(buf);
 
-		if (!(buf->flags & BM_IO_IN_PROGRESS))
+                /*     the following test is intended to distinguish between :
+                **      .   buffer which : 
+                **           .     has io in progress
+                **             AND is not associated with a current aio
+                **      .   not the above
+                **     Here,  "recent" means an aio marked by buf->freeNext <= FREENEXT_BAIOCB_ORIGIN but no longer in progress -
+                **          this situation arises when the aio has just been cancelled and this process now wishes to recycle the buffer.
+                **          In this case,  the first such would-be recycler (i.e. me) must :
+                **             . avoid waiting for the cancelled aio to complete
+                **             . if not myself doing async read, then assume responsibility for posting other future readbuffers.
+                */
+		if (    (buf->flags & BM_AIO_IN_PROGRESS)
+             || (!(buf->flags & BM_IO_IN_PROGRESS))
+           )
 			break;
 
 		/*
-		 * The only way BM_IO_IN_PROGRESS could be set when the io_in_progress
+		 * The only way BM_IO_IN_PROGRESS without AIO in progress could be set when the io_in_progress
 		 * lock isn't held is if the process doing the I/O is recovering from
 		 * an error (see AbortBufferIO).  If that's the case, we must wait for
 		 * him to get unwedged.
 		 */
 		UnlockBufHdr(buf);
+                if (!index_for_aio) {
 		LWLockRelease(buf->io_in_progress_lock);
+                }
 		WaitIO(buf);
 	}
 
-	/* Once we get here, there is definitely no I/O active on this buffer */
-
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	/* Once we get here, there is definitely no synchronous I/O active on this buffer
+        ** but if being asked to attach a BufferAiocb to the buf header,
+        ** then we must also check if there is any async io currently
+        ** in progress or pinned started by a different task.
+        */
+        if (index_for_aio) {
+            BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext);
+            if (    (buf->flags & (BM_AIO_IN_PROGRESS|BM_AIO_PREFETCH_PIN_BANKED))
+                 && (buf->freeNext <= FREENEXT_BAIOCB_ORIGIN)
+                 && (BAiocb->pidOfAio != this_backend_pid)
+               ) {
+                    /* someone else already doing async I/O */
+                    UnlockBufHdr(buf);
+                    return false;
+            }
+	}
+#endif   /* USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 	if (forInput ? (buf->flags & BM_VALID) : !(buf->flags & BM_DIRTY))
 	{
 		/* someone else already did the I/O */
 		UnlockBufHdr(buf);
+                if (!index_for_aio) {
 		LWLockRelease(buf->io_in_progress_lock);
+                }
 		return false;
 	}
 
 	buf->flags |= BM_IO_IN_PROGRESS;
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        if (index_for_aio) {
+            BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - index_for_aio);
+            /*   insist that no other buffer is using this BufferAiocb for async IO */
+            if (BAiocb->BAiocbbufh == (struct sbufdesc *)0) {
+                BAiocb->BAiocbbufh = buf;
+            }
+            if (BAiocb->BAiocbbufh != buf) {
+                               ereport(ERROR,
+                                    (errcode(ERRCODE_INTERNAL_ERROR),
+                                     errmsg("AIO control block %p to be used by %p already in use by %p"
+                                              ,BAiocb ,buf , BAiocb->BAiocbbufh)));
+            }
+            /*   note - there is no need to register self as an dependent of BAiocb
+            **   as we shall not unlock buf_desc before we free the BAiocb
+            */
+
+            buf->flags |= BM_AIO_IN_PROGRESS;
+            buf->freeNext = index_for_aio;
+            /*  at this point,  this buffer appears to have an in-progress aio_read,
+            **  and any other task which is able to look inside the buffer might try waiting on that aio -
+            **  except we have not yet issued the aio!   So we must keep the buffer header locked
+            **  from here all the way back to the BufStartAsync caller
+            */
+        } else {
+#endif   /* USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 	UnlockBufHdr(buf);
 
 	InProgressBuf = buf;
 	IsForInput = forInput;
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        }
+#endif   /* USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 	return true;
 }
@@ -3048,7 +3652,7 @@ StartBufferIO(volatile BufferDesc *buf,
  *	(Assumptions)
  *	My process is executing IO for the buffer
  *	BM_IO_IN_PROGRESS bit is set for the buffer
- *	We hold the buffer's io_in_progress lock
+ *	if no async IO in progress,  then We hold the buffer's io_in_progress lock
  *	The buffer is Pinned
  *
  * If clear_dirty is TRUE and BM_JUST_DIRTIED is not set, we clear the
@@ -3060,26 +3664,32 @@ StartBufferIO(volatile BufferDesc *buf,
  * BM_IO_ERROR in a failure case.  For successful completion it could
  * be 0, or BM_VALID if we just finished reading in the page.
  */
-static void
+void
 TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits)
 {
-	Assert(buf == InProgressBuf);
+        int flags_on_entry;
 
 	LockBufHdr(buf);
 
+        flags_on_entry = buf->flags;
+
+        if (! (flags_on_entry & BM_AIO_IN_PROGRESS) )
+            Assert( buf == InProgressBuf );
+
 	Assert(buf->flags & BM_IO_IN_PROGRESS);
-	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
+	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS);
 	if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
 		buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 	buf->flags |= set_flag_bits;
 
 	UnlockBufHdr(buf);
 
+        if (! (flags_on_entry & BM_AIO_IN_PROGRESS) ) {
 	InProgressBuf = NULL;
-
 	LWLockRelease(buf->io_in_progress_lock);
 }
+}
 
 /*
  * AbortBufferIO: Clean up any active buffer I/O after an error.
--- src/backend/storage/buffer/buf_async.c.orig	2014-06-25 17:33:03.176961989 -0400
+++ src/backend/storage/buffer/buf_async.c	2014-06-25 18:10:51.120521316 -0400
@@ -0,0 +1,931 @@
+/*-------------------------------------------------------------------------
+ *
+ * buf_async.c
+ *	  buffer manager asynchronous disk read routines
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/buf_async.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * Principal entry points:
+ *
+ * BufStartAsync() -- start an asynchronous read of a block into a buffer and
+ *	 pin it so that no one can destroy it while this process is using it.
+ *
+ * BufCheckAsync() -- check completion of an asynchronous read
+ *       and either claim buffer or discard it
+ *
+ * Private helper
+ *
+ * BufReleaseAsync() -- release the BAiocb resources used for an asynchronous read
+ *
+ * See also these files:
+ *		bufmgr.c -- main buffer manager functions
+ *		buf_init.c -- initialisation of resources
+ */
+#include "postgres.h"
+#include <sys/types.h> 
+#include <sys/file.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "catalog/catalog.h"
+#include "common/relpath.h"
+#include "executor/instrument.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/smgr.h"
+#include "storage/standby.h"
+#include "utils/rel.h"
+#include "utils/resowner_private.h"
+
+/*
+ * GUC parameters
+ */
+int      max_async_io_prefetchers; /* Maximum number of backends using asynchronous librt threads to read pages into our buffers */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
+extern volatile struct BAiocbAnchor *BAiocbAnchr;  /*  anchor for all control blocks pertaining to aio  */
+extern int numBufferAiocbs;        /*  total number of  BufferAiocbs in pool  */
+extern int maxGetBAiocbTries;      /*   max times we will try to get a free BufferAiocb */
+extern int maxRelBAiocbTries;      /*   max times we will try to release a BufferAiocb back to freelist */
+extern pid_t this_backend_pid;     /*   pid of this backend */
+
+extern bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
+extern void PinBuffer_Locked(volatile BufferDesc *buf);
+extern Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
+				  ForkNumber forkNum, BlockNumber blockNum,
+				  ReadBufferMode mode, BufferAccessStrategy strategy,
+				  bool *hit , int index_for_aio);
+extern void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
+extern void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
+				  int set_flag_bits);
+int BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, volatile BufferDesc *buf_desc
+  ,int intention , BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock );
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+int BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy);
+
+static struct BufferAiocb volatile * cachedBAiocb = (struct BufferAiocb*)0;  /*  one cached BufferAiocb for use with aio */
+
+/*  BufReleaseAsync releases a BufferAiocb and returns 0 if successful else non-zero
+**  it *must* be called :
+**     EITHER with a valid  BAiocb->BAiocbbufh -> buf_desc
+**            and that buf_desc must be spin-locked
+**     OR     with BAiocb->BAiocbbufh == 0
+*/
+static int
+BufReleaseAsync(struct BufferAiocb volatile * BAiocb)
+{
+        int    LockTries;         /*  max times we will try to release the BufferAiocb */
+        volatile struct BufferAiocb *BufferAiocbs;
+        struct BufferAiocb volatile * oldFreeBAiocb;  /*  old value of FreeBAiocbs */
+
+        int failed = 1; /* by end of this function, non-zero  will indicate if we failed to return the BAiocb */
+
+
+        if (    ( BAiocb == (struct BufferAiocb*)0 )
+             || ( BAiocb == (struct BufferAiocb*)BAIOCB_OCCUPIED )
+             || ( ((unsigned long)BAiocb) & 0x1 )
+           ) {
+                          elog(ERROR,
+                                 "AIO control block corruption on release of aiocb %p - invalid BAiocb"
+                                          ,BAiocb);
+        }
+        else 
+        if (   (0 == BAiocb->BAiocbDependentCount)     /*  no dependents  */
+            && ((struct BufferAiocb*)BAIOCB_OCCUPIED == BAiocb->BAiocbnext)  /*  not already on freelist */
+           ) {
+
+            if ((struct sbufdesc*)0 != BAiocb->BAiocbbufh) { /*  if a buffer was attached */
+                volatile        BufferDesc *buf_desc = BAiocb->BAiocbbufh;
+
+                /*  spinlock held so instead of TerminateBufferIO(buf, false , 0); ... */
+                if (buf_desc->flags & BM_AIO_PREFETCH_PIN_BANKED) { /* if a pid banked the pin */
+                    buf_desc->freeNext = -(BAiocb->pidOfAio);       /* then remember which pid */
+                }
+                else if (buf_desc->freeNext <= FREENEXT_BAIOCB_ORIGIN) {
+                    buf_desc->freeNext = FREENEXT_NOT_IN_LIST;   /* disconnect BufferAiocb from buf_desc */
+                }
+                buf_desc->flags &= ~BM_AIO_IN_PROGRESS;
+            }
+            
+            BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* disconnect buf_desc from BufferAiocb */
+            BAiocb->pidOfAio = 0;                      /*  clean */
+            LockTries = maxRelBAiocbTries;         /*  max times we will try to release the BufferAiocb */
+            do {
+                register long long int dividend , remainder;
+
+                /*      retrieve old value of FreeBAiocbs  */
+                BAiocb->BAiocbnext = oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs;  /*  old value of FreeBAiocbs */
+
+                /*  this is a volatile value unprotected by any lock,  so must validate it;
+                **  safest is to verify that it is identical to one of the BufferAiocbs
+                **  to do so,  verify by direct division that its address offset from first control block 
+                **  is an integral multiple of the control block size
+                **  that lies within the range [ 0 , (numBufferAiocbs-1) ]
+                */
+                BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+                remainder = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                            % (long long int)(sizeof(struct BufferAiocb));
+                failed = (int)remainder;
+                if (!failed) {
+                    dividend = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                               / (long long int)(sizeof(struct BufferAiocb));
+                     failed = ( (dividend < 0 ) || ( dividend >= numBufferAiocbs) );
+                     if (!failed) {
+                         if (__sync_bool_compare_and_swap (&(BAiocbAnchr->FreeBAiocbs), oldFreeBAiocb, BAiocb)) {
+                            LockTries = 0;   /*  end the do loop  */
+
+                            goto cheering;   /*  cant simply break because then failed would be set incorrectly */
+                         }
+                    }
+                }
+                /*  if we reach here, we have failed and failed is set to -1 */
+
+       cheering: ;
+
+                if ( LockTries > 1 ) {
+                    sched_yield();    /*   yield to another process,  (hopefully a backend) */
+                }
+            } while  (LockTries-- > 0);
+
+            if (failed) {
+#ifdef LOG_RELBAIOCB_DEPLETION
+                          elog(LOG,
+                                "BufReleaseAsync:AIO control block %p unreleased after tries= %d\n"
+                                       ,BAiocb,maxRelBAiocbTries);
+#endif  /* LOG_RELBAIOCB_DEPLETION */
+            }
+
+        }
+              else
+                          elog(LOG,
+                                "BufReleaseAsync:AIO control block %p either has dependents= %d or is already on freelist %p or has no buf_header %p\n"
+                                       ,BAiocb , BAiocb->BAiocbDependentCount , BAiocb->BAiocbnext , BAiocb->BAiocbbufh);
+        return failed;
+}
+
+/*  try using asynchronous aio_read to prefetch into a buffer
+**  return code :
+**        0 if started successfully
+**       -1 if failed for some reason
+**        1+PrivateRefCount if we found desired buffer in buffer pool
+**
+**  There is a harmless race condition here :
+**  two different backends may both arrive here simultaneously
+**  to prefetch the same buffer.    This is not unlikely when a syncscan is in progress.
+**  .  One will acquire the buffer and issue the smgrstartaio
+**  .  Other will find the buffer on return from  ReadBuffer_common with hit = true
+**  Only the first task has a pin on the buffer since ReadBuffer_common knows not to get a pin
+**  on a found buffer in prefetch mode.
+**  Therefore  -   the second task must simply abandon the prefetch if it finds the buffer in the buffer pool.
+**
+**  if we fail to acquire a BAiocb because of concurrent theft from freelist by other backend,
+**  retry up to maxGetBAiocbTries times provided that there actually was at least one BAiocb on the freelist.
+*/
+int
+BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy) {
+
+        int retcode = -1;
+        struct BufferAiocb volatile * BAiocb = (struct BufferAiocb*)0;  /*  BufferAiocb for use with aio */
+        int  smgrstartaio_rc = -1;           /*  retcode from smgrstartaio */
+        bool do_unpin_buffer = false;        /* unpin must be deferred until after buffer descriptor is unlocked */
+        Buffer		buf_id;
+        bool		hit = false;
+        volatile        BufferDesc *buf_desc = (BufferDesc *)0;
+
+        int    LockTries;         /*  max times we will try to get a free BufferAiocb */
+
+        struct BufferAiocb volatile * oldFreeBAiocb;  /*  old value of FreeBAiocbs */
+        struct BufferAiocb volatile * newFreeBAiocb;  /*  new value of FreeBAiocbs */
+
+
+    /*  return immediately if no async io resources */
+    if (numBufferAiocbs > 0) {
+        buf_id = (Buffer)0;
+
+        if ( (struct BAiocbAnchor *)0 != BAiocbAnchr ) {
+
+            volatile struct BufferAiocb *BufferAiocbs;
+
+            if ((struct BufferAiocb*)0 != cachedBAiocb) {  /* any cached BufferAiocb ? */
+                BAiocb = cachedBAiocb;                     /* yes  use it  */
+                cachedBAiocb = BAiocb->BAiocbnext;
+                BAiocb->BAiocbnext = (struct BufferAiocb*)BAIOCB_OCCUPIED; /* mark as not on freelist */
+                BAiocb->BAiocbDependentCount = 0;  /* no dependent yet */
+                BAiocb->BAiocbbufh = (struct sbufdesc*)0;
+                BAiocb->pidOfAio = 0;
+            } else {
+
+                LockTries = maxGetBAiocbTries;         /*  max times we will try to get a free BufferAiocb */
+                do {
+                    register long long int dividend = -1 , remainder;
+                    /*  check if we have a free BufferAiocb */
+
+                    oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs;  /*  old value of FreeBAiocbs */
+
+                    /*  check if we have a free BufferAiocb */
+
+                    /*  BAiocbAnchr->FreeBAiocbs is a volatile value unprotected by any lock,
+                    **  and use of compare-and-swap to add and remove items from the list has
+                    **  two potential pitfalls,    both relating to the fact that we must
+                    **  access data de-referenced from this pointer before the compare-and-swap.
+                    **  1)  The value we load may be corrupt,  e.g. mixture of bytes from
+                    **      two different values,   so must validate it;
+                    **      safest is to verify that it is identical to one of the BufferAiocbs.
+                    **      to do so,  verify by direct division that its address offset from
+                    **      first control block is an integral multiple of the control block size
+                    **      that lies within the range [ 0 , (numBufferAiocbs-1) ]
+                    **      Thus we completely prevent this pitfall.
+                    **  2)  The content of the item's next pointer may have changed between the
+                    **      time we de-reference it and the time of the compare-and-swap.
+                    **      Thus even though the compare-and-swap succeeds,   we might set the
+                    **      new head of the freelist to an invalid value  (either a free item
+                    **      that is not the first in the free chain  -  resulting only in
+                    **      loss of the orphaned free items, or,  much worse, an in-use item).
+                    **      In practice this is extremely unlikely because of the implied huge delay
+                    **      in this window interval in this (current) process.    Here are two scenarios:
+                    **      legend:
+                    **         P0  -  this (current) process,  P1,  P2 , ... other processes
+                    **         content of freelist shown as BAiocbAnchr->FreeBAiocbs -> first item -> 2nd item ...
+                    **         @[X] means address of X
+                    **         |      timeline of window of exposure to problems
+                    **      successive lines in chronological order                                       content of freelist
+                    **        2.1    P0           access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **               P0          access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1       access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        swap-remove I0,  place I1 at head of list                F -> I1 -> I2 -> I3 ...
+                    **         |           P2     access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I1]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2    access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I2]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2     swap-remove I1,  place I2 at head of list                F -> I2 -> I3 ...
+                    **         |        P1       complete aio,  replace I0 at head of list                 F -> I0 -> I2 -> I3 ...
+                    **               P0        swap-remove I0,  place I1 at head of list                   F -> I1 IS IN USE !! CORRUPT !!
+                    **               compare-and-swap succeeded but the value of newFreeBAiocb was stale
+                    **               and had become in-use during the window.
+                    **        2.2    P0           access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **               P0          access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1       access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        swap-remove I0,  place I1 at head of list                F -> I1 -> I2 -> I3 ...
+                    **         |           P2     access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I1]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2    access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I2]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2     swap-remove I1,  place I2 at head of list                F -> I2 -> I3 ...
+                    **         |              P3  access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I2]  F -> I2 -> I3 ...
+                    **         |              P3 access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I3]  F -> I2 -> I3 ...
+                    **         |              P3  swap-remove I2,  place I3 at head of list                F -> I3 ...
+                    **         |           P2    complete aio,  replace I1 at head of list                 F -> I1 -> I3 ...
+                    **         |              P3 complete aio,  replace I2 at head of list                 F -> I2 -> I1 -> I3 ...
+                    **         |        P1       complete aio,  replace I0 at head of list                 F -> I0 -> I2 -> I1 -> I3 ...
+                    **               P0        swap-remove I0,  place I1 at head of list                   F -> I1 -> I3 ... ! I2 is orphaned !
+                    **               compare-and-swap succeeded but the value of newFreeBAiocb was stale
+                    **               and had moved further down the free list during the window.
+                    **      Unfortunately, we cannot prevent this pitfall but we can detect it (after the fact),
+                    **      by checking that the next pointer of the item we have just removed for our use still points to the same item.
+                    **      This test is not subject to any timing or uncertainty since :
+                    **       .  The fact that the compare-and-swap succeeded implies that the item we removed
+                    **          was defintely on the freelist (at the head) when it was removed,
+                    **          and therefore cannot be in use,  and therefore its next pointer is no longer volatile.
+                    **       .  Although pointers of the anchor and items on the freelist are volatile,
+                    **          the addresses of items never change -  they are in an allocated array and never move.
+                    **      E.g. in the above two scenarios,   the test is that I0.next still -> I1,
+                    **      and this is true if and only if the second item on the freelist is
+                    **      still the same at the end of the window as it was at the start of the window.
+                    **      Note that we do not insist that it did not change during the window,
+                    **           only that it is still the correct new head of freelist.
+                    **      If this test fails,  we abort immediately as the subsystem is damaged and cannot be repaired.
+                    **      Note that at least one aio must have been issued *and* completed during the window
+                    **           for this to occur,  and since the window is just one single machine instruction,
+                    **           it is very unlikely in practice.
+                    */
+                    BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+                    remainder = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                                % (long long int)(sizeof(struct BufferAiocb));
+                    if (remainder == 0) {
+                        dividend = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                                    / (long long int)(sizeof(struct BufferAiocb));
+                    }
+                    if (    (remainder == 0)
+                         && ( (dividend >= 0 ) && ( dividend < numBufferAiocbs) )
+                       )
+                    {
+                        newFreeBAiocb = oldFreeBAiocb->BAiocbnext; /*  tentative new value is second on free list */
+                        /*   Here we are in the exposure window referred to in the above comments,
+                        **   so moving along rapidly ...
+                        */
+                        if (__sync_bool_compare_and_swap (&(BAiocbAnchr->FreeBAiocbs), oldFreeBAiocb, newFreeBAiocb)) {   /*  did we get it ? */
+                                /*   We have successfully swapped head of freelist pointed to by oldFreeBAiocb off the list;
+                                **   Here we check that the item we just placed at head of freelist, pointed to by newFreeBAiocb,
+                                **   is the right one
+                                **
+                                **   also check that the BAiocb we have acquired was not in use
+                                **   i.e. that scenario 2.1 above did not occur just before our compare-and-swap
+                                **   The test is that the BAiocb is not in use.
+                                **
+                                **  in one hypothetical case,
+                                **  we can be certain that there is no corruption -
+                                **  the case where newFreeBAiocb == 0 and oldFreeBAiocb->BAiocbnext != BAIOCB_OCCUPIED -
+                                **  i.e. we have set the freelist to empty but we have a baiocb chained from ours.
+                                **  in this case our comp_swap removed all BAiocbs from the list (including ours)
+                                **  so the others chained from ours are either orphaned (no harm done)
+                                **  or in use by another backend and will eventually be returned (fine).
+                                */
+                                if ((struct BufferAiocb *)0 == newFreeBAiocb) {
+                                    if ((struct BufferAiocb *)BAIOCB_OCCUPIED == oldFreeBAiocb->BAiocbnext) {
+                                        goto baiocb_corruption;
+                                    } else if ((struct BufferAiocb *)0 != oldFreeBAiocb->BAiocbnext) {
+                                      elog(LOG,
+                                         "AIO control block inconsistency on acquiring aiocb %p - its next free %p may be orphaned (no corruption has occurred)"
+                                         	,oldFreeBAiocb , oldFreeBAiocb->BAiocbnext);
+                                    }
+                                } else {
+                                    /*  case of newFreeBAiocb not null  -  so must check more carefully ... */
+                                    remainder = ((long long int)newFreeBAiocb - (long long int)BufferAiocbs)
+                                                % (long long int)(sizeof(struct BufferAiocb));
+                                    dividend = ((long long int)newFreeBAiocb - (long long int)BufferAiocbs)
+                                                / (long long int)(sizeof(struct BufferAiocb));
+
+                                    if (    (newFreeBAiocb != oldFreeBAiocb->BAiocbnext)
+                                         || (remainder != 0)
+                                         || ( (dividend < 0 ) || ( dividend >= numBufferAiocbs) )
+                                       ) {
+                                        goto baiocb_corruption;
+                                    }
+                                }
+                                BAiocb = oldFreeBAiocb;
+                                BAiocb->BAiocbnext = (struct BufferAiocb*)BAIOCB_OCCUPIED; /* mark as not on freelist */
+                                BAiocb->BAiocbDependentCount = 0;  /* no dependent yet */
+                                BAiocb->BAiocbbufh = (struct sbufdesc*)0;
+                                BAiocb->pidOfAio = 0;
+
+                                LockTries = 0;   /*  end the do loop  */
+
+                        }
+                    }
+
+                    if ( LockTries > 1 ) {
+                        sched_yield();    /*   yield to another process,  (hopefully a backend) */
+                    }
+                } while (     ((struct BufferAiocb*)0 == BAiocb)            /*  did not get a BAiocb    */
+                           && ((struct BufferAiocb*)0 != oldFreeBAiocb)     /*  there was a free BAiocb */
+                           && (LockTries-- > 0)                             /*  told to retry           */
+                        );
+            }
+        }
+
+        if ( BAiocb != (struct BufferAiocb*)0 ) {
+            /*  try an async io    */
+            BAiocb->BAiocbthis.aio_fildes = -1; /* necessary to ensure any thief realizes aio not yet started */
+            BAiocb->pidOfAio = this_backend_pid;
+
+            /*  now try to acquire a buffer :
+            **  note -   ReadBuffer_common returns hit=true if the block is found in the buffer pool,
+            **            in which case there is no need to prefetch.
+            **  otherwise ReadBuffer_common pins returned buffer and calls StartBufferIO
+            **           and StartBufferIO :
+            **      . sets buf_desc->freeNext to negative of ( index of the aiocb in the BufferAiocbs array + 3 )
+            **      . sets  BAiocb->BAiocbbufh -> buf_desc
+            **           and in this case the buffer spinlock is held.
+            **           This is essential as no other task must issue any intention with respect
+            **           to the buffer until we have started the aio_read.
+            **  Also note that ReadBuffer_common handles enlarging the ResourceOwner buffer list as needed
+            **       so I dont need to do that
+            */
+            buf_id = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
+                                        forkNum, blockNum
+                                       ,RBM_NOREAD_FOR_PREFETCH  /*  tells ReadBuffer not to do any read,  just alloc buf */
+                                       ,strategy , &hit , (FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))));
+            buf_desc = &BufferDescriptors[buf_id-1];    /* find buffer descriptor */
+
+            /*  normally hit will be false as presumably it was not in the pool
+            **  when our caller looked - but it could be there now ...
+            */
+            if (hit) {
+                   /*   see earlier comments  -  we must abandon the prefetch */
+                   retcode = 1 + PrivateRefCount[buf_id];
+                   BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* indicator that we must release the BAiocb before exiting */
+            } else
+            if (  (buf_id > 0) && ((BufferDesc *)0 != buf_desc) && (buf_desc == BAiocb->BAiocbbufh)  ) {
+                   /*   buff descriptor header lock should be held.
+                   **   However,  just to be safe ,   now validate that
+                   **   we are still the owner and no other task already stole it.
+                   */
+
+                   buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /* ensure no banked pin */
+                   /*  there should not be any other pid waiting on this buffer
+                   **  so check both of BM_VALID and BM_PIN_COUNT_WAITER are not set
+                   */
+                   if (    ( !(buf_desc->flags & (BM_VALID|BM_PIN_COUNT_WAITER) ) )
+                        && (buf_desc->flags & BM_AIO_IN_PROGRESS)
+                        && ((FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))) == buf_desc->freeNext) /* it is still mine */
+                        && (-1 == BAiocb->BAiocbthis.aio_fildes)   /* no thief stole it */
+                        && (0 == BAiocb->BAiocbDependentCount)     /* no dependent */
+                     ) {
+                        /*   we have an empty buffer for our use */
+
+                        BAiocb->BAiocbthis.aio_buf = (void *)(BufHdrGetBlock(buf_desc)); /* Location of actual buffer.  */
+
+                        /*   note - there is no need to register self as an dependent of BAiocb
+                        **   as we shall not unlock buf_desc before we free the BAiocb
+                        */
+
+                        /*   smgrstartaio retcode is returned in smgrstartaio_rc -
+                        **   it indicates whether started or not
+                        */
+                        smgrstartaio(reln->rd_smgr, forkNum, blockNum , (char *)&(BAiocb->BAiocbthis) , &smgrstartaio_rc
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on BAiocb lock instead of polling */
+                                    , (void *)&(BAiocb->BAiocbIolockItem)
+#endif /*  USE_AIO_SIGEVENT    */
+                                    );
+
+                        if (smgrstartaio_rc == 0) {
+                            retcode = 0;
+                            buf_desc->flags |= BM_AIO_PREFETCH_PIN_BANKED; /* bank the pin for next use by this task */
+                            /* we did not register register self as an dependent of BAiocb so no need to unregister  */
+                        } else {
+                            /*  failed - return the BufferAiocb to the free list  */
+                            do_unpin_buffer = true;    /* unpin must be deferred until after buffer descriptor is unlocked */
+                            /*  spinlock held so instead of TerminateBufferIO(buf_desc, false , 0); ... */
+                            buf_desc->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS | BM_AIO_PREFETCH_PIN_BANKED | BM_VALID);
+                            /* we did not register register self as an dependent of BAiocb so no need to unregister  */
+
+                            /*  return the BufferAiocb to the free list  */
+                            if (
+                                   BufReleaseAsync(BAiocb)
+                               ) {         /*  failed ? */
+                                BAiocb->BAiocbnext = cachedBAiocb;    /* then ...       */
+                                cachedBAiocb = BAiocb;                /*  ... cache it  */
+                            }
+
+                            pgBufferUsage.aio_read_failed++;
+                            smgrstartaio_rc = 1;  /*   to distinguish from aio not even attempted */
+                        }
+                   }
+                   else {
+                            /*  buffer was stolen or in use by other task - return the BufferAiocb to the free list  */
+                            do_unpin_buffer = true;    /* unpin must be deferred until after buffer descriptor is unlocked */
+                   }
+
+                   UnlockBufHdr(buf_desc);
+                   if (do_unpin_buffer) {
+                        if (smgrstartaio_rc >= 0) { /*  if  aio was attempted */
+                            TerminateBufferIO(buf_desc, false , 0);
+                        }
+                        UnpinBuffer(buf_desc, true);
+                   }
+            }
+            else {
+                BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* indicator that we must release the BAiocb before exiting */
+            }
+
+            if ((struct sbufdesc*)0 == BAiocb->BAiocbbufh) { /*  we did not associate a buffer */
+                                                             /*  so return the BufferAiocb to the free list  */
+                            if (
+                                   BufReleaseAsync(BAiocb)
+                               ) {         /*  failed ? */
+                                BAiocb->BAiocbnext = cachedBAiocb;    /* then ...       */
+                                cachedBAiocb = BAiocb;                /*  ... cache it  */
+                            }
+            }
+        }
+    }
+
+    return retcode;
+
+    baiocb_corruption:;
+         elog(PANIC,
+              "AIO control block corruption on acquiring aiocb %p - its next free %p conflicts with new freelist pointer %p which may be invalid (corruption may have occurred)"
+                                                    ,oldFreeBAiocb , oldFreeBAiocb->BAiocbnext , newFreeBAiocb);
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+/*
+ * BufCheckAsync --      act upon caller's intention regarding a shared buffer,
+ *                       primarily in connection with any async io in progress on the buffer.
+ *     class  subvalue   intention has two main classes and some subvalues within those :
+ *      +ve      1            .   want    <=>  caller wants the buffer,
+ *                                             wait for in-progress aio and then always pin
+ *      -ve                   .   reject  <=>  caller does not want the buffer,
+ *                                             if there are no dependents,  then cancel the aio
+ *              -1, -2 , -3 , ... (see below)        and then optionally unpin
+ *                             Used when there may have been a previous fetch or prefetch.
+ *
+ * buffer is assumed to be an existing member of the shared buffer pool
+ *    as returned by BufTableLookup.
+ * if AIO in progress,  then :
+ *      .  terminate AIO, waiting for completion if +ve intention, else without waiting
+ *      .  if the AIO had already completed successfully,   then mark buffer valid
+ *      .  pin/unpin as requested
+ *
+ * +ve intention indicates that buffer must be pinned :
+ *   if the strategy parameter is null,  then use the PinBuffer_Locked optimization
+ *   to pin and unlock in one operation.   But always update buffer usage count.
+ *
+ * -ve intention indicates whether and how to unpin :
+ *   BUF_INTENTION_REJECT_KEEP_PIN 	-1   pin already held, do not unpin, (caller wants to keep it)
+ *   BUF_INTENTION_REJECT_OBTAIN_PIN	-2   obtain pin,  caller wants it for same buffer
+ *   BUF_INTENTION_REJECT_FORGET	-3   unpin and tell resource owner to forget
+ *   BUF_INTENTION_REJECT_NOADJUST	-4   unpin and call ResourceOwnerForgetBuffer myself
+ *                                           instead of telling UnpinBuffer to adjust CurrentResource owner
+ *                                           (quirky simulation of ReleaseBuffer logic)
+ *   BUF_INTENTION_REJECT_UNBANK   	-5   unpin only if pin banked by caller
+ *   The behaviour for the -ve case is based on that of ReleaseBuffer, adding handling of async io.
+ *
+ * pin/unpin action must take account of whether this backend hold a "disposable" pin on the particular buffer.
+ * A "disposable" pin is a pin acquired by buffer manager without caller knowing, such as :
+ *      when required to safeguard an async AIO  -  pin can be held across multiple bufmgr calls
+ *      when required to safeguard waiting for an async AIO  -  pin acquired and released within this function
+ * if a disposable pin is held,   then :
+ *      if a new pin is requested,  the disposable pin must be retained (redeemed) and any flags relating to it unset
+ *      if an unpin is requested,   then :
+ *              if    either no AIO in progress or this backend did not initiate the AIO
+ *              then  the disposable pin must be dropped (redeemed) and any flags relating to it unset
+ *              else  log warning and do nothing
+ *  i.e. in either case,   there is no longer a disposable pin after this function has completed.
+ *       Note that if    intention is BUF_INTENTION_REJECT_UNBANK,
+ *                 then caller expects there to be a disposable banked pin
+ *                      and if there isn't one,  we do nothing
+ *                 for all other intentions,  if there is no disposable pin,   we pin/unpin normally.
+ *
+ * index_for_aio indicates the BAiocb to be used for next aio (see PrefetchBuffer)
+ * BufFreelistLockHeld indicates whether freelistlock is held
+ * spinLockHeld indicates whether buffer header spinlock is held
+ * PartitionLock is the  buffer partition lock to be used
+ *
+ * return code (meaningful ONLY if intention is +ve) indicates validity of buffer :
+ *  -1    buffer is invalid and failed PageHeaderIsValid check
+ *   0    buffer is not valid
+ *   1    buffer is valid
+ *   2    buffer is valid but tag changed  -  (so content does not match the relation block that caller expects)
+ */
+int
+BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, BufferDesc volatile * buf_desc, int intention , BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock )
+{
+
+        int retcode = BUF_INTENT_RC_INVALID_NO_AIO;
+        bool valid = false;
+
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        int smgrcompleteaio_rc;          /*  retcode from smgrcompleteaio */
+        SMgrRelation smgr = caller_smgr;
+        int    		BAiocbDependentCount_after_aio_finished = -1;  /*  for debugging  -  can be printed in gdb */
+	    BufferTag origTag = buf_desc->tag;	/* original identity of selected buffer */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        BufFlags	flags_on_entry;  /*  for debugging  -  can be printed in gdb */
+        int    		freeNext_on_entry;  /*  for debugging  -  can be printed in gdb */
+        bool       disposable_pin = false;            /* this backend had a disposable pin on entry or pins the buffer while waiting for aio_read to complete */
+        bool       pin_already_banked_by_me;          /* buffer is already pinned by me and redeemable */
+
+        int aio_successful = -1;         /*  did the aio_read succeed ?  -1 = no aio,  0 unsuccessful , 1 successful */
+        int local_intention = intention; /*  copy of intention which in one special case below may be set differently to intention */
+
+            if (!spinLockHeld) {
+                /*  lock buffer header    */
+                LockBufHdr(buf_desc);
+            }
+
+	    flags_on_entry = buf_desc->flags;
+	    freeNext_on_entry = buf_desc->freeNext;
+            pin_already_banked_by_me =
+                      (    (flags_on_entry & BM_AIO_PREFETCH_PIN_BANKED)
+                        && (   (    (flags_on_entry & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - freeNext_on_entry))->pidOfAio )
+                                                                      : (-(freeNext_on_entry))  )  == this_backend_pid )
+                      );
+
+            if (pin_already_banked_by_me) {
+                if (PrivateRefCount[buf_desc->buf_id] == 0) {  /*  but do we actually have a pin ?? */
+                    /*   this is an anomalous situation   -  somehow our disposable pin was lost without us noticing
+                    **   if AIO is in progress and we started it,
+                    **   then this is disastrous  -   two backends might both issue IO on same buffer
+                    **   otherwise,  it is harmless,  and simply means we have no disposable pin,
+                    **               but we must update flags to "notice" the fact now
+                    */
+                    if (flags_on_entry & BM_AIO_IN_PROGRESS) {
+                            elog(ERROR, "BufCheckAsync : AIO control block issuer of aio_read lost pin with BM_AIO_IN_PROGRESS on buffer %d rel=%s, blockNum=%u, flags 0x%X refcount=%u intention= %d"
+                                ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                           ,buf_desc->tag.blockNum, flags_on_entry, buf_desc->refcount ,intention);
+                    } else {
+                            elog(LOG, "BufCheckAsync : AIO control block issuer of aio_read lost pin on buffer %d rel=%s, blockNum=%u, with flags 0x%X refcount=%u intention= %d"
+                               ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                               ,buf_desc->tag.blockNum, flags_on_entry, buf_desc->refcount ,intention);
+							buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /*   redeem the banked pin  */
+							/*   since AIO not in progress,  disconnect the buffer from banked pin */
+							buf_desc->freeNext = FREENEXT_NOT_IN_LIST;      /*   and forget the bank client */
+							pin_already_banked_by_me = false;
+                    }
+                } else {
+                    disposable_pin = true;
+                }
+            }
+
+            /*  the case of BUF_INTENTION_REJECT_UNBANK is handled specially :
+            **    if this backend has a banked pin,  then proceed just as for BUF_INTENTION_REJECT_FORGET
+            **    else the call is a no-op  --  unlock buf header and return immediately
+            */
+
+            if (intention == BUF_INTENTION_REJECT_UNBANK) {
+                if (pin_already_banked_by_me) {
+                    local_intention = BUF_INTENTION_REJECT_FORGET;
+                } else {
+                    goto unlock_buf_header;  /*  code following the unlock will do nothing since local_intention still set to BUF_INTENTION_REJECT_UNBANK */
+                }
+            }
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+            /*       we do not expect that BM_AIO_IN_PROGRESS is set without freeNext identifying the BAiocb */
+            if ( (buf_desc->flags & BM_AIO_IN_PROGRESS) && (buf_desc->freeNext == FREENEXT_NOT_IN_LIST) ) {
+
+					elog(ERROR, "BufCheckAsync : found BM_AIO_IN_PROGRESS without a BAiocb on buffer %d rel=%s, blockNum=%u, flags %X refcount=%u"
+						,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+						,buf_desc->tag.blockNum, buf_desc->flags, buf_desc->refcount);
+                }
+            /*     check whether aio in progress  */
+            if  (    ( (struct BAiocbAnchor *)0 != BAiocbAnchr )
+                  && (buf_desc->flags & BM_AIO_IN_PROGRESS)
+                  && (buf_desc->freeNext <= FREENEXT_BAIOCB_ORIGIN)                       /*  has a valid BAiocb */
+                  && ((FREENEXT_BAIOCB_ORIGIN - buf_desc->freeNext) < numBufferAiocbs)    /*  double-check */
+                ) {        /* this is aio   */
+                    struct BufferAiocb volatile * BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf_desc->freeNext); /*  BufferAiocb associated with this aio */
+                    if ((struct BufferAiocb*)BAIOCB_OCCUPIED == BAiocb->BAiocbnext) { /*  ensure BAiocb is occupied */
+                        aio_successful = 0;       /*  tentatively the aio_read did not succeed   */
+                        retcode = BUF_INTENT_RC_INVALID_AIO;
+
+                        if (smgr == NULL) {
+                            if (caller_reln == NULL) {
+                                smgr = smgropen(buf_desc->tag.rnode, InvalidBackendId);
+                            } else {
+                                smgr = caller_reln->rd_smgr;
+                            }
+                        }
+
+                        /*  assert that this AIO is not using the same BufferAiocb as the one caller asked us to use */
+                        if ((index_for_aio < 0) && (index_for_aio == buf_desc->freeNext)) {
+                                   ereport(ERROR,
+                                        (errcode(ERRCODE_INTERNAL_ERROR),
+                                         errmsg("AIO control block index %d to be used by %p already in use by %p"
+                                                  ,index_for_aio, buf_desc, BAiocb->BAiocbbufh)));
+                        }
+
+                        /*    Call smgrcompleteaio only if either we want buffer or there are no dependents.
+                        **    In the other case of reject and there are dependents,
+                        **    then one of them will do it.
+                        */
+                        if (   (local_intention > 0) || (0 == BAiocb->BAiocbDependentCount)  ) {
+                            if (local_intention > 0) {
+                                /*  wait for in-progress aio and then pin
+                                **  OR  if I did not issue the aio and do not have a pin
+                                **  then pin now before waiting to ensure the buffer does not become unpinned while I wait
+                                **  we may potentially wait for io to complete
+                                **  so release buf header lock so that others may also wait here
+                                */
+                                BAiocb->BAiocbDependentCount++; /* register self as dependent  */
+                                if (PrivateRefCount[buf_desc->buf_id] == 0) {   /* if this buffer not pinned by me */
+                                    disposable_pin = true; /* this backend has pinned the buffer while waiting for aio_read to complete */
+                                    PinBuffer_Locked(buf_desc);
+                                } else {
+                                    UnlockBufHdr(buf_desc);
+                                }
+                                LWLockRelease(PartitionLock);
+
+                                smgrcompleteaio_rc = 1   /*  tell smgrcompleteaio to wait  */
+                                                   + ( BAiocb->pidOfAio == this_backend_pid ); /*  and whether I initiated the aio */
+                            } else {
+                                smgrcompleteaio_rc = 0;   /*  tell smgrcompleteaio to cancel */
+                            }
+
+                            smgrcompleteaio( smgr , (char *)&(BAiocb->BAiocbthis) , &smgrcompleteaio_rc
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on BAiocb lock instead of polling */
+                                           , (void *)&(BAiocb->BAiocbIolockItem)
+#endif /*  USE_AIO_SIGEVENT    */
+                                           );
+                            if ( (smgrcompleteaio_rc == 0) || (smgrcompleteaio_rc == 1) ) {
+                                  aio_successful = 1;
+                            }
+
+                            /*   statistics  */
+                            if (local_intention > 0) {
+                                if (smgrcompleteaio_rc == 0) {
+                                    /*    completed successfully and did not have to wait  */
+                                    pgBufferUsage.aio_read_ontime++;
+                                } else if (smgrcompleteaio_rc == 1) {
+                                    /*    completed successfully and did have to wait  */
+                                    pgBufferUsage.aio_read_waited++;
+                                } else {
+                                    /*  bad news   -   read failed and so buffer not usable
+                                    **  the buffer is still pinned so unpin and proceed with "not found" case
+                                    */
+                                    pgBufferUsage.aio_read_failed++;
+                                }
+
+                                /*  regain locks and handle the validity of the buffer and intention regarding it    */
+                                LWLockAcquire(PartitionLock, LW_SHARED);
+                                LockBufHdr(buf_desc);
+                                BAiocb->BAiocbDependentCount--; /* unregister self as dependent  */
+                            } else {
+                                    pgBufferUsage.aio_read_wasted++;  /*   regardless of whether aio_successful */
+                            }
+
+
+                            if (local_intention > 0) {
+                                /*  verify the buffer is still ours and has same identity
+                                **  There is one slightly tricky point here -
+                                **  if there are other dependents,   then each of them will perform this same check
+                                **  this is unavoidable as the correct setting of retcode and the BM_VALID flag
+                                **  is required by each dependent,  so we may not leave it to the last one to do it.
+                                **  It should not do any harm and easier to let them all do it than try to avoid.
+                                */
+                                if ((FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))) == buf_desc->freeNext) { /* it is still mine */
+
+                                    if (aio_successful) {
+                                        /* validate page header.   If valid,  then mark the buffer as valid */
+                                        if (PageIsVerified((Page)(BufHdrGetBlock(buf_desc)) , ((BAiocb->BAiocbthis).aio_offset/BLCKSZ))) {
+                                            buf_desc->flags |= BM_VALID;
+                                            if (BUFFERTAGS_EQUAL(origTag , buf_desc->tag)) {
+                                                retcode = BUF_INTENT_RC_VALID;
+                                            } else {
+                                                retcode = BUF_INTENT_RC_CHANGED_TAG;
+                                            }
+                                        } else {
+                                            retcode = BUF_INTENT_RC_BADPAGE;
+                                        }
+                                    }
+                                }
+                            }
+
+                            BAiocbDependentCount_after_aio_finished = BAiocb->BAiocbDependentCount;
+
+                            /*  if no dependents,   then disconnect the BAiocb and update buffer header */
+                            if (BAiocbDependentCount_after_aio_finished == 0 ) {
+
+
+                                /*  return the BufferAiocb to the free list  */
+                                buf_desc->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS);
+
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on BAiocb lock instead of polling */
+                            //  FIXME    the BAiocb must not be freed for reuse until we are certain
+                            //           that the aio completion signal has been delivered
+                            //           and the BAiocbIolock_chain_item placed on *our* 
+                            //           chain of awaiting-release LWLock ptrs.
+                            //           Reason is not because this backend needs that item,
+                            //           but because it must not be placed on some other backend's chain.
+#endif /*  USE_AIO_SIGEVENT */
+                                if (
+                                           BufReleaseAsync(BAiocb)
+                                   ) {        /*  failed ? */
+                                    BAiocb->BAiocbnext = cachedBAiocb;   /* then ...       */
+                                    cachedBAiocb = BAiocb;               /*  ... cache it  */
+                                }
+                            }
+
+                        }
+                    }
+            }
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+            /*  note whether buffer is valid before unlocking spinlock */
+            valid = ((buf_desc->flags & BM_VALID) != 0);
+
+            /*  if there was a disposable pin on entry to this function (i.e. marked in buffer flags)
+            **  then unmark it  -  refer to prologue comments talking about :
+            **    if a disposable pin is held,   then :
+            **     ...
+            **    i.e. in either case,   there is no longer a disposable pin after this function has completed.
+            */
+            if (pin_already_banked_by_me) {
+                        buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /*   redeem the banked pin  */
+                        /*   if AIO not in progress,  then disconnect the buffer from BAiocb and/or banked pin */
+                        if (!(buf_desc->flags & BM_AIO_IN_PROGRESS)) {
+                            buf_desc->freeNext = FREENEXT_NOT_IN_LIST;      /*   and forget the bank client */
+                        }
+                        /********** for debugging   *****************
+                        else elog(LOG, "BufCheckAsync : found BM_AIO_IN_PROGRESS when redeeming banked pin on buffer %d rel=%s, blockNum=%u, flags %X refcount=%u"
+                       ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                       ,buf_desc->tag.blockNum, buf_desc->flags, buf_desc->refcount);
+                        ********** for debugging     *****************/
+            }
+
+            /*  If we are to obtain new pin, then use pin optimization  -  pin and unlock.
+            **  However,   if the caller is the same backend who issued the aio_read,
+            **  then he ought to have obtained the pin at that time and must not acquire
+            **  a "second" one since this is logically the same read -  he would have obtained
+            **  a single pin if using synchronous read and we emulate that behaviour.
+            **  Its important to understand that the caller is not aware that he already obtained a pin -
+            **  because calling PrefetchBuffer did not imply a pin -
+            **  so we must track that via the pidOfAio field in the BAiocb.
+            **  And to add one further complication :
+            **      we assume that although PrefetchBuffer pinned the buffer,
+            **      it did not increment the usage count.
+            **      (because it called PinBuffer_Locked which does not do that)
+            **      so in this case,   we must increment the usage count without double-pinning.
+            **      yes its ugly  -  and theres a goto!
+            */
+            if (   (local_intention > 0)
+                || (local_intention == BUF_INTENTION_REJECT_OBTAIN_PIN)
+               ) {
+
+                /* Make sure we will have room to remember the buffer pin */
+                ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+                /*    here we really want a version of PinBuffer_Locked which updates usage count ... */
+                if (   (PrivateRefCount[buf_desc->buf_id] == 0) /*   if this buffer not previously pinned by me */
+                    || pin_already_banked_by_me                 /*   or I had a disposable pin on entry */
+                   ) {
+                    if (strategy == NULL)
+                    {
+                            if (buf_desc->usage_count < BM_MAX_USAGE_COUNT)
+                                    buf_desc->usage_count++;
+                    }
+                    else
+                    {
+                            if (buf_desc->usage_count == 0)
+                                    buf_desc->usage_count = 1;
+                    }
+		}
+
+                /*  now pin buffer unless we have a disposable */
+                if (!disposable_pin) { /* this backend neither banked pin for aio nor pinned the buffer while waiting for aio_read to complete */
+                    PinBuffer_Locked(buf_desc);
+                    goto unlocked_it;
+                }
+                else
+                /*    if this task previously issued the aio or pinned the buffer while waiting for aio_read to complete
+                **       and aio was unsuccessful,    then release the pin
+                */
+                if (     disposable_pin
+                      && (aio_successful == 0)       /*  aio_read failed ? */
+                   ) {
+                           UnpinBuffer(buf_desc, true);
+                }
+            }
+
+    unlock_buf_header:
+            UnlockBufHdr(buf_desc);
+    unlocked_it:
+
+            /*   now do any requested pin (if not done immediately above) or unpin/forget  */
+            if (local_intention == BUF_INTENTION_REJECT_KEEP_PIN) {
+            /*   the caller is supposed to hold a pin already so there should be nothing to do ... */
+                if (PrivateRefCount[buf_desc->buf_id] == 0) {
+                    elog(LOG, "request to keep pin on unpinned buffer %d",buf_desc->buf_id);
+
+                    valid = PinBuffer(buf_desc, strategy);
+                }
+            }
+            else
+            if (   (   (local_intention == BUF_INTENTION_REJECT_FORGET)
+                    || (local_intention == BUF_INTENTION_REJECT_NOADJUST)
+                   )
+                && (PrivateRefCount[buf_desc->buf_id] > 0) /*   if this buffer was previously pinned by me ... */
+               )  {
+
+                    if (local_intention == BUF_INTENTION_REJECT_FORGET) {
+                        UnpinBuffer(buf_desc, true); /*  ... then release the pin                   */
+                    } else
+                    if (local_intention == BUF_INTENTION_REJECT_NOADJUST) {
+                        /*   following code moved from ReleaseBuffer :
+                        **   not sure why we can't simply UnpinBuffer(buf_desc, true)
+                        **   but better leave it the way it was
+                        */
+                        ResourceOwnerForgetBuffer(CurrentResourceOwner, BufferDescriptorGetBuffer(buf_desc));
+                        if (PrivateRefCount[buf_desc->buf_id] > 1) {
+                            PrivateRefCount[buf_desc->buf_id]--;
+                        } else {
+                            UnpinBuffer(buf_desc, false);
+                        }
+                    }
+            }
+
+            /*    if retcode has not been set to one of the unusual conditions
+            **        namely failed header validity or tag changed
+            **    then the setting of valid takes precedence
+            **    over whatever retcode may be currently set to.
+            */
+            if ( ( (retcode == BUF_INTENT_RC_INVALID_NO_AIO) || (retcode == BUF_INTENT_RC_INVALID_AIO) ) && valid) {
+                   retcode = BUF_INTENT_RC_VALID;
+            } else
+            if ((retcode == BUF_INTENT_RC_VALID) && (!valid)) {
+                   if (aio_successful == -1) { /* aio not attempted */
+                       retcode = BUF_INTENT_RC_INVALID_NO_AIO;
+                   } else {
+                       retcode = BUF_INTENT_RC_INVALID_AIO;
+                   }
+            }
+
+            return retcode;
+}
--- src/backend/storage/buffer/buf_init.c.orig	2014-06-25 16:37:59.457618893 -0400
+++ src/backend/storage/buffer/buf_init.c	2014-06-25 18:10:51.132521363 -0400
@@ -13,15 +13,193 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <sys/types.h> 
+#include <unistd.h>
 
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
-
+#include <stdlib.h> /* for getenv() */
+#include <errno.h> /* for strtoul() */
+#if defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)
+#include <sched.h>
+#endif /*  defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)  */
 
 BufferDesc *BufferDescriptors;
 char	   *BufferBlocks;
-int32	   *PrivateRefCount;
+int32	   *PrivateRefCount;       /*  array of counts per buffer of how many times this task has pinned this buffer */
+
+volatile struct BAiocbAnchor *BAiocbAnchr = (struct BAiocbAnchor *)0;  /*  anchor for all control blocks pertaining to aio  */
+
+int CountInuseBAiocbs(void);     /*  keep compiler happy */
+void ReportFreeBAiocbs(void);    /*  keep compiler happy */
+
+extern int	MaxConnections;  /*  max number of client connections which postmaster will allow  */
+int numBufferAiocbs = 0;         /*  total number of  BufferAiocbs in pool (0 <=> no async io) */
+int hwmBufferAiocbs = 0;         /*  high water mark of in-use  BufferAiocbs in pool
+                                 **  (not required to be accurate, kindly maintained for us somehow by postmaster)
+                                 */
+
+#ifdef USE_PREFETCH
+unsigned int prefetch_dbOid = 0; /*  database oid of relations on which prefetching to be done - 0 means all */
+unsigned int prefetch_bitmap_scans = 1; /*  boolean whether to prefetch bitmap heap scans        */
+unsigned int prefetch_heap_scans = 0;   /*  boolean whether to prefetch non-bitmap heap scans    */
+unsigned int prefetch_sequential_index_scans = 0;  /*  boolean whether to prefetch sequential-access non-bitmap index scans  */
+unsigned int prefetch_index_scans = 256;  /*  boolean whether to prefetch non-bitmap index scans also numeric size of pfch_list  */
+unsigned int prefetch_btree_heaps = 1;  /*  boolean whether to prefetch heap pages in _bt_next for non-bitmap index scans   */
+#endif /* USE_PREFETCH */
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+int maxGetBAiocbTries = 1;       /*  max times we will try to get a free BufferAiocb */
+int maxRelBAiocbTries = 1;       /*  max times we will try to release a BufferAiocb back to freelist */
+
+/*  locking protocol for manipulating the BufferAiocbs and FreeBAiocbs list :
+**    1.    ownership of a BufferAiocb :
+**          to gain ownership of a BufferAiocb, a task must
+**          EITHER    remove it from FreeBAiocbs (it is now temporary owner and no other task can find it)
+**                    if decision is to attach it to a buffer descriptor header, then
+**                       .   lock the buffer descriptor header
+**                       .   check  NOT flags & BM_AIO_IN_PROGRESS
+**                       .   attach to buffer descriptor header
+**                       .   increment the BufferAiocb.dependent_count
+**                       .   unlock the buffer descriptor header
+**                     and ownership scope is from lock to unlock
+***          OR        locate it by dereferencing the pointer in a buffer descriptor,
+**                    in which case :
+**                       .   lock the buffer descriptor header
+**                       .   check  flags & BM_AIO_IN_PROGRESS
+**                       .   increment the BufferAiocb.dependent_count
+**                       .   if decision is to return to FreeBAiocbs,
+**                           then   (with buffer descriptor header still locked)
+**                                  .  turn off BM_AIO_IN_PROGRESS
+**                       .   IF        the BufferAiocb.dependent_count == 1 (I am sole dependent)
+**                       .   THEN
+**                       .       .  decrement the BufferAiocb.dependent_count
+**                               .  return to FreeBAiocbs (see below)
+**                       .   unlock the buffer descriptor header
+**                     and ownership scope is from lock to either return to FreeBAiocbs or unlock
+**    2.    adding and removing from FreeBAiocbs :
+**      two alternative methods - controlled by conditional macro definition LOCK_BAIOCB_FOR_GET_REL
+**       2.1 LOCK_BAIOCB_FOR_GET_REL is defined - use a lock
+**          .   lock BufFreelistLock exclusive
+**          .   add / remove from FreeBAiocbs
+**          .   unlock BufFreelistLock exclusive
+**          advantage of this method  -  never fails to add or remove
+**       2.2  LOCK_BAIOCB_FOR_GET_REL is not defined - use compare_and_swap
+**          .   retrieve the current Freelist pointer and validate
+**          .   compare_and_swap on/off the FreeBAiocb list
+**          .   unlock BufFreelistLock exclusive
+**          advantage of this method  -  never waits
+**          to avoid losing a FreeBAiocbs,   save it in a process-local cache and reuse
+*/
+#ifdef USE_AIO_SIGEVENT                                    /*    signal non-originator waiters instead of letting them poll */
+				/*    signal environment to invoke handlaiosignal    */
+sigset_t emptymask, blockaiomask, tempmask;
+struct sigaction chsigaction , olsigAIOaction;
+
+/*  the following loack anchor is defined in globals.c
+ *  because referenced inside the CHECK_FOR_INTERRUPTS macro
+*/
+struct BAiocbIolock_chain_item volatile * volatile BAiocbIolock_anchor = (struct BAiocbIolock_chain_item*)0; /* anchor for chain of awaiting-release LWLock ptrs */
+
+/*  ProcessPendingBAiocbIolocks is the companion to handlaiosignal
+ *  which is invoked from all CHECK_FOR_INTERRUPTS to check if there are any
+ *  pending BAiocbIolockItems (representing just-completed aio operations)
+*/
+void ProcessPendingBAiocbIolocks(void);
+void ProcessPendingBAiocbIolocks() {
+          int cs_rc;
+		  struct BAiocbIolock_chain_item volatile * volatile BAiocbIolockItemP; /* first item on chain */
+          struct BAiocbIolock_chain_item volatile * volatile instantaneous_BAiocbIolock_next; /* next item on chain */
 
+		  BAiocbIolockItemP = BAiocbIolock_anchor;              /* current first item on chain */
+          while (BAiocbIolockItemP != (struct BAiocbIolock_chain_item*)0) /* there is an item */
+          {
+                 instantaneous_BAiocbIolock_next = BAiocbIolockItemP->next; /* next item on chain */
+                 cs_rc = (__sync_bool_compare_and_swap (&(BAiocbIolock_anchor), BAiocbIolockItemP, instantaneous_BAiocbIolock_next)); /* swap it off chain */
+                 if (cs_rc) {
+
+                     /*     there are two places where originator might release the BAiocb waiter-lock :
+                      *      .  here   (always called)
+                      *      .  in completeaio (if originator actually calls completeaio)
+                      *      so each case must first check whether lock has already been released.
+                      *      Fortunately there is no possibility that both cases execute simultaneously
+                      *      because they are both performed in main-line (non-thread, non-interrupt) code
+                      *      by same process.
+                      */
+                     if (LWLockHeldByMe((struct LWLock *)(BAiocbIolockItemP->BAiocbIolock))) { /*  still waiter-locked */
+    				     LWLockRelease((struct LWLock *)BAiocbIolockItemP->BAiocbIolock);   /*  removed item so release lock */
+						 BAiocbIolockItemP->next = (struct BAiocbIolock_chain_item*)0;
+					 }
+				 } else {
+					 sched_yield();    /*   yield to another process,  (hopefully a backend) */
+                 }
+				 BAiocbIolockItemP = BAiocbIolock_anchor;   /* current first item on chain */
+          }
+}
+
+  /*  handlaiosignal is invoked whenever io completion signal is issued.
+  *   siginfP->si_value.sival_ptr contains the address of the LWlock
+  *   on which non-originator waiters wait and which we are to release
+  */
+void handlaiosignal(int sigsent, siginfo_t * siginfP, void *ucontext);
+void handlaiosignal(int sigsent, siginfo_t * siginfP, void *ucontext)
+{
+      if (sigsent == AIO_SIGEVENT_SIGNALNUM) {   /*  aio completion */
+          /*   wake up any non-originator waiters */
+          /*   we would like to issue a   ...
+          LWLockRelease((LWLock *)siginfP->si_value.sival_ptr);
+          ... here but that fails as the lwlock code is not re-entrant.
+          So instead,   hang the LWLockptr off the chain of awaiting-release
+          LWLock ptrs using atomic compare_and_swap ad infinitum.
+          */
+          struct BAiocbIolock_chain_item volatile * BAiocbIolockItemP = ((struct BAiocbIolock_chain_item*)siginfP->si_value.sival_ptr); /* address the item */
+          struct BAiocbIolock_chain_item volatile * anyItemP; /* for running the chain */
+          int cs_rc = 0;
+
+          /*  two sanity checks :
+          *      .  next ptr of this item should be 0
+          *      .  anchor must not -> this item
+          */
+		  if (BAiocbIolockItemP->next == (struct BAiocbIolock_chain_item*)0) {
+              /*  run the chain and check that my item is not yet on it */
+              anyItemP = BAiocbIolock_anchor;              /* current first item on chain */
+              while (    (anyItemP != (struct BAiocbIolock_chain_item*)0)   /*  not reached end */
+                      && (anyItemP != BAiocbIolockItemP)                    /*  not my item */
+                    ) {
+                  anyItemP = anyItemP->next;
+			  }
+              if (anyItemP == (struct BAiocbIolock_chain_item*)0) {
+				  BAiocbIolockItemP->next = BAiocbIolock_anchor;              /* current first item on chain */
+				  cs_rc = (__sync_bool_compare_and_swap (&(BAiocbIolock_anchor), BAiocbIolockItemP->next, BAiocbIolockItemP)); /* swap it onto chain */
+				  while (!cs_rc) {
+					  sched_yield();    /*   yield to another process,  (hopefully a backend) */
+					  anyItemP = BAiocbIolock_anchor;              /* current first item on chain */
+					  while (    (anyItemP != (struct BAiocbIolock_chain_item*)0)   /*  not reached end */
+							  && (anyItemP != BAiocbIolockItemP)                    /*  not my item */
+							) {
+						  anyItemP = anyItemP->next;
+					  }
+					  if (anyItemP == (struct BAiocbIolock_chain_item*)0) {
+						  BAiocbIolockItemP->next = BAiocbIolock_anchor;                              /* current first item on chain */
+						  cs_rc = (__sync_bool_compare_and_swap (&(BAiocbIolock_anchor), BAiocbIolockItemP->next, BAiocbIolockItemP)); /* swap it onto chain */
+					  }
+					  else abort();
+				  }
+			  }
+			  else abort();
+		  }
+          else abort();
+      }
+}
+#endif /*  USE_AIO_SIGEVENT    */
+#else   /*  not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        /*   this dummy structure is to ensure that references to these fields in other bufmgr runtime code
+        **   that is not conditionally ifdefd on USE_AIO_ATOMIC_BUILTIN_COMP_SWAP compiles and runs correctly
+        */
+        struct BAiocbAnchor dummy_BAiocbAnchr = { (struct BufferAiocb*)0 , (struct BufferAiocb*)0 };
+int maxGetBAiocbTries = -1;       /*  max times we will try to get a free BufferAiocb */
+int maxRelBAiocbTries = -1;       /*  max times we will try to release a BufferAiocb back to freelist */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Data Structures:
@@ -73,7 +251,16 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+              , foundAiocbs
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+          ;
+#if defined(USE_PREFETCH) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
+        char *envvarpointer = (char *)0;  /*  might point to an environment variable string */
+        char *charptr;
+#endif /* USE_PREFETCH */
+
 
 	BufferDescriptors = (BufferDesc *)
 		ShmemInitStruct("Buffer Descriptors",
@@ -83,6 +270,152 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        BAiocbAnchr = (struct BAiocbAnchor *)0; /*  anchor for all control blocks pertaining to aio  */
+        if (max_async_io_prefetchers < 0) {  /*  negative value indicates to initialize to something sensible during buf_init */
+            max_async_io_prefetchers = MaxConnections/6;  /*  default allows for average of MaxConnections/6 concurrent prefetchers  - reasonable ??? */
+        }
+
+        if ((target_prefetch_pages > 0) && (max_async_io_prefetchers > 0)) {
+            int ix;
+            volatile struct BufferAiocb *BufferAiocbs;
+            volatile struct BufferAiocb * volatile FreeBAiocbs;
+
+            numBufferAiocbs = (target_prefetch_pages*max_async_io_prefetchers);  /*  target_prefetch_pages per prefetcher */
+            BAiocbAnchr = (struct BAiocbAnchor *)
+	        	ShmemInitStruct("Buffer Aiocbs",
+                          sizeof(struct BAiocbAnchor) + (numBufferAiocbs * sizeof(struct BufferAiocb)), &foundAiocbs);
+            if (BAiocbAnchr) {
+                BufferAiocbs = BAiocbAnchr->BufferAiocbs = (struct BufferAiocb*)(((char *)BAiocbAnchr) + sizeof(struct BAiocbAnchor));
+                FreeBAiocbs = (struct BufferAiocb*)0;
+                for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    (BufferAiocbs+ix)->BAiocbnext = FreeBAiocbs;   /*  init the free list,  last one -> 0  */
+                    (BufferAiocbs+ix)->BAiocbbufh = (struct sbufdesc*)0;
+                    (BufferAiocbs+ix)->BAiocbDependentCount = 0;
+                    (BufferAiocbs+ix)->pidOfAio = 0;
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on BAiocb lock instead of polling */
+			        (BufferAiocbs+ix)->BAiocbIolockItem.BAiocbIolock = LWLockAssign();
+			        (BufferAiocbs+ix)->BAiocbIolockItem.next = (struct BAiocbIolock_chain_item*)0; /* initially not on a chain */
+#endif /*  USE_AIO_SIGEVENT    */
+                    FreeBAiocbs = (BufferAiocbs+ix);
+
+                }
+                BAiocbAnchr->FreeBAiocbs = FreeBAiocbs;
+                envvarpointer = getenv("PG_MAX_GET_BAIOCB_TRIES");
+                if (    (envvarpointer != (char *)0)
+                     && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+                   ) {
+                      maxGetBAiocbTries = strtol(envvarpointer, 0, 10);
+                }
+                envvarpointer = getenv("PG_MAX_REL_BAIOCB_TRIES");
+                if (    (envvarpointer != (char *)0)
+                     && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+                   ) {
+                      maxRelBAiocbTries = strtol(envvarpointer, 0, 10);
+                }
+
+#ifdef USE_AIO_SIGEVENT                                    /*    signal waiters instead of letting them poll */
+				/*    set our signal environment to invoke handlaiosignal    */
+				sigemptyset(&emptymask);
+				sigemptyset(&blockaiomask);
+				sigaddset(&blockaiomask, AIO_SIGEVENT_SIGNALNUM);
+				sigprocmask(SIG_SETMASK,&emptymask,0);       /*  normal mask allows all signals */
+
+				/*  chsigaction.sa_handler = (void(*)(int, siginfo_t *, void *ucontext))&handlaiosignal;  */
+				chsigaction.sa_sigaction = (void(*)(int, siginfo_t *, void *ucontext))&handlaiosignal; /*  call for siginfo */
+				chsigaction.sa_mask = emptymask;
+				chsigaction.sa_flags = SA_SIGINFO; /*  call for siginfo */
+				sigaction(AIO_SIGEVENT_SIGNALNUM, &chsigaction, &olsigAIOaction);
+#endif /*  USE_AIO_SIGEVENT    */
+            }
+        }
+#else   /*  not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        /*   this dummy structure is to ensure that references to these fields in other bufmgr runtime code
+        **   that is not conditionally ifdefd on USE_AIO_ATOMIC_BUILTIN_COMP_SWAP compiles and runs correctly
+        */
+        BAiocbAnchr = &dummy_BAiocbAnchr;
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+
+#ifdef USE_PREFETCH
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_BITMAP");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_bitmap_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_bitmap_scans = 1;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_ISCAN");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_index_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_index_scans = 1;
+             } else
+             if (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   ) {
+                 prefetch_index_scans = strtol(envvarpointer, &charptr, 10);
+                 if (charptr && (',' == *charptr)) {   /*  optional sequential prefetch in index scans */
+					 charptr++;        /*   following the comma ... */
+					 if ( ('Y' == *charptr) || ('y' == *charptr) || ('1' == *charptr) ) {
+                         prefetch_sequential_index_scans = 1;
+					 }
+				 }
+             }
+             /*  if prefeching for ISCAN,  then we require size of pfch_list to be at least target_prefetch_pages */
+             if (   (prefetch_index_scans > 0)
+                 && (prefetch_index_scans < target_prefetch_pages)
+                ) {
+                 prefetch_index_scans = target_prefetch_pages;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_BTREE");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_btree_heaps = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_btree_heaps = 1;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_HEAP");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_heap_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+              prefetch_heap_scans = 1;
+             }
+        }
+        envvarpointer = getenv("PG_PREFETCH_DBOID");
+        if (    (envvarpointer != (char *)0)
+             && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+           ) {
+              errno = 0;   /*  required in order to distinguish error from 0 */
+              prefetch_dbOid = (unsigned int)strtoul((const char *)envvarpointer, 0, 10);
+              if (errno) {
+                  prefetch_dbOid = 0;
+              }
+        }
+        elog(LOG, "prefetching initialised with target_prefetch_pages= %d "
+                  ", max_async_io_prefetchers= %d implying aio concurrency= %d "
+                  ", prefetching_for_bitmap= %s "
+                  ", prefetching_for_heap= %s "
+                  ", prefetching_for_iscan= %d with sequential_index_page_prefetching= %s "
+                  ", prefetching_for_btree= %s"
+                   ,target_prefetch_pages ,max_async_io_prefetchers ,numBufferAiocbs
+                   ,(prefetch_bitmap_scans ? "Y" : "N")
+                   ,(prefetch_heap_scans ? "Y" : "N")
+                   ,prefetch_index_scans
+                   ,(prefetch_sequential_index_scans ? "Y" : "N")
+                   ,(prefetch_btree_heaps ? "Y" : "N")
+            );
+#endif /* USE_PREFETCH */
+
+
 	if (foundDescs || foundBufs)
 	{
 		/* both should be present or neither */
@@ -176,3 +509,82 @@ BufferShmemSize(void)
 
 	return size;
 }
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*     imprecise count of number of in-use BAiocbs at any time
+ *     we scan the array read-only without latching so are subject to unstable result
+ *     (but since the array is in well-known contiguous storage,
+ *     we are not subject to segmentation violation)
+ *     This function may be called at any time and just does its best
+ *     return the count of what we counted.
+ */
+int
+CountInuseBAiocbs(void)
+{
+        volatile struct BufferAiocb *BAiocb;
+        int count = 0;
+        int ix;
+
+
+        if (BAiocbAnchr != (struct BAiocbAnchor *)0 ) {
+            BAiocb = BAiocbAnchr->BufferAiocbs;             /*   start of list */
+
+            for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    if ((struct BufferAiocb*)BAIOCB_OCCUPIED == (BAiocb+ix)->BAiocbnext) {   /* not on freelist ? */
+                        count++;
+                    }
+            }
+        }
+        return count;
+}
+
+/*
+ * report how many free BAiocbs at shutdown
+ * DO NOT call this while backends are actively working!!
+ * this report is useful when compare_and_swap method used (see above)
+ * as it can be used to deduce how many BAiocbs were in process-local caches -
+ * (original_number_on_freelist_at_startup - this_reported_number_at_shutdown)
+ */
+void
+ReportFreeBAiocbs(void)
+{
+        volatile struct BufferAiocb *BAiocb;
+        volatile struct BufferAiocb *BufferAiocbs;
+        int count = 0;
+        int fx , ix;
+
+
+        if (BAiocbAnchr != (struct BAiocbAnchor *)0 ) {
+            BAiocb = BAiocbAnchr->FreeBAiocbs;             /*   start of free list */
+            BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+
+            for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    (BufferAiocbs+ix)->BAiocbDependentCount = 0;  /* use this as marker for finding it on freelist */
+            }
+            for (fx = (numBufferAiocbs-1);  ( (fx>=0) &&  ( BAiocb != (struct BufferAiocb*)0 ) );  fx--) {
+                    
+                    /*  check if it is a valid BufferAiocb */
+                    for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                        if ((BufferAiocbs+ix) == BAiocb) { /*  is it this one ? */
+                             break;
+                        }
+                    }
+                    if (ix >= 0) {
+                        if (BAiocb->BAiocbDependentCount) {   /* seen it already ? */
+                            elog(LOG, "ReportFreeBAiocbs closed cycle on AIO control block freelist %p"
+                                          ,BAiocb);
+                            fx = 0; /* give up at this point */
+                        }
+                        BAiocb->BAiocbDependentCount = 1;  /* use this as marker for finding it on freelist */
+                        count++;
+                        BAiocb = BAiocb->BAiocbnext;
+                    } else {
+                        elog(LOG, "ReportFreeBAiocbs invalid item on AIO control block freelist %p"
+                                          ,BAiocb);
+                        fx = 0; /* give up at this point */
+                    }
+            }
+        }
+        elog(LOG, "ReportFreeBAiocbs AIO control block list : poolsize= %d  in-use-hwm= %d  final-free= %d" ,numBufferAiocbs , hwmBufferAiocbs , count);
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
--- src/backend/storage/smgr/md.c.orig	2014-06-25 16:37:59.465618895 -0400
+++ src/backend/storage/smgr/md.c	2014-06-25 18:10:51.144521410 -0400
@@ -647,6 +647,78 @@ mdprefetch(SMgrRelation reln, ForkNumber
 }
 
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ *	mdinitaio() --  init the aio subsystem max number of threads and max number of requests
+ */
+void
+mdinitaio(int max_aio_threads, int max_aio_num)
+{
+     FileInitaio( max_aio_threads, max_aio_num );
+}
+
+/*
+ *	mdstartaio() -- start aio read of the specified block of a relation
+ */
+void
+mdstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , void *BAiocbIolockaiocbp
+#endif /*  USE_AIO_SIGEVENT    */
+ )
+{
+#ifdef USE_PREFETCH
+	off_t		seekpos;
+	MdfdVec    *v;
+        int local_retcode;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+
+	seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	local_retcode = FileStartaio(v->mdfd_vfd, seekpos, BLCKSZ , aiocbp
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , BAiocbIolockaiocbp
+#endif /*  USE_AIO_SIGEVENT    */
+                                   );
+	if (retcode) {
+            *retcode = local_retcode;
+        }
+#endif   /* USE_PREFETCH */
+}
+
+
+/*
+ *	mdcompleteaio() -- complete aio read of the specified block of a relation
+ *      on entry, *inoutcode should indicate :
+ *           .  non-0  <=>   check if complete and wait if not
+ *           .  0      <=>   cancel io immediately
+ */
+void
+mdcompleteaio( char *aiocbp , int *inoutcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , void *BAiocbIolockaiocbp
+#endif /*  USE_AIO_SIGEVENT    */
+             )
+{
+#ifdef USE_PREFETCH
+        int local_retcode;
+
+	local_retcode = FileCompleteaio(aiocbp, (inoutcode ? *inoutcode : 0)
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , BAiocbIolockaiocbp
+#endif /*  USE_AIO_SIGEVENT    */
+                                       );
+	if (inoutcode) {
+            *inoutcode = local_retcode;
+        }
+#endif   /* USE_PREFETCH */
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+
 /*
  *	mdread() -- Read the specified block from a relation.
  */
--- src/backend/storage/smgr/smgr.c.orig	2014-06-25 16:37:59.465618895 -0400
+++ src/backend/storage/smgr/smgr.c	2014-06-25 18:10:51.164521488 -0400
@@ -49,6 +49,20 @@ typedef struct f_smgr
 						 BlockNumber blocknum, char *buffer, bool skipFsync);
 	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	void		(*smgr_initaio) (int max_aio_threads, int max_aio_num);
+	void		(*smgr_startaio) (SMgrRelation reln, ForkNumber forknum,
+											  BlockNumber blocknum , char *aiocbp , int *retcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , void *BAiocbIolockaiocbp
+#endif /*  USE_AIO_SIGEVENT    */
+                                 );
+	void		(*smgr_completeaio) ( char *aiocbp , int *inoutcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , void *BAiocbIolockaiocbp
+#endif /*  USE_AIO_SIGEVENT    */
+                                    );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
@@ -66,7 +80,11 @@ typedef struct f_smgr
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
-		mdprefetch, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
+		mdprefetch
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        ,mdinitaio, mdstartaio, mdcompleteaio
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+              , mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
 		mdpreckpt, mdsync, mdpostckpt
 	}
 };
@@ -612,6 +630,51 @@ smgrprefetch(SMgrRelation reln, ForkNumb
 	(*(smgrsw[reln->smgr_which].smgr_prefetch)) (reln, forknum, blocknum);
 }
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ *	smgrinitaio() -- initialize the aio subsystem max number of threads and max number of requests
+ */
+void
+smgrinitaio(int max_aio_threads, int max_aio_num)
+{
+	(*(smgrsw[0].smgr_initaio)) ( max_aio_threads, max_aio_num );
+}
+
+/*
+ *	smgrstartaio() -- Initiate aio read of the specified block of a relation.
+ */
+void
+smgrstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , void *BAiocbIolockaiocbp
+#endif /*  USE_AIO_SIGEVENT    */
+             )
+{
+	(*(smgrsw[reln->smgr_which].smgr_startaio)) (reln, forknum, blocknum , aiocbp , retcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , BAiocbIolockaiocbp
+#endif /*  USE_AIO_SIGEVENT    */
+                                                );
+}
+
+/*
+ *	smgrcompleteaio() -- Complete aio read of the specified block of a relation.
+ */
+void
+smgrcompleteaio(SMgrRelation reln,  char *aiocbp , int *inoutcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , void *BAiocbIolockaiocbp
+#endif /*  USE_AIO_SIGEVENT    */
+               )
+{
+	(*(smgrsw[reln->smgr_which].smgr_completeaio)) ( aiocbp , inoutcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , BAiocbIolockaiocbp
+#endif /*  USE_AIO_SIGEVENT    */
+                                                   );
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 /*
  *	smgrread() -- read a particular block from a relation into the supplied
  *				  buffer.
--- src/backend/storage/file/fd.c.orig	2014-06-25 16:37:59.457618893 -0400
+++ src/backend/storage/file/fd.c	2014-06-25 18:10:51.180521551 -0400
@@ -77,6 +77,9 @@
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+#include "aio.h"
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * We must leave some file descriptors free for system(), the dynamic loader,
@@ -1239,6 +1242,10 @@ FileClose(File file)
  * We could add an implementation using libaio in the future; but note that
  * this API is inappropriate for libaio, which wants to have a buffer provided
  * to read into.
+ * Also note that a new, different implementation of asynchronous prefetch
+ * using librt,  not libaio,  is provided by the two functions following this one,
+ * FileStartaio and FileCompleteaio.   These also require to have a buffer provided
+ * to read into,  which the new async_io support provides.
  */
 int
 FilePrefetch(File file, off_t offset, int amount)
@@ -1266,6 +1273,224 @@ FilePrefetch(File file, off_t offset, in
 #endif
 }
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ * FileInitaio - initialize the aio subsystem max number of threads and max number of requests
+ *  input parms
+ *  max_aio_threads;    maximum number of threads
+ *  max_aio_num;        maximum number of concurrent aio read requests
+ *
+ *  on linux, man page for the librt implemenation of aio_init() says :
+ *         This function is a GNU extension.
+ *  If your posix aio does not have it,   then add the following line to 
+ *        src/include/pg_config_manual.h
+ *    #define DONT_HAVE_AIO_INIT
+ *  to render it as a no-op
+ */
+void
+FileInitaio(int max_aio_threads, int max_aio_num )
+{
+#ifndef DONT_HAVE_AIO_INIT
+    struct aioinit aioinit_struct;  /*  structure to pass to aio_init */
+
+    aioinit_struct.aio_threads = max_aio_threads; /*     maximum number of threads */
+    aioinit_struct.aio_num = max_aio_num;         /*     maximum number of concurrent aio read requests */
+    aioinit_struct.aio_idle_time = 1;             /*     we dont want to alter this but aio_init does not ignore it so set to the default */
+    aio_init(&aioinit_struct);
+#endif  /* ndef DONT_HAVE_AIO_INIT */
+    return;
+}
+
+/*
+ * FileStartaio - initiate asynchronous read of a given range of the file.
+ * The logical seek position is unaffected.
+ *
+ * use standard posix aio (librt)
+ *  ASSUME   BufferAiocb.aio_buf already set to -> buffer by caller
+ *  return 0 if successfully started,  else non-zero
+ */
+int
+FileStartaio(File file, off_t offset, int amount , char *aiocbp
+#ifdef USE_AIO_SIGEVENT                                    /*    signal  non-originator waiters instead of letting them poll */
+            , void *BAiocbIolockaiocbp
+#endif /*  USE_AIO_SIGEVENT    */
+                      )
+{
+	int	returnCode;
+        struct aiocb *my_aiocbp = (struct aiocb *)aiocbp;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartaio: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset, amount));
+
+	returnCode = FileAccess(file);
+	if (returnCode >= 0) {
+
+            my_aiocbp->aio_fildes = VfdCache[file].fd;
+            my_aiocbp->aio_lio_opcode = LIO_READ;
+            my_aiocbp->aio_nbytes = amount;
+            my_aiocbp->aio_offset = offset;
+#ifdef USE_AIO_SIGEVENT                                    /*    signal non-originator waiters instead of letting them poll */
+            my_aiocbp->aio_sigevent.sigev_notify = SIGEV_SIGNAL;
+            my_aiocbp->aio_sigevent.sigev_signo = AIO_SIGEVENT_SIGNALNUM;
+            my_aiocbp->aio_sigevent.sigev_value.sival_ptr = BAiocbIolockaiocbp; /* address of BAiocbIolock_chain_item
+                                                                 containing LWlock on which non-originator waiters may wait */
+            /* set the next ptr and get eXclusiv lock before the aio because the signal handler may be called before return from the aio_read */
+            ((struct BAiocbIolock_chain_item*)BAiocbIolockaiocbp)->next = (struct BAiocbIolock_chain_item*)0;
+            if (!LWLockHeldByMe((struct LWLock *)(((struct BAiocbIolock_chain_item*)BAiocbIolockaiocbp)->BAiocbIolock))) { /* not waiter-locked */
+					LWLockAcquire( (LWLock *)(((struct BAiocbIolock_chain_item*)BAiocbIolockaiocbp)->BAiocbIolock), LW_EXCLUSIVE );
+			}
+#endif /*  USE_AIO_SIGEVENT    */
+            returnCode = aio_read(my_aiocbp);
+#ifdef USE_AIO_SIGEVENT                                    /*    signal non-originator waiters instead of letting them poll */
+            if (returnCode != 0) {
+                /*   the BAiocbIolock waiter-lock in this BAiocb should be currently locked,
+                 *   but best to check.
+                */
+                if (LWLockHeldByMe((struct LWLock *)(((struct BAiocbIolock_chain_item*)BAiocbIolockaiocbp)->BAiocbIolock))) { /* waiter-locked */
+                    /*  acquire exclusive use of LWlock until aio completes  */
+					LWLockRelease( (LWLock *)(((struct BAiocbIolock_chain_item*)BAiocbIolockaiocbp)->BAiocbIolock) );
+				}
+			}
+#endif /*  USE_AIO_SIGEVENT    */
+        }
+
+	return returnCode;
+}
+
+/*
+ * FileCompleteaio - complete asynchronous aio read
+ * normal_wait indicates whether to cancel or wait -
+ *                 0 <=> cancel
+ *                 1 <=> wait by polling the aiocb or waiting on lock
+ *                 2 <=> wait by suspending on the aiocb
+ *
+ * use standard posix aio (librt)
+ *  return 0 if successfull and did not have to wait,
+ *         1 if successfull and had to wait,
+ *    else x'ff'
+ */
+int
+FileCompleteaio( char *aiocbp , int normal_wait
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+            , void *BAiocbIolockaiocbp
+#endif /*  USE_AIO_SIGEVENT    */
+ )
+{
+	int	returnCode;
+	int	aio_errno;
+        struct aiocb *my_aiocbp = (struct aiocb *)aiocbp;
+        const struct aiocb const*cblist[1];
+        int fd;
+        struct timespec my_timeout = { 0 , 10000 };
+        struct timespec *suspend_timeout_P; /*  the timeout actually used depending on normal_wait */
+        int max_polls;
+
+        fd = my_aiocbp->aio_fildes;
+        cblist[0] = my_aiocbp;
+
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+            /*  if we are non-originator,  then we must now wait on the BAiocb waiter-lock,
+             *  even if the aio has completed.
+             *  Reason:
+             *     it is possible that this backend will eventually be the one to free the BAiocb
+             *     At the time that a BAiocb is freed,  there must not be a pending LWLockRelease,
+             *        i.e. no pending queued completion events.
+             *     waiting on the waiter-lock must be done only when *not* holding the buff-header spinlock,
+             *        otherwise there is a risk of deadlock.
+             *   Adding these conditions together ==>> we must wait now to ensure the LWLockRelease has been done
+            */
+            if (normal_wait == 1) { /*   told to wait on lock   i.e. non-originator */
+					/*  wait to acquire shared use of LWlock until aio completes  */
+                    LWLockAcquire( (LWLock *)(((struct BAiocbIolock_chain_item*)BAiocbIolockaiocbp)->BAiocbIolock), LW_SHARED);
+                    LWLockRelease( (LWLock *)(((struct BAiocbIolock_chain_item*)BAiocbIolockaiocbp)->BAiocbIolock));
+			}
+#endif /*  USE_AIO_SIGEVENT    */
+
+        returnCode = aio_errno = aio_error(my_aiocbp);
+        /* note that aio_error returns 0 if op already completed successfully */
+
+        /*  first handle normal case of waiting for op to complete  */
+        if (normal_wait) {
+            /*   if told not to poll,   then specify no timeout  */
+            suspend_timeout_P = ( (normal_wait == 1) ? &my_timeout : (struct timespec *)0 );
+
+            while (aio_errno == EINPROGRESS) {
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                /*   when using the aio_sigevent and normal_wait == 1,
+                 *   it should be impossible for the aio to be still in-progress at this point
+                 *   since we have already waited for originator to post completion via signal and LWlock
+                 */
+                if (normal_wait == 1) { /*   told to wait on lock */
+                    /*  should this be WARNING (since we can continue) or ERROR (since something bad happened? */
+                    elog(WARNING, "FileCompleteaio: aio unexpectedly in-progress after completion signalled for aiocb at %p LWlock_item at %p LWlock at %p\n"
+						 ,aiocbp ,BAiocbIolockaiocbp ,((struct BAiocbIolock_chain_item*)BAiocbIolockaiocbp)->BAiocbIolock);
+				}
+#endif /*  USE_AIO_SIGEVENT    */
+                max_polls = 256;
+                my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                returnCode = aio_suspend(cblist , 1 , suspend_timeout_P);
+                while (    (returnCode < 0) && (max_polls-- > 0)
+                        && ((EAGAIN == errno) || (EINTR == errno))
+                      ) {
+                    my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                    returnCode = aio_suspend(cblist , 1 , suspend_timeout_P);
+                }
+
+                returnCode = aio_errno = aio_error(my_aiocbp);
+                /*  now return_code is from aio_error  */
+                if (returnCode == 0) {
+                    returnCode = 1;    /*  successful but had to wait */
+                }
+            }
+            if (aio_errno) {
+                elog(LOG, "FileCompleteaio: %d %d", fd, returnCode);
+                returnCode = 0xff; /*  unsuccessful */
+            }
+
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+            /*   if this bkend waiter-locked the waiters-lock exclusive,  then release it now */
+            if (    (normal_wait == 2)  /*   told to suspend     */
+                     /*     there are two places where originator might release the BAiocb waiter-lock :
+                      *      .  here
+                      *      .  in ProcessPendingBAiocbIolocks   (always called)
+                      *      so each case must first check whether lock has already been released.
+                      *      Fortunately there is no possibility that both cases execute simultaneously
+                      *      because they are both performed in main-line (non-thread, non-interrupt) code
+                      *      by same process.
+                      */
+                  && (LWLockHeldByMe((struct LWLock *)(((struct BAiocbIolock_chain_item*)BAiocbIolockaiocbp)->BAiocbIolock))) /* still waiter-locked */
+               ) {
+                    LWLockRelease( (LWLock *)(((struct BAiocbIolock_chain_item*)BAiocbIolockaiocbp)->BAiocbIolock));
+			}
+#endif /*  USE_AIO_SIGEVENT    */
+        } else {
+            if (aio_errno == EINPROGRESS) {
+                do {
+                        max_polls = 256;
+                        my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                        returnCode = aio_cancel(fd, my_aiocbp);
+                        while ((returnCode == AIO_NOTCANCELED) && (max_polls-- > 0)) {
+                            my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                            returnCode = aio_cancel(fd, my_aiocbp);
+                        }
+                    returnCode = aio_errno = aio_error(my_aiocbp);
+                } while (aio_errno == EINPROGRESS);
+                returnCode = 0xff; /*  unsuccessful */
+            }
+            if (returnCode != 0)
+                returnCode = 0xff; /*  unsuccessful */
+        }
+
+	DO_DB(elog(LOG, "FileCompleteaio: %d %d",
+			   fd, returnCode));
+
+	return returnCode;
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 int
 FileRead(File file, char *buffer, int amount)
 {
--- src/backend/storage/lmgr/lwlock.c.orig	2014-06-25 16:37:59.461618894 -0400
+++ src/backend/storage/lmgr/lwlock.c	2014-06-25 18:10:51.192521598 -0400
@@ -45,6 +45,11 @@
 #include "utils/hsearch.h"
 #endif
 
+/*
+ * GUC and GUC-derived parameters
+ */
+extern int      max_async_io_prefetchers; /* Maximum number of backends using asynchronous librt threads to read pages into our buffers */
+extern int		target_prefetch_pages;    /* How many buffers PrefetchBuffer callers should try to stay ahead of their ReadBuffer calls by */
 
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
@@ -231,6 +236,11 @@ NumLWLocks(void)
 	/* bufmgr.c needs two for each shared buffer */
 	numLocks += 2 * NBuffers;
 
+#if defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)
+    /* bufmgr needs one for each BAiocb */
+	numLocks += max_async_io_prefetchers * target_prefetch_pages;
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP && USE_AIO_SIGEVENT */
+
 	/* proc.c needs one for each backend or auxiliary process */
 	numLocks += MaxBackends + NUM_AUXILIARY_PROCS;
 
--- src/backend/storage/lmgr/proc.c.orig	2014-06-25 16:37:59.461618894 -0400
+++ src/backend/storage/lmgr/proc.c	2014-06-25 18:10:51.204521645 -0400
@@ -52,6 +52,7 @@
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+extern pid_t this_backend_pid;     /*   pid of this backend */
 
 /* GUC variables */
 int			DeadlockTimeout = 1000;
@@ -361,6 +362,7 @@ InitProcess(void)
 	MyPgXact->xid = InvalidTransactionId;
 	MyPgXact->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
+        this_backend_pid = getpid();    /*    pid of this backend */
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
--- src/backend/access/heap/heapam.c.orig	2014-06-25 16:37:59.361618874 -0400
+++ src/backend/access/heap/heapam.c	2014-06-25 18:10:51.248521818 -0400
@@ -71,6 +71,28 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef USE_PREFETCH
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+
+#include "executor/instrument.h"
+
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_heap_scans; /*  boolean whether to prefetch non-bitmap heap scans         */
+
+/*  special values for scan->rs_prefetch_target indicating as follows :               */
+#define PREFETCH_MAYBE 0xffffffff      /*   prefetch permitted but not yet in effect  */
+#define PREFETCH_DISABLED 0xfffffffe   /*   prefetch disabled and not permitted       */
+/*  PREFETCH_WRAP_POINT indicates a pretcher who has reached the point where the scan would wrap -
+**  at this point the prefetcher runs on the spot until scan catches up.
+**  This *must* be < maximum valid setting of target_prefetch_pages aka effective_io_concurrency.
+*/
+#define PREFETCH_WRAP_POINT 0x0fffffff
+
+#endif   /* USE_PREFETCH */
+
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -115,6 +137,10 @@ static XLogRecPtr log_heap_new_cid(Relat
 static HeapTuple ExtractReplicaIdentity(Relation rel, HeapTuple tup, bool key_modified,
 					   bool *copy);
 
+#ifdef USE_PREFETCH
+static void heap_unread_add(HeapScanDesc scan, BlockNumber blockno);
+static void heap_unread_subtract(HeapScanDesc scan, BlockNumber blockno);
+#endif   /* USE_PREFETCH */
 
 /*
  * Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -292,9 +318,150 @@ initscan(HeapScanDesc scan, ScanKey key,
 	 * Currently, we don't have a stats counter for bitmap heap scans (but the
 	 * underlying bitmap index scans will be counted).
 	 */
-	if (!scan->rs_bitmapscan)
+#ifdef USE_PREFETCH
+        /*    by default,  no prefetching on any scan  */
+        scan->rs_prefetch_target = PREFETCH_DISABLED;  /*  tentatively disable  */
+        scan->rs_pfchblock = 0; /*  scanner will reset this to be ahead of scan */
+        scan->rs_Unread_Pfetched_base = (BlockNumber *)0; /*  list of prefetched but unread blocknos */
+        scan->rs_Unread_Pfetched_next = 0; /*  next unread blockno */
+        scan->rs_Unread_Pfetched_count = 0; /* number of valid unread blocknos */
+#endif   /* USE_PREFETCH */
+	if (!scan->rs_bitmapscan) {
+
 		pgstat_count_heap_scan(scan->rs_rd);
+#ifdef USE_PREFETCH
+                /*    bitmap scans do their own prefetching -
+                **    for others,  set up prefetching now
+                */
+                if (    prefetch_heap_scans
+                     && (target_prefetch_pages > 0)
+                     &&	(!RelationUsesLocalBuffers(scan->rs_rd))
+                   ) {
+                      /*   prefetch_dbOid may be set to a database Oid to specify only prefetch in that db */
+                      if (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == scan->rs_rd->rd_node.dbNode)
+                              )
+                          ||  (prefetch_dbOid == 0)
+                         ) {
+                          scan->rs_prefetch_target = PREFETCH_MAYBE;    /*  permitted but let the scan decide */
+                      }
+                      else {
+                      }
 }
+#endif   /* USE_PREFETCH */
+        }
+}
+
+#ifdef USE_PREFETCH
+/* add this blockno to list of prefetched and unread blocknos
+** use the one identified by the (next+count|modulo circumference) index if it is unused,
+** else search for the first available slot if there is one,
+** else error.
+*/
+static void
+heap_unread_add(HeapScanDesc scan, BlockNumber blockno)
+{
+      BlockNumber *available_P;   /*  where to store new blockno */
+      unsigned int Unread_Pfetched_index = scan->rs_Unread_Pfetched_next
+                                         + scan->rs_Unread_Pfetched_count; /* index of next unused slot */
+
+      /*  caller is not supposed to pass InvalidBlockNumber but check anyway */
+      if (blockno != InvalidBlockNumber) {
+
+		  /*  ensure there is some room somewhere   */
+		  if (scan->rs_Unread_Pfetched_count < target_prefetch_pages) {
+
+			  /*  try the "next+count" one */
+			  if (Unread_Pfetched_index >= target_prefetch_pages) {
+				  Unread_Pfetched_index -= target_prefetch_pages;  /* modulo circumference */
+			  }
+			  available_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+			  if (*available_P == InvalidBlockNumber) { /* unused */
+				  goto store_blockno;
+			  } else {
+                  /*  slow-search the entire list  */
+				  for (Unread_Pfetched_index = 0; Unread_Pfetched_index < target_prefetch_pages; Unread_Pfetched_index++) {
+					  available_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+					  if (*available_P == InvalidBlockNumber) { /* unused */
+                          /*  before storing this blockno,
+                          **  since the next pointer did not locate an unused slot,
+                          **  set it to one which is more likely to be so for the next time
+                          */
+                          scan->rs_Unread_Pfetched_next = Unread_Pfetched_index;
+						  goto store_blockno;
+					  }
+				  }
+			  }
+		  }
+
+          /*  if we reach here,    either there was no available slot
+          **  or we thought there was one and didn't find any  --
+          */
+  		  ereport(NOTICE,
+			  (errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("heap_unread_add overflowed list cannot add blockno %d", blockno)));
+
+  		  return;
+
+    store_blockno:
+		  *available_P = blockno;
+		  scan->rs_Unread_Pfetched_count++;  /*  update count */
+
+	  }
+
+    return;
+}
+
+/* remove specified blockno from list of prefetched and unread blocknos.
+** Usually this will be found at the rs_Unread_Pfetched_next item -
+** else search for it.    If not found,   inore it  -  no error results.
+*/
+static void
+heap_unread_subtract(HeapScanDesc scan, BlockNumber blockno)
+{
+      unsigned int Unread_Pfetched_index = scan->rs_Unread_Pfetched_next; /* index of next unread blockno */
+      BlockNumber *candidate_P;   /*  location of callers blockno - maybe */
+      BlockNumber nextUnreadPfetched;
+
+      /*  caller is not supposed to pass InvalidBlockNumber but check anyway */
+      if (    (blockno != InvalidBlockNumber)
+		   && ( scan->rs_Unread_Pfetched_count > 0 )   /*  if the list is not empty  */
+         ) {
+
+			  /*  take modulo of the circumference.
+			  **  actually rs_Unread_Pfetched_next should never exceed the circumference but check anyway.
+			  */
+			  if (Unread_Pfetched_index >= target_prefetch_pages) {
+				  Unread_Pfetched_index  -= target_prefetch_pages;
+}
+			  candidate_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);
+			  nextUnreadPfetched = *candidate_P;
+
+			  if ( nextUnreadPfetched == blockno ) {
+				  goto remove_blockno;
+			  } else {
+                  /*  slow-search the entire list  */
+				  for (Unread_Pfetched_index = 0; Unread_Pfetched_index < target_prefetch_pages; Unread_Pfetched_index++) {
+					  candidate_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+					  if (*candidate_P == blockno) { /* unused */
+						  goto remove_blockno;
+					  }
+				  }
+			  }
+
+    remove_blockno:
+			  *candidate_P = InvalidBlockNumber;
+
+			  scan->rs_Unread_Pfetched_next = (Unread_Pfetched_index+1);  /*  update next pfchd unread */
+			  if (scan->rs_Unread_Pfetched_next >= target_prefetch_pages) {
+					  scan->rs_Unread_Pfetched_next = 0;
+			  }
+			  scan->rs_Unread_Pfetched_count--;  /*  update count */
+	  }
+
+      return;
+}
+#endif   /* USE_PREFETCH */
 
 /*
  * heapgetpage - subroutine for heapgettup()
@@ -304,7 +471,7 @@ initscan(HeapScanDesc scan, ScanKey key,
  * which tuples on the page are visible.
  */
 static void
-heapgetpage(HeapScanDesc scan, BlockNumber page)
+heapgetpage(HeapScanDesc scan, BlockNumber page , BlockNumber prefetchHWM)
 {
 	Buffer		buffer;
 	Snapshot	snapshot;
@@ -314,6 +481,10 @@ heapgetpage(HeapScanDesc scan, BlockNumb
 	OffsetNumber lineoff;
 	ItemId		lpp;
 	bool		all_visible;
+#ifdef USE_PREFETCH
+	int             PrefetchBufferRc;  /*   indicates whether requested prefetch block already in a buffer and if pin count on buffer has been increased */
+#endif   /* USE_PREFETCH */
+
 
 	Assert(page < scan->rs_nblocks);
 
@@ -336,6 +507,98 @@ heapgetpage(HeapScanDesc scan, BlockNumb
 									   RBM_NORMAL, scan->rs_strategy);
 	scan->rs_cblock = page;
 
+#ifdef USE_PREFETCH
+
+        heap_unread_subtract(scan, page);
+
+        /*    maybe prefetch some pages  starting with rs_pfchblock */
+        if (scan->rs_prefetch_target >= 0) {       /*   prefetching enabled on this scan ?                         */
+            int next_block_to_be_read = (page+1);  /*   next block to be read = lowest possible prefetchable block */
+            int num_to_pfch_this_time;             /*   eventually holds the number of blocks to prefetch now      */
+            int prefetchable_range;                /*   size of the area ahead of the current prefetch position    */
+
+            /*  check if prefetcher reached wrap point and the scan has now wrapped */
+            if (  (page == 0) && (scan->rs_prefetch_target == PREFETCH_WRAP_POINT)  ) {
+                scan->rs_prefetch_target = 1;
+                scan->rs_pfchblock = next_block_to_be_read;
+            } else
+            if (scan->rs_pfchblock < next_block_to_be_read) {
+                scan->rs_pfchblock = next_block_to_be_read; /* next block to be prefetched must be ahead of one we just read */
+            }
+
+            /* now we know where we would start prefetching -
+            ** next question   -  if this is a sync scan,  ensure we do not prefetch behind the HWM
+            ** debatable whether to require strict inequality or >=  -   >= works better in practice
+            */
+            if ( (!scan->rs_syncscan) || (scan->rs_pfchblock >= prefetchHWM) ) {
+
+                /* now we know where we will start prefetching -
+                ** next question   -  how many?
+                ** apply two limits :
+                **  1.   target prefetch distance
+                **  2.   number of available blocks ahead of us
+                */
+
+                /*  1.   target prefetch distance   */
+                num_to_pfch_this_time = next_block_to_be_read + scan->rs_prefetch_target; /* page beyond prefetch target */
+                num_to_pfch_this_time -= scan->rs_pfchblock;                              /*  convert to offset        */
+
+                /*   first do prefetching up to our current limit  ...
+                **   highest page number that a scan (pre)-fetches is scan->rs_nblocks-1
+                **   note  -  prefetcher does not wrap a prefetch range -
+                **            instead just stop and then start again if and when main scan wraps
+                */
+                if (scan->rs_pfchblock <= scan->rs_startblock) {  /*  if on second leg towards startblock */
+                    prefetchable_range = ((int)(scan->rs_startblock) - (int)(scan->rs_pfchblock));
+                }
+                else {                                            /*     on first leg towards nblocks     */
+                    prefetchable_range = ((int)(scan->rs_nblocks) - (int)(scan->rs_pfchblock));
+                }
+                if (prefetchable_range > 0) {           /*  if theres a range to prefetch */
+
+                    /*  2.   number of available blocks ahead of us        */
+                    if (num_to_pfch_this_time > prefetchable_range) {
+                        num_to_pfch_this_time = prefetchable_range;
+                    }
+                    while (num_to_pfch_this_time-- > 0) {
+                        PrefetchBufferRc = PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, scan->rs_pfchblock, scan->rs_strategy);
+                        /*  if pin acquired on buffer,  then remember in case of future Discard */
+                        if (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED) {
+                            heap_unread_add(scan, scan->rs_pfchblock);
+						}
+                        scan->rs_pfchblock++;
+                        /*  if syncscan and requested block was already in buffer pool,
+                        **  this suggests that another scanner is ahead of us and we should advance
+                        */
+                        if ( (scan->rs_syncscan) && (PrefetchBufferRc & PREFTCHRC_BLK_ALREADY_PRESENT) ) {
+                            scan->rs_pfchblock++;
+                            num_to_pfch_this_time--;
+                        }
+                    }
+                }
+                else {
+                    /*   we must not modify scan->rs_pfchblock here
+                    **   because it is needed for possible DiscardBuffer at end of scan  ...
+                    **   ... instead ...
+                    */
+                    scan->rs_prefetch_target = PREFETCH_WRAP_POINT;  /*   mark this prefetcher as waiting to wrap */
+                }
+
+                /*   ...  then adjust prefetching limit : by doubling on each iteration */
+                if (scan->rs_prefetch_target == 0) {
+                    scan->rs_prefetch_target = 1;
+                }
+                else {
+                    scan->rs_prefetch_target *= 2;
+                    if (scan->rs_prefetch_target > target_prefetch_pages) {
+                        scan->rs_prefetch_target = target_prefetch_pages;
+                    }
+                }
+            }
+        }
+#endif   /* USE_PREFETCH */
+
+
 	if (!scan->rs_pageatatime)
 		return;
 
@@ -452,6 +715,10 @@ heapgettup(HeapScanDesc scan,
 	OffsetNumber lineoff;
 	int			linesleft;
 	ItemId		lpp;
+    BlockNumber  prefetchHWM = 0;   /*  HWM of prefetch BlockNum for all participants in same sync scan */
+#ifdef USE_PREFETCH
+    int          ix;
+#endif   /* USE_PREFETCH */
 
 	/*
 	 * calculate next starting lineoff, given scan direction
@@ -470,7 +737,25 @@ heapgettup(HeapScanDesc scan,
 				return;
 			}
 			page = scan->rs_startblock; /* first page */
-			heapgetpage(scan, page);
+#ifdef USE_PREFETCH
+                        /*  decide if we shall do prefetching :  only if :
+                        **   .  prefetching enabled for this scan
+                        **   .  not a bitmap scan (which do their own)
+                        **   .  sufficient number of blocks - at least twice the target_prefetch_pages
+                        */
+                        if (    (!scan->rs_bitmapscan)
+                             && (scan->rs_prefetch_target >= PREFETCH_MAYBE)
+                             && (scan->rs_nblocks > (2*target_prefetch_pages))
+                           )  {
+                                scan->rs_prefetch_target = 1;  /*  do prefetching on forward non-bitmap scan */
+                                scan->rs_Unread_Pfetched_base = (BlockNumber *)palloc(sizeof(BlockNumber)*target_prefetch_pages);
+                                /*  Initialise the list */
+                                for (ix = 0; ix < target_prefetch_pages; ix++) {
+									*(scan->rs_Unread_Pfetched_base + ix) = InvalidBlockNumber;
+								}
+                        }
+#endif   /* USE_PREFETCH */
+			heapgetpage(scan, page, prefetchHWM);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
 		}
@@ -516,7 +801,7 @@ heapgettup(HeapScanDesc scan,
 				page = scan->rs_startblock - 1;
 			else
 				page = scan->rs_nblocks - 1;
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 		}
 		else
 		{
@@ -557,7 +842,7 @@ heapgettup(HeapScanDesc scan,
 
 		page = ItemPointerGetBlockNumber(&(tuple->t_self));
 		if (page != scan->rs_cblock)
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 
 		/* Since the tuple was previously fetched, needn't lock page here */
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
@@ -660,8 +945,12 @@ heapgettup(HeapScanDesc scan,
 			 * a little bit backwards on every invocation, which is confusing.
 			 * We don't guarantee any specific ordering in general, though.
 			 */
-			if (scan->rs_syncscan)
-				ss_report_location(scan->rs_rd, page);
+			if (scan->rs_syncscan) {
+#ifdef USE_PREFETCH
+				prefetchHWM = scan->rs_pfchblock;
+#endif   /* USE_PREFETCH */
+				ss_report_location(scan->rs_rd, page, &prefetchHWM);
+			}
 		}
 
 		/*
@@ -671,6 +960,22 @@ heapgettup(HeapScanDesc scan,
 		{
 			if (BufferIsValid(scan->rs_cbuf))
 				ReleaseBuffer(scan->rs_cbuf);
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
 			scan->rs_cbuf = InvalidBuffer;
 			scan->rs_cblock = InvalidBlockNumber;
 			tuple->t_data = NULL;
@@ -678,7 +983,7 @@ heapgettup(HeapScanDesc scan,
 			return;
 		}
 
-		heapgetpage(scan, page);
+		heapgetpage(scan, page, prefetchHWM);
 
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 
@@ -727,6 +1032,10 @@ heapgettup_pagemode(HeapScanDesc scan,
 	OffsetNumber lineoff;
 	int			linesleft;
 	ItemId		lpp;
+    BlockNumber  prefetchHWM = 0;   /*  HWM of prefetch BlockNum for all participants in same sync scan */
+#ifdef USE_PREFETCH
+    int          ix;
+#endif   /* USE_PREFETCH */
 
 	/*
 	 * calculate next starting lineindex, given scan direction
@@ -745,7 +1054,25 @@ heapgettup_pagemode(HeapScanDesc scan,
 				return;
 			}
 			page = scan->rs_startblock; /* first page */
-			heapgetpage(scan, page);
+#ifdef USE_PREFETCH
+                        /*  decide if we shall do prefetching :  only if :
+                        **   .  prefetching enabled for this scan
+                        **   .  not a bitmap scan (which do their own)
+                        **   .  sufficient number of blocks - at least twice the target_prefetch_pages
+                        */
+                        if (    (!scan->rs_bitmapscan)
+                             && (scan->rs_prefetch_target >= PREFETCH_MAYBE)
+                             && (scan->rs_nblocks > (2*target_prefetch_pages))
+                           )  {
+                                scan->rs_prefetch_target = 1;  /*  do prefetching on forward non-bitmap scan */
+                                scan->rs_Unread_Pfetched_base = (BlockNumber *)palloc(sizeof(BlockNumber)*target_prefetch_pages);
+                                /*  Initialise the list */
+                                for (ix = 0; ix < target_prefetch_pages; ix++) {
+									*(scan->rs_Unread_Pfetched_base + ix) = InvalidBlockNumber;
+								}
+                        }
+#endif   /* USE_PREFETCH */
+			heapgetpage(scan, page, prefetchHWM);
 			lineindex = 0;
 			scan->rs_inited = true;
 		}
@@ -788,7 +1115,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 				page = scan->rs_startblock - 1;
 			else
 				page = scan->rs_nblocks - 1;
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 		}
 		else
 		{
@@ -826,7 +1153,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 
 		page = ItemPointerGetBlockNumber(&(tuple->t_self));
 		if (page != scan->rs_cblock)
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 
 		/* Since the tuple was previously fetched, needn't lock page here */
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
@@ -921,8 +1248,12 @@ heapgettup_pagemode(HeapScanDesc scan,
 			 * a little bit backwards on every invocation, which is confusing.
 			 * We don't guarantee any specific ordering in general, though.
 			 */
-			if (scan->rs_syncscan)
-				ss_report_location(scan->rs_rd, page);
+			if (scan->rs_syncscan) {
+#ifdef USE_PREFETCH
+				prefetchHWM = scan->rs_pfchblock;
+#endif   /* USE_PREFETCH */
+				ss_report_location(scan->rs_rd, page, &prefetchHWM);
+			}
 		}
 
 		/*
@@ -932,6 +1263,22 @@ heapgettup_pagemode(HeapScanDesc scan,
 		{
 			if (BufferIsValid(scan->rs_cbuf))
 				ReleaseBuffer(scan->rs_cbuf);
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                             while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
 			scan->rs_cbuf = InvalidBuffer;
 			scan->rs_cblock = InvalidBlockNumber;
 			tuple->t_data = NULL;
@@ -939,7 +1286,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 			return;
 		}
 
-		heapgetpage(scan, page);
+		heapgetpage(scan, page, prefetchHWM);
 
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
 		lines = scan->rs_ntuples;
@@ -1394,6 +1741,23 @@ void
 heap_rescan(HeapScanDesc scan,
 			ScanKey key)
 {
+
+#ifdef USE_PREFETCH
+         if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                       /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+         }
+#endif   /* USE_PREFETCH */
 	/*
 	 * unpin scan buffers
 	 */
@@ -1418,6 +1782,23 @@ heap_endscan(HeapScanDesc scan)
 {
 	/* Note: no locking manipulations needed */
 
+
+#ifdef USE_PREFETCH
+         if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                       /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+         }
+#endif   /* USE_PREFETCH */
 	/*
 	 * unpin scan buffers
 	 */
@@ -1435,6 +1816,12 @@ heap_endscan(HeapScanDesc scan)
 	if (scan->rs_strategy != NULL)
 		FreeAccessStrategy(scan->rs_strategy);
 
+#ifdef USE_PREFETCH
+    if (scan->rs_Unread_Pfetched_base) {
+        pfree(scan->rs_Unread_Pfetched_base);
+    }
+#endif   /* USE_PREFETCH */
+
 	if (scan->rs_temp_snap)
 		UnregisterSnapshot(scan->rs_snapshot);
 
@@ -1464,7 +1851,6 @@ heap_endscan(HeapScanDesc scan)
 #define HEAPDEBUG_3
 #endif   /* !defined(HEAPDEBUGALL) */
 
-
 HeapTuple
 heap_getnext(HeapScanDesc scan, ScanDirection direction)
 {
@@ -6349,6 +6735,25 @@ heap_markpos(HeapScanDesc scan)
 void
 heap_restrpos(HeapScanDesc scan)
 {
+
+
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                             while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
+
 	/* XXX no amrestrpos checking that ammarkpos called */
 
 	if (!ItemPointerIsValid(&scan->rs_mctid))
--- src/backend/access/heap/syncscan.c.orig	2014-06-25 16:37:59.365618875 -0400
+++ src/backend/access/heap/syncscan.c	2014-06-25 18:10:51.264521880 -0400
@@ -90,6 +90,7 @@ typedef struct ss_scan_location_t
 {
 	RelFileNode relfilenode;	/* identity of a relation */
 	BlockNumber location;		/* last-reported location in the relation */
+	BlockNumber prefetchHWM;	/* high-water-mark of prefetched Blocknum */
 } ss_scan_location_t;
 
 typedef struct ss_lru_item_t
@@ -113,7 +114,7 @@ static ss_scan_locations_t *scan_locatio
 
 /* prototypes for internal functions */
 static BlockNumber ss_search(RelFileNode relfilenode,
-		  BlockNumber location, bool set);
+		  BlockNumber location, bool set , BlockNumber *prefetchHWMp);
 
 
 /*
@@ -160,6 +161,7 @@ SyncScanShmemInit(void)
 			item->location.relfilenode.dbNode = InvalidOid;
 			item->location.relfilenode.relNode = InvalidOid;
 			item->location.location = InvalidBlockNumber;
+			item->location.prefetchHWM = InvalidBlockNumber;
 
 			item->prev = (i > 0) ?
 				(&scan_locations->items[i - 1]) : NULL;
@@ -185,7 +187,7 @@ SyncScanShmemInit(void)
  * data structure.
  */
 static BlockNumber
-ss_search(RelFileNode relfilenode, BlockNumber location, bool set)
+ss_search(RelFileNode relfilenode, BlockNumber location, bool set , BlockNumber *prefetchHWMp)
 {
 	ss_lru_item_t *item;
 
@@ -206,6 +208,22 @@ ss_search(RelFileNode relfilenode, Block
 			{
 				item->location.relfilenode = relfilenode;
 				item->location.location = location;
+                                /*  if prefetch information requested,
+                                **  then reconcile and either update or report back the new HWM.
+                                */
+                                if (prefetchHWMp)
+                                {
+                                    if (   (item->location.prefetchHWM == InvalidBlockNumber)
+                                        || (item->location.prefetchHWM < *prefetchHWMp)
+                                       )
+                                    {
+                                      item->location.prefetchHWM = *prefetchHWMp;
+                                    }
+                                    else
+                                    {
+                                      *prefetchHWMp = item->location.prefetchHWM;
+                                    }
+                                }
 			}
 			else if (set)
 				item->location.location = location;
@@ -252,7 +270,7 @@ ss_get_location(Relation rel, BlockNumbe
 	BlockNumber startloc;
 
 	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
-	startloc = ss_search(rel->rd_node, 0, false);
+	startloc = ss_search(rel->rd_node, 0, false , 0);
 	LWLockRelease(SyncScanLock);
 
 	/*
@@ -282,7 +300,7 @@ ss_get_location(Relation rel, BlockNumbe
  * same relfilenode.
  */
 void
-ss_report_location(Relation rel, BlockNumber location)
+ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp)
 {
 #ifdef TRACE_SYNCSCAN
 	if (trace_syncscan)
@@ -306,7 +324,7 @@ ss_report_location(Relation rel, BlockNu
 	{
 		if (LWLockConditionalAcquire(SyncScanLock, LW_EXCLUSIVE))
 		{
-			(void) ss_search(rel->rd_node, location, true);
+			(void) ss_search(rel->rd_node, location, true , prefetchHWMp);
 			LWLockRelease(SyncScanLock);
 		}
 #ifdef TRACE_SYNCSCAN
--- src/backend/access/index/indexam.c.orig	2014-06-25 16:37:59.365618875 -0400
+++ src/backend/access/index/indexam.c	2014-06-25 18:10:51.276521927 -0400
@@ -79,6 +79,55 @@
 #include "utils/tqual.h"
 
 
+#ifdef USE_PREFETCH
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+void index_mark_or_evict_block(IndexScanDesc scan , BlockNumber blocknumber , int markit);
+
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+
+/*  if specified block number is present in the prefetch array,
+**  then either mark it as not to be discarded or evict it according to input param
+*/
+void index_mark_or_evict_block(IndexScanDesc scan , BlockNumber blocknumber , int markit)
+{
+        unsigned short int pfchx , pfchy , pfchz; /*  indexes in BlockIdData array   */
+
+        if (    scan->do_prefetch
+             && ((struct pfch_block_item*)0 != scan->pfch_block_item_list)
+             /* no need to check for scan->pfch_next < prefetch_index_scans
+             ** since we will do nothing if scan->pfch_used == 0
+             */
+           ) {
+            /*  search the prefetch list to find if the block is a member */
+            for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                if (BlockIdGetBlockNumber(&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)) == blocknumber) {
+                      if (markit) {
+						      /*  mark it as not to be discarded */
+						      ((scan->pfch_block_item_list)+pfchx)->pfch_discard &= ~PREFTCHRC_BUF_PIN_INCREASED;
+					  } else {
+							  /*  shuffle all following the evictee to the left
+							  **  and update next pointer if its element moves
+							  */
+							  pfchy = (scan->pfch_used - 1); /*  current rightmost */
+							  scan->pfch_used = pfchy;
+
+							  while (pfchy > pfchx) {
+								  pfchz = pfchx + 1;
+								  BlockIdCopy((&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)), (&(((scan->pfch_block_item_list)+pfchz)->pfch_blockid)));
+								  ((scan->pfch_block_item_list)+pfchx)->pfch_discard = ((scan->pfch_block_item_list)+pfchz)->pfch_discard;
+								  if (scan->pfch_next == pfchz) {
+									  scan->pfch_next = pfchx;
+								  }
+								  pfchx = pfchz; /* advance */
+							  }
+                      }
+                }
+            }
+        }
+}
+#endif /* USE_PREFETCH */
+
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
  *
@@ -253,6 +302,11 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+	scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -277,6 +331,11 @@ index_beginscan_bitmap(Relation indexRel
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+	scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -311,6 +370,9 @@ index_beginscan_internal(Relation indexR
 									  Int32GetDatum(nkeys),
 									  Int32GetDatum(norderbys)));
 
+	scan->heap_tids_seen = 0;
+	scan->heap_tids_fetched = 0;
+	
 	return scan;
 }
 
@@ -342,6 +404,12 @@ index_rescan(IndexScanDesc scan,
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
@@ -373,10 +441,30 @@ index_endscan(IndexScanDesc scan)
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
 
+#ifdef USE_PREFETCH
+        /*   discard prefetched but unread buffers */
+        if (    scan->do_prefetch
+             && ((struct pfch_block_item*)0 != scan->pfch_block_item_list)
+           ) {
+            unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                  if (((scan->pfch_block_item_list)+pfchx)->pfch_discard) {
+                      DiscardBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber(&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)));
+                  }
+                }
+        }
+#endif   /* USE_PREFETCH */
+
 	/* End the AM's scan */
 	FunctionCall1(procedure, PointerGetDatum(scan));
 
@@ -472,6 +560,12 @@ index_getnext_tid(IndexScanDesc scan, Sc
 		/* ... but first, release any held pin on a heap page */
 		if (BufferIsValid(scan->xs_cbuf))
 		{
+#ifdef USE_PREFETCH
+                    /*   if specified block number is present in the prefetch array,  then evict it */
+                    if (scan->do_prefetch) {
+                          index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                    }
+#endif   /* USE_PREFETCH */
 			ReleaseBuffer(scan->xs_cbuf);
 			scan->xs_cbuf = InvalidBuffer;
 		}
@@ -479,6 +573,11 @@ index_getnext_tid(IndexScanDesc scan, Sc
 	}
 
 	pgstat_count_index_tuples(scan->indexRelation, 1);
+	if (scan->heap_tids_seen++ >= (~0)) {
+		/* Avoid integer overflow */
+		scan->heap_tids_seen = 1;
+		scan->heap_tids_fetched = 0;
+	}
 
 	/* Return the TID of the tuple we found. */
 	return &scan->xs_ctup.t_self;
@@ -502,6 +601,10 @@ index_getnext_tid(IndexScanDesc scan, Sc
  * enough information to do it efficiently in the general case.
  * ----------------
  */
+#if defined(USE_PREFETCH) && defined(AVOID_CATALOG_MIGRATION_FOR_ASYNCIO)
+extern Datum btpeeknexttuple(IndexScanDesc scan);
+#endif /* USE_PREFETCH */
+
 HeapTuple
 index_fetch_heap(IndexScanDesc scan)
 {
@@ -509,16 +612,111 @@ index_fetch_heap(IndexScanDesc scan)
 	bool		all_dead = false;
 	bool		got_heap_tuple;
 
+
+
 	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
 	if (!scan->xs_continue_hot)
 	{
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = scan->xs_cbuf;
 
+#ifdef USE_PREFETCH
+
+                /*   If the old block is different from new block, then evict old
+                **   block from prefetched array.   It is arguable we should leave it
+                **   in the array because it's likely to remain in the buffer pool
+                **   for a while,  but in that case , if we encounter the block
+                **   again,  prefetching it again does no harm.
+                **   (and note that,  if it's not pinned,  prefetching it will try to
+                **   pin it since prefetch tries to bank a pin for a buffer in the buffer pool).
+                **   therefore it should usually win.
+                */
+                if (    scan->do_prefetch
+                     && ( BufferIsValid(prev_buf) )
+                     && (BlocknotinBuffer(prev_buf,scan->heapRelation,ItemPointerGetBlockNumber(tid)))
+                     && (scan->pfch_next < prefetch_index_scans)  /* ensure there is an entry */
+                        ) {
+                          index_mark_or_evict_block(scan , BlocknumOfBuffer(prev_buf) , 0);
+                }
+
+#endif   /* USE_PREFETCH  */
 		scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf,
 											 scan->heapRelation,
 											 ItemPointerGetBlockNumber(tid));
 
+#ifdef USE_PREFETCH
+                /*   If the new block had been prefetched and pinned,
+                **   then mark that it no longer requires to be discarded.
+                **   Of course,  we don't evict the entry,
+                **   because we want to remember that it was recently prefetched.
+                */
+                index_mark_or_evict_block(scan , BlocknumOfBuffer(prev_buf) , 1);
+#endif   /* USE_PREFETCH  */
+
+				scan->heap_tids_fetched++;
+
+#ifdef USE_PREFETCH
+                /*  try prefetching next data block
+                **    (next meaning one containing TIDs from matching keys
+                **     in same index page and different from any block
+                **     we previously prefetched and listed in prefetched array)
+                */
+                {
+                    FmgrInfo   *procedure;
+                    bool	found;             /*  did we find the "next" heap tid in current index page */
+                    int         PrefetchBufferRc;  /*  indicates whether requested prefetch block already in a buffer and if pin count on buffer has been increased */
+
+                    if (scan->do_prefetch) {
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                        procedure = &scan->indexRelation->rd_aminfo->ampeeknexttuple; /* is incorrect but avoids adding function to catalog */
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                        if (RegProcedureIsValid(scan->indexRelation->rd_am->ampeeknexttuple)) {
+                            GET_SCAN_PROCEDURE(ampeeknexttuple); /* is correct but requires adding function to catalog */
+                        } else {
+                            procedure = 0;
+                        }
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+
+                        if (    procedure          /* does the index access method support peektuple? */
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                             && procedure->fn_addr /* procedure->fn_addr is non-null only if in catalog */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                           ) {
+                            int iterations = 1;      /*  how many iterations of prefetching shall we try  -
+                                                     **  if used entries in prefetch list is < target_prefetch_pages
+                                                     **  then 2,  else 1
+                                                     **  this should result in gradually and smoothly increasing up to target_prefetch_pages
+                                                     */
+                            /*  note we trust InitIndexScan verified this scan is forwards only and so set that */
+                            if (scan->pfch_used < target_prefetch_pages) {
+                                iterations = 2;
+                            }
+                            do {
+                                found =  DatumGetBool(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                                                         btpeeknexttuple(scan)     /*  pass scan as direct parameter since cant use fmgr because not in catalog */
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                                                         FunctionCall1(procedure, PointerGetDatum(scan)) /* use fmgr to call it because in catalog  */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                                                     );
+                                if (found) {
+                                    /*    btpeeknexttuple set pfch_next to point to the item in block_item_list to be prefetched */
+                                    PrefetchBufferRc = PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber((&((scan->pfch_block_item_list + scan->pfch_next))->pfch_blockid)) , 0);
+                                    /* elog(LOG,"index_fetch_heap prefetched rel %u blockNum %u"
+                                       ,scan->heapRelation->rd_node.relNode ,BlockIdGetBlockNumber(scan->pfch_block_item_list + scan->pfch_next));
+                                    */
+
+                                    /*  if pin acquired on buffer,  then remember in case of future Discard */
+                                    (scan->pfch_block_item_list + scan->pfch_next)->pfch_discard = (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED);
+
+
+                                }
+                            } while (--iterations > 0);
+                        }
+                    }
+                }
+#endif   /* USE_PREFETCH */
+
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
--- src/backend/access/index/genam.c.orig	2014-06-25 16:37:59.365618875 -0400
+++ src/backend/access/index/genam.c	2014-06-25 18:10:51.296522007 -0400
@@ -77,6 +77,12 @@ RelationGetIndexScan(Relation indexRelat
 
 	scan = (IndexScanDesc) palloc(sizeof(IndexScanDescData));
 
+#ifdef USE_PREFETCH
+        scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+        scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
+
 	scan->heapRelation = NULL;	/* may be set later */
 	scan->indexRelation = indexRelation;
 	scan->xs_snapshot = InvalidSnapshot;		/* caller must initialize this */
@@ -139,6 +145,19 @@ RelationGetIndexScan(Relation indexRelat
 void
 IndexScanEnd(IndexScanDesc scan)
 {
+#ifdef USE_PREFETCH
+	if (scan->do_prefetch) {
+		if ( (struct pfch_block_item*)0 != scan->pfch_block_item_list ) {
+			pfree(scan->pfch_block_item_list);
+			scan->pfch_block_item_list = (struct pfch_block_item*)0;
+		}
+		if ( (struct pfch_index_pagelist*)0 != scan->pfch_index_page_list ) {
+			pfree(scan->pfch_index_page_list);
+			scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+		}
+	}
+#endif   /* USE_PREFETCH */
+
 	if (scan->keyData != NULL)
 		pfree(scan->keyData);
 	if (scan->orderByData != NULL)
--- src/backend/access/nbtree/nbtsearch.c.orig	2014-06-25 16:37:59.365618875 -0400
+++ src/backend/access/nbtree/nbtsearch.c	2014-06-25 18:10:51.328522131 -0400
@@ -23,13 +23,18 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_btree_heaps;  /*  boolean whether to prefetch heap pages in _bt_next for non-bitmap index scans   */
+extern unsigned int prefetch_sequential_index_scans;  /*  boolean whether to prefetch sequential-access non-bitmap index scans  */
+#endif   /* USE_PREFETCH */
 
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
+static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir, 
+			 bool prefetch);
+static Buffer _bt_walk_left(IndexScanDesc scan, Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
 
@@ -226,7 +231,11 @@ _bt_moveright(Relation rel,
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, blkno, access
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                        );
 			continue;
 		}
 
@@ -1005,7 +1014,7 @@ _bt_first(IndexScanDesc scan, ScanDirect
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
@@ -1040,6 +1049,8 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTScanPosItem *currItem;
+	BlockNumber prevblkno = ItemPointerGetBlockNumber(
+		&scan->xs_ctup.t_self);
 
 	/*
 	 * Advance to next tuple on current page; or if there's no more, try to
@@ -1052,11 +1063,56 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+#ifdef USE_PREFETCH
+                /*    consider prefetching */
+		if (prefetch_btree_heaps && (target_prefetch_pages > 0)) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreRight
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+				
+					if (so->prefetchItemIndex <= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex + 1;
+					while (    (so->prefetchItemIndex <= so->currPos.lastItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex++].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {  /*  if not a repetition of previous block */
+							/*   start prefetch on next page, providing :
+                            **   EITHER  .  we're reading non-sequentially previously or for this block
+                            **   OR      .  user explicitly specified to prefetch for sequential pattern
+							**   as it may be counterproductive otherwise
+                            */
+							nonsequential = (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1));
+							if (prefetch_sequential_index_scans || nonsequential) {
+								_bt_prefetchbuf(scan->heapRelation, blkno , &scan->pfch_index_page_list );
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
+#endif   /* USE_PREFETCH */
 	}
 	else
 	{
@@ -1065,11 +1121,56 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+#ifdef USE_PREFETCH
+                /*    consider prefetching */
+		if (prefetch_btree_heaps && (target_prefetch_pages > 0)) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreLeft
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+			
+					if (so->prefetchItemIndex >= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex - 1;
+					while (    (so->prefetchItemIndex >= so->currPos.firstItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex--].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {  /*  if not a repetition of previous block */
+							/*   start prefetch on next page, providing :
+                            **   EITHER  .  we're reading non-sequentially previously or for this block
+                            **   OR      .  user explicitly specified to prefetch for sequential pattern
+							**   as it may be counterproductive otherwise
+                            */
+							nonsequential = (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1));
+							if (prefetch_sequential_index_scans || nonsequential) {
+								_bt_prefetchbuf(scan->heapRelation, blkno , &scan->pfch_index_page_list );
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
+#endif   /* USE_PREFETCH */
 	}
 
 	/* OK, itemIndex says what to return */
@@ -1119,9 +1220,11 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 	/*
 	 * we must save the page's right-link while scanning it; this tells us
 	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * corresponding need for the left-link, since splits always go right,
+	 * but we need it for back-sequential scan detection.
 	 */
 	so->currPos.nextPage = opaque->btpo_next;
+	so->currPos.prevPage = opaque->btpo_prev;
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
@@ -1156,6 +1259,7 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
+		so->prefetchItemIndex = 0;
 	}
 	else
 	{
@@ -1187,6 +1291,7 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 		so->currPos.firstItem = itemIndex;
 		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
 		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->prefetchItemIndex = MaxIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1224,7 +1329,7 @@ _bt_saveitem(BTScanOpaque so, int itemIn
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
  */
 static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+_bt_steppage(IndexScanDesc scan, ScanDirection dir, bool prefetch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -1278,7 +1383,11 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ
+#ifdef USE_PREFETCH
+                                     ,scan->pfch_index_page_list
+#endif /* USE_PREFETCH */
+                                                    );
 			/* check for deleted page */
 			page = BufferGetPage(so->currPos.buf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1287,9 +1396,22 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque))) {
+#ifdef USE_PREFETCH
+					if (    prefetch && so->currPos.moreRight
+						/*   start prefetch on next page, providing :
+						**   EITHER  .  we're reading non-sequentially for this block
+						**   OR      .  user explicitly specified to prefetch for sequential pattern
+						**   as it may be counterproductive otherwise
+						*/
+						&& (prefetch_sequential_index_scans || opaque->btpo_next != (blkno+1))
+                                            ) {
+ 						  _bt_prefetchbuf(rel, opaque->btpo_next , &scan->pfch_index_page_list);
+					}
+#endif /* USE_PREFETCH */
 					break;
 			}
+			}
 			/* nope, keep going */
 			blkno = opaque->btpo_next;
 		}
@@ -1317,7 +1439,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
+			so->currPos.buf = _bt_walk_left(scan , rel, so->currPos.buf);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -1332,14 +1454,60 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 			if (!P_IGNORE(opaque))
 			{
+				/* We must rely on the previously saved prevPage link! */
+				BlockNumber blkno = so->currPos.prevPage;
+				
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page))) {
+#ifdef USE_PREFETCH
+					if (prefetch && so->currPos.moreLeft) {
+						/* detect back-sequential runs and increase prefetch window blindly 
+						 * downwards 2 blocks at a time. This only works in our favor
+						 * for index-only scans, by merging read requests at the kernel,
+						 * so we want to inflate target_prefetch_pages since merged 
+						 * back-sequential requests are about as expensive as a single one 
+						 */
+						if (scan->xs_want_itup && blkno > 0 && opaque->btpo_prev == (blkno-1)) {
+							BlockNumber backPos;
+							unsigned int back_prefetch_pages = target_prefetch_pages * 16;
+							if (back_prefetch_pages > 64)
+								back_prefetch_pages = 64;
+							
+							if (so->backSeqRun == 0)
+								backPos = (blkno-1);
+							else
+								backPos = so->backSeqPos;
+							so->backSeqRun++;
+							
+							if (backPos > 0 && (blkno - backPos) <= back_prefetch_pages) {
+								_bt_prefetchbuf(rel, backPos-- , &scan->pfch_index_page_list);
+								/* don't start back-seq prefetch too early */
+								if (so->backSeqRun >= back_prefetch_pages
+										&& backPos > 0 
+										&& (blkno - backPos) <= back_prefetch_pages)
+								{
+									_bt_prefetchbuf(rel, backPos-- , &scan->pfch_index_page_list);
+								}
+							}
+							
+							so->backSeqPos = backPos;
+						} else {
+							/* start prefetch on next page */
+							if (so->backSeqRun != 0) {
+								if (opaque->btpo_prev > blkno || opaque->btpo_prev < so->backSeqPos)
+									so->backSeqRun = 0;
+							}
+							_bt_prefetchbuf(rel, opaque->btpo_prev , &scan->pfch_index_page_list);
+						}
+					}
+#endif /* USE_PREFETCH */
 					break;
 			}
 		}
 	}
+	}
 
 	return true;
 }
@@ -1359,7 +1527,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf)
+_bt_walk_left(IndexScanDesc scan, Relation rel, Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1387,7 +1555,11 @@ _bt_walk_left(Relation rel, Buffer buf)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, blkno, BT_READ
+#ifdef USE_PREFETCH
+                                     , scan->pfch_index_page_list
+#endif /* USE_PREFETCH */
+                                );
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -1631,7 +1803,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDir
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
--- src/backend/access/nbtree/nbtinsert.c.orig	2014-06-25 16:37:59.365618875 -0400
+++ src/backend/access/nbtree/nbtinsert.c	2014-06-25 18:10:51.364522273 -0400
@@ -793,7 +793,11 @@ _bt_insertonpg(Relation rel,
 		{
 			Assert(!P_ISLEAF(lpageop));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                );
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -972,7 +976,11 @@ _bt_split(Relation rel, Buffer buf, Buff
 	bool		isleaf;
 
 	/* Acquire a new page to split into */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                );
 
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
@@ -1175,7 +1183,11 @@ _bt_split(Relation rel, Buffer buf, Buff
 
 	if (!P_RIGHTMOST(oopaque))
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                );
 		spage = BufferGetPage(sbuf);
 		sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -1817,7 +1829,11 @@ _bt_finish_split(Relation rel, Buffer lb
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                     );
 	rpage = BufferGetPage(rbuf);
 	rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
 
@@ -1829,7 +1845,11 @@ _bt_finish_split(Relation rel, Buffer lb
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                            );
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -1877,7 +1897,11 @@ _bt_getstackbuf(Relation rel, BTStack st
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, access);
+		buf = _bt_getbuf(rel, blkno, access
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -2008,12 +2032,20 @@ _bt_newroot(Relation rel, Buffer lbuf, B
 	lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
--- src/backend/access/nbtree/nbtpage.c.orig	2014-06-25 16:37:59.365618875 -0400
+++ src/backend/access/nbtree/nbtpage.c	2014-06-25 18:10:51.376522320 -0400
@@ -127,7 +127,11 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                            );
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 
@@ -153,7 +157,11 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -209,7 +217,11 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                            );
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
@@ -350,7 +362,11 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -436,7 +452,11 @@ _bt_getrootheight(Relation rel)
 		Page		metapg;
 		BTPageOpaque metaopaque;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                             );
 		metapg = BufferGetPage(metabuf);
 		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 		metad = BTPageGetMeta(metapg);
@@ -561,6 +581,172 @@ _bt_log_reuse_page(Relation rel, BlockNu
 	END_CRIT_SECTION();
 }
 
+#ifdef USE_PREFETCH
+/*
+ *	_bt_prefetchbuf() -- Prefetch a buffer by block number
+ *                           and keep track of prefetched and unread blocknums in pagelist.
+ *   input parms  :
+ *       rel and blockno identify block to be prefetched as usual
+ *       pfch_index_page_list_P points to the pointer anchoring the head of the index page list
+ *             Since the pagelist is a kind of optimization,
+ *             handle palloc failure by quietly omitting the keeping track.
+ */
+void
+_bt_prefetchbuf(Relation rel, BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P)
+{
+
+    int rc = 0;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    struct pfch_index_item* found_item = 0;
+
+		if ((struct pfch_index_pagelist**)0 == pfch_index_page_list_P) {
+			pfch_index_plp = (struct pfch_index_pagelist*)0;
+		} else {
+			pfch_index_plp = *pfch_index_page_list_P;
+		}
+
+    	if (blkno != P_NEW && blkno != P_NONE)
+    	{
+            /* prefetch an existing block of the relation
+            ** but first,  check it has not recently already been prefetched and not yet read
+            */
+            found_item = _bt_find_block(blkno , pfch_index_plp);
+			if ((struct pfch_index_item*)0 == found_item) {  /*  not found */
+
+		        rc = PrefetchBuffer(rel, MAIN_FORKNUM, blkno , 0);
+
+                /*  add the pagenum to the list ,  indicating its discard status
+                **  since it's only an optimization,  ignore failure such as exceeded allowed space
+				*/
+                _bt_add_block( blkno , pfch_index_page_list_P , (uint32)(rc & PREFTCHRC_BUF_PIN_INCREASED));
+
+            }
+	    }
+        return;
+}
+
+/*   _bt_find_block finds the item referencing specified Block in index page list if present
+**   and returns the pointer to the pfch_index_item if found,  or null if not
+*/
+struct pfch_index_item*
+_bt_find_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list)
+{
+
+    struct pfch_index_item* found_item = 0;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    int ix, tx;
+
+		pfch_index_plp = pfch_index_page_list;
+
+		while (     (struct pfch_index_pagelist*)0 != pfch_index_plp
+                &&  ( (struct pfch_index_item*)0 == found_item)
+              ) {
+			ix = 0;
+			tx = pfch_index_plp->pfch_index_item_count;
+			while (     (ix < tx)
+                    &&  ( (struct pfch_index_item*)0 == found_item)
+                  ) {
+				if (pfch_index_plp->pfch_indexid[ix].pfch_blocknum == blkno) {
+					found_item = &pfch_index_plp->pfch_indexid[ix];
+				}
+                ix++;
+			}
+			pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+		}
+
+     return found_item;
+}
+
+/*   _bt_add_block adds the specified Block to the index page list
+**   and returns 0 if successful,  non-zero if not
+*/
+int
+_bt_add_block(BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P , uint32 discard_status)
+{
+    int rc = 1;
+    int ix;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    struct pfch_index_pagelist* pfch_index_page_list_anchor; /*  pointer to first chunk if any  */
+	/*  allow expansion of pagelist to 16 chunks
+	**  which accommodates backwards-sequential index scans
+	**  where the scanner increases target_prefetch_pages by a factor of up to 16
+	**   see code in _bt_steppage
+	**  note - this creates an undesirable weak dependency on this number in _bt_steppage,
+	**         but :
+	**           there is no disaster if the numbers disagree  -  just sub-optimal use of the list
+	**           to implement a proper interface would require that chunks have a variable size
+	**           which would require an extra size variable in each chunk
+	*/
+	int num_chunks = 16;
+
+		if ((struct pfch_index_pagelist**)0 == pfch_index_page_list_P) {
+			pfch_index_page_list_anchor = (struct pfch_index_pagelist*)0;
+		} else {
+			pfch_index_page_list_anchor = *pfch_index_page_list_P;
+		}
+		pfch_index_plp = pfch_index_page_list_anchor;       /* pointer to current chunk */
+
+		while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+			ix = pfch_index_plp->pfch_index_item_count;
+			if (ix < target_prefetch_pages) {
+				pfch_index_plp->pfch_indexid[ix].pfch_blocknum = blkno;
+				pfch_index_plp->pfch_indexid[ix].pfch_discard = discard_status;
+				pfch_index_plp->pfch_index_item_count = (ix+1);
+                rc = 0;
+				goto stored_pagenum;
+			}
+			pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+			num_chunks--;              /*  keep track of number of chunks */
+		}
+
+		/*   we did not find any free space in existing chunks -
+		**   create new chunk if within our limit and we have a pfch_index_page_list
+		*/
+		if ( (num_chunks > 0) && ((struct pfch_index_pagelist*)0 != pfch_index_page_list_anchor) ) {
+			pfch_index_plp = (struct pfch_index_pagelist*)palloc( sizeof(struct pfch_index_pagelist) + ( (target_prefetch_pages-1) * sizeof(struct pfch_index_item) ) );
+			if ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+				pfch_index_plp->pfch_index_pagelist_next = pfch_index_page_list_anchor;  /* old head of list is next after this */
+				pfch_index_plp->pfch_indexid[0].pfch_blocknum = blkno;
+				pfch_index_plp->pfch_indexid[0].pfch_discard = discard_status;
+				pfch_index_plp->pfch_index_item_count = 1;
+				pfch_index_page_list_P = &pfch_index_plp;   /*  new head of list is new chunk */
+                rc = 0;
+			}
+		}
+
+    stored_pagenum:;
+     return rc;
+}
+
+/*  _bt_subtract_block removes a block from the prefetched-but-unread pagelist if present */
+void
+_bt_subtract_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list)
+{
+    struct pfch_index_pagelist* pfch_index_plp = pfch_index_page_list;
+	if ( (blkno != P_NEW) && (blkno != P_NONE) ) {
+            int ix , jx;
+                while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+                    ix = pfch_index_plp->pfch_index_item_count;
+                    while (ix-- > 0) {
+                        if (pfch_index_plp->pfch_indexid[ix].pfch_blocknum == blkno) {
+                            /*   move the last item to the curent (now deleted) position and decrement count */
+                            jx = (pfch_index_plp->pfch_index_item_count-1); /*  index of last item ... */
+                            if (jx > ix) {                                  /*  ... is not the current one so move is required */
+                                pfch_index_plp->pfch_indexid[ix].pfch_blocknum = pfch_index_plp->pfch_indexid[jx].pfch_blocknum;
+                                pfch_index_plp->pfch_indexid[ix].pfch_discard = pfch_index_plp->pfch_indexid[jx].pfch_discard;
+                                ix = jx;
+                            }
+                            pfch_index_plp->pfch_index_item_count = ix;
+                            goto done_subtract;
+                        }
+                    }
+                    pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+                }
+        }
+    done_subtract:  return;
+}
+#endif /* USE_PREFETCH */
+
 /*
  *	_bt_getbuf() -- Get a buffer by block number for read or write.
  *
@@ -573,7 +759,11 @@ _bt_log_reuse_page(Relation rel, BlockNu
  *		_bt_checkpage to sanity-check the page (except in P_NEW case).
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access
+#ifdef USE_PREFETCH
+                              ,struct pfch_index_pagelist* pfch_index_page_list
+#endif /* USE_PREFETCH */
+          )
 {
 	Buffer		buf;
 
@@ -581,6 +771,12 @@ _bt_getbuf(Relation rel, BlockNumber blk
 	{
 		/* Read an existing block of the relation */
 		buf = ReadBuffer(rel, blkno);
+
+#ifdef USE_PREFETCH
+        /*  if the block is in the prefetched-but-unread pagelist,  remove it */
+        _bt_subtract_block( blkno , pfch_index_page_list);
+#endif /* USE_PREFETCH */
+
 		LockBuffer(buf, access);
 		_bt_checkpage(rel, buf);
 	}
@@ -702,6 +898,10 @@ _bt_getbuf(Relation rel, BlockNumber blk
  * bufmgr when one would do.  However, now it's mainly just a notational
  * convenience.  The only case where it saves work over _bt_relbuf/_bt_getbuf
  * is when the target page is the same one already in the buffer.
+ *
+ * if prefetching of index pages is changed to use this function,
+ * then it should be extended to take the index_page_list as parameter
+ * and call_bt_subtract_block in the same way that _bt_getbuf does.
  */
 Buffer
 _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
@@ -712,6 +912,7 @@ _bt_relandgetbuf(Relation rel, Buffer ob
 	if (BufferIsValid(obuf))
 		LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
 	buf = ReleaseAndReadBuffer(obuf, rel, blkno);
+
 	LockBuffer(buf, access);
 	_bt_checkpage(rel, buf);
 	return buf;
@@ -965,7 +1166,11 @@ _bt_is_page_halfdead(Relation rel, Block
 	BTPageOpaque opaque;
 	bool		result;
 
-	buf = _bt_getbuf(rel, blk, BT_READ);
+	buf = _bt_getbuf(rel, blk, BT_READ
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                     );
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -1069,7 +1274,11 @@ _bt_lock_branch_parent(Relation rel, Blo
 				Page		lpage;
 				BTPageOpaque lopaque;
 
-				lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+				lbuf = _bt_getbuf(rel, leftsib, BT_READ
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                 );
 				lpage = BufferGetPage(lbuf);
 				lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
@@ -1265,7 +1474,11 @@ _bt_pagedel(Relation rel, Buffer buf)
 					BTPageOpaque lopaque;
 					Page		lpage;
 
-					lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+					lbuf = _bt_getbuf(rel, leftsib, BT_READ
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                     );
 					lpage = BufferGetPage(lbuf);
 					lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 					/*
@@ -1340,7 +1553,11 @@ _bt_pagedel(Relation rel, Buffer buf)
 		if (!rightsib_empty)
 			break;
 
-		buf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		buf = _bt_getbuf(rel, rightsib, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 	}
 
 	return ndeleted;
@@ -1593,7 +1810,11 @@ _bt_unlink_halfdead_page(Relation rel, B
 		target = topblkno;
 
 		/* fetch the block number of the topmost parent's left sibling */
-		buf = _bt_getbuf(rel, topblkno, BT_READ);
+		buf = _bt_getbuf(rel, topblkno, BT_READ
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		leftsib = opaque->btpo_prev;
@@ -1632,7 +1853,11 @@ _bt_unlink_halfdead_page(Relation rel, B
 		LockBuffer(leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                         );
 		page = BufferGetPage(lbuf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -1646,7 +1871,11 @@ _bt_unlink_halfdead_page(Relation rel, B
 					 RelationGetRelationName(rel));
 				return false;
 			}
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                              );
 			page = BufferGetPage(lbuf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		}
@@ -1701,7 +1930,11 @@ _bt_unlink_halfdead_page(Relation rel, B
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                );
 	page = BufferGetPage(rbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	if (opaque->btpo_prev != target)
@@ -1731,7 +1964,11 @@ _bt_unlink_halfdead_page(Relation rel, B
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                );
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
--- src/backend/access/nbtree/nbtree.c.orig	2014-06-25 16:37:59.365618875 -0400
+++ src/backend/access/nbtree/nbtree.c	2014-06-25 18:10:51.388522367 -0400
@@ -30,6 +30,18 @@
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_index_scans; /* boolean whether to prefetch non-bitmap index scans also numeric size of pfch_list */
+#endif   /* USE_PREFETCH */
+
+Datum
+btpeeknexttuple(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+        IndexScanDesc scan
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+        PG_FUNCTION_ARGS
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+);
 
 /* Working state for btbuild and its callback */
 typedef struct
@@ -332,6 +344,78 @@ btgettuple(PG_FUNCTION_ARGS)
 }
 
 /*
+ *	btpeeknexttuple() -- peek at the next tuple different from any blocknum in pfch_block_item_list
+ *                           without reading a new index page
+ *                       and without causing any side-effects such as altering values in control blocks
+ *               if found,     store blocknum in next element of pfch_block_item_list
+ *      This function has no usefulness unless postgresql is compiled with USE_PREFETCH
+ */
+Datum
+btpeeknexttuple(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+        IndexScanDesc scan
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+        PG_FUNCTION_ARGS
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+)
+{
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+    IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res = false;
+	int		itemIndex;		/* current index in items[] */
+
+        
+#ifdef USE_PREFETCH
+        /*
+         * If we've already initialized this scan, we can just advance it in
+         * the appropriate direction.  If we haven't done so yet, bail out
+         */
+        if ( BTScanPosIsValid(so->currPos) ) {
+
+            itemIndex = so->currPos.itemIndex+1;    /*   next item */
+
+            /* This loop handles advancing till we find different data block or end of index page */
+            while (itemIndex <= so->currPos.lastItem) {
+                unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                        if (BlockIdEquals((&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)) , &(so->currPos.items[itemIndex].heapTid.ip_blkid))) {
+                             goto block_match;
+                        }
+                }
+
+                /*  if we reach here,  no block in list matched this item  */
+                res = true;
+                /*   set item in prefetch list
+                **   prefer unused entry if there is one,  else overwrite
+                */
+                if (scan->pfch_used < prefetch_index_scans) {
+                    scan->pfch_next = scan->pfch_used;
+                } else {
+                    scan->pfch_next++;
+                    if (scan->pfch_next >= prefetch_index_scans) {
+                        scan->pfch_next = 0;
+                    }
+                }
+
+                BlockIdCopy((&((scan->pfch_block_item_list + scan->pfch_next)->pfch_blockid)) , &(so->currPos.items[itemIndex].heapTid.ip_blkid));
+                if (scan->pfch_used <= scan->pfch_next) {
+                     scan->pfch_used = (scan->pfch_next + 1);
+                }
+
+                goto peek_complete;
+
+      block_match:         itemIndex++;
+            }
+	}
+
+ peek_complete:
+#endif   /* USE_PREFETCH */
+	PG_RETURN_BOOL(res);
+}
+
+/*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
 Datum
@@ -425,6 +509,12 @@ btbeginscan(PG_FUNCTION_ARGS)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	so->backSeqRun = 0;
+	so->backSeqPos = 0;
+	so->prefetchItemIndex = 0;
+	so->lastHeapPrefetchBlkno = P_NONE;
+	so->prefetchBlockCount = 0;
+	
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -517,6 +607,23 @@ btendscan(PG_FUNCTION_ARGS)
 	IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
+#ifdef USE_PREFETCH
+        struct pfch_index_pagelist* pfch_index_plp;
+        int ix;
+
+	/* discard all prefetched but unread index pages listed in the pagelist */
+        pfch_index_plp = scan->pfch_index_page_list;
+        while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+                    ix = pfch_index_plp->pfch_index_item_count;
+                    while (ix-- > 0) {
+                        if (pfch_index_plp->pfch_indexid[ix].pfch_discard) {
+                            DiscardBuffer( scan->indexRelation , MAIN_FORKNUM , pfch_index_plp->pfch_indexid[ix].pfch_blocknum);
+                        }
+                    }
+                    pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+        }
+#endif /* USE_PREFETCH */
+
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
 	{
--- src/backend/nodes/tidbitmap.c.orig	2014-06-25 16:37:59.397618882 -0400
+++ src/backend/nodes/tidbitmap.c	2014-06-25 18:10:51.408522446 -0400
@@ -44,6 +44,9 @@
 #include "nodes/bitmapset.h"
 #include "nodes/tidbitmap.h"
 #include "utils/hsearch.h"
+#ifdef USE_PREFETCH
+extern int	target_prefetch_pages;
+#endif   /* USE_PREFETCH */
 
 /*
  * The maximum number of tuples per page is not large (typically 256 with
@@ -572,7 +575,12 @@ tbm_begin_iterate(TIDBitmap *tbm)
 	 * needs of the TBMIterateResult sub-struct.
 	 */
 	iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
-								 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+								 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)
+#ifdef USE_PREFETCH
+                                          		      /*  space for remembering every prefetched but unread blockno */
+                                          		      +  (target_prefetch_pages * sizeof(BlockNumber))
+#endif   /* USE_PREFETCH */
+                                         );
 	iterator->tbm = tbm;
 
 	/*
@@ -1020,3 +1028,70 @@ tbm_comparator(const void *left, const v
 		return 1;
 	return 0;
 }
+
+#ifdef USE_PREFETCH
+void
+tbm_zero(TBMIterator *iterator) /* zero list of prefetched and unread blocknos */
+{
+      /* locate the list of prefetched but unread blocknos immediately following the array of offsets
+      ** and note that tbm_begin_iterate allocates space for (1 + MAX_TUPLES_PER_PAGE) offsets  -
+      ** 1 included in struct TBMIterator and MAX_TUPLES_PER_PAGE additional
+      */
+      iterator->output.Unread_Pfetched_base = ((BlockNumber *)(&(iterator->output.offsets[MAX_TUPLES_PER_PAGE+1])));
+      iterator->output.Unread_Pfetched_next = iterator->output.Unread_Pfetched_count = 0;
+}
+
+void
+tbm_add(TBMIterator *iterator, BlockNumber blockno) /* add this blockno to list of prefetched and unread blocknos */
+{
+      unsigned int Unread_Pfetched_index = iterator->output.Unread_Pfetched_next + iterator->output.Unread_Pfetched_count++;
+
+      if (iterator->output.Unread_Pfetched_count > target_prefetch_pages) {
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_add overflowed list cannot add blockno %d", blockno)));
+      }
+
+      if (Unread_Pfetched_index >= target_prefetch_pages)
+          Unread_Pfetched_index -= target_prefetch_pages;
+      *(iterator->output.Unread_Pfetched_base + Unread_Pfetched_index) = blockno;
+}
+
+void
+tbm_subtract(TBMIterator *iterator, BlockNumber blockno) /* remove this blockno from list of prefetched and unread blocknos */
+{
+      unsigned int Unread_Pfetched_index = iterator->output.Unread_Pfetched_next++;
+      BlockNumber nextUnreadPfetched;
+
+      /*    make a weak check that the next blockno is the one to be removed,
+      **    although actually in case of disagreement,   we ignore callers blockno and remove next anyway,
+      **    which is really what caller wants
+      */
+      if ( iterator->output.Unread_Pfetched_count == 0 ) {
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_subtract empty list cannot subtract blockno %d", blockno)));
+      }
+
+      if (Unread_Pfetched_index >= target_prefetch_pages)
+          Unread_Pfetched_index = 0;
+      nextUnreadPfetched = *(iterator->output.Unread_Pfetched_base + Unread_Pfetched_index);
+      if (   ( nextUnreadPfetched != blockno ) 
+          && ( nextUnreadPfetched != InvalidBlockNumber ) /* dont report it if the block in the list was InvalidBlockNumber */
+         ) {
+		ereport(NOTICE,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_subtract will subtract blockno %d not %d",
+					nextUnreadPfetched, blockno)));
+      }
+      if (iterator->output.Unread_Pfetched_next >= target_prefetch_pages)
+          iterator->output.Unread_Pfetched_next = 0;
+      iterator->output.Unread_Pfetched_count--;
+}
+#endif /* USE_PREFETCH */
+
+TBMIterateResult *
+tbm_locate_IterateResult(TBMIterator *iterator)
+{
+   return &(iterator->output);
+}
--- src/backend/utils/misc/guc.c.orig	2014-06-25 16:37:59.533618908 -0400
+++ src/backend/utils/misc/guc.c	2014-06-25 18:10:51.456522634 -0400
@@ -2259,6 +2259,25 @@ static struct config_int ConfigureNamesI
 	},
 
 	{
+		{"max_async_io_prefetchers",
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+			PGC_USERSET,
+#else
+			PGC_INTERNAL,
+#endif
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Maximum number of background processes concurrently using asynchronous librt threads to prefetch pages into shared memory buffers."),
+		},
+		&max_async_io_prefetchers,
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	        -1, 0, 8192,      /*  boot val -1 indicates to initialize to something sensible during buf_init */
+#else
+		0, 0, 0,
+#endif
+		NULL, NULL, NULL
+	},
+
+	{
 		{"max_worker_processes",
 			PGC_POSTMASTER,
 			RESOURCES_ASYNCHRONOUS,
--- src/backend/utils/mmgr/aset.c.orig	2014-06-25 16:37:59.533618908 -0400
+++ src/backend/utils/mmgr/aset.c	2014-06-25 18:10:51.484522744 -0400
@@ -733,6 +733,48 @@ AllocSetAlloc(MemoryContext context, Siz
 	 */
 	fidx = AllocSetFreeIndex(size);
 	chunk = set->freelist[fidx];
+#ifdef MEMORY_CONTEXT_CHECKING
+        /*    an instance of segfault caused by a rogue value in set->freelist[fidx]
+        **    has been seen - check for it using crude sanity check based on neighbours :
+        **    if at least one is sufficiently close, then pass,  else fail
+        */
+        if (chunk != 0) {
+            int frx, nrx; /*  frx is index,  nrx is index of failing neighbour for errmsg */
+            for (nrx = -1, frx = 0; (frx < ALLOCSET_NUM_FREELISTS); frx++) {
+                if (   (frx != fidx)     /*  not the chosen one */
+                    && ( ( (unsigned long)(set->freelist[frx]) ) != 0 ) /* not empty */
+                   ) {
+                    if (   ( (unsigned long)chunk < ( ( (unsigned long)(set->freelist[frx]) ) / 2 ) )
+                        && (  ( (unsigned long)(set->freelist[frx]) ) < 0x4000000  )
+               /***     || ( (unsigned long)chunk > ( ( (unsigned long)(set->freelist[frx]) ) * 2 ) )  ***/
+                       ) {
+                       nrx = frx;
+                    } else {
+                       nrx = -1;
+                       break;
+                    }
+                }
+            }
+
+            if (nrx >= 0) {
+                     /*  this must be a rogue value  -  put this list in the garbage */
+                     /*  build message but be careful to avoid recursively triggering same fault */
+                     set->freelist[fidx] = NULL;   /*  mark this list empty */
+                     elog(WARNING, "detected rogue value %p in freelist index %d compared with neighbour %p whose chunksize %d"
+				 , chunk , fidx , set->freelist[nrx] , set->freelist[nrx]->size);
+                     chunk = NULL;
+            }
+        }
+#else /* if not MEMORY_CONTEXT_CHECKING make very simple-minded check*/
+        if ( (chunk != 0) && ( (unsigned long)chunk <  0x40000 ) ) {
+                     /*  this must be a rogue value  -  put this list in the garbage */
+                     /*  build message but be careful to avoid recursively triggering same fault */
+                     set->freelist[fidx] = NULL;   /*  mark this list empty */
+                     elog(WARNING, "detected rogue value %p in freelist index %d"
+				 , chunk , fidx);
+                     chunk = NULL;
+        }
+#endif
 	if (chunk != NULL)
 	{
 		Assert(chunk->size >= size);
--- src/include/executor/instrument.h.orig	2014-06-25 16:37:59.581618917 -0400
+++ src/include/executor/instrument.h	2014-06-25 18:10:51.516522869 -0400
@@ -28,8 +28,18 @@ typedef struct BufferUsage
 	long		local_blks_written;		/* # of local disk blocks written */
 	long		temp_blks_read; /* # of temp blocks read */
 	long		temp_blks_written;		/* # of temp blocks written */
+
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
+
+	long		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_discrd;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_forgot;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb */
+	long		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno */
+	long		aio_read_wasted;		/* # of aio reads for which disk block not used */
+	long		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it */
+	long		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested */
 } BufferUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
--- src/include/storage/bufmgr.h.orig	2014-06-25 16:37:59.593618920 -0400
+++ src/include/storage/bufmgr.h	2014-06-25 18:10:51.536522947 -0400
@@ -41,6 +41,7 @@ typedef enum
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
 	RBM_NORMAL_NO_LOG			/* Don't log page as invalid during WAL
 								 * replay; otherwise same as RBM_NORMAL */
+       ,RBM_NOREAD_FOR_PREFETCH   /* Don't read from disk, don't zero buffer, find buffer only */
 } ReadBufferMode;
 
 /* in globals.c ... this duplicates miscadmin.h */
@@ -57,6 +58,9 @@ extern int	target_prefetch_pages;
 extern PGDLLIMPORT char *BufferBlocks;
 extern PGDLLIMPORT int32 *PrivateRefCount;
 
+/*  in buf_async.c  */;
+extern int      max_async_io_prefetchers; /* Maximum number of backends using asynchronous librt threads to read pages into our buffers */
+
 /* in localbuf.c */
 extern PGDLLIMPORT int NLocBuffer;
 extern PGDLLIMPORT Block *LocalBufferBlockPointers;
@@ -159,9 +163,15 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
 /*
- * prototypes for functions in bufmgr.c
+ * prototypes for external functions in bufmgr.c and buf_async.c
  */
-extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
+extern int PrefetchBuffer(Relation reln, ForkNumber forkNum,
+			   BlockNumber blockNum , BufferAccessStrategy strategy);
+/*   return code  is an int bitmask : */
+#define PREFTCHRC_BUF_PIN_INCREASED 0x01    /*  pin count on buffer has been increased by 1 */
+#define PREFTCHRC_BLK_ALREADY_PRESENT 0x02  /*  block was already present in a buffer       */
+
+extern void DiscardBuffer(Relation reln, ForkNumber forkNum,
 			   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
--- src/include/storage/lwlock.h.orig	2014-06-25 16:37:59.593618920 -0400
+++ src/include/storage/lwlock.h	2014-06-25 18:10:51.552523010 -0400
@@ -55,6 +55,21 @@ typedef struct LWLock
 	/* tail is undefined when head is NULL */
 } LWLock;
 
+
+#if defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)
+/*  instantiations of the following structure are embedded in the BufferAiocb defined in buf_internals.h
+ *  but the struct is also needed elsewhere such as in fd.c
+ *  each chain item represents one aio_read whose waiter-lock is held eXclusive by originator
+ *  and will be released by the same originator after the completion signal has been received.
+ *  any other backend may wait Shared on the lock to wait for completion,
+ *  *provided* that the BAiocb is still in use  (at least one dependent on it)
+*/
+struct BAiocbIolock_chain_item {  /*  a simple struct for chaining awaiting-release LWLock ptrs together */
+    struct BAiocbIolock_chain_item volatile * volatile next;    /*  next chain_item                      */
+    LWLock       volatile * volatile BAiocbIolock;    /*  waiter-lock to wait for async I/O to complete  */
+};
+#endif /*  defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)  */
+
 /*
  * Prior to PostgreSQL 9.4, every lightweight lock in the system was stored
  * in a single array.  For convenience and for compatibility with past
--- src/include/storage/smgr.h.orig	2014-06-25 16:37:59.593618920 -0400
+++ src/include/storage/smgr.h	2014-06-25 18:10:51.572523089 -0400
@@ -92,6 +92,20 @@ extern void smgrextend(SMgrRelation reln
 		   BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void smgrinitaio(int max_aio_threads, int max_aio_num);
+extern void smgrstartaio(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum , char *aiocbp , int *retcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , void *BAiocbIolockaiocbp
+#endif /* USE_AIO_SIGEVENT  */
+                        );
+extern void smgrcompleteaio( SMgrRelation reln, char *aiocbp , int *inoutcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , void *BAiocbIolockaiocbp
+#endif /* USE_AIO_SIGEVENT  */
+                           );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
@@ -118,6 +132,19 @@ extern void mdextend(SMgrRelation reln,
 		 BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void mdinitaio(int max_aio_threads, int max_aio_num);
+extern void mdstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , void *BAiocbIolockaiocbp
+#endif /* USE_AIO_SIGEVENT  */
+                      );
+extern void mdcompleteaio( char *aiocbp , int *inoutcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , void *BAiocbIolockaiocbp
+#endif /* USE_AIO_SIGEVENT  */
+                         );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
--- src/include/storage/fd.h.orig	2014-06-25 16:37:59.593618920 -0400
+++ src/include/storage/fd.h	2014-06-25 18:10:51.596523183 -0400
@@ -69,6 +69,19 @@ extern File PathNameOpenFile(FileName fi
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void FileInitaio(int max_aio_threads, int max_aio_num );
+extern int  FileStartaio(File file, off_t offset, int amount , char *aiocbp
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , void *BAiocbIolockaiocbp
+#endif /* USE_AIO_SIGEVENT  */
+                           );
+extern int  FileCompleteaio( char *aiocbp , int normal_wait
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , void *BAiocbIolockaiocbp
+#endif /* USE_AIO_SIGEVENT  */
+                           );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern int	FileRead(File file, char *buffer, int amount);
 extern int	FileWrite(File file, char *buffer, int amount);
 extern int	FileSync(File file);
--- src/include/storage/buf_internals.h.orig	2014-06-25 16:37:59.593618920 -0400
+++ src/include/storage/buf_internals.h	2014-06-25 18:10:51.612523246 -0400
@@ -22,7 +22,9 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
-
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+#include "aio.h"
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Flags for buffer descriptors
@@ -38,8 +40,23 @@
 #define BM_JUST_DIRTIED			(1 << 5)		/* dirtied since write started */
 #define BM_PIN_COUNT_WAITER		(1 << 6)		/* have waiter for sole pin */
 #define BM_CHECKPOINT_NEEDED	(1 << 7)		/* must write for checkpoint */
-#define BM_PERMANENT			(1 << 8)		/* permanent relation (not
-												 * unlogged) */
+#define BM_PERMANENT			(1 << 8)	/* permanent relation (not unlogged) */
+#define BM_AIO_IN_PROGRESS		(1 << 9)	/* aio in progress    */
+#define BM_AIO_PREFETCH_PIN_BANKED	(1 << 10)	/* pinned when prefetch issued
+                                                        ** and this pin is banked - i.e.
+                                                        ** redeemable by the next use by same task
+                                                        ** note that for any one buffer, a pin can be banked
+                                                        **      by at most one process globally,
+                                                        **      that is,   only one process may bank a pin on the buffer
+                                                        **                 and it may do so only once (may not be stacked)
+                                                        */
+
+/*********
+for asynchronous aio-read prefetching, two golden rules concerning buffer pinning and buffer-header flags must be observed:
+  R1.  a buffer marked as BM_AIO_IN_PROGRESS must be pinned by at least one backend
+  R2.  a buffer marked as BM_AIO_PREFETCH_PIN_BANKED must be pinned by the backend identified by
+               (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio : (-(buf->freeNext))
+*********/
 
 typedef bits16 BufFlags;
 
@@ -140,17 +157,88 @@ typedef struct sbufdesc
 	BufFlags	flags;			/* see bit definitions above */
 	uint16		usage_count;	/* usage counter for clock sweep code */
 	unsigned	refcount;		/* # of backends holding pins on buffer */
-	int			wait_backend_pid;		/* backend PID of pin-count waiter */
+	int		wait_backend_pid;	/*  if     flags & BM_PIN_COUNT_WAITER
+                                                **  then   backend PID of pin-count waiter
+                                                **  else   not set
+                                                */
 
 	slock_t		buf_hdr_lock;	/* protects the above fields */
 
 	int			buf_id;			/* buffer's index number (from 0) */
-	int			freeNext;		/* link in freelist chain */
+        int    	volatile	freeNext;	/* overloaded and much-abused field :
+                                                ** EITHER
+                                                **     if     >= 0
+                                                **     then   link in freelist chain
+                                                **  OR
+                                                **     if     <  0
+                                                **     then    EITHER
+                                                **             if     flags & BM_AIO_IN_PROGRESS
+                                                **             then   negative of (the index of the aiocb in the BufferAiocbs array + 3)
+                                                **             else   if flags & BM_AIO_PREFETCH_PIN_BANKED
+                                                **             then   -(pid of task that issued aio_read and pinned buffer)
+                                                **             else   one of the special values -1 or -2 listed below
+                                                */
 
 	LWLock	   *io_in_progress_lock;	/* to wait for I/O to complete */
 	LWLock	   *content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
+/*  structures for control blocks for our implementation of async io */
+
+/*  if USE_AIO_ATOMIC_BUILTIN_COMP_SWAP is not defined,  the following struct is not put into use at runtime
+**  but it is easier to let the compiler find the definition but hide the reference to aiocb
+**  which is the only type it would not understand
+*/
+
+struct BufferAiocb {
+       struct BufferAiocb volatile * volatile BAiocbnext;  /*    next free entry or value of BAIOCB_OCCUPIED means in use  */
+       struct sbufdesc    volatile * volatile BAiocbbufh;  /*    there can be at most one BufferDesc marked BM_AIO_IN_PROGRESS
+                                                           **    and using this BufferAiocb -
+                                                           **    if there is one, BAiocbbufh points to it, else BAiocbbufh is zero
+                                                           **    NOTE  BAiocbbufh should be zero for every BufferAiocb on the free list
+                                                           */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+       struct aiocb       volatile            BAiocbthis;  /*    the aio library's control block for one async io */
+#ifdef USE_AIO_SIGEVENT                                    /*    signal non-originator waiters instead of letting them poll */
+struct BAiocbIolock_chain_item volatile BAiocbIolockItem;  /*    to wait for async I/O to complete
+                                                            *    BAiocbIolock is waited on only by processes other than
+                                                            *    originator of aio   */
+#endif /*  USE_AIO_SIGEVENT    */
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+       int                volatile  BAiocbDependentCount;  /*    count of tasks who depend on this BufferAiocb
+                                                           **    in the sense that they are waiting for io completion.
+                                                           **    only a Dependent may move the BufferAiocb onto the freelist
+                                                           **    and only when that Dependent is the *only* Dependent (count == 1)
+                                                           **    BAiocbDependentCount is protected by bufferheader spinlock
+                                                           **    and must be updated only when that spinlock is held
+                                                           */
+       pid_t              volatile  pidOfAio;              /*    pid of backend who issued an aio_read using this BAiocb -
+                                                           **    this backend must have pinned the associated buffer.
+                                                           */
+};
+
+#define BAIOCB_OCCUPIED 0x75f1        /*  distinct indicator of a BufferAiocb.BAiocbnext that is NOT on free list */
+#define BAIOCB_FREE 0x7b9d            /*  distinct indicator of a BufferAiocb.BAiocbbufh that IS     on free list */
+
+struct BAiocbAnchor {                 /*  anchor for all control blocks pertaining to aio  */
+       volatile struct BufferAiocb* BufferAiocbs;          /*  aiocbs ...                   */
+       volatile struct BufferAiocb* volatile FreeBAiocbs; /* ... and their free list   */
+};
+
+/*   values for BufCheckAsync input and retcode */
+#define BUF_INTENTION_WANT 		 1  /* wants the buffer, wait for in-progress aio and then pin */
+#define BUF_INTENTION_REJECT_KEEP_PIN 	-1  /* pin already held, do not unpin */
+#define BUF_INTENTION_REJECT_OBTAIN_PIN	-2  /* obtain pin,  caller wants it for same buffer */
+#define BUF_INTENTION_REJECT_FORGET	-3  /* unpin and tell resource owner to forget */
+#define BUF_INTENTION_REJECT_NOADJUST	-4  /* unpin and call ResourceOwnerForgetBuffer */
+#define BUF_INTENTION_REJECT_UNBANK   	-5  /* unpin only if pin banked by caller */
+
+#define BUF_INTENT_RC_CHANGED_TAG	-5
+#define BUF_INTENT_RC_BADPAGE		-4
+#define BUF_INTENT_RC_INVALID_AIO	-3    /*  invalid and aio was in progress */
+#define BUF_INTENT_RC_INVALID_NO_AIO	-1    /*  invalid and no aio was in progress */
+#define BUF_INTENT_RC_VALID		 1
+
 #define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
 
 /*
@@ -159,6 +247,7 @@ typedef struct sbufdesc
  */
 #define FREENEXT_END_OF_LIST	(-1)
 #define FREENEXT_NOT_IN_LIST	(-2)
+#define FREENEXT_BAIOCB_ORIGIN	(-3)
 
 /*
  * Macros for acquiring/releasing a shared buffer header's spinlock.
--- src/include/catalog/pg_am.h.orig	2014-06-25 16:37:59.577618917 -0400
+++ src/include/catalog/pg_am.h	2014-06-25 18:10:51.632523324 -0400
@@ -67,6 +67,7 @@ CATALOG(pg_am,2601)
 	regproc		amcanreturn;	/* can indexscan return IndexTuples? */
 	regproc		amcostestimate; /* estimate cost of an indexscan */
 	regproc		amoptions;		/* parse AM-specific parameters */
+	regproc		ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } FormData_pg_am;
 
 /* ----------------
@@ -117,19 +118,19 @@ typedef FormData_pg_am *Form_pg_am;
  * ----------------
  */
 
-DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions ));
+DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions btpeeknexttuple ));
 DESCR("b-tree index access method");
 #define BTREE_AM_OID 403
-DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions ));
+DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions - ));
 DESCR("hash index access method");
 #define HASH_AM_OID 405
-DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions ));
+DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions - ));
 DESCR("GiST index access method");
 #define GIST_AM_OID 783
-DATA(insert OID = 2742 (  gin		0 6 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions ));
+DATA(insert OID = 2742 (  gin		0 6 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions - ));
 DESCR("GIN index access method");
 #define GIN_AM_OID 2742
-DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
+DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions - ));
 DESCR("SP-GiST index access method");
 #define SPGIST_AM_OID 4000
 
--- src/include/catalog/pg_proc.h.orig	2014-06-25 16:37:59.577618917 -0400
+++ src/include/catalog/pg_proc.h	2014-06-25 18:10:51.676523496 -0400
@@ -536,6 +536,12 @@ DESCR("convert float4 to int4");
 
 DATA(insert OID = 330 (  btgettuple		   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2281 2281" _null_ _null_ _null_ _null_	btgettuple _null_ _null_ _null_ ));
 DESCR("btree(internal)");
+
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+DATA(insert OID = 3255 (  btpeeknexttuple	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "2281" _null_ _null_ _null_ _null_ btpeeknexttuple _null_ _null_ _null_ ));
+DESCR("btree(internal)");
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+
 DATA(insert OID = 636 (  btgetbitmap	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_	btgetbitmap _null_ _null_ _null_ ));
 DESCR("btree(internal)");
 DATA(insert OID = 331 (  btinsert		   PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_	btinsert _null_ _null_ _null_ ));
--- src/include/pg_config_manual.h.orig	2014-06-25 16:37:59.589618919 -0400
+++ src/include/pg_config_manual.h	2014-06-25 18:10:51.700523591 -0400
@@ -138,9 +138,17 @@
 /*
  * USE_PREFETCH code should be compiled only if we have a way to implement
  * prefetching.  (This is decoupled from USE_POSIX_FADVISE because there
- * might in future be support for alternative low-level prefetch APIs.)
+ * might in future be support for alternative low-level prefetch APIs  --
+ * -- update October 2013  -- now there is such a new prefetch capability --
+ *   async_io into postgres buffers  -   configuration parameter max_async_io_threads)
  */
-#ifdef USE_POSIX_FADVISE
+#if defined(USE_POSIX_FADVISE) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
+#define USE_AIO_SIGEVENT 1
+/* AIO_SIGEVENT_SIGNALNUM is the signal used to indicate completion
+ * of an aio operation.  Choose a signal that is not used elsewhere
+ * in postgresql and which can be caught by signal handler.
+*/
+#define AIO_SIGEVENT_SIGNALNUM SIGIO
 #define USE_PREFETCH
 #endif
 
--- src/include/miscadmin.h.orig	2014-06-25 16:37:59.585618918 -0400
+++ src/include/miscadmin.h	2014-06-25 18:10:51.708523623 -0400
@@ -76,6 +76,9 @@
 extern PGDLLIMPORT volatile bool InterruptPending;
 extern PGDLLIMPORT volatile bool QueryCancelPending;
 extern PGDLLIMPORT volatile bool ProcDiePending;
+#if defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)
+extern struct BAiocbIolock_chain_item volatile * volatile BAiocbIolock_anchor; /* anchor for chain of awaiting-release LWLock ptrs */
+#endif /*  defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)  */
 
 extern volatile bool ClientConnectionLost;
 
@@ -87,13 +90,27 @@ extern PGDLLIMPORT volatile uint32 CritS
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+/* in storage/buffer/buf_init.c */
+extern void ProcessPendingBAiocbIolocks(void);
+
 #ifndef WIN32
 
+#if defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)
+#define CHECK_FOR_INTERRUPTS() \
+do { \
+    if (BAiocbIolock_anchor != (struct BAiocbIolock_chain_item*)0) \
+		ProcessPendingBAiocbIolocks(); \
+	if (InterruptPending) \
+		ProcessInterrupts(); \
+} while(0)
+#else /*  not defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)  */
+
 #define CHECK_FOR_INTERRUPTS() \
 do { \
 	if (InterruptPending) \
 		ProcessInterrupts(); \
 } while(0)
+#endif /*  defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)  */
 #else							/* WIN32 */
 
 #define CHECK_FOR_INTERRUPTS() \
--- src/include/access/nbtree.h.orig	2014-06-25 16:37:59.573618916 -0400
+++ src/include/access/nbtree.h	2014-06-25 18:10:51.716523654 -0400
@@ -19,6 +19,7 @@
 #include "access/sdir.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "access/relscan.h"
 #include "catalog/pg_index.h"
 
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
@@ -524,6 +525,7 @@ typedef struct BTScanPosData
 	Buffer		buf;			/* if valid, the buffer is pinned */
 
 	BlockNumber nextPage;		/* page's right link when we scanned it */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
@@ -603,6 +605,15 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* prefetch logic state */
+	unsigned int	backSeqRun;	/* number of back-sequential pages in a run */
+	BlockNumber		backSeqPos;	/* blkid last prefetched in back-sequential 
+				          		   runs */
+	BlockNumber		lastHeapPrefetchBlkno;	/* blkid last prefetched from heap */
+	int				prefetchItemIndex; /* item index within currPos last
+					                      fetched by heap prefetch */
+	int				prefetchBlockCount; /* number of prefetched heap blocks */
+	
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -655,7 +666,17 @@ extern Buffer _bt_getroot(Relation rel,
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access
+#ifdef USE_PREFETCH
+                                      , struct pfch_index_pagelist* pfch_index_page_list
+#endif /* USE_PREFETCH */
+                         );
+#ifdef USE_PREFETCH
+extern void _bt_prefetchbuf(Relation rel, BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P);
+extern struct pfch_index_item* _bt_find_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list);
+extern int _bt_add_block(BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P , uint32 discard_status);
+extern void _bt_subtract_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list);
+#endif /* USE_PREFETCH */
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 				 BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
--- src/include/access/heapam.h.orig	2014-06-25 16:37:59.573618916 -0400
+++ src/include/access/heapam.h	2014-06-25 18:10:51.728523701 -0400
@@ -175,7 +175,7 @@ extern void heap_page_prune_execute(Buff
 extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
 
 /* in heap/syncscan.c */
-extern void ss_report_location(Relation rel, BlockNumber location);
+extern void ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp);
 extern BlockNumber ss_get_location(Relation rel, BlockNumber relnblocks);
 extern void SyncScanShmemInit(void);
 extern Size SyncScanShmemSize(void);
--- src/include/access/relscan.h.orig	2014-06-25 16:37:59.573618916 -0400
+++ src/include/access/relscan.h	2014-06-25 18:10:51.740523748 -0400
@@ -44,6 +44,24 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+#ifdef USE_PREFETCH
+	int	    rs_prefetch_target; /* target distance (numblocks) for prefetch to reach beyond main scan */
+	BlockNumber rs_pfchblock;	/* next block # to be prefetched in scan, if any */
+
+        /*   Unread_Pfetched is a "mostly" circular list of recently prefetched blocknos of size target_prefetch_pages
+        **   the index of the first unread block is held in Unread_Pfetched_next
+        **         and is advanced when a block is read
+        **   the count of number of unread blocks is in Unread_Pfetched_count (and this subset can wrap around)
+        **   "mostly" means that there may be gaps caused by storing entries for blocks which do not need to be discarded -
+        **   these are indicated by blockno = InvalidBlockNumber,  and these slots are reused when found.
+        */
+        BlockNumber *rs_Unread_Pfetched_base;   /*  where the list of prefetched but unread blocknos starts */
+        unsigned int rs_Unread_Pfetched_next;   /*  where the next unread blockno probably is relative to start --
+                                                **  this is only a hint which may be temporarily stale.
+                                                */
+        unsigned int rs_Unread_Pfetched_count;  /*  number of valid unread blocknos in list */
+#endif   /* USE_PREFETCH */
+
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	ItemPointerData rs_mctid;	/* marked scan position, if any */
@@ -55,6 +73,27 @@ typedef struct HeapScanDescData
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
 }	HeapScanDescData;
 
+/* pfch_index_items track prefetched and unread index pages -   chunks of blocknumbers are chained in singly-linked list from scan->pfch_index_item_list */
+struct pfch_index_item {                              /*  index-relation BlockIds which we will/have prefetched */
+       BlockNumber         pfch_blocknum;    /* Blocknum which we will/have prefetched */
+       uint32              pfch_discard;     /* whether block is to be discarded when scan is closed */
+};
+
+struct pfch_block_item {
+       struct BlockIdData  pfch_blockid;     /* BlockId which we will/have prefetched */
+       uint32              pfch_discard;     /* whether block is to be discarded when scan is closed */
+};
+
+/* pfch_index_page_items track prefetched and unread index pages -
+** chunks of blocknumbers are chained backwards (newest first,  oldest last)
+** in singly-linked list from scan->pfch_index_item_list
+*/
+struct pfch_index_pagelist {                          /*  index-relation BlockIds which we will/have prefetched */
+       struct pfch_index_pagelist* pfch_index_pagelist_next;  /*  pointer to next chunk if any */
+       unsigned int    pfch_index_item_count;         /*  number of used entries in this chunk */
+       struct pfch_index_item pfch_indexid[1];        /*  in-line list of Blocknums which we will/have prefetched and whether to be discarded */
+};
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -75,8 +114,15 @@ typedef struct IndexScanDescData
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
 	bool		ignore_killed_tuples;	/* do not return killed entries */
-	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
-										 * tuples */
+	bool		xactStartedInRecovery;	/* prevents killing/seeing killed tuples */
+										 
+#ifdef USE_PREFETCH
+        struct pfch_index_pagelist* pfch_index_page_list;  /* array of index-relation BlockIds which we will/have prefetched */
+        struct pfch_block_item* pfch_block_item_list;   /* array of heap-relation BlockIds which we will/have prefetched */
+        unsigned short int     pfch_used;	/*  number of used elements in BlockIdData array   */
+        unsigned short int     pfch_next;	/*  next element for prefetch in BlockIdData array */
+	int             do_prefetch;    /*  should I prefetch ? */
+#endif   /* USE_PREFETCH */
 
 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
@@ -91,6 +137,10 @@ typedef struct IndexScanDescData
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/* heap fetch statistics for read-ahead logic */
+	unsigned int heap_tids_seen;
+	unsigned int heap_tids_fetched;
+
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
 }	IndexScanDescData;
--- src/include/nodes/tidbitmap.h.orig	2014-06-25 16:37:59.585618918 -0400
+++ src/include/nodes/tidbitmap.h	2014-06-25 18:10:51.752523795 -0400
@@ -41,6 +41,16 @@ typedef struct
 	int			ntuples;		/* -1 indicates lossy result */
 	bool		recheck;		/* should the tuples be rechecked? */
 	/* Note: recheck is always true if ntuples < 0 */
+#ifdef USE_PREFETCH
+        /*   Unread_Pfetched is a circular list of recently prefetched blocknos of size target_prefetch_pages
+        **   the index of the first unread block is held in Unread_Pfetched_next
+        **         and is advanced when a block is read
+        **   the count of number of unread blocks is in Unread_Pfetched_count (and this subset can wrap around)
+        */
+        BlockNumber *Unread_Pfetched_base;   /*  where the list of prefetched but unread blocknos starts */
+        unsigned int Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+        unsigned int Unread_Pfetched_count;  /*  number of valid unread blocknos in list */
+#endif   /* USE_PREFETCH */
 	OffsetNumber offsets[1];	/* VARIABLE LENGTH ARRAY */
 } TBMIterateResult;				/* VARIABLE LENGTH STRUCT */
 
@@ -62,5 +72,8 @@ extern bool tbm_is_empty(const TIDBitmap
 extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
 extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
 extern void tbm_end_iterate(TBMIterator *iterator);
-
+extern void tbm_zero(TBMIterator *iterator); /* zero list of prefetched and unread blocknos */
+extern void tbm_add(TBMIterator *iterator, BlockNumber blockno); /* add this blockno to list of prefetched and unread blocknos */
+extern void tbm_subtract(TBMIterator *iterator, BlockNumber blockno); /* remove this blockno from list of prefetched and unread blocknos */
+extern TBMIterateResult *tbm_locate_IterateResult(TBMIterator *iterator); /* locate the TBMIterateResult of an iterator */
 #endif   /* TIDBITMAP_H */
--- src/include/utils/rel.h.orig	2014-06-25 16:37:59.601618921 -0400
+++ src/include/utils/rel.h	2014-06-25 18:10:51.760523826 -0400
@@ -61,6 +61,7 @@ typedef struct RelationAmInfo
 	FmgrInfo	ammarkpos;
 	FmgrInfo	amrestrpos;
 	FmgrInfo	amcanreturn;
+	FmgrInfo	ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } RelationAmInfo;
 
 
--- src/include/pg_config.h.in.orig	2014-06-25 16:37:59.589618919 -0400
+++ src/include/pg_config.h.in	2014-06-25 18:10:51.768523858 -0400
@@ -1,4 +1,4 @@
-/* src/include/pg_config.h.in.  Generated from configure.in by autoheader.  */
+/* src/include/pg_config.h.in.  Generated from - by autoheader.  */
 
 /* Define to the type of arg 1 of 'accept' */
 #undef ACCEPT_TYPE_ARG1
@@ -751,6 +751,10 @@
 /* Define to the appropriate snprintf format for unsigned 64-bit ints. */
 #undef UINT64_FORMAT
 
+/* Define to select librt-style async io and the gcc atomic compare_and_swap.
+   */
+#undef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING

#58

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: John Lumby (#57)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

My cut'n'pasting failed me at one point corrected below.

discussion about what is the difference between a synchronous read
versus an asynchronous read as far as non-originator waiting on it is concerned.

I thought a bit more about this. There are currently two differences,
one of which can easily be changed and one not so easy.

1) The current code, even with sigevent, still makes the non-originator waiter
call aio_error on the originator's aiocb to get the completion code.
For sigevent variation, easily changed to have the originator always call aio_error
(from its CHECK_INTERRUPTS or from FIleCompleteaio)
and store that in the BAiocb.
My idea of why not to do that was that, by having the non-originator check the aiocb,
this would allow the waiter to proceed sooner. But for a different reason it actually
doesn't. (The non-originator must still wait for the LWlock release)

2) Buffer pinning and returning the BufferAiocb to the free list
With synchronous IO, each backend that calls a ReadBuffer must pin the buffer
early in the process.
With asynchronous IO, initially only the originator gets the pin
(and that is during PrefetchBuffer, not Readbuffer)
When the aio completes and some backend checks that completion,
then the backend has various responsibilities:

. pin the buffer if it did not already have one (from prefetch)
. if it was the last such backend to make that check
(amongst the cohort waiting on it)
then XXXXXXpin the buffer if it did not already have one (from prefetch)XXXX

then return the BufferAiocb to the free list

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

John Lumby

johnlumby@hotmail.com

over 11 years ago

In reply to: John Lumby (#58)

1 attachment(s)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

I am attaching a new version of the patch for consideration in the current commit fest.

Relative to the one I submitted on 25 June in BAY175-W412FF89303686022A9F16AA3190@phx.gbl
the method for handling aio completion using sigevent has been re-written to use
signals exclusively rather than a composite of signals and LWlocks,
and this has fixed the problem I mentioned before with the LWlock method.
More details are in postgresql-prefetching-asyncio.README
(search for
"Checking AIO Completion"
)

I also have worked my benchmark database and one application into a publishable state.
However,   the database is in the form of a compressed pg_dump and is around
218 MB in size.    I was hoping to come up with a generation program to load it
but have not had the time for that.     Is there some place on postgresql.org for
such a large file?   If not I will try to think of some place for it.

John Lumby

Attachments:

postgresql-9.5.140818.prefetching-asyncio.patchapplication/octet-streamDownload

--- configure.in.orig	2014-08-18 14:10:36.737016150 -0400
+++ configure.in	2014-08-19 16:56:12.631193389 -0400
@@ -1763,6 +1763,12 @@ operating system;  use --disable-thread-
 fi
 fi
 
+#  test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation  of the latter.
+PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" = x"yes"; then
+      AC_DEFINE(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP, 1, [Define to select librt-style async io and the gcc atomic compare_and_swap.])
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
--- contrib/pg_prewarm/pg_prewarm.c.orig	2014-08-18 14:10:36.757016243 -0400
+++ contrib/pg_prewarm/pg_prewarm.c	2014-08-19 16:56:12.947194521 -0400
@@ -159,7 +159,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		 */
 		for (block = first_block; block <= last_block; ++block)
 		{
-			PrefetchBuffer(rel, forkNumber, block);
+			PrefetchBuffer(rel, forkNumber, block, 0);
 			++blocks_done;
 		}
 #else
--- contrib/pg_stat_statements/pg_stat_statements--1.3.sql.orig	2014-08-19 10:17:32.814616339 -0400
+++ contrib/pg_stat_statements/pg_stat_statements--1.3.sql	2014-08-19 16:56:12.995194693 -0400
@@ -0,0 +1,52 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_stat_statements VERSION '1.3'" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_stat_statements_reset()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT queryid bigint,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT blk_read_time float8,
+    OUT blk_write_time float8
+   , OUT aio_read_noneed int8
+   , OUT aio_read_discrd int8
+   , OUT aio_read_forgot int8
+   , OUT aio_read_noblok int8
+   , OUT aio_read_failed int8
+   , OUT aio_read_wasted int8
+   , OUT aio_read_waited int8
+   , OUT aio_read_ontime int8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_3'
+LANGUAGE C STRICT VOLATILE;
+
+-- Register a view on the function for ease of use.
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
+
+-- Don't want this to be available to non-superusers.
+REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
--- contrib/pg_stat_statements/Makefile.orig	2014-08-18 14:10:36.757016243 -0400
+++ contrib/pg_stat_statements/Makefile	2014-08-19 16:56:13.007194736 -0400
@@ -4,7 +4,8 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o $(WIN32RES)
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.2.sql pg_stat_statements--1.1--1.2.sql \
+DATA = pg_stat_statements--1.3.sql pg_stat_statements--1.2--1.3.sql \
+	pg_stat_statements--1.2.sql pg_stat_statements--1.1--1.2.sql \
 	pg_stat_statements--1.0--1.1.sql pg_stat_statements--unpackaged--1.0.sql
 PGFILEDESC = "pg_stat_statements - execution statistics of SQL statements"
 
--- contrib/pg_stat_statements/pg_stat_statements.c.orig	2014-08-18 14:10:36.757016243 -0400
+++ contrib/pg_stat_statements/pg_stat_statements.c	2014-08-19 16:56:13.051194894 -0400
@@ -116,6 +116,7 @@ typedef enum pgssVersion
 	PGSS_V1_0 = 0,
 	PGSS_V1_1,
 	PGSS_V1_2
+	,PGSS_V1_3
 } pgssVersion;
 
 /*
@@ -147,6 +148,16 @@ typedef struct Counters
 	int64		local_blks_written;		/* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;		/* # of temp blocks written */
+
+	int64		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool  */
+	int64		aio_read_discrd;		/* # of prefetches for which buffer not subsequently read and therefore discarded  */
+	int64		aio_read_forgot;		/* # of prefetches for which buffer not subsequently read and then forgotten about */
+	int64		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb  control block               */
+	int64		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno     */
+	int64		aio_read_wasted;		/* # of aio reads for which in-progress aio cancelled and disk block not used      */
+	int64		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it                 */
+	int64		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested       */
+
 	double		blk_read_time;	/* time spent reading, in msec */
 	double		blk_write_time; /* time spent writing, in msec */
 	double		usage;			/* usage factor */
@@ -274,6 +285,7 @@ void		_PG_fini(void);
 
 PG_FUNCTION_INFO_V1(pg_stat_statements_reset);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
+PG_FUNCTION_INFO_V1(pg_stat_statements_1_3);
 PG_FUNCTION_INFO_V1(pg_stat_statements);
 
 static void pgss_shmem_startup(void);
@@ -1025,7 +1037,25 @@ pgss_ProcessUtility(Node *parsetree, con
 		bufusage.temp_blks_read =
 			pgBufferUsage.temp_blks_read - bufusage_start.temp_blks_read;
 		bufusage.temp_blks_written =
-			pgBufferUsage.temp_blks_written - bufusage_start.temp_blks_written;
+			pgBufferUsage.temp_blks_written - bufusage.temp_blks_written;
+
+		bufusage.aio_read_noneed =
+			pgBufferUsage.aio_read_noneed - bufusage.aio_read_noneed;
+		bufusage.aio_read_discrd =
+			pgBufferUsage.aio_read_discrd - bufusage.aio_read_discrd;
+		bufusage.aio_read_forgot =
+			pgBufferUsage.aio_read_forgot - bufusage.aio_read_forgot;
+		bufusage.aio_read_noblok =
+			pgBufferUsage.aio_read_noblok - bufusage.aio_read_noblok;
+		bufusage.aio_read_failed =
+			pgBufferUsage.aio_read_failed - bufusage.aio_read_failed;
+		bufusage.aio_read_wasted =
+			pgBufferUsage.aio_read_wasted - bufusage.aio_read_wasted;
+		bufusage.aio_read_waited =
+			pgBufferUsage.aio_read_waited - bufusage.aio_read_waited;
+		bufusage.aio_read_ontime =
+			pgBufferUsage.aio_read_ontime - bufusage.aio_read_ontime;
+
 		bufusage.blk_read_time = pgBufferUsage.blk_read_time;
 		INSTR_TIME_SUBTRACT(bufusage.blk_read_time, bufusage_start.blk_read_time);
 		bufusage.blk_write_time = pgBufferUsage.blk_write_time;
@@ -1040,6 +1070,7 @@ pgss_ProcessUtility(Node *parsetree, con
 				   rows,
 				   &bufusage,
 				   NULL);
+
 	}
 	else
 	{
@@ -1223,6 +1254,16 @@ pgss_store(const char *query, uint32 que
 		e->counters.local_blks_written += bufusage->local_blks_written;
 		e->counters.temp_blks_read += bufusage->temp_blks_read;
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
+
+		e->counters.aio_read_noneed     += bufusage->aio_read_noneed;
+		e->counters.aio_read_discrd     += bufusage->aio_read_discrd;
+		e->counters.aio_read_forgot     += bufusage->aio_read_forgot;
+		e->counters.aio_read_noblok     += bufusage->aio_read_noblok;
+		e->counters.aio_read_failed     += bufusage->aio_read_failed;
+		e->counters.aio_read_wasted     += bufusage->aio_read_wasted;
+		e->counters.aio_read_waited     += bufusage->aio_read_waited;
+		e->counters.aio_read_ontime     += bufusage->aio_read_ontime;
+
 		e->counters.blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_read_time);
 		e->counters.blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_write_time);
 		e->counters.usage += USAGE_EXEC(total_time);
@@ -1256,7 +1297,8 @@ pg_stat_statements_reset(PG_FUNCTION_ARG
 #define PG_STAT_STATEMENTS_COLS_V1_0	14
 #define PG_STAT_STATEMENTS_COLS_V1_1	18
 #define PG_STAT_STATEMENTS_COLS_V1_2	19
-#define PG_STAT_STATEMENTS_COLS			19		/* maximum of above */
+#define PG_STAT_STATEMENTS_COLS_V1_3	27
+#define PG_STAT_STATEMENTS_COLS			27		/* maximum of above */
 
 /*
  * Retrieve statement statistics.
@@ -1269,6 +1311,16 @@ pg_stat_statements_reset(PG_FUNCTION_ARG
  * function.  Unfortunately we weren't bright enough to do that for 1.1.
  */
 Datum
+pg_stat_statements_1_3(PG_FUNCTION_ARGS)
+{
+	bool		showtext = PG_GETARG_BOOL(0);
+
+	pg_stat_statements_internal(fcinfo, PGSS_V1_3, showtext);
+
+	return (Datum) 0;
+}
+
+Datum
 pg_stat_statements_1_2(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
@@ -1357,6 +1409,10 @@ pg_stat_statements_internal(FunctionCall
 			if (api_version != PGSS_V1_2)
 				elog(ERROR, "incorrect number of output arguments");
 			break;
+		case PG_STAT_STATEMENTS_COLS_V1_3:
+			if (api_version != PGSS_V1_3)
+				elog(ERROR, "incorrect number of output arguments");
+			break;
 		default:
 			elog(ERROR, "incorrect number of output arguments");
 	}
@@ -1533,11 +1589,24 @@ pg_stat_statements_internal(FunctionCall
 		{
 			values[i++] = Float8GetDatumFast(tmp.blk_read_time);
 			values[i++] = Float8GetDatumFast(tmp.blk_write_time);
+
+			if (api_version >= PGSS_V1_3)
+			{
+				values[i++] = Int64GetDatumFast(tmp.aio_read_noneed);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_discrd);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_forgot);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_noblok);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_failed);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_wasted);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_waited);
+				values[i++] = Int64GetDatumFast(tmp.aio_read_ontime);
+			}
 		}
 
 		Assert(i == (api_version == PGSS_V1_0 ? PG_STAT_STATEMENTS_COLS_V1_0 :
 					 api_version == PGSS_V1_1 ? PG_STAT_STATEMENTS_COLS_V1_1 :
 					 api_version == PGSS_V1_2 ? PG_STAT_STATEMENTS_COLS_V1_2 :
+					 api_version == PGSS_V1_3 ? PG_STAT_STATEMENTS_COLS_V1_3 :
 					 -1 /* fail if you forget to update this assert */ ));
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
--- contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql.orig	2014-08-19 10:17:32.814616339 -0400
+++ contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql	2014-08-19 16:56:13.079194994 -0400
@@ -0,0 +1,51 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements();
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements();
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT queryid bigint,
+    OUT query text,
+    OUT calls int8,
+    OUT total_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT blk_read_time float8,
+    OUT blk_write_time float8
+   , OUT aio_read_noneed int8
+   , OUT aio_read_discrd int8
+   , OUT aio_read_forgot int8
+   , OUT aio_read_noblok int8
+   , OUT aio_read_failed int8
+   , OUT aio_read_wasted int8
+   , OUT aio_read_waited int8
+   , OUT aio_read_ontime int8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_3'
+LANGUAGE C STRICT VOLATILE;
+
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
--- postgresql-prefetching-asyncio.README.orig	2014-08-19 10:17:32.814616339 -0400
+++ postgresql-prefetching-asyncio.README	2014-08-19 16:56:13.175195339 -0400
@@ -0,0 +1,618 @@
+Postgresql  --   Extended Prefetching using Asynchronous IO
+============================================================
+
+Postgresql currently (9.3.4) provides a limited prefetching capability
+using posix_fadvise to give hints to the Operating System kernel
+about which pages it expects to read in the near future.
+This capability is used only during the heap-scan phase of bitmap-index scans.
+It is controlled via the effective_io_concurrency configuration parameter.
+
+This capability is now extended in two ways :
+   .   use asynchronous IO into Postgresql shared buffers as an
+       alternative to posix_fadvise
+   .   Implement prefetching in other types of scan :
+            .  non-bitmap (i.e. simple) index scans - index pages
+                     currently only for B-tree indexes.
+                    (developed by Claudio Freire <klaussfreire(at)gmail(dot)com>)
+            .  non-bitmap (i.e. simple) index scans - heap pages
+                          currently only for B-tree indexes.
+            .  simple heap scans
+
+Posix asynchronous IO is chosen as the function library for asynchronous IO,
+since this is well supported and also fits very well with the model of
+the prefetching process,  particularly as regards checking for completion
+of an asynchronous read.    On linux,   Posix asynchronous IO is provided
+in the librt library.    librt uses independently-schedulable threads to
+achieve the asynchronicity,   rather than kernel functionality.
+
+In this implementation,  use of asynchronous IO is limited to prefetching
+while performing one of the three types of scan
+            .  B-tree bitmap index scan - heap pages    (as already exists)
+            .  B-tree non-bitmap (i.e. simple) index scans - index and heap pages
+            .  simple heap scans
+on permanent relations.   It is not used on temporary tables nor for writes.
+
+The advantages of Posix asynchronous IO into shared buffers
+compared to posix_fadvise are :
+   .   Beneficial for non-sequential access patterns as well as sequential
+   .   No restriction on the kinds of IO which can be used
+       (other kinds of asynchronous IO impose restrictions such as
+        buffer alignment,  use of non-buffered IO).
+   .   Does not interfere with standard linux kernel read-ahead functionality.
+       (It has been stated in 
+ www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com
+       that :
+          "the kernel stops doing read-ahead when a call to posix_fadvise comes.
+           I noticed the performance hit, and checked the kernel's code.
+           It effectively changes the prediction mode from sequential to fadvise,
+           negating the (assumed) kernel's prefetch logic")
+   .   When the read request is issued after a prefetch has completed,
+       no delay associated with a kernel call to copy the page from
+       kernel page buffers into the Postgresql shared buffer,
+       since it is already there.
+       Also,   in a memory-constrained environment,   there is a greater
+       probability that the prefetched page will "stick" in memory
+       since the linux kernel victimizes the filesystem page cache in preference
+       to swapping out user process pages.
+   .   Statistics on prefetch success can be gathered (see "Statistics" below)
+       which helps the administrator to tune the prefetching settings.
+
+These benefits are most likely to be obtained in a system whose usage profile
+(e.g. from iostat)  shows:
+     .   high IO wait from mostly-read activity
+     .   disk access pattern is not entirely sequential
+         (so kernel readahead can't predict it but postgresql can)
+     .   sufficient spare idle CPU to run the librt pthreads
+         or,  stated another way,    the CPU subsystem is relatively powerful
+         compared to the disk subsystem.
+In such ideal conditions,  and with a workload with plenty of index scans,
+around 10% - 20% improvement in throughput has been achieved.
+In an admittedly extreme environment measured by this author,    with a workload
+consisting of 8 client applications each running similar complex queries
+(same query structure but different predicates and constants),
+including 2 Bitmap Index Scans and 17 non-bitmap index scans,
+on a dual-core Intel laptop (4 hyperthreads) with the database on a single
+USB3-attached 500GB disk drive, and no part of the database in filesystem buffers
+initially,  (filesystem freshly mounted),  comparing unpatched build
+using posix_fadvise with effective_io_concurrency 4 against same build patched
+with async IO and effective_io_concurrency 4 and max_async_io_prefetchers 32,
+elapse time repeatably improved from around 640-670 seconds to around 530-550 seconds,
+a 17% - 18% improvement. 
+
+The disadvantages of Posix asynchronous IO compared to posix_fadvise are:
+     .   probably higher CPU utilization:
+         Firstly, the extra work performed by the librt threads adds CPU
+         overhead, and secondly, if the asynchronous prefetching is effective,
+         then it will deliver better (greater) overlap of CPU with IO, which
+         will reduce elapsed times and hence increase CPU utilization percentage
+         still more (during that shorter elapsed time).
+     .   more context switching,  because of the additional threads.
+
+
+Statistics:
+___________
+
+A number of additional statistics relating to effectiveness of asynchronous IO
+are provided as an extension of the existing pg_stat_statements loadable module.
+Refer to the appendix "Additional Supplied Modules" in the current
+PostgreSQL Documentation for details of this module.
+
+The following additional statistics are provided for asynchronous IO prefetching:
+
+    . aio_read_noneed  :   number of prefetches for which no need for prefetch as block already in buffer pool
+    . aio_read_discrd  :   number of prefetches for which buffer not subsequently read and therefore discarded
+    . aio_read_forgot  :   number of prefetches for which buffer not subsequently read and then forgotten about
+    . aio_read_noblok  :   number of prefetches for which no available BufferAiocb  control block
+    . aio_read_failed  :   number of aio reads for which aio itself failed or the read failed with an errno
+    . aio_read_wasted  :   number of aio reads for which in-progress aio cancelled and disk block not used
+    . aio_read_waited  :   number of aio reads for which disk block used but had to wait for it
+    . aio_read_ontime  :   number of aio reads for which disk block used and ready on time when requested
+
+Some of these are (hopefully) self-explanatory.    Some additional notes:
+
+    . aio_read_discrd and aio_read_forgot  :
+                    prefetch was wasted work since the buffer was not subsequently read
+                    The discrd case indicates that the scanner realized this and discarded the buffer,
+                    whereas the forgot case indicates that the scanner did not realize it,
+                    which should not normally occur.
+                    A high number in either suggests lowering effective_io_concurrency.
+
+    . aio_read_noblok  :   
+                    Any significant number in relation to all the other numbers indicates that
+                    max_async_io_prefetchers should be increased.
+
+    . aio_read_waited  :
+                    The page was prefetched but the asynchronous read had not completed by the time the
+                    scanner requested to read it.     causes extra overhead in waiting and indicates
+                    prefetching is not providing much if any benefit.
+                    The disk subsystem may be underpowered/overloaded in relation to the available CPU power.
+
+    . aio_read_ontime  :
+                    The page was prefetched and the asynchronous read had completed by the time the
+                    scanner requested to read it.     Optimal behaviour.      If this number if large
+                    in relation to all the other numbers except (possibly) aio_read_noneed,
+                    then prefetching is working well.
+
+To create the extension with support for these additional statistics, use the following syntax:
+     CREATE EXTENSION pg_stat_statements VERSION '1.3'
+or,  if you run the new code against an existing database which already has the extension
+( see installation and migration below ),  you can 
+     ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'
+
+A suggested set of commands for displaying these statistics might be :
+
+ /*  OPTIONALLY */ DROP extension pg_stat_statements;
+                   CREATE extension pg_stat_statements VERSION '1.3';
+ /*  run your workload   */
+                  select userid , dbid , substring(query from 1 for 24) , calls , total_time , rows , shared_blks_read , blk_read_time , blk_write_time \
+                    , aio_read_noneed , aio_read_noblok , aio_read_failed , aio_read_wasted , aio_read_waited , aio_read_ontime , aio_read_forgot       \
+                      from pg_stat_statements where shared_blks_read > 0;
+
+
+Installation and Build Configuration:
+_____________________________________
+
+1. First -  a prerequsite:
+#  as well as requiring all the usual package build tools such as gcc , make etc,
+#  as described in the instructions for building postgresql,
+#  the following is required :
+    gnu autoconf at version 2.69 :
+# run the following command
+autoconf -V
+# it *must* return
+autoconf (GNU Autoconf) 2.69
+
+2. If you don't have it or it is a different version,
+then you must obtain version 2.69 (which is the current version)
+from your distribution provider or from the gnu software download site.
+
+3. Also you must have the source tree for postgresql version 9.4 (development version).
+#   all the following commands assume your current working directory is the top of the source tree.
+
+4. cd to top of source tree :
+#   check it appears to be a postgresql source tree
+ls -ld configure.in src
+#   should show both the file and the directory
+grep PostgreSQL COPYRIGHT
+#   should show PostgreSQL Database Management System
+
+5. Apply the patch :
+patch -b -p0 -i <patch_file_path>
+#   should report no errors, 48 files patched (see list at bottom of this README)
+#   and all hunks applied
+#  check the patch was appplied to configure.in
+ls -ld configure.in.orig configure.in
+#   should show both files
+
+6. Rebuild the configure script with the patched configure.in :
+mv configure configure.orig;
+autoconf configure.in >configure;echo "rc= $? from autoconf"; chmod +x configure;
+ls -lrt configure.orig configure;
+
+7. run the new configure script :
+#   if you have run configure before,
+#   then you may first want to save existing config.status and config.log if they exist,
+#   and then specify same configure flags and options as you specified before.
+#   the patch does not alter or extend the set of configure options
+#   if unsure,   run ./configure --help
+#   if still unsure,   run ./configure
+./configure <other configure options as desired>
+
+
+
+8. now check that configure decided that this environment supports asynchronous IO :
+grep USE_AIO_ATOMIC_BUILTIN_COMP_SWAP src/include/pg_config.h
+#  it should show
+#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 1
+#  if not,  apparently your environment does not support asynch IO  -
+#  the config.log will show how it came to that conclusion,
+#  also check for :
+#    . a librt.so somewhere in the loader's library path (probably under /lib , /lib64 , or /usr)
+#    . your gcc must support the atomic compare_and_swap __sync_bool_compare_and_swap built-in function
+#  do not proceed without this define being set.
+
+9. do you want to use the new code on an existing cluster
+   that was created using the same code base but without the patch?
+   If so then run this nasty-looking command :
+   (cut-and-paste it into a terminal window or a shell-script file)
+   Otherwise continue to step 10.
+   see Migration note below for explanation.
+###############################################################################################
+   fl=src/Makefile.global; typeset -i bkx=0; while [[ $bkx < 200 ]]; do {
+       bkfl="${fl}.bak${bkx}"; if [[ -a ${bkfl} ]]; then ((bkx=bkx+1)); else break; fi;
+   }; done;
+   if [[ -a ${bkfl} ]]; then echo "sorry cannot find a backup name for $fl";
+   elif [[ -a $fl ]]; then {
+       mv $fl $bkfl && {
+          sed -e "/^CFLAGS =/ s/\$/ -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO/" $bkfl > $fl;
+          str="diff -w $bkfl $fl";echo "$str"; eval "$str";
+       };
+   };
+   else echo "ooopppss $fl is missing";
+   fi;
+###############################################################################################
+# it should report something like
+diff -w Makefile.global.bak0 Makefile.global
+222c222
+< CFLAGS = XXXX
+---
+> CFLAGS = XXXX -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+#   where XXXX is some set of flags
+
+
+10. now run the rest of the build process as usual  -
+    follow instructions in file INSTALL if that file exists,
+    else e.g. run
+make && make install
+
+If the build fails with the following error:
+undefined reference to `aio_init'
+Then edit the following file
+src/include/pg_config_manual.h
+and add the following line at the bottom:
+
+#define DONT_HAVE_AIO_INIT
+
+and then run
+make clean && make && make install
+See notes to section Runtime Configuration below for more information on this.
+
+If you would like to use the sigevent mechanism for signalling completion
+of asynchronous io to non-originating backends,  instead of the polling method,
+(see section Checking AIO Completion below)
+then add these lines to src/include/pg_config_manual.h
+
+#define USE_AIO_SIGEVENT 1
+#define AIO_SIGEVENT_SIGNALNUM SIGIO  /* or signal num of your choice */
+
+Here's the context
+
+/*
+ * USE_PREFETCH code should be compiled only if we have a way to implement
+ * prefetching.  (This is decoupled from USE_POSIX_FADVISE because there
+ * might in future be support for alternative low-level prefetch APIs  --
+ * -- update October 2013  -- now there is such a new prefetch capability --
+ *   async_io into postgres buffers  -   configuration parameter max_async_io_threads)
+ */
+#if defined(USE_POSIX_FADVISE) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
+#define USE_AIO_SIGEVENT 1
+/* AIO_SIGEVENT_SIGNALNUM is the signal used to indicate completion
+ * of an aio operation.  Choose a signal that is not used elsewhere
+ * in postgresql and which can be caught by signal handler.
+*/
+#define AIO_SIGEVENT_SIGNALNUM SIGIO
+#define USE_PREFETCH
+#endif
+
+
+
+
+Migration , Runtime Configuration, and Use:
+___________________________________________
+
+
+Database Migration:
+___________________
+
+The new prefetching code for non-bitmap index scans introduces a new btree-index
+function named btpeeknexttuple.    The correct way to add such a function involves
+also adding it to the catalog as an internal function in pg_proc.
+However,  this results in the new built code considering an existing database to be
+incompatible,  i.e requiring backup on the old code and restore on the new.
+This is normal behaviour for migration to a new version of postgresql,  and is
+also a valid way of migrating a database for use with this asynchronous IO feature,
+but in this case it may be inconvenient.
+
+As an alternative,    the new code may be compiled with the macro define
+AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+which does what it says by not altering the catalog.   The patched build can then
+be run against an existing database cluster initdb'd using the unpatched build.
+
+There are no known ill-effects of so doing,  but :
+     .  in any case,  it is strongly suggested to make a backup of any precious database
+        before accessing it with a patched build
+     .  be aware that if this asynchronous IO feature is eventually released as part of postgresql,
+        migration will probably be required anyway.
+
+This option to avoid catalog migration is intended as a convenience for a quick test,
+and also makes it easier to obtain performance comparisons on the same database.
+
+
+
+Runtime Configuration:
+______________________
+
+One new configuration parameter settable in postgresql.conf and
+in any other way as described in the postgresql documentation :
+
+max_async_io_prefetchers
+  Maximum number of background processes concurrently using asynchronous
+  librt threads to prefetch pages into shared memory buffers
+
+This number can be thought of as the maximum number
+of librt threads concurrently active,   each working on a list of
+from 1 to target_prefetch_pages pages ( see notes 1 and 2 ).
+
+In practice,    this number simply controls how many prefetch requests in total
+may be active concurrently :
+        max_async_io_prefetchers * target_prefetch_pages ( see note 1)
+
+default is max_connections/6
+and recall that the default for max_connections is 100
+
+
+note 1  a number based on effective_io_concurrency and approximately n * ln(n)
+        where n is effective_io_concurrency
+
+note 2  Provided that the gnu extension to Posix AIO which provides the
+aio_init() function is present,   then aio_init() is called
+to set the librt maximum number of threads to max_async_io_prefetchers,
+and to set the maximum number of concurrent aio read requests to the product of
+        max_async_io_prefetchers * target_prefetch_pages
+
+
+As well as this regular configuration parameter,
+there are several other parameters that can be set via environment variable.
+The reason why they are environment vars rather than regular configuration parameters
+is that it is not expected that they should need to be set,   but they may be useful :
+                variable name         values                  default        meaning
+   PG_TRY_PREFETCHING_FOR_BITMAP      [Y|N]                    Y         whether to prefetch bitmap heap scans
+   PG_TRY_PREFETCHING_FOR_ISCAN       [Y|N|integer[,[N|Y]]]   256,N      whether to prefetch  non-bitmap index scans
+                                                                    also numeric size of list of prefetched blocks
+                                                                    also whether to prefetch forward-sequential-pattern index pages
+   PG_TRY_PREFETCHING_FOR_BTREE       [Y|N]                    Y         whether to prefetch heap pages in non-bitmap index scans
+   PG_TRY_PREFETCHING_FOR_HEAP        [Y|N]                    N         whether to prefetch relation (un-indexed) heap scans
+
+
+The setting for PG_TRY_PREFETCHING_FOR_ISCAN is a litle complicated.
+It can be set to Y or N to control prefetching of  non-bitmap index scans;
+But in addition it can be set to an integer,   which both implies Y
+and also sets the size of a list used to remember prefetched but unread heap pages.
+This list is an optimization used to avoid re-prefetching and maximise the potential
+set of prefetchable blocks indexed by one index page.
+And if set to an integer,  this integer may be followed by either ,Y or ,N
+to specify to prefetch index pages which are being accessed forward-sequentially.
+It has been found that prefetching is not of great benefit for this access pattern,
+and so it is not the default,  but also does no harm (provided sufficient CPU capacity).
+
+
+
+Usage :
+______
+
+
+There are no changes in usage other than as noted under Configuration and Statistics.
+However,   in order to assess benefit from this feature,   it will be useful to
+understand the query access plans of your workload using EXPLAIN.    Before doing that,
+make sure that statistics are up to date using ANALYZE.
+
+
+
+Internals:
+__________
+
+
+Internal changes span two areas and the interface between them :
+
+ .  buffer manager layer
+ .  programming interface for scanner to call buffer manager
+ .  scanner layer
+
+ .  buffer manager layer
+    ____________________
+
+    changes comprise :
+       .   allocating,  pinning , unpinning buffers
+            this is complex and discussed briefly below in "Buffer Management"
+       .   acquiring and releasing a BufferAiocb, the control block
+            associated with a single aio_read,  and checking for its completion
+            a new file,  backend/storage/buffer/buf_async.c, provides three new functions,
+                  BufStartAsync        BufReleaseAsync            BufCheckAsync
+            which handle this.
+       .   calling librt asynch io functions
+            this follows the example of all other filesystem interfaces
+            and is straightforward.    
+            two new functions are provided in fd.c:
+                   FileStartaio        FileCompleteaio
+            and corresponding interfaces in smgr.c
+
+ .  programming interface for scanner to call buffer manager
+    ________________________________________________________
+       . calling interface for existing function PrefetchBuffer is modified :
+           .  one new argument,   BufferAccessStrategy strategy
+           .  now returns an int return code which indicates :
+                     whether pin count on buffer has been increased by 1
+                     whether block was already present in a buffer
+       .  new function DiscardBuffer
+           .  discard buffer used for a previously prefetched page
+                 which scanner decides it does not want to read.
+           .  same arguments as for PrefetchBuffer except for omission of BufferAccessStrategy
+           .  note - this is different from the existing function ReleaseBuffer
+                     in that ReleaseBuffer takes a buffer_descriptor as argument
+                     for a buffer which has been read, but has similar purpose.
+
+ .  scanner layer
+    _____________
+        common to all scanners is that the scanner which wishes to prefetch must do two things:
+          .  decide which pages to prefetch and call PrefetchBuffer to prefetch them
+                 nodeBitmapHeapscan already does this (but note one extra argument on PrefetchBuffer)
+          .  remember which pages it has prefetched in some list (actual or conceptual,  e.g. a page range),
+                 removing each page from this list if and when it subsequently reads the page.
+          .  at end of scan,  call DiscardBuffer for every remembered (i.e. prefetched not unread) page
+       how this list of prefetched pages is implemented varies for each of the three scanners and four scan types:
+            .  bitmap index scan - heap pages
+            .  non-bitmap (i.e. simple) index scans - index pages
+            .  non-bitmap (i.e. simple) index scans - heap pages
+            .  simple heap scans
+       The consequences of forgetting to call DiscardBuffer on a prefetched but unread page are:
+            .   counted in aio_read_forgot  (see "Statistics" above)
+            .   may incur an annoying but harmless warning in the pg_log "Buffer Leak ... "
+                  (the buffer is released at commit)
+       This does sometimes happen ...
+     
+
+
+Buffer Management
+_________________
+
+With async io,   PrefetchBuffer must allocate and pin a buffer,  which is relatively straightforward,
+but also every other part of buffer manager must know about the possibility that a buffer may be in
+a state of async_io_in_progress state and be prepared to determine the possible completion.
+That is,  one backend BK1 may start the io but another BK2 may try to read it before BK1 does.
+Posix Asynchronous IO provides a means for waiting on this or another task's read if in progress,
+namely aio_suspend(),  which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO
+are called as part of asynchronous prefetching,   their role is limited to maintaining the buffer descriptor flags,
+and they do not track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in
+a separate set of shared control blocks,  the BufferAiocb list -
+refer to     include/storage/buf_internals.h
+Checking asynchronous io status is handled in  backend/storage/buffer/buf_async.c BufCheckAsync function.
+Read the commentary for this function for more details.
+
+Checking AIO Completion
+_______________________
+There is a difference in how completion is checked depending on whether the backend doing the checking is :
+    .   the same as the backend which started the asynch io (the "originator")
+    .   a different backend ("non-originator")
+The "originator" case is most common and also simplest  -
+      FileCompleteaio simply issues the appriopriate aio_xxxx calls and suspends if not complete.
+The "non-originator" case is more complex and three methods are currently designed and implemented :
+    .  polling the aiocb for completion
+          This is currently the default method,  and the simplest,
+          but suffers from the CPU overhead of frequent polling.
+    .  /*  discarded approach ....     LWlock method
+          use of LWlocks and sigevent to cause a signal to be delivered to the originator.
+          The originator locks eXclusive at aio start and releases after delivery of the signal.
+          The non-originator locks Shared and releases.
+       */
+       The above approach was discarded because :
+        A. it relied on the associated signal being delivered for every originated aio --
+            but it turns out that the kernel does not always deliver every
+           requested signal.     This resulted in gradual depletion of aiocbs.
+        B. Furthermore,  use of LWlocks for synchronization,   although being consistent
+           with synchronous IO, prevent timely posting of waiters, because the LWLockRelease
+           call cannot be safely performed inside the signal handler and must be deferred to
+           the next CHECK_INTERRUPTS(), which might be delayed if e.g. the backend is itself
+           waiting on some LWlock.
+    .  /*  alternative approach ...  */   sigevent + sigsuspend method
+       This method also uses sigevent to cause a signal to be delivered to the originator.
+       However, to overcome the problems associated with the discarded LWlock method,
+       signals are also used for wait/post:
+        A. Instead of the LWlock mechanism,  waiters will wait by 
+           chaining themselves onto a chain of PGPROCs anchored in the BAiocb,
+           and then calling sigsuspend() to wait to be posted by originator.
+        B. Each backend will keep track of which aios it has originated,
+           and,   whenever any sigevent is delivered,   check all outstanding aios
+           which it originated prior to and including the one associated with the sigevent.
+           For each completed aio,  it will run the chain of waiter PROCs
+           and unchain and post each one by simple kill().
+           The same signal number can be used for this as for the sigevent,
+           by default SIGIO.
+        C. Note that although this method does not depend on every requested signal
+                being delivered,  it *does* rely on the following :
+                for *every* originated aio_read,   initiated at time T say,
+                at least one sigevent shall be delivered to originator at some time T+ > T
+
+
+Pinning and unpinning of buffers is the most complex aspect of asynch io prefetching,
+and the logic is spread throughout BufStartAsync , BufCheckAsync , and many functions in bufmgr.c.
+When a backend BK2 requests ReadBuffer of a page for which asynch read is in progress,
+buffer manager has to determine which backend BK1 pinned this buffer during previous PrefetchBuffer,
+and for example must not be re-pinned a second time if BK2 is BK1.
+Information concerning which backend initiated the prefetch is held in the BufferAiocb.
+
+The trickiest case concerns the scenario in which :
+   .  BK1 initiates prefetch and acquires a pin
+   .  BK2 possibly waits for completion and then reads the buffer,  and perhaps later on
+         releases it by ReleaseBuffer.
+   .  Since the asynchronous IO is no longer in progress,     there is no longer any
+         BufferAiocb associated with it.    Yet buffer manager must remember that BK1 holds a
+         "prefetch" pin, i.e. a pin which must not be repeated if and when BK1 finally issues ReadBuffer.
+   .  The solution to this problem is to invent the concept of a "banked" pin,
+      which is a pin obtained when prefetch was issued,   identied as in "banked" status only if and when
+      the associated asynchronous IO terminates,  and redeemable by the next use by same task,
+      either by ReadBuffer or DiscardBuffer.
+      The pid of the backend which holds a banked pin on a buffer (there can be at most one such backend)
+      is stored in the buffer descriptor.
+      This is done without increasing size of the buffer descriptor,  which is important since
+      there may be a very large number of these.     This does overload the relevant field in the descriptor.
+      Refer to include/storage/buf_internals.h for more details
+      and search for BM_AIO_PREFETCH_PIN_BANKED in storage/buffer/bufmgr.c and  backend/storage/buffer/buf_async.c
+
+______________________________________________________________________________
+The following 46 files are changed in this feature (output of the patch command) :
+
+patching file configure.in
+patching file contrib/pg_prewarm/pg_prewarm.c
+patching file contrib/pg_stat_statements/pg_stat_statements--1.3.sql
+patching file contrib/pg_stat_statements/Makefile
+patching file contrib/pg_stat_statements/pg_stat_statements.c
+patching file contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql
+patching file postgresql-prefetching-asyncio.README
+patching file config/c-library.m4
+patching file src/backend/postmaster/postmaster.c
+patching file src/backend/executor/nodeBitmapHeapscan.c
+patching file src/backend/executor/nodeIndexscan.c
+patching file src/backend/executor/instrument.c
+patching file src/backend/storage/buffer/Makefile
+patching file src/backend/storage/buffer/bufmgr.c
+patching file src/backend/storage/buffer/buf_async.c
+patching file src/backend/storage/buffer/buf_init.c
+patching file src/backend/storage/smgr/md.c
+patching file src/backend/storage/smgr/smgr.c
+patching file src/backend/storage/file/fd.c
+patching file src/backend/storage/lmgr/proc.c
+patching file src/backend/access/heap/heapam.c
+patching file src/backend/access/heap/syncscan.c
+patching file src/backend/access/index/indexam.c
+patching file src/backend/access/index/genam.c
+patching file src/backend/access/nbtree/nbtsearch.c
+patching file src/backend/access/nbtree/nbtinsert.c
+patching file src/backend/access/nbtree/nbtpage.c
+patching file src/backend/access/nbtree/nbtree.c
+patching file src/backend/nodes/tidbitmap.c
+patching file src/backend/utils/misc/guc.c
+patching file src/backend/utils/mmgr/aset.c
+patching file src/include/executor/instrument.h
+patching file src/include/storage/bufmgr.h
+patching file src/include/storage/proc.h
+patching file src/include/storage/smgr.h
+patching file src/include/storage/fd.h
+patching file src/include/storage/buf_internals.h
+patching file src/include/catalog/pg_am.h
+patching file src/include/catalog/pg_proc.h
+patching file src/include/pg_config_manual.h
+patching file src/include/access/nbtree.h
+patching file src/include/access/heapam.h
+patching file src/include/access/relscan.h
+patching file src/include/nodes/tidbitmap.h
+patching file src/include/utils/rel.h
+patching file src/include/pg_config.h.in
+
+
+Future Possibilities:
+____________________
+
+There are several possible extensions of this feature :
+   .   Extend prefetching of index scans to types of index
+       other than B-tree.
+       This should be fairly straightforward,  but requires some
+       good base of benchmarkable workloads to prove the value.
+   .   Investigate why asynchronous IO prefetching does not greatly
+       improve sequential relation heap scans and possibly find how to
+       achieve a benefit.
+   .   Build knowledge of asycnhronous IO prefetching into the
+       Query Planner costing.
+       This is far from straightforward.    The Postgresql Query Planner's
+       costing model is based on resource consumption rather than elapsed time.
+       Use of asynchronous IO prefetching is intended to improve elapsed time
+       as the expense of (probably) higher resource consumption.
+       Although Costing understands about the reduced cost of reading buffered
+       blocks, it does not take asynchronicity or overlap of CPU with disk
+       into account.  A naive approach might be to try to tweak the Query
+       Planner's Cost Constant configuration parameters
+       such as seq_page_cost , random_page_cost
+       but this is hazardous as explained in the Documentation.
+
+
+
+John Lumby,  johnlumby(at)hotmail(dot)com
--- config/c-library.m4.orig	2014-08-18 14:10:36.733016131 -0400
+++ config/c-library.m4	2014-08-19 16:56:13.243195582 -0400
@@ -367,3 +367,152 @@ if test "$pgac_cv_type_locale_t" = 'yes
   AC_DEFINE(LOCALE_T_IN_XLOCALE, 1,
             [Define to 1 if `locale_t' requires <xlocale.h>.])
 fi])])# PGAC_HEADER_XLOCALE
+
+
+# PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
+# ---------------------------------------
+# test whether this system has both the librt-style async io and the gcc atomic compare_and_swap
+#      and test operation of both,
+#      including verifying that aio_error can retrieve completion status
+#      of aio_read issued by a different process
+#
+AC_DEFUN([PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP],
+[AC_MSG_CHECKING([whether have both librt-style async io and the gcc atomic compare_and_swap])
+AC_CACHE_VAL(pgac_cv_aio_atomic_builtin_comp_swap,
+pgac_save_LIBS=$LIBS
+LIBS=" -lrt $pgac_save_LIBS"
+[AC_TRY_RUN([#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include "aio.h"
+#include <errno.h>
+
+char *shmem;
+
+/*  returns rc of aio_read or -1 if some error */
+int
+processA(void)
+{
+	int fd , rc;
+	struct aiocb *aiocbp = (struct aiocb *) shmem;
+	char *buf = shmem + sizeof(struct aiocb);
+
+	rc = fd = open("configure", O_RDONLY );
+	if (fd != -1) {
+
+            memset(aiocbp, 0, sizeof(struct aiocb));
+            aiocbp->aio_fildes = fd;
+            aiocbp->aio_offset = 0;
+            aiocbp->aio_buf = buf;
+            aiocbp->aio_nbytes = 8;
+            aiocbp->aio_reqprio = 0;
+            aiocbp->aio_sigevent.sigev_notify = SIGEV_NONE;
+
+            rc = aio_read(aiocbp);
+	}
+        return rc;
+}
+
+/*  returns result of aio_error  -  0 if io completed successfully */
+int
+processB(void)
+{
+	struct aiocb *aiocbp = (struct aiocb *) shmem;
+	const struct aiocb * const pl[1] = { aiocbp };
+	int rv;
+	int	returnCode;
+        struct timespec my_timeout = { 0 , 10000 };
+        int max_iters , max_polls;
+
+	rv = aio_error(aiocbp);
+        max_iters = 100;
+        while ( (max_iters-- > 0) && (rv == EINPROGRESS) ) {
+                max_polls = 256;
+                my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                returnCode = aio_suspend(pl , 1 , &my_timeout);
+                while ((returnCode < 0) && (EAGAIN == errno) && (max_polls-- > 0)) {
+                    my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                    returnCode = aio_suspend(pl , 1 , &my_timeout);
+                }
+                rv = aio_error(aiocbp);
+        }
+
+	return rv;
+}
+
+int main (int argc, char *argv[])
+ {
+   int rc;
+   int pidB;
+   int child_status;
+   struct aiocb volatile * first_aiocb;
+   struct aiocb volatile * second_aiocb;
+   struct aiocb volatile * my_aiocbp = (struct aiocb *)20000008;
+
+   first_aiocb = (struct aiocb *)20000008;
+   second_aiocb = (struct aiocb *)40000008;
+
+   /*  first test  --  __sync_bool_compare_and_swap
+   **  set zero as success if two comp-swaps both worked as expected -
+   **  first compares equal and swaps,  second compares unequal
+   */
+   rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   if (rc) {
+      rc = (__sync_bool_compare_and_swap (&my_aiocbp, first_aiocb, second_aiocb));
+   } else {
+      rc = -1;
+   }
+
+   if (rc == 0) {
+       /*  second test  --  process A start aio_read
+       **  and process B checks completion by polling
+       */
+        rc = -1; /* pessimistic */
+
+	shmem = mmap(NULL, sizeof(struct aiocb) + 2048,
+				 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
+				 -1, 0);
+	if (shmem != MAP_FAILED) {
+            
+            /*
+             * Start the I/O request in parent process, then fork and try to wait
+             * for it to finish from the child process.
+             */
+            rc = processA();
+            if (rc >= 0) {
+
+                rc = pidB = fork();
+                if (pidB != -1) {
+                    if (pidB != 0) {
+                        /* parent */
+                        wait (&child_status);
+                        if (WIFEXITED(child_status)) {
+                            rc = WEXITSTATUS(child_status);
+                        }
+                    } else {
+                        /* child */
+                        rc = processB();
+                        exit(rc);
+                    }
+                }
+            }
+        }
+   }
+
+   return rc;
+}],
+[pgac_cv_aio_atomic_builtin_comp_swap=yes],
+[pgac_cv_aio_atomic_builtin_comp_swap=no],
+[pgac_cv_aio_atomic_builtin_comp_swap=cross])
+])dnl AC_CACHE_VAL
+AC_MSG_RESULT([$pgac_cv_aio_atomic_builtin_comp_swap])
+if test x"$pgac_cv_aio_atomic_builtin_comp_swap" != x"yes"; then
+LIBS=$pgac_save_LIBS
+fi
+])# PGAC_FUNC_AIO_ATOMIC_BUILTIN_COMP_SWAP
--- src/backend/postmaster/postmaster.c.orig	2014-08-18 14:10:36.925017033 -0400
+++ src/backend/postmaster/postmaster.c	2014-08-19 16:56:13.315195840 -0400
@@ -123,6 +123,11 @@
 #include "storage/spin.h"
 #endif
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+void ReportFreeBAiocbs(void);
+int CountInuseBAiocbs(void);
+extern int hwmBufferAiocbs;         /*  high water mark of in-use  BufferAiocbs in pool           */
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Possible types of a backend. Beyond being the possible bkend_type values in
@@ -1489,9 +1494,15 @@ ServerLoop(void)
 	fd_set		readmask;
 	int			nSockets;
 	time_t		now,
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                           count_baiocb_time,
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 				last_touch_time;
 
 	last_touch_time = time(NULL);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        count_baiocb_time = time(NULL);
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 	nSockets = initMasks(&readmask);
 
@@ -1650,6 +1661,19 @@ ServerLoop(void)
 			last_touch_time = now;
 		}
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                /*   maintain the hwm of used baiocbs every 10 seconds  */
+		if ((now - count_baiocb_time) >= 10)
+		{
+                        int inuseBufferAiocbs;         /*  current in-use  BufferAiocbs in pool */
+                        inuseBufferAiocbs = CountInuseBAiocbs();
+                        if (inuseBufferAiocbs > hwmBufferAiocbs) {
+			    hwmBufferAiocbs = inuseBufferAiocbs;
+			}
+			count_baiocb_time = now;
+		}
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 		/*
 		 * If we already sent SIGQUIT to children and they are slow to shut
 		 * down, it's time to send them SIGKILL.  This doesn't happen
@@ -3440,6 +3464,9 @@ PostmasterStateMachine(void)
 						signal_child(PgStatPID, SIGQUIT);
 				}
 			}
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        ReportFreeBAiocbs();
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 		}
 	}
 
--- src/backend/executor/nodeBitmapHeapscan.c.orig	2014-08-18 14:10:36.869016769 -0400
+++ src/backend/executor/nodeBitmapHeapscan.c	2014-08-19 16:56:13.375196054 -0400
@@ -34,6 +34,8 @@
  *		ExecEndBitmapHeapScan		releases all storage.
  */
 #include "postgres.h"
+#include <sys/types.h> 
+#include <unistd.h>
 
 #include "access/relscan.h"
 #include "access/transam.h"
@@ -47,6 +49,10 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_bitmap_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
 static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
@@ -111,10 +117,21 @@ BitmapHeapNext(BitmapHeapScanState *node
 		node->tbmres = tbmres = NULL;
 
 #ifdef USE_PREFETCH
-		if (target_prefetch_pages > 0)
-		{
+		if (    prefetch_bitmap_scans
+                     && (target_prefetch_pages > 0)
+                     && (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == scan->rs_rd->rd_node.dbNode)
+                             )
+                         ||  (prefetch_dbOid == 0)
+                        )
+                        /* sufficient number of blocks - at least twice the target_prefetch_pages */
+                     && (scan->rs_nblocks > (2*target_prefetch_pages))
+                   ) {
 			node->prefetch_iterator = prefetch_iterator = tbm_begin_iterate(tbm);
 			node->prefetch_pages = 0;
+                        if (prefetch_iterator) {
+                          tbm_zero(prefetch_iterator);  /* zero list of prefetched and unread blocknos */
+                        }
 			node->prefetch_target = -1;
 		}
 #endif   /* USE_PREFETCH */
@@ -138,12 +155,14 @@ BitmapHeapNext(BitmapHeapScanState *node
 			}
 
 #ifdef USE_PREFETCH
+                        if (prefetch_iterator) {
 			if (node->prefetch_pages > 0)
 			{
 				/* The main iterator has closed the distance by one page */
 				node->prefetch_pages--;
+                                tbm_subtract(prefetch_iterator, tbmres->blockno); /* remove this blockno from list of prefetched and unread blocknos */
 			}
-			else if (prefetch_iterator)
+                            else
 			{
 				/* Do not let the prefetch iterator get behind the main one */
 				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
@@ -151,6 +170,7 @@ BitmapHeapNext(BitmapHeapScanState *node
 				if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
 					elog(ERROR, "prefetch and main iterators are out of sync");
 			}
+                        }
 #endif   /* USE_PREFETCH */
 
 			/*
@@ -239,16 +259,26 @@ BitmapHeapNext(BitmapHeapScanState *node
 			while (node->prefetch_pages < node->prefetch_target)
 			{
 				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
+                                int             PrefetchBufferRc; /*  return value from PrefetchBuffer  - refer to bufmgr.h */
+
 
 				if (tbmpre == NULL)
 				{
 					/* No more pages to prefetch */
-					tbm_end_iterate(prefetch_iterator);
-					node->prefetch_iterator = prefetch_iterator = NULL;
+                                        /* let ExecEndBitmapHeapScan terminate the prefetch_iterator
+				        **	tbm_end_iterate(prefetch_iterator);
+					**      node->prefetch_iterator = NULL;
+                                        */
+                                        prefetch_iterator = NULL;
 					break;
 				}
 				node->prefetch_pages++;
-				PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+				PrefetchBufferRc = PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno , 0);
+                                /*  add this blockno to list of prefetched and unread blocknos
+                                **  if pin count did not increase then indicate so in the Unread_Pfetched list
+                                */
+                                tbm_add(prefetch_iterator
+                                   ,( (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED) ? tbmpre->blockno : InvalidBlockNumber ) );
 			}
 		}
 #endif   /* USE_PREFETCH */
@@ -482,12 +512,31 @@ ExecEndBitmapHeapScan(BitmapHeapScanStat
 {
 	Relation	relation;
 	HeapScanDesc scanDesc;
+	TBMIterator *prefetch_iterator;
 
 	/*
 	 * extract information from the node
 	 */
 	relation = node->ss.ss_currentRelation;
 	scanDesc = node->ss.ss_currentScanDesc;
+	prefetch_iterator = node->prefetch_iterator;
+
+#ifdef USE_PREFETCH
+        /*  before any other cleanup,  discard any prefetched but unread buffers  */
+        if (prefetch_iterator != NULL) {
+            TBMIterateResult *tbmpre = tbm_locate_IterateResult(prefetch_iterator);
+            BlockNumber *Unread_Pfetched_base = tbmpre->Unread_Pfetched_base;
+            unsigned int Unread_Pfetched_next = tbmpre->Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+            unsigned int Unread_Pfetched_count = tbmpre->Unread_Pfetched_count;
+
+            while ((Unread_Pfetched_count--) > 0) {
+                DiscardBuffer( scanDesc->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                Unread_Pfetched_next++;
+                if (Unread_Pfetched_next >= target_prefetch_pages)
+                    Unread_Pfetched_next = 0;
+            }
+        }
+#endif   /* USE_PREFETCH */
 
 	/*
 	 * Free the exprcontext
--- src/backend/executor/nodeIndexscan.c.orig	2014-08-18 14:10:36.869016769 -0400
+++ src/backend/executor/nodeIndexscan.c	2014-08-19 16:56:13.391196111 -0400
@@ -35,8 +35,13 @@
 #include "utils/rel.h"
 
 
+
 static TupleTableSlot *IndexNext(IndexScanState *node);
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -418,7 +423,12 @@ ExecEndIndexScan(IndexScanState *node)
 	 * close the index relation (no-op if we didn't open it)
 	 */
 	if (indexScanDesc)
+        {
 		index_endscan(indexScanDesc);
+
+        /*  note  -  at this point all scan controlblock resources have been freed by IndexScanEnd called by index_endscan */
+
+        }
 	if (indexRelationDesc)
 		index_close(indexRelationDesc, NoLock);
 
@@ -609,6 +619,33 @@ ExecInitIndexScan(IndexScan *node, EStat
 											   indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
+#ifdef USE_PREFETCH
+        /*  initialize prefetching   */
+                indexstate->iss_ScanDesc->pfch_index_page_list =  (struct pfch_index_pagelist*)0;
+                indexstate->iss_ScanDesc->pfch_block_item_list = (struct pfch_block_item*)0;
+		if (    prefetch_index_scans
+			 && (target_prefetch_pages > 0)
+			 &&	(!RelationUsesLocalBuffers(indexstate->iss_ScanDesc->heapRelation)) /* I think this must always be true for an indexed heap ? */
+			 && (    (   (prefetch_dbOid > 0)
+					   && (prefetch_dbOid == indexstate->iss_ScanDesc->heapRelation->rd_node.dbNode)
+					 )
+				 ||  (prefetch_dbOid == 0)
+				)
+		   ) {
+			indexstate->iss_ScanDesc->pfch_index_page_list = palloc( sizeof(struct pfch_index_pagelist) + ( (target_prefetch_pages-1) * sizeof(struct pfch_index_item) ) );
+			indexstate->iss_ScanDesc->pfch_block_item_list = palloc( prefetch_index_scans * sizeof(struct pfch_block_item) );
+			if (     ( (struct pfch_index_pagelist*)0 != indexstate->iss_ScanDesc->pfch_index_page_list )
+                  && ( (struct pfch_block_item*)0 != indexstate->iss_ScanDesc->pfch_block_item_list )
+               ) {
+                          indexstate->iss_ScanDesc->pfch_used = 0;
+                          indexstate->iss_ScanDesc->pfch_next = prefetch_index_scans; /* ensure first entry is at index 0 */
+                          indexstate->iss_ScanDesc->pfch_index_page_list->pfch_index_pagelist_next = (struct pfch_index_pagelist*)0;
+                          indexstate->iss_ScanDesc->pfch_index_page_list->pfch_index_item_count = 0;
+                          indexstate->iss_ScanDesc->do_prefetch = 1;
+            }
+		}
+#endif   /* USE_PREFETCH */
+
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
 	 * index AM.
--- src/backend/executor/instrument.c.orig	2014-08-18 14:10:36.869016769 -0400
+++ src/backend/executor/instrument.c	2014-08-19 16:56:13.415196197 -0400
@@ -41,6 +41,14 @@ InstrAlloc(int n, int instrument_options
 		{
 			instr[i].need_bufusage = need_buffers;
 			instr[i].need_timer = need_timer;
+			instr[i].bufusage_start.aio_read_noneed = 0;
+			instr[i].bufusage_start.aio_read_discrd = 0;
+			instr[i].bufusage_start.aio_read_forgot = 0;
+			instr[i].bufusage_start.aio_read_noblok = 0;
+			instr[i].bufusage_start.aio_read_failed = 0;
+			instr[i].bufusage_start.aio_read_wasted = 0;
+			instr[i].bufusage_start.aio_read_waited = 0;
+			instr[i].bufusage_start.aio_read_ontime = 0;
 		}
 	}
 
@@ -143,6 +151,16 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+
+	dst->aio_read_noneed       += add->aio_read_noneed - sub->aio_read_noneed;
+	dst->aio_read_discrd       += add->aio_read_discrd - sub->aio_read_discrd;
+	dst->aio_read_forgot       += add->aio_read_forgot - sub->aio_read_forgot;
+	dst->aio_read_noblok       += add->aio_read_noblok - sub->aio_read_noblok;
+	dst->aio_read_failed       += add->aio_read_failed - sub->aio_read_failed;
+	dst->aio_read_wasted       += add->aio_read_wasted - sub->aio_read_wasted;
+	dst->aio_read_waited       += add->aio_read_waited - sub->aio_read_waited;
+	dst->aio_read_ontime       += add->aio_read_ontime - sub->aio_read_ontime;
+
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
--- src/backend/storage/buffer/Makefile.orig	2014-08-18 14:10:36.933017070 -0400
+++ src/backend/storage/buffer/Makefile	2014-08-19 16:56:13.471196400 -0400
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o buf_async.o
 
 include $(top_srcdir)/src/backend/common.mk
--- src/backend/storage/buffer/bufmgr.c.orig	2014-08-18 14:10:36.937017089 -0400
+++ src/backend/storage/buffer/bufmgr.c	2014-08-19 16:56:13.515196557 -0400
@@ -29,7 +29,7 @@
  *		buf_table.c -- manages the buffer lookup table
  */
 #include "postgres.h"
-
+#include <sys/types.h> 
 #include <sys/file.h>
 #include <unistd.h>
 
@@ -50,7 +50,6 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
-
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
 #define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
@@ -63,6 +62,17 @@
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
 
+extern volatile struct BAiocbAnchor *BAiocbAnchr;  /*  anchor for all control blocks pertaining to aio  */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern int numBufferAiocbs;        /*  total number of  BufferAiocbs in pool  */
+#if defined(USE_PREFETCH) && defined(USE_AIO_SIGEVENT)
+extern long num_startedaio;
+extern long num_cancelledaio;
+extern long num_signalledaio;                  /* count of number of signals delivered - counted by handlaiosignal */
+extern struct BAiocbIolock_chain_item volatile * volatile BAiocbIolock_anchor; /* anchor for chain of awaiting-release LWLock ptrs */
+#endif /* defined(USE_PREFETCH) && defined(USE_AIO_SIGEVENT)  */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 #define DROP_RELS_BSEARCH_THRESHOLD		20
 
 /* GUC variables */
@@ -78,26 +88,33 @@ bool		track_io_timing = false;
  */
 int			target_prefetch_pages = 0;
 
-/* local state for StartBufferIO and related functions */
+/* local state for StartBufferIO and related functions
+**  but ONLY for synchronous IO  -  not altered for aio
+*/
 static volatile BufferDesc *InProgressBuf = NULL;
 static bool IsForInput;
+pid_t this_backend_pid = 0;    /*    pid of this backend */
 
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
-
-static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
+extern int
+BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy);
+extern int
+BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, volatile BufferDesc *buf_desc, int intention
+        ,BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock );
+Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
-				  bool *hit);
-static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
-static void PinBuffer_Locked(volatile BufferDesc *buf);
-static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
+				  bool *hit , int index_for_aio);
+bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
+void PinBuffer_Locked(volatile BufferDesc *buf);
+void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
 static void WaitIO(volatile BufferDesc *buf);
-static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
-static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
+static bool StartBufferIO(volatile BufferDesc *buf, bool forInput , int index_for_aio );
+void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -106,25 +123,67 @@ static volatile BufferDesc *BufferAlloc(
 			ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr);
+			int *foundPtr , int index_for_aio );
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
 
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
 
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
  *
- * This is named by analogy to ReadBuffer but doesn't actually allocate a
- * buffer.  Instead it tries to ensure that a future ReadBuffer for the given
- * block will not be delayed by the I/O.  Prefetching is optional.
+ * This is named by analogy to ReadBuffer but allocates a buffer only if using asynchronous I/O.
+ * Its purpose  is to try to ensure that a future ReadBuffer for the given block
+ * will not be delayed by the I/O.  Prefetching is optional.
  * No-op if prefetching isn't compiled in.
- */
-void
-PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
-{
+ *
+ * Originally the prefetch simply called posix_fadvise() to recommend read-ahead into kernel page cache.
+ * Extended to provide an alternative of issuing an asynchronous aio_read() to read into a buffer.
+ * This extension has an implication on how this bufmgr component manages concurrent requests
+ * for the same disk block.
+ *
+ * Synchronous IO (read()) does not provide a means for waiting on another task's read if in progress,
+ * and bufmgr implements its own scheme in StartBufferIO, WaitIO, and TerminateBufferIO.
+ *
+ * Asynchronous IO (aio_read()) provides a means for waiting on this or another task's read if in progress,
+ * namely aio_suspend(),  which this extension uses.    Therefore,  although StartBufferIO and TerminateBufferIO
+ * are called as part of asynchronous prefetching,   their role is limited to maintaining the buffer desc flags,
+ * and they do not track the asynchronous IO itself.    Instead,   asynchronous IOs are tracked in 
+ * a separate set of shared control blocks,  the BufferAiocb list -
+ *   refer to     include/storage/buf_internals.h and storage/buffer/buf_init.c
+ *
+ * Another implication of asynchronous IO concerns buffer pinning.
+ * The buffer used for the prefetch is pinned before aio_read is issued.
+ * It is expected that the same task (and possibly others) will later ask to read the page
+ * and eventually release and unpin the buffer.
+ * However,  if the task which issued the aio_read later decides not to read the page,
+ * and return code indicates delta_pin_count > 0 (see below)
+ * it *must* instead issue a DiscardBuffer() (see function later in this file)
+ * so that its pin is released.
+ * Therefore,  each client which uses the PrefetchBuffer service must either always read all
+ * prefetched pages,  or keep track of prefetched pages and discard unread ones at end of scan.
+ *
+ * return code:   is an int bitmask defined in bufmgr.h
+        PREFTCHRC_BUF_PIN_INCREASED 0x01      pin count on buffer has been increased by 1
+        PREFTCHRC_BLK_ALREADY_PRESENT 0x02    block was already present in a buffer
+ *
+ * PREFTCHRC_BLK_ALREADY_PRESENT is a hint to caller that the prefetch may be unnecessary
+ */
+int
+PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy)
+{
+	Buffer		buf_id; /* indicates buffer containing the requested block  */
+        int             PrefetchBufferRc = 0; /*  return value as described above  */
+        int             PinCountOnEntry = 0;  /*  pin count on entry           */
+        int             PinCountdelta = 0;    /*  pin count delta increase     */
+
+
 #ifdef USE_PREFETCH
+
+	buf_id = -1;
 	Assert(RelationIsValid(reln));
 	Assert(BlockNumberIsValid(blockNum));
 
@@ -147,7 +206,12 @@ PrefetchBuffer(Relation reln, ForkNumber
 		BufferTag	newTag;		/* identity of requested block */
 		uint32		newHash;	/* hash value for newTag */
 		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-		int			buf_id;
+        int         BufStartAsyncrc = -1;  /*  retcode from BufStartAsync :
+                                                       **        0 if started successfully (which implies buffer was newly pinned )
+                                                       **       -1 if failed for some reason
+                                                       **        1+PrivateRefCount if we found desired buffer in buffer pool
+                                                       **  and we set it likewise if we find buffer in buffer pool
+                                                       */
 
 		/* create a tag so we can lookup the buffer */
 		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
@@ -159,28 +223,121 @@ PrefetchBuffer(Relation reln, ForkNumber
 
 		/* see if the block is in the buffer pool already */
 		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
+		buf_id = (Buffer)BufTableLookup(&newTag, newHash);
+                if (buf_id >= 0) {
+                    PinCountOnEntry = PrivateRefCount[buf_id];  /*  pin count on entry               */
+                    BufStartAsyncrc = 1 + PinCountOnEntry; /* indicate this backends pin count - see above comment */
+                    PrefetchBufferRc = PREFTCHRC_BLK_ALREADY_PRESENT;       /* indicate buffer present */
+                } else {
+                    PrefetchBufferRc = 0;                                   /* indicate buffer not present */
+                }
 		LWLockRelease(newPartitionLock);
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+     not_in_buffers:
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 		/* If not in buffers, initiate prefetch */
-		if (buf_id < 0)
+		if (buf_id < 0) {
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                    /*    try using async aio_read with a buffer */
+                    BufStartAsyncrc = BufStartAsync( reln, forkNum, blockNum , strategy );
+                    if (BufStartAsyncrc < 0) {
+                            pgBufferUsage.aio_read_noblok++;
+                    }
+#else /* not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP so try the alternative that does not read the block into a postgresql buffer */
 			smgrprefetch(reln->rd_smgr, forkNum, blockNum);
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+		}
 
-		/*
-		 * If the block *is* in buffers, we do nothing.  This is not really
-		 * ideal: the block might be just about to be evicted, which would be
-		 * stupid since we know we are going to need it soon.  But the only
-		 * easy answer is to bump the usage_count, which does not seem like a
-		 * great solution: when the caller does ultimately touch the block,
-		 * usage_count would get bumped again, resulting in too much
-		 * favoritism for blocks that are involved in a prefetch sequence. A
-		 * real fix would involve some additional per-buffer state, and it's
-		 * not clear that there's enough of a problem to justify that.
+        if (  (buf_id >= 0) || (BufStartAsyncrc >= 1)  ) {
+                        /* The block *is* in buffers.    */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        pgBufferUsage.aio_read_noneed++;
+#ifndef USE_PREFETCH_BUT_DONT_PIN_ALREADY_PRESENT /* jury is out on whether the following wins but it ought to ...  */
+                        /*
+                        ** If this backend already had pinned it,
+                        ** or another backend had banked a pin on it,
+                        ** or there is an IO in progress,
+                        ** or it is not marked valid,
+                        ** then do nothing.
+                        ** Otherwise pin it and mark the buffer's pin as banked by this backend.
+                        ** Note  -  it may or not be pinned by another backend -
+                        **          it is ok for us to bank a pin on it
+                        **          *provided* the other backend did not bank its pin.
+                        **          The reason for this is that the banked-pin indicator is global -
+                        **          it can identify at most one process.
+                        */
+                        /* pgBufferUsage.aio_read_wasted++;  overload counter - only for debugging */
+                        if (BufStartAsyncrc == 1) {            /*   not pinned by me  */
+                              /*  pgBufferUsage.aio_read_wasted++;  overload counter - only for debugging */
+                              /*  note   -   all we can say with certainty is that the buffer is not pinned by me
+                              **             we cannot be sure that it is still in buffer pool
+                              **             so must go through the entire locking and searching all over again ...
 		 */
+                            LWLockAcquire(newPartitionLock, LW_SHARED);
+                            buf_id = (Buffer)BufTableLookup(&newTag, newHash);
+                            /* If in buffers, proceed */
+                            if (buf_id >= 0) {
+                                /*  since the block is now present,
+                                **  save the current pin count to ensure final delta is calculated correctly
+                                */
+                                PinCountOnEntry = PrivateRefCount[buf_id];  /*  pin count on entry               */
+                                if ( PinCountOnEntry == 0) { /*  paranoid check it's still not pinned by me */
+                                    volatile        BufferDesc *buf_desc;
+
+                                    buf_desc = &BufferDescriptors[buf_id];        /* found buffer descriptor */
+                                    LockBufHdr(buf_desc);
+                                    if (    (buf_desc->flags & BM_VALID)           /* buffer is valid        */
+                                         && (!(buf_desc->flags & (BM_IO_IN_PROGRESS|BM_AIO_IN_PROGRESS|BM_AIO_PREFETCH_PIN_BANKED))) /* buffer is not any of ... */
+                                       ) {
+                                        buf_desc->flags |= BM_AIO_PREFETCH_PIN_BANKED; /* bank the pin for next use by this task */
+                                        /*  note  - we can call PinBuffer_Locked with the BM_AIO_PREFETCH_PIN_BANKED flag set because it is not yet pinned by me */
+                                        buf_desc->freeNext = -(this_backend_pid);       /* remember which pid banked it */
+                                        /*  pgBufferUsage.aio_read_wasted--;      overload counter - not wasted after all - only for debugging */
+
+                                        /* Make sure we will have room to remember the buffer pin */
+                                        ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+                                        PinBuffer_Locked(buf_desc);
+	}
+                                    else {
+                                        UnlockBufHdr(buf_desc);
+                                    }
+                                }
+                            }
+                            LWLockRelease(newPartitionLock);
+                            /*  although unlikely,  maybe it was evicted while we were puttering about  */
+                            if (buf_id < 0) {
+                                pgBufferUsage.aio_read_noneed--;   /*   back out the accounting */
+                                goto not_in_buffers;               /*   and try again           */
 	}
-#endif   /* USE_PREFETCH */
 }
+#endif /*  USE_PREFETCH_BUT_DONT_PIN_ALREADY_PRESENT */
 
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+		}
+
+		if (buf_id >= 0) {
+			PinCountdelta = PrivateRefCount[buf_id] - PinCountOnEntry;  /*  pin count delta increase    */
+			if (  (PinCountdelta < 0) || (PinCountdelta > 1) ) {
+				  elog(ERROR,
+						 "PrefetchBuffer #%d : incremented pin count by %d on bufdesc %p refcount %u localpins %d\n"
+								  ,(buf_id+1) , PinCountdelta , &BufferDescriptors[buf_id] ,BufferDescriptors[buf_id].refcount , PrivateRefCount[buf_id]);
+}
+		} else
+		if (BufStartAsyncrc == 0) {  /* aio started successfully (which implies buffer was newly pinned ) */
+			PinCountdelta = 1;
+		}
+
+		/*  set final PrefetchBufferRc according to previous value */
+		PrefetchBufferRc |= PinCountdelta;  /* set the PREFTCHRC_BUF_PIN_INCREASED bit */
+	}
+
+#endif   /* USE_PREFETCH */
+
+	return PrefetchBufferRc; /*  return value as described above */
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
@@ -253,7 +410,7 @@ ReadBufferExtended(Relation reln, ForkNu
 	 */
 	pgstat_count_buffer_read(reln);
 	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
+							forkNum, blockNum, mode, strategy, &hit , 0);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
 	return buf;
@@ -281,7 +438,7 @@ ReadBufferWithoutRelcache(RelFileNode rn
 	Assert(InRecovery);
 
 	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
-							 mode, strategy, &hit);
+							 mode, strategy, &hit , 0);
 }
 
 
@@ -289,15 +446,18 @@ ReadBufferWithoutRelcache(RelFileNode rn
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
+ * index_for_aio ,  if -ve , is negative of ( index of the aiocb in the BufferAiocbs array + 3 )
+ *     which is passed through to StartBufferIO
  */
-static Buffer
+Buffer
 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy, bool *hit , int index_for_aio )
 {
 	volatile BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+        int             allocrc;             /*  retcode from BufferAlloc */
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -329,16 +489,40 @@ ReadBuffer_common(SMgrRelation smgr, cha
 	}
 	else
 	{
+                allocrc = mode; /* pass mode to BufferAlloc since it must not wait for async io if RBM_NOREAD_FOR_PREFETCH */
 		/*
 		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
+							 strategy, &allocrc , index_for_aio );
+		if (allocrc < 0) {
+                    if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+                    {
+                        ereport(WARNING,
+                                (errcode(ERRCODE_DATA_CORRUPTED),
+                                 errmsg("invalid page header in block %u of relation %s; zeroing out page",
+                                        blockNum,
+                                        relpath(smgr->smgr_rnode, forkNum))));
+                        bufBlock = BufHdrGetBlock(bufHdr);
+                        MemSet((char *) bufBlock, 0, BLCKSZ);
+                    }
 		else
+                      ereport(ERROR,
+                              (errcode(ERRCODE_DATA_CORRUPTED),
+                               errmsg("invalid page header in block %u of relation %s",
+                                      blockNum,
+                                      relpath(smgr->smgr_rnode, forkNum))));
+                        found = true;
+                }
+		else if (allocrc > 0) {
+			pgBufferUsage.shared_blks_hit++;
+                        found = true;
+                }
+		else {
 			pgBufferUsage.shared_blks_read++;
+                        found = false;
+                }
 	}
 
 	/* At this point we do NOT hold any locks. */
@@ -411,7 +595,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 				Assert(bufHdr->flags & BM_VALID);
 				bufHdr->flags &= ~BM_VALID;
 				UnlockBufHdr(bufHdr);
-			} while (!StartBufferIO(bufHdr, true));
+			} while (!StartBufferIO(bufHdr, true, 0));
 		}
 	}
 
@@ -431,6 +615,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
+        if (mode != RBM_NOREAD_FOR_PREFETCH) {
 	if (isExtend)
 	{
 		/* new buffers are zero-filled */
@@ -500,6 +685,7 @@ ReadBuffer_common(SMgrRelation smgr, cha
 	VacuumPageMiss++;
 	if (VacuumCostActive)
 		VacuumCostBalance += VacuumCostPageMiss;
+	}
 
 	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
 									  smgr->smgr_rnode.node.spcNode,
@@ -521,21 +707,39 @@ ReadBuffer_common(SMgrRelation smgr, cha
  * the default strategy.  The selected buffer's usage_count is advanced when
  * using the default strategy, but otherwise possibly not (see PinBuffer).
  *
- * The returned buffer is pinned and is already marked as holding the
- * desired page.  If it already did have the desired page, *foundPtr is
- * set TRUE.  Otherwise, *foundPtr is set FALSE and the buffer is marked
+ * index_for_aio is an input parm which,  if non-zero,  identifies a BufferAiocb
+ * acquired by caller and to be used for any StartBufferIO performed by this routine.
+ * In this case,  if block not found in buffer pool and we allocate a new buffer,
+ * then we must maintain the spinlock on the buffer and pass it back to caller.
+ *
+ * foundPtr is input and output :
+ *  . input   -  indicates the read-buffer mode  ( see bufmgr.h )
+ *  . output  -  indicates the status of the buffer - see below
+ *
+ * Except for the case of RBM_NOREAD_FOR_PREFETCH and buffer is found,
+ * the returned buffer is pinned and is already marked as holding the
+ * desired page.
+ *  If it already did have the desired page and page content is valid,
+ *  *foundPtr is set to 1
+ *  If it already did have the desired page and mode is RBM_NOREAD_FOR_PREFETCH
+ *    and StartBufferIO returned false
+ *    (meaning it could not initialise the buffer for aio)
+ *  *foundPtr is set to 2
+ *  If it already did have the desired page but page content is invalid,
+ *  *foundPtr is set to -1
+ *   this can happen only if the buffer was read by an async read
+ *   and the aio is still in progress or pinned by the issuer of the startaio.
+ *  Otherwise, *foundPtr is set to 0 and the buffer is marked
  * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
  *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
- *
- * No locks are held either at entry or exit.
+ * No locks are held either at entry or exit EXCEPT for case noted above
+ * of passing an empty buffer back to async io caller ( index_for_aio set ).
  */
 static volatile BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr)
+			int *foundPtr , int index_for_aio )
 {
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
@@ -547,6 +751,13 @@ BufferAlloc(SMgrRelation smgr, char relp
 	int			buf_id;
 	volatile BufferDesc *buf;
 	bool		valid;
+        int             IntentionBufferrc;      /* retcode from BufCheckAsync */
+        bool            StartBufferIOrc;        /* retcode from StartBufferIO */
+        ReadBufferMode mode;
+
+
+        mode = *foundPtr;
+        *foundPtr = 0;
 
 	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
@@ -561,21 +772,53 @@ BufferAlloc(SMgrRelation smgr, char relp
 	if (buf_id >= 0)
 	{
 		/*
-		 * Found it.  Now, pin the buffer so no one can steal it from the
-		 * buffer pool, and check to see if the correct data has been loaded
-		 * into the buffer.
+		 * Found it.
 		 */
+                *foundPtr = 1;
 		buf = &BufferDescriptors[buf_id];
 
-		valid = PinBuffer(buf, strategy);
-
-		/* Can release the mapping lock as soon as we've pinned it */
+                /*   If prefetch mode,  then return immediately indicating found,
+                **   and NOTE in this case only,  we did not pin buffer.
+                **   In theory we might try to check whether the buffer is valid,  io in progress,  etc
+                **   but in practice it is simpler to abandon the prefetch if the buffer exists
+                */
+                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+                    /* release the mapping lock and return    */
 		LWLockRelease(newPartitionLock);
+                } else {
+                    /*   note that the current request is for same tag as the one associated with the aio -
+                    **   so simply complete the aio and we have our buffer.
+                    **         If an aio was started on this buffer,
+                    **         check complete and wait for it if not.
+                    **         And,  if aio had been started,  then the task
+                    **         which issued the start aio already pinned it for this read,
+                    **         so if that task was me and the aio was successful,
+                    **         pass the current pin to this read without dropping and re-acquiring.
+                    **         this is all done by BufCheckAsync
+                    */
+                    IntentionBufferrc = BufCheckAsync(smgr , 0 , buf, BUF_INTENTION_WANT , strategy , index_for_aio , false , newPartitionLock );
 
-		*foundPtr = TRUE;
+                    /*       check to see if the correct data has been loaded into the buffer.  */
+                    valid = (IntentionBufferrc == BUF_INTENT_RC_VALID);
 
-		if (!valid)
-		{
+                    /*  check for serious IO errors   */
+                    if (!valid) {
+                        if (    (IntentionBufferrc != BUF_INTENT_RC_INVALID_NO_AIO)
+                             && (IntentionBufferrc != BUF_INTENT_RC_INVALID_AIO)
+                           ) {
+                            *foundPtr = -1;  /*  inform caller of serious error */
+                        }
+                        else
+                        if (IntentionBufferrc == BUF_INTENT_RC_INVALID_AIO) {
+                            goto proceed_with_not_found;  /*  yes,  I know,  a goto ... think of it as a break out of the if */
+                        }
+                     }
+
+                    /* BufCheckAsync pinned the buffer            */
+                    /* so can now release the mapping lock               */
+                    LWLockRelease(newPartitionLock);
+
+                    if (!valid) {
 			/*
 			 * We can only get here if (a) someone else is still reading in
 			 * the page, or (b) a previous read attempt failed.  We have to
@@ -583,19 +826,21 @@ BufferAlloc(SMgrRelation smgr, char relp
 			 * own read attempt if the page is still not BM_VALID.
 			 * StartBufferIO does it all.
 			 */
-			if (StartBufferIO(buf, true))
+                        if (StartBufferIO(buf, true, index_for_aio))
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
 				 * have failed ... but we shall bravely try again.
 				 */
-				*foundPtr = FALSE;
+							*foundPtr = 0;
+						}
 			}
 		}
 
 		return buf;
 	}
 
+  proceed_with_not_found:
 	/*
 	 * Didn't find it in the buffer pool.  We'll have to initialize a new
 	 * buffer.  Remember to unlock the mapping lock while doing the work.
@@ -620,8 +865,10 @@ BufferAlloc(SMgrRelation smgr, char relp
 		/* Must copy buffer flags while we still hold the spinlock */
 		oldFlags = buf->flags;
 
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
+        /*         If an aio was started on this buffer,
+        **         check complete and cancel it if not.
+        */
+        BufCheckAsync(smgr , 0 , buf, BUF_INTENTION_REJECT_OBTAIN_PIN , 0 , index_for_aio, true , 0 );
 
 		/* Now it's safe to release the freelist lock */
 		if (lock_held)
@@ -792,13 +1039,18 @@ BufferAlloc(SMgrRelation smgr, char relp
 				 * then set up our own read attempt if the page is still not
 				 * BM_VALID.  StartBufferIO does it all.
 				 */
-				if (StartBufferIO(buf, true))
+                                StartBufferIOrc = StartBufferIO(buf, true , index_for_aio); /* retcode from StartBufferIO */
+				if (StartBufferIOrc)
 				{
 					/*
 					 * If we get here, previous attempts to read the buffer
 					 * must have failed ... but we shall bravely try again.
 					 */
-					*foundPtr = FALSE;
+					*foundPtr = 0;
+                                } else
+                                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+					UnpinBuffer(buf, true); /* must unpin if RBM_NOREAD_FOR_PREFETCH and found */
+                                        *foundPtr = 2;  /*  inform BufStartAsync that buffer must not be used */
 				}
 			}
 
@@ -861,10 +1113,17 @@ BufferAlloc(SMgrRelation smgr, char relp
 	 * lock.  If StartBufferIO returns false, then someone else managed to
 	 * read it before we did, so there's nothing left for BufferAlloc() to do.
 	 */
-	if (StartBufferIO(buf, true))
-		*foundPtr = FALSE;
-	else
-		*foundPtr = TRUE;
+        StartBufferIOrc = StartBufferIO(buf, true , index_for_aio); /* retcode from StartBufferIO */
+        if (StartBufferIOrc) {
+		*foundPtr = 0;
+        } else {
+                if (mode == RBM_NOREAD_FOR_PREFETCH) {
+                    UnpinBuffer(buf, true); /* must unpin if RBM_NOREAD_FOR_PREFETCH and found */
+                    *foundPtr = 2;  /*  inform BufStartAsync that buffer must not be used */
+                } else {
+                    *foundPtr = 1;
+                }
+        }
 
 	return buf;
 }
@@ -971,6 +1230,10 @@ retry:
 	/*
 	 * Insert the buffer at the head of the list of free buffers.
 	 */
+        /*   avoid confusing freelist with strange-looking freeNext */
+        if (buf->freeNext <= FREENEXT_BAIOCB_ORIGIN) { /* means was used for aiocb index */
+            buf->freeNext = FREENEXT_NOT_IN_LIST;
+        }
 	StrategyFreeBuffer(buf);
 }
 
@@ -1023,6 +1286,56 @@ MarkBufferDirty(Buffer buffer)
 	UnlockBufHdr(bufHdr);
 }
 
+/*  return the blocknum of the block in a buffer if it is valid
+**  if a shared buffer,  it must be pinned
+*/
+BlockNumber
+BlocknumOfBuffer(Buffer buffer)
+{
+	volatile BufferDesc *bufHdr;
+        BlockNumber rc = 0;
+
+	if (BufferIsValid(buffer)) {
+            if  (BufferIsLocal(buffer)) {
+                bufHdr = &LocalBufferDescriptors[-buffer - 1];
+            } else {
+                bufHdr = &BufferDescriptors[buffer - 1];
+            }
+
+            rc = bufHdr->tag.blockNum;
+        }
+
+        return rc;
+}
+
+/*  report whether specified buffer contains same or different block
+**  if a shared buffer,  it must be pinned
+*/
+bool
+BlocknotinBuffer(Buffer buffer,
+					 Relation relation,
+					 BlockNumber blockNum)
+{
+	volatile BufferDesc *bufHdr;
+        bool rc = false;
+
+	if (BufferIsValid(buffer)) {
+            if  (BufferIsLocal(buffer)) {
+                bufHdr = &LocalBufferDescriptors[-buffer - 1];
+            } else {
+                bufHdr = &BufferDescriptors[buffer - 1];
+            }
+
+            rc =
+                  (   (bufHdr->tag.blockNum != blockNum)
+                   || (!(RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) ))
+                   || (bufHdr->tag.forkNum != MAIN_FORKNUM)
+                  );
+        }
+
+        return rc;
+}
+
 /*
  * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer()
  *
@@ -1041,18 +1354,18 @@ ReleaseAndReadBuffer(Buffer buffer,
 					 Relation relation,
 					 BlockNumber blockNum)
 {
-	ForkNumber	forkNum = MAIN_FORKNUM;
 	volatile BufferDesc *bufHdr;
+        bool isDifferentBlock;   /*       requesting different block from that already in buffer ? */
 
 	if (BufferIsValid(buffer))
 	{
+	    /* if a shared buff, we have pin, so it's ok to examine tag without spinlock */
+            isDifferentBlock = BlocknotinBuffer(buffer,relation,blockNum); /*  requesting different block from that already in buffer ? */
 		if (BufferIsLocal(buffer))
 		{
 			Assert(LocalRefCount[-buffer - 1] > 0);
 			bufHdr = &LocalBufferDescriptors[-buffer - 1];
-			if (bufHdr->tag.blockNum == blockNum &&
-				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
-				bufHdr->tag.forkNum == forkNum)
+			if (!isDifferentBlock)
 				return buffer;
 			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 			LocalRefCount[-buffer - 1]--;
@@ -1061,12 +1374,12 @@ ReleaseAndReadBuffer(Buffer buffer,
 		{
 			Assert(PrivateRefCount[buffer - 1] > 0);
 			bufHdr = &BufferDescriptors[buffer - 1];
-			/* we have pin, so it's ok to examine tag without spinlock */
-			if (bufHdr->tag.blockNum == blockNum &&
-				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
-				bufHdr->tag.forkNum == forkNum)
+                        BufCheckAsync(0 , relation , bufHdr , ( isDifferentBlock ? BUF_INTENTION_REJECT_FORGET
+                                                                                 : BUF_INTENTION_REJECT_KEEP_PIN )
+                                                            , 0 , 0 , false , 0 ); /* end any IO and maybe unpin */
+			if (!isDifferentBlock) {
 				return buffer;
-			UnpinBuffer(bufHdr, true);
+                        }
 		}
 	}
 
@@ -1091,11 +1404,12 @@ ReleaseAndReadBuffer(Buffer buffer,
  * Returns TRUE if buffer is BM_VALID, else FALSE.  This provision allows
  * some callers to avoid an extra spinlock cycle.
  */
-static bool
+bool
 PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
 {
 	int			b = buf->buf_id;
 	bool		result;
+        bool       pin_already_banked_by_me = 0;  /* buffer is already pinned by me and redeemable */
 
 	if (PrivateRefCount[b] == 0)
 	{
@@ -1117,12 +1431,34 @@ PinBuffer(volatile BufferDesc *buf, Buff
 	else
 	{
 		/* If we previously pinned the buffer, it must surely be valid */
+                /* Errr  -   is that really true  ???    I don't think so  :
+                ** what if I pin,  try an IO,  in progress,  then mistakenly pin again
 		result = true;
+                */
+		LockBufHdr(buf);
+                pin_already_banked_by_me = (    (PrivateRefCount[b] > 0) && (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                             && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                                     : (-(buf->freeNext))  )  == this_backend_pid )
+                                           );
+                if (pin_already_banked_by_me) {
+                        elog(LOG, "PinBuffer : on buffer %d with banked pin rel=%s, blockNum=%u, with flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+                    buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
 	}
+                }
+		result = (buf->flags & BM_VALID) != 0;
+		UnlockBufHdr(buf);
+	}
+
+        if (!pin_already_banked_by_me) {
 	PrivateRefCount[b]++;
 	Assert(PrivateRefCount[b] > 0);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
 								BufferDescriptorGetBuffer(buf));
+        }
 	return result;
 }
 
@@ -1139,19 +1475,36 @@ PinBuffer(volatile BufferDesc *buf, Buff
  * to save a spin lock/unlock cycle, because we need to pin a buffer before
  * its state can change under us.
  */
-static void
+void
 PinBuffer_Locked(volatile BufferDesc *buf)
 {
 	int			b = buf->buf_id;
+        bool       pin_already_banked_by_me;  /* buffer is already pinned by me and redeemable */
 
-	if (PrivateRefCount[b] == 0)
+        pin_already_banked_by_me = (    (PrivateRefCount[b] > 0) && (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                     && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                             : (-(buf->freeNext))  )  == this_backend_pid )
+                             );
+	if (PrivateRefCount[b] == 0) {
 		buf->refcount++;
+        }
+        if (pin_already_banked_by_me) {
+                        elog(LOG, "PinBuffer_Locked : on buffer %d with banked pin rel=%s, blockNum=%u, with flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+            buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                    }
+        }
 	UnlockBufHdr(buf);
+        if (!pin_already_banked_by_me) {
 	PrivateRefCount[b]++;
 	Assert(PrivateRefCount[b] > 0);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
 								BufferDescriptorGetBuffer(buf));
 }
+}
 
 /*
  * UnpinBuffer -- make buffer available for replacement.
@@ -1161,29 +1514,68 @@ PinBuffer_Locked(volatile BufferDesc *bu
  * Most but not all callers want CurrentResourceOwner to be adjusted.
  * Those that don't should pass fixOwner = FALSE.
  */
-static void
+void
 UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 {
+
 	int			b = buf->buf_id;
+        bool       pin_already_banked_by_me;  /* buffer is already pinned by me and redeemable */
 
-	if (fixOwner)
+	if (fixOwner) {
 		ResourceOwnerForgetBuffer(CurrentResourceOwner,
 								  BufferDescriptorGetBuffer(buf));
+        }
 
 	Assert(PrivateRefCount[b] > 0);
 	PrivateRefCount[b]--;
 	if (PrivateRefCount[b] == 0)
 	{
+
 		/* I'd better not still hold any locks on the buffer */
 		Assert(!LWLockHeldByMe(buf->content_lock));
 		Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
 
 		LockBufHdr(buf);
 
+		/* this backend has released last pin - buffer should not have pin banked by me
+                ** and if AIO in progress then there should be another backend pin
+                */
+                pin_already_banked_by_me = (       (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                                             &&    (   (    (buf->flags & BM_AIO_IN_PROGRESS)
+                                                         ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                         : (-(buf->freeNext))
+                                                       )  == this_backend_pid )
+                                           );
+                if (pin_already_banked_by_me) {
+                        /*  this is a strange situation  -  caller had a banked pin (which callers are supposed not to know about)
+                        **                                  but either discovered it had it or has over-counted how many pins it has
+                        */
+                        buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;   /*   redeem the pin although it is now of no use since about to release */
+                        if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                            buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                        }
+
+                        /*     temporarily suppress logging error to avoid performance degradation -
+                        **     either this task really does not need the buffer in which case the error is harmless
+                        **     or a more severe error will be detected later (possibly immediately below)
+                        elog(LOG, "UnpinBuffer :  released last this-backend pin on buffer %d rel=%s, blockNum=%u, but had banked pin flags %X refcount=%u"
+                       ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                       ,buf->tag.blockNum, buf->flags, buf->refcount);
+                        */
+                }
+
 		/* Decrement the shared reference count */
 		Assert(buf->refcount > 0);
 		buf->refcount--;
 
+                if (   (buf->refcount == 0) && (buf->flags & BM_AIO_IN_PROGRESS)  ) {
+
+                        elog(ERROR, "UnpinBuffer :  released last any-backend pin on buffer %d rel=%s, blockNum=%u, but AIO in progress flags %X refcount=%u"
+                            ,buf->buf_id,relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum)
+                            ,buf->tag.blockNum, buf->flags, buf->refcount);
+                }
+
+
 		/* Support LockBufferForCleanup() */
 		if ((buf->flags & BM_PIN_COUNT_WAITER) &&
 			buf->refcount == 1)
@@ -1658,6 +2050,7 @@ SyncOneBuffer(int buf_id, bool skip_rece
 	volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
 	int			result = 0;
 
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -1724,6 +2117,16 @@ void
 InitBufferPoolBackend(void)
 {
 	on_shmem_exit(AtProcExit_Buffers, 0);
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        /*   init the aio subsystem max number of threads and max number of requests
+        **   max number of threads   <-->  max_async_io_prefetchers
+        **   max number of requests  <-->  numBufferAiocbs = (target_prefetch_pages*max_async_io_prefetchers)
+        **   there is no return code so we just hope.
+        */
+        smgrinitaio(max_async_io_prefetchers , numBufferAiocbs);
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 }
 
 /*
@@ -1736,6 +2139,11 @@ AtProcExit_Buffers(int code, Datum arg)
 	AbortBufferIO();
 	UnlockBuffers();
 
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on BAiocb lock instead of polling */
+    elog(LOG,"backend %d started %ld aios , cancelled %ld , signalled %ld\n"
+           ,this_backend_pid,num_startedaio,num_cancelledaio ,num_signalledaio);
+#endif /*  USE_AIO_SIGEVENT    */
+
 	CheckForBufferLeaks();
 
 	/* localbuf.c needs a chance too */
@@ -1779,6 +2187,8 @@ PrintBufferLeakWarning(Buffer buffer)
 	char	   *path;
 	BackendId	backend;
 
+
+
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
 	{
@@ -1789,12 +2199,28 @@ PrintBufferLeakWarning(Buffer buffer)
 	else
 	{
 		buf = &BufferDescriptors[buffer - 1];
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                /*   If reason that this buffer is pinned
+                **   is that it was prefetched with async_io
+                **   and never read or discarded, then omit the
+                **   warning,  because this is expected in some
+                **   cases when a scan is closed abnormally.
+                **   Note that the buffer will be released soon by our caller.
+                */
+                if (buf->flags & BM_AIO_PREFETCH_PIN_BANKED) {
+                    pgBufferUsage.aio_read_forgot++; /* account for it */
+                    return;
+                }
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 		loccount = PrivateRefCount[buffer - 1];
 		backend = InvalidBackendId;
 	}
 
+/* #if defined(DO_ISCAN_DISCARD_BUFFER) && defined(DO_HEAPSCAN_DISCARD_BUFFER) not yet working safely 130530 */
 	/* theoretically we should lock the bufhdr here */
 	path = relpathbackend(buf->tag.rnode, backend, buf->tag.forkNum);
+
+
 	elog(WARNING,
 		 "buffer refcount leak: [%03d] "
 		 "(rel=%s, blockNum=%u, flags=0x%x, refcount=%u %d)",
@@ -1802,6 +2228,7 @@ PrintBufferLeakWarning(Buffer buffer)
 		 buf->tag.blockNum, buf->flags,
 		 buf->refcount, loccount);
 	pfree(path);
+/* #endif defined(DO_ISCAN_DISCARD_BUFFER) && defined(DO_HEAPSCAN_DISCARD_BUFFER) not yet working safely 130530 */
 }
 
 /*
@@ -1918,7 +2345,7 @@ FlushBuffer(volatile BufferDesc *buf, SM
 	 * false, then someone else flushed the buffer before we could, so we need
 	 * not do anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, 0))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -2502,6 +2929,70 @@ FlushDatabaseBuffers(Oid dbid)
 	}
 }
 
+#ifdef USE_PREFETCH
+/*
+ * DiscardBuffer -- discard shared buffer used for a previously
+ *                  prefetched but unread block of a relation
+ *
+ * If the buffer is found and pinned with a banked pin,  then :
+ *      .  if AIO in progress, terminate AIO without waiting
+ *      .  if AIO had already completed successfully,
+ *         then mark buffer valid (in case someone else wants it)
+ *      .  redeem the banked pin and unpin it.
+ *
+ * This function is similar in purpose to ReleaseBuffer (below)
+ * but sufficiently different that it is a separate function.
+ * Two important differences are :
+ *   .   caller identifies buffer by blocknumber,  not buffer number
+ *   .   we unpin buffer *only* if the pin is banked,
+ *                      *never* if pinned but not banked.
+ *       This is essential as caller may perform a sequence of
+ *  SCAN1   . PrefetchBuffer   (and remember block was prefetched)
+ *  SCAN2   . ReadBuffer       (but fails to connect this read to the prefetch by SCAN1)
+ *  SCAN1   . DiscardBuffer    (SCAN1 terminates early)
+ *  SCAN2   . access tuples in buffer
+ *        Clearly the Discard *must not* unpin the buffer since SCAN2 needs it!
+ *
+ *
+ * caller may pass InvalidBlockNumber as blockNum to mean do nothing
+ */
+void
+DiscardBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
+{
+        BufferTag	newTag;		   /* identity of requested block */
+        uint32		newHash;	   /* hash value for newTag */
+        LWLockId	newPartitionLock;  /* buffer partition lock for it */
+        Buffer		buf_id;
+        volatile        BufferDesc *buf_desc;
+
+    if (!SmgrIsTemp(reln->rd_smgr)) {
+	Assert(RelationIsValid(reln));
+	if (BlockNumberIsValid(blockNum)) {
+
+            /* create a tag so we can lookup the buffer */
+            INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
+                                       forkNum, blockNum);
+
+            /* determine its hash code and partition lock ID */
+            newHash = BufTableHashCode(&newTag);
+            newPartitionLock = BufMappingPartitionLock(newHash);
+
+            /* see if the block is in the buffer pool already */
+            LWLockAcquire(newPartitionLock, LW_SHARED);
+            buf_id = BufTableLookup(&newTag, newHash);
+            LWLockRelease(newPartitionLock);
+
+            /* If in buffers, proceed */
+            if (buf_id >= 0) {
+                buf_desc = &BufferDescriptors[buf_id];        /* found buffer descriptor */
+                BufCheckAsync(0 , reln, buf_desc , BUF_INTENTION_REJECT_UNBANK , 0 , 0 , false , 0); /* end the IO and unpin if banked */
+                pgBufferUsage.aio_read_discrd++; /* account for it */
+            }
+        }
+    }
+}
+#endif   /* USE_PREFETCH */
+
 /*
  * ReleaseBuffer -- release the pin on a buffer
  */
@@ -2510,26 +3001,23 @@ ReleaseBuffer(Buffer buffer)
 {
 	volatile BufferDesc *bufHdr;
 
+
 	if (!BufferIsValid(buffer))
 		elog(ERROR, "bad buffer ID: %d", buffer);
 
-	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 
 	if (BufferIsLocal(buffer))
 	{
+                ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 		Assert(LocalRefCount[-buffer - 1] > 0);
 		LocalRefCount[-buffer - 1]--;
 		return;
 	}
-
-	bufHdr = &BufferDescriptors[buffer - 1];
-
-	Assert(PrivateRefCount[buffer - 1] > 0);
-
-	if (PrivateRefCount[buffer - 1] > 1)
-		PrivateRefCount[buffer - 1]--;
 	else
-		UnpinBuffer(bufHdr, false);
+        {
+                bufHdr = &BufferDescriptors[buffer - 1];
+                BufCheckAsync(0 , 0 , bufHdr , BUF_INTENTION_REJECT_NOADJUST , 0 , 0 , false , 0 );
+        }
 }
 
 /*
@@ -2555,14 +3043,41 @@ UnlockReleaseBuffer(Buffer buffer)
 void
 IncrBufferRefCount(Buffer buffer)
 {
+        bool       pin_already_banked_by_me = false;  /* buffer is already pinned by me and redeemable */
+        volatile BufferDesc *buf;                     /* descriptor for a shared buffer */
+
 	Assert(BufferIsPinned(buffer));
+
+        if (!(BufferIsLocal(buffer))) {
+                buf = &BufferDescriptors[buffer - 1];
+		LockBufHdr(buf);
+                pin_already_banked_by_me =
+                      (    (buf->flags & BM_AIO_PREFETCH_PIN_BANKED)
+                        && (   (    (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio )
+                                                                      : (-(buf->freeNext))  )  == this_backend_pid )
+                      );
+        }
+
+        if (!pin_already_banked_by_me) {
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, buffer);
+        }
+
 	if (BufferIsLocal(buffer))
 		LocalRefCount[-buffer - 1]++;
-	else
+	else {
+                if (pin_already_banked_by_me) {
+                    buf->flags &= ~BM_AIO_PREFETCH_PIN_BANKED;
+                    if (!(buf->flags & BM_AIO_IN_PROGRESS)) {
+                        buf->freeNext = FREENEXT_NOT_IN_LIST;      /*  forget the bank client */
+                    }
+                }
+		UnlockBufHdr(buf);
+                if (!pin_already_banked_by_me) {
 		PrivateRefCount[buffer - 1]++;
 }
+        }
+}
 
 /*
  * MarkBufferDirtyHint
@@ -2984,61 +3499,138 @@ WaitIO(volatile BufferDesc *buf)
  *
  * In some scenarios there are race conditions in which multiple backends
  * could attempt the same I/O operation concurrently.  If someone else
- * has already started I/O on this buffer then we will block on the
+ * has already started synchronous I/O on this buffer then we will block on the
  * io_in_progress lock until he's done.
  *
+ * if an async io is in progress and we are doing synchronous io,
+ * then readbuffer uses call to smgrcompleteaio to wait,
+ * and so we treat this request as if no io in progress
+ *
  * Input operations are only attempted on buffers that are not BM_VALID,
  * and output operations only on buffers that are BM_VALID and BM_DIRTY,
  * so we can always tell if the work is already done.
  *
+ * index_for_aio is an input parm which,  if non-zero,  identifies a BufferAiocb
+ * acquired by caller and to be attached to the buffer header for use with async io
+ *
  * Returns TRUE if we successfully marked the buffer as I/O busy,
  * FALSE if someone else already did the work.
  */
 static bool
-StartBufferIO(volatile BufferDesc *buf, bool forInput)
+StartBufferIO(volatile BufferDesc *buf, bool forInput , int index_for_aio )
 {
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        struct BufferAiocb volatile * BAiocb = (struct BufferAiocb*)0;  /*  BufferAiocb for use with aio */
+#endif   /* USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+ 
+        if (!index_for_aio)
 	Assert(!InProgressBuf);
 
 	for (;;)
 	{
+                if (!index_for_aio) {
 		/*
 		 * Grab the io_in_progress lock so that other processes can wait for
 		 * me to finish the I/O.
 		 */
 		LWLockAcquire(buf->io_in_progress_lock, LW_EXCLUSIVE);
+                }
 
 		LockBufHdr(buf);
 
-		if (!(buf->flags & BM_IO_IN_PROGRESS))
+                /*     the following test is intended to distinguish between :
+                **      .   buffer which : 
+                **           .     has io in progress
+                **             AND is not associated with a current aio
+                **      .   not the above
+                **     Here,  "recent" means an aio marked by buf->freeNext <= FREENEXT_BAIOCB_ORIGIN but no longer in progress -
+                **          this situation arises when the aio has just been cancelled and this process now wishes to recycle the buffer.
+                **          In this case,  the first such would-be recycler (i.e. me) must :
+                **             . avoid waiting for the cancelled aio to complete
+                **             . if not myself doing async read, then assume responsibility for posting other future readbuffers.
+                */
+		if (    (buf->flags & BM_AIO_IN_PROGRESS)
+             || (!(buf->flags & BM_IO_IN_PROGRESS))
+           )
 			break;
 
 		/*
-		 * The only way BM_IO_IN_PROGRESS could be set when the io_in_progress
+		 * The only way BM_IO_IN_PROGRESS without AIO in progress could be set when the io_in_progress
 		 * lock isn't held is if the process doing the I/O is recovering from
 		 * an error (see AbortBufferIO).  If that's the case, we must wait for
 		 * him to get unwedged.
 		 */
 		UnlockBufHdr(buf);
+                if (!index_for_aio) {
 		LWLockRelease(buf->io_in_progress_lock);
+                }
 		WaitIO(buf);
 	}
 
-	/* Once we get here, there is definitely no I/O active on this buffer */
-
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	/* Once we get here, there is definitely no synchronous I/O active on this buffer
+        ** but if being asked to attach a BufferAiocb to the buf header,
+        ** then we must also check if there is any async io currently
+        ** in progress or pinned started by a different task.
+        */
+        if (index_for_aio) {
+            BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext);
+            if (    (buf->flags & (BM_AIO_IN_PROGRESS|BM_AIO_PREFETCH_PIN_BANKED))
+                 && (buf->freeNext <= FREENEXT_BAIOCB_ORIGIN)
+                 && (BAiocb->pidOfAio != this_backend_pid)
+               ) {
+                    /* someone else already doing async I/O */
+                    UnlockBufHdr(buf);
+                    return false;
+            }
+	}
+#endif   /* USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 	if (forInput ? (buf->flags & BM_VALID) : !(buf->flags & BM_DIRTY))
 	{
 		/* someone else already did the I/O */
 		UnlockBufHdr(buf);
+                if (!index_for_aio) {
 		LWLockRelease(buf->io_in_progress_lock);
+                }
 		return false;
 	}
 
 	buf->flags |= BM_IO_IN_PROGRESS;
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        if (index_for_aio) {
+            BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - index_for_aio);
+            /*   insist that no other buffer is using this BufferAiocb for async IO */
+            if (BAiocb->BAiocbbufh == (struct sbufdesc *)0) {
+                BAiocb->BAiocbbufh = buf;
+            }
+            if (BAiocb->BAiocbbufh != buf) {
+                               ereport(ERROR,
+                                    (errcode(ERRCODE_INTERNAL_ERROR),
+                                     errmsg("AIO control block %p to be used by %p already in use by %p"
+                                              ,BAiocb ,buf , BAiocb->BAiocbbufh)));
+            }
+            /*   note - there is no need to register self as an dependent of BAiocb
+            **   as we shall not unlock buf_desc before we free the BAiocb
+            */
+
+            buf->flags |= BM_AIO_IN_PROGRESS;
+            buf->freeNext = index_for_aio;
+            /*  at this point,  this buffer appears to have an in-progress aio_read,
+            **  and any other task which is able to look inside the buffer might try waiting on that aio -
+            **  except we have not yet issued the aio!   So we must keep the buffer header locked
+            **  from here all the way back to the BufStartAsync caller
+            */
+        } else {
+#endif   /* USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 	UnlockBufHdr(buf);
 
 	InProgressBuf = buf;
 	IsForInput = forInput;
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        }
+#endif   /* USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 	return true;
 }
@@ -3048,7 +3640,7 @@ StartBufferIO(volatile BufferDesc *buf,
  *	(Assumptions)
  *	My process is executing IO for the buffer
  *	BM_IO_IN_PROGRESS bit is set for the buffer
- *	We hold the buffer's io_in_progress lock
+ *	if no async IO in progress,  then We hold the buffer's io_in_progress lock
  *	The buffer is Pinned
  *
  * If clear_dirty is TRUE and BM_JUST_DIRTIED is not set, we clear the
@@ -3060,26 +3652,32 @@ StartBufferIO(volatile BufferDesc *buf,
  * BM_IO_ERROR in a failure case.  For successful completion it could
  * be 0, or BM_VALID if we just finished reading in the page.
  */
-static void
+void
 TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits)
 {
-	Assert(buf == InProgressBuf);
+        int flags_on_entry;
 
 	LockBufHdr(buf);
 
+        flags_on_entry = buf->flags;
+
+        if (! (flags_on_entry & BM_AIO_IN_PROGRESS) )
+            Assert( buf == InProgressBuf );
+
 	Assert(buf->flags & BM_IO_IN_PROGRESS);
-	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
+	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS);
 	if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
 		buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 	buf->flags |= set_flag_bits;
 
 	UnlockBufHdr(buf);
 
+        if (! (flags_on_entry & BM_AIO_IN_PROGRESS) ) {
 	InProgressBuf = NULL;
-
 	LWLockRelease(buf->io_in_progress_lock);
 }
+}
 
 /*
  * AbortBufferIO: Clean up any active buffer I/O after an error.
--- src/backend/storage/buffer/buf_async.c.orig	2014-08-19 10:17:32.822616336 -0400
+++ src/backend/storage/buffer/buf_async.c	2014-08-19 16:56:13.535196628 -0400
@@ -0,0 +1,1203 @@
+/*-------------------------------------------------------------------------
+ *
+ * buf_async.c
+ *	  buffer manager asynchronous disk read routines
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/buf_async.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * Principal entry points:
+ *
+ * BufStartAsync() -- start an asynchronous read of a block into a buffer and
+ *	 pin it so that no one can destroy it while this process is using it.
+ *
+ * BufCheckAsync() -- check completion of an asynchronous read
+ *       and either claim buffer or discard it
+ *
+ * Private helper
+ *
+ * BufReleaseAsync() -- release the BAiocb resources used for an asynchronous read
+ *
+ * See also these files:
+ *		bufmgr.c -- main buffer manager functions
+ *		buf_init.c -- initialisation of resources
+ */
+#include "postgres.h"
+#include <sys/types.h> 
+#include <sys/file.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "catalog/catalog.h"
+#include "common/relpath.h"
+#include "executor/instrument.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/smgr.h"
+#include "storage/standby.h"
+#include "utils/rel.h"
+#include "utils/resowner_private.h"
+
+/*
+ * GUC parameters
+ */
+int      max_async_io_prefetchers; /* Maximum number of backends using asynchronous librt threads to read pages into our buffers */
+
+#if defined(USE_AIO_SIGEVENT)
+long num_startedaio = 0;
+long num_cancelledaio = 0;
+long num_signalledaio = 0;                  /* count of number of signals delivered - counted by handlaiosignal */
+unsigned long BAiocbaioOrdinal = 0;    /* ordinal of most recent aio originated by me */
+#endif /*  USE_AIO_SIGEVENT   && defined(COUNT_QUEUED_SIGNALS)  */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
+extern volatile struct BAiocbAnchor *BAiocbAnchr;  /*  anchor for all control blocks pertaining to aio  */
+extern int numBufferAiocbs;        /*  total number of  BufferAiocbs in pool  */
+extern int maxGetBAiocbTries;      /*   max times we will try to get a free BufferAiocb */
+extern int maxRelBAiocbTries;      /*   max times we will try to release a BufferAiocb back to freelist */
+extern pid_t this_backend_pid;     /*   pid of this backend */
+
+extern bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
+extern void PinBuffer_Locked(volatile BufferDesc *buf);
+extern Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
+				  ForkNumber forkNum, BlockNumber blockNum,
+				  ReadBufferMode mode, BufferAccessStrategy strategy,
+				  bool *hit , int index_for_aio);
+extern void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
+extern void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
+				  int set_flag_bits);
+int BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, volatile BufferDesc *buf_desc
+  ,int intention , BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock );
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+int BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy);
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+/*   PoorMans atomics forward decls */
+static int PoorMLock(volatile int *intP);
+static int PoorMUnLock(volatile int *intP);
+static int PoorMIncr(volatile int *intP);
+static int PoorMDecr(volatile int *intP);
+void BufPostAioWaiters(struct BufferAiocb volatile * BAiocb);
+void BufHandlSigAsync(int sigsent, siginfo_t * siginfP, void *ucontext);
+int BufAWaitAioCompletion (struct BufferAiocb *BAiocb);
+#endif  /* USE_AIO_SIGEVENT    */
+
+static struct BufferAiocb volatile * cachedBAiocb = (struct BufferAiocb*)0;  /*  one cached BufferAiocb for use with aio */
+
+/*  BufReleaseAsync releases a BufferAiocb and returns 0 if successful else non-zero
+**  it *must* be called :
+**     EITHER with a valid  BAiocb->BAiocbbufh -> buf_desc
+**            and that buf_desc must be spin-locked
+**     OR     with BAiocb->BAiocbbufh == 0
+*/
+static int
+BufReleaseAsync(struct BufferAiocb volatile * BAiocb)
+{
+        int    LockTries;         /*  max times we will try to release the BufferAiocb */
+        volatile struct BufferAiocb *BufferAiocbs;
+        struct BufferAiocb volatile * oldFreeBAiocb;  /*  old value of FreeBAiocbs */
+
+        int failed = 1; /* by end of this function, non-zero  will indicate if we failed to return the BAiocb */
+
+
+        if (    ( BAiocb == (struct BufferAiocb*)0 )
+             || ( BAiocb == (struct BufferAiocb*)BAIOCB_OCCUPIED )
+             || ( ((unsigned long)BAiocb) & 0x1 )
+           ) {
+                          elog(ERROR,
+                                 "AIO control block corruption on release of aiocb %p - invalid BAiocb"
+                                          ,BAiocb);
+        }
+        else 
+        if (   (0 == BAiocb->BAiocbDependentCount)     /*  no dependents  */
+            && ((struct BufferAiocb*)BAIOCB_OCCUPIED == BAiocb->BAiocbnext)  /*  not already on freelist */
+           ) {
+
+            if ((struct sbufdesc*)0 != BAiocb->BAiocbbufh) { /*  if a buffer was attached */
+                volatile        BufferDesc *buf_desc = BAiocb->BAiocbbufh;
+
+                /*  spinlock held so instead of TerminateBufferIO(buf, false , 0); ... */
+                if (buf_desc->flags & BM_AIO_PREFETCH_PIN_BANKED) { /* if a pid banked the pin */
+                    buf_desc->freeNext = -(BAiocb->pidOfAio);       /* then remember which pid */
+                }
+                else if (buf_desc->freeNext <= FREENEXT_BAIOCB_ORIGIN) {
+                    buf_desc->freeNext = FREENEXT_NOT_IN_LIST;   /* disconnect BufferAiocb from buf_desc */
+                }
+                buf_desc->flags &= ~BM_AIO_IN_PROGRESS;
+            }
+            
+            BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* disconnect buf_desc from BufferAiocb */
+            BAiocb->pidOfAio = 0;                      /*  clean */
+            LockTries = maxRelBAiocbTries;         /*  max times we will try to release the BufferAiocb */
+            do {
+                register long long int dividend , remainder;
+
+                /*      retrieve old value of FreeBAiocbs  */
+                BAiocb->BAiocbnext = oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs;  /*  old value of FreeBAiocbs */
+
+                /*  this is a volatile value unprotected by any lock,  so must validate it;
+                **  safest is to verify that it is identical to one of the BufferAiocbs
+                **  to do so,  verify by direct division that its address offset from first control block 
+                **  is an integral multiple of the control block size
+                **  that lies within the range [ 0 , (numBufferAiocbs-1) ]
+                */
+                BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+                remainder = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                            % (long long int)(sizeof(struct BufferAiocb));
+                failed = (int)remainder;
+                if (!failed) {
+                    dividend = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                               / (long long int)(sizeof(struct BufferAiocb));
+                     failed = ( (dividend < 0 ) || ( dividend >= numBufferAiocbs) );
+                     if (!failed) {
+                         if (__sync_bool_compare_and_swap (&(BAiocbAnchr->FreeBAiocbs), oldFreeBAiocb, BAiocb)) {
+                            LockTries = 0;   /*  end the do loop  */
+
+                            goto cheering;   /*  cant simply break because then failed would be set incorrectly */
+                         }
+                    }
+                }
+                /*  if we reach here, we have failed and failed is set to -1 */
+
+       cheering: ;
+
+                if ( LockTries > 1 ) {
+                    sched_yield();    /*   yield to another process,  (hopefully a backend) */
+                }
+            } while  (LockTries-- > 0);
+
+            if (failed) {
+#ifdef LOG_RELBAIOCB_DEPLETION
+                          elog(LOG,
+                                "BufReleaseAsync:AIO control block %p unreleased after tries= %d\n"
+                                       ,BAiocb,maxRelBAiocbTries);
+#endif  /* LOG_RELBAIOCB_DEPLETION */
+            }
+
+        }
+              else
+                          elog(LOG,
+                                "BufReleaseAsync:AIO control block %p either has dependents= %d or is already on freelist %p or has no buf_header %p\n"
+                                       ,BAiocb , BAiocb->BAiocbDependentCount , BAiocb->BAiocbnext , BAiocb->BAiocbbufh);
+        return failed;
+}
+
+/*  try using asynchronous aio_read to prefetch into a buffer
+**  return code :
+**        0 if started successfully
+**       -1 if failed for some reason
+**        1+PrivateRefCount if we found desired buffer in buffer pool
+**
+**  There is a harmless race condition here :
+**  two different backends may both arrive here simultaneously
+**  to prefetch the same buffer.    This is not unlikely when a syncscan is in progress.
+**  .  One will acquire the buffer and issue the smgrstartaio
+**  .  Other will find the buffer on return from  ReadBuffer_common with hit = true
+**  Only the first task has a pin on the buffer since ReadBuffer_common knows not to get a pin
+**  on a found buffer in prefetch mode.
+**  Therefore  -   the second task must simply abandon the prefetch if it finds the buffer in the buffer pool.
+**
+**  if we fail to acquire a BAiocb because of concurrent theft from freelist by other backend,
+**  retry up to maxGetBAiocbTries times provided that there actually was at least one BAiocb on the freelist.
+*/
+int
+BufStartAsync(Relation reln, ForkNumber forkNum, BlockNumber blockNum , BufferAccessStrategy strategy) {
+
+        int retcode = -1;
+        struct BufferAiocb volatile * prevBAiocb;  /*  previous BufferAiocb in cachedchain */
+        struct BufferAiocb volatile * BAiocb = (struct BufferAiocb*)0;  /*  BufferAiocb for use with aio */
+        int  smgrstartaio_rc = -1;           /*  retcode from smgrstartaio */
+        bool do_unpin_buffer = false;        /* unpin must be deferred until after buffer descriptor is unlocked */
+        Buffer		buf_id;
+        bool		hit = false;
+        volatile        BufferDesc *buf_desc = (BufferDesc *)0;
+
+        int    LockTries;         /*  max times we will try to get a free BufferAiocb */
+
+        struct BufferAiocb volatile * oldFreeBAiocb;  /*  old value of FreeBAiocbs */
+        struct BufferAiocb volatile * newFreeBAiocb;  /*  new value of FreeBAiocbs */
+
+
+    /*  return immediately if no async io resources */
+    if (numBufferAiocbs > 0) {
+        buf_id = (Buffer)0;
+
+        if ( (struct BAiocbAnchor *)0 != BAiocbAnchr ) {
+
+            volatile struct BufferAiocb *BufferAiocbs;
+
+            prevBAiocb = (struct BufferAiocb*)(&cachedBAiocb); /*  prev points to BAiocb */
+            BAiocb = cachedBAiocb;      /* tentatively try the first if there is one */
+            if ((struct BufferAiocb*)0 != BAiocb) {  /* any usable cached BufferAiocb ? */
+                prevBAiocb->BAiocbnext = BAiocb->BAiocbnext;  /* unchain from cached chain */
+                BAiocb->BAiocbnext = (struct BufferAiocb*)BAIOCB_OCCUPIED; /* mark as not on freelist */
+                BAiocb->BAiocbDependentCount = 0;  /* no dependent yet */
+                BAiocb->BAiocbbufh = (struct sbufdesc*)0;
+                BAiocb->pidOfAio = 0;
+            } else {
+
+                LockTries = maxGetBAiocbTries;         /*  max times we will try to get a free BufferAiocb */
+                do {
+                    register long long int dividend = -1 , remainder;
+                    /*  check if we have a free BufferAiocb */
+
+                    oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs;  /*  old value of FreeBAiocbs */
+
+                    /*  check if we have a free BufferAiocb */
+
+                    /*  BAiocbAnchr->FreeBAiocbs is a volatile value unprotected by any lock,
+                    **  and use of compare-and-swap to add and remove items from the list has
+                    **  two potential pitfalls,    both relating to the fact that we must
+                    **  access data de-referenced from this pointer before the compare-and-swap.
+                    **  1)  The value we load may be corrupt,  e.g. mixture of bytes from
+                    **      two different values,   so must validate it;
+                    **      safest is to verify that it is identical to one of the BufferAiocbs.
+                    **      to do so,  verify by direct division that its address offset from
+                    **      first control block is an integral multiple of the control block size
+                    **      that lies within the range [ 0 , (numBufferAiocbs-1) ]
+                    **      Thus we completely prevent this pitfall.
+                    **  2)  The content of the item's next pointer may have changed between the
+                    **      time we de-reference it and the time of the compare-and-swap.
+                    **      Thus even though the compare-and-swap succeeds,   we might set the
+                    **      new head of the freelist to an invalid value  (either a free item
+                    **      that is not the first in the free chain  -  resulting only in
+                    **      loss of the orphaned free items, or,  much worse, an in-use item).
+                    **      In practice this is extremely unlikely because of the implied huge delay
+                    **      in this window interval in this (current) process.    Here are two scenarios:
+                    **      legend:
+                    **         P0  -  this (current) process,  P1,  P2 , ... other processes
+                    **         content of freelist shown as BAiocbAnchr->FreeBAiocbs -> first item -> 2nd item ...
+                    **         @[X] means address of X
+                    **         |      timeline of window of exposure to problems
+                    **      successive lines in chronological order                                       content of freelist
+                    **        2.1    P0           access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **               P0          access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1       access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        swap-remove I0,  place I1 at head of list                F -> I1 -> I2 -> I3 ...
+                    **         |           P2     access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I1]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2    access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I2]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2     swap-remove I1,  place I2 at head of list                F -> I2 -> I3 ...
+                    **         |        P1       complete aio,  replace I0 at head of list                 F -> I0 -> I2 -> I3 ...
+                    **               P0        swap-remove I0,  place I1 at head of list                   F -> I1 IS IN USE !! CORRUPT !!
+                    **               compare-and-swap succeeded but the value of newFreeBAiocb was stale
+                    **               and had become in-use during the window.
+                    **        2.2    P0           access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **               P0          access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I0]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1       access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I1]  F -> I0 -> I1 -> I2 -> I3 ...
+                    **         |        P1        swap-remove I0,  place I1 at head of list                F -> I1 -> I2 -> I3 ...
+                    **         |           P2     access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I1]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2    access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I2]  F -> I1 -> I2 -> I3 ...
+                    **         |           P2     swap-remove I1,  place I2 at head of list                F -> I2 -> I3 ...
+                    **         |              P3  access oldFreeBAiocb = BAiocbAnchr->FreeBAiocbs = @[I2]  F -> I2 -> I3 ...
+                    **         |              P3 access newFreeBAiocb = oldFreeBAiocb->BAiocbnext = @[I3]  F -> I2 -> I3 ...
+                    **         |              P3  swap-remove I2,  place I3 at head of list                F -> I3 ...
+                    **         |           P2    complete aio,  replace I1 at head of list                 F -> I1 -> I3 ...
+                    **         |              P3 complete aio,  replace I2 at head of list                 F -> I2 -> I1 -> I3 ...
+                    **         |        P1       complete aio,  replace I0 at head of list                 F -> I0 -> I2 -> I1 -> I3 ...
+                    **               P0        swap-remove I0,  place I1 at head of list                   F -> I1 -> I3 ... ! I2 is orphaned !
+                    **               compare-and-swap succeeded but the value of newFreeBAiocb was stale
+                    **               and had moved further down the free list during the window.
+                    **      Unfortunately, we cannot prevent this pitfall but we can detect it (after the fact),
+                    **      by checking that the next pointer of the item we have just removed for our use still points to the same item.
+                    **      This test is not subject to any timing or uncertainty since :
+                    **       .  The fact that the compare-and-swap succeeded implies that the item we removed
+                    **          was defintely on the freelist (at the head) when it was removed,
+                    **          and therefore cannot be in use,  and therefore its next pointer is no longer volatile.
+                    **       .  Although pointers of the anchor and items on the freelist are volatile,
+                    **          the addresses of items never change -  they are in an allocated array and never move.
+                    **      E.g. in the above two scenarios,   the test is that I0.next still -> I1,
+                    **      and this is true if and only if the second item on the freelist is
+                    **      still the same at the end of the window as it was at the start of the window.
+                    **      Note that we do not insist that it did not change during the window,
+                    **           only that it is still the correct new head of freelist.
+                    **      If this test fails,  we abort immediately as the subsystem is damaged and cannot be repaired.
+                    **      Note that at least one aio must have been issued *and* completed during the window
+                    **           for this to occur,  and since the window is just one single machine instruction,
+                    **           it is very unlikely in practice.
+                    */
+                    BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+                    remainder = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                                % (long long int)(sizeof(struct BufferAiocb));
+                    if (remainder == 0) {
+                        dividend = ((long long int)oldFreeBAiocb - (long long int)BufferAiocbs)
+                                    / (long long int)(sizeof(struct BufferAiocb));
+                    }
+                    if (    (remainder == 0)
+                         && ( (dividend >= 0 ) && ( dividend < numBufferAiocbs) )
+                       )
+                    {
+                        newFreeBAiocb = oldFreeBAiocb->BAiocbnext; /*  tentative new value is second on free list */
+                        /*   Here we are in the exposure window referred to in the above comments,
+                        **   so moving along rapidly ...
+                        */
+                        if (__sync_bool_compare_and_swap (&(BAiocbAnchr->FreeBAiocbs), oldFreeBAiocb, newFreeBAiocb)) {   /*  did we get it ? */
+                                /*   We have successfully swapped head of freelist pointed to by oldFreeBAiocb off the list;
+                                **   Here we check that the item we just placed at head of freelist, pointed to by newFreeBAiocb,
+                                **   is the right one
+                                **
+                                **   also check that the BAiocb we have acquired was not in use
+                                **   i.e. that scenario 2.1 above did not occur just before our compare-and-swap
+                                **   The test is that the BAiocb is not in use.
+                                **
+                                **  in one hypothetical case,
+                                **  we can be certain that there is no corruption -
+                                **  the case where newFreeBAiocb == 0 and oldFreeBAiocb->BAiocbnext != BAIOCB_OCCUPIED -
+                                **  i.e. we have set the freelist to empty but we have a baiocb chained from ours.
+                                **  in this case our comp_swap removed all BAiocbs from the list (including ours)
+                                **  so the others chained from ours are either orphaned (no harm done)
+                                **  or in use by another backend and will eventually be returned (fine).
+                                */
+                                if ((struct BufferAiocb *)0 == newFreeBAiocb) {
+                                    if ((struct BufferAiocb *)BAIOCB_OCCUPIED == oldFreeBAiocb->BAiocbnext) {
+                                        goto baiocb_corruption;
+                                    } else if ((struct BufferAiocb *)0 != oldFreeBAiocb->BAiocbnext) {
+                                      elog(LOG,
+                                         "AIO control block inconsistency on acquiring aiocb %p - its next free %p may be orphaned (no corruption has occurred)"
+                                         	,oldFreeBAiocb , oldFreeBAiocb->BAiocbnext);
+                                    }
+                                } else {
+                                    /*  case of newFreeBAiocb not null  -  so must check more carefully ... */
+                                    remainder = ((long long int)newFreeBAiocb - (long long int)BufferAiocbs)
+                                                % (long long int)(sizeof(struct BufferAiocb));
+                                    dividend = ((long long int)newFreeBAiocb - (long long int)BufferAiocbs)
+                                                / (long long int)(sizeof(struct BufferAiocb));
+
+                                    if (    (newFreeBAiocb != oldFreeBAiocb->BAiocbnext)
+                                         || (remainder != 0)
+                                         || ( (dividend < 0 ) || ( dividend >= numBufferAiocbs) )
+                                       ) {
+                                        goto baiocb_corruption;
+                                    }
+                                }
+                                BAiocb = oldFreeBAiocb;
+                                BAiocb->BAiocbnext = (struct BufferAiocb*)BAIOCB_OCCUPIED; /* mark as not on freelist */
+                                BAiocb->BAiocbDependentCount = 0;  /* no dependent yet */
+                                BAiocb->BAiocbbufh = (struct sbufdesc*)0;
+                                BAiocb->pidOfAio = 0;
+
+                                LockTries = 0;   /*  end the do loop  */
+
+                        }
+                    }
+
+                    if ( LockTries > 1 ) {
+                        sched_yield();    /*   yield to another process,  (hopefully a backend) */
+                    }
+                } while (     ((struct BufferAiocb*)0 == BAiocb)            /*  did not get a BAiocb    */
+                           && ((struct BufferAiocb*)0 != oldFreeBAiocb)     /*  there was a free BAiocb */
+                           && (LockTries-- > 0)                             /*  told to retry           */
+                        );
+            }
+        }
+
+        if ( BAiocb != (struct BufferAiocb*)0 ) {
+            /*  try an async io    */
+            BAiocb->BAiocbthis.aio_fildes = -1; /* necessary to ensure any thief realizes aio not yet started */
+            BAiocb->pidOfAio = this_backend_pid;
+
+            /*  now try to acquire a buffer :
+            **  note -   ReadBuffer_common returns hit=true if the block is found in the buffer pool,
+            **            in which case there is no need to prefetch.
+            **  otherwise ReadBuffer_common pins returned buffer and calls StartBufferIO
+            **           and StartBufferIO :
+            **      . sets buf_desc->freeNext to negative of ( index of the aiocb in the BufferAiocbs array + 3 )
+            **      . sets  BAiocb->BAiocbbufh -> buf_desc
+            **           and in this case the buffer spinlock is held.
+            **           This is essential as no other task must issue any intention with respect
+            **           to the buffer until we have started the aio_read.
+            **  Also note that ReadBuffer_common handles enlarging the ResourceOwner buffer list as needed
+            **       so I dont need to do that
+            */
+            buf_id = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
+                                        forkNum, blockNum
+                                       ,RBM_NOREAD_FOR_PREFETCH  /*  tells ReadBuffer not to do any read,  just alloc buf */
+                                       ,strategy , &hit , (FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))));
+            buf_desc = &BufferDescriptors[buf_id-1];    /* find buffer descriptor */
+
+            /*  normally hit will be false as presumably it was not in the pool
+            **  when our caller looked - but it could be there now ...
+            */
+            if (hit) {
+                   /*   see earlier comments  -  we must abandon the prefetch */
+                   retcode = 1 + PrivateRefCount[buf_id];
+                   BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* indicator that we must release the BAiocb before exiting */
+            } else
+            if (  (buf_id > 0) && ((BufferDesc *)0 != buf_desc) && (buf_desc == BAiocb->BAiocbbufh)  ) {
+                   /*   buff descriptor header lock should be held.
+                   **   However,  just to be safe ,   now validate that
+                   **   we are still the owner and no other task already stole it.
+                   */
+
+                   buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /* ensure no banked pin */
+                   /*  there should not be any other pid waiting on this buffer
+                   **  so check both of BM_VALID and BM_PIN_COUNT_WAITER are not set
+                   */
+                   if (    ( !(buf_desc->flags & (BM_VALID|BM_PIN_COUNT_WAITER) ) )
+                        && (buf_desc->flags & BM_AIO_IN_PROGRESS)
+                        && ((FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))) == buf_desc->freeNext) /* it is still mine */
+                        && (-1 == BAiocb->BAiocbthis.aio_fildes)   /* no thief stole it */
+                        && (0 == BAiocb->BAiocbDependentCount)     /* no dependent */
+                     ) {
+                        /*   we have an empty buffer for our use */
+
+                        BAiocb->BAiocbthis.aio_buf = (void *)(BufHdrGetBlock(buf_desc)); /* Location of actual buffer.  */
+
+                        /*   note - there is no need to register self as an dependent of BAiocb
+                        **   as we shall not unlock buf_desc before we free the BAiocb
+                        */
+
+                        /*   smgrstartaio retcode is returned in smgrstartaio_rc -
+                        **   it indicates whether started or not
+                        */
+#ifdef USE_AIO_SIGEVENT
+                        BAiocb->BAiocbaioOrdinal = BAiocbaioOrdinal;
+#endif /*  USE_AIO_SIGEVENT    */
+                        smgrstartaio(reln->rd_smgr, forkNum, blockNum , (char *)&(BAiocb->BAiocbthis) , &smgrstartaio_rc
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+                                    ,BAiocbaioOrdinal
+#endif /*  USE_AIO_SIGEVENT    */
+                                    );
+#if defined(USE_AIO_SIGEVENT)
+                        BAiocbaioOrdinal++;    /* ordinal of most recent aio originated by me */
+#endif /*  USE_AIO_SIGEVENT    */
+
+                        if (smgrstartaio_rc == 0) {
+                            retcode = 0;
+                            buf_desc->flags |= BM_AIO_PREFETCH_PIN_BANKED; /* bank the pin for next use by this task */
+                            /* we did not register register self as an dependent of BAiocb so no need to unregister  */
+                        } else {
+                            /*  failed - return the BufferAiocb to the free list  */
+                            do_unpin_buffer = true;    /* unpin must be deferred until after buffer descriptor is unlocked */
+                            /*  spinlock held so instead of TerminateBufferIO(buf_desc, false , 0); ... */
+                            buf_desc->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS | BM_AIO_PREFETCH_PIN_BANKED | BM_VALID);
+                            /* we did not register register self as an dependent of BAiocb so no need to unregister  */
+
+                            /*  return the BufferAiocb to the free list  */
+                            if (
+                                   BufReleaseAsync(BAiocb)
+                               ) {         /*  failed ? */
+                                BAiocb->BAiocbnext = cachedBAiocb;    /* then ...       */
+                                cachedBAiocb = BAiocb;                /*  ... cache it  */
+                            }
+
+                            pgBufferUsage.aio_read_failed++;
+                            smgrstartaio_rc = 1;  /*   to distinguish from aio not even attempted */
+                        }
+                   }
+                   else {
+                            /*  buffer was stolen or in use by other task - return the BufferAiocb to the free list  */
+                            do_unpin_buffer = true;    /* unpin must be deferred until after buffer descriptor is unlocked */
+                   }
+
+                   UnlockBufHdr(buf_desc);
+                   if (do_unpin_buffer) {
+                        if (smgrstartaio_rc >= 0) { /*  if  aio was attempted */
+                            TerminateBufferIO(buf_desc, false , 0);
+                        }
+                        UnpinBuffer(buf_desc, true);
+                   }
+            }
+            else {
+                BAiocb->BAiocbbufh = (struct sbufdesc *)0; /* indicator that we must release the BAiocb before exiting */
+            }
+
+            if ((struct sbufdesc*)0 == BAiocb->BAiocbbufh) { /*  we did not associate a buffer */
+                                                             /*  so return the BufferAiocb to the free list  */
+                            if (
+                                   BufReleaseAsync(BAiocb)
+                               ) {         /*  failed ? */
+                                BAiocb->BAiocbnext = cachedBAiocb;    /* then ...       */
+                                cachedBAiocb = BAiocb;                /*  ... cache it  */
+                            }
+            }
+        }
+    }
+
+    return retcode;
+
+    baiocb_corruption:;
+         elog(PANIC,
+              "AIO control block corruption on acquiring aiocb %p - its next free %p conflicts with new freelist pointer %p which may be invalid (corruption may have occurred)"
+                                                    ,oldFreeBAiocb , oldFreeBAiocb->BAiocbnext , newFreeBAiocb);
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+/*
+ * BufCheckAsync --      act upon caller's intention regarding a shared buffer,
+ *                       primarily in connection with any async io in progress on the buffer.
+ *     class  subvalue   intention has two main classes and some subvalues within those :
+ *      +ve      1            .   want    <=>  caller wants the buffer,
+ *                                             wait for in-progress aio and then always pin
+ *      -ve                   .   reject  <=>  caller does not want the buffer,
+ *                                             if there are no dependents,  then cancel the aio
+ *              -1, -2 , -3 , ... (see below)        and then optionally unpin
+ *                             Used when there may have been a previous fetch or prefetch.
+ *
+ * buffer is assumed to be an existing member of the shared buffer pool
+ *    as returned by BufTableLookup.
+ * if AIO in progress,  then :
+ *      .  terminate AIO, waiting for completion if +ve intention, else without waiting
+ *      .  if the AIO had already completed successfully,   then mark buffer valid
+ *      .  pin/unpin as requested
+ *
+ * +ve intention indicates that buffer must be pinned :
+ *   if the strategy parameter is null,  then use the PinBuffer_Locked optimization
+ *   to pin and unlock in one operation.   But always update buffer usage count.
+ *
+ * -ve intention indicates whether and how to unpin :
+ *   BUF_INTENTION_REJECT_KEEP_PIN 	-1   pin already held, do not unpin, (caller wants to keep it)
+ *   BUF_INTENTION_REJECT_OBTAIN_PIN	-2   obtain pin,  caller wants it for same buffer
+ *   BUF_INTENTION_REJECT_FORGET	-3   unpin and tell resource owner to forget
+ *   BUF_INTENTION_REJECT_NOADJUST	-4   unpin and call ResourceOwnerForgetBuffer myself
+ *                                           instead of telling UnpinBuffer to adjust CurrentResource owner
+ *                                           (quirky simulation of ReleaseBuffer logic)
+ *   BUF_INTENTION_REJECT_UNBANK   	-5   unpin only if pin banked by caller
+ *   The behaviour for the -ve case is based on that of ReleaseBuffer, adding handling of async io.
+ *
+ * pin/unpin action must take account of whether this backend hold a "disposable" pin on the particular buffer.
+ * A "disposable" pin is a pin acquired by buffer manager without caller knowing, such as :
+ *      when required to safeguard an async AIO  -  pin can be held across multiple bufmgr calls
+ *      when required to safeguard waiting for an async AIO  -  pin acquired and released within this function
+ * if a disposable pin is held,   then :
+ *      if a new pin is requested,  the disposable pin must be retained (redeemed) and any flags relating to it unset
+ *      if an unpin is requested,   then :
+ *              if    either no AIO in progress or this backend did not initiate the AIO
+ *              then  the disposable pin must be dropped (redeemed) and any flags relating to it unset
+ *              else  log warning and do nothing
+ *  i.e. in either case,   there is no longer a disposable pin after this function has completed.
+ *       Note that if    intention is BUF_INTENTION_REJECT_UNBANK,
+ *                 then caller expects there to be a disposable banked pin
+ *                      and if there isn't one,  we do nothing
+ *                 for all other intentions,  if there is no disposable pin,   we pin/unpin normally.
+ *
+ * index_for_aio indicates the BAiocb to be used for next aio (see PrefetchBuffer)
+ * BufFreelistLockHeld indicates whether freelistlock is held
+ * spinLockHeld indicates whether buffer header spinlock is held
+ * PartitionLock is the  buffer partition lock to be used
+ *
+ * return code (meaningful ONLY if intention is +ve) indicates validity of buffer :
+ *  -1    buffer is invalid and failed PageHeaderIsValid check
+ *   0    buffer is not valid
+ *   1    buffer is valid
+ *   2    buffer is valid but tag changed  -  (so content does not match the relation block that caller expects)
+ */
+int
+BufCheckAsync(SMgrRelation caller_smgr, Relation caller_reln, BufferDesc volatile * buf_desc, int intention , BufferAccessStrategy strategy , int index_for_aio , bool spinLockHeld , LWLockId PartitionLock )
+{
+
+        int retcode = BUF_INTENT_RC_INVALID_NO_AIO;
+        bool valid = false;
+
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        int smgrcompleteaio_rc;          /*  retcode from smgrcompleteaio */
+        SMgrRelation smgr = caller_smgr;
+        int    		BAiocbDependentCount_after_aio_finished = -1;  /*  for debugging  -  can be printed in gdb */
+	    BufferTag origTag = buf_desc->tag;	/* original identity of selected buffer */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        BufFlags	flags_on_entry;  /*  for debugging  -  can be printed in gdb */
+        int    		freeNext_on_entry;  /*  for debugging  -  can be printed in gdb */
+        bool       disposable_pin = false;            /* this backend had a disposable pin on entry or pins the buffer while waiting for aio_read to complete */
+        bool       pin_already_banked_by_me;          /* buffer is already pinned by me and redeemable */
+
+        int aio_successful = -1;         /*  did the aio_read succeed ?  -1 = no aio,  0 unsuccessful , 1 successful */
+        int local_intention = intention; /*  copy of intention which in one special case below may be set differently to intention */
+
+            if (!spinLockHeld) {
+                /*  lock buffer header    */
+                LockBufHdr(buf_desc);
+            }
+
+	    flags_on_entry = buf_desc->flags;
+	    freeNext_on_entry = buf_desc->freeNext;
+            pin_already_banked_by_me =
+                      (    (flags_on_entry & BM_AIO_PREFETCH_PIN_BANKED)
+                        && (   (    (flags_on_entry & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - freeNext_on_entry))->pidOfAio )
+                                                                      : (-(freeNext_on_entry))  )  == this_backend_pid )
+                      );
+
+            if (pin_already_banked_by_me) {
+                if (PrivateRefCount[buf_desc->buf_id] == 0) {  /*  but do we actually have a pin ?? */
+                    /*   this is an anomalous situation   -  somehow our disposable pin was lost without us noticing
+                    **   if AIO is in progress and we started it,
+                    **   then this is disastrous  -   two backends might both issue IO on same buffer
+                    **   otherwise,  it is harmless,  and simply means we have no disposable pin,
+                    **               but we must update flags to "notice" the fact now
+                    */
+                    if (flags_on_entry & BM_AIO_IN_PROGRESS) {
+                            elog(ERROR, "BufCheckAsync : AIO control block issuer of aio_read lost pin with BM_AIO_IN_PROGRESS on buffer %d rel=%s, blockNum=%u, flags 0x%X refcount=%u intention= %d"
+                                ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                           ,buf_desc->tag.blockNum, flags_on_entry, buf_desc->refcount ,intention);
+                    } else {
+                            elog(LOG, "BufCheckAsync : AIO control block issuer of aio_read lost pin on buffer %d rel=%s, blockNum=%u, with flags 0x%X refcount=%u intention= %d"
+                               ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                               ,buf_desc->tag.blockNum, flags_on_entry, buf_desc->refcount ,intention);
+							buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /*   redeem the banked pin  */
+							/*   since AIO not in progress,  disconnect the buffer from banked pin */
+							buf_desc->freeNext = FREENEXT_NOT_IN_LIST;      /*   and forget the bank client */
+							pin_already_banked_by_me = false;
+                    }
+                } else {
+                    disposable_pin = true;
+                }
+            }
+
+            /*  the case of BUF_INTENTION_REJECT_UNBANK is handled specially :
+            **    if this backend has a banked pin,  then proceed just as for BUF_INTENTION_REJECT_FORGET
+            **    else the call is a no-op  --  unlock buf header and return immediately
+            */
+
+            if (intention == BUF_INTENTION_REJECT_UNBANK) {
+                if (pin_already_banked_by_me) {
+                    local_intention = BUF_INTENTION_REJECT_FORGET;
+                } else {
+                    goto unlock_buf_header;  /*  code following the unlock will do nothing since local_intention still set to BUF_INTENTION_REJECT_UNBANK */
+                }
+            }
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+            /*       we do not expect that BM_AIO_IN_PROGRESS is set without freeNext identifying the BAiocb */
+            if ( (buf_desc->flags & BM_AIO_IN_PROGRESS) && (buf_desc->freeNext == FREENEXT_NOT_IN_LIST) ) {
+
+					elog(ERROR, "BufCheckAsync : found BM_AIO_IN_PROGRESS without a BAiocb on buffer %d rel=%s, blockNum=%u, flags %X refcount=%u"
+						,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+						,buf_desc->tag.blockNum, buf_desc->flags, buf_desc->refcount);
+                }
+            /*     check whether aio in progress  */
+            if  (    ( (struct BAiocbAnchor *)0 != BAiocbAnchr )
+                  && (buf_desc->flags & BM_AIO_IN_PROGRESS)
+                  && (buf_desc->freeNext <= FREENEXT_BAIOCB_ORIGIN)                       /*  has a valid BAiocb */
+                  && ((FREENEXT_BAIOCB_ORIGIN - buf_desc->freeNext) < numBufferAiocbs)    /*  double-check */
+                ) {        /* this is aio   */
+                    struct BufferAiocb volatile * BAiocb = (BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf_desc->freeNext); /*  BufferAiocb associated with this aio */
+                    if ((struct BufferAiocb*)BAIOCB_OCCUPIED == BAiocb->BAiocbnext) { /*  ensure BAiocb is occupied */
+                        aio_successful = 0;       /*  tentatively the aio_read did not succeed   */
+                        retcode = BUF_INTENT_RC_INVALID_AIO;
+
+                        if (smgr == NULL) {
+                            if (caller_reln == NULL) {
+                                smgr = smgropen(buf_desc->tag.rnode, InvalidBackendId);
+                            } else {
+                                smgr = caller_reln->rd_smgr;
+                            }
+                        }
+
+                        /*  assert that this AIO is not using the same BufferAiocb as the one caller asked us to use */
+                        if ((index_for_aio < 0) && (index_for_aio == buf_desc->freeNext)) {
+                                   ereport(ERROR,
+                                        (errcode(ERRCODE_INTERNAL_ERROR),
+                                         errmsg("AIO control block index %d to be used by %p already in use by %p"
+                                                  ,index_for_aio, buf_desc, BAiocb->BAiocbbufh)));
+                        }
+
+                        /*    Call smgrcompleteaio only if either we want buffer or there are no dependents.
+                        **    In the other case of reject and there are dependents,
+                        **    then one of them will do it.
+                        */
+                        if (   (local_intention > 0) || (0 == BAiocb->BAiocbDependentCount)  ) {
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+                            int cancelled = 0;                 /* is io to be cancelled ? */
+#endif /*  USE_AIO_SIGEVENT    */
+                            if (local_intention > 0) {
+                                /*  wait for in-progress aio and then pin
+                                **  OR  if I did not issue the aio and do not have a pin
+                                **  then pin now before waiting to ensure the buffer does not become unpinned while I wait
+                                **  we may potentially wait for io to complete
+                                **  so release buf header lock so that others may also wait here
+                                */
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+                                smgrcompleteaio_rc = 16;
+                                while (!(PoorMIncr(&(BAiocb->BAiocbDependentCount)))) { /* register self as dependent  */
+                                    if (smgrcompleteaio_rc--) {
+                                        elog(LOG, "BufCheckAsync : yielding %d to increment BAiocb %p dependent",smgrcompleteaio_rc,BAiocb);
+                                    }
+                                    sched_yield();
+                                }
+#else
+                                BAiocb->BAiocbDependentCount++; /* register self as dependent  */
+#endif /* USE_AIO_SIGEVENT  */
+                                if (PrivateRefCount[buf_desc->buf_id] == 0) {   /* if this buffer not pinned by me */
+                                    disposable_pin = true; /* this backend has pinned the buffer while waiting for aio_read to complete */
+                                    PinBuffer_Locked(buf_desc);
+                                } else {
+                                    UnlockBufHdr(buf_desc);
+                                }
+                                LWLockRelease(PartitionLock);
+
+                                smgrcompleteaio_rc = 1   /*  tell smgrcompleteaio to wait  */
+                                                   + ( BAiocb->pidOfAio == this_backend_pid ); /*  and whether I initiated the aio */
+                            } else {
+                                smgrcompleteaio_rc = 0;   /*  tell smgrcompleteaio to cancel */
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+            /*   because aio is to be cancelled , remember to post suspended waiters after, because no signal will be sent */
+                                cancelled = 1;
+#endif /*  USE_AIO_SIGEVENT    */
+                            }
+
+                            smgrcompleteaio( smgr , (char *)&(BAiocb->BAiocbthis) , &smgrcompleteaio_rc
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+                                           , ((int (*)(char *))(&BufAWaitAioCompletion))
+                                           , ((char *)BAiocb)
+#endif /*  USE_AIO_SIGEVENT    */
+                                           );
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+            /*   if aio was cancelled , then post suspended waiters because no signal will be sent */
+                            if (cancelled) {
+                                BufPostAioWaiters(BAiocb);
+                            }
+#endif /*  USE_AIO_SIGEVENT    */
+                            if ( (smgrcompleteaio_rc == 0) || (smgrcompleteaio_rc == 1) ) {
+                                  aio_successful = 1;
+                            }
+
+                            /*   statistics  */
+                            if (local_intention > 0) {
+                                if (smgrcompleteaio_rc == 0) {
+                                    /*    completed successfully and did not have to wait  */
+                                    pgBufferUsage.aio_read_ontime++;
+                                } else if (smgrcompleteaio_rc == 1) {
+                                    /*    completed successfully and did have to wait  */
+                                    pgBufferUsage.aio_read_waited++;
+                                } else {
+                                    /*  bad news   -   read failed and so buffer not usable
+                                    **  the buffer is still pinned so unpin and proceed with "not found" case
+                                    */
+                                    pgBufferUsage.aio_read_failed++;
+                                }
+
+                                /*  regain locks and handle the validity of the buffer and intention regarding it    */
+                                LWLockAcquire(PartitionLock, LW_SHARED);
+                                LockBufHdr(buf_desc);
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+                                smgrcompleteaio_rc = 16;
+                                while (!(PoorMDecr(&(BAiocb->BAiocbDependentCount)))) { /* unregister self as dependent  */
+                                    if (smgrcompleteaio_rc--) {
+                                        elog(LOG, "BufCheckAsync : yielding %d to decrement BAiocb %p dependent",smgrcompleteaio_rc,BAiocb);
+                                    }
+                                    sched_yield();
+                                }
+#else
+                                BAiocb->BAiocbDependentCount--; /* unregister self as dependent  */
+#endif /* USE_AIO_SIGEVENT  */
+                            } else {
+                                    pgBufferUsage.aio_read_wasted++;  /*   regardless of whether aio_successful */
+                            }
+
+
+                            if (local_intention > 0) {
+                                /*  verify the buffer is still ours and has same identity
+                                **  There is one slightly tricky point here -
+                                **  if there are other dependents,   then each of them will perform this same check
+                                **  this is unavoidable as the correct setting of retcode and the BM_VALID flag
+                                **  is required by each dependent,  so we may not leave it to the last one to do it.
+                                **  It should not do any harm and easier to let them all do it than try to avoid.
+                                */
+                                if ((FREENEXT_BAIOCB_ORIGIN - (BAiocb - (BAiocbAnchr->BufferAiocbs))) == buf_desc->freeNext) { /* it is still mine */
+
+                                    if (aio_successful) {
+                                        /* validate page header.   If valid,  then mark the buffer as valid */
+                                        if (PageIsVerified((Page)(BufHdrGetBlock(buf_desc)) , ((BAiocb->BAiocbthis).aio_offset/BLCKSZ))) {
+                                            buf_desc->flags |= BM_VALID;
+                                            if (BUFFERTAGS_EQUAL(origTag , buf_desc->tag)) {
+                                                retcode = BUF_INTENT_RC_VALID;
+                                            } else {
+                                                retcode = BUF_INTENT_RC_CHANGED_TAG;
+                                            }
+                                        } else {
+                                            retcode = BUF_INTENT_RC_BADPAGE;
+                                        }
+                                    }
+                                }
+                            }
+
+                            BAiocbDependentCount_after_aio_finished = BAiocb->BAiocbDependentCount;
+
+                            /*  if no dependents,   then disconnect the BAiocb and update buffer header */
+                            if (BAiocbDependentCount_after_aio_finished == 0 ) {
+
+                                /*  return the BufferAiocb to the free list  */
+                                buf_desc->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR | BM_AIO_IN_PROGRESS);
+
+                                if (
+                                           BufReleaseAsync(BAiocb)
+                                   ) {        /*  failed ? */
+                                    BAiocb->BAiocbnext = cachedBAiocb;   /* then ...       */
+                                    cachedBAiocb = BAiocb;               /*  ... cache it  */
+                                }
+                            }
+
+                        }
+                    }
+            }         /* end this is aio   */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+            /*  At this point,  buffer header spinlock is held in all cases.
+            **  1.  it was either acquired by caller and held on entry or we acquired at early in this function
+            **  2.  if  (local_intention <= 0) it was never released
+            **  3.  if  (local_intention >  0) it was released and re-acquired around the call to smgrstartaio
+            */
+            /*  note whether buffer is valid before unlocking spinlock */
+            valid = ((buf_desc->flags & BM_VALID) != 0);
+
+            /*  if there was a disposable pin on entry to this function (i.e. marked in buffer flags)
+            **  then unmark it  -  refer to prologue comments talking about :
+            **    if a disposable pin is held,   then :
+            **     ...
+            **    i.e. in either case,   there is no longer a disposable pin after this function has completed.
+            */
+            if (pin_already_banked_by_me) {
+                        buf_desc->flags &= ~BM_AIO_PREFETCH_PIN_BANKED; /*   redeem the banked pin  */
+                        /*   if AIO not in progress,  then disconnect the buffer from BAiocb and/or banked pin */
+                        if (!(buf_desc->flags & BM_AIO_IN_PROGRESS)) {
+                            buf_desc->freeNext = FREENEXT_NOT_IN_LIST;      /*   and forget the bank client */
+                        }
+                        /********** for debugging   *****************
+                        else elog(LOG, "BufCheckAsync : found BM_AIO_IN_PROGRESS when redeeming banked pin on buffer %d rel=%s, blockNum=%u, flags %X refcount=%u"
+                       ,buf_desc->buf_id,relpathbackend(buf_desc->tag.rnode, InvalidBackendId, buf_desc->tag.forkNum)
+                       ,buf_desc->tag.blockNum, buf_desc->flags, buf_desc->refcount);
+                        ********** for debugging     *****************/
+            }
+
+            /*  If we are to obtain new pin, then use pin optimization  -  pin and unlock.
+            **  However,   if the caller is the same backend who issued the aio_read,
+            **  then he ought to have obtained the pin at that time and must not acquire
+            **  a "second" one since this is logically the same read -  he would have obtained
+            **  a single pin if using synchronous read and we emulate that behaviour.
+            **  Its important to understand that the caller is not aware that he already obtained a pin -
+            **  because calling PrefetchBuffer did not imply a pin -
+            **  so we must track that via the pidOfAio field in the BAiocb.
+            **  And to add one further complication :
+            **      we assume that although PrefetchBuffer pinned the buffer,
+            **      it did not increment the usage count.
+            **      (because it called PinBuffer_Locked which does not do that)
+            **      so in this case,   we must increment the usage count without double-pinning.
+            **      yes its ugly  -  and theres a goto!
+            */
+            if (   (local_intention > 0)
+                || (local_intention == BUF_INTENTION_REJECT_OBTAIN_PIN)
+               ) {
+
+                /* Make sure we will have room to remember the buffer pin */
+                ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+                /*    here we really want a version of PinBuffer_Locked which updates usage count ... */
+                if (   (PrivateRefCount[buf_desc->buf_id] == 0) /*   if this buffer not previously pinned by me */
+                    || pin_already_banked_by_me                 /*   or I had a disposable pin on entry */
+                   ) {
+                    if (strategy == NULL)
+                    {
+                            if (buf_desc->usage_count < BM_MAX_USAGE_COUNT)
+                                    buf_desc->usage_count++;
+                    }
+                    else
+                    {
+                            if (buf_desc->usage_count == 0)
+                                    buf_desc->usage_count = 1;
+                    }
+                }
+
+                /*  now pin buffer unless we have a disposable */
+                if (!disposable_pin) { /* this backend neither banked pin for aio nor pinned the buffer while waiting for aio_read to complete */
+                    PinBuffer_Locked(buf_desc);
+                    goto unlocked_it;
+                }
+                else
+                /*    if this task previously issued the aio or pinned the buffer while waiting for aio_read to complete
+                **       and aio was unsuccessful,    then release the pin
+                */
+                if (     disposable_pin
+                      && (aio_successful == 0)       /*  aio_read failed ? */
+                   ) {
+                    UnlockBufHdr(buf_desc);    /*  release spinlock since UnpinBuffer acquires it -  annoying but infrequent case */
+                    UnpinBuffer(buf_desc, true);
+                    goto unlocked_it;
+                }
+            }
+
+    unlock_buf_header:
+            UnlockBufHdr(buf_desc);
+    unlocked_it:
+
+            /*   now do any requested pin (if not done immediately above) or unpin/forget  */
+            if (local_intention == BUF_INTENTION_REJECT_KEEP_PIN) {
+            /*   the caller is supposed to hold a pin already so there should be nothing to do ... */
+                if (PrivateRefCount[buf_desc->buf_id] == 0) {
+                    elog(LOG, "request to keep pin on unpinned buffer %d",buf_desc->buf_id);
+
+                    valid = PinBuffer(buf_desc, strategy);
+                }
+            }
+            else
+            if (   (   (local_intention == BUF_INTENTION_REJECT_FORGET)
+                    || (local_intention == BUF_INTENTION_REJECT_NOADJUST)
+                   )
+                && (PrivateRefCount[buf_desc->buf_id] > 0) /*   if this buffer was previously pinned by me ... */
+               )  {
+
+                    if (local_intention == BUF_INTENTION_REJECT_FORGET) {
+                        UnpinBuffer(buf_desc, true); /*  ... then release the pin                   */
+                    } else
+                    if (local_intention == BUF_INTENTION_REJECT_NOADJUST) {
+                        /*   following code moved from ReleaseBuffer :
+                        **   not sure why we can't simply UnpinBuffer(buf_desc, true)
+                        **   but better leave it the way it was
+                        */
+                        ResourceOwnerForgetBuffer(CurrentResourceOwner, BufferDescriptorGetBuffer(buf_desc));
+                        if (PrivateRefCount[buf_desc->buf_id] > 1) {
+                            PrivateRefCount[buf_desc->buf_id]--;
+                        } else {
+                            UnpinBuffer(buf_desc, false);
+                        }
+                    }
+            }
+
+            /*    if retcode has not been set to one of the unusual conditions
+            **        namely failed header validity or tag changed
+            **    then the setting of valid takes precedence
+            **    over whatever retcode may be currently set to.
+            */
+            if ( ( (retcode == BUF_INTENT_RC_INVALID_NO_AIO) || (retcode == BUF_INTENT_RC_INVALID_AIO) ) && valid) {
+                   retcode = BUF_INTENT_RC_VALID;
+            } else
+            if ((retcode == BUF_INTENT_RC_VALID) && (!valid)) {
+                   if (aio_successful == -1) { /* aio not attempted */
+                       retcode = BUF_INTENT_RC_INVALID_NO_AIO;
+                   } else {
+                       retcode = BUF_INTENT_RC_INVALID_AIO;
+                   }
+            }
+
+            return retcode;
+}
+
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+
+/*   PoorMans Atomic functions :
+**   these functions operate on a signed int as follows :
+**   suppose current value of int is v
+**       .  lock       if v >  0 then compswap int to -v else fail
+**       .  unlock     if v <  0 then compswap int to -v (i.e. abs(v)) else fail
+**       .  increment  if v >= 0 then compswap int to (v+1) else fail
+**       .  decrement  if v >  0 then compswap int to (v-1) else fail
+**
+**   thus the lock prevents int from being incremented or decremented
+**
+**   in all cases :
+**      there is no retrying within PoorMan,  caller must do so if desired
+**      return return code from either test as described above or from compswap
+**              ( non-zero <=> success , 0 <=> fail )
+**      although not policed,  caller who gets lock should release it very soon after
+**                             and should not wait in any way while lock is held.
+*/
+
+static int PoorMLock(volatile int *intP) {
+     int rc = *intP;
+     if (rc > 0) {
+         rc = (__sync_bool_compare_and_swap (intP, rc, (-rc))); /* swap to -ve */
+     } else {
+         rc = 0; /* fail */
+     }
+     return rc;
+}
+
+static int PoorMUnLock(volatile int *intP) {
+     int rc = *intP;
+     if (rc < 0) {
+         rc = (__sync_bool_compare_and_swap (intP, rc, (-rc))); /* swap to +ve */
+     } else {
+         rc = 0; /* fail */
+     }
+     return rc;
+}
+
+static int PoorMIncr(volatile int *intP) {
+     int rc = *intP;
+     if (rc >= 0) {
+         rc = (__sync_bool_compare_and_swap (intP, rc, (rc+1))); /* swap to v+1 */
+     } else {
+         rc = 0; /* fail */
+     }
+     return rc;
+}
+
+static int PoorMDecr(volatile int *intP) {
+     int rc = *intP;
+     if (rc > 0) {
+         rc = (__sync_bool_compare_and_swap (intP, rc, (rc-1))); /* swap to v-1 */
+     } else {
+         rc = 0; /* fail */
+     }
+     return rc;
+}
+
+/*  the following functions break the smgr / md / fd layering by calling aio_error directly
+**  but this is a lesser evil than locating them in fd.c since they refer heavily to
+**  buffer-manager async-io signalling/completion constructs.
+*/
+#include "aio.h"
+/*  post backends waiting on BAiocb
+**  may be called either from signal-handler context or from mainline context
+*/
+void
+BufPostAioWaiters(struct BufferAiocb volatile * BAiocb)
+{
+      int cs_rc;
+	  volatile struct PGPROC *BAiocbWaiterProc;	    /* backend waiting for aio completion */
+	  volatile struct PGPROC *BAiocbWaiterProcnext;	/* next backend waiting for aio completion */
+
+          cs_rc = PoorMLock( &(BAiocb->BAiocbDependentCount) ); /* lock the BAiocb dependents */
+          if (cs_rc) { /* locked?  note locked implies BAiocbDependentCount was > 0 */
+              /*  check whether actually complete and get status.
+              **  The correct method is to call smgrcompleteaio
+              **  but that function might wait which we must not do,
+              **  so we must call aio_error directly and bypass the layers.
+              */
+              cs_rc = aio_error((const struct aiocb *)(&(BAiocb->BAiocbthis)));
+              if (cs_rc != EINPROGRESS) { /* any rc other than EINPROGRESS implies it is not in progress */
+                  /*  run the chain and wake up any waiters */
+                  /*  note - waiter may be me,  in which case omit sending another signal.
+                  */
+                  BAiocbWaiterProc = BAiocb->BAiocbWaiterchain;          /* current head of chain */
+                  while (BAiocbWaiterProc != (struct PGPROC *)0) {        /*  not reached end */
+                      BAiocbWaiterProcnext = BAiocbWaiterProc->BAiocbWaiterLink;
+                      cs_rc = (__sync_bool_compare_and_swap (&(BAiocb->BAiocbWaiterchain), BAiocbWaiterProc, BAiocbWaiterProcnext)); /* swap it off chain */
+                      if (cs_rc) {
+                          if (this_backend_pid != BAiocbWaiterProc->pid) {
+                              kill(BAiocbWaiterProc->pid , AIO_SIGEVENT_SIGNALNUM);
+                          }
+                          BAiocbWaiterProc = BAiocb->BAiocbWaiterchain;   /* new head of chain */
+                      } else {
+                          elog(LOG, "BufPostAioWaiters : my pid %d unable to remove waiter pid %d from BAiocb %p dept count= %d"
+                                    ,this_backend_pid,BAiocbWaiterProc->pid,BAiocb,BAiocb->BAiocbDependentCount);
+                          break; /* abandon ship if we could not swap it off cleanly */
+                      }
+                  }
+              }
+              /*  whatever else we achieve,  we *must* unlock the BAiocbDependentCount
+              **  it would be a bug if the following PoorMUnLock fails
+              **  since no other process can modify it while locked,
+              **  but we must check that.
+              */
+              cs_rc = PoorMUnLock( &(BAiocb->BAiocbDependentCount) ); /* unlock the BAiocb dependents */
+              while (!cs_rc) {
+                  sched_yield();
+                  cs_rc = PoorMUnLock( &(BAiocb->BAiocbDependentCount) ); /* unlock the BAiocb dependents */
+              }
+          }
+}
+/*   handle the AIO SIGNAL sent by kernel and by other backend for a completion.
+**    .  if sent by other backend,  then do nothing as the sole purpose is to wake me up
+**    .  if sent by kernel as a result of a sigqueue by aio completion, then :
+**         for each BAiocb,
+**           poor-man lock the BAiocb against change in dependents (including being released),
+**           check if io is complete (successful or otherwise)
+**           then check if there are any waiters and if so signal them.
+**           Note  -  logically,   each backend should operate on only BAiocbs associated with
+**                    ios which that same backend originated;
+**                    However,   there is no reason to insist on being originated by me,
+**                    and a potential benefit from not doing so in that we may
+**                    post a waiter sooner,  so simply check all.
+**           Retort - experimentally it seems that only posting waiters waiting on aio I originated is better
+*/
+void BufHandlSigAsync(int sigsent, siginfo_t * siginfP, void *ucontext)
+{
+      volatile struct BufferAiocb *BufferAiocbs;
+	  volatile struct BufferAiocb *BAiocb;	/*  this BAiocb */
+      int ix;
+
+      if (sigsent == AIO_SIGEVENT_SIGNALNUM) {                                     /*  related to aio completion ... */
+          if ( (siginfP != (siginfo_t *)0) && (siginfP->si_code == SI_ASYNCIO) ) { /* ... from sigqueue */
+              BufferAiocbs = BAiocbAnchr->BufferAiocbs;    /*  address base of the BAiocbs */
+              __sync_fetch_and_add (&num_signalledaio, 1); /*  accounting */
+              for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                  BAiocb = (BufferAiocbs+ix);	/*  this BAiocb */
+#ifndef AIO_PRIMISCUOUS_POSTING
+                  /*  post only if this aio was originated by me
+                  **  no later than the one to which this signal applies
+                  */
+                  if ( (BAiocb->pidOfAio == this_backend_pid) && (BAiocb->BAiocbaioOrdinal <= (unsigned long)(siginfP->si_value.sival_ptr)) )
+#endif /*  AIO_PRIMISCUOUS_POSTING */
+                  BufPostAioWaiters(BAiocb);
+              }
+          }
+      }
+}
+
+/*  function to await completion if non-originator
+*   called as a callback from FileCompleteaio
+*   returns the aio_error return code from completion
+*/
+extern sigset_t blockaiosighandlrmask; /* defined in buf_init.c */
+int BufAWaitAioCompletion (struct BufferAiocb *BAiocb) /* pointer to BAIOcb  */ 
+{
+
+    int onwaitchain = 0;              /*  is my PROC on the waiter chain ? */
+	int	aio_errno;
+    struct aiocb volatile * my_aiocbp = &(BAiocb->BAiocbthis);
+    struct PGPROC volatile ** BAiocbWaiterchainP = &(BAiocb->BAiocbWaiterchain); /* anchor of chain of backends waiting for aio completion */
+    struct PGPROC * FirstWaiterProc; /* PROC of first backend waiting for aio completion */
+    sigset_t tempmask;
+    volatile struct PGPROC *BAiocbWaiterProc;	    /* backend waiting for aio completion */
+	volatile struct PGPROC *BAiocbWaiterProcnext;	/* next backend waiting for aio completion */
+
+        sigprocmask(SIG_BLOCK,&blockaiosighandlrmask,&tempmask); /*  temporarily block signal AIO_SIGEVENT_SIGNALNUM */
+        aio_errno = aio_error((const struct aiocb *)my_aiocbp);
+        while (aio_errno == EINPROGRESS) {              /*  check still in progress */
+            if (!onwaitchain) {
+                /*  place self on chain of waiters  */
+                FirstWaiterProc = (struct PGPROC *)(*BAiocbWaiterchainP);  /*  dereference current head of chain */
+                MyProc->BAiocbWaiterLink = FirstWaiterProc;            /*  and attach this current chain to mine */
+                onwaitchain = __sync_bool_compare_and_swap (BAiocbWaiterchainP, FirstWaiterProc, MyProc); /* swap it onto chain */
+                if (onwaitchain) {                               /*   successfully swapped onto chain */
+                    sigsuspend(&tempmask);                       /*   unblock signal AIO_SIGEVENT_SIGNALNUM and wait for it */
+                } else {
+                    /*  failed to place myself on waiter chain so yield and try again */
+                    sigprocmask(SIG_SETMASK,&tempmask,0);            /*  normal mask allows AIO_SIGEVENT_SIGNALNUM */
+                    sched_yield();
+                    sigprocmask(SIG_BLOCK,&blockaiosighandlrmask,&tempmask); /*  temporarily block signal AIO_SIGEVENT_SIGNALNUM again */
+                }
+            }
+            aio_errno = aio_error((const struct aiocb *)my_aiocbp);                             /* and check once again */
+        }
+        sigprocmask(SIG_SETMASK,&tempmask,0);            /*  restore original mask which allows AIO_SIGEVENT_SIGNALNUM */
+
+        /*  ensure my PROC is no longer on wait chain -
+        *   signal-handler may or may not have done so.
+        *   note there is no race condition with sig handler
+        *        because sig handler swaps the PROC off the waiter chain *before* signalling me
+        */
+        while (onwaitchain) {                               /*   successfully swapped onto chain */
+            onwaitchain = 0;              /*  tentatively my PROC is no longer on the waiter chain */
+            BAiocbWaiterProc = BAiocb->BAiocbWaiterchain;          /* current head of chain */
+            while (BAiocbWaiterProc != (struct PGPROC *)0) {        /*  not reached end */
+                BAiocbWaiterProcnext = BAiocbWaiterProc->BAiocbWaiterLink;
+                if (BAiocbWaiterProc == MyProc) {  /*  is this me ? */
+                    onwaitchain = ( (__sync_bool_compare_and_swap (&(BAiocb->BAiocbWaiterchain), BAiocbWaiterProc, BAiocbWaiterProcnext)) /* swap it off chain */
+                                    ? 0  /*  my PROC is no longer on the waiter chain */
+                                    : 1  /*  my PROC is on the waiter chain */
+                    );
+                }
+                BAiocbWaiterProc = BAiocbWaiterProcnext;
+            }
+        }
+
+        return aio_errno;
+}
+#endif /*  USE_AIO_SIGEVENT    */
--- src/backend/storage/buffer/buf_init.c.orig	2014-08-18 14:10:36.933017070 -0400
+++ src/backend/storage/buffer/buf_init.c	2014-08-19 16:56:13.567196743 -0400
@@ -13,15 +13,122 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <sys/types.h> 
+#include <unistd.h>
 
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
-
+#include <stdlib.h> /* for getenv() */
+#include <errno.h> /* for strtoul() */
+#if defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)
+#include <sched.h>
+#endif /*  defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)  */
 
 BufferDesc *BufferDescriptors;
 char	   *BufferBlocks;
-int32	   *PrivateRefCount;
+int32	   *PrivateRefCount;       /*  array of counts per buffer of how many times this task has pinned this buffer */
+
+volatile struct BAiocbAnchor *BAiocbAnchr = (struct BAiocbAnchor *)0;  /*  anchor for all control blocks pertaining to aio  */
+
+int CountInuseBAiocbs(void);     /*  keep compiler happy */
+void ReportFreeBAiocbs(void);    /*  keep compiler happy */
+
+extern int	MaxConnections;  /*  max number of client connections which postmaster will allow  */
+int numBufferAiocbs = 0;         /*  total number of  BufferAiocbs in pool (0 <=> no async io) */
+int hwmBufferAiocbs = 0;         /*  high water mark of in-use  BufferAiocbs in pool
+                                 **  (not required to be accurate, kindly maintained for us somehow by postmaster)
+                                 */
 
+#ifdef USE_PREFETCH
+unsigned int prefetch_dbOid = 0; /*  database oid of relations on which prefetching to be done - 0 means all */
+unsigned int prefetch_bitmap_scans = 1; /*  boolean whether to prefetch bitmap heap scans        */
+unsigned int prefetch_heap_scans = 0;   /*  boolean whether to prefetch non-bitmap heap scans    */
+unsigned int prefetch_sequential_index_scans = 0;  /*  boolean whether to prefetch sequential-access non-bitmap index scans  */
+unsigned int prefetch_index_scans = 256;  /*  boolean whether to prefetch non-bitmap index scans also numeric size of pfch_list  */
+unsigned int prefetch_btree_heaps = 1;  /*  boolean whether to prefetch heap pages in _bt_next for non-bitmap index scans   */
+#endif /* USE_PREFETCH */
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+int maxGetBAiocbTries = 1;       /*  max times we will try to get a free BufferAiocb */
+int maxRelBAiocbTries = 1;       /*  max times we will try to release a BufferAiocb back to freelist */
+
+/*  locking protocol for manipulating the BufferAiocbs and FreeBAiocbs list :
+**    1.    ownership of a BufferAiocb :
+**          to gain ownership of a BufferAiocb, a task must
+**          EITHER    remove it from FreeBAiocbs (it is now temporary owner and no other task can find it)
+**                    if decision is to attach it to a buffer descriptor header, then
+**                       .   lock the buffer descriptor header
+**                       .   check  NOT flags & BM_AIO_IN_PROGRESS
+**                       .   attach to buffer descriptor header
+**                       .   increment the BufferAiocb.dependent_count
+**                       .   unlock the buffer descriptor header
+**                     and ownership scope is from lock to unlock
+***          OR        locate it by dereferencing the pointer in a buffer descriptor,
+**                    in which case :
+**                       .   lock the buffer descriptor header
+**                       .   check  flags & BM_AIO_IN_PROGRESS
+**                       .   increment the BufferAiocb.dependent_count
+**                       .   if decision is to return to FreeBAiocbs,
+**                           then   (with buffer descriptor header still locked)
+**                                  .  turn off BM_AIO_IN_PROGRESS
+**                       .   IF        the BufferAiocb.dependent_count == 1 (I am sole dependent)
+**                       .   THEN
+**                       .       .  decrement the BufferAiocb.dependent_count
+**                               .  return to FreeBAiocbs (see below)
+**                       .   unlock the buffer descriptor header
+**                     and ownership scope is from lock to either return to FreeBAiocbs or unlock
+**    2.    adding and removing from FreeBAiocbs :
+**      two alternative methods - controlled by conditional macro definition LOCK_BAIOCB_FOR_GET_REL
+**       2.1 LOCK_BAIOCB_FOR_GET_REL is defined - use a lock
+**          .   lock BufFreelistLock exclusive
+**          .   add / remove from FreeBAiocbs
+**          .   unlock BufFreelistLock exclusive
+**          advantage of this method  -  never fails to add or remove
+**       2.2  LOCK_BAIOCB_FOR_GET_REL is not defined - use compare_and_swap
+**          .   retrieve the current Freelist pointer and validate
+**          .   compare_and_swap on/off the FreeBAiocb list
+**          .   unlock BufFreelistLock exclusive
+**          advantage of this method  -  never waits
+**          to avoid losing a FreeBAiocbs,   save it in a process-local cache and reuse
+*/
+#ifdef USE_AIO_SIGEVENT                                    /*    signal non-originator waiters instead of letting them poll */
+extern long num_startedaio;
+extern long num_signalledaio;
+
+/*  structures relating to keeping track of aios originated by this backend */
+/*  BAiocbIostatusArray is a circular array of aio operations past and present;
+ *  each item is a 32-bit unsigned int container consisting of two fields
+ *    .   unsigned short status - one of three states :
+ *        .   not_in_progress          no aio in progress,  io lock not held
+ *        .   locked_in_progress          aio in progress,  io lock held eXclusive
+ *        .   locked_pending_release      aio completion signalled,  io lock held eXclusive
+ *    .   unsigned short BAiocb_index - the array index of the associated BAiocb when io lock held
+ *
+ *  The active items in the array correspond to all aio_reads issued by this backend
+ *  since the earliest io-locked aio.    BAiocbIostatusEarliest indexes this entry.
+ *  Every aio operation originated by this backend is assigned a monotonic-increasing sequence number
+ *  which,   modulo the size of the array,  identifies the array item representing that aio.
+*/
+unsigned int volatile * BAiocbIostatusArray = (unsigned int *)0;
+unsigned long BAiocbNextaioOrdinal = 0;           /*  ordinal number of each aio in the sequence  */
+unsigned long BAiocbEarliestiolockedOrdinal = 0;  /*  ordinal number of earliest io-locked aio */
+unsigned long BAiocbIostatusDimension = 0;        /*  number of items in array */
+
+  /*  BufHandlSigAsync is invoked whenever io completion signal is issued.
+  *   siginfP->si_value.sival_ptr contains the address of the LWlock
+  *   on which non-originator waiters wait and which we are to release
+  */
+void BufHandlSigAsync(int sigsent, siginfo_t * siginfP, void *ucontext);
+sigset_t blockaiosighandlrmask;     /*  sig mask to block AIO signal temporarily */
+#endif /*  USE_AIO_SIGEVENT    */
+#else   /*  not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        /*   this dummy structure is to ensure that references to these fields in other bufmgr runtime code
+        **   that is not conditionally ifdefd on USE_AIO_ATOMIC_BUILTIN_COMP_SWAP compiles and runs correctly
+        */
+        struct BAiocbAnchor dummy_BAiocbAnchr = { (struct BufferAiocb*)0 , (struct BufferAiocb*)0 };
+int maxGetBAiocbTries = -1;       /*  max times we will try to get a free BufferAiocb */
+int maxRelBAiocbTries = -1;       /*  max times we will try to release a BufferAiocb back to freelist */
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Data Structures:
@@ -73,7 +180,20 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+              , foundAiocbs
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+          ;
+#if defined(USE_PREFETCH) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
+        char *envvarpointer = (char *)0;  /*  might point to an environment variable string */
+        char *charptr;
+#endif /* USE_PREFETCH */
+
+#ifdef USE_AIO_SIGEVENT                                    /*    signal waiters instead of letting them poll */
+/*    signal environment to invoke BufHandlSigAsync    */
+struct sigaction chsigaction , olsigAIOaction;
+#endif /*  USE_AIO_SIGEVENT    */
 
 	BufferDescriptors = (BufferDesc *)
 		ShmemInitStruct("Buffer Descriptors",
@@ -83,6 +203,146 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+        BAiocbAnchr = (struct BAiocbAnchor *)0; /*  anchor for all control blocks pertaining to aio  */
+        if (max_async_io_prefetchers < 0) {  /*  negative value indicates to initialize to something sensible during buf_init */
+            max_async_io_prefetchers = MaxConnections/6;  /*  default allows for average of MaxConnections/6 concurrent prefetchers  - reasonable ??? */
+        }
+
+        if ((target_prefetch_pages > 0) && (max_async_io_prefetchers > 0)) {
+            int ix;
+            volatile struct BufferAiocb *BufferAiocbs;
+            volatile struct BufferAiocb * volatile FreeBAiocbs;
+
+            numBufferAiocbs = (target_prefetch_pages*max_async_io_prefetchers);  /*  target_prefetch_pages per prefetcher */
+            BAiocbAnchr = (struct BAiocbAnchor *)
+	        	ShmemInitStruct("Buffer Aiocbs",
+                          sizeof(struct BAiocbAnchor) + (numBufferAiocbs * sizeof(struct BufferAiocb)), &foundAiocbs);
+            if (BAiocbAnchr) {
+                BufferAiocbs = BAiocbAnchr->BufferAiocbs = (struct BufferAiocb*)(((char *)BAiocbAnchr) + sizeof(struct BAiocbAnchor));
+                FreeBAiocbs = (struct BufferAiocb*)0;
+                for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    (BufferAiocbs+ix)->BAiocbnext = FreeBAiocbs;   /*  init the free list,  last one -> 0  */
+                    (BufferAiocbs+ix)->BAiocbbufh = (struct sbufdesc*)0;
+                    (BufferAiocbs+ix)->BAiocbDependentCount = 0;
+                    (BufferAiocbs+ix)->pidOfAio = 0;
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+			        (BufferAiocbs+ix)->BAiocbWaiterchain = (struct PGPROC *)0; /* initially not on a chain */
+#endif /*  USE_AIO_SIGEVENT    */
+                    FreeBAiocbs = (BufferAiocbs+ix);
+
+                }
+                BAiocbAnchr->FreeBAiocbs = FreeBAiocbs;
+                envvarpointer = getenv("PG_MAX_GET_BAIOCB_TRIES");
+                if (    (envvarpointer != (char *)0)
+                     && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+                   ) {
+                      maxGetBAiocbTries = strtol(envvarpointer, 0, 10);
+                }
+                envvarpointer = getenv("PG_MAX_REL_BAIOCB_TRIES");
+                if (    (envvarpointer != (char *)0)
+                     && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+                   ) {
+                      maxRelBAiocbTries = strtol(envvarpointer, 0, 10);
+                }
+
+#ifdef USE_AIO_SIGEVENT                                    /*    signal waiters instead of letting them poll */
+				/*    set our signal environment to invoke BufHandlSigAsync    */
+				sigfillset(&blockaiosighandlrmask);
+				chsigaction.sa_sigaction = (void(*)(int, siginfo_t *, void *ucontext))&BufHandlSigAsync; /*  call for siginfo */
+				chsigaction.sa_mask = blockaiosighandlrmask;
+				chsigaction.sa_flags = SA_SIGINFO; /*  call for siginfo */
+				sigaction(AIO_SIGEVENT_SIGNALNUM, &chsigaction, &olsigAIOaction);
+#endif /*  USE_AIO_SIGEVENT    */
+            }
+        }
+#else   /*  not USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+        /*   this dummy structure is to ensure that references to these fields in other bufmgr runtime code
+        **   that is not conditionally ifdefd on USE_AIO_ATOMIC_BUILTIN_COMP_SWAP compiles and runs correctly
+        */
+        BAiocbAnchr = &dummy_BAiocbAnchr;
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+
+#ifdef USE_PREFETCH
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_BITMAP");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_bitmap_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_bitmap_scans = 1;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_ISCAN");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_index_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_index_scans = 1;
+             } else
+             if (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   ) {
+                 prefetch_index_scans = strtol(envvarpointer, &charptr, 10);
+                 if (charptr && (',' == *charptr)) {   /*  optional sequential prefetch in index scans */
+					 charptr++;        /*   following the comma ... */
+					 if ( ('Y' == *charptr) || ('y' == *charptr) || ('1' == *charptr) ) {
+                         prefetch_sequential_index_scans = 1;
+					 }
+				 }
+             }
+             /*  if prefeching for ISCAN,  then we require size of pfch_list to be at least target_prefetch_pages */
+             if (   (prefetch_index_scans > 0)
+                 && (prefetch_index_scans < target_prefetch_pages)
+                ) {
+                 prefetch_index_scans = target_prefetch_pages;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_BTREE");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_btree_heaps = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+                 prefetch_btree_heaps = 1;
+             }
+        }
+        envvarpointer = getenv("PG_TRY_PREFETCHING_FOR_HEAP");
+        if (envvarpointer != (char *)0) {
+             if (    ('N' == *envvarpointer) || ('n' == *envvarpointer)   ) {
+                 prefetch_heap_scans = 0;
+             } else
+             if (    ('Y' == *envvarpointer) || ('y' == *envvarpointer)   ) {
+              prefetch_heap_scans = 1;
+             }
+        }
+        envvarpointer = getenv("PG_PREFETCH_DBOID");
+        if (    (envvarpointer != (char *)0)
+             && (   ('1' <= *envvarpointer) && ('9' >= *envvarpointer)   )
+           ) {
+              errno = 0;   /*  required in order to distinguish error from 0 */
+              prefetch_dbOid = (unsigned int)strtoul((const char *)envvarpointer, 0, 10);
+              if (errno) {
+                  prefetch_dbOid = 0;
+              }
+        }
+        elog(LOG, "prefetching initialised with target_prefetch_pages= %d "
+                  ", max_async_io_prefetchers= %d implying aio concurrency= %d "
+                  ", prefetching_for_bitmap= %s "
+                  ", prefetching_for_heap= %s "
+                  ", prefetching_for_iscan= %d with sequential_index_page_prefetching= %s "
+                  ", prefetching_for_btree= %s"
+                   ,target_prefetch_pages ,max_async_io_prefetchers ,numBufferAiocbs
+                   ,(prefetch_bitmap_scans ? "Y" : "N")
+                   ,(prefetch_heap_scans ? "Y" : "N")
+                   ,prefetch_index_scans
+                   ,(prefetch_sequential_index_scans ? "Y" : "N")
+                   ,(prefetch_btree_heaps ? "Y" : "N")
+            );
+#endif /* USE_PREFETCH */
+
+
 	if (foundDescs || foundBufs)
 	{
 		/* both should be present or neither */
@@ -176,3 +436,82 @@ BufferShmemSize(void)
 
 	return size;
 }
+
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*     imprecise count of number of in-use BAiocbs at any time
+ *     we scan the array read-only without latching so are subject to unstable result
+ *     (but since the array is in well-known contiguous storage,
+ *     we are not subject to segmentation violation)
+ *     This function may be called at any time and just does its best
+ *     return the count of what we counted.
+ */
+int
+CountInuseBAiocbs(void)
+{
+        volatile struct BufferAiocb *BAiocb;
+        int count = 0;
+        int ix;
+
+
+        if (BAiocbAnchr != (struct BAiocbAnchor *)0 ) {
+            BAiocb = BAiocbAnchr->BufferAiocbs;             /*   start of list */
+
+            for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    if ((struct BufferAiocb*)BAIOCB_OCCUPIED == (BAiocb+ix)->BAiocbnext) {   /* not on freelist ? */
+                        count++;
+                    }
+            }
+        }
+        return count;
+}
+
+/*
+ * report how many free BAiocbs at shutdown
+ * DO NOT call this while backends are actively working!!
+ * this report is useful when compare_and_swap method used (see above)
+ * as it can be used to deduce how many BAiocbs were in process-local caches -
+ * (original_number_on_freelist_at_startup - this_reported_number_at_shutdown)
+ */
+void
+ReportFreeBAiocbs(void)
+{
+        volatile struct BufferAiocb *BAiocb;
+        volatile struct BufferAiocb *BufferAiocbs;
+        int count = 0;
+        int fx , ix;
+
+
+        if (BAiocbAnchr != (struct BAiocbAnchor *)0 ) {
+            BAiocb = BAiocbAnchr->FreeBAiocbs;             /*   start of free list */
+            BufferAiocbs = BAiocbAnchr->BufferAiocbs;
+
+            for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                    (BufferAiocbs+ix)->BAiocbDependentCount = 0;  /* use this as marker for finding it on freelist */
+            }
+            for (fx = (numBufferAiocbs-1);  ( (fx>=0) &&  ( BAiocb != (struct BufferAiocb*)0 ) );  fx--) {
+                    
+                    /*  check if it is a valid BufferAiocb */
+                    for (ix = (numBufferAiocbs-1);  ix>=0;  ix--) {
+                        if ((BufferAiocbs+ix) == BAiocb) { /*  is it this one ? */
+                             break;
+                        }
+                    }
+                    if (ix >= 0) {
+                        if (BAiocb->BAiocbDependentCount) {   /* seen it already ? */
+                            elog(LOG, "ReportFreeBAiocbs closed cycle on AIO control block freelist %p"
+                                          ,BAiocb);
+                            fx = 0; /* give up at this point */
+                        }
+                        BAiocb->BAiocbDependentCount = 1;  /* use this as marker for finding it on freelist */
+                        count++;
+                        BAiocb = BAiocb->BAiocbnext;
+                    } else {
+                        elog(LOG, "ReportFreeBAiocbs invalid item on AIO control block freelist %p"
+                                          ,BAiocb);
+                        fx = 0; /* give up at this point */
+                    }
+            }
+        }
+        elog(LOG, "ReportFreeBAiocbs AIO control block list : poolsize= %d  in-use-hwm= %d  final-free= %d" ,numBufferAiocbs , hwmBufferAiocbs , count);
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
--- src/backend/storage/smgr/md.c.orig	2014-08-18 14:10:36.941017108 -0400
+++ src/backend/storage/smgr/md.c	2014-08-19 16:56:13.603196872 -0400
@@ -664,6 +664,80 @@ mdprefetch(SMgrRelation reln, ForkNumber
 }
 
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ *	mdinitaio() --  init the aio subsystem max number of threads and max number of requests
+ */
+void
+mdinitaio(int max_aio_threads, int max_aio_num)
+{
+     FileInitaio( max_aio_threads, max_aio_num );
+}
+
+/*
+ *	mdstartaio() -- start aio read of the specified block of a relation
+ */
+void
+mdstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , unsigned long BAiocbaioOrdinal
+#endif /*  USE_AIO_SIGEVENT    */
+ )
+{
+#ifdef USE_PREFETCH
+	off_t		seekpos;
+	MdfdVec    *v;
+        int local_retcode;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+
+	seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	local_retcode = FileStartaio(v->mdfd_vfd, seekpos, BLCKSZ , aiocbp
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , BAiocbaioOrdinal
+#endif /*  USE_AIO_SIGEVENT    */
+                                   );
+	if (retcode) {
+            *retcode = local_retcode;
+        }
+#endif   /* USE_PREFETCH */
+}
+
+
+/*
+ *	mdcompleteaio() -- complete aio read of the specified block of a relation
+ *      on entry, *inoutcode should indicate :
+ *           .  non-0  <=>   check if complete and wait if not
+ *           .  0      <=>   cancel io immediately
+ */
+void
+mdcompleteaio( char *aiocbp , int *inoutcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , int (*BufAWaitAioCompletion)(char *)  /*  function to await completion if non-originator */
+                                , char *BAiocbP /* pointer to BAIOcb  */
+#endif /*  USE_AIO_SIGEVENT    */
+             )
+{
+#ifdef USE_PREFETCH
+        int local_retcode;
+
+	local_retcode = FileCompleteaio(aiocbp, (inoutcode ? *inoutcode : 0)
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , BufAWaitAioCompletion
+                                , BAiocbP /* pointer to BAIOcb  */
+#endif /*  USE_AIO_SIGEVENT    */
+                                       );
+	if (inoutcode) {
+            *inoutcode = local_retcode;
+        }
+#endif   /* USE_PREFETCH */
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
+
 /*
  *	mdread() -- Read the specified block from a relation.
  */
--- src/backend/storage/smgr/smgr.c.orig	2014-08-18 14:10:36.941017108 -0400
+++ src/backend/storage/smgr/smgr.c	2014-08-19 16:56:13.623196943 -0400
@@ -49,6 +49,21 @@ typedef struct f_smgr
 						 BlockNumber blocknum, char *buffer, bool skipFsync);
 	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	void		(*smgr_initaio) (int max_aio_threads, int max_aio_num);
+	void		(*smgr_startaio) (SMgrRelation reln, ForkNumber forknum,
+											  BlockNumber blocknum , char *aiocbp , int *retcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , unsigned long BAiocbaioOrdinal
+#endif /*  USE_AIO_SIGEVENT    */
+                                 );
+	void		(*smgr_completeaio) ( char *aiocbp , int *inoutcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , int (*BufAWaitAioCompletion)(char *)  /*  function to await completion if non-originator */
+                                , char *BAiocbP /* pointer to BAIOcb  */
+#endif /*  USE_AIO_SIGEVENT    */
+                                    );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
@@ -66,7 +81,11 @@ typedef struct f_smgr
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
-		mdprefetch, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
+		mdprefetch
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+                        ,mdinitaio, mdstartaio, mdcompleteaio
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+              , mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
 		mdpreckpt, mdsync, mdpostckpt
 	}
 };
@@ -612,6 +631,53 @@ smgrprefetch(SMgrRelation reln, ForkNumb
 	(*(smgrsw[reln->smgr_which].smgr_prefetch)) (reln, forknum, blocknum);
 }
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ *	smgrinitaio() -- initialize the aio subsystem max number of threads and max number of requests
+ */
+void
+smgrinitaio(int max_aio_threads, int max_aio_num)
+{
+	(*(smgrsw[0].smgr_initaio)) ( max_aio_threads, max_aio_num );
+}
+
+/*
+ *	smgrstartaio() -- Initiate aio read of the specified block of a relation.
+ */
+void
+smgrstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , unsigned long BAiocbaioOrdinal
+#endif /*  USE_AIO_SIGEVENT    */
+             )
+{
+	(*(smgrsw[reln->smgr_which].smgr_startaio)) (reln, forknum, blocknum , aiocbp , retcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , BAiocbaioOrdinal
+#endif /*  USE_AIO_SIGEVENT    */
+                                                );
+}
+
+/*
+ *	smgrcompleteaio() -- Complete aio read of the specified block of a relation.
+ */
+void
+smgrcompleteaio(SMgrRelation reln,  char *aiocbp , int *inoutcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , int (*BufAWaitAioCompletion)(char *)  /*  function to await completion if non-originator */
+                                , char *BAiocbP /* pointer to BAIOcb  */
+#endif /*  USE_AIO_SIGEVENT    */
+               )
+{
+	(*(smgrsw[reln->smgr_which].smgr_completeaio)) ( aiocbp , inoutcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , BufAWaitAioCompletion
+                                , BAiocbP
+#endif /*  USE_AIO_SIGEVENT    */
+                                                   );
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 /*
  *	smgrread() -- read a particular block from a relation into the supplied
  *				  buffer.
--- src/backend/storage/file/fd.c.orig	2014-08-18 14:10:36.937017089 -0400
+++ src/backend/storage/file/fd.c	2014-08-19 16:56:13.655197058 -0400
@@ -77,6 +77,9 @@
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+#include "aio.h"
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * We must leave some file descriptors free for system(), the dynamic loader,
@@ -123,6 +126,10 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = 32;	/* default if not changed */
 
+#ifdef USE_AIO_SIGEVENT                                    /*    signal non-originator waiters instead of letting them poll */
+extern long num_startedaio;
+extern long num_cancelledaio;
+#endif /*  USE_AIO_SIGEVENT    */
 
 /* Debugging.... */
 
@@ -1239,6 +1246,10 @@ FileClose(File file)
  * We could add an implementation using libaio in the future; but note that
  * this API is inappropriate for libaio, which wants to have a buffer provided
  * to read into.
+ * Also note that a new, different implementation of asynchronous prefetch
+ * using librt,  not libaio,  is provided by the two functions following this one,
+ * FileStartaio and FileCompleteaio.   These also require to have a buffer provided
+ * to read into,  which the new async_io support provides.
  */
 int
 FilePrefetch(File file, off_t offset, int amount)
@@ -1266,6 +1277,207 @@ FilePrefetch(File file, off_t offset, in
 #endif
 }
 
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+/*
+ * FileInitaio - initialize the aio subsystem max number of threads and max number of requests
+ *  input parms
+ *  max_aio_threads;    maximum number of threads
+ *  max_aio_num;        maximum number of concurrent aio read requests
+ *
+ *  on linux, man page for the librt implemenation of aio_init() says :
+ *         This function is a GNU extension.
+ *  If your posix aio does not have it,   then add the following line to 
+ *        src/include/pg_config_manual.h
+ *    #define DONT_HAVE_AIO_INIT
+ *  to render it as a no-op
+ */
+void
+FileInitaio(int max_aio_threads, int max_aio_num )
+{
+#ifndef DONT_HAVE_AIO_INIT
+    struct aioinit aioinit_struct;  /*  structure to pass to aio_init */
+
+    aioinit_struct.aio_threads = max_aio_threads; /*     maximum number of threads */
+    aioinit_struct.aio_num = max_aio_num;         /*     maximum number of concurrent aio read requests */
+    aioinit_struct.aio_idle_time = 1;             /*     we dont want to alter this but aio_init does not ignore it so set to the default */
+    aio_init(&aioinit_struct);
+#endif  /* ndef DONT_HAVE_AIO_INIT */
+    return;
+}
+
+/*
+ * FileStartaio - initiate asynchronous read of a given range of the file.
+ * The logical seek position is unaffected.
+ *
+ * use standard posix aio (librt)
+ *  ASSUME   BufferAiocb.aio_buf already set to -> buffer by caller
+ *  return 0 if successfully started,  else non-zero
+ */
+int
+FileStartaio(File file, off_t offset, int amount , char *aiocbp
+#ifdef USE_AIO_SIGEVENT                                    /*    signal  non-originator waiters instead of letting them poll */
+            , unsigned long BAiocbaioOrdinal               /*    ordinal number of this aio in backend's sequence  */
+#endif /*  USE_AIO_SIGEVENT    */
+                      )
+{
+	int	returnCode;
+        struct aiocb *my_aiocbp = (struct aiocb *)aiocbp;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartaio: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset, amount));
+
+	returnCode = FileAccess(file);
+	if (returnCode >= 0) {
+
+            my_aiocbp->aio_fildes = VfdCache[file].fd;
+            my_aiocbp->aio_lio_opcode = LIO_READ;
+            my_aiocbp->aio_nbytes = amount;
+            my_aiocbp->aio_offset = offset;
+#ifdef USE_AIO_SIGEVENT                                    /*    signal non-originator waiters instead of letting them poll */
+            my_aiocbp->aio_sigevent.sigev_notify = SIGEV_SIGNAL;
+            my_aiocbp->aio_sigevent.sigev_signo = AIO_SIGEVENT_SIGNALNUM;
+            my_aiocbp->aio_sigevent.sigev_value.sival_ptr = (void *)BAiocbaioOrdinal; /*  ordinal number of this aio */
+#endif /*  USE_AIO_SIGEVENT    */
+            returnCode = aio_read(my_aiocbp);
+#ifdef USE_AIO_SIGEVENT                                    /*    signal non-originator waiters instead of letting them poll */
+            if (returnCode == 0) {
+				num_startedaio++;
+            }
+#endif /*  USE_AIO_SIGEVENT    */
+        }
+
+	return returnCode;
+}
+
+/*
+ * FileCompleteaio - complete asynchronous aio read
+ * normal_wait indicates whether to cancel or wait -
+ *                -1 <=> check - dont wait or cancel but report whether complete
+ *                 0 <=> cancel
+ *                 1 <=> wait by polling the aiocb or waiting on signal i.e. non-originator
+ *                 2 <=> wait by suspending on the aiocb                i.e. originator
+ *
+ * use standard posix aio (librt)
+ *  return 0 if successfull and did not have to wait,
+ *         1 if successfull and had to wait,
+ *    else x'ff'
+ */
+int
+FileCompleteaio( char *aiocbp , int normal_wait
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+            , int (*BufAWaitAioCompletion)(char *)  /*  function to await completion if non-originator */
+            , char *BAiocbP /* pointer to BAIOcb  */
+#endif /*  USE_AIO_SIGEVENT    */
+ )
+{
+	int	returnCode;
+	int	aio_errno;
+        struct aiocb *my_aiocbp = (struct aiocb *)aiocbp;
+        const struct aiocb const*cblist[1];
+        int fd;
+        struct timespec my_timeout = { 0 , 10000 };
+        struct timespec *suspend_timeout_P; /*  the timeout actually used depending on normal_wait */
+        int max_polls;
+
+        fd = my_aiocbp->aio_fildes;
+        cblist[0] = my_aiocbp;
+
+        returnCode = aio_errno = aio_error(my_aiocbp);
+        /* note that aio_error returns 0 if op already completed successfully */
+
+        /*  first handle normal case of waiting for op to complete  */
+        if (normal_wait > 0) {
+            /*   if told not to poll,   then specify no timeout  */
+            suspend_timeout_P = ( (normal_wait == 1) ? &my_timeout : (struct timespec *)0 );
+
+            while (aio_errno == EINPROGRESS) {
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+                if (normal_wait == 1) { /*   told to wait on signal */
+                    returnCode = aio_errno = (*BufAWaitAioCompletion)(BAiocbP); /*  wait for completion */
+				} else {
+#endif /*  USE_AIO_SIGEVENT    */
+                    max_polls = 256;
+                    my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                    returnCode = aio_suspend(cblist , 1 , suspend_timeout_P);
+                    while (    (returnCode < 0) && (max_polls-- > 0)
+                            && ((EAGAIN == errno) || (EINTR == errno))
+                          ) {
+                        my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+                        returnCode = aio_suspend(cblist , 1 , suspend_timeout_P);
+                    }
+
+                    returnCode = aio_errno = aio_error(my_aiocbp);
+
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+                }
+#endif /*  USE_AIO_SIGEVENT    */
+                /*  now return_code is from aio_error  */
+                if (returnCode == 0) {
+                    returnCode = 1;    /*  successful but had to wait */
+                }
+            }
+            if (aio_errno) {
+                elog(LOG, "FileCompleteaio: normal_wait= %d fd= %d aio_errno= %d", normal_wait, fd, aio_errno);
+                returnCode = 0xff; /*  unsuccessful */
+            }
+
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on signal instead of polling */
+            /*   if this bkend is the originator, and therefore has the responsibility
+            **   to post suspended waiters,  then do so now.
+            **   140814 : Actually best not to do so :
+            **   because  mutually redundant with the same functionality in BufHandlSigAsync.
+            **   at one time we thought that it was necessary to do so here in the mistaken belief
+            **   that there could be cases where
+            **          . an aio is started 
+            **          . no signal is subsequently delivered until after the *next* aio is started
+            **   but in fact we have found that,  after any aio_read is started,
+            **   at least one signal will be delivered after it completes
+            **   even if no subequent aio_read is started.
+            if (    (normal_wait == 2)  **   originator   **
+                     **     there are several places where originator might post suspended waiters :
+                      *      .  BufCheckAsync if cancelling aio
+                      *      .  here
+                      *      .  in BufHandlSigAsync  ( called if signal is delivered )
+                      *      The last two appear to be mutually redundant,
+                      *      but signals are not always delivered
+                      *      Fortunately it does no harm if both cases execute simultaneously
+                      *      because each serializes on the waiters list
+                      **
+               ) {
+                    BufPostAioWaiters(BAiocbaiolockP);
+            }
+            */
+#endif /*  USE_AIO_SIGEVENT    */
+        } else {                    /* check or cancel */
+            if (normal_wait == 0) { /* cancel */
+				if (aio_errno == EINPROGRESS) {
+					do {
+							max_polls = 256;
+							my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+							returnCode = aio_cancel(fd, my_aiocbp);
+							while ((returnCode == AIO_NOTCANCELED) && (max_polls-- > 0)) {
+								my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;
+								returnCode = aio_cancel(fd, my_aiocbp);
+							}
+						returnCode = aio_errno = aio_error(my_aiocbp);
+					} while (aio_errno == EINPROGRESS);
+					returnCode = 0xff; /*  unsuccessful */
+				}
+			}
+            if (returnCode != 0)
+                returnCode = 0xff; /*  unsuccessful */
+        }
+
+	DO_DB(elog(LOG, "FileCompleteaio: %d %d",
+			   fd, returnCode));
+
+	return returnCode;
+}
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+
 int
 FileRead(File file, char *buffer, int amount)
 {
--- src/backend/storage/lmgr/proc.c.orig	2014-08-18 14:10:36.941017108 -0400
+++ src/backend/storage/lmgr/proc.c	2014-08-19 16:56:13.695197201 -0400
@@ -52,6 +52,7 @@
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
+extern pid_t this_backend_pid;     /*   pid of this backend */
 
 /* GUC variables */
 int			DeadlockTimeout = 1000;
@@ -361,6 +362,7 @@ InitProcess(void)
 	MyPgXact->xid = InvalidTransactionId;
 	MyPgXact->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
+        this_backend_pid = getpid();    /*    pid of this backend */
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
@@ -536,6 +538,9 @@ InitAuxiliaryProcess(void)
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
 	MyProc->lwWaitLink = NULL;
+#if defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)
+	MyProc->BAiocbWaiterLink = NULL;
+#endif
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
 #ifdef USE_ASSERT_CHECKING
--- src/backend/access/heap/heapam.c.orig	2014-08-18 14:10:36.841016638 -0400
+++ src/backend/access/heap/heapam.c	2014-08-19 16:56:13.795197558 -0400
@@ -71,6 +71,28 @@
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
+#ifdef USE_PREFETCH
+#include <sys/types.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+
+#include "executor/instrument.h"
+
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_heap_scans; /*  boolean whether to prefetch non-bitmap heap scans         */
+
+/*  special values for scan->rs_prefetch_target indicating as follows :               */
+#define PREFETCH_MAYBE 0xffffffff      /*   prefetch permitted but not yet in effect  */
+#define PREFETCH_DISABLED 0xfffffffe   /*   prefetch disabled and not permitted       */
+/*  PREFETCH_WRAP_POINT indicates a pretcher who has reached the point where the scan would wrap -
+**  at this point the prefetcher runs on the spot until scan catches up.
+**  This *must* be < maximum valid setting of target_prefetch_pages aka effective_io_concurrency.
+*/
+#define PREFETCH_WRAP_POINT 0x0fffffff
+
+#endif   /* USE_PREFETCH */
+
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -115,6 +137,10 @@ static XLogRecPtr log_heap_new_cid(Relat
 static HeapTuple ExtractReplicaIdentity(Relation rel, HeapTuple tup, bool key_modified,
 					   bool *copy);
 
+#ifdef USE_PREFETCH
+static void heap_unread_add(HeapScanDesc scan, BlockNumber blockno);
+static void heap_unread_subtract(HeapScanDesc scan, BlockNumber blockno);
+#endif   /* USE_PREFETCH */
 
 /*
  * Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -292,9 +318,150 @@ initscan(HeapScanDesc scan, ScanKey key,
 	 * Currently, we don't have a stats counter for bitmap heap scans (but the
 	 * underlying bitmap index scans will be counted).
 	 */
-	if (!scan->rs_bitmapscan)
+#ifdef USE_PREFETCH
+        /*    by default,  no prefetching on any scan  */
+        scan->rs_prefetch_target = PREFETCH_DISABLED;  /*  tentatively disable  */
+        scan->rs_pfchblock = 0; /*  scanner will reset this to be ahead of scan */
+        scan->rs_Unread_Pfetched_base = (BlockNumber *)0; /*  list of prefetched but unread blocknos */
+        scan->rs_Unread_Pfetched_next = 0; /*  next unread blockno */
+        scan->rs_Unread_Pfetched_count = 0; /* number of valid unread blocknos */
+#endif   /* USE_PREFETCH */
+	if (!scan->rs_bitmapscan) {
+
 		pgstat_count_heap_scan(scan->rs_rd);
+#ifdef USE_PREFETCH
+                /*    bitmap scans do their own prefetching -
+                **    for others,  set up prefetching now
+                */
+                if (    prefetch_heap_scans
+                     && (target_prefetch_pages > 0)
+                     &&	(!RelationUsesLocalBuffers(scan->rs_rd))
+                   ) {
+                      /*   prefetch_dbOid may be set to a database Oid to specify only prefetch in that db */
+                      if (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == scan->rs_rd->rd_node.dbNode)
+                              )
+                          ||  (prefetch_dbOid == 0)
+                         ) {
+                          scan->rs_prefetch_target = PREFETCH_MAYBE;    /*  permitted but let the scan decide */
+                      }
+                      else {
+                      }
 }
+#endif   /* USE_PREFETCH */
+        }
+}
+
+#ifdef USE_PREFETCH
+/* add this blockno to list of prefetched and unread blocknos
+** use the one identified by the (next+count|modulo circumference) index if it is unused,
+** else search for the first available slot if there is one,
+** else error.
+*/
+static void
+heap_unread_add(HeapScanDesc scan, BlockNumber blockno)
+{
+      BlockNumber *available_P;   /*  where to store new blockno */
+      unsigned int Unread_Pfetched_index = scan->rs_Unread_Pfetched_next
+                                         + scan->rs_Unread_Pfetched_count; /* index of next unused slot */
+
+      /*  caller is not supposed to pass InvalidBlockNumber but check anyway */
+      if (blockno != InvalidBlockNumber) {
+
+		  /*  ensure there is some room somewhere   */
+		  if (scan->rs_Unread_Pfetched_count < target_prefetch_pages) {
+
+			  /*  try the "next+count" one */
+			  if (Unread_Pfetched_index >= target_prefetch_pages) {
+				  Unread_Pfetched_index -= target_prefetch_pages;  /* modulo circumference */
+			  }
+			  available_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+			  if (*available_P == InvalidBlockNumber) { /* unused */
+				  goto store_blockno;
+			  } else {
+                  /*  slow-search the entire list  */
+				  for (Unread_Pfetched_index = 0; Unread_Pfetched_index < target_prefetch_pages; Unread_Pfetched_index++) {
+					  available_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+					  if (*available_P == InvalidBlockNumber) { /* unused */
+                          /*  before storing this blockno,
+                          **  since the next pointer did not locate an unused slot,
+                          **  set it to one which is more likely to be so for the next time
+                          */
+                          scan->rs_Unread_Pfetched_next = Unread_Pfetched_index;
+						  goto store_blockno;
+					  }
+				  }
+			  }
+		  }
+
+          /*  if we reach here,    either there was no available slot
+          **  or we thought there was one and didn't find any  --
+          */
+  		  ereport(NOTICE,
+			  (errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("heap_unread_add overflowed list cannot add blockno %d", blockno)));
+
+  		  return;
+
+    store_blockno:
+		  *available_P = blockno;
+		  scan->rs_Unread_Pfetched_count++;  /*  update count */
+
+	  }
+
+    return;
+}
+
+/* remove specified blockno from list of prefetched and unread blocknos.
+** Usually this will be found at the rs_Unread_Pfetched_next item -
+** else search for it.    If not found,   inore it  -  no error results.
+*/
+static void
+heap_unread_subtract(HeapScanDesc scan, BlockNumber blockno)
+{
+      unsigned int Unread_Pfetched_index = scan->rs_Unread_Pfetched_next; /* index of next unread blockno */
+      BlockNumber *candidate_P;   /*  location of callers blockno - maybe */
+      BlockNumber nextUnreadPfetched;
+
+      /*  caller is not supposed to pass InvalidBlockNumber but check anyway */
+      if (    (blockno != InvalidBlockNumber)
+		   && ( scan->rs_Unread_Pfetched_count > 0 )   /*  if the list is not empty  */
+         ) {
+
+			  /*  take modulo of the circumference.
+			  **  actually rs_Unread_Pfetched_next should never exceed the circumference but check anyway.
+			  */
+			  if (Unread_Pfetched_index >= target_prefetch_pages) {
+				  Unread_Pfetched_index  -= target_prefetch_pages;
+}
+			  candidate_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);
+			  nextUnreadPfetched = *candidate_P;
+
+			  if ( nextUnreadPfetched == blockno ) {
+				  goto remove_blockno;
+			  } else {
+                  /*  slow-search the entire list  */
+				  for (Unread_Pfetched_index = 0; Unread_Pfetched_index < target_prefetch_pages; Unread_Pfetched_index++) {
+					  candidate_P = (scan->rs_Unread_Pfetched_base + Unread_Pfetched_index);   /*  where to store new blockno */
+					  if (*candidate_P == blockno) { /* unused */
+						  goto remove_blockno;
+					  }
+				  }
+			  }
+
+    remove_blockno:
+			  *candidate_P = InvalidBlockNumber;
+
+			  scan->rs_Unread_Pfetched_next = (Unread_Pfetched_index+1);  /*  update next pfchd unread */
+			  if (scan->rs_Unread_Pfetched_next >= target_prefetch_pages) {
+					  scan->rs_Unread_Pfetched_next = 0;
+			  }
+			  scan->rs_Unread_Pfetched_count--;  /*  update count */
+	  }
+
+      return;
+}
+#endif   /* USE_PREFETCH */
 
 /*
  * heapgetpage - subroutine for heapgettup()
@@ -304,7 +471,7 @@ initscan(HeapScanDesc scan, ScanKey key,
  * which tuples on the page are visible.
  */
 static void
-heapgetpage(HeapScanDesc scan, BlockNumber page)
+heapgetpage(HeapScanDesc scan, BlockNumber page , BlockNumber prefetchHWM)
 {
 	Buffer		buffer;
 	Snapshot	snapshot;
@@ -314,6 +481,10 @@ heapgetpage(HeapScanDesc scan, BlockNumb
 	OffsetNumber lineoff;
 	ItemId		lpp;
 	bool		all_visible;
+#ifdef USE_PREFETCH
+	int             PrefetchBufferRc;  /*   indicates whether requested prefetch block already in a buffer and if pin count on buffer has been increased */
+#endif   /* USE_PREFETCH */
+
 
 	Assert(page < scan->rs_nblocks);
 
@@ -336,6 +507,98 @@ heapgetpage(HeapScanDesc scan, BlockNumb
 									   RBM_NORMAL, scan->rs_strategy);
 	scan->rs_cblock = page;
 
+#ifdef USE_PREFETCH
+
+        heap_unread_subtract(scan, page);
+
+        /*    maybe prefetch some pages  starting with rs_pfchblock */
+        if (scan->rs_prefetch_target >= 0) {       /*   prefetching enabled on this scan ?                         */
+            int next_block_to_be_read = (page+1);  /*   next block to be read = lowest possible prefetchable block */
+            int num_to_pfch_this_time;             /*   eventually holds the number of blocks to prefetch now      */
+            int prefetchable_range;                /*   size of the area ahead of the current prefetch position    */
+
+            /*  check if prefetcher reached wrap point and the scan has now wrapped */
+            if (  (page == 0) && (scan->rs_prefetch_target == PREFETCH_WRAP_POINT)  ) {
+                scan->rs_prefetch_target = 1;
+                scan->rs_pfchblock = next_block_to_be_read;
+            } else
+            if (scan->rs_pfchblock < next_block_to_be_read) {
+                scan->rs_pfchblock = next_block_to_be_read; /* next block to be prefetched must be ahead of one we just read */
+            }
+
+            /* now we know where we would start prefetching -
+            ** next question   -  if this is a sync scan,  ensure we do not prefetch behind the HWM
+            ** debatable whether to require strict inequality or >=  -   >= works better in practice
+            */
+            if ( (!scan->rs_syncscan) || (scan->rs_pfchblock >= prefetchHWM) ) {
+
+                /* now we know where we will start prefetching -
+                ** next question   -  how many?
+                ** apply two limits :
+                **  1.   target prefetch distance
+                **  2.   number of available blocks ahead of us
+                */
+
+                /*  1.   target prefetch distance   */
+                num_to_pfch_this_time = next_block_to_be_read + scan->rs_prefetch_target; /* page beyond prefetch target */
+                num_to_pfch_this_time -= scan->rs_pfchblock;                              /*  convert to offset        */
+
+                /*   first do prefetching up to our current limit  ...
+                **   highest page number that a scan (pre)-fetches is scan->rs_nblocks-1
+                **   note  -  prefetcher does not wrap a prefetch range -
+                **            instead just stop and then start again if and when main scan wraps
+                */
+                if (scan->rs_pfchblock <= scan->rs_startblock) {  /*  if on second leg towards startblock */
+                    prefetchable_range = ((int)(scan->rs_startblock) - (int)(scan->rs_pfchblock));
+                }
+                else {                                            /*     on first leg towards nblocks     */
+                    prefetchable_range = ((int)(scan->rs_nblocks) - (int)(scan->rs_pfchblock));
+                }
+                if (prefetchable_range > 0) {           /*  if theres a range to prefetch */
+
+                    /*  2.   number of available blocks ahead of us        */
+                    if (num_to_pfch_this_time > prefetchable_range) {
+                        num_to_pfch_this_time = prefetchable_range;
+                    }
+                    while (num_to_pfch_this_time-- > 0) {
+                        PrefetchBufferRc = PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, scan->rs_pfchblock, scan->rs_strategy);
+                        /*  if pin acquired on buffer,  then remember in case of future Discard */
+                        if (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED) {
+                            heap_unread_add(scan, scan->rs_pfchblock);
+						}
+                        scan->rs_pfchblock++;
+                        /*  if syncscan and requested block was already in buffer pool,
+                        **  this suggests that another scanner is ahead of us and we should advance
+                        */
+                        if ( (scan->rs_syncscan) && (PrefetchBufferRc & PREFTCHRC_BLK_ALREADY_PRESENT) ) {
+                            scan->rs_pfchblock++;
+                            num_to_pfch_this_time--;
+                        }
+                    }
+                }
+                else {
+                    /*   we must not modify scan->rs_pfchblock here
+                    **   because it is needed for possible DiscardBuffer at end of scan  ...
+                    **   ... instead ...
+                    */
+                    scan->rs_prefetch_target = PREFETCH_WRAP_POINT;  /*   mark this prefetcher as waiting to wrap */
+                }
+
+                /*   ...  then adjust prefetching limit : by doubling on each iteration */
+                if (scan->rs_prefetch_target == 0) {
+                    scan->rs_prefetch_target = 1;
+                }
+                else {
+                    scan->rs_prefetch_target *= 2;
+                    if (scan->rs_prefetch_target > target_prefetch_pages) {
+                        scan->rs_prefetch_target = target_prefetch_pages;
+                    }
+                }
+            }
+        }
+#endif   /* USE_PREFETCH */
+
+
 	if (!scan->rs_pageatatime)
 		return;
 
@@ -452,6 +715,10 @@ heapgettup(HeapScanDesc scan,
 	OffsetNumber lineoff;
 	int			linesleft;
 	ItemId		lpp;
+    BlockNumber  prefetchHWM = 0;   /*  HWM of prefetch BlockNum for all participants in same sync scan */
+#ifdef USE_PREFETCH
+    int          ix;
+#endif   /* USE_PREFETCH */
 
 	/*
 	 * calculate next starting lineoff, given scan direction
@@ -470,7 +737,25 @@ heapgettup(HeapScanDesc scan,
 				return;
 			}
 			page = scan->rs_startblock; /* first page */
-			heapgetpage(scan, page);
+#ifdef USE_PREFETCH
+                        /*  decide if we shall do prefetching :  only if :
+                        **   .  prefetching enabled for this scan
+                        **   .  not a bitmap scan (which do their own)
+                        **   .  sufficient number of blocks - at least twice the target_prefetch_pages
+                        */
+                        if (    (!scan->rs_bitmapscan)
+                             && (scan->rs_prefetch_target >= PREFETCH_MAYBE)
+                             && (scan->rs_nblocks > (2*target_prefetch_pages))
+                           )  {
+                                scan->rs_prefetch_target = 1;  /*  do prefetching on forward non-bitmap scan */
+                                scan->rs_Unread_Pfetched_base = (BlockNumber *)palloc(sizeof(BlockNumber)*target_prefetch_pages);
+                                /*  Initialise the list */
+                                for (ix = 0; ix < target_prefetch_pages; ix++) {
+									*(scan->rs_Unread_Pfetched_base + ix) = InvalidBlockNumber;
+								}
+                        }
+#endif   /* USE_PREFETCH */
+			heapgetpage(scan, page, prefetchHWM);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
 		}
@@ -516,7 +801,7 @@ heapgettup(HeapScanDesc scan,
 				page = scan->rs_startblock - 1;
 			else
 				page = scan->rs_nblocks - 1;
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 		}
 		else
 		{
@@ -557,7 +842,7 @@ heapgettup(HeapScanDesc scan,
 
 		page = ItemPointerGetBlockNumber(&(tuple->t_self));
 		if (page != scan->rs_cblock)
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 
 		/* Since the tuple was previously fetched, needn't lock page here */
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
@@ -660,8 +945,12 @@ heapgettup(HeapScanDesc scan,
 			 * a little bit backwards on every invocation, which is confusing.
 			 * We don't guarantee any specific ordering in general, though.
 			 */
-			if (scan->rs_syncscan)
-				ss_report_location(scan->rs_rd, page);
+			if (scan->rs_syncscan) {
+#ifdef USE_PREFETCH
+				prefetchHWM = scan->rs_pfchblock;
+#endif   /* USE_PREFETCH */
+				ss_report_location(scan->rs_rd, page, &prefetchHWM);
+			}
 		}
 
 		/*
@@ -671,6 +960,22 @@ heapgettup(HeapScanDesc scan,
 		{
 			if (BufferIsValid(scan->rs_cbuf))
 				ReleaseBuffer(scan->rs_cbuf);
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
 			scan->rs_cbuf = InvalidBuffer;
 			scan->rs_cblock = InvalidBlockNumber;
 			tuple->t_data = NULL;
@@ -678,7 +983,7 @@ heapgettup(HeapScanDesc scan,
 			return;
 		}
 
-		heapgetpage(scan, page);
+		heapgetpage(scan, page, prefetchHWM);
 
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 
@@ -727,6 +1032,10 @@ heapgettup_pagemode(HeapScanDesc scan,
 	OffsetNumber lineoff;
 	int			linesleft;
 	ItemId		lpp;
+    BlockNumber  prefetchHWM = 0;   /*  HWM of prefetch BlockNum for all participants in same sync scan */
+#ifdef USE_PREFETCH
+    int          ix;
+#endif   /* USE_PREFETCH */
 
 	/*
 	 * calculate next starting lineindex, given scan direction
@@ -745,7 +1054,25 @@ heapgettup_pagemode(HeapScanDesc scan,
 				return;
 			}
 			page = scan->rs_startblock; /* first page */
-			heapgetpage(scan, page);
+#ifdef USE_PREFETCH
+                        /*  decide if we shall do prefetching :  only if :
+                        **   .  prefetching enabled for this scan
+                        **   .  not a bitmap scan (which do their own)
+                        **   .  sufficient number of blocks - at least twice the target_prefetch_pages
+                        */
+                        if (    (!scan->rs_bitmapscan)
+                             && (scan->rs_prefetch_target >= PREFETCH_MAYBE)
+                             && (scan->rs_nblocks > (2*target_prefetch_pages))
+                           )  {
+                                scan->rs_prefetch_target = 1;  /*  do prefetching on forward non-bitmap scan */
+                                scan->rs_Unread_Pfetched_base = (BlockNumber *)palloc(sizeof(BlockNumber)*target_prefetch_pages);
+                                /*  Initialise the list */
+                                for (ix = 0; ix < target_prefetch_pages; ix++) {
+									*(scan->rs_Unread_Pfetched_base + ix) = InvalidBlockNumber;
+								}
+                        }
+#endif   /* USE_PREFETCH */
+			heapgetpage(scan, page, prefetchHWM);
 			lineindex = 0;
 			scan->rs_inited = true;
 		}
@@ -788,7 +1115,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 				page = scan->rs_startblock - 1;
 			else
 				page = scan->rs_nblocks - 1;
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 		}
 		else
 		{
@@ -826,7 +1153,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 
 		page = ItemPointerGetBlockNumber(&(tuple->t_self));
 		if (page != scan->rs_cblock)
-			heapgetpage(scan, page);
+			heapgetpage(scan, page, prefetchHWM);
 
 		/* Since the tuple was previously fetched, needn't lock page here */
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
@@ -921,8 +1248,12 @@ heapgettup_pagemode(HeapScanDesc scan,
 			 * a little bit backwards on every invocation, which is confusing.
 			 * We don't guarantee any specific ordering in general, though.
 			 */
-			if (scan->rs_syncscan)
-				ss_report_location(scan->rs_rd, page);
+			if (scan->rs_syncscan) {
+#ifdef USE_PREFETCH
+				prefetchHWM = scan->rs_pfchblock;
+#endif   /* USE_PREFETCH */
+				ss_report_location(scan->rs_rd, page, &prefetchHWM);
+			}
 		}
 
 		/*
@@ -932,6 +1263,22 @@ heapgettup_pagemode(HeapScanDesc scan,
 		{
 			if (BufferIsValid(scan->rs_cbuf))
 				ReleaseBuffer(scan->rs_cbuf);
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                             while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
 			scan->rs_cbuf = InvalidBuffer;
 			scan->rs_cblock = InvalidBlockNumber;
 			tuple->t_data = NULL;
@@ -939,7 +1286,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 			return;
 		}
 
-		heapgetpage(scan, page);
+		heapgetpage(scan, page, prefetchHWM);
 
 		dp = (Page) BufferGetPage(scan->rs_cbuf);
 		lines = scan->rs_ntuples;
@@ -1394,6 +1741,23 @@ void
 heap_rescan(HeapScanDesc scan,
 			ScanKey key)
 {
+
+#ifdef USE_PREFETCH
+         if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                       /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+         }
+#endif   /* USE_PREFETCH */
 	/*
 	 * unpin scan buffers
 	 */
@@ -1418,6 +1782,23 @@ heap_endscan(HeapScanDesc scan)
 {
 	/* Note: no locking manipulations needed */
 
+
+#ifdef USE_PREFETCH
+         if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                       /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                            while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+         }
+#endif   /* USE_PREFETCH */
 	/*
 	 * unpin scan buffers
 	 */
@@ -1435,6 +1816,12 @@ heap_endscan(HeapScanDesc scan)
 	if (scan->rs_strategy != NULL)
 		FreeAccessStrategy(scan->rs_strategy);
 
+#ifdef USE_PREFETCH
+    if (scan->rs_Unread_Pfetched_base) {
+        pfree(scan->rs_Unread_Pfetched_base);
+    }
+#endif   /* USE_PREFETCH */
+
 	if (scan->rs_temp_snap)
 		UnregisterSnapshot(scan->rs_snapshot);
 
@@ -1464,7 +1851,6 @@ heap_endscan(HeapScanDesc scan)
 #define HEAPDEBUG_3
 #endif   /* !defined(HEAPDEBUGALL) */
 
-
 HeapTuple
 heap_getnext(HeapScanDesc scan, ScanDirection direction)
 {
@@ -6364,6 +6750,25 @@ heap_markpos(HeapScanDesc scan)
 void
 heap_restrpos(HeapScanDesc scan)
 {
+
+
+#ifdef USE_PREFETCH
+                        if (  (scan->rs_pfchblock > 0) && (scan->rs_cblock != InvalidBlockNumber)  ) {
+                            BlockNumber *Unread_Pfetched_base = scan->rs_Unread_Pfetched_base;
+                            unsigned int Unread_Pfetched_next = scan->rs_Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+                            unsigned int Unread_Pfetched_count = scan->rs_Unread_Pfetched_count;
+
+                            /*  we have prefetched but not read the range from cblock+1 to pfchblock-1 if that is non-empty */
+                             while ((Unread_Pfetched_count--) > 0) {
+                                DiscardBuffer( scan->rs_rd, MAIN_FORKNUM, *(Unread_Pfetched_base+Unread_Pfetched_next));
+                                heap_unread_subtract(scan, *(Unread_Pfetched_base+Unread_Pfetched_next)); /* remove this blockno from list of prefetched and unread blocknos */
+                                Unread_Pfetched_next++;
+                                if (Unread_Pfetched_next >= target_prefetch_pages)
+                                    Unread_Pfetched_next = 0;
+                            }
+                        }
+#endif   /* USE_PREFETCH */
+
 	/* XXX no amrestrpos checking that ammarkpos called */
 
 	if (!ItemPointerIsValid(&scan->rs_mctid))
--- src/backend/access/heap/syncscan.c.orig	2014-08-18 14:10:36.841016638 -0400
+++ src/backend/access/heap/syncscan.c	2014-08-19 16:56:13.835197702 -0400
@@ -90,6 +90,7 @@ typedef struct ss_scan_location_t
 {
 	RelFileNode relfilenode;	/* identity of a relation */
 	BlockNumber location;		/* last-reported location in the relation */
+	BlockNumber prefetchHWM;	/* high-water-mark of prefetched Blocknum */
 } ss_scan_location_t;
 
 typedef struct ss_lru_item_t
@@ -113,7 +114,7 @@ static ss_scan_locations_t *scan_locatio
 
 /* prototypes for internal functions */
 static BlockNumber ss_search(RelFileNode relfilenode,
-		  BlockNumber location, bool set);
+		  BlockNumber location, bool set , BlockNumber *prefetchHWMp);
 
 
 /*
@@ -160,6 +161,7 @@ SyncScanShmemInit(void)
 			item->location.relfilenode.dbNode = InvalidOid;
 			item->location.relfilenode.relNode = InvalidOid;
 			item->location.location = InvalidBlockNumber;
+			item->location.prefetchHWM = InvalidBlockNumber;
 
 			item->prev = (i > 0) ?
 				(&scan_locations->items[i - 1]) : NULL;
@@ -185,7 +187,7 @@ SyncScanShmemInit(void)
  * data structure.
  */
 static BlockNumber
-ss_search(RelFileNode relfilenode, BlockNumber location, bool set)
+ss_search(RelFileNode relfilenode, BlockNumber location, bool set , BlockNumber *prefetchHWMp)
 {
 	ss_lru_item_t *item;
 
@@ -206,6 +208,22 @@ ss_search(RelFileNode relfilenode, Block
 			{
 				item->location.relfilenode = relfilenode;
 				item->location.location = location;
+                                /*  if prefetch information requested,
+                                **  then reconcile and either update or report back the new HWM.
+                                */
+                                if (prefetchHWMp)
+                                {
+                                    if (   (item->location.prefetchHWM == InvalidBlockNumber)
+                                        || (item->location.prefetchHWM < *prefetchHWMp)
+                                       )
+                                    {
+                                      item->location.prefetchHWM = *prefetchHWMp;
+                                    }
+                                    else
+                                    {
+                                      *prefetchHWMp = item->location.prefetchHWM;
+                                    }
+                                }
 			}
 			else if (set)
 				item->location.location = location;
@@ -252,7 +270,7 @@ ss_get_location(Relation rel, BlockNumbe
 	BlockNumber startloc;
 
 	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
-	startloc = ss_search(rel->rd_node, 0, false);
+	startloc = ss_search(rel->rd_node, 0, false , 0);
 	LWLockRelease(SyncScanLock);
 
 	/*
@@ -282,7 +300,7 @@ ss_get_location(Relation rel, BlockNumbe
  * same relfilenode.
  */
 void
-ss_report_location(Relation rel, BlockNumber location)
+ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp)
 {
 #ifdef TRACE_SYNCSCAN
 	if (trace_syncscan)
@@ -306,7 +324,7 @@ ss_report_location(Relation rel, BlockNu
 	{
 		if (LWLockConditionalAcquire(SyncScanLock, LW_EXCLUSIVE))
 		{
-			(void) ss_search(rel->rd_node, location, true);
+			(void) ss_search(rel->rd_node, location, true , prefetchHWMp);
 			LWLockRelease(SyncScanLock);
 		}
 #ifdef TRACE_SYNCSCAN
--- src/backend/access/index/indexam.c.orig	2014-08-18 14:10:36.845016657 -0400
+++ src/backend/access/index/indexam.c	2014-08-19 16:56:13.855197774 -0400
@@ -79,6 +79,55 @@
 #include "utils/tqual.h"
 
 
+#ifdef USE_PREFETCH
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+void index_mark_or_evict_block(IndexScanDesc scan , BlockNumber blocknumber , int markit);
+
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+
+/*  if specified block number is present in the prefetch array,
+**  then either mark it as not to be discarded or evict it according to input param
+*/
+void index_mark_or_evict_block(IndexScanDesc scan , BlockNumber blocknumber , int markit)
+{
+        unsigned short int pfchx , pfchy , pfchz; /*  indexes in BlockIdData array   */
+
+        if (    scan->do_prefetch
+             && ((struct pfch_block_item*)0 != scan->pfch_block_item_list)
+             /* no need to check for scan->pfch_next < prefetch_index_scans
+             ** since we will do nothing if scan->pfch_used == 0
+             */
+           ) {
+            /*  search the prefetch list to find if the block is a member */
+            for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                if (BlockIdGetBlockNumber(&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)) == blocknumber) {
+                      if (markit) {
+						      /*  mark it as not to be discarded */
+						      ((scan->pfch_block_item_list)+pfchx)->pfch_discard &= ~PREFTCHRC_BUF_PIN_INCREASED;
+					  } else {
+							  /*  shuffle all following the evictee to the left
+							  **  and update next pointer if its element moves
+							  */
+							  pfchy = (scan->pfch_used - 1); /*  current rightmost */
+							  scan->pfch_used = pfchy;
+
+							  while (pfchy > pfchx) {
+								  pfchz = pfchx + 1;
+								  BlockIdCopy((&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)), (&(((scan->pfch_block_item_list)+pfchz)->pfch_blockid)));
+								  ((scan->pfch_block_item_list)+pfchx)->pfch_discard = ((scan->pfch_block_item_list)+pfchz)->pfch_discard;
+								  if (scan->pfch_next == pfchz) {
+									  scan->pfch_next = pfchx;
+								  }
+								  pfchx = pfchz; /* advance */
+							  }
+                      }
+                }
+            }
+        }
+}
+#endif /* USE_PREFETCH */
+
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
  *
@@ -253,6 +302,11 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+	scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -277,6 +331,11 @@ index_beginscan_bitmap(Relation indexRel
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+	scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -311,6 +370,9 @@ index_beginscan_internal(Relation indexR
 									  Int32GetDatum(nkeys),
 									  Int32GetDatum(norderbys)));
 
+	scan->heap_tids_seen = 0;
+	scan->heap_tids_fetched = 0;
+	
 	return scan;
 }
 
@@ -342,6 +404,12 @@ index_rescan(IndexScanDesc scan,
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
@@ -373,10 +441,30 @@ index_endscan(IndexScanDesc scan)
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
 
+#ifdef USE_PREFETCH
+        /*   discard prefetched but unread buffers */
+        if (    scan->do_prefetch
+             && ((struct pfch_block_item*)0 != scan->pfch_block_item_list)
+           ) {
+            unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                  if (((scan->pfch_block_item_list)+pfchx)->pfch_discard) {
+                      DiscardBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber(&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)));
+                  }
+                }
+        }
+#endif   /* USE_PREFETCH */
+
 	/* End the AM's scan */
 	FunctionCall1(procedure, PointerGetDatum(scan));
 
@@ -472,6 +560,12 @@ index_getnext_tid(IndexScanDesc scan, Sc
 		/* ... but first, release any held pin on a heap page */
 		if (BufferIsValid(scan->xs_cbuf))
 		{
+#ifdef USE_PREFETCH
+                    /*   if specified block number is present in the prefetch array,  then evict it */
+                    if (scan->do_prefetch) {
+                          index_mark_or_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf) , 0);
+                    }
+#endif   /* USE_PREFETCH */
 			ReleaseBuffer(scan->xs_cbuf);
 			scan->xs_cbuf = InvalidBuffer;
 		}
@@ -479,6 +573,11 @@ index_getnext_tid(IndexScanDesc scan, Sc
 	}
 
 	pgstat_count_index_tuples(scan->indexRelation, 1);
+	if (scan->heap_tids_seen++ >= (~0)) {
+		/* Avoid integer overflow */
+		scan->heap_tids_seen = 1;
+		scan->heap_tids_fetched = 0;
+	}
 
 	/* Return the TID of the tuple we found. */
 	return &scan->xs_ctup.t_self;
@@ -502,6 +601,10 @@ index_getnext_tid(IndexScanDesc scan, Sc
  * enough information to do it efficiently in the general case.
  * ----------------
  */
+#if defined(USE_PREFETCH) && defined(AVOID_CATALOG_MIGRATION_FOR_ASYNCIO)
+extern Datum btpeeknexttuple(IndexScanDesc scan);
+#endif /* USE_PREFETCH */
+
 HeapTuple
 index_fetch_heap(IndexScanDesc scan)
 {
@@ -509,16 +612,111 @@ index_fetch_heap(IndexScanDesc scan)
 	bool		all_dead = false;
 	bool		got_heap_tuple;
 
+
+
 	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
 	if (!scan->xs_continue_hot)
 	{
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = scan->xs_cbuf;
 
+#ifdef USE_PREFETCH
+
+                /*   If the old block is different from new block, then evict old
+                **   block from prefetched array.   It is arguable we should leave it
+                **   in the array because it's likely to remain in the buffer pool
+                **   for a while,  but in that case , if we encounter the block
+                **   again,  prefetching it again does no harm.
+                **   (and note that,  if it's not pinned,  prefetching it will try to
+                **   pin it since prefetch tries to bank a pin for a buffer in the buffer pool).
+                **   therefore it should usually win.
+                */
+                if (    scan->do_prefetch
+                     && ( BufferIsValid(prev_buf) )
+                     && (BlocknotinBuffer(prev_buf,scan->heapRelation,ItemPointerGetBlockNumber(tid)))
+                     && (scan->pfch_next < prefetch_index_scans)  /* ensure there is an entry */
+                        ) {
+                          index_mark_or_evict_block(scan , BlocknumOfBuffer(prev_buf) , 0);
+                }
+
+#endif   /* USE_PREFETCH  */
 		scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf,
 											 scan->heapRelation,
 											 ItemPointerGetBlockNumber(tid));
 
+#ifdef USE_PREFETCH
+                /*   If the new block had been prefetched and pinned,
+                **   then mark that it no longer requires to be discarded.
+                **   Of course,  we don't evict the entry,
+                **   because we want to remember that it was recently prefetched.
+                */
+                index_mark_or_evict_block(scan , BlocknumOfBuffer(prev_buf) , 1);
+#endif   /* USE_PREFETCH  */
+
+				scan->heap_tids_fetched++;
+
+#ifdef USE_PREFETCH
+                /*  try prefetching next data block
+                **    (next meaning one containing TIDs from matching keys
+                **     in same index page and different from any block
+                **     we previously prefetched and listed in prefetched array)
+                */
+                {
+                    FmgrInfo   *procedure;
+                    bool	found;             /*  did we find the "next" heap tid in current index page */
+                    int         PrefetchBufferRc;  /*  indicates whether requested prefetch block already in a buffer and if pin count on buffer has been increased */
+
+                    if (scan->do_prefetch) {
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                        procedure = &scan->indexRelation->rd_aminfo->ampeeknexttuple; /* is incorrect but avoids adding function to catalog */
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                        if (RegProcedureIsValid(scan->indexRelation->rd_am->ampeeknexttuple)) {
+                            GET_SCAN_PROCEDURE(ampeeknexttuple); /* is correct but requires adding function to catalog */
+                        } else {
+                            procedure = 0;
+                        }
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+
+                        if (    procedure          /* does the index access method support peektuple? */
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                             && procedure->fn_addr /* procedure->fn_addr is non-null only if in catalog */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                           ) {
+                            int iterations = 1;      /*  how many iterations of prefetching shall we try  -
+                                                     **  if used entries in prefetch list is < target_prefetch_pages
+                                                     **  then 2,  else 1
+                                                     **  this should result in gradually and smoothly increasing up to target_prefetch_pages
+                                                     */
+                            /*  note we trust InitIndexScan verified this scan is forwards only and so set that */
+                            if (scan->pfch_used < target_prefetch_pages) {
+                                iterations = 2;
+                            }
+                            do {
+                                found =  DatumGetBool(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+                                                         btpeeknexttuple(scan)     /*  pass scan as direct parameter since cant use fmgr because not in catalog */
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                                                         FunctionCall1(procedure, PointerGetDatum(scan)) /* use fmgr to call it because in catalog  */
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+                                                     );
+                                if (found) {
+                                    /*    btpeeknexttuple set pfch_next to point to the item in block_item_list to be prefetched */
+                                    PrefetchBufferRc = PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber((&((scan->pfch_block_item_list + scan->pfch_next))->pfch_blockid)) , 0);
+                                    /* elog(LOG,"index_fetch_heap prefetched rel %u blockNum %u"
+                                       ,scan->heapRelation->rd_node.relNode ,BlockIdGetBlockNumber(scan->pfch_block_item_list + scan->pfch_next));
+                                    */
+
+                                    /*  if pin acquired on buffer,  then remember in case of future Discard */
+                                    (scan->pfch_block_item_list + scan->pfch_next)->pfch_discard = (PrefetchBufferRc & PREFTCHRC_BUF_PIN_INCREASED);
+
+
+                                }
+                            } while (--iterations > 0);
+                        }
+                    }
+                }
+#endif   /* USE_PREFETCH */
+
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
--- src/backend/access/index/genam.c.orig	2014-08-18 14:10:36.845016657 -0400
+++ src/backend/access/index/genam.c	2014-08-19 16:56:13.883197874 -0400
@@ -77,6 +77,12 @@ RelationGetIndexScan(Relation indexRelat
 
 	scan = (IndexScanDesc) palloc(sizeof(IndexScanDescData));
 
+#ifdef USE_PREFETCH
+        scan->do_prefetch = 0;             /*  no prefetching by default  */
+        scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+        scan->pfch_block_item_list = (struct pfch_block_item*)0;
+#endif   /* USE_PREFETCH */
+
 	scan->heapRelation = NULL;	/* may be set later */
 	scan->indexRelation = indexRelation;
 	scan->xs_snapshot = InvalidSnapshot;		/* caller must initialize this */
@@ -139,6 +145,19 @@ RelationGetIndexScan(Relation indexRelat
 void
 IndexScanEnd(IndexScanDesc scan)
 {
+#ifdef USE_PREFETCH
+	if (scan->do_prefetch) {
+		if ( (struct pfch_block_item*)0 != scan->pfch_block_item_list ) {
+			pfree(scan->pfch_block_item_list);
+			scan->pfch_block_item_list = (struct pfch_block_item*)0;
+		}
+		if ( (struct pfch_index_pagelist*)0 != scan->pfch_index_page_list ) {
+			pfree(scan->pfch_index_page_list);
+			scan->pfch_index_page_list = (struct pfch_index_pagelist*)0;
+		}
+	}
+#endif   /* USE_PREFETCH */
+
 	if (scan->keyData != NULL)
 		pfree(scan->keyData);
 	if (scan->orderByData != NULL)
--- src/backend/access/nbtree/nbtsearch.c.orig	2014-08-18 14:10:36.845016657 -0400
+++ src/backend/access/nbtree/nbtsearch.c	2014-08-19 16:56:13.911197974 -0400
@@ -23,13 +23,18 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_btree_heaps;  /*  boolean whether to prefetch heap pages in _bt_next for non-bitmap index scans   */
+extern unsigned int prefetch_sequential_index_scans;  /*  boolean whether to prefetch sequential-access non-bitmap index scans  */
+#endif   /* USE_PREFETCH */
 
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
+static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir, 
+			 bool prefetch);
+static Buffer _bt_walk_left(IndexScanDesc scan, Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
 
@@ -226,7 +231,11 @@ _bt_moveright(Relation rel,
 				_bt_relbuf(rel, buf);
 
 			/* re-acquire the lock in the right mode, and re-check */
-			buf = _bt_getbuf(rel, blkno, access);
+			buf = _bt_getbuf(rel, blkno, access
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                        );
 			continue;
 		}
 
@@ -1005,7 +1014,7 @@ _bt_first(IndexScanDesc scan, ScanDirect
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
@@ -1040,6 +1049,8 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTScanPosItem *currItem;
+	BlockNumber prevblkno = ItemPointerGetBlockNumber(
+		&scan->xs_ctup.t_self);
 
 	/*
 	 * Advance to next tuple on current page; or if there's no more, try to
@@ -1052,11 +1063,56 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+#ifdef USE_PREFETCH
+                /*    consider prefetching */
+		if (prefetch_btree_heaps && (target_prefetch_pages > 0)) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreRight
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+				
+					if (so->prefetchItemIndex <= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex + 1;
+					while (    (so->prefetchItemIndex <= so->currPos.lastItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex++].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {  /*  if not a repetition of previous block */
+							/*   start prefetch on next page, providing :
+                            **   EITHER  .  we're reading non-sequentially previously or for this block
+                            **   OR      .  user explicitly specified to prefetch for sequential pattern
+							**   as it may be counterproductive otherwise
+                            */
+							nonsequential = (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1));
+							if (prefetch_sequential_index_scans || nonsequential) {
+								_bt_prefetchbuf(scan->heapRelation, blkno , &scan->pfch_index_page_list );
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
+#endif   /* USE_PREFETCH */
 	}
 	else
 	{
@@ -1065,11 +1121,56 @@ _bt_next(IndexScanDesc scan, ScanDirecti
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+#ifdef USE_PREFETCH
+                /*    consider prefetching */
+		if (prefetch_btree_heaps && (target_prefetch_pages > 0)) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreLeft
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+			
+					if (so->prefetchItemIndex >= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex - 1;
+					while (    (so->prefetchItemIndex >= so->currPos.firstItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex--].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {  /*  if not a repetition of previous block */
+							/*   start prefetch on next page, providing :
+                            **   EITHER  .  we're reading non-sequentially previously or for this block
+                            **   OR      .  user explicitly specified to prefetch for sequential pattern
+							**   as it may be counterproductive otherwise
+                            */
+							nonsequential = (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1));
+							if (prefetch_sequential_index_scans || nonsequential) {
+								_bt_prefetchbuf(scan->heapRelation, blkno , &scan->pfch_index_page_list );
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
+#endif   /* USE_PREFETCH */
 	}
 
 	/* OK, itemIndex says what to return */
@@ -1119,9 +1220,11 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 	/*
 	 * we must save the page's right-link while scanning it; this tells us
 	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * corresponding need for the left-link, since splits always go right,
+	 * but we need it for back-sequential scan detection.
 	 */
 	so->currPos.nextPage = opaque->btpo_next;
+	so->currPos.prevPage = opaque->btpo_prev;
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
@@ -1156,6 +1259,7 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
+		so->prefetchItemIndex = 0;
 	}
 	else
 	{
@@ -1187,6 +1291,7 @@ _bt_readpage(IndexScanDesc scan, ScanDir
 		so->currPos.firstItem = itemIndex;
 		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
 		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->prefetchItemIndex = MaxIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1224,7 +1329,7 @@ _bt_saveitem(BTScanOpaque so, int itemIn
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
  */
 static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+_bt_steppage(IndexScanDesc scan, ScanDirection dir, bool prefetch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -1278,7 +1383,11 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ
+#ifdef USE_PREFETCH
+                                     ,scan->pfch_index_page_list
+#endif /* USE_PREFETCH */
+                                                    );
 			/* check for deleted page */
 			page = BufferGetPage(so->currPos.buf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1287,9 +1396,22 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque))) {
+#ifdef USE_PREFETCH
+					if (    prefetch && so->currPos.moreRight
+						/*   start prefetch on next page, providing :
+						**   EITHER  .  we're reading non-sequentially for this block
+						**   OR      .  user explicitly specified to prefetch for sequential pattern
+						**   as it may be counterproductive otherwise
+						*/
+						&& (prefetch_sequential_index_scans || opaque->btpo_next != (blkno+1))
+                                            ) {
+ 						  _bt_prefetchbuf(rel, opaque->btpo_next , &scan->pfch_index_page_list);
+					}
+#endif /* USE_PREFETCH */
 					break;
 			}
+			}
 			/* nope, keep going */
 			blkno = opaque->btpo_next;
 		}
@@ -1317,7 +1439,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			}
 
 			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
+			so->currPos.buf = _bt_walk_left(scan , rel, so->currPos.buf);
 
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
@@ -1332,14 +1454,60 @@ _bt_steppage(IndexScanDesc scan, ScanDir
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 			if (!P_IGNORE(opaque))
 			{
+				/* We must rely on the previously saved prevPage link! */
+				BlockNumber blkno = so->currPos.prevPage;
+				
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page))) {
+#ifdef USE_PREFETCH
+					if (prefetch && so->currPos.moreLeft) {
+						/* detect back-sequential runs and increase prefetch window blindly 
+						 * downwards 2 blocks at a time. This only works in our favor
+						 * for index-only scans, by merging read requests at the kernel,
+						 * so we want to inflate target_prefetch_pages since merged 
+						 * back-sequential requests are about as expensive as a single one 
+						 */
+						if (scan->xs_want_itup && blkno > 0 && opaque->btpo_prev == (blkno-1)) {
+							BlockNumber backPos;
+							unsigned int back_prefetch_pages = target_prefetch_pages * 16;
+							if (back_prefetch_pages > 64)
+								back_prefetch_pages = 64;
+							
+							if (so->backSeqRun == 0)
+								backPos = (blkno-1);
+							else
+								backPos = so->backSeqPos;
+							so->backSeqRun++;
+							
+							if (backPos > 0 && (blkno - backPos) <= back_prefetch_pages) {
+								_bt_prefetchbuf(rel, backPos-- , &scan->pfch_index_page_list);
+								/* don't start back-seq prefetch too early */
+								if (so->backSeqRun >= back_prefetch_pages
+										&& backPos > 0 
+										&& (blkno - backPos) <= back_prefetch_pages)
+								{
+									_bt_prefetchbuf(rel, backPos-- , &scan->pfch_index_page_list);
+								}
+							}
+							
+							so->backSeqPos = backPos;
+						} else {
+							/* start prefetch on next page */
+							if (so->backSeqRun != 0) {
+								if (opaque->btpo_prev > blkno || opaque->btpo_prev < so->backSeqPos)
+									so->backSeqRun = 0;
+							}
+							_bt_prefetchbuf(rel, opaque->btpo_prev , &scan->pfch_index_page_list);
+						}
+					}
+#endif /* USE_PREFETCH */
 					break;
 			}
 		}
 	}
+	}
 
 	return true;
 }
@@ -1359,7 +1527,7 @@ _bt_steppage(IndexScanDesc scan, ScanDir
  * again if it's important.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf)
+_bt_walk_left(IndexScanDesc scan, Relation rel, Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -1387,7 +1555,11 @@ _bt_walk_left(Relation rel, Buffer buf)
 		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, blkno, BT_READ
+#ifdef USE_PREFETCH
+                                     , scan->pfch_index_page_list
+#endif /* USE_PREFETCH */
+                                );
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -1631,7 +1803,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDir
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
--- src/backend/access/nbtree/nbtinsert.c.orig	2014-08-18 14:10:36.845016657 -0400
+++ src/backend/access/nbtree/nbtinsert.c	2014-08-19 16:56:13.947198104 -0400
@@ -793,7 +793,11 @@ _bt_insertonpg(Relation rel,
 		{
 			Assert(!P_ISLEAF(lpageop));
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                );
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
@@ -972,7 +976,11 @@ _bt_split(Relation rel, Buffer buf, Buff
 	bool		isleaf;
 
 	/* Acquire a new page to split into */
-	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                );
 
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
@@ -1175,7 +1183,11 @@ _bt_split(Relation rel, Buffer buf, Buff
 
 	if (!P_RIGHTMOST(oopaque))
 	{
-		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
+		sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                );
 		spage = BufferGetPage(sbuf);
 		sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);
 		if (sopaque->btpo_prev != origpagenumber)
@@ -1817,7 +1829,11 @@ _bt_finish_split(Relation rel, Buffer lb
 	Assert(P_INCOMPLETE_SPLIT(lpageop));
 
 	/* Lock right sibling, the one missing the downlink */
-	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                     );
 	rpage = BufferGetPage(rbuf);
 	rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
 
@@ -1829,7 +1845,11 @@ _bt_finish_split(Relation rel, Buffer lb
 		BTMetaPageData *metad;
 
 		/* acquire lock on the metapage */
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                            );
 		metapg = BufferGetPage(metabuf);
 		metad = BTPageGetMeta(metapg);
 
@@ -1877,7 +1897,11 @@ _bt_getstackbuf(Relation rel, BTStack st
 		Page		page;
 		BTPageOpaque opaque;
 
-		buf = _bt_getbuf(rel, blkno, access);
+		buf = _bt_getbuf(rel, blkno, access
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -2008,12 +2032,20 @@ _bt_newroot(Relation rel, Buffer lbuf, B
 	lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
 	/* get a new root page */
-	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 	rootpage = BufferGetPage(rootbuf);
 	rootblknum = BufferGetBlockNumber(rootbuf);
 
 	/* acquire lock on the metapage */
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
--- src/backend/access/nbtree/nbtpage.c.orig	2014-08-18 14:10:36.845016657 -0400
+++ src/backend/access/nbtree/nbtpage.c	2014-08-19 16:56:13.967198176 -0400
@@ -127,7 +127,11 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                            );
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 
@@ -153,7 +157,11 @@ _bt_getroot(Relation rel, int access)
 		rel->rd_amcache = NULL;
 	}
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -209,7 +217,11 @@ _bt_getroot(Relation rel, int access)
 		 * the new root page.  Since this is the first page in the tree, it's
 		 * a leaf as well as the root.
 		 */
-		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+		rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                            );
 		rootblkno = BufferGetBlockNumber(rootbuf);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
@@ -350,7 +362,11 @@ _bt_gettrueroot(Relation rel)
 		pfree(rel->rd_amcache);
 	rel->rd_amcache = NULL;
 
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	metad = BTPageGetMeta(metapg);
@@ -436,7 +452,11 @@ _bt_getrootheight(Relation rel)
 		Page		metapg;
 		BTPageOpaque metaopaque;
 
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                             );
 		metapg = BufferGetPage(metabuf);
 		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
 		metad = BTPageGetMeta(metapg);
@@ -561,6 +581,172 @@ _bt_log_reuse_page(Relation rel, BlockNu
 	END_CRIT_SECTION();
 }
 
+#ifdef USE_PREFETCH
+/*
+ *	_bt_prefetchbuf() -- Prefetch a buffer by block number
+ *                           and keep track of prefetched and unread blocknums in pagelist.
+ *   input parms  :
+ *       rel and blockno identify block to be prefetched as usual
+ *       pfch_index_page_list_P points to the pointer anchoring the head of the index page list
+ *             Since the pagelist is a kind of optimization,
+ *             handle palloc failure by quietly omitting the keeping track.
+ */
+void
+_bt_prefetchbuf(Relation rel, BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P)
+{
+
+    int rc = 0;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    struct pfch_index_item* found_item = 0;
+
+		if ((struct pfch_index_pagelist**)0 == pfch_index_page_list_P) {
+			pfch_index_plp = (struct pfch_index_pagelist*)0;
+		} else {
+			pfch_index_plp = *pfch_index_page_list_P;
+		}
+
+    	if (blkno != P_NEW && blkno != P_NONE)
+    	{
+            /* prefetch an existing block of the relation
+            ** but first,  check it has not recently already been prefetched and not yet read
+            */
+            found_item = _bt_find_block(blkno , pfch_index_plp);
+			if ((struct pfch_index_item*)0 == found_item) {  /*  not found */
+
+		        rc = PrefetchBuffer(rel, MAIN_FORKNUM, blkno , 0);
+
+                /*  add the pagenum to the list ,  indicating its discard status
+                **  since it's only an optimization,  ignore failure such as exceeded allowed space
+				*/
+                _bt_add_block( blkno , pfch_index_page_list_P , (uint32)(rc & PREFTCHRC_BUF_PIN_INCREASED));
+
+            }
+	    }
+        return;
+}
+
+/*   _bt_find_block finds the item referencing specified Block in index page list if present
+**   and returns the pointer to the pfch_index_item if found,  or null if not
+*/
+struct pfch_index_item*
+_bt_find_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list)
+{
+
+    struct pfch_index_item* found_item = 0;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    int ix, tx;
+
+		pfch_index_plp = pfch_index_page_list;
+
+		while (     (struct pfch_index_pagelist*)0 != pfch_index_plp
+                &&  ( (struct pfch_index_item*)0 == found_item)
+              ) {
+			ix = 0;
+			tx = pfch_index_plp->pfch_index_item_count;
+			while (     (ix < tx)
+                    &&  ( (struct pfch_index_item*)0 == found_item)
+                  ) {
+				if (pfch_index_plp->pfch_indexid[ix].pfch_blocknum == blkno) {
+					found_item = &pfch_index_plp->pfch_indexid[ix];
+				}
+                ix++;
+			}
+			pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+		}
+
+     return found_item;
+}
+
+/*   _bt_add_block adds the specified Block to the index page list
+**   and returns 0 if successful,  non-zero if not
+*/
+int
+_bt_add_block(BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P , uint32 discard_status)
+{
+    int rc = 1;
+    int ix;
+    struct pfch_index_pagelist* pfch_index_plp;      /*  pointer to current chunk */
+    struct pfch_index_pagelist* pfch_index_page_list_anchor; /*  pointer to first chunk if any  */
+	/*  allow expansion of pagelist to 16 chunks
+	**  which accommodates backwards-sequential index scans
+	**  where the scanner increases target_prefetch_pages by a factor of up to 16
+	**   see code in _bt_steppage
+	**  note - this creates an undesirable weak dependency on this number in _bt_steppage,
+	**         but :
+	**           there is no disaster if the numbers disagree  -  just sub-optimal use of the list
+	**           to implement a proper interface would require that chunks have a variable size
+	**           which would require an extra size variable in each chunk
+	*/
+	int num_chunks = 16;
+
+		if ((struct pfch_index_pagelist**)0 == pfch_index_page_list_P) {
+			pfch_index_page_list_anchor = (struct pfch_index_pagelist*)0;
+		} else {
+			pfch_index_page_list_anchor = *pfch_index_page_list_P;
+		}
+		pfch_index_plp = pfch_index_page_list_anchor;       /* pointer to current chunk */
+
+		while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+			ix = pfch_index_plp->pfch_index_item_count;
+			if (ix < target_prefetch_pages) {
+				pfch_index_plp->pfch_indexid[ix].pfch_blocknum = blkno;
+				pfch_index_plp->pfch_indexid[ix].pfch_discard = discard_status;
+				pfch_index_plp->pfch_index_item_count = (ix+1);
+                rc = 0;
+				goto stored_pagenum;
+			}
+			pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+			num_chunks--;              /*  keep track of number of chunks */
+		}
+
+		/*   we did not find any free space in existing chunks -
+		**   create new chunk if within our limit and we have a pfch_index_page_list
+		*/
+		if ( (num_chunks > 0) && ((struct pfch_index_pagelist*)0 != pfch_index_page_list_anchor) ) {
+			pfch_index_plp = (struct pfch_index_pagelist*)palloc( sizeof(struct pfch_index_pagelist) + ( (target_prefetch_pages-1) * sizeof(struct pfch_index_item) ) );
+			if ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+				pfch_index_plp->pfch_index_pagelist_next = pfch_index_page_list_anchor;  /* old head of list is next after this */
+				pfch_index_plp->pfch_indexid[0].pfch_blocknum = blkno;
+				pfch_index_plp->pfch_indexid[0].pfch_discard = discard_status;
+				pfch_index_plp->pfch_index_item_count = 1;
+				pfch_index_page_list_P = &pfch_index_plp;   /*  new head of list is new chunk */
+                rc = 0;
+			}
+		}
+
+    stored_pagenum:;
+     return rc;
+}
+
+/*  _bt_subtract_block removes a block from the prefetched-but-unread pagelist if present */
+void
+_bt_subtract_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list)
+{
+    struct pfch_index_pagelist* pfch_index_plp = pfch_index_page_list;
+	if ( (blkno != P_NEW) && (blkno != P_NONE) ) {
+            int ix , jx;
+                while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+                    ix = pfch_index_plp->pfch_index_item_count;
+                    while (ix-- > 0) {
+                        if (pfch_index_plp->pfch_indexid[ix].pfch_blocknum == blkno) {
+                            /*   move the last item to the curent (now deleted) position and decrement count */
+                            jx = (pfch_index_plp->pfch_index_item_count-1); /*  index of last item ... */
+                            if (jx > ix) {                                  /*  ... is not the current one so move is required */
+                                pfch_index_plp->pfch_indexid[ix].pfch_blocknum = pfch_index_plp->pfch_indexid[jx].pfch_blocknum;
+                                pfch_index_plp->pfch_indexid[ix].pfch_discard = pfch_index_plp->pfch_indexid[jx].pfch_discard;
+                                ix = jx;
+                            }
+                            pfch_index_plp->pfch_index_item_count = ix;
+                            goto done_subtract;
+                        }
+                    }
+                    pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+                }
+        }
+    done_subtract:  return;
+}
+#endif /* USE_PREFETCH */
+
 /*
  *	_bt_getbuf() -- Get a buffer by block number for read or write.
  *
@@ -573,7 +759,11 @@ _bt_log_reuse_page(Relation rel, BlockNu
  *		_bt_checkpage to sanity-check the page (except in P_NEW case).
  */
 Buffer
-_bt_getbuf(Relation rel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access
+#ifdef USE_PREFETCH
+                              ,struct pfch_index_pagelist* pfch_index_page_list
+#endif /* USE_PREFETCH */
+          )
 {
 	Buffer		buf;
 
@@ -581,6 +771,12 @@ _bt_getbuf(Relation rel, BlockNumber blk
 	{
 		/* Read an existing block of the relation */
 		buf = ReadBuffer(rel, blkno);
+
+#ifdef USE_PREFETCH
+        /*  if the block is in the prefetched-but-unread pagelist,  remove it */
+        _bt_subtract_block( blkno , pfch_index_page_list);
+#endif /* USE_PREFETCH */
+
 		LockBuffer(buf, access);
 		_bt_checkpage(rel, buf);
 	}
@@ -702,6 +898,10 @@ _bt_getbuf(Relation rel, BlockNumber blk
  * bufmgr when one would do.  However, now it's mainly just a notational
  * convenience.  The only case where it saves work over _bt_relbuf/_bt_getbuf
  * is when the target page is the same one already in the buffer.
+ *
+ * if prefetching of index pages is changed to use this function,
+ * then it should be extended to take the index_page_list as parameter
+ * and call_bt_subtract_block in the same way that _bt_getbuf does.
  */
 Buffer
 _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
@@ -712,6 +912,7 @@ _bt_relandgetbuf(Relation rel, Buffer ob
 	if (BufferIsValid(obuf))
 		LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
 	buf = ReleaseAndReadBuffer(obuf, rel, blkno);
+
 	LockBuffer(buf, access);
 	_bt_checkpage(rel, buf);
 	return buf;
@@ -965,7 +1166,11 @@ _bt_is_page_halfdead(Relation rel, Block
 	BTPageOpaque opaque;
 	bool		result;
 
-	buf = _bt_getbuf(rel, blk, BT_READ);
+	buf = _bt_getbuf(rel, blk, BT_READ
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                     );
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -1069,7 +1274,11 @@ _bt_lock_branch_parent(Relation rel, Blo
 				Page		lpage;
 				BTPageOpaque lopaque;
 
-				lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+				lbuf = _bt_getbuf(rel, leftsib, BT_READ
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                 );
 				lpage = BufferGetPage(lbuf);
 				lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
@@ -1265,7 +1474,11 @@ _bt_pagedel(Relation rel, Buffer buf)
 					BTPageOpaque lopaque;
 					Page		lpage;
 
-					lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+					lbuf = _bt_getbuf(rel, leftsib, BT_READ
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                     );
 					lpage = BufferGetPage(lbuf);
 					lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 					/*
@@ -1340,7 +1553,11 @@ _bt_pagedel(Relation rel, Buffer buf)
 		if (!rightsib_empty)
 			break;
 
-		buf = _bt_getbuf(rel, rightsib, BT_WRITE);
+		buf = _bt_getbuf(rel, rightsib, BT_WRITE
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 	}
 
 	return ndeleted;
@@ -1593,7 +1810,11 @@ _bt_unlink_halfdead_page(Relation rel, B
 		target = topblkno;
 
 		/* fetch the block number of the topmost parent's left sibling */
-		buf = _bt_getbuf(rel, topblkno, BT_READ);
+		buf = _bt_getbuf(rel, topblkno, BT_READ
+#ifdef USE_PREFETCH
+                                    , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                        );
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		leftsib = opaque->btpo_prev;
@@ -1632,7 +1853,11 @@ _bt_unlink_halfdead_page(Relation rel, B
 		LockBuffer(leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
-		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                         );
 		page = BufferGetPage(lbuf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -1646,7 +1871,11 @@ _bt_unlink_halfdead_page(Relation rel, B
 					 RelationGetRelationName(rel));
 				return false;
 			}
-			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
+			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                              );
 			page = BufferGetPage(lbuf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		}
@@ -1701,7 +1930,11 @@ _bt_unlink_halfdead_page(Relation rel, B
 	 * And next write-lock the (current) right sibling.
 	 */
 	rightsib = opaque->btpo_next;
-	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
+	rbuf = _bt_getbuf(rel, rightsib, BT_WRITE
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                );
 	page = BufferGetPage(rbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	if (opaque->btpo_prev != target)
@@ -1731,7 +1964,11 @@ _bt_unlink_halfdead_page(Relation rel, B
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE
+#ifdef USE_PREFETCH
+                                     , (struct pfch_index_pagelist*)0
+#endif /* USE_PREFETCH */
+                                );
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
--- src/backend/access/nbtree/nbtree.c.orig	2014-08-18 14:10:36.845016657 -0400
+++ src/backend/access/nbtree/nbtree.c	2014-08-19 16:56:13.979198218 -0400
@@ -29,6 +29,18 @@
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_index_scans; /* boolean whether to prefetch non-bitmap index scans also numeric size of pfch_list */
+#endif   /* USE_PREFETCH */
+
+Datum
+btpeeknexttuple(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+        IndexScanDesc scan
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+        PG_FUNCTION_ARGS
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+);
 
 /* Working state for btbuild and its callback */
 typedef struct
@@ -324,6 +336,78 @@ btgettuple(PG_FUNCTION_ARGS)
 }
 
 /*
+ *	btpeeknexttuple() -- peek at the next tuple different from any blocknum in pfch_block_item_list
+ *                           without reading a new index page
+ *                       and without causing any side-effects such as altering values in control blocks
+ *               if found,     store blocknum in next element of pfch_block_item_list
+ *      This function has no usefulness unless postgresql is compiled with USE_PREFETCH
+ */
+Datum
+btpeeknexttuple(
+#ifdef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+        IndexScanDesc scan
+#else /*  not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+        PG_FUNCTION_ARGS
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+)
+{
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+    IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res = false;
+	int		itemIndex;		/* current index in items[] */
+
+        
+#ifdef USE_PREFETCH
+        /*
+         * If we've already initialized this scan, we can just advance it in
+         * the appropriate direction.  If we haven't done so yet, bail out
+         */
+        if ( BTScanPosIsValid(so->currPos) ) {
+
+            itemIndex = so->currPos.itemIndex+1;    /*   next item */
+
+            /* This loop handles advancing till we find different data block or end of index page */
+            while (itemIndex <= so->currPos.lastItem) {
+                unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                        if (BlockIdEquals((&(((scan->pfch_block_item_list)+pfchx)->pfch_blockid)) , &(so->currPos.items[itemIndex].heapTid.ip_blkid))) {
+                             goto block_match;
+                        }
+                }
+
+                /*  if we reach here,  no block in list matched this item  */
+                res = true;
+                /*   set item in prefetch list
+                **   prefer unused entry if there is one,  else overwrite
+                */
+                if (scan->pfch_used < prefetch_index_scans) {
+                    scan->pfch_next = scan->pfch_used;
+                } else {
+                    scan->pfch_next++;
+                    if (scan->pfch_next >= prefetch_index_scans) {
+                        scan->pfch_next = 0;
+                    }
+                }
+
+                BlockIdCopy((&((scan->pfch_block_item_list + scan->pfch_next)->pfch_blockid)) , &(so->currPos.items[itemIndex].heapTid.ip_blkid));
+                if (scan->pfch_used <= scan->pfch_next) {
+                     scan->pfch_used = (scan->pfch_next + 1);
+                }
+
+                goto peek_complete;
+
+      block_match:         itemIndex++;
+            }
+	}
+
+ peek_complete:
+#endif   /* USE_PREFETCH */
+	PG_RETURN_BOOL(res);
+}
+
+/*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
 Datum
@@ -417,6 +501,12 @@ btbeginscan(PG_FUNCTION_ARGS)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	so->backSeqRun = 0;
+	so->backSeqPos = 0;
+	so->prefetchItemIndex = 0;
+	so->lastHeapPrefetchBlkno = P_NONE;
+	so->prefetchBlockCount = 0;
+	
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -509,6 +599,23 @@ btendscan(PG_FUNCTION_ARGS)
 	IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
+#ifdef USE_PREFETCH
+        struct pfch_index_pagelist* pfch_index_plp;
+        int ix;
+
+	/* discard all prefetched but unread index pages listed in the pagelist */
+        pfch_index_plp = scan->pfch_index_page_list;
+        while ( (struct pfch_index_pagelist*)0 != pfch_index_plp ) {
+                    ix = pfch_index_plp->pfch_index_item_count;
+                    while (ix-- > 0) {
+                        if (pfch_index_plp->pfch_indexid[ix].pfch_discard) {
+                            DiscardBuffer( scan->indexRelation , MAIN_FORKNUM , pfch_index_plp->pfch_indexid[ix].pfch_blocknum);
+                        }
+                    }
+                    pfch_index_plp = pfch_index_plp->pfch_index_pagelist_next;
+        }
+#endif /* USE_PREFETCH */
+
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
 	{
--- src/backend/nodes/tidbitmap.c.orig	2014-08-18 14:10:36.877016807 -0400
+++ src/backend/nodes/tidbitmap.c	2014-08-19 16:56:14.063198519 -0400
@@ -44,6 +44,9 @@
 #include "nodes/bitmapset.h"
 #include "nodes/tidbitmap.h"
 #include "utils/hsearch.h"
+#ifdef USE_PREFETCH
+extern int	target_prefetch_pages;
+#endif   /* USE_PREFETCH */
 
 /*
  * The maximum number of tuples per page is not large (typically 256 with
@@ -572,7 +575,12 @@ tbm_begin_iterate(TIDBitmap *tbm)
 	 * needs of the TBMIterateResult sub-struct.
 	 */
 	iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
-								 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+								 MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)
+#ifdef USE_PREFETCH
+                                          		      /*  space for remembering every prefetched but unread blockno */
+                                          		      +  (target_prefetch_pages * sizeof(BlockNumber))
+#endif   /* USE_PREFETCH */
+                                         );
 	iterator->tbm = tbm;
 
 	/*
@@ -1020,3 +1028,70 @@ tbm_comparator(const void *left, const v
 		return 1;
 	return 0;
 }
+
+#ifdef USE_PREFETCH
+void
+tbm_zero(TBMIterator *iterator) /* zero list of prefetched and unread blocknos */
+{
+      /* locate the list of prefetched but unread blocknos immediately following the array of offsets
+      ** and note that tbm_begin_iterate allocates space for (1 + MAX_TUPLES_PER_PAGE) offsets  -
+      ** 1 included in struct TBMIterator and MAX_TUPLES_PER_PAGE additional
+      */
+      iterator->output.Unread_Pfetched_base = ((BlockNumber *)(&(iterator->output.offsets[MAX_TUPLES_PER_PAGE+1])));
+      iterator->output.Unread_Pfetched_next = iterator->output.Unread_Pfetched_count = 0;
+}
+
+void
+tbm_add(TBMIterator *iterator, BlockNumber blockno) /* add this blockno to list of prefetched and unread blocknos */
+{
+      unsigned int Unread_Pfetched_index = iterator->output.Unread_Pfetched_next + iterator->output.Unread_Pfetched_count++;
+
+      if (iterator->output.Unread_Pfetched_count > target_prefetch_pages) {
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_add overflowed list cannot add blockno %d", blockno)));
+      }
+
+      if (Unread_Pfetched_index >= target_prefetch_pages)
+          Unread_Pfetched_index -= target_prefetch_pages;
+      *(iterator->output.Unread_Pfetched_base + Unread_Pfetched_index) = blockno;
+}
+
+void
+tbm_subtract(TBMIterator *iterator, BlockNumber blockno) /* remove this blockno from list of prefetched and unread blocknos */
+{
+      unsigned int Unread_Pfetched_index = iterator->output.Unread_Pfetched_next++;
+      BlockNumber nextUnreadPfetched;
+
+      /*    make a weak check that the next blockno is the one to be removed,
+      **    although actually in case of disagreement,   we ignore callers blockno and remove next anyway,
+      **    which is really what caller wants
+      */
+      if ( iterator->output.Unread_Pfetched_count == 0 ) {
+		ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_subtract empty list cannot subtract blockno %d", blockno)));
+      }
+
+      if (Unread_Pfetched_index >= target_prefetch_pages)
+          Unread_Pfetched_index = 0;
+      nextUnreadPfetched = *(iterator->output.Unread_Pfetched_base + Unread_Pfetched_index);
+      if (   ( nextUnreadPfetched != blockno ) 
+          && ( nextUnreadPfetched != InvalidBlockNumber ) /* dont report it if the block in the list was InvalidBlockNumber */
+         ) {
+		ereport(NOTICE,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+                                errmsg("tbm_subtract will subtract blockno %d not %d",
+					nextUnreadPfetched, blockno)));
+      }
+      if (iterator->output.Unread_Pfetched_next >= target_prefetch_pages)
+          iterator->output.Unread_Pfetched_next = 0;
+      iterator->output.Unread_Pfetched_count--;
+}
+#endif /* USE_PREFETCH */
+
+TBMIterateResult *
+tbm_locate_IterateResult(TBMIterator *iterator)
+{
+   return &(iterator->output);
+}
--- src/backend/utils/misc/guc.c.orig	2014-08-18 14:10:37.009017427 -0400
+++ src/backend/utils/misc/guc.c	2014-08-19 16:56:14.119198719 -0400
@@ -2258,6 +2258,25 @@ static struct config_int ConfigureNamesI
 	},
 
 	{
+		{"max_async_io_prefetchers",
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+			PGC_USERSET,
+#else
+			PGC_INTERNAL,
+#endif
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Maximum number of background processes concurrently using asynchronous librt threads to prefetch pages into shared memory buffers."),
+		},
+		&max_async_io_prefetchers,
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+	        -1, 0, 8192,      /*  boot val -1 indicates to initialize to something sensible during buf_init */
+#else
+		0, 0, 0,
+#endif
+		NULL, NULL, NULL
+	},
+
+	{
 		{"max_worker_processes",
 			PGC_POSTMASTER,
 			RESOURCES_ASYNCHRONOUS,
--- src/backend/utils/mmgr/aset.c.orig	2014-08-18 14:10:37.009017427 -0400
+++ src/backend/utils/mmgr/aset.c	2014-08-19 16:56:14.179198935 -0400
@@ -733,6 +733,48 @@ AllocSetAlloc(MemoryContext context, Siz
 	 */
 	fidx = AllocSetFreeIndex(size);
 	chunk = set->freelist[fidx];
+#ifdef MEMORY_CONTEXT_CHECKING
+        /*    an instance of segfault caused by a rogue value in set->freelist[fidx]
+        **    has been seen - check for it using crude sanity check based on neighbours :
+        **    if at least one is sufficiently close, then pass,  else fail
+        */
+        if (chunk != 0) {
+            int frx, nrx; /*  frx is index,  nrx is index of failing neighbour for errmsg */
+            for (nrx = -1, frx = 0; (frx < ALLOCSET_NUM_FREELISTS); frx++) {
+                if (   (frx != fidx)     /*  not the chosen one */
+                    && ( ( (unsigned long)(set->freelist[frx]) ) != 0 ) /* not empty */
+                   ) {
+                    if (   ( (unsigned long)chunk < ( ( (unsigned long)(set->freelist[frx]) ) / 2 ) )
+                        && (  ( (unsigned long)(set->freelist[frx]) ) < 0x4000000  )
+               /***     || ( (unsigned long)chunk > ( ( (unsigned long)(set->freelist[frx]) ) * 2 ) )  ***/
+                       ) {
+                       nrx = frx;
+                    } else {
+                       nrx = -1;
+                       break;
+                    }
+                }
+            }
+
+            if (nrx >= 0) {
+                     /*  this must be a rogue value  -  put this list in the garbage */
+                     /*  build message but be careful to avoid recursively triggering same fault */
+                     set->freelist[fidx] = NULL;   /*  mark this list empty */
+                     elog(WARNING, "detected rogue value %p in freelist index %d compared with neighbour %p whose chunksize %d"
+				 , chunk , fidx , set->freelist[nrx] , set->freelist[nrx]->size);
+                     chunk = NULL;
+            }
+        }
+#else /* if not MEMORY_CONTEXT_CHECKING make very simple-minded check*/
+        if ( (chunk != 0) && ( (unsigned long)chunk <  0x40000 ) ) {
+                     /*  this must be a rogue value  -  put this list in the garbage */
+                     /*  build message but be careful to avoid recursively triggering same fault */
+                     set->freelist[fidx] = NULL;   /*  mark this list empty */
+                     elog(WARNING, "detected rogue value %p in freelist index %d"
+				 , chunk , fidx);
+                     chunk = NULL;
+        }
+#endif
 	if (chunk != NULL)
 	{
 		Assert(chunk->size >= size);
--- src/include/executor/instrument.h.orig	2014-08-18 14:10:37.057017653 -0400
+++ src/include/executor/instrument.h	2014-08-19 16:56:14.427199824 -0400
@@ -28,8 +28,18 @@ typedef struct BufferUsage
 	long		local_blks_written;		/* # of local disk blocks written */
 	long		temp_blks_read; /* # of temp blocks read */
 	long		temp_blks_written;		/* # of temp blocks written */
+
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
+
+	long		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_discrd;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_forgot;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb */
+	long		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno */
+	long		aio_read_wasted;		/* # of aio reads for which disk block not used */
+	long		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it */
+	long		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested */
 } BufferUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
--- src/include/storage/bufmgr.h.orig	2014-08-18 14:10:37.065017690 -0400
+++ src/include/storage/bufmgr.h	2014-08-19 16:56:14.455199924 -0400
@@ -41,6 +41,7 @@ typedef enum
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
 	RBM_NORMAL_NO_LOG			/* Don't log page as invalid during WAL
 								 * replay; otherwise same as RBM_NORMAL */
+       ,RBM_NOREAD_FOR_PREFETCH   /* Don't read from disk, don't zero buffer, find buffer only */
 } ReadBufferMode;
 
 /* in globals.c ... this duplicates miscadmin.h */
@@ -57,6 +58,9 @@ extern int	target_prefetch_pages;
 extern PGDLLIMPORT char *BufferBlocks;
 extern PGDLLIMPORT int32 *PrivateRefCount;
 
+/*  in buf_async.c  */;
+extern int      max_async_io_prefetchers; /* Maximum number of backends using asynchronous librt threads to read pages into our buffers */
+
 /* in localbuf.c */
 extern PGDLLIMPORT int NLocBuffer;
 extern PGDLLIMPORT Block *LocalBufferBlockPointers;
@@ -159,9 +163,15 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
 /*
- * prototypes for functions in bufmgr.c
+ * prototypes for external functions in bufmgr.c and buf_async.c
  */
-extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
+extern int PrefetchBuffer(Relation reln, ForkNumber forkNum,
+			   BlockNumber blockNum , BufferAccessStrategy strategy);
+/*   return code  is an int bitmask : */
+#define PREFTCHRC_BUF_PIN_INCREASED 0x01    /*  pin count on buffer has been increased by 1 */
+#define PREFTCHRC_BLK_ALREADY_PRESENT 0x02  /*  block was already present in a buffer       */
+
+extern void DiscardBuffer(Relation reln, ForkNumber forkNum,
 			   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
--- src/include/storage/proc.h.orig	2014-08-18 14:10:37.065017690 -0400
+++ src/include/storage/proc.h	2014-08-19 16:56:14.487200038 -0400
@@ -106,6 +106,10 @@ struct PGPROC
 	uint8		lwWaitMode;		/* lwlock mode being waited for */
 	struct PGPROC *lwWaitLink;	/* next waiter for same LW lock */
 
+#if defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP) && defined(USE_AIO_SIGEVENT)
+	struct PGPROC *BAiocbWaiterLink;	/*  chain of backends waiting for aio completion */
+#endif
+
 	/* Info about lock the process is currently waiting for, if any. */
 	/* waitLock and waitProcLock are NULL if not currently waiting. */
 	LOCK	   *waitLock;		/* Lock object we're sleeping on ... */
--- src/include/storage/smgr.h.orig	2014-08-18 14:10:37.069017709 -0400
+++ src/include/storage/smgr.h	2014-08-19 16:56:14.507200109 -0400
@@ -92,6 +92,21 @@ extern void smgrextend(SMgrRelation reln
 		   BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void smgrinitaio(int max_aio_threads, int max_aio_num);
+extern void smgrstartaio(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum , char *aiocbp , int *retcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , unsigned long BAiocbaioOrdinal
+#endif /* USE_AIO_SIGEVENT  */
+                        );
+extern void smgrcompleteaio( SMgrRelation reln, char *aiocbp , int *inoutcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , int (*BufAWaitAioCompletion)(char *)  /*  function to await completion if non-originator */
+                                , char *BAiocbP /* pointer to BAIOcb  */
+#endif /* USE_AIO_SIGEVENT  */
+                           );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
@@ -118,6 +133,20 @@ extern void mdextend(SMgrRelation reln,
 		 BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void mdinitaio(int max_aio_threads, int max_aio_num);
+extern void mdstartaio(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum , char *aiocbp , int *retcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , unsigned long BAiocbaioOrdinal
+#endif /* USE_AIO_SIGEVENT  */
+                      );
+extern void mdcompleteaio( char *aiocbp , int *inoutcode
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , int (*BufAWaitAioCompletion)(char *)  /*  function to await completion if non-originator */
+                                , char *BAiocbP /* pointer to BAIOcb  */
+#endif /* USE_AIO_SIGEVENT  */
+                         );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
--- src/include/storage/fd.h.orig	2014-08-18 14:10:37.065017690 -0400
+++ src/include/storage/fd.h	2014-08-19 16:56:14.547200253 -0400
@@ -69,6 +69,20 @@ extern File PathNameOpenFile(FileName fi
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+extern void FileInitaio(int max_aio_threads, int max_aio_num );
+extern int  FileStartaio(File file, off_t offset, int amount , char *aiocbp
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , unsigned long BAiocbaioOrdinal /* ordinal number of this aio in backend's sequence  */
+#endif /* USE_AIO_SIGEVENT  */
+                           );
+extern int  FileCompleteaio( char *aiocbp , int normal_wait
+#ifdef USE_AIO_SIGEVENT     /*    non-originator waiters wait on lock instead of polling */
+                                , int (*BufAWaitAioCompletion)(char *)  /*  function to await completion if non-originator */
+                                , char *BAiocbP /* pointer to BAIOcb  */
+#endif /* USE_AIO_SIGEVENT  */
+                           );
+#endif  /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 extern int	FileRead(File file, char *buffer, int amount);
 extern int	FileWrite(File file, char *buffer, int amount);
 extern int	FileSync(File file);
--- src/include/storage/buf_internals.h.orig	2014-08-18 14:10:37.065017690 -0400
+++ src/include/storage/buf_internals.h	2014-08-19 16:56:14.579200367 -0400
@@ -22,7 +22,9 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
-
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+#include "aio.h"
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
 
 /*
  * Flags for buffer descriptors
@@ -38,8 +40,23 @@
 #define BM_JUST_DIRTIED			(1 << 5)		/* dirtied since write started */
 #define BM_PIN_COUNT_WAITER		(1 << 6)		/* have waiter for sole pin */
 #define BM_CHECKPOINT_NEEDED	(1 << 7)		/* must write for checkpoint */
-#define BM_PERMANENT			(1 << 8)		/* permanent relation (not
-												 * unlogged) */
+#define BM_PERMANENT			(1 << 8)	/* permanent relation (not unlogged) */
+#define BM_AIO_IN_PROGRESS		(1 << 9)	/* aio in progress    */
+#define BM_AIO_PREFETCH_PIN_BANKED	(1 << 10)	/* pinned when prefetch issued
+                                                        ** and this pin is banked - i.e.
+                                                        ** redeemable by the next use by same task
+                                                        ** note that for any one buffer, a pin can be banked
+                                                        **      by at most one process globally,
+                                                        **      that is,   only one process may bank a pin on the buffer
+                                                        **                 and it may do so only once (may not be stacked)
+                                                        */
+
+/*********
+for asynchronous aio-read prefetching, two golden rules concerning buffer pinning and buffer-header flags must be observed:
+  R1.  a buffer marked as BM_AIO_IN_PROGRESS must be pinned by at least one backend
+  R2.  a buffer marked as BM_AIO_PREFETCH_PIN_BANKED must be pinned by the backend identified by
+               (buf->flags & BM_AIO_IN_PROGRESS) ? ( ((BAiocbAnchr->BufferAiocbs)+(FREENEXT_BAIOCB_ORIGIN - buf->freeNext))->pidOfAio : (-(buf->freeNext))
+*********/
 
 typedef bits16 BufFlags;
 
@@ -140,17 +157,87 @@ typedef struct sbufdesc
 	BufFlags	flags;			/* see bit definitions above */
 	uint16		usage_count;	/* usage counter for clock sweep code */
 	unsigned	refcount;		/* # of backends holding pins on buffer */
-	int			wait_backend_pid;		/* backend PID of pin-count waiter */
+	int		wait_backend_pid;	/*  if     flags & BM_PIN_COUNT_WAITER
+                                                **  then   backend PID of pin-count waiter
+                                                **  else   not set
+                                                */
 
 	slock_t		buf_hdr_lock;	/* protects the above fields */
-
 	int			buf_id;			/* buffer's index number (from 0) */
-	int			freeNext;		/* link in freelist chain */
+        int    	volatile	freeNext;	/* overloaded and much-abused field :
+                                                ** EITHER
+                                                **     if     >= 0
+                                                **     then   link in freelist chain
+                                                **  OR
+                                                **     if     <  0
+                                                **     then    EITHER
+                                                **             if     flags & BM_AIO_IN_PROGRESS
+                                                **             then   negative of (the index of the aiocb in the BufferAiocbs array + 3)
+                                                **             else   if flags & BM_AIO_PREFETCH_PIN_BANKED
+                                                **             then   -(pid of task that issued aio_read and pinned buffer)
+                                                **             else   one of the special values -1 or -2 listed below
+                                                */
 
 	LWLock	   *io_in_progress_lock;	/* to wait for I/O to complete */
 	LWLock	   *content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
+/*  structures for control blocks for our implementation of async io */
+
+/*  if USE_AIO_ATOMIC_BUILTIN_COMP_SWAP is not defined,  the following struct is not put into use at runtime
+**  but it is easier to let the compiler find the definition but hide the reference to aiocb
+**  which is the only type it would not understand
+*/
+
+struct BufferAiocb {
+       struct BufferAiocb volatile * volatile BAiocbnext;  /*    next free entry or value of BAIOCB_OCCUPIED means in use  */
+       struct sbufdesc    volatile * volatile BAiocbbufh;  /*    there can be at most one BufferDesc marked BM_AIO_IN_PROGRESS
+                                                           **    and using this BufferAiocb -
+                                                           **    if there is one, BAiocbbufh points to it, else BAiocbbufh is zero
+                                                           **    NOTE  BAiocbbufh should be zero for every BufferAiocb on the free list
+                                                           */
+#ifdef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+       struct aiocb       volatile            BAiocbthis;  /*    the aio library's control block for one async io */
+#ifdef USE_AIO_SIGEVENT
+       struct PGPROC  volatile * BAiocbWaiterchain;        /*    chain of backends waiting for aio completion */
+       unsigned long BAiocbaioOrdinal;                     /*    ordinal of most recent aio originated by me */
+#endif  /* USE_AIO_SIGEVENT */
+#endif /*  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP */
+       int                volatile  BAiocbDependentCount;  /*    count of tasks who depend on this BufferAiocb
+                                                           **    in the sense that they are waiting for io completion.
+                                                           **    only a Dependent may move the BufferAiocb onto the freelist
+                                                           **    and only when that Dependent is the *only* Dependent (count == 1)
+                                                           **    BAiocbDependentCount is protected by bufferheader spinlock
+                                                           **    and must be updated only when that spinlock is held
+                                                           **    negative value indicates briefly locked (may not be updated)
+                                                           */
+       pid_t              volatile  pidOfAio;              /*    pid of backend who issued an aio_read using this BAiocb -
+                                                           **    this backend must have pinned the associated buffer.
+                                                           */
+};
+
+#define BAIOCB_OCCUPIED 0x75f1        /*  distinct indicator of a BufferAiocb.BAiocbnext that is NOT on free list */
+#define BAIOCB_FREE 0x7b9d            /*  distinct indicator of a BufferAiocb.BAiocbbufh that IS     on free list */
+
+struct BAiocbAnchor {                 /*  anchor for all control blocks pertaining to aio  */
+       volatile struct BufferAiocb* BufferAiocbs;          /*  aiocbs ...                   */
+       volatile struct BufferAiocb* volatile FreeBAiocbs; /* ... and their free list   */
+};
+
+/*   values for BufCheckAsync input and retcode */
+#define BUF_INTENTION_WANT 		 1  /* wants the buffer, wait for in-progress aio and then pin */
+#define BUF_INTENTION_REJECT_KEEP_PIN 	-1  /* pin already held, do not unpin */
+#define BUF_INTENTION_REJECT_OBTAIN_PIN	-2  /* obtain pin,  caller wants it for same buffer */
+#define BUF_INTENTION_REJECT_FORGET	-3  /* unpin and tell resource owner to forget */
+#define BUF_INTENTION_REJECT_NOADJUST	-4  /* unpin and call ResourceOwnerForgetBuffer */
+#define BUF_INTENTION_REJECT_UNBANK   	-5  /* unpin only if pin banked by caller */
+
+#define BUF_INTENT_RC_CHANGED_TAG	-5
+#define BUF_INTENT_RC_BADPAGE		-4
+#define BUF_INTENT_RC_INVALID_AIO	-3    /*  invalid and aio was in progress */
+#define BUF_INTENT_RC_INVALID_NO_AIO	-1    /*  invalid and no aio was in progress */
+#define BUF_INTENT_RC_VALID		 1
+
 #define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
 
 /*
@@ -159,6 +246,7 @@ typedef struct sbufdesc
  */
 #define FREENEXT_END_OF_LIST	(-1)
 #define FREENEXT_NOT_IN_LIST	(-2)
+#define FREENEXT_BAIOCB_ORIGIN	(-3)
 
 /*
  * Macros for acquiring/releasing a shared buffer header's spinlock.
--- src/include/catalog/pg_am.h.orig	2014-08-18 14:10:37.049017615 -0400
+++ src/include/catalog/pg_am.h	2014-08-19 16:56:14.611200482 -0400
@@ -67,6 +67,7 @@ CATALOG(pg_am,2601)
 	regproc		amcanreturn;	/* can indexscan return IndexTuples? */
 	regproc		amcostestimate; /* estimate cost of an indexscan */
 	regproc		amoptions;		/* parse AM-specific parameters */
+	regproc		ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } FormData_pg_am;
 
 /* ----------------
@@ -117,19 +118,19 @@ typedef FormData_pg_am *Form_pg_am;
  * ----------------
  */
 
-DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions ));
+DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions btpeeknexttuple ));
 DESCR("b-tree index access method");
 #define BTREE_AM_OID 403
-DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions ));
+DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions - ));
 DESCR("hash index access method");
 #define HASH_AM_OID 405
-DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions ));
+DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions - ));
 DESCR("GiST index access method");
 #define GIST_AM_OID 783
-DATA(insert OID = 2742 (  gin		0 6 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions ));
+DATA(insert OID = 2742 (  gin		0 6 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions - ));
 DESCR("GIN index access method");
 #define GIN_AM_OID 2742
-DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
+DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions - ));
 DESCR("SP-GiST index access method");
 #define SPGIST_AM_OID 4000
 
--- src/include/catalog/pg_proc.h.orig	2014-08-18 14:10:37.053017634 -0400
+++ src/include/catalog/pg_proc.h	2014-08-19 16:56:14.667200683 -0400
@@ -536,6 +536,12 @@ DESCR("convert float4 to int4");
 
 DATA(insert OID = 330 (  btgettuple		   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2281 2281" _null_ _null_ _null_ _null_	btgettuple _null_ _null_ _null_ ));
 DESCR("btree(internal)");
+
+#ifndef AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
+DATA(insert OID = 3256 (  btpeeknexttuple	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "2281" _null_ _null_ _null_ _null_ btpeeknexttuple _null_ _null_ _null_ ));
+DESCR("btree(internal)");
+#endif /* not AVOID_CATALOG_MIGRATION_FOR_ASYNCIO */
+
 DATA(insert OID = 636 (  btgetbitmap	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_	btgetbitmap _null_ _null_ _null_ ));
 DESCR("btree(internal)");
 DATA(insert OID = 331 (  btinsert		   PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_	btinsert _null_ _null_ _null_ ));
--- src/include/pg_config_manual.h.orig	2014-08-18 14:10:37.061017672 -0400
+++ src/include/pg_config_manual.h	2014-08-19 16:56:14.723200883 -0400
@@ -138,9 +138,11 @@
 /*
  * USE_PREFETCH code should be compiled only if we have a way to implement
  * prefetching.  (This is decoupled from USE_POSIX_FADVISE because there
- * might in future be support for alternative low-level prefetch APIs.)
+ * might in future be support for alternative low-level prefetch APIs  --
+ * -- update October 2013  -- now there is such a new prefetch capability --
+ *   async_io into postgres buffers  -   configuration parameter max_async_io_threads)
  */
-#ifdef USE_POSIX_FADVISE
+#if defined(USE_POSIX_FADVISE) || defined(USE_AIO_ATOMIC_BUILTIN_COMP_SWAP)
 #define USE_PREFETCH
 #endif
 
--- src/include/access/nbtree.h.orig	2014-08-18 14:10:37.045017596 -0400
+++ src/include/access/nbtree.h	2014-08-19 16:56:14.759201013 -0400
@@ -19,6 +19,7 @@
 #include "access/sdir.h"
 #include "access/xlog.h"
 #include "access/xlogutils.h"
+#include "access/relscan.h"
 #include "catalog/pg_index.h"
 
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
@@ -524,6 +525,7 @@ typedef struct BTScanPosData
 	Buffer		buf;			/* if valid, the buffer is pinned */
 
 	BlockNumber nextPage;		/* page's right link when we scanned it */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
@@ -603,6 +605,15 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* prefetch logic state */
+	unsigned int	backSeqRun;	/* number of back-sequential pages in a run */
+	BlockNumber		backSeqPos;	/* blkid last prefetched in back-sequential 
+				          		   runs */
+	BlockNumber		lastHeapPrefetchBlkno;	/* blkid last prefetched from heap */
+	int				prefetchItemIndex; /* item index within currPos last
+					                      fetched by heap prefetch */
+	int				prefetchBlockCount; /* number of prefetched heap blocks */
+	
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -655,7 +666,17 @@ extern Buffer _bt_getroot(Relation rel,
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access
+#ifdef USE_PREFETCH
+                                      , struct pfch_index_pagelist* pfch_index_page_list
+#endif /* USE_PREFETCH */
+                         );
+#ifdef USE_PREFETCH
+extern void _bt_prefetchbuf(Relation rel, BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P);
+extern struct pfch_index_item* _bt_find_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list);
+extern int _bt_add_block(BlockNumber blkno , struct pfch_index_pagelist** pfch_index_page_list_P , uint32 discard_status);
+extern void _bt_subtract_block(BlockNumber blkno , struct pfch_index_pagelist* pfch_index_page_list);
+#endif /* USE_PREFETCH */
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 				 BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
--- src/include/access/heapam.h.orig	2014-08-18 14:10:37.045017596 -0400
+++ src/include/access/heapam.h	2014-08-19 16:56:14.787201113 -0400
@@ -175,7 +175,7 @@ extern void heap_page_prune_execute(Buff
 extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
 
 /* in heap/syncscan.c */
-extern void ss_report_location(Relation rel, BlockNumber location);
+extern void ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp);
 extern BlockNumber ss_get_location(Relation rel, BlockNumber relnblocks);
 extern void SyncScanShmemInit(void);
 extern Size SyncScanShmemSize(void);
--- src/include/access/relscan.h.orig	2014-08-18 14:10:37.049017615 -0400
+++ src/include/access/relscan.h	2014-08-19 16:56:14.823201242 -0400
@@ -44,6 +44,24 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+#ifdef USE_PREFETCH
+	int	    rs_prefetch_target; /* target distance (numblocks) for prefetch to reach beyond main scan */
+	BlockNumber rs_pfchblock;	/* next block # to be prefetched in scan, if any */
+
+        /*   Unread_Pfetched is a "mostly" circular list of recently prefetched blocknos of size target_prefetch_pages
+        **   the index of the first unread block is held in Unread_Pfetched_next
+        **         and is advanced when a block is read
+        **   the count of number of unread blocks is in Unread_Pfetched_count (and this subset can wrap around)
+        **   "mostly" means that there may be gaps caused by storing entries for blocks which do not need to be discarded -
+        **   these are indicated by blockno = InvalidBlockNumber,  and these slots are reused when found.
+        */
+        BlockNumber *rs_Unread_Pfetched_base;   /*  where the list of prefetched but unread blocknos starts */
+        unsigned int rs_Unread_Pfetched_next;   /*  where the next unread blockno probably is relative to start --
+                                                **  this is only a hint which may be temporarily stale.
+                                                */
+        unsigned int rs_Unread_Pfetched_count;  /*  number of valid unread blocknos in list */
+#endif   /* USE_PREFETCH */
+
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	ItemPointerData rs_mctid;	/* marked scan position, if any */
@@ -55,6 +73,27 @@ typedef struct HeapScanDescData
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
 }	HeapScanDescData;
 
+/* pfch_index_items track prefetched and unread index pages -   chunks of blocknumbers are chained in singly-linked list from scan->pfch_index_item_list */
+struct pfch_index_item {                              /*  index-relation BlockIds which we will/have prefetched */
+       BlockNumber         pfch_blocknum;    /* Blocknum which we will/have prefetched */
+       uint32              pfch_discard;     /* whether block is to be discarded when scan is closed */
+};
+
+struct pfch_block_item {
+       struct BlockIdData  pfch_blockid;     /* BlockId which we will/have prefetched */
+       uint32              pfch_discard;     /* whether block is to be discarded when scan is closed */
+};
+
+/* pfch_index_page_items track prefetched and unread index pages -
+** chunks of blocknumbers are chained backwards (newest first,  oldest last)
+** in singly-linked list from scan->pfch_index_item_list
+*/
+struct pfch_index_pagelist {                          /*  index-relation BlockIds which we will/have prefetched */
+       struct pfch_index_pagelist* pfch_index_pagelist_next;  /*  pointer to next chunk if any */
+       unsigned int    pfch_index_item_count;         /*  number of used entries in this chunk */
+       struct pfch_index_item pfch_indexid[1];        /*  in-line list of Blocknums which we will/have prefetched and whether to be discarded */
+};
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -75,8 +114,15 @@ typedef struct IndexScanDescData
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
 	bool		ignore_killed_tuples;	/* do not return killed entries */
-	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
-										 * tuples */
+	bool		xactStartedInRecovery;	/* prevents killing/seeing killed tuples */
+										 
+#ifdef USE_PREFETCH
+        struct pfch_index_pagelist* pfch_index_page_list;  /* array of index-relation BlockIds which we will/have prefetched */
+        struct pfch_block_item* pfch_block_item_list;   /* array of heap-relation BlockIds which we will/have prefetched */
+        unsigned short int     pfch_used;	/*  number of used elements in BlockIdData array   */
+        unsigned short int     pfch_next;	/*  next element for prefetch in BlockIdData array */
+	int             do_prefetch;    /*  should I prefetch ? */
+#endif   /* USE_PREFETCH */
 
 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
@@ -91,6 +137,10 @@ typedef struct IndexScanDescData
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/* heap fetch statistics for read-ahead logic */
+	unsigned int heap_tids_seen;
+	unsigned int heap_tids_fetched;
+
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
 }	IndexScanDescData;
--- src/include/nodes/tidbitmap.h.orig	2014-08-18 14:10:37.061017672 -0400
+++ src/include/nodes/tidbitmap.h	2014-08-19 16:56:14.863201385 -0400
@@ -41,6 +41,16 @@ typedef struct
 	int			ntuples;		/* -1 indicates lossy result */
 	bool		recheck;		/* should the tuples be rechecked? */
 	/* Note: recheck is always true if ntuples < 0 */
+#ifdef USE_PREFETCH
+        /*   Unread_Pfetched is a circular list of recently prefetched blocknos of size target_prefetch_pages
+        **   the index of the first unread block is held in Unread_Pfetched_next
+        **         and is advanced when a block is read
+        **   the count of number of unread blocks is in Unread_Pfetched_count (and this subset can wrap around)
+        */
+        BlockNumber *Unread_Pfetched_base;   /*  where the list of prefetched but unread blocknos starts */
+        unsigned int Unread_Pfetched_next;   /*  where the next unread blockno is relative to start */
+        unsigned int Unread_Pfetched_count;  /*  number of valid unread blocknos in list */
+#endif   /* USE_PREFETCH */
 	OffsetNumber offsets[1];	/* VARIABLE LENGTH ARRAY */
 } TBMIterateResult;				/* VARIABLE LENGTH STRUCT */
 
@@ -62,5 +72,8 @@ extern bool tbm_is_empty(const TIDBitmap
 extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
 extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
 extern void tbm_end_iterate(TBMIterator *iterator);
-
+extern void tbm_zero(TBMIterator *iterator); /* zero list of prefetched and unread blocknos */
+extern void tbm_add(TBMIterator *iterator, BlockNumber blockno); /* add this blockno to list of prefetched and unread blocknos */
+extern void tbm_subtract(TBMIterator *iterator, BlockNumber blockno); /* remove this blockno from list of prefetched and unread blocknos */
+extern TBMIterateResult *tbm_locate_IterateResult(TBMIterator *iterator); /* locate the TBMIterateResult of an iterator */
 #endif   /* TIDBITMAP_H */
--- src/include/utils/rel.h.orig	2014-08-18 14:10:37.069017709 -0400
+++ src/include/utils/rel.h	2014-08-19 16:56:14.891201485 -0400
@@ -61,6 +61,7 @@ typedef struct RelationAmInfo
 	FmgrInfo	ammarkpos;
 	FmgrInfo	amrestrpos;
 	FmgrInfo	amcanreturn;
+	FmgrInfo	ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } RelationAmInfo;
 
 
--- src/include/pg_config.h.in.orig	2014-08-18 14:10:37.061017672 -0400
+++ src/include/pg_config.h.in	2014-08-19 16:56:14.915201571 -0400
@@ -1,4 +1,4 @@
-/* src/include/pg_config.h.in.  Generated from configure.in by autoheader.  */
+/* src/include/pg_config.h.in.  Generated from - by autoheader.  */
 
 /* Define to the type of arg 1 of 'accept' */
 #undef ACCEPT_TYPE_ARG1
@@ -747,6 +747,10 @@
 /* Define to the appropriate snprintf format for unsigned 64-bit ints. */
 #undef UINT64_FORMAT
 
+/* Define to select librt-style async io and the gcc atomic compare_and_swap.
+   */
+#undef USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING

#60

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: John Lumby (#59)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 08/20/2014 12:17 AM, John Lumby wrote:

I am attaching a new version of the patch for consideration in the current commit fest.

Thanks for working on this!

Relative to the one I submitted on 25 June in BAY175-W412FF89303686022A9F16AA3190@phx.gbl
the method for handling aio completion using sigevent has been re-written to use
signals exclusively rather than a composite of signals and LWlocks,
and this has fixed the problem I mentioned before with the LWlock method.

ISTM the patch is still allocating stuff in shared memory that really
doesn't belong there. Namely, the BufferAiocb structs. Or at least parts
of it; there's also a waiter queue there which probably needs to live in
shared memory, but the rest of it does not.

At least BufAWaitAioCompletion is still calling aio_error() on an AIO
request that might've been initiated by another backend. That's not OK.

Please write the patch without atomic CAS operation. Just use a
spinlock. There's a patch in the commitfest to add support for that, but
it's not committed yet, and all those USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
ifdefs make the patch more difficult to read. Same with all the other
#ifdefs; please just pick a method that works.

Also, please split prefetching of regular index scans into a separate
patch. It's orthogonal to doing async I/O; we could prefetch regular
index scans with posix_fadvise too, and AIO should be useful for
prefetching other stuff.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Claudio Freire

klaussfreire@gmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#60)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Also, please split prefetching of regular index scans into a separate patch.
It's orthogonal to doing async I/O; we could prefetch regular index scans
with posix_fadvise too, and AIO should be useful for prefetching other
stuff.

That patch already happened on the list, and it wasn't a win in many
cases. I'm not sure it should be proposed independently of this one.
Maybe a separate patch, but it should be considered dependent on this.

I don't have an archive link at hand atm, but I could produce one later.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

johnlumby

johnlumby@hotmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#60)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

Thanks for the replies and thoughts.

On 08/19/14 18:27, Heikki Linnakangas wrote:

On 08/20/2014 12:17 AM, John Lumby wrote:

I am attaching a new version of the patch for consideration in the
current commit fest.

Thanks for working on this!

Relative to the one I submitted on 25 June in
BAY175-W412FF89303686022A9F16AA3190@phx.gbl
the method for handling aio completion using sigevent has been
re-written to use
signals exclusively rather than a composite of signals and LWlocks,
and this has fixed the problem I mentioned before with the LWlock
method.

ISTM the patch is still allocating stuff in shared memory that really
doesn't belong there. Namely, the BufferAiocb structs. Or at least
parts of it; there's also a waiter queue there which probably needs to
live in shared memory, but the rest of it does not.

Actually the reason the BufferAiocb ( the postgresql block corresponding
to the aio's aiocb)
must be located in shared memory is that, as you said, it acts as
anchor for the waiter list.
See further comment below.

At least BufAWaitAioCompletion is still calling aio_error() on an AIO
request that might've been initiated by another backend. That's not OK.

Yes, you are right, and I agree with this one -
I will add a aio_error_return_code field in the BufferAiocb
and only the originator will set this from the real aiocb.

Please write the patch without atomic CAS operation. Just use a spinlock.

Umm, this is a new criticism I think. I use CAS for things other
than locking,
such as add/remove from shared queue. I suppose maybe a spinlock on
the entire queue
can be used equivalently, but with more code (extra confusion) and
worse performance
(coarser serialization). What is your objection to using gcc's
atomic ops? Portability?

There's a patch in the commitfest to add support for that,

sorry, support for what? There are already spinlocks in postgresql,
you mean some new kind? please point me at it with hacker msgid or
something.

but it's not committed yet, and all those
USE_AIO_ATOMIC_BUILTIN_COMP_SWAP ifdefs make the patch more difficult
to read. Same with all the other #ifdefs; please just pick a method
that works.

Ok, yes, the ifdefs are unpleasant. I will do something about that.
Ideally they would be entirely confined to header files and only macro
functions
in C files - maybe I can do that. And eventually when the dust has
settled
eliminate obsolete ifdef blocks altogether.

Also, please split prefetching of regular index scans into a separate
patch. It's orthogonal to doing async I/O;

actually not completely orthogonal, see next

we could prefetch regular index scans with posix_fadvise too, and AIO
should be useful for prefetching other stuff.

On 08/19/14 19:10, Claudio Freire wrote:

On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Also, please split prefetching of regular index scans into a separate patch. ...

That patch already happened on the list, and it wasn't a win in many
cases. I'm not sure it should be proposed independently of this one.
Maybe a separate patch, but it should be considered dependent on this.

I don't have an archive link at hand atm, but I could produce one later.

Several people have asked to split this patch into several smaller ones
and I
have thought about it. It would introduce some awkward dependencies.
E.g. splitting the scanner code (index, relation-heap) into separate
patch
from aio code would mean that the scanner patch becomes dependent
on the aio patch. They are not quite orthogonal.

The reason is that the scanners call a new function, DiscardBuffer(blockid)
when aio is in use. We can get around it by providing a stub for
that function
in the scanner patch, but then there is some risk of someone getting the
wrong version of that function in their build. It just adds yet more
complexity
and breakage opportunities.

- Heikki

One further comment concerning these BufferAiocb and aiocb control blocks
being in shared memory :

I explained above that the BufferAiocb must be in shared memory for
wait/post.
The aiocb does not necessarily have to be in shared memory,
but since there is a one-to-one relationship between BufferAiocb and aiocb,
it makes the code much simpler , and the operation much more efficient,
if the aiocb is embedded in the BufferAiocb as I have it now.
E.g. if the aiocb is in private-process memory, then an additional
allocation
scheme is needed (fixed number? palloc()in'g extra ones as needed? ...)
which adds complexity, and probably some inefficiency since a shared
pool is usually
more efficient (allows higher maximum per process etc), and there
would have to be
some pointer de-referencing from BufferAiocb to aiocb adding some
(small) overhead.

I understood your objection to use of shared memory as being that you
don't want
a non-originator to access the originator's aiocb using aio_xxx calls,
and, with the
one change I've said I will do above (put the aio_error retcode in the
BufferAiocb)
I will have achieved that requirement. I am hoping this answers your
objections
concerning shared memory.

Show quoted text

#63

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: johnlumby (#62)

Re: Extended Prefetching using Asynchronous IO - proposal and patch

On 08/25/2014 12:49 AM, johnlumby wrote:

On 08/19/14 18:27, Heikki Linnakangas wrote:

Please write the patch without atomic CAS operation. Just use a spinlock.

Umm, this is a new criticism I think.

Yeah. Be prepared that new issues will crop up as the patch gets slimmer
and easier to review :-). Right now there's still so much chaff that
it's difficult to see the wheat.

I use CAS for things other
than locking,
such as add/remove from shared queue. I suppose maybe a spinlock on
the entire queue
can be used equivalently, but with more code (extra confusion) and
worse performance
(coarser serialization). What is your objection to using gcc's
atomic ops? Portability?

Yeah, portability.

Atomic ops might make sense, but at this stage it's important to slim
down the patch to the bare minimum, so that it's easier to review. Once
the initial patch is in, you can improve it with additional patches.

There's a patch in the commitfest to add support for that,

sorry, support for what? There are already spinlocks in postgresql,
you mean some new kind? please point me at it with hacker msgid or
something.

Atomic ops: https://commitfest.postgresql.org/action/patch_view?id=1314

Once that's committed, you can use the new atomic ops in your patch. But
until then, stick to spinlocks.

On 08/19/14 19:10, Claudio Freire wrote:

On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Also, please split prefetching of regular index scans into a separate patch. ...

That patch already happened on the list, and it wasn't a win in many
cases. I'm not sure it should be proposed independently of this one.
Maybe a separate patch, but it should be considered dependent on this.

I don't have an archive link at hand atm, but I could produce one later.

Several people have asked to split this patch into several smaller ones
and I
have thought about it. It would introduce some awkward dependencies.
E.g. splitting the scanner code (index, relation-heap) into separate
patch
from aio code would mean that the scanner patch becomes dependent
on the aio patch. They are not quite orthogonal.

Right now, please focus on the main AIO patch. That should show a
benefit for bitmap heap scans too, so to demonstrate the benefits of
AIO, you don't need to prefetch regular index scans. The main AIO patch
can be written, performance tested, and reviewed without caring about
regular index scans at all.

The reason is that the scanners call a new function, DiscardBuffer(blockid)
when aio is in use. We can get around it by providing a stub for
that function
in the scanner patch, but then there is some risk of someone getting the
wrong version of that function in their build. It just adds yet more
complexity
and breakage opportunities.

Regardless of the regular index scans, we'll need to discuss the new API
of PrefetchBuffer and DiscardBuffer.

It would be simpler for the callers if you didn't require the
DiscardBuffer calls. I think it would be totally feasible to write the
patch that way. Just drop the buffer pin after the asynchronous read
finishes. When the caller actually needs the page, it will call
ReadBuffer which will pin it again. You'll get a little bit more bufmgr
traffic that way, but I think it'll be fine in practice.

One further comment concerning these BufferAiocb and aiocb control blocks
being in shared memory :

I explained above that the BufferAiocb must be in shared memory for
wait/post.
The aiocb does not necessarily have to be in shared memory,
but since there is a one-to-one relationship between BufferAiocb and aiocb,
it makes the code much simpler , and the operation much more efficient,
if the aiocb is embedded in the BufferAiocb as I have it now.
E.g. if the aiocb is in private-process memory, then an additional
allocation
scheme is needed (fixed number? palloc()in'g extra ones as needed? ...)
which adds complexity,

Yep, use palloc or a fixed pool. There's nothing wrong with that.

and probably some inefficiency since a shared
pool is usually
more efficient (allows higher maximum per process etc), and there
would have to be
some pointer de-referencing from BufferAiocb to aiocb adding some
(small) overhead.

I think you're falling into the domain of premature optimization. A few
pointer dereferences are totally insignificant, and the amount of memory
you're saving pales in comparison to other similar non-shared pools and
caches we have (catalog caches, for example). And on the other side of
the coin, with a shared pool you'll waste memory when async I/O is not
used (e.g because everything fits in RAM), and when it is used, you'll
have more contention on locks and cache lines when multiple processes
use the same objects.

The general guideline in PostgreSQL is that everything is
backend-private, except structures used to communicate between backends.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64

John Lumby

johnlumby@hotmail.com

over 10 years ago

In reply to: johnlumby (#62)

RegisterBackgroundWorker does not actually start a bg worker process in 9.4.4

I am new to bg_workers so this may be my user error,
but when I build and run the contrib/worker_spi
extension, I find that :

. starting postgres with the extension named in shared_preload_libraries :
its _PG_init is invoked as expected but no process is started -
it is as though RegisterBackgroundWorker did nothing

. creating the extension and then
psql ... "select worker_spi_launch(2);" :
I see
28409 28288 ? 463508 00:05 00:00:00 0.0 postgres: bgworker: worker 2
as expected.

Is there maybe some bug in postmaster's processing of
workers marked as start_at = BgWorkerStart_RecoveryFinished
in 9.4.4?

Cheers, John

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#65

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: John Lumby (#64)

Re: RegisterBackgroundWorker does not actually start a bg worker process in 9.4.4

On Tue, Jun 16, 2015 at 3:58 AM, John Lumby <johnlumby@hotmail.com> wrote:

I am new to bg_workers so this may be my user error,
but when I build and run the contrib/worker_spi
extension, I find that :

. starting postgres with the extension named in shared_preload_libraries :
its _PG_init is invoked as expected but no process is started -
it is as though RegisterBackgroundWorker did nothing

. creating the extension and then
psql ... "select worker_spi_launch(2);" :
I see
28409 28288 ? 463508 00:05 00:00:00 0.0 postgres: bgworker: worker 2
as expected.

Is there maybe some bug in postmaster's processing of
workers marked as start_at = BgWorkerStart_RecoveryFinished
in 9.4.4?

Not that I know of. I am using static background workers even with 9.4
clusters and they work as expected. Giving a try with worker_spi on
9.4, I see no problems as well:
$ ps ux | grep bgworker
michael 3906 0.0 0.1 2594780 7456 ?? Ss 8:21AM 0:00.02
postgres: bgworker: worker 1
michael 3905 0.0 0.1 2594780 6996 ?? Ss 8:21AM 0:00.02
postgres: bgworker: worker 2
$ psql -c 'show shared_preload_libraries'
shared_preload_libraries
--------------------------
worker_spi
(1 row)

Regards,
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#66

John Lumby

johnlumby@hotmail.com

over 10 years ago

In reply to: Michael Paquier (#65)

Re: RegisterBackgroundWorker does not actually start a bg worker process in 9.4.4

Michael, thanks for checking,
I tried it again today and it is now working so I must have forgotten something.

John
----------------------------------------

Date: Tue, 16 Jun 2015 08:33:09 +0900
Subject: Re: [GENERAL] RegisterBackgroundWorker does not actually start a bg worker process in 9.4.4
From: michael.paquier@gmail.com
To: johnlumby@hotmail.com
CC: pgsql-general@postgresql.org

On Tue, Jun 16, 2015 at 3:58 AM, John Lumby <johnlumby@hotmail.com> wrote:

I am new to bg_workers so this may be my user error,
but when I build and run the contrib/worker_spi
extension, I find that :

. starting postgres with the extension named in shared_preload_libraries :
its _PG_init is invoked as expected but no process is started -
it is as though RegisterBackgroundWorker did nothing

. creating the extension and then
psql ... "select worker_spi_launch(2);" :
I see
28409 28288 ? 463508 00:05 00:00:00 0.0 postgres: bgworker: worker 2
as expected.

Is there maybe some bug in postmaster's processing of
workers marked as start_at = BgWorkerStart_RecoveryFinished
in 9.4.4?

Not that I know of. I am using static background workers even with 9.4
clusters and they work as expected. Giving a try with worker_spi on
9.4, I see no problems as well:
$ ps ux | grep bgworker
michael 3906 0.0 0.1 2594780 7456 ?? Ss 8:21AM 0:00.02
postgres: bgworker: worker 1
michael 3905 0.0 0.1 2594780 6996 ?? Ss 8:21AM 0:00.02
postgres: bgworker: worker 2
$ psql -c 'show shared_preload_libraries'
shared_preload_libraries
--------------------------
worker_spi
(1 row)

Regards,
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general