old synchronized scan patch

Started by Jeff Davisover 19 years ago47 messageshackers
Jump to latest
#1Jeff Davis
pgsql@j-davis.com

Now that 8.3 is open, I was considering a revival of this old patch:

http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php

I could probably clean it up with a little help from someone on this
list.

Advantages of this patch: If multiple seq scans are going in parallel on
the same table, it can drastically increase cache hit rate by
synchronizing the scans. In theory, it should also prevent the problem
where sequential scans can turn into random access from the disk.

Disadvantages:
* sequential scans no longer return results in a deterministic order. As
I understand it, this has effects on the regression tests and possibly a
few other locations in the code.
* While the in the use case, it should improve performance quite a lot.
However, the use case is quite narrow: you need to be running sequential
scans that are reading tuples from the same table at the same time. That
means the table can't fit into memory, but must be small enough that it
is sane to be executing multiple sequential scans on it in parallel.

Is there some interest in this patch? How would I go about proving
whether it's useful enough or not?

Regards,
Jeff Davis

#2Luke Lonergan
llonergan@greenplum.com
In reply to: Jeff Davis (#1)
Re: old synchronized scan patch

Jeff,

On 12/4/06 11:07 AM, "Jeff Davis" <pgsql@j-davis.com> wrote:

Now that 8.3 is open, I was considering a revival of this old patch:

http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php

I could probably clean it up with a little help from someone on this
list.

Advantages of this patch: If multiple seq scans are going in parallel on
the same table, it can drastically increase cache hit rate by
synchronizing the scans. In theory, it should also prevent the problem
where sequential scans can turn into random access from the disk.

Disadvantages:
* sequential scans no longer return results in a deterministic order. As
I understand it, this has effects on the regression tests and possibly a
few other locations in the code.
* While the in the use case, it should improve performance quite a lot.
However, the use case is quite narrow: you need to be running sequential
scans that are reading tuples from the same table at the same time. That
means the table can't fit into memory, but must be small enough that it
is sane to be executing multiple sequential scans on it in parallel.

Is there some interest in this patch?

Yes.

For a scenario of interest let's assume:
1 - the Postgres cache is "scan resistant" such that a seq scan of a large
table is not mapped into the block cache.
2 - the O/S cache is not "scan resistant" so that it will map all blocks
read through the cache
3 - there is I/O prefetch in the OS that keeps the I/O subsystem efficient
under multi-user sequential scanning of different tables/files

With (2), if you issue 5 queries that all seq scan a table that is much
larger than memory, they will complete in nearly the time of one query.
I've tested this up to 1,000 simultaneous queries on Linux and get huge
benefits. This scenario is only sometimes helpful in real production usage.

Given (2), I think sync scan would only have a multi-user benefit when the
scans occur in periods separated by more than the amount of time it takes to
scan a cache size worth of data from disk. For example: If there is 16GB of
RAM, the OS I/O cache might have room for 14GB of blocks. On a good (ahem)
system it might take 14 seconds to scan that amount of data into cache from
disk (yes, 1GB/s). If a new backend begins scanning after that 14 seconds,
it won't benefit from the data in the cache because the new data from disk
replaces the oldest data in cache and so on.

Since we're talking O(seconds) separation between user queries, I think we
could have a benefit.

Where I think sync scan could have a big benefit is for multi-user business
intelligence workloads where there are a few huge fact tables of interest to
a wide audience. Example: 5 business analysts come to work at 9AM and start
ad-hoc queries expected to run in about 15 minutes each. Each query
sequential scans a 10 billion row fact table once, which takes 10 minutes of
the query runtime. With sync scan the last one completes in 35 minutes.
Without sync scan the last completes in 75 minutes. In this case sync scan
significantly improves the experience of 5 people.

How would I go about proving whether it's useful enough or not?

Can you run the above scenario on a table whose size is ten times the memory
on the machine? As a simple starting point, a simple "SELECT COUNT(*) FROM
BIGTABLE" should be sufficient, but the scans need to be separated by enough
time to invalidate the OS I/O cache.

- Luke

#3Jeff Davis
pgsql@j-davis.com
In reply to: Luke Lonergan (#2)
Re: old synchronized scan patch

On Mon, 2006-12-04 at 15:03 -0800, Luke Lonergan wrote:

Jeff,

Now that 8.3 is open, I was considering a revival of this old patch:

http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php

I could probably clean it up with a little help from someone on this
list.

Is there some interest in this patch?

Yes.

<snip>

Where I think sync scan could have a big benefit is for multi-user business
intelligence workloads where there are a few huge fact tables of interest to
a wide audience. Example: 5 business analysts come to work at 9AM and start
ad-hoc queries expected to run in about 15 minutes each. Each query
sequential scans a 10 billion row fact table once, which takes 10 minutes of
the query runtime. With sync scan the last one completes in 35 minutes.
Without sync scan the last completes in 75 minutes. In this case sync scan
significantly improves the experience of 5 people.

Thank you for your input.

How would I go about proving whether it's useful enough or not?

Can you run the above scenario on a table whose size is ten times the memory
on the machine? As a simple starting point, a simple "SELECT COUNT(*) FROM
BIGTABLE" should be sufficient, but the scans need to be separated by enough
time to invalidate the OS I/O cache.

I'll try to run a test like that this week. I will be doing this on my
home hardware (bad, consumer-grade stuff), so if I gave you a patch
against HEAD could you test it against some more real hardware (and
data)?

To open up the implementation topic:

My current patch starts a new sequential scan on a given relation at the
page of an already-running scan. It makes no guarantees that the scans
stay together, but in practice I don't think they deviate much. To try
to enforce synchronization of scanning I fear would do more harm than
good. Thoughts?

Also, it's more of a "hint" system that uses a direct mapping of the
relations Oid to hold the position of the scan. That means that, in rare
cases, the page offset could be wrong, in which case it will degenerate
to the current performance characteristics with no cost. The benefit of
doing it this way is that it's simple code, with essentially no
performance penalty or additional locking. Also, I can use a fixed
amount of shared memory (1 page is about right).

Regards,
Jeff Davis

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Luke Lonergan (#2)
Re: old synchronized scan patch

"Luke Lonergan" <llonergan@greenplum.com> writes:

On 12/4/06 11:07 AM, "Jeff Davis" <pgsql@j-davis.com> wrote:

Now that 8.3 is open, I was considering a revival of this old patch:
http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php

Where I think sync scan could have a big benefit is for multi-user business
intelligence workloads where there are a few huge fact tables of interest to
a wide audience. Example: 5 business analysts come to work at 9AM and start
ad-hoc queries expected to run in about 15 minutes each.

The problem I've got with this is that the use-case seems a bit narrow
(maybe not to Greenplum, but compared to the general Postgres audience)
whereas the proposed patch is invasive in terms of changing
well-established behavior, and what's worse, potentially *reducing*
performance for average workloads. In particular I'm concerned about
the shared-memory area where the scan status is stored becoming a
contention bottleneck. It would presumably have access demand
comparable to the bufmgr, ie one hit per page scanned, and we already
know that it's extremely hard to keep bufmgr from being a major source
of contention.

One thing you might consider doing is to have the patch do nothing when
scanning tables that are less than, perhaps, 10x shared_buffers long.
This would at least avoid the regression-test-breakage problem (at the
cost of not having the patch tested at all by said tests :-().

Anyway I think the major stumbling block is to be able to show that the
patch has only negligible performance impact in cases where it's not
able to provide a benefit.

regards, tom lane

#5Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#4)
Re: old synchronized scan patch

On Mon, 2006-12-04 at 18:45 -0500, Tom Lane wrote:

"Luke Lonergan" <llonergan@greenplum.com> writes:

On 12/4/06 11:07 AM, "Jeff Davis" <pgsql@j-davis.com> wrote:

Now that 8.3 is open, I was considering a revival of this old patch:
http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php

Where I think sync scan could have a big benefit is for multi-user business
intelligence workloads where there are a few huge fact tables of interest to
a wide audience. Example: 5 business analysts come to work at 9AM and start
ad-hoc queries expected to run in about 15 minutes each.

Anyway I think the major stumbling block is to be able to show that the
patch has only negligible performance impact in cases where it's not
able to provide a benefit.

My patch was specifically designed with this in mind, and unless I'm
missing something, should have negligible performance impact.

The initial implementation directly maps the Oid of the relation to a
location in an shared-memory table (not table as in relation, but table
as in area of memory). Then, it stores the page offset for that relation
each time it fetches a page for that relation in a sequential scan. When
a scan starts, it checks that number to see if it's in the valid range
for the relation's file, and if it is, that's where it starts the scan
in the file. If it's not in the valid range, it starts at offset 0.

Since I am not storing any pointers, and since the information is only
really a hint, I don't need to do any locking on that page.

In the case of a collision in the in-memory table, or if the offset in
the relation overflows the data type used to store it, it should be no
worse than current behavior.

Regards,
Jeff Davis

#6Luke Lonergan
llonergan@greenplum.com
In reply to: Jeff Davis (#3)
Re: old synchronized scan patch

Jeff,

On 12/4/06 3:38 PM, "Jeff Davis" <pgsql@j-davis.com> wrote:

I'll try to run a test like that this week. I will be doing this on my
home hardware (bad, consumer-grade stuff), so if I gave you a patch
against HEAD could you test it against some more real hardware (and
data)?

Sure.

To open up the implementation topic:

My current patch starts a new sequential scan on a given relation at the
page of an already-running scan. It makes no guarantees that the scans
stay together, but in practice I don't think they deviate much. To try
to enforce synchronization of scanning I fear would do more harm than
good. Thoughts?

I think this is good enough - a "background scanner" approach would be the
logical alternative, but may not be necessary. I suspect you are correct
about the scans being nearly synced.

This may not be the case if and when we implement a priority based
scheduler, but in that case we're already managing the throughput based on
content anyway.

Also, it's more of a "hint" system that uses a direct mapping of the
relations Oid to hold the position of the scan. That means that, in rare
cases, the page offset could be wrong, in which case it will degenerate
to the current performance characteristics with no cost. The benefit of
doing it this way is that it's simple code, with essentially no
performance penalty or additional locking. Also, I can use a fixed
amount of shared memory (1 page is about right).

If I understand correctly, Tom's concern is that this page is potentially
accessed once for every page read and may consequently become very hot. How
do you manage the scan position so this doesn't happen? Do we have any
readahead in the I/O layer? There is certainly readahead in the OS I/O
cache, but it's dynamic and we don't know the block position...

- Luke

#7Jeff Davis
pgsql@j-davis.com
In reply to: Luke Lonergan (#6)
Re: old synchronized scan patch

On Mon, 2006-12-04 at 16:47 -0800, Luke Lonergan wrote:

Jeff,

My current patch starts a new sequential scan on a given relation at the
page of an already-running scan. It makes no guarantees that the scans
stay together, but in practice I don't think they deviate much. To try
to enforce synchronization of scanning I fear would do more harm than
good. Thoughts?

I think this is good enough - a "background scanner" approach would be the
logical alternative, but may not be necessary. I suspect you are correct
about the scans being nearly synced.

A background scanner requires synchronizing the backends, which can
cause all kinds of bad performance problems. Otherwise, how would you
ensure that ALL the backends that need a page get it before the train
moves on? I don't think it's necessary or desirable to have a background
scanner.

This may not be the case if and when we implement a priority based
scheduler, but in that case we're already managing the throughput based on
content anyway.

Right. I don't see how you'd be able to get the data to a backend that
needs it without running that backend. If it's a priority scheduler, you
may not run that backend.

Also, it's more of a "hint" system that uses a direct mapping of the
relations Oid to hold the position of the scan. That means that, in rare
cases, the page offset could be wrong, in which case it will degenerate
to the current performance characteristics with no cost. The benefit of
doing it this way is that it's simple code, with essentially no
performance penalty or additional locking. Also, I can use a fixed
amount of shared memory (1 page is about right).

If I understand correctly, Tom's concern is that this page is potentially
accessed once for every page read and may consequently become very hot. How
do you manage the scan position so this doesn't happen? Do we have any
readahead in the I/O layer? There is certainly readahead in the OS I/O
cache, but it's dynamic and we don't know the block position...

My proposal is hint-based, and I consider any reads coming from that
data page to be untrusted. I can just statically map the Oid of the
relation onto a location in the page (which may or may not collide with
another Oid). The advantage here is that I don't need to lock. There's
no contention because every access is just reading a word or two from
the shared page (or writing a word or two). Of course, it must be a
static data structure because it can't contain pointers. But 8kb should
be enough to hold information on plenty of interesting relations.

I don't trust the data because collisions are possible. If the number is
obviously wrong (higher than number of pages in relation file), a new
scan starts at offset 0. If it's wrong but that can't be detected, it
will start at that location anyway, which is OK because that arbitrary
value is no worse than the arbitrary value of 0.

The whole premise of my implementation is:
(1) Get 99% of the benefit
(2) Pay near-zero performance penalty, even in worst case.
(3) Simple code changes

If those 3 things aren't true, let me know.

The way I see it, the only real cost is that it may break things that
assume deterministic sequential scans.

Regards,
Jeff Davis

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#5)
Re: old synchronized scan patch

Jeff Davis <pgsql@j-davis.com> writes:

Since I am not storing any pointers, and since the information is only
really a hint, I don't need to do any locking on that page.

If you think that, you need not bother to submit the patch. (Hint:
as soon as you consider more than one table at a time, it doesn't work,
even ignoring the question of inconsistent reads.)

regards, tom lane

#9Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#8)
Re: old synchronized scan patch

Ühel kenal päeval, E, 2006-12-04 kell 21:46, kirjutas Tom Lane:

Jeff Davis <pgsql@j-davis.com> writes:

Since I am not storing any pointers, and since the information is only
really a hint, I don't need to do any locking on that page.

If you think that, you need not bother to submit the patch. (Hint:
as soon as you consider more than one table at a time, it doesn't work,
even ignoring the question of inconsistent reads.)

Why does it not work ?

Are you suggesting, that another backend can somegow see only some bits
of page number being written ?

What problems do you see in multiple table case ?

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#10Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Jeff Davis (#3)
Re: old synchronized scan patch

To open up the implementation topic:

My current patch starts a new sequential scan on a given relation at

the

page of an already-running scan.

I think it should start earlier to make use of the already cached part
of the table.
It is hard to determine, or guess how much is still in cache though.
As a start you could maybe use some (small to be safe) fraction of
effective_cache_size
or shared_buffers.

Andreas

#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Hannu Krosing (#9)
Re: old synchronized scan patch

Hannu Krosing wrote:

Ühel kenal päeval, E, 2006-12-04 kell 21:46, kirjutas Tom Lane:

Jeff Davis <pgsql@j-davis.com> writes:

Since I am not storing any pointers, and since the information is only
really a hint, I don't need to do any locking on that page.

If you think that, you need not bother to submit the patch. (Hint:
as soon as you consider more than one table at a time, it doesn't work,
even ignoring the question of inconsistent reads.)

Why does it not work ?

Are you suggesting, that another backend can somegow see only some bits
of page number being written ?

What problems do you see in multiple table case ?

You need to manage adding and removing relations from the shared memory
structure. Which typically needs locking.

Assuming that relations are added or removed relatively seldom, you
might get away with a table of (Oid, BlockNumber) pairs, working around
the fact that the table might get messed up every now and then, and when
it does, you'll lose the benefits until it gets corrected. But it seems
really messy to me.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#12Hannu Krosing
hannu@tm.ee
In reply to: Heikki Linnakangas (#11)
Re: old synchronized scan patch

Ühel kenal päeval, T, 2006-12-05 kell 10:38, kirjutas Heikki
Linnakangas:

Hannu Krosing wrote:

Ühel kenal päeval, E, 2006-12-04 kell 21:46, kirjutas Tom Lane:

Jeff Davis <pgsql@j-davis.com> writes:

Since I am not storing any pointers, and since the information is only
really a hint, I don't need to do any locking on that page.

If you think that, you need not bother to submit the patch. (Hint:
as soon as you consider more than one table at a time, it doesn't work,
even ignoring the question of inconsistent reads.)

Why does it not work ?

Are you suggesting, that another backend can somegow see only some bits
of page number being written ?

What problems do you see in multiple table case ?

You need to manage adding and removing relations from the shared memory
structure. Which typically needs locking.

My impression was, that Jeff planned to do simple hashing to one of N
values and to write current page number there, maybe not even page nr
but each M-th page number.

Assuming that writing a 4byte page number is atomic op, then there is no
need for locking

Assuming that relations are added or removed relatively seldom, you
might get away with a table of (Oid, BlockNumber) pairs, working around
the fact that the table might get messed up every now and then, and when
it does, you'll lose the benefits until it gets corrected. But it seems
really messy to me.

Just storing page number at offset is cheap (and imho nice and clean
too).

The worst that can happen, is a hash collision, in which case you lose
the benefits of sync scans, but you wont degrade compared to non-sync
scans

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#13Florian Pflug
fgp@phlo.org
In reply to: Hannu Krosing (#12)
Re: old synchronized scan patch

Hannu Krosing wrote:

The worst that can happen, is a hash collision, in which case you lose
the benefits of sync scans, but you wont degrade compared to non-sync
scans

But it could cause "mysterious" performance regressions, no?
Image that your app includes two large tables, which are both
scannen frequently. Suppose that synchronous scanning gives this
use-case a noticeable performance boost. Now, you dump and reload
your schema, and suddently the hashes of oids of those tables
collide. You percieve a noticeable drop in performance that you
can neither explain nor fix without a rather deep understanding
of postgres internals.

greetings, Florian Pflug

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Florian Pflug (#13)
Re: old synchronized scan patch

"Florian G. Pflug" <fgp@phlo.org> writes:

Hannu Krosing wrote:

The worst that can happen, is a hash collision, in which case you lose
the benefits of sync scans, but you wont degrade compared to non-sync
scans

But it could cause "mysterious" performance regressions, no?

There are other issues for the "no lock" approach that Jeff proposes.
Suppose that we have three or four processes that are actually doing
synchronized scans of the same table. They will have current block
numbers that are similar but probably not identical. They will all be
scribbling on the same hashtable location. So if another process comes
along to join the "pack", it might get the highest active block number,
or the lowest, or something in between. Even discounting the possibility
that it gets random bits because it managed to read the value
non-atomically, how well is that really going to work?

Another thing that we have to consider is that the actual block read
requests will likely be distributed among the "pack leaders", rather
than all being issued by one process. AFAIK this will destroy the
kernel's ability to do read-ahead, because it will fail to recognize
that sequential reads are being issued --- any single process is *not*
reading sequentially, and I think that read-ahead scheduling is
generally driven off single-process behavior rather than the emergent
behavior of the whole system. (Feel free to contradict me if you've
actually read any kernel code that does this.) It might still be better
than unsynchronized reads, but it'd be leaving a lot on the table.

regards, tom lane

#15Florian Pflug
fgp@phlo.org
In reply to: Tom Lane (#14)
Re: old synchronized scan patch

Tom Lane wrote:

"Florian G. Pflug" <fgp@phlo.org> writes:

Hannu Krosing wrote:

The worst that can happen, is a hash collision, in which case you lose
the benefits of sync scans, but you wont degrade compared to non-sync
scans

But it could cause "mysterious" performance regressions, no?

There are other issues for the "no lock" approach that Jeff proposes.
Suppose that we have three or four processes that are actually doing
synchronized scans of the same table. They will have current block
numbers that are similar but probably not identical. They will all be
scribbling on the same hashtable location. So if another process comes
along to join the "pack", it might get the highest active block number,
or the lowest, or something in between. Even discounting the possibility
that it gets random bits because it managed to read the value
non-atomically, how well is that really going to work?

Could that be adressed by something along the line of
"Don't update the block number if the new value is between
the current number and the current number - update count/2" (
Update count being the number of blocks a backend reads before
updating the hashtable again). At least this would prefent
needless updates to nearly the same block number, which would
help NUMA-Style architectures like Opteron Systems I think.

Another thing that we have to consider is that the actual block read
requests will likely be distributed among the "pack leaders", rather
than all being issued by one process. AFAIK this will destroy the
kernel's ability to do read-ahead, because it will fail to recognize
that sequential reads are being issued --- any single process is *not*
reading sequentially, and I think that read-ahead scheduling is
generally driven off single-process behavior rather than the emergent
behavior of the whole system. (Feel free to contradict me if you've
actually read any kernel code that does this.) It might still be better
than unsynchronized reads, but it'd be leaving a lot on the table.

I don't see why a single process wouldn't be reading sequentially - As far
as I understood the original proposal, the current blocknumber from the
hashtable is only used as a starting point for sequential scans. After that,
each backend reads sequentiall until the end of the table I believe, no?

greetings, Florian Pflug

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Florian Pflug (#15)
Re: old synchronized scan patch

Florian G. Pflug wrote:

Tom Lane wrote:

There are other issues for the "no lock" approach that Jeff proposes.
Suppose that we have three or four processes that are actually doing
synchronized scans of the same table. They will have current block
numbers that are similar but probably not identical. They will all be
scribbling on the same hashtable location. So if another process comes
along to join the "pack", it might get the highest active block number,
or the lowest, or something in between. Even discounting the possibility
that it gets random bits because it managed to read the value
non-atomically, how well is that really going to work?

It might in fact work quite well. Assuming that all the active blocks
are in memory, the process that joins the pack will quickly catch up
with the rest. I'd like to see some testing to be sure, though...

Another thing that we have to consider is that the actual block read
requests will likely be distributed among the "pack leaders", rather
than all being issued by one process. AFAIK this will destroy the
kernel's ability to do read-ahead, because it will fail to recognize
that sequential reads are being issued --- any single process is *not*
reading sequentially, and I think that read-ahead scheduling is
generally driven off single-process behavior rather than the emergent
behavior of the whole system. (Feel free to contradict me if you've
actually read any kernel code that does this.) It might still be better
than unsynchronized reads, but it'd be leaving a lot on the table.

I don't see why a single process wouldn't be reading sequentially - As far
as I understood the original proposal, the current blocknumber from the
hashtable is only used as a starting point for sequential scans. After
that,
each backend reads sequentiall until the end of the table I believe, no?

When the read is satisfies from shared mem cache, it won't make it to
the kernel. From the kernel point of view, the pattern looks something
like this:

A 1--4--7-9
B -2---6---
C --3-5--8-

where letters denote processes, and numbers are block numbers read, and
time goes from left to right. When you look at one process at a time,
the pattern looks random, though it's constantly moving forward. It's
only when you look at all the processes together that you see that it's
sequential.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#16)
Re: old synchronized scan patch

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

Florian G. Pflug wrote:

I don't see why a single process wouldn't be reading sequentially - As far
as I understood the original proposal, the current blocknumber from the
hashtable is only used as a starting point for sequential scans. After
that,
each backend reads sequentiall until the end of the table I believe, no?

When the read is satisfies from shared mem cache, it won't make it to
the kernel.

Right, and the *whole point* of this proposal is that only one of the N
processes doing a synchronized scan actually does a read of any
particular block. The problem is that they're not synchronized well
enough to ensure that it's always the same one.

It strikes me that there's still another thing we'd have to deal with
to make this work nicely. If you have N processes doing a synchronized
scan, then each block that reaches shared memory is going to be hit N
times in fairly short succession --- which is going to be enough to
convince the bufmgr to keep it in memory for awhile. Thus a
synchronized seqscan is likely to end up flushing buffer cache in a way
that independent seqscans could not.

This could probably be dealt with by changing the ReadBuffer API to
allow the caller to say "don't increment the refcount on this page",
or some such. But it's some more work to be done if we intend to
take this idea seriously.

regards, tom lane

#18Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#8)
Re: old synchronized scan patch

On Mon, 2006-12-04 at 21:46 -0500, Tom Lane wrote:

Jeff Davis <pgsql@j-davis.com> writes:

Since I am not storing any pointers, and since the information is only
really a hint, I don't need to do any locking on that page.

If you think that, you need not bother to submit the patch. (Hint:
as soon as you consider more than one table at a time, it doesn't work,
even ignoring the question of inconsistent reads.)

I believe I accounted for that case correctly. In the unlikely event
that it gets a wrong hint value from the table, it would either:

(1) See that the value is larger than rs_nblocks and start from 0 like
normal
(2) Not know that the value is wrong because it is a valid block for
that relation, and it would use it anyway.

In the case of #2, everything should be fine, because an arbitrary value
is no worse than the arbitrary value of 0 we use right now.

I could always store the Oid in the table also, so that #2 wouldn't
happen unless the table was truncated (by TRUNCATE, CLUSTER, or VACUUM
FULL) after a scan.

Regards,
Jeff Davis

#19Jeff Davis
pgsql@j-davis.com
In reply to: Heikki Linnakangas (#11)
Re: old synchronized scan patch

On Tue, 2006-12-05 at 10:38 +0000, Heikki Linnakangas wrote:

Hannu Krosing wrote:

Ühel kenal päeval, E, 2006-12-04 kell 21:46, kirjutas Tom Lane:

Jeff Davis <pgsql@j-davis.com> writes:

Since I am not storing any pointers, and since the information is only
really a hint, I don't need to do any locking on that page.

If you think that, you need not bother to submit the patch. (Hint:
as soon as you consider more than one table at a time, it doesn't work,
even ignoring the question of inconsistent reads.)

Why does it not work ?

Are you suggesting, that another backend can somegow see only some bits
of page number being written ?

What problems do you see in multiple table case ?

You need to manage adding and removing relations from the shared memory
structure. Which typically needs locking.

Assuming that relations are added or removed relatively seldom, you
might get away with a table of (Oid, BlockNumber) pairs, working around
the fact that the table might get messed up every now and then, and when
it does, you'll lose the benefits until it gets corrected. But it seems
really messy to me.

I don't think it's messy if we think of it as a hint system. I think my
problem is that I am calling it Synchronized Scanning, which is a
misnomer because I am not enforcing the synchronization. I call it that
because that's what Simon Riggs called it back when I first suggested
it, but in reality it's just a hint system.

I see that right now we start every scan at 0, which is a completely
arbitrary value. I am just trying to add a little intelligence to it
with the hint, even if sometimes (in very rare circumstances) the
intelligence turns out to be just as dumb as picking the value 0.

That being said, I can just lock the hint table (the shared memory hint
table, not the relation) and only update the hint every K pages, as Niel
Conway suggested when I first proposed it. If we find a K small enough
so the feature is useful, but large enough that we're sure there won't
be contention, this might be a good option. However, I don't know that
we would eliminate the contention, because if K is a constant (rather
than random), the backends would still all want to update that shared
memory table at the same time.

Regards,
Jeff Davis

#20Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#14)
Re: old synchronized scan patch

On Tue, 2006-12-05 at 10:49 -0500, Tom Lane wrote:

"Florian G. Pflug" <fgp@phlo.org> writes:

Hannu Krosing wrote:

The worst that can happen, is a hash collision, in which case you lose
the benefits of sync scans, but you wont degrade compared to non-sync
scans

But it could cause "mysterious" performance regressions, no?

There are other issues for the "no lock" approach that Jeff proposes.
Suppose that we have three or four processes that are actually doing
synchronized scans of the same table. They will have current block
numbers that are similar but probably not identical. They will all be
scribbling on the same hashtable location. So if another process comes
along to join the "pack", it might get the highest active block number,
or the lowest, or something in between. Even discounting the possibility
that it gets random bits because it managed to read the value
non-atomically, how well is that really going to work?

That's an empirical question. I expect it will work, since any active
scan will have a significant cache trail behind it.

Another thing that we have to consider is that the actual block read
requests will likely be distributed among the "pack leaders", rather
than all being issued by one process. AFAIK this will destroy the
kernel's ability to do read-ahead, because it will fail to recognize
that sequential reads are being issued --- any single process is *not*
reading sequentially, and I think that read-ahead scheduling is
generally driven off single-process behavior rather than the emergent
behavior of the whole system. (Feel free to contradict me if you've
actually read any kernel code that does this.) It might still be better
than unsynchronized reads, but it'd be leaving a lot on the table.

That's a very interesting point. I had assumed read-ahead was at the
kernel level without really caring what processes issued the requests,
but it may be system-dependent. I think that's what the elevator (or I/O
scheduler, or whatever it's called) is supposed to do. I'll see if I can
find some relevant source code in a Linux or FreeBSD box.

The controller certainly wouldn't care about process IDs, however.

Regards,
Jeff Davis

#21Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#17)
#22Jeff Davis
pgsql@j-davis.com
In reply to: Florian Pflug (#13)
#23Jeff Davis
pgsql@j-davis.com
In reply to: Zeugswetter Andreas SB SD (#10)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#21)
#25Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#25)
#27Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jeff Davis (#19)
#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#26)
#29Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#28)
#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#30)
#32Jeff Davis
pgsql@j-davis.com
In reply to: Jim Nasby (#27)
#33Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#30)
#34Michael Glaesemann
grzm@seespotcode.net
In reply to: Jeff Davis (#33)
#35Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#31)
#36Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#31)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#36)
#38Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#37)
#39Csaba Nagy
nagy@ecircle-ag.com
In reply to: Tom Lane (#37)
#40Florian Pflug
fgp@phlo.org
In reply to: Tom Lane (#37)
#41Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#28)
#42Bruce Momjian
bruce@momjian.us
In reply to: Csaba Nagy (#39)
#43Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Luke Lonergan (#2)
#44Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Bruce Momjian (#41)
#45Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#37)
#46Jeff Davis
pgsql@j-davis.com
In reply to: Jim Nasby (#38)
#47Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Heikki Linnakangas (#43)