Tuplesort merge pre-reading

Started by Heikki Linnakangasover 9 years ago61 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

While reviewing Peter's latest round of sorting patches, and trying to
understand the new "batch allocation" mechanism, I started to wonder how
useful the pre-reading in the merge stage is in the first place.

I'm talking about the code that reads a bunch of from each tape, loading
them into the memtuples array. That code was added by Tom Lane, back in
1999:

commit cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Oct 30 17:27:15 1999 +0000

Further performance improvements in sorting: reduce number of
comparisons
during initial run formation by keeping both current run and next-run
tuples in the same heap (yup, Knuth is smarter than I am). And, during
merge passes, make use of available sort memory to load multiple tuples
from any one input 'tape' at a time, thereby improving locality of
access to the temp file.

So apparently there was a benefit back then, but is it still worthwhile?
The LogicalTape buffers one block at a time, anyway, how much gain are
we getting from parsing the tuples into SortTuple format in batches?

I wrote a quick patch to test that, attached. It seems to improve
performance, at least in this small test case:

create table lotsofints(i integer);
insert into lotsofints select random() * 1000000000.0 from
generate_series(1, 10000000);
vacuum freeze;

select count(*) FROM (select * from lotsofints order by i) t;

On my laptop, with default work_mem=4MB, that select takes 7.8 s on
unpatched master, and 6.2 s with the attached patch.

So, at least in some cases, the pre-loading hurts. I think we should get
rid of it. This patch probably needs some polishing: I replaced the
batch allocations with a simpler scheme with a buffer to hold just a
single tuple for each tape, and that might need some more work to allow
downsizing those buffers if you have a few very large tuples in an
otherwise narrow table. And perhaps we should free and reallocate a
smaller memtuples array for the merging, now that we're not making use
of the whole of it. And perhaps we should teach LogicalTape to use
larger buffers, if we can't rely on the OS to do the buffering for us.
But overall, this seems to make the code both simpler and faster.

Am I missing something?

- Heikki

Attachments:

0001-Don-t-bother-to-pre-read-tuples-into-slots-during-me.patchapplication/x-patch; name=0001-Don-t-bother-to-pre-read-tuples-into-slots-during-me.patchDownload+80-408
In reply to: Heikki Linnakangas (#1)
Re: Tuplesort merge pre-reading

On Tue, Sep 6, 2016 at 5:20 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I wrote a quick patch to test that, attached. It seems to improve
performance, at least in this small test case:

create table lotsofints(i integer);
insert into lotsofints select random() * 1000000000.0 from
generate_series(1, 10000000);
vacuum freeze;

select count(*) FROM (select * from lotsofints order by i) t;

On my laptop, with default work_mem=4MB, that select takes 7.8 s on
unpatched master, and 6.2 s with the attached patch.

The benefits have a lot to do with OS read-ahead, and efficient use of
memory bandwidth during the merge, where we want to access the caller
tuples sequentially per tape (i.e. that's what the batch memory stuff
added -- it also made much better use of available memory). Note that
I've been benchmarking the parallel CREATE INDEX patch on a server
with many HDDs, since sequential performance is mostly all that
matters. I think that in 1999, the preloading had a lot more to do
with logtape.c's ability to aggressively recycle blocks during merges,
such that the total storage overhead does not exceed the original size
of the caller tuples (plus what it calls "trivial bookkeeping
overhead" IIRC). That's less important these days, but still matters
some (it's more of an issue when you can't complete the sort in one
pass, which is rare these days).

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#2)
Re: Tuplesort merge pre-reading

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

It looks like your benchmark relies on multiple passes, which can be
misleading. I bet it suffers some amount of problems from palloc()
fragmentation. When very memory constrained, that can get really bad.

Non-final merge passes (merges with more than one run -- real or dummy
-- on any given tape) can have uneven numbers of runs on each tape.
So, tuplesort.c needs to be prepared to doll out memory among tapes
*unevenly* there (same applies to memtuples "slots"). This is why
batch memory support is so hard for those cases (the fact that they're
so rare anyway also puts me off it). As you know, I wrote a patch that
adds batch memory support to cases that require randomAccess (final
output on a materialized tape), for their final merge. These final
merges happen to not be a final on-the-fly merge only due to this
randomAccess requirement from caller. It's possible to support these
cases in the future, with that patch, only because I am safe to assume
that each run/tape is the same size there (well, the assumption is
exactly as safe as it was for the 9.6 final on-the-fly merge, at
least).

My point about non-final merges is that you have to be very careful
that you're comparing apples to apples, memory accounting wise, when
looking into something like this. I'm not saying that you didn't, but
it's worth considering.

FWIW, I did try an increase in the buffer size in LogicalTape at one
time several months back, and so no benefit there (at least, with no
other changes).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#3)
Re: Tuplesort merge pre-reading

On 09/06/2016 10:26 PM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

I spent some more time polishing this up, and also added some code to
logtape.c, to use larger read buffers, to compensate for the fact that
we don't do pre-reading from tuplesort.c anymore. That should trigger
the OS read-ahead, and make the I/O more sequential, like was the
purpose of the old pre-reading code. But simpler. I haven't tested that
part much yet, but I plan to run some tests on larger data sets that
don't fit in RAM, to make the I/O effects visible.

I wrote a little testing toolkit, see third patch. I'm not proposing to
commit that, but that's what I used for testing. It creates four tables,
about 1GB in size each (it also creates smaller and larger tables, but I
used the "medium" sized ones for these tests). Two of the tables contain
integers, and two contain text strings. Two of the tables are completely
ordered, two are in random order. To measure, it runs ORDER BY queries
on the tables, with different work_mem settings.

Attached are the full results. In summary, these patches improve
performance in some of the tests, and are a wash on others. The patches
help in particular in the randomly ordered cases, with up to 512 MB of
work_mem.

For example, with work_mem=256MB, which is enough to get a single merge
pass:

with patches:

ordered_ints: 7078 ms, 6910 ms, 6849 ms
random_ints: 15639 ms, 15575 ms, 15625 ms
ordered_text: 11121 ms, 12318 ms, 10824 ms
random_text: 53462 ms, 53420 ms, 52949 ms

unpatched master:

ordered_ints: 6961 ms, 7040 ms, 7044 ms
random_ints: 19205 ms, 18797 ms, 18955 ms
ordered_text: 11045 ms, 11377 ms, 11203 ms
random_text: 57117 ms, 54489 ms, 54806 ms

(The same queries were run three times in a row, that's what the three
numbers on each row mean. I.e. the differences between the numbers on
same row are noise)

It looks like your benchmark relies on multiple passes, which can be
misleading. I bet it suffers some amount of problems from palloc()
fragmentation. When very memory constrained, that can get really bad.

Non-final merge passes (merges with more than one run -- real or dummy
-- on any given tape) can have uneven numbers of runs on each tape.
So, tuplesort.c needs to be prepared to doll out memory among tapes
*unevenly* there (same applies to memtuples "slots"). This is why
batch memory support is so hard for those cases (the fact that they're
so rare anyway also puts me off it). As you know, I wrote a patch that
adds batch memory support to cases that require randomAccess (final
output on a materialized tape), for their final merge. These final
merges happen to not be a final on-the-fly merge only due to this
randomAccess requirement from caller. It's possible to support these
cases in the future, with that patch, only because I am safe to assume
that each run/tape is the same size there (well, the assumption is
exactly as safe as it was for the 9.6 final on-the-fly merge, at
least).

My point about non-final merges is that you have to be very careful
that you're comparing apples to apples, memory accounting wise, when
looking into something like this. I'm not saying that you didn't, but
it's worth considering.

I'm not 100% sure I'm accounting for all the memory correctly. But I
didn't touch the way the initial quicksort works, nor the way the runs
are built. And the merge passes don't actually need or benefit from a
lot of memory, so I doubt it's very sensitive to that.

In this patch, the memory available for the read buffers is just divided
evenly across maxTapes. The buffers for the tapes that are not currently
active are wasted. It could be made smarter, by freeing all the
currently-unused buffers for tapes that are not active at the moment.
Might do that later, but this is what I'm going to benchmark for now. I
don't think adding buffers is helpful beyond a certain point, so this is
probably good enough in practice. Although it would be nice to free the
memory we don't need earlier, in case there are other processes that
could make use of it.

FWIW, I did try an increase in the buffer size in LogicalTape at one
time several months back, and so no benefit there (at least, with no
other changes).

Yeah, unless you get rid of the pre-reading in tuplesort.c, you're just
double-buffering.

- Heikki

Attachments:

0001-Don-t-bother-to-pre-read-tuples-into-SortTuple-slots.patchtext/x-diff; name=0001-Don-t-bother-to-pre-read-tuples-into-SortTuple-slots.patchDownload+223-667
0002-Use-larger-read-buffers-in-logtape.patchtext/x-diff; name=0002-Use-larger-read-buffers-in-logtape.patchDownload+147-24
0003-Add-sorting-test-suite.patchtext/x-diff; name=0003-Add-sorting-test-suite.patchDownload+521-1
results-master.txttext/plain; charset=UTF-8; name=results-master.txtDownload
results-patched.txttext/plain; charset=UTF-8; name=results-patched.txtDownload
#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#4)
Re: Tuplesort merge pre-reading

On 09/08/2016 09:59 PM, Heikki Linnakangas wrote:

On 09/06/2016 10:26 PM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

I spent some more time polishing this up, and also added some code to
logtape.c, to use larger read buffers, to compensate for the fact that
we don't do pre-reading from tuplesort.c anymore. That should trigger
the OS read-ahead, and make the I/O more sequential, like was the
purpose of the old pre-reading code. But simpler. I haven't tested that
part much yet, but I plan to run some tests on larger data sets that
don't fit in RAM, to make the I/O effects visible.

Ok, I ran a few tests with 20 GB tables. I thought this would show any
differences in I/O behaviour, but in fact it was still completely CPU
bound, like the tests on smaller tables I posted yesterday. I guess I
need to point temp_tablespaces to a USB drive or something. But here we go.

It looks like there was a regression when sorting random text, with 256
MB work_mem. I suspect that was a fluke - I only ran these tests once
because they took so long. But I don't know for sure.

Claudio, if you could also repeat the tests you ran on Peter's patch set
on the other thread, with these patches, that'd be nice. These patches
are effectively a replacement for
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
would be much appreciated too, of course.

Attached are new versions. Compared to last set, they contain a few
comment fixes, and a change to the 2nd patch to not allocate tape
buffers for tapes that were completely unused.

- Heikki

Attachments:

results-large-master.txttext/plain; charset=UTF-8; name=results-large-master.txtDownload
results-large-patched.txttext/plain; charset=UTF-8; name=results-large-patched.txtDownload
0001-Don-t-bother-to-pre-read-tuples-into-SortTuple-slots.patchtext/x-diff; name=0001-Don-t-bother-to-pre-read-tuples-into-SortTuple-slots.patchDownload+226-678
0002-Use-larger-read-buffers-in-logtape.patchtext/x-diff; name=0002-Use-larger-read-buffers-in-logtape.patchDownload+162-27
#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#5)
Re: Tuplesort merge pre-reading

On 09/09/2016 02:13 PM, Heikki Linnakangas wrote:

On 09/08/2016 09:59 PM, Heikki Linnakangas wrote:

On 09/06/2016 10:26 PM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

I spent some more time polishing this up, and also added some code to
logtape.c, to use larger read buffers, to compensate for the fact that
we don't do pre-reading from tuplesort.c anymore. That should trigger
the OS read-ahead, and make the I/O more sequential, like was the
purpose of the old pre-reading code. But simpler. I haven't tested that
part much yet, but I plan to run some tests on larger data sets that
don't fit in RAM, to make the I/O effects visible.

Ok, I ran a few tests with 20 GB tables. I thought this would show any
differences in I/O behaviour, but in fact it was still completely CPU
bound, like the tests on smaller tables I posted yesterday. I guess I
need to point temp_tablespaces to a USB drive or something. But here we go.

I took a different tact at demonstrating the I/O pattern effects. I
added some instrumentation code to logtape.c, that prints a line to a
file whenever it reads a block, with the block number. I ran the same
query with master and with these patches, and plotted the access pattern
with gnuplot.

I'm happy with what it looks like. We are in fact getting a more
sequential access pattern with these patches, because we're not
expanding the pre-read tuples into SortTuples. Keeping densely-packed
blocks in memory, instead of SortTuples, allows caching more data overall.

Attached is the patch I used to generate these traces, the gnuplot
script, and traces from I got from sorting a 1 GB table of random
integers, with work_mem=16MB.

Note that in the big picture, what appear to be individual dots, are
actually clusters of a bunch of dots. So the access pattern is a lot
more sequential than it looks like at first glance, with or without
these patches. The zoomed-in pictures show that. If you want to inspect
these in more detail, I recommend running gnuplot in interactive mode,
so that you can zoom in and out easily.

I'm happy with the amount of testing I've done now, and the results.
Does anyone want to throw out any more test cases where there might be a
regression? If not, let's get these reviewed and committed.

- Heikki

Attachments:

logtape-trace-patched-16MBtext/plain; charset=UTF-8; name=logtape-trace-patched-16MBDownload
trace-logtape.patchtext/x-diff; name=trace-logtape.patchDownload+37-0
logtape-master.pngimage/png; name=logtape-master.pngDownload+1-1
logtape-master-zoomed.pngimage/png; name=logtape-master-zoomed.pngDownload
logtape-patched.pngimage/png; name=logtape-patched.pngDownload
logtape-patched-zoomed.pngimage/png; name=logtape-patched-zoomed.pngDownload
logtape.plottext/plain; charset=UTF-8; name=logtape.plotDownload
logtape-trace-master-16MBtext/plain; charset=UTF-8; name=logtape-trace-master-16MBDownload
#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#5)
Re: Tuplesort merge pre-reading

(Resending with compressed files, to get below the message size limit of
the list)

On 09/09/2016 02:13 PM, Heikki Linnakangas wrote:

On 09/08/2016 09:59 PM, Heikki Linnakangas wrote:

On 09/06/2016 10:26 PM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

I spent some more time polishing this up, and also added some code to
logtape.c, to use larger read buffers, to compensate for the fact that
we don't do pre-reading from tuplesort.c anymore. That should trigger
the OS read-ahead, and make the I/O more sequential, like was the
purpose of the old pre-reading code. But simpler. I haven't tested that
part much yet, but I plan to run some tests on larger data sets that
don't fit in RAM, to make the I/O effects visible.

Ok, I ran a few tests with 20 GB tables. I thought this would show any
differences in I/O behaviour, but in fact it was still completely CPU
bound, like the tests on smaller tables I posted yesterday. I guess I
need to point temp_tablespaces to a USB drive or something. But here we go.

I took a different tact at demonstrating the I/O pattern effects. I
added some instrumentation code to logtape.c, that prints a line to a
file whenever it reads a block, with the block number. I ran the same
query with master and with these patches, and plotted the access pattern
with gnuplot.

I'm happy with what it looks like. We are in fact getting a more
sequential access pattern with these patches, because we're not
expanding the pre-read tuples into SortTuples. Keeping densely-packed
blocks in memory, instead of SortTuples, allows caching more data overall.

Attached is the patch I used to generate these traces, the gnuplot
script, and traces from I got from sorting a 1 GB table of random
integers, with work_mem=16MB.

Note that in the big picture, what appear to be individual dots, are
actually clusters of a bunch of dots. So the access pattern is a lot
more sequential than it looks like at first glance, with or without
these patches. The zoomed-in pictures show that. If you want to inspect
these in more detail, I recommend running gnuplot in interactive mode,
so that you can zoom in and out easily.

I'm happy with the amount of testing I've done now, and the results.
Does anyone want to throw out any more test cases where there might be a
regression? If not, let's get these reviewed and committed.

- Heikki

Attachments:

trace-logtape.patchtext/x-diff; name=trace-logtape.patchDownload+37-0
logtape-master.pngimage/png; name=logtape-master.pngDownload+1-1
logtape-master-zoomed.pngimage/png; name=logtape-master-zoomed.pngDownload
logtape-patched.pngimage/png; name=logtape-patched.pngDownload
logtape-patched-zoomed.pngimage/png; name=logtape-patched-zoomed.pngDownload
logtape.plottext/plain; charset=UTF-8; name=logtape.plotDownload
logtape-trace-master-16MB.xzapplication/x-xz; name=logtape-trace-master-16MB.xzDownload
logtape-trace-patched-16MB.xzapplication/x-xz; name=logtape-trace-patched-16MB.xzDownload
#8Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#7)
Re: Tuplesort merge pre-reading

On Fri, Sep 9, 2016 at 1:01 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm happy with what it looks like. We are in fact getting a more sequential
access pattern with these patches, because we're not expanding the pre-read
tuples into SortTuples. Keeping densely-packed blocks in memory, instead of
SortTuples, allows caching more data overall.

Wow, this is really cool. We should do something like this for query
execution too.

I still didn't follow exactly why removing the prefetching allows more
sequential i/o. I thought the whole point of prefetching was to reduce
the random i/o from switching tapes.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#8)
Re: Tuplesort merge pre-reading

On 09/09/2016 03:25 PM, Greg Stark wrote:

On Fri, Sep 9, 2016 at 1:01 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm happy with what it looks like. We are in fact getting a more sequential
access pattern with these patches, because we're not expanding the pre-read
tuples into SortTuples. Keeping densely-packed blocks in memory, instead of
SortTuples, allows caching more data overall.

Wow, this is really cool. We should do something like this for query
execution too.

I still didn't follow exactly why removing the prefetching allows more
sequential i/o. I thought the whole point of prefetching was to reduce
the random i/o from switching tapes.

The first patch removed prefetching, but the second patch re-introduced
it, in a different form. The prefetching is now done in logtape.c, by
reading multiple pages at a time. The on-tape representation of tuples
is more compact than having them in memory as SortTuples, so you can fit
more data in memory overall, which makes the access pattern more sequential.

There's one difference between these approaches that I didn't point out
earlier: We used to prefetch tuples from each *run*, and stopped
pre-reading when we reached the end of the run. Now that we're doing the
prefetching as raw tape blocks, we don't stop at run boundaries. I don't
think that makes any big difference one way or another, but I thought
I'd mention it.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Heikki Linnakangas (#6)
Re: Tuplesort merge pre-reading

On Fri, Sep 9, 2016 at 4:55 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm happy with the amount of testing I've done now, and the results. Does
anyone want to throw out any more test cases where there might be a
regression? If not, let's get these reviewed and committed.

I'll try to look at this properly tomorrow. Currently still working
away at creating a new revision of my sorting patchset. Obviously this
is interesting, but it raises certain questions for the parallel
CREATE INDEX patch in particular that I'd like to get straight, aside
from everything else.

I've been using an AWS d2.4xlarge instance for testing all my recent
sort patches, with 16 vCPUs, 122 GiB RAM, 12 x 2 TB disks. It worked
well to emphasize I/O throughput and parallelism over latency. I'd
like to investigate how this pre-reading stuff does there. I recall
that for one very large case, it took a full minute to do just the
first round of preloading during the leader's final merge (this was
with something like 50GB of maintenance_work_mem). So, it will be
interesting.

BTW, noticed a typo here:

+ * track memory usage of indivitual tuples.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Bruce Momjian (#8)
Re: Tuplesort merge pre-reading

On Fri, Sep 9, 2016 at 5:25 AM, Greg Stark <stark@mit.edu> wrote:

Wow, this is really cool. We should do something like this for query
execution too.

We should certainly do this for tuplestore.c, too. I've been meaning
to adopt it to use batch memory. I did look at it briefly, and recall
that it was surprisingly awkward because a surprisingly large number
of callers want to have memory that they can manage independently of
the lifetime of their tuplestore.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Claudio Freire
klaussfreire@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: Tuplesort merge pre-reading

On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Claudio, if you could also repeat the tests you ran on Peter's patch set on
the other thread, with these patches, that'd be nice. These patches are
effectively a replacement for
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
would be much appreciated too, of course.

Attached are new versions. Compared to last set, they contain a few comment
fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
that were completely unused.

Will do so

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#12)
Re: Tuplesort merge pre-reading

On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Claudio, if you could also repeat the tests you ran on Peter's patch set on
the other thread, with these patches, that'd be nice. These patches are
effectively a replacement for
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
would be much appreciated too, of course.

Attached are new versions. Compared to last set, they contain a few comment
fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
that were completely unused.

Will do so

It seems both 1 and 1+2 break make check.

Did I misunderstand something? I'm applying the first patch of Peter's
series (cap number of tapes), then your first one (remove prefetch)
and second one (use larger read buffers).

Peter's patch needs some rebasing on top of those but nothing major.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Claudio Freire (#13)
Re: Tuplesort merge pre-reading

On 09/10/2016 04:21 AM, Claudio Freire wrote:

On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Claudio, if you could also repeat the tests you ran on Peter's patch set on
the other thread, with these patches, that'd be nice. These patches are
effectively a replacement for
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
would be much appreciated too, of course.

Attached are new versions. Compared to last set, they contain a few comment
fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
that were completely unused.

Will do so

Thanks!

It seems both 1 and 1+2 break make check.

Oh. Works for me. What's the failure you're getting?

Did I misunderstand something? I'm applying the first patch of Peter's
series (cap number of tapes), then your first one (remove prefetch)
and second one (use larger read buffers).

Yes. I have been testing without Peter's first patch, with just the two
patches I posted. But it should work together (and does, I just tested)
with that one as well.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Heikki Linnakangas (#14)
Re: Tuplesort merge pre-reading

On Sat, Sep 10, 2016 at 12:04 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Did I misunderstand something? I'm applying the first patch of Peter's
series (cap number of tapes), then your first one (remove prefetch)
and second one (use larger read buffers).

Yes. I have been testing without Peter's first patch, with just the two
patches I posted. But it should work together (and does, I just tested) with
that one as well.

You're going to need to rebase, since the root displace patch is based
on top of my patch 0002-*, not Heikki's alternative. But, that should
be very straightforward.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#14)
Re: Tuplesort merge pre-reading

Here's a new version of these patches, rebased over current master. I
squashed the two patches into one, there's not much point to keep them
separate.

- Heikki

Attachments:

0001-Change-the-way-pre-reading-in-external-sort-s-merge-.patchtext/x-diff; name=0001-Change-the-way-pre-reading-in-external-sort-s-merge-.patchDownload+389-731
In reply to: Heikki Linnakangas (#16)
Re: Tuplesort merge pre-reading

On Sun, Sep 11, 2016 at 8:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Here's a new version of these patches, rebased over current master. I
squashed the two patches into one, there's not much point to keep them
separate.

I think I have my head fully around this now. For some reason, I
initially thought that this patch was a great deal more radical than
it actually is. (Like Greg, I somehow initially thought that you were
rejecting the idea of batch memory in general, and somehow (over)
leveraging the filesystem cache. I think I misunderstood your remarks
when we talked on IM about it early on.)

I don't know what the difference is between accessing 10 pages
randomly, and accessing a random set of 10 single pages sequentially,
in close succession. As Tom would say, that's above my pay grade. I
suppose it comes down to how close "close" actually is (but in any
case, it's all very fudged).

I mention this because I think that cost_sort() should be updated to
consider sequential I/O the norm, alongside this patch of yours (your
patch strengthens the argument [1]/messages/by-id/CAM3SWZQLP6e=1si1NcQjYft7R-VYpprrf_i59tZOZX5m7VFK-w@mail.gmail.com -- Peter Geoghegan for that general idea). The reason
that this new approach produces more sequential I/O, apart from the
simple fact that you effectively have much more memory available and
so fewer rounds of preloading, is that the indirection block stuff can
make I/O less sequential in order to support eager reclamation of
space. For example, maybe there is interleaving of blocks as logtape.c
manages to reclaim blocks in the event of multiple merge steps. I care
about that second factor a lot more now than I would have a year ago,
when a final on-the-fly merge generally avoids multiple passes (and
associated logtape.c block fragmentation), because parallel CREATE
INDEX is usually affected by that (workers will often want to do their
own merge ahead of the leader's final merge), and because I want to
cap the number of tapes used, which will make multiple passes a bit
more common in practice.

I was always suspicious of the fact that memtuples is so large during
merging, and had a vague plan to fix that (although I was the one
responsible for growing memtuples even more for the merge in 9.6, that
was just because under the status quo of having many memtuples during
the merge, the size of memtuples should at least be in balance with
remaining memory for caller tuples -- it wasn't an endorsement of the
status quo). However, it never occurred to me to do that by pushing
batch memory into the head of logtape.c, which now seems like an
excellent idea.

To summarize my understanding of this patch: If it wasn't for my work
on parallel CREATE INDEX, I would consider this patch to give us only
a moderate improvement to user-visible performance, since it doesn't
really help memory rich cases very much (cases that are not likely to
have much random I/O anyway). In that universe, I'd be more
appreciative of the patch as a simplifying exercise, since while
refactoring. It's nice to get a boost for more memory constrained
cases, it's not a huge win. However, that's not the universe we live
in -- I *am* working on parallel CREATE INDEX, of course. And so, I
really am very excited about this patch, because it really helps with
the particular challenges I have there, even though it's fair to
assume that we have a reasonable amount of memory available when
parallelism is used. If workers are going to do their own merges, as
they often will, then multiple merge pass cases become far more
important, and any additional I/O is a real concern, *especially*
additional random I/O (say from logtape.c fragmentation). The patch
directly addresses that, which is great. Your patch, alongside my
just-committed patch concerning how we maintain the heap invariant,
together attack the merge bottleneck from two different directions:
they address I/O costs, and CPU costs, respectively.

Other things I noticed:

* You should probably point out that typically, access to batch memory
will still be sequential, despite your block-based scheme. The
preloading will now more or less make that the normal case. Any
fragmentation will now be essentially in memory, not on disk, which is
a big win.

* I think that logtape.c header comments are needed for this. Maybe
that's where you should point out that memory access is largely
sequential. But it's surely true that logtape.c needs to take
responsibility for being the place where the vast majority of memory
is allocated during merging.

* i think you should move "bool *mergeactive; /* active input run
source? */" within Tuplesortstate to be next to the other batch memory
stuff. No point in having separate merge and batch "sections" there
anymore.

* You didn't carry over my 0002-* batch memory patch modifications to
comments, even though you should have in a few cases. There remains
some references in comments to batch memory, as a thing exclusively
usable by final on-the-fly merges. That's not true anymore -- it's
usable by final merges, too. For example, readtup_alloc() still
references the final on-the-fly merge.

* You also fail to take credit in the commit message for making batch
memory usable when returning caller tuples to callers that happen to
request "randomAccess" (So, I guess the aforementioned comments above
routines like readtup_alloc() shouldn't even refer to merging, unless
it's to say that non-final merges are not supported due to their
unusual requirements). My patch didn't go that far (I only addressed
the final merge itself, not the subsequent access to tuples when
reading from that materialized final output tape by TSS_SORTEDONTAPE
case). But, that's actually really useful for randomAccess callers,
above and beyond what I proposed (which in any case was mostly written
with parallel workers in mind, which never do TSS_SORTEDONTAPE
processing).

* Furthermore, readtup_alloc() will not just be called in WRITETUP()
routines -- please update comments.

* There is a very subtle issue here:

+   /*
+    * We no longer need a large memtuples array, only one slot per tape. Shrink
+    * it, to make the memory available for other use. We only need one slot per
+    * tape.
+    */
+   pfree(state->memtuples);
+   FREEMEM(state, state->memtupsize * sizeof(SortTuple));
+   state->memtupsize = state->maxTapes;
+   state->memtuples = (SortTuple *) palloc(state->maxTapes * sizeof(SortTuple));
+   USEMEM(state, state->memtupsize * sizeof(SortTuple));

The FREEMEM() call needs to count the chunk overhead in both cases. In
short, I think you need to copy the GetMemoryChunkSpace() stuff that
you see within grow_memtuples().

* Whitespace issue here:

@@ -2334,7 +2349,8 @@ inittapes(Tuplesortstate *state)
#endif

/*
-    * Decrease availMem to reflect the space needed for tape buffers; but
+    * Decrease availMem to reflect the space needed for tape buffers, when
+    * writing the initial runs; but
* don't decrease it to the point that we have no room for tuples. (That
* case is only likely to occur if sorting pass-by-value Datums; in all
* other scenarios the memtuples[] array is unlikely to occupy more than
@@ -2359,14 +2375,6 @@ inittapes(Tuplesortstate *state)
state->tapeset = LogicalTapeSetCreate(maxTapes);

* I think that you need to comment on why state->tuplecontext is not
used for batch memory now. It is still useful, for multiple merge
passes, but the situation has notably changed for it.

* Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?:

@@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
Assert(lt->frozen);
datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
}
+
+       /* Allocate a read buffer */
+       if (lt->buffer)
+           pfree(lt->buffer);
+       lt->buffer = palloc(lt->read_buffer_size);
+       lt->buffer_size = lt->read_buffer_size;

* Typo:

+
+   /*
+    * from this point on, we no longer use the usemem()/lackmem() mechanism to
+    * track memory usage of indivitual tuples.
+    */
+   state->batchused = true;

* Please make this use the ".., + 1023" natural rounding trick that is
used in the similar traces that are removed:

+#ifdef TRACE_SORT
+   if (trace_sort)
+       elog(LOG, "using %d kB of memory for read buffers in %d tapes, %d kB per tape",
+            (int) (state->availMem / 1024), maxTapes, (int) (per_tape * BLCKSZ) / 1024);
+#endif

* It couldn't hurt to make this code paranoid about LACKMEM() becoming
true, which will cause havoc (we saw this recently in 9.6; a patch of
mine to fix that just went in):

+   /*
+    * Use all the spare memory we have available for read buffers. Divide it
+    * memory evenly among all the tapes.
+    */
+   avail_blocks = state->availMem / BLCKSZ;
+   per_tape = avail_blocks / maxTapes;
+   cutoff = avail_blocks % maxTapes;
+   if (per_tape == 0)
+   {
+       per_tape = 1;
+       cutoff = 0;
+   }
+   for (tapenum = 0; tapenum < maxTapes; tapenum++)
+   {
+       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+                                       (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
+   }

In other words, we really don't want availMem to become < 0, since
it's int64, but a derived value is passed to
LogicalTapeAssignReadBufferSize() as an argument of type "Size". Now,
if LACKMEM() did happen it would be a bug anyway, but I recommend
defensive code also be added. Per grow_memtuples(), "We need to be
sure that we do not cause LACKMEM to become true, else the space
management algorithm will go nuts". Let's be sure that we get that
right, since, as we saw recently, especially since grow_memtuples()
will not actually have the chance to save us here (if there is a bug
along these lines, let's at least make the right "can't happen error"
complaint to user when it pops up).

* It looks like your patch makes us less eager about freeing per-tape
batch memory, now held as preload buffer space within logtape.c.

ISTM that there should be some way to have the "tape exhausted" code
path within tuplesort_gettuple_common() (as well as the similar spot
within mergeonerun()) instruct logtape.c that we're done with that
tape. In other words, when mergeprereadone() (now renamed to
mergereadnext()) detects the tape is exhausted, it should have
logtape.c free its huge tape buffer immediately. Think about cases
where input is presorted, and earlier tapes can be freed quite early
on. It's worth keeping that around, (you removed the old way that this
happened, through mergebatchfreetape()).

That's all I have right now. I like the direction this is going in,
but I think this needs more polishing.

[1]: /messages/by-id/CAM3SWZQLP6e=1si1NcQjYft7R-VYpprrf_i59tZOZX5m7VFK-w@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#17)
Re: Tuplesort merge pre-reading

On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:

* Please make this use the ".., + 1023" natural rounding trick that is
used in the similar traces that are removed:

+#ifdef TRACE_SORT
+   if (trace_sort)
+       elog(LOG, "using %d kB of memory for read buffers in %d tapes, %d kB per tape",
+            (int) (state->availMem / 1024), maxTapes, (int) (per_tape * BLCKSZ) / 1024);
+#endif

Also, please remove the int cast, and use INT64_FORMAT. Again, this
should match existing trace_sort traces concerning batch memory.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#17)
Re: Tuplesort merge pre-reading

On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:

* Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?:

@@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
Assert(lt->frozen);
datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
}
+
+       /* Allocate a read buffer */
+       if (lt->buffer)
+           pfree(lt->buffer);
+       lt->buffer = palloc(lt->read_buffer_size);
+       lt->buffer_size = lt->read_buffer_size;

Of course, when you do that you're going to have to make the new
"buffer_size" and "read_buffer_size" fields of type "Size" (or,
possibly, "int64", to match tuplesort.c's own buffer sizing variables
ever since Noah added MaxAllocSizeHuge). Ditto for the existing "pos"
and "nbytes" fields next to "buffer_size" within the struct
LogicalTape, I think. ISTM that logtape.c blocknums can remain of type
"long", though, since that reflects an existing hardly-relevant
limitation that you're not making any worse.

More generally, I think you also need to explain in comments why there
is a "read_buffer_size" field in addition to the "buffer_size" field.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#17)
Re: Tuplesort merge pre-reading

On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:

+   for (tapenum = 0; tapenum < maxTapes; tapenum++)
+   {
+       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+                                       (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
+   }

Spotted another issue with this code just now. Shouldn't it be based
on state->tapeRange? You don't want the destTape to get memory, since
you don't use batch memory for tapes that are written to (and final
on-the-fly merges don't use their destTape at all).

(Looks again...)

Wait, you're using a local variable maxTapes here, which potentially
differs from state->maxTapes:

+   /*
+    * If we had fewer runs than tapes, refund buffers for tapes that were never
+    * allocated.
+    */
+   maxTapes = state->maxTapes;
+   if (state->currentRun < maxTapes)
+   {
+       FREEMEM(state, (maxTapes - state->currentRun) * TAPE_BUFFER_OVERHEAD);
+       maxTapes = state->currentRun;
+   }

I find this confusing, and think it's probably buggy. I don't think
you should have a local variable called maxTapes that you modify at
all, since state->maxTapes is supposed to not change once established.
You can't use state->currentRun like that, either, I think, because
it's the high watermark number of runs (quicksorted runs), not runs
after any particular merge phase, where we end up with fewer runs as
they're merged (we must also consider dummy runs to get this) -- we
want something like activeTapes. cf. the code you removed for the
beginmerge() finalMergeBatch case. Of course, activeTapes will vary if
there are multiple merge passes, which suggests all this code really
has no business being in mergeruns() (it should instead be in
beginmerge(), or code that beginmerge() reliably calls).

Immediately afterwards, you do this:

+   /* Initialize the merge tuple buffer arena.  */
+   state->batchMemoryBegin = palloc((maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
+   state->batchMemoryEnd = state->batchMemoryBegin + (maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
+   state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
+   USEMEM(state, (maxTapes + 1) * MERGETUPLEBUFFER_SIZE);

The fact that you size the buffer based on "maxTapes + 1" also
suggests a problem. I think that the code looks like this because it
must instruct logtape.c that the destTape tape requires some buffer
(iff there is to be a non-final merge). Is that right? I hope that you
don't give the typically unused destTape tape a full share of batch
memory all the time (the same applies to any other
inactive-at-final-merge tapes).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Peter Geoghegan (#17)
#22Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Gavin Flower (#21)
#23Claudio Freire
klaussfreire@gmail.com
In reply to: Heikki Linnakangas (#16)
#24Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#23)
In reply to: Claudio Freire (#24)
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Claudio Freire (#24)
In reply to: Heikki Linnakangas (#26)
#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#17)
In reply to: Heikki Linnakangas (#28)
#30Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#29)
In reply to: Heikki Linnakangas (#30)
In reply to: Heikki Linnakangas (#30)
In reply to: Heikki Linnakangas (#28)
#34Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#33)
In reply to: Heikki Linnakangas (#34)
#36Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#12)
#37Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#36)
#38Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Claudio Freire (#37)
#39Claudio Freire
klaussfreire@gmail.com
In reply to: Heikki Linnakangas (#38)
In reply to: Heikki Linnakangas (#30)
#41Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#40)
In reply to: Heikki Linnakangas (#41)
In reply to: Peter Geoghegan (#42)
#44Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#42)
#45Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#43)
In reply to: Heikki Linnakangas (#45)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#46)
#48Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#46)
#49Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#48)
In reply to: Robert Haas (#47)
#51Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#50)
In reply to: Heikki Linnakangas (#49)
#53Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#52)
In reply to: Heikki Linnakangas (#53)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
#56Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#55)
In reply to: Tom Lane (#55)
In reply to: Peter Geoghegan (#57)
#59Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#57)
In reply to: Robert Haas (#59)
#61Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#57)