tableam vs. TOAST
In a nearby thread[1]/messages/by-id/CALfoeitE+P8UGii8=BsGQLpHch2EZWJhq4M+D-jfaj8YCa_FSw@mail.gmail.com, Ashwin Agrawal complained that there is no way
for a table AM to get rid the TOAST table that the core system thinks
should be created. To that I added a litany of complaints of my own,
including...
- the core system decides whether or not a TOAST table is needed based
on criteria that are very much heap-specific,
- the code for reading and writing values stored in a TOAST table is
heap-specific, and
- the core system assumes that you want to use the same table AM for
the main table and the toast table, but you might not (e.g. you might
want to use the regular old heap for the latter).
Attached as a series of patches which try to improve things in this
area. Except possibly for 0001, this is v13 material; see discussion
on the other thread. These likely need some additional work, but I've
done enough with them that I thought it would be worth publishing them
at this stage, because it seems that I'm not the only one thinking
about the problems that exist in this general area. Here is an
overview:
0001 moves the needs_toast_table() calculation below the table AM
layer. That allows a table AM to decide for itself whether it wants a
TOAST table. The most obvious way in which a table AM might want to
be different from what core expects is to decide that the answer is
always "no," which it can do if it has some other method of storing
large values or doesn't wish to support them. Another possibility is
that it wants logic that is basically similar to the heap, but with a
different size threshold because its tuple format is different. There
are probably other possibilities.
0002 breaks tuptoaster.c into three separate files. It just does code
movement; no functional changes. The three pieces are detoast.c,
which handles detoasting of toast values and inspection of the sizes
of toasted datums; heaptoast.c, which keeps all the functions that are
intrinsically heap-specific; and toast_internals.c, which is intended
to have a very limited audience. A nice fringe benefit of this stuff
is that a lot of other files that current have to include tuptoaster.h
and thus htup_details.h no longer do.
0003 creates a new file toast_helper.c which is intended to help table
AMs implement insertion and deletion of toast table rows. Most of the
AM-independent logic from the functions remaining in heaptoast.c is
moved to this file. This leaves about ~600 of the original ~2400
lines from tuptoaster.c as heap-specific logic, but a new heap AM
actually wouldn't need all of that stuff, because some of the logic
here is in support of stuff like record types, which use HeapTuple
internally and will continue to do so even if those record types are
stored in some other kind of table.
0004 allows TOAST tables to be implemented using a table AM other than
heap. In a certain sense this is the opposite of 0003. 0003 is
intended to help people who are implementing a new kind of main table,
whereas 0004 is intended to help people implementing a new kind of
TOAST table. It teaches the code that inserts, deletes, and retrieves
TOAST row to use slots, and it makes some efficiency improvements in
the hopes of offsetting any performance loss from so doing. See
commit message and/or patch for full details.
I believe that with all of these changes it should be pretty
straightforward for a table AM that wants to use itself to store TOAST
data to do so, or to delegate that task back to say the regular heap.
I haven't really validated that yet, but plan to do so.
In addition to what's in this patch set, I believe that we should
probably rename some of these functions and macros, so that the
heap-specific ones have heap-specific names and the generic ones
don't, but I haven't gone through all of that yet. The existing
patches try to choose good names for the new things they add, but they
don't rename any of the existing stuff. I also think we should
consider removing TOAST_MAX_CHUNK_SIZE from the control file, both
because I'm not sure anybody's really using the ability to vary that
for anything and because that solution doesn't seem entirely sensible
in a world of multiple AMs. However, that is a debatable change, so
maybe others will disagree.
[1]: /messages/by-id/CALfoeitE+P8UGii8=BsGQLpHch2EZWJhq4M+D-jfaj8YCa_FSw@mail.gmail.com
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
v1-0004-Allow-TOAST-tables-to-be-implemented-using-table-.patchapplication/octet-stream; name=v1-0004-Allow-TOAST-tables-to-be-implemented-using-table-.patchDownload+273-148
v1-0001-tableam-Move-heap-specific-logic-from-needs_toast.patchapplication/octet-stream; name=v1-0001-tableam-Move-heap-specific-logic-from-needs_toast.patchDownload+92-48
v1-0003-Create-an-API-for-inserting-and-deleting-rows-in-.patchapplication/octet-stream; name=v1-0003-Create-an-API-for-inserting-and-deleting-rows-in-.patchDownload+490-358
v1-0002-Split-tuptoaster.c-into-three-separate-files.patchapplication/octet-stream; name=v1-0002-Split-tuptoaster.c-into-three-separate-files.patchDownload+2603-2550
Updated and rebased patches attached.
On Fri, May 17, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
0001 moves the needs_toast_table() calculation below the table AM
layer. That allows a table AM to decide for itself whether it wants a
TOAST table. The most obvious way in which a table AM might want to
be different from what core expects is to decide that the answer is
always "no," which it can do if it has some other method of storing
large values or doesn't wish to support them. Another possibility is
that it wants logic that is basically similar to the heap, but with a
different size threshold because its tuple format is different. There
are probably other possibilities.
This was committed as 1171d7d58545f26a402f76a05936d572bf29d53b per
discussion on another thread.
0002 breaks tuptoaster.c into three separate files. It just does code
movement; no functional changes. The three pieces are detoast.c,
which handles detoasting of toast values and inspection of the sizes
of toasted datums; heaptoast.c, which keeps all the functions that are
intrinsically heap-specific; and toast_internals.c, which is intended
to have a very limited audience. A nice fringe benefit of this stuff
is that a lot of other files that current have to include tuptoaster.h
and thus htup_details.h no longer do.
Now 0001. No changes.
0003 creates a new file toast_helper.c which is intended to help table
AMs implement insertion and deletion of toast table rows. Most of the
AM-independent logic from the functions remaining in heaptoast.c is
moved to this file. This leaves about ~600 of the original ~2400
lines from tuptoaster.c as heap-specific logic, but a new heap AM
actually wouldn't need all of that stuff, because some of the logic
here is in support of stuff like record types, which use HeapTuple
internally and will continue to do so even if those record types are
stored in some other kind of table.
Now 0002. No changes.
0004 allows TOAST tables to be implemented using a table AM other than
heap. In a certain sense this is the opposite of 0003. 0003 is
intended to help people who are implementing a new kind of main table,
whereas 0004 is intended to help people implementing a new kind of
TOAST table. It teaches the code that inserts, deletes, and retrieves
TOAST row to use slots, and it makes some efficiency improvements in
the hopes of offsetting any performance loss from so doing. See
commit message and/or patch for full details.
Now 0003. Some brain fade repaired.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
v2-0003-Allow-TOAST-tables-to-be-implemented-using-table-.patchapplication/octet-stream; name=v2-0003-Allow-TOAST-tables-to-be-implemented-using-table-.patchDownload+283-148
v2-0002-Create-an-API-for-inserting-and-deleting-rows-in-.patchapplication/octet-stream; name=v2-0002-Create-an-API-for-inserting-and-deleting-rows-in-.patchDownload+490-358
v2-0001-Split-tuptoaster.c-into-three-separate-files.patchapplication/octet-stream; name=v2-0001-Split-tuptoaster.c-into-three-separate-files.patchDownload+2603-2550
On Tue, May 21, 2019 at 2:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
Updated and rebased patches attached.
And again.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
v3-0003-Allow-TOAST-tables-to-be-implemented-using-table-.patchapplication/octet-stream; name=v3-0003-Allow-TOAST-tables-to-be-implemented-using-table-.patchDownload+282-147
v3-0001-Split-tuptoaster.c-into-three-separate-files.patchapplication/octet-stream; name=v3-0001-Split-tuptoaster.c-into-three-separate-files.patchDownload+2602-2550
v3-0002-Create-an-API-for-inserting-and-deleting-rows-in-.patchapplication/octet-stream; name=v3-0002-Create-an-API-for-inserting-and-deleting-rows-in-.patchDownload+490-358
On Tue, Jun 11, 2019 at 9:47 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 21, 2019 at 2:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
Updated and rebased patches attached.
And again.
Hi Robert,
I have tested the TOAST patches(v3) with different storage options
like(MAIN, EXTERNAL, EXTENDED, etc.), and
combinations of compression and out-of-line storage options.
I have used a few dummy tables with various tuple count say 10k, 20k, 40k,
etc. with different column lengths.
Used manual CHECKPOINT option with (checkpoint_timeout = 1d, max_wal_size =
10GB) before the test to avoid performance fluctuations,
and calculated the results as a median value of a few consecutive test
executions.
Please find the SQL script attached herewith, which I have used to perform
the observation.
Below are the test scenarios, how I have checked the behavior and
performance of TOAST patches against PG master.
1. where a single column is compressed(SCC)
2. where multiple columns are compressed(MCC)
-- ALTER the table column/s for storage as "MAIN" to make sure that
the column values are COMPRESSED.
3. where a single column is pushed to the TOAST table but not
compressed(SCTNC)
4. where multiple columns are pushed to the TOAST table but not
compressed(MCTNC)
-- ALTER the table column/s for storage as "EXTERNAL" to make sure
that the column values are pushed to the TOAST table but not COMPRESSED.
5. where a single column is pushed to the TOAST table and also
compressed(SCTC)
6. where multiple columns are pushed to the TOAST table and also
compressed(MCTC)
-- ALTER the table column/s for storage as "EXTENDED" to make sure
that the column values are pushed to the TOAST table and also COMPRESSED.
7. updating the tuples with similar data shouldn't affect the behavior of
storage options.
Please find my observation as below:
System Used: (VCPUs: 8, RAM: 16GB, Size: 640GB)
10000 Tuples 20000 Tuples 40000 Tuples 80000 Tuples
Without Patch With Patch Without Patch With Patch Without Patch With
Patch Without
Patch With Patch
1. SCC INSERT 125921.737 ms (02:05.922) 125992.563 ms (02:05.993) 234263.295
ms (03:54.263) 235952.336 ms (03:55.952) 497290.442 ms (08:17.290) 502820.139
ms (08:22.820) 948470.603 ms (15:48.471) 941778.952 ms (15:41.779)
1. SCC UPDATE 263017.814 ms (04:23.018) 270893.910 ms (04:30.894) 488393.748
ms (08:08.394) 507937.377 ms (08:27.937) 1078862.613 ms (17:58.863) 1053029.428
ms (17:33.029) 2037119.576 ms (33:57.120) 2023633.862 ms (33:43.634)
2. MCC INSERT 35415.089 ms (00:35.415) 35910.552 ms (00:35.911) 70899.737
ms (01:10.900) 70800.964 ms (01:10.801) 142185.996 ms (02:22.186) 142241.913
ms (02:22.242)
2. MCC UPDATE 72043.757 ms (01:12.044) 73848.732 ms (01:13.849) 137717.696
ms (02:17.718) 137577.606 ms (02:17.578) 276358.752 ms (04:36.359) 276520.727
ms (04:36.521)
3. SCTNC INSERT 26377.274 ms (00:26.377) 25600.189 ms (00:25.600) 45702.630
ms (00:45.703) 45163.510 ms (00:45.164) 99903.299 ms (01:39.903) 100013.004
ms (01:40.013)
3. SCTNC UPDATE 78385.225 ms (01:18.385) 76680.325 ms (01:16.680) 151823.250
ms (02:31.823) 153503.971 ms (02:33.504) 308197.734 ms (05:08.198) 308474.937
ms (05:08.475)
4. MCTNC INSERT 26214.069 ms (00:26.214) 25383.522 ms (00:25.384) 50826.522
ms (00:50.827) 50221.669 ms (00:50.222) 106034.338 ms (01:46.034) 106122.827
ms (01:46.123)
4. MCTNC UPDATE 78423.817 ms (01:18.424) 75154.593 ms (01:15.155) 158885.787
ms (02:38.886) 156530.964 ms (02:36.531) 319721.266 ms (05:19.721) 322385.709
ms (05:22.386)
5. SCTC INSERT 38451.022 ms (00:38.451) 38652.520 ms (00:38.653) 71590.748
ms (01:11.591) 71048.975 ms (01:11.049) 143327.913 ms (02:23.328) 142593.207
ms (02:22.593)
5. SCTC UPDATE 82069.311 ms (01:22.069) 81678.131 ms (01:21.678) 138763.508
ms (02:18.764) 138625.473 ms (02:18.625) 277534.080 ms (04:37.534) 277091.611
ms (04:37.092)
6. MCTC INSERT 36325.730 ms (00:36.326) 35803.368 ms (00:35.803) 73285.204
ms (01:13.285) 72728.371 ms (01:12.728) 142324.859 ms (02:22.325) 144368.335
ms (02:24.368)
6. MCTC UPDATE 73740.729 ms (01:13.741) 73002.511 ms (01:13.003) 141309.859
ms (02:21.310) 139676.173 ms (02:19.676) 278906.647 ms (04:38.907) 279522.408
ms (04:39.522)
All the observation looks good to me,
except for the "Test1" for SCC UPDATE with tuple count(10K/20K), for SCC
INSERT with tuple count(40K) there was a slightly increse in time taken
incase of "with patch" result. For a better observation, I also have ran
the same "Test 1" for higher tuple count(i.e. 80K), and it also looks fine.
I also have performed the below test with TOAST table objects.
8. pg_dump/restore, pg_upgrade with these
9. Streaming Replication setup
10. Concurrent Transactions
While testing few concurrent transactions I have below query:
-- Concurrent transactions acquire a lock for TOAST option(ALTER TABLE ..
SET STORAGE .. MAIN/EXTERNAL/EXTENDED/ etc)
-- Session 1:
CREATE TABLE a (a_id text PRIMARY KEY);
CREATE TABLE b (b_id text);
INSERT INTO a VALUES ('a'), ('b');
INSERT INTO b VALUES ('a'), ('b'), ('b');
BEGIN;
ALTER TABLE b ADD CONSTRAINT bfk FOREIGN KEY (b_id) REFERENCES a (a_id);
-- Not Acquiring any lock
-- Session 2:
SELECT * FROM b WHERE b_id = 'a'; -- Shows result
-- Session 1:
ALTER TABLE b ALTER COLUMN b_id SET STORAGE EXTERNAL; -- Acquire a
lock
-- Session 2:
SELECT * FROM b WHERE b_id = 'a'; -- Hang/Waiting for lock in
session 1
Is this an expected behavior?
--
With Regards,
Prabhat Kumar Sahu
Attachments:
On Wed, Jun 12, 2019 at 4:17 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 21, 2019 at 2:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
Updated and rebased patches attached.
And again.
Hi Robert,
Thus spake GCC:
detoast.c: In function ‘toast_fetch_datum’:
detoast.c:308:12: error: variable ‘toasttupDesc’ set but not used
[-Werror=unused-but-set-variable]
TupleDesc toasttupDesc;
^
--
Thomas Munro
https://enterprisedb.com
On Tue, Jun 25, 2019 at 2:19 AM Prabhat Sahu
<prabhat.sahu@enterprisedb.com> wrote:
I have tested the TOAST patches(v3) with different storage options like(MAIN, EXTERNAL, EXTENDED, etc.), and
combinations of compression and out-of-line storage options.
I have used a few dummy tables with various tuple count say 10k, 20k, 40k, etc. with different column lengths.
Used manual CHECKPOINT option with (checkpoint_timeout = 1d, max_wal_size = 10GB) before the test to avoid performance fluctuations,
and calculated the results as a median value of a few consecutive test executions.
Thanks for testing.
All the observation looks good to me,
except for the "Test1" for SCC UPDATE with tuple count(10K/20K), for SCC INSERT with tuple count(40K) there was a slightly increse in time taken
incase of "with patch" result. For a better observation, I also have ran the same "Test 1" for higher tuple count(i.e. 80K), and it also looks fine.
Did you run each test just once? How stable are the results?
While testing few concurrent transactions I have below query:
-- Concurrent transactions acquire a lock for TOAST option(ALTER TABLE .. SET STORAGE .. MAIN/EXTERNAL/EXTENDED/ etc)-- Session 1:
CREATE TABLE a (a_id text PRIMARY KEY);
CREATE TABLE b (b_id text);
INSERT INTO a VALUES ('a'), ('b');
INSERT INTO b VALUES ('a'), ('b'), ('b');BEGIN;
ALTER TABLE b ADD CONSTRAINT bfk FOREIGN KEY (b_id) REFERENCES a (a_id); -- Not Acquiring any lock
For me, this acquires AccessShareLock and ShareRowExclusiveLock on the
target table.
rhaas=# select locktype, database, relation, pid, mode, granted from
pg_locks where relation = 'b'::regclass;
locktype | database | relation | pid | mode | granted
----------+----------+----------+-------+-----------------------+---------
relation | 16384 | 16872 | 93197 | AccessShareLock | t
relation | 16384 | 16872 | 93197 | ShareRowExclusiveLock | t
(2 rows)
I don't see what that has to do with the topic at hand, though.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Jul 7, 2019 at 11:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Thus spake GCC:
detoast.c: In function ‘toast_fetch_datum’:
detoast.c:308:12: error: variable ‘toasttupDesc’ set but not used
[-Werror=unused-but-set-variable]
TupleDesc toasttupDesc;
^
Hmm, fixed, I hope.
Here's an updated patch set. In addition to the above fix, I fixed
things up for the new pgindent rules and added a fourth patch that
renames the detoasting functions to something that doesn't include
'heap.'
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0004-Rename-attribute-detoasting-functions.patchapplication/octet-stream; name=0004-Rename-attribute-detoasting-functions.patchDownload+37-38
0003-Allow-TOAST-tables-to-be-implemented-using-table-AMs.patchapplication/octet-stream; name=0003-Allow-TOAST-tables-to-be-implemented-using-table-AMs.patchDownload+284-149
0002-Create-an-API-for-inserting-and-deleting-rows-in-TOA.patchapplication/octet-stream; name=0002-Create-an-API-for-inserting-and-deleting-rows-in-TOA.patchDownload+493-358
0001-Split-tuptoaster.c-into-three-separate-files.patchapplication/octet-stream; name=0001-Split-tuptoaster.c-into-three-separate-files.patchDownload+2602-2550
On Mon, Jul 8, 2019 at 9:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jun 25, 2019 at 2:19 AM Prabhat Sahu
<prabhat.sahu@enterprisedb.com> wrote:I have tested the TOAST patches(v3) with different storage options
like(MAIN, EXTERNAL, EXTENDED, etc.), and
combinations of compression and out-of-line storage options.
I have used a few dummy tables with various tuple count say 10k, 20k,40k, etc. with different column lengths.
Used manual CHECKPOINT option with (checkpoint_timeout = 1d,
max_wal_size = 10GB) before the test to avoid performance fluctuations,
and calculated the results as a median value of a few consecutive test
executions.
Thanks for testing.
All the observation looks good to me,
except for the "Test1" for SCC UPDATE with tuple count(10K/20K), for SCCINSERT with tuple count(40K) there was a slightly increse in time taken
incase of "with patch" result. For a better observation, I also have ran
the same "Test 1" for higher tuple count(i.e. 80K), and it also looks fine.
Did you run each test just once? How stable are the results?
No, I have executed the test multiple times(7times each) and calculated the
result as the median among those,
and the result looks stable(with v3 patches).
--
With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.
The Postgres Database Company
On Tue, Jul 9, 2019 at 12:40 AM Prabhat Sahu
<prabhat.sahu@enterprisedb.com> wrote:
Did you run each test just once? How stable are the results?
No, I have executed the test multiple times(7times each) and calculated the result as the median among those,
and the result looks stable(with v3 patches).
I spent some time looking at your SCC test today. I think this isn't
really testing the code that actually got changed in the patch: a
quick CPU profile shows that your SCC test is bottlenecked on
pg_lzcompress, which spends a huge amount of time compressing the
gigantic string of 'a's you've constructed, and that code is exactly
the same with the patch as it in master. So, I think that any
fluctuations between the patched and unpatched results are just random
variation. There's no reason the patch should be slower with one row
count and faster with a different row count, anyway.
I tried to come up with a better test case that uses a more modest
amount of data, and ended up with this:
-- Setup.
CREATE OR REPLACE FUNCTION randomish_string(integer) RETURNS text AS $$
SELECT string_agg(random()::text, '') FROM generate_series(1, $1);
$$ LANGUAGE sql;
CREATE TABLE source_compressed (a int, b text);
INSERT INTO source_compressed
SELECT g, repeat('a', 2000) FROM generate_series(1, 10000) g;
CREATE TABLE sink_compressed (LIKE source_compressed);
CREATE TABLE source_external (a int, b text);
INSERT INTO source_external
SELECT g, randomish_string(400) FROM generate_series(1, 10000) g;
CREATE TABLE sink_external (LIKE source_external);
CREATE TABLE source_external_uncompressed (a int, b text);
ALTER TABLE source_external_uncompressed ALTER COLUMN b SET STORAGE EXTERNAL;
INSERT INTO source_external_uncompressed
SELECT g, randomish_string(400) FROM generate_series(1, 10000) g;
CREATE TABLE sink_external_uncompressed (LIKE source_external_uncompressed);
ALTER TABLE sink_external_uncompressed ALTER COLUMN b SET STORAGE EXTERNAL;
-- Test.
\timing
TRUNCATE sink_compressed, sink_external, sink_external_uncompressed;
CHECKPOINT;
INSERT INTO sink_compressed SELECT * FROM source_compressed;
INSERT INTO sink_external SELECT * FROM source_external;
INSERT INTO sink_external_uncompressed SELECT * FROM
source_external_uncompressed;
Roughly, on both master and with the patches, the first one takes
about 4.2 seconds, the second 7.5, and the third 1.2. The third one
is the fastest because it doesn't do any compression. Since it does
less irrelevant work than the other two cases, it has the best chance
of showing up any performance regression that the patch has caused --
if any regression existed, I suppose that it would be an increased
per-toast-fetch or per-toast-chunk overhead. However, I can't
reproduce any such regression. My first attempt at testing that case
showed the patch about 1% slower, but that wasn't reliably
reproducible when I did it a bunch more times. So as far as I can
figure, this patch does not regress performance in any
easily-measurable way.
Barring objections, I plan to commit the whole series of patches here
(latest rebase attached). They are not perfect and could likely be
improved in various ways, but I think they are an improvement over
what we have now, and it's not like it's set in stone once it's
committed. We can change it more if we come up with a better idea.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0004-Rename-attribute-detoasting-functions.patchapplication/octet-stream; name=0004-Rename-attribute-detoasting-functions.patchDownload+37-38
0002-Create-an-API-for-inserting-and-deleting-rows-in-TOA.patchapplication/octet-stream; name=0002-Create-an-API-for-inserting-and-deleting-rows-in-TOA.patchDownload+493-358
0003-Allow-TOAST-tables-to-be-implemented-using-table-AMs.patchapplication/octet-stream; name=0003-Allow-TOAST-tables-to-be-implemented-using-table-AMs.patchDownload+284-149
0001-Split-tuptoaster.c-into-three-separate-files.patchapplication/octet-stream; name=0001-Split-tuptoaster.c-into-three-separate-files.patchDownload+2612-2560
Hi,
On 2019-08-01 12:23:42 -0400, Robert Haas wrote:
Barring objections, I plan to commit the whole series of patches here
(latest rebase attached). They are not perfect and could likely be
improved in various ways, but I think they are an improvement over
what we have now, and it's not like it's set in stone once it's
committed. We can change it more if we come up with a better idea.
Could you wait until I either had a chance to look again, or until, say,
Monday if I don't get around to it? I'll try to get to it earlier than
that.
Greetings,
Andres Freund
On Thu, Aug 1, 2019 at 1:53 PM Andres Freund <andres@anarazel.de> wrote:
Could you wait until I either had a chance to look again, or until, say,
Monday if I don't get around to it? I'll try to get to it earlier than
that.
Sure, no problem.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2019-08-01 12:23:42 -0400, Robert Haas wrote:
Roughly, on both master and with the patches, the first one takes
about 4.2 seconds, the second 7.5, and the third 1.2. The third one
is the fastest because it doesn't do any compression. Since it does
less irrelevant work than the other two cases, it has the best chance
of showing up any performance regression that the patch has caused --
if any regression existed, I suppose that it would be an increased
per-toast-fetch or per-toast-chunk overhead. However, I can't
reproduce any such regression. My first attempt at testing that case
showed the patch about 1% slower, but that wasn't reliably
reproducible when I did it a bunch more times. So as far as I can
figure, this patch does not regress performance in any
easily-measurable way.
Hm, those all include writing, right? And for read-only we don't expect
any additional overhead, correct? The write overhead is probably too
large show a bit of function call overhead - but if so, it'd probably be
on unlogged tables? And with COPY, because that utilizes multi_insert,
which means more toasting in a shorter amount of time?
.oO(why does everyone attach attachements out of order? Is that
a gmail thing?)
From a4c858c75793f0f8aff7914c572a6615ea5babf8 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 8 Jul 2019 11:58:05 -0400
Subject: [PATCH 1/4] Split tuptoaster.c into three separate files.detoast.c/h contain functions required to detoast a datum, partially
or completely, plus a few other utility functions for examining the
size of toasted datums.toast_internals.c/h contain functions that are used internally to the
TOAST subsystem but which (mostly) do not need to be accessed from
outside.heaptoast.c/h contains code that is intrinsically specific to the
heap AM, either because it operates on HeapTuples or is based on the
layout of a heap page.detoast.c and toast_internals.c are placed in
src/backend/access/common rather than src/backend/access/heap. At
present, both files still have dependencies on the heap, but that will
be improved in a future commit.
I wonder if toasting.c should be moved too?
If anybody doesn't know git's --color-moved, it makes patches like this
a lot easier to review.
index 00000000000..582af147ea1 --- /dev/null +++ b/src/include/access/detoast.h @@ -0,0 +1,92 @@ +/*------------------------------------------------------------------------- + * + * detoast.h + * Access to compressed and external varlena values.Hm. Does it matter that that also includes stuff like expanded datums?
+ * Copyright (c) 2000-2019, PostgreSQL Global Development Group + * + * src/include/access/detoast.h + * + *------------------------------------------------------------------------- + */ +#ifndef DETOAST_H +#define DETOAST_H
trailing whitespace after "#ifndef DETOAST_H ".
commit 60d51e6510c66f79c51e43fe22730fe050d87854
Author: Robert Haas <rhaas@postgresql.org>
Date: 2019-07-08 12:02:16 -0400Create an API for inserting and deleting rows in TOAST tables.
This moves much of the non-heap-specific logic from toast_delete and
toast_insert_or_update into a helper functions accessible via a new
header, toast_helper.h. Using the functions in this module, a table
AM can implement creation and deletion of TOAST table rows with
much less code duplication than was possible heretofore. Some
table AMs won't want to use the TOAST logic at all, but for those
that do this will make that easier.Discussion: /messages/by-id/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
Hm. This leaves toast_insert_or_update() as a name. That makes it sound
like it's generic toast code, rather than heap specific?
It's definitely nice how a lot of repetitive code has been deduplicated.
Also makes it easier to see how algorithmically expensive
toast_insert_or_update() is :(.
/*
* Second we look for attributes of attstorage 'x' or 'e' that are still
* inline, and make them external. But skip this if there's no toast
* table to push them to.
*/
while (heap_compute_data_size(tupleDesc,
toast_values, toast_isnull) > maxDataLen &&
rel->rd_rel->reltoastrelid != InvalidOid)
Shouldn't this condition be the other way round?
if (for_compression)
skip_colflags |= TOASTCOL_INCOMPRESSIBLE;for (i = 0; i < numAttrs; i++)
{
Form_pg_attribute att = TupleDescAttr(tupleDesc, i);if ((ttc->ttc_attr[i].tai_colflags & skip_colflags) != 0)
continue;
if (VARATT_IS_EXTERNAL(DatumGetPointer(ttc->ttc_values[i])))
continue; /* can't happen, toast_action would be 'p' */
if (for_compression &&
VARATT_IS_COMPRESSED(DatumGetPointer(ttc->ttc_values[i])))
continue;
if (check_main && att->attstorage != 'm')
continue;
if (!check_main && att->attstorage != 'x' && att->attstorage != 'e')
continue;if (ttc->ttc_attr[i].tai_size > biggest_size)
{
biggest_attno = i;
biggest_size = ttc->ttc_attr[i].tai_size;
}
}
Couldn't most of these be handled via colflags, instead of having
numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED,
TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the
size check ought to boil down to a single mask test?
extern void toast_tuple_init(ToastTupleContext *ttc);
extern int toast_tuple_find_biggest_attribute(ToastTupleContext *ttc,
bool for_compression,
bool check_main);
extern void toast_tuple_try_compression(ToastTupleContext *ttc, int attribute);
extern void toast_tuple_externalize(ToastTupleContext *ttc, int attribute,
int options, int max_chunk_size);
extern void toast_tuple_cleanup(ToastTupleContext *ttc, bool cleanup_toastrel);
I wonder if a better prefix wouldn't be toasting_...
+/* + * Information about one column of a tuple being toasted. + * + * NOTE: toast_action[i] can have these values: + * ' ' default handling + * 'p' already processed --- don't touch it + * 'x' incompressible, but OK to move off + * + * NOTE: toast_attr[i].tai_size is only made valid for varlena attributes with + * toast_action[i] different from 'p'. + */ +typedef struct +{ + struct varlena *tai_oldexternal; + int32 tai_size; + uint8 tai_colflags; +} ToastAttrInfo;
I think that comment is outdated?
+/* + * Flags indicating the overall state of a TOAST operation. + * + * TOAST_NEEDS_DELETE_OLD indicates that one or more old TOAST datums need + * to be deleted. + * + * TOAST_NEEDS_FREE indicates that one or more TOAST values need to be freed. + * + * TOAST_HAS_NULLS indicates that nulls were found in the tuple being toasted. + * + * TOAST_NEEDS_CHANGE indicates that a new tuple needs to built; in other + * words, the toaster did something. + */ +#define TOAST_NEEDS_DELETE_OLD 0x0001 +#define TOAST_NEEDS_FREE 0x0002 +#define TOAST_HAS_NULLS 0x0004 +#define TOAST_NEEDS_CHANGE 0x0008
I'd make these enums. They're more often accessible in a debugger...
commit 9e4bd383a00e6bb96088666e57673b343049345c
Author: Robert Haas <rhaas@postgresql.org>
Date: 2019-08-01 10:37:02 -0400Allow TOAST tables to be implemented using table AMs other than heap.
toast_fetch_datum, toast_save_datum, and toast_delete_datum are
adjusted to use tableam rather than heap-specific functions. This
might have some performance impact, but this patch attempts to
mitigate that by restructuring things so that we don't open and close
the toast table and indexes multiple times per tuple.tableam now exposes an integer value (not a callback) for the
maximum TOAST chunk size, and has a new callback allowing table
AMs to specify the AM that should be used to implement the TOAST
table. Previously, the toast AM was always the same as the table AM.Patch by me, tested by Prabhat Sabu.
Discussion: /messages/by-id/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
I'm quite unconvinced that making the chunk size specified by the AM is
a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in
pg_control etc. It seems a bit dangerous to let AMs provide the size,
without being very clear that any change of the value will make data
inaccessible. It'd be different if the max were only used during
toasting.
I think the *size* checks should be weakened so we check:
1) After each chunk, whether the already assembled chunks exceed the
expected size.
2) After all chunks have been collected, check that the size is exactly
what we expect.
I don't think that removes a meaningful amount of error
checking. Missing tuples etc get detected by the chunk_ids not being
consecutive. The overall size is still verified.
The obvious problem with this is the slice fetching logic. For slices
with an offset of 0, it's obviously trivial to implement. For the higher
slice logic, I'd assume we'd have to fetch the first slice by estimating
where the start chunk is based on the current suggest chunk size, and
restarting the scan earlier/later if not accurate. In most cases it'll
be accurate, so we'd not loose efficiency.
Greetings,
Andres Freund
On Fri, Aug 2, 2019 at 6:42 PM Andres Freund <andres@anarazel.de> wrote:
Hm, those all include writing, right? And for read-only we don't expect
any additional overhead, correct? The write overhead is probably too
large show a bit of function call overhead - but if so, it'd probably be
on unlogged tables? And with COPY, because that utilizes multi_insert,
which means more toasting in a shorter amount of time?
Yes and yes. I guess we could test the unlogged case and with COPY,
but in any realistic case you're still looking for a tiny CPU overhead
in a sea of I/O costs. Even if an extremely short COPY on an unlogged
table regresses slightly, we do not normally reject patches that
improve code quality on the grounds that they add function call
overhead in a few places. Code like this hard to optimize and
maintain; as you remarked yourself, there are multiple opportunities
to do this stuff better that are hard to see in the current structure.
.oO(why does everyone attach attachements out of order? Is that
a gmail thing?)
Must be.
I wonder if toasting.c should be moved too?
I mean, we could, but I don't really see a reason. It'd just be
moving it from one fairly-generic place to another, and I'd rather
minimize churn.
trailing whitespace after "#ifndef DETOAST_H ".
Will fix.
Hm. This leaves toast_insert_or_update() as a name. That makes it sound
like it's generic toast code, rather than heap specific?
I'll rename it to heap_toast_insert_or_update(). But I think I'll put
that in 0004 with the other renames.
It's definitely nice how a lot of repetitive code has been deduplicated.
Also makes it easier to see how algorithmically expensive
toast_insert_or_update() is :(.
Yep.
Shouldn't this condition be the other way round?
I had to fight pretty hard to stop myself from tinkering with the
algorithm -- this can clearly be done better, but I wanted to make it
match the existing structure as far as possible. It also only needs to
be tested once, not on every loop iteration, so if we're going to
start changing things, we should go further than just swapping the
order of the tests. For now I prefer to draw a line in the sand and
change nothing.
Couldn't most of these be handled via colflags, instead of having
numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED,
TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the
size check ought to boil down to a single mask test?
I'm not really seeing how more flags would significantly simplify this
logic, but I might be missing something.
I wonder if a better prefix wouldn't be toasting_...
I'm open to other votes, but I think it's toast_tuple is more specific
than toasting_ and thus better.
+/* + * Information about one column of a tuple being toasted. + * + * NOTE: toast_action[i] can have these values: + * ' ' default handling + * 'p' already processed --- don't touch it + * 'x' incompressible, but OK to move off + * + * NOTE: toast_attr[i].tai_size is only made valid for varlena attributes with + * toast_action[i] different from 'p'. + */ +typedef struct +{ + struct varlena *tai_oldexternal; + int32 tai_size; + uint8 tai_colflags; +} ToastAttrInfo;I think that comment is outdated?
Oops. Will fix.
+/* + * Flags indicating the overall state of a TOAST operation. + * + * TOAST_NEEDS_DELETE_OLD indicates that one or more old TOAST datums need + * to be deleted. + * + * TOAST_NEEDS_FREE indicates that one or more TOAST values need to be freed. + * + * TOAST_HAS_NULLS indicates that nulls were found in the tuple being toasted. + * + * TOAST_NEEDS_CHANGE indicates that a new tuple needs to built; in other + * words, the toaster did something. + */ +#define TOAST_NEEDS_DELETE_OLD 0x0001 +#define TOAST_NEEDS_FREE 0x0002 +#define TOAST_HAS_NULLS 0x0004 +#define TOAST_NEEDS_CHANGE 0x0008I'd make these enums. They're more often accessible in a debugger...
Ugh, I hate that style. Abusing enums to make flag bits makes my skin
crawl. I always wondered what the appeal was -- I guess now I know.
Blech.
I'm quite unconvinced that making the chunk size specified by the AM is
a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in
pg_control etc. It seems a bit dangerous to let AMs provide the size,
without being very clear that any change of the value will make data
inaccessible. It'd be different if the max were only used during
toasting.
I was actually thinking about proposing that we rip
TOAST_MAX_CHUNK_SIZE out of pg_control. No real effort has been made
here to make this something that users could configure, and I don't
know of a good reason to configure it. It also seems pretty out of
place in a world where there are multiple AMs floating around -- why
should heap, and only heap, be checked there? Granted it does have
some pride of place, but still.
I think the *size* checks should be weakened so we check:
1) After each chunk, whether the already assembled chunks exceed the
expected size.
2) After all chunks have been collected, check that the size is exactly
what we expect.I don't think that removes a meaningful amount of error
checking. Missing tuples etc get detected by the chunk_ids not being
consecutive. The overall size is still verified.The obvious problem with this is the slice fetching logic. For slices
with an offset of 0, it's obviously trivial to implement. For the higher
slice logic, I'd assume we'd have to fetch the first slice by estimating
where the start chunk is based on the current suggest chunk size, and
restarting the scan earlier/later if not accurate. In most cases it'll
be accurate, so we'd not loose efficiency.
I don't feel entirely convinced that there's any rush to do all of
this right now, and the more I change the harder it is to make sure
that I haven't broken anything. How strongly do you feel about this
stuff?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
I don't feel entirely convinced that there's any rush to do all of
this right now, and the more I change the harder it is to make sure
that I haven't broken anything. How strongly do you feel about this
stuff?
FWIW, I agree with your comment further up that this patch ought to
just refactor, not change any algorithms. The latter can be done
in separate patches.
regards, tom lane
Hi,
On 2019-08-05 15:36:59 -0400, Robert Haas wrote:
On Fri, Aug 2, 2019 at 6:42 PM Andres Freund <andres@anarazel.de> wrote:
Hm. This leaves toast_insert_or_update() as a name. That makes it sound
like it's generic toast code, rather than heap specific?I'll rename it to heap_toast_insert_or_update(). But I think I'll put
that in 0004 with the other renames.
Makes sense.
It's definitely nice how a lot of repetitive code has been deduplicated.
Also makes it easier to see how algorithmically expensive
toast_insert_or_update() is :(.Yep.
Shouldn't this condition be the other way round?
I had to fight pretty hard to stop myself from tinkering with the
algorithm -- this can clearly be done better, but I wanted to make it
match the existing structure as far as possible. It also only needs to
be tested once, not on every loop iteration, so if we're going to
start changing things, we should go further than just swapping the
order of the tests. For now I prefer to draw a line in the sand and
change nothing.
Makes sense.
Couldn't most of these be handled via colflags, instead of having
numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED,
TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the
size check ought to boil down to a single mask test?I'm not really seeing how more flags would significantly simplify this
logic, but I might be missing something.
Well, right now you have a number of ifs for each attribute. If you
encoded all the parameters into flags, you could change that to a single
flag test - as far as I can tell, all the parameters could be
represented as that, if you moved MAIN etc into flags. A single if for
flags (and then the size check) is cheaper than several separate checks.
I'm quite unconvinced that making the chunk size specified by the AM is
a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in
pg_control etc. It seems a bit dangerous to let AMs provide the size,
without being very clear that any change of the value will make data
inaccessible. It'd be different if the max were only used during
toasting.I was actually thinking about proposing that we rip
TOAST_MAX_CHUNK_SIZE out of pg_control. No real effort has been made
here to make this something that users could configure, and I don't
know of a good reason to configure it. It also seems pretty out of
place in a world where there are multiple AMs floating around -- why
should heap, and only heap, be checked there? Granted it does have
some pride of place, but still.
I think the *size* checks should be weakened so we check:
1) After each chunk, whether the already assembled chunks exceed the
expected size.
2) After all chunks have been collected, check that the size is exactly
what we expect.I don't think that removes a meaningful amount of error
checking. Missing tuples etc get detected by the chunk_ids not being
consecutive. The overall size is still verified.The obvious problem with this is the slice fetching logic. For slices
with an offset of 0, it's obviously trivial to implement. For the higher
slice logic, I'd assume we'd have to fetch the first slice by estimating
where the start chunk is based on the current suggest chunk size, and
restarting the scan earlier/later if not accurate. In most cases it'll
be accurate, so we'd not loose efficiency.I don't feel entirely convinced that there's any rush to do all of
this right now, and the more I change the harder it is to make sure
that I haven't broken anything. How strongly do you feel about this
stuff?
I think we either should leave the hardcoded limit in place, or make it
actually not fixed. Ripping-it-out-but-not-actually just seems like a
trap for the unwary, without much point.
Greetings,
Andres Freund
On 2019-Aug-05, Robert Haas wrote:
Shouldn't this condition be the other way round?
I had to fight pretty hard to stop myself from tinkering with the
algorithm -- this can clearly be done better, but I wanted to make it
match the existing structure as far as possible. It also only needs to
be tested once, not on every loop iteration, so if we're going to
start changing things, we should go further than just swapping the
order of the tests. For now I prefer to draw a line in the sand and
change nothing.
I agree, and can we move forward with this 0001? The idea here is to
change no code (as also suggested by Tom elsewhere), and it's the
largest patch in this series by a mile. I checked --color-moved=zebra
and I think the patch looks fine, and also it compiles fine. I ran
src/tools/pginclude/headerscheck on it and found no complaints.
So here's a rebased version, where the DETOAST_H whitespace has been
removed. No other changes from your original. Will you please push
soon?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
0001-Split-tuptoaster.c-into-three-separate-files.patchtext/x-diff; charset=us-asciiDownload+2612-2558
On Thu, Sep 5, 2019 at 10:52 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I agree, and can we move forward with this 0001? The idea here is to
change no code (as also suggested by Tom elsewhere), and it's the
largest patch in this series by a mile. I checked --color-moved=zebra
and I think the patch looks fine, and also it compiles fine. I ran
src/tools/pginclude/headerscheck on it and found no complaints.So here's a rebased version, where the DETOAST_H whitespace has been
removed. No other changes from your original. Will you please push
soon?
Done, thanks. Here's the rest again with the additional rename added
to 0003 (formerly 0004). I think it's probably OK to go ahead with
that stuff, too, but I'll wait a bit to see if anyone wants to raise
more objections.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
v5-0002-Allow-TOAST-tables-to-be-implemented-using-table-.patchapplication/octet-stream; name=v5-0002-Allow-TOAST-tables-to-be-implemented-using-table-.patchDownload+284-149
v5-0001-Create-an-API-for-inserting-and-deleting-rows-in-.patchapplication/octet-stream; name=v5-0001-Create-an-API-for-inserting-and-deleting-rows-in-.patchDownload+493-358
v5-0003-Rename-some-toasting-functions-based-on-whether-t.patchapplication/octet-stream; name=v5-0003-Rename-some-toasting-functions-based-on-whether-t.patchDownload+50-51
Hi,
On 2019-09-05 13:42:40 -0400, Robert Haas wrote:
Done, thanks. Here's the rest again with the additional rename added
to 0003 (formerly 0004). I think it's probably OK to go ahead with
that stuff, too, but I'll wait a bit to see if anyone wants to raise
more objections.
Well, I still dislike making the toast chunk size configurable in a
halfhearted manner.
Greetings,
Andres Freund
On Thu, Sep 5, 2019 at 3:10 PM Andres Freund <andres@anarazel.de> wrote:
On 2019-09-05 13:42:40 -0400, Robert Haas wrote:
Done, thanks. Here's the rest again with the additional rename added
to 0003 (formerly 0004). I think it's probably OK to go ahead with
that stuff, too, but I'll wait a bit to see if anyone wants to raise
more objections.Well, I still dislike making the toast chunk size configurable in a
halfhearted manner.
So, I'd be willing to look into that some more. But how about if I
commit the next patch in the series first? I think this comment is
really about the second patch in the series, "Allow TOAST tables to be
implemented using table AMs other than heap," and it's fair to point
out that, since that patch extends table AM, we're somewhat committed
to it once we put it in. But "Create an API for inserting and
deleting rows in TOAST tables." is just refactoring, and I don't see
what we gain from waiting to commit that part.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Andres Freund <andres@anarazel.de> writes:
Well, I still dislike making the toast chunk size configurable in a
halfhearted manner.
It's hard to make it fully configurable without breaking our on-disk
storage, because of the lack of any explicit representation of the chunk
size in TOAST data. You have to "just know" how big the chunks are
supposed to be.
However, it's reasonable to ask why we should treat it as an AM property,
especially a fixed AM property as this has it. If somebody does
reimplement toast logic in some other AM, they might well decide it's
worth the storage cost to be more flexible about the chunk size ... but
too bad, this design won't let them do it.
I don't entirely understand why relation_toast_am is a callback
while toast_max_chunk_size isn't, either. Why would they not both
be callbacks? That would at least let an AM set a per-relation
max chunk size, if it wanted.
It seems like this design throws away most of the benefit of a fixed
chunk size (mostly, being able to do relevant modulo arithmetic with
shifts and masks rather than full-fledged integer division) without
getting much of anything in return.
regards, tom lane