Terrible performance on wide selects
I have a table which is rather wide (~800 columns) and consists of a few
columns of identifying data (run time, channel and such) and up to several
hundred columns of collected data (no, normalization does not suggest putting
collected data in another table - collected item 1 always corresponds to
collected item 1 but is completely different than item 3).
My test table is very short (62 rows) but in production would grow by several
thousand rows per day. Unfortunately if my test data is correct, performance
on wide selects is so bad that it will render the system unusable.
Here's the test. I have created two versions of the table - one stores the
collected data in an array of text and the other stores the data in
individual columns, no joins, no indexes. Times are averages of many runs -
the times varied very little and the data is small enough that I'm sure it
was served from RAM. Postgres CPU utilization observed on the longer runs was
98-99%. Changing the output format didn't seem to change things significantly.
Times for selecting all the columns in the table:
select * from columnversion;
8,000 ms
select * from arrayversion;
110 ms
select * from arraytocolumnview (data in the array version but converted to
columns in the view)
10,000 ms
Times to select a single column in a table:
select runstarttime from columversion;
32 ms
select runstarttime from arrayversion;
6 ms
So the question is, does it seem reasonable that a query on fundamentally
identical data should take 70-90 times as long when displayed as individual
columns vs. when output as a raw array and, more imporantly, what can I do to
get acceptable performance on this query?
Cheers,
Steve
Steve Crawford <scrawford@pinpointresearch.com> writes:
So the question is, does it seem reasonable that a query on fundamentally
identical data should take 70-90 times as long when displayed as individual
columns vs. when output as a raw array and, more imporantly, what can I do to
get acceptable performance on this query?
There are undoubtedly some places that are O(N^2) in the number of
targetlist items. Feel free to do some profiling to identify them.
It probably won't be real hard to fix 'em once identified.
regards, tom lane
Steve Crawford sent me some profiling results for queries involving wide
tuples (hundreds of columns).
Done, results attached. nocachegetattr seems to be the likely suspect.
Yipes, you can say that again.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
93.38 26.81 26.81 885688 0.03 0.03 nocachegetattr
0.00 0.00 1/885688 heapgettup [159]
0.00 0.00 1/885688 CatalogCacheComputeTupleHashValue [248]
0.00 0.00 5/885688 SearchCatCache [22]
13.40 0.00 442840/885688 ExecEvalVar [20]
13.40 0.00 442841/885688 printtup [12]
[11]: 93.4 26.81 0.00 885688 nocachegetattr [11]
Half of the calls are coming from printtup(), which seems relatively easy
to fix.
/*
* send the attributes of this tuple
*/
for (i = 0; i < natts; ++i)
{
...
origattr = heap_getattr(tuple, i + 1, typeinfo, &isnull);
...
}
The trouble here is that in the presence of variable-width fields,
heap_getattr requires a linear scan over the tuple --- and so the total
time spent in it is O(N^2) in the number of fields.
What we could do is reinstitute heap_deformtuple() as the inverse of
heap_formtuple() --- but make it extract Datums for all the columns in
a single pass over the tuple. This would reduce the time in printtup()
from O(N^2) to O(N), which would pretty much wipe out that part of the
problem.
The other half of the calls are coming from ExecEvalVar, which is a
harder problem to solve, since those calls are scattered all over the
place. It's harder to see how to get them to share work. Any ideas
out there?
regards, tom lane
Import Notes
Reply to msg id not found: 20030122224814.ECF47103E6@polaris.pinpointresearch.com
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, January 22, 2003 3:15 PM
To: Steve Crawford
Cc: pgsql-performance@postgreSQL.org; pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] Terrible performance on wide selectsSteve Crawford sent me some profiling results for queries
involving wide tuples (hundreds of columns).Done, results attached. nocachegetattr seems to be the
likely suspect.
Yipes, you can say that again.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
93.38 26.81 26.81 885688 0.03 0.03 nocachegetattr0.00 0.00 1/885688 heapgettup [159]
0.00 0.00 1/885688
CatalogCacheComputeTupleHashValue [248]
0.00 0.00 5/885688 SearchCatCache [22]
13.40 0.00 442840/885688 ExecEvalVar [20]
13.40 0.00 442841/885688 printtup [12]
[11] 93.4 26.81 0.00 885688 nocachegetattr [11]Half of the calls are coming from printtup(), which seems
relatively easy to fix./*
* send the attributes of this tuple
*/
for (i = 0; i < natts; ++i)
{
...
origattr = heap_getattr(tuple, i + 1, typeinfo,
&isnull);
...
}The trouble here is that in the presence of variable-width fields, heap_getattr requires a linear scan over the tuple --- and so the total time spent in it is O(N^2) in the number of fields.What we could do is reinstitute heap_deformtuple() as the inverse of
heap_formtuple() --- but make it extract Datums for all the
columns in a single pass over the tuple. This would reduce
the time in printtup() from O(N^2) to O(N), which would
pretty much wipe out that part of the problem.The other half of the calls are coming from ExecEvalVar,
which is a harder problem to solve, since those calls are
scattered all over the place. It's harder to see how to get
them to share work. Any ideas out there?
Is it possible that the needed information could be retrieved by
querying the system metadata to collect the column information?
Once the required tuple attributes are described, it could form a
binding list that allocates a buffer of sufficient size with pointers to
the required column start points.
Maybe I don't really understand the problem, but it seems simple enough
to do it once for the whole query.
If this is utter stupidity, please disregard and have a hearty laugh at
my expense.
;-)
Import Notes
Resolved by subject fallback
"Dann Corbit" <DCorbit@connx.com> writes:
Maybe I don't really understand the problem, but it seems simple enough
to do it once for the whole query.
We already do cache column offsets when they are fixed. The code that's
the problem executes when there's a variable-width column in the table
--- which means that all columns to its right are not at fixed offsets,
and have to be scanned for separately in each tuple, AFAICS.
regards, tom lane
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, January 22, 2003 4:04 PM
To: Dann Corbit
Cc: Steve Crawford; pgsql-performance@postgreSQL.org;
pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] Terrible performance on wide selects"Dann Corbit" <DCorbit@connx.com> writes:
Maybe I don't really understand the problem, but it seems simple
enough to do it once for the whole query.We already do cache column offsets when they are fixed. The code that's the problem executes when there's a variable-width column in the table --- which means that all columns to its right are not at fixed offsets, and have to be scanned for separately in each tuple, AFAICS.
Why not waste a bit of memory and make the row buffer the maximum
possible length?
E.g. for varchar(2000) allocate 2000 characters + size element and point
to the start of that thing.
If we have 64K rows, even at that it is a pittance. If someone designs
10,000 row tables, then it will allocate an annoyingly large block of
memory, but bad designs are always going to cause a fuss.
Import Notes
Resolved by subject fallback
"Dann Corbit" <DCorbit@connx.com> writes:
Why not waste a bit of memory and make the row buffer the maximum
possible length?
E.g. for varchar(2000) allocate 2000 characters + size element and point
to the start of that thing.
Surely you're not proposing that we store data on disk that way.
The real issue here is avoiding overhead while extracting columns out of
a stored tuple. We could perhaps use a different, less space-efficient
format for temporary tuples in memory than we do on disk, but I don't
think that will help a lot. The nature of O(N^2) bottlenecks is you
have to kill them all --- for example, if we fix printtup and don't do
anything with ExecEvalVar, we can't do more than double the speed of
Steve's example, so it'll still be slow. So we must have a solution for
the case where we are disassembling a stored tuple, anyway.
I have been sitting here toying with a related idea, which is to use the
heap_deformtuple code I suggested before to form an array of pointers to
Datums in a specific tuple (we could probably use the TupleTableSlot
mechanisms to manage the memory for these). Then subsequent accesses to
individual columns would just need an array-index operation, not a
nocachegetattr call. The trick with that would be that if only a few
columns are needed out of a row, it might be a net loss to compute the
Datum values for all columns. How could we avoid slowing that case down
while making the wide-tuple case faster?
regards, tom lane
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, January 22, 2003 4:18 PM
To: Dann Corbit
Cc: Steve Crawford; pgsql-performance@postgreSQL.org;
pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] Terrible performance on wide selects"Dann Corbit" <DCorbit@connx.com> writes:
Why not waste a bit of memory and make the row buffer the maximum
possible length? E.g. for varchar(2000) allocate 2000 characters +
size element and point to the start of that thing.Surely you're not proposing that we store data on disk that way.
The real issue here is avoiding overhead while extracting
columns out of a stored tuple. We could perhaps use a
different, less space-efficient format for temporary tuples
in memory than we do on disk, but I don't think that will
help a lot. The nature of O(N^2) bottlenecks is you have to
kill them all --- for example, if we fix printtup and don't
do anything with ExecEvalVar, we can't do more than double
the speed of Steve's example, so it'll still be slow. So we
must have a solution for the case where we are disassembling
a stored tuple, anyway.I have been sitting here toying with a related idea, which is
to use the heap_deformtuple code I suggested before to form
an array of pointers to Datums in a specific tuple (we could
probably use the TupleTableSlot mechanisms to manage the
memory for these). Then subsequent accesses to individual
columns would just need an array-index operation, not a
nocachegetattr call. The trick with that would be that if
only a few columns are needed out of a row, it might be a net
loss to compute the Datum values for all columns. How could
we avoid slowing that case down while making the wide-tuple
case faster?
For the disk case, why not have the start of the record contain an array
of offsets to the start of the data for each column? It would only be
necessary to have a list for variable fields.
So (for instance) if you have 12 variable fields, you would store 12
integers at the start of the record.
Import Notes
Resolved by subject fallback
[snip]
So (for instance) if you have 12 variable fields, you would
store 12 integers at the start of the record.
Additionally, you could implicitly size the integers from the properties
of the column. A varchar(255) would only need an unsigned char to store
the offset, but a varchar(80000) would require an unsigned int.
Import Notes
Resolved by subject fallback
"Dann Corbit" <DCorbit@connx.com> writes:
For the disk case, why not have the start of the record contain an array
of offsets to the start of the data for each column? It would only be
necessary to have a list for variable fields.
No, you'd need an entry for *every* column (or at least, every one to
the right of the first variable-width column or NULL). That's a lot of
overhead, especially in comparison to datatypes like bool or int4 ...
regards, tom lane
[snip]
For the disk case, why not have the start of the record
contain an array of offsets to the start of the data for each
column? It would only be necessary to have a list for
variable fields.So (for instance) if you have 12 variable fields, you would
store 12 integers at the start of the record.
You have to store this information anyway (for variable length objects).
By storing it at the front of the record you would lose nothing (except
the logical coupling of an object with its length). But I would think
that it would not consume any additional storage.
Import Notes
Resolved by subject fallback
Tom Lane kirjutas N, 23.01.2003 kell 02:18:
"Dann Corbit" <DCorbit@connx.com> writes:
Why not waste a bit of memory and make the row buffer the maximum
possible length?
E.g. for varchar(2000) allocate 2000 characters + size element and point
to the start of that thing.Surely you're not proposing that we store data on disk that way.
The real issue here is avoiding overhead while extracting columns out of
a stored tuple. We could perhaps use a different, less space-efficient
format for temporary tuples in memory than we do on disk, but I don't
think that will help a lot. The nature of O(N^2) bottlenecks is you
have to kill them all --- for example, if we fix printtup and don't do
anything with ExecEvalVar, we can't do more than double the speed of
Steve's example, so it'll still be slow. So we must have a solution for
the case where we are disassembling a stored tuple, anyway.I have been sitting here toying with a related idea, which is to use the
heap_deformtuple code I suggested before to form an array of pointers to
Datums in a specific tuple (we could probably use the TupleTableSlot
mechanisms to manage the memory for these). Then subsequent accesses to
individual columns would just need an array-index operation, not a
nocachegetattr call. The trick with that would be that if only a few
columns are needed out of a row, it might be a net loss to compute the
Datum values for all columns. How could we avoid slowing that case down
while making the wide-tuple case faster?
make the pointer array incrementally for O(N) performance:
i.e. for tuple with 100 cols, allocate an array of 100 pointers, plus
keep count of how many are actually valid,
so the first call to get col[5] will fill first 5 positions in the array
save said nr 5 and then access tuple[ptrarray[5]]
next call to get col[75] will start form col[5] and fill up to col[75]
next call to col[76] will start form col[75] and fill up to col[76]
next call to col[60] will just get tuple[ptrarray[60]]
the above description assumes 1-based non-C arrays ;)
--
Hannu Krosing <hannu@tm.ee>
Dann Corbit kirjutas N, 23.01.2003 kell 02:39:
[snip]
For the disk case, why not have the start of the record
contain an array of offsets to the start of the data for each
column? It would only be necessary to have a list for
variable fields.So (for instance) if you have 12 variable fields, you would
store 12 integers at the start of the record.You have to store this information anyway (for variable length objects).
By storing it at the front of the record you would lose nothing (except
the logical coupling of an object with its length). But I would think
that it would not consume any additional storage.
I don't think it will win much either (except for possible cache
locality with really huge page sizes), as the problem is _not_ scanning
over big strings finding their end marker, but instead is chasing long
chains of pointers.
There could be some merit in the idea of storing in the beginning of
tuple all pointers starting with first varlen field (16 bit int should
be enough)
so people can minimize the overhead by moving fixlen fields to the
beginning. once we have this setup, we no longer need the varlen fields
/stored/ together with field data.
this adds complexity of converting form (len,data) to ptr,...,data) when
constructing the tuple
as tuple (int,int,int,varchar,varchar)
which is currently stored as
(intdata1, intdata2, intdata3, (len4, vardata4), (len5,vardata5))
should be rewritten on storage to
(ptr4,ptr5),(intdata1, intdata2, intdata3, vardata4,vardata5)
but it seems to solve the O(N) problem quite nicely (and forces no
storage growth for tuples with fixlen fields in the beginning of tuple)
and we must also account for NULL fields in calculations .
--
Hannu Krosing <hannu@tm.ee>
Tom Lane kirjutas N, 23.01.2003 kell 02:04:
"Dann Corbit" <DCorbit@connx.com> writes:
Maybe I don't really understand the problem, but it seems simple enough
to do it once for the whole query.We already do cache column offsets when they are fixed. The code that's the problem executes when there's a variable-width column in the table --- which means that all columns to its right are not at fixed offsets, and have to be scanned for separately in each tuple, AFAICS.
Not only varlen columns, but also NULL columns forbid knowing the
offsets beforehand.
--
Hannu Krosing <hannu@tm.ee>
Hannu Krosing kirjutas N, 23.01.2003 kell 12:11:
make the pointer array incrementally for O(N) performance:
i.e. for tuple with 100 cols, allocate an array of 100 pointers, plus
keep count of how many are actually valid,
Additionally, this should also make repeted determining of NULL fields
faster - just put a NULL-pointer in and voila - no more bit-shifting and
AND-ing to find out if the field is null.
One has to watch the NULL bitmap on fist pass anyway.
so the first call to get col[5] will fill first 5 positions in the array
save said nr 5 and then access tuple[ptrarray[5]]next call to get col[75] will start form col[5] and fill up to col[75]
next call to col[76] will start form col[75] and fill up to col[76]
next call to col[60] will just get tuple[ptrarray[60]]
the above description assumes 1-based non-C arrays ;)
--
Hannu Krosing <hannu@tm.ee>
Hannu Krosing said:
Tom Lane kirjutas N, 23.01.2003 kell 02:04:
We already do cache column offsets when they are fixed. The code that's the problem executes when there's a variable-width column in the table --- which means that all columns to its right are not at fixed offsets, and have to be scanned for separately in each tuple, AFAICS.Not only varlen columns, but also NULL columns forbid knowing the
offsets beforehand.
Does this mean, that constructing tables where fixed length fields are
'before' variable lenght fields and 'possibly null' fields might increase
performance?
Daniel
Hannu Krosing <hannu@tm.ee> writes:
i.e. for tuple with 100 cols, allocate an array of 100 pointers, plus
keep count of how many are actually valid,
Additionally, this should also make repeted determining of NULL fields
faster - just put a NULL-pointer in and voila - no more bit-shifting and
AND-ing to find out if the field is null.
Right, the output of the operation would be a pair of arrays: Datum
values and is-null flags. (NULL pointers don't work for pass-by-value
datatypes.)
I like the idea of keeping track of a last-known-column position and
incrementally extending that as needed.
I think the way to manage this is to add the overhead data (the output
arrays and last-column state) to TupleTableSlots. Then we'd have
a routine similar to heap_getattr except that it takes a TupleTableSlot
and makes use of the extra state data. The infrastructure to manage
the state data is already in place: for example, ExecStoreTuple would
reset the last-known-column to 0, ExecSetSlotDescriptor would be
responsible for allocating the output arrays using the natts value from
the provided tupdesc, etc.
This wouldn't help for accesses that are not in the context of a slot,
but certainly all the ones from ExecEvalVar are. The executor always
works with tuples stored in slots, so I think we could fix all the
high-traffic cases this way.
regards, tom lane
Hannu Krosing <hannu@tm.ee> writes:
as tuple (int,int,int,varchar,varchar)
which is currently stored as
(intdata1, intdata2, intdata3, (len4, vardata4), (len5,vardata5))
should be rewritten on storage to
(ptr4,ptr5),(intdata1, intdata2, intdata3, vardata4,vardata5)
I do not see that this buys anything at all. heap_getattr still has to
make essentially the same calculation as before to determine column
locations, namely adding up column widths. All you've done is move the
data that it has to fetch to make the calculation. If anything, this
will be slower not faster, because now heap_getattr has to keep track
of two positions not one --- not just the next column offset, but also
the index of the next "ptr" to use. In the existing method it only
needs the column offset, because that's exactly where it can pick up
the next length from.
But the really serious objection is that the datatype functions that
access the data would now also need to be passed two pointers, since
after all they would like to know the length too. That breaks APIs
far and wide :-(
regards, tom lane
Daniel Kalchev <daniel@digsys.bg> writes:
Does this mean, that constructing tables where fixed length fields are
'before' variable lenght fields and 'possibly null' fields might increase
performance?
There'd have to be no nulls, period, to get any useful performance
difference --- but yes, in theory putting fixed-length columns before
variable-length ones is a win.
I wouldn't bother going out to rearrange your schemas though ... at
least not before you do some tests to prove that it's worthwhile.
regards, tom lane
On Thu, 23 Jan 2003, Daniel Kalchev wrote:
Does this mean, that constructing tables where fixed length fields are
'before' variable lenght fields and 'possibly null' fields might increase
performance?
This, I believe, is why DB2 always puts (in physical storage) all of the
fixed-length fields before the variable-length fields.
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
Dann Corbit kirjutas N, 23.01.2003 kell 02:22:
[snip]
So (for instance) if you have 12 variable fields, you would
store 12 integers at the start of the record.Additionally, you could implicitly size the integers from the properties
of the column. A varchar(255) would only need an unsigned char to store
the offset, but a varchar(80000) would require an unsigned int.
I guess that the pointer could always be 16-bit, as the offset inside a
tuple will never be more (other issues constrain max page size to 32K)
varchar(80000) will use TOAST (another file) anyway, but this will be
hidden inside the field storage in the page)
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
--
Hannu Krosing <hannu@tm.ee>
Added to TODO:
* Cache last known per-tuple offsets to speed long tuple access
---------------------------------------------------------------------------
Tom Lane wrote:
Hannu Krosing <hannu@tm.ee> writes:
i.e. for tuple with 100 cols, allocate an array of 100 pointers, plus
keep count of how many are actually valid,Additionally, this should also make repeted determining of NULL fields
faster - just put a NULL-pointer in and voila - no more bit-shifting and
AND-ing to find out if the field is null.Right, the output of the operation would be a pair of arrays: Datum
values and is-null flags. (NULL pointers don't work for pass-by-value
datatypes.)I like the idea of keeping track of a last-known-column position and
incrementally extending that as needed.I think the way to manage this is to add the overhead data (the output
arrays and last-column state) to TupleTableSlots. Then we'd have
a routine similar to heap_getattr except that it takes a TupleTableSlot
and makes use of the extra state data. The infrastructure to manage
the state data is already in place: for example, ExecStoreTuple would
reset the last-known-column to 0, ExecSetSlotDescriptor would be
responsible for allocating the output arrays using the natts value from
the provided tupdesc, etc.This wouldn't help for accesses that are not in the context of a slot,
but certainly all the ones from ExecEvalVar are. The executor always
works with tuples stored in slots, so I think we could fix all the
high-traffic cases this way.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073