Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

Started by Peter Geogheganover 10 years ago10 messageshackers
Jump to latest

We lack SortSupport for many character-like-type cases. In full, the
cases within the core system are:

* char(n) opfamily (bpchar_ops).

* text_pattern_ops opfamily (includes text and varchar "pattern"
opclasses, which are generally recommended for accelerating LIKE
operator queries).

* bpchar_pattern_ops -- the operator family/class for char(n), used
where "pattern" style indexing is required for the char(n) type.

* bytea default opclass. This is a type that, like the others, shares
its representation with text (a varlena header and some data bytes --
a string, essentially). Its comparator already behaves identically to
that of the text comparator when the "C" collation is used. I've
actually seen a specific request for this [1]https://lwn.net/Articles/653721/ -- Peter Geoghegan.

These cases do matter. For one thing, even if they're less important
than the default text/varchar opclasses, having such large
inconsistencies in character type sort performance is a fairly major
POLA violation; opclasses like text_pattern_ops are *supposed* to be
faster though less capable alternatives to the defaults. For another,
char(n) is in the SQL standard, which may be why all TPC benchmarks
use char(n) for columns that are sorted on or used for grouping.
char(n) sorting can be made about 8x faster with
SortSupport/abbreviation, making it the best candidate for
abbreviation optimization that I've ever seen. It would be regrettable
if we accidentally lost a benchmark due to not having char(n)
SortSupport.

Attached patch adds SortSupport for all of the cases listed above.

I did not bother doing anything with contrib/citext, because I think
it's not worth it. I think that we should definitely invest in case
insensitive collations, and retire contrib/citext. The fact that the
*default* collation (and not the input collation) is passed by
citextcmp()'s call to str_tolower() suggests to me that the only way
to make citext do the right thing is to basically introduce case
insensitive collations to Postgres. Of course, doing so would make
contrib/citext obsolete.

I also didn't bother extending the name type's SortSupport (something
that has existed since 9.2) to perform abbreviation, although that
wouldn't be very hard. I saw no point.

I think I might also get around to adding abbreviated key support for
the network address types during the 9.6 cycle. That seems like
something a less experienced contributor could easily pick up, though
-- if anyone wants to take it off my hands, please do so.

[1]: https://lwn.net/Articles/653721/ -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

0001-Generalize-SortSupport-for-text.patchtext/x-patch; charset=US-ASCII; name=0001-Generalize-SortSupport-for-text.patchDownload+303-133
In reply to: Peter Geoghegan (#1)
Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

On Thu, Nov 12, 2015 at 4:38 PM, Peter Geoghegan <pg@heroku.com> wrote:

* bytea default opclass. This is a type that, like the others, shares
its representation with text (a varlena header and some data bytes --
a string, essentially). Its comparator already behaves identically to
that of the text comparator when the "C" collation is used. I've
actually seen a specific request for this [1].

I probably should have given an explanation for why it's okay that NUL
bytes can exist in strings from the point of view of the generalized
SortSupport for text worker function. The next revision will have
comments along those lines.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#2)
Re: Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

Peter Geoghegan wrote:

On Thu, Nov 12, 2015 at 4:38 PM, Peter Geoghegan <pg@heroku.com> wrote:

* bytea default opclass. This is a type that, like the others, shares
its representation with text (a varlena header and some data bytes --
a string, essentially). Its comparator already behaves identically to
that of the text comparator when the "C" collation is used. I've
actually seen a specific request for this [1].

I probably should have given an explanation for why it's okay that NUL
bytes can exist in strings from the point of view of the generalized
SortSupport for text worker function. The next revision will have
comments along those lines.

Did you ever post "the next revision"?

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#1)
Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

Peter Geoghegan wrote:

We lack SortSupport for many character-like-type cases. In full, the
cases within the core system are:

You're stealthily introducing a new abstraction called "string",
including a typedef and DatumGetString support macros. Is that really
necessary? Shouldn't it be discussed specifically? I don't necessarily
oppose it as is, mainly because it's limited to within varlena.c for
now, but I'm not sure it'd okay to make it more public.

Also, there's a lot of churn in this patch to just remove tss to sss.
Can't we just keep the original variable name?

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Alvaro Herrera (#4)
Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

On Thu, Jan 7, 2016 at 7:41 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

You're stealthily introducing a new abstraction called "string",
including a typedef and DatumGetString support macros. Is that really
necessary? Shouldn't it be discussed specifically? I don't necessarily
oppose it as is, mainly because it's limited to within varlena.c for
now, but I'm not sure it'd okay to make it more public.

Note that a similar abstraction for the "unknown" type also exists
only within varlena.c. So, DatumGetStringP() and so on appear right
alongside DatumGetUnknownP() and so on.

The idea of the "string" abstraction is that is advertises that
certain functions could equally well apply to a variety of "varlena
header + some bytes" types. I thought about just using the varlena
type instead, but preferred the "string" abstraction.

Also, there's a lot of churn in this patch to just remove tss to sss.
Can't we just keep the original variable name?

I think that minimizing lines changed in a mechanical way by a commit
is overrated as a goal for Postgres patches, but I don't feel too
strongly about holding on to the "churn" in this patch. I attach a new
revision, which has the changes I outlined to code comments. I haven't
minimized the differences in the way you suggest just yet.

--
Peter Geoghegan

Attachments:

0001-Generalize-SortSupport-for-text.patchtext/x-patch; charset=US-ASCII; name=0001-Generalize-SortSupport-for-text.patchDownload+334-135
#6Andreas Karlsson
andreas.karlsson@percona.com
In reply to: Peter Geoghegan (#5)
Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

Hi,

I have reviewed this now and I think this is a useful addition even
though these indexes are less common. Consistent behavior is worth a lot
in my mind and this patch is reasonably small.

The patch no longer applies due to 1) oid collisions and 2) a trivial
conflict. When I fixed those two the test suite passed.

I tested this patch by building indexes with all the typess and got nice
measurable speedups.

Logically I think the patch makes sense and the code seems to be
correct, but I have some comments on it.

- You use two names a lot "string" vs "varstr". What is the difference
between those? Is there any reason for not using varstr consistently?

- You have a lot of renaming as has been mentioned previously in this
thread. I do not care strongly for it either way, but it did make it
harder to spot what the patch changed. If it was my own project I would
have considered splitting the patch into two parts, one renaming
everything and another adding the new feature, but the PostgreSQL seem
to often prefer having one commit per feature. Do as you wish here.

- I think the comment about NUL bytes in varstr_abbrev_convert makes it
seem like the handling is much more complicated than it actually is. I
did not understand it after a couple of readings and had to read the
code understand what it was talking about.

Nice work, I like your sorting patches.

Andreas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Andreas Karlsson (#6)
Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

On Sun, Jan 31, 2016 at 10:59 PM, Andreas Karlsson <andreas@proxel.se> wrote:

I have reviewed this now and I think this is a useful addition even though
these indexes are less common. Consistent behavior is worth a lot in my mind
and this patch is reasonably small.

The patch no longer applies due to 1) oid collisions and 2) a trivial
conflict. When I fixed those two the test suite passed.

I tested this patch by building indexes with all the typess and got nice
measurable speedups.

Logically I think the patch makes sense and the code seems to be correct,
but I have some comments on it.

- You use two names a lot "string" vs "varstr". What is the difference
between those? Is there any reason for not using varstr consistently?

- You have a lot of renaming as has been mentioned previously in this
thread. I do not care strongly for it either way, but it did make it harder
to spot what the patch changed. If it was my own project I would have
considered splitting the patch into two parts, one renaming everything and
another adding the new feature, but the PostgreSQL seem to often prefer
having one commit per feature. Do as you wish here.

- I think the comment about NUL bytes in varstr_abbrev_convert makes it seem
like the handling is much more complicated than it actually is. I did not
understand it after a couple of readings and had to read the code understand
what it was talking about.

Nice work, I like your sorting patches.

Thanks for the review. I fixed the OID conflict, tweaked a few
comments, and committed this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andreas Karlsson (#6)
Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

On Sun, Jan 31, 2016 at 7:59 PM, Andreas Karlsson <andreas@proxel.se> wrote:

Nice work, I like your sorting patches.

Thanks. I like your reviews of my sorting patches. :-)

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Robert Haas (#7)
Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

On Wed, Feb 3, 2016 at 11:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for the review. I fixed the OID conflict, tweaked a few
comments, and committed this.

Thanks. I noticed a tiny, preexisting typo in the string abbreviated
key code. The attached patch fixes it.

--
Peter Geoghegan

Attachments:

tiny-abbrev-typo.patchtext/x-patch; charset=US-ASCII; name=tiny-abbrev-typo.patchDownload+1-1
#10Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#9)
Re: Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

On Fri, Feb 5, 2016 at 6:14 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Feb 3, 2016 at 11:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for the review. I fixed the OID conflict, tweaked a few
comments, and committed this.

Thanks. I noticed a tiny, preexisting typo in the string abbreviated
key code. The attached patch fixes it.

Gosh, I must have looked at that line 10 times without seeing that
mistake. Thanks, committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers