Varying results when using merge joins over postgres_fdw vs hash joins

Started by Corey Huinkerover 8 years ago14 messages

corey.huinker@gmail.com

over 8 years ago

We are having an issue with a query that will return no results when the
query does a merge join with a foreign table, but (correctly) returns
results when using a hash join.

Here is the situation on the "remote" database (9.5):

analytics=# select encoding, datcollate, datctype from pg_database where
datname = current_database();
encoding | datcollate | datctype
----------+-------------+-------------
6 | en_US.UTF-8 | en_US.UTF-8

And here's what we do on the local side (9.6):

# select encoding, datcollate, datctype from pg_database where datname =
current_database();
encoding | datcollate | datctype
----------+------------+----------
6 | C | C

# import foreign schema public limit to (table_with_en_us_utf8_encoding)
from server primary_replica into public;

# create temporary table tmp_on_c_collated_foreign_server (str2 text);

# insert into tmp_on_c_collated_foreign_server (str2) values ('576228972');
# insert into tmp_on_c_collated_foreign_server (str2) values ('576228972');
# insert into tmp_on_c_collated_foreign_server (str2) values ('576228972');

--
-- query with merge join, returns zero rows
--
# explain (analyze, verbose) select e.str1, e.str2, e.str3 from
tmp_on_c_collated_foreign_server c left join table_with_en_us_utf8_encoding
e on c.str2 = e.str2 where e.str4='2' ;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=18041.88..22322.92 rows=229221 width=1548) (actual
time=102.849..102.849 *rows=0* loops=1)
Output: e.str1, e.str2, e.str3
Merge Cond: ((e.str2)::text = c.str2)
-> Foreign Scan on public.table_with_en_us_utf8_encoding e
(cost=17947.50..18705.95 rows=33709 width=93) (actual
time=102.815..102.815 rows=1 loops=1)
Output: e.id, e.str1, e.str2, e.str3, e.str4
Remote SQL: *SELECT str1, str2, str3 FROM
public.table_with_en_us_utf8_encoding WHERE ((str4 = '2'::text)) ORDER BY
str2 ASC NULLS LAST*
-> Sort (cost=94.38..97.78 rows=1360 width=32) (actual
time=0.028..0.029 rows=7 loops=1)
Output: c.str2
Sort Key: c.str2
Sort Method: quicksort Memory: 25kB
-> Seq Scan on pg_temp_3.tmp_on_c_collated_foreign_server c
(cost=0.00..23.60 rows=1360 width=32) (actual time=0.010..0.011 rows=7
loops=1)
Output: c.str2
Planning time: 4.285 ms
Execution time: 104.458 ms
(14 rows)

--
-- query with hash join, returns rows

-- the default for the foreign server is to use remote estimates, so we
turn that off...

# alter foreign table table_with_en_us_utf8_encoding OPTIONS (ADD
use_remote_estimate 'false');
ALTER FOREIGN TABLE

-- and then run the same query again

# explain (analyze, verbose) select e.str1, e.str2, e.str3 from
tmp_on_c_collated_foreign_server c left join table_with_en_us_utf8_encoding
e on c.str2 = e.str2 where e.str4='2' ;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=110.68..139.45 rows=7 width=1548) (actual
time=154.280..154.286 *rows=7* loops=1)
Output: e.str1, e.str2, e.str3
Hash Cond: (c.str2 = (e.str2)::text)
-> Seq Scan on pg_temp_3.tmp_on_c_collated_foreign_server c
(cost=0.00..23.60 rows=1360 width=32) (actual time=0.006..0.008 rows=7
loops=1)
Output: c.str2
-> Hash (cost=110.67..110.67 rows=1 width=1548) (actual
time=154.264..154.264 rows=33418 loops=1)
Output: e.str1, e.str2, e.str3
Buckets: 65536 (originally 1024) Batches: 1 (originally 1)
Memory Usage: 4003kB
-> Foreign Scan on public.table_with_en_us_utf8_encoding e
(cost=100.00..110.67 rows=1 width=1548) (actual time=8.289..144.210
rows=33418 loops=1)
Output: e.str1, e.str2, e.str3
Remote *SQL: SELECT str1, str2, str3 FROM
public.table_with_en_us_utf8_encoding WHERE ((str4 = '2'::text))*
Planning time: 0.153 ms
Execution time: 156.557 ms
(13 rows)

So we get different answers based on whether the planner decides to push do
a merge join (pushing down an order by clause) vs a hash join (no order by).

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 8 years ago

In reply to: Corey Huinker (#1)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

On Wed, Sep 20, 2017 at 5:07 AM, Corey Huinker <corey.huinker@gmail.com> wrote:

We are having an issue with a query that will return no results when the
query does a merge join with a foreign table, but (correctly) returns
results when using a hash join.

Here is the situation on the "remote" database (9.5):

# \d+ table_with_en_us_utf8_encoding
Table "public.table_with_en_us_utf8_encoding"
Column | Type | Modifiers | Storage | Stats target |
Description
--------+------------------------+-----------+----------+--------------+-------------
id | bigint | | plain | |
str1 | character varying(255) | | extended | |
str2 | character varying(255) | | extended | |
str3 | character varying(255) | | extended | |
str4 | character varying(3) | | extended | |

analytics=# select encoding, datcollate, datctype from pg_database where
datname = current_database();
encoding | datcollate | datctype
----------+-------------+-------------
6 | en_US.UTF-8 | en_US.UTF-8

And here's what we do on the local side (9.6):

# select encoding, datcollate, datctype from pg_database where datname =
current_database();
encoding | datcollate | datctype
----------+------------+----------
6 | C | C

# import foreign schema public limit to (table_with_en_us_utf8_encoding)
from server primary_replica into public;

# \d+ table_with_en_us_utf8_encoding
Foreign table
"public.table_with_en_us_utf8_encoding"
Column | Type | Collation | Nullable | Default | FDW
options | Storage | Stats target | Description
--------+------------------------+-----------+----------+---------+----------------------+----------+--------------+-------------
id | bigint | | | |
(column_name 'id') | plain | |
str1 | character varying(255) | | | |
(column_name 'str1') | extended | |
str2 | character varying(255) | | | |
(column_name 'str2') | extended | |
str3 | character varying(255) | | | |
(column_name 'str3') | extended | |
str4 | character varying(3) | | | |
(column_name 'str4') | extended | |
Server: primary_replica
FDW options: (schema_name 'public', table_name
'table_with_en_us_utf8_encoding')

The collation column is empty here, which means that collation for
str* columns is default collation i.e. C. This isn't true, since the
default ncollation on the foreign server is different from the default
collation of local database. AFAIU, import foreign schema should have
set appropriate collation of the foreign table.

# create temporary table tmp_on_c_collated_foreign_server (str2 text);

# insert into tmp_on_c_collated_foreign_server (str2) values ('576228972');
# insert into tmp_on_c_collated_foreign_server (str2) values ('576228972');
# insert into tmp_on_c_collated_foreign_server (str2) values ('576228972');

--
-- query with merge join, returns zero rows
--
# explain (analyze, verbose) select e.str1, e.str2, e.str3 from
tmp_on_c_collated_foreign_server c left join table_with_en_us_utf8_encoding
e on c.str2 = e.str2 where e.str4='2' ;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=18041.88..22322.92 rows=229221 width=1548) (actual
time=102.849..102.849 rows=0 loops=1)
Output: e.str1, e.str2, e.str3
Merge Cond: ((e.str2)::text = c.str2)
-> Foreign Scan on public.table_with_en_us_utf8_encoding e
(cost=17947.50..18705.95 rows=33709 width=93) (actual time=102.815..102.815
rows=1 loops=1)
Output: e.id, e.str1, e.str2, e.str3, e.str4
Remote SQL: SELECT str1, str2, str3 FROM
public.table_with_en_us_utf8_encoding WHERE ((str4 = '2'::text)) ORDER BY
str2 ASC NULLS LAST
-> Sort (cost=94.38..97.78 rows=1360 width=32) (actual
time=0.028..0.029 rows=7 loops=1)
Output: c.str2
Sort Key: c.str2
Sort Method: quicksort Memory: 25kB
-> Seq Scan on pg_temp_3.tmp_on_c_collated_foreign_server c
(cost=0.00..23.60 rows=1360 width=32) (actual time=0.010..0.011 rows=7
loops=1)
Output: c.str2
Planning time: 4.285 ms
Execution time: 104.458 ms
(14 rows)

Since the results returned by the foreign server are according to the
collation of the foreign server, the order doesn't match with order
expected by the local server and so the merge join reports different
rows.

--
-- query with hash join, returns rows

--

-- the default for the foreign server is to use remote estimates, so we turn
that off...

# alter foreign table table_with_en_us_utf8_encoding OPTIONS (ADD
use_remote_estimate 'false');
ALTER FOREIGN TABLE

-- and then run the same query again

# explain (analyze, verbose) select e.str1, e.str2, e.str3 from
tmp_on_c_collated_foreign_server c left join table_with_en_us_utf8_encoding
e on c.str2 = e.str2 where e.str4='2' ;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=110.68..139.45 rows=7 width=1548) (actual
time=154.280..154.286 rows=7 loops=1)
Output: e.str1, e.str2, e.str3
Hash Cond: (c.str2 = (e.str2)::text)
-> Seq Scan on pg_temp_3.tmp_on_c_collated_foreign_server c
(cost=0.00..23.60 rows=1360 width=32) (actual time=0.006..0.008 rows=7
loops=1)
Output: c.str2
-> Hash (cost=110.67..110.67 rows=1 width=1548) (actual
time=154.264..154.264 rows=33418 loops=1)
Output: e.str1, e.str2, e.str3
Buckets: 65536 (originally 1024) Batches: 1 (originally 1) Memory
Usage: 4003kB
-> Foreign Scan on public.table_with_en_us_utf8_encoding e
(cost=100.00..110.67 rows=1 width=1548) (actual time=8.289..144.210
rows=33418 loops=1)
Output: e.str1, e.str2, e.str3
Remote SQL: SELECT str1, str2, str3 FROM
public.table_with_en_us_utf8_encoding WHERE ((str4 = '2'::text))
Planning time: 0.153 ms
Execution time: 156.557 ms
(13 rows)

In this case, both tables use same collation while comparing the rows,
so result is different from the merge join result. Hash join executed
on local server and the same executed on foreign server (by importing
local table to the foreign server) would also differ.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Corey Huinker

corey.huinker@gmail.com

over 8 years ago

In reply to: Ashutosh Bapat (#2)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

The collation column is empty here, which means that collation for
str* columns is default collation i.e. C. This isn't true, since the
default ncollation on the foreign server is different from the default
collation of local database. AFAIU, import foreign schema should have
set appropriate collation of the foreign table.

That's what we saw. The query inside IMPORT FOREIGN SCHEMA assumes a NULL
collation means default, without asking the server what that default is. I
was thinking that we could change the select
inside postgresImportForeignSchema and replace
collname,
with something borrowed from information_schema.sql like

coalesce(collname, (SELECT encoding FROM pg_catalog.pg_database
WHERE datname = pg_catalog.current_database()))

which itself isn't right because encoding names don't match up perfectly
with collation names.

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Ashutosh Bapat (#2)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

On Wed, Sep 20, 2017 at 2:06 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

In this case, both tables use same collation while comparing the rows,
so result is different from the merge join result. Hash join executed
on local server and the same executed on foreign server (by importing
local table to the foreign server) would also differ.

Not really, because collatable types like text have the same equality
behavior, regardless of collation. (I would prefer it if they didn't
in at least some cases, but we don't have case insensitive collations
yet.)

I think that Corey describes a user hostile behavior. I feel that we
should try to do better here.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Corey Huinker (#3)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

Corey Huinker <corey.huinker@gmail.com> writes:

The collation column is empty here, which means that collation for
str* columns is default collation i.e. C. This isn't true, since the
default ncollation on the foreign server is different from the default
collation of local database. AFAIU, import foreign schema should have
set appropriate collation of the foreign table.

That's what we saw. The query inside IMPORT FOREIGN SCHEMA assumes a NULL
collation means default, without asking the server what that default is.

No, it's not NULL, it's pg_catalog.default. The problem is exactly that
that means something else on the remote server than it does locally.

I'm not sure whether there's a way to fix this that doesn't break other
cases. We could retrieve the pg_database.datcollate string from the
remote, but that doesn't necessarily match up with any collation name
we know about locally. One pretty common failure mode would be that
the datcollate string isn't a canonical spelling (eg, "en_US.UTF-8"
where the name we know about is "en_US.utf8"). In general, datcollate
is handled through other code paths than collation names, so it might
easily be that it doesn't match anything in the remote's pg_collation
catalog either :-(.

Another point is that when the servers' default collations do match, users
would likely not thank us for replacing "default" with something else.
Even if we picked a functionally equivalent collation, it would impede
query optimization because the planner wouldn't know it was equivalent.

Perhaps, rather than trying to fix this automatically, we should
leave it to the user. We could invent another import option that
says what to translate "default" to, with the default being,
uh, "default".

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Peter Geoghegan (#4)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

Peter Geoghegan <pg@bowt.ie> writes:

I think that Corey describes a user hostile behavior. I feel that we
should try to do better here.

It is that. I'm tempted to propose that we invent some kind of "unknown"
collation, which the planner would have to be taught to not equate to any
other column collation (not even other instances of "unknown"), and that
postgres_fdw's IMPORT ought to label remote columns with that collation
unless specifically told to do otherwise. Then it's on the user's head
if he tells us to do the wrong thing; but we won't produce incorrect
plans by default.

This is, of course, not at all a back-patchable fix.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Corey Huinker

corey.huinker@gmail.com

over 8 years ago

In reply to: Tom Lane (#5)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

I'm not sure whether there's a way to fix this that doesn't break other
cases. We could retrieve the pg_database.datcollate string from the
remote, but that doesn't necessarily match up with any collation name
we know about locally. One pretty common failure mode would be that
the datcollate string isn't a canonical spelling (eg, "en_US.UTF-8"
where the name we know about is "en_US.utf8"). In general, datcollate
is handled through other code paths than collation names, so it might
easily be that it doesn't match anything in the remote's pg_collation
catalog either :-(.

This is where we got stuck as well (+David who did a lot of digging on this
issue). Hence submitting the discovery without our half-baked patch.

We had difficulty finding the place in the code were LC_COLLATE gets
recombobulated into a recognized collation.

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Tom Lane (#6)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

On Wed, Sep 20, 2017 at 9:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It is that. I'm tempted to propose that we invent some kind of "unknown"
collation, which the planner would have to be taught to not equate to any
other column collation (not even other instances of "unknown"), and that
postgres_fdw's IMPORT ought to label remote columns with that collation
unless specifically told to do otherwise. Then it's on the user's head
if he tells us to do the wrong thing; but we won't produce incorrect
plans by default.

This is, of course, not at all a back-patchable fix.

I would like postgres_fdw to be taught about collation versioning, so
that postgres_fdw's IMPORT could automatically do the right thing when
ICU is in use. Maybe it's too early to discuss that, because we don't
even support alternative collation provider collations as the
database/cluster default collation just yet. FWIW, postgres_fdw +
collations was one of the issues that made me believe in the long term
strategic importance of ICU.

Anyway, I'm pretty sure that we also need to do something about this
in the short term. Maybe a prominent warning about server collation in
the postgres_fdw docs, or a user-visible NOTICE when incompatible
server collations are observed by postgres_fdw's IMPORT?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Corey Huinker (#7)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

Corey Huinker <corey.huinker@gmail.com> writes:

We had difficulty finding the place in the code were LC_COLLATE gets
recombobulated into a recognized collation.

That's because it isn't. The DB's default collation boils down to
"call strcoll(), having set LC_COLLATE to whatever pg_database says".
Non-default collations go through strcoll_l(), which might not even
exist on a given platform. So they're entirely separate code paths.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Peter Geoghegan (#8)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

Peter Geoghegan <pg@bowt.ie> writes:

I would like postgres_fdw to be taught about collation versioning, so
that postgres_fdw's IMPORT could automatically do the right thing when
ICU is in use. Maybe it's too early to discuss that, because we don't
even support alternative collation provider collations as the
database/cluster default collation just yet. FWIW, postgres_fdw +
collations was one of the issues that made me believe in the long term
strategic importance of ICU.

TBH, the more I learn about ICU, the less faith I have in the proposition
that it's going to fix anything at all for us in this area. It seems to
be just about as squishy as glibc in terms of locale identification,
if not worse.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Peter Geoghegan

pg@bowt.ie

over 8 years ago

In reply to: Tom Lane (#10)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

On Wed, Sep 20, 2017 at 9:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

TBH, the more I learn about ICU, the less faith I have in the proposition
that it's going to fix anything at all for us in this area. It seems to
be just about as squishy as glibc in terms of locale identification,
if not worse.

That may be our fault, to a significant degree. Did you read my mail
from yesterday, over on the "CREATE COLLATION does not sanitize ICU's
BCP 47 language tags" thread? I think that we've been incorrectly
specifying the locale name that is passed to ucol_open() this whole
time. At least, for ICU versions prior to ICU 54. This will need to be
addressed very soon.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

David Kohn

david.kohn@moat.com

over 8 years ago

In reply to: Tom Lane (#9)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

Yeah. Definitely went down a fun rabbit hole on that separate code paths
issue.

Perhaps, rather than trying to fix this automatically, we should

leave it to the user. We could invent another import option that
says what to translate "default" to, with the default being,
uh, "default".

I like this idea. One suggestion might be to do a check for the remote

LC_COLLATE and the local LC_COLLATE at the beginning of an IMPORT FOREIGN
SCHEMA call and at least raise a warning if the default collations do not
match. That wouldn't break anything, but at least would notify the user
that something bad could be happening and pointing them to that variable.
Actually, instead of an import option, this might make more sense as an
option on the foreign server, add a default_collation_mapping option for
the foreign server and raise a warning either at foreign server creation
time or at import foreign schema time (probably the latter as I don't think
we actually connect to the remote when we create the foreign server).

On Wed, Sep 20, 2017 at 12:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Corey Huinker <corey.huinker@gmail.com> writes:

We had difficulty finding the place in the code were LC_COLLATE gets
recombobulated into a recognized collation.

That's because it isn't. The DB's default collation boils down to
"call strcoll(), having set LC_COLLATE to whatever pg_database says".
Non-default collations go through strcoll_l(), which might not even
exist on a given platform. So they're entirely separate code paths.

regards, tom lane

--
David Kohn | Data Engineer | MOAT
63 Madison Ave, 15th Floor, NYC

#13

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 8 years ago

In reply to: Tom Lane (#6)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

On 9/20/17 12:06, Tom Lane wrote:

I'm tempted to propose that we invent some kind of "unknown"
collation, which the planner would have to be taught to not equate to any
other column collation (not even other instances of "unknown"), and that
postgres_fdw's IMPORT ought to label remote columns with that collation
unless specifically told to do otherwise. Then it's on the user's head
if he tells us to do the wrong thing; but we won't produce incorrect
plans by default.

OID 0 might already work that way, depending on the details.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Peter Eisentraut (#13)

Re: Varying results when using merge joins over postgres_fdw vs hash joins

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

On 9/20/17 12:06, Tom Lane wrote:

I'm tempted to propose that we invent some kind of "unknown"
collation, which the planner would have to be taught to not equate to any
other column collation (not even other instances of "unknown"), and that
postgres_fdw's IMPORT ought to label remote columns with that collation
unless specifically told to do otherwise. Then it's on the user's head
if he tells us to do the wrong thing; but we won't produce incorrect
plans by default.

OID 0 might already work that way, depending on the details.

No, OID 0 means "column is not collatable". I'm pretty sure there are
some asserts that will trip if we use that collation OID for a column of a
collatable data type --- and even if there are not, I think conflating the
two cases would be a bad idea.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers