PG9.2 and FDW query planning.

Started by Ronan Dunklauover 13 years ago3 messages
#1Ronan Dunklau
rdunklau@gmail.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello.

I'm in the process of porting our multicorn extension to pg9.2, and
I'd like to take advantage of the GetForeignPaths hook.

The multicorn extension allows the creation of a FDW in python using a
simple API, and I'd like to be able to provide FDW implementors a way
to influence the planner.

Let's say I have an IMAP foreign data wrapper, and I write a query
joining the table on itself using the In-Reply-To and Message-ID
headers, is there anything I can do to avoid fetching all the mails
from the remote server ?

If I could somehow inform the planner that it can look up rows by
message-id, thus avoiding the need to fetch everything from the remote
server. Perhaps "persuading" the planner to use a nested-loop ?

If I understand correctly, I should build one (or more) list of
pathkeys, and add more foreign paths using those pathkeys.

There was a discussion about index on foreign tables back in march.
- From what I understand from this discussion, someone proposed to
locally store information about indexes on the foreign tables, but I
did not find anything on how to build a path "from scratch".

Thank you.

- --
Ronan Dunklau

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)

iQEcBAEBAgAGBQJP9bsIAAoJECTYLCgFy323KtsH/2/8AVfBRm75oWFMlU0l0oBC
ujxkt338PnpVi1V5gtE5GSRSwybPWytXAkgzIQ5/DEP/RmeW8pliV+0V7zvqlEWG
zgvfA7stRBWtIIIv6mdlTM0eBBgsFnoJLiJDTUDst5vAaj8vg8b+pX/ip7nSF5sw
dV2i3ir6JSsG4nJOcQ/kP6xg4Joan65pOwFDfwnx9pFnerT0YN9f87DRuohcj12e
fgWSqZkGU5nx9yCLWa294YzIFFY7lIjLowzEfg2eP2dVIM09GquKsSXyilJiMy4J
3QL64mUDF/pNZeH0LnpyGJfCfPlQsX4c554rZbO03tVeSEZMyVpCISLqutTnR9I=
=HwMc
-----END PGP SIGNATURE-----

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ronan Dunklau (#1)
Re: PG9.2 and FDW query planning.

Ronan Dunklau <rdunklau@gmail.com> writes:

Let's say I have an IMAP foreign data wrapper, and I write a query
joining the table on itself using the In-Reply-To and Message-ID
headers, is there anything I can do to avoid fetching all the mails
from the remote server ?

If I could somehow inform the planner that it can look up rows by
message-id, thus avoiding the need to fetch everything from the remote
server. Perhaps "persuading" the planner to use a nested-loop ?

OK, so what you're saying is that the imap server can effectively
provide an index on message_id. What you'd do is create a parameterized
path that uses the tbl.message_id = other_tbl.in_reply_to join clause.
If that's enough cheaper than a full scan, the planner would select it.

FWIW, I'm not sure that it's sane to try to expose this stuff to python
yet. It's complicated and still something of a moving target. Once
we've got a few more C-coded FDWs that can do this type of optimization,
things might be stable enough that it'd be useful to try to provide a
high-level API.

regards, tom lane

#3Ronan Dunklau
rdunklau@gmail.com
In reply to: Tom Lane (#2)
Re: PG9.2 and FDW query planning.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/07/2012 18:30, Tom Lane wrote:

Ronan Dunklau <rdunklau@gmail.com> writes:

Let's say I have an IMAP foreign data wrapper, and I write a
query joining the table on itself using the In-Reply-To and
Message-ID headers, is there anything I can do to avoid fetching
all the mails from the remote server ?

If I could somehow inform the planner that it can look up rows
by message-id, thus avoiding the need to fetch everything from
the remote server. Perhaps "persuading" the planner to use a
nested-loop ?

OK, so what you're saying is that the imap server can effectively
provide an index on message_id. What you'd do is create a
parameterized path that uses the tbl.message_id =
other_tbl.in_reply_to join clause. If that's enough cheaper than a
full scan, the planner would select it.

Thank you, I was able to build such paths from your indication.

The python FDW implementor can optionally give a list of tuples
consisting of (path key, expected_number_of_row). So, in the imap
example that could be [('Message-ID', 1), ('From', 1000)] for example.

- From this information, if there is an equivalence class which
restrictinfo uses one of those keys, we build a parameterized path,
with an associated cost of base_width * expected_number_of_row, in
addition to the generic, unparameterized path.

The planner can then select this path, and build plans looking like this:

postgres=# explain select m1."From",
m1."To",
m2."From",
m2."To"
from mails m1 inner join mails m2 on m2."Message-ID" = m1."In-Reply-To"
where m1."From" = '%test@example.com%';

QUERY PLAN
- --------------------------------------------------------------------
Nested Loop (cost=10.00..60001000.00 rows=500000000 width=128)
-> Foreign Scan on mails m1 (cost=0.00..30000000.00 rows=100000
width=300)
Filter: (("From")::text = '%test@example.com%'::text)
-> Foreign Scan on mails m2 (cost=10.00..300.00 rows=1 width=300)
Filter: (("Message-ID")::text = (m1."In-Reply-To")::text)

If I understand it correctly, after returning a ForeignScan (from
GetForeignPlan), the planner decides to use a nestloop, and in the
process of creating the nestloop plan, replaces Var nodes coming from
the outerel (here, m1."In-Reply-To") by params nodes.

My current implementation already looks for (var = param) expressions
that it may handle during the plan phase and stores the association
between the var and the param_id.
At execution time, the needed parameters values are fetched (from the
ParamExecData array found in es_param_exec_vals) and passed to the
python foreign data wrapper.

The problem I have: how can I retrieve the generated params and keep
the association between the var and the param ?

Should I replace the (var = outervar) clauses by (var = param) myself
and store them in the fdw_exprs field of the foreign scan ?

FWIW, I'm not sure that it's sane to try to expose this stuff to
python yet. It's complicated and still something of a moving
target. Once we've got a few more C-coded FDWs that can do this
type of optimization, things might be stable enough that it'd be
useful to try to provide a high-level API.

The current API (as mentioned above) would be more declarative than
anything, only offering a way to (maybe) build parameterized paths
without guaranteeing anything. Even if the internals change, I fail to
see how it can hurt to offer such a feature.

Regards,

- --
Ronan Dunklau

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)

iQEcBAEBAgAGBQJP/s2dAAoJECTYLCgFy3239KkIAIiKJo/F1r4Yp49wLpmThjQI
ICo910ZajqlUKVsl9ye8m2l6p+lyGEmZMWUAWP6ae2pqFR+aC0zThypjF1faZ9tN
HfqMbEKx/trkDf05U28tJlvOeu21tiEOEs4n02fmfdHu9SvemuLdyhU3dOLxoBVK
ZZ8ra9q/+zHCPpc3zt0Mow80Q1X1M3DtirsHPoeIdOK69wD4nD2ZfhQule5HaoV1
dG3FlrKGAGzRpohLBCuWzyGPcWCS584lXGWfhsz/waLaSDIjcjvaaMke54eaa8Ci
7KxXkMM12CKFQyheSR5VVwFJrobnME2HDiJCoAOkRc0dW+Y2aASJnKG/FwL8C7s=
=4RjB
-----END PGP SIGNATURE-----