WIP Patch: Use sortedness of CSV foreign tables for query planning

Started by Etsuro Fujitaover 13 years ago10 messageshackers
Jump to latest
#1Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp

The following is a comment at fileGetForeignPaths() in contrib/file_fdw.c:

/*
* If data file was sorted, and we knew it somehow, we could insert
* appropriate pathkeys into the ForeignPath node to tell the planner
* that.
*/

To do this, I would like to propose new generic options for a file_fdw foreign
table to specify the sortedness of a data file. While it is best to allow to
specify the sortedness on multiple columns, the current interface for the
generic options dose not seems to be suitable for doing it. As a compromise, I
would like to propose single-column sortedness options and insert appropriate
pathkeys into the ForeignPath node based on these information:

sorted: Boolean option to specify whether data file is sorted by a column.
key: Specifies the name of a column by which data file is sorted. Required
when the above option is set to true.
direction: Specifies the sort order: asc or desc. The default is asc.
nulls: Specifies that nulls sort before or after non-nulls: first or last.
first is the default when direction option is set desc. When direction option
is not set desc, last is the default.

Attached is a WIP patch implementing this feature. I would like to demonstrate
the usefulness of the patch. Experimental results are shown below. Here, data
in /home/pgsql/relation.csv is sorted by aid in ascending order.

postgres=# CREATE EXTENSION file_fdw;
CREATE EXTENSION
postgres=# CREATE SERVER fs FOREIGN DATA WRAPPER file_fdw;
CREATE SERVER
postgres=# CREATE FOREIGN TABLE ft (aid INTEGER, bid INTEGER) SERVER fs OPTIONS
(filename '/home/pgsql/relation.csv', format 'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# ANALYZE ft;
ANALYZE
postgres=# SELECT count(*) FROM ft;
count
----------
50000000
(1 row)

postgres=# SELECT count(DISTINCT aid) FROM ft;
count
---------
1000000
(1 row)

postgres=# EXPLAIN ANALYZE SELECT DISTINCT aid FROM ft ORDER BY aid;
QUERY PLAN

--------------------------------------------------------------------------------
----------------------------------------------------
Unique (cost=14187375.19..14437375.19 rows=1092929 width=4) (actual time=48952
.602..62788.934 rows=1000000 loops=1)
-> Sort (cost=14187375.19..14312375.19 rows=50000000 width=4) (actual time=
48952.601..56635.448 rows=50000000 loops=1)
Sort Key: aid
Sort Method: external sort Disk: 684272kB
-> Foreign Scan on ft (cost=0.00..5059137.00 rows=50000000 width=4) (
actual time=0.073..18324.062 rows=50000000 loops=1)
Foreign File: /home/pgsql/relation.csv
Foreign File Size: 484444500
Total runtime: 63019.868 ms
(8 rows)

postgres=# ALTER FOREIGN TABLE ft OPTIONS ( ADD sorted 'true', key 'aid' );
ALTER FOREIGN TABLE
postgres=# EXPLAIN ANALYZE SELECT DISTINCT aid FROM ft ORDER BY aid;
QUERY PLAN

--------------------------------------------------------------------------------
----------------------------------------------
Unique (cost=0.00..5184137.00 rows=1092929 width=4) (actual time=0.074..23124.
195 rows=1000000 loops=1)
-> Foreign Scan on ft (cost=0.00..5059137.00 rows=50000000 width=4) (actual
time=0.070..17633.821 rows=50000000 loops=1)
Foreign File: /home/pgsql/relation.csv
Foreign File Size: 484444500
Total runtime: 23213.909 ms
(5 rows)

Any comments and suggestions are welcomed.

Thanks,

Best regards,
Etsuro Fujita

Attachments:

file-fdw-pathkeys-0802.patchapplication/octet-stream; name=file-fdw-pathkeys-0802.patchDownload+336-17
#2Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Etsuro Fujita (#1)
Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

"Etsuro Fujita" <fujita.etsuro@lab.ntt.co.jp> wrote:

I would like to propose single-column sortedness options and
insert appropriate pathkeys into the ForeignPath node based on
these information:

sorted: Boolean option to specify whether data file is sorted by
a column.
key: Specifies the name of a column by which data file is sorted.
Required when the above option is set to true.
direction: Specifies the sort order: asc or desc. The default is
asc.
nulls: Specifies that nulls sort before or after non-nulls: first
or last. first is the default when direction option is set
desc. When direction option is not set desc, last is the
default.

For character-based columns, don't we need to know the collation?

-Kevin

#3Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Kevin Grittner (#2)
Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

Hi Kevin,

I would like to propose single-column sortedness options and
insert appropriate pathkeys into the ForeignPath node based on
these information:

For character-based columns, don't we need to know the collation?

I forgot to add the collate option. I'll add it at the next version of the
patch.

Thanks,

Best regards,
Etsuro Fujita

#4Robert Haas
robertmhaas@gmail.com
In reply to: Etsuro Fujita (#1)
Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

On Thu, Aug 2, 2012 at 7:01 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

The following is a comment at fileGetForeignPaths() in contrib/file_fdw.c:

/*
* If data file was sorted, and we knew it somehow, we could insert
* appropriate pathkeys into the ForeignPath node to tell the planner
* that.
*/

To do this, I would like to propose new generic options for a file_fdw foreign
table to specify the sortedness of a data file. While it is best to allow to
specify the sortedness on multiple columns, the current interface for the
generic options dose not seems to be suitable for doing it. As a compromise, I
would like to propose single-column sortedness options and insert appropriate
pathkeys into the ForeignPath node based on these information:

I am not sure it is a good idea to complicate file_fdw with frammishes
of marginal utility. I guess I tend to view things like file_fdw as a
mechanism for getting the data into the database, not necessarily
something that you actually want to keep your data in permanently and
run complex queries against. It seems like that's the direction we're
headed in here - statistics, indexing, etc. I am all in favor of
having some kind of pluggable storage engine as an alternative to our
heap, but I'm not sure a flat-file is a good choice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Robert Haas (#4)
Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

Hi Robert,

From: Robert Haas [mailto:robertmhaas@gmail.com]

On Thu, Aug 2, 2012 at 7:01 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

The following is a comment at fileGetForeignPaths() in contrib/file_fdw.c:

/*
* If data file was sorted, and we knew it somehow, we could insert
* appropriate pathkeys into the ForeignPath node to tell the planner
* that.
*/

To do this, I would like to propose new generic options for a file_fdw

foreign

table to specify the sortedness of a data file. While it is best to allow

to

specify the sortedness on multiple columns, the current interface for the
generic options dose not seems to be suitable for doing it. As a

compromise,

I

would like to propose single-column sortedness options and insert

appropriate

pathkeys into the ForeignPath node based on these information:

I am not sure it is a good idea to complicate file_fdw with frammishes
of marginal utility. I guess I tend to view things like file_fdw as a
mechanism for getting the data into the database, not necessarily
something that you actually want to keep your data in permanently and
run complex queries against.

I think file_fdw is useful for managing log files such as PG CSV logs. Since
often, such files are sorted by timestamp, I think the patch can improve the
performance of log analysis, though I have to admit my demonstration was not
realistic.

It seems like that's the direction we're
headed in here - statistics, indexing, etc. I am all in favor of
having some kind of pluggable storage engine as an alternative to our
heap, but I'm not sure a flat-file is a good choice.

As you pointed out, I would like to allow indexing to be done for CSV foreign
tables, but that is another problem. The submitted patch or the above comment
is not something toward indexing, so to say, an optimization of the current
file_fdw module.

Thanks,

Best regards,
Etsuro Fujita

#6Robert Haas
robertmhaas@gmail.com
In reply to: Etsuro Fujita (#5)
Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I think file_fdw is useful for managing log files such as PG CSV logs. Since
often, such files are sorted by timestamp, I think the patch can improve the
performance of log analysis, though I have to admit my demonstration was not
realistic.

Hmm, I guess I could buy that as a plausible use case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#6)
Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I think file_fdw is useful for managing log files such as PG CSV logs. Since
often, such files are sorted by timestamp, I think the patch can improve the
performance of log analysis, though I have to admit my demonstration was not
realistic.

Hmm, I guess I could buy that as a plausible use case.

In the particular case of PG log files, I'd bet good money against them
being *exactly* sorted by timestamp. Clock skew between backends, or
varying amounts of time to construct and send messages, will result in
small inconsistencies. This would generally not matter, until the
planner relied on the claim of sortedness for something like a mergejoin
... and then it would matter a lot.

In general I'm quite suspicious of the idea of believing that externally
supplied data is sorted in exactly the way that PG thinks it should
sort. If we implement this you can bet that people will screw up, for
instance by using the wrong locale/collation to sort text data.

regards, tom lane

#8Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#7)
Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I think file_fdw is useful for managing log files such as PG CSV logs. Since
often, such files are sorted by timestamp, I think the patch can improve the
performance of log analysis, though I have to admit my demonstration was not
realistic.

Hmm, I guess I could buy that as a plausible use case.

In the particular case of PG log files, I'd bet good money against them
being *exactly* sorted by timestamp. Clock skew between backends, or
varying amounts of time to construct and send messages, will result in
small inconsistencies. This would generally not matter, until the
planner relied on the claim of sortedness for something like a mergejoin
... and then it would matter a lot.

Hmm, true.

In general I'm quite suspicious of the idea of believing that externally
supplied data is sorted in exactly the way that PG thinks it should
sort. If we implement this you can bet that people will screw up, for
instance by using the wrong locale/collation to sort text data.

I think that optimizations like this are going to be essential for
things like pgsql_fdw (or other_rdms_fdw). Despite the thorny
semantic issues, we're just not going to be able to get around it.
There will even be people who want SELECT * FROM ft ORDER BY 1 to
order by the remote side's notion of ordering rather than ours,
despite the fact that the remote side has some insane-by-PG-standards
definition of ordering. People are going to find ways to do that kind
of thing whether we condone it or not, so we might as well start
thinking now about how we're going to live with it. But that doesn't
answer the question of whether or not we ought to support it for
file_fdw in particular, which seems like a more arguable point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Robert Haas (#8)
Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

From: Robert Haas [mailto:robertmhaas@gmail.com]

On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I think file_fdw is useful for managing log files such as PG CSV logs.

Since

often, such files are sorted by timestamp, I think the patch can improve

the

performance of log analysis, though I have to admit my demonstration was

not

realistic.

Hmm, I guess I could buy that as a plausible use case.

In the particular case of PG log files, I'd bet good money against them
being *exactly* sorted by timestamp. Clock skew between backends, or
varying amounts of time to construct and send messages, will result in
small inconsistencies. This would generally not matter, until the
planner relied on the claim of sortedness for something like a mergejoin
... and then it would matter a lot.

Hmm, true.

In general I'm quite suspicious of the idea of believing that externally
supplied data is sorted in exactly the way that PG thinks it should
sort. If we implement this you can bet that people will screw up, for
instance by using the wrong locale/collation to sort text data.

I think that optimizations like this are going to be essential for
things like pgsql_fdw (or other_rdms_fdw). Despite the thorny
semantic issues, we're just not going to be able to get around it.
There will even be people who want SELECT * FROM ft ORDER BY 1 to
order by the remote side's notion of ordering rather than ours,
despite the fact that the remote side has some insane-by-PG-standards
definition of ordering. People are going to find ways to do that kind
of thing whether we condone it or not, so we might as well start
thinking now about how we're going to live with it. But that doesn't
answer the question of whether or not we ought to support it for
file_fdw in particular, which seems like a more arguable point.

For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the key
column is sorted in the specified way at the execution phase ie, at the (first)
scan of a data file, only when pathkeys are set, and (2) to abort the
transaction if it detects the data file is not sorted.

Thanks,

Best regards,
Etsuro Fujita

#10Robert Haas
robertmhaas@gmail.com
In reply to: Etsuro Fujita (#9)
Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

On Tue, Aug 7, 2012 at 2:02 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I think that optimizations like this are going to be essential for
things like pgsql_fdw (or other_rdms_fdw). Despite the thorny
semantic issues, we're just not going to be able to get around it.
There will even be people who want SELECT * FROM ft ORDER BY 1 to
order by the remote side's notion of ordering rather than ours,
despite the fact that the remote side has some insane-by-PG-standards
definition of ordering. People are going to find ways to do that kind
of thing whether we condone it or not, so we might as well start
thinking now about how we're going to live with it. But that doesn't
answer the question of whether or not we ought to support it for
file_fdw in particular, which seems like a more arguable point.

For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the key
column is sorted in the specified way at the execution phase ie, at the (first)
scan of a data file, only when pathkeys are set, and (2) to abort the
transaction if it detects the data file is not sorted.

That seems like an even worse idea. People who want to access data
repeatedly should load it into tables.

Mind you, if you want to publish a version of file_fdw on PGXN that
does this, that's fine with me. But I don't think it belongs in core,
at least not without a lot more evidence that there is a real demand
for this than we have so far.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company