WIP: Collecting statistics on CSV file data

Started by Etsuro Fujitaover 14 years ago35 messageshackers

fujita.etsuro@lab.ntt.co.jp

over 14 years ago

Hi there,

To enable file_fdw to estimate costs of scanning a CSV file more
accurately, I would like to propose a new FDW callback routine,
AnalyzeForeignTable, which allows to ANALYZE command to collect
statistics on a foreign table, and a corresponding file_fdw function,
fileAnalyzeForeignTable. Attached is my WIP patch.

Here's a summary of the implementation:

void AnalyzeForeignTable (Relation relation, VacuumStmt *vacstmt, int
elevel);
This is a new FDW callback routine to collect statistics on a foreign
table and store the results in the pg_class and pg_statistic system
catalogs. This is called when ANALYZE command is executed. (ANALYZE
command should be executed because autovacuum does not analyze foreign
tables.)

static void fileAnalyzeForeignTable(Relation relation, VacuumStmt
*vacstmt, int elevel);
This new file_fdw function collects and stores the same statistics on
CSV file data as collected on a local table except for index related
statistics by executing the sequential scan on the CSV file and
acquiring sample rows using Vitter's algorithm. (It is time-consuming
for a large file.)

estimate_costs() (more precisely, clauselist_selectivity() in
estimate_costs()) estimates baserel->rows using the statistics stored in
the pg_statistic system catalog. If there are no statistics,
estimate_costs() estimates it using the default statistics as in
PostgreSQL 9.1.

I am able to demonstrate the effectiveness of this patch. The following
run is performed on a single core of a 3.00GHz Intel Xeon CPU with 8GB
of RAM. Configuration settings are default except for work_mem = 256MB.
We can see from this result that the optimiser selects a good plan when
the foreign tables have been analyzed.

I appreciate your comments and suggestions.

[sample csv file data]
postgres=# COPY (SELECT s.a, repeat('a', 100) FROM generate_series(1,
5000000) AS s(a)) TO '/home/pgsql/sample_csv_data1.csv' (FORMAT csv,
DELIMITER ',');
COPY 5000000
postgres=# COPY (SELECT (random()*10000)::int, repeat('b', 100) FROM
generate_series(1, 5000000)) TO '/home/pgsql/sample_csv_data2.csv'
(FORMAT csv, DELIMITER ',');
COPY 5000000

[Unpatched]
postgres=# CREATE FOREIGN TABLE tab1 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data1.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# CREATE FOREIGN TABLE tab2 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data2.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# SELECT count(*) FROM tab1;
count
---------
5000000
(1 row)

postgres=# SELECT count(*) FROM tab2;
count
---------
5000000
(1 row)

postgres=# EXPLAIN ANALYZE SELECT count(*) FROM tab1, tab2 WHERE
tab1.aid >= 0 AND tab1.aid <= 10000 AND tab1.aid = tab2.aid;
QUERY
PLAN

---------------------------------------------------------------------------------------------------------------------------------------------

---
Aggregate (cost=128859182.29..128859182.30 rows=1 width=0) (actual
time=27321.304..27321.304 rows=1 loops=1)
-> Merge Join (cost=5787102.68..111283426.33 rows=7030302383
width=0) (actual time=22181.428..26736.194 rows=4999745 loops=1)
Merge Cond: (tab1.aid = tab2.aid)
-> Sort (cost=1857986.37..1858198.83 rows=84983 width=4)
(actual time=5964.282..5965.958 rows=10000 loops=1)
Sort Key: tab1.aid
Sort Method: quicksort Memory: 853kB
-> Foreign Scan on tab1 (cost=0.00..1851028.44
rows=84983 width=4) (actual time=0.071..5962.382 rows=10000 loops=1)
Filter: ((aid >= 0) AND (aid <= 10000))
Foreign File: /home/pgsql/sample_csv_data1.csv
Foreign File Size: 543888896
-> Materialize (cost=3929116.30..4011842.29 rows=16545197
width=4) (actual time=16216.953..19550.846 rows=5000000 loops=1)
-> Sort (cost=3929116.30..3970479.30 rows=16545197
width=4) (actual time=16216.947..18418.684 rows=5000000 loops=1)
Sort Key: tab2.aid
Sort Method: external merge Disk: 68424kB
-> Foreign Scan on tab2 (cost=0.00..1719149.70
rows=16545197 width=4) (actual time=0.081..6059.630 rows=5000000 loops=1)
Foreign File: /home/pgsql/sample_csv_data2.csv
Foreign File Size: 529446313
Total runtime: 27350.673 ms
(18 rows)

[Patched]
postgres=# CREATE FOREIGN TABLE tab1 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data1.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# CREATE FOREIGN TABLE tab2 (aid INTEGER, msg text) SERVER
file_fs OPTIONS (filename '/home/pgsql/sample_csv_data2.csv', format
'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# ANALYZE VERBOSE tab1;
INFO: analyzing "public.tab1"
INFO: "tab1": scanned, containing 5000000 rows; 30000 rows in sample
ANALYZE
postgres=# ANALYZE VERBOSE tab2;
INFO: analyzing "public.tab2"
INFO: "tab2": scanned, containing 5000000 rows; 30000 rows in sample
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT count(*) FROM tab1, tab2 WHERE
tab1.aid >= 0 AND tab1.aid <= 10000 AND tab1.aid = tab2.aid;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1282725.25..1282725.26 rows=1 width=0) (actual
time=15114.325..15114.325 rows=1 loops=1)
-> Hash Join (cost=591508.50..1271157.90 rows=4626940 width=0)
(actual time=5964.449..14526.822 rows=4999745 loops=1)
Hash Cond: (tab2.aid = tab1.aid)
-> Foreign Scan on tab2 (cost=0.00..564630.00 rows=5000000
width=4) (actual time=0.070..6253.257 rows=5000000 loops=1)
Foreign File: /home/pgsql/sample_csv_data2.csv
Foreign File Size: 529446313
-> Hash (cost=591393.00..591393.00 rows=9240 width=4) (actual
time=5964.346..5964.346 rows=10000 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 352kB
-> Foreign Scan on tab1 (cost=0.00..591393.00 rows=9240
width=4) (actual time=0.066..5962.222 rows=10000 loops=1)
Filter: ((aid >= 0) AND (aid <= 10000))
Foreign File: /home/pgsql/sample_csv_data1.csv
Foreign File Size: 543888896
Total runtime: 15114.480 ms
(13 rows)

Best regards,
Etsuro Fujita

Shigeru Hanada

shigeru.hanada@gmail.com

over 14 years ago

In reply to: Etsuro Fujita (#1)

Re: WIP: Collecting statistics on CSV file data

Hi Fujita-san,

(2011/09/12 19:40), Etsuro Fujita wrote:

Hi there,

To enable file_fdw to estimate costs of scanning a CSV file more
accurately, I would like to propose a new FDW callback routine,
AnalyzeForeignTable, which allows to ANALYZE command to collect
statistics on a foreign table, and a corresponding file_fdw function,
fileAnalyzeForeignTable. Attached is my WIP patch.

<snip>

I think this is a very nice feature so that planner would be able to
create smarter plan for a query which uses foreign tables.

I took a look at the patch, and found that it couldn't be applied
cleanly against HEAD. Please rebase your patch against current HEAD of
master branch, rather than 9.1beta1.

The wiki pages below would be helpful for you.
http://wiki.postgresql.org/wiki/Submitting_a_Patch
http://wiki.postgresql.org/wiki/Creating_Clean_Patches
http://wiki.postgresql.org/wiki/Reviewing_a_Patch

And it would be easy to use git to follow changes made by other
developers in master branch.
http://wiki.postgresql.org/wiki/Working_with_Git

Regards,
--
Shigeru Hanada

Marti Raudsepp

marti@juffo.org

over 14 years ago

In reply to: Etsuro Fujita (#1)

Re: WIP: Collecting statistics on CSV file data

2011/9/12 Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>:

This is called when ANALYZE command is executed. (ANALYZE
command should be executed because autovacuum does not analyze foreign
tables.)

This is a good idea.

However, if adding these statistics requires an explicit ANALYZE
command, then we should also have a command for resetting the
collected statistics -- to get it back into the un-analyzed state.

Currently it looks like the only way to reset statistics is to tamper
with catalogs directly, or recreate the foreign table.

Regards,
Marti

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Marti Raudsepp (#3)

Re: WIP: Collecting statistics on CSV file data

Marti Raudsepp <marti@juffo.org> writes:

2011/9/12 Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>:

This is called when ANALYZE command is executed. (ANALYZE
command should be executed because autovacuum does not analyze foreign
tables.)

This is a good idea.

However, if adding these statistics requires an explicit ANALYZE
command, then we should also have a command for resetting the
collected statistics -- to get it back into the un-analyzed state.

Uh, why? There is no UNANALYZE operation for ordinary tables, and
I've never heard anyone ask for one.

If you're desperate you could manually delete the relevant rows in
pg_statistic, a solution that would presumably work for foreign tables
too.

Probably a more interesting question is why we wouldn't change
autovacuum so that it calls this automatically for foreign tables.

(Note: I'm unconvinced that there's a use-case for this in the case of
"real" foreign tables on a remote server --- it seems likely that the
wrapper ought to ask the remote server for its current stats, instead.
But it's clearly useful for non-server-backed sources such as file_fdw.)

regards, tom lane

Euler Taveira de Oliveira

euler@timbira.com

over 14 years ago

In reply to: Marti Raudsepp (#3)

Re: WIP: Collecting statistics on CSV file data

On 20-09-2011 11:12, Marti Raudsepp wrote:

2011/9/12 Etsuro Fujita<fujita.etsuro@lab.ntt.co.jp>:

This is called when ANALYZE command is executed. (ANALYZE
command should be executed because autovacuum does not analyze foreign
tables.)

This is a good idea.

However, if adding these statistics requires an explicit ANALYZE
command, then we should also have a command for resetting the
collected statistics -- to get it back into the un-analyzed state.

Why would you want this? If the stats aren't up to date, run ANALYZE
periodically. Remember that it is part of the DBA maintenance tasks [1]http://www.postgresql.org/docs/current/static/maintenance.html.

[1]: http://www.postgresql.org/docs/current/static/maintenance.html

--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

David Fetter

david@fetter.org

over 14 years ago

In reply to: Tom Lane (#4)

Re: WIP: Collecting statistics on CSV file data

On Tue, Sep 20, 2011 at 11:13:05AM -0400, Tom Lane wrote:

Marti Raudsepp <marti@juffo.org> writes:

2011/9/12 Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>:

This is called when ANALYZE command is executed. (ANALYZE
command should be executed because autovacuum does not analyze foreign
tables.)

This is a good idea.

However, if adding these statistics requires an explicit ANALYZE
command, then we should also have a command for resetting the
collected statistics -- to get it back into the un-analyzed state.

Uh, why? There is no UNANALYZE operation for ordinary tables, and
I've never heard anyone ask for one.

If you're desperate you could manually delete the relevant rows in
pg_statistic, a solution that would presumably work for foreign tables
too.

Probably a more interesting question is why we wouldn't change
autovacuum so that it calls this automatically for foreign tables.

How about a per-table setting that tells autovacuum whether to do
this? Come to think of it, all of per-FDW, per-remote and per-table
settings would be handy, so people could express things like, "all CSV
files except these three, all PostgreSQL connections on the
10.1.0.0/16 network, and these two tables in Oracle."

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Alvaro Herrera

alvherre@2ndquadrant.com

over 14 years ago

In reply to: David Fetter (#6)

Re: WIP: Collecting statistics on CSV file data

Excerpts from David Fetter's message of mar sep 20 21:22:32 -0300 2011:

On Tue, Sep 20, 2011 at 11:13:05AM -0400, Tom Lane wrote:

Probably a more interesting question is why we wouldn't change
autovacuum so that it calls this automatically for foreign tables.

How about a per-table setting that tells autovacuum whether to do
this?

Seems reasonable. Have autovacuum assume that foreign tables are not to
be analyzed, unless some reloption is set.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 14 years ago

In reply to: Shigeru Hanada (#2)

Re: WIP: Collecting statistics on CSV file data

Hi Hanada-san,

I'm very sorry for late reply.

(2011/09/20 18:49), Shigeru Hanada wrote:

I took a look at the patch, and found that it couldn't be applied
cleanly against HEAD. Please rebase your patch against current HEAD of
master branch, rather than 9.1beta1.

The wiki pages below would be helpful for you.
http://wiki.postgresql.org/wiki/Submitting_a_Patch
http://wiki.postgresql.org/wiki/Creating_Clean_Patches
http://wiki.postgresql.org/wiki/Reviewing_a_Patch

And it would be easy to use git to follow changes made by other
developers in master branch.
http://wiki.postgresql.org/wiki/Working_with_Git

Thank you for the review and the helpful information.
I rebased. Please find attached a patch. I'll add the patch to the next CF.

Changes:

* cleanups and fixes
* addition of the following to ALTER FOREIGN TABLE
ALTER [COLUMN] column SET STATISTICS integer
ALTER [COLUMN] column SET ( n_distinct = val ) (n_distinct only)
ALTER [COLUMN] column RESET ( n_distinct )
* reflection of the force_not_null info in acquiring sample rows
* documentation

Best regards,
Etsuro Fujita

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 14 years ago

In reply to: Alvaro Herrera (#7)

Re: WIP: Collecting statistics on CSV file data

Hi,

I'm very sorry for the late reply.

(2011/09/21 10:00), Alvaro Herrera wrote:

Excerpts from David Fetter's message of mar sep 20 21:22:32 -0300 2011:

On Tue, Sep 20, 2011 at 11:13:05AM -0400, Tom Lane wrote:

Probably a more interesting question is why we wouldn't change
autovacuum so that it calls this automatically for foreign tables.

How about a per-table setting that tells autovacuum whether to do
this?

Seems reasonable. Have autovacuum assume that foreign tables are not to
be analyzed, unless some reloption is set.

Thank you for the comments. I'd like to leave that feature for future work.

(But this is BTW. I'm interested in developing CREATE FOREIGN INDEX.
I've examined whether there are discussions about the design and
implementation of it in the archive, but could not find information. If
you know anything, please tell me.)

Best regards,
Etsuro Fujita

#10

David Fetter

david@fetter.org

over 14 years ago

In reply to: Etsuro Fujita (#9)

Re: WIP: Collecting statistics on CSV file data

On Fri, Oct 07, 2011 at 08:09:44PM +0900, Etsuro Fujita wrote:

Hi,

I'm very sorry for the late reply.

(2011/09/21 10:00), Alvaro Herrera wrote:

Excerpts from David Fetter's message of mar sep 20 21:22:32 -0300 2011:

On Tue, Sep 20, 2011 at 11:13:05AM -0400, Tom Lane wrote:

Probably a more interesting question is why we wouldn't change
autovacuum so that it calls this automatically for foreign tables.

How about a per-table setting that tells autovacuum whether to do
this?

Seems reasonable. Have autovacuum assume that foreign tables are not to
be analyzed, unless some reloption is set.

Thank you for the comments. I'd like to leave that feature for future work.

(But this is BTW. I'm interested in developing CREATE FOREIGN INDEX.
I've examined whether there are discussions about the design and
implementation of it in the archive, but could not find information.
If you know anything, please tell me.)

Look into the "virtual index interface" from Informix. We might want
to start a wiki page on this.

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#11

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 14 years ago

In reply to: David Fetter (#10)

Re: WIP: Collecting statistics on CSV file data

(2011/10/07 21:56), David Fetter wrote:

(But this is BTW. I'm interested in developing CREATE FOREIGN INDEX.
I've examined whether there are discussions about the design and
implementation of it in the archive, but could not find information.
If you know anything, please tell me.)

Look into the "virtual index interface" from Informix.

Thank you for the information.

We might want to start a wiki page on this.

Yeah, I think it might be better to add information to the SQL/MED wiki
page:

http://wiki.postgresql.org/wiki/SQL/MED

Best regards,
Etsuro Fujita

#12

Shigeru Hanada

shigeru.hanada@gmail.com

over 14 years ago

In reply to: Etsuro Fujita (#8)

Re: WIP: Collecting statistics on CSV file data

(2011/10/07 18:09), Etsuro Fujita wrote:

Thank you for the review and the helpful information.
I rebased. Please find attached a patch. I'll add the patch to the next CF.

Changes:

* cleanups and fixes
* addition of the following to ALTER FOREIGN TABLE
ALTER [COLUMN] column SET STATISTICS integer
ALTER [COLUMN] column SET ( n_distinct = val ) (n_distinct only)
ALTER [COLUMN] column RESET ( n_distinct )
* reflection of the force_not_null info in acquiring sample rows
* documentation

The new patch could be applied with some shifts. Regression tests of
core and file_fdw have passed cleanly. Though I've tested only simple
tests, ANALYZE works for foreign tables for file_fdw, and estimation of
costs and selectivity seem appropriate.

New API AnalyzeForeignTable
===========================
In your design, new handler function is called instead of
do_analylze_rel. IMO this hook point would be good for FDWs which can
provide statistics in optimized way. For instance, pgsql_fdw can simply
copy statistics from remote PostgreSQL server if they are compatible.
Possible another idea is to replace acquire_sample_rows so that other
FDWs can reuse most part of fileAnalyzeForeignTable and
file_fdw_do_analyze_rel.

And I think that AnalyzeForeignTable should be optional, and it would be
very useful if a default handler is provided. Probably a default
handler can use basic FDW APIs to acquire sample rows from the result of
"SELECT * FROM foreign_table" with skipping periodically. It won't be
efficient but I think it's not so unreasonable.

Other issues
============
There are some other comments about non-critical issues.
- When there is no analyzable column, vac_update_relstats is not called.
Is this behavior intentional?
- psql can't complete foreign table name after ANALYZE.
- A new parameter has been added to vac_update_relstats in a recent
commit. Perhaps 0 is OK for that parameter.
- ANALYZE without relation name ignores foreign tables because
get_rel_oids doesn't list foreign tables.
- IMO logging "analyzing foo.bar" should not be done in
AnalyzeForeignTable handler of each FDW because some FDW might forget to
do it. Maybe it should be pulled up to analyze_rel or somewhere in core.
- It should be mentioned in a document that foreign tables are not
analyzed automatically because they are read-only.

Regards,
--
Shigeru Hanada

#13

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 14 years ago

In reply to: Shigeru Hanada (#12)

Re: WIP: Collecting statistics on CSV file data

(2011/10/18 2:27), Shigeru Hanada wrote:

The new patch could be applied with some shifts. Regression tests of
core and file_fdw have passed cleanly. Though I've tested only simple
tests, ANALYZE works for foreign tables for file_fdw, and estimation of
costs and selectivity seem appropriate.

Thank you for your testing.

New API AnalyzeForeignTable
===========================

And I think that AnalyzeForeignTable should be optional, and it would be
very useful if a default handler is provided. Probably a default
handler can use basic FDW APIs to acquire sample rows from the result of
"SELECT * FROM foreign_table" with skipping periodically. It won't be
efficient but I think it's not so unreasonable.

I agree with you. However, I think that it is difficult to support such
a default handler in a unified way because there exist external data
sources for which we cannot execute "SELECT * FROM foreign_table", e.g.,
web-accessible DBs limiting full access to the contents.

Other issues
============
There are some other comments about non-critical issues.
- When there is no analyzable column, vac_update_relstats is not called.
Is this behavior intentional?
- psql can't complete foreign table name after ANALYZE.
- A new parameter has been added to vac_update_relstats in a recent
commit. Perhaps 0 is OK for that parameter.

I'll check.

- ANALYZE without relation name ignores foreign tables because
get_rel_oids doesn't list foreign tables.

I think that it might be better to ignore foreign tables by default
because analyzing such tables may take long depending on FDW.

- IMO logging "analyzing foo.bar" should not be done in
AnalyzeForeignTable handler of each FDW because some FDW might forget to
do it. Maybe it should be pulled up to analyze_rel or somewhere in core.
- It should be mentioned in a document that foreign tables are not
analyzed automatically because they are read-only.

OK. I'll revise.

Regards,

Best regards,
Etsuro Fujita

#14

Leonardo Francalanci

m_lists@yahoo.it

over 14 years ago

In reply to: Shigeru Hanada (#12)

Re: WIP: Collecting statistics on CSV file data

New API AnalyzeForeignTable

I didn't look at the patch, but I'm using CSV foreign tables with named pipes
to get near-realtime KPI calculated by postgresql. Of course, pipes can be
read just once, so I wouldn't want an "automatic analyze" of foreign tables...

#15

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 14 years ago

In reply to: Leonardo Francalanci (#14)

Re: WIP: Collecting statistics on CSV file data

Hi,

(2011/10/18 16:32), Leonardo Francalanci wrote:

New API AnalyzeForeignTable

I didn't look at the patch, but I'm using CSV foreign tables with named pipes
to get near-realtime KPI calculated by postgresql. Of course, pipes can be
read just once, so I wouldn't want an "automatic analyze" of foreign tables...

The patch does not analyze on foreign tables automatically. (The issue
of auto-analyze on foreign tables has been discussed. Please refer to [1]http://archives.postgresql.org/pgsql-hackers/2011-09/msg00992.php.)

[1]: http://archives.postgresql.org/pgsql-hackers/2011-09/msg00992.php

Best regards,
Etsuro Fujita

#16

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 14 years ago

In reply to: Shigeru Hanada (#12)

Re: WIP: Collecting statistics on CSV file data

Hi,

I revised the patch according to Hanada-san's comments. Attached is the
updated version of the patch.

Changes:

* pull up of logging "analyzing foo.bar"
* new vac_update_relstats always called
* tab-completion in psql
* add "foreign tables are not analyzed automatically..." to 23.1.3
Updating Planner Statistics
* some other modifications

Best regards,
Etsuro Fujita

#17

Shigeru Hanada

shigeru.hanada@gmail.com

over 14 years ago

In reply to: Etsuro Fujita (#16)

Re: WIP: Collecting statistics on CSV file data

(2011/10/20 18:56), Etsuro Fujita wrote:

I revised the patch according to Hanada-san's comments. Attached is the
updated version of the patch.

Changes:

* pull up of logging "analyzing foo.bar"
* new vac_update_relstats always called
* tab-completion in psql
* add "foreign tables are not analyzed automatically..." to 23.1.3
Updating Planner Statistics
* some other modifications

Submission review
=================

- Patch can be applied, and all regression tests passed. :)

Random comments
===============

- Some headers are not necessary for file_fdw.c

#include "access/htup.h"
#include "commands/dbcommands.h"
#include "optimizer/plancat.h"
#include "utils/elog.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"

- It might be better to mention in document that users need to
explicitly specify foreign table name to ANALYZE command.

- I think backend should be aware the case which a handler is NULL. For
the case of ANALYZE of foreign table, it would be worth telling user
that request was not accomplished.

- file_fdw_do_analyze_rel is almost copy of do_analyze_rel. IIUC,
difference against do_analyze_rel are:
* don't logging analyze target
* don't switch userid to the owner of target table
* don't measure elapsed time for autoanalyze deamon
* don't handle index
* some comments are removed.
* sample rows are acquired by file_fdw's routine

I don't see any problem here, but would you confirm that all of them are
intentional?

Besides, keeping format (mainly indent and folding) of this function
similar to do_analyze_rel would help to follow changes in do_analyze_rel.

- IMHO exporting CopyState should be avoided. One possible idea is
adding new COPY API which allows to extract records from the file with
skipping specified number or rate.

- In your design, each FDW have to copy most of do_analyze_rel to their
own source. It means that FDW authors must know much details of ANALYZE
to implement ANALYZE handler. Actually, your patch exports some static
functions from analyze.c. Have you considered hooking
acquire_sample_rows()? Such handler should be more simple, and
FDW-specific. As you say, such design requires FDWs to skip some
records, but it would be difficult for some FDW (e.g. twitter_fdw) which
can't pick sample data up easily. IMHO such problem *must* be solved by
FDW itself.

--
Shigeru Hanada

#18

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 14 years ago

In reply to: Shigeru Hanada (#17)

Re: WIP: Collecting statistics on CSV file data

(2011/11/07 20:26), Shigeru Hanada wrote:

(2011/10/20 18:56), Etsuro Fujita wrote:

I revised the patch according to Hanada-san's comments. Attached is the
updated version of the patch.

Changes:

* pull up of logging "analyzing foo.bar"
* new vac_update_relstats always called
* tab-completion in psql
* add "foreign tables are not analyzed automatically..." to 23.1.3
Updating Planner Statistics
* some other modifications

Submission review
=================

- Patch can be applied, and all regression tests passed. :)

Thank you for your testing. I updated the patch according to your
comments. Attached is the updated version of the patch.

- file_fdw_do_analyze_rel is almost copy of do_analyze_rel. IIUC,
difference against do_analyze_rel are:
* don't logging analyze target
* don't switch userid to the owner of target table
* don't measure elapsed time for autoanalyze deamon
* don't handle index
* some comments are removed.
* sample rows are acquired by file_fdw's routine

I don't see any problem here, but would you confirm that all of them are
intentional?

Yes. But in the updated version, I've refactored analyze.c a little bit
to allow FDW authors to simply call do_analyze_rel().

- In your design, each FDW have to copy most of do_analyze_rel to their
own source. It means that FDW authors must know much details of ANALYZE
to implement ANALYZE handler. Actually, your patch exports some static
functions from analyze.c. Have you considered hooking
acquire_sample_rows()? Such handler should be more simple, and
FDW-specific. As you say, such design requires FDWs to skip some
records, but it would be difficult for some FDW (e.g. twitter_fdw) which
can't pick sample data up easily. IMHO such problem *must* be solved by
FDW itself.

The updated version enables FDW authors to just write their own
acquire_sample_rows(). On the other hand, by retaining to hook
AnalyzeForeignTable routine at analyze_rel(), higher level than
acquire_sample_rows() as before, it allows FDW authors to write
AnalyzeForeignTable routine for foreign tables on a remote server to ask
the server for its current stats instead, as pointed out earlier by Tom
Lane.

Best regards,
Etsuro Fujita

#19

Shigeru Hanada

shigeru.hanada@gmail.com

over 14 years ago

In reply to: Etsuro Fujita (#18)

Re: WIP: Collecting statistics on CSV file data

(2011/11/18 16:25), Etsuro Fujita wrote:

Thank you for your testing. I updated the patch according to your
comments. Attached is the updated version of the patch.

I'd like to share result of my review even though it's not fully
finished. So far I looked from viewpoint of API design, code
formatting, and documentation. I'll examine effectiveness of the patch
and details of implementation next week, and hopefully try writing
ANALYZE handler for pgsql_fdw :)

New patch has correct format, and it could be applied to HEAD of master
branch cleanly. All regression tests including those of contrib modules
have passed. It contains changes of codes and regression tests related
to the issue, and they have enough comments. IMO the document in this
patch is not enough to show how to write analyze handler to FDW authors,
but it can be enhanced afterward. With this patch, FDW author can
provide optional ANALYZE handler which updates statistics of foreign
tables. Planner would be able to generate better plan by using statistics.

Yes. But in the updated version, I've refactored analyze.c a little bit
to allow FDW authors to simply call do_analyze_rel().

<snip>

The updated version enables FDW authors to just write their own
acquire_sample_rows(). On the other hand, by retaining to hook
AnalyzeForeignTable routine at analyze_rel(), higher level than
acquire_sample_rows() as before, it allows FDW authors to write
AnalyzeForeignTable routine for foreign tables on a remote server to ask
the server for its current stats instead, as pointed out earlier by Tom
Lane.

IIUC, this patch offers three options to FDWs: a) set
AnalyzeForeignTable to NULL to indicate lack of capability, b) provide
AnalyzeForeignTable which calls do_analyze_rel with custom
sample_row_acquirer, and c) create statistics data from scratch by FDW
itself by doing similar things to do_analyze_rel with given argument or
copying statistics from remote PostgreSQL server.

ISTM that this design is well-balanced between simplicity and
flexibility. Maybe these three options would suit web-based wrappers,
file-based or RDBMS wrappers, and pgsql_fdw respectively. I think that
adding more details of FdwRoutine, such as purpose of new callback
function and difference from required ones, would help FDW authors,
including me :)

I have some random comments:

- I think separated typedef of sample_acquire_rows would make codes more
readable. In addition, parameters of the function should be described
explicitly.
- I couldn't see the reason why file_fdw sets ctid of sample tuples,
though I guess it's for Vitter's random sampling algorithm. If every
FDW must set valid ctid to sample tuples, it should be mentioned in
document of AnalyzeForeignTable. Exporting some functions from
analyze.c relates this issue?
- Why file_fdw skips sample tuples which have NULL value? AFAIS
original acquire_sample_rows doesn't do so.
- Some comment lines go past 80 columns.
- Some headers included in file_fdw.c seems unnecessary.

Regards,
--
Shigeru Hanada

#20

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Shigeru Hanada (#19)

Re: WIP: Collecting statistics on CSV file data

2011/11/18 Shigeru Hanada <shigeru.hanada@gmail.com>:

- I couldn't see the reason why file_fdw sets ctid of sample tuples,
though I guess it's for Vitter's random sampling algorithm. If every
FDW must set valid ctid to sample tuples, it should be mentioned in
document of AnalyzeForeignTable. Exporting some functions from
analyze.c relates this issue?

If every FDW must set valid ctid to sample tuples, it should be fixed
so that they don't have to, I would think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 14 years ago

In reply to: Shigeru Hanada (#19)

#22

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 14 years ago

In reply to: Robert Haas (#20)