break table into portions for writing to separate files

Started by Sebalmost 12 years ago15 messagesgeneral

spluque@gmail.com

almost 12 years ago

Hello,

I've been looking for a way to write a table into multiple files, and am
wondering if there are some clever suggestions. Say we have a table
that is too large (several Gb) to write to a file that can be used for
further analyses in other languages. The table consists of a timestamp
field and several numeric fields, with records every 10th of a second.
It could be meaningfully broken down into subsets of say 20 minutes
worth of records. One option is to write a shell script that loops
through the timestamp, selects the corresponding subset of the table,
and writes it as a unique file. However, this would be extremely slow
because each select takes several hours, and there can be hundreds of
subsets. Is there a better way?

Cheers,

--
Seb

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

dinesh kumar

dineshkumar02@gmail.com

almost 12 years ago

In reply to: Seb (#1)

Re: break table into portions for writing to separate files

Hi,

Does the below kind of approach work for you. I haven't tested this, but
would like to give an idea something like below.

Create a plpgsql function which takes 3 parameters as "From Date", "To
Date" and "Interval".

prev_interval := '0'::interval;

LOOP

IF ( "From Date" + "Interval" <= "To Date") THEN

EXECUTE FORMAT (
$$
COPY (SELECT <Columns Llist> FROM <tableName> WHERE timestamp_column >=%s
AND timestamp_column<%s) TO '%s.csv'
$$,
("From Date" + "prev_interval")::TEXT,
("From Date" + "Interval") ::TEXT,
( Filename || (Extract(Epoch from interval)/60)::BIGINT)::TEXT);

prev_interval := "Interval";

"Interval" := "Interval" + "Interval";

ELSE
EXIT FROM LOOP;
END IF;

END LOOP;

Thanks,
Dinesh
manojadinesh.blogspot.com

On Thu, May 1, 2014 at 11:20 PM, Seb <spluque@gmail.com> wrote:

Show quoted text

Hello,

I've been looking for a way to write a table into multiple files, and am
wondering if there are some clever suggestions. Say we have a table
that is too large (several Gb) to write to a file that can be used for
further analyses in other languages. The table consists of a timestamp
field and several numeric fields, with records every 10th of a second.
It could be meaningfully broken down into subsets of say 20 minutes
worth of records. One option is to write a shell script that loops
through the timestamp, selects the corresponding subset of the table,
and writes it as a unique file. However, this would be extremely slow
because each select takes several hours, and there can be hundreds of
subsets. Is there a better way?

Cheers,

--
Seb

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Francisco Olarte

folarte@peoplecall.com

almost 12 years ago

In reply to: Seb (#1)

Re: break table into portions for writing to separate files

Hi:

On Thu, May 1, 2014 at 7:50 PM, Seb <spluque@gmail.com> wrote:

I've been looking for a way to write a table into multiple files, and am
wondering if there are some clever suggestions. Say we have a table
that is too large (several Gb) to write to a file that can be used for
further analyses in other languages. The table consists of a timestamp
field and several numeric fields, with records every 10th of a second.
It could be meaningfully broken down into subsets of say 20 minutes
worth of records. One option is to write a shell script that loops
through the timestamp, selects the corresponding subset of the table,
and writes it as a unique file. However, this would be extremely slow
because each select takes several hours, and there can be hundreds of
subsets. Is there a better way?

As you mention looping and a shell, I suppose you are in something
unix like, with pipes et al. You can pipe COPY ( either with the pipe
options for copy, or piping a psql command, or whichever thing you
like ) through a script which spits ecah data chunk into its
corresponding file. If your data is somehow clustered into the table (
by chance or by design ) you don't even need a to sort the data, just
use an open file pool, I did that once with call files, chunked them
into day sized files and it worked like a charm ( and if you need the
files sorted, you can then use sort on each of them, which normally is
quite fast ).

For your description of data, with a pipe, you could read a line,
extract a key for the record ( the timestamp rounded down to 20
minutes would be a good one ), get and open output file for append (
using a small caching layer ) , write it.

Depending on how many files you expect, how many RAM you have and how
many files your OS allows you to open, other solutions exists. And if
you do not have enough ram / openfiles / clustering for any of them
there are multitude of tricks ( if, say, you have 3 years worth, no
correlation, and can only open/buffer about 1000 files you could split
from the db into day sized chunks and then split each of them into 20
minutes ones.

Regards.

Francisco Olarte.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Szymon Guz

mabewlun@gmail.com

almost 12 years ago

In reply to: Seb (#1)

Re: break table into portions for writing to separate files

Hi,
several Gb is about 1GB, that's not too much. In case you meant 'several
GB', that shouldn't be a problem as well.

The first thing I'd do would be creating an index on the column used for
dividing the data. Then I'd just use the command COPY with a proper select
to save the data to a file.

If each select lasts for several hours, make the select faster. Good index
usually helps. You can also post here the query which lasts for too long,
and attach its plan as well.

regards,
Szymon

On 1 May 2014 19:50, Seb <spluque@gmail.com> wrote:

Show quoted text

Hello,

I've been looking for a way to write a table into multiple files, and am
wondering if there are some clever suggestions. Say we have a table
that is too large (several Gb) to write to a file that can be used for
further analyses in other languages. The table consists of a timestamp
field and several numeric fields, with records every 10th of a second.
It could be meaningfully broken down into subsets of say 20 minutes
worth of records. One option is to write a shell script that loops
through the timestamp, selects the corresponding subset of the table,
and writes it as a unique file. However, this would be extremely slow
because each select takes several hours, and there can be hundreds of
subsets. Is there a better way?

Cheers,

--
Seb

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Seb

spluque@gmail.com

almost 12 years ago

In reply to: Seb (#1)

Re: break table into portions for writing to separate files

On Thu, 1 May 2014 20:20:23 +0200,
Francisco Olarte <folarte@peoplecall.com> wrote:

[...]

As you mention looping and a shell, I suppose you are in something
unix like, with pipes et al. You can pipe COPY ( either with the pipe
options for copy, or piping a psql command, or whichever thing you
like ) through a script which spits ecah data chunk into its
corresponding file. If your data is somehow clustered into the table (
by chance or by design ) you don't even need a to sort the data, just
use an open file pool, I did that once with call files, chunked them
into day sized files and it worked like a charm ( and if you need the
files sorted, you can then use sort on each of them, which normally is
quite fast ).

For your description of data, with a pipe, you could read a line,
extract a key for the record ( the timestamp rounded down to 20
minutes would be a good one ), get and open output file for append (
using a small caching layer ) , write it.

Depending on how many files you expect, how many RAM you have and how
many files your OS allows you to open, other solutions exists. And if
you do not have enough ram / openfiles / clustering for any of them
there are multitude of tricks ( if, say, you have 3 years worth, no
correlation, and can only open/buffer about 1000 files you could split
from the db into day sized chunks and then split each of them into 20
minutes ones.

Thanks, I'm glad to hear you've used this approach successfully. It
seems as though the best solution is to do a single SELECT to get the
data out of the server (it is a view with a very complex query plan
joining several other similar views), and then pipe the output through
say awk to break down into chunks for writing the files, as you
describe.

Cheers,

--
Seb

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Seb

spluque@gmail.com

almost 12 years ago

In reply to: Seb (#1)

Re: break table into portions for writing to separate files

On Thu, 1 May 2014 20:22:26 +0200,
Szymon Guz <mabewlun@gmail.com> wrote:

Hi, several Gb is about 1GB, that's not too much. In case you meant
'several GB', that shouldn't be a problem as well.

Sorry, I meant several GB. Although that may not be a problem for
PostgreSQL, it is for post-processing the output file with other tools.

The first thing I'd do would be creating an index on the column used
for dividing the data. Then I'd just use the command COPY with a
proper select to save the data to a file.

I should have mentioned that this is quite a complex view (not a table),
which joins several other views of similar complexity. I'm not sure
whether indexes are useful/feasible in this case. I'll investigate.

Thanks,

--
Seb

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Torsten Förtsch

torsten.foertsch@gmx.net

almost 12 years ago

In reply to: Seb (#1)

Re: break table into portions for writing to separate files

On 01/05/14 19:50, Seb wrote:

Hello,

I've been looking for a way to write a table into multiple files, and am
wondering if there are some clever suggestions. Say we have a table
that is too large (several Gb) to write to a file that can be used for
further analyses in other languages. The table consists of a timestamp
field and several numeric fields, with records every 10th of a second.
It could be meaningfully broken down into subsets of say 20 minutes
worth of records. One option is to write a shell script that loops
through the timestamp, selects the corresponding subset of the table,
and writes it as a unique file. However, this would be extremely slow
because each select takes several hours, and there can be hundreds of
subsets. Is there a better way?

# copy (select * from generate_series(1,1000)) to program 'split -l 100
- /tmp/xxx';
COPY 1000
# \q

$ ls -l /tmp/xxxa*
-rw------- 1 postgres postgres 292 May 1 19:08 /tmp/xxxaa
-rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxab
-rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxac
-rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxad
-rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxae
-rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxaf
-rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxag
-rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxah
-rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxai
-rw------- 1 postgres postgres 401 May 1 19:08 /tmp/xxxaj

Each of those contains 100 lines.

Torsten

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Seb

spluque@gmail.com

almost 12 years ago

In reply to: Seb (#1)

Re: break table into portions for writing to separate files

On Thu, 01 May 2014 21:12:46 +0200,
Torsten Förtsch <torsten.foertsch@gmx.net> wrote:

[...]

# copy (select * from generate_series(1,1000)) to program 'split -l
100 - /tmp/xxx'; COPY 1000 # \q

$ ls -l /tmp/xxxa* -rw------- 1 postgres postgres 292 May 1 19:08
/tmp/xxxaa -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxab
-rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxac -rw------- 1
postgres postgres 400 May 1 19:08 /tmp/xxxad -rw------- 1 postgres
postgres 400 May 1 19:08 /tmp/xxxae -rw------- 1 postgres postgres 400
May 1 19:08 /tmp/xxxaf -rw------- 1 postgres postgres 400 May 1 19:08
/tmp/xxxag -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxah
-rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxai -rw------- 1
postgres postgres 401 May 1 19:08 /tmp/xxxaj

Each of those contains 100 lines.

Wonderful! I didn't know about this 'PROGRAM' parameter for COPY
nowadays. Although the SELECT is slow, the split will happen very
quickly this way, so this should be acceptable.

Thanks,

--
Seb

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Szymon Guz

mabewlun@gmail.com

almost 12 years ago

In reply to: Seb (#6)

Re: break table into portions for writing to separate files

On 1 May 2014 21:01, Seb <spluque@gmail.com> wrote:

On Thu, 1 May 2014 20:22:26 +0200,
Szymon Guz <mabewlun@gmail.com> wrote:

Hi, several Gb is about 1GB, that's not too much. In case you meant
'several GB', that shouldn't be a problem as well.

Sorry, I meant several GB. Although that may not be a problem for
PostgreSQL, it is for post-processing the output file with other tools.

The first thing I'd do would be creating an index on the column used
for dividing the data. Then I'd just use the command COPY with a
proper select to save the data to a file.

I should have mentioned that this is quite a complex view (not a table),
which joins several other views of similar complexity. I'm not sure
whether indexes are useful/feasible in this case. I'll investigate.

Yes, indexes can be used to speed up the view as well. Such a view is
nothing more than just a query.

regards,
Szymon

#10

Seb

spluque@gmail.com

almost 12 years ago

In reply to: Seb (#1)

Re: break table into portions for writing to separate files

On Thu, 1 May 2014 22:17:24 +0200,
Szymon Guz <mabewlun@gmail.com> wrote:

On 1 May 2014 21:01, Seb <spluque@gmail.com> wrote:
On Thu, 1 May 2014 20:22:26 +0200,
Szymon Guz <mabewlun@gmail.com> wrote:

Hi, several Gb is about 1GB, that's not too much. In case you meant
'several GB', that shouldn't be a problem as well.

Sorry, I meant several GB. Although that may not be a problem for
PostgreSQL, it is for post-processing the output file with other
tools.

The first thing I'd do would be creating an index on the column used
for dividing the data. Then I'd just use the command COPY with a
proper select to save the data to a file.

I should have mentioned that this is quite a complex view (not a
table), which joins several other views of similar complexity. I'm
not sure whether indexes are useful/feasible in this case. I'll
investigate.

Yes, indexes can be used to speed up the view as well. Such a view is
nothing more than just a query.

Help for CREATE INDEX says that its target should be a table or
materialized view, so I'm guessing you mean indexes on the relevant
fields of the underlying tables. In that case, I already have indexes
on those, especially the timestamp fields which are the ones that are
used for the heavy query work.

Thanks,

--
Seb

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11

Szymon Guz

mabewlun@gmail.com

almost 12 years ago

In reply to: Seb (#10)

Re: break table into portions for writing to separate files

On 1 May 2014 22:24, Seb <spluque@gmail.com> wrote:

On Thu, 1 May 2014 22:17:24 +0200,
Szymon Guz <mabewlun@gmail.com> wrote:

On 1 May 2014 21:01, Seb <spluque@gmail.com> wrote:
On Thu, 1 May 2014 20:22:26 +0200,
Szymon Guz <mabewlun@gmail.com> wrote:

Hi, several Gb is about 1GB, that's not too much. In case you meant
'several GB', that shouldn't be a problem as well.

Sorry, I meant several GB. Although that may not be a problem for
PostgreSQL, it is for post-processing the output file with other
tools.

The first thing I'd do would be creating an index on the column used
for dividing the data. Then I'd just use the command COPY with a
proper select to save the data to a file.

I should have mentioned that this is quite a complex view (not a
table), which joins several other views of similar complexity. I'm
not sure whether indexes are useful/feasible in this case. I'll
investigate.

Yes, indexes can be used to speed up the view as well. Such a view is
nothing more than just a query.

Help for CREATE INDEX says that its target should be a table or
materialized view, so I'm guessing you mean indexes on the relevant
fields of the underlying tables. In that case, I already have indexes
on those, especially the timestamp fields which are the ones that are
used for the heavy query work.

Can you show us the query plan for the queries you are using, the view
definition, and how you query that view?

Szymon

#12

Seb

spluque@gmail.com

almost 12 years ago

In reply to: Seb (#1)

Re: break table into portions for writing to separate files

On Thu, 1 May 2014 22:31:46 +0200,
Szymon Guz <mabewlun@gmail.com> wrote:

[...]

Can you show us the query plan for the queries you are using, the view
definition, and how you query that view?

Thanks for your help with this. Here's the view definition (eliding
similar column references):

---<--------------------cut here---------------start------------------->---
CREATE OR REPLACE VIEW public.daily_flux_2013 AS
WITH pre AS (
SELECT mot.time_study, ...,
wind.wind_speed_u, ...,
op.op_analyzer_status, ...,
count(wind.wind_speed_u) OVER w AS nwind
FROM daily_motion_2013 mot
JOIN daily_wind3d_analog_2013 wind USING (time_study)
JOIN daily_opath_2013 op USING (time_study)
JOIN ( SELECT generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:20:00'::interval) AS time_20min) ts_20min ON mot.time_study >= ts_20min.time_20min AND mot.time_study < (ts_20min.time_20min + '00:20:00'::interval)
WINDOW w AS (PARTITION BY ts_20min.time_20min ORDER BY ts_20min.time_20min)
)
SELECT pre.time_study, ...,
FROM pre
WHERE pre.nwind = 12000
ORDER BY pre.time_study;
---<--------------------cut here---------------end--------------------->---

Here, mot, wind, and op are views that are similarly constructed
(i.e. they use generate_series () and join a few tables). The WHERE
clause is used to output only 20 minute periods where every 0.1 second
record is available (as determined by the wind_speed_u column).

I'm SELECT'ing this view simply as 'SELECT * FROM daily_flux_2013', and
that shows this query plan (lengthy output from pgadmin's):

---<--------------------cut here---------------start------------------->---
"Sort (cost=29182411.29..29182411.57 rows=111 width=976)"
" Sort Key: pre.time_study"
" CTE pre"
" -> WindowAgg (cost=29181518.64..29181907.52 rows=22222 width=434)"
" -> Sort (cost=29181518.64..29181574.19 rows=22222 width=434)"
" Sort Key: (generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:20:00'::interval))"
" -> Nested Loop (cost=22171519.20..29179914.24 rows=22222 width=434)"
" Join Filter: (((generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) >= (generate_series('2013-07-28 00:00:00'::timestamp without time zone, ' (...)"
" -> Result (cost=0.00..5.01 rows=1000 width=0)"
" -> Materialize (cost=22171519.20..29175899.74 rows=200 width=426)"
" -> Merge Join (cost=22171519.20..29175898.74 rows=200 width=426)"
" Merge Cond: ((generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) = (generate_series('2013-07-28 00:00:00'::timestamp without tim (...)"
" -> Merge Join (cost=9360527.55..12865370.87 rows=200 width=123)"
" Merge Cond: ((generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) = (generate_series('2013-07-28 00:00:00'::timestamp witho (...)"
" -> Unique (cost=8625.16..8626.84 rows=200 width=50)"
" -> Sort (cost=8625.16..8626.00 rows=337 width=50)"
" Sort Key: (generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval))"
" -> Nested Loop (cost=0.57..8611.01 rows=337 width=50)"
" -> Result (cost=0.00..5.01 rows=1000 width=0)"
" -> Index Scan using motion_series_time_idx on motion_series mot (cost=0.57..8.59 rows=1 width=50)"
" Index Cond: ("time" = (generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)))"
" Filter: (logging_group_id = 33)"
" -> Materialize (cost=9351902.39..12856739.03 rows=200 width=73)"
" -> Unique (cost=9351902.39..12856736.53 rows=200 width=73)"
" -> Merge Join (cost=9351902.39..12369954.70 rows=194712730 width=73)"
" Merge Cond: ((generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) = w."time")"
" -> Sort (cost=64.84..67.34 rows=1000 width=8)"
" Sort Key: (generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval))"
" -> Result (cost=0.00..5.01 rows=1000 width=0)"
" -> Materialize (cost=9351837.55..9546550.28 rows=38942546 width=73)"
" -> Sort (cost=9351837.55..9449193.92 rows=38942546 width=73)"
" Sort Key: w."time""
" -> Append (cost=0.00..2711828.47 rows=38942546 width=73)"
" -> Seq Scan on wind3d_series w (cost=0.00..0.00 rows=1 width=236)"
" Filter: (((stream_type)::text = 'analog'::text) AND (logging_group_id = 33))"
" -> Bitmap Heap Scan on wind3d_series_analog w_1 (cost=728917.29..2711828.47 rows=38942545 width=73)"
" Recheck Cond: (logging_group_id = 33)"
" Filter: ((stream_type)::text = 'analog'::text)"
" -> Bitmap Index Scan on fki_wind3d_series_analog_logging_group_id_fkey (cost=0.00..719181.65 rows=38942545 width=0)"
" Index Cond: (logging_group_id = 33)"
" -> Materialize (cost=12810991.66..16310524.87 rows=200 width=319)"
" -> Unique (cost=12810991.66..16310522.37 rows=200 width=319)"
" -> Merge Join (cost=12810991.66..15824477.13 rows=194418095 width=319)"
" Merge Cond: ((generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) = op."time")"
" -> Sort (cost=64.84..67.34 rows=1000 width=8)"
" Sort Key: (generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval))"
" -> Result (cost=0.00..5.01 rows=1000 width=0)"
" -> Materialize (cost=12810926.82..13005344.91 rows=38883619 width=319)"
" -> Sort (cost=12810926.82..12908135.87 rows=38883619 width=319)"
" Sort Key: op."time""
" -> Append (cost=0.00..2194298.15 rows=38883619 width=319)"
" -> Seq Scan on open_path_series op (cost=0.00..0.00 rows=1 width=556)"
" Filter: ((NOT is_shroud) AND (logging_group_id = 33))"
" -> Index Scan using fki_open_path_series_noshroud_logging_group_id_fkey on open_path_series_noshroud op_1 (cost=0.57..2194298.15 rows=38883618 width=319)"
" Index Cond: (logging_group_id = 33)"
" Filter: (NOT is_shroud)"
" -> CTE Scan on pre (cost=0.00..500.00 rows=111 width=976)"
" Filter: (nwind = 12000)"
---<--------------------cut here---------------end--------------------->---

--
Seb

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13

Szymon Guz

mabewlun@gmail.com

almost 12 years ago

In reply to: Seb (#12)

Re: break table into portions for writing to separate files

On 1 May 2014 22:50, Seb <spluque@gmail.com> wrote:

On Thu, 1 May 2014 22:31:46 +0200,
Szymon Guz <mabewlun@gmail.com> wrote:

[...]

Can you show us the query plan for the queries you are using, the view
definition, and how you query that view?

Thanks for your help with this. Here's the view definition (eliding
similar column references):

---<--------------------cut here---------------start------------------->---
CREATE OR REPLACE VIEW public.daily_flux_2013 AS
WITH pre AS (
SELECT mot.time_study, ...,
wind.wind_speed_u, ...,
op.op_analyzer_status, ...,
count(wind.wind_speed_u) OVER w AS nwind
FROM daily_motion_2013 mot
JOIN daily_wind3d_analog_2013 wind USING (time_study)
JOIN daily_opath_2013 op USING (time_study)
JOIN ( SELECT generate_series('2013-07-28 00:00:00'::timestamp without
time zone, '2013-09-13 00:00:00'::timestamp without time zone,
'00:20:00'::interval) AS time_20min) ts_20min ON mot.time_study >=
ts_20min.time_20min AND mot.time_study < (ts_20min.time_20min +
'00:20:00'::interval)
WINDOW w AS (PARTITION BY ts_20min.time_20min ORDER BY
ts_20min.time_20min)
)
SELECT pre.time_study, ...,
FROM pre
WHERE pre.nwind = 12000
ORDER BY pre.time_study;
---<--------------------cut here---------------end--------------------->---

Here, mot, wind, and op are views that are similarly constructed
(i.e. they use generate_series () and join a few tables). The WHERE
clause is used to output only 20 minute periods where every 0.1 second
record is available (as determined by the wind_speed_u column).

I'm SELECT'ing this view simply as 'SELECT * FROM daily_flux_2013', and
that shows this query plan (lengthy output from pgadmin's):

---<--------------------cut here---------------start------------------->---
"Sort (cost=29182411.29..29182411.57 rows=111 width=976)"
" Sort Key: pre.time_study"
" CTE pre"
" -> WindowAgg (cost=29181518.64..29181907.52 rows=22222 width=434)"
" -> Sort (cost=29181518.64..29181574.19 rows=22222 width=434)"
" Sort Key: (generate_series('2013-07-28
00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp
without time zone, '00:20:00'::interval))"
" -> Nested Loop (cost=22171519.20..29179914.24
rows=22222 width=434)"
" Join Filter: (((generate_series('2013-07-28
00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp
without time zone, '00:00:00.1'::interval)) >= (generate_series('2013-07-28
00:00:00'::timestamp without time zone, ' (...)"
" -> Result (cost=0.00..5.01 rows=1000 width=0)"
" -> Materialize (cost=22171519.20..29175899.74
rows=200 width=426)"
" -> Merge Join
(cost=22171519.20..29175898.74 rows=200 width=426)"
" Merge Cond:
((generate_series('2013-07-28 00:00:00'::timestamp without time zone,
'2013-09-13 00:00:00'::timestamp without time zone,
'00:00:00.1'::interval)) = (generate_series('2013-07-28
00:00:00'::timestamp without tim (...)"
" -> Merge Join
(cost=9360527.55..12865370.87 rows=200 width=123)"
" Merge Cond:
((generate_series('2013-07-28 00:00:00'::timestamp without time zone,
'2013-09-13 00:00:00'::timestamp without time zone,
'00:00:00.1'::interval)) = (generate_series('2013-07-28
00:00:00'::timestamp witho (...)"
" -> Unique
(cost=8625.16..8626.84 rows=200 width=50)"
" -> Sort
(cost=8625.16..8626.00 rows=337 width=50)"
" Sort Key:
(generate_series('2013-07-28 00:00:00'::timestamp without time zone,
'2013-09-13 00:00:00'::timestamp without time zone,
'00:00:00.1'::interval))"
" -> Nested Loop
(cost=0.57..8611.01 rows=337 width=50)"
" -> Result
(cost=0.00..5.01 rows=1000 width=0)"
" -> Index Scan
using motion_series_time_idx on motion_series mot (cost=0.57..8.59 rows=1
width=50)"
" Index
Cond: ("time" = (generate_series('2013-07-28 00:00:00'::timestamp without
time zone, '2013-09-13 00:00:00'::timestamp without time zone,
'00:00:00.1'::interval)))"
" Filter:
(logging_group_id = 33)"
" -> Materialize
(cost=9351902.39..12856739.03 rows=200 width=73)"
" -> Unique
(cost=9351902.39..12856736.53 rows=200 width=73)"
" -> Merge Join
(cost=9351902.39..12369954.70 rows=194712730 width=73)"
" Merge Cond:
((generate_series('2013-07-28 00:00:00'::timestamp without time zone,
'2013-09-13 00:00:00'::timestamp without time zone,
'00:00:00.1'::interval)) = w."time")"
" -> Sort
(cost=64.84..67.34 rows=1000 width=8)"
" Sort Key:
(generate_series('2013-07-28 00:00:00'::timestamp without time zone,
'2013-09-13 00:00:00'::timestamp without time zone,
'00:00:00.1'::interval))"
" ->
Result (cost=0.00..5.01 rows=1000 width=0)"
" -> Materialize
(cost=9351837.55..9546550.28 rows=38942546 width=73)"
" -> Sort
(cost=9351837.55..9449193.92 rows=38942546 width=73)"
"
Sort Key: w."time""
" ->
Append (cost=0.00..2711828.47 rows=38942546 width=73)"
"
-> Seq Scan on wind3d_series w (cost=0.00..0.00 rows=1 width=236)"
"
Filter: (((stream_type)::text = 'analog'::text) AND
(logging_group_id = 33))"
"
-> Bitmap Heap Scan on wind3d_series_analog w_1
(cost=728917.29..2711828.47 rows=38942545 width=73)"
"
Recheck Cond: (logging_group_id = 33)"
"
Filter: ((stream_type)::text = 'analog'::text)"
"
-> Bitmap Index Scan on
fki_wind3d_series_analog_logging_group_id_fkey (cost=0.00..719181.65
rows=38942545 width=0)"
"
Index Cond: (logging_group_id = 33)"
" -> Materialize
(cost=12810991.66..16310524.87 rows=200 width=319)"
" -> Unique
(cost=12810991.66..16310522.37 rows=200 width=319)"
" -> Merge Join
(cost=12810991.66..15824477.13 rows=194418095 width=319)"
" Merge Cond:
((generate_series('2013-07-28 00:00:00'::timestamp without time zone,
'2013-09-13 00:00:00'::timestamp without time zone,
'00:00:00.1'::interval)) = op."time")"
" -> Sort
(cost=64.84..67.34 rows=1000 width=8)"
" Sort Key:
(generate_series('2013-07-28 00:00:00'::timestamp without time zone,
'2013-09-13 00:00:00'::timestamp without time zone,
'00:00:00.1'::interval))"
" -> Result
(cost=0.00..5.01 rows=1000 width=0)"
" -> Materialize
(cost=12810926.82..13005344.91 rows=38883619 width=319)"
" -> Sort
(cost=12810926.82..12908135.87 rows=38883619 width=319)"
" Sort Key:
op."time""
" ->
Append (cost=0.00..2194298.15 rows=38883619 width=319)"
" ->
Seq Scan on open_path_series op (cost=0.00..0.00 rows=1 width=556)"
"
Filter: ((NOT is_shroud) AND (logging_group_id = 33))"
" ->
Index Scan using fki_open_path_series_noshroud_logging_group_id_fkey on
open_path_series_noshroud op_1 (cost=0.57..2194298.15 rows=38883618
width=319)"
"
Index Cond: (logging_group_id = 33)"
"
Filter: (NOT is_shroud)"
" -> CTE Scan on pre (cost=0.00..500.00 rows=111 width=976)"
" Filter: (nwind = 12000)"
---<--------------------cut here---------------end--------------------->---

--
Seb

In this form it is quite unreadible. Could you paste the plan to the
http://explain.depesz.com/ and provide her an url of the page?

thanks,
Szymon

#14

Seb

spluque@gmail.com

almost 12 years ago

In reply to: Seb (#1)

Re: break table into portions for writing to separate files

On Thu, 1 May 2014 23:41:04 +0200,
Szymon Guz <mabewlun@gmail.com> wrote:

[...]

In this form it is quite unreadible. Could you paste the plan to the
http://explain.depesz.com/ and provide her an url of the page?

Nice.

http://explain.depesz.com/s/iMJi

--
Seb

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15

Francisco Olarte

folarte@peoplecall.com

almost 12 years ago

In reply to: Seb (#5)

Re: break table into portions for writing to separate files

Hi Seb:

On Thu, May 1, 2014 at 8:50 PM, Seb <spluque@gmail.com> wrote:

Thanks, I'm glad to hear you've used this approach successfully.

Well, this is always successful if you are able to develop a
moderately complex script.

It seems as though the best solution is to do a single SELECT to get the
data out of the server (it is a view with a very complex query plan
joining several other similar views), and then pipe the output through
say awk to break down into chunks for writing the files, as you
describe.

If you've got enough disk in the target machine ( several Gb, even a
hundred, shouldn't be a problem with todays, or even yesterdays
machines ) for two copies of the data, I'll recommend to just do a
COPY of the view to a file, then process the file. This lets you do
very simple scripts to chunk it, and normally workstation disk is way
cheaper than servers.

What I would normally do from what you describe is to spool the whole
table ( maybe gzipping it along the way if it's real big, in my
experience some current machines ( fast multicores with not so fast
disks ) are faster gziping and reading/writing a third of the data (
easily achievable with gzip if your data are numbers and timestamps )
than writing the full set, and then make a set of scripts which work
on it. I do not grok awk ( I began using perl in the 4.019 era and
substituted awk/sed with it for all kind of one liners and similars ),
but with perl, on a moderate machine ( like the one I'm using, core
duo with 2Gb, quite old by today standards ) you can filter the data
and extract it to several hundreds files at nearly full disk speed.

As I commented before, if your query results exhibit some locality on
the key, you can open several files keeping an LRU cache and split it
in one go. I wouldn't try to force order in the db, I've found disk
sorts are better for this kind of problems.

Ah, another hint. I've had to make somehow similar tasks ( dumping a
big query which joins a lot ) in the past. In some of this cases the
result was sweeping and concatenating a slew of really big tables
while joining a lot of small ones, something like sweeping 20Gb while
joining it with 15 more totalling 512Mb among them, generating a 100Gb
denormalized result. In these cases I developed a program which just
slurped all the small tables into RAM and did the join before writing,
which greatly improved the speed ( as, among other things, Postgres
only had to send me 20.5Gb over the net ). Sometimes you can use
things like this to improve performance as WS RAM is sometimes
plentiful and cheap, as you have all of it for a single task, while db
server ram is scarcer.

Francisco Olarte.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general