Design ? table vs. view?
I have a table which has some "raw" data in it. By "raw", I mean it is
minimally processed from a log file. Every week, I update this table by
processing the weekly log using awk to create a "psql script" file which
looks similar to:
COPY rawdata FROM STDIN;
.... lines created by awk script
\.
The table schema is:
Table "jobrun.rawdata"
Column | Type |
----------+---------------+
lpar | character(4) |
yyddd | character(5) |
timedata | character(11) |
jobid | character(8) |
msgid | character(7) |
jobname | character(8) |
Now, this data is not really very useful in its raw form. So I "process" it
via a view:
View "jobrun.rundata"
Column | Type |
---------+--------------------------+
lpar | character(4) |
msgid | character(7) |
jobname | character(8) |
jobid | character(8) |
msgtime | timestamp with time zone |
View definition:
SELECT rawdata.lpar,
rawdata.msgid,
rawdata.jobname,
rawdata.jobid,
to_timestamp((rawdata.yyddd::text || ' '::text) ||
rawdata.timedata::text, 'YYDDD HH24:MI:SS.MS'::text) AS msgtime
FROM rawdata;
My question is this: If I do a number of SELECTs on the "rundata" table.
So, would it be worth while to make this a table in itself? The plus of a
view is that I don't need to worry about updates. And I still have the
"raw" data around. In reality, this is just the first VIEW. I create three
other views. Two views are to "subset" the data based on the contents of
the "msgid" value (there are only 2 possibilities at present: 'START' and
'END'). The final view, which is my actual information is a FULL OUTER JOIN
of the START and END subset, based on lpar,jobname, and jobid:
View "jobrun.runinfo"
Column | Type |
----------+--------------------------+
lpar | character(4) |
jobname | character(8) |
jobid | character(8) |
runstart | timestamp with time zone |
runend | timestamp with time zone |
View definition:
SELECT COALESCE(a.lpar, b.lpar) AS lpar,
COALESCE(a.jobname, b.jobname) AS jobname,
COALESCE(a.jobid, b.jobid) AS jobid,
a.msgtime AS runstart,
b.msgtime AS runend
FROM runstart a
FULL JOIN runend b ON a.lpar = b.lpar AND a.jobname = b.jobname AND
a.jobid = b.jobid;
So the overhead may be quite high, because to SELECT from RUNINFO,
PostgreSQL must realize all four views.
I appreciate your thoughts on if this is OK, given that performance is
currently acceptable. Mainly because this work is basically only done one a
week, on Sundays. And I don't do it myself, it is done via a scheduler (not
cron, but similar) which runs some scripts.
--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan
Maranatha! <><
John McKown
John McKown wrote
I have a table which has some "raw" data in it. By "raw", I mean it is
minimally processed from a log file. Every week, I update this table by
processing the weekly log using awk to create a "psql script" file which
looks similar to:COPY rawdata FROM STDIN;
.... lines created by awk script
\.The table schema is:
Table "jobrun.rawdata"
Column | Type |
----------+---------------+
lpar | character(4) |
yyddd | character(5) |
timedata | character(11) |
jobid | character(8) |
msgid | character(7) |
jobname | character(8) |Now, this data is not really very useful in its raw form. So I "process"
it
via a view:View "jobrun.rundata"
Column | Type |
---------+--------------------------+
lpar | character(4) |
msgid | character(7) |
jobname | character(8) |
jobid | character(8) |
msgtime | timestamp with time zone |
View definition:
SELECT rawdata.lpar,
rawdata.msgid,
rawdata.jobname,
rawdata.jobid,
to_timestamp((rawdata.yyddd::text || ' '::text) ||
rawdata.timedata::text, 'YYDDD HH24:MI:SS.MS'::text) AS msgtime
FROM rawdata;My question is this: If I do a number of SELECTs on the "rundata" table.
So, would it be worth while to make this a table in itself? The plus of a
view is that I don't need to worry about updates. And I still have the
"raw" data around. In reality, this is just the first VIEW. I create three
other views. Two views are to "subset" the data based on the contents of
the "msgid" value (there are only 2 possibilities at present: 'START' and
'END'). The final view, which is my actual information is a FULL OUTER
JOIN
of the START and END subset, based on lpar,jobname, and jobid:View "jobrun.runinfo"
Column | Type |
----------+--------------------------+
lpar | character(4) |
jobname | character(8) |
jobid | character(8) |
runstart | timestamp with time zone |
runend | timestamp with time zone |
View definition:
SELECT COALESCE(a.lpar, b.lpar) AS lpar,
COALESCE(a.jobname, b.jobname) AS jobname,
COALESCE(a.jobid, b.jobid) AS jobid,
a.msgtime AS runstart,
b.msgtime AS runend
FROM runstart a
FULL JOIN runend b ON a.lpar = b.lpar AND a.jobname = b.jobname AND
a.jobid = b.jobid;So the overhead may be quite high, because to SELECT from RUNINFO,
PostgreSQL must realize all four views.I appreciate your thoughts on if this is OK, given that performance is
currently acceptable. Mainly because this work is basically only done one
a
week, on Sundays. And I don't do it myself, it is done via a scheduler
(not
cron, but similar) which runs some scripts.
I would likely make "jobrun.runinfo" into a table while leaving
"jobrun.rawdata" as-is. I would have a function that populates "runinfo"
from "rawdata" that I would call after performing the copy to "rawdata".
There would be no views - unless you desire a view interface over "runinfo"
for API or permission reasons.
In 9.4 you can (probably) make "runinfo" an explicit MATERIALIZED VIEW and
perform REFRESH command to accomplish the same thing - though I am not
particularly familiar with the mechanics of that feature.
David J.
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Design-table-vs-view-tp5811577p5811589.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Jul 15, 2014 at 8:46 AM, David G Johnston <
david.g.johnston@gmail.com> wrote:
John McKown wrote
I have a table which has some "raw" data in it. By "raw", I mean it is
minimally processed from a log file. Every week, I update this table by
processing the weekly log using awk to create a "psql script" file which
looks similar to:<snip>
So the overhead may be quite high, because to SELECT from RUNINFO,
PostgreSQL must realize all four views.I appreciate your thoughts on if this is OK, given that performance is
currently acceptable. Mainly because this work is basically only done one
a
week, on Sundays. And I don't do it myself, it is done via a scheduler
(not
cron, but similar) which runs some scripts.I would likely make "jobrun.runinfo" into a table while leaving
"jobrun.rawdata" as-is. I would have a function that populates "runinfo"
from "rawdata" that I would call after performing the copy to "rawdata".
There would be no views - unless you desire a view interface over "runinfo"
for API or permission reasons.In 9.4 you can (probably) make "runinfo" an explicit MATERIALIZED VIEW and
perform REFRESH command to accomplish the same thing - though I am not
particularly familiar with the mechanics of that feature.David J.
Being the indecisive nut that I am, I am going to do both <grin/>. I will
keep the current view. But when I update the rawdata, what I will then do
is:
drop table runinfo_table;
create table runinfo_table as select distinct * from runinfo;
I am fairly confident that there cannot be any duplicates in runinfo. But,
being paranoid as well, I will do the DISTINCT just to be sure. I may
change the VIEW to do that in the future, and remove it from the
preceeding. Since the process which updates the rawdata table is automated
and runs on a Sunday, the time needed to recreate runinfo_table is not
relevant to me. So I get what I want, unless I update rawdata off schedule.
I cannot imagine why I would do that since the logs from which I create it
are generally only available after 17:00 local time on Sunday. Getting the
iogs-to-date information for the time since the last dump is basically a
PITA and my current use is not critical. Actually, it is more a
"skunkworks" project of my own to produce a set of nice graphs, using R,
which _might_ turn out to be interesting to management, but the production
of which _will_ help me learn PostgreSQL and R better (hopefully).
Many thanks.
--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan
Maranatha! <><
John McKown
Hey,
I guess you know all about PL/R,
the R language extension for postgres .
It is very convenient, though be carefull as sometime it crashed my server.
Cheers,
Rémi-C
2014-07-16 3:42 GMT+02:00 John McKown <john.archie.mckown@gmail.com>:
Show quoted text
On Tue, Jul 15, 2014 at 8:46 AM, David G Johnston <
david.g.johnston@gmail.com> wrote:John McKown wrote
I have a table which has some "raw" data in it. By "raw", I mean it is
minimally processed from a log file. Every week, I update this table by
processing the weekly log using awk to create a "psql script" file which
looks similar to:<snip>
So the overhead may be quite high, because to SELECT from RUNINFO,
PostgreSQL must realize all four views.I appreciate your thoughts on if this is OK, given that performance is
currently acceptable. Mainly because this work is basically only doneone
a
week, on Sundays. And I don't do it myself, it is done via a scheduler
(not
cron, but similar) which runs some scripts.I would likely make "jobrun.runinfo" into a table while leaving
"jobrun.rawdata" as-is. I would have a function that populates "runinfo"
from "rawdata" that I would call after performing the copy to "rawdata".
There would be no views - unless you desire a view interface over
"runinfo"
for API or permission reasons.In 9.4 you can (probably) make "runinfo" an explicit MATERIALIZED VIEW and
perform REFRESH command to accomplish the same thing - though I am not
particularly familiar with the mechanics of that feature.David J.
Being the indecisive nut that I am, I am going to do both <grin/>. I will
keep the current view. But when I update the rawdata, what I will then do
is:drop table runinfo_table;
create table runinfo_table as select distinct * from runinfo;I am fairly confident that there cannot be any duplicates in runinfo. But,
being paranoid as well, I will do the DISTINCT just to be sure. I may
change the VIEW to do that in the future, and remove it from the
preceeding. Since the process which updates the rawdata table is automated
and runs on a Sunday, the time needed to recreate runinfo_table is not
relevant to me. So I get what I want, unless I update rawdata off schedule.
I cannot imagine why I would do that since the logs from which I create it
are generally only available after 17:00 local time on Sunday. Getting the
iogs-to-date information for the time since the last dump is basically a
PITA and my current use is not critical. Actually, it is more a
"skunkworks" project of my own to produce a set of nice graphs, using R,
which _might_ turn out to be interesting to management, but the production
of which _will_ help me learn PostgreSQL and R better (hopefully).Many thanks.
--
There is nothing more pleasant than traveling and meeting new people!
Genghis KhanMaranatha! <><
John McKown