Need advise for database structure for non linear data.
Hi,
I need advise about a database structure. I need to capture data from the
web about one specific subject on few specific websites and insert that data
to a database. I have done this question here before, but I think I have not
explained very well.
The problem with this task is that the information is not linear, if I try
to design tables with fields for all possible data I will end up with many
row fields with NULL values. There are any problem with this(end up with
many row fields with NULL values)? Or should I user other kind of structure?
For example store the data in one field and that field containing an
associative array with data.
What I mean with non linear data is the following:
array(
'name' => 'Don',
'age' => '31'
);
array(
'name' => 'Peter',
'age' => '28',
'car' => 'ford',
'km' => '2000'
);
In a specific website search I will store only "name" and "age", and in
other website I will store "name", "age", "car" and "km".
I don't know If I explain weel my problem. My english is not very good.
Best Regards.
Hello,
Am 03.01.11 12:11, schrieb Andre Lopes:
Hi,
I need advise about a database structure. I need to capture data from the
web about one specific subject on few specific websites and insert that data
to a database. I have done this question here before, but I think I have not
explained very well.What I mean with non linear data is the following:
array(
'name' => 'Don',
'age' => '31'
);array(
'name' => 'Peter',
'age' => '28',
'car' => 'ford',
'km' => '2000'
);In a specific website search I will store only "name" and "age", and in
other website I will store "name", "age", "car" and "km".I don't know If I explain weel my problem. My english is not very good.
In theory, using a single table having three columns
(array-id,key,value) will suit your needs.
However, providing a simple key/value store is not the idea behind DBMS
like postgres ...
See:
http://en.wikipedia.org/wiki/NoSQL
http://en.wikipedia.org/wiki/Relational_database_management_system
Thomas
Andre Lopes wrote on 03.01.2011 12:11:
array(
'name' => 'Don',
'age' => '31'
);array(
'name' => 'Peter',
'age' => '28',
'car' => 'ford',
'km' => '2000'
);In a specific website search I will store only "name" and "age", and
in other website I will store "name", "age", "car" and "km".I don't know If I explain weel my problem. My english is not very
good.
That's exactly what the hstore data type supports:
http://www.postgresql.org/docs/current/static/hstore.html
Regards
Thomas
I can propose you something like this:
website(id int, url varchar);
attr_def (id int, name varchar);
attr_val (id int, def_id reference attr_def.id, website_id int
references website.id, value varchar);
If all of your attributes in website are single valued then you can
remove id from attr_val and use PK from website_id, def_id.
Depending on your needs one or many from following indexes:
attr_val(value) - search for attributes with value;
attr_val(def_id, value) - search for given attributes with values;
attr_val(website_id, def_id, value) - checks, if given site has
attribue, search by values for given site and attribute;
attr_val(def_id, website_id) - like above, without value searching;
attr_val(website_id, value) - search for attributes on given site with
value.
Probably you will use 2nd or 3rd index.
Example of search on website
select d.name, v.value from attre_def d join attr_val v on (v.def_id =
d.id) join website w on (v.website_id = w.id)
where d.name = 'xxxx' and w.url='http://somtehing'
This is common map structure.
Kind regards,
Radosław Smogura
On Mon, 03 Jan 2011 12:26:45 +0100, Thomas Schmidt
<postgres@stephan.homeunix.net> wrote:
Show quoted text
Hello,
Am 03.01.11 12:11, schrieb Andre Lopes:
Hi,
I need advise about a database structure. I need to capture data
from the
web about one specific subject on few specific websites and insert
that data
to a database. I have done this question here before, but I think I
have not
explained very well.What I mean with non linear data is the following:
array(
'name' => 'Don',
'age' => '31'
);array(
'name' => 'Peter',
'age' => '28',
'car' => 'ford',
'km' => '2000'
);In a specific website search I will store only "name" and "age", and
in
other website I will store "name", "age", "car" and "km".I don't know If I explain weel my problem. My english is not very
good.In theory, using a single table having three columns
(array-id,key,value) will suit your needs.
However, providing a simple key/value store is not the idea behind
DBMS like postgres ...
See:
http://en.wikipedia.org/wiki/NoSQL
http://en.wikipedia.org/wiki/Relational_database_management_systemThomas
Hello,
Am 03.01.11 12:46, schrieb Radosław Smogura:
I can propose you something like this:
website(id int, url varchar);
attr_def (id int, name varchar);
attr_val (id int, def_id reference attr_def.id, website_id int
references website.id, value varchar);
If all of your attributes in website are single valued then you can
remove id from attr_val and use PK from website_id, def_id.Depending on your needs one or many from following indexes:
attr_val(value) - search for attributes with value;
(...)
Probably you will use 2nd or 3rd index.
Example of search on website
select d.name, v.value from attre_def d join attr_val v on (v.def_id =
d.id) join website w on (v.website_id = w.id)
where d.name = 'xxxx' and w.url='http://somtehing'
Imho its hard - (if not impossible) to recommand a specific database
scheme (incl indexes) without knowing the applications taking plance
behind it.
Your schema is nice for specific querying, but might blow up if lots of
data is stored in the database (joins, index-building might be time
consuming).
On the other hand, google put some effort into their "BigTable"
http://en.wikipedia.org/wiki/BigTable for storing tons of data...
Thus - it all depends on the usage :-)
Thomas
Hi,
Thanks for the reply's. I was tempted to accept the Rodoslaw Smogura
proposal. There will be about 100 websites to capture data on daily basis.
Each website adds per day(average) 2 articles.
Thomas talked about the noSQL possibility. What do you think would be
better? I have no experience in noSQL and that could be a weakness.
Best Regards,
André
On Mon, Jan 3, 2011 at 11:58 AM, Thomas Schmidt <
postgres@stephan.homeunix.net> wrote:
Show quoted text
Hello,
Am 03.01.11 12:46, schrieb Radosław Smogura:
I can propose you something like this:
website(id int, url varchar);
attr_def (id int, name varchar);
attr_val (id int, def_id reference attr_def.id, website_id int references
website.id, value varchar);
If all of your attributes in website are single valued then you can remove
id from attr_val and use PK from website_id, def_id.Depending on your needs one or many from following indexes:
attr_val(value) - search for attributes with value;(...)
Probably you will use 2nd or 3rd index.
Example of search on website
select d.name, v.value from attre_def d join attr_val v on (v.def_id =
d.id) join website w on (v.website_id = w.id)
where d.name = 'xxxx' and w.url='http://somtehing'Imho its hard - (if not impossible) to recommand a specific database scheme
(incl indexes) without knowing the applications taking plance behind it.
Your schema is nice for specific querying, but might blow up if lots of
data is stored in the database (joins, index-building might be time
consuming).
On the other hand, google put some effort into their "BigTable"
http://en.wikipedia.org/wiki/BigTable for storing tons of data...Thus - it all depends on the usage :-)
Thomas
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hello,
Am 03.01.11 14:14, schrieb Andre Lopes:
Hi,
Thanks for the reply's. I was tempted to accept the Rodoslaw Smogura
proposal. There will be about 100 websites to capture data on daily basis.
Each website adds per day(average) 2 articles.Thomas talked about the noSQL possibility. What do you think would be
better? I have no experience in noSQL and that could be a weakness.
Imho RDBMS do a very good job in managing data on a relational basis.
However - there are alternatives and use cases for 'em and there is no
holy grail...
Not having any experience is a good point for not using it in production
:-).
However, if you've time to spare, looking into database design
(plain-sql and not-only-sql) will help.
I don't think that you get in trouble with a few hundered rows per day,
but keep in mind, what queries are used.
Thomas
On 01/03/2011 12:11 PM, Andre Lopes wrote:
[snip]
The problem with this task is that the information is not linear, if I
try to design tables with fields for all possible data I will end up
with many row fields with NULL values. There are any problem with
this(end up with many row fields with NULL values)? Or should I user
other kind of structure? For example store the data in one field and
that field containing an associative array with data.
As far as I understand NULL values are not really stored and a column
with many NULLs is not a problem as such, but if it is part of an index
the index might not be very useful.
At least that's my understanding of how SQL databases work. If I got
this wrong I hope someone will correct me.
[snip]
/Fredric
Andre,
From a distant view of your problem I would like to vote for Thomas
Kellerer's proposal:
Maintain only the data you need (to enhance import/sync performance)
and use the hstore data type (as long as query performance is ok).
Yours, S.
2011/1/3 Fredric Fredricson <Fredric.Fredricson@bonetmail.com>:
Show quoted text
On 01/03/2011 12:11 PM, Andre Lopes wrote:
[snip]
The problem with this task is that the information is not linear, if I try
to design tables with fields for all possible data I will end up with many
row fields with NULL values. There are any problem with this(end up with
many row fields with NULL values)? Or should I user other kind of structure?
For example store the data in one field and that field containing an
associative array with data.As far as I understand NULL values are not really stored and a column with
many NULLs is not a problem as such, but if it is part of an index the index
might not be very useful.At least that's my understanding of how SQL databases work. If I got this
wrong I hope someone will correct me.[snip]
/Fredric
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general