Could use some advice on search architecture
Hello,
I'm contemplating what architecture I should use to make searching as
fast as possible given the information available and the search
requirements. Let me give some background first;
- The database contains products of can potentially have a lot of them
(up to about 3 to 5 million)
- Each product has about 30 different properties defined about them.
Things like what color they are etc. All these properties are enumerated
choices, so for instance for color there is a list of available static
never changing options of which one can be chosen for that product. This
is the same for all those 30 properties. Currently they are stored as
enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
'option2', etc..)
- It should be possible to search for products and provide properties
that the product SHOULD have, not must have. For instance, for color,
the search could specify that it should return products that are either
red, blue or green.
- The products that match with the most properties should be in the top
of the search results
- If different products match with the same amount of properties, the
ordering should then be on the product that is most popular. There is
information in the database (and if need be also in the same table)
about how many times a product is sold.
- The results will be paginated per 15 products
The requirement is that these searches should be as fast as possible,
with a maximum of about 200 ms time taken for a search query.
What would be the best approach to this if I were to do this in the
database only? Should/can this be done with postgresql only or should I
look into other types of technology? (Lucene? Sphinx? others?)
Any advice on this would be greatly appreciated.
Thx in advance!
Ron
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
First, I would not restrict color to 30 colors, if anything like furniture
or clothing, etc. is involved. Colors are very important to consumers, and
exact colors are important. I would re-think my color selections.
Make sure you have indexes on all the appropriate columns, of course.
Susan
On Fri, Apr 18, 2014 at 6:59 AM, Ron Pasch <postgresql@ronpasch.nl> wrote:
Show quoted text
Hello,
I'm contemplating what architecture I should use to make searching as fast
as possible given the information available and the search requirements.
Let me give some background first;- The database contains products of can potentially have a lot of them (up
to about 3 to 5 million)
- Each product has about 30 different properties defined about them.
Things like what color they are etc. All these properties are enumerated
choices, so for instance for color there is a list of available static
never changing options of which one can be chosen for that product. This is
the same for all those 30 properties. Currently they are stored as
enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1', 'option2',
etc..)
- It should be possible to search for products and provide properties that
the product SHOULD have, not must have. For instance, for color, the search
could specify that it should return products that are either red, blue or
green.
- The products that match with the most properties should be in the top of
the search results
- If different products match with the same amount of properties, the
ordering should then be on the product that is most popular. There is
information in the database (and if need be also in the same table) about
how many times a product is sold.
- The results will be paginated per 15 productsThe requirement is that these searches should be as fast as possible, with
a maximum of about 200 ms time taken for a search query.What would be the best approach to this if I were to do this in the
database only? Should/can this be done with postgresql only or should I
look into other types of technology? (Lucene? Sphinx? others?)Any advice on this would be greatly appreciated.
Thx in advance!
Ron
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 4/18/2014 8:59 AM, Ron Pasch wrote:
Hello,
I'm contemplating what architecture I should use to make searching as
fast as possible given the information available and the search
requirements. Let me give some background first;- The database contains products of can potentially have a lot of them
(up to about 3 to 5 million)
- Each product has about 30 different properties defined about them.
Things like what color they are etc. All these properties are enumerated
choices, so for instance for color there is a list of available static
never changing options of which one can be chosen for that product. This
is the same for all those 30 properties. Currently they are stored as
enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
'option2', etc..)
- It should be possible to search for products and provide properties
that the product SHOULD have, not must have. For instance, for color,
the search could specify that it should return products that are either
red, blue or green.
- The products that match with the most properties should be in the top
of the search results
- If different products match with the same amount of properties, the
ordering should then be on the product that is most popular. There is
information in the database (and if need be also in the same table)
about how many times a product is sold.
- The results will be paginated per 15 productsThe requirement is that these searches should be as fast as possible,
with a maximum of about 200 ms time taken for a search query.What would be the best approach to this if I were to do this in the
database only? Should/can this be done with postgresql only or should I
look into other types of technology? (Lucene? Sphinx? others?)Any advice on this would be greatly appreciated.
Thx in advance!
Ron
As we are PG users, on a PG list, we are gonna recommend PG, obviously. :-)
Actually though, I recommend PG.
- The products that match with the most properties should be in the top
of the search results
That kinda query is going to be really difficult, I think, regardless of
what you use. To find that you'll have to measure every product (all 5
million) and then sort the results.
enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
'option2', etc..)
So, you have something like:
CREATE TYPE colortype AS ENUM ('red', 'green', 'blue');
CREATE TYPE sizetype AS ENUM ('small', 'medium', 'large');
create table product
(
id serial,
name text,
color colortype,
size sizetype,
...
);
I assume the problem is you don't want to index all 30 properties? That
makes sense.
- It should be possible to search for products and provide properties
that the product SHOULD have, not must have.
I don't understand this. Say you have a sprocket in red and green. Do
you want to search for:
select * from product where name = 'sprocket' and (color = 'red' or
color = 'green')
Or do you want something else? Does the user say they'd "prefer" blue,
but will take whatever else you have?
Do you search for some properties exactly and some "preferred"?
Perhaps you could describe a little more how you want to query the
database? Or, maybe, what your user's are searching for?
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 18/04/2014 21:24, Andy Colson wrote:
On 4/18/2014 8:59 AM, Ron Pasch wrote:
Hello,
I'm contemplating what architecture I should use to make searching as
fast as possible given the information available and the search
requirements. Let me give some background first;- The database contains products of can potentially have a lot of them
(up to about 3 to 5 million)
- Each product has about 30 different properties defined about them.
Things like what color they are etc. All these properties are enumerated
choices, so for instance for color there is a list of available static
never changing options of which one can be chosen for that product. This
is the same for all those 30 properties. Currently they are stored as
enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
'option2', etc..)
- It should be possible to search for products and provide properties
that the product SHOULD have, not must have. For instance, for color,
the search could specify that it should return products that are either
red, blue or green.
- The products that match with the most properties should be in the top
of the search results
- If different products match with the same amount of properties, the
ordering should then be on the product that is most popular. There is
information in the database (and if need be also in the same table)
about how many times a product is sold.
- The results will be paginated per 15 productsThe requirement is that these searches should be as fast as possible,
with a maximum of about 200 ms time taken for a search query.What would be the best approach to this if I were to do this in the
database only? Should/can this be done with postgresql only or should I
look into other types of technology? (Lucene? Sphinx? others?)Any advice on this would be greatly appreciated.
Thx in advance!
Ron
As we are PG users, on a PG list, we are gonna recommend PG,
obviously. :-)Actually though, I recommend PG.
- The products that match with the most properties should be in the top
of the search resultsThat kinda query is going to be really difficult, I think, regardless
of what you use. To find that you'll have to measure every product
(all 5 million) and then sort the results.enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
'option2', etc..)So, you have something like:
CREATE TYPE colortype AS ENUM ('red', 'green', 'blue');
CREATE TYPE sizetype AS ENUM ('small', 'medium', 'large');create table product
(
id serial,
name text,
color colortype,
size sizetype,
...
);I assume the problem is you don't want to index all 30 properties?
That makes sense.- It should be possible to search for products and provide properties
that the product SHOULD have, not must have.I don't understand this. Say you have a sprocket in red and green.
Do you want to search for:select * from product where name = 'sprocket' and (color = 'red' or
color = 'green')Or do you want something else? Does the user say they'd "prefer"
blue, but will take whatever else you have?Do you search for some properties exactly and some "preferred"?
Perhaps you could describe a little more how you want to query the
database? Or, maybe, what your user's are searching for?-Andy
Well, given that there are known limited attributes, this is the type of
application that really really suits a column oriented database, such as
Sybase IQ (now sold by SAP). Its a neat product that scales. Great
performance with drag'n'drop analytics.
Unless you can charm IQ out of SAP (it has been known to happen), you
might have to look at some other techniques
So consider some binary data representation
Red - 1 (0000 0001)
Orange - 2 (0000 0010)
Yellow - 4 (0000 0100)
Green - 8 (0000 1000)
Blue - 16 (0001 0000)
Indigo - 32 (0010 0000)
Violet - 64 (0100 0000)
This way, you can encode several colours in 1 value
Red or Green or Indigo = 1 + 8 + 32 = 41 = 0010 1001
Robin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi,
Seems that this blog post is worth reading in your case
http://hlinnaka.iki.fi/2014/03/28/gin-as-a-substitute-for-bitmap-indexes/
regards,
Marc Mamin
________________________________________
Von: pgsql-general-owner@postgresql.org [pgsql-general-owner@postgresql.org]" im Auftrag von "Robin [robinstc@live.co.uk]
Gesendet: Samstag, 19. April 2014 09:38
An: pgsql-general@postgresql.org
Betreff: Re: [GENERAL] Could use some advice on search architecture
On 18/04/2014 21:24, Andy Colson wrote:
On 4/18/2014 8:59 AM, Ron Pasch wrote:
Hello,
I'm contemplating what architecture I should use to make searching as
fast as possible given the information available and the search
requirements. Let me give some background first;- The database contains products of can potentially have a lot of them
(up to about 3 to 5 million)
- Each product has about 30 different properties defined about them.
Things like what color they are etc. All these properties are enumerated
choices, so for instance for color there is a list of available static
never changing options of which one can be chosen for that product. This
is the same for all those 30 properties. Currently they are stored as
enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
'option2', etc..)
- It should be possible to search for products and provide properties
that the product SHOULD have, not must have. For instance, for color,
the search could specify that it should return products that are either
red, blue or green.
- The products that match with the most properties should be in the top
of the search results
- If different products match with the same amount of properties, the
ordering should then be on the product that is most popular. There is
information in the database (and if need be also in the same table)
about how many times a product is sold.
- The results will be paginated per 15 productsThe requirement is that these searches should be as fast as possible,
with a maximum of about 200 ms time taken for a search query.What would be the best approach to this if I were to do this in the
database only? Should/can this be done with postgresql only or should I
look into other types of technology? (Lucene? Sphinx? others?)Any advice on this would be greatly appreciated.
Thx in advance!
Ron
As we are PG users, on a PG list, we are gonna recommend PG,
obviously. :-)Actually though, I recommend PG.
- The products that match with the most properties should be in the top
of the search resultsThat kinda query is going to be really difficult, I think, regardless
of what you use. To find that you'll have to measure every product
(all 5 million) and then sort the results.enumerated types (CREATE TYPE propertyvalue AS ENUM ('option1',
'option2', etc..)So, you have something like:
CREATE TYPE colortype AS ENUM ('red', 'green', 'blue');
CREATE TYPE sizetype AS ENUM ('small', 'medium', 'large');create table product
(
id serial,
name text,
color colortype,
size sizetype,
...
);I assume the problem is you don't want to index all 30 properties?
That makes sense.- It should be possible to search for products and provide properties
that the product SHOULD have, not must have.I don't understand this. Say you have a sprocket in red and green.
Do you want to search for:select * from product where name = 'sprocket' and (color = 'red' or
color = 'green')Or do you want something else? Does the user say they'd "prefer"
blue, but will take whatever else you have?Do you search for some properties exactly and some "preferred"?
Perhaps you could describe a little more how you want to query the
database? Or, maybe, what your user's are searching for?-Andy
Well, given that there are known limited attributes, this is the type of
application that really really suits a column oriented database, such as
Sybase IQ (now sold by SAP). Its a neat product that scales. Great
performance with drag'n'drop analytics.
Unless you can charm IQ out of SAP (it has been known to happen), you
might have to look at some other techniques
So consider some binary data representation
Red - 1 (0000 0001)
Orange - 2 (0000 0010)
Yellow - 4 (0000 0100)
Green - 8 (0000 1000)
Blue - 16 (0001 0000)
Indigo - 32 (0010 0000)
Violet - 64 (0100 0000)
This way, you can encode several colours in 1 value
Red or Green or Indigo = 1 + 8 + 32 = 41 = 0010 1001
Robin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
bottom post
On 19/04/2014 12:46, R. Pasch wrote:
On 19-4-2014 9:38, Robin wrote:
Well, given that there are known limited attributes, this is the type
of application that really really suits a column oriented database,
such as Sybase IQ (now sold by SAP). Its a neat product that scales.
Great performance with drag'n'drop analytics.Unless you can charm IQ out of SAP (it has been known to happen), you
might have to look at some other techniquesSo consider some binary data representation
Red - 1 (0000 0001)
Orange - 2 (0000 0010)
Yellow - 4 (0000 0100)
Green - 8 (0000 1000)
Blue - 16 (0001 0000)
Indigo - 32 (0010 0000)
Violet - 64 (0100 0000)This way, you can encode several colours in 1 value
Red or Green or Indigo = 1 + 8 + 32 = 41 = 0010 1001Robin
I stopped reading when I heard the word "sold by SAP" ;-) This project
is solely build with open-source and freely available software.I've been thinking about using a binary data representation but didn't
come to a solution to this specific problem quite yet. Per property of
a product, only one bit would be 1 and the rest would be 0. What would
a query look like to match all products that have a bit in the correct
position?Say for instance these are a couple records (and yes, property values
can be null as well)title, property1, property2, property3
================================
product1, 0000 0001, 0000 0010, NULL
product2, 0000 0100, 0100 0000, 0010 0000
product3, 0010 0000, 0010 0000, 0100 0000Say that I would like to retrieve the products that either have
property1 as 0010 0000, 1000 000 or 0000 0001. Combined that would be
0010 1001 and would have to match product1 and product3 as they both
have their individual bit matching one of the bits being asked for.
What would a where statement look like using this type of binary
representation?If that would be fairly simple to do and fast (most important factor)
then I could do an OR construction on all property columns and have
something count the amount of properties that actually matched. Is
that something you can do with a binary operator of some sort as well?
Count the amount of overlapping bits?Say for instance I have a binary value of 0110 0101 and another binary
value of 1100 0100, how could I found out how many bits matched? (in
this case the number of matching bits would be 2)
I understand the reluctance to pay SAP-style rates, as a longtime DB
user, I have learned some 'charm' techniques.
However, I poked around a bit for alternatives, as I do like the
column-oriented approach, and found something called - /MonetDB /
<http://www.monetdb.com/Home/Features>- it apparently has a column-store
db kernel, and is open source - I suggest you have a look, if it does
what it says on the label, then it looks like a find.
There is a discussion of bitmask-trickiness here also dealing with
colours
<http://postgresql.1045698.n5.nabble.com/Bitmask-trickiness-td1865258.html>
Robin
<http://postgresql.1045698.n5.nabble.com/Bitmask-trickiness-td1865258.html>
Import Notes
Reply to msg id not found: 53526207.3090005@paronix.nl
Postgresql has 2 column store, 1-in memory(cant remember the name) and
http://www.citusdata.com/blog/76-postgresql-columnar-store-for-analytics
On Sat, Apr 19, 2014 at 2:10 PM, Robin <robinstc@live.co.uk> wrote:
Show quoted text
bottom post
On 19/04/2014 12:46, R. Pasch wrote:On 19-4-2014 9:38, Robin wrote:
Well, given that there are known limited attributes, this is the type of
application that really really suits a column oriented database, such as
Sybase IQ (now sold by SAP). Its a neat product that scales. Great
performance with drag'n'drop analytics.Unless you can charm IQ out of SAP (it has been known to happen), you
might have to look at some other techniquesSo consider some binary data representation
Red - 1 (0000 0001)
Orange - 2 (0000 0010)
Yellow - 4 (0000 0100)
Green - 8 (0000 1000)
Blue - 16 (0001 0000)
Indigo - 32 (0010 0000)
Violet - 64 (0100 0000)This way, you can encode several colours in 1 value
Red or Green or Indigo = 1 + 8 + 32 = 41 = 0010 1001Robin
I stopped reading when I heard the word "sold by SAP" ;-) This project is
solely build with open-source and freely available software.I've been thinking about using a binary data representation but didn't
come to a solution to this specific problem quite yet. Per property of a
product, only one bit would be 1 and the rest would be 0. What would a
query look like to match all products that have a bit in the correct
position?Say for instance these are a couple records (and yes, property values can
be null as well)title, property1, property2, property3
================================
product1, 0000 0001, 0000 0010, NULL
product2, 0000 0100, 0100 0000, 0010 0000
product3, 0010 0000, 0010 0000, 0100 0000Say that I would like to retrieve the products that either have property1
as 0010 0000, 1000 000 or 0000 0001. Combined that would be 0010 1001 and
would have to match product1 and product3 as they both have their
individual bit matching one of the bits being asked for. What would a where
statement look like using this type of binary representation?If that would be fairly simple to do and fast (most important factor) then
I could do an OR construction on all property columns and have something
count the amount of properties that actually matched. Is that something you
can do with a binary operator of some sort as well? Count the amount of
overlapping bits?Say for instance I have a binary value of 0110 0101 and another binary
value of 1100 0100, how could I found out how many bits matched? (in this
case the number of matching bits would be 2)I understand the reluctance to pay SAP-style rates, as a longtime DB user,
I have learned some 'charm' techniques.However, I poked around a bit for alternatives, as I do like the
column-oriented approach, and found something called - *MonetDB *<http://www.monetdb.com/Home/Features>-
it apparently has a column-store db kernel, and is open source - I suggest
you have a look, if it does what it says on the label, then it looks like a
find.There is a discussion of bitmask-trickiness here also dealing with colours<http://postgresql.1045698.n5.nabble.com/Bitmask-trickiness-td1865258.html>
Robin
<http://postgresql.1045698.n5.nabble.com/Bitmask-trickiness-td1865258.html>
On 04/19/2014 06:26 AM, Ron Pasch wrote:
- It should be possible to search for products and provide properties
that the product SHOULD have, not must have.I don't understand this. Say you have a sprocket in red and green. Do you want to search for:
select * from product where name = 'sprocket' and (color = 'red' or color = 'green')
Or do you want something else? Does the user say they'd "prefer" blue, but will take whatever else you have?
Do you search for some properties exactly and some "preferred"?
Perhaps you could describe a little more how you want to query the database? Or, maybe, what your user's are searching for?
-Andy
Yes, the user can prefer certain properties and the products that match most of the properties should be in the top of the results, but if a product doesn't match all of them but just some of them, they should still be returned, but lower in the results.
I'm seriously wondering if doing this solely with postgres is even possible without having long execution times. I've done some tests with 5 million records and just doing the "or" construction you mentioned above, which resulted in 600 to 900 ms queries and returning only those records of which all properties match at least one selected value.
I was thinking that perhaps using a search engine like lucene or sphinx would be more appropriate, but then I wonder what I would exactly be indexing and how I would be querying that, but that's a question for a different mailing list ;-)
Please keep the list cc'd, so others can help as well.
Yeah, doing a bunch of or's is gonna have to test all 5 million products.
I wonder if there is a way we can treat this like a two step process.
1) cut down the number of products
2) sort them by #matches, popularity, etc
You've talked about #2, but how about #1. Is there any way to either include or exclude a product? Users don't just ask for red, they ask for "tires (maybe red)". Not all 5 million products are tires, right?
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Import Notes
Reply to msg id not found: 53525D5F.8090900@ronpasch.nl
On 04/19/2014 06:26 AM, Ron Pasch wrote:
- It should be possible to search for products and provide properties
that the product SHOULD have, not must have.I don't understand this. Say you have a sprocket in red and green. Do you want to search for:
select * from product where name = 'sprocket' and (color = 'red' or color = 'green')
Or do you want something else? Does the user say they'd "prefer" blue, but will take whatever else you have?
Do you search for some properties exactly and some "preferred"?
Perhaps you could describe a little more how you want to query the database? Or, maybe, what your user's are searching for?
-Andy
Yes, the user can prefer certain properties and the products that match most of the properties should be in the top of the results, but if a product doesn't match all of them but just some of them, they should still be returned, but lower in the results.
I'm seriously wondering if doing this solely with postgres is even possible without having long execution times. I've done some tests with 5 million records and just doing the "or" construction you mentioned above, which resulted in 600 to 900 ms queries and returning only those records of which all properties match at least one selected value.
I don't think that OR clauses are the right way as the aim is to count how many attributes do match the search.
basically a standard approach would look like:
SELECT pID, sum(match) as matches FROM
(
selct pID, 1 as match from Products where color ='pink'
UNION ALL
selct pID, 1 as match from Products where size ='XXL'
...
)foo
GROUP BY pID order by matches DESC
How many distinct attributes are involved ? ( 15 colors + 9 sizes + ....)
Marc
I was thinking that perhaps using a search engine like lucene or sphinx would be more appropriate, but then I wonder what I would exactly be indexing and how I would be querying that, but that's a question for a different mailing list ;-)
Please keep the list cc'd, so others can help as well.
Yeah, doing a bunch of or's is gonna have to test all 5 million products.
I wonder if there is a way we can treat this like a two step process.
1) cut down the number of products
2) sort them by #matches, popularity, etc
You've talked about #2, but how about #1. Is there any way to either include or exclude a product? Users don't just ask for red, they ask for "tires (maybe red)". Not all 5 million products are tires, right?
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Because this is a topic that interests me I have done some digging.
MonetDB <http://www.monetdb.com/Home>is a proper column-store DBMS. It
is used on some decent sized projects, several of which are EU-funded.
Digging a bit deeper I have discovered that a PostgreSQL Foreign Data
Wrapper for MonetDB has been created - monetdb_fdw.
The fdw is written up here
<http://pgsnaga.blogspot.co.uk/2013/07/monetdbfdw-postgresql-meets-column.html>
There is a video that shows what happens when a query is executed in
PostgreSQL and MonetDB
The video shows an analytical query being processed
1. Using PostgreSQL - 177 seconds
2. Using MonetDB - 8 seconds
3. Using a remote MonetDB server through MonetDB FDW - 1 second
I think its worth investigating further
Robin St.Clair
the other is named imcs.
imcs is mainly a memory database,it is very fast for olap,because of multi
threads parallel query plan and column storage.for billions of data,most
group by like querys return within 2s.
we use it for several weeks and happy for the performance.
jov
在 2014-4-19 下午8:27,"Dorian Hoxha" <dorian.hoxha@gmail.com>写道:
Show quoted text
Postgresql has 2 column store, 1-in memory(cant remember the name) and
http://www.citusdata.com/blog/76-postgresql-columnar-store-for-analyticsOn Sat, Apr 19, 2014 at 2:10 PM, Robin <robinstc@live.co.uk> wrote:
bottom post
On 19/04/2014 12:46, R. Pasch wrote:On 19-4-2014 9:38, Robin wrote:
Well, given that there are known limited attributes, this is the type of
application that really really suits a column oriented database, such as
Sybase IQ (now sold by SAP). Its a neat product that scales. Great
performance with drag'n'drop analytics.Unless you can charm IQ out of SAP (it has been known to happen), you
might have to look at some other techniquesSo consider some binary data representation
Red - 1 (0000 0001)
Orange - 2 (0000 0010)
Yellow - 4 (0000 0100)
Green - 8 (0000 1000)
Blue - 16 (0001 0000)
Indigo - 32 (0010 0000)
Violet - 64 (0100 0000)This way, you can encode several colours in 1 value
Red or Green or Indigo = 1 + 8 + 32 = 41 = 0010 1001Robin
I stopped reading when I heard the word "sold by SAP" ;-) This project
is solely build with open-source and freely available software.I've been thinking about using a binary data representation but didn't
come to a solution to this specific problem quite yet. Per property of a
product, only one bit would be 1 and the rest would be 0. What would a
query look like to match all products that have a bit in the correct
position?Say for instance these are a couple records (and yes, property values can
be null as well)title, property1, property2, property3
================================
product1, 0000 0001, 0000 0010, NULL
product2, 0000 0100, 0100 0000, 0010 0000
product3, 0010 0000, 0010 0000, 0100 0000Say that I would like to retrieve the products that either have property1
as 0010 0000, 1000 000 or 0000 0001. Combined that would be 0010 1001 and
would have to match product1 and product3 as they both have their
individual bit matching one of the bits being asked for. What would a where
statement look like using this type of binary representation?If that would be fairly simple to do and fast (most important factor)
then I could do an OR construction on all property columns and have
something count the amount of properties that actually matched. Is that
something you can do with a binary operator of some sort as well? Count the
amount of overlapping bits?Say for instance I have a binary value of 0110 0101 and another binary
value of 1100 0100, how could I found out how many bits matched? (in this
case the number of matching bits would be 2)I understand the reluctance to pay SAP-style rates, as a longtime DB
user, I have learned some 'charm' techniques.However, I poked around a bit for alternatives, as I do like the
column-oriented approach, and found something called - *MonetDB *<http://www.monetdb.com/Home/Features>-
it apparently has a column-store db kernel, and is open source - I suggest
you have a look, if it does what it says on the label, then it looks like a
find.There is a discussion of bitmask-trickiness here also dealing with
colours<http://postgresql.1045698.n5.nabble.com/Bitmask-trickiness-td1865258.html>Robin
<http://postgresql.1045698.n5.nabble.com/Bitmask-trickiness-td1865258.html>
I ended up running some tests using 5 million rows of products. I used
about 5 properties that a product should always be matched to, and then
I used the following in the select;
(CASE property1 in (option1, option2, option3, etc) WHEN TRUE THEN 1
ELSE 0 END)
+ (CASE property2 in (option1, option2, option3, etc) WHEN TRUE THEN 1
ELSE 0 END)
+ (CASE property3 in (option1, option2, option3, etc) WHEN TRUE THEN 1
ELSE 0 END)
...
AS numberOfMatchingProperties
That way I can use the number of matching properties in the order by
clause and have the properties that must always match filter out the
bulk of the 5 million records.
The tests that I've done return around 100.000 records in about 100 to
150 milliseconds using this technique, and using OFFSET and LIMIT to
paginate those by about 15 records each time is very very fast.
This I can live with :) Thx for letting me pick your brains on this a
little.
Cheers,
Ron
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Import Notes
Reply to msg id not found: BLU0-SMTP17E1B5E736D79767513B7FE25F0@phx.gbl