UNIQUE constraint on character sequences
Dear list,
I would be pleased if you could share some thoughts with me on the
following: say I wish to maintain a table with all distinct character
sequences (variable length) showing series with strong similarities.
Example:
"abbbabacccdef"
"abbbabaccdcdf"
"abbbabaccdcgf"
...
"qwtrhdffdd"
...
"qwtrhdffdds"
...
"qwtrhdffddsspp"
"qwtrhdffddsspf"
"qwtrhdffddssph"
"qwtrhdffddsspL"
etc.
Think of them as ordered values (array-like), as a set having many values in
common, in the same "elements" (that is: positions; in my application, each
position has some particular meaning -- the sequence represents a set of
particular settings)
CREATE TABLE textseqs(txtseq TEXT)
What would be an efficient approach in enforcing a UNIQUE constraint?
I was thinking of using hashbuckets in a b-tree:
CREATE UNIQUE INDEX ON textseqs USING BTREE( hashtext(txtseq), txtseq )
This index would "cache" hashes for each row. Upon inserting of a new row,
traversing the index involves the comparison of two single integers for each
node in the b-tree, until the actual hash value (if it exists) was reached.
Then, only within that bucket (the hashes won't be unique), the more
expensive string comparing is required; involving a sequencial comparison of
(potentially maaaany) characters. Yet, within that bucket, the character
series may be expected to show stronger differences than a plain sorted list
of all values in the table would have, indexed by a b-tree index. Wouldn't
traversing such a plain (non-composite, single column, on: "txtseq") b-tree
index involve a sequencial comparison of (potentially many) characters *at
EACH NODE* of the tree ? Or am I mistaken that each node is filled with
actual values from the txtseq column?
Thank you for your input!
Cheers,
Rob
Could you index the reverse of the string so the unique part appears first?
On May 14, 2011, at 11:20, InterRob <rob.marjot@gmail.com> wrote:
Show quoted text
Dear list,
I would be pleased if you could share some thoughts with me on the following: say I wish to maintain a table with all distinct character sequences (variable length) showing series with strong similarities. Example:
"abbbabacccdef"
"abbbabaccdcdf"
"abbbabaccdcgf"
...
"qwtrhdffdd"
...
"qwtrhdffdds"
...
"qwtrhdffddsspp"
"qwtrhdffddsspf"
"qwtrhdffddssph"
"qwtrhdffddsspL"
etc.Think of them as ordered values (array-like), as a set having many values in common, in the same "elements" (that is: positions; in my application, each position has some particular meaning -- the sequence represents a set of particular settings)
CREATE TABLE textseqs(txtseq TEXT)
What would be an efficient approach in enforcing a UNIQUE constraint?
I was thinking of using hashbuckets in a b-tree:
CREATE UNIQUE INDEX ON textseqs USING BTREE( hashtext(txtseq), txtseq )This index would "cache" hashes for each row. Upon inserting of a new row, traversing the index involves the comparison of two single integers for each node in the b-tree, until the actual hash value (if it exists) was reached. Then, only within that bucket (the hashes won't be unique), the more expensive string comparing is required; involving a sequencial comparison of (potentially maaaany) characters. Yet, within that bucket, the character series may be expected to show stronger differences than a plain sorted list of all values in the table would have, indexed by a b-tree index. Wouldn't traversing such a plain (non-composite, single column, on: "txtseq") b-tree index involve a sequencial comparison of (potentially many) characters *at EACH NODE* of the tree ? Or am I mistaken that each node is filled with actual values from the txtseq column?
Thank you for your input!
Cheers,
Rob