Proposal to add a new URL data type.
Hi everyone,
My name is Alexander Borisov, I want to propose/discuss
adding a new URL type as an extension in PostgreSQL contrib.
I think everyone here knows/seen what URL/URI and their basics.
If someone is interested in the basics, you can read the original
RFC 1738 [1]https://datatracker.ietf.org/doc/html/rfc1738.
For what, who?
Postgres users often store URLs in the database. As an example, they
provide links to their pages on the web, analyze users posts and get
links for further storage and analysis. Naturally, there is a need to
compare, sort, build indexes, get statistics on individual parts of
the URL: hosts, protocols, path and so on.
Adding a new URL type will standardize the work with URLs in Postgres
and provide simple tools for manipulating the new type.
URL in the world
URL is one of the basic concepts of the web, it makes it possible to
refer to certain resources on the web (and not only). At the moment
there are two main directions in specifications:
RFC: RFC 3986 [2]https://datatracker.ietf.org/doc/html/rfc3986 by IETF.
WHATWG: WHATWG URL [3]https://url.spec.whatwg.org/, WHATWG community by Apple, Google, Mozilla,
Microsoft. From the developers of modern browsers.
It's worth noting that the WHATWG URL is a living standard (like HTML).
It changes, improves, fixes bugs, but it doesn't break backwards
compatibility.
What is the main difference between WHATWG and RFC 3986?
There is a difference, and it's not that significant, unless host
parsing. To begin with, let's define that a full URL has the format:
Format: [scheme][divider][user:pass][host][port][path][query][fragment]
Example: https://root:qwerty@example.com:8080/path/to?abc=xyz#fragment
Now we can look at SOME of the differences in parsing:
[user:password]
Source: https://root:qwerty@1234@example.com/
RFC 3986: user: root, password: qwerty.
Parses until it encounters the first '@' character.
WHATWG: user: root, password: qwerty@1234.
Parses until it encounters the last '@' character.
[host]
Source: https://exаmple.com/ (а — U+0430)
RFC 3986: https://exаmple.com/.
Considers that the host is already Punycode encoded.
WHATWG: https://xn--exmple-4nf.com/.
Uses Unicode TR46 (IDNA, NFC normalization, convert non-ASCII
characters into Punycode, validation and so on).
[path]
Source: https://example.com/a/./b/../c
RFC 3986: https://example.com/a/./b/../c.
Cannot contain spaces (U+0020). U+005C Reverse solidus (\) is not
treated as U+002F Forward slash (/).
WHATWG: https://example.com/a/c.
Whitespace is encoded in %20 (URL-encode). Reverse solidus (\) is
treated as Forward slash (/).
We can see that there are differences, and the most significant ones
are related to host parsing and path normalization.
Let's take a look at what specifications popular programming languages
and applications rely on:
Node.js: WHATWG URL (standart module, node:url).
Before that, RFC 3986 was used, now it's a legacy API.
Python 3: RFC 3986 (standart library, urllib).
JAVA: RFC 3986 (standart class, java.net.URL).
PHP: In a future version 8.5, want to implement Uri\WhatWgUri,
Uri\Rfc3986Uri.
C++: WHATWG URL. ADA URL Parser is a popular one.
Rust: WHATWG URL (standart crate, url).
WebKit, Blink, Gecko: WHATWG URL.
curl: RFC 3986.
I could argue that modern and popular programming languages choose the
WHATWG specification. PHP thought to implement two approaches at once.
Browser engines understandably use the WHATWG specification, but they
write the specification themselves.
Proposal
I propose to add a new data type for PostgreSQL as an extension, in
contrib. Name the new type URL and use the WHATWG URL specification to
implement the new type. The choice of URL parsing specification is
justified by the following factors:
1. Live specification, adopts to modern realities.
2. The developers of modern browsers are behind the specification.
3. Increasing popularity.
Implementation
Attached to this email is a patch that contains:
1. URL type implementation with functions to get and modify individual
parts of the URL.
2. Code to implement URL parsing according to WHATWG specification.
Don't be alarmed, it has a different style than Postgres.
This is TEST THE WATER, for the opportunity to visualize, try, discuss.
I should note that the code of URL parsing implementation is taken from
my project (written by me, open source), so it differs in style from
the one adopted in Postgres.
Initially the code has its own implementation of Unicode TR46, as well
as its own encode/decode implementation for encodings (40 encodings).
But in order not to scare the audience with the size of the patch
I decided to use ICU in the current version.
You can read more about the implemented functions in README and url.c.
Final implementation
I see the final implementation in one of the options:
Option one
Rewrite URL parsing for Postgres style and functions. Write an
implementation of Unicode TR46 (so as not to use ICU IDNA).
This is not a small task, but not rocket science.
Option two
Rewrite URL parsing for Postgres style and functions.
Use ICU Unicode TR46.
Option three
Use the code as it is, use ICU Unicode TR46.
The simplest task.
From myself
I am willing to take care of the implementation of the new data type
and its further support. If it's of interest to the community.
[1]: https://datatracker.ietf.org/doc/html/rfc1738
[2]: https://datatracker.ietf.org/doc/html/rfc3986
[3]: https://url.spec.whatwg.org/
--
Alexander Borisov
Attachments:
0001-Add-url-data-type-to-contrib.patchtext/plain; charset=UTF-8; name=0001-Add-url-data-type-to-contrib.patchDownload+13818-1
On 05.12.24 15:01, Alexander Borisov wrote:
Postgres users often store URLs in the database. As an example, they
provide links to their pages on the web, analyze users posts and get
links for further storage and analysis. Naturally, there is a need to
compare, sort, build indexes, get statistics on individual parts of
the URL: hosts, protocols, path and so on.Adding a new URL type will standardize the work with URLs in Postgres
and provide simple tools for manipulating the new type.
Have you seen this: https://github.com/petere/pguri ?
The difference there is that it uses an external library for parsing the
URLs, which makes the code much smaller. Overall, the functionality
looks pretty similar to yours at first glance.
On 05/12/2024 15:59, Peter Eisentraut wrote:
On 05.12.24 15:01, Alexander Borisov wrote:
Postgres users often store URLs in the database. As an example, they
provide links to their pages on the web, analyze users posts and get
links for further storage and analysis. Naturally, there is a need to
compare, sort, build indexes, get statistics on individual parts of
the URL: hosts, protocols, path and so on.Adding a new URL type will standardize the work with URLs in Postgres
and provide simple tools for manipulating the new type.Have you seen this: https://github.com/petere/pguri ?
The difference there is that it uses an external library for parsing
the URLs, which makes the code much smaller. Overall, the
functionality looks pretty similar to yours at first glance.
+1 for adding this to contrib
05.12.2024 17:59, Peter Eisentraut пишет:
On 05.12.24 15:01, Alexander Borisov wrote:
Postgres users often store URLs in the database. As an example, they
provide links to their pages on the web, analyze users posts and get
links for further storage and analysis. Naturally, there is a need to
compare, sort, build indexes, get statistics on individual parts of
the URL: hosts, protocols, path and so on.Adding a new URL type will standardize the work with URLs in Postgres
and provide simple tools for manipulating the new type.Have you seen this: https://github.com/petere/pguri ?
The difference there is that it uses an external library for parsing the
URLs, which makes the code much smaller. Overall, the functionality
looks pretty similar to yours at first glance.
Hi Peter,
I looked at your implementation of the URI extension. You are using
a third party library uriparser for URI/URL parsing. The uriparser
library is based on the RFC 3986 specification, which I cite in
comparison to WHATWG in my email. The name of the functions to get the
individual parts of the URL will naturally be the same, that's how
URLs/URIs work. But you have to look at the point and perspective here.
As I've written before, there is a difference between parsing URLs
according to the RFC 3986 specification and WHATWG URLs. This is
especially true for host. Here are a couple more examples.
The specifications describe character encoding differently, this
applies to userinfo (username/password), path, query, fragment.
RFC 3986 standard encodes characters in the same way everywhere.
The characters to encode are: < > “ ` \r \n \t { } | \ ^ '.
The WHATWG URL standard takes a more selective and subtle approach.
For example, path may contain the character “|”, but userinfo does
not (encoded).
The WHATWG specification also requires that tabs and newlines be removed
from URLs before parsing. The WHATWG character encoding gradation can
be found in the specification [1]https://url.spec.whatwg.org/#c0-control-percent-encode-set.
In addition to functions to retrieve individual parts, the WHATWG URL
standard describes an API for changing them: scheme, userinfo, host,
hostname, port, path, query, fragment. There is not just one value is
replaced by another, there is a certain logic, which is not always
obvious. For example, try to replace scheme, let's take the URL for
example: https://example.com/.
This URL contains special scheme (any others are not special), there
are only six of them: ftp, file, http, https, ws, wss. And it is
impossible to replace scepial scheme with a non-special one. More
precisely, the URL will be returned with special scheme, i.e. without
changes. This is how you can check it with the patch I have given:
Example: select url_scheme_set('https://example.com/'::url, 'wss');
Result: wss://example.com/
Example: select url_scheme_set('https://example.com/'::url, 'myown');
Result: https://example.com/
Example: select url_scheme_set('best://example.com/'::url, 'myown');
Result: myown://example.com/
Example: select url_scheme_set('best://example.com/'::url, 'https');
Result: best://example.com/
In addition, WHATWG validates URLs during parsing and reports
non-critical errors [2]https://url.spec.whatwg.org/#writing. If such errors occur, parsing continues.
However, in my implementation I don't output these errors, I just
haven't figured out how to do it correctly in SQL (as NOTICE?).
Without going further into the differences in specifications I could
say simply - RFC 3986 is obsolete, for example, node.js has labeled
the API with it as Legacy (they use WHATWG).
If we abstract from specifications and consider our approaches in
implementation. You parse the URL every time for any request
(even to retrieve fragments). In my implementation I proceed from
the fact that requests to read URLs will significantly exceed their
changes. Parsing is done once on input, the result is saved in
a special format and later the necessary parts of the URL
(or the whole URL) are retrieved. Also please note that there are
no dependencies on third-party libraries (ICU dependencies can also
be fought off).
All currently available functions and examples can be seen in
the README file, in the patch.
[1]: https://url.spec.whatwg.org/#c0-control-percent-encode-set
[2]: https://url.spec.whatwg.org/#writing
--
Alexander Borisov
On 6 Dec 2024, at 13:59, Alexander Borisov <lex.borisov@gmail.com> wrote:
As I've written before, there is a difference between parsing URLs
according to the RFC 3986 specification and WHATWG URLs. This is
especially true for host. Here are a couple more examples.
As someone who wears another open-source hat which is heavily involved in
parsing URLs I cannot stress enough how much I think postgres should avoid
this. The example url http://http://http://@http://http://?http://#http:// is
a valid url, but is rejected by a number of implementations and parsed
differently by most that accept it.
A URL datatype is a *good idea* but one which I personally believe is best
handled as an external extension.
--
Daniel Gustafsson
Hi Daniel,
06.12.2024 16:46, Daniel Gustafsson пишет:
On 6 Dec 2024, at 13:59, Alexander Borisov <lex.borisov@gmail.com> wrote:
As I've written before, there is a difference between parsing URLs
according to the RFC 3986 specification and WHATWG URLs. This is
especially true for host. Here are a couple more examples.As someone who wears another open-source hat which is heavily involved in
parsing URLs I cannot stress enough how much I think postgres should avoid
this. The example url http://http://http://@http://http://?http://#http:// is
a valid url, but is rejected by a number of implementations and parsed
differently by most that accept it.
Your example is valid, yes, it looks scary, t might catch someone off
guard. At the same time your URL is correctly parsed both RFC 3986
and WHATWG URL.
There are many examples of “scary” URLs that you can't even understand
how they are parsed. You can write a URL with any intimidating host,
path, scheme, but that's not what I mean.
There are generally accepted standards for URL/URI parsing RFC 3986 and
WHATWG URL. We are not talking about self-written implementations
(without relying on any specifications) or those who made a mistake
while implementing one of the standards.
I propose to implement support for one of the standards that looks
promising. On the contrary, everything is quite clear. All we need to
do is point out that we have a URL data type in extension by WHATWG
specification. I would even say that by creating a new type we will
contribute to the standardization of this zoo.
It's about creating a new URL data type according to the
specification WHATWG and including it in contrib as an extension.
--
Alexander Borisov
On Thu, 5 Dec 2024 at 15:02, Alexander Borisov <lex.borisov@gmail.com> wrote:
What is the main difference between WHATWG and RFC 3986?
[snip]
[host]
Source: https://exаmple.com/ (а — U+0430)
RFC 3986: https://exаmple.com/.
WHATWG: https://xn--exmple-4nf.com/.
[snip]
[path]
Source: https://example.com/a/./b/../c
RFC 3986: https://example.com/a/./b/../c.
WHATWG: https://example.com/a/c.
[snip]
Proposal
I propose to add a new data type for PostgreSQL as an extension, in
contrib. Name the new type URL and use the WHATWG URL specification to
implement the new type.
I'd be extremely annoyed if URLs I wrote into the database didn't
return in identical manner when fetched from the database. See also
how numeric has different representations of the same value: 2.0 and
2.00 are equivalent for sorting purposes, they aren't the same and
cannot just truncate those zeroes. Note that a path of "/%2e/" could
well be interpreted differently from "/./" or "/" by a server.
The choice of URL parsing specification is
justified by the following factors:
1. Live specification, adopts to modern realities.
I don't think choosing to defer to a living standard is a good idea
for contrib extensions, which are expected to be supported and stable
with the major PostgreSQL release they're bundled with. If (when) that
living standard gets updated, as tends to happen to such standards,
we'd suddenly lose compatibility with the standard we said we
supported, which isn't a nice outlook. Compare that to RFCs, which
AFAIK don't change in specification once released.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
06.12.2024 21:04, Matthias van de Meent:
On Thu, 5 Dec 2024 at 15:02, Alexander Borisov <lex.borisov@gmail.com> wrote:
[..]
I'd be extremely annoyed if URLs I wrote into the database didn't
return in identical manner when fetched from the database. See also
how numeric has different representations of the same value: 2.0 and
2.00 are equivalent for sorting purposes, they aren't the same and
cannot just truncate those zeroes. Note that a path of "/%2e/" could
well be interpreted differently from "/./" or "/" by a server.
That's why data types are invented. Most likely, you will not be able
to write bad UTF-8 bit sequence into a field with the text type.
Because the incoming data will not pass validation. The user chooses
the data type for his needs, knowing how it works.
I mean that the data in the database should be stored validated and
choosing the URL type to store URLs should not be surprised that
the incoming URL will be parsed and will pass validation.
Also, no one is stopping you from storing the URL in text format and
using the new type on the fly.
I don't think choosing to defer to a living standard is a good idea
for contrib extensions, which are expected to be supported and stable
with the major PostgreSQL release they're bundled with. If (when) that
living standard gets updated, as tends to happen to such standards,
we'd suddenly lose compatibility with the standard we said we
supported, which isn't a nice outlook. Compare that to RFCs, which
AFAIK don't change in specification once released.
WHATWG:
"The standard can generally not be changed in backwards-incompatible
ways without extreme care, and with implementer commitments leading
the way."
You can read more about what it means Living Standard
https://whatwg.org/faq#living-standard.
--
Alexander Borisov
чт, 5 дек. 2024 г. в 17:02, Alexander Borisov <lex.borisov@gmail.com>:
My name is Alexander Borisov, I want to propose/discuss
adding a new URL type as an extension in PostgreSQL contrib.
I think everyone here knows/seen what URL/URI and their basics.
If someone is interested in the basics, you can read the original
RFC 1738 [1]....
I am willing to take care of the implementation of the new data type
and its further support. If it's of interest to the community.
Hey, I had a look at this patch and found its functionality mature and
performant.
As Peter mentioned pguri, I used it to compare with the proposed extension.
This brought up
the following differences:
- pguri (uriparser 0.9.8) doesn't support Chinese symbols in the host part
of URI (uri_test1.sh):
ERROR: invalid input syntax for type uri at or near "事例.com#comments
<http://xn--3kq3x.com#comments>"
Therefore, I avoided Chinese or Cyrillic symbols in the pguri test script.
- There are no SET functions in the pguri, changing specific portions of
URI is troublesome. I used
replace() in the test, but this is an error prone approach.
- It's even more troublesome to set parts of the URI that are not initially
there. Probably, a full decomposition
into parts and the following wrap up is needed
Suggested extension has no such limitations. Additionally, pguri extracts
userinfo as a whole,
while suggested extension can get/set user and password individually.
Running tests (attached) I got the following numbers:
$ ./url_test.sh
NOTICE: extension "url" already exists, skipping
tps = 13068.287423 (without initial connection time)
tps = 12888.937747 (without initial connection time)
tps = 12830.642558 (without initial connection time)
tps = 12846.341411 (without initial connection time)
tps = 13187.955601 (without initial connection time)
$ ./uri_test2.sh
NOTICE: extension "uri" already exists, skipping
tps = 2441.934308 (without initial connection time)
tps = 2513.277660 (without initial connection time)
tps = 2484.641673 (without initial connection time)
tps = 2519.312395 (without initial connection time)
tps = 2512.364492 (without initial connection time)
So it's 12.9k vs 2.5k, or 6x faster for a case where we replace 5 parts of
the original URL.
Given its performance and functionality, I find the suggested URL extension
better than pguri.
Now to the review part.
1. Applying patch causes indentation warning, please, bring spacing to the
project's policy
$ git apply ~/0001-Add-url-data-type-to-contrib.patch
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:837: indent with
spaces.
return lexbor_calloc(1, sizeof(lexbor_array_t));
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:843: indent with
spaces.
if (array == NULL) {
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:844: indent with
spaces.
return LXB_STATUS_ERROR_OBJECT_IS_NULL;
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:845: indent with
spaces.
}
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:847: indent with
spaces.
if (size == 0) {
warning: squelched 6350 whitespace errors
warning: 6355 lines add whitespace errors.
2. There's a lexbor/ library that contains core and url parts. Feels like
some commentary about what's
inside is required.
3. Do you think it's possible to adjust your code to use existing postgres
infrastructure instead? I don't
think having its own infrastructure for a single extension is a good
thing. Also, this might be a source
for performance improvements in the core.
4. There's no user visible documentation, please, add one.
I've created a commitfest entry for the patch:
https://commitfest.postgresql.org/51/5432/
I was not able to find you, please, register a community account and set
yourself as an author for the patch.
--
Victor Yegorov
10.12.2024 13:59, Victor Yegorov пишет:
чт, 5 дек. 2024 г. в 17:02, Alexander Borisov <lex.borisov@gmail.com
<mailto:lex.borisov@gmail.com>>:
[..]
Hey, I had a look at this patch and found its functionality mature and
performant.As Peter mentioned pguri, I used it to compare with the proposed
extension. This brought up
the following differences:
- pguri (uriparser 0.9.8) doesn't support Chinese symbols in the host
part of URI (uri_test1.sh):ERROR: invalid input syntax for type uri at or near "事
例.com#comments <http://xn--3kq3x.com#comments>"Therefore, I avoided Chinese or Cyrillic symbols in the pguri test
script.
- There are no SET functions in the pguri, changing specific portions of
URI is troublesome. I used
replace() in the test, but this is an error prone approach.
- It's even more troublesome to set parts of the URI that are not
initially there. Probably, a full decomposition
into parts and the following wrap up is needed
Suggested extension has no such limitations. Additionally, pguri
extracts userinfo as a whole,
while suggested extension can get/set user and password individually.Running tests (attached) I got the following numbers:
$ ./url_test.sh
NOTICE: extension "url" already exists, skipping
tps = 13068.287423 (without initial connection time)
tps = 12888.937747 (without initial connection time)
tps = 12830.642558 (without initial connection time)
tps = 12846.341411 (without initial connection time)
tps = 13187.955601 (without initial connection time)$ ./uri_test2.sh
NOTICE: extension "uri" already exists, skipping
tps = 2441.934308 (without initial connection time)
tps = 2513.277660 (without initial connection time)
tps = 2484.641673 (without initial connection time)
tps = 2519.312395 (without initial connection time)
tps = 2512.364492 (without initial connection time)So it's 12.9k vs 2.5k, or 6x faster for a case where we replace 5 parts
of the original URL.Given its performance and functionality, I find the suggested URL
extension better than pguri.
Thanks for the constructive comments and the testing you have done.
Now to the review part.
1. Applying patch causes indentation warning, please, bring spacing to
the project's policy$ git apply ~/0001-Add-url-data-type-to-contrib.patch
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:837: indent with
spaces.
return lexbor_calloc(1, sizeof(lexbor_array_t));
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:843: indent with
spaces.
if (array == NULL) {
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:844: indent with
spaces.
return LXB_STATUS_ERROR_OBJECT_IS_NULL;
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:845: indent with
spaces.
}
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:847: indent with
spaces.
if (size == 0) {
warning: squelched 6350 whitespace errors
warning: 6355 lines add whitespace errors.
This will be fixed when the main URL parser code is rewritten to fit
the Postgres style/infrastructure.
2. There's a lexbor/ library that contains core and url parts. Feels
like some commentary about what's
inside is required.
Yeah, that's a good point.
3. Do you think it's possible to adjust your code to use existing
postgres infrastructure instead? I don't
think having its own infrastructure for a single extension is a good
thing. Also, this might be a source
for performance improvements in the core.
To implement (rewrite/modify) a URL parser using exclusively Postgres
infrastructure is one of the main goals.
4. There's no user visible documentation, please, add one.
That's a good point.
I've created a commitfest entry for the patch: https://
commitfest.postgresql.org/51/5432/ <https://
commitfest.postgresql.org/51/5432/>
I was not able to find you, please, register a community account and set
yourself as an author for the patch.
Done.
--
Victor Yegorov
--
Alexander Borisov
On Fri, Dec 6, 2024 at 8:46 AM Daniel Gustafsson <daniel@yesql.se> wrote:
A URL datatype is a *good idea* but one which I personally believe is best
handled as an external extension.
+1. I don't think that it's an intrinsically bad idea to have a URL
data type in the core distribution, but no compelling reason for
putting it there has been proposed. Alexander's effort can equally
well live in github or pgxn or whatever and people who want to use it
still can. Also, it's pretty clear from reading this thread that
there's more than one way to do this and not everybody agrees with or
likes the particular thing Alexander has done. For an out-of-core
extension, that's perfectly fine, and even good: different people want
different things, and that's OK. For something that's part of
PostgreSQL itself, it's a big problem. There's no reason that we
should privilege one implementation over others, and we certainly do
not want the core project to have to maintain and ship multiple
versions of things like this.
--
Robert Haas
EDB: http://www.enterprisedb.com
ср, 11 дек. 2024 г. в 19:04, Alexander Borisov <lex.borisov@gmail.com>:
I've created a commitfest entry for the patch: https://
commitfest.postgresql.org/51/5432/ <https://
commitfest.postgresql.org/51/5432/>
I was not able to find you, please, register a community account and set
yourself as an author for the patch.Done.
I've marked this patch as Rejected, per discussion.
Still, I find this functionality nice to have, I'd be happy if you could
create an extension on github (or similar platform).
--
Victor Yegorov
On Thu, Dec 19, 2024 at 7:52 AM Victor Yegorov <vyegorov@gmail.com> wrote:
ср, 11 дек. 2024 г. в 19:04, Alexander Borisov <lex.borisov@gmail.com>:
I've created a commitfest entry for the patch: https://
commitfest.postgresql.org/51/5432/ <https://
commitfest.postgresql.org/51/5432/>
I was not able to find you, please, register a community account and set
yourself as an author for the patch.Done.
I've marked this patch as Rejected, per discussion.
+1
Still, I find this functionality nice to have, I'd be happy if you could create an extension on github (or similar platform).
+1
Robert Treat
https://xzilla.net