dollar quoting

Started by Andrew Dunstanabout 22 years ago31 messageshackers
Jump to latest
#1Andrew Dunstan
andrew@dunslane.net

What has become of the "dollar quoting" mechanism that we had so much
discussion about back in August/September?

IIRC, a consensus was reached on the actual format of the quote
delimiters (either $$ or $identifier$), and Tom had a proof of concept
patch to the parser to handle it, but work was needed on psql, plpgsql,
pg_dump (and pg_restore?) + docs.

Is anyone working on it?

cheers

andrew

#2Jon Jensen
jon@endpoint.com
In reply to: Andrew Dunstan (#1)
Re: dollar quoting

On Thu, 5 Feb 2004, Andrew Dunstan wrote:

What has become of the "dollar quoting" mechanism that we had so much
discussion about back in August/September?

IIRC, a consensus was reached on the actual format of the quote
delimiters (either $$ or $identifier$), and Tom had a proof of concept
patch to the parser to handle it, but work was needed on psql, plpgsql,
pg_dump (and pg_restore?) + docs.

Is anyone working on it?

I am, but not very quickly. I hope to have some time in the next month,
but if someone else beats me to it I'll just be happy it got done.

Jon

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#1)
Re: dollar quoting

Andrew Dunstan <andrew@dunslane.net> writes:

IIRC, a consensus was reached on the actual format of the quote
delimiters (either $$ or $identifier$), and Tom had a proof of concept
patch to the parser to handle it, but work was needed on psql, plpgsql,
pg_dump (and pg_restore?) + docs.

I think someone has to fix psql before we can consider applying the
backend patch. Fixing the other stuff can come after.

Is anyone working on it?

I kinda thought you had volunteered to work on the psql part...

regards, tom lane

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#3)
Re: dollar quoting

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

IIRC, a consensus was reached on the actual format of the quote
delimiters (either $$ or $identifier$), and Tom had a proof of concept
patch to the parser to handle it, but work was needed on psql, plpgsql,
pg_dump (and pg_restore?) + docs.

I think someone has to fix psql before we can consider applying the
backend patch. Fixing the other stuff can come after.

Makes sense.

Is anyone working on it?

I kinda thought you had volunteered to work on the psql part...

I don't recall being that specific, but you could be right. In any case,
I didn't want to trip over anyone else, which is why I asked.

I will try to coordinate with Jon.

cheers

andrew

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#4)
Re: dollar quoting

Andrew Dunstan wrote:

Tom Lane wrote:

I kinda thought you had volunteered to work on the psql part...

I don't recall being that specific, but you could be right. In any
case, I didn't want to trip over anyone else, which is why I asked.

I will try to coordinate with Jon.

After staring at the code for a long time, I think I see how to do this.
It's complicated a bit by the fact that $ is a valid identifier
character. So my current thinking is to say that if we see $ not in a
quote and not preceded by a valid identifier char then it is the start
of a $foo$ sequence. Or have I missed something? Can we validly see $ in
any other context?

BTW, Tom's proof of concept patch worked just fine for me. I changed the
allowed pattern to what I think was agreed:

dolqdelim \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$

and changed some names and comments to remove misleading references to
"here docs".

cheers

andrew

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: dollar quoting

Andrew Dunstan <andrew@dunslane.net> writes:

After staring at the code for a long time, I think I see how to do this.
It's complicated a bit by the fact that $ is a valid identifier
character. So my current thinking is to say that if we see $ not in a
quote and not preceded by a valid identifier char then it is the start
of a $foo$ sequence. Or have I missed something? Can we validly see $ in
any other context?

Right, a $ should be considered to start a quote marker only if it's not
part of an identifier. The backend lexer doesn't have a problem with
this because it's written in flex, but I can imagine that getting it
right in psql's ad-hoc parser might be tricky.

There was some discussion awhile back of converting psql to use flex
for interpreting its input, but I dunno how practical that really is.
I don't know how you get flex to do reasonable stuff with an incomplete
input string. Still, it might be worth looking into.

regards, tom lane

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#6)
Re: dollar quoting

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

After staring at the code for a long time, I think I see how to do this.
It's complicated a bit by the fact that $ is a valid identifier
character. So my current thinking is to say that if we see $ not in a
quote and not preceded by a valid identifier char then it is the start
of a $foo$ sequence. Or have I missed something? Can we validly see $ in
any other context?

Right, a $ should be considered to start a quote marker only if it's not
part of an identifier. The backend lexer doesn't have a problem with
this because it's written in flex, but I can imagine that getting it
right in psql's ad-hoc parser might be tricky.

I think it's doable, though. I seem to have a working patch, which I
will send out for review soon.

There was some discussion awhile back of converting psql to use flex
for interpreting its input, but I dunno how practical that really is.
I don't know how you get flex to do reasonable stuff with an incomplete
input string. Still, it might be worth looking into.

That's what made me not even think about it. If someone better versed in
this stuff than me wants to do it then more power to them.

cheers

andrew

#8Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#7)
Re: [HACKERS] dollar quoting

Andrew Dunstan wrote:

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

After staring at the code for a long time, I think I see how to do
this. It's complicated a bit by the fact that $ is a valid
identifier character. So my current thinking is to say that if we
see $ not in a quote and not preceded by a valid identifier char
then it is the start of a $foo$ sequence. Or have I missed
something? Can we validly see $ in any other context?

I had missed one, though - numbered params in prepared statements. Fixed
in attached patch.

Right, a $ should be considered to start a quote marker only if it's not
part of an identifier. The backend lexer doesn't have a problem with
this because it's written in flex, but I can imagine that getting it
right in psql's ad-hoc parser might be tricky.

I think it's doable, though. I seem to have a working patch, which I
will send out for review soon.

Proof of Concept patch (i.e. not for application) attached for review.
The scanner changes are based on Tom's original, with some name/comment
changes and a more liberal pattern. The psql changes are all my own work
:-).

Comments welcome. Reviewers: I am not sure I got multi-byte stuff right
in psql/mainloop.c - please pay close attention to that.

If I'm not wildly off course I will polish this up and start on docs.

cheers

andrew

Attachments:

dq-amd.patchtext/plain; name=dq-amd.patchDownload+120-9
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#8)
Re: [HACKERS] dollar quoting

Andrew Dunstan <andrew@dunslane.net> writes:

Comments welcome. Reviewers: I am not sure I got multi-byte stuff right
in psql/mainloop.c - please pay close attention to that.

The i-1 stuff should generally be i-prevlen. Not sure if there are any
other pitfalls.

A bigger problem here:

+             else if (!dol_quote && line[i] == '$' && 
+                      !isdigit(line[i + thislen]) && 
+                      (dol_end = strchr(line+i+1,'$')) != NULL &&
+                      (i == 0 || 
+                       ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) || 
+                          line[i-1] == '_')))
+             {

is that you aren't checking that what comes between the two dollar signs
looks like empty-or-an-identifier. The check for
next-char-isn't-a-digit is part of that but not the only part.

Also I'm not sure about the positioning of these tests relative to the
in_quote and in_xcomment tests. As you have it, $foo$ will be
recognized within an xcomment, which I think is at variance with the
proposed backend lexing behavior.

Also, the strdup should be pg_strdup.

regards, tom lane

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#9)
Re: [HACKERS] dollar quoting

Tom Lane wrote:

A bigger problem here:

+             else if (!dol_quote && line[i] == '$' && 
+                      !isdigit(line[i + thislen]) && 
+                      (dol_end = strchr(line+i+1,'$')) != NULL &&
+                      (i == 0 || 
+                       ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) || 
+                          line[i-1] == '_')))
+             {

is that you aren't checking that what comes between the two dollar signs
looks like empty-or-an-identifier. The check for
next-char-isn't-a-digit is part of that but not the only part.

Well, I think the right way to do a full check would be with a regex,
which I had hoped to avoid. However, I will now try to get one working
and to address your other concerns.

Thanks for the comments.

cheers

andrew

#11Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#9)
Re: [HACKERS] dollar quoting

I think the attached patch addresses Tom's comments.

I ended up not using a regex, which seemed to be a little heavy handed,
but just writing a small custom recognition function, that should (and I
think does) mimic the pattern recognition for these tokens used by the
backend lexer. This patch just puts that function in mainloop.c, but
perhaps it belongs elsewhere (string_utils.c maybe?). I don't have
strong opinions on that.

Enjoy

andrew

Tom Lane wrote:

Show quoted text

Andrew Dunstan <andrew@dunslane.net> writes:

Comments welcome. Reviewers: I am not sure I got multi-byte stuff right
in psql/mainloop.c - please pay close attention to that.

The i-1 stuff should generally be i-prevlen. Not sure if there are any
other pitfalls.

A bigger problem here:

+             else if (!dol_quote && line[i] == '$' && 
+                      !isdigit(line[i + thislen]) && 
+                      (dol_end = strchr(line+i+1,'$')) != NULL &&
+                      (i == 0 || 
+                       ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) || 
+                          line[i-1] == '_')))
+             {

is that you aren't checking that what comes between the two dollar signs
looks like empty-or-an-identifier. The check for
next-char-isn't-a-digit is part of that but not the only part.

Also I'm not sure about the positioning of these tests relative to the
in_quote and in_xcomment tests. As you have it, $foo$ will be
recognized within an xcomment, which I think is at variance with the
proposed backend lexing behavior.

Also, the strdup should be pg_strdup.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Attachments:

dq-amd2.patchtext/plain; name=dq-amd2.patchDownload+152-9
#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#11)
Re: [HACKERS] dollar quoting

Andrew Dunstan <andrew@dunslane.net> writes:

I ended up not using a regex, which seemed to be a little heavy handed,
but just writing a small custom recognition function, that should (and I
think does) mimic the pattern recognition for these tokens used by the
backend lexer.

I looked at this and realized that it still doesn't do very well at
distinguishing $foo$ from other random uses of $. The problem is that
looking back at just the immediately preceding character isn't enough
context to tell whether a $ is part of an identifier. Consider the
input
a42$foo$
This is a legal identifier according to PG 7.4. But how about
42$foo$
This is a syntax error in 7.4, and we propose to redefine it as an
integer literal '42' followed by a dollar-quote start symbol.

There's no way to tell these apart with a single-character lookback,
or indeed any fixed number of characters of lookback.

I begin to think that we'll really have to bite the bullet and convert
psql's input parser to use flex. If we're not scanning with exactly the
same rules as the backend uses, we're going to get the wrong answers.

regards, tom lane

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#12)
Re: [HACKERS] dollar quoting

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I ended up not using a regex, which seemed to be a little heavy handed,
but just writing a small custom recognition function, that should (and I
think does) mimic the pattern recognition for these tokens used by the
backend lexer.

I looked at this and realized that it still doesn't do very well at
distinguishing $foo$ from other random uses of $. The problem is that
looking back at just the immediately preceding character isn't enough
context to tell whether a $ is part of an identifier. Consider the
input
a42$foo$
This is a legal identifier according to PG 7.4. But how about
42$foo$
This is a syntax error in 7.4, and we propose to redefine it as an
integer literal '42' followed by a dollar-quote start symbol.

The test in the patch I sent is this:

else if (!dol_quote && valid_dolquote(line+i) &&
(i == 0 ||
! ((line[i-prevlen] & 0x80) != 0 ||
isalnum(line[i-prevlen]) ||
line[i-prevlen] == '_' ||
line[i-prevlen] == '$' )))

The test should not succeed anywhere in the string '42$foo$'.

Note that psql does not change any '$foo$' at all - it just passes it to
the backend. The reason we need this at all in psql is that it has to
detect the end of a statement, and it has to prompt correctly, and to do
that it needs to know if we are in a quote (single, double, dollar) or a
comment.

psql does not detect many syntax errors, or even lexical errors - that
is the job of the backend - rightly so, I believe.

There's no way to tell these apart with a single-character lookback,
or indeed any fixed number of characters of lookback.

I'm still not convinced, although maybe there's something I'm not getting.

I begin to think that we'll really have to bite the bullet and convert
psql's input parser to use flex. If we're not scanning with exactly the
same rules as the backend uses, we're going to get the wrong answers.

Interacting with lexer states would probably be ... unpleasant. Matching
a stream oriented lexer with a line oriented CLI would be messy I suspect.

cheers

andrew

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#13)
Re: [HACKERS] dollar quoting

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

... But how about
42$foo$
This is a syntax error in 7.4, and we propose to redefine it as an
integer literal '42' followed by a dollar-quote start symbol.

The test should not succeed anywhere in the string '42$foo$'.

No, it won't. The problem is that it should, because the backend will
see that as '42' followed by a $foo$ quote start.

Interacting with lexer states would probably be ... unpleasant. Matching
a stream oriented lexer with a line oriented CLI would be messy I suspect.

I think it would not be that bad. We'd have to run the lexer on the
command input buffer and see what state it terminates in.

regards, tom lane

#15Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#14)
Re: [HACKERS] dollar quoting

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

... But how about
42$foo$
This is a syntax error in 7.4, and we propose to redefine it as an
integer literal '42' followed by a dollar-quote start symbol.

The test should not succeed anywhere in the string '42$foo$'.

No, it won't. The problem is that it should, because the backend will
see that as '42' followed by a $foo$ quote start.

Ok, I see what you are saying. This mismatch would only happen on
invalid input, though. I believe that what I did will work on all legal
input.

I think that this might be cured by having psql recognise a legal
identifier or keyword and eating it as a word, rather than treating it
as just another set of bytes in the stream. That would enable us to
avoid the lookback in the dollar-quote recognition test altogether. The
attached patch does it that way - the keyword/id test needs to come
right at the end of the loop to avoid clashing with backslash commands,
btw.

I *think* that this way psql will recognise the start of a dollar quote
iff the backend lexer would.

Interacting with lexer states would probably be ... unpleasant. Matching
a stream oriented lexer with a line oriented CLI would be messy I suspect.

I think it would not be that bad. We'd have to run the lexer on the
command input buffer and see what state it terminates in.

Yeah. I am not enough of a flex wizard to undertake the task, though. It
would take me lots of time. If we make a decision that we really need
this in order to do dollar quoting in psql I would need some substantial
help, at least.

cheers

andrew

Attachments:

dq-amd3.patchtext/plain; name=dq-amd3.patchDownload+166-7
#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#15)
Re: [HACKERS] dollar quoting

Andrew Dunstan <andrew@dunslane.net> writes:

No, it won't. The problem is that it should, because the backend will
see that as '42' followed by a $foo$ quote start.

Ok, I see what you are saying. This mismatch would only happen on
invalid input, though. I believe that what I did will work on all legal
input.

I'm unconvinced. Even if there are not any current syntaxes in which a
numeric literal can be adjacent to a string literal (I'm not totally
sure about that), what of the future? We should solve the problem
rather than assuming it won't bite us.

I think that this might be cured by having psql recognise a legal
identifier or keyword and eating it as a word, rather than treating it
as just another set of bytes in the stream.

Hm, might work ... will think about it ...

regards, tom lane

#17Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#16)
Re: [PATCHES] dollar quoting

Another interesting thing abut psql that I noticed when using '$' in
identifiers is this effect:

andrew=# create table ab$cd$ef (ef$cd$ab text);
CREATE TABLE
andrew=# \d ab$cd$ef
Did not find any relation named "ab$cd$ef".
andrew=# \d ab\$cd\$ef
Table "public.ab$cd$ef"
Column | Type | Modifiers
----------+------+-----------
ef$cd$ab | text |

which is perhaps slightly less than intuitive.

cheers

andrew

Show quoted text
#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#17)
Re: [PATCHES] dollar quoting

Andrew Dunstan <andrew@dunslane.net> writes:

andrew=# create table ab$cd$ef (ef$cd$ab text);
CREATE TABLE
andrew=# \d ab$cd$ef
Did not find any relation named "ab$cd$ef".

Hmph. I always thought that "$" was only special at the end of a regex,
but that doesn't seem to be how our implementation treats it. Anyway
this is not a bug, it is a feature: the argument of \d is a regex.

regards, tom lane

#19Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#18)
Re: [PATCHES] dollar quoting

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

andrew=# create table ab$cd$ef (ef$cd$ab text);
CREATE TABLE
andrew=# \d ab$cd$ef
Did not find any relation named "ab$cd$ef".

Hmph. I always thought that "$" was only special at the end of a regex,
but that doesn't seem to be how our implementation treats it. Anyway
this is not a bug, it is a feature: the argument of \d is a regex.

Arguably this at least is a singularly useless feature, since a
regex-meaning $ before the end of string is a nonsense, as you rightly
imply, and one at the end of the string is redundant, as it is implied -
psql turns 'abc' into '^abc$' when constructing the query.

I don't care that much - I don't use $ in my identifiers.

cheers

andrew

#20Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#16)
Re: [HACKERS] dollar quoting

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

No, it won't. The problem is that it should, because the backend will
see that as '42' followed by a $foo$ quote start.

Ok, I see what you are saying. This mismatch would only happen on
invalid input, though. I believe that what I did will work on all legal
input.

I'm unconvinced. Even if there are not any current syntaxes in which a
numeric literal can be adjacent to a string literal (I'm not totally
sure about that), what of the future? We should solve the problem
rather than assuming it won't bite us.

I think that this might be cured by having psql recognise a legal
identifier or keyword and eating it as a word, rather than treating it
as just another set of bytes in the stream.

Hm, might work ... will think about it ...

I am a little concerned about adding the overhead of lex to psql. Right
now, some folks have reported that lex/yacc take a considerable amount
of processing time in the backend as part of a query, and adding that to
psql just to do $$ seems questionable. Of course, we can alway test and
see what the overhead shows.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#20)
#22Andrew Dunstan
andrew@dunslane.net
In reply to: Bruce Momjian (#20)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#22)
#24Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#24)
#26Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#26)
#28Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#21)
#29Bruce Momjian
bruce@momjian.us
In reply to: Christopher Kings-Lynne (#26)
#30Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#25)
#31Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Christopher Kings-Lynne (#30)