patch: General purpose utility functions used by the JSON data type

Started by Joseph Adamsover 15 years ago11 messages

joeyadams3.14159@gmail.com

over 15 years ago

1 attachment(s)

I factored out the general-purpose utility functions in the JSON data
type code into a patch against HEAD. I have made a few changes to
them since I posted about them earlier (
http://archives.postgresql.org/pgsql-hackers/2010-08/msg00692.php ).

A summary of the utility functions along with some of my own thoughts
about them:

getEnumLabelOids
* Useful-ometer: ()-----------------------------------o
* Rationale: There is currently no streamlined way to return a custom
enum value from a PostgreSQL function written in C. This function
performs a batch lookup of enum OIDs, which can then be cached with
fn_extra. This should be reasonably efficient, and it's quite elegant
to use.

FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT
* Useful-ometer: ()--------------------o
* Rationale: Using fcinfo->flinfo->fn_extra takes a lot of
boilerplate. These macros help cut down the boilerplate, and the
comment explains what fn_extra is all about.

getTypeInfo
* Useful-ometer: ()---------------------------o
* Rationale: The get_type_io_data "six-fer" function is very
cumbersome to use, since one has to declare all the output variables.
The getTypeInfo puts the results in a structure. It also performs the
fmgr_info_cxt step, which is a step done after every usage of
get_type_io_data in the PostgreSQL code.
* Other thoughts: getTypeInfo also retrieves typcategory (and
typispreferred), which is rather ad-hoc. This benefits the JSON code
because to_json() uses the typcategory to figure out what type of JSON
value to convert something to (for instance, things in category 'A'
become JSON arrays). Other data types could care less about the
typcategory. Should getTypeInfo leave that step out?

pg_substring, pg_encoding_substring
* Useful-ometer: ()-------o
* Rationale: The JSONPath code uses it / will use it for extracting
substrings, which is probably not a very useful feature (but who am I
to say that). This function could probably benefit the
text_substring() function in varlena.c , but it would take a bit of
work to ensure it continues to comply with standards.

server_to_utf8, utf8_to_server, text_to_utf8_cstring,
utf8_cstring_to_text, utf8_cstring_to_text_with_len
* Useful-ometer: ()--------------o
* Rationale: The JSON data type operates in UTF-8 rather than the
server encoding because it needs to deal with Unicode escapes, but
individual Unicode characters can't be converted to/from the server
encoding simply and efficiently (as far as I know). These routines
made the conversions done by the JSON data type vastly simpler, and
they could simplify other data types in the future (XML does a lot of
server<->UTF-8 conversions too).

This patch doesn't include tests . How would I go about writing them?

I have made the JSON data type built-in, and I will post that patch
shortly (it depends on this one). The built-in JSON data type uses
all of these utility functions, and the tests for the JSON data type
pass.

Attachments:

json-util-01.patchapplication/octet-stream; name=json-util-01.patchDownload

commit 5e861eff888c9557b9e40d418fa6d0839d2d886a
Author: Joey Adams <joeyadams3.14159@gmail.com>
Date:   Fri Aug 13 04:20:06 2010 -0400

    Added several utility functions/macros to the backend
    
     * getEnumLabelOids:
          streamlined conversion of enum labels to OIDs
     * FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT:
          macros to cut down on boilerplate when working with
          fcinfo->flinfo->fn_mcxt
     * getTypeInfo:
          wrapper around
          get_type_io_data / fmgr_info_cxt / get_type_category_preferred
          that stores results in a structure called TypeInfo
     * pg_substring, pg_encoding_substring:
          slicing of multibyte-encoded strings
     * server_to_utf8, utf8_to_server:
          convenience routines for converting between the database encoding
          and UTF-8
     * text_to_utf8_cstring, utf8_cstring_to_text, utf8_cstring_to_text_with_len:
          variants of text_to_cstring and company that also convert
          into and out of UTF-8

diff --git a/src/backend/utils/adt/enum.c b/src/backend/utils/adt/enum.c
index 69562db..9259c99 100644
--- a/src/backend/utils/adt/enum.c
+++ b/src/backend/utils/adt/enum.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "fmgr.h"
 #include "utils/array.h"
@@ -25,6 +26,7 @@
 
 static ArrayType *enum_range_internal(Oid enumtypoid, Oid lower, Oid upper);
 static int	enum_elem_cmp(const void *left, const void *right);
+static int	enum_label_cmp(const void *left, const void *right);
 
 
 /* Basic I/O support */
@@ -412,6 +414,84 @@ enum_range_internal(Oid enumtypoid, Oid lower, Oid upper)
 	return result;
 }
 
+/*
+ * getEnumLabelOids
+ *	  Look up the OIDs of enum labels.	Enum label OIDs are needed to
+ *	  return values of a custom enum type from a C function.
+ *
+ *	  Callers should typically cache the OIDs produced by this function
+ *	  using FN_EXTRA, as retrieving enum label OIDs is somewhat expensive.
+ *
+ *	  Every labels[i].index must be between 0 and count, and oid_out
+ *	  must be allocated to hold count items.  Note that getEnumLabelOids
+ *	  sorts the labels[] array passed to it.
+ *
+ *	  Any labels not found in the enum will have their corresponding
+ *	  oid_out entries set to InvalidOid.
+ *
+ *	  Sample usage:
+ *
+ *	  -- SQL --
+ *	  CREATE TYPE colors AS ENUM ('red', 'green', 'blue');
+ *
+ *	  -- C --
+ *	  enum Colors {RED, GREEN, BLUE, COLOR_COUNT};
+ *
+ *	  static EnumLabel enum_labels[COLOR_COUNT] =
+ *	  {
+ *		  {RED,   "red"},
+ *		  {GREEN, "green"},
+ *		  {BLUE,  "blue"}
+ *	  };
+ *
+ *	  Oid *label_oids = palloc(COLOR_COUNT * sizeof(Oid));
+ *	  getEnumLabelOids("colors", enum_labels, label_oids, COLOR_COUNT);
+ *
+ *	  PG_RETURN_OID(label_oids[GREEN]);
+ */
+void
+getEnumLabelOids(const char *typname, EnumLabel labels[], Oid oid_out[], int count)
+{
+	CatCList   *list;
+	Oid			enumtypoid;
+	int			total;
+	int			i;
+	EnumLabel	key;
+	EnumLabel  *found;
+
+	enumtypoid = TypenameGetTypid(typname);
+	Assert(OidIsValid(enumtypoid));
+
+	qsort(labels, count, sizeof(EnumLabel), enum_label_cmp);
+
+	for (i = 0; i < count; i++)
+	{
+		/* Initialize oid_out items to InvalidOid. */
+		oid_out[i] = InvalidOid;
+
+		/* Make sure EnumLabel indices are in range. */
+		Assert(labels[i].index >= 0 && labels[i].index < count);
+	}
+
+	list = SearchSysCacheList1(ENUMTYPOIDNAME,
+							   ObjectIdGetDatum(enumtypoid));
+	total = list->n_members;
+
+	for (i = 0; i < total; i++)
+	{
+		HeapTuple	tup = &list->members[i]->tuple;
+		Oid			oid = HeapTupleGetOid(tup);
+		Form_pg_enum en = (Form_pg_enum) GETSTRUCT(tup);
+
+		key.label = NameStr(en->enumlabel);
+		found = bsearch(&key, labels, count, sizeof(EnumLabel), enum_label_cmp);
+		if (found != NULL)
+			oid_out[found->index] = oid;
+	}
+
+	ReleaseCatCacheList(list);
+}
+
 /* qsort comparison function for Datums that are OIDs */
 static int
 enum_elem_cmp(const void *left, const void *right)
@@ -425,3 +505,13 @@ enum_elem_cmp(const void *left, const void *right)
 		return 1;
 	return 0;
 }
+
+/* qsort comparison function for EnumLabel entries used by getEnumLabelOids */
+static int
+enum_label_cmp(const void *left, const void *right)
+{
+	const char *l = ((EnumLabel *) left)->label;
+	const char *r = ((EnumLabel *) right)->label;
+
+	return strcmp(l, r);
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 94766cd..c8f23d4 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -177,6 +177,75 @@ text_to_cstring_buffer(const text *src, char *dst, size_t dst_len)
 		pfree(srcunpacked);
 }
 
+/*
+ * text_to_utf8_cstring
+ *
+ * Just like text_to_cstring, but yields a C string
+ * encoded in UTF-8 instead of the server encoding.
+ */
+char *
+text_to_utf8_cstring(const text *t)
+{
+	/* must cast away the const, just like in text_to_cstring */
+	text	   *tunpacked = pg_detoast_datum_packed((struct varlena *) t);
+	const char *data = VARDATA_ANY(tunpacked);
+	int			len = VARSIZE_ANY_EXHDR(tunpacked);
+	char	   *result;
+
+	result = server_to_utf8(data, len);
+	if (result == data)
+		result = pnstrdup(data, len);
+
+	if (tunpacked != t)
+		pfree(tunpacked);
+
+	return result;
+}
+
+/*
+ * text_to_utf8_cstring
+ *
+ * Just like cstring_to_text, but takes a C string
+ * encoded in UTF-8 instead of the server encoding.
+ */
+text *
+utf8_cstring_to_text(const char *s)
+{
+	return utf8_cstring_to_text_with_len(s, strlen(s));
+}
+
+/*
+ * utf8_cstring_to_text_with_len
+ *
+ * Just like cstring_to_text_with_len, but takes a C string
+ * encoded in UTF-8 instead of the server encoding.
+ *
+ * The input string should not contain null characters.
+ */
+text *
+utf8_cstring_to_text_with_len(const char *s, int len)
+{
+	char	   *cstring;
+	int			cstring_len;
+	text	   *result;
+
+	cstring = utf8_to_server(s, len);
+	if (cstring == s)
+		cstring_len = len;
+	else
+		cstring_len = strlen(cstring);
+
+	result = (text *) palloc(len + VARHDRSZ);
+
+	SET_VARSIZE(result, len + VARHDRSZ);
+	memcpy(VARDATA(result), cstring, cstring_len);
+
+	if (cstring != s)
+		pfree(cstring);
+
+	return result;
+}
+
 
 /*****************************************************************************
  *	 USER I/O ROUTINES														 *
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 19a4a45..5b6b823 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -1844,6 +1844,37 @@ get_type_io_data(Oid typid,
 	ReleaseSysCache(typeTuple);
 }
 
+/*
+ * getTypeInfo
+ *	  Retrieve information about a type, along with either its
+ *	  input, output, binary receive, or binary send procedure.
+ *
+ *	  which_func should be one of:
+ *		  IOFunc_input
+ *		  IOFunc_output
+ *		  IOFunc_receive
+ *		  IOFunc_send
+ *
+ *	  mcxt is the memory context that the IO function will use to
+ *	  store subsidiary data.  It should live at least as long as
+ *	  the TypeInfo structure.
+ */
+void
+getTypeInfo(TypeInfo *d, Oid type, IOFuncSelector which_func, MemoryContext mcxt)
+{
+	d->type = type;
+	d->which_func = which_func;
+	d->mcxt = mcxt;
+
+	get_type_io_data(type, which_func,
+					 &d->typlen, &d->typbyval, &d->typalign,
+					 &d->typdelim, &d->typioparam, &d->typiofunc);
+	fmgr_info_cxt(d->typiofunc, &d->proc, d->mcxt);
+
+	get_type_category_preferred(type,
+								&d->typcategory, &d->typispreferred);
+}
+
 #ifdef NOT_USED
 char
 get_typalign(Oid typid)
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 0995a75..dfd4136 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -568,6 +568,36 @@ pg_server_to_client(const char *s, int len)
 }
 
 /*
+ * server_to_utf8, utf8_to_server
+ *	  Adaptations of pg_do_encoding_conversion for simplifying UTF-8 conversions.
+ *
+ *	  Sometimes, it makes more sense to operate primarily in UTF-8 rather than
+ *	  the server encoding.	For instance, the JSON data type operates in UTF-8
+ *	  because it needs to encode/decode individual characters when dealing with
+ *	  Unicode escapes, but there is no simple and efficient way to do that
+ *	  in the server encoding.
+ *
+ *	  Just like pg_do_encoding_conversion, if no conversion is done,
+ *	  the original pointer given is returned.
+ *
+ *	  These functions are no-ops when the server encoding is UTF-8.
+ */
+
+char *
+server_to_utf8(const char *s, int len)
+{
+	return (char *) pg_do_encoding_conversion(
+				   (unsigned char *) s, len, GetDatabaseEncoding(), PG_UTF8);
+}
+
+char *
+utf8_to_server(const char *s, int len)
+{
+	return (char *) pg_do_encoding_conversion(
+				   (unsigned char *) s, len, PG_UTF8, GetDatabaseEncoding());
+}
+
+/*
  *	Perform default encoding conversion using cached FmgrInfo. Since
  *	this function does not access database at all, it is safe to call
  *	outside transactions.  If the conversion has not been set up by
diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c
index 4b98c8b..8c56592 100644
--- a/src/backend/utils/mb/wchar.c
+++ b/src/backend/utils/mb/wchar.c
@@ -1390,18 +1390,24 @@ pg_mic_mblen(const unsigned char *mbstr)
 	return pg_mule_mblen(mbstr);
 }
 
+static mblen_converter
+encoding_mblen_converter(int encoding)
+{
+	Assert(PG_VALID_ENCODING(encoding));
+
+	return ((encoding >= 0 &&
+			 encoding < sizeof(pg_wchar_table) / sizeof(pg_wchar_tbl)) ?
+			pg_wchar_table[encoding].mblen :
+			pg_wchar_table[PG_SQL_ASCII].mblen);
+}
+
 /*
  * Returns the byte length of a multibyte character.
  */
 int
 pg_encoding_mblen(int encoding, const char *mbstr)
 {
-	Assert(PG_VALID_ENCODING(encoding));
-
-	return ((encoding >= 0 &&
-			 encoding < sizeof(pg_wchar_table) / sizeof(pg_wchar_tbl)) ?
-		((*pg_wchar_table[encoding].mblen) ((const unsigned char *) mbstr)) :
-	((*pg_wchar_table[PG_SQL_ASCII].mblen) ((const unsigned char *) mbstr)));
+	return (*encoding_mblen_converter(encoding)) ((const unsigned char *) mbstr);
 }
 
 /*
@@ -1641,4 +1647,120 @@ report_untranslatable_char(int src_encoding, int dest_encoding,
 			 pg_enc2name_tbl[dest_encoding].name)));
 }
 
+/*
+ * pg_substring
+ *	  Find substring bounds in a string of a given encoding.
+ *
+ *	  The requested start and length are clipped to fit the string.
+ *
+ *	  src and srcbytes:			input string slice
+ *	  start and length:			start and number of characters requested
+ *	  out_start and out_bytes:	substring slice
+ *	  out_length:				number of characters in substring
+ *
+ *	  Unlike the SQL substring function, the start argument
+ *	  of this function is zero-based.
+ *
+ *	  Example (assume UTF-8 all around):
+ *		 const char *in = "⁰ ¹ ² ³"; // "\342\201\260 \302\271 \302\262 \302\263"
+ *		 const char *out_start;
+ *		 int		 out_bytes;
+ *		 int		 out_chars;
+ *
+ *		 pg_encoding_substring(in, strlen(in),
+ *							   2, 100,
+ *							   &out_start, &out_bytes, &out_chars);
+ *
+ *	  out_start will point to the "¹", or "\302\271".
+ *	  out_bytes will be 8.
+ *	  out_chars will be 5.
+ */
+void
+pg_substring(const char *src, int srcbytes,
+			 int start, int length,
+			 const char **out_start, int *out_bytes, int *out_length)
+{
+	pg_encoding_substring(GetDatabaseEncoding(),
+						  src, srcbytes,
+						  start, length,
+						  out_start, out_bytes, out_length);
+}
+
+/*
+ * pg_encoding_substring
+ *	  Find substring bounds in a string of a given encoding.
+ */
+void
+pg_encoding_substring(int encoding,
+					  const char *src, int srcbytes,
+					  int start, int length,
+					  const char **out_start, int *out_bytes, int *out_length)
+{
+	const char *e = src + srcbytes;
+	const char *sub_start;
+	const char *sub_end;
+	int			sub_length;
+	mblen_converter mblen;
+	int			len;
+
+	if (start < 0)
+	{
+		length += start;
+		start = 0;
+	}
+	if (length < 0)
+		length = 0;
+
+	/* optimization for single-byte encoding */
+	if (pg_encoding_max_length(encoding) == 1)
+	{
+		*out_start = src + start;
+		*out_bytes = *out_length = Min(length, srcbytes - start);
+		return;
+	}
+
+	/*
+	 * Get the length callback once so it doesn't have to be looked up every
+	 * time we call it.
+	 */
+	mblen = encoding_mblen_converter(encoding);
+
+	/* Find the beginning of the substring. */
+	sub_start = src;
+	while (start > 0 && sub_start < e)
+	{
+		len = (*mblen) ((const unsigned char *) sub_start);
+
+		if (sub_start + len > e)
+		{
+			Assert(false);		/* Clipped multibyte character */
+			break;
+		}
+
+		sub_start += len;
+		start--;
+	}
+
+	/* Find the end and length of the substring. */
+	sub_end = sub_start;
+	sub_length = 0;
+	while (sub_length < length && sub_end < e)
+	{
+		len = (*mblen) ((const unsigned char *) sub_end);
+
+		if (sub_end + len > e)
+		{
+			Assert(false);		/* Clipped multibyte character */
+			break;
+		}
+
+		sub_end += len;
+		sub_length++;
+	}
+
+	*out_start = sub_start;
+	*out_bytes = sub_end - sub_start;
+	*out_length = sub_length;
+}
+
 #endif
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index c502b96..3d6d0c1 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -544,6 +544,40 @@ extern void **find_rendezvous_variable(const char *varName);
 extern int AggCheckCallContext(FunctionCallInfo fcinfo,
 					MemoryContext *aggcontext);
 
+/*
+ * FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT
+ *	   Macros for manipulating context preserved across function calls.
+ *
+ *	   FN_EXTRA is typically used for caching lookups and other nontrivial
+ *	   operations across multiple calls of a user-defined function.
+ *
+ *	   Do not use FN_EXTRA in a set-returning function.  Use user_fctx instead.
+ *
+ *	   Typical usage looks like:
+ *
+ *	   my_extra = FN_EXTRA();
+ *	   if (my_extra == NULL)
+ *	   {
+ *		   my_extra = FN_EXTRA_ALLOC(sizeof(MyExtra));
+ *		   my_extra->type_name = NULL;
+ *	   }
+ *
+ *	   if (my_extra->type_name == NULL ||
+ *		   strcmp(my_extra->type_name, type_name) != 0)
+ *	   {
+ *		   my_extra->type_name = MemoryContextStrdup(FN_MCXT(), type_name);
+ *		   my_extra->type_id   = TypenameGetTypid(my_extra->type_name);
+ *	   }
+ */
+#define FN_EXTRA() (fcinfo->flinfo->fn_extra)
+#define FN_EXTRA_ALLOC(size) \
+	(fcinfo->flinfo->fn_extra = MemoryContextAlloc(fcinfo->flinfo->fn_mcxt, size))
+
+/*
+ * Data allocated inside of FN_EXTRA() should be allocated into FN_MCXT()
+ * so it is preserved across calls
+ */
+#define FN_MCXT() (fcinfo->flinfo->fn_mcxt)
 
 /*
  * !!! OLD INTERFACE !!!
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 389be5c..2500b81 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -420,6 +420,9 @@ extern unsigned char *pg_do_encoding_conversion(unsigned char *src, int len,
 extern char *pg_client_to_server(const char *s, int len);
 extern char *pg_server_to_client(const char *s, int len);
 
+extern char *server_to_utf8(const char *s, int len);
+extern char *utf8_to_server(const char *s, int len);
+
 extern unsigned short BIG5toCNS(unsigned short big5, unsigned char *lc);
 extern unsigned short CNStoBIG5(unsigned short cns, unsigned char lc);
 
@@ -466,4 +469,12 @@ extern bool pg_utf8_islegal(const unsigned char *source, int length);
 extern WCHAR *pgwin32_toUTF16(const char *str, int len, int *utf16len);
 #endif
 
+void pg_substring(const char *src, int srcbytes,
+			 int start, int length,
+			 const char **out_start, int *out_bytes, int *out_length);
+void pg_encoding_substring(int encoding,
+					  const char *src, int srcbytes,
+					  int start, int length,
+					const char **out_start, int *out_bytes, int *out_length);
+
 #endif   /* PG_WCHAR_H */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index a4c6180..427fbb3 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -17,6 +17,12 @@
 #include "fmgr.h"
 #include "nodes/parsenodes.h"
 
+typedef struct
+{
+	int			index;
+	const char *label;
+}	EnumLabel;
+
 /*
  *		Defined in adt/
  */
@@ -163,6 +169,7 @@ extern Datum enum_first(PG_FUNCTION_ARGS);
 extern Datum enum_last(PG_FUNCTION_ARGS);
 extern Datum enum_range_bounds(PG_FUNCTION_ARGS);
 extern Datum enum_range_all(PG_FUNCTION_ARGS);
+void		getEnumLabelOids(const char *typname, EnumLabel labels[], Oid oid_out[], int count);
 
 /* int.c */
 extern Datum int2in(PG_FUNCTION_ARGS);
@@ -675,6 +682,9 @@ extern text *cstring_to_text(const char *s);
 extern text *cstring_to_text_with_len(const char *s, int len);
 extern char *text_to_cstring(const text *t);
 extern void text_to_cstring_buffer(const text *src, char *dst, size_t dst_len);
+extern char *text_to_utf8_cstring(const text *t);
+extern text *utf8_cstring_to_text(const char *s);
+extern text *utf8_cstring_to_text_with_len(const char *s, int len);
 
 #define CStringGetTextDatum(s) PointerGetDatum(cstring_to_text(s))
 #define TextDatumGetCString(d) text_to_cstring((text *) DatumGetPointer(d))
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 066ad76..7ec3e35 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -15,6 +15,7 @@
 
 #include "access/attnum.h"
 #include "access/htup.h"
+#include "fmgr.h"
 #include "nodes/pg_list.h"
 
 /* I/O function selector for get_type_io_data */
@@ -26,6 +27,24 @@ typedef enum IOFuncSelector
 	IOFunc_send
 } IOFuncSelector;
 
+typedef struct TypeInfo
+{
+	Oid			type;
+	IOFuncSelector which_func;
+	MemoryContext mcxt;
+
+	int16		typlen;
+	bool		typbyval;
+	char		typalign;
+	char		typdelim;
+	Oid			typioparam;
+	Oid			typiofunc;
+	FmgrInfo	proc;
+
+	char		typcategory;
+	bool		typispreferred;
+} TypeInfo;
+
 /* Hook for plugins to get control in get_attavgwidth() */
 typedef int32 (*get_attavgwidth_hook_type) (Oid relid, AttrNumber attnum);
 extern PGDLLIMPORT get_attavgwidth_hook_type get_attavgwidth_hook;
@@ -106,6 +125,8 @@ extern void get_type_io_data(Oid typid,
 				 char *typdelim,
 				 Oid *typioparam,
 				 Oid *func);
+void getTypeInfo(TypeInfo *d, Oid type, IOFuncSelector which_func,
+			MemoryContext mcxt);
 extern char get_typstorage(Oid typid);
 extern Node *get_typdefault(Oid typid);
 extern char get_typtype(Oid typid);

Mike Fowler

mike@mlfowler.com

over 15 years ago

In reply to: Joseph Adams (#1)

Re: patch: General purpose utility functions used by the JSON data type

On 13/08/10 10:45, Joseph Adams wrote:

This patch doesn't include tests . How would I go about writing them?

I have made the JSON data type built-in, and I will post that patch
shortly (it depends on this one). The built-in JSON data type uses
all of these utility functions, and the tests for the JSON data type
pass.

Joseph,

Most existing data types have a regression SQL script in
src/test/regress/sql. Using one of them as an example to draw some
inspiration from, you should be able to write a 'json.sql' script that
exercises the data type and it's supporting functions. Full instructions
can be found at http://wiki.postgresql.org/wiki/Regression_test_authoring

Regards,

--
Mike Fowler
Registered Linux user: 379787

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Joseph Adams (#1)

Re: patch: General purpose utility functions used by the JSON data type

On Fri, Aug 13, 2010 at 5:45 AM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

getEnumLabelOids
* Useful-ometer: ()-----------------------------------o
* Rationale: There is currently no streamlined way to return a custom
enum value from a PostgreSQL function written in C. This function
performs a batch lookup of enum OIDs, which can then be cached with
fn_extra. This should be reasonably efficient, and it's quite elegant
to use.

It's possible that there might be a contrib module out there someplace
that wants to do this, but it's clearly a waste of time for a core
datatype. The OIDs are fixed. Just get rid of the enum altogether
and use the OIDs directly wherever you would have used the enum. Then
you don't need this.

Incidentally, if we were going to accept this function, I think we'd
need to add some kind of a check to throw an error if any of the
labels can't be mapped onto an Oid. Otherwise, an error in this area
might lead to difficult-to-find misbehavior.

FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT
* Useful-ometer: ()--------------------o
* Rationale: Using fcinfo->flinfo->fn_extra takes a lot of
boilerplate. These macros help cut down the boilerplate, and the
comment explains what fn_extra is all about.

I think this is not a good idea. Macros that use local variables
under the covers make it hard for new contributors to read the code
and understand what it does. It's only worth doing if it saves a lot
of typing, and this doesn't. If we were going to do this, the right
thing for your patch to do would be to update every instance of this
coding pattern in our source base, so that people who copy those
examples in the future do it the new way. But I think there's no real
advantage in that: it would complicate back-patching future bug fixes
for no particularly compelling reason.

getTypeInfo
* Useful-ometer: ()---------------------------o
* Rationale: The get_type_io_data "six-fer" function is very
cumbersome to use, since one has to declare all the output variables.
The getTypeInfo puts the results in a structure. It also performs the
fmgr_info_cxt step, which is a step done after every usage of
get_type_io_data in the PostgreSQL code.
* Other thoughts: getTypeInfo also retrieves typcategory (and
typispreferred), which is rather ad-hoc. This benefits the JSON code
because to_json() uses the typcategory to figure out what type of JSON
value to convert something to (for instance, things in category 'A'
become JSON arrays). Other data types could care less about the
typcategory. Should getTypeInfo leave that step out?

Well, again, you have to decide whether this is a function that you're
adding just for the JSON code or whether it's really a general purpose
utility function. If you want it to be a general purpose utility
function, you need to change all the callers that could potentially
leverage it. If it's JSON specific, put it in the JSON code. It
might be simpler to just declare the variables.

pg_substring, pg_encoding_substring
* Useful-ometer: ()-------o
* Rationale: The JSONPath code uses it / will use it for extracting
substrings, which is probably not a very useful feature (but who am I
to say that). This function could probably benefit the
text_substring() function in varlena.c , but it would take a bit of
work to ensure it continues to comply with standards.

I'm more positive about this idea. If you can make this generally
useful, I'd encourage you to do that. On a random coding style note,
I see that you have two copies of the following code, which can fairly
obviously be written in a single line rather than five, assuming it's
actually safe.

+               if (sub_end + len > e)
+               {
+                       Assert(false);          /* Clipped multibyte
character */
+                       break;
+               }

server_to_utf8, utf8_to_server, text_to_utf8_cstring,
utf8_cstring_to_text, utf8_cstring_to_text_with_len
* Useful-ometer: ()--------------o
* Rationale: The JSON data type operates in UTF-8 rather than the
server encoding because it needs to deal with Unicode escapes, but
individual Unicode characters can't be converted to/from the server
encoding simply and efficiently (as far as I know). These routines
made the conversions done by the JSON data type vastly simpler, and
they could simplify other data types in the future (XML does a lot of
server<->UTF-8 conversions too).

Sounds interesting. But again, you would need to modify the XML code
to use these also, so that we can clearly see that this is reusable
code, and actually reuse it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Joseph Adams

joeyadams3.14159@gmail.com

over 15 years ago

In reply to: Robert Haas (#3)

Re: patch: General purpose utility functions used by the JSON data type

On Fri, Aug 13, 2010 at 10:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Aug 13, 2010 at 5:45 AM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

getEnumLabelOids
* Useful-ometer: ()-----------------------------------o
* Rationale: There is currently no streamlined way to return a custom
enum value from a PostgreSQL function written in C. This function
performs a batch lookup of enum OIDs, which can then be cached with
fn_extra. This should be reasonably efficient, and it's quite elegant
to use.

It's possible that there might be a contrib module out there someplace
that wants to do this, but it's clearly a waste of time for a core
datatype. The OIDs are fixed. Just get rid of the enum altogether
and use the OIDs directly wherever you would have used the enum. Then
you don't need this.

Indeed, a built-in JSON data type will certainly not need it.
However, users may want to return enums from procedures written in C,
and this function provides a way to do it.

Incidentally, if we were going to accept this function, I think we'd
need to add some kind of a check to throw an error if any of the
labels can't be mapped onto an Oid. Otherwise, an error in this area
might lead to difficult-to-find misbehavior.

I agree. Perhaps an ereport(ERROR, ...) would be better, since it
would not be hard for a user to cause an enum mapping error (even in a
production database) by updating a stored procedure in C but not
updating the corresponding enum (or vice versa, if enum labels are
renamed).

FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT
* Useful-ometer: ()--------------------o
* Rationale: Using fcinfo->flinfo->fn_extra takes a lot of
boilerplate. These macros help cut down the boilerplate, and the
comment explains what fn_extra is all about.

I think this is not a good idea. Macros that use local variables
under the covers make it hard for new contributors to read the code
and understand what it does. It's only worth doing if it saves a lot
of typing, and this doesn't. If we were going to do this, the right
thing for your patch to do would be to update every instance of this
coding pattern in our source base, so that people who copy those
examples in the future do it the new way. But I think there's no real
advantage in that: it would complicate back-patching future bug fixes
for no particularly compelling reason.

Fair enough. Perhaps the comment about FN_EXTRA (which explains
fn_extra in more detail) could be reworded (to talk about using
fcinfo->flinfo->fn_extra manually) and included in the documentation
at xfunc-c.html . fn_extra currently only gets a passing mention in
the discussion about set-returning functions.

pg_substring, pg_encoding_substring
* Useful-ometer: ()-------o
* Rationale: The JSONPath code uses it / will use it for extracting
substrings, which is probably not a very useful feature (but who am I
to say that). This function could probably benefit the
text_substring() function in varlena.c , but it would take a bit of
work to ensure it continues to comply with standards.

I'm more positive about this idea. If you can make this generally
useful, I'd encourage you to do that. On a random coding style note,
I see that you have two copies of the following code, which can fairly
obviously be written in a single line rather than five, assuming it's
actually safe.
+               if (sub_end + len > e)
+               {
+                       Assert(false);          /* Clipped multibyte
character */
+                       break;
+               }

If I simply say Assert(sub_end + len <= e), the function will yield a
range hanging off the edge of the input string (out of bounds). The
five lines include a safeguard against that when assertion checking is
off. This could happen if the input string has a clipped multibyte
character at the end. Perhaps I should just drop the assertions and
make it so if there's a clipped character at the end, it'll be
ignored, no big deal.

Joey Adams

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Joseph Adams (#4)

Re: patch: General purpose utility functions used by the JSON data type

On Fri, Aug 13, 2010 at 12:55 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

Indeed, a built-in JSON data type will certainly not need it.
However, users may want to return enums from procedures written in C,
and this function provides a way to do it.

Yeah, but I can't see accepting it on speculation. We'll add it if
and when it's clear that it will be generally useful.

Incidentally, if we were going to accept this function, I think we'd
need to add some kind of a check to throw an error if any of the
labels can't be mapped onto an Oid. Otherwise, an error in this area
might lead to difficult-to-find misbehavior.

I agree. Perhaps an ereport(ERROR, ...) would be better, since it
would not be hard for a user to cause an enum mapping error (even in a
production database) by updating a stored procedure in C but not
updating the corresponding enum (or vice versa, if enum labels are
renamed).

Right...

Fair enough. Perhaps the comment about FN_EXTRA (which explains
fn_extra in more detail) could be reworded (to talk about using
fcinfo->flinfo->fn_extra manually) and included in the documentation
at xfunc-c.html . fn_extra currently only gets a passing mention in
the discussion about set-returning functions.

Feel to propose a patch to that comment.

pg_substring, pg_encoding_substring
* Useful-ometer: ()-------o
* Rationale: The JSONPath code uses it / will use it for extracting
substrings, which is probably not a very useful feature (but who am I
to say that). This function could probably benefit the
text_substring() function in varlena.c , but it would take a bit of
work to ensure it continues to comply with standards.

I'm more positive about this idea. If you can make this generally
useful, I'd encourage you to do that. On a random coding style note,
I see that you have two copies of the following code, which can fairly
obviously be written in a single line rather than five, assuming it's
actually safe.
+               if (sub_end + len > e)
+               {
+                       Assert(false);          /* Clipped multibyte
character */
+                       break;
+               }
If I simply say Assert(sub_end + len <= e), the function will yield a
range hanging off the edge of the input string (out of bounds). The
five lines include a safeguard against that when assertion checking is
off. This could happen if the input string has a clipped multibyte
character at the end. Perhaps I should just drop the assertions and
make it so if there's a clipped character at the end, it'll be
ignored, no big deal.

I think maybe what you want is ereport(ERROR). It should never be
possible for user action to trip an Assert(); Assert() is only to
catch coding mistakes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Joseph Adams (#4)

Re: patch: General purpose utility functions used by the JSON data type

Joseph Adams <joeyadams3.14159@gmail.com> writes:

On Fri, Aug 13, 2010 at 10:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:

+               if (sub_end + len > e)
+               {
+                       Assert(false);          /* Clipped multibyte character */
+                       break;
+               }

If I simply say Assert(sub_end + len <= e), the function will yield a
range hanging off the edge of the input string (out of bounds). The
five lines include a safeguard against that when assertion checking is
off.

If you think it is actually likely to happen in practice, then an Assert
is 100% inappropriate. Throw an actual error instead. Code that has
provisions for continuing after an Assert failure is wrong by definition.

regards, tom lane

David Fetter

david@fetter.org

over 15 years ago

In reply to: Robert Haas (#5)

Re: patch: General purpose utility functions used by the JSON data type

On Fri, Aug 13, 2010 at 12:59:48PM -0400, Robert Haas wrote:

On Fri, Aug 13, 2010 at 12:55 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

Indeed, a built-in JSON data type will certainly not need it.
However, users may want to return enums from procedures written in
C, and this function provides a way to do it.

Yeah, but I can't see accepting it on speculation. We'll add it if
and when it's clear that it will be generally useful.

Please pardon me for jumping in here, but one of the things people
really love about PostgreSQL is that when you reach for something,
it's frequently "just there."

As we're improving enums (allowing them to be altered easily after
creation, etc.), it seems reasonable to provide ways to return them
from all kinds of PLs, including making this easier in C.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: David Fetter (#7)

Re: patch: General purpose utility functions used by the JSON data type

On Fri, Aug 13, 2010 at 1:05 PM, David Fetter <david@fetter.org> wrote:

On Fri, Aug 13, 2010 at 12:59:48PM -0400, Robert Haas wrote:

On Fri, Aug 13, 2010 at 12:55 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

Indeed, a built-in JSON data type will certainly not need it.
However, users may want to return enums from procedures written in
C, and this function provides a way to do it.

Yeah, but I can't see accepting it on speculation. We'll add it if
and when it's clear that it will be generally useful.

Please pardon me for jumping in here, but one of the things people
really love about PostgreSQL is that when you reach for something,
it's frequently "just there."

As we're improving enums (allowing them to be altered easily after
creation, etc.), it seems reasonable to provide ways to return them
from all kinds of PLs, including making this easier in C.

Maybe so, but it's not clear the interface that Joseph implemented is
the one everyone wants...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

David Fetter

david@fetter.org

over 15 years ago

In reply to: Robert Haas (#8)

Re: patch: General purpose utility functions used by the JSON data type

On Fri, Aug 13, 2010 at 01:33:06PM -0400, Robert Haas wrote:

On Fri, Aug 13, 2010 at 1:05 PM, David Fetter <david@fetter.org> wrote:

On Fri, Aug 13, 2010 at 12:59:48PM -0400, Robert Haas wrote:

On Fri, Aug 13, 2010 at 12:55 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:

Indeed, a built-in JSON data type will certainly not need it.
However, users may want to return enums from procedures written in
C, and this function provides a way to do it.

Yeah, but I can't see accepting it on speculation. ï¿½We'll add it if
and when it's clear that it will be generally useful.

Please pardon me for jumping in here, but one of the things people
really love about PostgreSQL is that when you reach for something,
it's frequently "just there."

As we're improving enums (allowing them to be altered easily after
creation, etc.), it seems reasonable to provide ways to return them
from all kinds of PLs, including making this easier in C.

Maybe so, but it's not clear the interface that Joseph implemented is
the one everyone wants...

Fair enough. What's the interface now in a nutshell? Lack of
nutshells might well mean the interface needs more work...

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#10

Joseph Adams

joeyadams3.14159@gmail.com

over 15 years ago

In reply to: David Fetter (#9)

Re: patch: General purpose utility functions used by the JSON data type

On Fri, Aug 13, 2010 at 2:02 PM, David Fetter <david@fetter.org> wrote:

On Fri, Aug 13, 2010 at 01:33:06PM -0400, Robert Haas wrote:

Maybe so, but it's not clear the interface that Joseph implemented is
the one everyone wants...

Fair enough. What's the interface now in a nutshell? Lack of
nutshells might well mean the interface needs more work...

Suppose we have:

-- SQL --
CREATE TYPE colors AS ENUM ('red', 'green', 'blue');

-- C --
enum Colors {RED, GREEN, BLUE, COLOR_COUNT};

In a nutshell:

* Define an enum mapping like so:

static EnumLabel enum_labels[COLOR_COUNT] =
{
{COLOR_RED, "red"},
{COLOR_GREEN, "green"},
{COLOR_BLUE, "blue"}
};

* Get the OIDs of the enum labels:

Oid oids[COLOR_COUNT];
getEnumLabelOids("colors", enum_labels, oids, COLOR_COUNT);

* Convert C enum values to OIDs using the table:

PG_RETURN_OID(oids[COLOR_BLUE]);

A caller would typically cache the Oid table with fn_extra.

Currently, getEnumLabelOids puts InvalidOid for labels that could not
successfully be looked up. This will probably be changed to
ereport(ERROR) instead.

Also, I'm thinking that getEnumLabelOids should be renamed to
something that's easier to remember. Maybe enum_oids or
get_enum_oids.

Joey Adams

#11

Andrew Dunstan

andrew@dunslane.net

over 15 years ago

In reply to: Joseph Adams (#10)

Re: patch: General purpose utility functions used by the JSON data type

On 08/13/2010 03:46 PM, Joseph Adams wrote:

On Fri, Aug 13, 2010 at 2:02 PM, David Fetter<david@fetter.org> wrote:

On Fri, Aug 13, 2010 at 01:33:06PM -0400, Robert Haas wrote:

Maybe so, but it's not clear the interface that Joseph implemented is
the one everyone wants...

Fair enough. What's the interface now in a nutshell? Lack of
nutshells might well mean the interface needs more work...

Suppose we have:

-- SQL --
CREATE TYPE colors AS ENUM ('red', 'green', 'blue');

-- C --
enum Colors {RED, GREEN, BLUE, COLOR_COUNT};

In a nutshell:

* Define an enum mapping like so:

static EnumLabel enum_labels[COLOR_COUNT] =
{
{COLOR_RED, "red"},
{COLOR_GREEN, "green"},
{COLOR_BLUE, "blue"}
};

* Get the OIDs of the enum labels:

Oid oids[COLOR_COUNT];
getEnumLabelOids("colors", enum_labels, oids, COLOR_COUNT);

* Convert C enum values to OIDs using the table:

PG_RETURN_OID(oids[COLOR_BLUE]);

A caller would typically cache the Oid table with fn_extra.

Currently, getEnumLabelOids puts InvalidOid for labels that could not
successfully be looked up. This will probably be changed to
ereport(ERROR) instead.

Also, I'm thinking that getEnumLabelOids should be renamed to
something that's easier to remember. Maybe enum_oids or
get_enum_oids.

I should point out that I'm hoping to present a patch in a few weeks for
extensible enums, along the lines previously discussed on this list. I
have only just noticed this thread or I would have jumped in earlier.

Maybe what I'm doing won't impact much on this - it does cache enum oids
and their associated sort orders in the function context, but lazily -
the theory being that a retail comparison should not have to look up the
whole of a large enum set just to get two sort order values.

cheers

andrew