profiling connection overhead

Started by Robert Haasabout 15 years ago89 messages

robertmhaas@gmail.com

about 15 years ago

Per previous threats, I spent some time tonight running oprofile
(using the directions Tom Lane was foolish enough to provide me back
in May). I took testlibpq.c and hacked it up to just connect to the
server and then disconnect in a tight loop without doing anything
useful, hoping to measure the overhead of starting up a new
connection. Ha, ha, funny about that:

120899 18.0616 postgres AtProcExit_Buffers
56891 8.4992 libc-2.11.2.so memset
30987 4.6293 libc-2.11.2.so memcpy
26944 4.0253 postgres hash_search_with_hash_value
26554 3.9670 postgres AllocSetAlloc
20407 3.0487 libc-2.11.2.so _int_malloc
17269 2.5799 libc-2.11.2.so fread
13005 1.9429 ld-2.11.2.so do_lookup_x
11850 1.7703 ld-2.11.2.so _dl_fixup
10194 1.5229 libc-2.11.2.so _IO_file_xsgetn

In English: the #1 overhead here is actually something that happens
when processes EXIT, not when they start. Essentially all the time is
in two lines:

56920 6.6006 : for (i = 0; i < NBuffers; i++)
: {
98745 11.4507 : if (PrivateRefCount[i] != 0)

Non-default configs:

max_connections = 100
shared_buffers = 480MB
work_mem = 4MB
maintenance_work_mem = 128MB
checkpoint_segments = 30
random_page_cost = 2.0

Anything we can do about this? That's a lot of overhead, and it'd be
a lot worse on a big machine with 8GB of shared_buffers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Robert Haas (#1)

2 attachment(s)

Re: profiling connection overhead

On Wed, Nov 24, 2010 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Per previous threats, I spent some time tonight running oprofile
(using the directions Tom Lane was foolish enough to provide me back
in May). I took testlibpq.c and hacked it up to just connect to the
server and then disconnect in a tight loop without doing anything
useful, hoping to measure the overhead of starting up a new
connection.

Oh, right: attachments.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

reconnect-opreport.txttext/plain; charset=US-ASCII; name=reconnect-opreport.txtDownload

CPU: AMD64 family10, speed 2200 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               symbol name
120899   18.0616  postgres                 AtProcExit_Buffers
56891     8.4992  libc-2.11.2.so           memset
30987     4.6293  libc-2.11.2.so           memcpy
26944     4.0253  postgres                 hash_search_with_hash_value
26554     3.9670  postgres                 AllocSetAlloc
20407     3.0487  libc-2.11.2.so           _int_malloc
17269     2.5799  libc-2.11.2.so           fread
13005     1.9429  ld-2.11.2.so             do_lookup_x
11850     1.7703  ld-2.11.2.so             _dl_fixup
10194     1.5229  libc-2.11.2.so           _IO_file_xsgetn
10087     1.5069  postgres                 MemoryContextAllocZero
9143      1.3659  ld-2.11.2.so             strcmp
8957      1.3381  postgres                 load_relcache_init_file
7132      1.0655  postgres                 fmgr_info_cxt_security
5630      0.8411  libc-2.11.2.so           vfprintf
5029      0.7513  ld-2.11.2.so             _dl_lookup_symbol_x
4997      0.7465  postgres                 _bt_getroot
3935      0.5879  libc-2.11.2.so           memcmp
3874      0.5788  postgres                 hash_seq_search
3718      0.5554  postgres                 LWLockAcquire
3666      0.5477  postgres                 guc_name_compare
3457      0.5165  libc-2.11.2.so           __strlen_sse2
3297      0.4926  postgres                 load_relmap_file
3175      0.4743  libc-2.11.2.so           malloc
3170      0.4736  postgres                 LockAcquireExtended
3139      0.4689  postgres                 hash_any
3113      0.4651  postgres                 MemoryContextAlloc
2946      0.4401  postgres                 _bt_compare
2936      0.4386  postgres                 index_getnext
2885      0.4310  ld-2.11.2.so             _dl_sort_fini
2873      0.4292  libc-2.11.2.so           _int_free
2619      0.3913  postgres                 MemoryContextCreate
2579      0.3853  ld-2.11.2.so             check_match.12146
2485      0.3712  postgres                 heap_page_prune_opt
2457      0.3671  postgres                 LWLockRelease
2438      0.3642  postgres                 CreateTemplateTupleDesc
2322      0.3469  ld-2.11.2.so             _dl_fini
2301      0.3438  postgres                 set_config_option
2253      0.3366  postgres                 _bt_first
2225      0.3324  postgres                 PinBuffer
2140      0.3197  postgres                 BeginReportingGUCOptions
2091      0.3124  postgres                 _bt_preprocess_keys
2085      0.3115  libc-2.11.2.so           _IO_vfscanf
2051      0.3064  postgres                 element_alloc
1962      0.2931  postgres                 ServerLoop
1936      0.2892  postgres                 CreateTupleDescCopyConstr
1884      0.2815  libc-2.11.2.so           __strcpy_sse2
1846      0.2758  libkrb5.so.3.3           /lib64/libkrb5.so.3.3
1801      0.2691  postgres                 FunctionCall2
1797      0.2685  postgres                 hash_create
1782      0.2662  postgres                 PgstatCollectorMain
1761      0.2631  postgres                 _bt_checkpage
1728      0.2582  postgres                 AllocSetFree
1597      0.2386  libselinux.so.1          /lib64/libselinux.so.1
1579      0.2359  libc-2.11.2.so           _IO_default_xsputn
1543      0.2305  libc-2.11.2.so           free
1531      0.2287  postgres                 SearchCatCache
1528      0.2283  postgres                 BuildHardcodedDescriptor
1506      0.2250  libc-2.11.2.so           strchrnul
1475      0.2204  postgres                 _bt_checkkeys
1457      0.2177  postgres                 ResourceOwnerForgetRelationRef
1451      0.2168  ld-2.11.2.so             _dl_runtime_resolve
1443      0.2156  postgres                 InitCatCache
1443      0.2156  postgres                 hash_search
1382      0.2065  ld-2.11.2.so             _dl_name_match_p
1360      0.2032  postgres                 PostgresMain
1347      0.2012  postgres                 pgstat_report_stat
1342      0.2005  libssl.so.1.0.0b         /usr/lib64/libssl.so.1.0.0b
1340      0.2002  postgres                 systable_beginscan
1311      0.1959  libgssapi_krb5.so.2.2    /lib64/libgssapi_krb5.so.2.2
1254      0.1873  postgres                 errstart
1247      0.1863  libc-2.11.2.so           __strncmp_sse2
1245      0.1860  postgres                 ReadBuffer_common
1244      0.1858  libxml2.so.2.7.6         /usr/lib64/libxml2.so.2.7.6
1199      0.1791  libc-2.11.2.so           fork
1164      0.1739  libc-2.11.2.so           _IO_sgetn
1164      0.1739  postgres                 pgstat_initstats
1163      0.1737  postgres                 AllocSetContextCreate
1158      0.1730  postgres                 ProcessStartupPacket
1153      0.1723  postgres                 LockRelease
1130      0.1688  libc-2.11.2.so           mempcpy
1124      0.1679  postgres                 InitCatalogCache
1120      0.1673  libk5crypto.so.3.1       /lib64/libk5crypto.so.3.1
1117      0.1669  postgres                 HeapTupleSatisfiesNow
1110      0.1658  postgres                 oid_hash
1101      0.1645  postgres                 InitPostgres
1069      0.1597  postgres                 nocachegetattr
1061      0.1585  postgres                 RelationBuildDesc
1059      0.1582  postgres                 tag_hash
1057      0.1579  libc-2.11.2.so           bsearch
1054      0.1575  postgres                 hash_uint32
1010      0.1509  postgres                 SIGetDataEntries
955       0.1427  postgres                 RelationCacheInitializePhase3
953       0.1424  postgres                 get_tabstat_entry
932       0.1392  postgres                 InitProcess
926       0.1383  libc-2.11.2.so           malloc_consolidate
882       0.1318  libc-2.11.2.so           __GI_____strtoll_l_internal
881       0.1316  libcrypto.so.1.0.0b      /lib64/libcrypto.so.1.0.0b
864       0.1291  postgres                 RelationMapOidToFilenode
863       0.1289  postgres                 IsSharedRelation
845       0.1262  postgres                 RecoveryInProgress
840       0.1255  postgres                 RelationInitPhysicalAddr
829       0.1238  libc-2.11.2.so           snprintf
828       0.1237  libc-2.11.2.so           _wordcopy_fwd_aligned
826       0.1234  postgres                 ReadBufferExtended
797       0.1191  postgres                 reaper
786       0.1174  libc-2.11.2.so           random_r
783       0.1170  postgres                 RelationInitLockInfo
764       0.1141  libc-2.11.2.so           _IO_str_init_static_internal
761       0.1137  postgres                 LockReleaseAll
760       0.1135  libc-2.11.2.so           _IO_setb
760       0.1135  postgres                 ReportGUCOption
753       0.1125  postgres                 LockBuffer
744       0.1111  libkrb5support.so.0.1    /lib64/libkrb5support.so.0.1
741       0.1107  libc-2.11.2.so           sigprocmask
740       0.1106  postgres                 DLAddHead
733       0.1095  libc-2.11.2.so           _getopt_internal_r
728       0.1088  postgres                 CatalogCacheInitializeCache
724       0.1082  postgres                 _bt_binsrch
708       0.1058  libc-2.11.2.so           _IO_un_link
707       0.1056  libc-2.11.2.so           close
707       0.1056  postgres                 btoidcmp
699       0.1044  libcom_err.so.2.1        /lib64/libcom_err.so.2.1
690       0.1031  libc-2.11.2.so           recv
687       0.1026  libc-2.11.2.so           vsnprintf
668       0.0998  libc-2.11.2.so           __errno_location
660       0.0986  postgres                 pgstat_bestart
659       0.0985  postgres                 CleanUpLock
654       0.0977  libc-2.11.2.so           _IO_file_underflow@@GLIBC_2.2.5
654       0.0977  postgres                 smgropen
651       0.0973  libz.so.1.2.3            /lib64/libz.so.1.2.3
643       0.0961  libc-2.11.2.so           __strstr_sse2
639       0.0955  postgres                 internal_flush
630       0.0941  postgres                 StartTransactionCommand
627       0.0937  postgres                 fmgr_info_cxt
626       0.0935  postgres                 _bt_fix_scankey_strategy
626       0.0935  postgres                 btrescan
613       0.0916  postgres                 _bt_readpage
610       0.0911  postgres                 _bt_getbuf
608       0.0908  libc-2.11.2.so           fclose@@GLIBC_2.2.5
608       0.0908  postgres                 strlcpy
597       0.0892  postgres                 LockRelationOid
595       0.0889  postgres                 hba_getauthmethod
587       0.0877  libc-2.11.2.so           strdup
587       0.0877  postgres                 AfterTriggerBeginXact
587       0.0877  postgres                 _ShowOption
586       0.0875  libc-2.11.2.so           _IO_file_fopen@@GLIBC_2.2.5
586       0.0875  postgres                 CommitTransaction
571       0.0853  postgres                 RelationIncrementReferenceCount
568       0.0849  postgres                 ResourceOwnerEnlargeRelationRefs
566       0.0846  libc-2.11.2.so           strtok
566       0.0846  postgres                 secure_read
563       0.0841  libpthread-2.11.2.so     _pthread_cleanup_push_defer
556       0.0831  postgres                 _bt_relandgetbuf
547       0.0817  postgres                 _bt_moveright
546       0.0816  postgres                 PathNameOpenFile
541       0.0808  libc-2.11.2.so           _IO_no_init
540       0.0807  libc-2.11.2.so           memmove
540       0.0807  libc-2.11.2.so           time
539       0.0805  postgres                 index_getprocinfo
538       0.0804  postgres                 _bt_mark_scankey_required
531       0.0793  postgres                 pg_char_to_encname_struct
525       0.0784  libc-2.11.2.so           __strcmp_sse2
525       0.0784  libc-2.11.2.so           __unregister_atfork
517       0.0772  libc-2.11.2.so           _IO_flush_all_lockp
505       0.0754  postgres                 ClosePostmasterPorts
505       0.0754  postgres                 process_postgres_switches
503       0.0751  postgres                 relation_open
493       0.0737  libc-2.11.2.so           open
491       0.0734  postgres                 pg_perm_setlocale
484       0.0723  postgres                 RelationGetIndexScan
483       0.0722  libc-2.11.2.so           _IO_link_in
482       0.0720  postgres                 _bt_search
478       0.0714  postgres                 fmgr_info_copy
475       0.0710  postgres                 CleanupTempFiles
472       0.0705  postgres                 IsAutoVacuumLauncherProcess
472       0.0705  postgres                 prepare_for_client_read
463       0.0692  libc-2.11.2.so           getpid
462       0.0690  libc-2.11.2.so           calloc
462       0.0690  postgres                 heap_open
460       0.0687  postgres                 RelationInitIndexAccessInfo
454       0.0678  postgres                 internal_putbytes
452       0.0675  libc-2.11.2.so           setitimer
448       0.0669  postgres                 LookupOpclassInfo
444       0.0663  ld-2.11.2.so             _wordcopy_bwd_aligned
443       0.0662  postgres                 DirectFunctionCall1
439       0.0656  libc-2.11.2.so           munmap
437       0.0653  postgres                 BufTableLookup
434       0.0648  postgres                 AllocateFile
434       0.0648  postgres                 systable_getnext
430       0.0642  postgres                 ValidatePgVersion
429       0.0641  postgres                 enable_sig_alarm
428       0.0639  postgres                 oideq
422       0.0630  postgres                 ReceiveSharedInvalidMessages
420       0.0627  postgres                 AssignPostmasterChildSlot
420       0.0627  postgres                 ReleaseBuffer
417       0.0623  libkeyutils-1.2.so       /lib64/libkeyutils-1.2.so
417       0.0623  postgres                 guc_var_compare
416       0.0621  libc-2.11.2.so           fopen@@GLIBC_2.2.5
416       0.0621  postgres                 shmem_exit
413       0.0617  postgres                 DLInitElem
413       0.0617  postgres                 heapgettup
412       0.0616  libc-2.11.2.so           __add_to_environ
411       0.0614  postgres                 RelationDecrementReferenceCount
409       0.0611  libc-2.11.2.so           random
403       0.0602  postgres                 RelationIdGetRelation
401       0.0599  postgres                 elog_start
400       0.0598  libc-2.11.2.so           __underflow
399       0.0596  libc-2.11.2.so           __default_morecore
399       0.0596  libc-2.11.2.so           setlocale
393       0.0587  libc-2.11.2.so           __sigsetjmp
390       0.0583  postgres                 ClientAuthentication
387       0.0578  libc-2.11.2.so           _IO_file_close_it@@GLIBC_2.2.5
386       0.0577  postgres                 ResourceOwnerForgetBuffer
386       0.0577  postgres                 index_open
385       0.0575  postgres                 pfree
384       0.0574  libc-2.11.2.so           ptmalloc_unlock_all
378       0.0565  libc-2.11.2.so           sbrk
371       0.0554  postgres                 ResourceOwnerEnlargeBuffers
371       0.0554  postgres                 hash_seq_init
367       0.0548  postgres                 ShowTransactionState
366       0.0547  postgres                 BufTableHashCode
364       0.0544  postgres                 MemoryContextStrdup
361       0.0539  libc-2.11.2.so           sigemptyset
361       0.0539  postgres                 index_endscan
359       0.0536  postgres                 truncate_identifier
357       0.0533  postgres                 GetSnapshotData
355       0.0530  postgres                 index_beginscan_internal
354       0.0529  postgres                 UnpinBuffer
350       0.0523  libc-2.11.2.so           _itoa_word
350       0.0523  libc-2.11.2.so           srandom_r
347       0.0518  libc-2.11.2.so           send
346       0.0517  postgres                 BasicOpenFile
345       0.0515  postgres                 lcons_oid
344       0.0514  libc-2.11.2.so           __libc_sigaction
341       0.0509  postgres                 ReadBuffer
340       0.0508  libc-2.11.2.so           poll
337       0.0503  postgres                 GetRedoRecPtr
336       0.0502  libc-2.11.2.so           __libc_fini
335       0.0500  libc-2.11.2.so           read
335       0.0500  postgres                 btgettuple
328       0.0490  postgres                 RelationCacheInitializePhase2
327       0.0489  postgres                 list_member_oid
326       0.0487  libc-2.11.2.so           __read_nocancel
326       0.0487  postgres                 ReindexIsProcessingIndex
326       0.0487  postgres                 ReleaseAndReadBuffer
326       0.0487  postgres                 pgstat_initialize
324       0.0484  libc-2.11.2.so           mmap
323       0.0483  postgres                 call_string_assign_hook
322       0.0481  postgres                 set_spins_per_delay
321       0.0480  libc-2.11.2.so           fflush
321       0.0480  postgres                 MemoryContextDelete
317       0.0474  postgres                 ResourceOwnerReleaseInternal
316       0.0472  libc-2.11.2.so           _IO_file_init@@GLIBC_2.2.5
316       0.0472  postgres                 btbeginscan
315       0.0471  postgres                 UnlockRelationId
313       0.0468  postgres                 SearchSysCache
313       0.0468  postgres                 pgstat_beshutdown_hook
312       0.0466  postgres                 SetConfigOption
311       0.0465  postgres                 SharedInvalBackendInit
306       0.0457  postgres                 initStringInfo
305       0.0456  postgres                 systable_endscan
301       0.0450  postgres                 ScanKeyEntryInitializeWithInfo
301       0.0450  postgres                 seg_alloc
300       0.0448  libc-2.11.2.so           _IO_cleanup
300       0.0448  postgres                 assign_session_authorization
296       0.0442  libc-2.11.2.so           __ctype_b_loc
296       0.0442  postgres                 DisableNotifyInterrupt
295       0.0441  libc-2.11.2.so           __cxa_finalize
295       0.0441  libpthread-2.11.2.so     __reclaim_stacks
294       0.0439  postgres                 set_ps_display
293       0.0438  libc-2.11.2.so           ____strtoull_l_internal
292       0.0436  ld-2.11.2.so             memmove
291       0.0435  postgres                 FunctionCall3
291       0.0435  postgres                 TransactionIdPrecedes
291       0.0435  postgres                 btendscan
290       0.0433  libnss_files-2.11.2.so   __do_global_dtors_aux
289       0.0432  postgres                 pqsignal
288       0.0430  postgres                 initialize_acl
287       0.0429  postgres                 IsAutoVacuumWorkerProcess
286       0.0427  postgres                 SetClientEncoding
282       0.0421  postgres                 pq_putmessage
281       0.0420  libpthread-2.11.2.so     pthread_mutex_lock
280       0.0418  libc-2.11.2.so           _IO_list_unlock
279       0.0417  postgres                 LockTagHashCode
278       0.0415  postgres                 config_enum_lookup_by_value
278       0.0415  postgres                 pg_strncasecmp
276       0.0412  postgres                 mdnblocks
273       0.0408  postgres                 AtStart_GUC
273       0.0408  postgres                 ResourceOwnerRememberRelationRef
272       0.0406  libc-2.11.2.so           __select_nocancel
272       0.0406  postgres                 BackendInitialize
272       0.0406  postgres                 PreCommit_Notify
271       0.0405  libc-2.11.2.so           exit
271       0.0405  postgres                 GetCurrentSubTransactionId
269       0.0402  postgres                 FreeDesc
268       0.0400  libc-2.11.2.so           _IO_file_doallocate
268       0.0400  postgres                 RelationCacheInitialize
268       0.0400  postgres                 ReleaseResources_hash
266       0.0397  postgres                 CommitTransactionCommand
265       0.0396  libc-2.11.2.so           fscanf
265       0.0396  postgres                 index_rescan
263       0.0393  postgres                 ApplySetting
262       0.0391  libpthread-2.11.2.so     pthread_rwlock_wrlock
258       0.0385  libc-2.11.2.so           llseek
257       0.0384  libresolv-2.11.2.so      __do_global_dtors_aux
256       0.0382  libm-2.11.2.so           __do_global_dtors_aux
256       0.0382  postgres                 CatalogCacheComputeHashValue
255       0.0381  postgres                 get_role_oid
253       0.0378  postgres                 AtCommit_Notify
253       0.0378  postgres                 LockAcquire
253       0.0378  postgres                 pgstat_report_xact_timestamp
252       0.0376  postgres                 nameeq
250       0.0373  postgres                 GetTransactionSnapshot
248       0.0370  postgres                 ProcQueueInit
246       0.0368  postgres                 GetCurrentTimestamp
245       0.0366  libc-2.11.2.so           _getopt_internal
243       0.0363  postgres                 CatalogCacheCreateEntry
242       0.0362  libc-2.11.2.so           vsprintf
242       0.0362  postgres                 AtEOXact_ComboCid
242       0.0362  postgres                 SHMQueueElemInit
240       0.0359  postgres                 ReleaseCatCache
240       0.0359  postgres                 get_hash_value
238       0.0356  postgres                 CreateCacheMemoryContext
238       0.0356  postgres                 GetCurrentTransactionNestLevel
237       0.0354  libc-2.11.2.so           semctl
236       0.0353  postgres                 PreCommit_on_commit_actions
235       0.0351  libc-2.11.2.so           ptmalloc_unlock_all2
235       0.0351  postgres                 extractRelOptions
234       0.0350  postgres                 CleanupInvalidationState
234       0.0350  postgres                 RelationMapInitialize
232       0.0347  postgres                 smgrGetPendingDeletes
231       0.0345  postgres                 InitPlanCache
229       0.0342  libc-2.11.2.so           brk
229       0.0342  postgres                 ProcKill
229       0.0342  postgres                 parse_bool_with_len
228       0.0341  libc-2.11.2.so           _IO_old_init
228       0.0341  postgres                 EnablePortalManager
228       0.0341  postgres                 RemoveLocalLock
227       0.0339  libc-2.11.2.so           select
227       0.0339  postgres                 SHMQueueInit
227       0.0339  postgres                 enlargeStringInfo
226       0.0338  postgres                 AfterTriggerEndXact
226       0.0338  postgres                 mdopen
224       0.0335  libc-2.11.2.so           putenv
224       0.0335  postgres                 find_option
223       0.0333  libc-2.11.2.so           getopt
220       0.0329  postgres                 elog_finish
220       0.0329  postgres                 pq_beginmessage
219       0.0327  postgres                 InitBufferPoolBackend
219       0.0327  postgres                 PGSemaphoreReset
219       0.0327  postgres                 pg_encoding_max_length
218       0.0326  libpthread-2.11.2.so     sem_destroy@@GLIBC_2.2.5
218       0.0326  postgres                 SetDatabaseEncoding
217       0.0324  libpthread-2.11.2.so     _fini
216       0.0323  libc-2.11.2.so           getsockname
216       0.0323  libm-2.11.2.so           _fini
215       0.0321  postgres                 init_ps_display
213       0.0318  libnss_files-2.11.2.so   _fini
212       0.0317  postgres                 InitializeSearchPath
211       0.0315  postgres                 InitDeadLockChecking
211       0.0315  postgres                 btnamecmp
211       0.0315  postgres                 parse_bool
209       0.0312  libdl-2.11.2.so          _fini
209       0.0312  libresolv-2.11.2.so      _fini
209       0.0312  postgres                 ResourceOwnerEnlargeCatCacheRefs
207       0.0309  postgres                 AtEOXact_MultiXact
207       0.0309  postgres                 ProcSignalInit
207       0.0309  postgres                 disable_sig_alarm
207       0.0309  postgres                 heap_getnext
206       0.0308  postgres                 client_read_ended
206       0.0308  postgres                 heapgetpage
205       0.0306  postgres                 ResourceOwnerCreate
205       0.0306  postgres                 pq_getbytes
204       0.0305  postgres                 CountChildren
204       0.0305  postgres                 on_exit_reset
202       0.0302  postgres                 int2gt
201       0.0300  postgres                 CacheRegisterSyscacheCallback
201       0.0300  postgres                 EnableNotifyInterrupt
201       0.0300  postgres                 index_close
201       0.0300  postgres                 pg_get_timezone_name
200       0.0299  libc-2.11.2.so           _int_realloc
200       0.0299  libc-2.11.2.so           srandom
199       0.0297  postgres                 ProcArrayEndTransaction
197       0.0294  libc-2.11.2.so           waitpid
197       0.0294  postgres                 InitBufferPoolAccess
196       0.0293  postgres                 _bt_steppage
194       0.0290  postgres                 SHMQueueInsertBefore
193       0.0288  postgres                 appendBinaryStringInfo
192       0.0287  libc-2.11.2.so           vfscanf
191       0.0285  postgres                 AcceptInvalidationMessages
191       0.0285  postgres                 AfterTriggerFireDeferred
191       0.0285  postgres                 RelationMapInitializePhase2
190       0.0284  postgres                 proc_exit_prepare
189       0.0282  postgres                 VirtualXactLockTableInsert
187       0.0279  postgres                 pg_getnameinfo_all
186       0.0278  postgres                 pgstat_get_db_entry
184       0.0275  postgres                 PostmasterRandom
179       0.0267  postgres                 EnableCatchupInterrupt
179       0.0267  postgres                 InSecurityRestrictedOperation
178       0.0266  postgres                 ResourceOwnerForgetCatCacheRef
175       0.0261  postgres                 DebugFileOpen
175       0.0261  postgres                 ProcArrayAdd
174       0.0260  postgres                 LockCheckConflicts
173       0.0258  libc-2.11.2.so           gettimeofday
173       0.0258  postgres                 pq_sendint
172       0.0257  postgres                 InitFileAccess
172       0.0257  postgres                 guc_strdup
172       0.0257  postgres                 heap_copytuple_with_tuple
171       0.0255  libc-2.11.2.so           _IO_file_finish@@GLIBC_2.2.5
171       0.0255  libc-2.11.2.so           realloc
170       0.0254  postgres                 index_beginscan
170       0.0254  postgres                 process_startup_options
168       0.0251  postgres                 AbortOutOfAnyTransaction
168       0.0251  postgres                 ShutdownPostgres
167       0.0249  libc-2.11.2.so           _IO_file_close
167       0.0249  postgres                 LogChildExit
167       0.0249  postgres                 on_shmem_exit
165       0.0247  postgres                 pq_recvbuf
164       0.0245  postgres                 SHMQueueDelete
163       0.0244  postgres                 IndexScanEnd
163       0.0244  postgres                 relpathbackend
162       0.0242  postgres                 AbortBufferIO
161       0.0241  libc-2.11.2.so           __fopen_internal
159       0.0238  postgres                 ReadyForQuery
159       0.0238  postgres                 StreamConnection
159       0.0238  postgres                 timestamptz_to_time_t
158       0.0236  postgres                 relation_close
156       0.0233  libc-2.11.2.so           __close_nocancel
156       0.0233  postgres                 ResourceOwnerRememberBuffer
155       0.0232  postgres                 pg_split_opts
154       0.0230  libc-2.11.2.so           _IO_doallocbuf
154       0.0230  postgres                 SHMQueueNext
154       0.0230  postgres                 smgrsetowner
150       0.0224  libc-2.11.2.so           _exit
150       0.0224  postgres                 initscan
150       0.0224  postgres                 pg_server_to_client
150       0.0224  postgres                 pq_sendstring
150       0.0224  postgres                 set_string_field
149       0.0223  postgres                 smgrinit
142       0.0212  postgres                 _bt_next
141       0.0211  postgres                 FileSeek
141       0.0211  postgres                 IsTransactionOrTransactionBlock
139       0.0208  libc-2.11.2.so           sigdelset
138       0.0206  postgres                 RelationParseRelOptions
138       0.0206  postgres                 heap_beginscan_internal
137       0.0205  libc-2.11.2.so           _IO_file_stat
137       0.0205  libc-2.11.2.so           __fxstat64
137       0.0205  libc-2.11.2.so           access
137       0.0205  postgres                 GetDatabaseTuple
137       0.0205  postgres                 ProcArrayRemove
135       0.0202  postgres                 secure_write
134       0.0200  libpthread-2.11.2.so     __do_global_dtors_aux
133       0.0199  libc-2.11.2.so           _IO_file_read
129       0.0193  postgres                 ProcessCompletedNotifies
129       0.0193  postgres                 pg_encoding_mbcliplen
129       0.0193  postgres                 pqinitmask
128       0.0191  libc-2.11.2.so           _IO_file_sync@@GLIBC_2.2.5
126       0.0188  libc-2.11.2.so           sprintf
126       0.0188  postgres                 pq_init
123       0.0184  postgres                 pq_endmessage
121       0.0181  libc-2.11.2.so           strtol
121       0.0181  postgres                 AtEOXact_LargeObject
121       0.0181  postgres                 __do_global_dtors_aux
120       0.0179  postgres                 ScanPgRelation
118       0.0176  postgres                 InitializeSessionUserId
117       0.0175  postgres                 heap_copytuple
116       0.0173  postgres                 AllocSetDelete
116       0.0173  postgres                 secure_close
115       0.0172  postgres                 DLRemove
115       0.0172  postgres                 GrantLock
113       0.0169  libc-2.11.2.so           _IO_iter_begin
113       0.0169  libdl-2.11.2.so          fini
112       0.0167  postgres                 PostmasterStateMachine
111       0.0166  libc-2.11.2.so           _IO_unsave_markers
111       0.0166  postgres                 GetDatabaseEncodingName
109       0.0163  postgres                 AtEOXact_Inval
109       0.0163  postgres                 DynaHashAlloc
108       0.0161  postgres                 _bt_freestack
108       0.0161  postgres                 proc_exit
107       0.0160  postgres                 CleanupProcSignalState
106       0.0158  postgres                 assign_client_encoding
106       0.0158  postgres                 superuser_arg
105       0.0157  postgres                 fork_process
104       0.0155  postgres                 show_session_authorization
102       0.0152  postgres                 AtEarlyCommit_Snapshot
102       0.0152  postgres                 ConnFree
102       0.0152  postgres                 InitXLOGAccess
101       0.0151  postgres                 GetDatabasePath
101       0.0151  postgres                 LockWaitCancel
101       0.0151  postgres                 ResourceOwnerDelete
98        0.0146  libdl-2.11.2.so          __do_global_dtors_aux
97        0.0145  postgres                 GetNextLocalTransactionId
95        0.0142  libc-2.11.2.so           strnlen
95        0.0142  postgres                 MarkPostmasterChildActive
94        0.0140  postgres                 AtEOXact_RelationMap
94        0.0140  postgres                 AtEOXact_SPI
94        0.0140  postgres                 RelationMapInitializePhase3
92        0.0137  postgres                 GetSysCacheOid
91        0.0136  postgres                 CommitHoldablePortals
91        0.0136  postgres                 _fini
91        0.0136  postgres                 show_timezone
90        0.0134  libc-2.11.2.so           sigaction
90        0.0134  postgres                 GetPgIndexDescriptor
89        0.0133  postgres                 ResourceOwnerRelease
89        0.0133  postgres                 pgstat_send_tabstat
89        0.0133  postgres                 pq_getmessage
87        0.0130  libc-2.11.2.so           accept
85        0.0127  postgres                 AtEOXact_GUC
85        0.0127  postgres                 CacheRegisterRelcacheCallback
84        0.0125  postgres                 AtEOXact_Namespace
84        0.0125  postgres                 FreeFile
84        0.0125  postgres                 GetSessionUserId
84        0.0125  postgres                 TimestampDifferenceExceeds
84        0.0125  postgres                 pq_getbyte
83        0.0124  libc-2.11.2.so           _IO_switch_to_get_mode
83        0.0124  postgres                 ScanKeyInit
83        0.0124  postgres                 TransactionBlockStatusCode
82        0.0123  postgres                 TransactionStartedDuringRecovery
81        0.0121  libc-2.11.2.so           __fopen_maybe_mmap
81        0.0121  postgres                 AtEOXact_PgStat
81        0.0121  postgres                 AtStart_Inval
81        0.0121  postgres                 pq_sendbyte
81        0.0121  postgres                 process_local_preload_libraries
80        0.0120  postgres                 IsAbortedTransactionBlockState
79        0.0118  libpthread-2.11.2.so     pthread_mutex_unlock
79        0.0118  postgres                 LockSharedObject
79        0.0118  postgres                 ReleasePostmasterChildSlot
78        0.0117  libc-2.11.2.so           __uflow
77        0.0115  postgres                 InLocalUserIdChange
77        0.0115  postgres                 mdinit
75        0.0112  postgres                 afterTriggerMarkEvents
75        0.0112  postgres                 hashname
74        0.0111  postgres                 AtCommit_Portals
73        0.0109  postgres                 InitializeClientEncoding
73        0.0109  postgres                 TimestampDifference
73        0.0109  postgres                 hash_seq_term
71        0.0106  postgres                 FunctionCall1
71        0.0106  postgres                 pg_utf_mblen
71        0.0106  postgres                 resetStringInfo
70        0.0105  postgres                 AtEOXact_HashTables
70        0.0105  postgres                 RemoveProcFromArray
70        0.0105  postgres                 load_libraries
69        0.0103  postgres                 AtProcExit_LocalBuffers
69        0.0103  postgres                 SetSessionAuthorization
68        0.0102  postgres                 superuser
66        0.0099  postgres                 heap_getsysattr
65        0.0097  libc-2.11.2.so           __open_nocancel
65        0.0097  libc-2.11.2.so           __strchr_sse2
65        0.0097  postgres                 _mdnblocks
65        0.0097  postgres                 heap_beginscan
64        0.0096  postgres                 heap_endscan
63        0.0094  postgres                 ProcessInvalidationMessagesMulti
62        0.0093  postgres                 hashoid
62        0.0093  postgres                 pq_close
61        0.0091  postgres                 IsTransactionState
61        0.0091  postgres                 PageGetHeapFreeSpace
60        0.0090  libc-2.11.2.so           ptmalloc_lock_all
60        0.0090  postgres                 AtEOXact_Snapshot
60        0.0090  postgres                 InitProcessPhase2
60        0.0090  postgres                 RelationGetNumberOfBlocks
58        0.0087  postgres                 AtEOXact_on_commit_actions
58        0.0087  postgres                 UnlockBuffers
58        0.0087  postgres                 new_list
58        0.0087  postgres                 pg_valid_client_encoding
58        0.0087  postgres                 pq_flush
57        0.0085  libpthread-2.11.2.so     _pthread_cleanup_pop_restore
57        0.0085  postgres                 DLMoveToFront
57        0.0085  postgres                 pgstat_clear_snapshot
55        0.0082  libc-2.11.2.so           setsid
55        0.0082  postgres                 AtEOXact_Files
55        0.0082  postgres                 sendAuthRequest
54        0.0081  libpthread-2.11.2.so     pthread_rwlock_unlock
54        0.0081  postgres                 pg_char_to_encoding
53        0.0079  libc-2.11.2.so           strtoul
53        0.0079  postgres                 GetCurrentTransactionStopTimestamp
53        0.0079  postgres                 on_proc_exit
52        0.0078  libc-2.11.2.so           _IO_list_lock
52        0.0078  postgres                 pg_mbcliplen
51        0.0076  libc-2.11.2.so           _IO_default_finish
51        0.0076  postgres                 IpcSemaphoreInitialize
50        0.0075  libc-2.11.2.so           __restore_rt
48        0.0072  postgres                 pgstat_report_activity
44        0.0066  libc-2.11.2.so           _IO_default_uflow
44        0.0066  postgres                 AtProcExit_Files
44        0.0066  postgres                 LWLockReleaseAll
43        0.0064  postgres                 ResourceOwnerNewParent
43        0.0064  postgres                 SetDatabasePath
42        0.0063  postgres                 BaseInit
42        0.0063  postgres                 smgrnblocks
40        0.0060  libc-2.11.2.so           sigfillset
40        0.0060  postgres                 MarkPostmasterChildInactive
39        0.0058  postgres                 GetCurrentCommandId
39        0.0058  postgres                 atexit_callback
38        0.0057  libc-2.11.2.so           _IO_sputbackc
38        0.0057  libc-2.11.2.so           __sigjmp_save
38        0.0057  postgres                 BufferGetBlockNumber
35        0.0052  postgres                 pgstat_send
34        0.0051  postgres                 AtEOXact_RelationCache
34        0.0051  postgres                 StreamClose
33        0.0049  postgres                 fmgr_info
32        0.0048  postgres                 AllocSetInit
30        0.0045  postgres                 update_spins_per_delay
25        0.0037  postgres                 UnlockReleaseBuffer
24        0.0036  postgres                 ResourceOwnerRememberCatCacheRef
24        0.0036  postgres                 canAcceptConnections
22        0.0033  postgres                 DisableCatchupInterrupt
21        0.0031  postgres                 BgBufferSync
21        0.0031  postgres                 RelationClose
21        0.0031  postgres                 XLogBackgroundFlush
21        0.0031  postgres                 smgrDoPendingDeletes
20        0.0030  libc-2.11.2.so           __lseek_nocancel
20        0.0030  postgres                 GetCurrentStatementStartTimestamp
20        0.0030  postgres                 MemoryContextDeleteChildren
18        0.0027  postgres                 MemoryContextResetAndDeleteChildren
17        0.0025  postgres                 pgstat_report_appname
16        0.0024  libc-2.11.2.so           __accept_nocancel
16        0.0024  postgres                 AllocSetReset
15        0.0022  postgres                 pg_usleep
14        0.0021  postgres                 GetUserId
13        0.0019  postgres                 ProcReleaseLocks
13        0.0019  postgres                 SyncOneBuffer
12        0.0018  postgres                 WalWriterMain
12        0.0018  postgres                 btint2cmp
11        0.0016  libc-2.11.2.so           _IO_iter_end
11        0.0016  libc-2.11.2.so           _IO_list_resetlock
11        0.0016  libc-2.11.2.so           __strtol_internal
11        0.0016  libc-2.11.2.so           getppid
9         0.0013  libc-2.11.2.so           _IO_iter_file
9         0.0013  postgres                 AtEOXact_Buffers
9         0.0013  postgres                 CheckArchiveTimeout
6        9.0e-04  postgres                 GetUserIdAndSecContext
5        7.5e-04  postgres                 AbsorbFsyncRequests
4        6.0e-04  postgres                 pgstat_send_bgwriter
3        4.5e-04  postgres                 BackgroundWriterMain
3        4.5e-04  postgres                 StrategySyncStart
3        4.5e-04  postgres                 _bt_relbuf
2        3.0e-04  libc-2.11.2.so           _IO_iter_next
2        3.0e-04  postgres                 PostmasterIsAlive
1        1.5e-04  postgres                 AtEOXact_LocalBuffers
1        1.5e-04  postgres                 AutoVacLauncherMain
1        1.5e-04  postgres                 BgWriterNap
1        1.5e-04  postgres                 ReleaseSysCache
1        1.5e-04  postgres                 backend_read_statsfile
1        1.5e-04  postgres                 heap_freetuple
1        1.5e-04  postgres                 smgrshutdown

reconnect.ctext/x-csrc; charset=US-ASCII; name=reconnect.cDownload

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 15 years ago

In reply to: Robert Haas (#1)

1 attachment(s)

Re: profiling connection overhead

On 24.11.2010 07:07, Robert Haas wrote:

Per previous threats, I spent some time tonight running oprofile
(using the directions Tom Lane was foolish enough to provide me back
in May). I took testlibpq.c and hacked it up to just connect to the
server and then disconnect in a tight loop without doing anything
useful, hoping to measure the overhead of starting up a new
connection. Ha, ha, funny about that:

120899 18.0616 postgres AtProcExit_Buffers
56891 8.4992 libc-2.11.2.so memset
30987 4.6293 libc-2.11.2.so memcpy
26944 4.0253 postgres hash_search_with_hash_value
26554 3.9670 postgres AllocSetAlloc
20407 3.0487 libc-2.11.2.so _int_malloc
17269 2.5799 libc-2.11.2.so fread
13005 1.9429 ld-2.11.2.so do_lookup_x
11850 1.7703 ld-2.11.2.so _dl_fixup
10194 1.5229 libc-2.11.2.so _IO_file_xsgetn

In English: the #1 overhead here is actually something that happens
when processes EXIT, not when they start. Essentially all the time is
in two lines:

56920 6.6006 : for (i = 0; i< NBuffers; i++)
: {
98745 11.4507 : if (PrivateRefCount[i] != 0)

Oh, that's quite surprising.

Anything we can do about this? That's a lot of overhead, and it'd be
a lot worse on a big machine with 8GB of shared_buffers.

Micro-optimizing that search for the non-zero value helps a little bit
(attached). Reduces the percentage shown by oprofile from about 16% to
12% on my laptop.

For bigger gains, I think you need to somehow make the PrivateRefCount
smaller. Perhaps only use one byte for each buffer instead of int32, and
use some sort of an overflow list for the rare case that a buffer is
pinned more than 255 times. Or make it a hash table instead of a simple
lookup array. But whatever you do, you have to be very careful to not
add overhead to PinBuffer/UnPinBuffer, those can already be quite high
in oprofile reports of real applications. It might be worth
experimenting a bit, at the moment PrivateRefCount takes up 5MB of
memory per 1GB of shared_buffers. Machines with a high shared_buffers
setting have no shortage of memory, but a large array like that might
waste a lot of precious CPU cache.

Now, the other question is if this really matters. Even if we eliminate
that loop in AtProcExit_Buffers altogether, is connect/disconnect still
be so slow that you have to use a connection pooler if you do that a lot?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

optimize-AtProcExit_Buffers-1.patchtext/x-diff; name=optimize-AtProcExit_Buffers-1.patchDownload

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 54c7109..03593fd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1665,11 +1665,20 @@ static void
 AtProcExit_Buffers(int code, Datum arg)
 {
 	int			i;
+	int		   *ptr;
+	int		   *end;
 
 	AbortBufferIO();
 	UnlockBuffers();
 
-	for (i = 0; i < NBuffers; i++)
+	/* Fast search for the first non-zero entry in PrivateRefCount */
+	end = (int *) &PrivateRefCount[NBuffers - 1];
+	ptr = (int *) PrivateRefCount;
+	while(ptr < end && *ptr == 0)
+		ptr++;
+	i = ((int32 *) ptr) - PrivateRefCount;
+
+	for (;i < NBuffers; i++)
 	{
 		if (PrivateRefCount[i] != 0)
 		{

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Heikki Linnakangas (#3)

Re: profiling connection overhead

On Wed, Nov 24, 2010 at 2:10 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Anything we can do about this? That's a lot of overhead, and it'd be
a lot worse on a big machine with 8GB of shared_buffers.

Micro-optimizing that search for the non-zero value helps a little bit
(attached). Reduces the percentage shown by oprofile from about 16% to 12%
on my laptop.

For bigger gains,

The first optimization that occurred to me was "remove the loop
altogether". I could maybe see needing to do something like this if
we're recovering from an error, but why do we need to do this (except
perhaps to fail an assertion) if we're exiting cleanly? Even a
session-lifetime buffer-pin leak would be quite disastrous, one would
think.

Now, the other question is if this really matters. Even if we eliminate that
loop in AtProcExit_Buffers altogether, is connect/disconnect still be so
slow that you have to use a connection pooler if you do that a lot?

Oh, I'm sure this isn't going to be nearly enough to fix that problem,
but every little bit helps; and if we never do the first optimization,
we'll never get to #30 or wherever it is that it really starts to move
the needle.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#4)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Nov 24, 2010 at 2:10 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Micro-optimizing that search for the non-zero value helps a little bit
(attached). Reduces the percentage shown by oprofile from about 16% to 12%
on my laptop.

That "micro-optimization" looks to me like your compiler leaves
something to be desired.

The first optimization that occurred to me was "remove the loop
altogether".

Or make it execute only in assert-enabled mode, perhaps.

This check had some use back in the bad old days, but the ResourceOwner
mechanism has probably removed a lot of the argument for it.

The counter-argument might be that failing to remove a buffer pin would
be disastrous; but I can't see that it'd be worse than failing to remove
an LWLock, and we have no belt-and-suspenders-too loop for those.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#5)

Re: profiling connection overhead

On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

The first optimization that occurred to me was "remove the loop
altogether".

Or make it execute only in assert-enabled mode, perhaps.

This check had some use back in the bad old days, but the ResourceOwner
mechanism has probably removed a lot of the argument for it.

Yeah, that's what I was thinking - this could would have been a good
backstop when our cleanup mechanisms were not as robust as they seem
to be today. But making the check execute only in assert-enabled more
doesn't seem right, since the check actually acts to mask other coding
errors, rather than reveal them. Maybe we replace the check with one
that only occurs in an Assert-enabled build and just loops through and
does Assert(PrivateRefCount[i] == 0). I'm not sure exactly where this
gets called in the shutdown sequence, though - is it sensible to
Assert() here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#6)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Or make it execute only in assert-enabled mode, perhaps.

But making the check execute only in assert-enabled more
doesn't seem right, since the check actually acts to mask other coding
errors, rather than reveal them. Maybe we replace the check with one
that only occurs in an Assert-enabled build and just loops through and
does Assert(PrivateRefCount[i] == 0).

Yeah, that would be sensible. There is precedent for this elsewhere
too; I think there's a similar setup for checking buffer refcounts
during transaction cleanup.

I'm not sure exactly where this
gets called in the shutdown sequence, though - is it sensible to
Assert() here?

Assert is sensible anywhere.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#7)

1 attachment(s)

Re: profiling connection overhead

On Wed, Nov 24, 2010 at 11:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Or make it execute only in assert-enabled mode, perhaps.

But making the check execute only in assert-enabled more
doesn't seem right, since the check actually acts to mask other coding
errors, rather than reveal them. Maybe we replace the check with one
that only occurs in an Assert-enabled build and just loops through and
does Assert(PrivateRefCount[i] == 0).

Yeah, that would be sensible. There is precedent for this elsewhere
too; I think there's a similar setup for checking buffer refcounts
during transaction cleanup.

I'm not sure exactly where this
gets called in the shutdown sequence, though - is it sensible to
Assert() here?

Assert is sensible anywhere.

OK, patch attached. Here's what oprofile output looks like with this applied:

3505 10.4396 libc-2.11.2.so memset
2051 6.1089 libc-2.11.2.so memcpy
1686 5.0217 postgres AllocSetAlloc
1642 4.8907 postgres hash_search_with_hash_value
1247 3.7142 libc-2.11.2.so _int_malloc
1096 3.2644 libc-2.11.2.so fread
855 2.5466 ld-2.11.2.so do_lookup_x
723 2.1535 ld-2.11.2.so _dl_fixup
645 1.9211 ld-2.11.2.so strcmp
620 1.8467 postgres MemoryContextAllocZero

Somehow I don't think I'm going to get much further with this without
figuring out how to get oprofile to cough up a call graph.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

AtProcExit_Buffers.patchapplication/octet-stream; name=AtProcExit_Buffers.patchDownload

*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 1664,1689 **** InitBufferPoolBackend(void)
  static void
  AtProcExit_Buffers(int code, Datum arg)
  {
- 	int			i;
- 
  	AbortBufferIO();
  	UnlockBuffers();
  
! 	for (i = 0; i < NBuffers; i++)
  	{
! 		if (PrivateRefCount[i] != 0)
! 		{
! 			volatile BufferDesc *buf = &(BufferDescriptors[i]);
  
! 			/*
! 			 * We don't worry about updating ResourceOwner; if we even got
! 			 * here, it suggests that ResourceOwners are messed up.
! 			 */
! 			PrivateRefCount[i] = 1;		/* make sure we release shared pin */
! 			UnpinBuffer(buf, false);
  			Assert(PrivateRefCount[i] == 0);
  		}
  	}
  
  	/* localbuf.c needs a chance too */
  	AtProcExit_LocalBuffers();
--- 1664,1683 ----
  static void
  AtProcExit_Buffers(int code, Datum arg)
  {
  	AbortBufferIO();
  	UnlockBuffers();
  
! #ifdef USE_ASSERT_CHECKING
! 	if (assert_enabled)
  	{
! 		int			i;
  
! 		for (i = 0; i < NBuffers; i++)
! 		{
  			Assert(PrivateRefCount[i] == 0);
  		}
  	}
+ #endif
  
  	/* localbuf.c needs a chance too */
  	AtProcExit_LocalBuffers();

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#8)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

OK, patch attached.

Two comments:

1. A comment would help, something like "Assert we released all buffer pins".

2. AtProcExit_LocalBuffers should be redone the same way, for
consistency (it likely won't make any performance difference).
Note the comment for AtProcExit_LocalBuffers, too; that probably
needs to be changed along the lines of "If we missed any, and
assertions aren't enabled, we'll fail later in DropRelFileNodeBuffers
while trying to drop the temp rels".

regards, tom lane

#10

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#9)

1 attachment(s)

Re: profiling connection overhead

On Wed, Nov 24, 2010 at 1:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

OK, patch attached.

Two comments:

Revised patch attached.

I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.
For example:

6 42.8571 postgres record_in
8 57.1429 postgres pg_perm_setlocale
17035 5.7219 libc-2.11.2.so memcpy
17035 100.000 libc-2.11.2.so memcpy [self]

Not that helpful. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

AtProcExit_Buffers-v2.patchapplication/octet-stream; name=AtProcExit_Buffers-v2.patchDownload

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 54c7109..edc4977 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1659,31 +1659,26 @@ InitBufferPoolBackend(void)
 }
 
 /*
- * Ensure we have released all shared-buffer locks and pins during backend exit
+ * During backend exit, ensure that we released all shared-buffer locks and
+ * assert that we have no remaining pins.
  */
 static void
 AtProcExit_Buffers(int code, Datum arg)
 {
-	int			i;
-
 	AbortBufferIO();
 	UnlockBuffers();
 
-	for (i = 0; i < NBuffers; i++)
+#ifdef USE_ASSERT_CHECKING
+	if (assert_enabled)
 	{
-		if (PrivateRefCount[i] != 0)
-		{
-			volatile BufferDesc *buf = &(BufferDescriptors[i]);
+		int			i;
 
-			/*
-			 * We don't worry about updating ResourceOwner; if we even got
-			 * here, it suggests that ResourceOwners are messed up.
-			 */
-			PrivateRefCount[i] = 1;		/* make sure we release shared pin */
-			UnpinBuffer(buf, false);
+		for (i = 0; i < NBuffers; i++)
+		{
 			Assert(PrivateRefCount[i] == 0);
 		}
 	}
+#endif
 
 	/* localbuf.c needs a chance too */
 	AtProcExit_LocalBuffers();
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 46fddde..efaeca4 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -468,9 +468,10 @@ AtEOXact_LocalBuffers(bool isCommit)
 /*
  * AtProcExit_LocalBuffers - ensure we have dropped pins during backend exit.
  *
- * This is just like AtProcExit_Buffers, but for local buffers.  We have
- * to drop pins to ensure that any attempt to drop temp files doesn't
- * fail in DropRelFileNodeBuffers.
+ * This is just like AtProcExit_Buffers, but for local buffers.  We shouldn't
+ * be holding any remaining pins; if we are, and assertions aren't enabled,
+ * we'll fail later in DropRelFileNodeBuffers while trying to drop the temp
+ * rels.
  */
 void
 AtProcExit_LocalBuffers(void)
@@ -478,4 +479,16 @@ AtProcExit_LocalBuffers(void)
 	/* just zero the refcounts ... */
 	if (LocalRefCount)
 		MemSet(LocalRefCount, 0, NLocBuffer * sizeof(*LocalRefCount));
+
+#ifdef USE_ASSERT_CHECKING
+	if (assert_enabled && LocalRefCount)
+	{
+		int			i;
+
+		for (i = 0; i < NLocBuffer; i++)
+		{
+			Assert(LocalRefCount[i] == 0);
+		}
+	}
+#endif
 }

#11

Gerhard Heift

ml-postgresql-20081012-3518@gheift.de

about 15 years ago

In reply to: Robert Haas (#10)

Re: profiling connection overhead

On Wed, Nov 24, 2010 at 01:20:36PM -0500, Robert Haas wrote:

On Wed, Nov 24, 2010 at 1:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

OK, patch attached.

Two comments:

Revised patch attached.

I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.
For example:

6 42.8571 postgres record_in
8 57.1429 postgres pg_perm_setlocale
17035 5.7219 libc-2.11.2.so memcpy
17035 100.000 libc-2.11.2.so memcpy [self]

Not that helpful. :-(

Have a look at the wiki:
http://wiki.postgresql.org/wiki/Profiling_with_OProfile#Additional_analysis

Robert Haas

Regards,
Gerhard Heift

#12

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Robert Haas (#10)

2 attachment(s)

Re: profiling connection overhead

On Wed, Nov 24, 2010 at 1:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.

It looks like the trick is to compile with -fno-omit-frame-pointer.
New profiling run:

27563 10.3470 libc-2.11.2.so memset
15162 5.6917 libc-2.11.2.so memcpy
13471 5.0569 postgres hash_search_with_hash_value
13465 5.0547 postgres AllocSetAlloc
9513 3.5711 libc-2.11.2.so _int_malloc
8729 3.2768 libc-2.11.2.so fread
6336 2.3785 ld-2.11.2.so do_lookup_x
5788 2.1728 ld-2.11.2.so _dl_fixup
4995 1.8751 postgres MemoryContextAllocZero
4978 1.8687 ld-2.11.2.so strcmp

Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13

Andres Freund

andres@anarazel.de

about 15 years ago

In reply to: Robert Haas (#8)

Re: profiling connection overhead

On Wednesday 24 November 2010 19:01:32 Robert Haas wrote:

Somehow I don't think I'm going to get much further with this without
figuring out how to get oprofile to cough up a call graph.

I think to do that sensibly you need CFLAGS="-O2 -fno-omit-frame-pointer"...

#14

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Gerhard Heift (#11)

Re: profiling connection overhead

Gerhard Heift <ml-postgresql-20081012-3518@gheift.de> writes:

On Wed, Nov 24, 2010 at 01:20:36PM -0500, Robert Haas wrote:

I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.

Have a look at the wiki:
http://wiki.postgresql.org/wiki/Profiling_with_OProfile#Additional_analysis

The critical piece of information is there now, but it wasn't a minute
ago.

regards, tom lane

#15

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#10)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

Revised patch attached.

The asserts in AtProcExit_LocalBuffers are a bit pointless since
you forgot to remove the code that forcibly zeroes LocalRefCount[]...
otherwise +1.

regards, tom lane

#16

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#12)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.

AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.

Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.

regards, tom lane

#17

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#16)

Re: profiling connection overhead

On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.

AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.

Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.

The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them. Any new pages we request from the OS will be
zeroed anyway, but with palloc we then have to re-zero the allocated
block anyway because palloc can return a memory that's been used,
freed, and reused. However, for anything that only needs to be
allocated once and never freed, and whose size can be known at compile
time, that's not an issue.

In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. We could just declare:

char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];

...and then peel off chunks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Andres Freund

andres@anarazel.de

about 15 years ago

In reply to: Robert Haas (#17)

Re: profiling connection overhead

On Wednesday 24 November 2010 21:47:32 Robert Haas wrote:

On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.

AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.

Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.

The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them. Any new pages we request from the OS will be
zeroed anyway, but with palloc we then have to re-zero the allocated
block anyway because palloc can return a memory that's been used,
freed, and reused. However, for anything that only needs to be
allocated once and never freed, and whose size can be known at compile
time, that's not an issue.

In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. We could just declare:

char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];

...and then peel off chunks.

Won't this just cause loads of additional pagefaults after fork() when those
pages are used the first time and then a second time when first written to (to
copy it)?

Andres

#19

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Andres Freund (#18)

Re: profiling connection overhead

On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres@anarazel.de> wrote:

On Wednesday 24 November 2010 21:47:32 Robert Haas wrote:

On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.

AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.

Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.

The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them. Any new pages we request from the OS will be
zeroed anyway, but with palloc we then have to re-zero the allocated
block anyway because palloc can return a memory that's been used,
freed, and reused. However, for anything that only needs to be
allocated once and never freed, and whose size can be known at compile
time, that's not an issue.

In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. We could just declare:

char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];

...and then peel off chunks.

Won't this just cause loads of additional pagefaults after fork() when those
pages are used the first time and then a second time when first written to (to
copy it)?

Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#19)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres@anarazel.de> wrote:

The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them.

Won't this just cause loads of additional pagefaults after fork() when those
pages are used the first time and then a second time when first written to (to
copy it)?

Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().

I think you're missing the real point, which that the cost you're
measuring here probably isn't so much memset() as faulting in large
chunks of address space. Avoiding the explicit memset() likely will
save little in real runtime --- it'll just make sure the initial-touch
costs are more distributed and harder to measure. But in any case I
think this idea is a nonstarter because it gets in the way of making
those hashtables expansible, which we *do* need to do eventually.

(You might be able to confirm or disprove this theory if you ask
oprofile to count memory access stalls instead of CPU clock cycles...)

regards, tom lane

#21

Andres Freund

andres@anarazel.de

about 15 years ago

In reply to: Robert Haas (#19)

Re: profiling connection overhead

On Wednesday 24 November 2010 21:54:53 Robert Haas wrote:

On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres@anarazel.de> wrote:

On Wednesday 24 November 2010 21:47:32 Robert Haas wrote:

On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.

AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that
cheaper other than to cut the initial sizes of the hash tables (and
add support for expanding them later, which is lacking in catcache
ATM). Not convinced that that creates any net savings --- it might
just save some cycles at startup in exchange for more cycles later,
in typical backend usage.

Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.

The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them. Any new pages we request from the OS will be
zeroed anyway, but with palloc we then have to re-zero the allocated
block anyway because palloc can return a memory that's been used,
freed, and reused. However, for anything that only needs to be
allocated once and never freed, and whose size can be known at compile
time, that's not an issue.

In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. We could just declare:

char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];

...and then peel off chunks.

Won't this just cause loads of additional pagefaults after fork() when
those pages are used the first time and then a second time when first
written to (to copy it)?

Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().

Yes, but only once. Also scrubbing a page is faster than copying it... (and
there were patches floating around to do that in advance, not sure if they got
integrated into mainline linux)

Andres

#22

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Andres Freund (#21)

Re: profiling connection overhead

On Nov 24, 2010, at 4:05 PM, Andres Freund <andres@anarazel.de> wrote:

Won't this just cause loads of additional pagefaults after fork() when
those pages are used the first time and then a second time when first
written to (to copy it)?

Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().

Yes, but only once. Also scrubbing a page is faster than copying it... (and
there were patches floating around to do that in advance, not sure if they got
integrated into mainline linux)

I'm not following - can you elaborate?

...Robert

#23

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#22)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

On Nov 24, 2010, at 4:05 PM, Andres Freund <andres@anarazel.de> wrote:

Yes, but only once. Also scrubbing a page is faster than copying it... (and
there were patches floating around to do that in advance, not sure if they got
integrated into mainline linux)

I'm not following - can you elaborate?

I think Andres is saying that bss space isn't optimized during a fork
operation: it'll be propagated to the child as copy-on-write pages.
Dunno if that's true or not, but if it is, it'd be a good reason to
avoid the scheme you're suggesting.

regards, tom lane

#24

Andres Freund

andres@anarazel.de

about 15 years ago

In reply to: Robert Haas (#22)

Re: profiling connection overhead

On Wednesday 24 November 2010 22:18:08 Robert Haas wrote:

On Nov 24, 2010, at 4:05 PM, Andres Freund <andres@anarazel.de> wrote:

Won't this just cause loads of additional pagefaults after fork() when
those pages are used the first time and then a second time when first
written to (to copy it)?

Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().

Yes, but only once. Also scrubbing a page is faster than copying it...
(and there were patches floating around to do that in advance, not sure
if they got integrated into mainline linux)

I'm not following - can you elaborate?

When forking the memory mapping of the process is copied - the actual pages
are not. When a page is first accessed the page fault handler will setup a
mapping to the "old" page and mark it as shared. When now written to it will
fault again and copy the page.

In contrast if you access a page the first time after an sbrk (or mmap, doesn't
matter) a new page will get scrubbed and and a mapping will get setup.

Andres

#25

Andres Freund

andres@anarazel.de

about 15 years ago

In reply to: Tom Lane (#23)

Re: profiling connection overhead

On Wednesday 24 November 2010 22:25:45 Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Nov 24, 2010, at 4:05 PM, Andres Freund <andres@anarazel.de> wrote:

Yes, but only once. Also scrubbing a page is faster than copying it...
(and there were patches floating around to do that in advance, not sure
if they got integrated into mainline linux)

I'm not following - can you elaborate?

I think Andres is saying that bss space isn't optimized during a fork
operation: it'll be propagated to the child as copy-on-write pages.
Dunno if that's true or not, but if it is, it'd be a good reason to
avoid the scheme you're suggesting.

Afair nearly all pages are propagated with copy-on-write semantics.

Andres

#26

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#20)

Re: profiling connection overhead

On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(You might be able to confirm or disprove this theory if you ask
oprofile to count memory access stalls instead of CPU clock cycles...)

I don't see an event for that.

# opcontrol --list-events | grep STALL
INSTRUCTION_FETCH_STALL: (counter: all)
DISPATCH_STALLS: (counter: all)
DISPATCH_STALL_FOR_BRANCH_ABORT: (counter: all)
DISPATCH_STALL_FOR_SERIALIZATION: (counter: all)
DISPATCH_STALL_FOR_SEGMENT_LOAD: (counter: all)
DISPATCH_STALL_FOR_REORDER_BUFFER_FULL: (counter: all)
DISPATCH_STALL_FOR_RESERVATION_STATION_FULL: (counter: all)
DISPATCH_STALL_FOR_FPU_FULL: (counter: all)
DISPATCH_STALL_FOR_LS_FULL: (counter: all)
DISPATCH_STALL_WAITING_FOR_ALL_QUIET: (counter: all)
DISPATCH_STALL_FOR_FAR_TRANSFER_OR_RESYNC: (counter: all)

# opcontrol --list-events | grep MEMORY
MEMORY_REQUESTS: (counter: all)
MEMORY_CONTROLLER_PAGE_TABLE_OVERFLOWS: (counter: all)
MEMORY_CONTROLLER_SLOT_MISSED: (counter: all)
MEMORY_CONTROLLER_TURNAROUNDS: (counter: all)
MEMORY_CONTROLLER_BYPASS_COUNTER_SATURATION: (counter: all)
CPU_IO_REQUESTS_TO_MEMORY_IO: (counter: all)
MEMORY_CONTROLLER_REQUESTS: (counter: all)

Ideas?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#26)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(You might be able to confirm or disprove this theory if you ask
oprofile to count memory access stalls instead of CPU clock cycles...)

I don't see an event for that.

You probably want something involving cache misses. The event names
vary depending on just which CPU you've got.

regards, tom lane

#28

Andres Freund

andres@anarazel.de

about 15 years ago

In reply to: Tom Lane (#27)

Re: profiling connection overhead

On Wednesday 24 November 2010 23:03:48 Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(You might be able to confirm or disprove this theory if you ask
oprofile to count memory access stalls instead of CPU clock cycles...)

I don't see an event for that.

You probably want something involving cache misses. The event names
vary depending on just which CPU you've got.

Or some BUS OUTSTANDING event.

Andres

#29

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Andres Freund (#28)

Re: profiling connection overhead

On Wed, Nov 24, 2010 at 5:15 PM, Andres Freund <andres@anarazel.de> wrote:

On Wednesday 24 November 2010 23:03:48 Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(You might be able to confirm or disprove this theory if you ask
oprofile to count memory access stalls instead of CPU clock cycles...)

I don't see an event for that.

You probably want something involving cache misses. The event names
vary depending on just which CPU you've got.

Or some BUS OUTSTANDING event.

I don't see anything for BUS OUTSTANDING. For CACHE and MISS I have
some options:

# opcontrol --list-events | grep CACHE
DATA_CACHE_ACCESSES: (counter: all)
DATA_CACHE_MISSES: (counter: all)
DATA_CACHE_REFILLS_FROM_L2_OR_NORTHBRIDGE: (counter: all)
DATA_CACHE_REFILLS_FROM_NORTHBRIDGE: (counter: all)
DATA_CACHE_LINES_EVICTED: (counter: all)
LOCKED_INSTRUCTIONS_DCACHE_MISSES: (counter: all)
L2_CACHE_MISS: (counter: all)
L2_CACHE_FILL_WRITEBACK: (counter: all)
INSTRUCTION_CACHE_FETCHES: (counter: all)
INSTRUCTION_CACHE_MISSES: (counter: all)
INSTRUCTION_CACHE_REFILLS_FROM_L2: (counter: all)
INSTRUCTION_CACHE_REFILLS_FROM_SYSTEM: (counter: all)
INSTRUCTION_CACHE_VICTIMS: (counter: all)
INSTRUCTION_CACHE_INVALIDATED: (counter: all)
CACHE_BLOCK_COMMANDS: (counter: all)
READ_REQUEST_L3_CACHE: (counter: all)
L3_CACHE_MISSES: (counter: all)
IBS_FETCH_ICACHE_MISSES: (ext: ibs_fetch)
IBS_FETCH_ICACHE_HITS: (ext: ibs_fetch)
IBS_OP_DATA_CACHE_MISS: (ext: ibs_op)
IBS_OP_NB_LOCAL_CACHE: (ext: ibs_op)
IBS_OP_NB_REMOTE_CACHE: (ext: ibs_op)
IBS_OP_NB_CACHE_MODIFIED: (ext: ibs_op)
IBS_OP_NB_CACHE_OWNED: (ext: ibs_op)
IBS_OP_NB_LOCAL_CACHE_LAT: (ext: ibs_op)
IBS_OP_NB_REMOTE_CACHE_LAT: (ext: ibs_op)

# opcontrol --list-events | grep MISS | grep -v CACHE
L1_DTLB_MISS_AND_L2_DTLB_HIT: (counter: all)
L1_DTLB_AND_L2_DTLB_MISS: (counter: all)
L1_ITLB_MISS_AND_L2_ITLB_HIT: (counter: all)
L1_ITLB_MISS_AND_L2_ITLB_MISS: (counter: all)
MEMORY_CONTROLLER_SLOT_MISSED: (counter: all)
IBS_FETCH_L1_ITLB_MISSES_L2_ITLB_HITS: (ext: ibs_fetch)
IBS_FETCH_L1_ITLB_MISSES_L2_ITLB_MISSES: (ext: ibs_fetch)
IBS_OP_L1_DTLB_MISS_L2_DTLB_HIT: (ext: ibs_op)
IBS_OP_L1_L2_DTLB_MISS: (ext: ibs_op)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#29)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

I don't see anything for BUS OUTSTANDING. For CACHE and MISS I have
some options:

DATA_CACHE_MISSES: (counter: all)
L3_CACHE_MISSES: (counter: all)

Those two look promising, though I can't claim to be an expert.

regards, tom lane

#31

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#30)

Re: profiling connection overhead

On Wed, Nov 24, 2010 at 5:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I don't see anything for BUS OUTSTANDING. For CACHE and MISS I have
some options:

DATA_CACHE_MISSES: (counter: all)
L3_CACHE_MISSES: (counter: all)

Those two look promising, though I can't claim to be an expert.

OK. Thanksgiving is about to interfere with my access to this
machine, but I'll pick this up next week.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Robert Haas (#19)

Re: profiling connection overhead

Robert Haas wrote:

In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. ?We could just declare:

char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];

...and then peel off chunks.

Won't this just cause loads of additional pagefaults after fork() when those
pages are used the first time and then a second time when first written to (to
copy it)?

Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().

Here is perhaps more detail than you wanted, but ...

Basically in a forked process, the text/program is fixed, and the
initialized data and stack are copy on write (COW). Allocating a big
block of zero memory in data is unitialized data, and the behavior there
differs depending on whether the parent process faulted in those pages.
If it did, then they are COW, but if it did not, odds are the OS just
gives them to you clean and not shared. The pages have to be empty
because if it gave you anything else it could be giving you data from
another process. For details, see
http://docs.hp.com/en/5965-4641/ch01s11.html, Faulting In a Page of
Stack or Uninitialized Data.

As far as sbrk(), those pages are zero-filled also, again for security
reasons. You have to clear malloc()'ed memory (or call calloc()) not
because the OS gave you dirty pages but because you might be using
memory that you previously freed. If you have never freed memory (and
the postmaster/parent has not either), I bet all malloc'ed memory would
be zero-filled.

Not sure that information moves us forward. If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#33

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Bruce Momjian (#32)

Re: profiling connection overhead

On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote:

Not sure that information moves us forward. If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.

Well, we can determine the answers to these questions empirically. I
think some more scrutiny of the code with the points you and Andres
and Tom have raised is probably in order, and probably some more
benchmarking, too. I haven't had a chance to do that yet, however.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#33)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote:

Not sure that information moves us forward. �If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.

Well, we can determine the answers to these questions empirically.

Not really. Per Bruce's description, a page would become COW the moment
the postmaster touched (either write or read) any variable on it. Since
we have no control over how the loader lays out static variables, the
actual behavior of a particular build would be pretty random and subject
to unexpected changes caused by seemingly unrelated edits.

Also, the referenced URL only purports to describe the behavior of
HPUX, which is not exactly a mainstream OS. I think it requires a
considerable leap of faith to assume that all or even most platforms
work the way this suggests, and not in the dumber fashion Andres
suggested. Has anybody here actually looked at the relevant Linux
or BSD kernel code?

regards, tom lane

#35

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#34)

Re: profiling connection overhead

On Sun, Nov 28, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote:

Not sure that information moves us forward. If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.

Well, we can determine the answers to these questions empirically.

Not really. Per Bruce's description, a page would become COW the moment
the postmaster touched (either write or read) any variable on it. Since
we have no control over how the loader lays out static variables, the
actual behavior of a particular build would be pretty random and subject
to unexpected changes caused by seemingly unrelated edits.

Well, one big character array pretty much has to be laid out
contiguously, and it would be pretty surprising (but not entirely
impossible) to find that the linker randomly sprinkles symbols from
other files in between consecutive definitions in the same source
file. I think the next question to answer is to try to allocate blame
for the memset/memcpy overhead between page faults and the zeroing
itself. That seems like something we can easily member by writing a
test program that zeroes the same region twice and kicks out timing
numbers. If, as you and Andres are arguing, the actual zeroing is
minor, then we can forget this whole line of discussion and move on to
other possible optimizations. If that turns out not to be true then
we can worry about how best to avoid the zeroing. I have to believe
that's a solvable problem; the question is whether there's a benefit.

In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails, the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it. To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#36

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Robert Haas (#33)

Re: profiling connection overhead

Robert Haas wrote:

On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote:

Not sure that information moves us forward. ?If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.

Well, we can determine the answers to these questions empirically. I
think some more scrutiny of the code with the points you and Andres
and Tom have raised is probably in order, and probably some more
benchmarking, too. I haven't had a chance to do that yet, however.

Basically, my bet is if you allocated a large zero-data variable in the
postmaster but never accessed it from the postmaster, at most you would
copy-on-write (COW) fault in two page, one at the beginning that is
shared by accessed variables, and one at the end. The remaining pages
(4k default for x86) would be zero-filled and not COW shared.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#37

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#35)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails, the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that.

One comment to make here is that it would be a serious error to focus on
the costs of just starting and stopping a backend; you have to think
about cases where the backend does at least some useful work in between,
and that means actually *populating* those caches (to some extent) not
just initializing them. Maybe your wording above was chosen with that
in mind, but I think onlookers might easily overlook the point.

FWIW, today I've been looking into getting rid of the silliness in
build_index_pathkeys whereby it reconstructs pathkey opfamily OIDs
from sortops instead of just using the index opfamilies directly.
It turns out that once you fix that, there is no need at all for
relcache to cache per-index operator data (the rd_operator arrays)
because that's the only code that uses 'em. I don't see any particular
change in the runtime of the regression tests from ripping out that
part of the cached data, but it ought to have at least some beneficial
effect on real startup time.

regards, tom lane

#38

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#37)

Re: profiling connection overhead

On Sun, Nov 28, 2010 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails, the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that.

One comment to make here is that it would be a serious error to focus on
the costs of just starting and stopping a backend; you have to think
about cases where the backend does at least some useful work in between,
and that means actually *populating* those caches (to some extent) not
just initializing them. Maybe your wording above was chosen with that
in mind, but I think onlookers might easily overlook the point.

I did have that in mind, but I agree the point is worth mentioning.
So, for example, it wouldn't gain anything meaningful for us to
postpone catcache initialization until someone executes a query. It
would improve the synthetic benchmark, but that's it.

FWIW, today I've been looking into getting rid of the silliness in
build_index_pathkeys whereby it reconstructs pathkey opfamily OIDs
from sortops instead of just using the index opfamilies directly.
It turns out that once you fix that, there is no need at all for
relcache to cache per-index operator data (the rd_operator arrays)
because that's the only code that uses 'em. I don't see any particular
change in the runtime of the regression tests from ripping out that
part of the cached data, but it ought to have at least some beneficial
effect on real startup time.

Wow. that's great. The fact that it simplifies the code is probably
the main point, but obviously whatever cycles we can save during
startup (and ongoing operation) are all to the good.

One possible way to get a real speedup here would be to look for ways
to trim the number of catcaches. But I'm not too convinced there's
much water to squeeze out of that rock. After our recent conversation
about KNNGIST, it occurred to me to wonder whether there's really any
point in pretending that a user can usefully add an AM, both due to
hard-wired planner knowledge and due to lack of any sort of extensible
XLOG support. If not, we could potentially turn pg_am into a
hardcoded lookup table rather than a modifiable catalog, which would
also likely be faster; and perhaps reference AMs elsewhere with
characters rather than OIDs. But even if this were judged a sensible
thing to do I'm not very sure that even a purpose-built synthetic
benchmark would be able to measure the speedup.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#39

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#38)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

After our recent conversation
about KNNGIST, it occurred to me to wonder whether there's really any
point in pretending that a user can usefully add an AM, both due to
hard-wired planner knowledge and due to lack of any sort of extensible
XLOG support. If not, we could potentially turn pg_am into a
hardcoded lookup table rather than a modifiable catalog, which would
also likely be faster; and perhaps reference AMs elsewhere with
characters rather than OIDs. But even if this were judged a sensible
thing to do I'm not very sure that even a purpose-built synthetic
benchmark would be able to measure the speedup.

Well, the lack of extensible XLOG support is definitely a big handicap
to building a *production* index AM as an add-on. But it's not such a
handicap for development. And I don't believe that the planner is
hardwired in any way that doesn't allow new index types. GIST and GIN
have both been added successfully without kluging the planner. It does
know a lot more about btree than other index types, but that doesn't
mean you can't add a new index type that doesn't behave like btree;
that's more reflective of where development effort has been spent.

So I would consider the above idea a step backwards, and I doubt it
would save anything meaningful anyway.

regards, tom lane

#40

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#39)

Re: profiling connection overhead

On Sun, Nov 28, 2010 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

After our recent conversation
about KNNGIST, it occurred to me to wonder whether there's really any
point in pretending that a user can usefully add an AM, both due to
hard-wired planner knowledge and due to lack of any sort of extensible
XLOG support. If not, we could potentially turn pg_am into a
hardcoded lookup table rather than a modifiable catalog, which would
also likely be faster; and perhaps reference AMs elsewhere with
characters rather than OIDs. But even if this were judged a sensible
thing to do I'm not very sure that even a purpose-built synthetic
benchmark would be able to measure the speedup.

Well, the lack of extensible XLOG support is definitely a big handicap
to building a *production* index AM as an add-on. But it's not such a
handicap for development.

Realistically, it's hard for me to imagine that anyone would go to the
trouble of building it as a loadable module first and then converting
it to part of core later on. That'd hardly be less work.

And I don't believe that the planner is
hardwired in any way that doesn't allow new index types. GIST and GIN
have both been added successfully without kluging the planner.

We have 9 boolean flags to indicate the capabilities (or lack thereof)
of AMs, and we only have 4 AMs. It seems altogether plausible to
assume that the next AM we add could require flags 10 and 11. Heck, I
think KNNGIST is going to require another flag... which will likely
never be set for any AM other than GIST.

It does
know a lot more about btree than other index types, but that doesn't
mean you can't add a new index type that doesn't behave like btree;
that's more reflective of where development effort has been spent.

So I would consider the above idea a step backwards, and I doubt it
would save anything meaningful anyway.

That latter point, as far as I'm concerned, is the real nail in the coffin.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#41

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#38)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

One possible way to get a real speedup here would be to look for ways
to trim the number of catcaches.

BTW, it's not going to help to remove catcaches that have a small
initial size, as the pg_am cache certainly does. If the bucket zeroing
cost is really something to minimize, it's only the caches with the
largest nbuckets counts that are worth considering --- and we certainly
can't remove those without penalty.

regards, tom lane

#42

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#40)

Re: profiling connection overhead

BTW, this might be premature to mention pending some tests about mapping
versus zeroing overhead, but it strikes me that there's more than one
way to skin a cat. I still think the idea of statically allocated space
sucks. But what if we rearranged things so that palloc0 doesn't consist
of palloc-then-memset, but rather push the zeroing responsibility down
into the allocator? In particular, I'm imagining that palloc0 with a
sufficiently large space request --- more than a couple pages --- could
somehow arrange to get space that's guaranteed zero already. And if the
request isn't large, zeroing it isn't where our problem is anyhow.

The most portable way to do that would be to use calloc insted of malloc,
and hope that libc is smart enough to provide freshly-mapped space.
It would be good to look and see whether glibc actually does so,
of course. If not we might end up having to mess with sbrk for
ourselves, and I'm not sure how pleasantly that interacts with malloc.

Another question that would be worth asking here is whether the
hand-baked MemSet macro still outruns memset on modern architectures.
I think it's been quite a few years since that was last tested.

regards, tom lane

#43

Greg Stark

gsstark@mit.edu

about 15 years ago

In reply to: Tom Lane (#42)

Re: profiling connection overhead

On Mon, Nov 29, 2010 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The most portable way to do that would be to use calloc insted of malloc,
and hope that libc is smart enough to provide freshly-mapped space.
It would be good to look and see whether glibc actually does so,
of course. If not we might end up having to mess with sbrk for
ourselves, and I'm not sure how pleasantly that interacts with malloc.

It's *supposed* to interact fine. The only thing I wonder is that I
think malloc intentionally uses mmap for larger allocations but I'm
not clear what the advantages are. Is it because it's a cheaper way to
get zeroed bytes? Or just so that free has a hope of returning the
allocations to the OS?

Another question that would be worth asking here is whether the
hand-baked MemSet macro still outruns memset on modern architectures.
I think it's been quite a few years since that was last tested.

I know glibc has some sexy memset macros for cases where the size is a
constant. I'm not sure there's been much of an advance in the general
case though. This would tend to imply we should consider going the
other direction of having the caller of palloc0 do the zeroing
instead. Or making palloc0 a macro which expands to include calling
memset with the parameter inlined.

--
greg

#44

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Greg Stark (#43)

Re: profiling connection overhead

Greg Stark <gsstark@mit.edu> writes:

On Mon, Nov 29, 2010 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Another question that would be worth asking here is whether the
hand-baked MemSet macro still outruns memset on modern architectures.
I think it's been quite a few years since that was last tested.

I know glibc has some sexy memset macros for cases where the size is a
constant. I'm not sure there's been much of an advance in the general
case though. This would tend to imply we should consider going the
other direction of having the caller of palloc0 do the zeroing
instead. Or making palloc0 a macro which expands to include calling
memset with the parameter inlined.

Well, that was exactly the reason why we did it the way we do it.
However, I think it's probably only node allocations where the size
is likely to be constant and hence result in a win. Perhaps we should
implement makeNode() differently from the general case.

regards, tom lane

#45

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#41)

Re: profiling connection overhead

On Sun, Nov 28, 2010 at 7:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

One possible way to get a real speedup here would be to look for ways
to trim the number of catcaches.

BTW, it's not going to help to remove catcaches that have a small
initial size, as the pg_am cache certainly does. If the bucket zeroing
cost is really something to minimize, it's only the caches with the
largest nbuckets counts that are worth considering --- and we certainly
can't remove those without penalty.

Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever. I bet the
number of people for whom <(int4, int4) has any non-standard
properties is somewhere between slim and none; and it might well be
the case that formrdesc() is faster than reading the relcache init
file, if we didn't need to worry about deviation from canonical. This
is even more frustrating in the hypothetical situation where a backend
can switch databases, because we have to blow away all of these cache
entries that are 99.9% likely to be basically identical in the old and
new databases.

The relation descriptors for pg_class and pg_attribute are examples of
things it would be nice to hardwire and never need to update. We are
really pretty much screwed if there is any meaningful deviation from
what is expected, but relpages, reltuples, and relfrozenxid - and
maybe relacl or reloptions - can legitimately vary between databases.

Maybe we could speed things up a bit if we got rid of the pg_attribute
entries for the system attributes (except OID).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#46

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#45)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever.

True. I wonder if we could do something like the relcache init file
for the catcaches.

Maybe we could speed things up a bit if we got rid of the pg_attribute
entries for the system attributes (except OID).

I used to have high hopes for that idea, but the column privileges
patch broke it permanently.

regards, tom lane

#47

Dimitri Fontaine

dimitri@2ndQuadrant.fr

about 15 years ago

In reply to: Robert Haas (#40)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

Well, the lack of extensible XLOG support is definitely a big handicap
to building a *production* index AM as an add-on. But it's not such a
handicap for development.

Realistically, it's hard for me to imagine that anyone would go to the
trouble of building it as a loadable module first and then converting
it to part of core later on. That'd hardly be less work.

Well, it depends a lot on external factors. Like for example willing to
use the code before to spend the necessary QA time that is needed for it
to land in core. Two particular examples come to mind, which are tsearch
and KNN GiST. The main problem with integrating into core, AFAIUI, are
related to code maitenance, not at all with code stability and quality
of the addon itself.

It's just so much easier to develop an external module…

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#48

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#46)

1 attachment(s)

Re: profiling connection overhead

On Sun, Nov 28, 2010 at 11:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever.

True. I wonder if we could do something like the relcache init file
for the catcaches.

Maybe. It's hard to know exactly what to pull in, though, nor is it
clear to me how much it would really save. You've got to keep the
thing up to date somehow, too.

I finally got around to doing some testing of
page-faults-versus-actually-memory-initialization, using the attached
test program, compiled with warnings, but without optimization.
Typical results on MacOS X:

first run: 297299
second run: 99653

And on Fedora 12 (2.6.32.23-170.fc12.x86_64):

first run: 509309
second run: 114721

I guess the word "run" is misleading (I wrote the program in 5
minutes); it's just zeroing the same chunk twice and measuring the
times. The difference is presumably the page fault overhead, which
implies that faulting is two-thirds of the overhead on MacOS X and
three-quarters of the overhead on Linux. This makes me pretty
pessimistic about the chances of a meaningful speedup here.

Maybe we could speed things up a bit if we got rid of the pg_attribute
entries for the system attributes (except OID).

I used to have high hopes for that idea, but the column privileges
patch broke it permanently.

http://archives.postgresql.org/pgsql-hackers/2010-07/msg00151.php

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#49

Andres Freund

andres@anarazel.de

about 15 years ago

In reply to: Robert Haas (#48)

Re: profiling connection overhead

On Monday 29 November 2010 17:57:51 Robert Haas wrote:

On Sun, Nov 28, 2010 at 11:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever.

True. I wonder if we could do something like the relcache init file
for the catcaches.

Maybe. It's hard to know exactly what to pull in, though, nor is it
clear to me how much it would really save. You've got to keep the
thing up to date somehow, too.

I finally got around to doing some testing of
page-faults-versus-actually-memory-initialization, using the attached
test program, compiled with warnings, but without optimization.
Typical results on MacOS X:

first run: 297299
second run: 99653

And on Fedora 12 (2.6.32.23-170.fc12.x86_64):

first run: 509309
second run: 114721

Hm. A quick test shows that its quite a bit faster if you allocate memory
with:
size_t s = 512*1024*1024;
char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE|
MAP_ANONYMOUS, -1, 0);

Andres

#50

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Andres Freund (#49)

Re: profiling connection overhead

On Mon, Nov 29, 2010 at 12:24 PM, Andres Freund <andres@anarazel.de> wrote:

Hm. A quick test shows that its quite a bit faster if you allocate memory
with:
size_t s = 512*1024*1024;
char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE|
MAP_ANONYMOUS, -1, 0);

Numbers?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#51

Andres Freund

andres@anarazel.de

about 15 years ago

In reply to: Robert Haas (#50)

1 attachment(s)

Re: profiling connection overhead

On Monday 29 November 2010 18:34:02 Robert Haas wrote:

On Mon, Nov 29, 2010 at 12:24 PM, Andres Freund <andres@anarazel.de> wrote:

Hm. A quick test shows that its quite a bit faster if you allocate memory
with:
size_t s = 512*1024*1024;
char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE|
MAP_ANONYMOUS, -1, 0);

Numbers?

malloc alloc: 43
malloc memset1: 438763
malloc memset2: 98764
total: 537570

mmap alloc: 296065
mmap memset1: 99203
mmap memset2: 100608
total: 495876

But you don't actually need the memset1 in the mmap case as MAP_ANONYMOUS
memory is already zeroed. We could actually use that knowledge even without
MAP_POPULATE if we somehow keep track whether an allocated memory region is
still zeroed.

Taking that into account its:

malloc alloc: 47
malloc memset1: 437819
malloc memset2: 98317
total: 536183
mmap alloc: 292904
mmap memset1: 1
mmap memset2: 99284
total: 392189

I am somewhat reluctant to believe thats the way to go.

Andres

#52

Jeff Janes

jeff.janes@gmail.com

about 15 years ago

In reply to: Andres Freund (#49)

Re: profiling connection overhead

On Mon, Nov 29, 2010 at 9:24 AM, Andres Freund <andres@anarazel.de> wrote:

On Monday 29 November 2010 17:57:51 Robert Haas wrote:

On Sun, Nov 28, 2010 at 11:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever.

True. I wonder if we could do something like the relcache init file
for the catcaches.

Maybe. It's hard to know exactly what to pull in, though, nor is it
clear to me how much it would really save. You've got to keep the
thing up to date somehow, too.

I finally got around to doing some testing of
page-faults-versus-actually-memory-initialization, using the attached
test program, compiled with warnings, but without optimization.
Typical results on MacOS X:

first run: 297299
second run: 99653

And on Fedora 12 (2.6.32.23-170.fc12.x86_64):

first run: 509309
second run: 114721

Hm. A quick test shows that its quite a bit faster if you allocate memory
with:
size_t s = 512*1024*1024;
char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE|
MAP_ANONYMOUS, -1, 0);

Could you post the program?

Are you sure you haven't just moved the page-fault time to a part of
the code where it still exists, but just isn't being captured and
reported?

Cheers,

Jeff

#53

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#48)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

I guess the word "run" is misleading (I wrote the program in 5
minutes); it's just zeroing the same chunk twice and measuring the
times. The difference is presumably the page fault overhead, which
implies that faulting is two-thirds of the overhead on MacOS X and
three-quarters of the overhead on Linux.

Ah, cute solution to the measurement problem. I replicated the
experiment just as a cross-check:

Fedora 13 on x86_64 (recent Nehalem):
first run: 346767
second run: 103143

Darwin on x86_64 (not-so-recent Penryn):
first run: 341289
second run: 64535

HPUX on HPPA:
first run: 2191136
second run: 1199879

(On the last two machines I had to cut the array size to 256MB to avoid
swapping.) All builds with "gcc -O2".

This makes me pretty
pessimistic about the chances of a meaningful speedup here.

Yeah, this is confirmation that what you are seeing in the original test
is mostly about faulting pages in, not about the zeroing. I think it
would still be interesting to revisit the micro-optimization of
MemSet(), but it doesn't look like massive restructuring to avoid it
altogether is going to be worthwhile.

regards, tom lane

#54

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Jeff Janes (#52)

Re: profiling connection overhead

Jeff Janes <jeff.janes@gmail.com> writes:

Are you sure you haven't just moved the page-fault time to a part of
the code where it still exists, but just isn't being captured and
reported?

I'm a bit suspicious about that too. Another thing to keep in mind
is that Robert's original program doesn't guarantee that the char
array is maxaligned; though reasonable implementations of memset
should be able to use the same inner loop anyway for most of the
array.

I did some experimentation here and couldn't find any real difference in
runtime between the original program and substituting a malloc() call
for the static array allocation. Rolling in calloc in place of
malloc/memset made no particular difference either, which says that
Fedora 13's glibc does not have any optimization for that case as I'd
hoped.

regards, tom lane

#55

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#53)

Re: profiling connection overhead

On Mon, Nov 29, 2010 at 12:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(On the last two machines I had to cut the array size to 256MB to avoid
swapping.)

You weren't kidding about that "not so recent" part. :-)

This makes me pretty
pessimistic about the chances of a meaningful speedup here.

Yeah, this is confirmation that what you are seeing in the original test
is mostly about faulting pages in, not about the zeroing. I think it
would still be interesting to revisit the micro-optimization of
MemSet(), but it doesn't look like massive restructuring to avoid it
altogether is going to be worthwhile.

Yep. I think that what we've established here is that starting new
processes all time time is just plain expensive, and we're going to
have to start fewer of them if we want to make a meaningful
improvement.

My impression is that the process startup overhead is even higher on
Windows, although I am not now nor have I ever been a Windows
programmer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#56

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Tom Lane (#34)

Re: profiling connection overhead

Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote:

Not sure that information moves us forward. ��If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.

Well, we can determine the answers to these questions empirically.

Not really. Per Bruce's description, a page would become COW the moment
the postmaster touched (either write or read) any variable on it. Since
we have no control over how the loader lays out static variables, the
actual behavior of a particular build would be pretty random and subject
to unexpected changes caused by seemingly unrelated edits.

I believe all linkers will put initialized data ("data" segment) before
unitialized data ("bss" segment):

http://en.wikipedia.org/wiki/Data_segment

The only question is whether the linker has data and bss sharing the
same VM page (4k), or whether a new VM page is used when starting the
bss segment.

Also, the referenced URL only purports to describe the behavior of
HPUX, which is not exactly a mainstream OS. I think it requires a
considerable leap of faith to assume that all or even most platforms
work the way this suggests, and not in the dumber fashion Andres
suggested. Has anybody here actually looked at the relevant Linux
or BSD kernel code?

I have years ago, but not recently. You can see the sections on Linux
via objdump:

$ objdump --headers /bin/ls

/bin/ls: file format elf32-i386

Sections:
Idx Name Size VMA LMA File off Algn
...
24 .data 0000012c 080611a0 080611a0 000191a0 2**5
CONTENTS, ALLOC, LOAD, DATA
25 .bss 00000c40 080612e0 080612e0 000192cc 2**5
ALLOC

Based on this output, a new 4k page is not started for the 'bss'
segment. It basically uses 32-byte alignment.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#57

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Robert Haas (#35)

Re: profiling connection overhead

Robert Haas wrote:

In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails, the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it. To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.

Combining your last two sentences, I am not sure anyone with the
flexibility we have has solved the "cache populating" problem.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#58

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Tom Lane (#42)

Re: profiling connection overhead

Tom Lane wrote:

BTW, this might be premature to mention pending some tests about mapping
versus zeroing overhead, but it strikes me that there's more than one
way to skin a cat. I still think the idea of statically allocated space
sucks. But what if we rearranged things so that palloc0 doesn't consist
of palloc-then-memset, but rather push the zeroing responsibility down
into the allocator? In particular, I'm imagining that palloc0 with a
sufficiently large space request --- more than a couple pages --- could
somehow arrange to get space that's guaranteed zero already. And if the
request isn't large, zeroing it isn't where our problem is anyhow.

The most portable way to do that would be to use calloc insted of malloc,
and hope that libc is smart enough to provide freshly-mapped space.
It would be good to look and see whether glibc actually does so,
of course. If not we might end up having to mess with sbrk for
ourselves, and I'm not sure how pleasantly that interacts with malloc.

Yes, I was going to suggest trying calloc(), either because we can get
already-zeroed sbrk() memory, or because libc uses assembly language for
zeroing memory, as some good libc's do. I know most kernels also use
assembly for zeroing memory.

Another question that would be worth asking here is whether the
hand-baked MemSet macro still outruns memset on modern architectures.
I think it's been quite a few years since that was last tested.

Yes, MemSet was found to be faster than calling a C function, but new
testing is certainly warranted.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#59

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Robert Haas (#45)

Re: profiling connection overhead

Robert Haas wrote:

On Sun, Nov 28, 2010 at 7:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

One possible way to get a real speedup here would be to look for ways
to trim the number of catcaches.

BTW, it's not going to help to remove catcaches that have a small
initial size, as the pg_am cache certainly does. ?If the bucket zeroing
cost is really something to minimize, it's only the caches with the
largest nbuckets counts that are worth considering --- and we certainly
can't remove those without penalty.

Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever. I bet the
number of people for whom <(int4, int4) has any non-standard
properties is somewhere between slim and none; and it might well be
the case that formrdesc() is faster than reading the relcache init
file, if we didn't need to worry about deviation from canonical. This
is even more frustrating in the hypothetical situation where a backend
can switch databases, because we have to blow away all of these cache
entries that are 99.9% likely to be basically identical in the old and
new databases.

It is very tempting to look at optimizations here, but I am worried we
might head down the flat-files solution that caused continual problems
in the past.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#60

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Greg Stark (#43)

Re: profiling connection overhead

Greg Stark wrote:

On Mon, Nov 29, 2010 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The most portable way to do that would be to use calloc insted of malloc,
and hope that libc is smart enough to provide freshly-mapped space.
It would be good to look and see whether glibc actually does so,
of course. ?If not we might end up having to mess with sbrk for
ourselves, and I'm not sure how pleasantly that interacts with malloc.

It's *supposed* to interact fine. The only thing I wonder is that I
think malloc intentionally uses mmap for larger allocations but I'm
not clear what the advantages are. Is it because it's a cheaper way to
get zeroed bytes? Or just so that free has a hope of returning the
allocations to the OS?

Using mmap() so you can return large allocations to the OS is a neat
trick, certainly. I am not sure who implements that.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#61

Andres Freund

andres@anarazel.de

about 15 years ago

In reply to: Tom Lane (#54)

Re: profiling connection overhead

On Monday 29 November 2010 19:10:07 Tom Lane wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

Are you sure you haven't just moved the page-fault time to a part of
the code where it still exists, but just isn't being captured and
reported?

I'm a bit suspicious about that too. Another thing to keep in mind
is that Robert's original program doesn't guarantee that the char
array is maxaligned; though reasonable implementations of memset
should be able to use the same inner loop anyway for most of the
array.

Yes, I measured the time including mmap itself. I don't find it surprising its
taking measurably shorter as it can just put up the mappings without
explicitly faulting for each and every page. The benefit is too small to worry
though, so ...

The answer to Robert includes the timings + test program.

Andres

#62

Peter Eisentraut

peter_e@gmx.net

about 15 years ago

In reply to: Tom Lane (#54)

Re: profiling connection overhead

On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote:

Rolling in calloc in place of
malloc/memset made no particular difference either, which says that
Fedora 13's glibc does not have any optimization for that case as I'd
hoped.

glibc's calloc is either mmap of /dev/zero or malloc followed by memset.

#63

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Peter Eisentraut (#62)

Re: profiling connection overhead

Peter Eisentraut <peter_e@gmx.net> writes:

On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote:

Rolling in calloc in place of
malloc/memset made no particular difference either, which says that
Fedora 13's glibc does not have any optimization for that case as I'd
hoped.

glibc's calloc is either mmap of /dev/zero or malloc followed by memset.

Hmm. I would have expected to see a difference then. Do you know what
conditions are needed to cause the mmap to be used?

regards, tom lane

#64

Jeff Janes

jeff.janes@gmail.com

about 15 years ago

In reply to: Robert Haas (#35)

Re: profiling connection overhead

On 11/28/10, Robert Haas <robertmhaas@gmail.com> wrote:

In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails,

Is there a particular discussion of that matter you could point me to?

the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it.

Oracle's backend start up time seems to be way higher than PG's.
Their main solution is something that is fundamentally a built in
connection pooler with some bells and whistles built in. I'm not
sure "other people" you had in mind--Oracle is generally the one that
pops to my mind.

To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.

What about "well OK, we have PGbouncer"? Are there fixable
short-comings that it has which could make the issue less of an issue?

Cheers,

Jeff

#65

Peter Eisentraut

peter_e@gmx.net

about 15 years ago

In reply to: Tom Lane (#63)

Re: profiling connection overhead

On tis, 2010-11-30 at 15:49 -0500, Tom Lane wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote:

Rolling in calloc in place of
malloc/memset made no particular difference either, which says that
Fedora 13's glibc does not have any optimization for that case as I'd
hoped.

glibc's calloc is either mmap of /dev/zero or malloc followed by memset.

Hmm. I would have expected to see a difference then. Do you know what
conditions are needed to cause the mmap to be used?

Check out the mallopt(3) man page. It contains a few tunable malloc
options that may be useful for your investigation.

#66

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Jeff Janes (#64)

Re: profiling connection overhead

On Tue, Nov 30, 2010 at 11:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On 11/28/10, Robert Haas <robertmhaas@gmail.com> wrote:

In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails,

Is there a particular discussion of that matter you could point me to?

the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it.

Oracle's backend start up time seems to be way higher than PG's.
Their main solution is something that is fundamentally a built in
connection pooler with some bells and whistles built in. I'm not
sure "other people" you had in mind--Oracle is generally the one that
pops to my mind.

Interesting. How about MySQL and SQL Server?

To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.

What about "well OK, we have PGbouncer"? Are there fixable
short-comings that it has which could make the issue less of an issue?

We do have pgbouncer, and pgpool-II, and that's a good thing. But it
also requires proxying every interaction with the database through an
intermediate piece of software, which is not free. An in-core
solution ought to be able to arrange for each new connection to be
directly attached to an existing backend, using file-descriptor
passing. Tom has previously complained that this isn't portable, but
a little research suggests that it is supported on at least Linux, Mac
OS X, FreeBSD, OpenBSD, Solaris, and Windows, so in practice the
percentage of our user base who could benefit seems like it would
likely be very high.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#67

Andres Freund

andres@anarazel.de

about 15 years ago

In reply to: Robert Haas (#66)

Re: profiling connection overhead

On Wednesday 01 December 2010 15:20:32 Robert Haas wrote:

On Tue, Nov 30, 2010 at 11:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On 11/28/10, Robert Haas <robertmhaas@gmail.com> wrote:

To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.

What about "well OK, we have PGbouncer"? Are there fixable
short-comings that it has which could make the issue less of an issue?

We do have pgbouncer, and pgpool-II, and that's a good thing. But it
also requires proxying every interaction with the database through an
intermediate piece of software, which is not free. An in-core
solution ought to be able to arrange for each new connection to be
directly attached to an existing backend, using file-descriptor
passing. Tom has previously complained that this isn't portable, but
a little research suggests that it is supported on at least Linux, Mac
OS X, FreeBSD, OpenBSD, Solaris, and Windows, so in practice the
percentage of our user base who could benefit seems like it would
likely be very high.

HPUX and AIX allow fd transfer as well. I still don't see what even remotely
relevant platform would be a problem.

Andres

#68

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 15 years ago

In reply to: Robert Haas (#66)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> wrote:

Jeff Janes <jeff.janes@gmail.com> wrote:

Oracle's backend start up time seems to be way higher than PG's.

Interesting. How about MySQL and SQL Server?

My recollection of Sybase ASE is that establishing a connection
doesn't start a backend or even a thread. It establishes a network
connection and associates network queues and a connection context
structure with it. "Engine" threads with CPU affinity (and a few
miscellaneous "worker" threads, too, if I remember right) do all the
work in a queue-based fashion.

Last I worked with MS SQL Server it was based on the Sybase code and
therefore worked the same way. I know they've made a lot of changes
in the last five years (including switching to MVCC and adding
snapshot isolation in addition to the already-existing serializable
isolation), so I don't know whether connection startup cost has
changed along the way.

-Kevin

#69

Jeff Janes

jeff.janes@gmail.com

about 15 years ago

In reply to: Robert Haas (#66)

Re: profiling connection overhead

On Wed, Dec 1, 2010 at 6:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 30, 2010 at 11:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On 11/28/10, Robert Haas <robertmhaas@gmail.com> wrote:

In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails,

Is there a particular discussion of that matter you could point me to?

the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it.

Oracle's backend start up time seems to be way higher than PG's.
Their main solution is something that is fundamentally a built in
connection pooler with some bells and whistles built in. I'm not
sure "other people" you had in mind--Oracle is generally the one that
pops to my mind.

Interesting. How about MySQL and SQL Server?

I don't have experience with MS SQL Server, and don't know how it
performs on that front. I haven't really considered MySQL to be a
"real" RDBMS, more of just an indexing system, although I guess it is
steadily becoming more featurefull. It is indisputably faster at
making connections than PG, but still much slower than a connection
pooler.

To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.

What about "well OK, we have PGbouncer"? Are there fixable
short-comings that it has which could make the issue less of an issue?

We do have pgbouncer, and pgpool-II, and that's a good thing. But it
also requires proxying every interaction with the database through an
intermediate piece of software, which is not free.

True, a simple in-memory benchmark with pgbench -S -c1 showed 10,000
tps connecting straight, and 7000 tps through pgbouncer. But if
people want to make and breaks 100s of connections per second, they
must not be doing very many queries per connection so I don't know how
relevant that per query slow-down is.

An in-core
solution ought to be able to arrange for each new connection to be
directly attached to an existing backend, using file-descriptor
passing.

But who would be doing the passing? For the postmaster to be doing
that would probably go against the minimalist design. It would have
to keep track of which backend is available, and which db and user it
is primed for. Perhaps a feature could be added to the backend to
allow it to get passed a FD from pgbouncer or pgpool-II and then hand
control back to the pooler upon "close" of the connection, as they
already have the infrastructure to keep pools around while the
postmaster does not. Are pgbouncer and pgpool close enough to "core"
to make such intimate collaboration with the backend OK?

Cheers,

Jeff

#70

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

about 15 years ago

In reply to: Jeff Janes (#64)

Re: profiling connection overhead

On 12/01/2010 05:32 AM, Jeff Janes wrote:

On 11/28/10, Robert Haas<robertmhaas@gmail.com> wrote:

In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails,

Is there a particular discussion of that matter you could point me to?

the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it.

Oracle's backend start up time seems to be way higher than PG's.
Their main solution is something that is fundamentally a built in
connection pooler with some bells and whistles built in. I'm not
sure "other people" you had in mind--Oracle is generally the one that
pops to my mind.

To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.

What about "well OK, we have PGbouncer"? Are there fixable
short-comings that it has which could make the issue less of an issue?

well I would very much like to seen an integrated pooler in postgresql -
pgbouncer is a very nice piece of software (and might even be a base for
an integrated bouncer), but being not closely tied to the backend you
are loosing a lot.
One of the more obvious examples is that now that we have no flatfile
copy of pg_authid you have to use cruel hacks like:
http://www.depesz.com/index.php/2010/12/04/auto-refreshing-password-file-for-pgbouncer/

to get "automatic" management of roles. There are some other drawbacks
as well:

* no coordination of restarts/configuration changes between the cluster
and the pooler
* you have two pieces of config files to configure your pooling settings
(having all that available say in a catalog in pg would be awesome)
* you lose all of the advanced authentication features of pg (because
all connections need to go through the pooler) and also ip-based stuff
* no SSL support(in the case of pgbouncer)
* complexity in reseting backend state (we added some support for that
through explicit SQL level commands in recent releases but it still is a
hazard)

Stefan

#71

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Stefan Kaltenbrunner (#70)

Re: profiling connection overhead

* no coordination of restarts/configuration changes between the cluster
and the pooler
* you have two pieces of config files to configure your pooling settings
(having all that available say in a catalog in pg would be awesome)
* you lose all of the advanced authentication features of pg (because
all connections need to go through the pooler) and also ip-based stuff
* no SSL support(in the case of pgbouncer)
* complexity in reseting backend state (we added some support for that
through explicit SQL level commands in recent releases but it still is a
hazard)

* pooler logs to separate file, for which there are (currently) no
anaysis tools
* pooling is incompatible with the use of ROLES for data security

The last is a major issue, and not one I think we can easily resolve.
MySQL has a pooling-friendly user system, because when you connect to
MySQL you basically always connect as the superuser and on connection it
switches you to your chosen login role. This, per Rob Wulsch, is one of
the things at the heart of allowing MySQL to support 100,000 low
frequency users per cheap hosting system.

As you might imagine, this behavior is also the source of a lot of
MySQL's security bugs. I don't see how we could imitate it without
getting the bugs as well.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#72

Rob Wultsch

wultsch@gmail.com

about 15 years ago

In reply to: Josh Berkus (#71)

Re: profiling connection overhead

On Sun, Dec 5, 2010 at 11:59 AM, Josh Berkus <josh@agliodbs.com> wrote:

* no coordination of restarts/configuration changes between the cluster
and the pooler
* you have two pieces of config files to configure your pooling settings
(having all that available say in a catalog in pg would be awesome)
* you lose all of the advanced authentication features of pg (because
all connections need to go through the pooler) and also ip-based stuff
* no SSL support(in the case of pgbouncer)
* complexity in reseting backend state (we added some support for that
through explicit SQL level commands in recent releases but it still is a
hazard)

More:

* pooler logs to separate file, for which there are (currently) no anaysis
tools
* pooling is incompatible with the use of ROLES for data security

The last is a major issue, and not one I think we can easily resolve. MySQL
has a pooling-friendly user system, because when you connect to MySQL you
basically always connect as the superuser and on connection it switches you
to your chosen login role. This, per Rob Wulsch, is one of the things at
the heart of allowing MySQL to support 100,000 low frequency users per cheap
hosting system.

As you might imagine, this behavior is also the source of a lot of MySQL's
security bugs. I don't see how we could imitate it without getting the bugs
as well.

I think you have read a bit more into what I have said than is
correct. MySQL can deal with thousands of users and separate schemas
on commodity hardware. There are many design decisions (some
questionable) that have made MySQL much better in a shared hosting
environment than pg and I don't know where the grants system falls
into that.

MySQL does not have that many security problems because of how grants
are stored. Most MySQL security issues are DOS sort of stuff based on
a authenticated user being able to cause a crash. The decoupled
backend storage and a less than awesome parser shared most of the
blame for these issues.

One thing I would suggest that the PG community keeps in mind while
talking about built in connection process caching, is that it is very
nice feature for memory leaks caused by a connection to not exist for
and continue growing forever.

NOTE: 100k is not a number that I would put much stock in. I don't
recall ever mentioning that number and it is not a number that would
be truthful for me to throw out.

--
Rob Wultsch
wultsch@gmail.com

#73

Rob Wultsch

wultsch@gmail.com

about 15 years ago

In reply to: Rob Wultsch (#72)

Re: profiling connection overhead

On Sun, Dec 5, 2010 at 12:45 PM, Rob Wultsch <wultsch@gmail.com> wrote:

One thing I would suggest that the PG community keeps in mind while
talking about built in connection process caching, is that it is very
nice feature for memory leaks caused by a connection to not exist for
and continue growing forever.

s/not exist for/not exist/

I have had issues with very slow leaks in MySQL building up over
months. It really sucks to have to go to management to ask for
downtime because of a slow memory leak.

--
Rob Wultsch
wultsch@gmail.com

#74

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Rob Wultsch (#73)

Re: profiling connection overhead

On Sun, Dec 5, 2010 at 3:17 PM, Rob Wultsch <wultsch@gmail.com> wrote:

On Sun, Dec 5, 2010 at 12:45 PM, Rob Wultsch <wultsch@gmail.com> wrote:

One thing I would suggest that the PG community keeps in mind while
talking about built in connection process caching, is that it is very
nice feature for memory leaks caused by a connection to not exist for
and continue growing forever.

s/not exist for/not exist/

I have had issues with very slow leaks in MySQL building up over
months. It really sucks to have to go to management to ask for
downtime because of a slow memory leak.

Apache has a very simple and effective solution to this problem - they
have a configuration option controlling the number of connections a
child process handles before it dies and a new one is spawned. I've
found that setting this to 1000 works excellently. Process startup
overhead decreases by three orders of magnitude, and only egregiously
bad leaks add up to enough to matter.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#75

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Rob Wultsch (#72)

Re: profiling connection overhead

On Sun, Dec 5, 2010 at 2:45 PM, Rob Wultsch <wultsch@gmail.com> wrote:

I think you have read a bit more into what I have said than is
correct. MySQL can deal with thousands of users and separate schemas
on commodity hardware. There are many design decisions (some
questionable) that have made MySQL much better in a shared hosting
environment than pg and I don't know where the grants system falls
into that.

Objection: Vague.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#76

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Jeff Janes (#69)

Re: profiling connection overhead

On Sat, Dec 4, 2010 at 8:04 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

But who would be doing the passing? For the postmaster to be doing
that would probably go against the minimalist design. It would have
to keep track of which backend is available, and which db and user it
is primed for. Perhaps a feature could be added to the backend to
allow it to get passed a FD from pgbouncer or pgpool-II and then hand
control back to the pooler upon "close" of the connection, as they
already have the infrastructure to keep pools around while the
postmaster does not. Are pgbouncer and pgpool close enough to "core"
to make such intimate collaboration with the backend OK?

I am not sure. I'm afraid that might be adding complexity without
really solving anything, but maybe I'm a pessimist.

One possible way to do make an improvement in this area would be to
move the responsibility for accepting connections out of the
postmaster. Instead, you'd have a group of children that would all
call accept() on the socket, and the OS would arbitrarily pick one to
receive each new incoming connection. The postmaster would just be
responsible for making sure that there were enough children hanging
around. You could in fact make this change without doing anything
else, in which case it wouldn't save any work but would possibly
reduce connection latency a bit since more of the work could be done
before the connection actually arrived.

From there, you could go two ways.

One option would be to have backends that would otherwise terminate
normally instead do the equivalent of DISCARD ALL and then go back
around and try to accept() another incoming connection. If they get a
guy who wants the database to which they previously connected, profit.
If not, laboriously flush every cache in sight and rebind to the new
database.

Another option would be to have backends that would otherwise
terminate normally instead do the equivalent of DISCARD ALL and then
mark themselves as able to accept a new connection to the same
database to which they are already connected (but not any other
database). Following authentication, a backend that accepted a new
incoming connection looks through the pool of such backends and, if it
finds one, hands off the connection using file-descriptor passing and
then loops back around to accept() again. Otherwise it handles the
connection itself. This wouldn't offer much of an advantage over the
first option for a cluster that basically has just one database, or
for a cluster that has 1000 actively used databases. But it would be
much better for a system with three databases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#77

Rob Wultsch

wultsch@gmail.com

about 15 years ago

In reply to: Robert Haas (#75)

Re: profiling connection overhead

On Sun, Dec 5, 2010 at 6:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Dec 5, 2010 at 2:45 PM, Rob Wultsch <wultsch@gmail.com> wrote:

I think you have read a bit more into what I have said than is
correct. MySQL can deal with thousands of users and separate schemas
on commodity hardware. There are many design decisions (some
questionable) that have made MySQL much better in a shared hosting
environment than pg and I don't know where the grants system falls
into that.

Objection: Vague.

I retract the remark, your honor.

At some point Hackers should look at pg vs MySQL multi tenantry but it
is way tangential today.

--
Rob Wultsch
wultsch@gmail.com

#78

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Rob Wultsch (#77)

Re: profiling connection overhead

On Sun, Dec 5, 2010 at 9:35 PM, Rob Wultsch <wultsch@gmail.com> wrote:

On Sun, Dec 5, 2010 at 6:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Dec 5, 2010 at 2:45 PM, Rob Wultsch <wultsch@gmail.com> wrote:

I think you have read a bit more into what I have said than is
correct. MySQL can deal with thousands of users and separate schemas
on commodity hardware. There are many design decisions (some
questionable) that have made MySQL much better in a shared hosting
environment than pg and I don't know where the grants system falls
into that.

Objection: Vague.

I retract the remark, your honor.

Clarifying it would be fine, too... :-)

At some point Hackers should look at pg vs MySQL multi tenantry but it
is way tangential today.

My understanding is that our schemas work like MySQL databases; and
our databases are an even higher level of isolation. No?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#79

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#76)

Re: profiling connection overhead

Robert Haas <robertmhaas@gmail.com> writes:

One possible way to do make an improvement in this area would be to
move the responsibility for accepting connections out of the
postmaster. Instead, you'd have a group of children that would all
call accept() on the socket, and the OS would arbitrarily pick one to
receive each new incoming connection. The postmaster would just be
responsible for making sure that there were enough children hanging
around. You could in fact make this change without doing anything
else, in which case it wouldn't save any work but would possibly
reduce connection latency a bit since more of the work could be done
before the connection actually arrived.

This seems like potentially a good idea independent of anything else,
just to reduce connection latency: fork() (not to mention exec() on
Windows) now happens before not after receipt of the connection request.
However, I see a couple of stumbling blocks:

1. Does accept() work that way everywhere (Windows, I'm looking at you)

2. What do you do when max_connections is exceeded, and you don't have
anybody at all listening on the socket? Right now we are at least able
to send back an error message explaining the problem.

Another issue that would require some thought is what algorithm the
postmaster uses for deciding to spawn new children. But that doesn't
sound like a potential showstopper.

regards, tom lane

#80

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Tom Lane (#79)

Re: profiling connection overhead

On 12/06/2010 09:38 AM, Tom Lane wrote:

Another issue that would require some thought is what algorithm the
postmaster uses for deciding to spawn new children. But that doesn't
sound like a potential showstopper.

We'd probably want a couple of different ones, optimized for different
connection patterns. Realistically.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#81

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#79)

Re: profiling connection overhead

On Mon, Dec 6, 2010 at 12:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

One possible way to do make an improvement in this area would be to
move the responsibility for accepting connections out of the
postmaster. Instead, you'd have a group of children that would all
call accept() on the socket, and the OS would arbitrarily pick one to
receive each new incoming connection. The postmaster would just be
responsible for making sure that there were enough children hanging
around. You could in fact make this change without doing anything
else, in which case it wouldn't save any work but would possibly
reduce connection latency a bit since more of the work could be done
before the connection actually arrived.

This seems like potentially a good idea independent of anything else,
just to reduce connection latency: fork() (not to mention exec() on
Windows) now happens before not after receipt of the connection request.
However, I see a couple of stumbling blocks:

1. Does accept() work that way everywhere (Windows, I'm looking at you)

Not sure. It might be useful to look at what Apache does, but I don't
have time to do that ATM.

2. What do you do when max_connections is exceeded, and you don't have
anybody at all listening on the socket? Right now we are at least able
to send back an error message explaining the problem.

Sending back an error message explaining the problem seems like a
non-negotiable requirement. I'm not quite sure how to dance around
this. Perhaps if max_connections is exhausted, the postmaster itself
joins the accept() queue and launches a dead-end backend for each new
connection. Or perhaps we reserve one extra backend slot for a
probably-dead-end backend that will just sit there and mail rejection
notices; except that if it sees that a regular backend slot has opened
up it grabs it and turns itself into a regular backend.

Another issue that would require some thought is what algorithm the
postmaster uses for deciding to spawn new children. But that doesn't
sound like a potential showstopper.

The obvious algorithm would be to try to keep N spare workers around.
Any time the number of unconnected backends drops below N the
postmaster starts spawning new ones until it gets back up to N. I
think the trick may not be the algorithm so much as finding a way to
make the signaling sufficiently robust and lightweight. For example,
I bet having each child that gets a new connection signal() the
postmaster is a bad plan.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#82

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Robert Haas (#78)

Re: profiling connection overhead

At some point Hackers should look at pg vs MySQL multi tenantry but it
is way tangential today.

My understanding is that our schemas work like MySQL databases; and
our databases are an even higher level of isolation. No?

That's correct. Drizzle is looking at implementing a feature like our
databases called "catalogs" (per the SQL spec).

Let me stress that not everyone is happy with the MySQL multi-tenantry
approach. But it does make multi-tenancy on a scale which you seldom
see with PG possible, even if it has problems. It's worth seeing
whether we can steal any of their optimization ideas without breaking PG.

I was specifically looking at the login model, which works around the
issue that we have: namely that different login ROLEs can't share a
connection pool. In MySQL, they can share the built-in connection
"pool" because role-switching effectively is a session variable.
AFAICT, anyway.

For that matter, if anyone knows any other DB which does multi-tenant
well/better, we should be looking at them too.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#83

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Josh Berkus (#82)

Re: profiling connection overhead

On Mon, Dec 6, 2010 at 12:57 PM, Josh Berkus <josh@agliodbs.com> wrote:

At some point Hackers should look at pg vs MySQL multi tenantry but it
is way tangential today.

My understanding is that our schemas work like MySQL databases; and
our databases are an even higher level of isolation. No?

That's correct. Drizzle is looking at implementing a feature like our
databases called "catalogs" (per the SQL spec).

Let me stress that not everyone is happy with the MySQL multi-tenantry
approach. But it does make multi-tenancy on a scale which you seldom see
with PG possible, even if it has problems. It's worth seeing whether we can
steal any of their optimization ideas without breaking PG.

Please make sure to articulate what you think is wrong with our existing model.

I was specifically looking at the login model, which works around the issue
that we have: namely that different login ROLEs can't share a connection
pool. In MySQL, they can share the built-in connection "pool" because
role-switching effectively is a session variable. AFAICT, anyway.

Please explain more precisely what is wrong with SET SESSION
AUTHORIZATION / SET ROLE.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#84

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Robert Haas (#83)

Re: profiling connection overhead

Please explain more precisely what is wrong with SET SESSION
AUTHORIZATION / SET ROLE.

1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't
had any time to work on)

2) Users can always issue their own SET ROLE and then "hack into" other
users' data.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#85

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Josh Berkus (#84)

Re: profiling connection overhead

On Mon, Dec 6, 2010 at 2:47 PM, Josh Berkus <josh@agliodbs.com> wrote:

Please explain more precisely what is wrong with SET SESSION
AUTHORIZATION / SET ROLE.

1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't
had any time to work on)

2) Users can always issue their own SET ROLE and then "hack into" other
users' data.

Makes sense. It would be nice to fix those issues, independent of
anything else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#86

Alvaro Herrera

alvherre@commandprompt.com

about 15 years ago

In reply to: Robert Haas (#85)

Re: profiling connection overhead

Excerpts from Robert Haas's message of lun dic 06 23:09:56 -0300 2010:

On Mon, Dec 6, 2010 at 2:47 PM, Josh Berkus <josh@agliodbs.com> wrote:

Please explain more precisely what is wrong with SET SESSION
AUTHORIZATION / SET ROLE.

1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't
had any time to work on)

2) Users can always issue their own SET ROLE and then "hack into" other
users' data.

Makes sense. It would be nice to fix those issues, independent of
anything else.

It seems plausible to fix the first one, but how would you fix the
second one? You either allow SET ROLE (which you need, to support the
pooler changing authorization), or you don't. There doesn't seem to be
a usable middleground.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#87

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Alvaro Herrera (#86)

Re: profiling connection overhead

It seems plausible to fix the first one, but how would you fix the
second one? You either allow SET ROLE (which you need, to support the
pooler changing authorization), or you don't. There doesn't seem to be
a usable middleground.

Well, this is why such a pooler would *have* to be built into the
backend. It would need to be able to SET ROLE even though SET ROLE
would not be accepted over the client connection. We'd also need
bookkeeping to track the ROLE (and other GUCs) of each client connection
and reset them whenever that client connection switches back.

Mind you, I'm not entirely convinced that the end result of this would
be performant. And they would certainly be complicated. I think that
we should start by dealing with the simplest situation, ignoring SET
ROLE and GUC issues for now.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#88

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Alvaro Herrera (#86)

Re: profiling connection overhead

On Mon, Dec 6, 2010 at 9:37 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Excerpts from Robert Haas's message of lun dic 06 23:09:56 -0300 2010:

On Mon, Dec 6, 2010 at 2:47 PM, Josh Berkus <josh@agliodbs.com> wrote:

Please explain more precisely what is wrong with SET SESSION
AUTHORIZATION / SET ROLE.

1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't
had any time to work on)

2) Users can always issue their own SET ROLE and then "hack into" other
users' data.

Makes sense. It would be nice to fix those issues, independent of
anything else.

It seems plausible to fix the first one, but how would you fix the
second one? You either allow SET ROLE (which you need, to support the
pooler changing authorization), or you don't. There doesn't seem to be
a usable middleground.

You could add a protocol message that does a "permanent" role switch
in a way that can't be undone except by another such protocol message.
Then connection poolers could simply refuse to proxy that particular
message.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#89

Craig Ringer

craig@postnewspapers.com.au

about 15 years ago

In reply to: Josh Berkus (#87)

Re: profiling connection overhead

On 07/12/10 10:48, Josh Berkus wrote:

It seems plausible to fix the first one, but how would you fix the
second one? You either allow SET ROLE (which you need, to support the
pooler changing authorization), or you don't. There doesn't seem to be
a usable middleground.

Well, this is why such a pooler would *have* to be built into the
backend. It would need to be able to SET ROLE even though SET ROLE
would not be accepted over the client connection.

There's actually another way to do that that could be retrofitted onto
an existing external pooler. It's not lovely, but if the approach above
proved too hard...

SET ROLE could accept a cookie / one-time password that had to be passed
to RESET ROLE in order for RESET ROLE to accept the command.

SET ROLE fred WITH COOKIE 'goqu8Mi6choht8ie';
-- hand to the user
-- blah blah user work blah
-- returned by the user
RESET ROLE WITH COOKIE 'goqu8Mi6choht8ie';

The tricky bit might be that the user should still be permitted to SET
ROLE, but only to roles that the role the pooler switched them to
("fred") has rights to SET ROLE to, not to roles that the pooler user
its self has rights to switch to.

We'd also need
bookkeeping to track the ROLE (and other GUCs) of each client connection
and reset them whenever that client connection switches back.

I'm really interested in this direction. Taken just a little further, it
could bring Pg to the point where query executors (backends) are
separated from connection state, so a given backend could pick up and
work on queries by several different connections in rapid succession.
The advantage there is that idle connections would become cheap,
low-overhead affairs.

As I (poorly) understand how Pg is designed it'd only be possible for a
backend to work on queries that act on the same database, it couldn't
really switch databases. That'd still be a real bonus especially for
newer users who don't realize they *need* a connection pool.

--
System & Network Administrator
POST Newspapers