profiling connection overhead
Per previous threats, I spent some time tonight running oprofile
(using the directions Tom Lane was foolish enough to provide me back
in May). I took testlibpq.c and hacked it up to just connect to the
server and then disconnect in a tight loop without doing anything
useful, hoping to measure the overhead of starting up a new
connection. Ha, ha, funny about that:
120899 18.0616 postgres AtProcExit_Buffers
56891 8.4992 libc-2.11.2.so memset
30987 4.6293 libc-2.11.2.so memcpy
26944 4.0253 postgres hash_search_with_hash_value
26554 3.9670 postgres AllocSetAlloc
20407 3.0487 libc-2.11.2.so _int_malloc
17269 2.5799 libc-2.11.2.so fread
13005 1.9429 ld-2.11.2.so do_lookup_x
11850 1.7703 ld-2.11.2.so _dl_fixup
10194 1.5229 libc-2.11.2.so _IO_file_xsgetn
In English: the #1 overhead here is actually something that happens
when processes EXIT, not when they start. Essentially all the time is
in two lines:
56920 6.6006 : for (i = 0; i < NBuffers; i++)
: {
98745 11.4507 : if (PrivateRefCount[i] != 0)
Non-default configs:
max_connections = 100
shared_buffers = 480MB
work_mem = 4MB
maintenance_work_mem = 128MB
checkpoint_segments = 30
random_page_cost = 2.0
Anything we can do about this? That's a lot of overhead, and it'd be
a lot worse on a big machine with 8GB of shared_buffers.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Nov 24, 2010 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Per previous threats, I spent some time tonight running oprofile
(using the directions Tom Lane was foolish enough to provide me back
in May). I took testlibpq.c and hacked it up to just connect to the
server and then disconnect in a tight loop without doing anything
useful, hoping to measure the overhead of starting up a new
connection.
Oh, right: attachments.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
reconnect-opreport.txttext/plain; charset=US-ASCII; name=reconnect-opreport.txtDownload
CPU: AMD64 family10, speed 2200 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
120899 18.0616 postgres AtProcExit_Buffers
56891 8.4992 libc-2.11.2.so memset
30987 4.6293 libc-2.11.2.so memcpy
26944 4.0253 postgres hash_search_with_hash_value
26554 3.9670 postgres AllocSetAlloc
20407 3.0487 libc-2.11.2.so _int_malloc
17269 2.5799 libc-2.11.2.so fread
13005 1.9429 ld-2.11.2.so do_lookup_x
11850 1.7703 ld-2.11.2.so _dl_fixup
10194 1.5229 libc-2.11.2.so _IO_file_xsgetn
10087 1.5069 postgres MemoryContextAllocZero
9143 1.3659 ld-2.11.2.so strcmp
8957 1.3381 postgres load_relcache_init_file
7132 1.0655 postgres fmgr_info_cxt_security
5630 0.8411 libc-2.11.2.so vfprintf
5029 0.7513 ld-2.11.2.so _dl_lookup_symbol_x
4997 0.7465 postgres _bt_getroot
3935 0.5879 libc-2.11.2.so memcmp
3874 0.5788 postgres hash_seq_search
3718 0.5554 postgres LWLockAcquire
3666 0.5477 postgres guc_name_compare
3457 0.5165 libc-2.11.2.so __strlen_sse2
3297 0.4926 postgres load_relmap_file
3175 0.4743 libc-2.11.2.so malloc
3170 0.4736 postgres LockAcquireExtended
3139 0.4689 postgres hash_any
3113 0.4651 postgres MemoryContextAlloc
2946 0.4401 postgres _bt_compare
2936 0.4386 postgres index_getnext
2885 0.4310 ld-2.11.2.so _dl_sort_fini
2873 0.4292 libc-2.11.2.so _int_free
2619 0.3913 postgres MemoryContextCreate
2579 0.3853 ld-2.11.2.so check_match.12146
2485 0.3712 postgres heap_page_prune_opt
2457 0.3671 postgres LWLockRelease
2438 0.3642 postgres CreateTemplateTupleDesc
2322 0.3469 ld-2.11.2.so _dl_fini
2301 0.3438 postgres set_config_option
2253 0.3366 postgres _bt_first
2225 0.3324 postgres PinBuffer
2140 0.3197 postgres BeginReportingGUCOptions
2091 0.3124 postgres _bt_preprocess_keys
2085 0.3115 libc-2.11.2.so _IO_vfscanf
2051 0.3064 postgres element_alloc
1962 0.2931 postgres ServerLoop
1936 0.2892 postgres CreateTupleDescCopyConstr
1884 0.2815 libc-2.11.2.so __strcpy_sse2
1846 0.2758 libkrb5.so.3.3 /lib64/libkrb5.so.3.3
1801 0.2691 postgres FunctionCall2
1797 0.2685 postgres hash_create
1782 0.2662 postgres PgstatCollectorMain
1761 0.2631 postgres _bt_checkpage
1728 0.2582 postgres AllocSetFree
1597 0.2386 libselinux.so.1 /lib64/libselinux.so.1
1579 0.2359 libc-2.11.2.so _IO_default_xsputn
1543 0.2305 libc-2.11.2.so free
1531 0.2287 postgres SearchCatCache
1528 0.2283 postgres BuildHardcodedDescriptor
1506 0.2250 libc-2.11.2.so strchrnul
1475 0.2204 postgres _bt_checkkeys
1457 0.2177 postgres ResourceOwnerForgetRelationRef
1451 0.2168 ld-2.11.2.so _dl_runtime_resolve
1443 0.2156 postgres InitCatCache
1443 0.2156 postgres hash_search
1382 0.2065 ld-2.11.2.so _dl_name_match_p
1360 0.2032 postgres PostgresMain
1347 0.2012 postgres pgstat_report_stat
1342 0.2005 libssl.so.1.0.0b /usr/lib64/libssl.so.1.0.0b
1340 0.2002 postgres systable_beginscan
1311 0.1959 libgssapi_krb5.so.2.2 /lib64/libgssapi_krb5.so.2.2
1254 0.1873 postgres errstart
1247 0.1863 libc-2.11.2.so __strncmp_sse2
1245 0.1860 postgres ReadBuffer_common
1244 0.1858 libxml2.so.2.7.6 /usr/lib64/libxml2.so.2.7.6
1199 0.1791 libc-2.11.2.so fork
1164 0.1739 libc-2.11.2.so _IO_sgetn
1164 0.1739 postgres pgstat_initstats
1163 0.1737 postgres AllocSetContextCreate
1158 0.1730 postgres ProcessStartupPacket
1153 0.1723 postgres LockRelease
1130 0.1688 libc-2.11.2.so mempcpy
1124 0.1679 postgres InitCatalogCache
1120 0.1673 libk5crypto.so.3.1 /lib64/libk5crypto.so.3.1
1117 0.1669 postgres HeapTupleSatisfiesNow
1110 0.1658 postgres oid_hash
1101 0.1645 postgres InitPostgres
1069 0.1597 postgres nocachegetattr
1061 0.1585 postgres RelationBuildDesc
1059 0.1582 postgres tag_hash
1057 0.1579 libc-2.11.2.so bsearch
1054 0.1575 postgres hash_uint32
1010 0.1509 postgres SIGetDataEntries
955 0.1427 postgres RelationCacheInitializePhase3
953 0.1424 postgres get_tabstat_entry
932 0.1392 postgres InitProcess
926 0.1383 libc-2.11.2.so malloc_consolidate
882 0.1318 libc-2.11.2.so __GI_____strtoll_l_internal
881 0.1316 libcrypto.so.1.0.0b /lib64/libcrypto.so.1.0.0b
864 0.1291 postgres RelationMapOidToFilenode
863 0.1289 postgres IsSharedRelation
845 0.1262 postgres RecoveryInProgress
840 0.1255 postgres RelationInitPhysicalAddr
829 0.1238 libc-2.11.2.so snprintf
828 0.1237 libc-2.11.2.so _wordcopy_fwd_aligned
826 0.1234 postgres ReadBufferExtended
797 0.1191 postgres reaper
786 0.1174 libc-2.11.2.so random_r
783 0.1170 postgres RelationInitLockInfo
764 0.1141 libc-2.11.2.so _IO_str_init_static_internal
761 0.1137 postgres LockReleaseAll
760 0.1135 libc-2.11.2.so _IO_setb
760 0.1135 postgres ReportGUCOption
753 0.1125 postgres LockBuffer
744 0.1111 libkrb5support.so.0.1 /lib64/libkrb5support.so.0.1
741 0.1107 libc-2.11.2.so sigprocmask
740 0.1106 postgres DLAddHead
733 0.1095 libc-2.11.2.so _getopt_internal_r
728 0.1088 postgres CatalogCacheInitializeCache
724 0.1082 postgres _bt_binsrch
708 0.1058 libc-2.11.2.so _IO_un_link
707 0.1056 libc-2.11.2.so close
707 0.1056 postgres btoidcmp
699 0.1044 libcom_err.so.2.1 /lib64/libcom_err.so.2.1
690 0.1031 libc-2.11.2.so recv
687 0.1026 libc-2.11.2.so vsnprintf
668 0.0998 libc-2.11.2.so __errno_location
660 0.0986 postgres pgstat_bestart
659 0.0985 postgres CleanUpLock
654 0.0977 libc-2.11.2.so _IO_file_underflow@@GLIBC_2.2.5
654 0.0977 postgres smgropen
651 0.0973 libz.so.1.2.3 /lib64/libz.so.1.2.3
643 0.0961 libc-2.11.2.so __strstr_sse2
639 0.0955 postgres internal_flush
630 0.0941 postgres StartTransactionCommand
627 0.0937 postgres fmgr_info_cxt
626 0.0935 postgres _bt_fix_scankey_strategy
626 0.0935 postgres btrescan
613 0.0916 postgres _bt_readpage
610 0.0911 postgres _bt_getbuf
608 0.0908 libc-2.11.2.so fclose@@GLIBC_2.2.5
608 0.0908 postgres strlcpy
597 0.0892 postgres LockRelationOid
595 0.0889 postgres hba_getauthmethod
587 0.0877 libc-2.11.2.so strdup
587 0.0877 postgres AfterTriggerBeginXact
587 0.0877 postgres _ShowOption
586 0.0875 libc-2.11.2.so _IO_file_fopen@@GLIBC_2.2.5
586 0.0875 postgres CommitTransaction
571 0.0853 postgres RelationIncrementReferenceCount
568 0.0849 postgres ResourceOwnerEnlargeRelationRefs
566 0.0846 libc-2.11.2.so strtok
566 0.0846 postgres secure_read
563 0.0841 libpthread-2.11.2.so _pthread_cleanup_push_defer
556 0.0831 postgres _bt_relandgetbuf
547 0.0817 postgres _bt_moveright
546 0.0816 postgres PathNameOpenFile
541 0.0808 libc-2.11.2.so _IO_no_init
540 0.0807 libc-2.11.2.so memmove
540 0.0807 libc-2.11.2.so time
539 0.0805 postgres index_getprocinfo
538 0.0804 postgres _bt_mark_scankey_required
531 0.0793 postgres pg_char_to_encname_struct
525 0.0784 libc-2.11.2.so __strcmp_sse2
525 0.0784 libc-2.11.2.so __unregister_atfork
517 0.0772 libc-2.11.2.so _IO_flush_all_lockp
505 0.0754 postgres ClosePostmasterPorts
505 0.0754 postgres process_postgres_switches
503 0.0751 postgres relation_open
493 0.0737 libc-2.11.2.so open
491 0.0734 postgres pg_perm_setlocale
484 0.0723 postgres RelationGetIndexScan
483 0.0722 libc-2.11.2.so _IO_link_in
482 0.0720 postgres _bt_search
478 0.0714 postgres fmgr_info_copy
475 0.0710 postgres CleanupTempFiles
472 0.0705 postgres IsAutoVacuumLauncherProcess
472 0.0705 postgres prepare_for_client_read
463 0.0692 libc-2.11.2.so getpid
462 0.0690 libc-2.11.2.so calloc
462 0.0690 postgres heap_open
460 0.0687 postgres RelationInitIndexAccessInfo
454 0.0678 postgres internal_putbytes
452 0.0675 libc-2.11.2.so setitimer
448 0.0669 postgres LookupOpclassInfo
444 0.0663 ld-2.11.2.so _wordcopy_bwd_aligned
443 0.0662 postgres DirectFunctionCall1
439 0.0656 libc-2.11.2.so munmap
437 0.0653 postgres BufTableLookup
434 0.0648 postgres AllocateFile
434 0.0648 postgres systable_getnext
430 0.0642 postgres ValidatePgVersion
429 0.0641 postgres enable_sig_alarm
428 0.0639 postgres oideq
422 0.0630 postgres ReceiveSharedInvalidMessages
420 0.0627 postgres AssignPostmasterChildSlot
420 0.0627 postgres ReleaseBuffer
417 0.0623 libkeyutils-1.2.so /lib64/libkeyutils-1.2.so
417 0.0623 postgres guc_var_compare
416 0.0621 libc-2.11.2.so fopen@@GLIBC_2.2.5
416 0.0621 postgres shmem_exit
413 0.0617 postgres DLInitElem
413 0.0617 postgres heapgettup
412 0.0616 libc-2.11.2.so __add_to_environ
411 0.0614 postgres RelationDecrementReferenceCount
409 0.0611 libc-2.11.2.so random
403 0.0602 postgres RelationIdGetRelation
401 0.0599 postgres elog_start
400 0.0598 libc-2.11.2.so __underflow
399 0.0596 libc-2.11.2.so __default_morecore
399 0.0596 libc-2.11.2.so setlocale
393 0.0587 libc-2.11.2.so __sigsetjmp
390 0.0583 postgres ClientAuthentication
387 0.0578 libc-2.11.2.so _IO_file_close_it@@GLIBC_2.2.5
386 0.0577 postgres ResourceOwnerForgetBuffer
386 0.0577 postgres index_open
385 0.0575 postgres pfree
384 0.0574 libc-2.11.2.so ptmalloc_unlock_all
378 0.0565 libc-2.11.2.so sbrk
371 0.0554 postgres ResourceOwnerEnlargeBuffers
371 0.0554 postgres hash_seq_init
367 0.0548 postgres ShowTransactionState
366 0.0547 postgres BufTableHashCode
364 0.0544 postgres MemoryContextStrdup
361 0.0539 libc-2.11.2.so sigemptyset
361 0.0539 postgres index_endscan
359 0.0536 postgres truncate_identifier
357 0.0533 postgres GetSnapshotData
355 0.0530 postgres index_beginscan_internal
354 0.0529 postgres UnpinBuffer
350 0.0523 libc-2.11.2.so _itoa_word
350 0.0523 libc-2.11.2.so srandom_r
347 0.0518 libc-2.11.2.so send
346 0.0517 postgres BasicOpenFile
345 0.0515 postgres lcons_oid
344 0.0514 libc-2.11.2.so __libc_sigaction
341 0.0509 postgres ReadBuffer
340 0.0508 libc-2.11.2.so poll
337 0.0503 postgres GetRedoRecPtr
336 0.0502 libc-2.11.2.so __libc_fini
335 0.0500 libc-2.11.2.so read
335 0.0500 postgres btgettuple
328 0.0490 postgres RelationCacheInitializePhase2
327 0.0489 postgres list_member_oid
326 0.0487 libc-2.11.2.so __read_nocancel
326 0.0487 postgres ReindexIsProcessingIndex
326 0.0487 postgres ReleaseAndReadBuffer
326 0.0487 postgres pgstat_initialize
324 0.0484 libc-2.11.2.so mmap
323 0.0483 postgres call_string_assign_hook
322 0.0481 postgres set_spins_per_delay
321 0.0480 libc-2.11.2.so fflush
321 0.0480 postgres MemoryContextDelete
317 0.0474 postgres ResourceOwnerReleaseInternal
316 0.0472 libc-2.11.2.so _IO_file_init@@GLIBC_2.2.5
316 0.0472 postgres btbeginscan
315 0.0471 postgres UnlockRelationId
313 0.0468 postgres SearchSysCache
313 0.0468 postgres pgstat_beshutdown_hook
312 0.0466 postgres SetConfigOption
311 0.0465 postgres SharedInvalBackendInit
306 0.0457 postgres initStringInfo
305 0.0456 postgres systable_endscan
301 0.0450 postgres ScanKeyEntryInitializeWithInfo
301 0.0450 postgres seg_alloc
300 0.0448 libc-2.11.2.so _IO_cleanup
300 0.0448 postgres assign_session_authorization
296 0.0442 libc-2.11.2.so __ctype_b_loc
296 0.0442 postgres DisableNotifyInterrupt
295 0.0441 libc-2.11.2.so __cxa_finalize
295 0.0441 libpthread-2.11.2.so __reclaim_stacks
294 0.0439 postgres set_ps_display
293 0.0438 libc-2.11.2.so ____strtoull_l_internal
292 0.0436 ld-2.11.2.so memmove
291 0.0435 postgres FunctionCall3
291 0.0435 postgres TransactionIdPrecedes
291 0.0435 postgres btendscan
290 0.0433 libnss_files-2.11.2.so __do_global_dtors_aux
289 0.0432 postgres pqsignal
288 0.0430 postgres initialize_acl
287 0.0429 postgres IsAutoVacuumWorkerProcess
286 0.0427 postgres SetClientEncoding
282 0.0421 postgres pq_putmessage
281 0.0420 libpthread-2.11.2.so pthread_mutex_lock
280 0.0418 libc-2.11.2.so _IO_list_unlock
279 0.0417 postgres LockTagHashCode
278 0.0415 postgres config_enum_lookup_by_value
278 0.0415 postgres pg_strncasecmp
276 0.0412 postgres mdnblocks
273 0.0408 postgres AtStart_GUC
273 0.0408 postgres ResourceOwnerRememberRelationRef
272 0.0406 libc-2.11.2.so __select_nocancel
272 0.0406 postgres BackendInitialize
272 0.0406 postgres PreCommit_Notify
271 0.0405 libc-2.11.2.so exit
271 0.0405 postgres GetCurrentSubTransactionId
269 0.0402 postgres FreeDesc
268 0.0400 libc-2.11.2.so _IO_file_doallocate
268 0.0400 postgres RelationCacheInitialize
268 0.0400 postgres ReleaseResources_hash
266 0.0397 postgres CommitTransactionCommand
265 0.0396 libc-2.11.2.so fscanf
265 0.0396 postgres index_rescan
263 0.0393 postgres ApplySetting
262 0.0391 libpthread-2.11.2.so pthread_rwlock_wrlock
258 0.0385 libc-2.11.2.so llseek
257 0.0384 libresolv-2.11.2.so __do_global_dtors_aux
256 0.0382 libm-2.11.2.so __do_global_dtors_aux
256 0.0382 postgres CatalogCacheComputeHashValue
255 0.0381 postgres get_role_oid
253 0.0378 postgres AtCommit_Notify
253 0.0378 postgres LockAcquire
253 0.0378 postgres pgstat_report_xact_timestamp
252 0.0376 postgres nameeq
250 0.0373 postgres GetTransactionSnapshot
248 0.0370 postgres ProcQueueInit
246 0.0368 postgres GetCurrentTimestamp
245 0.0366 libc-2.11.2.so _getopt_internal
243 0.0363 postgres CatalogCacheCreateEntry
242 0.0362 libc-2.11.2.so vsprintf
242 0.0362 postgres AtEOXact_ComboCid
242 0.0362 postgres SHMQueueElemInit
240 0.0359 postgres ReleaseCatCache
240 0.0359 postgres get_hash_value
238 0.0356 postgres CreateCacheMemoryContext
238 0.0356 postgres GetCurrentTransactionNestLevel
237 0.0354 libc-2.11.2.so semctl
236 0.0353 postgres PreCommit_on_commit_actions
235 0.0351 libc-2.11.2.so ptmalloc_unlock_all2
235 0.0351 postgres extractRelOptions
234 0.0350 postgres CleanupInvalidationState
234 0.0350 postgres RelationMapInitialize
232 0.0347 postgres smgrGetPendingDeletes
231 0.0345 postgres InitPlanCache
229 0.0342 libc-2.11.2.so brk
229 0.0342 postgres ProcKill
229 0.0342 postgres parse_bool_with_len
228 0.0341 libc-2.11.2.so _IO_old_init
228 0.0341 postgres EnablePortalManager
228 0.0341 postgres RemoveLocalLock
227 0.0339 libc-2.11.2.so select
227 0.0339 postgres SHMQueueInit
227 0.0339 postgres enlargeStringInfo
226 0.0338 postgres AfterTriggerEndXact
226 0.0338 postgres mdopen
224 0.0335 libc-2.11.2.so putenv
224 0.0335 postgres find_option
223 0.0333 libc-2.11.2.so getopt
220 0.0329 postgres elog_finish
220 0.0329 postgres pq_beginmessage
219 0.0327 postgres InitBufferPoolBackend
219 0.0327 postgres PGSemaphoreReset
219 0.0327 postgres pg_encoding_max_length
218 0.0326 libpthread-2.11.2.so sem_destroy@@GLIBC_2.2.5
218 0.0326 postgres SetDatabaseEncoding
217 0.0324 libpthread-2.11.2.so _fini
216 0.0323 libc-2.11.2.so getsockname
216 0.0323 libm-2.11.2.so _fini
215 0.0321 postgres init_ps_display
213 0.0318 libnss_files-2.11.2.so _fini
212 0.0317 postgres InitializeSearchPath
211 0.0315 postgres InitDeadLockChecking
211 0.0315 postgres btnamecmp
211 0.0315 postgres parse_bool
209 0.0312 libdl-2.11.2.so _fini
209 0.0312 libresolv-2.11.2.so _fini
209 0.0312 postgres ResourceOwnerEnlargeCatCacheRefs
207 0.0309 postgres AtEOXact_MultiXact
207 0.0309 postgres ProcSignalInit
207 0.0309 postgres disable_sig_alarm
207 0.0309 postgres heap_getnext
206 0.0308 postgres client_read_ended
206 0.0308 postgres heapgetpage
205 0.0306 postgres ResourceOwnerCreate
205 0.0306 postgres pq_getbytes
204 0.0305 postgres CountChildren
204 0.0305 postgres on_exit_reset
202 0.0302 postgres int2gt
201 0.0300 postgres CacheRegisterSyscacheCallback
201 0.0300 postgres EnableNotifyInterrupt
201 0.0300 postgres index_close
201 0.0300 postgres pg_get_timezone_name
200 0.0299 libc-2.11.2.so _int_realloc
200 0.0299 libc-2.11.2.so srandom
199 0.0297 postgres ProcArrayEndTransaction
197 0.0294 libc-2.11.2.so waitpid
197 0.0294 postgres InitBufferPoolAccess
196 0.0293 postgres _bt_steppage
194 0.0290 postgres SHMQueueInsertBefore
193 0.0288 postgres appendBinaryStringInfo
192 0.0287 libc-2.11.2.so vfscanf
191 0.0285 postgres AcceptInvalidationMessages
191 0.0285 postgres AfterTriggerFireDeferred
191 0.0285 postgres RelationMapInitializePhase2
190 0.0284 postgres proc_exit_prepare
189 0.0282 postgres VirtualXactLockTableInsert
187 0.0279 postgres pg_getnameinfo_all
186 0.0278 postgres pgstat_get_db_entry
184 0.0275 postgres PostmasterRandom
179 0.0267 postgres EnableCatchupInterrupt
179 0.0267 postgres InSecurityRestrictedOperation
178 0.0266 postgres ResourceOwnerForgetCatCacheRef
175 0.0261 postgres DebugFileOpen
175 0.0261 postgres ProcArrayAdd
174 0.0260 postgres LockCheckConflicts
173 0.0258 libc-2.11.2.so gettimeofday
173 0.0258 postgres pq_sendint
172 0.0257 postgres InitFileAccess
172 0.0257 postgres guc_strdup
172 0.0257 postgres heap_copytuple_with_tuple
171 0.0255 libc-2.11.2.so _IO_file_finish@@GLIBC_2.2.5
171 0.0255 libc-2.11.2.so realloc
170 0.0254 postgres index_beginscan
170 0.0254 postgres process_startup_options
168 0.0251 postgres AbortOutOfAnyTransaction
168 0.0251 postgres ShutdownPostgres
167 0.0249 libc-2.11.2.so _IO_file_close
167 0.0249 postgres LogChildExit
167 0.0249 postgres on_shmem_exit
165 0.0247 postgres pq_recvbuf
164 0.0245 postgres SHMQueueDelete
163 0.0244 postgres IndexScanEnd
163 0.0244 postgres relpathbackend
162 0.0242 postgres AbortBufferIO
161 0.0241 libc-2.11.2.so __fopen_internal
159 0.0238 postgres ReadyForQuery
159 0.0238 postgres StreamConnection
159 0.0238 postgres timestamptz_to_time_t
158 0.0236 postgres relation_close
156 0.0233 libc-2.11.2.so __close_nocancel
156 0.0233 postgres ResourceOwnerRememberBuffer
155 0.0232 postgres pg_split_opts
154 0.0230 libc-2.11.2.so _IO_doallocbuf
154 0.0230 postgres SHMQueueNext
154 0.0230 postgres smgrsetowner
150 0.0224 libc-2.11.2.so _exit
150 0.0224 postgres initscan
150 0.0224 postgres pg_server_to_client
150 0.0224 postgres pq_sendstring
150 0.0224 postgres set_string_field
149 0.0223 postgres smgrinit
142 0.0212 postgres _bt_next
141 0.0211 postgres FileSeek
141 0.0211 postgres IsTransactionOrTransactionBlock
139 0.0208 libc-2.11.2.so sigdelset
138 0.0206 postgres RelationParseRelOptions
138 0.0206 postgres heap_beginscan_internal
137 0.0205 libc-2.11.2.so _IO_file_stat
137 0.0205 libc-2.11.2.so __fxstat64
137 0.0205 libc-2.11.2.so access
137 0.0205 postgres GetDatabaseTuple
137 0.0205 postgres ProcArrayRemove
135 0.0202 postgres secure_write
134 0.0200 libpthread-2.11.2.so __do_global_dtors_aux
133 0.0199 libc-2.11.2.so _IO_file_read
129 0.0193 postgres ProcessCompletedNotifies
129 0.0193 postgres pg_encoding_mbcliplen
129 0.0193 postgres pqinitmask
128 0.0191 libc-2.11.2.so _IO_file_sync@@GLIBC_2.2.5
126 0.0188 libc-2.11.2.so sprintf
126 0.0188 postgres pq_init
123 0.0184 postgres pq_endmessage
121 0.0181 libc-2.11.2.so strtol
121 0.0181 postgres AtEOXact_LargeObject
121 0.0181 postgres __do_global_dtors_aux
120 0.0179 postgres ScanPgRelation
118 0.0176 postgres InitializeSessionUserId
117 0.0175 postgres heap_copytuple
116 0.0173 postgres AllocSetDelete
116 0.0173 postgres secure_close
115 0.0172 postgres DLRemove
115 0.0172 postgres GrantLock
113 0.0169 libc-2.11.2.so _IO_iter_begin
113 0.0169 libdl-2.11.2.so fini
112 0.0167 postgres PostmasterStateMachine
111 0.0166 libc-2.11.2.so _IO_unsave_markers
111 0.0166 postgres GetDatabaseEncodingName
109 0.0163 postgres AtEOXact_Inval
109 0.0163 postgres DynaHashAlloc
108 0.0161 postgres _bt_freestack
108 0.0161 postgres proc_exit
107 0.0160 postgres CleanupProcSignalState
106 0.0158 postgres assign_client_encoding
106 0.0158 postgres superuser_arg
105 0.0157 postgres fork_process
104 0.0155 postgres show_session_authorization
102 0.0152 postgres AtEarlyCommit_Snapshot
102 0.0152 postgres ConnFree
102 0.0152 postgres InitXLOGAccess
101 0.0151 postgres GetDatabasePath
101 0.0151 postgres LockWaitCancel
101 0.0151 postgres ResourceOwnerDelete
98 0.0146 libdl-2.11.2.so __do_global_dtors_aux
97 0.0145 postgres GetNextLocalTransactionId
95 0.0142 libc-2.11.2.so strnlen
95 0.0142 postgres MarkPostmasterChildActive
94 0.0140 postgres AtEOXact_RelationMap
94 0.0140 postgres AtEOXact_SPI
94 0.0140 postgres RelationMapInitializePhase3
92 0.0137 postgres GetSysCacheOid
91 0.0136 postgres CommitHoldablePortals
91 0.0136 postgres _fini
91 0.0136 postgres show_timezone
90 0.0134 libc-2.11.2.so sigaction
90 0.0134 postgres GetPgIndexDescriptor
89 0.0133 postgres ResourceOwnerRelease
89 0.0133 postgres pgstat_send_tabstat
89 0.0133 postgres pq_getmessage
87 0.0130 libc-2.11.2.so accept
85 0.0127 postgres AtEOXact_GUC
85 0.0127 postgres CacheRegisterRelcacheCallback
84 0.0125 postgres AtEOXact_Namespace
84 0.0125 postgres FreeFile
84 0.0125 postgres GetSessionUserId
84 0.0125 postgres TimestampDifferenceExceeds
84 0.0125 postgres pq_getbyte
83 0.0124 libc-2.11.2.so _IO_switch_to_get_mode
83 0.0124 postgres ScanKeyInit
83 0.0124 postgres TransactionBlockStatusCode
82 0.0123 postgres TransactionStartedDuringRecovery
81 0.0121 libc-2.11.2.so __fopen_maybe_mmap
81 0.0121 postgres AtEOXact_PgStat
81 0.0121 postgres AtStart_Inval
81 0.0121 postgres pq_sendbyte
81 0.0121 postgres process_local_preload_libraries
80 0.0120 postgres IsAbortedTransactionBlockState
79 0.0118 libpthread-2.11.2.so pthread_mutex_unlock
79 0.0118 postgres LockSharedObject
79 0.0118 postgres ReleasePostmasterChildSlot
78 0.0117 libc-2.11.2.so __uflow
77 0.0115 postgres InLocalUserIdChange
77 0.0115 postgres mdinit
75 0.0112 postgres afterTriggerMarkEvents
75 0.0112 postgres hashname
74 0.0111 postgres AtCommit_Portals
73 0.0109 postgres InitializeClientEncoding
73 0.0109 postgres TimestampDifference
73 0.0109 postgres hash_seq_term
71 0.0106 postgres FunctionCall1
71 0.0106 postgres pg_utf_mblen
71 0.0106 postgres resetStringInfo
70 0.0105 postgres AtEOXact_HashTables
70 0.0105 postgres RemoveProcFromArray
70 0.0105 postgres load_libraries
69 0.0103 postgres AtProcExit_LocalBuffers
69 0.0103 postgres SetSessionAuthorization
68 0.0102 postgres superuser
66 0.0099 postgres heap_getsysattr
65 0.0097 libc-2.11.2.so __open_nocancel
65 0.0097 libc-2.11.2.so __strchr_sse2
65 0.0097 postgres _mdnblocks
65 0.0097 postgres heap_beginscan
64 0.0096 postgres heap_endscan
63 0.0094 postgres ProcessInvalidationMessagesMulti
62 0.0093 postgres hashoid
62 0.0093 postgres pq_close
61 0.0091 postgres IsTransactionState
61 0.0091 postgres PageGetHeapFreeSpace
60 0.0090 libc-2.11.2.so ptmalloc_lock_all
60 0.0090 postgres AtEOXact_Snapshot
60 0.0090 postgres InitProcessPhase2
60 0.0090 postgres RelationGetNumberOfBlocks
58 0.0087 postgres AtEOXact_on_commit_actions
58 0.0087 postgres UnlockBuffers
58 0.0087 postgres new_list
58 0.0087 postgres pg_valid_client_encoding
58 0.0087 postgres pq_flush
57 0.0085 libpthread-2.11.2.so _pthread_cleanup_pop_restore
57 0.0085 postgres DLMoveToFront
57 0.0085 postgres pgstat_clear_snapshot
55 0.0082 libc-2.11.2.so setsid
55 0.0082 postgres AtEOXact_Files
55 0.0082 postgres sendAuthRequest
54 0.0081 libpthread-2.11.2.so pthread_rwlock_unlock
54 0.0081 postgres pg_char_to_encoding
53 0.0079 libc-2.11.2.so strtoul
53 0.0079 postgres GetCurrentTransactionStopTimestamp
53 0.0079 postgres on_proc_exit
52 0.0078 libc-2.11.2.so _IO_list_lock
52 0.0078 postgres pg_mbcliplen
51 0.0076 libc-2.11.2.so _IO_default_finish
51 0.0076 postgres IpcSemaphoreInitialize
50 0.0075 libc-2.11.2.so __restore_rt
48 0.0072 postgres pgstat_report_activity
44 0.0066 libc-2.11.2.so _IO_default_uflow
44 0.0066 postgres AtProcExit_Files
44 0.0066 postgres LWLockReleaseAll
43 0.0064 postgres ResourceOwnerNewParent
43 0.0064 postgres SetDatabasePath
42 0.0063 postgres BaseInit
42 0.0063 postgres smgrnblocks
40 0.0060 libc-2.11.2.so sigfillset
40 0.0060 postgres MarkPostmasterChildInactive
39 0.0058 postgres GetCurrentCommandId
39 0.0058 postgres atexit_callback
38 0.0057 libc-2.11.2.so _IO_sputbackc
38 0.0057 libc-2.11.2.so __sigjmp_save
38 0.0057 postgres BufferGetBlockNumber
35 0.0052 postgres pgstat_send
34 0.0051 postgres AtEOXact_RelationCache
34 0.0051 postgres StreamClose
33 0.0049 postgres fmgr_info
32 0.0048 postgres AllocSetInit
30 0.0045 postgres update_spins_per_delay
25 0.0037 postgres UnlockReleaseBuffer
24 0.0036 postgres ResourceOwnerRememberCatCacheRef
24 0.0036 postgres canAcceptConnections
22 0.0033 postgres DisableCatchupInterrupt
21 0.0031 postgres BgBufferSync
21 0.0031 postgres RelationClose
21 0.0031 postgres XLogBackgroundFlush
21 0.0031 postgres smgrDoPendingDeletes
20 0.0030 libc-2.11.2.so __lseek_nocancel
20 0.0030 postgres GetCurrentStatementStartTimestamp
20 0.0030 postgres MemoryContextDeleteChildren
18 0.0027 postgres MemoryContextResetAndDeleteChildren
17 0.0025 postgres pgstat_report_appname
16 0.0024 libc-2.11.2.so __accept_nocancel
16 0.0024 postgres AllocSetReset
15 0.0022 postgres pg_usleep
14 0.0021 postgres GetUserId
13 0.0019 postgres ProcReleaseLocks
13 0.0019 postgres SyncOneBuffer
12 0.0018 postgres WalWriterMain
12 0.0018 postgres btint2cmp
11 0.0016 libc-2.11.2.so _IO_iter_end
11 0.0016 libc-2.11.2.so _IO_list_resetlock
11 0.0016 libc-2.11.2.so __strtol_internal
11 0.0016 libc-2.11.2.so getppid
9 0.0013 libc-2.11.2.so _IO_iter_file
9 0.0013 postgres AtEOXact_Buffers
9 0.0013 postgres CheckArchiveTimeout
6 9.0e-04 postgres GetUserIdAndSecContext
5 7.5e-04 postgres AbsorbFsyncRequests
4 6.0e-04 postgres pgstat_send_bgwriter
3 4.5e-04 postgres BackgroundWriterMain
3 4.5e-04 postgres StrategySyncStart
3 4.5e-04 postgres _bt_relbuf
2 3.0e-04 libc-2.11.2.so _IO_iter_next
2 3.0e-04 postgres PostmasterIsAlive
1 1.5e-04 postgres AtEOXact_LocalBuffers
1 1.5e-04 postgres AutoVacLauncherMain
1 1.5e-04 postgres BgWriterNap
1 1.5e-04 postgres ReleaseSysCache
1 1.5e-04 postgres backend_read_statsfile
1 1.5e-04 postgres heap_freetuple
1 1.5e-04 postgres smgrshutdown
On 24.11.2010 07:07, Robert Haas wrote:
Per previous threats, I spent some time tonight running oprofile
(using the directions Tom Lane was foolish enough to provide me back
in May). I took testlibpq.c and hacked it up to just connect to the
server and then disconnect in a tight loop without doing anything
useful, hoping to measure the overhead of starting up a new
connection. Ha, ha, funny about that:120899 18.0616 postgres AtProcExit_Buffers
56891 8.4992 libc-2.11.2.so memset
30987 4.6293 libc-2.11.2.so memcpy
26944 4.0253 postgres hash_search_with_hash_value
26554 3.9670 postgres AllocSetAlloc
20407 3.0487 libc-2.11.2.so _int_malloc
17269 2.5799 libc-2.11.2.so fread
13005 1.9429 ld-2.11.2.so do_lookup_x
11850 1.7703 ld-2.11.2.so _dl_fixup
10194 1.5229 libc-2.11.2.so _IO_file_xsgetnIn English: the #1 overhead here is actually something that happens
when processes EXIT, not when they start. Essentially all the time is
in two lines:56920 6.6006 : for (i = 0; i< NBuffers; i++)
: {
98745 11.4507 : if (PrivateRefCount[i] != 0)
Oh, that's quite surprising.
Anything we can do about this? That's a lot of overhead, and it'd be
a lot worse on a big machine with 8GB of shared_buffers.
Micro-optimizing that search for the non-zero value helps a little bit
(attached). Reduces the percentage shown by oprofile from about 16% to
12% on my laptop.
For bigger gains, I think you need to somehow make the PrivateRefCount
smaller. Perhaps only use one byte for each buffer instead of int32, and
use some sort of an overflow list for the rare case that a buffer is
pinned more than 255 times. Or make it a hash table instead of a simple
lookup array. But whatever you do, you have to be very careful to not
add overhead to PinBuffer/UnPinBuffer, those can already be quite high
in oprofile reports of real applications. It might be worth
experimenting a bit, at the moment PrivateRefCount takes up 5MB of
memory per 1GB of shared_buffers. Machines with a high shared_buffers
setting have no shortage of memory, but a large array like that might
waste a lot of precious CPU cache.
Now, the other question is if this really matters. Even if we eliminate
that loop in AtProcExit_Buffers altogether, is connect/disconnect still
be so slow that you have to use a connection pooler if you do that a lot?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
optimize-AtProcExit_Buffers-1.patchtext/x-diff; name=optimize-AtProcExit_Buffers-1.patchDownload
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 54c7109..03593fd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1665,11 +1665,20 @@ static void
AtProcExit_Buffers(int code, Datum arg)
{
int i;
+ int *ptr;
+ int *end;
AbortBufferIO();
UnlockBuffers();
- for (i = 0; i < NBuffers; i++)
+ /* Fast search for the first non-zero entry in PrivateRefCount */
+ end = (int *) &PrivateRefCount[NBuffers - 1];
+ ptr = (int *) PrivateRefCount;
+ while(ptr < end && *ptr == 0)
+ ptr++;
+ i = ((int32 *) ptr) - PrivateRefCount;
+
+ for (;i < NBuffers; i++)
{
if (PrivateRefCount[i] != 0)
{
On Wed, Nov 24, 2010 at 2:10 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Anything we can do about this? That's a lot of overhead, and it'd be
a lot worse on a big machine with 8GB of shared_buffers.Micro-optimizing that search for the non-zero value helps a little bit
(attached). Reduces the percentage shown by oprofile from about 16% to 12%
on my laptop.For bigger gains,
The first optimization that occurred to me was "remove the loop
altogether". I could maybe see needing to do something like this if
we're recovering from an error, but why do we need to do this (except
perhaps to fail an assertion) if we're exiting cleanly? Even a
session-lifetime buffer-pin leak would be quite disastrous, one would
think.
Now, the other question is if this really matters. Even if we eliminate that
loop in AtProcExit_Buffers altogether, is connect/disconnect still be so
slow that you have to use a connection pooler if you do that a lot?
Oh, I'm sure this isn't going to be nearly enough to fix that problem,
but every little bit helps; and if we never do the first optimization,
we'll never get to #30 or wherever it is that it really starts to move
the needle.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 24, 2010 at 2:10 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:Micro-optimizing that search for the non-zero value helps a little bit
(attached). Reduces the percentage shown by oprofile from about 16% to 12%
on my laptop.
That "micro-optimization" looks to me like your compiler leaves
something to be desired.
The first optimization that occurred to me was "remove the loop
altogether".
Or make it execute only in assert-enabled mode, perhaps.
This check had some use back in the bad old days, but the ResourceOwner
mechanism has probably removed a lot of the argument for it.
The counter-argument might be that failing to remove a buffer pin would
be disastrous; but I can't see that it'd be worse than failing to remove
an LWLock, and we have no belt-and-suspenders-too loop for those.
regards, tom lane
On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
The first optimization that occurred to me was "remove the loop
altogether".Or make it execute only in assert-enabled mode, perhaps.
This check had some use back in the bad old days, but the ResourceOwner
mechanism has probably removed a lot of the argument for it.
Yeah, that's what I was thinking - this could would have been a good
backstop when our cleanup mechanisms were not as robust as they seem
to be today. But making the check execute only in assert-enabled more
doesn't seem right, since the check actually acts to mask other coding
errors, rather than reveal them. Maybe we replace the check with one
that only occurs in an Assert-enabled build and just loops through and
does Assert(PrivateRefCount[i] == 0). I'm not sure exactly where this
gets called in the shutdown sequence, though - is it sensible to
Assert() here?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Or make it execute only in assert-enabled mode, perhaps.
But making the check execute only in assert-enabled more
doesn't seem right, since the check actually acts to mask other coding
errors, rather than reveal them. Maybe we replace the check with one
that only occurs in an Assert-enabled build and just loops through and
does Assert(PrivateRefCount[i] == 0).
Yeah, that would be sensible. There is precedent for this elsewhere
too; I think there's a similar setup for checking buffer refcounts
during transaction cleanup.
I'm not sure exactly where this
gets called in the shutdown sequence, though - is it sensible to
Assert() here?
Assert is sensible anywhere.
regards, tom lane
On Wed, Nov 24, 2010 at 11:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Or make it execute only in assert-enabled mode, perhaps.
But making the check execute only in assert-enabled more
doesn't seem right, since the check actually acts to mask other coding
errors, rather than reveal them. Maybe we replace the check with one
that only occurs in an Assert-enabled build and just loops through and
does Assert(PrivateRefCount[i] == 0).Yeah, that would be sensible. There is precedent for this elsewhere
too; I think there's a similar setup for checking buffer refcounts
during transaction cleanup.I'm not sure exactly where this
gets called in the shutdown sequence, though - is it sensible to
Assert() here?Assert is sensible anywhere.
OK, patch attached. Here's what oprofile output looks like with this applied:
3505 10.4396 libc-2.11.2.so memset
2051 6.1089 libc-2.11.2.so memcpy
1686 5.0217 postgres AllocSetAlloc
1642 4.8907 postgres hash_search_with_hash_value
1247 3.7142 libc-2.11.2.so _int_malloc
1096 3.2644 libc-2.11.2.so fread
855 2.5466 ld-2.11.2.so do_lookup_x
723 2.1535 ld-2.11.2.so _dl_fixup
645 1.9211 ld-2.11.2.so strcmp
620 1.8467 postgres MemoryContextAllocZero
Somehow I don't think I'm going to get much further with this without
figuring out how to get oprofile to cough up a call graph.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
AtProcExit_Buffers.patchapplication/octet-stream; name=AtProcExit_Buffers.patchDownload
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 1664,1689 **** InitBufferPoolBackend(void)
static void
AtProcExit_Buffers(int code, Datum arg)
{
- int i;
-
AbortBufferIO();
UnlockBuffers();
! for (i = 0; i < NBuffers; i++)
{
! if (PrivateRefCount[i] != 0)
! {
! volatile BufferDesc *buf = &(BufferDescriptors[i]);
! /*
! * We don't worry about updating ResourceOwner; if we even got
! * here, it suggests that ResourceOwners are messed up.
! */
! PrivateRefCount[i] = 1; /* make sure we release shared pin */
! UnpinBuffer(buf, false);
Assert(PrivateRefCount[i] == 0);
}
}
/* localbuf.c needs a chance too */
AtProcExit_LocalBuffers();
--- 1664,1683 ----
static void
AtProcExit_Buffers(int code, Datum arg)
{
AbortBufferIO();
UnlockBuffers();
! #ifdef USE_ASSERT_CHECKING
! if (assert_enabled)
{
! int i;
! for (i = 0; i < NBuffers; i++)
! {
Assert(PrivateRefCount[i] == 0);
}
}
+ #endif
/* localbuf.c needs a chance too */
AtProcExit_LocalBuffers();
Robert Haas <robertmhaas@gmail.com> writes:
OK, patch attached.
Two comments:
1. A comment would help, something like "Assert we released all buffer pins".
2. AtProcExit_LocalBuffers should be redone the same way, for
consistency (it likely won't make any performance difference).
Note the comment for AtProcExit_LocalBuffers, too; that probably
needs to be changed along the lines of "If we missed any, and
assertions aren't enabled, we'll fail later in DropRelFileNodeBuffers
while trying to drop the temp rels".
regards, tom lane
On Wed, Nov 24, 2010 at 1:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
OK, patch attached.
Two comments:
Revised patch attached.
I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.
For example:
6 42.8571 postgres record_in
8 57.1429 postgres pg_perm_setlocale
17035 5.7219 libc-2.11.2.so memcpy
17035 100.000 libc-2.11.2.so memcpy [self]
Not that helpful. :-(
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
AtProcExit_Buffers-v2.patchapplication/octet-stream; name=AtProcExit_Buffers-v2.patchDownload
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 54c7109..edc4977 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1659,31 +1659,26 @@ InitBufferPoolBackend(void)
}
/*
- * Ensure we have released all shared-buffer locks and pins during backend exit
+ * During backend exit, ensure that we released all shared-buffer locks and
+ * assert that we have no remaining pins.
*/
static void
AtProcExit_Buffers(int code, Datum arg)
{
- int i;
-
AbortBufferIO();
UnlockBuffers();
- for (i = 0; i < NBuffers; i++)
+#ifdef USE_ASSERT_CHECKING
+ if (assert_enabled)
{
- if (PrivateRefCount[i] != 0)
- {
- volatile BufferDesc *buf = &(BufferDescriptors[i]);
+ int i;
- /*
- * We don't worry about updating ResourceOwner; if we even got
- * here, it suggests that ResourceOwners are messed up.
- */
- PrivateRefCount[i] = 1; /* make sure we release shared pin */
- UnpinBuffer(buf, false);
+ for (i = 0; i < NBuffers; i++)
+ {
Assert(PrivateRefCount[i] == 0);
}
}
+#endif
/* localbuf.c needs a chance too */
AtProcExit_LocalBuffers();
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 46fddde..efaeca4 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -468,9 +468,10 @@ AtEOXact_LocalBuffers(bool isCommit)
/*
* AtProcExit_LocalBuffers - ensure we have dropped pins during backend exit.
*
- * This is just like AtProcExit_Buffers, but for local buffers. We have
- * to drop pins to ensure that any attempt to drop temp files doesn't
- * fail in DropRelFileNodeBuffers.
+ * This is just like AtProcExit_Buffers, but for local buffers. We shouldn't
+ * be holding any remaining pins; if we are, and assertions aren't enabled,
+ * we'll fail later in DropRelFileNodeBuffers while trying to drop the temp
+ * rels.
*/
void
AtProcExit_LocalBuffers(void)
@@ -478,4 +479,16 @@ AtProcExit_LocalBuffers(void)
/* just zero the refcounts ... */
if (LocalRefCount)
MemSet(LocalRefCount, 0, NLocBuffer * sizeof(*LocalRefCount));
+
+#ifdef USE_ASSERT_CHECKING
+ if (assert_enabled && LocalRefCount)
+ {
+ int i;
+
+ for (i = 0; i < NLocBuffer; i++)
+ {
+ Assert(LocalRefCount[i] == 0);
+ }
+ }
+#endif
}
On Wed, Nov 24, 2010 at 01:20:36PM -0500, Robert Haas wrote:
On Wed, Nov 24, 2010 at 1:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
OK, patch attached.
Two comments:
Revised patch attached.
I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.
For example:6 42.8571 postgres record_in
8 57.1429 postgres pg_perm_setlocale
17035 5.7219 libc-2.11.2.so memcpy
17035 100.000 libc-2.11.2.so memcpy [self]Not that helpful. :-(
Have a look at the wiki:
http://wiki.postgresql.org/wiki/Profiling_with_OProfile#Additional_analysis
Robert Haas
Regards,
Gerhard Heift
On Wed, Nov 24, 2010 at 1:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.
It looks like the trick is to compile with -fno-omit-frame-pointer.
New profiling run:
27563 10.3470 libc-2.11.2.so memset
15162 5.6917 libc-2.11.2.so memcpy
13471 5.0569 postgres hash_search_with_hash_value
13465 5.0547 postgres AllocSetAlloc
9513 3.5711 libc-2.11.2.so _int_malloc
8729 3.2768 libc-2.11.2.so fread
6336 2.3785 ld-2.11.2.so do_lookup_x
5788 2.1728 ld-2.11.2.so _dl_fixup
4995 1.8751 postgres MemoryContextAllocZero
4978 1.8687 ld-2.11.2.so strcmp
Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wednesday 24 November 2010 19:01:32 Robert Haas wrote:
Somehow I don't think I'm going to get much further with this without
figuring out how to get oprofile to cough up a call graph.
I think to do that sensibly you need CFLAGS="-O2 -fno-omit-frame-pointer"...
Gerhard Heift <ml-postgresql-20081012-3518@gheift.de> writes:
On Wed, Nov 24, 2010 at 01:20:36PM -0500, Robert Haas wrote:
I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.
Have a look at the wiki:
http://wiki.postgresql.org/wiki/Profiling_with_OProfile#Additional_analysis
The critical piece of information is there now, but it wasn't a minute
ago.
regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes:
Revised patch attached.
The asserts in AtProcExit_LocalBuffers are a bit pointless since
you forgot to remove the code that forcibly zeroes LocalRefCount[]...
otherwise +1.
regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes:
Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.
AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.
Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.
regards, tom lane
On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.
The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them. Any new pages we request from the OS will be
zeroed anyway, but with palloc we then have to re-zero the allocated
block anyway because palloc can return a memory that's been used,
freed, and reused. However, for anything that only needs to be
allocated once and never freed, and whose size can be known at compile
time, that's not an issue.
In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. We could just declare:
char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];
...and then peel off chunks.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wednesday 24 November 2010 21:47:32 Robert Haas wrote:
On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them. Any new pages we request from the OS will be
zeroed anyway, but with palloc we then have to re-zero the allocated
block anyway because palloc can return a memory that's been used,
freed, and reused. However, for anything that only needs to be
allocated once and never freed, and whose size can be known at compile
time, that's not an issue.In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. We could just declare:char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];
...and then peel off chunks.
Won't this just cause loads of additional pagefaults after fork() when those
pages are used the first time and then a second time when first written to (to
copy it)?
Andres
On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres@anarazel.de> wrote:
On Wednesday 24 November 2010 21:47:32 Robert Haas wrote:
On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them. Any new pages we request from the OS will be
zeroed anyway, but with palloc we then have to re-zero the allocated
block anyway because palloc can return a memory that's been used,
freed, and reused. However, for anything that only needs to be
allocated once and never freed, and whose size can be known at compile
time, that's not an issue.In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. We could just declare:char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];
...and then peel off chunks.
Won't this just cause loads of additional pagefaults after fork() when those
pages are used the first time and then a second time when first written to (to
copy it)?
Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres@anarazel.de> wrote:
The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them.
Won't this just cause loads of additional pagefaults after fork() when those
pages are used the first time and then a second time when first written to (to
copy it)?
Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().
I think you're missing the real point, which that the cost you're
measuring here probably isn't so much memset() as faulting in large
chunks of address space. Avoiding the explicit memset() likely will
save little in real runtime --- it'll just make sure the initial-touch
costs are more distributed and harder to measure. But in any case I
think this idea is a nonstarter because it gets in the way of making
those hashtables expansible, which we *do* need to do eventually.
(You might be able to confirm or disprove this theory if you ask
oprofile to count memory access stalls instead of CPU clock cycles...)
regards, tom lane
On Wednesday 24 November 2010 21:54:53 Robert Haas wrote:
On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres@anarazel.de> wrote:
On Wednesday 24 November 2010 21:47:32 Robert Haas wrote:
On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that
cheaper other than to cut the initial sizes of the hash tables (and
add support for expanding them later, which is lacking in catcache
ATM). Not convinced that that creates any net savings --- it might
just save some cycles at startup in exchange for more cycles later,
in typical backend usage.Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them. Any new pages we request from the OS will be
zeroed anyway, but with palloc we then have to re-zero the allocated
block anyway because palloc can return a memory that's been used,
freed, and reused. However, for anything that only needs to be
allocated once and never freed, and whose size can be known at compile
time, that's not an issue.In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. We could just declare:char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];
...and then peel off chunks.
Won't this just cause loads of additional pagefaults after fork() when
those pages are used the first time and then a second time when first
written to (to copy it)?Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().
Yes, but only once. Also scrubbing a page is faster than copying it... (and
there were patches floating around to do that in advance, not sure if they got
integrated into mainline linux)
Andres
On Nov 24, 2010, at 4:05 PM, Andres Freund <andres@anarazel.de> wrote:
Won't this just cause loads of additional pagefaults after fork() when
those pages are used the first time and then a second time when first
written to (to copy it)?Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().Yes, but only once. Also scrubbing a page is faster than copying it... (and
there were patches floating around to do that in advance, not sure if they got
integrated into mainline linux)
I'm not following - can you elaborate?
...Robert
Robert Haas <robertmhaas@gmail.com> writes:
On Nov 24, 2010, at 4:05 PM, Andres Freund <andres@anarazel.de> wrote:
Yes, but only once. Also scrubbing a page is faster than copying it... (and
there were patches floating around to do that in advance, not sure if they got
integrated into mainline linux)
I'm not following - can you elaborate?
I think Andres is saying that bss space isn't optimized during a fork
operation: it'll be propagated to the child as copy-on-write pages.
Dunno if that's true or not, but if it is, it'd be a good reason to
avoid the scheme you're suggesting.
regards, tom lane
On Wednesday 24 November 2010 22:18:08 Robert Haas wrote:
On Nov 24, 2010, at 4:05 PM, Andres Freund <andres@anarazel.de> wrote:
Won't this just cause loads of additional pagefaults after fork() when
those pages are used the first time and then a second time when first
written to (to copy it)?Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().Yes, but only once. Also scrubbing a page is faster than copying it...
(and there were patches floating around to do that in advance, not sure
if they got integrated into mainline linux)I'm not following - can you elaborate?
When forking the memory mapping of the process is copied - the actual pages
are not. When a page is first accessed the page fault handler will setup a
mapping to the "old" page and mark it as shared. When now written to it will
fault again and copy the page.
In contrast if you access a page the first time after an sbrk (or mmap, doesn't
matter) a new page will get scrubbed and and a mapping will get setup.
Andres
On Wednesday 24 November 2010 22:25:45 Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Nov 24, 2010, at 4:05 PM, Andres Freund <andres@anarazel.de> wrote:
Yes, but only once. Also scrubbing a page is faster than copying it...
(and there were patches floating around to do that in advance, not sure
if they got integrated into mainline linux)I'm not following - can you elaborate?
I think Andres is saying that bss space isn't optimized during a fork
operation: it'll be propagated to the child as copy-on-write pages.
Dunno if that's true or not, but if it is, it'd be a good reason to
avoid the scheme you're suggesting.
Afair nearly all pages are propagated with copy-on-write semantics.
Andres
On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
(You might be able to confirm or disprove this theory if you ask
oprofile to count memory access stalls instead of CPU clock cycles...)
I don't see an event for that.
# opcontrol --list-events | grep STALL
INSTRUCTION_FETCH_STALL: (counter: all)
DISPATCH_STALLS: (counter: all)
DISPATCH_STALL_FOR_BRANCH_ABORT: (counter: all)
DISPATCH_STALL_FOR_SERIALIZATION: (counter: all)
DISPATCH_STALL_FOR_SEGMENT_LOAD: (counter: all)
DISPATCH_STALL_FOR_REORDER_BUFFER_FULL: (counter: all)
DISPATCH_STALL_FOR_RESERVATION_STATION_FULL: (counter: all)
DISPATCH_STALL_FOR_FPU_FULL: (counter: all)
DISPATCH_STALL_FOR_LS_FULL: (counter: all)
DISPATCH_STALL_WAITING_FOR_ALL_QUIET: (counter: all)
DISPATCH_STALL_FOR_FAR_TRANSFER_OR_RESYNC: (counter: all)
# opcontrol --list-events | grep MEMORY
MEMORY_REQUESTS: (counter: all)
MEMORY_CONTROLLER_PAGE_TABLE_OVERFLOWS: (counter: all)
MEMORY_CONTROLLER_SLOT_MISSED: (counter: all)
MEMORY_CONTROLLER_TURNAROUNDS: (counter: all)
MEMORY_CONTROLLER_BYPASS_COUNTER_SATURATION: (counter: all)
CPU_IO_REQUESTS_TO_MEMORY_IO: (counter: all)
MEMORY_CONTROLLER_REQUESTS: (counter: all)
Ideas?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
(You might be able to confirm or disprove this theory if you ask
oprofile to count memory access stalls instead of CPU clock cycles...)
I don't see an event for that.
You probably want something involving cache misses. The event names
vary depending on just which CPU you've got.
regards, tom lane
On Wednesday 24 November 2010 23:03:48 Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
(You might be able to confirm or disprove this theory if you ask
oprofile to count memory access stalls instead of CPU clock cycles...)I don't see an event for that.
You probably want something involving cache misses. The event names
vary depending on just which CPU you've got.
Or some BUS OUTSTANDING event.
Andres
On Wed, Nov 24, 2010 at 5:15 PM, Andres Freund <andres@anarazel.de> wrote:
On Wednesday 24 November 2010 23:03:48 Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
(You might be able to confirm or disprove this theory if you ask
oprofile to count memory access stalls instead of CPU clock cycles...)I don't see an event for that.
You probably want something involving cache misses. The event names
vary depending on just which CPU you've got.Or some BUS OUTSTANDING event.
I don't see anything for BUS OUTSTANDING. For CACHE and MISS I have
some options:
# opcontrol --list-events | grep CACHE
DATA_CACHE_ACCESSES: (counter: all)
DATA_CACHE_MISSES: (counter: all)
DATA_CACHE_REFILLS_FROM_L2_OR_NORTHBRIDGE: (counter: all)
DATA_CACHE_REFILLS_FROM_NORTHBRIDGE: (counter: all)
DATA_CACHE_LINES_EVICTED: (counter: all)
LOCKED_INSTRUCTIONS_DCACHE_MISSES: (counter: all)
L2_CACHE_MISS: (counter: all)
L2_CACHE_FILL_WRITEBACK: (counter: all)
INSTRUCTION_CACHE_FETCHES: (counter: all)
INSTRUCTION_CACHE_MISSES: (counter: all)
INSTRUCTION_CACHE_REFILLS_FROM_L2: (counter: all)
INSTRUCTION_CACHE_REFILLS_FROM_SYSTEM: (counter: all)
INSTRUCTION_CACHE_VICTIMS: (counter: all)
INSTRUCTION_CACHE_INVALIDATED: (counter: all)
CACHE_BLOCK_COMMANDS: (counter: all)
READ_REQUEST_L3_CACHE: (counter: all)
L3_CACHE_MISSES: (counter: all)
IBS_FETCH_ICACHE_MISSES: (ext: ibs_fetch)
IBS_FETCH_ICACHE_HITS: (ext: ibs_fetch)
IBS_OP_DATA_CACHE_MISS: (ext: ibs_op)
IBS_OP_NB_LOCAL_CACHE: (ext: ibs_op)
IBS_OP_NB_REMOTE_CACHE: (ext: ibs_op)
IBS_OP_NB_CACHE_MODIFIED: (ext: ibs_op)
IBS_OP_NB_CACHE_OWNED: (ext: ibs_op)
IBS_OP_NB_LOCAL_CACHE_LAT: (ext: ibs_op)
IBS_OP_NB_REMOTE_CACHE_LAT: (ext: ibs_op)
# opcontrol --list-events | grep MISS | grep -v CACHE
L1_DTLB_MISS_AND_L2_DTLB_HIT: (counter: all)
L1_DTLB_AND_L2_DTLB_MISS: (counter: all)
L1_ITLB_MISS_AND_L2_ITLB_HIT: (counter: all)
L1_ITLB_MISS_AND_L2_ITLB_MISS: (counter: all)
MEMORY_CONTROLLER_SLOT_MISSED: (counter: all)
IBS_FETCH_L1_ITLB_MISSES_L2_ITLB_HITS: (ext: ibs_fetch)
IBS_FETCH_L1_ITLB_MISSES_L2_ITLB_MISSES: (ext: ibs_fetch)
IBS_OP_L1_DTLB_MISS_L2_DTLB_HIT: (ext: ibs_op)
IBS_OP_L1_L2_DTLB_MISS: (ext: ibs_op)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
I don't see anything for BUS OUTSTANDING. For CACHE and MISS I have
some options:
DATA_CACHE_MISSES: (counter: all)
L3_CACHE_MISSES: (counter: all)
Those two look promising, though I can't claim to be an expert.
regards, tom lane
On Wed, Nov 24, 2010 at 5:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
I don't see anything for BUS OUTSTANDING. For CACHE and MISS I have
some options:DATA_CACHE_MISSES: (counter: all)
L3_CACHE_MISSES: (counter: all)Those two look promising, though I can't claim to be an expert.
OK. Thanksgiving is about to interfere with my access to this
machine, but I'll pick this up next week.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas wrote:
In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. ?We could just declare:char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];
...and then peel off chunks.
Won't this just cause loads of additional pagefaults after fork() when those
pages are used the first time and then a second time when first written to (to
copy it)?Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().
Here is perhaps more detail than you wanted, but ...
Basically in a forked process, the text/program is fixed, and the
initialized data and stack are copy on write (COW). Allocating a big
block of zero memory in data is unitialized data, and the behavior there
differs depending on whether the parent process faulted in those pages.
If it did, then they are COW, but if it did not, odds are the OS just
gives them to you clean and not shared. The pages have to be empty
because if it gave you anything else it could be giving you data from
another process. For details, see
http://docs.hp.com/en/5965-4641/ch01s11.html, Faulting In a Page of
Stack or Uninitialized Data.
As far as sbrk(), those pages are zero-filled also, again for security
reasons. You have to clear malloc()'ed memory (or call calloc()) not
because the OS gave you dirty pages but because you might be using
memory that you previously freed. If you have never freed memory (and
the postmaster/parent has not either), I bet all malloc'ed memory would
be zero-filled.
Not sure that information moves us forward. If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote:
Not sure that information moves us forward. If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.
Well, we can determine the answers to these questions empirically. I
think some more scrutiny of the code with the points you and Andres
and Tom have raised is probably in order, and probably some more
benchmarking, too. I haven't had a chance to do that yet, however.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote:
Not sure that information moves us forward. �If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.
Well, we can determine the answers to these questions empirically.
Not really. Per Bruce's description, a page would become COW the moment
the postmaster touched (either write or read) any variable on it. Since
we have no control over how the loader lays out static variables, the
actual behavior of a particular build would be pretty random and subject
to unexpected changes caused by seemingly unrelated edits.
Also, the referenced URL only purports to describe the behavior of
HPUX, which is not exactly a mainstream OS. I think it requires a
considerable leap of faith to assume that all or even most platforms
work the way this suggests, and not in the dumber fashion Andres
suggested. Has anybody here actually looked at the relevant Linux
or BSD kernel code?
regards, tom lane
On Sun, Nov 28, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote:
Not sure that information moves us forward. If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.Well, we can determine the answers to these questions empirically.
Not really. Per Bruce's description, a page would become COW the moment
the postmaster touched (either write or read) any variable on it. Since
we have no control over how the loader lays out static variables, the
actual behavior of a particular build would be pretty random and subject
to unexpected changes caused by seemingly unrelated edits.
Well, one big character array pretty much has to be laid out
contiguously, and it would be pretty surprising (but not entirely
impossible) to find that the linker randomly sprinkles symbols from
other files in between consecutive definitions in the same source
file. I think the next question to answer is to try to allocate blame
for the memset/memcpy overhead between page faults and the zeroing
itself. That seems like something we can easily member by writing a
test program that zeroes the same region twice and kicks out timing
numbers. If, as you and Andres are arguing, the actual zeroing is
minor, then we can forget this whole line of discussion and move on to
other possible optimizations. If that turns out not to be true then
we can worry about how best to avoid the zeroing. I have to believe
that's a solvable problem; the question is whether there's a benefit.
In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails, the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it. To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas wrote:
On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote:
Not sure that information moves us forward. ?If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.Well, we can determine the answers to these questions empirically. I
think some more scrutiny of the code with the points you and Andres
and Tom have raised is probably in order, and probably some more
benchmarking, too. I haven't had a chance to do that yet, however.
Basically, my bet is if you allocated a large zero-data variable in the
postmaster but never accessed it from the postmaster, at most you would
copy-on-write (COW) fault in two page, one at the beginning that is
shared by accessed variables, and one at the end. The remaining pages
(4k default for x86) would be zero-filled and not COW shared.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
Robert Haas <robertmhaas@gmail.com> writes:
The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails, the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that.
One comment to make here is that it would be a serious error to focus on
the costs of just starting and stopping a backend; you have to think
about cases where the backend does at least some useful work in between,
and that means actually *populating* those caches (to some extent) not
just initializing them. Maybe your wording above was chosen with that
in mind, but I think onlookers might easily overlook the point.
FWIW, today I've been looking into getting rid of the silliness in
build_index_pathkeys whereby it reconstructs pathkey opfamily OIDs
from sortops instead of just using the index opfamilies directly.
It turns out that once you fix that, there is no need at all for
relcache to cache per-index operator data (the rd_operator arrays)
because that's the only code that uses 'em. I don't see any particular
change in the runtime of the regression tests from ripping out that
part of the cached data, but it ought to have at least some beneficial
effect on real startup time.
regards, tom lane
On Sun, Nov 28, 2010 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails, the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that.One comment to make here is that it would be a serious error to focus on
the costs of just starting and stopping a backend; you have to think
about cases where the backend does at least some useful work in between,
and that means actually *populating* those caches (to some extent) not
just initializing them. Maybe your wording above was chosen with that
in mind, but I think onlookers might easily overlook the point.
I did have that in mind, but I agree the point is worth mentioning.
So, for example, it wouldn't gain anything meaningful for us to
postpone catcache initialization until someone executes a query. It
would improve the synthetic benchmark, but that's it.
FWIW, today I've been looking into getting rid of the silliness in
build_index_pathkeys whereby it reconstructs pathkey opfamily OIDs
from sortops instead of just using the index opfamilies directly.
It turns out that once you fix that, there is no need at all for
relcache to cache per-index operator data (the rd_operator arrays)
because that's the only code that uses 'em. I don't see any particular
change in the runtime of the regression tests from ripping out that
part of the cached data, but it ought to have at least some beneficial
effect on real startup time.
Wow. that's great. The fact that it simplifies the code is probably
the main point, but obviously whatever cycles we can save during
startup (and ongoing operation) are all to the good.
One possible way to get a real speedup here would be to look for ways
to trim the number of catcaches. But I'm not too convinced there's
much water to squeeze out of that rock. After our recent conversation
about KNNGIST, it occurred to me to wonder whether there's really any
point in pretending that a user can usefully add an AM, both due to
hard-wired planner knowledge and due to lack of any sort of extensible
XLOG support. If not, we could potentially turn pg_am into a
hardcoded lookup table rather than a modifiable catalog, which would
also likely be faster; and perhaps reference AMs elsewhere with
characters rather than OIDs. But even if this were judged a sensible
thing to do I'm not very sure that even a purpose-built synthetic
benchmark would be able to measure the speedup.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
After our recent conversation
about KNNGIST, it occurred to me to wonder whether there's really any
point in pretending that a user can usefully add an AM, both due to
hard-wired planner knowledge and due to lack of any sort of extensible
XLOG support. If not, we could potentially turn pg_am into a
hardcoded lookup table rather than a modifiable catalog, which would
also likely be faster; and perhaps reference AMs elsewhere with
characters rather than OIDs. But even if this were judged a sensible
thing to do I'm not very sure that even a purpose-built synthetic
benchmark would be able to measure the speedup.
Well, the lack of extensible XLOG support is definitely a big handicap
to building a *production* index AM as an add-on. But it's not such a
handicap for development. And I don't believe that the planner is
hardwired in any way that doesn't allow new index types. GIST and GIN
have both been added successfully without kluging the planner. It does
know a lot more about btree than other index types, but that doesn't
mean you can't add a new index type that doesn't behave like btree;
that's more reflective of where development effort has been spent.
So I would consider the above idea a step backwards, and I doubt it
would save anything meaningful anyway.
regards, tom lane
On Sun, Nov 28, 2010 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
After our recent conversation
about KNNGIST, it occurred to me to wonder whether there's really any
point in pretending that a user can usefully add an AM, both due to
hard-wired planner knowledge and due to lack of any sort of extensible
XLOG support. If not, we could potentially turn pg_am into a
hardcoded lookup table rather than a modifiable catalog, which would
also likely be faster; and perhaps reference AMs elsewhere with
characters rather than OIDs. But even if this were judged a sensible
thing to do I'm not very sure that even a purpose-built synthetic
benchmark would be able to measure the speedup.Well, the lack of extensible XLOG support is definitely a big handicap
to building a *production* index AM as an add-on. But it's not such a
handicap for development.
Realistically, it's hard for me to imagine that anyone would go to the
trouble of building it as a loadable module first and then converting
it to part of core later on. That'd hardly be less work.
And I don't believe that the planner is
hardwired in any way that doesn't allow new index types. GIST and GIN
have both been added successfully without kluging the planner.
We have 9 boolean flags to indicate the capabilities (or lack thereof)
of AMs, and we only have 4 AMs. It seems altogether plausible to
assume that the next AM we add could require flags 10 and 11. Heck, I
think KNNGIST is going to require another flag... which will likely
never be set for any AM other than GIST.
It does
know a lot more about btree than other index types, but that doesn't
mean you can't add a new index type that doesn't behave like btree;
that's more reflective of where development effort has been spent.So I would consider the above idea a step backwards, and I doubt it
would save anything meaningful anyway.
That latter point, as far as I'm concerned, is the real nail in the coffin.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
One possible way to get a real speedup here would be to look for ways
to trim the number of catcaches.
BTW, it's not going to help to remove catcaches that have a small
initial size, as the pg_am cache certainly does. If the bucket zeroing
cost is really something to minimize, it's only the caches with the
largest nbuckets counts that are worth considering --- and we certainly
can't remove those without penalty.
regards, tom lane
BTW, this might be premature to mention pending some tests about mapping
versus zeroing overhead, but it strikes me that there's more than one
way to skin a cat. I still think the idea of statically allocated space
sucks. But what if we rearranged things so that palloc0 doesn't consist
of palloc-then-memset, but rather push the zeroing responsibility down
into the allocator? In particular, I'm imagining that palloc0 with a
sufficiently large space request --- more than a couple pages --- could
somehow arrange to get space that's guaranteed zero already. And if the
request isn't large, zeroing it isn't where our problem is anyhow.
The most portable way to do that would be to use calloc insted of malloc,
and hope that libc is smart enough to provide freshly-mapped space.
It would be good to look and see whether glibc actually does so,
of course. If not we might end up having to mess with sbrk for
ourselves, and I'm not sure how pleasantly that interacts with malloc.
Another question that would be worth asking here is whether the
hand-baked MemSet macro still outruns memset on modern architectures.
I think it's been quite a few years since that was last tested.
regards, tom lane
On Mon, Nov 29, 2010 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The most portable way to do that would be to use calloc insted of malloc,
and hope that libc is smart enough to provide freshly-mapped space.
It would be good to look and see whether glibc actually does so,
of course. If not we might end up having to mess with sbrk for
ourselves, and I'm not sure how pleasantly that interacts with malloc.
It's *supposed* to interact fine. The only thing I wonder is that I
think malloc intentionally uses mmap for larger allocations but I'm
not clear what the advantages are. Is it because it's a cheaper way to
get zeroed bytes? Or just so that free has a hope of returning the
allocations to the OS?
Another question that would be worth asking here is whether the
hand-baked MemSet macro still outruns memset on modern architectures.
I think it's been quite a few years since that was last tested.
I know glibc has some sexy memset macros for cases where the size is a
constant. I'm not sure there's been much of an advance in the general
case though. This would tend to imply we should consider going the
other direction of having the caller of palloc0 do the zeroing
instead. Or making palloc0 a macro which expands to include calling
memset with the parameter inlined.
--
greg
Greg Stark <gsstark@mit.edu> writes:
On Mon, Nov 29, 2010 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Another question that would be worth asking here is whether the
hand-baked MemSet macro still outruns memset on modern architectures.
I think it's been quite a few years since that was last tested.
I know glibc has some sexy memset macros for cases where the size is a
constant. I'm not sure there's been much of an advance in the general
case though. This would tend to imply we should consider going the
other direction of having the caller of palloc0 do the zeroing
instead. Or making palloc0 a macro which expands to include calling
memset with the parameter inlined.
Well, that was exactly the reason why we did it the way we do it.
However, I think it's probably only node allocations where the size
is likely to be constant and hence result in a win. Perhaps we should
implement makeNode() differently from the general case.
regards, tom lane
On Sun, Nov 28, 2010 at 7:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
One possible way to get a real speedup here would be to look for ways
to trim the number of catcaches.BTW, it's not going to help to remove catcaches that have a small
initial size, as the pg_am cache certainly does. If the bucket zeroing
cost is really something to minimize, it's only the caches with the
largest nbuckets counts that are worth considering --- and we certainly
can't remove those without penalty.
Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever. I bet the
number of people for whom <(int4, int4) has any non-standard
properties is somewhere between slim and none; and it might well be
the case that formrdesc() is faster than reading the relcache init
file, if we didn't need to worry about deviation from canonical. This
is even more frustrating in the hypothetical situation where a backend
can switch databases, because we have to blow away all of these cache
entries that are 99.9% likely to be basically identical in the old and
new databases.
The relation descriptors for pg_class and pg_attribute are examples of
things it would be nice to hardwire and never need to update. We are
really pretty much screwed if there is any meaningful deviation from
what is expected, but relpages, reltuples, and relfrozenxid - and
maybe relacl or reloptions - can legitimately vary between databases.
Maybe we could speed things up a bit if we got rid of the pg_attribute
entries for the system attributes (except OID).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever.
True. I wonder if we could do something like the relcache init file
for the catcaches.
Maybe we could speed things up a bit if we got rid of the pg_attribute
entries for the system attributes (except OID).
I used to have high hopes for that idea, but the column privileges
patch broke it permanently.
regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes:
Well, the lack of extensible XLOG support is definitely a big handicap
to building a *production* index AM as an add-on. But it's not such a
handicap for development.Realistically, it's hard for me to imagine that anyone would go to the
trouble of building it as a loadable module first and then converting
it to part of core later on. That'd hardly be less work.
Well, it depends a lot on external factors. Like for example willing to
use the code before to spend the necessary QA time that is needed for it
to land in core. Two particular examples come to mind, which are tsearch
and KNN GiST. The main problem with integrating into core, AFAIUI, are
related to code maitenance, not at all with code stability and quality
of the addon itself.
It's just so much easier to develop an external module…
Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Sun, Nov 28, 2010 at 11:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever.True. I wonder if we could do something like the relcache init file
for the catcaches.
Maybe. It's hard to know exactly what to pull in, though, nor is it
clear to me how much it would really save. You've got to keep the
thing up to date somehow, too.
I finally got around to doing some testing of
page-faults-versus-actually-memory-initialization, using the attached
test program, compiled with warnings, but without optimization.
Typical results on MacOS X:
first run: 297299
second run: 99653
And on Fedora 12 (2.6.32.23-170.fc12.x86_64):
first run: 509309
second run: 114721
I guess the word "run" is misleading (I wrote the program in 5
minutes); it's just zeroing the same chunk twice and measuring the
times. The difference is presumably the page fault overhead, which
implies that faulting is two-thirds of the overhead on MacOS X and
three-quarters of the overhead on Linux. This makes me pretty
pessimistic about the chances of a meaningful speedup here.
Maybe we could speed things up a bit if we got rid of the pg_attribute
entries for the system attributes (except OID).I used to have high hopes for that idea, but the column privileges
patch broke it permanently.
http://archives.postgresql.org/pgsql-hackers/2010-07/msg00151.php
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
On Monday 29 November 2010 17:57:51 Robert Haas wrote:
On Sun, Nov 28, 2010 at 11:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever.True. I wonder if we could do something like the relcache init file
for the catcaches.Maybe. It's hard to know exactly what to pull in, though, nor is it
clear to me how much it would really save. You've got to keep the
thing up to date somehow, too.I finally got around to doing some testing of
page-faults-versus-actually-memory-initialization, using the attached
test program, compiled with warnings, but without optimization.
Typical results on MacOS X:first run: 297299
second run: 99653And on Fedora 12 (2.6.32.23-170.fc12.x86_64):
first run: 509309
second run: 114721
Hm. A quick test shows that its quite a bit faster if you allocate memory
with:
size_t s = 512*1024*1024;
char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE|
MAP_ANONYMOUS, -1, 0);
Andres
On Mon, Nov 29, 2010 at 12:24 PM, Andres Freund <andres@anarazel.de> wrote:
Hm. A quick test shows that its quite a bit faster if you allocate memory
with:
size_t s = 512*1024*1024;
char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE|
MAP_ANONYMOUS, -1, 0);
Numbers?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Monday 29 November 2010 18:34:02 Robert Haas wrote:
On Mon, Nov 29, 2010 at 12:24 PM, Andres Freund <andres@anarazel.de> wrote:
Hm. A quick test shows that its quite a bit faster if you allocate memory
with:
size_t s = 512*1024*1024;
char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE|
MAP_ANONYMOUS, -1, 0);Numbers?
malloc alloc: 43
malloc memset1: 438763
malloc memset2: 98764
total: 537570
mmap alloc: 296065
mmap memset1: 99203
mmap memset2: 100608
total: 495876
But you don't actually need the memset1 in the mmap case as MAP_ANONYMOUS
memory is already zeroed. We could actually use that knowledge even without
MAP_POPULATE if we somehow keep track whether an allocated memory region is
still zeroed.
Taking that into account its:
malloc alloc: 47
malloc memset1: 437819
malloc memset2: 98317
total: 536183
mmap alloc: 292904
mmap memset1: 1
mmap memset2: 99284
total: 392189
I am somewhat reluctant to believe thats the way to go.
Andres
Attachments:
On Mon, Nov 29, 2010 at 9:24 AM, Andres Freund <andres@anarazel.de> wrote:
On Monday 29 November 2010 17:57:51 Robert Haas wrote:
On Sun, Nov 28, 2010 at 11:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever.True. I wonder if we could do something like the relcache init file
for the catcaches.Maybe. It's hard to know exactly what to pull in, though, nor is it
clear to me how much it would really save. You've got to keep the
thing up to date somehow, too.I finally got around to doing some testing of
page-faults-versus-actually-memory-initialization, using the attached
test program, compiled with warnings, but without optimization.
Typical results on MacOS X:first run: 297299
second run: 99653And on Fedora 12 (2.6.32.23-170.fc12.x86_64):
first run: 509309
second run: 114721Hm. A quick test shows that its quite a bit faster if you allocate memory
with:
size_t s = 512*1024*1024;
char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE|
MAP_ANONYMOUS, -1, 0);
Could you post the program?
Are you sure you haven't just moved the page-fault time to a part of
the code where it still exists, but just isn't being captured and
reported?
Cheers,
Jeff
Robert Haas <robertmhaas@gmail.com> writes:
I guess the word "run" is misleading (I wrote the program in 5
minutes); it's just zeroing the same chunk twice and measuring the
times. The difference is presumably the page fault overhead, which
implies that faulting is two-thirds of the overhead on MacOS X and
three-quarters of the overhead on Linux.
Ah, cute solution to the measurement problem. I replicated the
experiment just as a cross-check:
Fedora 13 on x86_64 (recent Nehalem):
first run: 346767
second run: 103143
Darwin on x86_64 (not-so-recent Penryn):
first run: 341289
second run: 64535
HPUX on HPPA:
first run: 2191136
second run: 1199879
(On the last two machines I had to cut the array size to 256MB to avoid
swapping.) All builds with "gcc -O2".
This makes me pretty
pessimistic about the chances of a meaningful speedup here.
Yeah, this is confirmation that what you are seeing in the original test
is mostly about faulting pages in, not about the zeroing. I think it
would still be interesting to revisit the micro-optimization of
MemSet(), but it doesn't look like massive restructuring to avoid it
altogether is going to be worthwhile.
regards, tom lane
Jeff Janes <jeff.janes@gmail.com> writes:
Are you sure you haven't just moved the page-fault time to a part of
the code where it still exists, but just isn't being captured and
reported?
I'm a bit suspicious about that too. Another thing to keep in mind
is that Robert's original program doesn't guarantee that the char
array is maxaligned; though reasonable implementations of memset
should be able to use the same inner loop anyway for most of the
array.
I did some experimentation here and couldn't find any real difference in
runtime between the original program and substituting a malloc() call
for the static array allocation. Rolling in calloc in place of
malloc/memset made no particular difference either, which says that
Fedora 13's glibc does not have any optimization for that case as I'd
hoped.
regards, tom lane
On Mon, Nov 29, 2010 at 12:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
(On the last two machines I had to cut the array size to 256MB to avoid
swapping.)
You weren't kidding about that "not so recent" part. :-)
This makes me pretty
pessimistic about the chances of a meaningful speedup here.Yeah, this is confirmation that what you are seeing in the original test
is mostly about faulting pages in, not about the zeroing. I think it
would still be interesting to revisit the micro-optimization of
MemSet(), but it doesn't look like massive restructuring to avoid it
altogether is going to be worthwhile.
Yep. I think that what we've established here is that starting new
processes all time time is just plain expensive, and we're going to
have to start fewer of them if we want to make a meaningful
improvement.
My impression is that the process startup overhead is even higher on
Windows, although I am not now nor have I ever been a Windows
programmer.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce@momjian.us> wrote:
Not sure that information moves us forward. ���If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.Well, we can determine the answers to these questions empirically.
Not really. Per Bruce's description, a page would become COW the moment
the postmaster touched (either write or read) any variable on it. Since
we have no control over how the loader lays out static variables, the
actual behavior of a particular build would be pretty random and subject
to unexpected changes caused by seemingly unrelated edits.
I believe all linkers will put initialized data ("data" segment) before
unitialized data ("bss" segment):
http://en.wikipedia.org/wiki/Data_segment
The only question is whether the linker has data and bss sharing the
same VM page (4k), or whether a new VM page is used when starting the
bss segment.
Also, the referenced URL only purports to describe the behavior of
HPUX, which is not exactly a mainstream OS. I think it requires a
considerable leap of faith to assume that all or even most platforms
work the way this suggests, and not in the dumber fashion Andres
suggested. Has anybody here actually looked at the relevant Linux
or BSD kernel code?
I have years ago, but not recently. You can see the sections on Linux
via objdump:
$ objdump --headers /bin/ls
/bin/ls: file format elf32-i386
Sections:
Idx Name Size VMA LMA File off Algn
...
24 .data 0000012c 080611a0 080611a0 000191a0 2**5
CONTENTS, ALLOC, LOAD, DATA
25 .bss 00000c40 080612e0 080612e0 000192cc 2**5
ALLOC
Based on this output, a new 4k page is not started for the 'bss'
segment. It basically uses 32-byte alignment.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
Robert Haas wrote:
In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails, the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it. To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.
Combining your last two sentences, I am not sure anyone with the
flexibility we have has solved the "cache populating" problem.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
Tom Lane wrote:
BTW, this might be premature to mention pending some tests about mapping
versus zeroing overhead, but it strikes me that there's more than one
way to skin a cat. I still think the idea of statically allocated space
sucks. But what if we rearranged things so that palloc0 doesn't consist
of palloc-then-memset, but rather push the zeroing responsibility down
into the allocator? In particular, I'm imagining that palloc0 with a
sufficiently large space request --- more than a couple pages --- could
somehow arrange to get space that's guaranteed zero already. And if the
request isn't large, zeroing it isn't where our problem is anyhow.
The most portable way to do that would be to use calloc insted of malloc,
and hope that libc is smart enough to provide freshly-mapped space.
It would be good to look and see whether glibc actually does so,
of course. If not we might end up having to mess with sbrk for
ourselves, and I'm not sure how pleasantly that interacts with malloc.
Yes, I was going to suggest trying calloc(), either because we can get
already-zeroed sbrk() memory, or because libc uses assembly language for
zeroing memory, as some good libc's do. I know most kernels also use
assembly for zeroing memory.
Another question that would be worth asking here is whether the
hand-baked MemSet macro still outruns memset on modern architectures.
I think it's been quite a few years since that was last tested.
Yes, MemSet was found to be faster than calling a C function, but new
testing is certainly warranted.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
Robert Haas wrote:
On Sun, Nov 28, 2010 at 7:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
One possible way to get a real speedup here would be to look for ways
to trim the number of catcaches.BTW, it's not going to help to remove catcaches that have a small
initial size, as the pg_am cache certainly does. ?If the bucket zeroing
cost is really something to minimize, it's only the caches with the
largest nbuckets counts that are worth considering --- and we certainly
can't remove those without penalty.Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever. I bet the
number of people for whom <(int4, int4) has any non-standard
properties is somewhere between slim and none; and it might well be
the case that formrdesc() is faster than reading the relcache init
file, if we didn't need to worry about deviation from canonical. This
is even more frustrating in the hypothetical situation where a backend
can switch databases, because we have to blow away all of these cache
entries that are 99.9% likely to be basically identical in the old and
new databases.
It is very tempting to look at optimizations here, but I am worried we
might head down the flat-files solution that caused continual problems
in the past.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
Greg Stark wrote:
On Mon, Nov 29, 2010 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The most portable way to do that would be to use calloc insted of malloc,
and hope that libc is smart enough to provide freshly-mapped space.
It would be good to look and see whether glibc actually does so,
of course. ?If not we might end up having to mess with sbrk for
ourselves, and I'm not sure how pleasantly that interacts with malloc.It's *supposed* to interact fine. The only thing I wonder is that I
think malloc intentionally uses mmap for larger allocations but I'm
not clear what the advantages are. Is it because it's a cheaper way to
get zeroed bytes? Or just so that free has a hope of returning the
allocations to the OS?
Using mmap() so you can return large allocations to the OS is a neat
trick, certainly. I am not sure who implements that.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Monday 29 November 2010 19:10:07 Tom Lane wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
Are you sure you haven't just moved the page-fault time to a part of
the code where it still exists, but just isn't being captured and
reported?I'm a bit suspicious about that too. Another thing to keep in mind
is that Robert's original program doesn't guarantee that the char
array is maxaligned; though reasonable implementations of memset
should be able to use the same inner loop anyway for most of the
array.
Yes, I measured the time including mmap itself. I don't find it surprising its
taking measurably shorter as it can just put up the mappings without
explicitly faulting for each and every page. The benefit is too small to worry
though, so ...
The answer to Robert includes the timings + test program.
Andres
On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote:
Rolling in calloc in place of
malloc/memset made no particular difference either, which says that
Fedora 13's glibc does not have any optimization for that case as I'd
hoped.
glibc's calloc is either mmap of /dev/zero or malloc followed by memset.
Peter Eisentraut <peter_e@gmx.net> writes:
On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote:
Rolling in calloc in place of
malloc/memset made no particular difference either, which says that
Fedora 13's glibc does not have any optimization for that case as I'd
hoped.
glibc's calloc is either mmap of /dev/zero or malloc followed by memset.
Hmm. I would have expected to see a difference then. Do you know what
conditions are needed to cause the mmap to be used?
regards, tom lane
On 11/28/10, Robert Haas <robertmhaas@gmail.com> wrote:
In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails,
Is there a particular discussion of that matter you could point me to?
the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it.
Oracle's backend start up time seems to be way higher than PG's.
Their main solution is something that is fundamentally a built in
connection pooler with some bells and whistles built in. I'm not
sure "other people" you had in mind--Oracle is generally the one that
pops to my mind.
To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.
What about "well OK, we have PGbouncer"? Are there fixable
short-comings that it has which could make the issue less of an issue?
Cheers,
Jeff
On tis, 2010-11-30 at 15:49 -0500, Tom Lane wrote:
Peter Eisentraut <peter_e@gmx.net> writes:
On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote:
Rolling in calloc in place of
malloc/memset made no particular difference either, which says that
Fedora 13's glibc does not have any optimization for that case as I'd
hoped.glibc's calloc is either mmap of /dev/zero or malloc followed by memset.
Hmm. I would have expected to see a difference then. Do you know what
conditions are needed to cause the mmap to be used?
Check out the mallopt(3) man page. It contains a few tunable malloc
options that may be useful for your investigation.
On Tue, Nov 30, 2010 at 11:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On 11/28/10, Robert Haas <robertmhaas@gmail.com> wrote:
In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails,Is there a particular discussion of that matter you could point me to?
the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it.Oracle's backend start up time seems to be way higher than PG's.
Their main solution is something that is fundamentally a built in
connection pooler with some bells and whistles built in. I'm not
sure "other people" you had in mind--Oracle is generally the one that
pops to my mind.
Interesting. How about MySQL and SQL Server?
To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.What about "well OK, we have PGbouncer"? Are there fixable
short-comings that it has which could make the issue less of an issue?
We do have pgbouncer, and pgpool-II, and that's a good thing. But it
also requires proxying every interaction with the database through an
intermediate piece of software, which is not free. An in-core
solution ought to be able to arrange for each new connection to be
directly attached to an existing backend, using file-descriptor
passing. Tom has previously complained that this isn't portable, but
a little research suggests that it is supported on at least Linux, Mac
OS X, FreeBSD, OpenBSD, Solaris, and Windows, so in practice the
percentage of our user base who could benefit seems like it would
likely be very high.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wednesday 01 December 2010 15:20:32 Robert Haas wrote:
On Tue, Nov 30, 2010 at 11:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On 11/28/10, Robert Haas <robertmhaas@gmail.com> wrote:
To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.What about "well OK, we have PGbouncer"? Are there fixable
short-comings that it has which could make the issue less of an issue?We do have pgbouncer, and pgpool-II, and that's a good thing. But it
also requires proxying every interaction with the database through an
intermediate piece of software, which is not free. An in-core
solution ought to be able to arrange for each new connection to be
directly attached to an existing backend, using file-descriptor
passing. Tom has previously complained that this isn't portable, but
a little research suggests that it is supported on at least Linux, Mac
OS X, FreeBSD, OpenBSD, Solaris, and Windows, so in practice the
percentage of our user base who could benefit seems like it would
likely be very high.
HPUX and AIX allow fd transfer as well. I still don't see what even remotely
relevant platform would be a problem.
Andres
Robert Haas <robertmhaas@gmail.com> wrote:
Jeff Janes <jeff.janes@gmail.com> wrote:
Oracle's backend start up time seems to be way higher than PG's.
Interesting. How about MySQL and SQL Server?
My recollection of Sybase ASE is that establishing a connection
doesn't start a backend or even a thread. It establishes a network
connection and associates network queues and a connection context
structure with it. "Engine" threads with CPU affinity (and a few
miscellaneous "worker" threads, too, if I remember right) do all the
work in a queue-based fashion.
Last I worked with MS SQL Server it was based on the Sybase code and
therefore worked the same way. I know they've made a lot of changes
in the last five years (including switching to MVCC and adding
snapshot isolation in addition to the already-existing serializable
isolation), so I don't know whether connection startup cost has
changed along the way.
-Kevin
On Wed, Dec 1, 2010 at 6:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 30, 2010 at 11:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On 11/28/10, Robert Haas <robertmhaas@gmail.com> wrote:
In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails,Is there a particular discussion of that matter you could point me to?
the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it.Oracle's backend start up time seems to be way higher than PG's.
Their main solution is something that is fundamentally a built in
connection pooler with some bells and whistles built in. I'm not
sure "other people" you had in mind--Oracle is generally the one that
pops to my mind.Interesting. How about MySQL and SQL Server?
I don't have experience with MS SQL Server, and don't know how it
performs on that front. I haven't really considered MySQL to be a
"real" RDBMS, more of just an indexing system, although I guess it is
steadily becoming more featurefull. It is indisputably faster at
making connections than PG, but still much slower than a connection
pooler.
To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.What about "well OK, we have PGbouncer"? Are there fixable
short-comings that it has which could make the issue less of an issue?We do have pgbouncer, and pgpool-II, and that's a good thing. But it
also requires proxying every interaction with the database through an
intermediate piece of software, which is not free.
True, a simple in-memory benchmark with pgbench -S -c1 showed 10,000
tps connecting straight, and 7000 tps through pgbouncer. But if
people want to make and breaks 100s of connections per second, they
must not be doing very many queries per connection so I don't know how
relevant that per query slow-down is.
An in-core
solution ought to be able to arrange for each new connection to be
directly attached to an existing backend, using file-descriptor
passing.
But who would be doing the passing? For the postmaster to be doing
that would probably go against the minimalist design. It would have
to keep track of which backend is available, and which db and user it
is primed for. Perhaps a feature could be added to the backend to
allow it to get passed a FD from pgbouncer or pgpool-II and then hand
control back to the pooler upon "close" of the connection, as they
already have the infrastructure to keep pools around while the
postmaster does not. Are pgbouncer and pgpool close enough to "core"
to make such intimate collaboration with the backend OK?
Cheers,
Jeff
On 12/01/2010 05:32 AM, Jeff Janes wrote:
On 11/28/10, Robert Haas<robertmhaas@gmail.com> wrote:
In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails,Is there a particular discussion of that matter you could point me to?
the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it.Oracle's backend start up time seems to be way higher than PG's.
Their main solution is something that is fundamentally a built in
connection pooler with some bells and whistles built in. I'm not
sure "other people" you had in mind--Oracle is generally the one that
pops to my mind.To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.What about "well OK, we have PGbouncer"? Are there fixable
short-comings that it has which could make the issue less of an issue?
well I would very much like to seen an integrated pooler in postgresql -
pgbouncer is a very nice piece of software (and might even be a base for
an integrated bouncer), but being not closely tied to the backend you
are loosing a lot.
One of the more obvious examples is that now that we have no flatfile
copy of pg_authid you have to use cruel hacks like:
http://www.depesz.com/index.php/2010/12/04/auto-refreshing-password-file-for-pgbouncer/
to get "automatic" management of roles. There are some other drawbacks
as well:
* no coordination of restarts/configuration changes between the cluster
and the pooler
* you have two pieces of config files to configure your pooling settings
(having all that available say in a catalog in pg would be awesome)
* you lose all of the advanced authentication features of pg (because
all connections need to go through the pooler) and also ip-based stuff
* no SSL support(in the case of pgbouncer)
* complexity in reseting backend state (we added some support for that
through explicit SQL level commands in recent releases but it still is a
hazard)
Stefan
* no coordination of restarts/configuration changes between the cluster
and the pooler
* you have two pieces of config files to configure your pooling settings
(having all that available say in a catalog in pg would be awesome)
* you lose all of the advanced authentication features of pg (because
all connections need to go through the pooler) and also ip-based stuff
* no SSL support(in the case of pgbouncer)
* complexity in reseting backend state (we added some support for that
through explicit SQL level commands in recent releases but it still is a
hazard)
More:
* pooler logs to separate file, for which there are (currently) no
anaysis tools
* pooling is incompatible with the use of ROLES for data security
The last is a major issue, and not one I think we can easily resolve.
MySQL has a pooling-friendly user system, because when you connect to
MySQL you basically always connect as the superuser and on connection it
switches you to your chosen login role. This, per Rob Wulsch, is one of
the things at the heart of allowing MySQL to support 100,000 low
frequency users per cheap hosting system.
As you might imagine, this behavior is also the source of a lot of
MySQL's security bugs. I don't see how we could imitate it without
getting the bugs as well.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Sun, Dec 5, 2010 at 11:59 AM, Josh Berkus <josh@agliodbs.com> wrote:
* no coordination of restarts/configuration changes between the cluster
and the pooler
* you have two pieces of config files to configure your pooling settings
(having all that available say in a catalog in pg would be awesome)
* you lose all of the advanced authentication features of pg (because
all connections need to go through the pooler) and also ip-based stuff
* no SSL support(in the case of pgbouncer)
* complexity in reseting backend state (we added some support for that
through explicit SQL level commands in recent releases but it still is a
hazard)More:
* pooler logs to separate file, for which there are (currently) no anaysis
tools
* pooling is incompatible with the use of ROLES for data securityThe last is a major issue, and not one I think we can easily resolve. MySQL
has a pooling-friendly user system, because when you connect to MySQL you
basically always connect as the superuser and on connection it switches you
to your chosen login role. This, per Rob Wulsch, is one of the things at
the heart of allowing MySQL to support 100,000 low frequency users per cheap
hosting system.As you might imagine, this behavior is also the source of a lot of MySQL's
security bugs. I don't see how we could imitate it without getting the bugs
as well.
I think you have read a bit more into what I have said than is
correct. MySQL can deal with thousands of users and separate schemas
on commodity hardware. There are many design decisions (some
questionable) that have made MySQL much better in a shared hosting
environment than pg and I don't know where the grants system falls
into that.
MySQL does not have that many security problems because of how grants
are stored. Most MySQL security issues are DOS sort of stuff based on
a authenticated user being able to cause a crash. The decoupled
backend storage and a less than awesome parser shared most of the
blame for these issues.
One thing I would suggest that the PG community keeps in mind while
talking about built in connection process caching, is that it is very
nice feature for memory leaks caused by a connection to not exist for
and continue growing forever.
NOTE: 100k is not a number that I would put much stock in. I don't
recall ever mentioning that number and it is not a number that would
be truthful for me to throw out.
--
Rob Wultsch
wultsch@gmail.com
On Sun, Dec 5, 2010 at 12:45 PM, Rob Wultsch <wultsch@gmail.com> wrote:
One thing I would suggest that the PG community keeps in mind while
talking about built in connection process caching, is that it is very
nice feature for memory leaks caused by a connection to not exist for
and continue growing forever.
s/not exist for/not exist/
I have had issues with very slow leaks in MySQL building up over
months. It really sucks to have to go to management to ask for
downtime because of a slow memory leak.
--
Rob Wultsch
wultsch@gmail.com
On Sun, Dec 5, 2010 at 3:17 PM, Rob Wultsch <wultsch@gmail.com> wrote:
On Sun, Dec 5, 2010 at 12:45 PM, Rob Wultsch <wultsch@gmail.com> wrote:
One thing I would suggest that the PG community keeps in mind while
talking about built in connection process caching, is that it is very
nice feature for memory leaks caused by a connection to not exist for
and continue growing forever.s/not exist for/not exist/
I have had issues with very slow leaks in MySQL building up over
months. It really sucks to have to go to management to ask for
downtime because of a slow memory leak.
Apache has a very simple and effective solution to this problem - they
have a configuration option controlling the number of connections a
child process handles before it dies and a new one is spawned. I've
found that setting this to 1000 works excellently. Process startup
overhead decreases by three orders of magnitude, and only egregiously
bad leaks add up to enough to matter.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Dec 5, 2010 at 2:45 PM, Rob Wultsch <wultsch@gmail.com> wrote:
I think you have read a bit more into what I have said than is
correct. MySQL can deal with thousands of users and separate schemas
on commodity hardware. There are many design decisions (some
questionable) that have made MySQL much better in a shared hosting
environment than pg and I don't know where the grants system falls
into that.
Objection: Vague.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Dec 4, 2010 at 8:04 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
But who would be doing the passing? For the postmaster to be doing
that would probably go against the minimalist design. It would have
to keep track of which backend is available, and which db and user it
is primed for. Perhaps a feature could be added to the backend to
allow it to get passed a FD from pgbouncer or pgpool-II and then hand
control back to the pooler upon "close" of the connection, as they
already have the infrastructure to keep pools around while the
postmaster does not. Are pgbouncer and pgpool close enough to "core"
to make such intimate collaboration with the backend OK?
I am not sure. I'm afraid that might be adding complexity without
really solving anything, but maybe I'm a pessimist.
One possible way to do make an improvement in this area would be to
move the responsibility for accepting connections out of the
postmaster. Instead, you'd have a group of children that would all
call accept() on the socket, and the OS would arbitrarily pick one to
receive each new incoming connection. The postmaster would just be
responsible for making sure that there were enough children hanging
around. You could in fact make this change without doing anything
else, in which case it wouldn't save any work but would possibly
reduce connection latency a bit since more of the work could be done
before the connection actually arrived.
From there, you could go two ways.
One option would be to have backends that would otherwise terminate
normally instead do the equivalent of DISCARD ALL and then go back
around and try to accept() another incoming connection. If they get a
guy who wants the database to which they previously connected, profit.
If not, laboriously flush every cache in sight and rebind to the new
database.
Another option would be to have backends that would otherwise
terminate normally instead do the equivalent of DISCARD ALL and then
mark themselves as able to accept a new connection to the same
database to which they are already connected (but not any other
database). Following authentication, a backend that accepted a new
incoming connection looks through the pool of such backends and, if it
finds one, hands off the connection using file-descriptor passing and
then loops back around to accept() again. Otherwise it handles the
connection itself. This wouldn't offer much of an advantage over the
first option for a cluster that basically has just one database, or
for a cluster that has 1000 actively used databases. But it would be
much better for a system with three databases.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Dec 5, 2010 at 6:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Dec 5, 2010 at 2:45 PM, Rob Wultsch <wultsch@gmail.com> wrote:
I think you have read a bit more into what I have said than is
correct. MySQL can deal with thousands of users and separate schemas
on commodity hardware. There are many design decisions (some
questionable) that have made MySQL much better in a shared hosting
environment than pg and I don't know where the grants system falls
into that.Objection: Vague.
I retract the remark, your honor.
At some point Hackers should look at pg vs MySQL multi tenantry but it
is way tangential today.
--
Rob Wultsch
wultsch@gmail.com
On Sun, Dec 5, 2010 at 9:35 PM, Rob Wultsch <wultsch@gmail.com> wrote:
On Sun, Dec 5, 2010 at 6:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Dec 5, 2010 at 2:45 PM, Rob Wultsch <wultsch@gmail.com> wrote:
I think you have read a bit more into what I have said than is
correct. MySQL can deal with thousands of users and separate schemas
on commodity hardware. There are many design decisions (some
questionable) that have made MySQL much better in a shared hosting
environment than pg and I don't know where the grants system falls
into that.Objection: Vague.
I retract the remark, your honor.
Clarifying it would be fine, too... :-)
At some point Hackers should look at pg vs MySQL multi tenantry but it
is way tangential today.
My understanding is that our schemas work like MySQL databases; and
our databases are an even higher level of isolation. No?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
One possible way to do make an improvement in this area would be to
move the responsibility for accepting connections out of the
postmaster. Instead, you'd have a group of children that would all
call accept() on the socket, and the OS would arbitrarily pick one to
receive each new incoming connection. The postmaster would just be
responsible for making sure that there were enough children hanging
around. You could in fact make this change without doing anything
else, in which case it wouldn't save any work but would possibly
reduce connection latency a bit since more of the work could be done
before the connection actually arrived.
This seems like potentially a good idea independent of anything else,
just to reduce connection latency: fork() (not to mention exec() on
Windows) now happens before not after receipt of the connection request.
However, I see a couple of stumbling blocks:
1. Does accept() work that way everywhere (Windows, I'm looking at you)
2. What do you do when max_connections is exceeded, and you don't have
anybody at all listening on the socket? Right now we are at least able
to send back an error message explaining the problem.
Another issue that would require some thought is what algorithm the
postmaster uses for deciding to spawn new children. But that doesn't
sound like a potential showstopper.
regards, tom lane
On 12/06/2010 09:38 AM, Tom Lane wrote:
Another issue that would require some thought is what algorithm the
postmaster uses for deciding to spawn new children. But that doesn't
sound like a potential showstopper.
We'd probably want a couple of different ones, optimized for different
connection patterns. Realistically.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Mon, Dec 6, 2010 at 12:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
One possible way to do make an improvement in this area would be to
move the responsibility for accepting connections out of the
postmaster. Instead, you'd have a group of children that would all
call accept() on the socket, and the OS would arbitrarily pick one to
receive each new incoming connection. The postmaster would just be
responsible for making sure that there were enough children hanging
around. You could in fact make this change without doing anything
else, in which case it wouldn't save any work but would possibly
reduce connection latency a bit since more of the work could be done
before the connection actually arrived.This seems like potentially a good idea independent of anything else,
just to reduce connection latency: fork() (not to mention exec() on
Windows) now happens before not after receipt of the connection request.
However, I see a couple of stumbling blocks:1. Does accept() work that way everywhere (Windows, I'm looking at you)
Not sure. It might be useful to look at what Apache does, but I don't
have time to do that ATM.
2. What do you do when max_connections is exceeded, and you don't have
anybody at all listening on the socket? Right now we are at least able
to send back an error message explaining the problem.
Sending back an error message explaining the problem seems like a
non-negotiable requirement. I'm not quite sure how to dance around
this. Perhaps if max_connections is exhausted, the postmaster itself
joins the accept() queue and launches a dead-end backend for each new
connection. Or perhaps we reserve one extra backend slot for a
probably-dead-end backend that will just sit there and mail rejection
notices; except that if it sees that a regular backend slot has opened
up it grabs it and turns itself into a regular backend.
Another issue that would require some thought is what algorithm the
postmaster uses for deciding to spawn new children. But that doesn't
sound like a potential showstopper.
The obvious algorithm would be to try to keep N spare workers around.
Any time the number of unconnected backends drops below N the
postmaster starts spawning new ones until it gets back up to N. I
think the trick may not be the algorithm so much as finding a way to
make the signaling sufficiently robust and lightweight. For example,
I bet having each child that gets a new connection signal() the
postmaster is a bad plan.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
At some point Hackers should look at pg vs MySQL multi tenantry but it
is way tangential today.My understanding is that our schemas work like MySQL databases; and
our databases are an even higher level of isolation. No?
That's correct. Drizzle is looking at implementing a feature like our
databases called "catalogs" (per the SQL spec).
Let me stress that not everyone is happy with the MySQL multi-tenantry
approach. But it does make multi-tenancy on a scale which you seldom
see with PG possible, even if it has problems. It's worth seeing
whether we can steal any of their optimization ideas without breaking PG.
I was specifically looking at the login model, which works around the
issue that we have: namely that different login ROLEs can't share a
connection pool. In MySQL, they can share the built-in connection
"pool" because role-switching effectively is a session variable.
AFAICT, anyway.
For that matter, if anyone knows any other DB which does multi-tenant
well/better, we should be looking at them too.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Mon, Dec 6, 2010 at 12:57 PM, Josh Berkus <josh@agliodbs.com> wrote:
At some point Hackers should look at pg vs MySQL multi tenantry but it
is way tangential today.My understanding is that our schemas work like MySQL databases; and
our databases are an even higher level of isolation. No?That's correct. Drizzle is looking at implementing a feature like our
databases called "catalogs" (per the SQL spec).Let me stress that not everyone is happy with the MySQL multi-tenantry
approach. But it does make multi-tenancy on a scale which you seldom see
with PG possible, even if it has problems. It's worth seeing whether we can
steal any of their optimization ideas without breaking PG.
Please make sure to articulate what you think is wrong with our existing model.
I was specifically looking at the login model, which works around the issue
that we have: namely that different login ROLEs can't share a connection
pool. In MySQL, they can share the built-in connection "pool" because
role-switching effectively is a session variable. AFAICT, anyway.
Please explain more precisely what is wrong with SET SESSION
AUTHORIZATION / SET ROLE.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Please explain more precisely what is wrong with SET SESSION
AUTHORIZATION / SET ROLE.
1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't
had any time to work on)
2) Users can always issue their own SET ROLE and then "hack into" other
users' data.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Mon, Dec 6, 2010 at 2:47 PM, Josh Berkus <josh@agliodbs.com> wrote:
Please explain more precisely what is wrong with SET SESSION
AUTHORIZATION / SET ROLE.1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't
had any time to work on)2) Users can always issue their own SET ROLE and then "hack into" other
users' data.
Makes sense. It would be nice to fix those issues, independent of
anything else.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Excerpts from Robert Haas's message of lun dic 06 23:09:56 -0300 2010:
On Mon, Dec 6, 2010 at 2:47 PM, Josh Berkus <josh@agliodbs.com> wrote:
Please explain more precisely what is wrong with SET SESSION
AUTHORIZATION / SET ROLE.1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't
had any time to work on)2) Users can always issue their own SET ROLE and then "hack into" other
users' data.Makes sense. It would be nice to fix those issues, independent of
anything else.
It seems plausible to fix the first one, but how would you fix the
second one? You either allow SET ROLE (which you need, to support the
pooler changing authorization), or you don't. There doesn't seem to be
a usable middleground.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
It seems plausible to fix the first one, but how would you fix the
second one? You either allow SET ROLE (which you need, to support the
pooler changing authorization), or you don't. There doesn't seem to be
a usable middleground.
Well, this is why such a pooler would *have* to be built into the
backend. It would need to be able to SET ROLE even though SET ROLE
would not be accepted over the client connection. We'd also need
bookkeeping to track the ROLE (and other GUCs) of each client connection
and reset them whenever that client connection switches back.
Mind you, I'm not entirely convinced that the end result of this would
be performant. And they would certainly be complicated. I think that
we should start by dealing with the simplest situation, ignoring SET
ROLE and GUC issues for now.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Mon, Dec 6, 2010 at 9:37 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Excerpts from Robert Haas's message of lun dic 06 23:09:56 -0300 2010:
On Mon, Dec 6, 2010 at 2:47 PM, Josh Berkus <josh@agliodbs.com> wrote:
Please explain more precisely what is wrong with SET SESSION
AUTHORIZATION / SET ROLE.1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't
had any time to work on)2) Users can always issue their own SET ROLE and then "hack into" other
users' data.Makes sense. It would be nice to fix those issues, independent of
anything else.It seems plausible to fix the first one, but how would you fix the
second one? You either allow SET ROLE (which you need, to support the
pooler changing authorization), or you don't. There doesn't seem to be
a usable middleground.
You could add a protocol message that does a "permanent" role switch
in a way that can't be undone except by another such protocol message.
Then connection poolers could simply refuse to proxy that particular
message.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 07/12/10 10:48, Josh Berkus wrote:
It seems plausible to fix the first one, but how would you fix the
second one? You either allow SET ROLE (which you need, to support the
pooler changing authorization), or you don't. There doesn't seem to be
a usable middleground.Well, this is why such a pooler would *have* to be built into the
backend. It would need to be able to SET ROLE even though SET ROLE
would not be accepted over the client connection.
There's actually another way to do that that could be retrofitted onto
an existing external pooler. It's not lovely, but if the approach above
proved too hard...
SET ROLE could accept a cookie / one-time password that had to be passed
to RESET ROLE in order for RESET ROLE to accept the command.
SET ROLE fred WITH COOKIE 'goqu8Mi6choht8ie';
-- hand to the user
-- blah blah user work blah
-- returned by the user
RESET ROLE WITH COOKIE 'goqu8Mi6choht8ie';
The tricky bit might be that the user should still be permitted to SET
ROLE, but only to roles that the role the pooler switched them to
("fred") has rights to SET ROLE to, not to roles that the pooler user
its self has rights to switch to.
We'd also need
bookkeeping to track the ROLE (and other GUCs) of each client connection
and reset them whenever that client connection switches back.
I'm really interested in this direction. Taken just a little further, it
could bring Pg to the point where query executors (backends) are
separated from connection state, so a given backend could pick up and
work on queries by several different connections in rapid succession.
The advantage there is that idle connections would become cheap,
low-overhead affairs.
As I (poorly) understand how Pg is designed it'd only be possible for a
backend to work on queries that act on the same database, it couldn't
really switch databases. That'd still be a real bonus especially for
newer users who don't realize they *need* a connection pool.
--
System & Network Administrator
POST Newspapers