1. 13 3月, 2016 1 次提交
  2. 11 3月, 2016 1 次提交
    • A
      Allow to trigger kernel writeback after a configurable number of writes. · 428b1d6b
      Andres Freund 提交于
      Currently writes to the main data files of postgres all go through the
      OS page cache. This means that some operating systems can end up
      collecting a large number of dirty buffers in their respective page
      caches.  When these dirty buffers are flushed to storage rapidly, be it
      because of fsync(), timeouts, or dirty ratios, latency for other reads
      and writes can increase massively.  This is the primary reason for
      regular massive stalls observed in real world scenarios and artificial
      benchmarks; on rotating disks stalls on the order of hundreds of seconds
      have been observed.
      
      On linux it is possible to control this by reducing the global dirty
      limits significantly, reducing the above problem. But global
      configuration is rather problematic because it'll affect other
      applications; also PostgreSQL itself doesn't always generally want this
      behavior, e.g. for temporary files it's undesirable.
      
      Several operating systems allow some control over the kernel page
      cache. Linux has sync_file_range(2), several posix systems have msync(2)
      and posix_fadvise(2). sync_file_range(2) is preferable because it
      requires no special setup, whereas msync() requires the to-be-flushed
      range to be mmap'ed. For the purpose of flushing dirty data
      posix_fadvise(2) is the worst alternative, as flushing dirty data is
      just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
      from the page cache.  Thus the feature is enabled by default only on
      linux, but can be enabled on all systems that have any of the above
      APIs.
      
      While desirable and likely possible this patch does not contain an
      implementation for windows.
      
      With the infrastructure added, writes made via checkpointer, bgwriter
      and normal user backends can be flushed after a configurable number of
      writes. Each of these sources of writes controlled by a separate GUC,
      checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
      respectively; they're separate because the number of flushes that are
      good are separate, and because the performance considerations of
      controlled flushing for each of these are different.
      
      A later patch will add checkpoint sorting - after that flushes from the
      ckeckpoint will almost always be desirable. Bgwriter flushes are most of
      the time going to be random, which are slow on lots of storage hardware.
      Flushing in backends works well if the storage and bgwriter can keep up,
      but if not it can have negative consequences.  This patch is likely to
      have negative performance consequences without checkpoint sorting, but
      unfortunately so has sorting without flush control.
      
      Discussion: alpine.DEB.2.10.1506011320000.28433@sto
      Author: Fabien Coelho and Andres Freund
      428b1d6b
  3. 10 3月, 2016 1 次提交
    • A
      Introduce durable_rename() and durable_link_or_rename(). · 606e0f98
      Andres Freund 提交于
      Renaming a file using rename(2) is not guaranteed to be durable in face
      of crashes; especially on filesystems like xfs and ext4 when mounted
      with data=writeback. To be certain that a rename() atomically replaces
      the previous file contents in the face of crashes and different
      filesystems, one has to fsync the old filename, rename the file, fsync
      the new filename, fsync the containing directory.  This sequence is not
      generally adhered to currently; which exposes us to data loss risks. To
      avoid having to repeat this arduous sequence, introduce
      durable_rename(), which wraps all that.
      
      Also add durable_link_or_rename(). Several places use link() (with a
      fallback to rename()) to rename a file, trying to avoid replacing the
      target file out of paranoia. Some of those rename sequences need to be
      durable as well. There seems little reason extend several copies of the
      same logic, so centralize the link() callers.
      
      This commit does not yet make use of the new functions; they're used in
      a followup commit.
      
      Author: Michael Paquier, Andres Freund
      Discussion: 56583BDD.9060302@2ndquadrant.com
      Backpatch: All supported branches
      606e0f98
  4. 08 3月, 2016 1 次提交
  5. 03 1月, 2016 1 次提交
  6. 30 5月, 2015 1 次提交
    • T
      Remove special cases for ETXTBSY from new fsync'ing logic. · 57e1138b
      Tom Lane 提交于
      The argument that this is a sufficiently-expected case to be silently
      ignored seems pretty thin.  Andres had brought it up back when we were
      still considering that most fsync failures should be hard errors, and it
      probably would be legit not to fail hard for ETXTBSY --- but the same is
      true for EROFS and other cases, which is why we gave up on hard failures.
      ETXTBSY is surely not a normal case, so logging the failure seems fine
      from here.
      57e1138b
  7. 29 5月, 2015 2 次提交
    • T
      Fix fsync-at-startup code to not treat errors as fatal. · d8179b00
      Tom Lane 提交于
      Commit 2ce439f3 introduced a rather serious
      regression, namely that if its scan of the data directory came across any
      un-fsync-able files, it would fail and thereby prevent database startup.
      Worse yet, symlinks to such files also caused the problem, which meant that
      crash restart was guaranteed to fail on certain common installations such
      as older Debian.
      
      After discussion, we agreed that (1) failure to start is worse than any
      consequence of not fsync'ing is likely to be, therefore treat all errors
      in this code as nonfatal; (2) we should not chase symlinks other than
      those that are expected to exist, namely pg_xlog/ and tablespace links
      under pg_tblspc/.  The latter restriction avoids possibly fsync'ing a
      much larger part of the filesystem than intended, if the user has left
      random symlinks hanging about in the data directory.
      
      This commit takes care of that and also does some code beautification,
      mainly moving the relevant code into fd.c, which seems a much better place
      for it than xlog.c, and making sure that the conditional compilation for
      the pre_sync_fname pass has something to do with whether pg_flush_data
      works.
      
      I also relocated the call site in xlog.c down a few lines; it seems a
      bit silly to be doing this before ValidateXLOGDirectoryStructure().
      
      The similar logic in initdb.c ought to be made to match this, but that
      change is noncritical and will be dealt with separately.
      
      Back-patch to all active branches, like the prior commit.
      
      Abhijit Menon-Sen and Tom Lane
      d8179b00
    • T
      Fix assorted inconsistencies in our calls of readlink(). · 32f628be
      Tom Lane 提交于
      Ensure that we null-terminate the result string (one place in pg_rewind).
      Be paranoid about out-of-range results from readlink() (should not happen,
      but there is no good reason for some call sites to be careful about it and
      others not).  Consistently use the whole buffer, not sometimes one byte
      less.  Ensure we emit an appropriate errcode() in all cases.  Spell the
      error messages the same way.
      
      The only serious bug here is the missing null-termination in pg_rewind,
      which is new code, so no need for a back-patch.
      
      Abhijit Menon-Sen and Tom Lane
      32f628be
  8. 24 5月, 2015 1 次提交
  9. 19 5月, 2015 1 次提交
  10. 05 5月, 2015 2 次提交
    • R
      Fix some problems with patch to fsync the data directory. · 456ff086
      Robert Haas 提交于
      pg_win32_is_junction() was a typo for pgwin32_is_junction().  open()
      was used not only in a two-argument form, which breaks on Windows,
      but also where BasicOpenFile() should have been used.
      
      Per reports from Andrew Dunstan and David Rowley.
      456ff086
    • R
      Recursively fsync() the data directory after a crash. · 2ce439f3
      Robert Haas 提交于
      Otherwise, if there's another crash, some writes from after the first
      crash might make it to disk while writes from before the crash fail
      to make it to disk.  This could lead to data corruption.
      
      Back-patch to all supported versions.
      
      Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised
      by me.
      2ce439f3
  11. 07 1月, 2015 1 次提交
  12. 06 11月, 2014 1 次提交
    • H
      Move the backup-block logic from XLogInsert to a new file, xloginsert.c. · 2076db2a
      Heikki Linnakangas 提交于
      xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
      related to putting together the WAL record are in xloginsert.c, and the
      lower level stuff for managing WAL buffers and such are in xlog.c.
      
      Also move the definition of XLogRecord to a separate header file. This
      causes churn in the #includes of all the files that write WAL records, and
      redo routines, but it avoids pulling in xlog.h into most places.
      
      Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
      2076db2a
  13. 07 5月, 2014 1 次提交
    • B
      pgindent run for 9.4 · 0a783200
      Bruce Momjian 提交于
      This includes removing tabs after periods in C comments, which was
      applied to back branches, so this change should not effect backpatching.
      0a783200
  14. 01 5月, 2014 1 次提交
    • T
      Rationalize common/relpath.[hc]. · 2d001904
      Tom Lane 提交于
      Commit a7301839 created rather a mess by
      putting dependencies on backend-only include files into include/common.
      We really shouldn't do that.  To clean it up:
      
      * Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in
      catalog/catalog.h.  We won't consider this symbol part of the FE/BE API.
      
      * Push enum ForkNumber from relfilenode.h into relpath.h.  We'll consider
      relpath.h as the source of truth for fork numbers, since relpath.c was
      already partially serving that function, and anyway relfilenode.h was
      kind of a random place for that enum.
      
      * So, relfilenode.h now includes relpath.h rather than vice-versa.  This
      direction of dependency is fine.  (That allows most, but not quite all,
      of the existing explicit #includes of relpath.h to go away again.)
      
      * Push forkname_to_number from catalog.c to relpath.c, just to centralize
      fork number stuff a bit better.
      
      * Push GetDatabasePath from catalog.c to relpath.c; it was rather odd
      that the previous commit didn't keep this together with relpath().
      
      * To avoid needing relfilenode.h in common/, redefine the underlying
      function (now called GetRelationPath) as taking separate OID arguments,
      and make the APIs using RelFileNode or RelFileNodeBackend into macro
      wrappers.  (The macros have a potential multiple-eval risk, but none of
      the existing call sites have an issue with that; one of them had such a
      risk already anyway.)
      
      * Fix failure to follow the directions when "init" fork type was added;
      specifically, the errhint in forkname_to_number wasn't updated, and neither
      was the SGML documentation for pg_relation_size().
      
      * Fix tablespace-path-too-long check in CreateTableSpace() to account for
      fork-name component of maximum-length pathnames.  This requires putting
      FORKNAMECHARS into a header file, but it was rather useless (and
      actually unreferenced) where it was.
      
      The last couple of items are potentially back-patchable bug fixes,
      if anyone is sufficiently excited about them; but personally I'm not.
      
      Per a gripe from Christoph Berg about how include/common wasn't
      self-contained.
      2d001904
  15. 22 3月, 2014 2 次提交
  16. 13 3月, 2014 1 次提交
  17. 24 1月, 2014 1 次提交
    • T
      Allow use of "z" flag in our printf calls, and use it where appropriate. · ac4ef637
      Tom Lane 提交于
      Since C99, it's been standard for printf and friends to accept a "z" size
      modifier, meaning "whatever size size_t has".  Up to now we've generally
      dealt with printing size_t values by explicitly casting them to unsigned
      long and using the "l" modifier; but this is really the wrong thing on
      platforms where pointers are wider than longs (such as Win64).  So let's
      start using "z" instead.  To ensure we can do that on all platforms, teach
      src/port/snprintf.c to understand "z", and add a configure test to force
      use of that implementation when the platform's version doesn't handle "z".
      
      Having done that, modify a bunch of places that were using the
      unsigned-long hack to use "z" instead.  This patch doesn't pretend to have
      gotten everyplace that could benefit, but it catches many of them.  I made
      an effort in particular to ensure that all uses of the same error message
      text were updated together, so as not to increase the number of
      translatable strings.
      
      It's possible that this change will result in format-string warnings from
      pre-C99 compilers.  We might have to reconsider if there are any popular
      compilers that will warn about this; but let's start by seeing what the
      buildfarm thinks.
      
      Andres Freund, with a little additional work by me
      ac4ef637
  18. 08 1月, 2014 1 次提交
  19. 04 9月, 2013 1 次提交
  20. 10 6月, 2013 1 次提交
    • T
      Remove fixed limit on the number of concurrent AllocateFile() requests. · 007556bf
      Tom Lane 提交于
      AllocateFile(), AllocateDir(), and some sister routines share a small array
      for remembering requests, so that the files can be closed on transaction
      failure.  Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
      While historically that had seemed sufficient, Steve Toutant pointed out
      that this meant you couldn't scan more than 32 file_fdw foreign tables in
      one query, because file_fdw depends on the COPY code which uses
      AllocateFile().  There are probably other cases, or will be in the future,
      where this nonconfigurable limit impedes users.
      
      We can't completely remove any such limit, at least not without a lot of
      work, since each such request requires a kernel file descriptor and most
      platforms limit the number we can have.  (In principle we could
      "virtualize" these descriptors, as fd.c already does for the main VFD pool,
      but not without an additional layer of overhead and a lot of notational
      impact on the calling code.)  But we can at least let the array size be
      configurable.  Hence, change the code to allow up to max_safe_fds/2
      allocated file requests.  On modern platforms this should allow several
      hundred concurrent file_fdw scans, or more if one increases the value of
      max_files_per_process.  To go much further than that, we'd need to do some
      more work on the data structure, since the current code for closing
      requests has potentially O(N^2) runtime; but it should still be all right
      for request counts in this range.
      
      Back-patch to 9.1 where contrib/file_fdw was introduced.
      007556bf
  21. 17 5月, 2013 1 次提交
    • T
      Fix fd.c to preserve errno where needed. · 6563fb2b
      Tom Lane 提交于
      PathNameOpenFile failed to ensure that the correct value of errno was
      returned to its caller after a failure (because it incorrectly supposed
      that free() can never change errno).  In some cases this would result
      in a user-visible failure because an expected ENOENT errno was replaced
      with something else.  Bogus EINVAL failures have been observed on OS X,
      for example.
      
      There were also a couple of places that could mangle an important value
      of errno if FDDEBUG was defined.  While the usefulness of that debug
      support is highly debatable, we might as well make it safe to use,
      so add errno save/restore logic to the DO_DB macro.
      
      Per bug #8167 from Nelson Minar, diagnosed by RhodiumToad.
      Back-patch to all supported branches.
      6563fb2b
  22. 28 2月, 2013 1 次提交
    • H
      Add support for piping COPY to/from an external program. · 3d009e45
      Heikki Linnakangas 提交于
      This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
      psql \copy syntax. Like with reading/writing files, the backend version is
      superuser-only, and in the psql version, the program is run in the client.
      
      In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
      the stdin/stdout is quoted, it's now interpreted as a filename. For example,
      "\copy foo from 'stdin'" now reads from a file called 'stdin', not from
      standard input. Before this, there was no way to specify a filename called
      stdin, stdout, pstdin or pstdout.
      
      This creates a new function in pgport, wait_result_to_str(), which can
      be used to convert the exit status of a process, as returned by wait(3),
      to a human-readable string.
      
      Etsuro Fujita, reviewed by Amit Kapila.
      3d009e45
  23. 22 2月, 2013 1 次提交
    • A
      Move relpath() to libpgcommon · a7301839
      Alvaro Herrera 提交于
      This enables non-backend code, such as pg_xlogdump, to use it easily.
      The previous location, in src/backend/catalog/catalog.c, made that
      essentially impossible because that file depends on many backend-only
      facilities; so this needs to live separately.
      a7301839
  24. 20 2月, 2013 1 次提交
  25. 08 2月, 2013 2 次提交
  26. 02 1月, 2013 1 次提交
  27. 27 11月, 2012 1 次提交
    • H
      Add OpenTransientFile, with automatic cleanup at end-of-xact. · 1f67078e
      Heikki Linnakangas 提交于
      Files opened with BasicOpenFile or PathNameOpenFile are not automatically
      cleaned up on error. That puts unnecessary burden on callers that only want
      to keep the file open for a short time. There is AllocateFile, but that
      returns a buffered FILE * stream, which in many cases is not the nicest API
      to work with. So add function called OpenTransientFile, which returns a
      unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
      
      This plugs a few rare fd leaks in error cases:
      
      1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
      2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
         use OpenTransientFile here because the fd is supposed to persist over
         transaction boundaries.
      3. lo_import/lo_export - fixed by using OpenTransientFile instead of
         PathNameOpenFile.
      
      In addition to plugging those leaks, this replaces many BasicOpenFile() calls
      with OpenTransientFile() that were not leaking, because the code meticulously
      closed the file on error. That wasn't strictly necessary, but IMHO it's good
      for robustness.
      
      The same leaks exist in older versions, but given the rarity of the issues,
      I'm not backpatching this. Not yet, anyway - it might be good to backpatch
      later, after this mechanism has had some more testing in master branch.
      1f67078e
  28. 18 10月, 2012 1 次提交
    • T
      Revert "Use "transient" files for blind writes, take 2". · 9bacf0e3
      Tom Lane 提交于
      This reverts commit fba105b1.
      That approach had problems with the smgr-level state not tracking what
      we really want to happen, and with the VFD-level state not tracking the
      smgr-level state very well either.  In consequence, it was still possible
      to hold kernel file descriptors open for long-gone tables (as in recent
      report from Tore Halset), and yet there were also cases of FDs being closed
      undesirably soon.  A replacement implementation will follow.
      9bacf0e3
  29. 29 8月, 2012 1 次提交
    • A
      Split resowner.h · 45326c5a
      Alvaro Herrera 提交于
      This lets files that are mere users of ResourceOwner not automatically
      include the headers for stuff that is managed by the resowner mechanism.
      45326c5a
  30. 22 7月, 2012 1 次提交
    • T
      Improve copydir() code for the case that fsync is off. · 2d46a57d
      Tom Lane 提交于
      We should avoid calling sync_file_range or posix_fadvise in this case,
      since (a) we don't really care if the data gets synced, and might as
      well save the kernel calls; (b) at least on Linux we know that the
      kernel might block us until it's scheduled the write.
      
      Also, avoid making a useless second traversal of the directory tree
      if we're not actually going to call fsync(2) after all.
      2d46a57d
  31. 14 7月, 2012 1 次提交
    • T
      Add fsync capability to initdb, and use sync_file_range() if available. · b966dd6c
      Tom Lane 提交于
      Historically we have not worried about fsync'ing anything during initdb
      (in fact, initdb intentionally passes -F to each backend launch to prevent
      it from fsync'ing).  But with filesystems getting more aggressive about
      caching data, that's not such a good plan anymore.  Make initdb do a pass
      over the finished data directory tree to fsync everything.  For testing
      purposes, the -N/--nosync flag can be used to restore the old behavior.
      
      Also, testing shows that on Linux, sync_file_range() is much faster than
      posix_fadvise() for hinting to the kernel that an fsync is coming,
      apparently because the latter blocks on a rather small request queue while
      the former doesn't.  So use this function if available in initdb, and also
      in the backend's pg_flush_data() (where it currently will affect only the
      speed of CREATE DATABASE's cloning step).
      
      We will later make pg_regress invoke initdb with the --nosync flag
      to avoid slowing down cases such as "make check" in contrib.  But
      let's not do so until we've shaken out any portability issues in this
      patch.
      
      Jeff Davis, reviewed by Andres Freund
      b966dd6c
  32. 11 6月, 2012 1 次提交
  33. 29 3月, 2012 1 次提交
    • H
      Inherit max_safe_fds to child processes in EXEC_BACKEND mode. · 5762a4d9
      Heikki Linnakangas 提交于
      Postmaster sets max_safe_fds by testing how many open file descriptors it
      can open, and that is normally inherited by all child processes at fork().
      Not so on EXEC_BACKEND, ie. Windows, however. Because of that, we
      effectively ignored max_files_per_process on Windows, and always assumed
      a conservative default of 32 simultaneous open files. That could have an
      impact on performance, if you need to access a lot of different files
      in a query. After this patch, the value is passed to child processes by
      save/restore_backend_variables() among many other global variables.
      
      It has been like this forever, but given the lack of complaints about it,
      I'm not backpatching this.
      5762a4d9
  34. 22 3月, 2012 1 次提交
  35. 28 1月, 2012 1 次提交
  36. 26 1月, 2012 1 次提交