1. 11 9月, 2013 10 次提交
    • G
      list_lru: dynamically adjust node arrays · 5ca302c8
      Glauber Costa 提交于
      We currently use a compile-time constant to size the node array for the
      list_lru structure.  Due to this, we don't need to allocate any memory at
      initialization time.  But as a consequence, the structures that contain
      embedded list_lru lists can become way too big (the superblock for
      instance contains two of them).
      
      This patch aims at ameliorating this situation by dynamically allocating
      the node arrays with the firmware provided nr_node_ids.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5ca302c8
    • D
      xfs: fix dquot isolation hang · 35163417
      Dave Chinner 提交于
      The new LRU list isolation code in xfs_qm_dquot_isolate() isn't
      completely up to date.  Firstly, it needs conversion to return enum
      lru_status values, not raw numbers. Secondly - most importantly - it
      fails to unlock the dquot and relock the LRU in the LRU_RETRY path.
      This leads to deadlocks in xfstests generic/232. Fix them.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      35163417
    • A
      xfs-convert-dquot-cache-lru-to-list_lru-fix · 2f5b56f8
      Andrew Morton 提交于
      fix warnings
      
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2f5b56f8
    • D
      xfs: convert dquot cache lru to list_lru · cd56a39a
      Dave Chinner 提交于
      Convert the XFS dquot lru to use the list_lru construct and convert the
      shrinker to being node aware.
      
      [glommer@openvz.org: edited for conflicts + warning fixes]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cd56a39a
    • D
      xfs: rework buffer dispose list tracking · a4082357
      Dave Chinner 提交于
      In converting the buffer lru lists to use the generic code, the locking
      for marking the buffers as on the dispose list was lost.  This results in
      confusion in LRU buffer tracking and acocunting, resulting in reference
      counts being mucked up and filesystem beig unmountable.
      
      To fix this, introduce an internal buffer spinlock to protect the state
      field that holds the dispose list information.  Because there is now
      locking needed around xfs_buf_lru_add/del, and they are used in exactly
      one place each two lines apart, get rid of the wrappers and code the logic
      directly in place.
      
      Further, the LRU emptying code used on unmount is less than optimal.
      Convert it to use a dispose list as per a normal shrinker walk, and repeat
      the walk that fills the dispose list until the LRU is empty.  Thi avoids
      needing to drop and regain the LRU lock for every item being freed, and
      allows the same logic as the shrinker isolate call to be used.  Simpler,
      easier to understand.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a4082357
    • A
      xfs-convert-buftarg-lru-to-generic-code-fix · addbda40
      Andrew Morton 提交于
      fix warnings
      
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      addbda40
    • D
      xfs: convert buftarg LRU to generic code · e80dfa19
      Dave Chinner 提交于
      Convert the buftarg LRU to use the new generic LRU list and take advantage
      of the functionality it supplies to make the buffer cache shrinker node
      aware.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e80dfa19
    • D
      fs: convert inode and dentry shrinking to be node aware · 9b17c623
      Dave Chinner 提交于
      Now that the shrinker is passing a node in the scan control structure, we
      can pass this to the the generic LRU list code to isolate reclaim to the
      lists on matching nodes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9b17c623
    • D
      shrinker: convert superblock shrinkers to new API · 0a234c6d
      Dave Chinner 提交于
      Convert superblock shrinker to use the new count/scan API, and propagate
      the API changes through to the filesystem callouts.  The filesystem
      callouts already use a count/scan API, so it's just changing counters to
      longs to match the VM API.
      
      This requires the dentry and inode shrinker callouts to be converted to
      the count/scan API.  This is mainly a mechanical change.
      
      [glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0a234c6d
    • G
      super: fix calculation of shrinkable objects for small numbers · 55f841ce
      Glauber Costa 提交于
      The sysctl knob sysctl_vfs_cache_pressure is used to determine which
      percentage of the shrinkable objects in our cache we should actively try
      to shrink.
      
      It works great in situations in which we have many objects (at least more
      than 100), because the aproximation errors will be negligible.  But if
      this is not the case, specially when total_objects < 100, we may end up
      concluding that we have no objects at all (total / 100 = 0, if total <
      100).
      
      This is certainly not the biggest killer in the world, but may matter in
      very low kernel memory situations.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      55f841ce
  2. 04 9月, 2013 3 次提交
  3. 31 8月, 2013 7 次提交
    • E
      Fix wrong flag ASSERT in xfs_attr_shortform_getvalue · 914ed44b
      Eric Sandeen 提交于
      This ASSERT is testing an if_flags flag value against
      a di_aformat enum value.  di_aformat is never assigned
      XFS_IFINLINE.
      
      This happens to work for now, because XFS_IFINLINE has
      the same value as XFS_DINODE_FMT_LOCAL, and that's tested
      just before we call this function.
      
      However, I think the intention is to assert that we have
      read in the data, i.e. XFS_IFINLINE on if_flags, before
      we use if_data.  This is done in other places through the
      code as well.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      914ed44b
    • D
      xfs: finish removing IOP_* macros. · 904c17e6
      Dave Chinner 提交于
      In optimising the CIL operations, some of the IOP_* macros for
      calling log item operations were removed. Remove the rest of them as
      Christoph requested.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NGeoffrey Wehrman <gwehrman@sgi.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      904c17e6
    • D
      xfs: inode log reservations are too small · 23956703
      Dave Chinner 提交于
      We've been seeing occasional problems with log space leaks and
      transaction underruns such as this for some time:
      
       XFS (dm-0): xlog_write: reservation summary:
         trans type  = FSYNC_TS (36)
         unit res    = 2740 bytes
         current res = -4 bytes
         total reg   = 0 bytes (o/flow = 0 bytes)
         ophdrs      = 0 (ophdr space = 0 bytes)
         ophdr + reg = 0 bytes
         num regions = 0
      
      Turns out that xfstests generic/311 is reliably reproducing this
      problem with the test it runs at sequence 16 of it execution. It is
      a 100% reliable reproducer with the mkfs configuration of "-b
      size=1024 -m crc=1" on a 10GB scratch device.
      
      The problem? Inode forks in btree format are logged in memory
      format, not disk format (i.e. bmbt format, not bmdr format). That
      means there is a btree block header being logged, when such a
      structure is never written to the inode fork in bmdr format. The
      bmdr header in the inode is only 4 bytes, while the bmbt header is
      24 bytes for v4 filesystems and 72 bytes for v5 filesystems.
      
      We currently reserve the inode size plus the rounded up overhead of
      a logging a buffer, which is 128 bytes. That means the reservation
      for a 512 byte inode is 640 bytes. What we can actually log is:
      
      	inode core, data and attr fork = 512 bytes
      	inode log format + log op header = 56 + 12 = 68 bytes
      	data fork bmbt hdr = 24/72 bytes
      	attr fork bmbt hdr = 24/72 bytes
      
      So, for a v2 inodes we can log at least 628 bytes, but if we split that
      inode over the end of the log across log buffers, we need to also
      another log op header, which takes us to 640 bytes. If there's
      another reservation taken out of this that I haven't taken into
      account (perhaps multiple iclog splits?) or I haven't corectly
      calculated the bmbt format space used (entirely possible), then
      we will overun it.
      
      For v3 inodes the maximum is actually 724 bytes, and even a
      single maximally sized btree format fork can blow it (652 bytes).
      And that's exactly what is happening with the FSYNC_TS transaction
      in the above output - it's consumed 644 bytes of space after the CIL
      context took the space reserved for it (2100 bytes).
      
      This problem has always been present in the XFS code - the btree
      format inode forks have always been logged in this manner. Hence
      there has always been the possibility of an overrun with such a
      transaction. The CRC code has just exposed it frequently enough to
      be able to debug and understand the root cause....
      
      So, let's fix all the inode log space reservations.
      
      [ I'm so glad we spent the effort to clean up the transaction
        reservation code. This is an easy fix now. ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      23956703
    • B
      xfs: check correct status variable for xfs_inobt_get_rec() call · b121099d
      Brian Foster 提交于
      The call to xfs_inobt_get_rec() in xfs_dialloc_ag() passes 'j' as
      the output status variable. The immediately following
      XFS_WANT_CORRUPTED_GOTO() checks the value of 'i,' which is from
      the previous lookup call and has already been checked. Fix the
      corruption check to use 'j.'
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      b121099d
    • D
      xfs: inode buffers may not be valid during recovery readahead · d8914002
      Dave Chinner 提交于
      CRC enabled filesystems fail log recovery with 100% reliability on
      xfstests xfs/085 with the following failure:
      
      XFS (vdb): Mounting Filesystem
      XFS (vdb): Starting recovery (logdev: internal)
      XFS (vdb): Corruption detected. Unmount and run xfs_repair
      XFS (vdb): bad inode magic/vsn daddr 144 #0 (magic=0)
      XFS: Assertion failed: 0, file: fs/xfs/xfs_inode_buf.c, line: 95
      
      The problem is that the inode buffer has not been recovered before
      the readahead on the inode buffer is issued. The checkpoint being
      recovered actually allocates the inode chunk we are doing readahead
      from, so what comes from disk during readahead is essentially
      random and the verifier barfs on it.
      
      This inode buffer readahead problem affects non-crc filesystems,
      too, but xfstests does not trigger it at all on such
      configurations....
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      d8914002
    • D
      xfs: check LSN ordering for v5 superblocks during recovery · 50d5c8d8
      Dave Chinner 提交于
      Log recovery has some strict ordering requirements which unordered
      or reordered metadata writeback can defeat. This can occur when an
      item is logged in a transaction, written back to disk, and then
      logged in a new transaction before the tail of the log is moved past
      the original modification.
      
      The result of this is that when we read an object off disk for
      recovery purposes, the buffer that we read may not contain the
      object type that recovery is expecting and hence at the end of the
      checkpoint being recovered we have an invalid object in memory.
      
      This isn't usually a problem, as recovery will then replay all the
      other checkpoints and that brings the object back to a valid and
      correct state, but the issue is that while the object is in the
      invalid state it can be flushed to disk. This results in the object
      verifier failing and triggering a corruption shutdown of log
      recover. This is correct behaviour for the verifiers - the problem
      is that we are not detecting that the object we've read off disk is
      newer than the transaction we are replaying.
      
      All metadata in v5 filesystems has the LSN of it's last modification
      stamped in it. This enabled log recover to read that field and
      determine the age of the object on disk correctly. If the LSN of the
      object on disk is older than the transaction being replayed, then we
      replay the modification. If the LSN of the object matches or is more
      recent than the transaction's LSN, then we should avoid overwriting
      the object as that is what leads to the transient corrupt state.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      50d5c8d8
    • D
      xfs: btree block LSN escaping to disk uninitialised · b58fa554
      Dave Chinner 提交于
      When testing LSN ordering code for v5 superblocks, it was discovered
      that the the LSN embedded in the generic btree blocks was
      occasionally uninitialised. These values didn't get written to disk
      by metadata writeback - they got written by previous transactions in
      log recovery.
      
      The issue is here that the when the block is first allocated and
      initialised, the LSN field was not initialised - it gets overwritten
      before IO is issued on the buffer - but the value that is logged by
      transactions that modify the header before it is written to disk
      (and initialised) contain garbage. Hence the first recovery of the
      buffer will stamp garbage into the LSN field, and that can cause
      subsequent transactions to not replay correctly.
      
      The fix is simply to initialise the bb_lsn field to zero when we
      initialise the block for the first time.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      b58fa554
  4. 30 8月, 2013 1 次提交
  5. 29 8月, 2013 2 次提交
    • D
      xfs: fix bad dquot buffer size in log recovery readahead · 0f0d3345
      Dave Chinner 提交于
      xfstests xfs/087 fails 100% reliably with this assert:
      
      XFS (vdb): Mounting Filesystem
      XFS (vdb): Starting recovery (logdev: internal)
      XFS: Assertion failed: bp->b_flags & XBF_STALE, file: fs/xfs/xfs_buf.c, line: 548
      
      while trying to read a dquot buffer in xlog_recover_dquot_ra_pass2().
      
      The issue is that the buffer length to read that is passed to
      xfs_buf_readahead is in units of filesystem blocks, not disk blocks.
      (i.e. FSB, not daddr). Fix it but putting the correct conversion in
      place.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      0f0d3345
    • D
      xfs: don't account buffer cancellation during log recovery readahead · 84a5b730
      Dave Chinner 提交于
      When doing readhaead in log recovery, we check to see if buffers are
      cancelled before doing readahead. If we find a cancelled buffer,
      however, we always decrement the reference count we have on it, and
      that means that readahead is causing a double decrement of the
      cancelled buffer reference count.
      
      This results in log recovery *replaying cancelled buffers* as the
      actual recovery pass does not find the cancelled buffer entry in the
      commit phase of the second pass across a transaction. On debug
      kernels, this results in an ASSERT failure like so:
      
      XFS: Assertion failed: !(flags & XFS_BLF_CANCEL), file: fs/xfs/xfs_log_recover.c, line: 1815
      
      xfstests generic/311 reproduces this ASSERT failure with 100%
      reproducability.
      
      Fix it by making readahead only peek at the buffer cancelled state
      rather than the full accounting that xlog_check_buffer_cancelled()
      does.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      84a5b730
  6. 27 8月, 2013 2 次提交
  7. 24 8月, 2013 2 次提交
  8. 23 8月, 2013 1 次提交
  9. 22 8月, 2013 3 次提交
    • M
      xfs: add xfs sb v4 support for dirent filetype field · 3e3c51ce
      Mark Tinguely 提交于
      Add XFS superblock v4 support for the file type field in the
      directory entry feature.
      
      This support adds a feature bit for version 4 superblocks and
      leaves the original superblock 5 incompatibility bit.
      Signed-off-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NGeoffrey Wehrman <gwehrman@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      3e3c51ce
    • D
      xfs: Add write support for dirent filetype field · 1c55cece
      Dave Chinner 提交于
      Add support to propagate and add filetype values into the on-disk
      directs. This involves passing the filetype into the xfs_da_args
      structure along with the name and namelength for direct operations,
      and encoding it into the dirent at the same time we write the inode
      number into the dirent.
      
      With write support, add the feature flag to the
      XFS_SB_FEAT_INCOMPAT_ALL mask so we can now mount filesystems with
      this feature set.
      
      Performance of directory recursion is now much improved. Parallel
      walk of ~50 million directory entries across hundreds of directories
      improves significantly. Unpatched, no CRCs:
      
      Walking via ls -R
      
      real    3m19.886s
      user    6m36.960s
      sys     28m19.087s
      
      THis is doing roughly 500 getdents() calls per second, and 250,000
      inode lookups per second to determine the inode type at roughly
      17,000 read IOPS. CPU usage is 90% kernel space.
      
      With dtype support patched in and the fileset recreated with CRCs
      enabled:
      
      Walking via ls -R
      
      real    0m31.316s
      user    6m32.975s
      sys     0m21.111s
      
      This is doing roughly 3500 getdents() calls per second at 16,000
      IOPS. There are no inode lookups at all. CPU usages is almost 100%
      userspace.
      
      This is a big win for recursive directory walks that only need to
      find file names and file types.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1c55cece
    • D
      xfs: Add read-only support for dirent filetype field · 0cb97766
      Dave Chinner 提交于
      Add support for the file type field in directory entries so that
      readdir can return the type of the inode the dirent points to to
      userspace without first having to read the inode off disk.
      
      The encoding of the type field is a single byte that is added to the
      end of the directory entry name length. For all intents and
      purposes, it appends a "hidden" byte to the name field which
      contains the type information. As the directory entry is already of
      dynamic size, helpers are already required to access and decode the
      direct entry structures.
      
      Hence the relevent extraction and iteration helpers are updated to
      understand the hidden byte.  Helpers for reading and writing the
      filetype field from the directory entries are also added. Only the
      read helpers are used by this patch.  It also adds all the code
      necessary to read the type information out of the dirents on disk.
      
      Further we add the superblock feature bit and helpers to indicate
      that we understand the on-disk format change. This is not a
      compatible change - existing kernels cannot read the new format
      successfully - so an incompatible feature flag is added. We don't
      yet allow filesystems to mount with this flag yet - that will be
      added once write support is added.
      
      Finally, the code to take the type from the VFS, convert it to an
      XFS on-disk type and put it into the xfs_name structures passed
      around is added, but the directory code does not use this field yet.
      That will be in the next patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      0cb97766
  10. 21 8月, 2013 9 次提交