1. 01 10月, 2014 1 次提交
  2. 30 9月, 2014 2 次提交
  3. 27 9月, 2014 6 次提交
  4. 18 9月, 2014 11 次提交
    • J
      nfsd4: clarify how grace period ends · 70b28235
      J. Bruce Fields 提交于
      The grace period is ended in two steps--first userland is notified that
      the grace period is now long enough that any clients who have not yet
      reclaimed can be safely forgotten, then we flip the switch that forbids
      reclaims and allows new opens.  I had to think a bit to convince myself
      that the ordering was right here.  Document it.
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      70b28235
    • J
      nfsd4: stop grace_time update at end of grace period · bea57fe4
      J. Bruce Fields 提交于
      The attempt to automatically set a new grace period time at the end of
      the grace period isn't really helpful.  We'll probably shut down and
      reboot before we actually make use of the new grace period time anyway.
      So may as well leave it up to the init system to get this right.
      
      This just confuses people when they see /proc/fs/nfsd/nfsv4gracetime
      change from what they set it to.
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      bea57fe4
    • J
      nfsd: skip subsequent UMH "create" operations after the first one for v4.0 clients · 65decb65
      Jeff Layton 提交于
      In the case of v4.0 clients, we may call into the "create" client
      tracking operation multiple times (once for each openowner). Upcalling
      for each one of those is wasteful and slow however. We can skip doing
      further "create" operations after the first one if we know that one has
      already been done.
      
      v4.1+ clients generally only call into this function once (on
      RECLAIM_COMPLETE), and we can't skip upcalling on the create even if the
      STABLE bit is set. Doing so would make it impossible for nfsdcltrack to
      lift the grace period early since the timestamp has a different meaning
      in the case where the client is expected to issue a RECLAIM_COMPLETE.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      65decb65
    • J
      nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcalls · 788a7914
      Jeff Layton 提交于
      The nfsdcltrack upcall doesn't utilize the NFSD4_CLIENT_STABLE flag,
      which basically results in an upcall every time we call into the client
      tracking ops.
      
      Change it to set this bit on a successful "check" or "create" request,
      and clear it on a "remove" request.  Also, check to see if that bit is
      set before upcalling on a "check" or "remove" request, and skip
      upcalling appropriately, depending on its state.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      788a7914
    • J
      nfsd: serialize nfsdcltrack upcalls for a particular client · d682e750
      Jeff Layton 提交于
      In a later patch, we want to add a flag that will allow us to reduce the
      need for upcalls. In order to handle that correctly, we'll need to
      ensure that racing upcalls for the same client can't occur. In practice
      it should be rare for this to occur with a well-behaved client, but it
      is possible.
      
      Convert one of the bits in the cl_flags field to be an upcall bitlock,
      and use it to ensure that upcalls for the same client are serialized.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      d682e750
    • J
      nfsd: pass extra info in env vars to upcalls to allow for early grace period end · d4318acd
      Jeff Layton 提交于
      In order to support lifting the grace period early, we must tell
      nfsdcltrack what sort of client the "create" upcall is for. We can't
      reliably tell if a v4.0 client has completed reclaiming, so we can only
      lift the grace period once all the v4.1+ clients have issued a
      RECLAIM_COMPLETE and if there are no v4.0 clients.
      
      Also, in order to lift the grace period, we have to tell userland when
      the grace period started so that it can tell whether a RECLAIM_COMPLETE
      has been issued for each client since then.
      
      Since this is all optional info, we pass it along in environment
      variables to the "init" and "create" upcalls. By doing this, we don't
      need to revise the upcall format. The UMH upcall can simply make use of
      this info if it happens to be present. If it's not then it can just
      avoid lifting the grace period early.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      d4318acd
    • J
      nfsd: add a v4_end_grace file to /proc/fs/nfsd · 7f5ef2e9
      Jeff Layton 提交于
      Allow a privileged userland process to end the v4 grace period early.
      Writing "Y", "y", or "1" to the file will cause the v4 grace period to
      be lifted.  The basic idea with this will be to allow the userland
      client tracking program to lift the grace period once it knows that no
      more clients will be reclaiming state.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      7f5ef2e9
    • J
      lockd: add a /proc/fs/lockd/nlm_end_grace file · d68e3c4a
      Jeff Layton 提交于
      Add a new procfile that will allow a (privileged) userland process to
      end the NLM grace period early. The basic idea here will be to have
      sm-notify write to this file, if it sent out no NOTIFY requests when
      it runs. In that situation, we can generally expect that there will be
      no reclaim requests so the grace period can be lifted early.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      d68e3c4a
    • J
      nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETE · 3b3e7b72
      Jeff Layton 提交于
      As stated in RFC 5661, section 18.51.3:
      
          Once a RECLAIM_COMPLETE is done, there can be no further reclaim
          operations for locks whose scope is defined as having completed
          recovery.  Once the client sends RECLAIM_COMPLETE, the server will
          not allow the client to do subsequent reclaims of locking state for
          that scope and, if these are attempted, will return
          NFS4ERR_NO_GRACE.
      
      Ensure that we enforce that requirement.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      3b3e7b72
    • J
      nfsd: remove redundant boot_time parm from grace_done client tracking op · 919b8049
      Jeff Layton 提交于
      Since it's stored in nfsd_net, we don't need to pass it in separately.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      919b8049
    • J
      lockd: move lockd's grace period handling into its own module · f7790029
      Jeff Layton 提交于
      Currently, all of the grace period handling is part of lockd. Eventually
      though we'd like to be able to build v4-only servers, at which point
      we'll need to put all of this elsewhere.
      
      Move the code itself into fs/nfs_common and have it build a grace.ko
      module. Then, rejigger the Kconfig options so that both nfsd and lockd
      enable it automatically.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      f7790029
  5. 11 9月, 2014 1 次提交
  6. 08 9月, 2014 1 次提交
  7. 05 9月, 2014 2 次提交
  8. 04 9月, 2014 6 次提交
  9. 03 9月, 2014 3 次提交
  10. 02 9月, 2014 7 次提交
    • C
      f2fs: reposition unlock_new_inode to prevent accessing invalid inode · b73e5282
      Chao Yu 提交于
      As the race condition on the inode cache, following scenario can appear:
      [Thread a]				[Thread b]
      					->f2fs_mkdir
      					  ->f2fs_add_link
      					    ->__f2fs_add_link
      					      ->init_inode_metadata failed here
      ->gc_thread_func
        ->f2fs_gc
          ->do_garbage_collect
            ->gc_data_segment
              ->f2fs_iget
                ->iget_locked
                  ->wait_on_inode
      					  ->unlock_new_inode
              ->move_data_page
      					  ->make_bad_inode
      					  ->iput
      
      When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode
      should be set as bad to avoid being accessed by other thread. But in above
      scenario, it allows f2fs to access the invalid inode before this inode was set
      as bad.
      This patch fix the potential problem, and this issue was found by code review.
      
      change log from v1:
       o Add condition judgment in gc_data_segment() suggested by Changman Lee.
       o use iget_failed to simplify code.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      b73e5282
    • B
      xfs: trim eofblocks before collapse range · 41b9d726
      Brian Foster 提交于
      xfs_collapse_file_space() currently writes back the entire file
      undergoing collapse range to settle things down for the extent shift
      algorithm. While this prevents changes to the extent list during the
      collapse operation, the writeback itself is not enough to prevent
      unnecessary collapse failures.
      
      The current shift algorithm uses the extent index to iterate the in-core
      extent list. If a post-eof delalloc extent persists after the writeback
      (e.g., a prior zero range op where the end of the range aligns with eof
      can separate the post-eof blocks such that they are not written back and
      converted), xfs_bmap_shift_extents() becomes confused over the encoded
      br_startblock value and fails the collapse.
      
      As with the full writeback, this is a temporary fix until the algorithm
      is improved to cope with a volatile extent list and avoid attempts to
      shift post-eof extents.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      41b9d726
    • D
      xfs: xfs_file_collapse_range is delalloc challenged · 1669a8ca
      Dave Chinner 提交于
      If we have delalloc extents on a file before we run a collapse range
      opertaion, we sync the range that we are going to collapse to
      convert delalloc extents in that region to real extents to simplify
      the shift operation.
      
      However, the shift operation then assumes that the extent list is
      not going to change as it iterates over the extent list moving
      things about. Unfortunately, this isn't true because we can't hold
      the ILOCK over all the operations. We can prevent new IO from
      modifying the extent list by holding the IOLOCK, but that doesn't
      prevent writeback from running....
      
      And when writeback runs, it can convert delalloc extents is the
      range of the file prior to the region being collapsed, and this
      changes the indexes of all the extents in the file. That causes the
      collapse range operation to Go Bad.
      
      The right fix is to rewrite the extent shift operation not to be
      dependent on the extent list not changing across the entire
      operation, but this is a fairly significant piece of work to do.
      Hence, as a short-term workaround for the problem, sync the entire
      file before starting a collapse operation to remove all delalloc
      ranges from the file and so avoid the problem of concurrent
      writeback changing the extent list.
      Diagnosed-and-Reported-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      
      1669a8ca
    • B
      xfs: don't log inode unless extent shift makes extent modifications · ca446d88
      Brian Foster 提交于
      The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
      all subsequent extents down into the specified, previously punched out,
      region. This function performs some validation, such as whether a
      sufficient hole exists in the target region of the collapse, then shifts
      the remaining exents downward.
      
      The exit path of the function currently logs the inode unconditionally.
      While we must log the inode (and abort) if an error occurs and the
      transaction is dirty, the initial validation paths can generate errors
      before the transaction has been dirtied. This creates an unnecessary
      filesystem shutdown scenario, as the caller will cancel a transaction
      that has been marked dirty.
      
      Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
      are made to the inode bmap. Only log the inode in the exit path if
      logflags has been set. This ensures we only have to cancel a dirty
      transaction if modifications have been made and prevents an unnecessary
      filesystem shutdown otherwise.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      ca446d88
    • D
      xfs: use ranged writeback and invalidation for direct IO · 7d4ea3ce
      Dave Chinner 提交于
      Now we are not doing silly things with dirtying buffers beyond EOF
      and using invalidation correctly, we can finally reduce the ranges of
      writeback and invalidation used by direct IO to match that of the IO
      being issued.
      
      Bring the writeback and invalidation ranges back to match the
      generic direct IO code - this will greatly reduce the perturbation
      of cached data when direct IO and buffered IO are mixed, but still
      provide the same buffered vs direct IO coherency behaviour we
      currently have.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      7d4ea3ce
    • D
      xfs: don't zero partial page cache pages during O_DIRECT writes · 834ffca6
      Dave Chinner 提交于
      Similar to direct IO reads, direct IO writes are using 
      truncate_pagecache_range to invalidate the page cache. This is
      incorrect due to the sub-block zeroing in the page cache that
      truncate_pagecache_range() triggers.
      
      This patch fixes things by using invalidate_inode_pages2_range
      instead.  It preserves the page cache invalidation, but won't zero
      any pages.
      
      cc: stable@vger.kernel.org
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      834ffca6
    • C
      xfs: don't zero partial page cache pages during O_DIRECT writes · 85e584da
      Chris Mason 提交于
      xfs is using truncate_pagecache_range to invalidate the page cache
      during DIO reads.  This is different from the other filesystems who
      only invalidate pages during DIO writes.
      
      truncate_pagecache_range is meant to be used when we are freeing the
      underlying data structs from disk, so it will zero any partial
      ranges in the page.  This means a DIO read can zero out part of the
      page cache page, and it is possible the page will stay in cache.
      
      buffered reads will find an up to date page with zeros instead of
      the data actually on disk.
      
      This patch fixes things by using invalidate_inode_pages2_range
      instead.  It preserves the page cache invalidation, but won't zero
      any pages.
      
      [dchinner: catch error and warn if it fails. Comment.]
      
      cc: stable@vger.kernel.org
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      85e584da