1. 18 9月, 2014 11 次提交
    • J
      nfsd4: clarify how grace period ends · 70b28235
      J. Bruce Fields 提交于
      The grace period is ended in two steps--first userland is notified that
      the grace period is now long enough that any clients who have not yet
      reclaimed can be safely forgotten, then we flip the switch that forbids
      reclaims and allows new opens.  I had to think a bit to convince myself
      that the ordering was right here.  Document it.
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      70b28235
    • J
      nfsd4: stop grace_time update at end of grace period · bea57fe4
      J. Bruce Fields 提交于
      The attempt to automatically set a new grace period time at the end of
      the grace period isn't really helpful.  We'll probably shut down and
      reboot before we actually make use of the new grace period time anyway.
      So may as well leave it up to the init system to get this right.
      
      This just confuses people when they see /proc/fs/nfsd/nfsv4gracetime
      change from what they set it to.
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      bea57fe4
    • J
      nfsd: skip subsequent UMH "create" operations after the first one for v4.0 clients · 65decb65
      Jeff Layton 提交于
      In the case of v4.0 clients, we may call into the "create" client
      tracking operation multiple times (once for each openowner). Upcalling
      for each one of those is wasteful and slow however. We can skip doing
      further "create" operations after the first one if we know that one has
      already been done.
      
      v4.1+ clients generally only call into this function once (on
      RECLAIM_COMPLETE), and we can't skip upcalling on the create even if the
      STABLE bit is set. Doing so would make it impossible for nfsdcltrack to
      lift the grace period early since the timestamp has a different meaning
      in the case where the client is expected to issue a RECLAIM_COMPLETE.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      65decb65
    • J
      nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcalls · 788a7914
      Jeff Layton 提交于
      The nfsdcltrack upcall doesn't utilize the NFSD4_CLIENT_STABLE flag,
      which basically results in an upcall every time we call into the client
      tracking ops.
      
      Change it to set this bit on a successful "check" or "create" request,
      and clear it on a "remove" request.  Also, check to see if that bit is
      set before upcalling on a "check" or "remove" request, and skip
      upcalling appropriately, depending on its state.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      788a7914
    • J
      nfsd: serialize nfsdcltrack upcalls for a particular client · d682e750
      Jeff Layton 提交于
      In a later patch, we want to add a flag that will allow us to reduce the
      need for upcalls. In order to handle that correctly, we'll need to
      ensure that racing upcalls for the same client can't occur. In practice
      it should be rare for this to occur with a well-behaved client, but it
      is possible.
      
      Convert one of the bits in the cl_flags field to be an upcall bitlock,
      and use it to ensure that upcalls for the same client are serialized.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      d682e750
    • J
      nfsd: pass extra info in env vars to upcalls to allow for early grace period end · d4318acd
      Jeff Layton 提交于
      In order to support lifting the grace period early, we must tell
      nfsdcltrack what sort of client the "create" upcall is for. We can't
      reliably tell if a v4.0 client has completed reclaiming, so we can only
      lift the grace period once all the v4.1+ clients have issued a
      RECLAIM_COMPLETE and if there are no v4.0 clients.
      
      Also, in order to lift the grace period, we have to tell userland when
      the grace period started so that it can tell whether a RECLAIM_COMPLETE
      has been issued for each client since then.
      
      Since this is all optional info, we pass it along in environment
      variables to the "init" and "create" upcalls. By doing this, we don't
      need to revise the upcall format. The UMH upcall can simply make use of
      this info if it happens to be present. If it's not then it can just
      avoid lifting the grace period early.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      d4318acd
    • J
      nfsd: add a v4_end_grace file to /proc/fs/nfsd · 7f5ef2e9
      Jeff Layton 提交于
      Allow a privileged userland process to end the v4 grace period early.
      Writing "Y", "y", or "1" to the file will cause the v4 grace period to
      be lifted.  The basic idea with this will be to allow the userland
      client tracking program to lift the grace period once it knows that no
      more clients will be reclaiming state.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      7f5ef2e9
    • J
      lockd: add a /proc/fs/lockd/nlm_end_grace file · d68e3c4a
      Jeff Layton 提交于
      Add a new procfile that will allow a (privileged) userland process to
      end the NLM grace period early. The basic idea here will be to have
      sm-notify write to this file, if it sent out no NOTIFY requests when
      it runs. In that situation, we can generally expect that there will be
      no reclaim requests so the grace period can be lifted early.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      d68e3c4a
    • J
      nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETE · 3b3e7b72
      Jeff Layton 提交于
      As stated in RFC 5661, section 18.51.3:
      
          Once a RECLAIM_COMPLETE is done, there can be no further reclaim
          operations for locks whose scope is defined as having completed
          recovery.  Once the client sends RECLAIM_COMPLETE, the server will
          not allow the client to do subsequent reclaims of locking state for
          that scope and, if these are attempted, will return
          NFS4ERR_NO_GRACE.
      
      Ensure that we enforce that requirement.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      3b3e7b72
    • J
      nfsd: remove redundant boot_time parm from grace_done client tracking op · 919b8049
      Jeff Layton 提交于
      Since it's stored in nfsd_net, we don't need to pass it in separately.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      919b8049
    • J
      lockd: move lockd's grace period handling into its own module · f7790029
      Jeff Layton 提交于
      Currently, all of the grace period handling is part of lockd. Eventually
      though we'd like to be able to build v4-only servers, at which point
      we'll need to put all of this elsewhere.
      
      Move the code itself into fs/nfs_common and have it build a grace.ko
      module. Then, rejigger the Kconfig options so that both nfsd and lockd
      enable it automatically.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      f7790029
  2. 11 9月, 2014 1 次提交
  3. 04 9月, 2014 6 次提交
  4. 03 9月, 2014 2 次提交
  5. 29 8月, 2014 3 次提交
  6. 19 8月, 2014 1 次提交
    • R
      nfsd: allow turning off nfsv3 readdir_plus · 18c01ab3
      Rajesh Ghanekar 提交于
      One of our customer's application only needs file names, not file
      attributes. With directories having 10K+ inodes (assuming buffer cache
      has directory blocks cached having file names, but inode cache is
      limited and hence need eviction of older cached inodes), older inodes
      are evicted periodically. So if they keep on doing readdir(2) from NSF
      client on multiple directories, some directory's files are periodically
      removed from inode cache and hence new readdir(2) on same directory
      requires disk access to bring back inodes again to inode cache.
      
      As READDIRPLUS request fetches attributes also, doing getattr on each
      file on server, it causes unnecessary disk accesses. If READDIRPLUS on
      NFS client is returned with -ENOTSUPP, NFS client uses READDIR request
      which just gets the names of the files in a directory, not attributes,
      hence avoiding disk accesses on server.
      
      There's already a corresponding client-side mount option, but an export
      option reduces the need for configuration across multiple clients.
      
      This flag affects NFSv3 only.  If it turns out it's needed for NFSv4 as
      well then we may have to figure out how to extend the behavior to NFSv4,
      but it's not currently obvious how to do that.
      Signed-off-by: NRajesh Ghanekar <rajesh_ghanekar@symantec.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      18c01ab3
  7. 18 8月, 2014 13 次提交
  8. 15 8月, 2014 3 次提交
    • C
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason 提交于
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: NChris Mason <clm@fb.com>
      8d875f95
    • F
      Btrfs: fix csum tree corruption, duplicate and outdated checksums · 27b9a812
      Filipe Manana 提交于
      Under rare circumstances we can end up leaving 2 versions of a checksum
      for the same file extent range.
      
      The reason for this is that after calling btrfs_next_leaf we process
      slot 0 of the leaf it returns, instead of processing the slot set in
      path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
      btrfs_next_leaf() releases the path and before it searches for the next
      leaf, another task might cause a split of the next leaf, which migrates
      some of its keys to the leaf we were processing before calling
      btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
      same leaf but with path->slots[0] having a slot number corresponding
      to the first new key it got, that is, a slot number that didn't exist
      before calling btrfs_next_leaf(), as the leaf now has more keys than
      it had before. So we must really process the returned leaf starting at
      path->slots[0] always, as it isn't always 0, and the key at slot 0 can
      have an offset much lower than our search offset/bytenr.
      
      For example, consider the following scenario, where we have:
      
      sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
      four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472
      
        Leaf N:
      
          slot = 0                           slot = btrfs_header_nritems() - 1
        |-------------------------------------------------------------------|
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
        |-------------------------------------------------------------------|
      
        Leaf N + 1:
      
            slot = 0                          slot = btrfs_header_nritems() - 1
        |--------------------------------------------------------------------|
        | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
        |--------------------------------------------------------------------|
      
      Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
      find the next highest key, which releases the current path and then searches
      for that next key. However after releasing the path and before finding that
      next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
      to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
      btrfs_next_leaf() will returns us a path again with leaf N but with the slot
      pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
      is then:
      
          slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
        |----------------------------------------------------------------------------------------------------|
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
        |----------------------------------------------------------------------------------------------------|
      
      And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
      into the "insert:" label, which will set tmp to:
      
          tmp = min((sums->len - total_bytes) >> blocksize_bits,
              (next_offset - file_key.offset) >> blocksize_bits) =
          min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
          min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4
      
      and
      
         ins_size = csum_size * tmp = 4 * 4 = 16 bytes.
      
      In other words, we insert a new csum item in the tree with key
      (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
      for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
      because the item with key (CSUM CSUM 40161280) (the one that was moved from
      leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
      bytes of our data and won't get those old checksums removed.
      
      So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
      and breaks the logical rule:
      
         Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover
      
      An obvious bad effect of this is that a subsequent csum tree lookup to get
      the checksum of any of the blocks with logical offset of 40161280, 40165376
      or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      27b9a812
    • T
      Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch · 4eb1f66d
      Takashi Iwai 提交于
      We've got bug reports that btrfs crashes when quota is enabled on
      32bit kernel, typically with the Oops like below:
       BUG: unable to handle kernel NULL pointer dereference at 00000004
       IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
       *pde = 00000000
       Oops: 0000 [#1] SMP
       CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S      W 3.15.2-1.gd43d97e-default #1
       Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
       task: f1478130 ti: f147c000 task.ti: f147c000
       EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
       EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
       EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
       ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
        DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
       CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
       Stack:
        00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
        00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
        00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
       Call Trace:
        [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
        [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
        [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
        [<c025e38b>] process_one_work+0x11b/0x390
        [<c025eea1>] worker_thread+0x101/0x340
        [<c026432b>] kthread+0x9b/0xb0
        [<c0712a71>] ret_from_kernel_thread+0x21/0x30
        [<c0264290>] kthread_create_on_node+0x110/0x110
      
      This indicates a NULL corruption in prefs_delayed list.  The further
      investigation and bisection pointed that the call of ulist_add_merge()
      results in the corruption.
      
      ulist_add_merge() takes u64 as aux and writes a 64bit value into
      old_aux.  The callers of this function in backref.c, however, pass a
      pointer of a pointer to old_aux.  That is, the function overwrites
      64bit value on 32bit pointer.  This caused a NULL in the adjacent
      variable, in this case, prefs_delayed.
      
      Here is a quick attempt to band-aid over this: a new function,
      ulist_add_merge_ptr() is introduced to pass/store properly a pointer
      value instead of u64.  There are still ugly void ** cast remaining
      in the callers because void ** cannot be taken implicitly.  But, it's
      safer than explicit cast to u64, anyway.
      
      Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
      Cc: <stable@vger.kernel.org> [v3.11+]
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      4eb1f66d