1. 11 3月, 2011 1 次提交
    • T
      SUNRPC: Close a race in __rpc_wait_for_completion_task() · bf294b41
      Trond Myklebust 提交于
      Although they run as rpciod background tasks, under normal operation
      (i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck()
      and nfs4_do_close() want to be fully synchronous. This means that when we
      exit, we want all references to the rpc_task to be gone, and we want
      any dentry references etc. held by that task to be released.
      
      For this reason these functions call __rpc_wait_for_completion_task(),
      followed by rpc_put_task() in the expectation that the latter will be
      releasing the last reference to the rpc_task, and thus ensuring that the
      callback_ops->rpc_release() has been called synchronously.
      
      This patch fixes a race which exists due to the fact that
      rpciod calls rpc_complete_task() (in order to wake up the callers of
      __rpc_wait_for_completion_task()) and then subsequently calls
      rpc_put_task() without ensuring that these two steps are done atomically.
      
      In order to avoid adding new spin locks, the patch uses the existing
      waitqueue spin lock to order the rpc_task reference count releases between
      the waiting process and rpciod.
      The common case where nobody is waiting for completion is optimised for by
      checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task
      reference count is 1: in those cases we drop trying to grab the spin lock,
      and immediately free up the rpc_task.
      
      Those few processes that need to put the rpc_task from inside an
      asynchronous context and that do not care about ordering are given a new
      helper: rpc_put_task_async().
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      bf294b41
  2. 05 3月, 2011 1 次提交
    • N
      nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3) · e9e3d724
      Neil Horman 提交于
      The "bad_page()" page allocator sanity check was reported recently (call
      chain as follows):
      
        bad_page+0x69/0x91
        free_hot_cold_page+0x81/0x144
        skb_release_data+0x5f/0x98
        __kfree_skb+0x11/0x1a
        tcp_ack+0x6a3/0x1868
        tcp_rcv_established+0x7a6/0x8b9
        tcp_v4_do_rcv+0x2a/0x2fa
        tcp_v4_rcv+0x9a2/0x9f6
        do_timer+0x2df/0x52c
        ip_local_deliver+0x19d/0x263
        ip_rcv+0x539/0x57c
        netif_receive_skb+0x470/0x49f
        :virtio_net:virtnet_poll+0x46b/0x5c5
        net_rx_action+0xac/0x1b3
        __do_softirq+0x89/0x133
        call_softirq+0x1c/0x28
        do_softirq+0x2c/0x7d
        do_IRQ+0xec/0xf5
        default_idle+0x0/0x50
        ret_from_intr+0x0/0xa
        default_idle+0x29/0x50
        cpu_idle+0x95/0xb8
        start_kernel+0x220/0x225
        _sinittext+0x22f/0x236
      
      It occurs because an skb with a fraglist was freed from the tcp
      retransmit queue when it was acked, but a page on that fraglist had
      PG_Slab set (indicating it was allocated from the Slab allocator (which
      means the free path above can't safely free it via put_page.
      
      We tracked this back to an nfsv4 setacl operation, in which the nfs code
      attempted to fill convert the passed in buffer to an array of pages in
      __nfs4_proc_set_acl, which gets used by the skb->frags list in
      xs_sendpages.  __nfs4_proc_set_acl just converts each page in the buffer
      to a page struct via virt_to_page, but the vfs allocates the buffer via
      kmalloc, meaning the PG_slab bit is set.  We can't create a buffer with
      kmalloc and free it later in the tcp ack path with put_page, so we need
      to either:
      
      1) ensure that when we create the list of pages, no page struct has
         PG_Slab set
      
       or
      
      2) not use a page list to send this data
      
      Given that these buffers can be multiple pages and arbitrarily sized, I
      think (1) is the right way to go.  I've written the below patch to
      allocate a page from the buddy allocator directly and copy the data over
      to it.  This ensures that we have a put_page free-able page for every
      entry that winds up on an skb frag list, so it can be safely freed when
      the frame is acked.  We do a put page on each entry after the
      rpc_call_sync call so as to drop our own reference count to the page,
      leaving only the ref count taken by tcp_sendpages.  This way the data
      will be properly freed when the ack comes in
      
      Successfully tested by myself to solve the above oops.
      
      Note, as this is the result of a setacl operation that exceeded a page
      of data, I think this amounts to a local DOS triggerable by an
      uprivlidged user, so I'm CCing security on this as well.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Trond Myklebust <Trond.Myklebust@netapp.com>
      CC: security@kernel.org
      CC: Jeff Layton <jlayton@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9e3d724
  3. 29 1月, 2011 3 次提交
    • C
      NFS: NFSv4 readdir loses entries · d1205f87
      Chuck Lever 提交于
      On recent 2.6.38-rc kernels, connectathon basic test 6 fails on
      NFSv4 mounts of OpenSolaris with something like:
      
      > ./test6: readdir
      > 	./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.12' dir entry, pass 0
      > 	./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.82' dir entry, pass 0
      > 	./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.164' dir entry, pass 0
      > 	./test6: (/mnt/klimt/matisse.test) Test failed with 3 errors
      > basic tests failed
      > Tests failed, leaving /mnt/klimt mounted
      > [cel@matisse cthon04]$
      
      I narrowed the problem down to nfs4_decode_dirent() reporting that the
      decode buffer had overflowed while decoding the entries for those
      missing files.
      
      verify_attr_len() assumes both it's pointer arguments reside on the
      same page.  When these arguments point to locations on two different
      pages, verify_attr_len() can report false errors.  This can happen now
      that a large NFSv4 readdir result can span pages.
      
      We have reasonably good checking in nfs4_decode_dirent() anyway, so
      it should be safe to simply remove the extra checking.
      
      At a guess, this was introduced by commit 6650239a, "NFS: Don't use
      vm_map_ram() in readdir".
      
      Cc: stable@kernel.org [2.6.37]
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      d1205f87
    • C
      NFS: Micro-optimize nfs4_decode_dirent() · c08e76d0
      Chuck Lever 提交于
      Make the decoding of NFSv4 directory entries slightly more efficient
      by:
      
        1.  Avoiding unnecessary byte swapping when checking XDR booleans,
            and
      
        2.  Not bumping "p" when its value will be immediately replaced by
            xdr_inline_decode()
      
      This commit makes nfs4_decode_dirent() consistent with similar logic
      in the other two decode_dirent() functions.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      c08e76d0
    • T
      NFS: Fix an NFS client lockdep issue · e00b8a24
      Trond Myklebust 提交于
      There is no reason to be freeing the delegation cred in the rcu callback,
      and doing so is resulting in a lockdep complaint that rpc_credcache_lock
      is being called from both softirq and non-softirq contexts.
      Reported-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@kernel.org
      e00b8a24
  4. 26 1月, 2011 9 次提交
    • A
      NFS construct consistent co_ownerid for v4.1 · c7a360b0
      Andy Adamson 提交于
      As stated in section 2.4 of RFC 5661, subsequent instances of the client need
      to present the same co_ownerid. Concatinate the client's IP dot address,
      host name, and the rpc_auth pseudoflavor to form the co_ownerid.
      Signed-off-by: NAndy Adamson <andros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      c7a360b0
    • T
      NFS: nfs_wcc_update_inode() should set nfsi->attr_gencount · 27dc1cd3
      Trond Myklebust 提交于
      If the call to nfs_wcc_update_inode() results in an attribute update, we
      need to ensure that the inode's attr_gencount gets bumped too, otherwise
      we are not protected against races with other GETATTR calls.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      27dc1cd3
    • A
      NFS improve pnfs_put_deviceid_cache debug print · b2a2897d
      Andy Adamson 提交于
      What we really want to know is the ref count.
      Signed-off-by: NAndy Adamson <andros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      b2a2897d
    • A
      NFS fix cb_sequence error processing · 2c4cdf8f
      Andy Adamson 提交于
      Always assign the cb_process_state nfs_client pointer so a processing error
      in cb_sequence after the nfs_client is found and referenced returns
      a non-NULL cb_process_state nfs_client and the matching nfs_put_client in
      nfs4_callback_compound dereferences the client.
      Signed-off-by: NAndy Adamson <andros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      2c4cdf8f
    • A
      NFS do not find client in NFSv4 pg_authenticate · 778be232
      Andy Adamson 提交于
      The information required to find the nfs_client cooresponding to the incoming
      back channel request is contained in the NFS layer. Perform minimal checking
      in the RPC layer pg_authenticate method, and push more detailed checking into
      the NFS layer where the nfs_client can be found.
      Signed-off-by: NAndy Adamson <andros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      778be232
    • C
      NFS: Prevent memory allocation failure in nfsacl_encode() · f61f6da0
      Chuck Lever 提交于
      nfsacl_encode() allocates memory in certain cases.  This of course
      is not guaranteed to work.
      
      Since commit 9f06c719 "SUNRPC: New xdr_streams XDR encoder API", the
      kernel's XDR encoders can't return a result indicating possibly a
      failure, so a memory allocation failure in nfsacl_encode() has become
      fatal (ie, the XDR code Oopses) in some cases.
      
      However, the allocated memory is a tiny fixed amount, on the order
      of 40-50 bytes.  We can easily use a stack-allocated buffer for
      this, with only a wee bit of nose-holding.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      f61f6da0
    • C
      NFS: Fix "kernel BUG at fs/nfs/nfs3xdr.c:1338!" · ee5dc773
      Chuck Lever 提交于
      Milan Broz <mbroz@redhat.com> reports:
      
      > on today Linus' tree I get OOps if using nfs.
      >
      > server (2.6.36) exports dir:
      > /dir   172.16.1.0/24(rw,async,all_squash,no_subtree_check,anonuid=500,anongid=500)
      >
      > on client it is mounted  in fstab
      > server:/dir  /mnt/tst  nfs  rw,soft 0 0
      >
      > and these commands OOpses it (simplified from a configure script):
      >
      > cd /dir
      > touch x
      > install x y
      >
      > [  105.327701] ------------[ cut here ]------------
      > [  105.327979] kernel BUG at fs/nfs/nfs3xdr.c:1338!
      > [  105.328075] invalid opcode: 0000 [#1] PREEMPT SMP
      > [  105.328223] last sysfs file: /sys/devices/virtual/bdi/0:16/uevent
      > [  105.328349] Modules linked in: usbcore dm_mod
      > [  105.328553]
      > [  105.328678] Pid: 3710, comm: install Not tainted 2.6.37+ #423 440BX Desktop Reference Platform/VMware Virtual Platform
      > [  105.328853] EIP: 0060:[<c116c06c>] EFLAGS: 00010282 CPU: 0
      > [  105.329152] EIP is at nfs3_xdr_enc_setacl3args+0x61/0x98
      > [  105.329249] EAX: ffffffea EBX: ce941d98 ECX: 00000000 EDX: 00000004
      > [  105.329340] ESI: ce941cd0 EDI: 000000a4 EBP: ce941cc0 ESP: ce941cb4
      > [  105.329431]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
      > [  105.329525] Process install (pid: 3710, ti=ce940000 task=ced36f20 task.ti=ce940000)
      > [  105.336600] Stack:
      > [  105.336693]  ce941cd0 ce9dc000 00000000 ce941cf8 c12ecd02 c12f43e0 c116c00b cf754158
      > [  105.336982]  ce9dc004 cf754284 ce9dc004 cf7ffee8 ceff9978 ce9dc000 cf7ffee8 ce9dc000
      > [  105.337182]  ce9dc000 ce941d14 c12e698d cf75412c ce941d98 cf7ffee8 cf7fff20 00000000
      > [  105.337405] Call Trace:
      > [  105.337695]  [<c12ecd02>] rpcauth_wrap_req+0x75/0x7f
      > [  105.337806]  [<c12f43e0>] ? xdr_encode_opaque+0x12/0x15
      > [  105.337898]  [<c116c00b>] ? nfs3_xdr_enc_setacl3args+0x0/0x98
      > [  105.337988]  [<c12e698d>] call_transmit+0x17e/0x1e8
      > [  105.338072]  [<c12ec307>] __rpc_execute+0x6d/0x1a6
      > [  105.338155]  [<c12ec474>] rpc_execute+0x34/0x37
      > [  105.338235]  [<c12e738d>] rpc_run_task+0xb5/0xbd
      > [  105.338316]  [<c12e7474>] rpc_call_sync+0x3d/0x58
      > [  105.338402]  [<c116d0c6>] nfs3_proc_setacls+0x18e/0x24f
      > [  105.338493]  [<c10b3f76>] ? __kmalloc+0x148/0x1c4
      > [  105.338579]  [<c10ecd01>] ? posix_acl_alloc+0x12/0x22
      > [  105.338665]  [<c116d5c8>] nfs3_proc_setacl+0xa0/0xca
      > [  105.338748]  [<c116d69c>] nfs3_setxattr+0x62/0x88
      > [  105.338834]  [<c1317042>] ? sub_preempt_count+0x7c/0x89
      > [  105.338926]  [<c116d63a>] ? nfs3_setxattr+0x0/0x88
      > [  105.339026]  [<c10cfa79>] __vfs_setxattr_noperm+0x26/0x95
      > [  105.339114]  [<c10cfb43>] vfs_setxattr+0x5b/0x76
      > [  105.339211]  [<c10cfbfb>] setxattr+0x9d/0xc3
      > [  105.339298]  [<c10a2ea8>] ? handle_pte_fault+0x258/0x5cb
      > [  105.339428]  [<c1091ff6>] ? __free_pages+0x1a/0x23
      > [  105.339517]  [<c10498ea>] ? up_read+0x16/0x2c
      > [  105.339599]  [<c10b8365>] ? fget+0x0/0xa3
      > [  105.339677]  [<c10b8365>] ? fget+0x0/0xa3
      > [  105.339760]  [<c1025d23>] ? get_parent_ip+0xb/0x31
      > [  105.339843]  [<c1317042>] ? sub_preempt_count+0x7c/0x89
      > [  105.339931]  [<c10cfc72>] sys_fsetxattr+0x51/0x79
      > [  105.340014]  [<c1002853>] sysenter_do_call+0x12/0x32
      > [  105.340133] Code: 2e 76 18 00 58 31 d2 8b 7f 28 f6 43 04 01 74 03 8b 53 08 6a 00 8b 46 04 6a 01 8b 0b 52 89 fa e8 85 10 f8 ff 83 c4 0c 85 c0 79 04 <0f> 0b eb fe 31 c9 f6 43 04 04 74 03 8b 4b 0c 68 00 10 00 00 8d
      > [  105.350321] EIP: [<c116c06c>] nfs3_xdr_enc_setacl3args+0x61/0x98 SS:ESP 0068:ce941cb4
      > [  105.364385] ---[ end trace 01fcfe7f0f7f6e4a ]---
      
      nfs3_xdr_enc_setacl3args() is not properly setting up the target
      buffer before nfsacl_encode() attempts to encode the ACL.
      
      Introduced by commit d9c407b1 "NFS: Introduce new-style XDR encoding
      functions for NFSv3."
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      ee5dc773
    • C
      NFS: Fix "kernel BUG at fs/aio.c:554!" · 839f7ad6
      Chuck Lever 提交于
      Nick Piggin reports:
      
      > I'm getting use after frees in aio code in NFS
      >
      > [ 2703.396766] Call Trace:
      > [ 2703.396858]  [<ffffffff8100b057>] ? native_sched_clock+0x27/0x80
      > [ 2703.396959]  [<ffffffff8108509e>] ? put_lock_stats+0xe/0x40
      > [ 2703.397058]  [<ffffffff81088348>] ? lock_release_holdtime+0xa8/0x140
      > [ 2703.397159]  [<ffffffff8108a2a5>] lock_acquire+0x95/0x1b0
      > [ 2703.397260]  [<ffffffff811627db>] ? aio_put_req+0x2b/0x60
      > [ 2703.397361]  [<ffffffff81039701>] ? get_parent_ip+0x11/0x50
      > [ 2703.397464]  [<ffffffff81612a31>] _raw_spin_lock_irq+0x41/0x80
      > [ 2703.397564]  [<ffffffff811627db>] ? aio_put_req+0x2b/0x60
      > [ 2703.397662]  [<ffffffff811627db>] aio_put_req+0x2b/0x60
      > [ 2703.397761]  [<ffffffff811647fe>] do_io_submit+0x2be/0x7c0
      > [ 2703.397895]  [<ffffffff81164d0b>] sys_io_submit+0xb/0x10
      > [ 2703.397995]  [<ffffffff8100307b>] system_call_fastpath+0x16/0x1b
      >
      > Adding some tracing, it is due to nfs completing the request then
      > returning something other than -EIOCBQUEUED, so aio.c
      > also completes the request.
      
      To address this, prevent the NFS direct I/O engine from completing
      async iocbs when the forward path returns an error without starting
      any I/O.
      
      This fix appears to survive ^C during both "xfstest no. 208" and "fsx
      -Z."
      
      It's likely this bug has existed for a very long while, as we are seeing
      very similar symptoms in OEL 5.  Copying stable.
      
      Cc: Stable <stable@kernel.org>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      839f7ad6
    • J
      NFS4: Avoid potential NULL pointer dereference in decode_and_add_ds(). · ad3d2eed
      Jesper Juhl 提交于
      On Mon, 17 Jan 2011, Mi Jinlong wrote:
      
      >
      >
      > Jesper Juhl:
      > > strrchr() can return NULL if nothing is found. If this happens we'll
      > > dereference a NULL pointer in
      > > fs/nfs/nfs4filelayoutdev.c::decode_and_add_ds().
      > >
      > > I tried to find some other code that guarantees that this can never
      > > happen but I was unsuccessful. So, unless someone else can point to some
      > > code that ensures this can never be a problem, I believe this patch is
      > > needed.
      > >
      > > While I was changing this code I also noticed that all the dprintk()
      > > statements, except one, start with "%s:". The one missing the ":" I added
      > > it to.
      >
      >   Maybe another one also should be changed at decode_and_add_ds() at line 243:
      >
      >    243  printk("%s Decoded address and port %s\n", __func__, buf);
      >
      Missed that one. Thanks.
      Signed-off-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      ad3d2eed
  5. 20 1月, 2011 1 次提交
  6. 16 1月, 2011 3 次提交
    • D
      Unexport do_add_mount() and add in follow_automount(), not ->d_automount() · ea5b778a
      David Howells 提交于
      Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
      added rather than calling do_add_mount() itself.  follow_automount() will then
      do the addition.
      
      This slightly complicates things as ->d_automount() normally wants to add the
      new vfsmount to an expiration list and start an expiration timer.  The problem
      with that is that the vfsmount will be deleted if it has a refcount of 1 and
      the timer will not repeat if the expiration list is empty.
      
      To this end, we require the vfsmount to be returned from d_automount() with a
      refcount of (at least) 2.  One of these refs will be dropped unconditionally.
      In addition, follow_automount() must get a 3rd ref around the call to
      do_add_mount() lest it eat a ref and return an error, leaving the mount we
      have open to being expired as we would otherwise have only 1 ref on it.
      
      d_automount() should also add the the vfsmount to the expiration list (by
      calling mnt_set_expiry()) and start the expiration timer before returning, if
      this mechanism is to be used.  The vfsmount will be unlinked from the
      expiration list by follow_automount() if do_add_mount() fails.
      
      This patch also fixes the call to do_add_mount() for AFS to propagate the mount
      flags from the parent vfsmount.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ea5b778a
    • D
      NFS: Use d_automount() rather than abusing follow_link() · 36d43a43
      David Howells 提交于
      Make NFS use the new d_automount() dentry operation rather than abusing
      follow_link() on directories.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Acked-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      36d43a43
    • D
      Add a dentry op to allow processes to be held during pathwalk transit · cc53ce53
      David Howells 提交于
      Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
      sleep when it tries to transit away from one of that filesystem's directories
      during a pathwalk.  The operation is keyed off a new dentry flag
      (DCACHE_MANAGE_TRANSIT).
      
      The filesystem is allowed to be selective about which processes it holds and
      which it permits to continue on or prohibits from transiting from each flagged
      directory.  This will allow autofs to hold up client processes whilst letting
      its userspace daemon through to maintain the directory or the stuff behind it
      or mounted upon it.
      
      The ->d_manage() dentry operation:
      
      	int (*d_manage)(struct path *path, bool mounting_here);
      
      takes a pointer to the directory about to be transited away from and a flag
      indicating whether the transit is undertaken by do_add_mount() or
      do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
      
      It should return 0 if successful and to let the process continue on its way;
      -EISDIR to prohibit the caller from skipping to overmounted filesystems or
      automounting, and to use this directory; or some other error code to return to
      the user.
      
      ->d_manage() is called with namespace_sem writelocked if mounting_here is true
      and no other locks held, so it may sleep.  However, if mounting_here is true,
      it may not initiate or wait for a mount or unmount upon the parameter
      directory, even if the act is actually performed by userspace.
      
      Within fs/namei.c, follow_managed() is extended to check with d_manage() first
      on each managed directory, before transiting away from it or attempting to
      automount upon it.
      
      follow_down() is renamed follow_down_one() and should only be used where the
      filesystem deliberately intends to avoid management steps (e.g. autofs).
      
      A new follow_down() is added that incorporates the loop done by all other
      callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
      and CIFS do use it, their use is removed by converting them to use
      d_automount()).  The new follow_down() calls d_manage() as appropriate.  It
      also takes an extra parameter to indicate if it is being called from mount code
      (with namespace_sem writelocked) which it passes to d_manage().  follow_down()
      ignores automount points so that it can be used to mount on them.
      
      __follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
      DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
      sleep.  It would be possible to enter d_manage() in rcu-walk mode too, and have
      that determine whether to abort or not itself.  That would allow the autofs
      daemon to continue on in rcu-walk mode.
      
      Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
      required as every tranist from that directory will cause d_manage() to be
      invoked.  It can always be set again when necessary.
      
      ==========================
      WHAT THIS MEANS FOR AUTOFS
      ==========================
      
      Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
      trigger the automounting of indirect mounts, and both of these can be called
      with i_mutex held.
      
      autofs knows that the i_mutex will be held by the caller in lookup(), and so
      can drop it before invoking the daemon - but this isn't so for d_revalidate(),
      since the lock is only held on _some_ of the code paths that call it.  This
      means that autofs can't risk dropping i_mutex from its d_revalidate() function
      before it calls the daemon.
      
      The bug could manifest itself as, for example, a process that's trying to
      validate an automount dentry that gets made to wait because that dentry is
      expired and needs cleaning up:
      
      	mkdir         S ffffffff8014e05a     0 32580  24956
      	Call Trace:
      	 [<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
      	 [<ffffffff80127f7d>] avc_has_perm+0x46/0x58
      	 [<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
      	 [<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
      	 [<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
      	 [<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
      	 [<ffffffff80057a2f>] lookup_create+0x46/0x80
      	 [<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
      
      versus the automount daemon which wants to remove that dentry, but can't
      because the normal process is holding the i_mutex lock:
      
      	automount     D ffffffff8014e05a     0 32581      1              32561
      	Call Trace:
      	 [<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
      	 [<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
      	 [<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
      	 [<ffffffff800e6d55>] do_rmdir+0x77/0xde
      	 [<ffffffff8005d229>] tracesys+0x71/0xe0
      	 [<ffffffff8005d28d>] tracesys+0xd5/0xe0
      
      which means that the system is deadlocked.
      
      This patch allows autofs to hold up normal processes whilst the daemon goes
      ahead and does things to the dentry tree behind the automouter point without
      risking a deadlock as almost no locks are held in d_manage() and none in
      d_automount().
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Was-Acked-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cc53ce53
  7. 14 1月, 2011 2 次提交
  8. 13 1月, 2011 1 次提交
  9. 12 1月, 2011 1 次提交
  10. 11 1月, 2011 1 次提交
    • T
      NFS: Don't use vm_map_ram() in readdir · 6650239a
      Trond Myklebust 提交于
      vm_map_ram() is not available on NOMMU platforms, and causes trouble
      on incoherrent architectures such as ARM when we access the page data
      through both the direct and the virtual mapping.
      
      The alternative is to use the direct mapping to access page data
      for the case when we are not crossing a page boundary, but to copy
      the data into a linear scratch buffer when we are accessing data
      that spans page boundaries.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Tested-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      Cc: stable@kernel.org  [2.6.37]
      6650239a
  11. 07 1月, 2011 17 次提交
    • N
      fs: dcache per-inode inode alias locking · 873feea0
      Nick Piggin 提交于
      dcache_inode_lock can be replaced with per-inode locking. Use existing
      inode->i_lock for this. This is slightly non-trivial because we sometimes
      need to find the inode from the dentry, which requires d_inode to be
      stabilised (either with refcount or d_lock).
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      873feea0
    • N
      fs: provide rcu-walk aware permission i_ops · b74c79e9
      Nick Piggin 提交于
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b74c79e9
    • N
      fs: rcu-walk aware d_revalidate method · 34286d66
      Nick Piggin 提交于
      Require filesystems be aware of .d_revalidate being called in rcu-walk
      mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
      -ECHILD from all implementations.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      34286d66
    • N
      fs: dcache reduce branches in lookup path · fb045adb
      Nick Piggin 提交于
      Reduce some branches and memory accesses in dcache lookup by adding dentry
      flags to indicate common d_ops are set, rather than having to check them.
      This saves a pointer memory access (dentry->d_op) in common path lookup
      situations, and saves another pointer load and branch in cases where we
      have d_op but not the particular operation.
      
      Patched with:
      
      git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fb045adb
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
    • N
      fs: dcache remove dcache_lock · b5c84bf6
      Nick Piggin 提交于
      dcache_lock no longer protects anything. remove it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b5c84bf6
    • N
      fs: Use rename lock and RCU for multi-step operations · 949854d0
      Nick Piggin 提交于
      The remaining usages for dcache_lock is to allow atomic, multi-step read-side
      operations over the directory tree by excluding modifications to the tree.
      Also, to walk in the leaf->root direction in the tree where we don't have
      a natural d_lock ordering.
      
      This could be accomplished by taking every d_lock, but this would mean a
      huge number of locks and actually gets very tricky.
      
      Solve this instead by using the rename seqlock for multi-step read-side
      operations, retry in case of a rename so we don't walk up the wrong parent.
      Concurrent dentry insertions are not serialised against.  Concurrent deletes
      are tricky when walking up the directory: our parent might have been deleted
      when dropping locks so also need to check and retry for that.
      
      We can also use the rename lock in cases where livelock is a worry (and it
      is introduced in subsequent patch).
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      949854d0
    • N
      fs: scale inode alias list · b23fb0a6
      Nick Piggin 提交于
      Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
      from concurrent modification. d_alias is also protected by d_lock.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b23fb0a6
    • N
      fs: dcache scale dentry refcount · b7ab39f6
      Nick Piggin 提交于
      Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
      0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
      we start protecting many other dentry members with d_lock.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b7ab39f6
    • N
      fs: change d_delete semantics · fe15ce44
      Nick Piggin 提交于
      Change d_delete from a dentry deletion notification to a dentry caching
      advise, more like ->drop_inode. Require it to be constant and idempotent,
      and not take d_lock. This is how all existing filesystems use the callback
      anyway.
      
      This makes fine grained dentry locking of dput and dentry lru scanning
      much simpler.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fe15ce44
    • T
      NFSv4: Ensure continued open and lockowner name uniqueness · d035c36c
      Trond Myklebust 提交于
      In order to enable migration support, we will want to move some of the
      structures that are subject to migration into the struct nfs_server.
      In particular, if we are to move the state_owner and state_owner_id to
      being a per-filesystem structure, then we should label the resulting
      open/lock owners with a per-filesytem label to ensure global uniqueness.
      
      This patch does so by adding the super block s_dev to the open/lock owner
      name.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      d035c36c
    • C
      NFS: Move cl_delegations to the nfs_server struct · d3978bb3
      Chuck Lever 提交于
      Delegations are per-inode, not per-nfs_client.  When a server file
      system is migrated, delegations on the client must be moved from the
      source to the destination nfs_server.  Make it easier to manage a
      mount point's delegation list across a migration event by moving the
      list to the nfs_server struct.
      
      Clean up: I added documenting comments to public functions I changed
      in this patch.  For consistency I added comments to all the other
      public functions in fs/nfs/delegation.c.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      d3978bb3
    • C
      NFS: Introduce nfs_detach_delegations() · dda4b225
      Chuck Lever 提交于
      Clean up:  Refactor code that takes clp->cl_lock and calls
      nfs_detach_delegations_locked() into its own function.
      
      While we're changing the call sites, get rid of the second parameter
      and the logic in nfs_detach_delegations_locked() that uses it, since
      callers always set that parameter of nfs_detach_delegations_locked()
      to NULL.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      dda4b225
    • C
      NFS: Move cl_state_owners and related fields to the nfs_server struct · 24d292b8
      Chuck Lever 提交于
      NFSv4 migration needs to reassociate state owners from the source to
      the destination nfs_server data structures.  To make that easier, move
      the cl_state_owners field to the nfs_server struct.  cl_openowner_id
      and cl_lockowner_id accompany this move, as they are used in
      conjunction with cl_state_owners.
      
      The cl_lock field in the parent nfs_client continues to protect all
      three of these fields.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      24d292b8
    • C
      NFS: Allow walking nfs_client.cl_superblocks list outside client.c · fca5238e
      Chuck Lever 提交于
      We're about to move some fields from struct nfs_client to struct
      nfs_server.  There is a many-to-one relationship between nfs_servers
      and nfs_clients.  After these fields are moved to the nfs_server
      struct, to visit all of the data in these fields that is owned by one
      nfs_client, code will need to visit each nfs_server on the
      cl_superblocks list for that nfs_client.
      
      To serialize changes to the cl_superblocks list during these little
      expeditions, protect the list with RCU.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      fca5238e
    • F
      pnfs: layout roc code · f7e8917a
      Fred Isaman 提交于
      A layout can request return-on-close.  How this interacts with the
      forgetful model of never sending LAYOUTRETURNS is a bit ambiguous.
      We forget any layouts marked roc, and wait for them to be completely
      forgotten before continuing with the close.  In addition, to compensate
      for races with any inflight LAYOUTGETs, and the fact that we do not get
      any layout stateid back from the server, we set the barrier to the worst
      case scenario of current_seqid + number of outstanding LAYOUTGETS.
      Signed-off-by: NFred Isaman <iisaman@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      f7e8917a
    • A
      pnfs: update nfs4_callback_recallany to handle layouts · 36840370
      Alexandros Batsakis 提交于
      While here, update the code a bit.
      Signed-off-by: NAlexandros Batsakis <batsakis@netapp.com>
      Signed-off-by: NFred Isaman <iisaman@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      36840370