1. 11 3月, 2011 1 次提交
    • T
      SUNRPC: Close a race in __rpc_wait_for_completion_task() · bf294b41
      Trond Myklebust 提交于
      Although they run as rpciod background tasks, under normal operation
      (i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck()
      and nfs4_do_close() want to be fully synchronous. This means that when we
      exit, we want all references to the rpc_task to be gone, and we want
      any dentry references etc. held by that task to be released.
      
      For this reason these functions call __rpc_wait_for_completion_task(),
      followed by rpc_put_task() in the expectation that the latter will be
      releasing the last reference to the rpc_task, and thus ensuring that the
      callback_ops->rpc_release() has been called synchronously.
      
      This patch fixes a race which exists due to the fact that
      rpciod calls rpc_complete_task() (in order to wake up the callers of
      __rpc_wait_for_completion_task()) and then subsequently calls
      rpc_put_task() without ensuring that these two steps are done atomically.
      
      In order to avoid adding new spin locks, the patch uses the existing
      waitqueue spin lock to order the rpc_task reference count releases between
      the waiting process and rpciod.
      The common case where nobody is waiting for completion is optimised for by
      checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task
      reference count is 1: in those cases we drop trying to grab the spin lock,
      and immediately free up the rpc_task.
      
      Those few processes that need to put the rpc_task from inside an
      asynchronous context and that do not care about ordering are given a new
      helper: rpc_put_task_async().
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      bf294b41
  2. 05 3月, 2011 1 次提交
    • N
      nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3) · e9e3d724
      Neil Horman 提交于
      The "bad_page()" page allocator sanity check was reported recently (call
      chain as follows):
      
        bad_page+0x69/0x91
        free_hot_cold_page+0x81/0x144
        skb_release_data+0x5f/0x98
        __kfree_skb+0x11/0x1a
        tcp_ack+0x6a3/0x1868
        tcp_rcv_established+0x7a6/0x8b9
        tcp_v4_do_rcv+0x2a/0x2fa
        tcp_v4_rcv+0x9a2/0x9f6
        do_timer+0x2df/0x52c
        ip_local_deliver+0x19d/0x263
        ip_rcv+0x539/0x57c
        netif_receive_skb+0x470/0x49f
        :virtio_net:virtnet_poll+0x46b/0x5c5
        net_rx_action+0xac/0x1b3
        __do_softirq+0x89/0x133
        call_softirq+0x1c/0x28
        do_softirq+0x2c/0x7d
        do_IRQ+0xec/0xf5
        default_idle+0x0/0x50
        ret_from_intr+0x0/0xa
        default_idle+0x29/0x50
        cpu_idle+0x95/0xb8
        start_kernel+0x220/0x225
        _sinittext+0x22f/0x236
      
      It occurs because an skb with a fraglist was freed from the tcp
      retransmit queue when it was acked, but a page on that fraglist had
      PG_Slab set (indicating it was allocated from the Slab allocator (which
      means the free path above can't safely free it via put_page.
      
      We tracked this back to an nfsv4 setacl operation, in which the nfs code
      attempted to fill convert the passed in buffer to an array of pages in
      __nfs4_proc_set_acl, which gets used by the skb->frags list in
      xs_sendpages.  __nfs4_proc_set_acl just converts each page in the buffer
      to a page struct via virt_to_page, but the vfs allocates the buffer via
      kmalloc, meaning the PG_slab bit is set.  We can't create a buffer with
      kmalloc and free it later in the tcp ack path with put_page, so we need
      to either:
      
      1) ensure that when we create the list of pages, no page struct has
         PG_Slab set
      
       or
      
      2) not use a page list to send this data
      
      Given that these buffers can be multiple pages and arbitrarily sized, I
      think (1) is the right way to go.  I've written the below patch to
      allocate a page from the buddy allocator directly and copy the data over
      to it.  This ensures that we have a put_page free-able page for every
      entry that winds up on an skb frag list, so it can be safely freed when
      the frame is acked.  We do a put page on each entry after the
      rpc_call_sync call so as to drop our own reference count to the page,
      leaving only the ref count taken by tcp_sendpages.  This way the data
      will be properly freed when the ack comes in
      
      Successfully tested by myself to solve the above oops.
      
      Note, as this is the result of a setacl operation that exceeded a page
      of data, I think this amounts to a local DOS triggerable by an
      uprivlidged user, so I'm CCing security on this as well.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Trond Myklebust <Trond.Myklebust@netapp.com>
      CC: security@kernel.org
      CC: Jeff Layton <jlayton@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9e3d724
  3. 26 1月, 2011 1 次提交
  4. 12 1月, 2011 1 次提交
  5. 07 1月, 2011 7 次提交
  6. 05 1月, 2011 2 次提交
  7. 22 12月, 2010 2 次提交
  8. 08 12月, 2010 1 次提交
  9. 16 11月, 2010 1 次提交
  10. 29 10月, 2010 1 次提交
  11. 26 10月, 2010 1 次提交
  12. 25 10月, 2010 3 次提交
  13. 24 10月, 2010 3 次提交
    • B
      NFS: Readdir plus in v4 · 82f2e547
      Bryan Schumaker 提交于
      By requsting more attributes during a readdir, we can mimic the readdir plus
      operation that was in NFSv3.
      
      To test, I ran the command `ls -lU --color=none` on directories with various
      numbers of files.  Without readdir plus, I see this:
      
      n files |    100    |   1,000   |  10,000   |  100,000  | 1,000,000
      --------+-----------+-----------+-----------+-----------+----------
      real    | 0m00.153s | 0m00.589s | 0m05.601s | 0m56.691s | 9m59.128s
      user    | 0m00.007s | 0m00.007s | 0m00.077s | 0m00.703s | 0m06.800s
      sys     | 0m00.010s | 0m00.070s | 0m00.633s | 0m06.423s | 1m10.005s
      access  | 3         | 1         | 1         | 4         | 31
      getattr | 2         | 1         | 1         | 1         | 1
      lookup  | 104       | 1,003     | 10,003    | 100,003   | 1,000,003
      readdir | 2         | 16        | 158       | 1,575     | 15,749
      total   | 111       | 1,021     | 10,163    | 101,583   | 1,015,784
      
      With readdir plus enabled, I see this:
      
      n files |    100    |   1,000   |  10,000   |  100,000  | 1,000,000
      --------+-----------+-----------+-----------+-----------+----------
      real    | 0m00.115s | 0m00.206s | 0m01.079s | 0m12.521s | 2m07.528s
      user    | 0m00.003s | 0m00.003s | 0m00.040s | 0m00.290s | 0m03.296s
      sys     | 0m00.007s | 0m00.020s | 0m00.120s | 0m01.357s | 0m17.556s
      access  | 3         | 1         | 1         | 1         | 7
      getattr | 2         | 1         | 1         | 1         | 1
      lookup  | 4         | 3         | 3         | 3         | 3
      readdir | 6         | 62        | 630       | 6,300     | 62,993
      total   | 15        | 67        | 635       | 6,305     | 63,004
      
      Readdir plus disabled has about a 16x increase in the number of rpc calls and
      is 4 - 5 times slower on large directories.
      Signed-off-by: NBryan Schumaker <bjschuma@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      82f2e547
    • B
      NFS: readdir with vmapped pages · 56e4ebf8
      Bryan Schumaker 提交于
      We can use vmapped pages to read more information from the network at once.
      This will reduce the number of calls needed to complete a readdir.
      Signed-off-by: NBryan Schumaker <bjschuma@netapp.com>
      [trondmy: Added #include for linux/vmalloc.h> in fs/nfs/dir.c]
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      56e4ebf8
    • T
      NFSv4: The state manager must ignore EKEYEXPIRED. · 168667c4
      Trond Myklebust 提交于
      Otherwise, we cannot recover state correctly.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      168667c4
  14. 20 10月, 2010 2 次提交
    • T
      NFSv4: Don't call nfs4_state_mark_reclaim_reboot() from error handlers · ae1007d3
      Trond Myklebust 提交于
      In the case of a server reboot, the state recovery thread starts by calling
      nfs4_state_end_reclaim_reboot() in order to avoid edge conditions when
      the server reboots while the client is in the middle of recovery.
      
      However, if the client has already marked the nfs4_state as requiring
      reboot recovery, then the above behaviour will cause the recovery thread to
      treat the open as if it was part of such an edge condition: the open will
      be recovered as if it was part of a lease expiration (and all the locks
      will be lost).
      Fix is to remove the call to nfs4_state_mark_reclaim_reboot from
      nfs4_async_handle_error(), and nfs4_handle_exception(). Instead we leave it
      to the recovery thread to do this for us.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@kernel.org
      ae1007d3
    • T
      NFSv4: Fix open recovery · b0ed9dbc
      Trond Myklebust 提交于
      NFSv4 open recovery is currently broken: since we do not clear the
      state->flags states before attempting recovery, we end up with the
      'can_open_cached()' function triggering. This again leads to no OPEN call
      being put on the wire.
      Reported-by: NSachin Prabhu <sprabhu@redhat.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@kernel.org
      b0ed9dbc
  15. 24 9月, 2010 1 次提交
  16. 22 9月, 2010 1 次提交
  17. 18 9月, 2010 3 次提交
  18. 17 9月, 2010 6 次提交
  19. 18 8月, 2010 1 次提交
    • T
      NFS: Fix an Oops in the NFSv4 atomic open code · 0a377cff
      Trond Myklebust 提交于
      Adam Lackorzynski reports:
      
      with 2.6.35.2 I'm getting this reproducible Oops:
      
      [  110.825396] BUG: unable to handle kernel NULL pointer dereference at
      (null)
      [  110.828638] IP: [<ffffffff811247b7>] encode_attrs+0x1a/0x2a4
      [  110.828638] PGD be89f067 PUD bf18f067 PMD 0
      [  110.828638] Oops: 0000 [#1] SMP
      [  110.828638] last sysfs file: /sys/class/net/lo/operstate
      [  110.828638] CPU 2
      [  110.828638] Modules linked in: rtc_cmos rtc_core rtc_lib amd64_edac_mod
      i2c_amd756 edac_core i2c_core dm_mirror dm_region_hash dm_log dm_snapshot
      sg sr_mod usb_storage ohci_hcd mptspi tg3 mptscsih mptbase usbcore nls_base
      [last unloaded: scsi_wait_scan]
      [  110.828638]
      [  110.828638] Pid: 11264, comm: setchecksum Not tainted 2.6.35.2 #1
      [  110.828638] RIP: 0010:[<ffffffff811247b7>]  [<ffffffff811247b7>]
      encode_attrs+0x1a/0x2a4
      [  110.828638] RSP: 0000:ffff88003bf5b878  EFLAGS: 00010296
      [  110.828638] RAX: ffff8800bddb48a8 RBX: ffff88003bf5bb18 RCX:
      0000000000000000
      [  110.828638] RDX: ffff8800be258800 RSI: 0000000000000000 RDI:
      ffff88003bf5b9f8
      [  110.828638] RBP: 0000000000000000 R08: ffff8800bddb48a8 R09:
      0000000000000004
      [  110.828638] R10: 0000000000000003 R11: ffff8800be779000 R12:
      ffff8800be258800
      [  110.828638] R13: ffff88003bf5b9f8 R14: ffff88003bf5bb20 R15:
      ffff8800be258800
      [  110.828638] FS:  0000000000000000(0000) GS:ffff880041e00000(0063)
      knlGS:00000000556bd6b0
      [  110.828638] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
      [  110.828638] CR2: 0000000000000000 CR3: 00000000be8ef000 CR4:
      00000000000006e0
      [  110.828638] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [  110.828638] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
      0000000000000400
      [  110.828638] Process setchecksum (pid: 11264, threadinfo
      ffff88003bf5a000, task ffff88003f232210)
      [  110.828638] Stack:
      [  110.828638]  0000000000000000 ffff8800bfbcf920 0000000000000000
      0000000000000ffe
      [  110.828638] <0> 0000000000000000 0000000000000000 0000000000000000
      0000000000000000
      [  110.828638] <0> 0000000000000000 0000000000000000 0000000000000000
      0000000000000000
      [  110.828638] Call Trace:
      [  110.828638]  [<ffffffff81124c1f>] ? nfs4_xdr_enc_setattr+0x90/0xb4
      [  110.828638]  [<ffffffff81371161>] ? call_transmit+0x1c3/0x24a
      [  110.828638]  [<ffffffff813774d9>] ? __rpc_execute+0x78/0x22a
      [  110.828638]  [<ffffffff81371a91>] ? rpc_run_task+0x21/0x2b
      [  110.828638]  [<ffffffff81371b7e>] ? rpc_call_sync+0x3d/0x5d
      [  110.828638]  [<ffffffff8111e284>] ? _nfs4_do_setattr+0x11b/0x147
      [  110.828638]  [<ffffffff81109466>] ? nfs_init_locked+0x0/0x32
      [  110.828638]  [<ffffffff810ac521>] ? ifind+0x4e/0x90
      [  110.828638]  [<ffffffff8111e2fb>] ? nfs4_do_setattr+0x4b/0x6e
      [  110.828638]  [<ffffffff8111e634>] ? nfs4_do_open+0x291/0x3a6
      [  110.828638]  [<ffffffff8111ed81>] ? nfs4_open_revalidate+0x63/0x14a
      [  110.828638]  [<ffffffff811056c4>] ? nfs_open_revalidate+0xd7/0x161
      [  110.828638]  [<ffffffff810a2de4>] ? do_lookup+0x1a4/0x201
      [  110.828638]  [<ffffffff810a4733>] ? link_path_walk+0x6a/0x9d5
      [  110.828638]  [<ffffffff810a42b6>] ? do_last+0x17b/0x58e
      [  110.828638]  [<ffffffff810a5fbe>] ? do_filp_open+0x1bd/0x56e
      [  110.828638]  [<ffffffff811cd5e0>] ? _atomic_dec_and_lock+0x30/0x48
      [  110.828638]  [<ffffffff810a9b1b>] ? dput+0x37/0x152
      [  110.828638]  [<ffffffff810ae063>] ? alloc_fd+0x69/0x10a
      [  110.828638]  [<ffffffff81099f39>] ? do_sys_open+0x56/0x100
      [  110.828638]  [<ffffffff81027a22>] ? ia32_sysret+0x0/0x5
      [  110.828638] Code: 83 f1 01 e8 f5 ca ff ff 48 83 c4 50 5b 5d 41 5c c3 41
      57 41 56 41 55 49 89 fd 41 54 49 89 d4 55 48 89 f5 53 48 81 ec 18 01 00 00
      <8b> 06 89 c2 83 e2 08 83 fa 01 19 db 83 e3 f8 83 c3 18 a8 01 8d
      [  110.828638] RIP  [<ffffffff811247b7>] encode_attrs+0x1a/0x2a4
      [  110.828638]  RSP <ffff88003bf5b878>
      [  110.828638] CR2: 0000000000000000
      [  112.840396] ---[ end trace 95282e83fd77358f ]---
      
      We need to ensure that the O_EXCL flag is turned off if the user doesn't
      set O_CREAT.
      
      Cc: stable@kernel.org
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      0a377cff
  20. 12 8月, 2010 1 次提交