1. 16 3月, 2011 1 次提交
    • S
      ceph: preserve I_COMPLETE across rename · 09adc80c
      Sage Weil 提交于
      d_move puts the renamed dentry at the end of d_subdirs, screwing with our
      cached dentry directory offsets.  We were just clearing I_COMPLETE to avoid
      any possibility of trouble.  However, assigning the renamed dentry an
      offset at the end of the directory (to match it's new d_subdirs position)
      is sufficient to maintain correct behavior and hold onto I_COMPLETE.
      
      This is especially important for workloads like rsync, which renames files
      into place.  Before, we would lose I_COMPLETE and do MDS lookups for each
      file.  With this patch we only talk to the MDS on create and rename.
      Signed-off-by: NSage Weil <sage@newdream.net>
      09adc80c
  2. 15 3月, 2011 1 次提交
    • T
      Fix corrupted OSF partition table parsing · 1eafbfeb
      Timo Warns 提交于
      The kernel automatically evaluates partition tables of storage devices.
      The code for evaluating OSF partitions contains a bug that leaks data
      from kernel heap memory to userspace for certain corrupted OSF
      partitions.
      
      In more detail:
      
        for (i = 0 ; i < le16_to_cpu(label->d_npartitions); i++, partition++) {
      
      iterates from 0 to d_npartitions - 1, where d_npartitions is read from
      the partition table without validation and partition is a pointer to an
      array of at most 8 d_partitions.
      
      Add the proper and obvious validation.
      Signed-off-by: NTimo Warns <warns@pre-sense.de>
      Cc: stable@kernel.org
      [ Changed the patch trivially to not repeat the whole le16_to_cpu()
        thing, and to use an explicit constant for the magic value '8' ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1eafbfeb
  3. 14 3月, 2011 1 次提交
  4. 12 3月, 2011 7 次提交
    • C
      Btrfs: break out of shrink_delalloc earlier · 36e39c40
      Chris Mason 提交于
      Josef had changed shrink_delalloc to exit after three shrink
      attempts, which wasn't quite enough because new writers could
      race in and steal free space.
      
      But it also fixed deadlocks and stalls as we tried to recover
      delalloc reservations.  The code was tweaked to loop 1024
      times, and would reset the counter any time a small amount
      of progress was made.  This was too drastic, and with a
      lot of writers we can end up stuck in shrink_delalloc forever.
      
      The shrink_delalloc loop is fairly complex because the caller is looping
      too, and the caller will go ahead and force a transaction commit to make
      sure we reclaim space.
      
      This reworks things to exit shrink_delalloc when we've forced some
      writeback and the delalloc reservations have gone down.  This means
      the writeback has not just started but has also finished at
      least some of the metadata changes required to reclaim delalloc
      space.
      
      If we've got this wrong, we're returning ENOSPC too early, which
      is a big improvement over the current behavior of hanging the machine.
      
      Test 224 in xfstests hammers on this nicely, and with 1000 writers
      trying to fill a 1GB drive we get our first ENOSPC at 93% full.  The
      other writers are able to continue until we get 100%.
      
      This is a worst case test for btrfs because the 1000 writers are doing
      small IO, and the small FS size means we don't have a lot of room
      for metadata chunks.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      36e39c40
    • C
      NFS: NFSROOT should default to "proto=udp" · 53d47375
      Chuck Lever 提交于
      There have been a number of recent reports that NFSROOT is no longer
      working with default mount options, but fails only with certain NICs.
      
      Brian Downing <bdowning@lavos.net> bisected to commit 56463e50 "NFS:
      Use super.c for NFSROOT mount option parsing".  Among other things,
      this commit changes the default mount options for NFSROOT to use TCP
      instead of UDP as the underlying transport.
      
      TCP seems less able to deal with NICs that are slow to initialize.
      The system logs that have accompanied reports of problems all show
      that NFSROOT attempts to establish a TCP connection before the NIC is
      fully initialized, and thus the TCP connection attempt fails.
      
      When a TCP connection attempt fails during a mount operation, the
      NFS stack needs to fail the operation.  Usually user space knows how
      and when to retry it.  The network layer does not report a distinct
      error code for this particular failure mode.  Thus, there isn't a
      clean way for the RPC client to see that it needs to retry in this
      case, but not in others.
      
      Because NFSROOT is used in some environments where it is not possible
      to update the kernel command line to specify "udp", the proper thing
      to do is change NFSROOT to use UDP by default, as it did before commit
      56463e50.
      
      To make it easier to see how to change default mount options for
      NFSROOT and to distinguish default settings from mandatory settings,
      I've adjusted a couple of areas to document the specifics.
      
      root_nfs_cat() is also modified to deal with commas properly when
      concatenating strings containing mount option lists.  This keeps
      root_nfs_cat() call sites simpler, now that we may be concatenating
      multiple mount option strings.
      Tested-by: NBrian Downing <bdowning@lavos.net>
      Tested-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Cc: <stable@kernel.org> # 2.6.37
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      53d47375
    • H
      nfs4: remove duplicated #include · 57df216b
      Huang Weiyi 提交于
      Remove duplicated #include('s) in
        fs/nfs/nfs4proc.c
      Signed-off-by: NHuang Weiyi <weiyi.huang@gmail.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      57df216b
    • T
      NFSv4: nfs4_state_mark_reclaim_nograce() should be static · f9feab1e
      Trond Myklebust 提交于
      There are no more external users of nfs4_state_mark_reclaim_nograce() or
      nfs4_state_mark_reclaim_reboot(), so mark them as static.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      f9feab1e
    • T
      NFSv4: Fix the setlk error handler · ecac799a
      Trond Myklebust 提交于
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      ecac799a
    • T
      NFSv4.1: Fix the handling of the SEQUENCE status bits · b4410c2f
      Trond Myklebust 提交于
      We want SEQUENCE status bits to be handled by the state manager in order
      to avoid threading issues.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      b4410c2f
    • T
      NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses · 0400a6b0
      Trond Myklebust 提交于
      nfs4_schedule_state_recovery() should only be used when we need to force
      the state manager to check the lease. If we just want to start the
      state manager in order to handle a state recovery situation, we should be
      using nfs4_schedule_state_manager().
      
      This patch fixes the abuses of nfs4_schedule_state_recovery() by replacing
      its use with a set of helper functions that do the right thing.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      0400a6b0
  5. 11 3月, 2011 10 次提交
  6. 10 3月, 2011 10 次提交
  7. 09 3月, 2011 3 次提交
  8. 08 3月, 2011 3 次提交
    • A
      unfuck proc_sysctl ->d_compare() · dfef6dcd
      Al Viro 提交于
      a) struct inode is not going to be freed under ->d_compare();
      however, the thing PROC_I(inode)->sysctl points to just might.
      Fortunately, it's enough to make freeing that sucker delayed,
      provided that we don't step on its ->unregistering, clear
      the pointer to it in PROC_I(inode) before dropping the reference
      and check if it's NULL in ->d_compare().
      
      b) I'm not sure that we *can* walk into NULL inode here (we recheck
      dentry->seq between verifying that it's still hashed / fetching
      dentry->d_inode and passing it to ->d_compare() and there's no
      negative hashed dentries in /proc/sys/*), but if we can walk into
      that, we really should not have ->d_compare() return 0 on it!
      Said that, I really suspect that this check can be simply killed.
      Nick?
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      dfef6dcd
    • J
      nfsd4: fix bad pointer on failure to find delegation · 32b007b4
      J. Bruce Fields 提交于
      In case of a nonempty list, the return on error here is obviously bogus;
      it ends up being a pointer to the list head instead of to any valid
      delegation on the list.
      
      In particular, if nfsd4_delegreturn() hits this case, and you're quite unlucky,
      then renew_client may oops, and it may take an embarassingly long time to
      figure out why.  Facepalm.
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
      IP: [<ffffffff81292965>] nfsd4_delegreturn+0x125/0x200
      ...
      
      Cc: stable@kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      32b007b4
    • C
      Btrfs: deal with short returns from copy_from_user · 31339acd
      Chris Mason 提交于
      When copy_from_user is only able to copy some of the bytes we requested,
      we may end up creating a partially up to date page.  To avoid garbage in
      the page, we need to treat a partial copy as a zero length copy.
      
      This makes the rest of the file_write code drop the page and
      retry the whole copy instead of marking the partially up to
      date page as dirty.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cc: stable@kernel.org
      31339acd
  9. 07 3月, 2011 1 次提交
    • C
      Btrfs: fix regressions in copy_from_user handling · b1bf862e
      Chris Mason 提交于
      Commit 914ee295 fixed deadlocks in
      btrfs_file_write where we would catch page faults on pages we had
      locked.
      
      But, there were a few problems:
      
      1) The x86-32 iov_iter_copy_from_user_atomic code always fails to copy
      data when the amount to copy is more than 4K and the offset to start
      copying from is not page aligned.  The result was btrfs_file_write
      looping forever retrying the iov_iter_copy_from_user_atomic
      
      We deal with this by changing btrfs_file_write to drop down to single
      page copies when iov_iter_copy_from_user_atomic starts returning failure.
      
      2) The btrfs_file_write code was leaking delalloc reservations when
      iov_iter_copy_from_user_atomic returned zero.  The looping above would
      result in the entire filesystem running out of delalloc reservations and
      constantly trying to flush things to disk.
      
      3) btrfs_file_write will lock down page cache pages, make sure
      any writeback is finished, do the copy_from_user and then release them.
      Before the loop runs we check the first and last pages in the write to
      see if they are only being partially modified.  If the start or end of
      the write isn't aligned, we make sure the corresponding pages are
      up to date so that we don't introduce garbage into the file.
      
      With the copy_from_user changes, we're allowing the VM to reclaim the
      pages after a partial update from copy_from_user, but we're not
      making sure the page cache page is up to date when we loop around to
      resume the write.
      
      We deal with this by pushing the up to date checks down into the page
      prep code.  This fits better with how the rest of file_write works.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      cc: stable@kernel.org
      b1bf862e
  10. 05 3月, 2011 3 次提交
    • N
      nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3) · e9e3d724
      Neil Horman 提交于
      The "bad_page()" page allocator sanity check was reported recently (call
      chain as follows):
      
        bad_page+0x69/0x91
        free_hot_cold_page+0x81/0x144
        skb_release_data+0x5f/0x98
        __kfree_skb+0x11/0x1a
        tcp_ack+0x6a3/0x1868
        tcp_rcv_established+0x7a6/0x8b9
        tcp_v4_do_rcv+0x2a/0x2fa
        tcp_v4_rcv+0x9a2/0x9f6
        do_timer+0x2df/0x52c
        ip_local_deliver+0x19d/0x263
        ip_rcv+0x539/0x57c
        netif_receive_skb+0x470/0x49f
        :virtio_net:virtnet_poll+0x46b/0x5c5
        net_rx_action+0xac/0x1b3
        __do_softirq+0x89/0x133
        call_softirq+0x1c/0x28
        do_softirq+0x2c/0x7d
        do_IRQ+0xec/0xf5
        default_idle+0x0/0x50
        ret_from_intr+0x0/0xa
        default_idle+0x29/0x50
        cpu_idle+0x95/0xb8
        start_kernel+0x220/0x225
        _sinittext+0x22f/0x236
      
      It occurs because an skb with a fraglist was freed from the tcp
      retransmit queue when it was acked, but a page on that fraglist had
      PG_Slab set (indicating it was allocated from the Slab allocator (which
      means the free path above can't safely free it via put_page.
      
      We tracked this back to an nfsv4 setacl operation, in which the nfs code
      attempted to fill convert the passed in buffer to an array of pages in
      __nfs4_proc_set_acl, which gets used by the skb->frags list in
      xs_sendpages.  __nfs4_proc_set_acl just converts each page in the buffer
      to a page struct via virt_to_page, but the vfs allocates the buffer via
      kmalloc, meaning the PG_slab bit is set.  We can't create a buffer with
      kmalloc and free it later in the tcp ack path with put_page, so we need
      to either:
      
      1) ensure that when we create the list of pages, no page struct has
         PG_Slab set
      
       or
      
      2) not use a page list to send this data
      
      Given that these buffers can be multiple pages and arbitrarily sized, I
      think (1) is the right way to go.  I've written the below patch to
      allocate a page from the buddy allocator directly and copy the data over
      to it.  This ensures that we have a put_page free-able page for every
      entry that winds up on an skb frag list, so it can be safely freed when
      the frame is acked.  We do a put page on each entry after the
      rpc_call_sync call so as to drop our own reference count to the page,
      leaving only the ref count taken by tcp_sendpages.  This way the data
      will be properly freed when the ack comes in
      
      Successfully tested by myself to solve the above oops.
      
      Note, as this is the result of a setacl operation that exceeded a page
      of data, I think this amounts to a local DOS triggerable by an
      uprivlidged user, so I'm CCing security on this as well.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Trond Myklebust <Trond.Myklebust@netapp.com>
      CC: security@kernel.org
      CC: Jeff Layton <jlayton@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9e3d724
    • S
      ceph: no .snap inside of snapped namespace · 455cec0a
      Sage Weil 提交于
      Otherwise you can do things like
      
      # mkdir .snap/foo
      # cd .snap/foo/.snap
      # ls
      <badness>
      Signed-off-by: NSage Weil <sage@newdream.net>
      455cec0a
    • A
      minimal fix for do_filp_open() race · 1858efd4
      Al Viro 提交于
      failure exits on the no-O_CREAT side of do_filp_open() merge with
      those of O_CREAT one; unfortunately, if do_path_lookup() returns
      -ESTALE, we'll get out_filp:, notice that we are about to return
      -ESTALE without having trying to create the sucker with LOOKUP_REVAL
      and jump right into the O_CREAT side of code.  And proceed to try
      and create a file.  Usually that'll fail with -ESTALE again, but
      we can race and get that attempt of pathname resolution to succeed.
      
      open() without O_CREAT really shouldn't end up creating files, races
      or not.  The real fix is to rearchitect the whole do_filp_open(),
      but for now splitting the failure exits will do.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1858efd4