1. 12 3月, 2011 1 次提交
    • C
      Btrfs: break out of shrink_delalloc earlier · 36e39c40
      Chris Mason 提交于
      Josef had changed shrink_delalloc to exit after three shrink
      attempts, which wasn't quite enough because new writers could
      race in and steal free space.
      
      But it also fixed deadlocks and stalls as we tried to recover
      delalloc reservations.  The code was tweaked to loop 1024
      times, and would reset the counter any time a small amount
      of progress was made.  This was too drastic, and with a
      lot of writers we can end up stuck in shrink_delalloc forever.
      
      The shrink_delalloc loop is fairly complex because the caller is looping
      too, and the caller will go ahead and force a transaction commit to make
      sure we reclaim space.
      
      This reworks things to exit shrink_delalloc when we've forced some
      writeback and the delalloc reservations have gone down.  This means
      the writeback has not just started but has also finished at
      least some of the metadata changes required to reclaim delalloc
      space.
      
      If we've got this wrong, we're returning ENOSPC too early, which
      is a big improvement over the current behavior of hanging the machine.
      
      Test 224 in xfstests hammers on this nicely, and with 1000 writers
      trying to fill a 1GB drive we get our first ENOSPC at 93% full.  The
      other writers are able to continue until we get 100%.
      
      This is a worst case test for btrfs because the 1000 writers are doing
      small IO, and the small FS size means we don't have a lot of room
      for metadata chunks.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      36e39c40
  2. 11 3月, 2011 2 次提交
  3. 10 3月, 2011 10 次提交
  4. 09 3月, 2011 3 次提交
  5. 08 3月, 2011 3 次提交
    • A
      unfuck proc_sysctl ->d_compare() · dfef6dcd
      Al Viro 提交于
      a) struct inode is not going to be freed under ->d_compare();
      however, the thing PROC_I(inode)->sysctl points to just might.
      Fortunately, it's enough to make freeing that sucker delayed,
      provided that we don't step on its ->unregistering, clear
      the pointer to it in PROC_I(inode) before dropping the reference
      and check if it's NULL in ->d_compare().
      
      b) I'm not sure that we *can* walk into NULL inode here (we recheck
      dentry->seq between verifying that it's still hashed / fetching
      dentry->d_inode and passing it to ->d_compare() and there's no
      negative hashed dentries in /proc/sys/*), but if we can walk into
      that, we really should not have ->d_compare() return 0 on it!
      Said that, I really suspect that this check can be simply killed.
      Nick?
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      dfef6dcd
    • J
      nfsd4: fix bad pointer on failure to find delegation · 32b007b4
      J. Bruce Fields 提交于
      In case of a nonempty list, the return on error here is obviously bogus;
      it ends up being a pointer to the list head instead of to any valid
      delegation on the list.
      
      In particular, if nfsd4_delegreturn() hits this case, and you're quite unlucky,
      then renew_client may oops, and it may take an embarassingly long time to
      figure out why.  Facepalm.
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
      IP: [<ffffffff81292965>] nfsd4_delegreturn+0x125/0x200
      ...
      
      Cc: stable@kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      32b007b4
    • C
      Btrfs: deal with short returns from copy_from_user · 31339acd
      Chris Mason 提交于
      When copy_from_user is only able to copy some of the bytes we requested,
      we may end up creating a partially up to date page.  To avoid garbage in
      the page, we need to treat a partial copy as a zero length copy.
      
      This makes the rest of the file_write code drop the page and
      retry the whole copy instead of marking the partially up to
      date page as dirty.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cc: stable@kernel.org
      31339acd
  6. 07 3月, 2011 1 次提交
    • C
      Btrfs: fix regressions in copy_from_user handling · b1bf862e
      Chris Mason 提交于
      Commit 914ee295 fixed deadlocks in
      btrfs_file_write where we would catch page faults on pages we had
      locked.
      
      But, there were a few problems:
      
      1) The x86-32 iov_iter_copy_from_user_atomic code always fails to copy
      data when the amount to copy is more than 4K and the offset to start
      copying from is not page aligned.  The result was btrfs_file_write
      looping forever retrying the iov_iter_copy_from_user_atomic
      
      We deal with this by changing btrfs_file_write to drop down to single
      page copies when iov_iter_copy_from_user_atomic starts returning failure.
      
      2) The btrfs_file_write code was leaking delalloc reservations when
      iov_iter_copy_from_user_atomic returned zero.  The looping above would
      result in the entire filesystem running out of delalloc reservations and
      constantly trying to flush things to disk.
      
      3) btrfs_file_write will lock down page cache pages, make sure
      any writeback is finished, do the copy_from_user and then release them.
      Before the loop runs we check the first and last pages in the write to
      see if they are only being partially modified.  If the start or end of
      the write isn't aligned, we make sure the corresponding pages are
      up to date so that we don't introduce garbage into the file.
      
      With the copy_from_user changes, we're allowing the VM to reclaim the
      pages after a partial update from copy_from_user, but we're not
      making sure the page cache page is up to date when we loop around to
      resume the write.
      
      We deal with this by pushing the up to date checks down into the page
      prep code.  This fits better with how the rest of file_write works.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      cc: stable@kernel.org
      b1bf862e
  7. 05 3月, 2011 3 次提交
    • N
      nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3) · e9e3d724
      Neil Horman 提交于
      The "bad_page()" page allocator sanity check was reported recently (call
      chain as follows):
      
        bad_page+0x69/0x91
        free_hot_cold_page+0x81/0x144
        skb_release_data+0x5f/0x98
        __kfree_skb+0x11/0x1a
        tcp_ack+0x6a3/0x1868
        tcp_rcv_established+0x7a6/0x8b9
        tcp_v4_do_rcv+0x2a/0x2fa
        tcp_v4_rcv+0x9a2/0x9f6
        do_timer+0x2df/0x52c
        ip_local_deliver+0x19d/0x263
        ip_rcv+0x539/0x57c
        netif_receive_skb+0x470/0x49f
        :virtio_net:virtnet_poll+0x46b/0x5c5
        net_rx_action+0xac/0x1b3
        __do_softirq+0x89/0x133
        call_softirq+0x1c/0x28
        do_softirq+0x2c/0x7d
        do_IRQ+0xec/0xf5
        default_idle+0x0/0x50
        ret_from_intr+0x0/0xa
        default_idle+0x29/0x50
        cpu_idle+0x95/0xb8
        start_kernel+0x220/0x225
        _sinittext+0x22f/0x236
      
      It occurs because an skb with a fraglist was freed from the tcp
      retransmit queue when it was acked, but a page on that fraglist had
      PG_Slab set (indicating it was allocated from the Slab allocator (which
      means the free path above can't safely free it via put_page.
      
      We tracked this back to an nfsv4 setacl operation, in which the nfs code
      attempted to fill convert the passed in buffer to an array of pages in
      __nfs4_proc_set_acl, which gets used by the skb->frags list in
      xs_sendpages.  __nfs4_proc_set_acl just converts each page in the buffer
      to a page struct via virt_to_page, but the vfs allocates the buffer via
      kmalloc, meaning the PG_slab bit is set.  We can't create a buffer with
      kmalloc and free it later in the tcp ack path with put_page, so we need
      to either:
      
      1) ensure that when we create the list of pages, no page struct has
         PG_Slab set
      
       or
      
      2) not use a page list to send this data
      
      Given that these buffers can be multiple pages and arbitrarily sized, I
      think (1) is the right way to go.  I've written the below patch to
      allocate a page from the buddy allocator directly and copy the data over
      to it.  This ensures that we have a put_page free-able page for every
      entry that winds up on an skb frag list, so it can be safely freed when
      the frame is acked.  We do a put page on each entry after the
      rpc_call_sync call so as to drop our own reference count to the page,
      leaving only the ref count taken by tcp_sendpages.  This way the data
      will be properly freed when the ack comes in
      
      Successfully tested by myself to solve the above oops.
      
      Note, as this is the result of a setacl operation that exceeded a page
      of data, I think this amounts to a local DOS triggerable by an
      uprivlidged user, so I'm CCing security on this as well.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Trond Myklebust <Trond.Myklebust@netapp.com>
      CC: security@kernel.org
      CC: Jeff Layton <jlayton@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9e3d724
    • S
      ceph: no .snap inside of snapped namespace · 455cec0a
      Sage Weil 提交于
      Otherwise you can do things like
      
      # mkdir .snap/foo
      # cd .snap/foo/.snap
      # ls
      <badness>
      Signed-off-by: NSage Weil <sage@newdream.net>
      455cec0a
    • A
      minimal fix for do_filp_open() race · 1858efd4
      Al Viro 提交于
      failure exits on the no-O_CREAT side of do_filp_open() merge with
      those of O_CREAT one; unfortunately, if do_path_lookup() returns
      -ESTALE, we'll get out_filp:, notice that we are about to return
      -ESTALE without having trying to create the sucker with LOOKUP_REVAL
      and jump right into the O_CREAT side of code.  And proceed to try
      and create a file.  Usually that'll fail with -ESTALE again, but
      we can race and get that attempt of pathname resolution to succeed.
      
      open() without O_CREAT really shouldn't end up creating files, races
      or not.  The real fix is to rearchitect the whole do_filp_open(),
      but for now splitting the failure exits will do.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1858efd4
  8. 04 3月, 2011 3 次提交
  9. 03 3月, 2011 9 次提交
  10. 02 3月, 2011 3 次提交
    • J
      ext2: Fix link count corruption under heavy link+rename load · e8a80c6f
      Josh Hunt 提交于
      vfs_rename_other() does not lock renamed inode with i_mutex. Thus changing
      i_nlink in a non-atomic manner (which happens in ext2_rename()) can corrupt
      it as reported and analyzed by Josh.
      
      In fact, there is no good reason to mess with i_nlink of the moved file.
      We did it presumably to simulate linking into the new directory and unlinking
      from an old one. But the practical effect of this is disputable because fsck
      can possibly treat file as being properly linked into both directories without
      writing any error which is confusing. So we just stop increment-decrement
      games with i_nlink which also fixes the corruption.
      
      CC: stable@kernel.org
      CC: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NJosh Hunt <johunt@akamai.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      e8a80c6f
    • A
      xfs: zero proper structure size for geometry calls · af24ee9e
      Alex Elder 提交于
      Commit 493f3358 added this call to
      xfs_fs_geometry() in order to avoid passing kernel stack data back
      to user space:
      
      +       memset(geo, 0, sizeof(*geo));
      
      Unfortunately, one of the callers of that function passes the
      address of a smaller data type, cast to fit the type that
      xfs_fs_geometry() requires.  As a result, this can happen:
      
      Kernel panic - not syncing: stack-protector: Kernel stack is corrupted
      in: f87aca93
      
      Pid: 262, comm: xfs_fsr Not tainted 2.6.38-rc6-493f3358+ #1
      Call Trace:
      
      [<c12991ac>] ? panic+0x50/0x150
      [<c102ed71>] ? __stack_chk_fail+0x10/0x18
      [<f87aca93>] ? xfs_ioc_fsgeometry_v1+0x56/0x5d [xfs]
      
      Fix this by fixing that one caller to pass the right type and then
      copy out the subset it is interested in.
      
      Note: This patch is an alternative to one originally proposed by
      Eric Sandeen.
      Reported-by: NJeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Tested-by: NJeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
      af24ee9e
    • R
      nilfs2: fix regression that i-flag is not set on changeless checkpoints · 72746ac6
      Ryusuke Konishi 提交于
      According to the report from Jiro SEKIBA titled "regression in
      2.6.37?"  (Message-Id: <8739n8vs1f.wl%jir@sekiba.com>), on 2.6.37 and
      later kernels, lscp command no longer displays "i" flag on checkpoints
      that snapshot operations or garbage collection created.
      
      This is a regression of nilfs2 checkpointing function, and it's
      critical since it broke behavior of a part of nilfs2 applications.
      For instance, snapshot manager of TimeBrowse gets to create
      meaningless snapshots continuously; snapshot creation triggers another
      checkpoint, but applications cannot distinguish whether the new
      checkpoint contains meaningful changes or not without the i-flag.
      
      This patch fixes the regression and brings that application behavior
      back to normal.
      Reported-by: NJiro SEKIBA <jir@unicus.jp>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: NJiro SEKIBA <jir@unicus.jp>
      Cc: stable <stable@kernel.org>  [2.6.37]
      72746ac6
  11. 01 3月, 2011 1 次提交
  12. 26 2月, 2011 1 次提交