1. 07 1月, 2016 11 次提交
    • D
      btrfs: constify remaining structs with function pointers · 20e5506b
      David Sterba 提交于
      * struct extent_io_ops
      * struct btrfs_free_space_op
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      20e5506b
    • D
      btrfs tests: replace whole ops structure for free space tests · 28f0779a
      David Sterba 提交于
      Preparatory work for making btrfs_free_space_op constant. In
      test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with
      own version thus preventing constification. We can rework it so we
      replace the whole structure with the correct function pointers.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28f0779a
    • D
      btrfs: don't use slab cache for struct btrfs_delalloc_work · 100d5702
      David Sterba 提交于
      Although we prefer to use separate caches for various structs, it seems
      better not to do that for struct btrfs_delalloc_work. Objects of this
      type are allocated rarely, when transaction commit calls
      btrfs_start_delalloc_roots, requesting delayed iputs.
      
      The objects are temporary (with some IO involved) but still allocated
      and freed within __start_delalloc_inodes. Memory allocation failure is
      handled.
      
      The slab cache is empty most of the time (observed on several systems),
      so if we need to allocate a new slab object, the first one has to
      allocate a full page. In a potential case of low memory conditions this
      might fail with higher probability compared to using the generic slab
      caches.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      100d5702
    • D
      btrfs: drop duplicate prefix from scrub workqueues · 0de270fa
      David Sterba 提交于
      The helper btrfs_alloc_workqueue will add the "btrfs-" prefix.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0de270fa
    • D
      93a3d467
    • D
      btrfs: handle invalid num_stripes in sys_array · f5cdedd7
      David Sterba 提交于
      We can handle the special case of num_stripes == 0 directly inside
      btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
      catch other unhandled cases where we fail to validate external data.
      
      A crafted or corrupted image crashes at mount time:
      
      BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0
      BTRFS info (device loop0): disk space caching is enabled
      BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
      Kernel panic - not syncing: BUG!
      CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25
      Stack:
       637af890 60062489 602aeb2e 604192ba
       60387961 00000011 637af8a0 6038a835
       637af9c0 6038776b 634ef32b 00000000
      Call Trace:
       [<6001c86d>] show_stack+0xfe/0x15b
       [<6038a835>] dump_stack+0x2a/0x2c
       [<6038776b>] panic+0x13e/0x2b3
       [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff
       [<601cfbbe>] open_ctree+0x192d/0x27af
       [<6019c2c1>] btrfs_mount+0x8f5/0xb9a
       [<600bc9a7>] mount_fs+0x11/0xf3
       [<600d5167>] vfs_kern_mount+0x75/0x11a
       [<6019bcb0>] btrfs_mount+0x2e4/0xb9a
       [<600bc9a7>] mount_fs+0x11/0xf3
       [<600d5167>] vfs_kern_mount+0x75/0x11a
       [<600d710b>] do_mount+0xa35/0xbc9
       [<600d7557>] SyS_mount+0x95/0xc8
       [<6001e884>] handle_syscall+0x6b/0x8e
      Reported-by: NJiri Slaby <jslaby@suse.com>
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      CC: stable@vger.kernel.org	# 3.19+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f5cdedd7
    • D
      btrfs: better packing of btrfs_delayed_extent_op · 35b3ad50
      David Sterba 提交于
      btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
      and has 8 unused bytes. Reducing the level type to u8 makes it possible
      to squeeze it to the padding byte after key. The bitfields were switched
      to bool as there's space to store the full byte without increasing the
      whole structure, besides that the generated assembly is smaller.
      
      struct btrfs_delayed_extent_op {
      	struct btrfs_disk_key      key;                  /*     0    17 */
      	u8                         level;                /*    17     1 */
      	bool                       update_key;           /*    18     1 */
      	bool                       update_flags;         /*    19     1 */
      	bool                       is_data;              /*    20     1 */
      
      	/* XXX 3 bytes hole, try to pack */
      
      	u64                        flags_to_set;         /*    24     8 */
      
      	/* size: 32, cachelines: 1, members: 6 */
      	/* sum members: 29, holes: 1, sum holes: 3 */
      	/* last cacheline: 32 bytes */
      };
      
      The final size is 32 bytes which gives +26 object per slab page.
      
         text	   data	    bss	    dec	    hex	filename
       938811	  43670	  23144	1005625	  f5839	fs/btrfs/btrfs.ko.before
       938747	  43670	  23144	1005561	  f57f9	fs/btrfs/btrfs.ko.after
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      35b3ad50
    • D
      btrfs: put delayed item hook into inode · 8089fe62
      David Sterba 提交于
      Inodes for delayed iput allocate a trivial helper structure, let's place
      the list hook directly into the inode and save a kmalloc (killing a
      __GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.
      
      The inode can be put into the delayed_iputs list more than once and we
      have to keep the count. This means we can't use the list_splice to
      process a bunch of inodes because we'd lost track of the count if the
      inode is put into the delayed iputs again while it's processed.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8089fe62
    • Z
      btrfs: Support convert to -d dup for btrfs-convert · c5ca8781
      Zhao Lei 提交于
      Since we will add support for -d dup for non-mixed filesystem,
      kernel need to support converting to this raid-type.
      
      This patch remove limitation of above case.
      
      Tested by following script:
      (combination of dup conversion with fsck):
      
      export TEST_DEV='/dev/vdc'
      export TEST_DIR='/var/ltf/tester/mnt'
      
      do_dup_test()
      {
          local m_from="$1"
          local d_from="$2"
          local m_to="$3"
          local d_to="$4"
      
          echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"
      
          umount "$TEST_DIR" &>/dev/null
          ./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
          mount "$TEST_DEV" "$TEST_DIR" || return 1
      
          cp -a /sbin/* "$TEST_DIR"
      
          [[ "$m_from" != "$m_to" ]] && {
              ./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
          }
      
          [[ "$d_from" != "$d_to" ]] && {
      	local opt=()
      	[[ "$d_to" == single ]] && opt+=("-f")
              ./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
          }
      
          umount "$TEST_DIR" || return 1
          ./btrfsck "$TEST_DEV" || return 1
          echo
      
          return 0
      }
      
      test_all()
      {
          for m_from in single dup; do
          for d_from in single dup; do
          for m_to in single dup; do
          for d_to in single dup; do
          do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
          done
          done
          done
          done
      }
      
      test_all
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5ca8781
    • J
      Btrfs: igrab inode in writepage · be7bd730
      Josef Bacik 提交于
      We hit this panic on a few of our boxes this week where we have an
      ordered_extent with an NULL inode.  We do an igrab() of the inode in writepages,
      but weren't doing it in writepage which can be called directly from the VM on
      dirty pages.  If the inode has been unlinked then we could have I_FREEING set
      which means igrab() would return NULL and we get this panic.  Fix this by trying
      to igrab in btrfs_writepage, and if it returns NULL then just redirty the page
      and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be7bd730
    • A
      Btrfs: add missing brelse when superblock checksum fails · b2acdddf
      Anand Jain 提交于
      Looks like oversight, call brelse() when checksum fails. Further down the
      code, in the non error path, we do call brelse() and so we don't see
      brelse() in the goto error paths.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b2acdddf
  2. 19 12月, 2015 1 次提交
  3. 16 12月, 2015 2 次提交
    • C
      Btrfs: check prepare_uptodate_page() error code earlier · bb1591b4
      Chris Mason 提交于
      prepare_pages() may end up calling prepare_uptodate_page() twice if our
      write only spans a single page.  But if the first call returns an error,
      our page will be unlocked and its not safe to call it again.
      
      This bug goes all the way back to 2011, and it's not something commonly
      hit.
      
      While we're here, add a more explicit check for the page being truncated
      away.  The bare lock_page() alone is protected only by good thoughts and
      i_mutex, which we're sure to regret eventually.
      Reported-by: NDave Jones <dsj@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bb1591b4
    • C
      Btrfs: check for empty bitmap list in setup_cluster_bitmaps · 1b9b922a
      Chris Mason 提交于
      Dave Jones found a warning from kasan in setup_cluster_bitmaps()
      
      ==================================================================
      BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
      addr ffff88039bef6828
      Read of size 8 by task nfsd/1009
      page:ffffea000e6fbd80 count:0 mapcount:0 mapping:          (null)
      index:0x0
      flags: 0x8000000000000000()
      page dumped because: kasan: bad access detected
      CPU: 1 PID: 1009 Comm: nfsd Tainted: G        W
      4.4.0-rc3-backup-debug+ #1
       ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e
       0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0
       ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280
      Call Trace:
       [<ffffffffa680a43e>] dump_stack+0x4b/0x6d
       [<ffffffffa62638d1>] kasan_report_error+0x501/0x520
       [<ffffffffa61121c0>] ? debug_show_all_locks+0x1e0/0x1e0
       [<ffffffffa6263948>] kasan_report+0x58/0x60
       [<ffffffffa6814b00>] ? rb_last+0x10/0x40
       [<ffffffffa66f8af4>] ? setup_cluster_bitmap+0xc4/0x5a0
       [<ffffffffa6262ead>] __asan_load8+0x5d/0x70
       [<ffffffffa66f8af4>] setup_cluster_bitmap+0xc4/0x5a0
       [<ffffffffa66f675a>] ? setup_cluster_no_bitmap+0x6a/0x400
       [<ffffffffa66fcd16>] btrfs_find_space_cluster+0x4b6/0x640
       [<ffffffffa66fc860>] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
       [<ffffffffa66fc36e>] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
       [<ffffffffa702dc37>] ? _raw_spin_unlock+0x27/0x40
       [<ffffffffa666a1a1>] find_free_extent+0xba1/0x1520
      
      Andrey noticed this was because we were doing list_first_entry on a list
      that might be empty.  Rework the tests a bit so we don't do that.
      Signed-off-by: NChris Mason <clm@fb.com>
      Reprorted-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Reported-by: NDave Jones <dsj@fb.com>
      1b9b922a
  4. 14 12月, 2015 1 次提交
  5. 13 12月, 2015 2 次提交
    • J
      ocfs2: fix SGID not inherited issue · 854ee2e9
      Junxiao Bi 提交于
      Commit 8f1eb487 ("ocfs2: fix umask ignored issue") introduced an
      issue, SGID of sub dir was not inherited from its parents dir.  It is
      because SGID is set into "inode->i_mode" in ocfs2_get_init_inode(), but
      is overwritten by "mode" which don't have SGID set later.
      
      Fixes: 8f1eb487 ("ocfs2: fix umask ignored issue")
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Acked-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      854ee2e9
    • H
      osd fs: __r4w_get_page rely on PageUptodate for uptodate · 3066a967
      Hugh Dickins 提交于
      Commit 42cb14b1 ("mm: migrate dirty page without
      clear_page_dirty_for_io etc") simplified the migration of a PageDirty
      pagecache page: one stat needs moving from zone to zone and that's about
      all.
      
      It's convenient and safest for it to shift the PageDirty bit from old
      page to new, just before updating the zone stats: before copying data
      and marking the new PageUptodate.  This is all done while both pages are
      isolated and locked, just as before; and just as before, there's a
      moment when the new page is visible in the radix_tree, but not yet
      PageUptodate.  What's new is that it may now be briefly visible as
      PageDirty before it is PageUptodate.
      
      When I scoured the tree to see if this could cause a problem anywhere,
      the only places I found were in two similar functions __r4w_get_page():
      which look up a page with find_get_page() (not using page lock), then
      claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate.
      
      I'm not sure whether that was right before, but now it might be wrong
      (on rare occasions): only claim the page is uptodate if PageUptodate.
      Or perhaps the page in question could never be migratable anyway?
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Tested-by: NBoaz Harrosh <ooo@electrozaur.com>
      Cc: Benny Halevy <bhalevy@panasas.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3066a967
  6. 10 12月, 2015 3 次提交
    • H
      btrfs: fix misleading warning when space cache failed to load · 94356889
      Holger Hoffstätte 提交于
      When an inconsistent space cache is detected during loading we log a
      warning that users frequently mistake as instruction to invalidate the
      cache manually, even though this is not required. Fix the message to
      indicate that the cache will be rebuilt automatically.
      Signed-off-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Acked-by: NFilipe Manana <fdmanana@suse.com>
      94356889
    • F
      Btrfs: fix transaction handle leak in balance · 8a7d656f
      Filipe Manana 提交于
      If we fail to allocate a new data chunk, we were jumping to the error path
      without release the transaction handle we got before. Fix this by always
      releasing it before doing the jump.
      
      Fixes: 2c9fe835 ("btrfs: Fix lost-data-profile caused by balance bg")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      8a7d656f
    • F
      Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list · 348a0013
      Filipe Manana 提交于
      As of my previous change titled "Btrfs: fix scrub preventing unused block
      groups from being deleted", the following warning at
      extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
      filesysten with "-o discard":
      
       10263  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
       10264  {
       (...)
       10405                  if (trimming) {
       10406                          WARN_ON(!list_empty(&block_group->bg_list));
       10407                          spin_lock(&trans->transaction->deleted_bgs_lock);
       10408                          list_move(&block_group->bg_list,
       10409                                    &trans->transaction->deleted_bgs);
       10410                          spin_unlock(&trans->transaction->deleted_bgs_lock);
       10411                          btrfs_get_block_group(block_group);
       10412                  }
       (...)
      
      This happens because scrub can now add back the block group to the list of
      unused block groups (fs_info->unused_bgs). This is dangerous because we
      are moving the block group from the unused block groups list to the list
      of deleted block groups without holding the lock that protects the source
      list (fs_info->unused_bgs_lock).
      
      The following diagram illustrates how this happens:
      
                  CPU 1                                     CPU 2
      
       cleaner_kthread()
         btrfs_delete_unused_bgs()
      
           sees bg X in list
            fs_info->unused_bgs
      
           deletes bg X from list
            fs_info->unused_bgs
      
                                                  scrub_enumerate_chunks()
      
                                                    searches device tree using
                                                    its commit root
      
                                                    finds device extent for
                                                    block group X
      
                                                    gets block group X from the tree
                                                    fs_info->block_group_cache_tree
                                                    (via btrfs_lookup_block_group())
      
                                                    sets bg X to RO (again)
      
                                                    scrub_chunk(bg X)
      
                                                    sets bg X back to RW mode
      
                                                    adds bg X to the list
                                                    fs_info->unused_bgs again,
                                                    since it's still unused and
                                                    currently not in that list
      
           sets bg X to RO mode
      
           btrfs_remove_chunk(bg X)
      
           --> discard is enabled and bg X
               is in the fs_info->unused_bgs
               list again so the warning is
               triggered
           --> we move it from that list into
               the transaction's delete_bgs
               list, but we can have another
               task currently manipulating
               the first list (fs_info->unused_bgs)
      
      Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
      the list of unused block groups and the list of deleted block groups. This
      makes it safe and there's not much worry for more lock contention, as this
      lock is seldom used and only the cleaner kthread adds elements to the list
      of deleted block groups. The warning goes away too, as this was previously
      an impossible case (and would have been better a BUG_ON/ASSERT) but it's
      not impossible anymore.
      Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      348a0013
  7. 09 12月, 2015 2 次提交
    • A
      fix the regression from "direct-io: Fix negative return from dio read beyond eof" · 2d4594ac
      Al Viro 提交于
      Sure, it's better to bail out of past-the-eof read and return 0 than return
      a bogus negative value on such.  Only we'd better make sure we are bailing out
      with 0 and not -ENOMEM...
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2d4594ac
    • A
      9p: ->evict_inode() should kick out ->i_data, not ->i_mapping · 4ad78628
      Al Viro 提交于
      For block devices the pagecache is associated with the inode
      on bdevfs, not with the aliasing ones on the mountable filesystems.
      The latter have its own ->i_data empty and ->i_mapping pointing
      to the (unique per major/minor) bdevfs inode.  That guarantees
      cache coherence between all block device inodes with the same
      device number.
      
      Eviction of an alias inode has no business trying to evict the
      pages belonging to bdevfs one; moreover, ->i_mapping is only
      safe to access when the thing is opened.  At the time of
      ->evict_inode() the victim is definitely *not* opened.  We are
      about to kill the address space embedded into struct inode
      (inode->i_data) and that's what we need to empty of any pages.
      
      9p instance tries to empty inode->i_mapping instead, which is
      both unsafe and bogus - if we have several device nodes with
      the same device number in different places, closing one of them
      should not try to empty the (shared) page cache.
      
      Fortunately, other instances in the tree are OK; they are
      evicting from &inode->i_data instead, as 9p one should.
      
      Cc: stable@vger.kernel.org # v2.6.32+, ones prior to 2.6.36 need only half of that
      Reported-by: N"Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
      Tested-by: N"Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4ad78628
  8. 08 12月, 2015 1 次提交
    • T
      SUNRPC: Fix callback channel · 756b9b37
      Trond Myklebust 提交于
      The NFSv4.1 callback channel is currently broken because the receive
      message will keep shrinking because the backchannel receive buffer size
      never gets reset.
      The easiest solution to this problem is instead of changing the receive
      buffer, to rather adjust the copied request.
      
      Fixes: 38b7631f ("nfs4: limit callback decoding to received bytes")
      Cc: Benjamin Coddington <bcodding@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      756b9b37
  9. 07 12月, 2015 3 次提交
  10. 05 12月, 2015 2 次提交
    • I
      block: detach bdev inode from its wb in __blkdev_put() · 43d1c0eb
      Ilya Dryomov 提交于
      Since 52ebea74 ("writeback: make backing_dev_info host
      cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
      gets attached to a wb (struct bdi_writeback).  Detaching happens on
      evict, in inode_detach_wb() called from __destroy_inode(), and involves
      updating wb.
      
      However, detaching an internal bdev inode from its wb in
      __destroy_inode() is too late.  Its bdi and by extension root wb are
      embedded into struct request_queue, which has different lifetime rules
      and can be freed long before the final bdput() is called (can be from
      __fput() of a corresponding /dev inode, through dput() - evict() -
      bd_forget().  bdevs hold onto the underlying disk/queue pair only while
      opened; as soon as bdev is closed all bets are off.  In fact,
      disk/queue can be gone before __blkdev_put() even returns:
      
      1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
      1500 {
      ...
      1518         if (bdev->bd_contains == bdev) {
      1519                 if (disk->fops->release)
      1520                         disk->fops->release(disk, mode);
      
      [ Driver puts its references to disk/queue ]
      
      1521         }
      1522         if (!bdev->bd_openers) {
      1523                 struct module *owner = disk->fops->owner;
      1524
      1525                 disk_put_part(bdev->bd_part);
      1526                 bdev->bd_part = NULL;
      1527                 bdev->bd_disk = NULL;
      1528                 if (bdev != bdev->bd_contains)
      1529                         victim = bdev->bd_contains;
      1530                 bdev->bd_contains = NULL;
      1531
      1532                 put_disk(disk);
      
      [ We put ours, the queue is gone
        The last bdput() would result in a write to invalid memory ]
      
      1533                 module_put(owner);
      ...
      1539 }
      
      Since bdev inodes are special anyway, detach them in __blkdev_put()
      after clearing inode's dirty bits, turning the problematic
      inode_detach_wb() in __destroy_inode() into a noop.
      
      add_disk() grabs its disk->queue since 523e1d39 ("block: make
      gendisk hold a reference to its queue"), so the old ->release comment
      is removed in favor of the new inode_detach_wb() comment.
      
      Cc: stable@vger.kernel.org # 4.2+, needs backporting
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Tested-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      43d1c0eb
    • J
      jbd2: fix null committed data return in undo_access · 087ffd4e
      Junxiao Bi 提交于
      introduced jbd2_write_access_granted() to improve write|undo_access
      speed, but missed to check the status of b_committed_data which caused
      a kernel panic on ocfs2.
      
      [ 6538.405938] ------------[ cut here ]------------
      [ 6538.406686] kernel BUG at fs/ocfs2/suballoc.c:2400!
      [ 6538.406686] invalid opcode: 0000 [#1] SMP
      [ 6538.406686] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront xen_fbfront parport_pc parport pcspkr i2c_piix4 acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix cirrus ttm drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod
      [ 6538.406686] CPU: 1 PID: 16265 Comm: mmap_truncate Not tainted 4.3.0 #1
      [ 6538.406686] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
      [ 6538.406686] task: ffff88007c2bab00 ti: ffff880075b78000 task.ti: ffff880075b78000
      [ 6538.406686] RIP: 0010:[<ffffffffa06a286b>]  [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
      [ 6538.406686] RSP: 0018:ffff880075b7b7f8  EFLAGS: 00010246
      [ 6538.406686] RAX: ffff8800760c5b40 RBX: ffff88006c06a000 RCX: ffffffffa06e6df0
      [ 6538.406686] RDX: 0000000000000000 RSI: ffff88007a6f6ea0 RDI: ffff88007a760430
      [ 6538.406686] RBP: ffff880075b7b878 R08: 0000000000000002 R09: 0000000000000001
      [ 6538.406686] R10: ffffffffa06769be R11: 0000000000000000 R12: 0000000000000001
      [ 6538.406686] R13: ffffffffa06a1750 R14: 0000000000000001 R15: ffff88007a6f6ea0
      [ 6538.406686] FS:  00007f17fde30720(0000) GS:ffff88007f040000(0000) knlGS:0000000000000000
      [ 6538.406686] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6538.406686] CR2: 0000000000601730 CR3: 000000007aea0000 CR4: 00000000000406e0
      [ 6538.406686] Stack:
      [ 6538.406686]  ffff88007c2bb5b0 ffff880075b7b8e0 ffff88007a7604b0 ffff88006c640800
      [ 6538.406686]  ffff88007a7604b0 ffff880075d77390 0000000075b7b878 ffffffffa06a309d
      [ 6538.406686]  ffff880075d752d8 ffff880075b7b990 ffff880075b7b898 0000000000000000
      [ 6538.406686] Call Trace:
      [ 6538.406686]  [<ffffffffa06a309d>] ? ocfs2_read_group_descriptor+0x6d/0xa0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a3654>] _ocfs2_free_suballoc_bits+0xe4/0x320 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a397e>] _ocfs2_free_clusters+0xee/0x210 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0682d50>] ? ocfs2_extend_trans+0x50/0x1a0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a3ad5>] ocfs2_free_clusters+0x15/0x20 [ocfs2]
      [ 6538.406686]  [<ffffffffa065072c>] ocfs2_replay_truncate_records+0xfc/0x290 [ocfs2]
      [ 6538.406686]  [<ffffffffa06843ac>] ? ocfs2_start_trans+0xec/0x1d0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0654600>] __ocfs2_flush_truncate_log+0x140/0x2d0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0654394>] ? ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x170 [ocfs2]
      [ 6538.406686]  [<ffffffffa065acd4>] ocfs2_remove_btree_range+0x374/0x630 [ocfs2]
      [ 6538.406686]  [<ffffffffa017486b>] ? jbd2_journal_stop+0x25b/0x470 [jbd2]
      [ 6538.406686]  [<ffffffffa065d5b5>] ocfs2_commit_truncate+0x305/0x670 [ocfs2]
      [ 6538.406686]  [<ffffffffa0683430>] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2]
      [ 6538.406686]  [<ffffffffa067adb7>] ocfs2_truncate_file+0x297/0x380 [ocfs2]
      [ 6538.406686]  [<ffffffffa01759e4>] ? jbd2_journal_begin_ordered_truncate+0x64/0xc0 [jbd2]
      [ 6538.406686]  [<ffffffffa067c7a2>] ocfs2_setattr+0x572/0x860 [ocfs2]
      [ 6538.406686]  [<ffffffff810e4a3f>] ? current_fs_time+0x3f/0x50
      [ 6538.406686]  [<ffffffff812124b7>] notify_change+0x1d7/0x340
      [ 6538.406686]  [<ffffffff8121abf9>] ? generic_getxattr+0x79/0x80
      [ 6538.406686]  [<ffffffff811f5876>] do_truncate+0x66/0x90
      [ 6538.406686]  [<ffffffff81120e30>] ? __audit_syscall_entry+0xb0/0x110
      [ 6538.406686]  [<ffffffff811f5bb3>] do_sys_ftruncate.clone.0+0xf3/0x120
      [ 6538.406686]  [<ffffffff811f5bee>] SyS_ftruncate+0xe/0x10
      [ 6538.406686]  [<ffffffff816aa2ae>] entry_SYSCALL_64_fastpath+0x12/0x71
      [ 6538.406686] Code: 28 48 81 ee b0 04 00 00 48 8b 92 50 fb ff ff 48 8b 80 b0 03 00 00 48 39 90 88 00 00 00 0f 84 30 fe ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
      [ 6538.406686] RIP  [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
      [ 6538.406686]  RSP <ffff880075b7b7f8>
      [ 6538.691128] ---[ end trace 31cd7011d6770d7e ]---
      [ 6538.694492] Kernel panic - not syncing: Fatal exception
      [ 6538.695484] Kernel Offset: disabled
      
      Fixes: de92c8ca("jbd2: speedup jbd2_journal_get_[write|undo]_access()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      087ffd4e
  11. 02 12月, 2015 1 次提交
    • E
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet 提交于
      This patch is a cleanup to make following patch easier to
      review.
      
      Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      
      To ease backports, we rename both constants.
      
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cd3e072
  12. 01 12月, 2015 1 次提交
    • J
      direct-io: Fix negative return from dio read beyond eof · 74cedf9b
      Jan Kara 提交于
      Assume a filesystem with 4KB blocks. When a file has size 1000 bytes and
      we issue direct IO read at offset 1024, blockdev_direct_IO() reads the
      tail of the last block and the logic for handling short DIO reads in
      dio_complete() results in a return value -24 (1000 - 1024) which
      obviously confuses userspace.
      
      Fix the problem by bailing out early once we sample i_size and can
      reliably check that direct IO read starts beyond i_size.
      Reported-by: NAvi Kivity <avi@scylladb.com>
      Fixes: 9fe55eea
      CC: stable@vger.kernel.org
      CC: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      74cedf9b
  13. 27 11月, 2015 3 次提交
  14. 26 11月, 2015 3 次提交
  15. 25 11月, 2015 4 次提交
    • H
      btrfs: fix balance range usage filters in 4.4-rc · dba72cb3
      Holger Hoffstätte 提交于
      There's a regression in 4.4-rc since commit bc309467
      (btrfs: extend balance filter usage to take minimum and maximum) in that
      existing (non-ranged) balance with -dusage=x no longer works; all chunks
      are skipped.
      
      After staring at the code for a while and wondering why a non-ranged
      balance would even need min and max thresholds (..which then were not
      set correctly, leading to the bug) I realized that the only problem
      was the fact that the filter functions were named wrong, thanks to
      patching copypasta. Simply renaming both functions lets the existing
      btrfs-progs call balance with -dusage=x and now the non-ranged filter
      function is invoked, properly using only a single chunk limit.
      Signed-off-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Fixes: bc309467 ("btrfs: extend balance filter usage to take minimum and maximum")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dba72cb3
    • M
      btrfs: qgroup: account shared subtree during snapshot delete · 82bd101b
      Mark Fasheh 提交于
      Commit 0ed4792a ('btrfs: qgroup: Switch to new extent-oriented qgroup
      mechanism.') removed our qgroup accounting during
      btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
      going bad shortly after a snapshot is removed.
      
      Fix this by adding a dirty extent record when we encounter extents during
      our shared subtree walk. This effectively restores the functionality we had
      with the original shared subtree walking code in 1152651a (btrfs: qgroup:
      account shared subtrees during snapshot delete).
      
      The idea with the original patch (and this one) is that shared subtrees can
      get skipped during drop_snapshot. The shared subtree walk then allows us a
      chance to visit those extents and add them to the qgroup work for later
      processing. This ultimately makes the accounting for drop snapshot work.
      
      The new qgroup code nicely handles all the other extents during the tree
      walk via the ref dec/inc functions so we don't have to add actions beyond
      what we had originally.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      82bd101b
    • J
      Btrfs: use btrfs_get_fs_root in resolve_indirect_ref · 2d9e9776
      Josef Bacik 提交于
      The backref code will look up the fs_root we're trying to resolve our indirect
      refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT
      if the ref is 0.  This isn't helpful for the qgroup stuff with snapshot delete
      as it won't be able to search down the snapshot we are deleting, which will
      cause us to miss roots.  So use btrfs_get_fs_root and send false for check_ref
      so we can always get the root we're looking for.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      2d9e9776
    • J
      btrfs: qgroup: fix quota disable during rescan · 967ef513
      Justin Maggard 提交于
      There's a race condition that leads to a NULL pointer dereference if you
      disable quotas while a quota rescan is running.  To fix this, we just need
      to wait for the quota rescan worker to actually exit before tearing down
      the quota structures.
      Signed-off-by: NJustin Maggard <jmaggard@netgear.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      967ef513