1. 09 12月, 2015 2 次提交
    • A
      fix the regression from "direct-io: Fix negative return from dio read beyond eof" · 2d4594ac
      Al Viro 提交于
      Sure, it's better to bail out of past-the-eof read and return 0 than return
      a bogus negative value on such.  Only we'd better make sure we are bailing out
      with 0 and not -ENOMEM...
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2d4594ac
    • A
      9p: ->evict_inode() should kick out ->i_data, not ->i_mapping · 4ad78628
      Al Viro 提交于
      For block devices the pagecache is associated with the inode
      on bdevfs, not with the aliasing ones on the mountable filesystems.
      The latter have its own ->i_data empty and ->i_mapping pointing
      to the (unique per major/minor) bdevfs inode.  That guarantees
      cache coherence between all block device inodes with the same
      device number.
      
      Eviction of an alias inode has no business trying to evict the
      pages belonging to bdevfs one; moreover, ->i_mapping is only
      safe to access when the thing is opened.  At the time of
      ->evict_inode() the victim is definitely *not* opened.  We are
      about to kill the address space embedded into struct inode
      (inode->i_data) and that's what we need to empty of any pages.
      
      9p instance tries to empty inode->i_mapping instead, which is
      both unsafe and bogus - if we have several device nodes with
      the same device number in different places, closing one of them
      should not try to empty the (shared) page cache.
      
      Fortunately, other instances in the tree are OK; they are
      evicting from &inode->i_data instead, as 9p one should.
      
      Cc: stable@vger.kernel.org # v2.6.32+, ones prior to 2.6.36 need only half of that
      Reported-by: N"Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
      Tested-by: N"Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4ad78628
  2. 07 12月, 2015 3 次提交
  3. 05 12月, 2015 1 次提交
    • J
      jbd2: fix null committed data return in undo_access · 087ffd4e
      Junxiao Bi 提交于
      introduced jbd2_write_access_granted() to improve write|undo_access
      speed, but missed to check the status of b_committed_data which caused
      a kernel panic on ocfs2.
      
      [ 6538.405938] ------------[ cut here ]------------
      [ 6538.406686] kernel BUG at fs/ocfs2/suballoc.c:2400!
      [ 6538.406686] invalid opcode: 0000 [#1] SMP
      [ 6538.406686] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront xen_fbfront parport_pc parport pcspkr i2c_piix4 acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix cirrus ttm drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod
      [ 6538.406686] CPU: 1 PID: 16265 Comm: mmap_truncate Not tainted 4.3.0 #1
      [ 6538.406686] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
      [ 6538.406686] task: ffff88007c2bab00 ti: ffff880075b78000 task.ti: ffff880075b78000
      [ 6538.406686] RIP: 0010:[<ffffffffa06a286b>]  [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
      [ 6538.406686] RSP: 0018:ffff880075b7b7f8  EFLAGS: 00010246
      [ 6538.406686] RAX: ffff8800760c5b40 RBX: ffff88006c06a000 RCX: ffffffffa06e6df0
      [ 6538.406686] RDX: 0000000000000000 RSI: ffff88007a6f6ea0 RDI: ffff88007a760430
      [ 6538.406686] RBP: ffff880075b7b878 R08: 0000000000000002 R09: 0000000000000001
      [ 6538.406686] R10: ffffffffa06769be R11: 0000000000000000 R12: 0000000000000001
      [ 6538.406686] R13: ffffffffa06a1750 R14: 0000000000000001 R15: ffff88007a6f6ea0
      [ 6538.406686] FS:  00007f17fde30720(0000) GS:ffff88007f040000(0000) knlGS:0000000000000000
      [ 6538.406686] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6538.406686] CR2: 0000000000601730 CR3: 000000007aea0000 CR4: 00000000000406e0
      [ 6538.406686] Stack:
      [ 6538.406686]  ffff88007c2bb5b0 ffff880075b7b8e0 ffff88007a7604b0 ffff88006c640800
      [ 6538.406686]  ffff88007a7604b0 ffff880075d77390 0000000075b7b878 ffffffffa06a309d
      [ 6538.406686]  ffff880075d752d8 ffff880075b7b990 ffff880075b7b898 0000000000000000
      [ 6538.406686] Call Trace:
      [ 6538.406686]  [<ffffffffa06a309d>] ? ocfs2_read_group_descriptor+0x6d/0xa0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a3654>] _ocfs2_free_suballoc_bits+0xe4/0x320 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a397e>] _ocfs2_free_clusters+0xee/0x210 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0682d50>] ? ocfs2_extend_trans+0x50/0x1a0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a3ad5>] ocfs2_free_clusters+0x15/0x20 [ocfs2]
      [ 6538.406686]  [<ffffffffa065072c>] ocfs2_replay_truncate_records+0xfc/0x290 [ocfs2]
      [ 6538.406686]  [<ffffffffa06843ac>] ? ocfs2_start_trans+0xec/0x1d0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0654600>] __ocfs2_flush_truncate_log+0x140/0x2d0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0654394>] ? ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x170 [ocfs2]
      [ 6538.406686]  [<ffffffffa065acd4>] ocfs2_remove_btree_range+0x374/0x630 [ocfs2]
      [ 6538.406686]  [<ffffffffa017486b>] ? jbd2_journal_stop+0x25b/0x470 [jbd2]
      [ 6538.406686]  [<ffffffffa065d5b5>] ocfs2_commit_truncate+0x305/0x670 [ocfs2]
      [ 6538.406686]  [<ffffffffa0683430>] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2]
      [ 6538.406686]  [<ffffffffa067adb7>] ocfs2_truncate_file+0x297/0x380 [ocfs2]
      [ 6538.406686]  [<ffffffffa01759e4>] ? jbd2_journal_begin_ordered_truncate+0x64/0xc0 [jbd2]
      [ 6538.406686]  [<ffffffffa067c7a2>] ocfs2_setattr+0x572/0x860 [ocfs2]
      [ 6538.406686]  [<ffffffff810e4a3f>] ? current_fs_time+0x3f/0x50
      [ 6538.406686]  [<ffffffff812124b7>] notify_change+0x1d7/0x340
      [ 6538.406686]  [<ffffffff8121abf9>] ? generic_getxattr+0x79/0x80
      [ 6538.406686]  [<ffffffff811f5876>] do_truncate+0x66/0x90
      [ 6538.406686]  [<ffffffff81120e30>] ? __audit_syscall_entry+0xb0/0x110
      [ 6538.406686]  [<ffffffff811f5bb3>] do_sys_ftruncate.clone.0+0xf3/0x120
      [ 6538.406686]  [<ffffffff811f5bee>] SyS_ftruncate+0xe/0x10
      [ 6538.406686]  [<ffffffff816aa2ae>] entry_SYSCALL_64_fastpath+0x12/0x71
      [ 6538.406686] Code: 28 48 81 ee b0 04 00 00 48 8b 92 50 fb ff ff 48 8b 80 b0 03 00 00 48 39 90 88 00 00 00 0f 84 30 fe ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
      [ 6538.406686] RIP  [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
      [ 6538.406686]  RSP <ffff880075b7b7f8>
      [ 6538.691128] ---[ end trace 31cd7011d6770d7e ]---
      [ 6538.694492] Kernel panic - not syncing: Fatal exception
      [ 6538.695484] Kernel Offset: disabled
      
      Fixes: de92c8ca("jbd2: speedup jbd2_journal_get_[write|undo]_access()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      087ffd4e
  4. 02 12月, 2015 1 次提交
    • E
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet 提交于
      This patch is a cleanup to make following patch easier to
      review.
      
      Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      
      To ease backports, we rename both constants.
      
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cd3e072
  5. 01 12月, 2015 1 次提交
    • J
      direct-io: Fix negative return from dio read beyond eof · 74cedf9b
      Jan Kara 提交于
      Assume a filesystem with 4KB blocks. When a file has size 1000 bytes and
      we issue direct IO read at offset 1024, blockdev_direct_IO() reads the
      tail of the last block and the logic for handling short DIO reads in
      dio_complete() results in a return value -24 (1000 - 1024) which
      obviously confuses userspace.
      
      Fix the problem by bailing out early once we sample i_size and can
      reliably check that direct IO read starts beyond i_size.
      Reported-by: NAvi Kivity <avi@scylladb.com>
      Fixes: 9fe55eea
      CC: stable@vger.kernel.org
      CC: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      74cedf9b
  6. 27 11月, 2015 3 次提交
  7. 26 11月, 2015 3 次提交
  8. 25 11月, 2015 16 次提交
    • H
      btrfs: fix balance range usage filters in 4.4-rc · dba72cb3
      Holger Hoffstätte 提交于
      There's a regression in 4.4-rc since commit bc309467
      (btrfs: extend balance filter usage to take minimum and maximum) in that
      existing (non-ranged) balance with -dusage=x no longer works; all chunks
      are skipped.
      
      After staring at the code for a while and wondering why a non-ranged
      balance would even need min and max thresholds (..which then were not
      set correctly, leading to the bug) I realized that the only problem
      was the fact that the filter functions were named wrong, thanks to
      patching copypasta. Simply renaming both functions lets the existing
      btrfs-progs call balance with -dusage=x and now the non-ranged filter
      function is invoked, properly using only a single chunk limit.
      Signed-off-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Fixes: bc309467 ("btrfs: extend balance filter usage to take minimum and maximum")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dba72cb3
    • M
      btrfs: qgroup: account shared subtree during snapshot delete · 82bd101b
      Mark Fasheh 提交于
      Commit 0ed4792a ('btrfs: qgroup: Switch to new extent-oriented qgroup
      mechanism.') removed our qgroup accounting during
      btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
      going bad shortly after a snapshot is removed.
      
      Fix this by adding a dirty extent record when we encounter extents during
      our shared subtree walk. This effectively restores the functionality we had
      with the original shared subtree walking code in 1152651a (btrfs: qgroup:
      account shared subtrees during snapshot delete).
      
      The idea with the original patch (and this one) is that shared subtrees can
      get skipped during drop_snapshot. The shared subtree walk then allows us a
      chance to visit those extents and add them to the qgroup work for later
      processing. This ultimately makes the accounting for drop snapshot work.
      
      The new qgroup code nicely handles all the other extents during the tree
      walk via the ref dec/inc functions so we don't have to add actions beyond
      what we had originally.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      82bd101b
    • J
      Btrfs: use btrfs_get_fs_root in resolve_indirect_ref · 2d9e9776
      Josef Bacik 提交于
      The backref code will look up the fs_root we're trying to resolve our indirect
      refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT
      if the ref is 0.  This isn't helpful for the qgroup stuff with snapshot delete
      as it won't be able to search down the snapshot we are deleting, which will
      cause us to miss roots.  So use btrfs_get_fs_root and send false for check_ref
      so we can always get the root we're looking for.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      2d9e9776
    • J
      btrfs: qgroup: fix quota disable during rescan · 967ef513
      Justin Maggard 提交于
      There's a race condition that leads to a NULL pointer dereference if you
      disable quotas while a quota rescan is running.  To fix this, we just need
      to wait for the quota rescan worker to actually exit before tearing down
      the quota structures.
      Signed-off-by: NJustin Maggard <jmaggard@netgear.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      967ef513
    • F
      Btrfs: fix race between cleaner kthread and space cache writeout · 036a9348
      Filipe Manana 提交于
      When a block group becomes unused and the cleaner kthread is currently
      running, we can end up getting the current transaction aborted with error
      -ENOENT when we try to commit the transaction, leading to the following
      trace:
      
        [59779.258768] WARNING: CPU: 3 PID: 5990 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
        [59779.272594] BTRFS: Transaction aborted (error -2)
        (...)
        [59779.291137] Call Trace:
        [59779.291621]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
        [59779.292543]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
        [59779.293435]  [<ffffffffa04cb81f>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
        [59779.295000]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
        [59779.296138]  [<ffffffffa04c2721>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
        [59779.297663]  [<ffffffffa04cb81f>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
        [59779.299141]  [<ffffffffa0549b0d>] commit_cowonly_roots+0x1de/0x261 [btrfs]
        [59779.300359]  [<ffffffffa04dd5b6>] btrfs_commit_transaction+0x4c4/0x99c [btrfs]
        [59779.301805]  [<ffffffffa04b5df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
        [59779.302893]  [<ffffffff81196634>] sync_filesystem+0x7f/0x93
        (...)
        [59779.318186] ---[ end trace 577e2daff90da33a ]---
      
      The following diagram illustrates a sequence of steps leading to this
      problem:
      
             CPU 1                                             CPU 2
      
                                 <at transaction N>
      
                                                              adds bg A to list
                                                              fs_info->unused_bgs
      
                                                              adds bg B to list
                                                              fs_info->unused_bgs
      
                                 <transaction kthread
                                  commits transaction N
                                  and wakes up the
                                  cleaner kthread>
      
        cleaner kthread
          delete_unused_bgs()
      
            sees bg A in list
            fs_info->unused_bgs
      
            btrfs_start_transaction()
      
                                 <transaction N + 1 starts>
      
            deletes bg A
      
                                                              update_block_group(bg C)
      
                                                                --> adds bg C to list
                                                                    fs_info->unused_bgs
      
            deletes bg B
      
            sees bg C in the list
            fs_info->unused_bgs
      
            btrfs_remove_chunk(bg C)
              btrfs_remove_block_group(bg C)
      
                --> checks if the block group
                    is in a dirty list, and
                    because it isn't now, it
                    does nothing
      
                --> the block group item
                    is deleted from the
                    extent tree
      
                                                                --> adds bg C to list
                                                                    transaction->dirty_bgs
      
                                                               some task calls
                                                               btrfs_commit_transaction(t N + 1)
                                                                 commit_cowonly_roots()
                                                                   btrfs_write_dirty_block_groups()
                                                                     --> sees bg C in cur_trans->dirty_bgs
                                                                     --> calls write_one_cache_group()
                                                                         which returns -ENOENT because
                                                                         it did not find the block group
                                                                         item in the extent tree
                                                                     --> transaction aborte with -ENOENT
                                                                         because write_one_cache_group()
                                                                         returned that error
      
      So fix this by adding a block group to the list of dirty block groups
      before adding it to the list of unused block groups.
      
      This happened on a stress test using fsstress plus concurrent calls to
      fallocate 20G and truncate (releasing part of the space allocated with
      fallocate).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      036a9348
    • F
      Btrfs: fix scrub preventing unused block groups from being deleted · 758f2dfc
      Filipe Manana 提交于
      Currently scrub can race with the cleaner kthread when the later attempts
      to delete an unused block group, and the result is preventing the cleaner
      kthread from ever deleting later the block group - unless the block group
      becomes used and unused again. The following diagram illustrates that
      race:
      
                    CPU 1                                 CPU 2
      
       cleaner kthread
         btrfs_delete_unused_bgs()
      
           gets block group X from
           fs_info->unused_bgs and
           removes it from that list
      
                                                   scrub_enumerate_chunks()
      
                                                     searches device tree using
                                                     its commit root
      
                                                     finds device extent for
                                                     block group X
      
                                                     gets block group X from the tree
                                                     fs_info->block_group_cache_tree
                                                     (via btrfs_lookup_block_group())
      
                                                     sets bg X to RO
      
           sees the block group is
           already RO and therefore
           doesn't delete it nor adds
           it back to unused list
      
      So fix this by making scrub add the block group again to the list of
      unused block groups if the block group is still unused when it finished
      scrubbing it and it hasn't been removed already.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      758f2dfc
    • F
      Btrfs: fix race between scrub and block group deletion · 020d5b73
      Filipe Manana 提交于
      Scrub can race with the cleaner kthread deleting block groups that are
      unused (and with relocation too) leading to a failure with error -EINVAL
      that gets returned to user space.
      
      The following diagram illustrates how it happens:
      
                    CPU 1                                 CPU 2
      
       cleaner kthread
         btrfs_delete_unused_bgs()
      
           gets block group X from
           fs_info->unused_bgs
      
           sets block group to RO
      
             btrfs_remove_chunk(bg X)
      
               deletes device extents
      
                                               scrub_enumerate_chunks()
      
                                                 searches device tree using
                                                 its commit root
      
                                                 finds device extent for
                                                 block group X
      
                                                 gets block group X from the tree
                                                 fs_info->block_group_cache_tree
                                                 (via btrfs_lookup_block_group())
      
                                                 sets bg X to RO (again)
      
                btrfs_remove_block_group(bg X)
      
                  deletes block group from
                  fs_info->block_group_cache_tree
      
                  removes extent map from
                  fs_info->mapping_tree
      
                                                     scrub_chunk(offset X)
      
                                                       searches fs_info->mapping_tree
                                                       for extent map starting at
                                                       offset X
      
                                                          --> doesn't find any such
                                                              extent map
                                                          --> returns -EINVAL and scrub
                                                              errors out to userspace
                                                              with -EINVAL
      
      Fix this by dealing with an extent map lookup failure as an indicator of
      block group deletion.
      Issue reproduced with fstest btrfs/071.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      020d5b73
    • D
      btrfs: fix rcu warning during device replace · 31388ab2
      David Sterba 提交于
      The test btrfs/011 triggers a rcu warning
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.4.0-rc1-default+ #286 Tainted: G        W
      -------------------------------
      fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 1, debug_locks = 0
      4 locks held by btrfs/28786:
      
      0:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs]
      1:  (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs]
      2:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs]
      3:  (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs]
      
      stack backtrace:
      CPU: 0 PID: 28786 Comm: btrfs Tainted: G        W       4.4.0-rc1-default+ #286
      Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010
      0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001
      0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78
      ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220
      Call Trace:
      [<ffffffff8141d47b>] dump_stack+0x4f/0x74
      [<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140
      [<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs]
      [<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10
      [<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80
      [<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs]
      [<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs]
      [<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs]
      [<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs]
      [<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs]
      [<ffffffff81196aa5>] ? __might_fault+0x95/0xa0
      [<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs]
      [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
      [<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs]
      [<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs]
      [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
      [<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0
      [<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0
      [<ffffffff811e3336>] ? __fget_light+0x86/0xb0
      [<ffffffff811e3369>] ? __fdget+0x9/0x20
      [<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80
      [<ffffffff811d7483>] SyS_ioctl+0x53/0x80
      [<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This is because of unprotected use of rcu_dereference in
      btrfs_scratch_superblocks. We can't add rcu locks around the whole
      function because we read the superblock.
      
      The fix will use the rcu string buffer directly without the rcu locking.
      Thi is safe as the device will not go away in the meantime. We're
      holding the device list mutexes.
      
      Restructuring the code to narrow down the rcu section turned out to be
      impossible, we need to call filp_open (through update_dev_time) on the
      buffer and this could call kmalloc/__might_sleep. We could call kstrdup
      with GFP_ATOMIC but it's not absolutely necessary.
      
      Fixes: 12b1c263 (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks)
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31388ab2
    • Z
      btrfs: Continue replace when set_block_ro failed · 76a8efa1
      Zhaolei 提交于
      xfstests/011 failed in node with small_size filesystem.
      Can be reproduced by following script:
        DEV_LIST="/dev/vdd /dev/vde"
        DEV_REPLACE="/dev/vdf"
      
        do_test()
        {
            local mkfs_opt="$1"
            local size="$2"
      
            dmesg -c >/dev/null
            umount $SCRATCH_MNT &>/dev/null
      
            echo  mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
            mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
            mount "${DEV_LIST[0]}" $SCRATCH_MNT
      
            echo -n "Writing big files"
            dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
            for ((i = 1; i <= size; i++)); do
                echo -n .
                /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
            done
            echo
      
            echo Start replace
            btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
                dmesg
                return 1
            }
            return 0
        }
      
        # Set size to value near fs size
        # for example, 1897 can trigger this bug in 2.6G device.
        #
        ./do_test "-d raid1 -m raid1" 1897
      
      System will report replace fail with following warning in dmesg:
       [  134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
       [  135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
       [  135.543505] ------------[ cut here ]------------
       [  135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
       [  135.545276] Modules linked in:
       [  135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
       [  135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
       [  135.547798]  ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
       [  135.548774]  ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
       [  135.549774]  ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
       [  135.550758] Call Trace:
       [  135.551086]  [<ffffffff817fe7b5>] dump_stack+0x44/0x55
       [  135.551737]  [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
       [  135.552487]  [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
       [  135.553211]  [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
       [  135.554051]  [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
       [  135.554722]  [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
       [  135.555506]  [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
       [  135.556304]  [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
       [  135.557009]  [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
       [  135.557855]  [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
       [  135.558669]  [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
       [  135.559374]  [<ffffffff81202124>] SyS_ioctl+0x74/0x80
       [  135.559987]  [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
       [  135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
      
      Reason:
       When big data writen to fs, the whole free space will be allocated
       for data chunk.
       And operation as scrub need to set_block_ro(), and when there is
       only one metadata chunk in system(or other metadata chunks
       are all full), the function will try to allocate a new chunk,
       and failed because no space in device.
      
      Fix:
       When set_block_ro failed for metadata chunk, it is not a problem
       because scrub_lock paused commit_trancaction in same time, and
       metadata are always cowed, so the on-the-fly writepages will not
       write data into same place with scrub/replace.
       Let replace continue in this case is no problem.
      
      Tested by above script, and xfstests/011, plus 100 times xfstests/070.
      
      Changelog v1->v2:
      1: Add detail comments in source and commit-message.
      2: Add dmesg detail into commit-message.
      3: Limit return value of -ENOSPC to be passed.
      All suggested by: Filipe Manana <fdmanana@gmail.com>
      Suggested-by: NFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      76a8efa1
    • D
      btrfs: fix clashing number of the enhanced balance usage filter · da02c689
      David Sterba 提交于
      I've accidentally picked an already used number for the enhanced usage
      filter represented by BTRFS_BALANCE_ARGS_USAGE_RANGE, clashing with
      BTRFS_BALANCE_ARGS_CONVERT. Introduced during the development phase,
      no backward compatibility issues.
      Reported-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Fixes: bc309467 ("btrfs: extend balance filter usage to take minimum and maximum")
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      da02c689
    • F
      Btrfs: fix the number of transaction units needed to remove a block group · 7fd01182
      Filipe Manana 提交于
      We were using only 1 transaction unit when attempting to delete an unused
      block group but in reality we need 3 + N units, where N corresponds to the
      number of stripes. We were accounting only for the addition of the orphan
      item (for the block group's free space cache inode) but we were not
      accounting that we need to delete one block group item from the extent
      tree, one free space item from the tree of tree roots and N device extent
      items from the device tree.
      
      While one unit is not enough, it worked most of the time because for each
      single unit we are too pessimistic and assume an entire tree path, with
      the highest possible heigth (8), needs to be COWed with eventual node
      splits at every possible level in the tree, so there was usually enough
      reserved space for removing all the items and adding the orphan item.
      
      However after adding the orphan item, writepages() can by called by the VM
      subsystem against the btree inode when we are under memory pressure, which
      causes writeback to start for the nodes we COWed before, this forces the
      operation to remove the free space item to COW again some (or all of) the
      same nodes (in the tree of tree roots). Even without writepages() being
      called, we could fail with ENOSPC because these items are located in
      multiple trees and one of them might have a higher heigth and require
      node/leaf splits at many levels, exhausting all the reserved space before
      removing all the items and adding the orphan.
      
      In the kernel 4.0 release, commit 3d84be79 ("Btrfs: fix BUG_ON in
      btrfs_orphan_add() when delete unused block group"), we attempted to fix
      a BUG_ON due to ENOSPC when trying to add the orphan item by making the
      cleaner kthread reserve one transaction unit before attempting to remove
      the block group, but this was not enough. We had a couple user reports
      still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
      a 4.2-rc6 kernel for example:
      
          http://www.spinics.net/lists/linux-btrfs/msg46070.html
      
      So fix this by reserving all the necessary units of metadata.
      Reported-by: NStefan Priebe <s.priebe@profihost.ag>
      Fixes: 3d84be79 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7fd01182
    • F
      Btrfs: use global reserve when deleting unused block group after ENOSPC · 8eab77ff
      Filipe Manana 提交于
      It's possible to reach a state where the cleaner kthread isn't able to
      start a transaction to delete an unused block group due to lack of enough
      free metadata space and due to lack of unallocated device space to allocate
      a new metadata block group as well. If this happens try to use space from
      the global block group reserve just like we do for unlink operations, so
      that we don't reach a permanent state where starting a transaction for
      filesystem operations (file creation, renames, etc) keeps failing with
      -ENOSPC. Such an unfortunate state was observed on a machine where over
      a dozen unused data block groups existed and the cleaner kthread was
      failing to delete them due to ENOSPC error when attempting to start a
      transaction, and even running balance with a -dusage=0 filter failed with
      ENOSPC as well. Also unmounting and mounting again the filesystem didn't
      help. Allowing the cleaner kthread to use the global block reserve to
      delete the unused data block groups fixed the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eab77ff
    • D
      Btrfs: tests: checking for NULL instead of IS_ERR() · 89b6c8d1
      Dan Carpenter 提交于
      btrfs_alloc_dummy_root() return an error pointer on failure, it never
      returns NULL.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      89b6c8d1
    • D
      btrfs: fix signed overflows in btrfs_sync_file · 9dcbeed4
      David Sterba 提交于
      The calculation of range length in btrfs_sync_file leads to signed
      overflow. This was caught by PaX gcc SIZE_OVERFLOW plugin.
      
      https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
      
      The fsync call passes 0 and LLONG_MAX, the range length does not fit to
      loff_t and overflows, but the value is converted to u64 so it silently
      works as expected.
      
      The minimal fix is a typecast to u64, switching functions to take
      (start, end) instead of (start, len) would be more intrusive.
      
      Coccinelle script found that there's one more opencoded calculation of
      the length.
      
      <smpl>
      @@
      loff_t start, end;
      @@
      * end - start
      </smpl>
      
      CC: stable@vger.kernel.org
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9dcbeed4
    • J
      jbd2: Fix unreclaimed pages after truncate in data=journal mode · bc23f0c8
      Jan Kara 提交于
      Ted and Namjae have reported that truncated pages don't get timely
      reclaimed after being truncated in data=journal mode. The following test
      triggers the issue easily:
      
      for (i = 0; i < 1000; i++) {
      	pwrite(fd, buf, 1024*1024, 0);
      	fsync(fd);
      	fsync(fd);
      	ftruncate(fd, 0);
      }
      
      The reason is that journal_unmap_buffer() finds that truncated buffers
      are not journalled (jh->b_transaction == NULL), they are part of
      checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have
      been already written out (!buffer_dirty(bh)). We clean such buffers but
      we leave them in the checkpoint list. Since checkpoint transaction holds
      a reference to the journal head, these buffers cannot be released until
      the checkpoint transaction is cleaned up. And at that point we don't
      call release_buffer_page() anymore so pages detached from mapping are
      lingering in the system waiting for reclaim to find them and free them.
      
      Fix the problem by removing buffers from transaction checkpoint lists
      when journal_unmap_buffer() finds out they don't have to be there
      anymore.
      Reported-and-tested-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Fixes: de1b7941Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      bc23f0c8
    • D
      ext4: Fix handling of extended tv_sec · a4dad1ae
      David Turner 提交于
      In ext4, the bottom two bits of {a,c,m}time_extra are used to extend
      the {a,c,m}time fields, deferring the year 2038 problem to the year
      2446.
      
      When decoding these extended fields, for times whose bottom 32 bits
      would represent a negative number, sign extension causes the 64-bit
      extended timestamp to be negative as well, which is not what's
      intended.  This patch corrects that issue, so that the only negative
      {a,c,m}times are those between 1901 and 1970 (as per 32-bit signed
      timestamps).
      
      Some older kernels might have written pre-1970 dates with 1,1 in the
      extra bits.  This patch treats those incorrectly-encoded dates as
      pre-1970, instead of post-2311, until kernel 4.20 is released.
      Hopefully by then e2fsck will have fixed up the bad data.
      
      Also add a comment explaining the encoding of ext4's extra {a,c,m}time
      bits.
      Signed-off-by: NDavid Turner <novalis@novalis.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reported-by: NMark Harris <mh8928@yahoo.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=23732
      Cc: stable@vger.kernel.org
      a4dad1ae
  9. 24 11月, 2015 10 次提交