1. 27 5月, 2015 15 次提交
  2. 21 5月, 2015 1 次提交
  3. 20 5月, 2015 2 次提交
    • F
      Btrfs: fix racy system chunk allocation when setting block group ro · a9629596
      Filipe Manana 提交于
      If while setting a block group read-only we end up allocating a system
      chunk, through check_system_chunk(), we were not doing it while holding
      the chunk mutex which is a problem if a concurrent chunk allocation is
      happening, through do_chunk_alloc(), as it means both block groups can
      end up using the same logical addresses and physical regions in the
      device(s). So make sure we hold the chunk mutex.
      
      Cc: stable@vger.kernel.org  # 4.0+
      Fixes: 2f081088 ("btrfs: delete chunk allocation attemp when
                            setting block group ro")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a9629596
    • M
      btrfs: clear 'ret' in btrfs_check_shared() loop · 2c2ed5aa
      Mark Fasheh 提交于
      btrfs_check_shared() is leaking a return value of '1' from
      find_parent_nodes(). As a result, callers (in this case, extent_fiemap())
      are told extents are shared when they are not. This in turn broke fiemap on
      btrfs for kernels v3.18 and up.
      
      The fix is simple - we just have to clear 'ret' after we are done processing
      the results of find_parent_nodes().
      
      It wasn't clear to me at first what was happening with return values in
      btrfs_check_shared() and find_parent_nodes() - thanks to Josef for the help
      on irc. I added documentation to both functions to make things more clear
      for the next hacker who might come across them.
      
      If we could queue this up for -stable too that would be great.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c2ed5aa
  4. 15 5月, 2015 8 次提交
    • T
      ext4: fix an ext3 collapse range regression in xfstests · b9576fc3
      Theodore Ts'o 提交于
      The xfstests test suite assumes that an attempt to collapse range on
      the range (0, 1) will return EOPNOTSUPP if the file system does not
      support collapse range.  Commit 280227a7: "ext4: move check under
      lock scope to close a race" broke this, and this caused xfstests to
      fail when run when testing file systems that did not have the extents
      feature enabled.
      Reported-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      b9576fc3
    • V
      kernfs: do not account ino_ida allocations to memcg · 499611ed
      Vladimir Davydov 提交于
      root->ino_ida is used for kernfs inode number allocations. Since IDA has
      a layered structure, different IDs can reside on the same layer, which
      is currently accounted to some memory cgroup. The problem is that each
      kmem cache of a memory cgroup has its own directory on sysfs (under
      /sys/fs/kernel/<cache-name>/cgroup). If the inode number of such a
      directory or any file in it gets allocated from a layer accounted to the
      cgroup which the cache is created for, the cgroup will get pinned for
      good, because one has to free all kmem allocations accounted to a cgroup
      in order to release it and destroy all its kmem caches. That said we
      must not account layers of ino_ida to any memory cgroup.
      
      Since per net init operations may create new sysfs entries directly
      (e.g. lo device) or indirectly (nf_conntrack creates a new kmem cache
      per each namespace, which, in turn, creates new sysfs entries), an easy
      way to reproduce this issue is by creating network namespace(s) from
      inside a kmem-active memory cgroup.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[4.0.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      499611ed
    • D
      jbd2: fix r_count overflows leading to buffer overflow in journal recovery · e531d0bc
      Darrick J. Wong 提交于
      The journal revoke block recovery code does not check r_count for
      sanity, which means that an evil value of r_count could result in
      the kernel reading off the end of the revoke table and into whatever
      garbage lies beyond.  This could crash the kernel, so fix that.
      
      However, in testing this fix, I discovered that the code to write
      out the revoke tables also was not correctly checking to see if the
      block was full -- the current offset check is fine so long as the
      revoke table space size is a multiple of the record size, but this
      is not true when either journal_csum_v[23] are set.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      e531d0bc
    • E
      ext4: check for zero length extent explicitly · 2f974865
      Eryu Guan 提交于
      The following commit introduced a bug when checking for zero length extent
      
      5946d089 ext4: check for overlapping extents in ext4_valid_extent_entries()
      
      Zero length extent could pass the check if lblock is zero.
      
      Adding the explicit check for zero length back.
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      2f974865
    • L
      ext4: fix NULL pointer dereference when journal restart fails · 9d506594
      Lukas Czerner 提交于
      Currently when journal restart fails, we'll have the h_transaction of
      the handle set to NULL to indicate that the handle has been effectively
      aborted. We handle this situation quietly in the jbd2_journal_stop() and just
      free the handle and exit because everything else has been done before we
      attempted (and failed) to restart the journal.
      
      Unfortunately there are a number of problems with that approach
      introduced with commit
      
      41a5b913 "jbd2: invalidate handle if jbd2_journal_restart()
      fails"
      
      First of all in ext4 jbd2_journal_stop() will be called through
      __ext4_journal_stop() where we would try to get a hold of the superblock
      by dereferencing h_transaction which in this case would lead to NULL
      pointer dereference and crash.
      
      In addition we're going to free the handle regardless of the refcount
      which is bad as well, because others up the call chain will still
      reference the handle so we might potentially reference already freed
      memory.
      
      Moreover it's expected that we'll get aborted handle as well as detached
      handle in some of the journalling function as the error propagates up
      the stack, so it's unnecessary to call WARN_ON every time we get
      detached handle.
      
      And finally we might leak some memory by forgetting to free reserved
      handle in jbd2_journal_stop() in the case where handle was detached from
      the transaction (h_transaction is NULL).
      
      Fix the NULL pointer dereference in __ext4_journal_stop() by just
      calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
      the potential memory leak in jbd2_journal_stop() and use proper
      handle refcounting before we attempt to free it to avoid use-after-free
      issues.
      
      And finally remove all WARN_ON(!transaction) from the code so that we do
      not get random traces when something goes wrong because when journal
      restart fails we will get to some of those functions.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      9d506594
    • T
      ext4: remove unused function prototype from ext4.h · 92c82639
      Theodore Ts'o 提交于
      The ext4_extent_tree_init() function hasn't been in the ext4 code for
      a long time ago, except in an unused function prototype in ext4.h
      
      Google-Bug-Id: 4530137
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      92c82639
    • T
      ext4: don't save the error information if the block device is read-only · 1b46617b
      Theodore Ts'o 提交于
      Google-Bug-Id: 20939131
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1b46617b
    • T
      ext4: fix lazytime optimization · 8f4d8558
      Theodore Ts'o 提交于
      We had a fencepost error in the lazytime optimization which means that
      timestamp would get written to the wrong inode.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      8f4d8558
  5. 14 5月, 2015 2 次提交
    • J
      nfs: take extra reference to fl->fl_file when running a setlk · feaff8e5
      Jeff Layton 提交于
      We had a report of a crash while stress testing the NFS client:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000150
          IP: [<ffffffff8127b698>] locks_get_lock_context+0x8/0x90
          PGD 0
          Oops: 0000 [#1] SMP
          Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_filter ebtable_broute bridge stp llc ebtables ip6table_security ip6table_mangle ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw ip6table_filter ip6_tables iptable_security iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw coretemp crct10dif_pclmul ppdev crc32_pclmul crc32c_intel ghash_clmulni_intel vmw_balloon serio_raw vmw_vmci i2c_piix4 shpchp parport_pc acpi_cpufreq parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi scsi_transport_spi mptscsih mptbase e1000 ata_generic pata_acpi
          CPU: 1 PID: 399 Comm: kworker/1:1H Not tainted 4.1.0-0.rc1.git0.1.fc23.x86_64 #1
          Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
          Workqueue: rpciod rpc_async_schedule [sunrpc]
          task: ffff880036aea7c0 ti: ffff8800791f4000 task.ti: ffff8800791f4000
          RIP: 0010:[<ffffffff8127b698>]  [<ffffffff8127b698>] locks_get_lock_context+0x8/0x90
          RSP: 0018:ffff8800791f7c00  EFLAGS: 00010293
          RAX: ffff8800791f7c40 RBX: ffff88001f2ad8c0 RCX: ffffe8ffffc80305
          RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
          RBP: ffff8800791f7c88 R08: ffff88007fc971d8 R09: 279656d600000000
          R10: 0000034a01000000 R11: 279656d600000000 R12: ffff88001f2ad918
          R13: ffff88001f2ad8c0 R14: 0000000000000000 R15: 0000000100e73040
          FS:  0000000000000000(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000000000150 CR3: 0000000001c0b000 CR4: 00000000000407e0
          Stack:
           ffffffff8127c5b0 ffff8800791f7c18 ffffffffa0171e29 ffff8800791f7c58
           ffffffffa0171ef8 ffff8800791f7c78 0000000000000246 ffff88001ea0ba00
           ffff8800791f7c40 ffff8800791f7c40 00000000ff5d86a3 ffff8800791f7ca8
          Call Trace:
           [<ffffffff8127c5b0>] ? __posix_lock_file+0x40/0x760
           [<ffffffffa0171e29>] ? rpc_make_runnable+0x99/0xa0 [sunrpc]
           [<ffffffffa0171ef8>] ? rpc_wake_up_task_queue_locked.part.35+0xc8/0x250 [sunrpc]
           [<ffffffff8127cd3a>] posix_lock_file_wait+0x4a/0x120
           [<ffffffffa03e4f12>] ? nfs41_wake_and_assign_slot+0x32/0x40 [nfsv4]
           [<ffffffffa03bf108>] ? nfs41_sequence_done+0xd8/0x2d0 [nfsv4]
           [<ffffffffa03c116d>] do_vfs_lock+0x2d/0x30 [nfsv4]
           [<ffffffffa03c251d>] nfs4_lock_done+0x1ad/0x210 [nfsv4]
           [<ffffffffa0171a30>] ? __rpc_sleep_on_priority+0x390/0x390 [sunrpc]
           [<ffffffffa0171a30>] ? __rpc_sleep_on_priority+0x390/0x390 [sunrpc]
           [<ffffffffa0171a5c>] rpc_exit_task+0x2c/0xa0 [sunrpc]
           [<ffffffffa0167450>] ? call_refreshresult+0x150/0x150 [sunrpc]
           [<ffffffffa0172640>] __rpc_execute+0x90/0x460 [sunrpc]
           [<ffffffffa0172a25>] rpc_async_schedule+0x15/0x20 [sunrpc]
           [<ffffffff810baa1b>] process_one_work+0x1bb/0x410
           [<ffffffff810bacc3>] worker_thread+0x53/0x480
           [<ffffffff810bac70>] ? process_one_work+0x410/0x410
           [<ffffffff810bac70>] ? process_one_work+0x410/0x410
           [<ffffffff810c0b38>] kthread+0xd8/0xf0
           [<ffffffff810c0a60>] ? kthread_worker_fn+0x180/0x180
           [<ffffffff817a1aa2>] ret_from_fork+0x42/0x70
           [<ffffffff810c0a60>] ? kthread_worker_fn+0x180/0x180
      
      Jean says:
      
      "Running locktests with a large number of iterations resulted in a
       client crash.  The test run took a while and hasn't finished after close
       to 2 hours. The crash happened right after I gave up and killed the test
       (after 107m) with Ctrl+C."
      
      The crash happened because a NULL inode pointer got passed into
      locks_get_lock_context. The call chain indicates that file_inode(filp)
      returned NULL, which means that f_inode was NULL. Since that's zeroed
      out in __fput, that suggests that this filp pointer outlived the last
      reference.
      
      Looking at the code, that seems possible. We copy the struct file_lock
      that's passed in, but if the task is signalled at an inopportune time we
      can end up trying to use that file_lock in rpciod context after the process
      that requested it has already returned (and possibly put its filp
      reference).
      
      Fix this by taking an extra reference to the filp when we allocate the
      lock info, and put it in nfs4_lock_release.
      Reported-by: NJean Spector <jean@primarydata.com>
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      feaff8e5
    • C
      nfs: stat(2) fails during cthon04 basic test5 on NFSv4.0 · 6b196875
      Chuck Lever 提交于
      When running the Connectathon basic tests against a Solaris NFS
      server over NFSv4.0, test5 reports that stat(2) returns a file size
      of zero instead of 1MB.
      
      On success, nfs_commit_inode() can return a positive result; see
      other call sites such as nfs_file_fsync_commit() and
      nfs_commit_unstable_pages().
      
      The call site recently added in nfs_wb_all() does not prevent that
      positive return value from leaking to its callers. If it leaks
      through nfs_sync_inode() back to nfs_getattr(), that causes stat(2)
      to return a positive return value to user space while also not
      filling in the passed-in struct stat.
      
      Additional clean up: the new logic in nfs_wb_all() is rewritten in
      bfields-normal form.
      
      Fixes: 5bb89b47 ("NFSv4.1/pnfs: Separate out metadata . . .")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      6b196875
  6. 13 5月, 2015 1 次提交
    • H
      parisc,metag: Fix crashes due to stack randomization on stack-grows-upwards architectures · d045c77c
      Helge Deller 提交于
      On architectures where the stack grows upwards (CONFIG_STACK_GROWSUP=y,
      currently parisc and metag only) stack randomization sometimes leads to crashes
      when the stack ulimit is set to lower values than STACK_RND_MASK (which is 8 MB
      by default if not defined in arch-specific headers).
      
      The problem is, that when the stack vm_area_struct is set up in fs/exec.c, the
      additional space needed for the stack randomization (as defined by the value of
      STACK_RND_MASK) was not taken into account yet and as such, when the stack
      randomization code added a random offset to the stack start, the stack
      effectively got smaller than what the user defined via rlimit_max(RLIMIT_STACK)
      which then sometimes leads to out-of-stack situations and crashes.
      
      This patch fixes it by adding the maximum possible amount of memory (based on
      STACK_RND_MASK) which theoretically could be added by the stack randomization
      code to the initial stack size. That way, the user-defined stack size is always
      guaranteed to be at minimum what is defined via rlimit_max(RLIMIT_STACK).
      
      This bug is currently not visible on the metag architecture, because on metag
      STACK_RND_MASK is defined to 0 which effectively disables stack randomization.
      
      The changes to fs/exec.c are inside an "#ifdef CONFIG_STACK_GROWSUP"
      section, so it does not affect other platformws beside those where the
      stack grows upwards (parisc and metag).
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Cc: linux-parisc@vger.kernel.org
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: linux-metag@vger.kernel.org
      Cc: stable@vger.kernel.org # v3.16+
      d045c77c
  7. 11 5月, 2015 4 次提交
    • F
      Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON · 062c19e9
      Filipe Manana 提交于
      There's a race between releasing extent buffers that are flagged as stale
      and recycling them that makes us it the following BUG_ON at
      btrfs_release_extent_buffer_page:
      
          BUG_ON(extent_buffer_under_io(eb))
      
      The BUG_ON is triggered because the extent buffer has the flag
      EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
      dirty by another concurrent task.
      
      Here follows a sequence of steps that leads to the BUG_ON.
      
            CPU 0                                                    CPU 1                                                CPU 2
      
      path->nodes[0] == eb X
      X->refs == 2 (1 for the tree, 1 for the path)
      btrfs_header_generation(X) == current trans id
      flag EXTENT_BUFFER_DIRTY set on X
      
      btrfs_release_path(path)
          unlocks X
      
                                                            reads eb X
                                                               X->refs incremented to 3
                                                            locks eb X
                                                            btrfs_del_items(X)
                                                               X becomes empty
                                                               clean_tree_block(X)
                                                                   clear EXTENT_BUFFER_DIRTY from X
                                                               btrfs_del_leaf(X)
                                                                   unlocks X
                                                                   extent_buffer_get(X)
                                                                      X->refs incremented to 4
                                                                   btrfs_free_tree_block(X)
                                                                      X's range is not pinned
                                                                      X's range added to free
                                                                        space cache
                                                                   free_extent_buffer_stale(X)
                                                                      lock X->refs_lock
                                                                      set EXTENT_BUFFER_STALE on X
                                                                      release_extent_buffer(X)
                                                                          X->refs decremented to 3
                                                                          unlocks X->refs_lock
                                                            btrfs_release_path()
                                                               unlocks X
                                                               free_extent_buffer(X)
                                                                   X->refs becomes 2
      
                                                                                                            __btrfs_cow_block(Y)
                                                                                                                btrfs_alloc_tree_block()
                                                                                                                    btrfs_reserve_extent()
                                                                                                                        find_free_extent()
                                                                                                                            gets offset == X->start
                                                                                                                    btrfs_init_new_buffer(X->start)
                                                                                                                        btrfs_find_create_tree_block(X->start)
                                                                                                                            alloc_extent_buffer(X->start)
                                                                                                                                find_extent_buffer(X->start)
                                                                                                                                    finds eb X in radix tree
      
          free_extent_buffer(X)
              lock X->refs_lock
                  test X->refs == 2
                  test bit EXTENT_BUFFER_STALE is set
                  test !extent_buffer_under_io(eb)
      
                                                                                                                                    increments X->refs to 3
                                                                                                                                    mark_extent_buffer_accessed(X)
                                                                                                                                        check_buffer_tree_ref(X)
                                                                                                                                          --> does nothing,
                                                                                                                                              X->refs >= 2 and
                                                                                                                                              EXTENT_BUFFER_TREE_REF
                                                                                                                                              is set in X
                                                                                                                    clear EXTENT_BUFFER_STALE from X
                                                                                                                    locks X
                                                                                                                btrfs_mark_buffer_dirty()
                                                                                                                    set_extent_buffer_dirty(X)
                                                                                                                        check_buffer_tree_ref(X)
                                                                                                                           --> does nothing, X->refs >= 2 and
                                                                                                                               EXTENT_BUFFER_TREE_REF is set
                                                                                                                        sets EXTENT_BUFFER_DIRTY on X
      
                  test and clear EXTENT_BUFFER_TREE_REF
                  decrements X->refs to 2
              release_extent_buffer(X)
                  decrements X->refs to 1
                  unlock X->refs_lock
      
                                                                                                            unlock X
                                                                                                            free_extent_buffer(X)
                                                                                                                lock X->refs_lock
                                                                                                                release_extent_buffer(X)
                                                                                                                    decrements X->refs to 0
                                                                                                                    btrfs_release_extent_buffer_page(X)
                                                                                                                         BUG_ON(extent_buffer_under_io(X))
                                                                                                                             --> EXTENT_BUFFER_DIRTY set on X
      
      Fix this by making find_extent buffer wait for any ongoing task currently
      executing free_extent_buffer()/free_extent_buffer_stale() if the extent
      buffer has the stale flag set.
      A more clean alternative would be to always increment the extent buffer's
      reference count while holding its refs_lock spinlock but find_extent_buffer
      is a performance critical area and that would cause lock contention whenever
      multiple tasks search for the same extent buffer concurrently.
      
      A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
      btrfs patches backported from newer kernels) was hitting this often:
      
      [1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
      (...)
      [1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
      [1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
      [1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
      [1212302.540792] RIP: 0010:[<ffffffffa0507081>]  [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
      (...)
      [1212302.630008] Call Trace:
      [1212302.630008]  [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs]
      [1212302.630008]  [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs]
      [1212302.630008]  [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
      [1212302.630008]  [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs]
      [1212302.630008]  [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
      [1212302.630008]  [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs]
      [1212302.630008]  [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
      [1212302.630008]  [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
      [1212302.630008]  [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
      [1212302.630008]  [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs]
      [1212302.630008]  [<ffffffff811b9e28>] evict+0xa8/0x190
      [1212302.630008]  [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0
      
      I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
      integration branch from about a month ago, running the following stress
      test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):
      
        while true; do
           mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
           mount /dev/sdd /mnt
           snapshot_cmd="btrfs subvolume snapshot -r /mnt"
           snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
           fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
           umount /mnt
        done
      
      Which usually triggers the BUG_ON within less than 24 hours:
      
      [49558.618097] ------------[ cut here ]------------
      [49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
      (...)
      [49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G        W      3.19.0-btrfs-next-7+ #3
      [49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
      [49558.620031] RIP: 0010:[<ffffffffa0476b1a>]  [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
      (...)
      [49558.620031] Call Trace:
      [49558.620031]  [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs]
      [49558.620031]  [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43
      [49558.620031]  [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs]
      [49558.620031]  [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs]
      [49558.620031]  [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs]
      [49558.620031]  [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs]
      [49558.620031]  [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
      [49558.728054]  [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
      [49558.728054]  [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
      [49558.728054]  [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
      [49558.728054]  [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs]
      [49558.728054]  [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs]
      [49558.728054]  [<ffffffff81155923>] ? iterate_supers+0x60/0xc4
      [49558.728054]  [<ffffffff8117966a>] ? do_sync_work+0x91/0x91
      [49558.728054]  [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22
      [49558.728054]  [<ffffffff81155939>] iterate_supers+0x76/0xc4
      [49558.728054]  [<ffffffff811798e8>] sys_sync+0x55/0x83
      [49558.728054]  [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      062c19e9
    • F
      Btrfs: fix race between block group creation and their cache writeout · ff1f8250
      Filipe Manana 提交于
      So creating a block group has 2 distinct phases:
      
      Phase 1 - creates the btrfs_block_group_cache item and adds it to the
      rbtree fs_info->block_group_cache_tree and to the corresponding list
      space_info->block_groups[];
      
      Phase 2 - adds the block group item to the extent tree and corresponding
      items to the chunk tree.
      
      The first phase adds the block_group_cache_item to a list of pending block
      groups in the transaction handle, and phase 2 happens when
      btrfs_end_transaction() is called against the transaction handle.
      
      It happens that once phase 1 completes, other concurrent tasks that use
      their own transaction handle, but points to the same running transaction
      (struct btrfs_trans_handle->transaction), can use this block group for
      space allocations and therefore mark it dirty. Dirty block groups are
      tracked in a list belonging to the currently running transaction (struct
      btrfs_transaction) and not in the transaction handle (btrfs_trans_handle).
      
      This is a problem because once a task calls btrfs_commit_transaction(),
      it calls btrfs_start_dirty_block_groups() which will see all dirty block
      groups and attempt to start their writeout, including those that are
      still attached to the transaction handle of some concurrent task that
      hasn't called btrfs_end_transaction() yet - which means those block
      groups haven't gone through phase 2 yet and therefore when
      write_one_cache_group() is called, it won't find the block group items
      in the extent tree and abort the current transaction with -ENOENT,
      turning the fs into readonly mode and require a remount.
      
      Fix this by ignoring -ENOENT when looking for block group items in the
      extent tree when we attempt to start the writeout of the block group
      caches outside the critical section of the transaction commit. We will
      try again later during the critical section and if there we still don't
      find the block group item in the extent tree, we then abort the current
      transaction.
      
      This issue happened twice, once while running fstests btrfs/067 and once
      for btrfs/078, which produced the following trace:
      
      [ 3278.703014] WARNING: CPU: 7 PID: 18499 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [ 3278.707329] BTRFS: Transaction aborted (error -2)
      (...)
      [ 3278.731555] Call Trace:
      [ 3278.732396]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [ 3278.733860]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [ 3278.735312]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [ 3278.736874]  [<ffffffffa03ada6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [ 3278.738302]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [ 3278.739520]  [<ffffffffa03ada6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [ 3278.741222]  [<ffffffffa03b9e56>] write_one_cache_group+0xae/0xbf [btrfs]
      [ 3278.742797]  [<ffffffffa03c487b>] btrfs_start_dirty_block_groups+0x170/0x2b2 [btrfs]
      [ 3278.744492]  [<ffffffffa03d309c>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      [ 3278.746084]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [ 3278.747249]  [<ffffffffa03e5660>] btrfs_sync_file+0x313/0x387 [btrfs]
      [ 3278.748744]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
      [ 3278.749958]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
      [ 3278.751218]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
      [ 3278.754197]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
      [ 3278.755192]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
      [ 3278.756236]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [ 3278.757366] ---[ end trace 9a4d4df4969709aa ]---
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ff1f8250
    • F
      Btrfs: fix panic when starting bg cache writeout after IO error · 28aeeac1
      Filipe Manana 提交于
      When waiting for the writeback of block group cache we returned
      immediately if there was an error during writeback without waiting
      for the ordered extent to complete. This left a short time window
      where if some other task attempts to start the writeout for the same
      block group cache it can attempt to add a new ordered extent, starting
      at the same offset (0) before the previous one is removed from the
      ordered tree, causing an ordered tree panic (calls BUG()).
      
      This normally doesn't happen in other write paths, such as buffered
      writes or direct IO writes for regular files, since before marking
      page ranges dirty we lock the ranges and wait for any ordered extents
      within the range to complete first.
      
      Fix this by making btrfs_wait_ordered_range() not return immediately
      if it gets an error from the writeback, waiting for all ordered extents
      to complete first.
      
      This issue happened often when running the fstest btrfs/088 and it's
      easy to trigger it by running in a loop until the panic happens:
      
        for ((i = 1; i <= 10000; i++)) do ./check btrfs/088 ; done
      
      [17156.862573] BTRFS critical (device sdc): panic in ordered_data_tree_panic:70: Inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
      [17156.864052] ------------[ cut here ]------------
      [17156.864052] kernel BUG at fs/btrfs/ordered-data.c:70!
      (...)
      [17156.864052] Call Trace:
      [17156.864052]  [<ffffffffa03876e3>] btrfs_add_ordered_extent+0x12/0x14 [btrfs]
      [17156.864052]  [<ffffffffa03787e2>] run_delalloc_nocow+0x5bf/0x747 [btrfs]
      [17156.864052]  [<ffffffffa03789ff>] run_delalloc_range+0x95/0x353 [btrfs]
      [17156.864052]  [<ffffffffa038b7fe>] writepage_delalloc.isra.16+0xb9/0x13f [btrfs]
      [17156.864052]  [<ffffffffa038d75b>] __extent_writepage+0x129/0x1f7 [btrfs]
      [17156.864052]  [<ffffffffa038da5a>] extent_write_cache_pages.isra.15.constprop.28+0x231/0x2f4 [btrfs]
      [17156.864052]  [<ffffffff810ad2af>] ? __module_text_address+0x12/0x59
      [17156.864052]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [17156.864052]  [<ffffffffa038df76>] extent_writepages+0x4b/0x5c [btrfs]
      [17156.864052]  [<ffffffff81144431>] ? kmem_cache_free+0x9b/0xce
      [17156.864052]  [<ffffffffa0376a46>] ? btrfs_submit_direct+0x3fc/0x3fc [btrfs]
      [17156.864052]  [<ffffffffa0389cd6>] ? free_extent_state+0x8c/0xc1 [btrfs]
      [17156.864052]  [<ffffffffa0374871>] btrfs_writepages+0x28/0x2a [btrfs]
      [17156.864052]  [<ffffffff8110c4c8>] do_writepages+0x23/0x2c
      [17156.864052]  [<ffffffff81102f36>] __filemap_fdatawrite_range+0x5a/0x61
      [17156.864052]  [<ffffffff81102f6e>] filemap_fdatawrite_range+0x13/0x15
      [17156.864052]  [<ffffffffa0383ef7>] btrfs_fdatawrite_range+0x21/0x48 [btrfs]
      [17156.864052]  [<ffffffffa03ab89e>] __btrfs_write_out_cache.isra.14+0x2d9/0x3a7 [btrfs]
      [17156.864052]  [<ffffffffa03ac1ab>] ? btrfs_write_out_cache+0x41/0xdc [btrfs]
      [17156.864052]  [<ffffffffa03ac1fd>] btrfs_write_out_cache+0x93/0xdc [btrfs]
      [17156.864052]  [<ffffffffa0363847>] ? btrfs_start_dirty_block_groups+0x13a/0x2b2 [btrfs]
      [17156.864052]  [<ffffffffa03638e6>] btrfs_start_dirty_block_groups+0x1d9/0x2b2 [btrfs]
      [17156.864052]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [17156.864052]  [<ffffffffa037209e>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      [17156.864052]  [<ffffffffa034c748>] btrfs_sync_fs+0xe1/0x12d [btrfs]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      28aeeac1
    • F
      Btrfs: fix crash after inode cache writeback failure · e43699d4
      Filipe Manana 提交于
      If the writeback of an inode cache failed we were unnecessarilly
      attempting to release again the delalloc metadata that we previously
      reserved. However attempting to do this a second time triggers an
      assertion at drop_outstanding_extent() because we have no more
      outstanding extents for our inode cache's inode. If we were able
      to start writeback of the cache the reserved metadata space is
      released at btrfs_finished_ordered_io(), even if an error happens
      during writeback.
      
      So make sure we don't repeat the metadata space release if writeback
      started for our inode cache.
      
      This issue was trivial to reproduce by running the fstest btrfs/088
      with "-o inode_cache", which triggered the assertion leading to a
      BUG() call and requiring a reboot in order to run the remaining
      fstests. Trace produced by btrfs/088:
      
      [255289.385904] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5276
      [255289.388094] ------------[ cut here ]------------
      [255289.389184] kernel BUG at fs/btrfs/ctree.h:4057!
      [255289.390125] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      (...)
      [255289.392068] Call Trace:
      [255289.392068]  [<ffffffffa035e774>] drop_outstanding_extent+0x3d/0x6d [btrfs]
      [255289.392068]  [<ffffffffa0364988>] btrfs_delalloc_release_metadata+0x54/0xe3 [btrfs]
      [255289.392068]  [<ffffffffa03b4174>] btrfs_write_out_ino_cache+0x95/0xad [btrfs]
      [255289.392068]  [<ffffffffa036f5c4>] btrfs_save_ino_cache+0x275/0x2dc [btrfs]
      [255289.392068]  [<ffffffffa03e2d83>] commit_fs_roots.isra.12+0xaa/0x137 [btrfs]
      [255289.392068]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [255289.392068]  [<ffffffffa037841f>] ? btrfs_commit_transaction+0x4b1/0x9c9 [btrfs]
      [255289.392068]  [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
      [255289.392068]  [<ffffffffa037842e>] btrfs_commit_transaction+0x4c0/0x9c9 [btrfs]
      (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e43699d4
  8. 10 5月, 2015 1 次提交
  9. 09 5月, 2015 2 次提交
  10. 07 5月, 2015 1 次提交
  11. 06 5月, 2015 3 次提交
    • C
      splice: sendfile() at once fails for big files · 0ff28d9f
      Christophe Leroy 提交于
      Using sendfile with below small program to get MD5 sums of some files,
      it appear that big files (over 64kbytes with 4k pages system) get a
      wrong MD5 sum while small files get the correct sum.
      This program uses sendfile() to send a file to an AF_ALG socket
      for hashing.
      
      /* md5sum2.c */
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <string.h>
      #include <fcntl.h>
      #include <sys/socket.h>
      #include <sys/stat.h>
      #include <sys/types.h>
      #include <linux/if_alg.h>
      
      int main(int argc, char **argv)
      {
      	int sk = socket(AF_ALG, SOCK_SEQPACKET, 0);
      	struct stat st;
      	struct sockaddr_alg sa = {
      		.salg_family = AF_ALG,
      		.salg_type = "hash",
      		.salg_name = "md5",
      	};
      	int n;
      
      	bind(sk, (struct sockaddr*)&sa, sizeof(sa));
      
      	for (n = 1; n < argc; n++) {
      		int size;
      		int offset = 0;
      		char buf[4096];
      		int fd;
      		int sko;
      		int i;
      
      		fd = open(argv[n], O_RDONLY);
      		sko = accept(sk, NULL, 0);
      		fstat(fd, &st);
      		size = st.st_size;
      		sendfile(sko, fd, &offset, size);
      		size = read(sko, buf, sizeof(buf));
      		for (i = 0; i < size; i++)
      			printf("%2.2x", buf[i]);
      		printf("  %s\n", argv[n]);
      		close(fd);
      		close(sko);
      	}
      	exit(0);
      }
      
      Test below is done using official linux patch files. First result is
      with a software based md5sum. Second result is with the program above.
      
      root@vgoip:~# ls -l patch-3.6.*
      -rw-r--r--    1 root     root         64011 Aug 24 12:01 patch-3.6.2.gz
      -rw-r--r--    1 root     root         94131 Aug 24 12:01 patch-3.6.3.gz
      
      root@vgoip:~# md5sum patch-3.6.*
      b3ffb9848196846f31b2ff133d2d6443  patch-3.6.2.gz
      c5e8f687878457db77cb7158c38a7e43  patch-3.6.3.gz
      
      root@vgoip:~# ./md5sum2 patch-3.6.*
      b3ffb9848196846f31b2ff133d2d6443  patch-3.6.2.gz
      5fd77b24e68bb24dcc72d6e57c64790e  patch-3.6.3.gz
      
      After investivation, it appears that sendfile() sends the files by blocks
      of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each
      block, the SPLICE_F_MORE flag is missing, therefore the hashing operation
      is reset as if it was the end of the file.
      
      This patch adds SPLICE_F_MORE to the flags when more data is pending.
      
      With the patch applied, we get the correct sums:
      
      root@vgoip:~# md5sum patch-3.6.*
      b3ffb9848196846f31b2ff133d2d6443  patch-3.6.2.gz
      c5e8f687878457db77cb7158c38a7e43  patch-3.6.3.gz
      
      root@vgoip:~# ./md5sum2 patch-3.6.*
      b3ffb9848196846f31b2ff133d2d6443  patch-3.6.2.gz
      c5e8f687878457db77cb7158c38a7e43  patch-3.6.3.gz
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0ff28d9f
    • J
      ocfs2: dlm: fix race between purge and get lock resource · b1432a2a
      Junxiao Bi 提交于
      There is a race window in dlm_get_lock_resource(), which may return a
      lock resource which has been purged.  This will cause the process to
      hang forever in dlmlock() as the ast msg can't be handled due to its
      lock resource not existing.
      
          dlm_get_lock_resource {
              ...
              spin_lock(&dlm->spinlock);
              tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash);
              if (tmpres) {
                   spin_unlock(&dlm->spinlock);
                   >>>>>>>> race window, dlm_run_purge_list() may run and purge
                                    the lock resource
                   spin_lock(&tmpres->spinlock);
                   ...
                   spin_unlock(&tmpres->spinlock);
              }
          }
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1432a2a
    • R
      nilfs2: fix sanity check of btree level in nilfs_btree_root_broken() · d8fd150f
      Ryusuke Konishi 提交于
      The range check for b-tree level parameter in nilfs_btree_root_broken()
      is wrong; it accepts the case of "level == NILFS_BTREE_LEVEL_MAX" even
      though the level is limited to values in the range of 0 to
      (NILFS_BTREE_LEVEL_MAX - 1).
      
      Since the level parameter is read from storage device and used to index
      nilfs_btree_path array whose element count is NILFS_BTREE_LEVEL_MAX, it
      can cause memory overrun during btree operations if the boundary value
      is set to the level parameter on device.
      
      This fixes the broken sanity check and adds a comment to clarify that
      the upper bound NILFS_BTREE_LEVEL_MAX is exclusive.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8fd150f