1. 20 6月, 2017 27 次提交
  2. 10 6月, 2017 3 次提交
    • O
      Btrfs: fix delalloc accounting leak caused by u32 overflow · 70e7af24
      Omar Sandoval 提交于
      btrfs_calc_trans_metadata_size() does an unsigned 32-bit multiplication,
      which can overflow if num_items >= 4 GB / (nodesize * BTRFS_MAX_LEVEL * 2).
      For a nodesize of 16kB, this overflow happens at 16k items. Usually,
      num_items is a small constant passed to btrfs_start_transaction(), but
      we also use btrfs_calc_trans_metadata_size() for metadata reservations
      for extent items in btrfs_delalloc_{reserve,release}_metadata().
      
      In drop_outstanding_extents(), num_items is calculated as
      inode->reserved_extents - inode->outstanding_extents. The difference
      between these two counters is usually small, but if many delalloc
      extents are reserved and then the outstanding extents are merged in
      btrfs_merge_extent_hook(), the difference can become large enough to
      overflow in btrfs_calc_trans_metadata_size().
      
      The overflow manifests itself as a leak of a multiple of 4 GB in
      delalloc_block_rsv and the metadata bytes_may_use counter. This in turn
      can cause early ENOSPC errors. Additionally, these WARN_ONs in
      extent-tree.c will be hit when unmounting:
      
          WARN_ON(fs_info->delalloc_block_rsv.size > 0);
          WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
          WARN_ON(space_info->bytes_pinned > 0 ||
                  space_info->bytes_reserved > 0 ||
                  space_info->bytes_may_use > 0);
      
      Fix it by casting nodesize to a u64 so that
      btrfs_calc_trans_metadata_size() does a full 64-bit multiplication.
      While we're here, do the same in btrfs_calc_trunc_metadata_size(); this
      can't overflow with any existing uses, but it's better to be safe here
      than have another hard-to-debug problem later on.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      70e7af24
    • L
      Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io · 452e62b7
      Liu Bo 提交于
      Before this, we use 'filled' mode here, ie. if all range has been
      filled with EXTENT_DEFRAG bits, get to clear it, but if the defrag
      range joins the adjacent delalloc range, then we'll have EXTENT_DEFRAG
      bits in extent_state until releasing this inode's pages, and that
      prevents extent_data from being freed.
      
      This clears the bit if any was found within the ordered extent.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      452e62b7
    • S
      btrfs: tree-log.c: Wrong printk information about namelen · 286b92f4
      Su Yue 提交于
      In verify_dir_item, it wants to printk name_len of dir_item but
      printk data_len acutally.
      
      Fix it by calling btrfs_dir_name_len instead of btrfs_dir_data_len.
      Signed-off-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      286b92f4
  3. 08 6月, 2017 1 次提交
    • D
      crypto: Work around deallocated stack frame reference gcc bug on sparc. · d41519a6
      David Miller 提交于
      On sparc, if we have an alloca() like situation, as is the case with
      SHASH_DESC_ON_STACK(), we can end up referencing deallocated stack
      memory.  The result can be that the value is clobbered if a trap
      or interrupt arrives at just the right instruction.
      
      It only occurs if the function ends returning a value from that
      alloca() area and that value can be placed into the return value
      register using a single instruction.
      
      For example, in lib/libcrc32c.c:crc32c() we end up with a return
      sequence like:
      
              return  %i7+8
               lduw   [%o5+16], %o0   ! MEM[(u32 *)__shash_desc.1_10 + 16B],
      
      %o5 holds the base of the on-stack area allocated for the shash
      descriptor.  But the return released the stack frame and the
      register window.
      
      So if an intererupt arrives between 'return' and 'lduw', then
      the value read at %o5+16 can be corrupted.
      
      Add a data compiler barrier to work around this problem.  This is
      exactly what the gcc fix will end up doing as well, and it absolutely
      should not change the code generated for other cpus (unless gcc
      on them has the same bug :-)
      
      With crucial insight from Eric Sandeen.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: NAnatoly Pugachev <matorola@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      d41519a6
  4. 01 6月, 2017 3 次提交
    • J
      btrfs: fix race with relocation recovery and fs_root setup · a9b3311e
      Jeff Mahoney 提交于
      If we have to recover relocation during mount, we'll ultimately have to
      evict the orphan inode.  That goes through the reservation dance, where
      priority_reclaim_metadata_space and flush_space expect fs_info->fs_root
      to be valid.  That's the next thing to be set up during mount, so we
      crash, almost always in flush_space trying to join the transaction
      but priority_reclaim_metadata_space is possible as well.  This call
      path has been problematic in the past WRT whether ->fs_root is valid
      yet.  Commit 957780eb (Btrfs: introduce ticketed enospc
      infrastructure) added new users that are called in the direct path
      instead of the async path that had already been worked around.
      
      The thing is that we don't actually need the fs_root, specifically, for
      anything.  We either use it to determine whether the root is the
      chunk_root for use in choosing an allocation profile or as a root to pass
      btrfs_join_transaction before immediately committing it.  Anything that
      isn't the chunk root works in the former case and any root works in
      the latter.
      
      A simple fix is to use a root we know will always be there: the
      extent_root.
      
      Cc: <stable@vger.kernel.org> # v4.8+
      Fixes: 957780eb (Btrfs: introduce ticketed enospc infrastructure)
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a9b3311e
    • J
      btrfs: fix memory leak in update_space_info failure path · 896533a7
      Jeff Mahoney 提交于
      If we fail to add the space_info kobject, we'll leak the memory
      for the percpu counter.
      
      Fixes: 6ab0a202 (btrfs: publish allocation data in sysfs)
      Cc: <stable@vger.kernel.org> # v3.14+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      896533a7
    • D
      btrfs: use correct types for page indices in btrfs_page_exists_in_range · cc2b702c
      David Sterba 提交于
      Variables start_idx and end_idx are supposed to hold a page index
      derived from the file offsets. The int type is not the right one though,
      offsets larger than 1 << 44 will get silently trimmed off the high bits.
      (1 << 44 is 16TiB)
      
      What can go wrong, if start is below the boundary and end gets trimmed:
      - if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
      - the final check "if (page->index <= end_idx)" will unexpectedly fail
      
      The function will return false, ie. "there's no page in the range",
      although there is at least one.
      
      btrfs_page_exists_in_range is used to prevent races in:
      
      * in hole punching, where we make sure there are not pages in the
        truncated range, otherwise we'll wait for them to finish and redo
        truncation, but we're going to replace the pages with holes anyway so
        the only problem is the intermediate state
      
      * lock_extent_direct: we want to make sure there are no pages before we
        lock and start DIO, to prevent stale data reads
      
      For practical occurence of the bug, there are several constaints.  The
      file must be quite large, the affected range must cross the 16TiB
      boundary and the internal state of the file pages and pending operations
      must match.  Also, we must not have started any ordered data in the
      range, otherwise we don't even reach the buggy function check.
      
      DIO locking tries hard in several places to avoid deadlocks with
      buffered IO and avoids waiting for ranges. The worst consequence seems
      to be stale data read.
      
      CC: Liu Bo <bo.li.liu@oracle.com>
      CC: stable@vger.kernel.org	# 3.16+
      Fixes: fc4adbff ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc2b702c
  5. 16 5月, 2017 3 次提交
    • C
      btrfs: fix incorrect error return ret being passed to mapping_set_error · bff5baf8
      Colin Ian King 提交于
      The setting of return code ret should be based on the error code
      passed into function end_extent_writepage and not on ret. Thanks
      to Liu Bo for spotting this mistake in the original fix I submitted.
      
      Detected by CoverityScan, CID#1414312 ("Logically dead code")
      
      Fixes: 5dca6eea ("Btrfs: mark mapping with error flag to report errors to userspace")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bff5baf8
    • J
      btrfs: Make flush bios explicitely sync · 8d910125
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
      definitions.  generic_make_request_checks() however strips REQ_FUA and
      REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
      write cache and thus write effectively becomes asynchronous which can
      lead to performance regressions
      
      Fix the problem by making sure all bios which are synchronous are
      properly marked with REQ_SYNC.
      
      CC: David Sterba <dsterba@suse.com>
      CC: linux-btrfs@vger.kernel.org
      Fixes: b685d3d6Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8d910125
    • Q
      btrfs: fiemap: Cache and merge fiemap extent before submit it to user · 4751832d
      Qu Wenruo 提交于
      [BUG]
      Cycle mount btrfs can cause fiemap to return different result.
      Like:
       # mount /dev/vdb5 /mnt/btrfs
       # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
       # xfs_io -c "fiemap -v" /mnt/btrfs/file
       /mnt/test/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        25088..25215       128   0x1
       # umount /mnt/btrfs
       # mount /dev/vdb5 /mnt/btrfs
       # xfs_io -c "fiemap -v" /mnt/btrfs/file
       /mnt/test/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..31]:         25088..25119        32   0x0
         1: [32..63]:        25120..25151        32   0x0
         2: [64..95]:        25152..25183        32   0x0
         3: [96..127]:       25184..25215        32   0x1
      But after above fiemap, we get correct merged result if we call fiemap
      again.
       # xfs_io -c "fiemap -v" /mnt/btrfs/file
       /mnt/test/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        25088..25215       128   0x1
      
      [REASON]
      Btrfs will try to merge extent map when inserting new extent map.
      
      btrfs_fiemap(start=0 len=(u64)-1)
      |- extent_fiemap(start=0 len=(u64)-1)
         |- get_extent_skip_holes(start=0 len=64k)
         |  |- btrfs_get_extent_fiemap(start=0 len=64k)
         |     |- btrfs_get_extent(start=0 len=64k)
         |        |  Found on-disk (ino, EXTENT_DATA, 0)
         |        |- add_extent_mapping()
         |        |- Return (em->start=0, len=16k)
         |
         |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
         |
         |- get_extent_skip_holes(start=0 len=64k)
         |  |- btrfs_get_extent_fiemap(start=0 len=64k)
         |     |- btrfs_get_extent(start=16k len=48k)
         |        |  Found on-disk (ino, EXTENT_DATA, 16k)
         |        |- add_extent_mapping()
         |        |  |- try_merge_map()
         |        |     Merge with previous em start=0 len=16k
         |        |     resulting em start=0 len=32k
         |        |- Return (em->start=0, len=32K)    << Merged result
         |- Stripe off the unrelated range (0~16K) of return em
         |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
            ^^^ Causing split fiemap extent.
      
      And since in add_extent_mapping(), em is already merged, in next
      fiemap() call, we will get merged result.
      
      [FIX]
      Here we introduce a new structure, fiemap_cache, which records previous
      fiemap extent.
      
      And will always try to merge current fiemap_cache result before calling
      fiemap_fill_next_extent().
      Only when we failed to merge current fiemap extent with cached one, we
      will call fiemap_fill_next_extent() to submit cached one.
      
      So by this method, we can merge all fiemap extents.
      
      It can also be done in fs/ioctl.c, however the problem is if
      fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
      extent.
      So I choose to merge it in btrfs.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4751832d
  6. 09 5月, 2017 2 次提交
    • M
      mm, vmalloc: use __GFP_HIGHMEM implicitly · 19809c2d
      Michal Hocko 提交于
      __vmalloc* allows users to provide gfp flags for the underlying
      allocation.  This API is quite popular
      
        $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
        77
      
      The only problem is that many people are not aware that they really want
      to give __GFP_HIGHMEM along with other flags because there is really no
      reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
      which are mapped to the kernel vmalloc space.  About half of users don't
      use this flag, though.  This signals that we make the API unnecessarily
      too complex.
      
      This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
      be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
      are simplified and drop the flag.
      
      Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Cristopher Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19809c2d
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
  7. 05 5月, 2017 1 次提交