1. 17 8月, 2013 1 次提交
    • J
      jbd2: Fix oops in jbd2_journal_file_inode() · a361293f
      Jan Kara 提交于
      Commit 0713ed0c added
      jbd2_journal_file_inode() call into ext4_block_zero_page_range().
      However that function gets called from truncate path and thus inode
      needn't have jinode attached - that happens in ext4_file_open() but
      the file needn't be ever open since mount. Calling
      jbd2_journal_file_inode() without jinode attached results in the oops.
      
      We fix the problem by attaching jinode to inode also in ext4_truncate()
      and ext4_punch_hole() when we are going to zero out partial blocks.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a361293f
  2. 16 8月, 2013 1 次提交
    • L
      Fix TLB gather virtual address range invalidation corner cases · 2b047252
      Linus Torvalds 提交于
      Ben Tebulin reported:
      
       "Since v3.7.2 on two independent machines a very specific Git
        repository fails in 9/10 cases on git-fsck due to an SHA1/memory
        failures.  This only occurs on a very specific repository and can be
        reproduced stably on two independent laptops.  Git mailing list ran
        out of ideas and for me this looks like some very exotic kernel issue"
      
      and bisected the failure to the backport of commit 53a59fc6 ("mm:
      limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").
      
      That commit itself is not actually buggy, but what it does is to make it
      much more likely to hit the partial TLB invalidation case, since it
      introduces a new case in tlb_next_batch() that previously only ever
      happened when running out of memory.
      
      The real bug is that the TLB gather virtual memory range setup is subtly
      buggered.  It was introduced in commit 597e1c35 ("mm/mmu_gather:
      enable tlb flush range in generic mmu_gather"), and the range handling
      was already fixed at least once in commit e6c495a9 ("mm: fix the TLB
      range flushed when __tlb_remove_page() runs out of slots"), but that fix
      was not complete.
      
      The problem with the TLB gather virtual address range is that it isn't
      set up by the initial tlb_gather_mmu() initialization (which didn't get
      the TLB range information), but it is set up ad-hoc later by the
      functions that actually flush the TLB.  And so any such case that forgot
      to update the TLB range entries would potentially miss TLB invalidates.
      
      Rather than try to figure out exactly which particular ad-hoc range
      setup was missing (I personally suspect it's the hugetlb case in
      zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
      did), this patch just gets rid of the problem at the source: make the
      TLB range information available to tlb_gather_mmu(), and initialize it
      when initializing all the other tlb gather fields.
      
      This makes the patch larger, but conceptually much simpler.  And the end
      result is much more understandable; even if you want to play games with
      partial ranges when invalidating the TLB contents in chunks, now the
      range information is always there, and anybody who doesn't want to
      bother with it won't introduce subtle bugs.
      
      Ben verified that this fixes his problem.
      Reported-bisected-and-tested-by: NBen Tebulin <tebulin@googlemail.com>
      Build-testing-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Build-testing-by: NRichard Weinberger <richard.weinberger@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b047252
  3. 14 8月, 2013 7 次提交
    • Y
      fs/proc/task_mmu.c: fix buffer overflow in add_page_map() · 8c829622
      yonghua zheng 提交于
      Recently we met quite a lot of random kernel panic issues after enabling
      CONFIG_PROC_PAGE_MONITOR.  After debuggind we found this has something
      to do with following bug in pagemap:
      
      In struct pagemapread:
      
        struct pagemapread {
            int pos, len;
            pagemap_entry_t *buffer;
            bool v2;
        };
      
      pos is number of PM_ENTRY_BYTES in buffer, but len is the size of
      buffer, it is a mistake to compare pos and len in add_page_map() for
      checking buffer is full or not, and this can lead to buffer overflow and
      random kernel panic issue.
      
      Correct len to be total number of PM_ENTRY_BYTES in buffer.
      
      [akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition]
      Signed-off-by: NYonghua Zheng <younghua.zheng@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c829622
    • J
      ocfs2: fix null pointer dereference in ocfs2_dir_foreach_blk_id() · d6394b59
      Jeff Liu 提交于
      Fix a NULL pointer deference while removing an empty directory, which
      was introduced by commit 3704412b ("[readdir] convert ocfs2").
      
        BUG: unable to handle kernel NULL pointer dereference at (null)
        IP: [<(null)>]           (null)
        PGD 6da85067 PUD 6da89067 PMD 0
        Oops: 0010 [#1] SMP
        CPU: 0 PID: 6564 Comm: rmdir Tainted: G           O 3.11.0-rc1 #4
        RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
        Call Trace:
          ocfs2_dir_foreach+0x49/0x50 [ocfs2]
          ocfs2_empty_dir+0x12c/0x3e0 [ocfs2]
          ocfs2_unlink+0x56e/0xc10 [ocfs2]
          vfs_rmdir+0xd5/0x140
          do_rmdir+0x1cb/0x1e0
          SyS_rmdir+0x16/0x20
          system_call_fastpath+0x16/0x1b
        Code:  Bad RIP value.
        RIP  [<          (null)>]           (null)
        RSP <ffff88006daddc10>
        CR2: 0000000000000000
      
      [dan.carpenter@oracle.com: fix pointer math]
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Reported-by: NDavid Weber <wb@munzinger.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6394b59
    • T
      ocfs2: fix NULL pointer dereference in ocfs2_duplicate_clusters_by_page · c7dd3392
      Tiger Yang 提交于
      Since ocfs2_cow_file_pos will invoke ocfs2_refcount_icow with a NULL as
      the struct file pointer, it finally result in a null pointer dereference
      in ocfs2_duplicate_clusters_by_page.
      
      This patch replace file pointer with inode pointer in
      cow_duplicate_clusters to fix this issue.
      
      [jeff.liu@oracle.com: rebased patch against linux-next tree]
      Signed-off-by: NTiger Yang <tiger.yang@oracle.com>
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Acked-by: NTao Ma <tm@tao.ma>
      Tested-by: NDavid Weber <wb@munzinger.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7dd3392
    • J
      ocfs2: Revert 40bd62eb to avoid regression in extended allocation · 6115ea28
      Jie Liu 提交于
      Revert commit 40bd62eb ("fs/ocfs2/journal.h: add bits_wanted while
      calculating credits in ocfs2_calc_extend_credits").
      
      Unfortunately this change broke fallocate even if there is insufficient
      disk space for the preallocation, which is a serious problem.
      
        # df -h
        /dev/sda8        22G  1.2G   21G   6% /ocfs2
        # fallocate -o 0 -l 200M /ocfs2/testfile
        fallocate: /ocfs2/test: fallocate failed: No space left on device
      
      and a kernel warning:
      
        CPU: 3 PID: 3656 Comm: fallocate Tainted: G        W  O 3.11.0-rc3 #2
        Call Trace:
          dump_stack+0x77/0x9e
          warn_slowpath_common+0xc4/0x110
          warn_slowpath_null+0x2a/0x40
          start_this_handle+0x6c/0x640 [jbd2]
          jbd2__journal_start+0x138/0x300 [jbd2]
          jbd2_journal_start+0x23/0x30 [jbd2]
          ocfs2_start_trans+0x166/0x300 [ocfs2]
          __ocfs2_extend_allocation+0x38f/0xdb0 [ocfs2]
          ocfs2_allocate_unwritten_extents+0x3c9/0x520
          __ocfs2_change_file_space+0x5e0/0xa60 [ocfs2]
          ocfs2_fallocate+0xb1/0xe0 [ocfs2]
          do_fallocate+0x1cb/0x220
          SyS_fallocate+0x6f/0xb0
          system_call_fastpath+0x16/0x1b
        JBD2: fallocate wants too many credits (51216 > 4381)
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6115ea28
    • M
      hugetlb: fix lockdep splat caused by pmd sharing · b610ded7
      Michal Hocko 提交于
      Dave has reported the following lockdep splat:
      
        =================================
        [ INFO: inconsistent lock state ]
        3.11.0-rc1+ #9 Not tainted
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
         (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
        {RECLAIM_FS-ON-W} state was registered at:
           mark_held_locks+0x81/0xe7
           lockdep_trace_alloc+0x5e/0xbc
           __alloc_pages_nodemask+0x8b/0x9b6
           __get_free_pages+0x20/0x31
           get_zeroed_page+0x12/0x14
           __pmd_alloc+0x1c/0x6b
           huge_pmd_share+0x265/0x283
           huge_pte_alloc+0x5d/0x71
           hugetlb_fault+0x7c/0x64a
           handle_mm_fault+0x255/0x299
           __do_page_fault+0x142/0x55c
           do_page_fault+0xd/0x16
           error_code+0x6c/0x74
        irq event stamp: 3136917
        hardirqs last  enabled at (3136917):  _raw_spin_unlock_irq+0x27/0x50
        hardirqs last disabled at (3136916):  _raw_spin_lock_irq+0x15/0x78
        softirqs last  enabled at (3136180):  __do_softirq+0x137/0x30f
        softirqs last disabled at (3136175):  irq_exit+0xa8/0xaa
        other info that might help us debug this:
         Possible unsafe locking scenario:
               CPU0
               ----
          lock(&mapping->i_mmap_mutex);
          <Interrupt>
            lock(&mapping->i_mmap_mutex);
      
        *** DEADLOCK ***
        no locks held by kswapd0/49.
      
        stack backtrace:
        CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
        Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
        Call Trace:
          dump_stack+0x4b/0x79
          print_usage_bug+0x1d9/0x1e3
          mark_lock+0x1e0/0x261
          __lock_acquire+0x623/0x17f2
          lock_acquire+0x7d/0x195
          mutex_lock_nested+0x6c/0x3a7
          page_referenced+0x87/0x5e3
          shrink_page_list+0x3d9/0x947
          shrink_inactive_list+0x155/0x4cb
          shrink_lruvec+0x300/0x5ce
          shrink_zone+0x53/0x14e
          kswapd+0x517/0xa75
          kthread+0xa8/0xaa
          ret_from_kernel_thread+0x1b/0x28
      
      which is a false positive caused by hugetlb pmd sharing code which
      allocates a new pmd from withing mapping->i_mmap_mutex.  If this
      allocation causes reclaim then the lockdep detector complains that we
      might self-deadlock.
      
      This is not correct though, because hugetlb pages are not reclaimable so
      their mapping will be never touched from the reclaim path.
      
      The patch tells lockup detector that hugetlb i_mmap_mutex is special by
      assigning it a separate lockdep class so it won't report possible
      deadlocks on unrelated mappings.
      
      [peterz@infradead.org: comment for annotation]
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b610ded7
    • C
      mm: save soft-dirty bits on file pages · 41bb3476
      Cyrill Gorcunov 提交于
      Andy reported that if file page get reclaimed we lose the soft-dirty bit
      if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
      encoded into pte entry.  Thus when #pf happens on such non-present pte
      we can restore it back.
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41bb3476
    • C
      mm: save soft-dirty bits on swapped pages · 179ef71c
      Cyrill Gorcunov 提交于
      Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
      get swapped out, the bit is getting lost and no longer available when
      pte read back.
      
      To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
      pte entry for the page being swapped out.  When such page is to be read
      back from a swap cache we check for bit presence and if it's there we
      clear it and restore the former _PAGE_SOFT_DIRTY bit back.
      
      One of the problem was to find a place in pte entry where we can save
      the _PTE_SWP_SOFT_DIRTY bit while page is in swap.  The _PAGE_PSE was
      chosen for that, it doesn't intersect with swap entry format stored in
      pte.
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      179ef71c
  4. 12 8月, 2013 2 次提交
    • J
      jbd2: Fix use after free after error in jbd2_journal_dirty_metadata() · 91aa11fa
      Jan Kara 提交于
      When jbd2_journal_dirty_metadata() returns error,
      __ext4_handle_dirty_metadata() stops the handle. However callers of this
      function do not count with that fact and still happily used now freed
      handle. This use after free can result in various issues but very likely
      we oops soon.
      
      The motivation of adding __ext4_journal_stop() into
      __ext4_handle_dirty_metadata() in commit 9ea7a0df seems to be only to
      improve error reporting. So replace __ext4_journal_stop() with
      ext4_journal_abort_handle() which was there before that commit and add
      WARN_ON_ONCE() to dump stack to provide useful information.
      Reported-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org	# 3.2+
      91aa11fa
    • T
      ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOT · cde2d7a7
      Theodore Ts'o 提交于
      Previously we weren't swapping only some of the extent_status LRU
      fields during the processing of the EXT4_IOC_SWAP_BOOT ioctl.  The
      much safer thing to do is to just completely flush the extent status
      tree when doing the swap.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <gnehzuil.liu@gmail.com>
      Cc: stable@vger.kernel.org
      cde2d7a7
  5. 10 8月, 2013 11 次提交
    • Z
      btrfs: don't loop on large offsets in readdir · db62efbb
      Zach Brown 提交于
      When btrfs readdir() hits the last entry it sets the readdir offset to a
      huge value to stop buggy apps from breaking when the same name is
      returned by readdir() with concurrent rename()s.
      
      But unconditionally setting the offset to INT_MAX causes readdir() to
      loop returning any entries with offsets past INT_MAX.  It only takes a
      few hours of constant file creation and removal to create entries past
      INT_MAX.
      
      So let's set the huge offset to LLONG_MAX if the last entry has already
      overflowed 32bit loff_t.   Without large offsets behaviour is identical.
      With large offsets 64bit apps will work and 32bit apps will be no more
      broken than they currently are if they see large offsets.
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      db62efbb
    • J
      Btrfs: check to see if root_list is empty before adding it to dead roots · cfad392b
      Josef Bacik 提交于
      A user reported a panic when running with autodefrag and deleting snapshots.
      This is because we could end up trying to add the root to the dead roots list
      twice.  To fix this check to see if we are empty before adding ourselves to the
      dead roots list.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      cfad392b
    • J
      Btrfs: release both paths before logging dir/changed extents · f3b15ccd
      Josef Bacik 提交于
      The ceph guys tripped over this bug where we were still holding onto the
      original path that we used to copy the inode with when logging.  This is based
      on Chris's fix which was reported to fix the problem.  We need to drop the paths
      in two cases anyway so just move the drop up so that we don't have duplicate
      code.  Thanks,
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      f3b15ccd
    • J
      Btrfs: allow splitting of hole em's when dropping extent cache · ee20a983
      Josef Bacik 提交于
      I noticed while running multi-threaded fsync tests that sometimes fsck would
      complain about an improper gap.  This happens because we fail to add a hole
      extent to the file, which was happening when we'd split a hole EM because
      btrfs_drop_extent_cache was just discarding the whole em instead of splitting
      it.  So this patch fixes this by allowing us to split a hole em properly, which
      means that added holes actually get logged properly and we no longer see this
      fsck error.  Thankfully we're tolerant of these sort of problems so a user would
      not see any adverse effects of this bug, other than fsck complaining.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      ee20a983
    • J
      Btrfs: make sure the backref walker catches all refs to our extent · ed8c4913
      Josef Bacik 提交于
      Because we don't mess with the offset into the extent for compressed we will
      properly find both extents for this case
      
      [extent a][extent b][rest of extent a]
      
      but because we already added a ref for the front half we won't add the inode
      information for the second half.  This causes us to leak that memory and not
      print out the other offset when we do logical-resolve.  So fix this by calling
      ulist_add_merge and then add our eie to the existing entry if there is one.
      With this patch we get both offsets out of logical-resolve.  With this and the
      other 2 patches I've sent we now pass btrfs/276 on my vm with compress-force=lzo
      set.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      ed8c4913
    • J
      Btrfs: fix backref walking when we hit a compressed extent · 8ca15e05
      Josef Bacik 提交于
      If you do btrfs inspect-internal logical-resolve on a compressed extent that has
      been partly overwritten it won't find anything.  This is because we try and
      match the extent offset we've searched for based on the extent offset in the
      data extent entry.  However this doesn't work for compressed extents because the
      offsets are for the uncompressed size, not the compressed size.  So instead only
      do this check if we are not compressed, that way we can get an actual entry for
      the physical offset rather than nothing for compressed.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      8ca15e05
    • J
      Btrfs: do not offset physical if we're compressed · b76bb701
      Josef Bacik 提交于
      xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
      offsetting the physical based on the extent offset.  This is perfectly fine with
      uncompressed extents, however the extent offset is into the uncompressed area,
      not the compressed.  So we can return a physical value that isn't at all within
      the area we have allocated on disk.  Fix this by returning the start of the
      extent if it is compressed no matter what the offset.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      b76bb701
    • L
      Btrfs: fix extent buffer leak after backref walking · b5b9b5b3
      Liu Bo 提交于
      commit 47fb091f(Btrfs: fix unlock after free on rewinded tree blocks)
      takes an extra increment on the reference of allocated dummy extent buffer, so now we
      cannot free this dummy one, and end up with extent buffer leak.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      b5b9b5b3
    • L
      Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents · e68afa49
      Liu Bo 提交于
      For partial extents, snapshot-aware defrag does not work as expected,
      since
      a) we use the wrong logical offset to search for parents, which should be
         disk_bytenr + extent_offset, not just disk_bytenr,
      b) 'offset' returned by the backref walking just refers to key.offset, not
         the 'offset' stored in btrfs_extent_data_ref which is
         (key.offset - extent_offset).
      
      The reproducer:
      $ mkfs.btrfs sda
      $ mount sda /mnt
      $ btrfs sub create /mnt/sub
      $ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done
      $ btrfs sub snap /mnt/sub /mnt/snap1
      $ btrfs sub snap /mnt/sub /mnt/snap2
      $ sync; btrfs filesystem defrag /mnt/sub/foo;
      $ umount /mnt
      $ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared.
      
      This addresses the above two problems.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      e68afa49
    • J
      btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified · 7cddc193
      Jie Liu 提交于
      Create a small file and fallocate it to a big size with
      FALLOC_FL_KEEP_SIZE option, then truncate it back to the
      small size again, the disk free space is not changed back
      in this case. i.e,
      
      total 4
      -rw-r--r-- 1 root root 512 Jun 28 11:35 test
      
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G   56K  7.2G   1% /mnt
      
      -rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test
      
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G  5.1G  2.2G  70% /mnt
      
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G  5.1G  2.2G  70% /mnt
      
      With this fix, the truncated up space is back as:
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G   56K  7.2G   1% /mnt
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      7cddc193
    • O
      dlm: kill the unnecessary and wrong device_close()->recalc_sigpending() · 201d3dfa
      Oleg Nesterov 提交于
      device_close()->recalc_sigpending() is not needed, sigprocmask() takes
      care of TIF_SIGPENDING correctly.
      
      And without ->siglock it is racy and wrong, it can wrongly clear
      TIF_SIGPENDING and miss a signal.
      
      But even with this patch device_close() is still buggy:
      
        1. sigprocmask() should not be used, we have set_task_blocked(),
           but this is minor.
      
        2. We should never block SIGKILL or SIGSTOP, and this is what
           the code tries to do.
      
        3. This can't protect against SIGKILL or SIGSTOP anyway. Another
           thread can do signal_wake_up(), say, do_signal_stop() or
           complete_signal() or debugger.
      
        4. sigprocmask(SIG_BLOCK, allsigs) doesn't necessarily clears
           TIF_SIGPENDING, say, freezing() or ->jobctl.
      
        5. device_write() looks equally wrong by the same reason.
      
      Looks like, this tries to protect some wait_event_interruptible() logic
      from signals, it should be turned into uninterruptible wait.  Or we need
      to implement something like signals_stop/start for such a use-case.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      201d3dfa
  6. 09 8月, 2013 2 次提交
  7. 08 8月, 2013 6 次提交
  8. 07 8月, 2013 1 次提交
  9. 06 8月, 2013 1 次提交
    • T
      LOCKD: Don't call utsname()->nodename from nlmclnt_setlockargs · 9a1b6bf8
      Trond Myklebust 提交于
      Firstly, nlmclnt_setlockargs can be called from a reclaimer thread, in
      which case we're in entirely the wrong namespace.
      
      Secondly, commit 8aac6270 (move
      exit_task_namespaces() outside of exit_notify()) now means that
      exit_task_work() is called after exit_task_namespaces(), which
      triggers an Oops when we're freeing up the locks.
      
      Fix this by ensuring that we initialise the nlm_host's rpc_client at mount
      time, so that the cl_nodename field is initialised to the value of
      utsname()->nodename that the net namespace uses. Then replace the
      lockd callers of utsname()->nodename.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Toralf Förster <toralf.foerster@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Nix <nix@esperi.org.uk>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: stable@vger.kernel.org # 3.10.x
      9a1b6bf8
  10. 05 8月, 2013 4 次提交
    • Z
      vfs: add missing check for __O_TMPFILE in fcntl_init() · 3d62c45b
      Zheng Liu 提交于
      As comment in include/uapi/asm-generic/fcntl.h described, when
      introducing new O_* bits, we need to check its uniqueness in
      fcntl_init().  But __O_TMPFILE bit is missing.  So fix it.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3d62c45b
    • A
      fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink · bb2314b4
      Andy Lutomirski 提交于
      Every now and then someone proposes a new flink syscall, and this spawns
      a long discussion of whether it would be a security problem.  I think
      that this is missing the point: flink is *already* allowed without
      privilege as long as /proc is mounted -- it's called AT_SYMLINK_FOLLOW.
      
      Now that O_TMPFILE is here, the ability to create a file with O_TMPFILE,
      write it, and link it in is very convenient.  The only problem is that
      it requires that /proc be mounted so that you can do:
      
      linkat(AT_FDCWD, "/proc/self/fd/<tmpfd>", dfd, path, AT_SYMLINK_NOFOLLOW)
      
      This sucks -- it's much nicer to do:
      
      linkat(tmpfd, "", dfd, path, AT_EMPTY_PATH)
      
      Let's allow it.
      
      If this turns out to be excessively scary, it we could instead require
      that the inode in question be I_LINKABLE, but this seems pointless given
      the /proc situation
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bb2314b4
    • A
      fs: Fix file mode for O_TMPFILE · e305f48b
      Andy Lutomirski 提交于
      O_TMPFILE, like O_CREAT, should respect the requested mode and should
      create regular files.
      
      This fixes two bugs: O_TMPFILE required privilege (because the mode
      ended up as 000) and it produced bogus inodes with no type.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e305f48b
    • A
      reiserfs: fix deadlock in umount · 672fe15d
      Al Viro 提交于
      Since remove_proc_entry() started to wait for IO in progress (i.e.
      since 2007 or so), the locking in fs/reiserfs/proc.c became wrong;
      if procfs read happens between the moment when umount() locks the
      victim superblock and removal of /proc/fs/reiserfs/<device>/*,
      we'll get a deadlock - read will wait for s_umount (in sget(),
      called by r_start()), while umount will wait in remove_proc_entry()
      for that read to finish, holding s_umount all along.
      
      Fortunately, the same change allows a much simpler race avoidance -
      all we need to do is remove the procfs entries in the very beginning
      of reiserfs ->kill_sb(); that'll guarantee that pointer to superblock
      will remain valid for the duration for procfs IO, so we don't need
      sget() to keep the sucker alive.  As the matter of fact, we can
      get rid of the home-grown iterator completely, and use single_open()
      instead.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      672fe15d
  11. 01 8月, 2013 4 次提交
    • G
      ocfs2/refcounttree: add the missing NULL check of the return value of find_or_create_page() · 62c61046
      Gu Zheng 提交于
      Add the missing NULL check of the return value of find_or_create_page() in
      function ocfs2_duplicate_clusters_by_page().
      
      [akpm@linux-foundation.org: fix layout, per Joel]
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Acked-by: NJoel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62c61046
    • J
      cifs: set sb->s_d_op before calling d_make_root() · 66ffd113
      Jeff Layton 提交于
      Currently, the s_root dentry doesn't get its d_op pointer set to
      anything. This breaks lookups in the root of case-insensitive mounts
      since that relies on having d_hash and d_compare routines that know to
      treat the filename as case-insensitive.
      
      cifs.ko has been broken this way for a long time, but commit 1c929cfe
      ("switch cifs"), added a cryptic comment which is removed in the patch
      below, which makes me wonder if this was done deliberately for some
      reason. It's not clear to me why we'd want the s_root not to have d_op
      set properly.
      
      It may have something to do with d_automount or d_revalidate on the
      root, but my suspicion in looking over the code is that Al was just
      trying to preserve the existing behavior when changing this code over to
      use s_d_op.
      
      This patch changes it so that we set s_d_op before calling d_make_root
      and removes the comment. I tested mounting, accessing and unmounting
      several types of shares (including DFS referrals) and everything still
      seemed to work OK afterward. I could be missing something however, so
      please do let me know if I am.
      Reported-by: NJan-Marek Glogowski <glogow@fbihome.de>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Ian Kent <raven@themaw.net>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NSteve French <smfrench@gmail.com>
      66ffd113
    • J
      cifs: fix bad error handling in crypto code · ba482029
      Jeff Layton 提交于
      Jarod reported an Oops like when testing with fips=1:
      
      CIFS VFS: could not allocate crypto hmacmd5
      CIFS VFS: could not crypto alloc hmacmd5 rc -2
      CIFS VFS: Error -2 during NTLMSSP authentication
      CIFS VFS: Send error in SessSetup = -2
      BUG: unable to handle kernel NULL pointer dereference at 000000000000004e
      IP: [<ffffffff812b5c7a>] crypto_destroy_tfm+0x1a/0x90
      PGD 0
      Oops: 0000 [#1] SMP
      Modules linked in: md4 nls_utf8 cifs dns_resolver fscache kvm serio_raw virtio_balloon virtio_net mperf i2c_piix4 cirrus drm_kms_helper ttm drm i2c_core virtio_blk ata_generic pata_acpi
      CPU: 1 PID: 639 Comm: mount.cifs Not tainted 3.11.0-0.rc3.git0.1.fc20.x86_64 #1
      Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      task: ffff88007bf496e0 ti: ffff88007b080000 task.ti: ffff88007b080000
      RIP: 0010:[<ffffffff812b5c7a>]  [<ffffffff812b5c7a>] crypto_destroy_tfm+0x1a/0x90
      RSP: 0018:ffff88007b081d10  EFLAGS: 00010282
      RAX: 0000000000001f1f RBX: ffff880037422000 RCX: ffff88007b081fd8
      RDX: 000000000000001f RSI: 0000000000000006 RDI: fffffffffffffffe
      RBP: ffff88007b081d30 R08: ffff880037422000 R09: ffff88007c090100
      R10: 0000000000000000 R11: 00000000fffffffe R12: fffffffffffffffe
      R13: ffff880037422000 R14: ffff880037422000 R15: 00000000fffffffe
      FS:  00007fc322f4f780(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 000000000000004e CR3: 000000007bdaa000 CR4: 00000000000006e0
      Stack:
       ffffffff81085845 ffff880037422000 ffff8800375e7400 ffff880037422000
       ffff88007b081d48 ffffffffa0176022 ffff880037422000 ffff88007b081d60
       ffffffffa015c07b ffff880037600600 ffff88007b081dc8 ffffffffa01610e1
      Call Trace:
       [<ffffffff81085845>] ? __cancel_work_timer+0x75/0xf0
       [<ffffffffa0176022>] cifs_crypto_shash_release+0x82/0xf0 [cifs]
       [<ffffffffa015c07b>] cifs_put_tcp_session+0x8b/0xe0 [cifs]
       [<ffffffffa01610e1>] cifs_mount+0x9d1/0xad0 [cifs]
       [<ffffffffa014ff50>] cifs_do_mount+0xa0/0x4d0 [cifs]
       [<ffffffff811ab6e9>] mount_fs+0x39/0x1b0
       [<ffffffff811c466f>] vfs_kern_mount+0x5f/0xf0
       [<ffffffff811c6a9e>] do_mount+0x23e/0xa20
       [<ffffffff811c66e6>] ? copy_mount_options+0x36/0x170
       [<ffffffff811c7303>] SyS_mount+0x83/0xc0
       [<ffffffff8165c8d9>] system_call_fastpath+0x16/0x1b
      Code: eb 9e 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 55 41 54 49 89 fc 53 48 83 ec 08 48 85 ff 74 46 <48> 83 7e 48 00 48 8b 5e 50 74 4b 48 89 f7 e8 83 fc ff ff 4c 8b
      RIP  [<ffffffff812b5c7a>] crypto_destroy_tfm+0x1a/0x90
       RSP <ffff88007b081d10>
      CR2: 000000000000004e
      
      The cifs code allocates some crypto structures. If that fails, it
      returns an error, but it leaves the pointers set to their PTR_ERR
      values. Then later when it tries to clean up, it sees that those values
      are non-NULL and then passes them to the routine that frees them.
      
      Fix this by setting the pointers to NULL after collecting the error code
      in this situation.
      
      Cc: Sachin Prabhu <sprabhu@redhat.com>
      Reported-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NSteve French <smfrench@gmail.com>
      ba482029
    • O
      debugfs: debugfs_remove_recursive() must not rely on list_empty(d_subdirs) · 776164c1
      Oleg Nesterov 提交于
      debugfs_remove_recursive() is wrong,
      
      1. it wrongly assumes that !list_empty(d_subdirs) means that this
         dir should be removed.
      
         This is not that bad by itself, but:
      
      2. if d_subdirs does not becomes empty after __debugfs_remove()
         it gives up and silently fails, it doesn't even try to remove
         other entries.
      
         However ->d_subdirs can be non-empty because it still has the
         already deleted !debugfs_positive() entries.
      
      3. simple_release_fs() is called even if __debugfs_remove() fails.
      
      Suppose we have
      
      	dir1/
      		dir2/
      			file2
      		file1
      
      and someone opens dir1/dir2/file2.
      
      Now, debugfs_remove_recursive(dir1/dir2) succeeds, and dir1/dir2 goes
      away.
      
      But debugfs_remove_recursive(dir1) silently fails and doesn't remove
      this directory. Because it tries to delete (the already deleted)
      dir1/dir2/file2 again and then fails due to "Avoid infinite loop"
      logic.
      
      Test-case:
      
      	#!/bin/sh
      
      	cd /sys/kernel/debug/tracing
      	echo 'p:probe/sigprocmask sigprocmask' >> kprobe_events
      	sleep 1000 < events/probe/sigprocmask/id &
      	echo -n >| kprobe_events
      
      	[ -d events/probe ] && echo "ERR!! failed to rm probe"
      
      And after that it is not possible to create another probe entry.
      
      With this patch debugfs_remove_recursive() skips !debugfs_positive()
      files although this is not strictly needed. The most important change
      is that it does not try to make ->d_subdirs empty, it simply scans
      the whole list(s) recursively and removes as much as possible.
      
      Link: http://lkml.kernel.org/r/20130726151256.GC19472@redhat.comAcked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      776164c1