1. 20 5月, 2016 6 次提交
    • J
      dax: Use radix tree entry lock to protect cow faults · bc2466e4
      Jan Kara 提交于
      When doing cow faults, we cannot directly fill in PTE as we do for other
      faults as we rely on generic code to do proper accounting of the cowed page.
      We also have no page to lock to protect against races with truncate as
      other faults have and we need the protection to extend until the moment
      generic code inserts cowed page into PTE thus at that point we have no
      protection of fs-specific i_mmap_sem. So far we relied on using
      i_mmap_lock for the protection however that is completely special to cow
      faults. To make fault locking more uniform use DAX entry lock instead.
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      bc2466e4
    • J
      dax: New fault locking · ac401cc7
      Jan Kara 提交于
      Currently DAX page fault locking is racy.
      
      CPU0 (write fault)		CPU1 (read fault)
      
      __dax_fault()			__dax_fault()
        get_block(inode, block, &bh, 0) -> not mapped
      				  get_block(inode, block, &bh, 0)
      				    -> not mapped
        if (!buffer_mapped(&bh))
          if (vmf->flags & FAULT_FLAG_WRITE)
            get_block(inode, block, &bh, 1) -> allocates blocks
        if (page) -> no
      				  if (!buffer_mapped(&bh))
      				    if (vmf->flags & FAULT_FLAG_WRITE) {
      				    } else {
      				      dax_load_hole();
      				    }
        dax_insert_mapping()
      
      And we are in a situation where we fail in dax_radix_entry() with -EIO.
      
      Another problem with the current DAX page fault locking is that there is
      no race-free way to clear dirty tag in the radix tree. We can always
      end up with clean radix tree and dirty data in CPU cache.
      
      We fix the first problem by introducing locking of exceptional radix
      tree entries in DAX mappings acting very similarly to page lock and thus
      synchronizing properly faults against the same mapping index. The same
      lock can later be used to avoid races when clearing radix tree dirty
      tag.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      ac401cc7
    • J
      dax: Allow DAX code to replace exceptional entries · 4f622938
      Jan Kara 提交于
      Currently we forbid page_cache_tree_insert() to replace exceptional radix
      tree entries for DAX inodes. However to make DAX faults race free we will
      lock radix tree entries and when hole is created, we need to replace
      such locked radix tree entry with a hole page. So modify
      page_cache_tree_insert() to allow that.
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      4f622938
    • J
      dax: Define DAX lock bit for radix tree exceptional entry · e804315d
      Jan Kara 提交于
      We will use lowest available bit in the radix tree exceptional entry for
      locking of the entry. Define it. Also clean up definitions of DAX entry
      type bits in DAX exceptional entries to use defined constants instead of
      hardcoding numbers and cleanup checking of these bits to not rely on how
      other bits in the entry are set.
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      e804315d
    • J
      dax: Make huge page handling depend of CONFIG_BROKEN · 348e967a
      Jan Kara 提交于
      Currently the handling of huge pages for DAX is racy. For example the
      following can happen:
      
      CPU0 (THP write fault)			CPU1 (normal read fault)
      
      __dax_pmd_fault()			__dax_fault()
        get_block(inode, block, &bh, 0) -> not mapped
      					get_block(inode, block, &bh, 0)
      					  -> not mapped
        if (!buffer_mapped(&bh) && write)
          get_block(inode, block, &bh, 1) -> allocates blocks
        truncate_pagecache_range(inode, lstart, lend);
      					dax_load_hole();
      
      This results in data corruption since process on CPU1 won't see changes
      into the file done by CPU0.
      
      The race can happen even if two normal faults race however with THP the
      situation is even worse because the two faults don't operate on the same
      entries in the radix tree and we want to use these entries for
      serialization. So make THP support in DAX code depend on CONFIG_BROKEN
      for now.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      348e967a
    • J
      dax: Fix condition for filling of PMD holes · b9953536
      Jan Kara 提交于
      Currently dax_pmd_fault() decides to fill a PMD-sized hole only if
      returned buffer has BH_Uptodate set. However that doesn't get set for
      any mapping buffer so that branch is actually a dead code. The
      BH_Uptodate check doesn't make any sense so just remove it.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      b9953536
  2. 19 5月, 2016 5 次提交
  3. 17 5月, 2016 15 次提交
  4. 13 5月, 2016 5 次提交
    • J
      ext4: pre-zero allocated blocks for DAX IO · 12735f88
      Jan Kara 提交于
      Currently ext4 treats DAX IO the same way as direct IO. I.e., it
      allocates unwritten extents before IO is done and converts unwritten
      extents afterwards. However this way DAX IO can race with page fault to
      the same area:
      
      ext4_ext_direct_IO()				dax_fault()
        dax_io()
          get_block() - allocates unwritten extent
          copy_from_iter_pmem()
      						  get_block() - converts
      						    unwritten block to
      						    written and zeroes it
      						    out
        ext4_convert_unwritten_extents()
      
      So data written with DAX IO gets lost. Similarly dax_new_buf() called
      from dax_io() can overwrite data that has been already written to the
      block via mmap.
      
      Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
      use them for DAX mmap. The downside of this solution is that every
      allocating write writes each block twice (once zeros, once data). Fixing
      the race with locking is possible as well however we would need to
      lock-out faults for the whole range written to by DAX IO. And that is
      not easy to do without locking-out faults for the whole file which seems
      too aggressive.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      12735f88
    • J
      ext4: refactor direct IO code · 914f82a3
      Jan Kara 提交于
      Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
      and ext4_ind_direct_IO(). However the extent based function calls into
      the indirect based one for some cases and for example it is not able to
      handle file extending. Previously it was not also properly handling
      retries in case of ENOSPC errors. With DAX things would get even more
      contrieved so just refactor the direct IO code and instead of indirect /
      extent split do the split to read vs writes.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      914f82a3
    • J
      ext4: fix race in transient ENOSPC detection · dbc427ce
      Jan Kara 提交于
      When there are blocks to free in the running transaction, block
      allocator can return ENOSPC although the filesystem has some blocks to
      free. We use ext4_should_retry_alloc() to force commit of the current
      transaction and return whether anything was committed so that it makes
      sense to retry the allocation. However the transaction may get committed
      after block allocation fails but before we call
      ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false
      because there is nothing to commit and we wrongly return ENOSPC.
      
      Fix the race by unconditionally returning 1 from ext4_should_retry_alloc()
      when we tried to commit a transaction. This should not add any
      unnecessary retries since we had a transaction running a while ago when
      trying to allocate blocks and we want to retry the allocation once that
      transaction has committed anyway.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      dbc427ce
    • J
      ext4: handle transient ENOSPC properly for DAX · 7cb476f8
      Jan Kara 提交于
      ext4_dax_get_blocks() was accidentally omitted fixing get blocks
      handlers to properly handle transient ENOSPC errors. Fix it now to use
      ext4_get_blocks_trans() helper which takes care of these errors.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7cb476f8
    • J
      dax: call get_blocks() with create == 1 for write faults to unwritten extents · aef39ab1
      Jan Kara 提交于
      Currently, __dax_fault() does not call get_blocks() callback with create
      argument set, when we got back unwritten extent from the initial
      get_blocks() call during a write fault. This is because originally
      filesystems were supposed to convert unwritten extents to written ones
      using complete_unwritten() callback. Later this was abandoned in favor of
      using pre-zeroed blocks however the condition whether get_blocks() needs
      to be called with create == 1 remained.
      
      Fix the condition so that filesystems are not forced to zero-out and
      convert unwritten extents when get_blocks() is called with create == 0
      (which introduces unnecessary overhead for read faults and can be
      problematic as the filesystem may possibly be read-only).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      aef39ab1
  5. 06 5月, 2016 5 次提交
    • N
      ext4: remove unmeetable inconsisteny check from ext4_find_extent() · 816cd71b
      Nicolai Stange 提交于
      ext4_find_extent(), stripped down to the parts relevant to this patch,
      reads as
      
        ppos = 0;
        i = depth;
        while (i) {
          --i;
          ++ppos;
          if (unlikely(ppos > depth)) {
            ...
            ret = -EFSCORRUPTED;
            goto err;
          }
        }
      
      Due to the loop's bounds, the condition ppos > depth can never be met.
      
      Remove this dead code.
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      816cd71b
    • L
      jbd2: remove excess descriptions for handle_s · 466c3fb6
      Luis de Bethencourt 提交于
      Commit bf699327 ("jbd2: Use tracepoints for history file")
      removed the members j_history, j_history_max and j_history_cur from struct
      handle_s but the descriptions stayed lingering. Removing them.
      Signed-off-by: NLuis de Bethencourt <luisbg@osg.samsung.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      466c3fb6
    • J
      ext4: remove unnecessary bio get/put · 32157de2
      Jens Axboe 提交于
      ext4_io_submit() used to check for EOPNOTSUPP after bio submission,
      which is why it had to get an extra reference to the bio before
      submitting it. But since we no longer touch the bio after submission,
      get rid of the redundant get/put of the bio. If we do get the extra
      reference, we enter the slower path of having to flag this bio as now
      having external references.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      32157de2
    • N
      ext4: silence UBSAN in ext4_mb_init() · 935244cd
      Nicolai Stange 提交于
      Currently, in ext4_mb_init(), there's a loop like the following:
      
        do {
          ...
          offset += 1 << (sb->s_blocksize_bits - i);
          i++;
        } while (i <= sb->s_blocksize_bits + 1);
      
      Note that the updated offset is used in the loop's next iteration only.
      
      However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
      the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
      and UBSAN reports
      
        UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
        shift exponent 4294967295 is too large for 32-bit type 'int'
        [...]
        Call Trace:
         [<ffffffff818c4d25>] dump_stack+0xbc/0x117
         [<ffffffff818c4c69>] ? _atomic_dec_and_lock+0x169/0x169
         [<ffffffff819411ab>] ubsan_epilogue+0xd/0x4e
         [<ffffffff81941cac>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
         [<ffffffff81941ab1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
         [<ffffffff814b6dc1>] ? kmem_cache_alloc+0x101/0x390
         [<ffffffff816fc13b>] ? ext4_mb_init+0x13b/0xfd0
         [<ffffffff814293c7>] ? create_cache+0x57/0x1f0
         [<ffffffff8142948a>] ? create_cache+0x11a/0x1f0
         [<ffffffff821c2168>] ? mutex_lock+0x38/0x60
         [<ffffffff821c23ab>] ? mutex_unlock+0x1b/0x50
         [<ffffffff814c26ab>] ? put_online_mems+0x5b/0xc0
         [<ffffffff81429677>] ? kmem_cache_create+0x117/0x2c0
         [<ffffffff816fcc49>] ext4_mb_init+0xc49/0xfd0
         [...]
      
      Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.
      
      Unless compilers start to do some fancy transformations (which at least
      GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
      such calculated value of offset is never used again.
      
      Silence UBSAN by introducing another variable, offset_incr, holding the
      next increment to apply to offset and adjust that one by right shifting it
      by one position per loop iteration.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      935244cd
    • N
      ext4: address UBSAN warning in mb_find_order_for_block() · b5cb316c
      Nicolai Stange 提交于
      Currently, in mb_find_order_for_block(), there's a loop like the following:
      
        while (order <= e4b->bd_blkbits + 1) {
          ...
          bb += 1 << (e4b->bd_blkbits - order);
        }
      
      Note that the updated bb is used in the loop's next iteration only.
      
      However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
      the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports
      
        UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
        shift exponent -1 is negative
        [...]
        Call Trace:
         [<ffffffff818c4d35>] dump_stack+0xbc/0x117
         [<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169
         [<ffffffff819411bb>] ubsan_epilogue+0xd/0x4e
         [<ffffffff81941cbc>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
         [<ffffffff81941ac1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
         [<ffffffff816e93a0>] ? ext4_mb_generate_from_pa+0x590/0x590
         [<ffffffff816502c8>] ? ext4_read_block_bitmap_nowait+0x598/0xe80
         [<ffffffff816e7b7e>] mb_find_order_for_block+0x1ce/0x240
         [...]
      
      Unless compilers start to do some fancy transformations (which at least
      GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
      such calculated value of bb is never used again.
      
      Silence UBSAN by introducing another variable, bb_incr, holding the next
      increment to apply to bb and adjust that one by right shifting it by one
      position per loop iteration.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      b5cb316c
  6. 05 5月, 2016 2 次提交
    • J
      ext4: fix oops on corrupted filesystem · 74177f55
      Jan Kara 提交于
      When filesystem is corrupted in the right way, it can happen
      ext4_mark_iloc_dirty() in ext4_orphan_add() returns error and we
      subsequently remove inode from the in-memory orphan list. However this
      deletion is done with list_del(&EXT4_I(inode)->i_orphan) and thus we
      leave i_orphan list_head with a stale content. Later we can look at this
      content causing list corruption, oops, or other issues. The reported
      trace looked like:
      
      WARNING: CPU: 0 PID: 46 at lib/list_debug.c:53 __list_del_entry+0x6b/0x100()
      list_del corruption, 0000000061c1d6e0->next is LIST_POISON1
      0000000000100100)
      CPU: 0 PID: 46 Comm: ext4.exe Not tainted 4.1.0-rc4+ #250
      Stack:
       60462947 62219960 602ede24 62219960
       602ede24 603ca293 622198f0 602f02eb
       62219950 6002c12c 62219900 601b4d6b
      Call Trace:
       [<6005769c>] ? vprintk_emit+0x2dc/0x5c0
       [<602ede24>] ? printk+0x0/0x94
       [<600190bc>] show_stack+0xdc/0x1a0
       [<602ede24>] ? printk+0x0/0x94
       [<602ede24>] ? printk+0x0/0x94
       [<602f02eb>] dump_stack+0x2a/0x2c
       [<6002c12c>] warn_slowpath_common+0x9c/0xf0
       [<601b4d6b>] ? __list_del_entry+0x6b/0x100
       [<6002c254>] warn_slowpath_fmt+0x94/0xa0
       [<602f4d09>] ? __mutex_lock_slowpath+0x239/0x3a0
       [<6002c1c0>] ? warn_slowpath_fmt+0x0/0xa0
       [<60023ebf>] ? set_signals+0x3f/0x50
       [<600a205a>] ? kmem_cache_free+0x10a/0x180
       [<602f4e88>] ? mutex_lock+0x18/0x30
       [<601b4d6b>] __list_del_entry+0x6b/0x100
       [<601177ec>] ext4_orphan_del+0x22c/0x2f0
       [<6012f27c>] ? __ext4_journal_start_sb+0x2c/0xa0
       [<6010b973>] ? ext4_truncate+0x383/0x390
       [<6010bc8b>] ext4_write_begin+0x30b/0x4b0
       [<6001bb50>] ? copy_from_user+0x0/0xb0
       [<601aa840>] ? iov_iter_fault_in_readable+0xa0/0xc0
       [<60072c4f>] generic_perform_write+0xaf/0x1e0
       [<600c4166>] ? file_update_time+0x46/0x110
       [<60072f0f>] __generic_file_write_iter+0x18f/0x1b0
       [<6010030f>] ext4_file_write_iter+0x15f/0x470
       [<60094e10>] ? unlink_file_vma+0x0/0x70
       [<6009b180>] ? unlink_anon_vmas+0x0/0x260
       [<6008f169>] ? free_pgtables+0xb9/0x100
       [<600a6030>] __vfs_write+0xb0/0x130
       [<600a61d5>] vfs_write+0xa5/0x170
       [<600a63d6>] SyS_write+0x56/0xe0
       [<6029fcb0>] ? __libc_waitpid+0x0/0xa0
       [<6001b698>] handle_syscall+0x68/0x90
       [<6002633d>] userspace+0x4fd/0x600
       [<6002274f>] ? save_registers+0x1f/0x40
       [<60028bd7>] ? arch_prctl+0x177/0x1b0
       [<60017bd5>] fork_handler+0x85/0x90
      
      Fix the problem by using list_del_init() as we always should with
      i_orphan list.
      
      CC: stable@vger.kernel.org
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      74177f55
    • S
      ext4: fix check of dqget() return value in ext4_ioctl_setproject() · ff0bc084
      Seth Forshee 提交于
      A failed call to dqget() returns an ERR_PTR() and not null. Fix
      the check in ext4_ioctl_setproject() to handle this correctly.
      
      Fixes: 9b7365fc ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support")
      Cc: stable@vger.kernel.org # v4.5
      Signed-off-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      ff0bc084
  7. 30 4月, 2016 2 次提交
    • T
      ext4: clean up error handling when orphan list is corrupted · 7827a7f6
      Theodore Ts'o 提交于
      Instead of just printing warning messages, if the orphan list is
      corrupted, declare the file system is corrupted.  If there are any
      reserved inodes in the orphaned inode list, declare the file system
      corrupted and stop right away to avoid doing more potential damage to
      the file system.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7827a7f6
    • T
      ext4: fix hang when processing corrupted orphaned inode list · c9eb13a9
      Theodore Ts'o 提交于
      If the orphaned inode list contains inode #5, ext4_iget() returns a
      bad inode (since the bootloader inode should never be referenced
      directly).  Because of the bad inode, we end up processing the inode
      repeatedly and this hangs the machine.
      
      This can be reproduced via:
      
         mke2fs -t ext4 /tmp/foo.img 100
         debugfs -w -R "ssv last_orphan 5" /tmp/foo.img
         mount -o loop /tmp/foo.img /mnt
      
      (But don't do this if you are using an unpatched kernel if you care
      about the system staying functional.  :-)
      
      This bug was found by the port of American Fuzzy Lop into the kernel
      to find file system problems[1].  (Since it *only* happens if inode #5
      shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not
      surprising that AFL needed two hours before it found it.)
      
      [1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf
      
      Cc: stable@vger.kernel.org
      Reported by: Vegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      c9eb13a9