1. 29 9月, 2014 3 次提交
  2. 09 9月, 2014 11 次提交
    • E
      xfs: remove rbpp check from xfs_rtmodify_summary_int · ab6978c2
      Eric Sandeen 提交于
      rbpp is always passed into xfs_rtmodify_summary
      and xfs_rtget_summary, so there is no need to
      test for it in xfs_rtmodify_summary_int.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ab6978c2
    • E
      xfs: combine xfs_rtmodify_summary and xfs_rtget_summary · afabfd30
      Eric Sandeen 提交于
      xfs_rtmodify_summary and xfs_rtget_summary are almost identical;
      fold them into xfs_rtmodify_summary_int(), with wrappers for each of
      the original calls.
      
      The _int function modifies if a delta is passed, and returns a
      summary pointer if *sum is passed.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      afabfd30
    • E
      xfs: combine xfs_dir_canenter into xfs_dir_createname · b16ed7c1
      Eric Sandeen 提交于
      xfs_dir_canenter and xfs_dir_createname are
      almost identical.
      
      Fold the former into the latter, with a helpful
      wrapper for the former.  If createname is called without
      an inode number, it now only checks for space, and does
      not actually add the entry.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b16ed7c1
    • E
      xfs: check resblks before calling xfs_dir_canenter · 94f3cad5
      Eric Sandeen 提交于
      Move the resblks test out of the xfs_dir_canenter,
      and into the caller.
      
      This makes a little more sense on the face of it;
      xfs_dir_canenter immediately returns if resblks !=0;
      and given some of the comments preceding the calls:
      
       * Check for ability to enter directory entry, if no space reserved.
      
      even more so.
      
      It also facilitates the next patch.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      94f3cad5
    • E
      xfs: deduplicate xlog_do_recovery_pass() · 970fd3f0
      Eric Sandeen 提交于
      In xlog_do_recovery_pass(), there are 2 distinct cases:
      non-wrapped and wrapped log recovery.
      
      If we find a wrapped log, we recover around the end
      of the log, and then handle the rest of recovery
      exactly as in the non-wrapped case - using exactly the same
      (duplicated) code.
      
      Rather than having the same code in both cases, we can
      get the wrapped portion out of the way first if needed,
      and then recover the non-wrapped portion of the log.
      
      There should be no functional change here, just code
      reorganization & deduplication.
      
      The patch looks a bit bigger than it really is; the last
      hunk is whitespace changes (un-indenting).
      
      Tested with xfstests "check -g log" on a stock configuration.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      970fd3f0
    • E
      xfs: lseek: the "whence" argument is called "whence" · 59f9c004
      Eric Sandeen 提交于
      For some reason, the older commit:
      
          965c8e59 lseek: the "whence" argument is called "whence"
      
          lseek: the "whence" argument is called "whence"
      
          But the kernel decided to call it "origin" instead.
          Fix most of the sites.
      
      left out xfs.  So fix xfs.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      59f9c004
    • E
      xfs: combine xfs_seek_hole & xfs_seek_data · 49c69591
      Eric Sandeen 提交于
      xfs_seek_hole & xfs_seek_data are remarkably similar;
      so much so that they can be combined, saving a fair
      bit of semi-complex code duplication.
      
      The following patch passes generic/285 and generic/286,
      which specifically test seek behavior.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      49c69591
    • B
      xfs: export log_recovery_delay to delay mount time log recovery · 2e227178
      Brian Foster 提交于
      XFS log recovery has been discovered to have race conditions with
      buffers when I/O errors occur. External tools are available to simulate
      I/O errors to XFS, but this alone is not sufficient for testing log
      recovery. XFS unconditionally resets the inactive region of the log
      prior to log recovery to avoid confusion over processing any partially
      written log records that might have been written before an unclean
      shutdown. Therefore, unconditional write I/O failures at mount time are
      caught by the reset sequence rather than log recovery and hinder the
      ability to test the latter.
      
      The device-mapper dm-flakey module uses an up/down timer to define a
      cycle for when to fail I/Os. Create a pre log recovery delay tunable
      that can be used to coordinate XFS log recovery with I/O errors
      simulated by dm-flakey. This facilitates coordination in userspace that
      allows the reset of stale log blocks to succeed and writes due to log
      recovery to fail. For example, define a dm-flakey instance with an
      uptime long enough to allow log reset to succeed and a log recovery
      delay long enough to allow the dm-flakey uptime to expire.
      
      The 'log_recovery_delay' sysfs tunable is exported under
      /sys/fs/xfs/debug and is only enabled for kernels compiled in XFS debug
      mode. The value is exported in units of seconds and allows for a delay
      of up to 60 seconds. Note that this is for XFS debug and test
      instrumentation purposes only and should not be used by applications. No
      delay is enabled by default.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2e227178
    • B
      xfs: add debug sysfs attribute set · 65b65735
      Brian Foster 提交于
      Create a top-level debug directory for global debug sysfs attributes.
      This directory is added and removed on XFS module initialization and
      removal respectively for DEBUG mode kernels only. It typically resides
      at /sys/fs/xfs/debug. It is located at the top level of the xfs sysfs
      hierarchy as attributes might define global behavior or behavior that
      must be configured before an xfs mount is available (e.g., log recovery
      behavior).
      
      Define the global debug kobject that represents the debug sysfs
      directory and add generic attribute show/store helpers to support future
      attributes. No debug attributes are exported as of yet.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      65b65735
    • E
      xfs: add a few more verifier tests · e1b05723
      Eric Sandeen 提交于
      These were exposed by fsfuzzer runs; without them we fail
      in various exciting and sometimes convoluted ways when we
      encounter disk corruption.
      
      Without the MAXLEVELS tests we tend to walk off the end of
      an array in a loop like this:
      
              for (i = 0; i < cur->bc_nlevels; i++) {
                      if (cur->bc_bufs[i])
      
      Without the dirblklog test we try to allocate more memory
      than we could possibly hope for and loop forever:
      
      xfs_dabuf_map()
      	nfsb = mp->m_dir_geo->fsbcount;
      	irecs = kmem_zalloc(sizeof(irec) * nfsb, KM_SLEEP...
      
      As for the logbsize check, that's the convoluted one.
      
      If logbsize is specified at mount time, it's sanitized
      in xfs_parseargs; in particular it makes sure that it's
      not > XLOG_MAX_RECORD_BSIZE.
      
      If not specified at mount time, it comes from the superblock
      via sb_logsunit; this is limited to 256k at mkfs time as well;
      it's copied into m_logbsize in xfs_finish_flags().
      
      However, if for some reason the on-disk value is corrupt and
      too large, nothing catches it.  It's a circuitous path, but
      that size eventually finds its way to places that make the kernel
      very unhappy, leading to oopses in xlog_pack_data() because we
      use the size as an index into iclog->ic_data, but the array
      is not necessarily that big.
      
      Anyway - bounds checking when we read from disk is a good thing!
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e1b05723
    • B
      xfs: mark all internal workqueues as freezable · 8018ec08
      Brian Foster 提交于
      Workqueues must be explicitly set as freezable to ensure they are frozen
      in the assocated part of the hibernation/suspend sequence. Freezing of
      workqueues and kernel threads is important to ensure that modifications
      are not made on-disk after the hibernation image has been created.
      Otherwise, the in-memory state can become inconsistent with what is on
      disk and eventually lead to filesystem corruption. We have reports of
      free space btree corruptions that occur immediately after restore from
      hibernate that suggest the xfs-eofblocks workqueue could be causing
      such problems if it races with hibernation.
      
      Mark all of the internal XFS workqueues as freezable to ensure nothing
      changes on-disk once the freezer infrastructure freezes kernel threads
      and creates the hibernation image.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reported-by: NCarlos E. R. <carlos.e.r@opensuse.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8018ec08
  3. 25 8月, 2014 1 次提交
    • B
      aio: fix reqs_available handling · d856f32a
      Benjamin LaHaise 提交于
      As reported by Dan Aloni, commit f8567a38 ("aio: fix aio request
      leak when events are reaped by userspace") introduces a regression when
      user code attempts to perform io_submit() with more events than are
      available in the ring buffer.  Reverting that commit would reintroduce a
      regression when user space event reaping is used.
      
      Fixing this bug is a bit more involved than the previous attempts to fix
      this regression.  Since we do not have a single point at which we can
      count events as being reaped by user space and io_getevents(), we have
      to track event completion by looking at the number of events left in the
      event ring.  So long as there are as many events in the ring buffer as
      there have been completion events generate, we cannot call
      put_reqs_available().  The code to check for this is now placed in
      refill_reqs_available().
      
      A test program from Dan and modified by me for verifying this bug is available
      at http://www.kvack.org/~bcrl/20140824-aio_bug.c .
      Reported-by: NDan Aloni <dan@kernelim.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Acked-by: NDan Aloni <dan@kernelim.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: stable@vger.kernel.org      # v3.16 and anything that f8567a38 was backported to
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d856f32a
  4. 23 8月, 2014 8 次提交
  5. 20 8月, 2014 3 次提交
  6. 18 8月, 2014 3 次提交
  7. 17 8月, 2014 3 次提交
  8. 16 8月, 2014 3 次提交
  9. 15 8月, 2014 5 次提交
    • C
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason 提交于
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: NChris Mason <clm@fb.com>
      8d875f95
    • F
      Btrfs: fix csum tree corruption, duplicate and outdated checksums · 27b9a812
      Filipe Manana 提交于
      Under rare circumstances we can end up leaving 2 versions of a checksum
      for the same file extent range.
      
      The reason for this is that after calling btrfs_next_leaf we process
      slot 0 of the leaf it returns, instead of processing the slot set in
      path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
      btrfs_next_leaf() releases the path and before it searches for the next
      leaf, another task might cause a split of the next leaf, which migrates
      some of its keys to the leaf we were processing before calling
      btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
      same leaf but with path->slots[0] having a slot number corresponding
      to the first new key it got, that is, a slot number that didn't exist
      before calling btrfs_next_leaf(), as the leaf now has more keys than
      it had before. So we must really process the returned leaf starting at
      path->slots[0] always, as it isn't always 0, and the key at slot 0 can
      have an offset much lower than our search offset/bytenr.
      
      For example, consider the following scenario, where we have:
      
      sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
      four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472
      
        Leaf N:
      
          slot = 0                           slot = btrfs_header_nritems() - 1
        |-------------------------------------------------------------------|
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
        |-------------------------------------------------------------------|
      
        Leaf N + 1:
      
            slot = 0                          slot = btrfs_header_nritems() - 1
        |--------------------------------------------------------------------|
        | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
        |--------------------------------------------------------------------|
      
      Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
      find the next highest key, which releases the current path and then searches
      for that next key. However after releasing the path and before finding that
      next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
      to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
      btrfs_next_leaf() will returns us a path again with leaf N but with the slot
      pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
      is then:
      
          slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
        |----------------------------------------------------------------------------------------------------|
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
        |----------------------------------------------------------------------------------------------------|
      
      And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
      into the "insert:" label, which will set tmp to:
      
          tmp = min((sums->len - total_bytes) >> blocksize_bits,
              (next_offset - file_key.offset) >> blocksize_bits) =
          min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
          min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4
      
      and
      
         ins_size = csum_size * tmp = 4 * 4 = 16 bytes.
      
      In other words, we insert a new csum item in the tree with key
      (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
      for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
      because the item with key (CSUM CSUM 40161280) (the one that was moved from
      leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
      bytes of our data and won't get those old checksums removed.
      
      So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
      and breaks the logical rule:
      
         Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover
      
      An obvious bad effect of this is that a subsequent csum tree lookup to get
      the checksum of any of the blocks with logical offset of 40161280, 40165376
      or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      27b9a812
    • T
      Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch · 4eb1f66d
      Takashi Iwai 提交于
      We've got bug reports that btrfs crashes when quota is enabled on
      32bit kernel, typically with the Oops like below:
       BUG: unable to handle kernel NULL pointer dereference at 00000004
       IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
       *pde = 00000000
       Oops: 0000 [#1] SMP
       CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S      W 3.15.2-1.gd43d97e-default #1
       Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
       task: f1478130 ti: f147c000 task.ti: f147c000
       EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
       EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
       EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
       ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
        DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
       CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
       Stack:
        00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
        00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
        00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
       Call Trace:
        [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
        [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
        [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
        [<c025e38b>] process_one_work+0x11b/0x390
        [<c025eea1>] worker_thread+0x101/0x340
        [<c026432b>] kthread+0x9b/0xb0
        [<c0712a71>] ret_from_kernel_thread+0x21/0x30
        [<c0264290>] kthread_create_on_node+0x110/0x110
      
      This indicates a NULL corruption in prefs_delayed list.  The further
      investigation and bisection pointed that the call of ulist_add_merge()
      results in the corruption.
      
      ulist_add_merge() takes u64 as aux and writes a 64bit value into
      old_aux.  The callers of this function in backref.c, however, pass a
      pointer of a pointer to old_aux.  That is, the function overwrites
      64bit value on 32bit pointer.  This caused a NULL in the adjacent
      variable, in this case, prefs_delayed.
      
      Here is a quick attempt to band-aid over this: a new function,
      ulist_add_merge_ptr() is introduced to pass/store properly a pointer
      value instead of u64.  There are still ugly void ** cast remaining
      in the callers because void ** cannot be taken implicitly.  But, it's
      safer than explicit cast to u64, anyway.
      
      Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
      Cc: <stable@vger.kernel.org> [v3.11+]
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      4eb1f66d
    • L
      Btrfs: fix compressed write corruption on enospc · ce62003f
      Liu Bo 提交于
      When failing to allocate space for the whole compressed extent, we'll
      fallback to uncompressed IO, but we've forgotten to redirty the pages
      which belong to this compressed extent, and these 'clean' pages will
      simply skip 'submit' part and go to endio directly, at last we got data
      corruption as we write nothing.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Tested-By: NMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      ce62003f
    • M
      btrfs: correctly handle return from ulist_add · f90e579c
      Mark Fasheh 提交于
      ulist_add() can return '1' on sucess, which qgroup_subtree_accounting()
      doesn't take into account. As a result, that value can be bubbled up to
      callers, causing an error to be printed. Fix this by only returning the
      value of ulist_add() when it indicates an error.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      f90e579c