1. 10 11月, 2016 3 次提交
    • N
      md: remove md_super_wait() call after bitmap_flush() · 6119e679
      NeilBrown 提交于
      bitmap_flush() finishes with bitmap_update_sb(), and that finishes
      with write_page(..., 1), so write_page() will wait for all writes
      to complete.  So there is no point calling md_super_wait()
      immediately afterwards.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6119e679
    • N
      md: define mddev flags, recovery flags and r1bio state bits using enums · be306c29
      NeilBrown 提交于
      This is less error prone than using individual #defines.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      be306c29
    • N
      md/raid1: fix: IO can block resync indefinitely · f2c771a6
      NeilBrown 提交于
      While performing a resync/recovery, raid1 divides the
      array space into three regions:
       - before the resync
       - at or shortly after the resync point
       - much further ahead of the resync point.
      
      Write requests to the first or third do not need to wait.  Write
      requests to the middle region do need to wait if resync requests are
      pending.
      
      If there are any active write requests in the middle region, resync
      will wait for them.
      
      Due to an accounting error, there is a small range of addresses,
      between conf->next_resync and conf->start_next_window, where write
      requests will *not* be blocked, but *will* be counted in the middle
      region.  This can effectively block resync indefinitely if filesystem
      writes happen repeatedly to this region.
      
      As ->next_window_requests is incremented when the sector is after
        conf->start_next_window + NEXT_NORMALIO_DISTANCE
      the same boundary should be used for determining when write requests
      should wait.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      f2c771a6
  2. 08 11月, 2016 19 次提交
  3. 29 10月, 2016 3 次提交
    • N
      md: be careful not lot leak internal curr_resync value into metadata. -- (all) · 1217e1d1
      NeilBrown 提交于
      mddev->curr_resync usually records where the current resync is up to,
      but during the starting phase it has some "magic" values.
      
       1 - means that the array is trying to start a resync, but has yielded
           to another array which shares physical devices, and also needs to
           start a resync
       2 - means the array is trying to start resync, but has found another
           array which shares physical devices and has already started resync.
      
       3 - means that resync has commensed, but it is possible that nothing
           has actually been resynced yet.
      
      It is important that this value not be visible to user-space and
      particularly that it doesn't get written to the metadata, as the
      resync or recovery checkpoint.  In part, this is because it may be
      slightly higher than the correct value, though this is very rare.
      In part, because it is not a multiple of 4K, and some devices only
      support 4K aligned accesses.
      
      There are two places where this value is propagates into either
      ->curr_resync_completed or ->recovery_cp or ->recovery_offset.
      These currently avoid the propagation of values 1 and 3, but will
      allow 3 to leak through.
      
      Change them to only propagate the value if it is > 3.
      
      As this can cause an array to fail, the patch is suitable for -stable.
      
      Cc: stable@vger.kernel.org (v3.7+)
      Reported-by: NViswesh <viswesh.vichu@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1217e1d1
    • T
      raid1: handle read error also in readonly mode · 7449f699
      Tomasz Majchrzak 提交于
      If write is the first operation on a disk and it happens not to be
      aligned to page size, block layer sends read request first. If read
      operation fails, the disk is set as failed as no attempt to fix the
      error is made because array is in auto-readonly mode. Similarily, the
      disk is set as failed for read-only array.
      
      Take the same approach as in raid10. Don't fail the disk if array is in
      readonly or auto-readonly mode. Try to redirect the request first and if
      unsuccessful, return a read error.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      7449f699
    • S
      raid5-cache: correct condition for empty metadata write · 9a8b27fa
      Shaohua Li 提交于
      As long as we recover one metadata block, we should write the empty metadata
      write. The original code could make recovery corrupted if only one meta is
      valid.
      Reported-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NShaohua Li <shli@fb.com>
      9a8b27fa
  4. 25 10月, 2016 5 次提交
    • T
      md: report 'write_pending' state when array in sync · 16f88949
      Tomasz Majchrzak 提交于
      If there is a bad block on a disk and there is a recovery performed from
      this disk, the same bad block is reported for a new disk. It involves
      setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external
      metadata this flag is not being cleared as array state is reported as
      'clean'. The read request to bad block in RAID5 array gets stuck as it
      is waiting for a flag to be cleared - as per commit c3cce6cd
      ("md/raid5: ensure device failure recorded before write request
      returns.").
      
      The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been
      clarified in commit 070dc6dd ("md: resolve confusion of
      MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in
      personality error handlers since and it doesn't fully comply with
      initial purpose. It was supposed to notify that write request is about
      to start, however now it is also used to request metadata update.
      Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has
      been set and in_sync has been set to 0 at the same time. Error handlers
      just set the flag without modifying in_sync value. Sysfs array state is
      a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is
      set and in_sync is set to 1. Userspace has no idea it is expected to
      take some action.
      
      Swap the order that array state is checked so 'write_pending' is
      reported ahead of 'clean' ('write_pending' is a misleading name but it
      is too late to rename it now).
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      16f88949
    • Z
      md/raid5: write an empty meta-block when creating log super-block · 56056c2e
      Zhengyuan Liu 提交于
      If superblock points to an invalid meta block, r5l_load_log will set
      create_super with true and create an new superblock, this runtime path
      would always happen if we do no writing I/O to this array since it was
      created. Writing an empty meta block could avoid this unnecessary
      action at the first time we created log superblock.
      
      Another reason is for the corretness of log recovery. Currently we have
      bellow code to guarantee log revocery to be correct.
      
              if (ctx.seq > log->last_cp_seq + 1) {
                      int ret;
      
                      ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
                      if (ret)
                              return ret;
                      log->seq = ctx.seq + 11;
                      log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
                      r5l_write_super(log, ctx.pos);
              } else {
                      log->log_start = ctx.pos;
                      log->seq = ctx.seq;
              }
      
      If we just created a array with a journal device, log->log_start and
      log->last_checkpoint should all be 0, then we write three meta block
      which are valid except mid one and supposed crash happened. The ctx.seq
      would equal to log->last_cp_seq + 1 and log->log_start would be set to
      position of mid invalid meta block after we did a recovery, this will
      lead to problems which could be avoided with this patch.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NShaohua Li <shli@fb.com>
      56056c2e
    • Z
      md/raid5: initialize next_checkpoint field before use · 28cd88e2
      Zhengyuan Liu 提交于
      No initial operation was done to this field when we
      load/recovery the log, it got assignment only when IO
      to raid disk was finished. So r5l_quiesce may use wrong
      next_checkpoint to reclaim log space, that would make
      reclaimable space calculation confused.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NShaohua Li <shli@fb.com>
      28cd88e2
    • S
      RAID10: ignore discard error · 579ed34f
      Shaohua Li 提交于
      This is the counterpart of raid10 fix. If a write error occurs, raid10
      will try to rewrite the bio in small chunk size. If the rewrite fails,
      raid10 will record the error in bad block. narrow_write_error will
      always use WRITE for the bio, but actually it could be a discard. Since
      discard bio hasn't payload, write the bio will cause different issues.
      But discard error isn't fatal, we can safely ignore it. This is what
      this patch does.
      
      This issue should exist since discard is added, but only exposed with
      recent arbitrary bio size feature.
      
      Cc: Sitsofe Wheeler <sitsofe@gmail.com>
      Cc: stable@vger.kernel.org (v3.6)
      Signed-off-by: NShaohua Li <shli@fb.com>
      579ed34f
    • S
      RAID1: ignore discard error · e3f948cd
      Shaohua Li 提交于
      If a write error occurs, raid1 will try to rewrite the bio in small
      chunk size. If the rewrite fails, raid1 will record the error in bad
      block. narrow_write_error will always use WRITE for the bio, but
      actually it could be a discard. Since discard bio hasn't payload, write
      the bio will cause different issues. But discard error isn't fatal, we
      can safely ignore it. This is what this patch does.
      
      This issue should exist since discard is added, but only exposed with
      recent arbitrary bio size feature.
      Reported-and-tested-by: NSitsofe Wheeler <sitsofe@gmail.com>
      Cc: stable@vger.kernel.org (v3.6)
      Signed-off-by: NShaohua Li <shli@fb.com>
      e3f948cd
  5. 24 10月, 2016 1 次提交
  6. 19 10月, 2016 2 次提交
  7. 18 10月, 2016 1 次提交
  8. 14 10月, 2016 2 次提交
  9. 12 10月, 2016 2 次提交
    • P
      kthread: kthread worker API cleanup · 3989144f
      Petr Mladek 提交于
      A good practice is to prefix the names of functions by the name
      of the subsystem.
      
      The kthread worker API is a mix of classic kthreads and workqueues.  Each
      worker has a dedicated kthread.  It runs a generic function that process
      queued works.  It is implemented as part of the kthread subsystem.
      
      This patch renames the existing kthread worker API to use
      the corresponding name from the workqueues API prefixed by
      kthread_:
      
      __init_kthread_worker()		-> __kthread_init_worker()
      init_kthread_worker()		-> kthread_init_worker()
      init_kthread_work()		-> kthread_init_work()
      insert_kthread_work()		-> kthread_insert_work()
      queue_kthread_work()		-> kthread_queue_work()
      flush_kthread_work()		-> kthread_flush_work()
      flush_kthread_worker()		-> kthread_flush_worker()
      
      Note that the names of DEFINE_KTHREAD_WORK*() macros stay
      as they are. It is common that the "DEFINE_" prefix has
      precedence over the subsystem names.
      
      Note that INIT() macros and init() functions use different
      naming scheme. There is no good solution. There are several
      reasons for this solution:
      
        + "init" in the function names stands for the verb "initialize"
          aka "initialize worker". While "INIT" in the macro names
          stands for the noun "INITIALIZER" aka "worker initializer".
      
        + INIT() macros are used only in DEFINE() macros
      
        + init() functions are used close to the other kthread()
          functions. It looks much better if all the functions
          use the same scheme.
      
        + There will be also kthread_destroy_worker() that will
          be used close to kthread_cancel_work(). It is related
          to the init() function. Again it looks better if all
          functions use the same naming scheme.
      
        + there are several precedents for such init() function
          names, e.g. amd_iommu_init_device(), free_area_init_node(),
          jump_label_init_type(),  regmap_init_mmio_clk(),
      
        + It is not an argument but it was inconsistent even before.
      
      [arnd@arndb.de: fix linux-next merge conflict]
       Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.comSuggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3989144f
    • A
      dm raid: fix compat_features validation · 5c33677c
      Andy Whitcroft 提交于
      In ecbfb9f1 ("dm raid: add raid level takeover support") a new
      compatible feature flag was added.  Validation for these compat_features
      was added but this only passes for new raid mappings with this feature
      flag.  This causes previously created raid mappings to be failed at
      import.
      
      Check compat_features for the only valid combination.
      
      Fixes: ecbfb9f1 ("dm raid: add raid level takeover support")
      Cc: stable@vger.kernel.org # v4.8
      Signed-off-by: NAndy Whitcroft <apw@canonical.com>
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5c33677c
  10. 04 10月, 2016 1 次提交
  11. 29 9月, 2016 1 次提交
    • H
      dm mpath: always return reservation conflict without failing over · 8ff232c1
      Hannes Reinecke 提交于
      If dm-mpath encounters an reservation conflict it should not fail the
      path (as communication with the target is not affected) but should
      rather retry on another path.  However, in doing so we might be inducing
      a ping-pong between paths, with no guarantee of any forward progress.
      And arguably a reservation conflict is an unexpected error, so we should
      be passing it upwards to allow the application to take appropriate
      steps.
      
      This change resolves a show-stopper problem seen with the pNFS SCSI
      layout because it is trivial to hit reservation conflict based failover
      loops without it.
      
      Doubts were raised about the implications of this change relative to
      products like IBM's SVC.  But there is little point withholding a fix
      for Linux because a proprietary product may or may not have some issues
      in its implementation of how it interfaces with Linux.  In the future,
      if there is glaring evidence that this change is certainly problematic
      we can revisit it.
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: Mike Snitzer <snitzer@redhat.com> # tweaked header
      8ff232c1