1. 24 10月, 2015 11 次提交
    • S
      raid5: add basic stripe log · f6bed0ef
      Shaohua Li 提交于
      This introduces a simple log for raid5. Data/parity writing to raid
      array first writes to the log, then write to raid array disks. If
      crash happens, we can recovery data from the log. This can speed up
      raid resync and fix write hole issue.
      
      The log structure is pretty simple. Data/meta data is stored in block
      unit, which is 4k generally. It has only one type of meta data block.
      The meta data block can track 3 types of data, stripe data, stripe
      parity and flush block. MD superblock will point to the last valid
      meta data block. Each meta data block has checksum/seq number, so
      recovery can scan the log correctly. We store a checksum of stripe
      data/parity to the metadata block, so meta data and stripe data/parity
      can be written to log disk together. otherwise, meta data write must
      wait till stripe data/parity is finished.
      
      For stripe data, meta data block will record stripe data sector and
      size. Currently the size is always 4k. This meta data record can be made
      simpler if we just fix write hole (eg, we can record data of a stripe's
      different disks together), but this format can be extended to support
      caching in the future, which must record data address/size.
      
      For stripe parity, meta data block will record stripe sector. It's
      size should be 4k (for raid5) or 8k (for raid6). We always store p
      parity first. This format should work for caching too.
      
      flush block indicates a stripe is in raid array disks. Fixing write
      hole doesn't need this type of meta data, it's for caching extension.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      f6bed0ef
    • S
      raid5: add a new state for stripe log handling · b70abcb2
      Shaohua Li 提交于
      When a stripe finishes construction, we write the stripe to raid in
      ops_run_io normally. With log, we do a bunch of other operations before
      the stripe is written to raid. Mainly write the stripe to log disk,
      flush disk cache and so on. The operations are still driven by raid5d
      and run in the stripe state machine. We introduce a new state for such
      stripe (trapped into log). The stripe is in this state from the time it
      first enters ops_run_io (finish construction) to the time it is written
      to raid. Since we know the state is only for log, we bypass other
      check/operation in handle_stripe.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      b70abcb2
    • S
      raid5: export some functions · 6d036f7d
      Shaohua Li 提交于
      Next several patches use some raid5 functions, rename them with raid5
      prefix and export out.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      6d036f7d
    • S
      md: override md superblock recovery_offset for journal device · 3069aa8d
      Shaohua Li 提交于
      Journal device stores data in a log structure. We need record the log
      start. Here we override md superblock recovery_offset for this purpose.
      This field of a journal device is meaningless otherwise.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      3069aa8d
    • S
      MD: add a new disk role to present write journal device · bac624f3
      Song Liu 提交于
      Next patches will use a disk as raid5/6 journaling. We need a new disk
      role to present the journal device and add MD_FEATURE_JOURNAL to
      feature_map for backward compability.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      bac624f3
    • S
      MD: replace special disk roles with macros · c4d4c91b
      Song Liu 提交于
      Add the following two macros for special roles: spare and faulty
      
      MD_DISK_ROLE_SPARE	0xffff
      MD_DISK_ROLE_FAULTY	0xfffe
      
      Add MD_DISK_ROLE_MAX	0xff00 as the maximal possible regular role,
      and minimal value of special role.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c4d4c91b
    • G
      md-cluster: Call update_raid_disks() if another node --grow's raid_disks · 28c1b9fd
      Goldwyn Rodrigues 提交于
      To incorporate --grow feature executed on one node, other nodes need to
      acknowledge the change in number of disks. Call update_raid_disks()
      to update internal data structures.
      
      This leads to call check_reshape() -> md_allow_write() -> md_update_sb(),
      this results in a deadlock. This is done so it can safely allocate memory
      (which might trigger writeback which might write to raid1). This is
      not required for md with a bitmap.
      
      In the clustered case, we don't perform md_update_sb() in md_allow_write(),
      but in do_md_run(). Also we disable safemode for clustered mode.
      
      mddev->recovery_cp need not be set in check_sb_changes() because this
      is required only when a node reads another node's bitmap. mddev->recovery_cp
      (which is read from sb->resync_offset), is set only if mddev is in_sync.
      Since we disabled safemode, in_sync is set to zero.
      In a clustered environment, the MD may not be in sync because another
      node could be writing to it. So make sure that in_sync is not set in
      case of clustered node in __md_stop_writes().
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      28c1b9fd
    • N
      md-cluster: remove mddev arg from add_resync_info() · 30661b49
      NeilBrown 提交于
      The arg isn't used, so its presence is only confusing.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      30661b49
    • N
      md-cluster: don't cast void pointers when assigning them. · 2e2a7cd9
      NeilBrown 提交于
      It is common practice in the kernel to leave out this case.
      It isn't needed and adds little if any value.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      2e2a7cd9
    • N
      md-cluster: discard unused sb_mutex. · 82381523
      NeilBrown 提交于
      Signed-off-by: NNeilBrown <neilb@suse.com>
      82381523
    • G
      md-cluster: Fix warnings when build with CF=-D__CHECK_ENDIAN__ · cf97a348
      Guoqing Jiang 提交于
      This patches fixes sparse warnings like incorrect type in assignment
      (different base types), cast to restricted __le64.
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      cf97a348
  2. 16 10月, 2015 1 次提交
  3. 14 10月, 2015 1 次提交
    • N
      Merge branch 'md-next' of git://github.com/goldwynr/linux into for-next · c2a06c38
      NeilBrown 提交于
      md-cluster: A better way for METADATA_UPDATED processing
      
      The processing of METADATA_UPDATED message is too simple and prone to
      errors. Besides, it would not update the internal data structures as
      required.
      
      This set of patches reads the superblock from one of the device of the MD
      and checks for changes in the in-memory data structures. If there is a change,
      it performs the necessary actions to keep the internal data structures
      as it would be in the primary node.
      
      An example is if a devices turns faulty. The algorithm is:
      
      1. The initiator node marks the device as faulty and updates the superblock
      2. The initiator node sends METADATA_UPDATED with an advisory  device number to the rest of the nodes.
      3. The receiving node on receiving the METADATA_UPDATED message
        3.1 Reads the superblock
        3.2 Detects a device has failed by comparing with memory structure
        3.3 Calls the necessary functions to record the failure and get the device out of the active array.
        3.4 Acknowledges the message.
      
      The patch series also fixes adding the disk which was impacted because of
      the changes.
      
      Patches can also be found at
      https://github.com/goldwynr/linux branch md-next
      
      Changes since V2:
       - Fix status synchrnoization after --add and --re-add operations
       - Included Guoqing's patches on endian correctness, zeroing cmsg etc
       - Restructure add_new_disk() and cancel()
      c2a06c38
  4. 13 10月, 2015 8 次提交
  5. 12 10月, 2015 17 次提交
  6. 11 10月, 2015 2 次提交