1. 19 11月, 2016 9 次提交
    • S
      md/r5cache: write-out phase and reclaim support · a39f7afd
      Song Liu 提交于
      There are two limited resources, stripe cache and journal disk space.
      For better performance, we priotize reclaim of full stripe writes.
      To free up more journal space, we free earliest data on the journal.
      
      In current implementation, reclaim happens when:
      1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
         if there is no reclaim in the past 5 seconds.
      2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
         or cached stripes is enough for a full stripe (chunk size / 4k)
         (r5c_check_cached_full_stripe)
      3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
      4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
      
      r5c_do_reclaim() contains new logic of reclaim.
      
      For stripe cache:
      
      When stripe cache pressure is high (more than 3/4 stripes are cached,
      or there is empty inactive lists), flush all full stripe. If fewer
      than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
      are flushed, flush some paritial stripes. When stripe cache pressure
      is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
      
      For log space:
      
      To avoid deadlock due to log space, we need to reserve enough space
      to flush cached data. The size of required log space depends on total
      number of cached stripes (stripe_in_journal_count). In current
      implementation, the writing-out phase automatically include pending
      data writes with parity writes (similar to write through case).
      Therefore, we need up to (conf->raid_disks + 1) pages for each cached
      stripe (1 page for meta data, raid_disks pages for all data and
      parity). r5c_log_required_to_flush_cache() calculates log space
      required to flush cache. In the following, we refer to the space
      calculated by r5c_log_required_to_flush_cache() as
      reclaim_required_space.
      
      Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
      R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
      device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
      is set when free space on the log device is less than 2x of
      reclaim_required_space.
      
      r5c_cache keeps all data in cache (not fully committed to RAID) in
      a list (stripe_in_journal_list). These stripes are in the order of their
      first appearance on the journal. So the log tail (last_checkpoint)
      should point to the journal_start of the first item in the list.
      
      When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
      stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
      set, the state machine only writes data that are already in the
      log device (in stripe_in_journal_list).
      
      This patch includes a fix to improve performance by
      Shaohua Li <shli@fb.com>.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      a39f7afd
    • S
      md/r5cache: caching phase of r5cache · 1e6d690b
      Song Liu 提交于
      As described in previous patch, write back cache operates in two
      phases: caching and writing-out. The caching phase works as:
      1. write data to journal
         (r5c_handle_stripe_dirtying, r5c_cache_data)
      2. call bio_endio
         (r5c_handle_data_cached, r5c_return_dev_pending_writes).
      
      Then the writing-out phase is as:
      1. Mark the stripe as write-out (r5c_make_stripe_write_out)
      2. Calcualte parity (reconstruct or RMW)
      3. Write parity (and maybe some other data) to journal device
      4. Write data and parity to RAID disks
      
      This patch implements caching phase. The cache is integrated with
      stripe cache of raid456. It leverages code of r5l_log to write
      data to journal device.
      
      Writing-out phase of the cache is implemented in the next patch.
      
      With r5cache, write operation does not wait for parity calculation
      and write out, so the write latency is lower (1 write to journal
      device vs. read and then write to raid disks). Also, r5cache will
      reduce RAID overhead (multipile IO due to read-modify-write of
      parity) and provide more opportunities of full stripe writes.
      
      This patch adds 2 flags to stripe_head.state:
       - STRIPE_R5C_PARTIAL_STRIPE,
       - STRIPE_R5C_FULL_STRIPE,
      
      Instead of inactive_list, stripes with cached data are tracked in
      r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
      STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
      stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
      are not considered as "active".
      
      For RMW, the code allocates an extra page for each data block
      being updated.  This is stored in r5dev->orig_page and the old data
      is read into it.  Then the prexor calculation subtracts ->orig_page
      from the parity block, and the reconstruct calculation adds the
      ->page data back into the parity block.
      
      r5cache naturally excludes SkipCopy. When the array has write back
      cache, async_copy_data() will not skip copy.
      
      There are some known limitations of the cache implementation:
      
      1. Write cache only covers full page writes (R5_OVERWRITE). Writes
         of smaller granularity are write through.
      2. Only one log io (sh->log_io) for each stripe at anytime. Later
         writes for the same stripe have to wait. This can be improved by
         moving log_io to r5dev.
      3. With writeback cache, read path must enter state machine, which
         is a significant bottleneck for some workloads.
      4. There is no per stripe checkpoint (with r5l_payload_flush) in
         the log, so recovery code has to replay more than necessary data
         (sometimes all the log from last_checkpoint). This reduces
         availability of the array.
      
      This patch includes a fix proposed by ZhengYuan Liu
      <liuzhengyuan@kylinos.cn>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1e6d690b
    • S
      md/r5cache: State machine for raid5-cache write back mode · 2ded3703
      Song Liu 提交于
      This patch adds state machine for raid5-cache. With log device, the
      raid456 array could operate in two different modes (r5c_journal_mode):
        - write-back (R5C_MODE_WRITE_BACK)
        - write-through (R5C_MODE_WRITE_THROUGH)
      
      Existing code of raid5-cache only has write-through mode. For write-back
      cache, it is necessary to extend the state machine.
      
      With write-back cache, every stripe could operate in two different
      phases:
        - caching
        - writing-out
      
      In caching phase, the stripe handles writes as:
        - write to journal
        - return IO
      
      In writing-out phase, the stripe behaviors as a stripe in write through
      mode R5C_MODE_WRITE_THROUGH.
      
      STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
      writing-out phase.
      
      Please note: this is a "no-op" patch for raid5-cache write-through
      mode.
      
      The following detailed explanation is copied from the raid5-cache.c:
      
      /*
       * raid5 cache state machine
       *
       * With rhe RAID cache, each stripe works in two phases:
       *      - caching phase
       *      - writing-out phase
       *
       * These two phases are controlled by bit STRIPE_R5C_CACHING:
       *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
       *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
       *
       * When there is no journal, or the journal is in write-through mode,
       * the stripe is always in writing-out phase.
       *
       * For write-back journal, the stripe is sent to caching phase on write
       * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
       * the write-out phase by clearing STRIPE_R5C_CACHING.
       *
       * Stripes in caching phase do not write the raid disks. Instead, all
       * writes are committed from the log device. Therefore, a stripe in
       * caching phase handles writes as:
       *      - write to log device
       *      - return IO
       *
       * Stripes in writing-out phase handle writes as:
       *      - calculate parity
       *      - write pending data and parity to journal
       *      - write data and parity to raid disks
       *      - return IO for pending writes
       */
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2ded3703
    • S
      md/r5cache: move some code to raid5.h · 937621c3
      Song Liu 提交于
      Move some define and inline functions to raid5.h, so they can be
      used in raid5-cache.c
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      937621c3
    • S
      md/r5cache: Check array size in r5l_init_log · c757ec95
      Song Liu 提交于
      Currently, r5l_write_stripe checks meta size for each stripe write,
      which is not necessary.
      
      With this patch, r5l_init_log checks maximal meta size of the array,
      which is (r5l_meta_block + raid_disks x r5l_payload_data_parity).
      If this is too big to fit in one page, r5l_init_log aborts.
      
      With current meta data, r5l_log support raid_disks up to 203.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c757ec95
    • S
      md: add blktrace event for writes to superblock · 504634f6
      Shaohua Li 提交于
      superblock write is an expensive operation. With raid5-cache, it can be called
      regularly. Tracing to help performance debug.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Cc: NeilBrown <neilb@suse.com>
      504634f6
    • N
      md/raid1, raid10: add blktrace records when IO is delayed · 578b54ad
      NeilBrown 提交于
      Both raid1 and raid10 will sometimes delay handling an IO request,
      such as when resync is happening or there are too many requests queued.
      
      Add some blktrace messsages so we can see when that is happening when
      looking for performance artefacts.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      578b54ad
    • N
      md/bitmap: add blktrace event for writes to the bitmap · 581dbd94
      NeilBrown 提交于
      We trace wheneven bitmap_unplug() finds that it needs to write
      to the bitmap, or when bitmap_daemon_work() find there is work
      to do.
      
      This makes it easier to correlate bitmap updates with data writes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      581dbd94
    • N
      md: add block tracing for bio_remapping · 109e3765
      NeilBrown 提交于
      The block tracing infrastructure (accessed with blktrace/blkparse)
      supports the tracing of mapping bios from one device to another.
      This is currently used when a bio in a partition is mapped to the
      whole device, when bios are mapped by dm, and for mapping in md/raid5.
      Other md personalities do not include this tracing yet, so add it.
      
      When a read-error is detected we redirect the request to a different device.
      This could justifiably be seen as a new mapping for the originial bio,
      or a secondary mapping for the bio that errors.  This patch uses
      the second option.
      
      When md is used under dm-raid, the mappings are not traced as we do
      not have access to the block device number of the parent.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      109e3765
  2. 18 11月, 2016 1 次提交
  3. 10 11月, 2016 3 次提交
    • N
      md: remove md_super_wait() call after bitmap_flush() · 6119e679
      NeilBrown 提交于
      bitmap_flush() finishes with bitmap_update_sb(), and that finishes
      with write_page(..., 1), so write_page() will wait for all writes
      to complete.  So there is no point calling md_super_wait()
      immediately afterwards.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6119e679
    • N
      md: define mddev flags, recovery flags and r1bio state bits using enums · be306c29
      NeilBrown 提交于
      This is less error prone than using individual #defines.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      be306c29
    • N
      md/raid1: fix: IO can block resync indefinitely · f2c771a6
      NeilBrown 提交于
      While performing a resync/recovery, raid1 divides the
      array space into three regions:
       - before the resync
       - at or shortly after the resync point
       - much further ahead of the resync point.
      
      Write requests to the first or third do not need to wait.  Write
      requests to the middle region do need to wait if resync requests are
      pending.
      
      If there are any active write requests in the middle region, resync
      will wait for them.
      
      Due to an accounting error, there is a small range of addresses,
      between conf->next_resync and conf->start_next_window, where write
      requests will *not* be blocked, but *will* be counted in the middle
      region.  This can effectively block resync indefinitely if filesystem
      writes happen repeatedly to this region.
      
      As ->next_window_requests is incremented when the sector is after
        conf->start_next_window + NEXT_NORMALIO_DISTANCE
      the same boundary should be used for determining when write requests
      should wait.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      f2c771a6
  4. 08 11月, 2016 19 次提交
  5. 05 11月, 2016 7 次提交
    • O
      HID: sensor: fix attributes in HID sensor interface · 4c4480aa
      Ooi, Joyce 提交于
      User is unable to access to input-X-yyy and feature-X-yyy where
      X is a hex value and more than 9 (e.g. input-a-yyy, feature-b-yyy) in HID
      sensor custom sysfs interface.
      This is because when creating the attribute, the attribute index is
      written to using %x (hex). However, when reading and writing values into
      the attribute, the attribute index is scanned using %d (decimal). Hence,
      user is unable to access to attributes with index in hex values
      (e.g. 'a', 'b', 'c') but able to access to attributes with index in
      decimal values (e.g. 1, 2, 3,..).
      This fix will change input-%d-%x-%s and feature-%d-%x-%s to input-%x-%x-%s
      and feature-%x-%x-%s in show_values() and store_values() accordingly.
      Signed-off-by: NOoi, Joyce <joyce.ooi@intel.com>
      Reviewed-by: NBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      4c4480aa
    • S
      HID: intel-ish-hid: request_irq failure · 021afd55
      Srinivas Pandruvada 提交于
      On some platforms ISH interrupt is shared, which causes request_irq to
      fail. This requires IRQF_SHARED irq flag.
      
      But IRQF_NO_SUSPEND and IRQF_SHARED should not be used together, so
      removed IRQF_NO_SUSPEND flag. Anyway this driver doesn't require
      IRQF_NO_SUSPEND, as this interrupt is not required during "noirq" phases
      of suspending and resuming devices as well as during the time when
      nonboot CPUs are taken offline and brought back online.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      021afd55
    • E
      HID: intel-ish-hid: Fix driver reinit failure · 2a1e3b93
      Even Xu 提交于
      When built as a module, modprobe followed by rmmod can fail because
      DMA was still active. So to fix this, DMA needs to be disabled during
      module exit.
      
      This change disables DMA during modules exit and change the ISH PCI
      device status to D3.
      Signed-off-by: NEven Xu <even.xu@intel.com>
      Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      2a1e3b93
    • E
      HID: intel-ish-hid: Move DMA disable code to new function · 8b2979fe
      Even Xu 提交于
      Add a new function ish_disable_dma() and move DMA disable operations
      here, so that this functionality can be reused.
      Signed-off-by: NEven Xu <even.xu@intel.com>
      Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      8b2979fe
    • E
      HID: intel-ish-hid: consolidate ish wake up operation · c2ed83f5
      Even Xu 提交于
      Same operations are done in ish_hw_start() and _ish_hw_reset() to
      wakeup ISH device. Consolidate them by introducing a new function
      ish_wakeup() and move the code there.
      Signed-off-by: NEven Xu <even.xu@intel.com>
      Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      c2ed83f5
    • N
      PCI: designware: Check for iATU unroll support after initializing host · 416379f9
      Niklas Cassel 提交于
      dw_pcie_iatu_unroll_enabled() reads a dbi_base register.  Reading any
      dbi_base register before pp->ops->host_init has been called causes
      "imprecise external abort" on platforms like ARTPEC-6, where the PCIe
      module is disabled at boot and first enabled in pp->ops->host_init.  Move
      dw_pcie_iatu_unroll_enabled() to dw_pcie_setup_rc(), since it is after
      pp->ops->host_init, but before pp->iatu_unroll_enabled is actually used.
      
      Fixes: a0601a47 ("PCI: designware: Add iATU Unroll feature")
      Tested-by: NJames Le Cuirot <chewi@gentoo.org>
      Signed-off-by: NNiklas Cassel <niklas.cassel@axis.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: NJoao Pinto <jpinto@synopsys.com>
      Acked-by: NOlof Johansson <olof@lixom.net>
      416379f9
    • V
      i2c: core: fix NULL pointer dereference under race condition · 147b36d5
      Vladimir Zapolskiy 提交于
      Race condition between registering an I2C device driver and
      deregistering an I2C adapter device which is assumed to manage that
      I2C device may lead to a NULL pointer dereference due to the
      uninitialized list head of driver clients.
      
      The root cause of the issue is that the I2C bus may know about the
      registered device driver and thus it is matched by bus_for_each_drv(),
      but the list of clients is not initialized and commonly it is NULL,
      because I2C device drivers define struct i2c_driver as static and
      clients field is expected to be initialized by I2C core:
      
        i2c_register_driver()             i2c_del_adapter()
          driver_register()                 ...
            bus_add_driver()                ...
              ...                           bus_for_each_drv(..., __process_removed_adapter)
            ...                               i2c_do_del_adapter()
          ...                                   list_for_each_entry_safe(..., &driver->clients, ...)
          INIT_LIST_HEAD(&driver->clients);
      
      To solve the problem it is sufficient to do clients list head
      initialization before calling driver_register().
      
      The problem was found while using an I2C device driver with a sluggish
      registration routine on a bus provided by a physically detachable I2C
      master controller, but practically the oops may be reproduced under
      the race between arbitraty I2C device driver registration and managing
      I2C bus device removal e.g. by unbinding the latter over sysfs:
      
      % echo 21a4000.i2c > /sys/bus/platform/drivers/imx-i2c/unbind
        Unable to handle kernel NULL pointer dereference at virtual address 00000000
        Internal error: Oops: 17 [#1] SMP ARM
        CPU: 2 PID: 533 Comm: sh Not tainted 4.9.0-rc3+ #61
        Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
        task: e5ada400 task.stack: e4936000
        PC is at i2c_do_del_adapter+0x20/0xcc
        LR is at __process_removed_adapter+0x14/0x1c
        Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
        Control: 10c5387d  Table: 35bd004a  DAC: 00000051
        Process sh (pid: 533, stack limit = 0xe4936210)
        Stack: (0xe4937d28 to 0xe4938000)
        Backtrace:
        [<c0667be0>] (i2c_do_del_adapter) from [<c0667cc0>] (__process_removed_adapter+0x14/0x1c)
        [<c0667cac>] (__process_removed_adapter) from [<c0516998>] (bus_for_each_drv+0x6c/0xa0)
        [<c051692c>] (bus_for_each_drv) from [<c06685ec>] (i2c_del_adapter+0xbc/0x284)
        [<c0668530>] (i2c_del_adapter) from [<bf0110ec>] (i2c_imx_remove+0x44/0x164 [i2c_imx])
        [<bf0110a8>] (i2c_imx_remove [i2c_imx]) from [<c051a838>] (platform_drv_remove+0x2c/0x44)
        [<c051a80c>] (platform_drv_remove) from [<c05183d8>] (__device_release_driver+0x90/0x12c)
        [<c0518348>] (__device_release_driver) from [<c051849c>] (device_release_driver+0x28/0x34)
        [<c0518474>] (device_release_driver) from [<c0517150>] (unbind_store+0x80/0x104)
        [<c05170d0>] (unbind_store) from [<c0516520>] (drv_attr_store+0x28/0x34)
        [<c05164f8>] (drv_attr_store) from [<c0298acc>] (sysfs_kf_write+0x50/0x54)
        [<c0298a7c>] (sysfs_kf_write) from [<c029801c>] (kernfs_fop_write+0x100/0x214)
        [<c0297f1c>] (kernfs_fop_write) from [<c0220130>] (__vfs_write+0x34/0x120)
        [<c02200fc>] (__vfs_write) from [<c0221088>] (vfs_write+0xa8/0x170)
        [<c0220fe0>] (vfs_write) from [<c0221e74>] (SyS_write+0x4c/0xa8)
        [<c0221e28>] (SyS_write) from [<c0108a20>] (ret_fast_syscall+0x0/0x1c)
      Signed-off-by: NVladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
      Signed-off-by: NWolfram Sang <wsa@the-dreams.de>
      Cc: stable@kernel.org
      147b36d5
  6. 04 11月, 2016 1 次提交