1. 18 3月, 2020 1 次提交
    • A
      md/raid10: Fix raid10 replace hang when new added disk faulty · 52c7cc01
      Alex Wu 提交于
      commit ee37d7314a32ab6809eacc3389bad0406c69a81f upstream.
      
      [Symptom]
      
      Resync thread hang when new added disk faulty during replacing.
      
      [Root Cause]
      
      In raid10_sync_request(), we expect to issue a bio with callback
      end_sync_read(), and a bio with callback end_sync_write().
      
      In normal situation, we will add resyncing sectors into
      mddev->recovery_active when raid10_sync_request() returned, and sub
      resynced sectors from mddev->recovery_active when end_sync_write()
      calls end_sync_request().
      
      If new added disk, which are replacing the old disk, is set faulty,
      there is a race condition:
          1. In the first rcu protected section, resync thread did not detect
             that mreplace is set faulty and pass the condition.
          2. In the second rcu protected section, mreplace is set faulty.
          3. But, resync thread will prepare the read object first, and then
             check the write condition.
          4. It will find that mreplace is set faulty and do not have to
             prepare write object.
      This cause we add resync sectors but never sub it.
      
      [How to Reproduce]
      
      This issue can be easily reproduced by the following steps:
          mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
          mdadm /dev/md0 -a /dev/sde
          mdadm /dev/md0 --replace /dev/sdd
          sleep 1
          mdadm /dev/md0 -f /dev/sde
      
      [How to Fix]
      
      This issue can be fixed by using local variables to record the result
      of test conditions. Once the conditions are satisfied, we can make sure
      that we need to issue a bio for read and a bio for write.
      
      Previous 'commit 24afd80d ("md/raid10: handle recovery of
      replacement devices.")' will also check whether bio is NULL, but leave
      the comment saying that it is a pointless test. So we remove this dummy
      check.
      Reported-by: NAlex Chen <alexchen@synology.com>
      Reviewed-by: NAllen Peng <allenpeng@synology.com>
      Reviewed-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NAlex Wu <alexwu@synology.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      52c7cc01
  2. 18 12月, 2019 1 次提交
    • D
      md: improve handling of bio with REQ_PREFLUSH in md_flush_request() · a12c768d
      David Jeffery 提交于
      commit 775d78319f1ceb32be8eb3b1202ccdc60e9cb7f1 upstream.
      
      If pers->make_request fails in md_flush_request(), the bio is lost. To
      fix this, pass back a bool to indicate if the original make_request call
      should continue to handle the I/O and instead of assuming the flush logic
      will push it to completion.
      
      Convert md_flush_request to return a bool and no longer calls the raid
      driver's make_request function.  If the return is true, then the md flush
      logic has or will complete the bio and the md make_request call is done.
      If false, then the md make_request function needs to keep processing like
      it is a normal bio. Let the original call to md_handle_request handle any
      need to retry sending the bio to the raid driver's make_request function
      should it be needed.
      
      Also mark md_flush_request and the make_request function pointer as
      __must_check to issue warnings should these critical return values be
      ignored.
      
      Fixes: 2bc13b83e629 ("md: batch flush requests.")
      Cc: stable@vger.kernel.org # # v4.19+
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
      Reviewed-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a12c768d
  3. 01 12月, 2019 1 次提交
  4. 24 3月, 2019 1 次提交
  5. 19 3月, 2019 1 次提交
  6. 13 2月, 2019 1 次提交
    • G
      md: fix raid10 hang issue caused by barrier · 54541a75
      Guoqing Jiang 提交于
      [ Upstream commit e820d55cb99dd93ac2dc949cf486bb187e5cd70d ]
      
      When both regular IO and resync IO happen at the same time,
      and if we also need to split regular. Then we can see tasks
      hang due to barrier.
      
      1. resync thread
      [ 1463.757205] INFO: task md1_resync:5215 blocked for more than 480 seconds.
      [ 1463.757207]       Not tainted 4.19.5-1-default #1
      [ 1463.757209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1463.757212] md1_resync      D    0  5215      2 0x80000000
      [ 1463.757216] Call Trace:
      [ 1463.757223]  ? __schedule+0x29a/0x880
      [ 1463.757231]  ? raise_barrier+0x8d/0x140 [raid10]
      [ 1463.757236]  schedule+0x78/0x110
      [ 1463.757243]  raise_barrier+0x8d/0x140 [raid10]
      [ 1463.757248]  ? wait_woken+0x80/0x80
      [ 1463.757257]  raid10_sync_request+0x1f6/0x1e30 [raid10]
      [ 1463.757265]  ? _raw_spin_unlock_irq+0x22/0x40
      [ 1463.757284]  ? is_mddev_idle+0x125/0x137 [md_mod]
      [ 1463.757302]  md_do_sync.cold.78+0x404/0x969 [md_mod]
      [ 1463.757311]  ? wait_woken+0x80/0x80
      [ 1463.757336]  ? md_rdev_init+0xb0/0xb0 [md_mod]
      [ 1463.757351]  md_thread+0xe9/0x140 [md_mod]
      [ 1463.757358]  ? _raw_spin_unlock_irqrestore+0x2e/0x60
      [ 1463.757364]  ? __kthread_parkme+0x4c/0x70
      [ 1463.757369]  kthread+0x112/0x130
      [ 1463.757374]  ? kthread_create_worker_on_cpu+0x40/0x40
      [ 1463.757380]  ret_from_fork+0x3a/0x50
      
      2. regular IO
      [ 1463.760679] INFO: task kworker/0:8:5367 blocked for more than 480 seconds.
      [ 1463.760683]       Not tainted 4.19.5-1-default #1
      [ 1463.760684] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1463.760687] kworker/0:8     D    0  5367      2 0x80000000
      [ 1463.760718] Workqueue: md submit_flushes [md_mod]
      [ 1463.760721] Call Trace:
      [ 1463.760731]  ? __schedule+0x29a/0x880
      [ 1463.760741]  ? wait_barrier+0xdd/0x170 [raid10]
      [ 1463.760746]  schedule+0x78/0x110
      [ 1463.760753]  wait_barrier+0xdd/0x170 [raid10]
      [ 1463.760761]  ? wait_woken+0x80/0x80
      [ 1463.760768]  raid10_write_request+0xf2/0x900 [raid10]
      [ 1463.760774]  ? wait_woken+0x80/0x80
      [ 1463.760778]  ? mempool_alloc+0x55/0x160
      [ 1463.760795]  ? md_write_start+0xa9/0x270 [md_mod]
      [ 1463.760801]  ? try_to_wake_up+0x44/0x470
      [ 1463.760810]  raid10_make_request+0xc1/0x120 [raid10]
      [ 1463.760816]  ? wait_woken+0x80/0x80
      [ 1463.760831]  md_handle_request+0x121/0x190 [md_mod]
      [ 1463.760851]  md_make_request+0x78/0x190 [md_mod]
      [ 1463.760860]  generic_make_request+0x1c6/0x470
      [ 1463.760870]  raid10_write_request+0x77a/0x900 [raid10]
      [ 1463.760875]  ? wait_woken+0x80/0x80
      [ 1463.760879]  ? mempool_alloc+0x55/0x160
      [ 1463.760895]  ? md_write_start+0xa9/0x270 [md_mod]
      [ 1463.760904]  raid10_make_request+0xc1/0x120 [raid10]
      [ 1463.760910]  ? wait_woken+0x80/0x80
      [ 1463.760926]  md_handle_request+0x121/0x190 [md_mod]
      [ 1463.760931]  ? _raw_spin_unlock_irq+0x22/0x40
      [ 1463.760936]  ? finish_task_switch+0x74/0x260
      [ 1463.760954]  submit_flushes+0x21/0x40 [md_mod]
      
      So resync io is waiting for regular write io to complete to
      decrease nr_pending (conf->barrier++ is called before waiting).
      The regular write io splits another bio after call wait_barrier
      which call nr_pending++, then the splitted bio would continue
      with raid10_write_request -> wait_barrier, so the splitted bio
      has to wait for barrier to be zero, then deadlock happens as
      follows.
      
      	resync io		regular io
      
      	raise_barrier
      				wait_barrier
      				generic_make_request
      				wait_barrier
      
      To resolve the issue, we need to call allow_barrier to decrease
      nr_pending before generic_make_request since regular IO is not
      issued to underlying devices, and wait_barrier is called again
      to ensure no internal IO happening.
      
      Fixes: fc9977dd ("md/raid10: simplify the splitting of requests.")
      Reported-and-tested-by: NSiniša Bandin <sinisa@4net.rs>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      54541a75
  7. 14 11月, 2018 1 次提交
  8. 01 9月, 2018 1 次提交
    • X
      RAID10 BUG_ON in raise_barrier when force is true and conf->barrier is 0 · 1d0ffd26
      Xiao Ni 提交于
      In raid10 reshape_request it gets max_sectors in read_balance. If the underlayer disks
      have bad blocks, the max_sectors is less than last. It will call goto read_more many
      times. It calls raise_barrier(conf, sectors_done != 0) every time. In this condition
      sectors_done is not 0. So the value passed to the argument force of raise_barrier is
      true.
      
      In raise_barrier it checks conf->barrier when force is true. If force is true and
      conf->barrier is 0, it panic. In this case reshape_request submits bio to under layer
      disks. And in the callback function of the bio it calls lower_barrier. If the bio
      finishes before calling raise_barrier again, it can trigger the BUG_ON.
      
      Add one pair of raise_barrier/lower_barrier to fix this bug.
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Suggested-by: NNeil Brown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1d0ffd26
  9. 02 8月, 2018 1 次提交
  10. 29 6月, 2018 1 次提交
  11. 13 6月, 2018 2 次提交
    • K
      treewide: kzalloc() -> kcalloc() · 6396bb22
      Kees Cook 提交于
      The kzalloc() function has a 2-factor argument form, kcalloc(). This
      patch replaces cases of:
      
              kzalloc(a * b, gfp)
      
      with:
              kcalloc(a * b, gfp)
      
      as well as handling cases of:
      
              kzalloc(a * b * c, gfp)
      
      with:
      
              kzalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kzalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kzalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kzalloc
      + kcalloc
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(sizeof(THING) * C2, ...)
      |
        kzalloc(sizeof(TYPE) * C2, ...)
      |
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(C1 * C2, ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6396bb22
    • K
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook 提交于
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6da2ec56
  12. 31 5月, 2018 1 次提交
  13. 02 5月, 2018 2 次提交
    • G
      raid10: check bio in r10buf_pool_free to void NULL pointer dereference · eb81b328
      Guoqing Jiang 提交于
      For recovery case, r10buf_pool_alloc only allocates 2 bios,
      so we can't access more than 2 bios in r10buf_pool_free.
      Otherwise, we can see NULL pointer dereference as follows:
      
      [   98.347009] BUG: unable to handle kernel NULL pointer dereference
      at 0000000000000050
      [   98.355783] IP: r10buf_pool_free+0x38/0xe0 [raid10]
      [...]
      [   98.543734] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   98.550161] CR2: 0000000000000050 CR3: 000000089500a001 CR4: 00000000001606f0
      [   98.558145] Call Trace:
      [   98.560881]  <IRQ>
      [   98.563136]  put_buf+0x19/0x20 [raid10]
      [   98.567426]  end_sync_request+0x6b/0x70 [raid10]
      [   98.572591]  end_sync_write+0x9b/0x160 [raid10]
      [   98.577662]  blk_update_request+0x78/0x2c0
      [   98.582254]  scsi_end_request+0x2c/0x1e0 [scsi_mod]
      [   98.587719]  scsi_io_completion+0x22f/0x610 [scsi_mod]
      [   98.593472]  blk_done_softirq+0x8e/0xc0
      [   98.597767]  __do_softirq+0xde/0x2b3
      [   98.601770]  irq_exit+0xae/0xb0
      [   98.605285]  do_IRQ+0x81/0xd0
      [   98.608606]  common_interrupt+0x7d/0x7d
      [   98.612898]  </IRQ>
      
      So we need to check the bio is valid or not before the bio is
      used in r10buf_pool_free. Another workable way is to free 2 bios
      for recovery case just like r10buf_pool_alloc.
      
      Fixes: f0250618 ("md: raid10: don't use bio's vec table to manage resync pages")
      Reported-by: NAlexis Castilla <pencerval@gmail.com>
      Tested-by: NAlexis Castilla <pencerval@gmail.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      eb81b328
    • Y
      md: fix an error code format and remove unsed bio_sector · 13db16d7
      Yufen Yu 提交于
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      13db16d7
  14. 09 3月, 2018 1 次提交
  15. 26 2月, 2018 2 次提交
    • B
      md: fix a potential deadlock of raid5/raid10 reshape · 8876391e
      BingJing Chang 提交于
      There is a potential deadlock if mount/umount happens when
      raid5_finish_reshape() tries to grow the size of emulated disk.
      
      How the deadlock happens?
      1) The raid5 resync thread finished reshape (expanding array).
      2) The mount or umount thread holds VFS sb->s_umount lock and tries to
         write through critical data into raid5 emulated block device. So it
         waits for raid5 kernel thread handling stripes in order to finish it
         I/Os.
      3) In the routine of raid5 kernel thread, md_check_recovery() will be
         called first in order to reap the raid5 resync thread. That is,
         raid5_finish_reshape() will be called. In this function, it will try
         to update conf and call VFS revalidate_disk() to grow the raid5
         emulated block device. It will try to acquire VFS sb->s_umount lock.
      The raid5 kernel thread cannot continue, so no one can handle mount/
      umount I/Os (stripes). Once the write-through I/Os cannot be finished,
      mount/umount will not release sb->s_umount lock. The deadlock happens.
      
      The raid5 kernel thread is an emulated block device. It is responible to
      handle I/Os (stripes) from upper layers. The emulated block device
      should not request any I/Os on itself. That is, it should not call VFS
      layer functions. (If it did, it will try to acquire VFS locks to
      guarantee the I/Os sequence.) So we have the resync thread to send
      resync I/O requests and to wait for the results.
      
      For solving this potential deadlock, we can put the size growth of the
      emulated block device as the final step of reshape thread.
      
      2017/12/29:
      Thanks to Guoqing Jiang <gqjiang@suse.com>,
      we confirmed that there is the same deadlock issue in raid10. It's
      reproducible and can be fixed by this patch. For raid10.c, we can remove
      the similar code to prevent deadlock as well since they has been called
      before.
      Reported-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      8876391e
    • L
      md-cluster: choose correct label when clustered layout is not supported · 43a52123
      Lidong Zhong 提交于
      r10conf is already successfully allocated before checking the layout
      Signed-off-by: NLidong Zhong <lzhong@suse.com>
      Reviewed-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      43a52123
  16. 20 2月, 2018 1 次提交
    • Y
      md raid10: fix NULL deference in handle_write_completed() · 01a69cab
      Yufen Yu 提交于
      In the case of 'recover', an r10bio with R10BIO_WriteError &
      R10BIO_IsRecover will be progressed by handle_write_completed().
      This function traverses all r10bio->devs[copies].
      If devs[m].repl_bio != NULL, it thinks conf->mirrors[dev].replacement
      is also not NULL. However, this is not always true.
      
      When there is an rdev of raid10 has replacement, then each r10bio
      ->devs[m].repl_bio != NULL in conf->r10buf_pool. However, in 'recover',
      even if corresponded replacement is NULL, it doesn't clear r10bio
      ->devs[m].repl_bio, resulting in replacement NULL deference.
      
      This bug was introduced when replacement support for raid10 was
      added in Linux 3.3.
      
      As NeilBrown suggested:
      	Elsewhere the determination of "is this device part of the
      	resync/recovery" is made by resting bio->bi_end_io.
      	If this is end_sync_write, then we tried to write here.
      	If it is NULL, then we didn't try to write.
      
      Fixes: 9ad1aefc ("md/raid10:  Handle replacement devices during resync.")
      Cc: stable (V3.3+)
      Suggested-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      01a69cab
  17. 18 2月, 2018 1 次提交
  18. 12 12月, 2017 1 次提交
    • N
      md/raid1,raid10: silence warning about wait-within-wait · 474beb57
      NeilBrown 提交于
      If you prepare_to_wait() after a previous prepare_to_wait(),
      but before calling schedule(), you get warning:
      
        do not call blocking ops when !TASK_RUNNING; state=2
      
      This is appropriate as it is often a bug.  The event that the
      first prepare_to_wait() expects might wake up the schedule following
      the second prepare_to_wait(), which could be confusing.
      
      However if both prepare_to_wait()s are part of simple wait_event()
      loops, and if the inner one is rarely called, then there is
      no problem.  The inner loop is too simple to get confused by
      a stray wakeup, and the outer loop won't spin unduly because the
      inner doesnt affect it often.
      
      This pattern occurs in both raid1.c and raid10.c in the use of
      flush_pending_writes().
      
      The warning can be silenced by setting current->state to TASK_RUNNING.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      474beb57
  19. 02 12月, 2017 1 次提交
  20. 02 11月, 2017 4 次提交
    • G
      md-cluster: Use a small window for raid10 resync · 8db87912
      Guoqing Jiang 提交于
      Suspending the entire device for resync could take
      too long. Resync in small chunks.
      
      cluster's resync window is maintained in r10conf as
      cluster_sync_low and cluster_sync_high, and processed
      in raid10's sync_request(). If the current resync is
      outside the cluster resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Set cluster_sync_high to cluster_sync_low + stripe
         size.
      3. Send a message to all nodes so they may add it in
         their suspension list.
      
      Note:
      We only support "near" raid10 so far, resync a far or
      offset raid10 array could have trouble. So raid10_run
      checks the layout of clustered raid10, it will refuse
      to run if the layout is not correct.
      
      With the "near" layout we process one stripe at a time
      progressing monotonically through the address space.
      So we can have a sliding window of whole-stripes which
      moves through the array suspending IO on other nodes,
      and both resync which uses array addresses and recovery
      which uses device addresses can stay within this window.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      8db87912
    • G
      md-cluster: Suspend writes in RAID10 if within range · cb8a7a7e
      Guoqing Jiang 提交于
      If there is a resync going on, all nodes must suspend
      writes to the range. This is recorded in suspend_info
      and suspend_list.
      
      If there is an I/O within the ranges of any of the
      suspend_info, area_resyncing will return 1.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cb8a7a7e
    • G
      md-cluster/raid10: set "do_balance = 0" if area is resyncing · d4098c72
      Guoqing Jiang 提交于
      Just like clustered raid1, it is impossible for cluster raid10
      to choose the best device for read balance when the area of
      array is resyncing. Because we cannot trust the data to be the
      same on all devices at that time, so we choose just the first
      one to use, so set do_balance to 0.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d4098c72
    • N
      md: remove special meaning of ->quiesce(.., 2) · b03e0ccb
      NeilBrown 提交于
      The '2' argument means "wake up anything that is waiting".
      This is an inelegant part of the design and was added
      to help support management of suspend_lo/suspend_hi setting.
      Now that suspend_lo/hi is managed in mddev_suspend/resume,
      that need is gone.
      These is still a couple of places where we call 'quiesce'
      with an argument of '2', but they can safely be changed to
      call ->quiesce(.., 1); ->quiesce(.., 0) which
      achieve the same result at the small cost of pausing IO
      briefly.
      
      This removes a small "optimization" from suspend_{hi,lo}_store,
      but it isn't clear that optimization served a useful purpose.
      The code now is a lot clearer.
      Suggested-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b03e0ccb
  21. 17 10月, 2017 3 次提交
  22. 26 8月, 2017 1 次提交
  23. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  24. 22 7月, 2017 4 次提交
  25. 19 6月, 2017 1 次提交
  26. 17 6月, 2017 1 次提交
    • G
      md/raid10: fix FailFast test for wrong device · 1cdd1257
      Guoqing Jiang 提交于
      We need to test FailFast flag for replacement device here
      since the set up for writing is for the replacement, so we
      need fix it like:
      
      - if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
      + if (test_bit(FailFast, &conf->mirrors[d].replacement->flags))
      
      Since commit f90145f3 ("md/raid10: add rcu protection
      to rdev access in raid10_sync_request.") had added the rcu
      protection for the part, so let's extend the range protected
      by rcu and use rdev directly.
      
      Fixes: 1919cbb2 ("md/raid10: add failfast handling for writes.")
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1cdd1257
  27. 14 6月, 2017 1 次提交
    • N
      md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7
      NeilBrown 提交于
      If mddev_suspend() races with md_write_start() we can deadlock
      with mddev_suspend() waiting for the request that is currently
      in md_write_start() to complete the ->make_request() call,
      and md_write_start() waiting for the metadata to be updated
      to mark the array as 'dirty'.
      As metadata updates done by md_check_recovery() only happen then
      the mddev_lock() can be claimed, and as mddev_suspend() is often
      called with the lock held, these threads wait indefinitely for each
      other.
      
      We fix this by having md_write_start() abort if mddev_suspend()
      is happening, and ->make_request() aborts if md_write_start()
      aborted.
      md_make_request() can detect this abort, decrease the ->active_io
      count, and wait for mddev_suspend().
      Reported-by: NNix <nix@esperi.org.uk>
      Fix: 68866e42(MD: no sync IO while suspended)
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cc27b0c7
  28. 09 6月, 2017 1 次提交
  29. 06 6月, 2017 1 次提交