1. 27 12月, 2019 3 次提交
    • M
      dm snapshot: rework COW throttling to fix deadlock · 358f9553
      Mikulas Patocka 提交于
      [ Upstream commit b21555786f18cd77f2311ad89074533109ae3ffa ]
      
      Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and
      workqueue stalls") introduced a semaphore to limit the maximum number of
      in-flight kcopyd (COW) jobs.
      
      The implementation of this throttling mechanism is prone to a deadlock:
      
      1. One or more threads write to the origin device causing COW, which is
         performed by kcopyd.
      
      2. At some point some of these threads might reach the s->cow_count
         semaphore limit and block in down(&s->cow_count), holding a read lock
         on _origins_lock.
      
      3. Someone tries to acquire a write lock on _origins_lock, e.g.,
         snapshot_ctr(), which blocks because the threads at step (2) already
         hold a read lock on it.
      
      4. A COW operation completes and kcopyd runs dm-snapshot's completion
         callback, which ends up calling pending_complete().
         pending_complete() tries to resubmit any deferred origin bios. This
         requires acquiring a read lock on _origins_lock, which blocks.
      
         This happens because the read-write semaphore implementation gives
         priority to writers, meaning that as soon as a writer tries to enter
         the critical section, no readers will be allowed in, until all
         writers have completed their work.
      
         So, pending_complete() waits for the writer at step (3) to acquire
         and release the lock. This writer waits for the readers at step (2)
         to release the read lock and those readers wait for
         pending_complete() (the kcopyd thread) to signal the s->cow_count
         semaphore: DEADLOCK.
      
      The above was thoroughly analyzed and documented by Nikos Tsironis as
      part of his initial proposal for fixing this deadlock, see:
      https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html
      
      Fix this deadlock by reworking COW throttling so that it waits without
      holding any locks. Add a variable 'in_progress' that counts how many
      kcopyd jobs are running. A function wait_for_in_progress() will sleep if
      'in_progress' is over the limit. It drops _origins_lock in order to
      avoid the deadlock.
      Reported-by: NGuruswamy Basavaiah <guru2018@gmail.com>
      Reported-by: NNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com>
      Tested-by: NNikos Tsironis <ntsironis@arrikto.com>
      Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue stalls")
      Cc: stable@vger.kernel.org # v5.0+
      Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      358f9553
    • M
      dm snapshot: introduce account_start_copy() and account_end_copy() · 1452343e
      Mikulas Patocka 提交于
      [ Upstream commit a2f83e8b0c82c9500421a26c49eb198b25fcdea3 ]
      
      This simple refactoring moves code for modifying the semaphore cow_count
      into separate functions to prepare for changes that will extend these
      methods to provide for a more sophisticated mechanism for COW
      throttling.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1452343e
    • N
      dm snapshot: Fix excessive memory usage and workqueue stalls · 6d63b552
      Nikos Tsironis 提交于
      [ Upstream commit 721b1d98fb517ae99ab3b757021cf81db41e67be ]
      
      kcopyd has no upper limit to the number of jobs one can allocate and
      issue. Under certain workloads this can lead to excessive memory usage
      and workqueue stalls. For example, when creating multiple dm-snapshot
      targets with a 4K chunk size and then writing to the origin through the
      page cache. Syncing the page cache causes a large number of BIOs to be
      issued to the dm-snapshot origin target, which itself issues an even
      larger (because of the BIO splitting taking place) number of kcopyd
      jobs.
      
      Running the following test, from the device mapper test suite [1],
      
        dmtest run --suite snapshot -n many_snapshots_of_same_volume_N
      
      , with 8 active snapshots, results in the kcopyd job slab cache growing
      to 10G. Depending on the available system RAM this can lead to the OOM
      killer killing user processes:
      
      [463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
                    nodemask=(null), order=1, oom_score_adj=0
      [463.492894] kthreadd cpuset=/ mems_allowed=0
      [463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
      [463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      [463.492952] Call Trace:
      [463.492964]  dump_stack+0x7d/0xbb
      [463.492973]  dump_header+0x6b/0x2fc
      [463.492987]  ? lockdep_hardirqs_on+0xee/0x190
      [463.493012]  oom_kill_process+0x302/0x370
      [463.493021]  out_of_memory+0x113/0x560
      [463.493030]  __alloc_pages_slowpath+0xf40/0x1020
      [463.493055]  __alloc_pages_nodemask+0x348/0x3c0
      [463.493067]  cache_grow_begin+0x81/0x8b0
      [463.493072]  ? cache_grow_begin+0x874/0x8b0
      [463.493078]  fallback_alloc+0x1e4/0x280
      [463.493092]  kmem_cache_alloc_node+0xd6/0x370
      [463.493098]  ? copy_process.part.31+0x1c5/0x20d0
      [463.493105]  copy_process.part.31+0x1c5/0x20d0
      [463.493115]  ? __lock_acquire+0x3cc/0x1550
      [463.493121]  ? __switch_to_asm+0x34/0x70
      [463.493129]  ? kthread_create_worker_on_cpu+0x70/0x70
      [463.493135]  ? finish_task_switch+0x90/0x280
      [463.493165]  _do_fork+0xe0/0x6d0
      [463.493191]  ? kthreadd+0x19f/0x220
      [463.493233]  kernel_thread+0x25/0x30
      [463.493235]  kthreadd+0x1bf/0x220
      [463.493242]  ? kthread_create_on_cpu+0x90/0x90
      [463.493248]  ret_from_fork+0x3a/0x50
      [463.493279] Mem-Info:
      [463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
      [463.493285]  active_file:80216 inactive_file:80107 isolated_file:435
      [463.493285]  unevictable:0 dirty:51266 writeback:109372 unstable:0
      [463.493285]  slab_reclaimable:31191 slab_unreclaimable:3483521
      [463.493285]  mapped:526 shmem:4903 pagetables:1759 bounce:0
      [463.493285]  free:33623 free_pcp:2392 free_cma:0
      ...
      [463.493489] Unreclaimable slab info:
      [463.493513] Name                      Used          Total
      [463.493522] bio-6                   1028KB       1028KB
      [463.493525] bio-5                   1028KB       1028KB
      [463.493528] dm_snap_pending_exception     236783KB     243789KB
      [463.493531] dm_exception              41KB         42KB
      [463.493534] bio-4                   1216KB       1216KB
      [463.493537] bio-3                 439396KB     439396KB
      [463.493539] kcopyd_job           6973427KB    6973427KB
      ...
      [463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
      [463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
      [463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      Moreover, issuing a large number of kcopyd jobs results in kcopyd
      hogging the CPU, while processing them. As a result, processing of work
      items, queued for execution on the same CPU as the currently running
      kcopyd thread, is stalled for long periods of time, hurting performance.
      Running the aforementioned test we get, in dmesg, messages like the
      following:
      
      [67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
      [67501.195586] Showing busy workqueues and worker pools:
      [67501.195591] workqueue events: flags=0x0
      [67501.195597]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195611]     pending: cache_reap
      [67501.195641] workqueue mm_percpu_wq: flags=0x8
      [67501.195645]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195656]     pending: vmstat_update
      [67501.195682] workqueue kblockd: flags=0x18
      [67501.195687]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
      [67501.195698]     pending: blk_timeout_work
      [67501.195753] workqueue kcopyd: flags=0x8
      [67501.195757]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195768]     pending: do_work [dm_mod]
      [67501.195802] workqueue kcopyd: flags=0x8
      [67501.195806]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195817]     pending: do_work [dm_mod]
      [67501.195834] workqueue kcopyd: flags=0x8
      [67501.195838]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195848]     pending: do_work [dm_mod]
      [67501.195881] workqueue kcopyd: flags=0x8
      [67501.195885]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195896]     pending: do_work [dm_mod]
      [67501.195920] workqueue kcopyd: flags=0x8
      [67501.195924]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
      [67501.195935]     in-flight: 67:do_work [dm_mod]
      [67501.195945]     pending: do_work [dm_mod]
      [67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765
      
      The root cause for these issues is the way dm-snapshot uses kcopyd. In
      particular, the lack of an explicit or implicit limit to the maximum
      number of in-flight COW jobs. The merging path is not affected because
      it implicitly limits the in-flight kcopyd jobs to one.
      
      Fix these issues by using a semaphore to limit the maximum number of
      in-flight kcopyd jobs. We grab the semaphore before allocating a new
      kcopyd job in start_copy() and start_full_bio() and release it after the
      job finishes in copy_callback().
      
      The initial semaphore value is configurable through a module parameter,
      to allow fine tuning the maximum number of in-flight COW jobs. Setting
      this parameter to zero initializes the semaphore to INT_MAX.
      
      A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
      value was decided experimentally as a trade-off between memory
      consumption, stalling the kernel's workqueues and maintaining a high
      enough throughput.
      
      Re-running the aforementioned test:
      
        * Workqueue stalls are eliminated
        * kcopyd's job slab cache uses a maximum of 130MB
        * The time taken by the test to write to the snapshot-origin target is
          reduced from 05m20.48s to 03m26.38s
      
      [1] https://github.com/jthornber/device-mapper-test-suiteSigned-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6d63b552
  2. 09 8月, 2018 1 次提交
  3. 08 8月, 2018 1 次提交
  4. 13 6月, 2018 1 次提交
    • K
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook 提交于
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6da2ec56
  5. 05 6月, 2018 1 次提交
  6. 31 5月, 2018 1 次提交
  7. 17 1月, 2018 1 次提交
  8. 04 12月, 2017 1 次提交
    • M
      dm: fix various targets to dm_register_target after module __init resources created · 7e6358d2
      monty_pavel@sina.com 提交于
      A NULL pointer is seen if two concurrent "vgchange -ay -K <vg name>"
      processes race to load the dm-thin-pool module:
      
       PID: 25992 TASK: ffff883cd7d23500 CPU: 4 COMMAND: "vgchange"
        #0 [ffff883cd743d600] machine_kexec at ffffffff81038fa9
        0000001 [ffff883cd743d660] crash_kexec at ffffffff810c5992
        0000002 [ffff883cd743d730] oops_end at ffffffff81515c90
        0000003 [ffff883cd743d760] no_context at ffffffff81049f1b
        0000004 [ffff883cd743d7b0] __bad_area_nosemaphore at ffffffff8104a1a5
        0000005 [ffff883cd743d800] bad_area at ffffffff8104a2ce
        0000006 [ffff883cd743d830] __do_page_fault at ffffffff8104aa6f
        0000007 [ffff883cd743d950] do_page_fault at ffffffff81517bae
        0000008 [ffff883cd743d980] page_fault at ffffffff81514f95
           [exception RIP: kmem_cache_alloc+108]
           RIP: ffffffff8116ef3c RSP: ffff883cd743da38 RFLAGS: 00010046
           RAX: 0000000000000004 RBX: ffffffff81121b90 RCX: ffff881bf1e78cc0
           RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
           RBP: ffff883cd743da68 R8: ffff881bf1a4eb00 R9: 0000000080042000
           R10: 0000000000002000 R11: 0000000000000000 R12: 00000000000000d0
           R13: 0000000000000000 R14: 00000000000000d0 R15: 0000000000000246
           ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
        0000009 [ffff883cd743da70] mempool_alloc_slab at ffffffff81121ba5
       0000010 [ffff883cd743da80] mempool_create_node at ffffffff81122083
       0000011 [ffff883cd743dad0] mempool_create at ffffffff811220f4
       0000012 [ffff883cd743dae0] pool_ctr at ffffffffa08de049 [dm_thin_pool]
       0000013 [ffff883cd743dbd0] dm_table_add_target at ffffffffa0005f2f [dm_mod]
       0000014 [ffff883cd743dc30] table_load at ffffffffa0008ba9 [dm_mod]
       0000015 [ffff883cd743dc90] ctl_ioctl at ffffffffa0009dc4 [dm_mod]
      
      The race results in a NULL pointer because:
      
      Process A (vgchange -ay -K):
       	a. send DM_LIST_VERSIONS_CMD ioctl;
       	b. pool_target not registered;
       	c. modprobe dm_thin_pool and wait until end.
      
      Process B (vgchange -ay -K):
       	a. send DM_LIST_VERSIONS_CMD ioctl;
       	b. pool_target registered;
       	c. table_load->dm_table_add_target->pool_ctr;
       	d. _new_mapping_cache is NULL and panic.
      Note:
       	1. process A and process B are two concurrent processes.
       	2. pool_target can be detected by process B but
       	_new_mapping_cache initialization has not ended.
      
      To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
      with the same problem, simply dm_register_target() after all resources
      created during module init (as labelled with __init) are finished.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: Nmonty <monty_pavel@sina.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      7e6358d2
  9. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  10. 09 6月, 2017 3 次提交
  11. 26 4月, 2017 1 次提交
  12. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  13. 21 7月, 2016 2 次提交
    • T
      dm snap: add fake origin_direct_access · f6e629bd
      Toshi Kani 提交于
      dax-capable mapped-device is marked as DM_TYPE_DAX_BIO_BASED,
      which supports both dax and bio-based operations.  dm-snap
      needs to work with dax-capable device when bio-based operation
      is used.
      
      Add fake origin_direct_access() to origin device so that its
      origin device is also marked as DM_TYPE_DAX_BIO_BASED for
      dax-capable device.  This allows to extend target's DM table.
      dm-snap works normally when bio-based operation is used.
      
      dm-snap does not support dax operation, and mount with dax
      option to a target device or snapshot device fails.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f6e629bd
    • C
      block: get rid of bio_rw and READA · 70246286
      Christoph Hellwig 提交于
      These two are confusing leftover of the old world order, combining
      values of the REQ_OP_ and REQ_ namespaces.  For callers that don't
      special case we mostly just replace bi_rw with bio_data_dir or
      op_is_write, except for the few cases where a switch over the REQ_OP_
      values makes more sense.  Any check for READA is replaced with an
      explicit check for REQ_RAHEAD.  Also remove the READA alias for
      REQ_RAHEAD.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      70246286
  14. 08 6月, 2016 1 次提交
  15. 11 3月, 2016 1 次提交
    • D
      dm snapshot: disallow the COW and origin devices from being identical · 4df2bf46
      DingXiang 提交于
      Otherwise loading a "snapshot" table using the same device for the
      origin and COW devices, e.g.:
      
      echo "0 20971520 snapshot 253:3 253:3 P 8" | dmsetup create snap
      
      will trigger:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
      [ 1958.979934] IP: [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
      [ 1958.989655] PGD 0
      [ 1958.991903] Oops: 0000 [#1] SMP
      ...
      [ 1959.059647] CPU: 9 PID: 3556 Comm: dmsetup Tainted: G          IO    4.5.0-rc5.snitm+ #150
      ...
      [ 1959.083517] task: ffff8800b9660c80 ti: ffff88032a954000 task.ti: ffff88032a954000
      [ 1959.091865] RIP: 0010:[<ffffffffa040efba>]  [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
      [ 1959.104295] RSP: 0018:ffff88032a957b30  EFLAGS: 00010246
      [ 1959.110219] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000001
      [ 1959.118180] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff880329334a00
      [ 1959.126141] RBP: ffff88032a957b50 R08: 0000000000000000 R09: 0000000000000001
      [ 1959.134102] R10: 000000000000000a R11: f000000000000000 R12: ffff880330884d80
      [ 1959.142061] R13: 0000000000000008 R14: ffffc90001c13088 R15: ffff880330884d80
      [ 1959.150021] FS:  00007f8926ba3840(0000) GS:ffff880333440000(0000) knlGS:0000000000000000
      [ 1959.159047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1959.165456] CR2: 0000000000000098 CR3: 000000032f48b000 CR4: 00000000000006e0
      [ 1959.173415] Stack:
      [ 1959.175656]  ffffc90001c13040 ffff880329334a00 ffff880330884ed0 ffff88032a957bdc
      [ 1959.183946]  ffff88032a957bb8 ffffffffa040f225 ffff880329334a30 ffff880300000000
      [ 1959.192233]  ffffffffa04133e0 ffff880329334b30 0000000830884d58 00000000569c58cf
      [ 1959.200521] Call Trace:
      [ 1959.203248]  [<ffffffffa040f225>] dm_exception_store_create+0x1d5/0x240 [dm_snapshot]
      [ 1959.211986]  [<ffffffffa040d310>] snapshot_ctr+0x140/0x630 [dm_snapshot]
      [ 1959.219469]  [<ffffffffa0005c44>] ? dm_split_args+0x64/0x150 [dm_mod]
      [ 1959.226656]  [<ffffffffa0005ea7>] dm_table_add_target+0x177/0x440 [dm_mod]
      [ 1959.234328]  [<ffffffffa0009203>] table_load+0x143/0x370 [dm_mod]
      [ 1959.241129]  [<ffffffffa00090c0>] ? retrieve_status+0x1b0/0x1b0 [dm_mod]
      [ 1959.248607]  [<ffffffffa0009e35>] ctl_ioctl+0x255/0x4d0 [dm_mod]
      [ 1959.255307]  [<ffffffff813304e2>] ? memzero_explicit+0x12/0x20
      [ 1959.261816]  [<ffffffffa000a0c3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
      [ 1959.268615]  [<ffffffff81215eb6>] do_vfs_ioctl+0xa6/0x5c0
      [ 1959.274637]  [<ffffffff81120d2f>] ? __audit_syscall_entry+0xaf/0x100
      [ 1959.281726]  [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70
      [ 1959.288814]  [<ffffffff81216449>] SyS_ioctl+0x79/0x90
      [ 1959.294450]  [<ffffffff8167e4ae>] entry_SYSCALL_64_fastpath+0x12/0x71
      ...
      [ 1959.323277] RIP  [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
      [ 1959.333090]  RSP <ffff88032a957b30>
      [ 1959.336978] CR2: 0000000000000098
      [ 1959.344121] ---[ end trace b049991ccad1169e ]---
      
      Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1195899
      Cc: stable@vger.kernel.org
      Signed-off-by: NDing Xiang <dingxiang@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      4df2bf46
  16. 23 2月, 2016 1 次提交
  17. 09 1月, 2016 1 次提交
    • M
      dm snapshot: fix hung bios when copy error occurs · 385277bf
      Mikulas Patocka 提交于
      When there is an error copying a chunk dm-snapshot can incorrectly hold
      associated bios indefinitely, resulting in hung IO.
      
      The function copy_callback sets pe->error if there was error copying the
      chunk, and then calls complete_exception.  complete_exception calls
      pending_complete on error, otherwise it calls commit_exception with
      commit_callback (and commit_callback calls complete_exception).
      
      The persistent exception store (dm-snap-persistent.c) assumes that calls
      to prepare_exception and commit_exception are paired.
      persistent_prepare_exception increases ps->pending_count and
      persistent_commit_exception decreases it.
      
      If there is a copy error, persistent_prepare_exception is called but
      persistent_commit_exception is not.  This results in the variable
      ps->pending_count never returning to zero and that causes some pending
      exceptions (and their associated bios) to be held forever.
      
      Fix this by unconditionally calling commit_exception regardless of
      whether the copy was successful.  A new "valid" parameter is added to
      commit_exception -- when the copy fails this parameter is set to zero so
      that the chunk that failed to copy (and all following chunks) is not
      recorded in the snapshot store.  Also, remove commit_callback now that
      it is merely a wrapper around pending_complete.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      385277bf
  18. 10 12月, 2015 1 次提交
  19. 10 10月, 2015 1 次提交
  20. 14 8月, 2015 1 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666
  21. 13 8月, 2015 1 次提交
    • M
      dm snapshot: don't invalidate on-disk image on snapshot write overflow · 76c44f6d
      Mikulas Patocka 提交于
      When the snapshot overflows because of a write to the origin, the on-disk
      image has to be invalidated.  However, when the snapshot overflows because
      of a write to the snapshot, the on-disk image doesn't have to be
      invalidated.  Change the behavior so that the on-disk image is not
      invalidated in this case.
      
      When the snapshot overflows, the variable snapshot_overflowed is set.
      All writes to the snapshot are disallowed to minimize filesystem
      corruption - this condition is cleared when the snapshot is deactivated
      and activated.
      
      The user can extend the overflowed snapshot, deactivate and activate it
      again, run fsck (if journaling filesystem is not used) mount it and
      recover the data.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      76c44f6d
  22. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  23. 22 5月, 2015 1 次提交
    • M
      block: remove management of bi_remaining when restoring original bi_end_io · 326e1dbb
      Mike Snitzer 提交于
      Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
      non-chains") regressed all existing callers that followed this pattern:
       1) saving a bio's original bi_end_io
       2) wiring up an intermediate bi_end_io
       3) restoring the original bi_end_io from intermediate bi_end_io
       4) calling bio_endio() to execute the restored original bi_end_io
      
      The regression was due to BIO_CHAIN only ever getting set if
      bio_inc_remaining() is called.  For the above pattern it isn't set until
      step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
      the first bio_endio(), in step 2 above, never decremented __bi_remaining
      before calling the intermediate bi_end_io -- leaving __bi_remaining with
      the value 1 instead of 0.  When bio_inc_remaining() occurred during step
      3 it brought it to a value of 2.  When the second bio_endio() was
      called, in step 4 above, it should've called the original bi_end_io but
      it didn't because there was an extra reference that wasn't dropped (due
      to atomic operations being optimized away since BIO_CHAIN wasn't set
      upfront).
      
      Fix this issue by removing the __bi_remaining management complexity for
      all callers that use the above pattern -- bio_chain() is the only
      interface that _needs_ to be concerned with __bi_remaining.  For the
      above pattern callers just expect the bi_end_io they set to get called!
      Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
      that aren't associated with the bio_chain() interface.
      
      Also, the bio_inc_remaining() interface has been moved local to bio.c.
      
      Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      326e1dbb
  24. 06 5月, 2015 1 次提交
  25. 28 2月, 2015 2 次提交
    • M
      dm snapshot: suspend merging snapshot when doing exception handover · 09ee96b2
      Mikulas Patocka 提交于
      The "dm snapshot: suspend origin when doing exception handover" commit
      fixed a exception store handover bug associated with pending exceptions
      to the "snapshot-origin" target.
      
      However, a similar problem exists in snapshot merging.  When snapshot
      merging is in progress, we use the target "snapshot-merge" instead of
      "snapshot-origin".  Consequently, during exception store handover, we
      must find the snapshot-merge target and suspend its associated
      mapped_device.
      
      To avoid lockdep warnings, the target must be suspended and resumed
      without holding _origins_lock.
      
      Introduce a dm_hold() function that grabs a reference on a
      mapped_device, but unlike dm_get(), it doesn't crash if the device has
      the DMF_FREEING flag set, it returns an error in this case.
      
      In snapshot_resume() we grab the reference to the origin device using
      dm_hold() while holding _origins_lock (_origins_lock guarantees that the
      device won't disappear).  Then we release _origins_lock, suspend the
      device and grab _origins_lock again.
      
      NOTE to stable@ people:
      When backporting to kernels 3.18 and older, use dm_internal_suspend and
      dm_internal_resume instead of dm_internal_suspend_fast and
      dm_internal_resume_fast.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      09ee96b2
    • M
      dm snapshot: suspend origin when doing exception handover · b735fede
      Mikulas Patocka 提交于
      In the function snapshot_resume we perform exception store handover.  If
      there is another active snapshot target, the exception store is moved
      from this target to the target that is being resumed.
      
      The problem is that if there is some pending exception, it will point to
      an incorrect exception store after that handover, causing a crash due to
      dm-snap-persistent.c:get_exception()'s BUG_ON.
      
      This bug can be triggered by repeatedly changing snapshot permissions
      with "lvchange -p r" and "lvchange -p rw" while there are writes on the
      associated origin device.
      
      To fix this bug, we must suspend the origin device when doing the
      exception store handover to make sure that there are no pending
      exceptions:
      - introduce _origin_hash that keeps track of dm_origin structures.
      - introduce functions __lookup_dm_origin, __insert_dm_origin and
        __remove_dm_origin that manipulate the origin hash.
      - modify snapshot_resume so that it calls dm_internal_suspend_fast() and
        dm_internal_resume_fast() on the origin device.
      
      NOTE to stable@ people:
      
      When backporting to kernels 3.12-3.18, use dm_internal_suspend and
      dm_internal_resume instead of dm_internal_suspend_fast and
      dm_internal_resume_fast.
      
      When backporting to kernels older than 3.12, you need to pick functions
      dm_internal_suspend and dm_internal_resume from the commit
      fd2ed4d2.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      b735fede
  26. 18 2月, 2015 1 次提交
    • M
      dm snapshot: fix a possible invalid memory access on unload · 22aa66a3
      Mikulas Patocka 提交于
      When the snapshot target is unloaded, snapshot_dtr() waits until
      pending_exceptions_count drops to zero.  Then, it destroys the snapshot.
      Therefore, the function that decrements pending_exceptions_count
      should not touch the snapshot structure after the decrement.
      
      pending_complete() calls free_pending_exception(), which decrements
      pending_exceptions_count, and then it performs up_write(&s->lock) and it
      calls retry_origin_bios() which dereferences  s->origin.  These two
      memory accesses to the fields of the snapshot may touch the dm_snapshot
      struture after it is freed.
      
      This patch moves the call to free_pending_exception() to the end of
      pending_complete(), so that the snapshot will not be destroyed while
      pending_complete() is in progress.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      22aa66a3
  27. 16 7月, 2014 1 次提交
    • N
      sched: Remove proliferation of wait_on_bit() action functions · 74316201
      NeilBrown 提交于
      The current "wait_on_bit" interface requires an 'action'
      function to be provided which does the actual waiting.
      There are over 20 such functions, many of them identical.
      Most cases can be satisfied by one of just two functions, one
      which uses io_schedule() and one which just uses schedule().
      
      So:
       Rename wait_on_bit and        wait_on_bit_lock to
              wait_on_bit_action and wait_on_bit_lock_action
       to make it explicit that they need an action function.
      
       Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
       which are *not* given an action function but implicitly use
       a standard one.
       The decision to error-out if a signal is pending is now made
       based on the 'mode' argument rather than being encoded in the action
       function.
      
       All instances of the old wait_on_bit and wait_on_bit_lock which
       can use the new version have been changed accordingly and their
       action functions have been discarded.
       wait_on_bit{_lock} does not return any specific error code in the
       event of a signal so the caller must check for non-zero and
       interpolate their own error code as appropriate.
      
      The wait_on_bit() call in __fscache_wait_on_invalidate() was
      ambiguous as it specified TASK_UNINTERRUPTIBLE but used
      fscache_wait_bit_interruptible as an action function.
      David Howells confirms this should be uniformly
      "uninterruptible"
      
      The main remaining user of wait_on_bit{,_lock}_action is NFS
      which needs to use a freezer-aware schedule() call.
      
      A comment in fs/gfs2/glock.c notes that having multiple 'action'
      functions is useful as they display differently in the 'wchan'
      field of 'ps'. (and /proc/$PID/wchan).
      As the new bit_wait{,_io} functions are tagged "__sched", they
      will not show up at all, but something higher in the stack.  So
      the distinction will still be visible, only with different
      function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
      gfs2/glock.c case).
      
      Since first version of this patch (against 3.15) two new action
      functions appeared, on in NFS and one in CIFS.  CIFS also now
      uses an action function that makes the same freezer aware
      schedule call as NFS.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
      Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steve French <sfrench@samba.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brownSigned-off-by: NIngo Molnar <mingo@kernel.org>
      74316201
  28. 04 6月, 2014 2 次提交
  29. 18 4月, 2014 1 次提交
  30. 15 1月, 2014 1 次提交
  31. 11 12月, 2013 1 次提交
    • M
      dm snapshot: avoid snapshot space leak on crash · 230c83af
      Mikulas Patocka 提交于
      There is a possible leak of snapshot space in case of crash.
      
      The reason for space leaking is that chunks in the snapshot device are
      allocated sequentially, but they are finished (and stored in the metadata)
      out of order, depending on the order in which copying finished.
      
      For example, supposed that the metadata contains the following records
      SUPERBLOCK
      METADATA (blocks 0 ... 250)
      DATA 0
      DATA 1
      DATA 2
      ...
      DATA 250
      
      Now suppose that you allocate 10 new data blocks 251-260. Suppose that
      copying of these blocks finish out of order (block 260 finished first
      and the block 251 finished last). Now, the snapshot device looks like
      this:
      SUPERBLOCK
      METADATA (blocks 0 ... 250, 260, 259, 258, 257, 256)
      DATA 0
      DATA 1
      DATA 2
      ...
      DATA 250
      DATA 251
      DATA 252
      DATA 253
      DATA 254
      DATA 255
      METADATA (blocks 255, 254, 253, 252, 251)
      DATA 256
      DATA 257
      DATA 258
      DATA 259
      DATA 260
      
      Now, if the machine crashes after writing the first metadata block but
      before writing the second metadata block, the space for areas DATA 250-255
      is leaked, it contains no valid data and it will never be used in the
      future.
      
      This patch makes dm-snapshot complete exceptions in the same order they
      were allocated, thus fixing this bug.
      
      Note: when backporting this patch to the stable kernel, change the version
      field in the following way:
      * if version in the stable kernel is {1, 11, 1}, change it to {1, 12, 0}
      * if version in the stable kernel is {1, 10, 0} or {1, 10, 1}, change it
        to {1, 10, 2}
      Userspace reads the version to determine if the bug was fixed, so the
      version change is needed.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      230c83af
  32. 24 11月, 2013 2 次提交
    • K
      block: Generic bio chaining · 196d38bc
      Kent Overstreet 提交于
      This adds a generic mechanism for chaining bio completions. This is
      going to be used for a bio_split() replacement, and it turns out to be
      very useful in a fair amount of driver code - a fair number of drivers
      were implementing this in their own roundabout ways, often painfully.
      
      Note that this means it's no longer to call bio_endio() more than once
      on the same bio! This can cause problems for drivers that save/restore
      bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all
      - in all but the simplest cases they'd be better off just cloning the
      bio, and immutable biovecs is making bio cloning cheaper. But for now,
      we add a bio_endio_nodec() for these cases.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      196d38bc
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37