1. 18 12月, 2014 3 次提交
    • M
      dm thin: fix crash by initializing thin device's refcount and completion earlier · 2b94e896
      Marc Dionne 提交于
      Commit 80e96c54 ("dm thin: do not allow thin device activation
      while pool is suspended") delayed the initialization of a new thin
      device's refcount and completion until after this new thin was added
      to the pool's active_thins list and the pool lock is released.  This
      opens a race with a worker thread that walks the list and calls
      thin_get/put, noticing that the refcount goes to 0 and calling
      complete, freezing up the system and giving the oops below:
      
       kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
       kernel: IP: [<ffffffff810d360b>] __wake_up_common+0x2b/0x90
      
       kernel: Call Trace:
       kernel: [<ffffffff810d3683>] __wake_up_locked+0x13/0x20
       kernel: [<ffffffff810d3dc7>] complete+0x37/0x50
       kernel: [<ffffffffa0595c50>] thin_put+0x20/0x30 [dm_thin_pool]
       kernel: [<ffffffffa059aab7>] do_worker+0x667/0x870 [dm_thin_pool]
       kernel: [<ffffffff816a8a4c>] ? __schedule+0x3ac/0x9a0
       kernel: [<ffffffff810b1aef>] process_one_work+0x14f/0x400
       kernel: [<ffffffff810b206b>] worker_thread+0x6b/0x490
       kernel: [<ffffffff810b2000>] ? rescuer_thread+0x260/0x260
       kernel: [<ffffffff810b6a7b>] kthread+0xdb/0x100
       kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170
       kernel: [<ffffffff816ad7ec>] ret_from_fork+0x7c/0xb0
       kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170
      
      Set the thin device's initial refcount and initialize the completion
      before adding it to the pool's active_thins list in thin_ctr().
      Signed-off-by: NMarc Dionne <marc.dionne@your-file-system.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2b94e896
    • J
      dm thin: fix missing out-of-data-space to write mode transition if blocks are released · 2c43fd26
      Joe Thornber 提交于
      Discard bios and thin device deletion have the potential to release data
      blocks.  If the thin-pool is in out-of-data-space mode, and blocks were
      released, transition the thin-pool back to full write mode.
      
      The correct time to do this is just after the thin-pool metadata commit.
      It cannot be done before the commit because the space maps will not
      allow immediate reuse of the data blocks in case there's a rollback
      following power failure.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      2c43fd26
    • J
      dm thin: fix inability to discard blocks when in out-of-data-space mode · 45ec9bd0
      Joe Thornber 提交于
      When the pool was in PM_OUT_OF_SPACE mode its process_prepared_discard
      function pointer was incorrectly being set to
      process_prepared_discard_passdown rather than process_prepared_discard.
      
      This incorrect function pointer meant the discard was being passed down,
      but not effecting the mapping.  As such any discard that was issued, in
      an attempt to reclaim blocks, would not successfully free data space.
      Reported-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      45ec9bd0
  2. 22 11月, 2014 1 次提交
    • M
      dm thin: fix pool_io_hints to avoid looking at max_hw_sectors · d200c30e
      Mike Snitzer 提交于
      Simplify the pool_io_hints code that works to establish a max_sectors
      value that is a power-of-2 factor of the thin-pool's blocksize.  The
      biggest associated improvement is that the DM thin-pool is no longer
      concerning itself with the data device's max_hw_sectors when adjusting
      max_sectors.
      
      This fixes the relative fragility of the original "dm thin: adjust
      max_sectors_kb based on thinp blocksize" commit that only became
      apparent when testing was performed using a DM thin-pool ontop of a
      virtio_blk device.  One proposed upstream patch detailed the problems
      inherent in virtio_blk: https://lkml.org/lkml/2014/11/20/611
      
      So even though virtio_blk incorrectly set its max_hw_sectors it actually
      helped make it clear that we need DM thinp to be tolerant of any future
      Linux driver that incorrectly sets max_hw_sectors.
      
      We only need to be concerned with modifying the thin-pool device's
      max_sectors limit if it is smaller than the thin-pool's blocksize.  In
      this case the value of max_sectors does become a limiting factor when
      upper layers (e.g. filesystems) construct their bios.  But if the
      hardware can support IOs larger than the thin-pool's blocksize the user
      is encouraged to adjust the thin-pool's data device's max_sectors
      accordingly -- doing so will enable the thin-pool to inherit the
      established user-defined max_sectors.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d200c30e
  3. 20 11月, 2014 2 次提交
  4. 13 11月, 2014 2 次提交
  5. 11 11月, 2014 14 次提交
  6. 05 11月, 2014 1 次提交
  7. 02 8月, 2014 3 次提交
  8. 12 6月, 2014 1 次提交
    • L
      dm thin: update discard_granularity to reflect the thin-pool blocksize · 09869de5
      Lukas Czerner 提交于
      DM thinp already checks whether the discard_granularity of the data
      device is a factor of the thin-pool block size.  But when using the
      dm-thin-pool's discard passdown support, DM thinp was not selecting the
      max of the underlying data device's discard_granularity and the
      thin-pool's block size.
      
      Update set_discard_limits() to set discard_granularity to the max of
      these values.  This enables blkdev_issue_discard() to properly align the
      discards that are sent to the DM thin device on a full block boundary.
      As such each discard will now cover an entire DM thin-pool block and the
      block will be reclaimed.
      Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      09869de5
  9. 04 6月, 2014 2 次提交
  10. 21 5月, 2014 1 次提交
    • M
      dm thin: add 'no_space_timeout' dm-thin-pool module param · 80c57893
      Mike Snitzer 提交于
      Commit 85ad643b ("dm thin: add timeout to stop out-of-data-space mode
      holding IO forever") introduced a fixed 60 second timeout.  Users may
      want to either disable or modify this timeout.
      
      Allow the out-of-data-space timeout to be configured using the
      'no_space_timeout' dm-thin-pool module param.  Setting it to 0 will
      disable the timeout, resulting in IO being queued until more data space
      is added to the thin-pool.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.14+
      80c57893
  11. 15 5月, 2014 2 次提交
  12. 29 4月, 2014 1 次提交
  13. 08 4月, 2014 2 次提交
    • J
      dm thin: fix rcu_read_lock being held in code that can sleep · b10ebd34
      Joe Thornber 提交于
      Commit c140e1c4 ("dm thin: use per thin device deferred bio lists")
      introduced the use of an rculist for all active thin devices.  The use
      of rcu_read_lock() in process_deferred_bios() can result in a BUG if a
      dm_bio_prison_cell must be allocated as a side-effect of bio_detain():
      
       BUG: sleeping function called from invalid context at mm/mempool.c:203
       in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u8:0
       3 locks held by kworker/u8:0/6:
         #0:  ("dm-" "thin"){.+.+..}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550
         #1:  ((&pool->worker)){+.+...}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550
         #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff816360b5>] do_worker+0x5/0x4d0
      
      We can't process deferred bios with the rcu lock held, since
      dm_bio_prison_cell allocation may block if the bio-prison's cell mempool
      is exhausted.
      
      To fix:
      
      - Introduce a refcount and completion field to each thin_c
      
      - Add thin_get/put methods for adjusting the refcount.  If the refcount
        hits zero then the completion is triggered.
      
      - Initialise refcount to 1 when creating thin_c
      
      - When iterating the active_thins list we thin_get() whilst the rcu
        lock is held.
      
      - After the rcu lock is dropped we process the deferred bios for that
        thin.
      
      - When destroying a thin_c we thin_put() and then wait for the
        completion -- to avoid a race between the worker thread iterating
        from that thin_c and destroying the thin_c.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b10ebd34
    • J
      dm thin: irqsave must always be used with the pool->lock spinlock · 5e3283e2
      Joe Thornber 提交于
      Commit c140e1c4 ("dm thin: use per thin device deferred bio lists")
      incorrectly stopped disabling irqs when taking the pool's spinlock.
      
      Irqs must be disabled when taking the pool's spinlock otherwise a thread
      could spin_lock(), then get interrupted to service thin_endio() in
      interrupt context, which would then deadlock in spin_lock_irqsave().
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5e3283e2
  14. 05 4月, 2014 1 次提交
    • M
      dm thin: sort the per thin deferred bios using an rb_tree · 67324ea1
      Mike Snitzer 提交于
      A thin-pool will allocate blocks using FIFO order for all thin devices
      which share the thin-pool.  Because of this simplistic allocation the
      thin-pool's space can become fragmented quite easily; especially when
      multiple threads are requesting blocks in parallel.
      
      Sort each thin device's deferred_bio_list based on logical sector to
      help reduce fragmentation of the thin-pool's ondisk layout.
      
      The following tables illustrate the realized gains/potential offered by
      sorting each thin device's deferred_bio_list.  An "io size"-sized random
      read of the device would result in "seeks/io" fragments being read, with
      an average "distance/seek" between each fragment.
      
      Data was written to a single thin device using multiple threads via
      iozone (8 threads, 64K for both the block_size and io_size).
      
      unsorted:
      
           io size   seeks/io distance/seek
        --------------------------------------
                4k    0.000   0b
               16k    0.013   11m
               64k    0.065   11m
              256k    0.274   10m
                1m    1.109   10m
                4m    4.411   10m
               16m    17.097  11m
               64m    60.055  13m
              256m    148.798 25m
                1g    809.929 21m
      
      sorted:
      
           io size   seeks/io distance/seek
        --------------------------------------
                4k    0.000   0b
               16k    0.000   1g
               64k    0.001   1g
              256k    0.003   1g
                1m    0.011   1g
                4m    0.045   1g
               16m    0.181   1g
               64m    0.747   1011m
              256m    3.299   1g
                1g    14.373  1g
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      67324ea1
  15. 01 4月, 2014 1 次提交
    • M
      dm thin: use per thin device deferred bio lists · c140e1c4
      Mike Snitzer 提交于
      The thin-pool previously only had a single deferred_bios list that would
      collect bios for all thin devices in the pool.  Split this per-pool
      deferred_bios list out to per-thin deferred_bios_list -- doing so
      enables increased parallelism when processing deferred bios.  And now
      that each thin device has it's own deferred_bios_list we can sort all
      bios in the list using logical sector.  The requeue code in error
      handling path is also cleaner as a side-effect.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      c140e1c4
  16. 31 3月, 2014 1 次提交
  17. 29 3月, 2014 1 次提交
  18. 06 3月, 2014 1 次提交
    • J
      dm thin: fix noflush suspend IO queueing · 738211f7
      Joe Thornber 提交于
      i) by the time DM core calls the postsuspend hook the dm_noflush flag
      has been cleared.  So the old thin_postsuspend did nothing.  We need to
      use the presuspend hook instead.
      
      ii) There was a race between bios leaving DM core and arriving in the
      deferred queue.
      
      thin_presuspend now sets a 'requeue' flag causing all bios destined for
      that thin to be requeued back to DM core.  Then it requeues all held IO,
      and all IO on the deferred queue (destined for that thin).  Finally
      postsuspend clears the 'requeue' flag.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      738211f7