1. 15 4月, 2016 1 次提交
    • M
      block: loop: fix filesystem corruption in case of aio/dio · a7297a6a
      Ming Lei 提交于
      Starting from commit e36f6204(block: split bios to max possible length),
      block core starts to split bio in the middle of bvec.
      
      Unfortunately loop dio/aio doesn't consider this situation, and
      always treat 'iter.iov_offset' as zero. Then filesystem corruption
      is observed.
      
      This patch figures out the offset of the base bvevc via
      'bio->bi_iter.bi_bvec_done' and fixes the issue by passing the offset
      to iov iterator.
      
      Fixes: e36f6204 (block: split bios to max possible length)
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org (4.5)
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a7297a6a
  2. 13 4月, 2016 1 次提交
  3. 01 10月, 2015 1 次提交
  4. 24 9月, 2015 5 次提交
    • M
      block: loop: support DIO & AIO · bc07c10a
      Ming Lei 提交于
      There are at least 3 advantages to use direct I/O and AIO on
      read/write loop's backing file:
      
      1) double cache can be avoided, then memory usage gets
      decreased a lot
      
      2) not like user space direct I/O, there isn't cost of
      pinning pages
      
      3) avoid context switch for obtaining good throughput
      - in buffered file read, random I/O top throughput is often obtained
      only if they are submitted concurrently from lots of tasks; but for
      sequential I/O, most of times they can be hit from page cache, so
      concurrent submissions often introduce unnecessary context switch
      and can't improve throughput much. There was such discussion[1]
      to use non-blocking I/O to improve the problem for application.
      - with direct I/O and AIO, concurrent submissions can be
      avoided and random read throughput can't be affected meantime
      
      xfstests(-g auto, ext4) is basically passed when running with
      direct I/O(aio), one exception is generic/232, but it failed in
      loop buffered I/O(4.2-rc6-next-20150814) too.
      
      Follows the fio test result for performance purpose:
      	4 jobs fio test inside ext4 file system over loop block
      
      1) How to run
      	- KVM: 4 VCPUs, 2G RAM
      	- linux kernel: 4.2-rc6-next-20150814(base) with the patchset
      	- the loop block is over one image on SSD.
      	- linux psync, 4 jobs, size 1500M, ext4 over loop block
      	- test result: IOPS from fio output
      
      2) Throughput(IOPS) becomes a bit better with direct I/O(aio)
              -------------------------------------------------------------
              test cases          |randread   |read   |randwrite  |write  |
              -------------------------------------------------------------
              base                |8015       |113811 |67442      |106978
              -------------------------------------------------------------
              base+loop aio       |8136       |125040 |67811      |111376
              -------------------------------------------------------------
      
      - somehow, it should be caused by more page cache avaiable for
      application or one extra page copy is avoided in case of direct I/O
      
      3) context switch
              - context switch decreased by ~50% with loop direct I/O(aio)
      	compared with loop buffered I/O(4.2-rc6-next-20150814)
      
      4) memory usage from /proc/meminfo
              -------------------------------------------------------------
                                         | Buffers       | Cached
              -------------------------------------------------------------
              base                       | > 760MB       | ~950MB
              -------------------------------------------------------------
              base+loop direct I/O(aio)  | < 5MB         | ~1.6GB
              -------------------------------------------------------------
      
      - so there are much more page caches available for application with
      direct I/O
      
      [1] https://lwn.net/Articles/612483/Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bc07c10a
    • M
      block: loop: introduce ioctl command of LOOP_SET_DIRECT_IO · ab1cb278
      Ming Lei 提交于
      If loop block is mounted via 'mount -o loop', it isn't easy
      to pass file descriptor opened as O_DIRECT, so this patch
      introduces a new command to support direct IO for this case.
      
      Cc: linux-api@vger.kernel.org
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ab1cb278
    • M
      block: loop: prepare for supporing direct IO · 2e5ab5f3
      Ming Lei 提交于
      This patches provides one interface for enabling direct IO
      from user space:
      
      	- userspace(such as losetup) can pass 'file' which is
      	opened/fcntl as O_DIRECT
      
      Also __loop_update_dio() is introduced to check if direct I/O
      can be used on current loop setting.
      
      The last big change is to introduce LO_FLAGS_DIRECT_IO flag
      for userspace to know if direct IO is used to access backing
      file.
      
      Cc: linux-api@vger.kernel.org
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2e5ab5f3
    • M
      block: loop: use kthread_work · e03a3d7a
      Ming Lei 提交于
      The following patch will use dio/aio to submit IO to backing file,
      then it needn't to schedule IO concurrently from work, so
      use kthread_work for decreasing context switch cost a lot.
      
      For non-AIO case, single thread has been used for long long time,
      and it was just converted to work in v4.0, which has caused performance
      regression for fedora live booting already. In discussion[1], even
      though submitting I/O via work concurrently can improve random read IO
      throughput, meantime it might hurt sequential read IO performance, so
      better to restore to single thread behaviour.
      
      For the following AIO support, it is better to use multi hw-queue
      with per-hwq kthread than current work approach suppose there is so
      high performance requirement for loop.
      
      [1] http://marc.info/?t=143082678400002&r=1&w=2Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e03a3d7a
    • M
      block: loop: set QUEUE_FLAG_NOMERGES for request queue of loop · 5b5e20f4
      Ming Lei 提交于
      It doesn't make sense to enable merge because the I/O
      submitted to backing file is handled page by page.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5b5e20f4
  5. 17 7月, 2015 1 次提交
  6. 24 6月, 2015 1 次提交
  7. 20 5月, 2015 3 次提交
    • J
      loop: remove (now) unused 'out' label · 6a927007
      Jens Axboe 提交于
      gcc, righfully, complains:
      
      drivers/block/loop.c:1369:1: warning: label 'out' defined but not used [-Wunused-label]
      
      Kill it.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6a927007
    • M
      block: loop: fix another reread part failure · 06f0e9e6
      Ming Lei 提交于
      loop_clr_fd() can be run piggyback with lo_release(), and
      under this situation, reread partition may always fail because
      bd_mutex has been held already.
      
      This patch detects the situation by the reference count, and
      call __blkdev_reread_part() to avoid acquiring the lock again.
      
      In the meantime, this patch switches to new kernel APIs
      of blkdev_reread_part() and __blkdev_reread_part().
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJarod Wilson <jarod@redhat.com>
      Acked-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      06f0e9e6
    • M
      block: loop: don't hold lo_ctl_mutex in lo_open · f8933667
      Ming Lei 提交于
      The lo_ctl_mutex is held for running all ioctl handlers, and
      in some ioctl handlers, ioctl_by_bdev(BLKRRPART) is called for
      rereading partitions, which requires bd_mutex.
      
      So it is easy to cause failure because trylock(bd_mutex) may
      fail inside blkdev_reread_part(), and follows the lock context:
      
      blkid or other application:
      	->open()
      		->mutex_lock(bd_mutex)
      		->lo_open()
      			->mutex_lock(lo_ctl_mutex)
      
      losetup(set fd ioctl):
      	->mutex_lock(lo_ctl_mutex)
      	->ioctl_by_bdev(BLKRRPART)
      		->trylock(bd_mutex)
      
      This patch trys to eliminate the ABBA lock dependency by removing
      lo_ctl_mutext in lo_open() with the following approach:
      
      1) make lo_refcnt as atomic_t and avoid acquiring lo_ctl_mutex in lo_open():
      	- for open vs. add/del loop, no any problem because of loop_index_mutex
      	- freeze request queue during clr_fd, so I/O can't come until
      	  clearing fd is completed, like the effect of holding lo_ctl_mutex
      	  in lo_open
      	- both open() and release() have been serialized by bd_mutex already
      
      2) don't hold lo_ctl_mutex for decreasing/checking lo_refcnt in
      lo_release(), then lo_ctl_mutex is only required for the last release.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJarod Wilson <jarod@redhat.com>
      Acked-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f8933667
  8. 06 5月, 2015 2 次提交
    • M
      block: loop: avoiding too many pending per work I/O · 4d4e41ae
      Ming Lei 提交于
      If there are too many pending per work I/O, too many
      high priority work thread can be generated so that
      system performance can be effected.
      
      This patch limits the max_active parameter of workqueue as 16.
      
      This patch fixes Fedora 22 live booting performance
      regression when it is booted from squashfs over dm
      based on loop, and looks the following reasons are
      related with the problem:
      
      - not like other filesyststems(such as ext4), squashfs
      is a bit special, and I observed that increasing I/O jobs
      to access file in squashfs only improve I/O performance a
      little, but it can make big difference for ext4
      
      - nested loop: both squashfs.img and ext3fs.img are mounted
      as loop block, and ext3fs.img is inside the squashfs
      
      - during booting, lots of tasks may run concurrently
      
      Fixes: b5dd2f60
      Cc: stable@vger.kernel.org (v4.0)
      Cc: Justin M. Forbes <jforbes@fedoraproject.org>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4d4e41ae
    • M
      block: loop: convert to per-device workqueue · f4aa4c7b
      Ming Lei 提交于
      Documentation/workqueue.txt:
      	If there is dependency among multiple work items used
      	during memory reclaim, they should be queued to separate
      	wq each with WQ_MEM_RECLAIM.
      
      Loop devices can be stacked, so we have to convert to per-device
      workqueue. One example is Fedora live CD.
      
      Fixes: b5dd2f60
      Cc: stable@vger.kernel.org (v4.0)
      Cc: Justin M. Forbes <jforbes@fedoraproject.org>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f4aa4c7b
  9. 28 4月, 2015 1 次提交
    • N
      block: destroy bdi before blockdev is unregistered. · 6cd18e71
      NeilBrown 提交于
      Because of the peculiar way that md devices are created (automatically
      when the device node is opened), a new device can be created and
      registered immediately after the
      	blk_unregister_region(disk_devt(disk), disk->minors);
      call in del_gendisk().
      
      Therefore it is important that all visible artifacts of the previous
      device are removed before this call.  In particular, the 'bdi'.
      
      Since:
      commit c4db59d3
      Author: Christoph Hellwig <hch@lst.de>
          fs: don't reassign dirty inodes to default_backing_dev_info
      
      moved the
         device_unregister(bdi->dev);
      call from bdi_unregister() to bdi_destroy() it has been quite easy to
      lose a race and have a new (e.g.) "md127" be created after the
      blk_unregister_region() call and before bdi_destroy() is ultimately
      called by the final 'put_disk', which must come after del_gendisk().
      
      The new device finds that the bdi name is already registered in sysfs
      and complains
      
      > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
      > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'
      
      We can fix this by moving the bdi_destroy() call out of
      blk_release_queue() (which can happen very late when a refcount
      reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
      device driver calls it.
      
      Then it is only necessary for md to call blk_cleanup_queue() before
      del_gendisk().  As loop.c devices are also created on demand by
      opening the device node, we make the same change there.
      
      Fixes: c4db59d3Reported-by: NAzat Khuzhin <a3at.mail@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org (v4.0)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6cd18e71
  10. 16 4月, 2015 1 次提交
  11. 12 4月, 2015 1 次提交
  12. 03 1月, 2015 5 次提交
    • J
      loop: add blk-mq.h include · 78e367a3
      Jens Axboe 提交于
      Looks like we pull it in through other ways on x86, but we fail
      on sparc:
      
      In file included from drivers/block/cryptoloop.c:30:0:
      drivers/block/loop.h:63:24: error: field 'tag_set' has incomplete type
      struct blk_mq_tag_set tag_set;
      
      Add the include to loop.h, kill it from loop.c.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      78e367a3
    • M
      block: loop: don't handle REQ_FUA explicitly · af65aa8e
      Ming Lei 提交于
      block core handles REQ_FUA by its flush state machine, so
      won't do it in loop explicitly.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      af65aa8e
    • M
      block: loop: introduce lo_discard() and lo_req_flush() · cf655d95
      Ming Lei 提交于
      No behaviour change, just move the handling for REQ_DISCARD
      and REQ_FLUSH in these two functions.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cf655d95
    • M
      block: loop: say goodby to bio · 30112013
      Ming Lei 提交于
      Switch to block request completely.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      30112013
    • M
      block: loop: improve performance via blk-mq · b5dd2f60
      Ming Lei 提交于
      The conversion is a bit straightforward, and use work queue to
      dispatch requests of loop block, and one big change is that requests
      is submitted to backend file/device concurrently with work queue,
      so throughput may get improved much. Given write requests over same
      file are often run exclusively, so don't handle them concurrently for
      avoiding extra context switch cost, possible lock contention and work
      schedule cost. Also with blk-mq, there is opportunity to get loop I/O
      merged before submitting to backend file/device.
      
      In the following test:
      	- base: v3.19-rc2-2041231
      	- loop over file in ext4 file system on SSD disk
      	- bs: 4k, libaio, io depth: 64, O_DIRECT, num of jobs: 1
      	- throughput: IOPS
      
      	------------------------------------------------------
      	|            | base      | base with loop-mq | delta |
      	------------------------------------------------------
      	| randread   | 1740      | 25318             | +1355%|
      	------------------------------------------------------
      	| read       | 42196     | 51771             | +22.6%|
      	-----------------------------------------------------
      	| randwrite  | 35709     | 34624             | -3%   |
      	-----------------------------------------------------
      	| write      | 39137     | 40326             | +3%   |
      	-----------------------------------------------------
      
      So loop-mq can improve throughput for both read and randread, meantime,
      performance of write and randwrite isn't hurted basically.
      
      Another benefit is that loop driver code gets simplified
      much after blk-mq conversion, and the patch can be thought as
      cleanup too.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b5dd2f60
  13. 18 4月, 2014 1 次提交
  14. 09 4月, 2014 1 次提交
    • M
      drivers/block/loop.c: ratelimit error messages · 44bd70c3
      Mike Galbraith 提交于
      Metric tons of high speed spew is not helpful when things go pear shaped.
      systemd lost its mind, forgot how to stop services it insists on being
      sole manager of, massive printk() flood ensued, box eventually died.
      
      [16206.684000] loop: Write error at byte offset 11412291584, length 4096.
      [16206.684000] systemd-journald[1758]: /dev/kmsg buffer overrun, some messages lost.
      [16206.684000] loop: Write error at byte offset 13155434496, length 4096.
      [16206.684000] loop: Write error at byte offset 13155438592, length 4096.
      [16206.684000] loop: Write error at byte offset 13155442688, length 4096.
      [16206.684000] loop: Write error at byte offset 13960736768, length 4096.
      [16206.684000] loop: Write error at byte offset 14229172224, length 4096.
      [16206.684000] systemd-journald[1758]: /dev/kmsg buffer overrun, some messages lost.
      [16206.684000] loop: Write error at byte offset 14766043136, length 4096.
      [16206.684000] loop: Write error at byte offset 15034478592, length 4096.
      [16206.684000] systemd-journald[1758]: /dev/kmsg buffer overrun, some messages lost.
      Signed-off-by: NMike Galbraith <bitbucket@online.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      44bd70c3
  15. 22 1月, 2014 1 次提交
  16. 24 11月, 2013 2 次提交
    • K
      block: Convert bio_for_each_segment() to bvec_iter · 7988613b
      Kent Overstreet 提交于
      More prep work for immutable biovecs - with immutable bvecs drivers
      won't be able to use the biovec directly, they'll need to use helpers
      that take into account bio->bi_iter.bi_bvec_done.
      
      This updates callers for the new usage without changing the
      implementation yet.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Paul Clements <Paul.Clements@steeleye.com>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
      Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
      Cc: support@lsi.com
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: drbd-user@lists.linbit.com
      Cc: nbd-general@lists.sourceforge.net
      Cc: cbe-oss-dev@lists.ozlabs.org
      Cc: xen-devel@lists.xensource.com
      Cc: virtualization@lists.linux-foundation.org
      Cc: linux-raid@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Cc: DL-MPTFusionLinux@lsi.com
      Cc: linux-scsi@vger.kernel.org
      Cc: devel@driverdev.osuosl.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: cluster-devel@redhat.com
      Cc: linux-mm@kvack.org
      Acked-by: NGeoff Levand <geoff@infradead.org>
      7988613b
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37
  17. 09 11月, 2013 1 次提交
    • M
      loop: fix crash when using unassigned loop device · ef7e7c82
      Mikulas Patocka 提交于
      When the loop module is loaded, it creates 8 loop devices /dev/loop[0-7].
      The devices have no request routine and thus, when they are used without
      being assigned, a crash happens.
      
      For example, these commands cause crash (assuming there are no used loop
      devices):
      
      Kernel Fault: Code=26 regs=000000007f420980 (Addr=0000000000000010)
      CPU: 1 PID: 50 Comm: kworker/1:1 Not tainted 3.11.0 #1
      Workqueue: ksnaphd do_metadata [dm_snapshot]
      task: 000000007fcf4078 ti: 000000007f420000 task.ti: 000000007f420000
      [  116.319988]
           YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
      PSW: 00001000000001001111111100001111 Not tainted
      r00-03  000000ff0804ff0f 00000000408bf5d0 00000000402d8204 000000007b7ff6c0
      r04-07  00000000408a95d0 000000007f420950 000000007b7ff6c0 000000007d06c930
      r08-11  000000007f4205c0 0000000000000001 000000007f4205c0 000000007f4204b8
      r12-15  0000000000000010 0000000000000000 0000000000000000 0000000000000000
      r16-19  000000001108dd48 000000004061cd7c 000000007d859800 000000000800000f
      r20-23  0000000000000000 0000000000000008 0000000000000000 0000000000000000
      r24-27  00000000ffffffff 000000007b7ff6c0 000000007d859800 00000000408a95d0
      r28-31  0000000000000000 000000007f420950 000000007f420980 000000007f4208e8
      sr00-03  0000000000000000 0000000000000000 0000000000000000 0000000000303000
      sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      [  117.549988]
      IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000402d82fc 00000000402d8300
       IIR: 53820020    ISR: 0000000000000000  IOR: 0000000000000010
       CPU:        1   CR30: 000000007f420000 CR31: ffffffffffffffff
       ORIG_R28: 0000000000000001
       IAOQ[0]: generic_make_request+0x11c/0x1a0
       IAOQ[1]: generic_make_request+0x120/0x1a0
       RP(r2): generic_make_request+0x24/0x1a0
      Backtrace:
       [<00000000402d83f0>] submit_bio+0x70/0x140
       [<0000000011087c4c>] dispatch_io+0x234/0x478 [dm_mod]
       [<0000000011087f44>] sync_io+0xb4/0x190 [dm_mod]
       [<00000000110883bc>] dm_io+0x2c4/0x310 [dm_mod]
       [<00000000110bfcd0>] do_metadata+0x28/0xb0 [dm_snapshot]
       [<00000000401591d8>] process_one_work+0x160/0x460
       [<0000000040159bc0>] worker_thread+0x300/0x478
       [<0000000040161a70>] kthread+0x118/0x128
       [<0000000040104020>] end_fault_vector+0x20/0x28
       [<0000000040177220>] task_tick_fair+0x420/0x4d0
       [<00000000401aa048>] invoke_rcu_core+0x50/0x60
       [<00000000401ad5b8>] rcu_check_callbacks+0x210/0x8d8
       [<000000004014aaa0>] update_process_times+0xa8/0xc0
       [<00000000401ab86c>] rcu_process_callbacks+0x4b4/0x598
       [<0000000040142408>] __do_softirq+0x250/0x2c0
       [<00000000401789d0>] find_busiest_group+0x3c0/0xc70
      [  119.379988]
      Kernel panic - not syncing: Kernel Fault
      Rebooting in 1 seconds..
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ef7e7c82
  18. 08 11月, 2013 2 次提交
    • M
      block: fix a probe argument to blk_register_region · a207f593
      Mikulas Patocka 提交于
      The probe function is supposed to return NULL on failure (as we can see in
      kobj_lookup: kobj = probe(dev, index, data); ... if (kobj) return kobj;
      
      However, in loop and brd, it returns negative error from ERR_PTR.
      
      This causes a crash if we simulate disk allocation failure and run
      less -f /dev/loop0 because the negative number is interpreted as a pointer:
      
      BUG: unable to handle kernel NULL pointer dereference at 00000000000002b4
      IP: [<ffffffff8118b188>] __blkdev_get+0x28/0x450
      PGD 23c677067 PUD 23d6d1067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP
      Modules linked in: loop hpfs nvidia(PO) ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_stats cpufreq_ondemand cpufreq_userspace cpufreq_powersave cpufreq_conservative hid_generic spadfs usbhid hid fuse raid0 snd_usb_audio snd_pcm_oss snd_mixer_oss md_mod snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib dmi_sysfs snd_rawmidi nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd soundcore lm85 hwmon_vid ohci_hcd ehci_pci ehci_hcd serverworks sata_svw libata acpi_cpufreq freq_table mperf ide_core usbcore kvm_amd kvm tg3 i2c_piix4 libphy microcode e100 usb_common ptp skge i2c_core pcspkr k10temp evdev floppy hwmon pps_core mii rtc_cmos button processor unix [last unloaded: nvidia]
      CPU: 1 PID: 6831 Comm: less Tainted: P        W  O 3.10.15-devel #18
      Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
      task: ffff880203cc6bc0 ti: ffff88023e47c000 task.ti: ffff88023e47c000
      RIP: 0010:[<ffffffff8118b188>]  [<ffffffff8118b188>] __blkdev_get+0x28/0x450
      RSP: 0018:ffff88023e47dbd8  EFLAGS: 00010286
      RAX: ffffffffffffff74 RBX: ffffffffffffff74 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
      RBP: ffff88023e47dc18 R08: 0000000000000002 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff88023f519658
      R13: ffffffff8118c300 R14: 0000000000000000 R15: ffff88023f519640
      FS:  00007f2070bf7700(0000) GS:ffff880247400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000002b4 CR3: 000000023da1d000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
       0000000000000002 0000001d00000000 000000003e47dc50 ffff88023f519640
       ffff88043d5bb668 ffffffff8118c300 ffff88023d683550 ffff88023e47de60
       ffff88023e47dc98 ffffffff8118c10d 0000001d81605698 0000000000000292
      Call Trace:
       [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
       [<ffffffff8118c10d>] blkdev_get+0x1dd/0x370
       [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
       [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
       [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
       [<ffffffff8118c365>] blkdev_open+0x65/0x80
       [<ffffffff8114d12e>] do_dentry_open.isra.18+0x23e/0x2f0
       [<ffffffff8114d214>] finish_open+0x34/0x50
       [<ffffffff8115e122>] do_last.isra.62+0x2d2/0xc50
       [<ffffffff8115eb58>] path_openat.isra.63+0xb8/0x4d0
       [<ffffffff81115a8e>] ? might_fault+0x4e/0xa0
       [<ffffffff8115f4f0>] do_filp_open+0x40/0x90
       [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
       [<ffffffff8116db85>] ? __alloc_fd+0xa5/0x1f0
       [<ffffffff8114e45f>] do_sys_open+0xef/0x1d0
       [<ffffffff8114e559>] SyS_open+0x19/0x20
       [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
      Code: 44 00 00 55 48 89 e5 41 57 49 89 ff 41 56 41 89 d6 41 55 41 54 4c 8d 67 18 53 48 83 ec 18 89 75 cc e9 f2 00 00 00 0f 1f 44 00 00 <48> 8b 80 40 03 00 00 48 89 df 4c 8b 68 58 e8 d5
      a4 07 00 44 89
      RIP  [<ffffffff8118b188>] __blkdev_get+0x28/0x450
       RSP <ffff88023e47dbd8>
      CR2: 00000000000002b4
      ---[ end trace bb7f32dbf02398dc ]---
      
      The brd change should be backported to stable kernels starting with 2.6.25.
      The loop change should be backported to stable kernels starting with 2.6.22.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org	# 2.6.22+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a207f593
    • M
      loop: fix crash if blk_alloc_queue fails · 3ec981e3
      Mikulas Patocka 提交于
      loop: fix crash if blk_alloc_queue fails
      
      If blk_alloc_queue fails, loop_add cleans up, but it doesn't clean up the
      identifier allocated with idr_alloc. That causes crash on module unload in
      idr_for_each(&loop_index_idr, &loop_exit_cb, NULL); where we attempt to
      remove non-existed device with that id.
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000380
      IP: [<ffffffff812057c9>] del_gendisk+0x19/0x2d0
      PGD 43d399067 PUD 43d0ad067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP
      Modules linked in: loop(-) dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_ondemand cpufreq_conservative cpufreq_powersave spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc lm85 hwmon_vid snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq ohci_hcd freq_table tg3 ehci_pci mperf ehci_hcd kvm_amd kvm sata_svw serverworks libphy libata ide_core k10temp usbcore hwmon microcode ptp pcspkr pps_core e100 skge mii usb_common i2c_piix4 floppy evdev rtc_cmos i2c_core processor but!
       ton unix
      CPU: 7 PID: 2735 Comm: rmmod Tainted: G        W    3.10.15-devel #15
      Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
      task: ffff88043d38e780 ti: ffff88043d21e000 task.ti: ffff88043d21e000
      RIP: 0010:[<ffffffff812057c9>]  [<ffffffff812057c9>] del_gendisk+0x19/0x2d0
      RSP: 0018:ffff88043d21fe10  EFLAGS: 00010282
      RAX: ffffffffa05102e0 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff88043ea82800 RDI: 0000000000000000
      RBP: ffff88043d21fe48 R08: 0000000000000000 R09: 0000000000000001
      R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000000ff
      R13: 0000000000000080 R14: 0000000000000000 R15: ffff88043ea82800
      FS:  00007ff646534700(0000) GS:ffff880447000000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000380 CR3: 000000043e9bf000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
       ffffffff8100aba4 0000000000000092 ffff88043d21fe48 ffff88043ea82800
       00000000000000ff ffff88043d21fe98 0000000000000000 ffff88043d21fe60
       ffffffffa05102b4 0000000000000000 ffff88043d21fe70 ffffffffa05102ec
      Call Trace:
       [<ffffffff8100aba4>] ? native_sched_clock+0x24/0x80
       [<ffffffffa05102b4>] loop_remove+0x14/0x40 [loop]
       [<ffffffffa05102ec>] loop_exit_cb+0xc/0x10 [loop]
       [<ffffffff81217b74>] idr_for_each+0x104/0x190
       [<ffffffffa05102e0>] ? loop_remove+0x40/0x40 [loop]
       [<ffffffff8109adc5>] ? trace_hardirqs_on_caller+0x105/0x1d0
       [<ffffffffa05135dc>] loop_exit+0x34/0xa58 [loop]
       [<ffffffff810a98ea>] SyS_delete_module+0x13a/0x260
       [<ffffffff81221d5e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
       [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
      Code: f0 4c 8b 6d f8 c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 4c 8d af 80 00 00 00 41 54 53 48 89 fb 48 83 ec 18 <48> 83 bf 80 03 00
      00 00 74 4d e8 98 fe ff ff 31 f6 48 c7 c7 20
      RIP  [<ffffffff812057c9>] del_gendisk+0x19/0x2d0
       RSP <ffff88043d21fe10>
      CR2: 0000000000000380
      ---[ end trace 64ec069ec70f1309 ]---
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org	# 3.1+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3ec981e3
  19. 29 6月, 2013 1 次提交
  20. 07 5月, 2013 1 次提交
  21. 10 4月, 2013 1 次提交
  22. 08 4月, 2013 1 次提交
    • J
      Revert "loop: cleanup partitions when detaching loop device" · c2fccc1c
      Jens Axboe 提交于
      This reverts commit 8761a3dc.
      
      There are situations where the destruction path is called
      with the bdev->bd_mutex already held, which then deadlocks in
      loop_clr_fd(). The normal partition cleanup does a trylock()
      on the mutex, but it'd be nice to have a more bullet proof
      method in loop. So punt this more involved fix to the next
      merge window, and just back out this buggy fix for now.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c2fccc1c
  23. 02 4月, 2013 1 次提交
    • A
      loop: prevent bdev freeing while device in use · c1681bf8
      Anatol Pomozov 提交于
      struct block_device lifecycle is defined by its inode (see fs/block_dev.c) -
      block_device allocated first time we access /dev/loopXX and deallocated on
      bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile"
      we want that block_device stay alive until we destroy the loop device
      with "losetup -d".
      
      But because we do not hold /dev/loopXX inode its counter goes 0, and
      inode/bdev can be destroyed at any moment. Usually it happens at memory
      pressure or when user drops inode cache (like in the test below). When later in
      loop_clr_fd() we want to use bdev we have use-after-free error with following
      stack:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000280
        bd_set_size+0x10/0xa0
        loop_clr_fd+0x1f8/0x420 [loop]
        lo_ioctl+0x200/0x7e0 [loop]
        lo_compat_ioctl+0x47/0xe0 [loop]
        compat_blkdev_ioctl+0x341/0x1290
        do_filp_open+0x42/0xa0
        compat_sys_ioctl+0xc1/0xf20
        do_sys_open+0x16e/0x1d0
        sysenter_dispatch+0x7/0x1a
      
      To prevent use-after-free we need to grab the device in loop_set_fd()
      and put it later in loop_clr_fd().
      
      The issue is reprodusible on current Linus head and v3.3. Here is the test:
      
        dd if=/dev/zero of=loop.file bs=1M count=1
        while [ true ]; do
          losetup /dev/loop0 loop.file
          echo 2 > /proc/sys/vm/drop_caches
          losetup -d /dev/loop0
        done
      
      [ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every
        time we call loop_set_fd() we check that loop_device->lo_state is
        Lo_unbound and set it to Lo_bound If somebody will try to set_fd again
        it will get EBUSY.  And if we try to loop_clr_fd() on unbound loop
        device we'll get ENXIO.
      
        loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under
        loop_device->lo_ctl_mutex. ]
      Signed-off-by: NAnatol Pomozov <anatol.pomozov@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1681bf8
  24. 23 3月, 2013 1 次提交
    • P
      loop: cleanup partitions when detaching loop device · 8761a3dc
      Phillip Susi 提交于
      Any partitions added by user space to the loop device were being
      left in place after detaching the loop device.  This was because
      the detach path issued a BLKRRPART to clean up partitions if
      LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto
      scanned on attach.  Replace this BLKRRPART with code that
      unconditionally cleans up partitions on detach instead.
      Signed-off-by: NPhillip Susi <psusi@ubuntu.com>
      
      Modified by Jens to export delete_partition().
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8761a3dc
  25. 22 3月, 2013 1 次提交
  26. 28 2月, 2013 2 次提交