1. 30 1月, 2015 1 次提交
    • J
      NVMe: avoid kmalloc/kfree for smaller IO · ac3dd5bd
      Jens Axboe 提交于
      Currently we allocate an nvme_iod for each IO, which holds the
      sg list, prps, and other IO related info. Set a threshold of
      2 pages and/or 8KB of data, below which we can just embed this
      in the per-command pdu in blk-mq. For any IO at or below
      NVME_INT_PAGES and NVME_INT_BYTES, we save a kmalloc and kfree.
      
      For higher IOPS, this saves up to 1% of CPU time.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      ac3dd5bd
  2. 27 1月, 2015 1 次提交
  3. 25 1月, 2015 1 次提交
  4. 24 1月, 2015 3 次提交
    • S
      libata: use blk taging · 12cb5ce1
      Shaohua Li 提交于
      libata uses its own tag management which is duplication and the
      implementation is poor. And if we switch to blk-mq, tag is build-in.
      It's time to switch to generic taging.
      
      The SAS driver has its own tag management, and looks we can't directly
      map the host controler tag to SATA tag. So I just bypassed the SAS case.
      
      I changed the code/variable name for the tag management of libata to
      make it self contained. Only sas will use it. Later if libsas implements
      its tag management, the tag management code in libata can be deleted
      easily.
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      12cb5ce1
    • S
      blk-mq: add tag allocation policy · 24391c0d
      Shaohua Li 提交于
      This is the blk-mq part to support tag allocation policy. The default
      allocation policy isn't changed (though it's not a strict FIFO). The new
      policy is round-robin for libata. But it's a try-best implementation. If
      multiple tasks are competing, the tags returned will be mixed (which is
      unavoidable even with !mq, as requests from different tasks can be
      mixed in queue)
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      24391c0d
    • S
      block: support different tag allocation policy · ee1b6f7a
      Shaohua Li 提交于
      The libata tag allocation is using a round-robin policy. Next patch will
      make libata use block generic tag allocation, so let's add a policy to
      tag allocation.
      
      Currently two policies: FIFO (default) and round-robin.
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ee1b6f7a
  5. 22 1月, 2015 2 次提交
  6. 17 1月, 2015 1 次提交
    • J
      null_blk: suppress invalid partition info · 227290b4
      Jens Axboe 提交于
      null_blk is partitionable, but it doesn't store any of the info. When
      it is loaded, you would normally see:
      
      [1226739.343608]  nullb0: unknown partition table
      [1226739.343746]  nullb1: unknown partition table
      
      which can confuse some people. Add the appropriate gendisk flag
      to suppress this info.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      227290b4
  7. 14 1月, 2015 3 次提交
    • B
      brd: Request from fdisk 4k alignment · c8fa3173
      Boaz Harrosh 提交于
      Because of the direct_access() API which returns a PFN. partitions
      better start on 4K boundary, else offset ZERO of a partition will
      not be aligned and blk_direct_access() will fail the call.
      
      By setting blk_queue_physical_block_size(PAGE_SIZE) we can communicate
      this to fdisk and friends.
      
      The call to blk_queue_physical_block_size() is harmless and will
      not affect the Kernel behavior in any way. It is only for
      communication to user-mode.
      
      before this patch running fdisk on a default size brd of 4M
      the first sector offered is 34 (BAD), but after this patch it
      will be 40, ie 8 sectors aligned. Also when entering some random
      partition sizes the next partition-start sector is offered 8 sectors
      aligned after this patch. (Please note that with fdisk the user
      can still enter bad values, only the offered default values will
      be correct)
      
      Note that with bdev-size > 4M fdisk will try to align on a 1M
      boundary (above first-sector will be 2048), in any case.
      
      CC: Martin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c8fa3173
    • B
      brd: Fix all partitions BUGs · 937af5ec
      Boaz Harrosh 提交于
      This patch fixes up brd's partitions scheme, now enjoying all worlds.
      
      The MAIN fix here is that currently, if one fdisks some partitions,
      a BAD bug will make all partitions point to the same start-end sector
      ie: 0 - brd_size And an mkfs of any partition would trash the partition
      table and the other partition.
      
      Another fix is that "mount -U uuid" will not work if show_part was not
      specified, because of the GENHD_FL_SUPPRESS_PARTITION_INFO flag.
      We now always load without it and remove the show_part parameter.
      
      [We remove Dmitry's new module-param part_show it is now always
       show]
      
      So NOW the logic goes like this:
      * max_part - Just says how many minors to reserve between ramX
        devices. In any way, there can be as many partition as requested.
        If minors between devices ends, then dynamic 259-major ids will
        be allocated on the fly.
        The default is now max_part=1, which means all partitions devt(s)
        will be from the dynamic (259) major-range.
        (If persistent partition minors is needed use max_part=X)
        For example with /dev/sdX max_part is hard coded 16.
      
      * Creation of new devices on the fly still/always work:
        mknod /path/devnod b 1 X
        fdisk -l /path/devnod
        Will create a new device if [X / max_part] was not already
        created before. (Just as before)
      
        partitions on the dynamically created device will work as well
        Same logic applies with minors as with the pre-created ones.
      
      TODO: dynamic grow of device size. So each device can have it's
            own size.
      
      CC: Dmitry Monakhov <dmonakhov@openvz.org>
      Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      937af5ec
    • M
      block: Change direct_access calling convention · dd22f551
      Matthew Wilcox 提交于
      In order to support accesses to larger chunks of memory, pass in a
      'size' parameter (counted in bytes), and return the amount available at
      that address.
      
      Add a new helper function, bdev_direct_access(), to handle common
      functionality including partition handling, checking the length requested
      is positive, checking for the sector being page-aligned, and checking
      the length of the request does not pass the end of the partition.
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NBoaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dd22f551
  8. 03 1月, 2015 6 次提交
    • J
      loop: add blk-mq.h include · 78e367a3
      Jens Axboe 提交于
      Looks like we pull it in through other ways on x86, but we fail
      on sparc:
      
      In file included from drivers/block/cryptoloop.c:30:0:
      drivers/block/loop.h:63:24: error: field 'tag_set' has incomplete type
      struct blk_mq_tag_set tag_set;
      
      Add the include to loop.h, kill it from loop.c.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      78e367a3
    • M
      block: loop: don't handle REQ_FUA explicitly · af65aa8e
      Ming Lei 提交于
      block core handles REQ_FUA by its flush state machine, so
      won't do it in loop explicitly.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      af65aa8e
    • M
      block: loop: introduce lo_discard() and lo_req_flush() · cf655d95
      Ming Lei 提交于
      No behaviour change, just move the handling for REQ_DISCARD
      and REQ_FLUSH in these two functions.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cf655d95
    • M
      block: loop: say goodby to bio · 30112013
      Ming Lei 提交于
      Switch to block request completely.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      30112013
    • M
      block: loop: improve performance via blk-mq · b5dd2f60
      Ming Lei 提交于
      The conversion is a bit straightforward, and use work queue to
      dispatch requests of loop block, and one big change is that requests
      is submitted to backend file/device concurrently with work queue,
      so throughput may get improved much. Given write requests over same
      file are often run exclusively, so don't handle them concurrently for
      avoiding extra context switch cost, possible lock contention and work
      schedule cost. Also with blk-mq, there is opportunity to get loop I/O
      merged before submitting to backend file/device.
      
      In the following test:
      	- base: v3.19-rc2-2041231
      	- loop over file in ext4 file system on SSD disk
      	- bs: 4k, libaio, io depth: 64, O_DIRECT, num of jobs: 1
      	- throughput: IOPS
      
      	------------------------------------------------------
      	|            | base      | base with loop-mq | delta |
      	------------------------------------------------------
      	| randread   | 1740      | 25318             | +1355%|
      	------------------------------------------------------
      	| read       | 42196     | 51771             | +22.6%|
      	-----------------------------------------------------
      	| randwrite  | 35709     | 34624             | -3%   |
      	-----------------------------------------------------
      	| write      | 39137     | 40326             | +3%   |
      	-----------------------------------------------------
      
      So loop-mq can improve throughput for both read and randread, meantime,
      performance of write and randwrite isn't hurted basically.
      
      Another benefit is that loop driver code gets simplified
      much after blk-mq conversion, and the patch can be thought as
      cleanup too.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b5dd2f60
    • M
      block: fix checking return value of blk_mq_init_queue · 35b489d3
      Ming Lei 提交于
      Check IS_ERR_OR_NULL(return value) instead of just return value.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      
      Reduced to IS_ERR() by me, we never return NULL.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      35b489d3
  9. 27 12月, 2014 1 次提交
  10. 24 12月, 2014 1 次提交
  11. 23 12月, 2014 3 次提交
  12. 22 12月, 2014 2 次提交
  13. 20 12月, 2014 4 次提交
  14. 19 12月, 2014 11 次提交