1. 21 5月, 2014 1 次提交
    • J
      blk-mq: allow changing of queue depth through sysfs · e3a2b3f9
      Jens Axboe 提交于
      For request_fn based devices, the block layer exports a 'nr_requests'
      file through sysfs to allow adjusting of queue depth on the fly.
      Currently this returns -EINVAL for blk-mq, since it's not wired up.
      Wire this up for blk-mq, so that it now also always dynamic
      adjustments of the allowed queue depth for any given block device
      managed by blk-mq.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e3a2b3f9
  2. 20 5月, 2014 1 次提交
    • J
      blk-mq: switch ctx pending map to the sparser blk_align_bitmap · 1429d7c9
      Jens Axboe 提交于
      Each hardware queue has a bitmap of software queues with pending
      requests. When new IO is queued on a software queue, the bit is
      set, and when IO is pruned on a hardware queue run, the bit is
      cleared. This causes a lot of traffic. Switch this from the regular
      BITS_PER_LONG bitmap to a sparser layout, similarly to what was
      done for blk-mq tagging.
      
      20% performance increase was observed for single threaded IO, and
      about 15% performanc increase on multiple threads driving the
      same device.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1429d7c9
  3. 14 5月, 2014 1 次提交
    • J
      blk-mq: improve support for shared tags maps · 0d2602ca
      Jens Axboe 提交于
      This adds support for active queue tracking, meaning that the
      blk-mq tagging maintains a count of active users of a tag set.
      This allows us to maintain a notion of fairness between users,
      so that we can distribute the tag depth evenly without starving
      some users while allowing others to try unfair deep queues.
      
      If sharing of a tag set is detected, each hardware queue will
      track the depth of its own queue. And if this exceeds the total
      depth divided by the number of active queues, the user is actively
      throttled down.
      
      The active queue count is done lazily to avoid bouncing that data
      between submitter and completer. Each hardware queue gets marked
      active when it allocates its first tag, and gets marked inactive
      when 1) the last tag is cleared, and 2) the queue timeout grace
      period has passed.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0d2602ca
  4. 09 5月, 2014 2 次提交
    • J
      blk-mq: implement new and more efficient tagging scheme · 4bb659b1
      Jens Axboe 提交于
      blk-mq currently uses percpu_ida for tag allocation. But that only
      works well if the ratio between tag space and number of CPUs is
      sufficiently high. For most devices and systems, that is not the
      case. The end result if that we either only utilize the tag space
      partially, or we end up attempting to fully exhaust it and run
      into lots of lock contention with stealing between CPUs. This is
      not optimal.
      
      This new tagging scheme is a hybrid bitmap allocator. It uses
      two tricks to both be SMP friendly and allow full exhaustion
      of the space:
      
      1) We cache the last allocated (or freed) tag on a per blk-mq
         software context basis. This allows us to limit the space
         we have to search. The key element here is not caching it
         in the shared tag structure, otherwise we end up dirtying
         more shared cache lines on each allocate/free operation.
      
      2) The tag space is split into cache line sized groups, and
         each context will start off randomly in that space. Even up
         to full utilization of the space, this divides the tag users
         efficiently into cache line groups, avoiding dirtying the same
         one both between allocators and between allocator and freeer.
      
      This scheme shows drastically better behaviour, both on small
      tag spaces but on large ones as well. It has been tested extensively
      to show better performance for all the cases blk-mq cares about.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4bb659b1
    • C
      blk-mq: initialize struct request fields individually · af76e555
      Christoph Hellwig 提交于
      This allows us to avoid a non-atomic memset over ->atomic_flags as well
      as killing lots of duplicate initializations.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      af76e555
  5. 08 5月, 2014 1 次提交
  6. 25 4月, 2014 1 次提交
    • C
      blk-mq: respect rq_affinity · 38535201
      Christoph Hellwig 提交于
      The blk-mq code is using it's own version of the I/O completion affinity
      tunables, which causes a few issues:
      
       - the rq_affinity sysfs file doesn't work for blk-mq devices, even if it
         still is present, thus breaking existing tuning setups.
       - the rq_affinity = 1 mode, which is the defauly for legacy request based
         drivers isn't implemented at all.
       - blk-mq drivers don't implement any completion affinity with the default
         flag settings.
      
      This patches removes the blk-mq ipi_redirect flag and sysfs file, as well
      as the internal BLK_MQ_F_SHOULD_IPI flag and replaces it with code that
      respects the queue-wide rq_affinity flags and also implements the
      rq_affinity = 1 mode.
      
      This means I/O completion affinity can now only be tuned block-queue wide
      instead of per context, which seems more sensible to me anyway.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      38535201
  7. 23 4月, 2014 1 次提交
  8. 17 4月, 2014 7 次提交
  9. 16 4月, 2014 4 次提交
    • C
      block: all blk-mq requests are tagged · fb3ccb5d
      Christoph Hellwig 提交于
      Instead of setting the REQ_QUEUED flag on each of them just take it into
      account in the only macro checking it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fb3ccb5d
    • C
      blk-mq: split out tag initialization, support shared tags · 24d2f903
      Christoph Hellwig 提交于
      Add a new blk_mq_tag_set structure that gets set up before we initialize
      the queue.  A single blk_mq_tag_set structure can be shared by multiple
      queues.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      
      Modular export of blk_mq_{alloc,free}_tagset added by me.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      24d2f903
    • C
      blk-mq: add ->init_request and ->exit_request methods · e9b267d9
      Christoph Hellwig 提交于
      The current blk_mq_init_commands/blk_mq_free_commands interface has a
      two problems:
      
       1) Because only the constructor is passed to blk_mq_init_commands there
          is no easy way to clean up when a comman initialization failed.  The
          current code simply leaks the allocations done in the constructor.
      
       2) There is no good place to call blk_mq_free_commands: before
          blk_cleanup_queue there is no guarantee that all outstanding
          commands have completed, so we can't free them yet.  After
          blk_cleanup_queue the queue has usually been freed.  This can be
          worked around by grabbing an unconditional reference before calling
          blk_cleanup_queue and dropping it after blk_mq_free_commands is
          done, although that's not exatly pretty and driver writers are
          guaranteed to get it wrong sooner or later.
      
      Both issues are easily fixed by making the request constructor and
      destructor normal blk_mq_ops methods.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e9b267d9
    • J
      block: remove struct request buffer member · b4f42e28
      Jens Axboe 提交于
      This was used in the olden days, back when onions were proper
      yellow. Basically it mapped to the current buffer to be
      transferred. With highmem being added more than a decade ago,
      most drivers map pages out of a bio, and rq->buffer isn't
      pointing at anything valid.
      
      Convert old style drivers to just use bio_data().
      
      For the discard payload use case, just reference the page
      in the bio.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b4f42e28
  10. 12 4月, 2014 1 次提交
    • D
      net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369
      David S. Miller 提交于
      Several spots in the kernel perform a sequence like:
      
      	skb_queue_tail(&sk->s_receive_queue, skb);
      	sk->sk_data_ready(sk, skb->len);
      
      But at the moment we place the SKB onto the socket receive queue it
      can be consumed and freed up.  So this skb->len access is potentially
      to freed up memory.
      
      Furthermore, the skb->len can be modified by the consumer so it is
      possible that the value isn't accurate.
      
      And finally, no actual implementation of this callback actually uses
      the length argument.  And since nobody actually cared about it's
      value, lots of call sites pass arbitrary values in such as '0' and
      even '1'.
      
      So just remove the length argument from the callback, that way there
      is no confusion whatsoever and all of these use-after-free cases get
      fixed as a side effect.
      
      Based upon a patch by Eric Dumazet and his suggestion to audit this
      issue tree-wide.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      676d2369
  11. 11 4月, 2014 6 次提交
  12. 10 4月, 2014 9 次提交
    • J
      block: fix regression with block enabled tagging · 360f92c2
      Jens Axboe 提交于
      Martin reported that his test system would not boot with
      current git, it oopsed with this:
      
      BUG: unable to handle kernel paging request at ffff88046c6c9e80
      IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
      PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
      Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
      Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
      rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
      CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
      Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
      Workqueue: events_unbound async_run_entry_fn
      task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
      RIP: 0010:[<ffffffff812971e0>]  [<ffffffff812971e0>]
      blk_queue_start_tag+0x90/0x150
      RSP: 0018:ffff880273d03a58  EFLAGS: 00010092
      RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
      RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
      RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
      R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
      R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
      FS:  0000000000000000(0000) GS:ffff880277b00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
      Stack:
       ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
       ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
       ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
      Call Trace:
       [<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510
       [<ffffffff81293167>] __blk_run_queue+0x37/0x50
       [<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130
       [<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0
       [<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0
       [<ffffffff813bba35>] scsi_execute+0xe5/0x180
       [<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110
       [<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod]
       [<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0
       [<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
       [<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod]
       [<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140
       [<ffffffff8106a1c5>] process_one_work+0x175/0x410
       [<ffffffff8106b373>] worker_thread+0x123/0x400
       [<ffffffff8106b250>] ? manage_workers+0x160/0x160
       [<ffffffff8107104e>] kthread+0xce/0xf0
       [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
       [<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0
       [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
      Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
      df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89
      58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
      RIP  [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
       RSP <ffff880273d03a58>
      CR2: ffff88046c6c9e80
      
      Martin bisected and found this to be the problem patch;
      
      	commit 6d113398
      	Author: Jan Kara <jack@suse.cz>
      	Date:   Mon Feb 24 16:39:54 2014 +0100
      
      	    block: Stop abusing rq->csd.list in blk-softirq
      
      and the problem was immediately apparent. The patch states that
      it is safe to reuse queuelist at completion time, since it is
      no longer used. However, that is not true if a device is using
      block enabled tagging. If that is the case, then the queuelist
      is reused to keep track of busy tags. If a device also ended
      up using softirq completions, we'd reuse ->queuelist for the
      IPI handling while block tagging was still using it. Boom.
      
      Fix this by adding a new ipi_list list head, and share the
      memory used with the request hash table. The hash table is
      never used after the request is moved to the dispatch list,
      which happens long before any potential completion of the
      request. Add a new request bit for this, so we don't have
      cases that check rq->hash while it could potentially have
      been reused for the IPI completion.
      Reported-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Tested-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      360f92c2
    • M
      scsi: Make sure cmd_flags are 64-bit · 2bfad21e
      Martin K. Petersen 提交于
      cmd_flags in struct request is now 64 bits wide but the scsi_execute
      functions truncated arguments passed to int leading to errors. Make sure
      the flags parameters are u64.
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Cc: Jens Axboe <axboe@fb.com>
      CC: Jan Kara <jack@suse.cz>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2bfad21e
    • M
      tracing: Fix anonymous unions in struct ftrace_event_call · abb43f69
      Mathieu Desnoyers 提交于
      gcc <= 4.5.x has significant limitations with respect to initialization
      of anonymous unions within structures. They need to be surrounded by
      brackets, _and_ they need to be initialized in the same order in which
      they appear in the structure declaration.
      
      Link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
      Link: http://lkml.kernel.org/r/1397077568-3156-1-git-send-email-mathieu.desnoyers@efficios.comSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      abb43f69
    • B
      x86: LLVMLinux: Fix "incomplete type const struct x86cpu_device_id" · c4586256
      Behan Webster 提交于
      Similar to the fix in 40413dcb
      
      MODULE_DEVICE_TABLE(x86cpu, ...) expects the struct to be called struct
      x86cpu_device_id, and not struct x86_cpu_id which is what is used in the rest
      of the kernel code.  Although gcc seems to ignore this error, clang fails
      without this define to fix the name.
      
      Code from drivers/thermal/x86_pkg_temp_thermal.c
      static const struct x86_cpu_id __initconst pkg_temp_thermal_ids[] = { ... };
      MODULE_DEVICE_TABLE(x86cpu, pkg_temp_thermal_ids);
      
      Error from clang:
      drivers/thermal/x86_pkg_temp_thermal.c:577:1: error: variable has
            incomplete type 'const struct x86cpu_device_id'
      MODULE_DEVICE_TABLE(x86cpu, pkg_temp_thermal_ids);
      ^
      include/linux/module.h:145:3: note: expanded from macro
            'MODULE_DEVICE_TABLE'
        MODULE_GENERIC_TABLE(type##_device, name)
        ^
      include/linux/module.h:87:32: note: expanded from macro
            'MODULE_GENERIC_TABLE'
      extern const struct gtype##_id __mod_##gtype##_table            \
                                     ^
      <scratch space>:143:1: note: expanded from here
      __mod_x86cpu_device_table
      ^
      drivers/thermal/x86_pkg_temp_thermal.c:577:1: note: forward declaration of
            'struct x86cpu_device_id'
      include/linux/module.h:145:3: note: expanded from macro
            'MODULE_DEVICE_TABLE'
        MODULE_GENERIC_TABLE(type##_device, name)
        ^
      include/linux/module.h:87:21: note: expanded from macro
            'MODULE_GENERIC_TABLE'
      extern const struct gtype##_id __mod_##gtype##_table            \
                          ^
      <scratch space>:141:1: note: expanded from here
      x86cpu_device_id
      ^
      1 error generated.
      Signed-off-by: NBehan Webster <behanw@converseincode.com>
      Signed-off-by: NJan-Simon Möller <dl9pf@gmx.de>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4586256
    • M
      LLVMLinux: Add support for clang to compiler.h and new compiler-clang.h · 565cbdc2
      Mark Charlebois 提交于
      Add a compiler-clang.h file to add specific macros needed for compiling the
      kernel with clang.
      
      Initially the only override required is the macro for silencing the
      compiler for a purposefully uninintialized variable.
      
      Author: Mark Charlebois <charlebm@gmail.com>
      Signed-off-by: NMark Charlebois <charlebm@gmail.com>
      Signed-off-by: NBehan Webster <behanw@converseincode.com>
      565cbdc2
    • B
      LLVMLinux: Remove warning about returning an uninitialized variable · aa93685a
      Behan Webster 提交于
      Fix uninitialized return code in default case in cmpxchg-local.h
      
      This patch fixes the code to prevent an uninitialized return value that is detected
      when compiling with clang. The bug produces numerous warnings when compiling the
      Linux kernel with clang.
      Signed-off-by: NBehan Webster <behanw@converseincode.com>
      Signed-off-by: NMark Charlebois <charlebm@gmail.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      aa93685a
    • J
      blk-mq: ensure that hardware queues are always run on the mapped CPUs · e4043dcf
      Jens Axboe 提交于
      Instead of providing soft mappings with no guarantees on hardware
      queues always being run on the right CPU, switch to a hard mapping
      guarantee that ensure that we always run the hardware queue on
      (one of, if more) the mapped CPU.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e4043dcf
    • J
      block: add kblockd_schedule_delayed_work_on() · 8ab14595
      Jens Axboe 提交于
      Same function as kblockd_schedule_delayed_work(), but allow the
      caller to pass in a CPU that the work should be executed on. This
      just directly extends and maps into the workqueue API, and will
      be used to make the blk-mq mappings more strict.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ab14595
    • J
      block: remove 'q' parameter from kblockd_schedule_*_work() · 59c3d45e
      Jens Axboe 提交于
      The queue parameter is never used, just get rid of it.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      59c3d45e
  13. 09 4月, 2014 5 次提交