1. 22 9月, 2018 3 次提交
    • D
      blkcg: convert blkg_lookup_create to find closest blkg · 07b05bcc
      Dennis Zhou (Facebook) 提交于
      There are several scenarios where blkg_lookup_create can fail. Examples
      include the blkcg dying, request_queue is dying, or simply being OOM. At
      the end of the day, most handle this by simply falling back to the
      q->root_blkg and calling it a day.
      
      This patch implements the notion of closest blkg. During
      blkg_lookup_create, if it fails to create, return the closest blkg
      found or the q->root_blkg. blkg_try_get_closest is introduced and used
      during association so a bio is always attached to a blkg.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      07b05bcc
    • D
      blkcg: update blkg_lookup_create to do locking · 49f4c2dc
      Dennis Zhou (Facebook) 提交于
      To know when to create a blkg, the general pattern is to do a
      blkg_lookup and if that fails, lock and then do a lookup again and if
      that fails finally create. It doesn't make much sense for everyone who
      wants to do creation to write this themselves.
      
      This changes blkg_lookup_create to do locking and implement this
      pattern. The old blkg_lookup_create is renamed to __blkg_lookup_create.
      If a call site wants to do its own error handling or already owns the
      queue lock, they can use __blkg_lookup_create. This will be used in
      upcoming patches.
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      49f4c2dc
    • D
      blkcg: fix ref count issue with bio_blkcg using task_css · 27e6fa99
      Dennis Zhou (Facebook) 提交于
      The accessor function bio_blkcg either returns the blkcg associated with
      the bio or finds one in the current context. This can cause an issue
      when trying to associate a bio with a blkcg. Particularly, it's the
      third case that is problematic:
      
      	return css_to_blkcg(task_css(current, io_cgrp_id));
      
      As the above may race against task migration and the cgroup exiting, it
      is not always ok to take a reference on the blkcg returned from
      bio_blkcg.
      
      This patch adds association ahead of calling bio_blkcg rather than
      after. This makes association a required and explicit step along the
      code paths for calling bio_blkcg. blk_get_rl is modified as well to get
      a reference to the blkcg it may use and blk_put_rl will always put the
      reference back. Association is also moved above the bio_blkcg call to
      ensure it will not return NULL in blk-iolatency.
      
      BFQ and CFQ utilize this flaw, but due to the complexity, I do not want
      to address this in this series. I've created a private version of the
      function with notes not to use it describing the flaw. Hopefully soon,
      that code can be cleaned up.
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      27e6fa99
  2. 21 9月, 2018 1 次提交
  3. 20 9月, 2018 1 次提交
  4. 15 9月, 2018 3 次提交
    • P
      blok, bfq: do not plug I/O if all queues are weight-raised · c8765de0
      Paolo Valente 提交于
      To reduce latency for interactive and soft real-time applications, bfq
      privileges the bfq_queues containing the I/O of these
      applications. These privileged queues, referred-to as weight-raised
      queues, get a much higher share of the device throughput
      w.r.t. non-privileged queues. To preserve this higher share, the I/O
      of any non-weight-raised queue must be plugged whenever a sync
      weight-raised queue, while being served, remains temporarily empty. To
      attain this goal, bfq simply plugs any I/O (from any queue), if a sync
      weight-raised queue remains empty while in service.
      
      Unfortunately, this plugging typically lowers throughput with random
      I/O, on devices with internal queueing (because it reduces the filling
      level of the internal queues of the device).
      
      This commit addresses this issue by restricting the cases where
      plugging is performed: if a sync weight-raised queue remains empty
      while in service, then I/O plugging is performed only if some of the
      active bfq_queues are *not* weight-raised (which is actually the only
      circumstance where plugging is needed to preserve the higher share of
      the throughput of weight-raised queues). This restriction proved able
      to boost throughput in really many use cases needing only maximum
      throughput.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c8765de0
    • P
      block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash · d0edc247
      Paolo Valente 提交于
      The Achilles' heel of BFQ is its failing to reach a high throughput
      with sync random I/O on flash storage with internal queueing, in case
      the processes doing I/O have differentiated weights.
      
      The cause of this failure is as follows. If at least two processes do
      sync I/O, and have a different weight from each other, then BFQ plugs
      I/O dispatching every time one of these processes, while it is being
      served, remains temporarily without pending I/O requests. This
      plugging is necessary to guarantee that every process enjoys a
      bandwidth proportional to its weight; but it empties the internal
      queue(s) of the drive. And this kills throughput with random I/O. So,
      if some processes have differentiated weights and do both sync and
      random I/O, the end result is a throughput collapse.
      
      This commit tries to counter this problem by injecting the service of
      other processes, in a controlled way, while the process in service
      happens to have no I/O. This injection is performed only if the medium
      is non rotational and performs internal queueing, and the process in
      service does random I/O (service injection might be beneficial for
      sequential I/O too, we'll work on that).
      
      As an example of the benefits of this commit, on a PLEXTOR PX-256M5S
      SSD, and with five processes having differentiated weights and doing
      sync random 4KB I/O, this commit makes the throughput with bfq grow by
      400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than
      that reached with none. As some less random I/O is added to the mix,
      the throughput becomes equal to or higher than that with none.
      
      This commit is a very first attempt to recover throughput without
      losing control, and certainly has many limitations. One is, e.g., that
      the processes whose service is injected are not chosen so as to
      distribute the extra bandwidth they receive in accordance to their
      weights. Thus there might be loss of weighted fairness in some
      cases. Anyway, this loss concerns extra service, which would not have
      been received at all without this commit. Other limitations and issues
      will probably show up with usage.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d0edc247
    • P
      block, bfq: correctly charge and reset entity service in all cases · cbeb869a
      Paolo Valente 提交于
      BFQ schedules entities (which represent either per-process queues or
      groups of queues) as a function of their timestamps. In particular, as
      a function of their (virtual) finish times. The finish time of an
      entity is computed as a function of the budget assigned to the entity,
      assuming, tentatively, that the entity, once in service, will receive
      an amount of service equal to its budget. Then, when the entity is
      expired because it finishes to be served, this finish time is updated
      as a function of the actual service received by the entity. This
      allows the entity to be correctly charged with only the service
      received, and then to be correctly re-scheduled.
      
      Yet an entity may receive service also while not being the entity in
      service (in the scheduling environment of its parent entity), for
      several reasons. If the entity remains with no backlog while receiving
      this 'unofficial' service, then it is expired. Also on such an
      expiration, the finish time of the entity should be updated to account
      for only the service actually received by the entity. Unfortunately,
      such an update is not performed for an entity expiring without being
      the entity in service.
      
      In a similar vein, the service counter of the entity in service is
      reset when the entity is expired, to be ready to be used for next
      service cycle. This reset too should be performed also in case an
      entity is expired because it remains empty after receiving service
      while not being the entity in service. But in this case the reset is
      not performed.
      
      This commit performs the above update of the finish time and reset of
      the service received, also for an entity expiring while not being the
      entity in service.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cbeb869a
  5. 14 9月, 2018 1 次提交
  6. 12 9月, 2018 1 次提交
  7. 08 9月, 2018 1 次提交
  8. 07 9月, 2018 7 次提交
  9. 06 9月, 2018 5 次提交
    • S
      printk/tracing: Do not trace printk_nmi_enter() · d1c392c9
      Steven Rostedt (VMware) 提交于
      I hit the following splat in my tests:
      
      ------------[ cut here ]------------
      IRQs not enabled as expected
      WARNING: CPU: 3 PID: 0 at kernel/time/tick-sched.c:982 tick_nohz_idle_enter+0x44/0x8c
      Modules linked in: ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipv6
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.19.0-rc2-test+ #2
      Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
      EIP: tick_nohz_idle_enter+0x44/0x8c
      Code: ec 05 00 00 00 75 26 83 b8 c0 05 00 00 00 75 1d 80 3d d0 36 3e c1 00
      75 14 68 94 63 12 c1 c6 05 d0 36 3e c1 01 e8 04 ee f8 ff <0f> 0b 58 fa bb a0
      e5 66 c1 e8 25 0f 04 00 64 03 1d 28 31 52 c1 8b
      EAX: 0000001c EBX: f26e7f8c ECX: 00000006 EDX: 00000007
      ESI: f26dd1c0 EDI: 00000000 EBP: f26e7f40 ESP: f26e7f38
      DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010296
      CR0: 80050033 CR2: 0813c6b0 CR3: 2f342000 CR4: 001406f0
      Call Trace:
       do_idle+0x33/0x202
       cpu_startup_entry+0x61/0x63
       start_secondary+0x18e/0x1ed
       startup_32_smp+0x164/0x168
      irq event stamp: 18773830
      hardirqs last  enabled at (18773829): [<c040150c>] trace_hardirqs_on_thunk+0xc/0x10
      hardirqs last disabled at (18773830): [<c040151c>] trace_hardirqs_off_thunk+0xc/0x10
      softirqs last  enabled at (18773824): [<c0ddaa6f>] __do_softirq+0x25f/0x2bf
      softirqs last disabled at (18773767): [<c0416bbe>] call_on_stack+0x45/0x4b
      ---[ end trace b7c64aa79e17954a ]---
      
      After a bit of debugging, I found what was happening. This would trigger
      when performing "perf" with a high NMI interrupt rate, while enabling and
      disabling function tracer. Ftrace uses breakpoints to convert the nops at
      the start of functions to calls to the function trampolines. The breakpoint
      traps disable interrupts and this makes calls into lockdep via the
      trace_hardirqs_off_thunk in the entry.S code. What happens is the following:
      
        do_idle {
      
          [interrupts enabled]
      
          <interrupt> [interrupts disabled]
      	TRACE_IRQS_OFF [lockdep says irqs off]
      	[...]
      	TRACE_IRQS_IRET
      	    test if pt_regs say return to interrupts enabled [yes]
      	    TRACE_IRQS_ON [lockdep says irqs are on]
      
      	    <nmi>
      		nmi_enter() {
      		    printk_nmi_enter() [traced by ftrace]
      		    [ hit ftrace breakpoint ]
      		    <breakpoint exception>
      			TRACE_IRQS_OFF [lockdep says irqs off]
      			[...]
      			TRACE_IRQS_IRET [return from breakpoint]
      			   test if pt_regs say interrupts enabled [no]
      			   [iret back to interrupt]
      	   [iret back to code]
      
          tick_nohz_idle_enter() {
      
      	lockdep_assert_irqs_enabled() [lockdep say no!]
      
      Although interrupts are indeed enabled, lockdep thinks it is not, and since
      we now do asserts via lockdep, it gives a false warning. The issue here is
      that printk_nmi_enter() is called before lockdep_off(), which disables
      lockdep (for this reason) in NMIs. By simply not allowing ftrace to see
      printk_nmi_enter() (via notrace annotation) we keep lockdep from getting
      confused.
      
      Cc: stable@vger.kernel.org
      Fixes: 42a0bb3f ("printk/nmi: generic solution for safe printk in NMI")
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      d1c392c9
    • M
      block: don't warn when doing fsync on read-only devices · 8b2ded1c
      Mikulas Patocka 提交于
      It is possible to call fsync on a read-only handle (for example, fsck.ext2
      does it when doing read-only check), and this call results in kernel
      warning.
      
      The patch b089cfd9 ("block: don't warn for flush on read-only device")
      attempted to disable the warning, but it is buggy and it doesn't
      (op_is_flush tests flags, but bio_op strips off the flags).
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 721c7fc7 ("block: fail op_is_write() requests to read-only partitions")
      Cc: stable@vger.kernel.org	# 4.18
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8b2ded1c
    • L
      Merge tag 'gpio-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio · b36fdc68
      Linus Torvalds 提交于
      Pull GPIO fixes from Linus Walleij:
       "Some GPIO fixes. The ACPI stuff is probably the most annoying for
        users that get fixed this time.
      
         - Atomic contexts, cansleep* calls and such fastpath/slopwpath
           things.
      
         - Defer ACPI event handler registration to late_initcall() so IRQs do
           not fire in our face before other drivers have a chance to register
           handlers.
      
         - Race condition if a consumer requests a GPIO after
           gpiochip_add_data_with_key() but before of_gpiochip_add()
      
         - Probe errorpath in the dwapb driver"
      
      * tag 'gpio-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
        gpio: Fix crash due to registration race
        gpio: dwapb: Fix error handling in dwapb_gpio_probe()
        gpiolib-acpi: Register GpioInt ACPI event handlers from a late_initcall
        gpiolib: acpi: Switch to cansleep version of GPIO library call
        gpio: adp5588: Fix sleep-in-atomic-context bug
      b36fdc68
    • L
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · f4697d9a
      Linus Torvalds 提交于
      Pull SCSI fixes from James Bottomley:
       "A set of very minor fixes and a couple of reverts to fix a major
        problem (the attempt to change the busy count causes a hang when
        attempting to change the drive cache type)"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: aacraid: fix a signedness bug
        Revert "scsi: core: avoid host-wide host_busy counter for scsi_mq"
        Revert "scsi: core: fix scsi_host_queue_ready"
        scsi: libata: Add missing newline at end of file
        scsi: target: iscsi: cxgbit: use pr_debug() instead of pr_info()
        scsi: hpsa: limit transfer length to 1MB, not 512kB
        scsi: lpfc: Correct MDS diag and nvmet configuration
        scsi: lpfc: Default fdmi_on to on
        scsi: csiostor: fix incorrect port capabilities
        scsi: csiostor: add a check for NULL pointer after kmalloc()
        scsi: documentation: add scsi_mod.use_blk_mq to scsi-parameters
        scsi: core: Update SCSI_MQ_DEFAULT help text to match default
      f4697d9a
    • L
      Merge tag 'nds32-for-linus-4.19-tag1' of... · d0c1db1d
      Linus Torvalds 提交于
      Merge tag 'nds32-for-linus-4.19-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/greentime/linux
      
      Pull nds32 updates from Greentime Hu:
       "Contained in here are the bug fixes, building error fixes and ftrace
        support for nds32"
      
      * tag 'nds32-for-linus-4.19-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/greentime/linux:
        nds32: linker script: GCOV kernel may refers data in __exit
        nds32: fix build error because of wrong semicolon
        nds32: Fix a kernel panic issue because of wrong frame pointer access.
        nds32: Only print one page of stack when die to prevent printing too much information.
        nds32: Add macro definition for offset of lp register on stack
        nds32: Remove the deprecated ABI implementation
        nds32/stack: Get real return address by using ftrace_graph_ret_addr
        nds32/ftrace: Support dynamic function graph tracer
        nds32/ftrace: Support dynamic function tracer
        nds32/ftrace: Add RECORD_MCOUNT support
        nds32/ftrace: Support static function graph tracer
        nds32/ftrace: Support static function tracer
        nds32: Extract the checking and getting pointer to a macro
        nds32: Clean up the coding style
        nds32: Fix get_user/put_user macro expand pointer problem
        nds32: Fix empty call trace
        nds32: add NULL entry to the end of_device_id array
        nds32: fix logic for module
      d0c1db1d
  10. 05 9月, 2018 17 次提交