1. 09 7月, 2018 12 次提交
    • J
      blkcg: add generic throttling mechanism · d09d8df3
      Josef Bacik 提交于
      Since IO can be issued from literally anywhere it's almost impossible to
      do throttling without having some sort of adverse effect somewhere else
      in the system because of locking or other dependencies.  The best way to
      solve this is to do the throttling when we know we aren't holding any
      other kernel resources.  Do this by tracking throttling in a per-blkg
      basis, and if we require throttling flag the task that it needs to check
      before it returns to user space and possibly sleep there.
      
      This is to address the case where a process is doing work that is
      generating IO that can't be throttled, whether that is directly with a
      lot of REQ_META IO, or indirectly by allocating so much memory that it
      is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
      don't want to induce a memory allocation in the IO path, so simply
      saving the request queue in the task and flagging it to do the
      notify_resume thing achieves the same result without the overhead of a
      memory allocation.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d09d8df3
    • T
      swap,blkcg: issue swap io with the appropriate context · 0d3bd88d
      Tejun Heo 提交于
      For backcharging we need to know who the page belongs to when swapping
      it out.  We don't worry about things that do ->rw_page (zram etc) at the
      moment, we're only worried about pages that actually go to a block
      device.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0d3bd88d
    • J
      blk: introduce REQ_SWAP · 0d1e0c7c
      Josef Bacik 提交于
      Just like REQ_META, it's important to know the IO coming down is swap
      in order to guard against potential IO priority inversion issues with
      cgroups.  Add REQ_SWAP and use it for all swap IO, and add it to our
      bio_issue_as_root_blkg helper.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0d1e0c7c
    • J
      blk-cgroup: allow controllers to output their own stats · 903d23f0
      Josef Bacik 提交于
      blk-iolatency has a few stats that it would like to print out, and
      instead of adding a bunch of crap to the generic code just provide a
      helper so that controllers can add stuff to the stat line if they want
      to.
      
      Hide it behind a boot option since it changes the output of io.stat from
      normal, and these stats are only interesting to developers.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      903d23f0
    • J
      block: introduce bio_issue_as_root_blkg · c7c98fd3
      Josef Bacik 提交于
      Instead of forcing all file systems to get the right context on their
      bio's, simply check for REQ_META to see if we need to issue as the root
      blkg.  We don't want to force all bio's to have the root blkg associated
      with them if REQ_META is set, as some controllers (blk-iolatency) need
      to know who the originating cgroup is so it can backcharge them for the
      work they are doing.  This helper will make sure that the controllers do
      the proper thing wrt the IO priority and backcharging.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7c98fd3
    • J
      block: add bi_blkg to the bio for cgroups · 08e18eab
      Josef Bacik 提交于
      Currently io.low uses a bi_cg_private to stash its private data for the
      blkg, however other blkcg policies may want to use this as well.  Since
      we can get the private data out of the blkg, move this to bi_blkg in the
      bio and make it generic, then we can use bio_associate_blkg() to attach
      the blkg to the bio.
      
      Theoretically we could simply replace the bi_css with this since we can
      get to all the same information from the blkg, however you have to
      lookup the blkg, so for example wbc_init_bio() would have to lookup and
      possibly allocate the blkg for the css it was trying to attach to the
      bio.  This could be problematic and result in us either not attaching
      the css at all to the bio, or falling back to the root blkcg if we are
      unable to allocate the corresponding blkg.
      
      So for now do this, and in the future if possible we could just replace
      the bi_css with bi_blkg and update the helpers to do the correct
      translation.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08e18eab
    • M
      blk-mq: dequeue request one by one from sw queue if hctx is busy · 6e768717
      Ming Lei 提交于
      It won't be efficient to dequeue request one by one from sw queue,
      but we have to do that when queue is busy for better merge performance.
      
      This patch takes the Exponential Weighted Moving Average(EWMA) to figure
      out if queue is busy, then only dequeue request one by one from sw queue
      when queue is busy.
      
      Fixes: b347689f ("blk-mq-sched: improve dispatching from sw queue")
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Laurence Oberman <loberman@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Reported-by: NKashyap Desai <kashyap.desai@broadcom.com>
      Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e768717
    • M
      blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set() · 97889f9a
      Ming Lei 提交于
      We have to remove synchronize_rcu() from blk_queue_cleanup(),
      otherwise long delay can be caused during lun probe. For removing
      it, we have to avoid to iterate the set->tag_list in IO path, eg,
      blk_mq_sched_restart().
      
      This patch reverts 5b79413946d (Revert "blk-mq: don't handle
      TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
      and there isn't any reason to restart all queues in one tags any more,
      see the following reasons:
      
      1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
      which can wake up queues waiting for driver tag.
      
      2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
      target or host is ready, but SCSI built-in restart can cover all these well,
      see scsi_end_request(), queue will be rerun after any request initiated from
      this host/target is completed.
      
      In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
      when running I/O on these 8 luns concurrently.
      
      Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: linux-scsi@vger.kernel.org
      Reported-by: NAndrew Jones <drjones@redhat.com>
      Tested-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      97889f9a
    • M
      blk-mq: introduce new lock for protecting hctx->dispatch_wait · 5815839b
      Ming Lei 提交于
      Now hctx->lock is only acquired when adding hctx->dispatch_wait to
      one wait queue, but not held when removing it from the wait queue.
      
      IO hang can be observed easily if SCHED RESTART is disabled, that means
      now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait().
      
      This patch fixes the issue by introducing hctx->dispatch_wait_lock and
      holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(),
      since we need to avoid acquiring hctx->lock in irq context.
      
      Fixes: eb619fdb ("blk-mq: fix issue with shared tag queue re-running")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Tested-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5815839b
    • B
      block: Make struct request_queue smaller for CONFIG_BLK_DEV_ZONED=n · 6a5ac984
      Bart Van Assche 提交于
      Exclude zoned block device members from struct request_queue for
      CONFIG_BLK_DEV_ZONED == n. Avoid breaking the build by only building
      the code that uses these struct request_queue members if
      CONFIG_BLK_DEV_ZONED != n.
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6a5ac984
    • B
      block: Inline blk_queue_nr_zones() · 7c8542b7
      Bart Van Assche 提交于
      Since the implementation of blk_queue_nr_zones() is trivial and since
      it only has a single caller, inline this function.
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7c8542b7
    • B
      block: Remove bdev_nr_zones() · 6b1d83d2
      Bart Van Assche 提交于
      Remove this function since it has no callers. This function was
      introduced in commit 6cc77e9c ("block: introduce zoned block
      devices zone write locking").
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b1d83d2
  2. 04 7月, 2018 1 次提交
  3. 03 7月, 2018 2 次提交
    • N
      compiler-gcc.h: Add __attribute__((gnu_inline)) to all inline declarations · d03db2bc
      Nick Desaulniers 提交于
      Functions marked extern inline do not emit an externally visible
      function when the gnu89 C standard is used. Some KBUILD Makefiles
      overwrite KBUILD_CFLAGS. This is an issue for GCC 5.1+ users as without
      an explicit C standard specified, the default is gnu11. Since c99, the
      semantics of extern inline have changed such that an externally visible
      function is always emitted. This can lead to multiple definition errors
      of extern inline functions at link time of compilation units whose build
      files have removed an explicit C standard compiler flag for users of GCC
      5.1+ or Clang.
      Suggested-by: NArnd Bergmann <arnd@arndb.de>
      Suggested-by: NH. Peter Anvin <hpa@zytor.com>
      Suggested-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
      Acked-by: NJuergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@redhat.com
      Cc: akataria@vmware.com
      Cc: akpm@linux-foundation.org
      Cc: andrea.parri@amarulasolutions.com
      Cc: ard.biesheuvel@linaro.org
      Cc: aryabinin@virtuozzo.com
      Cc: astrachan@google.com
      Cc: boris.ostrovsky@oracle.com
      Cc: brijesh.singh@amd.com
      Cc: caoj.fnst@cn.fujitsu.com
      Cc: geert@linux-m68k.org
      Cc: ghackmann@google.com
      Cc: gregkh@linuxfoundation.org
      Cc: jan.kiszka@siemens.com
      Cc: jarkko.sakkinen@linux.intel.com
      Cc: jpoimboe@redhat.com
      Cc: keescook@google.com
      Cc: kirill.shutemov@linux.intel.com
      Cc: kstewart@linuxfoundation.org
      Cc: linux-efi@vger.kernel.org
      Cc: linux-kbuild@vger.kernel.org
      Cc: manojgupta@google.com
      Cc: mawilcox@microsoft.com
      Cc: michal.lkml@markovi.net
      Cc: mjg59@google.com
      Cc: mka@chromium.org
      Cc: pombredanne@nexb.com
      Cc: rientjes@google.com
      Cc: rostedt@goodmis.org
      Cc: sedat.dilek@gmail.com
      Cc: thomas.lendacky@amd.com
      Cc: tstellar@redhat.com
      Cc: tweek@google.com
      Cc: virtualization@lists.linux-foundation.org
      Cc: will.deacon@arm.com
      Cc: yamada.masahiro@socionext.com
      Link: http://lkml.kernel.org/r/20180621162324.36656-2-ndesaulniers@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d03db2bc
    • P
      kthread, sched/core: Fix kthread_parkme() (again...) · 1cef1150
      Peter Zijlstra 提交于
      Gaurav reports that commit:
      
        85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      
      isn't working for him. Because of the following race:
      
      > controller Thread                               CPUHP Thread
      > takedown_cpu
      > kthread_park
      > kthread_parkme
      > Set KTHREAD_SHOULD_PARK
      >                                                 smpboot_thread_fn
      >                                                 set Task interruptible
      >
      >
      > wake_up_process
      >  if (!(p->state & state))
      >                 goto out;
      >
      >                                                 Kthread_parkme
      >                                                 SET TASK_PARKED
      >                                                 schedule
      >                                                 raw_spin_lock(&rq->lock)
      > ttwu_remote
      > waiting for __task_rq_lock
      >                                                 context_switch
      >
      >                                                 finish_lock_switch
      >
      >
      >
      >                                                 Case TASK_PARKED
      >                                                 kthread_park_complete
      >
      >
      > SET Running
      
      Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
      handling is buggered because the TASK_DEAD thing is done with
      preemption disabled, the current code can still complete early on
      preemption :/
      
      So basically revert that earlier fix and go with a variant of the
      alternative mentioned in the commit. Promote TASK_PARKED to special
      state to avoid the store-store issue on task->state leading to the
      WARN in kthread_unpark() -> __kthread_bind().
      
      But in addition, add wait_task_inactive() to kthread_park() to ensure
      the task really is PARKED when we return from kthread_park(). This
      avoids the whole kthread still gets migrated nonsense -- although it
      would be really good to get this done differently.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1cef1150
  4. 02 7月, 2018 1 次提交
    • S
      net: fix use-after-free in GRO with ESP · 603d4cf8
      Sabrina Dubroca 提交于
      Since the addition of GRO for ESP, gro_receive can consume the skb and
      return -EINPROGRESS. In that case, the lower layer GRO handler cannot
      touch the skb anymore.
      
      Commit 5f114163 ("net: Add a skb_gro_flush_final helper.") converted
      some of the gro_receive handlers that can lead to ESP's gro_receive so
      that they wouldn't access the skb when -EINPROGRESS is returned, but
      missed other spots, mainly in tunneling protocols.
      
      This patch finishes the conversion to using skb_gro_flush_final(), and
      adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and
      GUE.
      
      Fixes: 5f114163 ("net: Add a skb_gro_flush_final helper.")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      603d4cf8
  5. 30 6月, 2018 1 次提交
    • D
      bpf: undo prog rejection on read-only lock failure · 85782e03
      Daniel Borkmann 提交于
      Partially undo commit 9facc336 ("bpf: reject any prog that failed
      read-only lock") since it caused a regression, that is, syzkaller was
      able to manage to cause a panic via fault injection deep in set_memory_ro()
      path by letting an allocation fail: In x86's __change_page_attr_set_clr()
      it was able to change the attributes of the primary mapping but not in
      the alias mapping via cpa_process_alias(), so the second, inner call
      to the __change_page_attr() via __change_page_attr_set_clr() had to split
      a larger page and failed in the alloc_pages() with the artifically triggered
      allocation error which is then propagated down to the call site.
      
      Thus, for set_memory_ro() this means that it returned with an error, but
      from debugging a probe_kernel_write() revealed EFAULT on that memory since
      the primary mapping succeeded to get changed. Therefore the subsequent
      hdr->locked = 0 reset triggered the panic as it was performed on read-only
      memory, so call-site assumptions were infact wrong to assume that it would
      either succeed /or/ not succeed at all since there's no such rollback in
      set_memory_*() calls from partial change of mappings, in other words, we're
      left in a state that is "half done". A later undo via set_memory_rw() is
      succeeding though due to matching permissions on that part (aka due to the
      try_preserve_large_page() succeeding). While reproducing locally with
      explicitly triggering this error, the initial splitting only happens on
      rare occasions and in real world it would additionally need oom conditions,
      but that said, it could partially fail. Therefore, it is definitely wrong
      to bail out on set_memory_ro() error and reject the program with the
      set_memory_*() semantics we have today. Shouldn't have gone the extra mile
      since no other user in tree today infact checks for any set_memory_*()
      errors, e.g. neither module_enable_ro() / module_disable_ro() for module
      RO/NX handling which is mostly default these days nor kprobes core with
      alloc_insn_page() / free_insn_page() as examples that could be invoked long
      after bootup and original 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks") did neither when it got first introduced to BPF
      so "improving" with bailing out was clearly not right when set_memory_*()
      cannot handle it today.
      
      Kees suggested that if set_memory_*() can fail, we should annotate it with
      __must_check, and all callers need to deal with it gracefully given those
      set_memory_*() markings aren't "advisory", but they're expected to actually
      do what they say. This might be an option worth to move forward in future
      but would at the same time require that set_memory_*() calls from supporting
      archs are guaranteed to be "atomic" in that they provide rollback if part
      of the range fails, once that happened, the transition from RW -> RO could
      be made more robust that way, while subsequent RO -> RW transition /must/
      continue guaranteeing to always succeed the undo part.
      
      Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com
      Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com
      Fixes: 9facc336 ("bpf: reject any prog that failed read-only lock")
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      85782e03
  6. 29 6月, 2018 4 次提交
    • J
      sg: remove ->sg_magic member · 9544bc53
      Jens Axboe 提交于
      This was introduced more than a decade ago when sg chaining was
      added, but we never really caught anything with it. The scatterlist
      entry size can be critical, since drivers allocate it, so remove
      the magic member. Recently it's been triggering allocation stalls
      and failures in NVMe.
      Tested-by: NJordan Glover <Golden_Miller83@protonmail.ch>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9544bc53
    • S
      include/linux/dax.h: dax_iomap_fault() returns vm_fault_t · f77bc3a8
      Souptick Joarder 提交于
      Commit 1c8f4220 ("mm: change return type to vm_fault_t") missed a
      conversion.  It's not a big problem at present because mainline is still
      using
      
      	typedef int vm_fault_t;
      
      Fixes: 1c8f4220 ("mm: change return type to vm_fault_t")
      Link: http://lkml.kernel.org/r/20180620172046.GA27894@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f77bc3a8
    • M
      slub: fix failure when we delete and create a slab cache · d50d82fa
      Mikulas Patocka 提交于
      In kernel 4.17 I removed some code from dm-bufio that did slab cache
      merging (commit 21bb1327: "dm bufio: remove code that merges slab
      caches") - both slab and slub support merging caches with identical
      attributes, so dm-bufio now just calls kmem_cache_create and relies on
      implicit merging.
      
      This uncovered a bug in the slub subsystem - if we delete a cache and
      immediatelly create another cache with the same attributes, it fails
      because of duplicate filename in /sys/kernel/slab/.  The slub subsystem
      offloads freeing the cache to a workqueue - and if we create the new
      cache before the workqueue runs, it complains because of duplicate
      filename in sysfs.
      
      This patch fixes the bug by moving the call of kobject_del from
      sysfs_slab_remove_workfn to shutdown_cache.  kobject_del must be called
      while we hold slab_mutex - so that the sysfs entry is deleted before a
      cache with the same attributes could be created.
      
      Running device-mapper-test-suite with:
      
        dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/
      
      triggered:
      
        Buffer I/O error on dev dm-0, logical block 1572848, async page read
        device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
        device-mapper: thin: 253:1: aborting current metadata transaction
        sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
        CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
        Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
        Workqueue: dm-thin do_worker [dm_thin_pool]
        Call Trace:
         dump_stack+0x5a/0x73
         sysfs_warn_dup+0x58/0x70
         sysfs_create_dir_ns+0x77/0x80
         kobject_add_internal+0xba/0x2e0
         kobject_init_and_add+0x70/0xb0
         sysfs_slab_add+0xb1/0x250
         __kmem_cache_create+0x116/0x150
         create_cache+0xd9/0x1f0
         kmem_cache_create_usercopy+0x1c1/0x250
         kmem_cache_create+0x18/0x20
         dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
         dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
         __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
         dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
         metadata_operation_failed+0x59/0x100 [dm_thin_pool]
         alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
         process_cell+0x2a3/0x550 [dm_thin_pool]
         do_worker+0x28d/0x8f0 [dm_thin_pool]
         process_one_work+0x171/0x370
         worker_thread+0x49/0x3f0
         kthread+0xf8/0x130
         ret_from_fork+0x35/0x40
        kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
        kmem_cache_create(dm_bufio_buffer-16) failed with error -17
      
      Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.comSigned-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reported-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d50d82fa
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  7. 27 6月, 2018 2 次提交
  8. 26 6月, 2018 1 次提交
    • S
      bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF · fdb5c453
      Sean Young 提交于
      If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not
      possible to attach, detach or query IR BPF programs to /dev/lircN devices,
      making them impossible to use. For embedded devices, it should be possible
      to use IR decoding without cgroups or CONFIG_CGROUP_BPF enabled.
      
      This change requires some refactoring, since bpf_prog_{attach,detach,query}
      functions are now always compiled, but their code paths for cgroups need
      moving out. Rather than a #ifdef CONFIG_CGROUP_BPF in kernel/bpf/syscall.c,
      moving them to kernel/bpf/cgroup.c and kernel/bpf/sockmap.c does not
      require #ifdefs since that is already conditionally compiled.
      
      Fixes: f4364dcf ("media: rc: introduce BPF_PROG_LIRC_MODE2")
      Signed-off-by: NSean Young <sean@mess.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fdb5c453
  9. 25 6月, 2018 5 次提交
    • A
      disable -Wattribute-alias warning for SYSCALL_DEFINEx() · bee20031
      Arnd Bergmann 提交于
      gcc-8 warns for every single definition of a system call entry
      point, e.g.:
      
      include/linux/compat.h:56:18: error: 'compat_sys_rt_sigprocmask' alias between functions of incompatible types 'long int(int,  compat_sigset_t *, compat_sigset_t *, compat_size_t)' {aka 'long int(int,  struct <anonymous> *, struct <anonymous> *, unsigned int)'} and 'long int(long int,  long int,  long int,  long int)' [-Werror=attribute-alias]
        asmlinkage long compat_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))\
                        ^~~~~~~~~~
      include/linux/compat.h:45:2: note: in expansion of macro 'COMPAT_SYSCALL_DEFINEx'
        COMPAT_SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
        ^~~~~~~~~~~~~~~~~~~~~~
      kernel/signal.c:2601:1: note: in expansion of macro 'COMPAT_SYSCALL_DEFINE4'
       COMPAT_SYSCALL_DEFINE4(rt_sigprocmask, int, how, compat_sigset_t __user *, nset,
       ^~~~~~~~~~~~~~~~~~~~~~
      include/linux/compat.h:60:18: note: aliased declaration here
        asmlinkage long compat_SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))\
                        ^~~~~~~~~~
      
      The new warning seems reasonable in principle, but it doesn't
      help us here, since we rely on the type mismatch to sanitize the
      system call arguments. After I reported this as GCC PR82435, a new
      -Wno-attribute-alias option was added that could be used to turn the
      warning off globally on the command line, but I'd prefer to do it a
      little more fine-grained.
      
      Interestingly, turning a warning off and on again inside of
      a single macro doesn't always work, in this case I had to add
      an extra statement inbetween and decided to copy the __SC_TEST
      one from the native syscall to the compat syscall macro.  See
      https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83256 for more details
      about this.
      
      [paul.burton@mips.com:
        - Rebase atop current master.
        - Split GCC & version arguments to __diag_ignore() in order to match
          changes to the preceding patch.
        - Add the comment argument to match the preceding patch.]
      
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82435Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Tested-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Tested-by: NStafford Horne <shorne@gmail.com>
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      bee20031
    • A
      kbuild: add macro for controlling warnings to linux/compiler.h · 8793bb7f
      Arnd Bergmann 提交于
      I have occasionally run into a situation where it would make sense to
      control a compiler warning from a source file rather than doing so from
      a Makefile using the $(cc-disable-warning, ...) or $(cc-option, ...)
      helpers.
      
      The approach here is similar to what glibc uses, using __diag() and
      related macros to encapsulate a _Pragma("GCC diagnostic ...") statement
      that gets turned into the respective "#pragma GCC diagnostic ..." by
      the preprocessor when the macro gets expanded.
      
      Like glibc, I also have an argument to pass the affected compiler
      version, but decided to actually evaluate that one. For now, this
      supports GCC_4_6, GCC_4_7, GCC_4_8, GCC_4_9, GCC_5, GCC_6, GCC_7,
      GCC_8 and GCC_9. Adding support for CLANG_5 and other interesting
      versions is straightforward here. GNU compilers starting with gcc-4.2
      could support it in principle, but "#pragma GCC diagnostic push"
      was only added in gcc-4.6, so it seems simpler to not deal with those
      at all. The same versions show a large number of warnings already,
      so it seems easier to just leave it at that and not do a more
      fine-grained control for them.
      
      The use cases I found so far include:
      
      - turning off the gcc-8 -Wattribute-alias warning inside of the
        SYSCALL_DEFINEx() macro without having to do it globally.
      
      - Reducing the build time for a simple re-make after a change,
        once we move the warnings from ./Makefile and
        ./scripts/Makefile.extrawarn into linux/compiler.h
      
      - More control over the warnings based on other configurations,
        using preprocessor syntax instead of Makefile syntax. This should make
        it easier for the average developer to understand and change things.
      
      - Adding an easy way to turn the W=1 option on unconditionally
        for a subdirectory or a specific file. This has been requested
        by several developers in the past that want to have their subsystems
        W=1 clean.
      
      - Integrating clang better into the build systems. Clang supports
        more warnings than GCC, and we probably want to classify them
        as default, W=1, W=2 etc, but there are cases in which the
        warnings should be classified differently due to excessive false
        positives from one or the other compiler.
      
      - Adding a way to turn the default warnings into errors (e.g. using
        a new "make E=0" tag) while not also turning the W=1 warnings into
        errors.
      
      This patch for now just adds the minimal infrastructure in order to
      do the first of the list above. As the #pragma GCC diagnostic
      takes precedence over command line options, the next step would be
      to convert a lot of the individual Makefiles that set nonstandard
      options to use __diag() instead.
      
      [paul.burton@mips.com:
        - Rebase atop current master.
        - Add __diag_GCC, or more generally __diag_<compiler>, abstraction to
          avoid code outside of linux/compiler-gcc.h needing to duplicate
          knowledge about different GCC versions.
        - Add a comment argument to __diag_{ignore,warn,error} which isn't
          used in the expansion of the macros but serves to push people to
          document the reason for using them - per feedback from Kees Cook.
        - Translate severity to GCC-specific pragmas in linux/compiler-gcc.h
          rather than using GCC-specific in linux/compiler_types.h.
        - Drop all but GCC 8 macros, since we only need to define macros for
          versions that we need to introduce pragmas for, and as of this
          series that's just GCC 8.
        - Capitalize comments in linux/compiler-gcc.h to match the style of
          the rest of the file.
        - Line up macro definitions with tabs in linux/compiler-gcc.h.]
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Tested-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Tested-by: NStafford Horne <shorne@gmail.com>
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      8793bb7f
    • H
      acpi: Add helper for deactivating memory region · d2d2e3c4
      Heikki Krogerus 提交于
      Sometimes memory resource may be overlapping with
      SystemMemory Operation Region by design, for example if the
      memory region is used as a mailbox for communication with a
      firmware in the system. One occasion of such mailboxes is
      USB Type-C Connector System Software Interface (UCSI).
      
      With regions like that, it is important that the driver is
      able to map the memory with the requirements it has. For
      example, the driver should be allowed to map the memory as
      non-cached memory. However, if the operation region has been
      accessed before the driver has mapped the memory, the memory
      has been marked as write-back by the time the driver is
      loaded. That means the driver will fail to map the memory
      if it expects non-cached memory.
      
      To work around the problem, introducing helper that the
      drivers can use to temporarily deactivate (unmap)
      SystemMemory Operation Regions that overlap with their
      IO memory.
      
      Fixes: 8243edf4 ("usb: typec: ucsi: Add ACPI driver")
      Cc: stable@vger.kernel.org
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d2d2e3c4
    • V
      PM / Domains: Rename opp_node to np · ad6384ba
      Viresh Kumar 提交于
      The DT node passed here isn't necessarily an OPP node, as this routine
      can also be used for cases where the "required-opps" property is present
      directly in the device's node. Rename it.
      
      This also removes a stale comment.
      Acked-by: NUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ad6384ba
    • V
      PM / Domains: Fix return value of of_genpd_opp_to_performance_state() · 5e03aa61
      Viresh Kumar 提交于
      of_genpd_opp_to_performance_state() should return 0 for errors, but the
      dummy routine isn't doing that. Fix it.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: NUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5e03aa61
  10. 24 6月, 2018 1 次提交
  11. 23 6月, 2018 2 次提交
    • J
      bdi: Fix another oops in wb_workfn() · 3ee7e869
      Jan Kara 提交于
      syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
      wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
      WB_shutting_down after wb->bdi->dev became NULL. This indicates that
      unregister_bdi() failed to call wb_shutdown() on one of wb objects.
      
      The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
      drops bdi's reference to wb structures before going through the list of
      wbs again and calling wb_shutdown() on each of them. This way the loop
      iterating through all wbs can easily miss a wb if that wb has already
      passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
      from cgwb_release_workfn() and as a result fully shutdown bdi although
      wb_workfn() for this wb structure is still running. In fact there are
      also other ways cgwb_bdi_unregister() can race with
      cgwb_release_workfn() leading e.g. to use-after-free issues:
      
      CPU1                            CPU2
                                      cgwb_bdi_unregister()
                                        cgwb_kill(*slot);
      
      cgwb_release()
        queue_work(cgwb_release_wq, &wb->release_work);
      cgwb_release_workfn()
                                        wb = list_first_entry(&bdi->wb_list, ...)
                                        spin_unlock_irq(&cgwb_lock);
        wb_shutdown(wb);
        ...
        kfree_rcu(wb, rcu);
                                        wb_shutdown(wb); -> oops use-after-free
      
      We solve these issues by synchronizing writeback structure shutdown from
      cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
      way we also no longer need synchronization using WB_shutting_down as the
      mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
      CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
      bdi_unregister().
      Reported-by: Nsyzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3ee7e869
    • W
      rseq: Avoid infinite recursion when delivering SIGSEGV · 784e0300
      Will Deacon 提交于
      When delivering a signal to a task that is using rseq, we call into
      __rseq_handle_notify_resume() so that the registers pushed in the
      sigframe are updated to reflect the state of the restartable sequence
      (for example, ensuring that the signal returns to the abort handler if
      necessary).
      
      However, if the rseq management fails due to an unrecoverable fault when
      accessing userspace or certain combinations of RSEQ_CS_* flags, then we
      will attempt to deliver a SIGSEGV. This has the potential for infinite
      recursion if the rseq code continuously fails on signal delivery.
      
      Avoid this problem by using force_sigsegv() instead of force_sig(), which
      is explicitly designed to reset the SEGV handler to SIG_DFL in the case
      of a recursive fault. In doing so, remove rseq_signal_deliver() from the
      internal rseq API and have an optional struct ksignal * parameter to
      rseq_handle_notify_resume() instead.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: peterz@infradead.org
      Cc: paulmck@linux.vnet.ibm.com
      Cc: boqun.feng@gmail.com
      Link: https://lkml.kernel.org/r/1529664307-983-1-git-send-email-will.deacon@arm.com
      784e0300
  12. 22 6月, 2018 2 次提交
  13. 21 6月, 2018 4 次提交
    • W
      kernel.h: Fix a typo in comment · 8730662d
      Wei Wang 提交于
      Signed-off-by: NWei Wang <wvw@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Crt Mori <cmo@melexis.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: gregkh@linuxfoundation.org
      Cc: wei.vince.wang@gmail.com
      Link: https://lkml.kernel.org/lkml/20180424212241.16013-1-wvw@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8730662d
    • M
      x86/platform/UV: Add adjustable set memory block size function · f642fb58
      mike.travis@hpe.com 提交于
      Add a new function to "adjust" the current fixed UV memory block size
      of 2GB so it can be changed to a different physical boundary.  This is
      out of necessity so arch dependent code can accommodate specific BIOS
      requirements which can align these new PMEM modules at less than the
      default boundaries.
      
      A "set order" type of function was used to insure that the memory block
      size will be a power of two value without requiring a validity check.
      64GB was chosen as the upper limit for memory block size values to
      accommodate upcoming 4PB systems which have 6 more bits of physical
      address space (46 becoming 52).
      Signed-off-by: NMike Travis <mike.travis@hpe.com>
      Reviewed-by: NAndrew Banman <andrew.banman@hpe.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russ Anderson <russ.anderson@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dan.j.williams@intel.com
      Cc: jgross@suse.com
      Cc: kirill.shutemov@linux.intel.com
      Cc: mhocko@suse.com
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/lkml/20180524201711.609546602@stormcage.americas.sgi.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f642fb58
    • M
      rseq/cleanup: Do not abort rseq c.s. in child on fork() · 9a789fcf
      Mathieu Desnoyers 提交于
      Considering that we explicitly forbid system calls in rseq critical
      sections, it is not valid to issue a fork or clone system call within a
      rseq critical section, so rseq_fork() is not required to restart an
      active rseq c.s. in the child process.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-kselftest@vger.kernel.org
      Link: https://lore.kernel.org/lkml/20180619133230.4087-4-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9a789fcf
    • E
      bpf: enforce correct alignment for instructions · 92624782
      Eric Dumazet 提交于
      After commit 9facc336 ("bpf: reject any prog that failed read-only lock")
      offsetof(struct bpf_binary_header, image) became 3 instead of 4,
      breaking powerpc BPF badly, since instructions need to be word aligned.
      
      Fixes: 9facc336 ("bpf: reject any prog that failed read-only lock")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92624782
  14. 19 6月, 2018 1 次提交
  15. 17 6月, 2018 1 次提交