1. 10 11月, 2019 1 次提交
    • T
      net: fix sk_page_frag() recursion from memory reclaim · 1d5cb12a
      Tejun Heo 提交于
      [ Upstream commit 20eb4f29b60286e0d6dc01d9c260b4bd383c58fb ]
      
      sk_page_frag() optimizes skb_frag allocations by using per-task
      skb_frag cache when it knows it's the only user.  The condition is
      determined by seeing whether the socket allocation mask allows
      blocking - if the allocation may block, it obviously owns the task's
      context and ergo exclusively owns current->task_frag.
      
      Unfortunately, this misses recursion through memory reclaim path.
      Please take a look at the following backtrace.
      
       [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
           ...
           tcp_sendmsg+0x27/0x40
           sock_sendmsg+0x30/0x40
           sock_xmit.isra.24+0xa1/0x170 [nbd]
           nbd_send_cmd+0x1d2/0x690 [nbd]
           nbd_queue_rq+0x1b5/0x3b0 [nbd]
           __blk_mq_try_issue_directly+0x108/0x1b0
           blk_mq_request_issue_directly+0xbd/0xe0
           blk_mq_try_issue_list_directly+0x41/0xb0
           blk_mq_sched_insert_requests+0xa2/0xe0
           blk_mq_flush_plug_list+0x205/0x2a0
           blk_flush_plug_list+0xc3/0xf0
       [1] blk_finish_plug+0x21/0x2e
           _xfs_buf_ioapply+0x313/0x460
           __xfs_buf_submit+0x67/0x220
           xfs_buf_read_map+0x113/0x1a0
           xfs_trans_read_buf_map+0xbf/0x330
           xfs_btree_read_buf_block.constprop.42+0x95/0xd0
           xfs_btree_lookup_get_block+0x95/0x170
           xfs_btree_lookup+0xcc/0x470
           xfs_bmap_del_extent_real+0x254/0x9a0
           __xfs_bunmapi+0x45c/0xab0
           xfs_bunmapi+0x15/0x30
           xfs_itruncate_extents_flags+0xca/0x250
           xfs_free_eofblocks+0x181/0x1e0
           xfs_fs_destroy_inode+0xa8/0x1b0
           destroy_inode+0x38/0x70
           dispose_list+0x35/0x50
           prune_icache_sb+0x52/0x70
           super_cache_scan+0x120/0x1a0
           do_shrink_slab+0x120/0x290
           shrink_slab+0x216/0x2b0
           shrink_node+0x1b6/0x4a0
           do_try_to_free_pages+0xc6/0x370
           try_to_free_mem_cgroup_pages+0xe3/0x1e0
           try_charge+0x29e/0x790
           mem_cgroup_charge_skmem+0x6a/0x100
           __sk_mem_raise_allocated+0x18e/0x390
           __sk_mem_schedule+0x2a/0x40
       [0] tcp_sendmsg_locked+0x8eb/0xe10
           tcp_sendmsg+0x27/0x40
           sock_sendmsg+0x30/0x40
           ___sys_sendmsg+0x26d/0x2b0
           __sys_sendmsg+0x57/0xa0
           do_syscall_64+0x42/0x100
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      In [0], tcp_send_msg_locked() was using current->page_frag when it
      called sk_wmem_schedule().  It already calculated how many bytes can
      be fit into current->page_frag.  Due to memory pressure,
      sk_wmem_schedule() called into memory reclaim path which called into
      xfs and then IO issue path.  Because the filesystem in question is
      backed by nbd, the control goes back into the tcp layer - back into
      tcp_sendmsg_locked().
      
      nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
      sense - it's in the process of freeing memory and wants to be able to,
      e.g., drop clean pages to make forward progress.  However, this
      confused sk_page_frag() called from [2].  Because it only tests
      whether the allocation allows blocking which it does, it now thinks
      current->page_frag can be used again although it already was being
      used in [0].
      
      After [2] used current->page_frag, the offset would be increased by
      the used amount.  When the control returns to [0],
      current->page_frag's offset is increased and the previously calculated
      number of bytes now may overrun the end of allocated memory leading to
      silent memory corruptions.
      
      Fix it by adding gfpflags_normal_context() which tests sleepable &&
      !reclaim and use it to determine whether to use current->task_frag.
      
      v2: Eric didn't like gfp flags being tested twice.  Introduce a new
          helper gfpflags_normal_context() and combine the two tests.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1d5cb12a
  2. 18 10月, 2019 2 次提交
  3. 12 10月, 2019 3 次提交
  4. 05 10月, 2019 4 次提交
    • C
      quota: fix wrong condition in is_quota_modification() · 06098609
      Chao Yu 提交于
      commit 6565c182094f69e4ffdece337d395eb7ec760efc upstream.
      
      Quoted from
      commit 3da40c7b ("ext4: only call ext4_truncate when size <= isize")
      
      " At LSF we decided that if we truncate up from isize we shouldn't trim
        fallocated blocks that were fallocated with KEEP_SIZE and are past the
       new i_size.  This patch fixes ext4 to do this. "
      
      And generic/092 of fstest have covered this case for long time, however
      is_quota_modification() didn't adjust based on that rule, so that in
      below condition, we will lose to quota block change:
      - fallocate blocks beyond EOF
      - remount
      - truncate(file_path, file_size)
      
      Fix it.
      
      Link: https://lore.kernel.org/r/20190911093650.35329-1-yuchao0@huawei.com
      Fixes: 3da40c7b ("ext4: only call ext4_truncate when size <= isize")
      CC: stable@vger.kernel.org
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      06098609
    • M
      blk-mq: add callback of .cleanup_rq · 4ec3ca27
      Ming Lei 提交于
      [ Upstream commit 226b4fc75c78f9c497c5182d939101b260cfb9f3 ]
      
      SCSI maintains its own driver private data hooked off of each SCSI
      request, and the pridate data won't be freed after scsi_queue_rq()
      returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
      (e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
      fully dispatched them, due to a lower level SCSI driver's resource
      limitation identified in scsi_queue_rq(). Currently SCSI's per-request
      private data is leaked when the upper layer driver (dm-rq) frees and
      then retries these requests in response to BLK_STS_RESOURCE or
      BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().
      
      This usecase is so specialized that it doesn't warrant training an
      existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
      account for freeing its driver private data -- doing so would add an
      extra branch for handling a special case that all other consumers of
      SCSI (and blk-mq) won't ever need to worry about.
      
      So the most pragmatic way forward is to delegate freeing SCSI driver
      private data to the upper layer driver (dm-rq).  Do so by adding
      new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
      from dm-rq.  A following commit will implement the .cleanup_rq() hook
      in scsi_mq_ops.
      
      Cc: Ewan D. Milne <emilne@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: <stable@vger.kernel.org>
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4ec3ca27
    • U
      mmc: core: Add helper function to indicate if SDIO IRQs is enabled · a0dd3d95
      Ulf Hansson 提交于
      [ Upstream commit bd880b00697befb73eff7220ee20bdae4fdd487b ]
      
      To avoid each host driver supporting SDIO IRQs, from keeping track
      internally about if SDIO IRQs has been claimed, let's introduce a common
      helper function, sdio_irq_claimed().
      
      The function returns true if SDIO IRQs are claimed, via using the
      information about the number of claimed irqs. This is safe, even without
      any locks, as long as the helper function is called only from
      runtime/system suspend callbacks of the host driver.
      Tested-by: NMatthias Kaehlcke <mka@chromium.org>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      Reviewed-by: NDouglas Anderson <dianders@chromium.org>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a0dd3d95
    • M
      kprobes: Prohibit probing on BUG() and WARN() address · fad90d4b
      Masami Hiramatsu 提交于
      [ Upstream commit e336b4027775cb458dc713745e526fa1a1996b2a ]
      
      Since BUG() and WARN() may use a trap (e.g. UD2 on x86) to
      get the address where the BUG() has occurred, kprobes can not
      do single-step out-of-line that instruction. So prohibit
      probing on such address.
      
      Without this fix, if someone put a kprobe on WARN(), the
      kernel will crash with invalid opcode error instead of
      outputing warning message, because kernel can not find
      correct bug address.
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N . Rao <naveen.n.rao@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/156750890133.19112.3393666300746167111.stgit@devnote2Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fad90d4b
  5. 16 9月, 2019 2 次提交
    • B
      gpio: don't WARN() on NULL descs if gpiolib is disabled · c9c90711
      Bartosz Golaszewski 提交于
      [ Upstream commit ffe0bbabb0cffceceae07484fde1ec2a63b1537c ]
      
      If gpiolib is disabled, we use the inline stubs from gpio/consumer.h
      instead of regular definitions of GPIO API. The stubs for 'optional'
      variants of gpiod_get routines return NULL in this case as if the
      relevant GPIO wasn't found. This is correct so far.
      
      Calling other (non-gpio_get) stubs from this header triggers a warning
      because the GPIO descriptor couldn't have been requested. The warning
      however is unconditional (WARN_ON(1)) and is emitted even if the passed
      descriptor pointer is NULL.
      
      We don't want to force the users of 'optional' gpio_get to check the
      returned pointer before calling e.g. gpiod_set_value() so let's only
      WARN on non-NULL descriptors.
      
      Cc: stable@vger.kernel.org
      Reported-by: NClaus H. Stovgaard <cst@phaseone.com>
      Signed-off-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c9c90711
    • Y
      dm mpath: fix missing call of path selector type->end_io · 69409854
      Yufen Yu 提交于
      [ Upstream commit 5de719e3d01b4abe0de0d7b857148a880ff2a90b ]
      
      After commit 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via
      blk_insert_cloned_request feedback"), map_request() will requeue the tio
      when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
      
      Thus, if device driver status is error, a tio may be requeued multiple
      times until the return value is not DM_MAPIO_REQUEUE.  That means
      type->start_io may be called multiple times, while type->end_io is only
      called when IO complete.
      
      In fact, even without commit 396eaf21, setup_clone() failure can
      also cause tio requeue and associated missed call to type->end_io.
      
      The service-time path selector selects path based on in_flight_size,
      which is increased by st_start_io() and decreased by st_end_io().
      Missed calls to st_end_io() can lead to in_flight_size count error and
      will cause the selector to make the wrong choice.  In addition,
      queue-length path selector will also be affected.
      
      To fix the problem, call type->end_io in ->release_clone_rq before tio
      requeue.  map_info is passed to ->release_clone_rq() for map_request()
      error path that result in requeue.
      
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Cc: stable@vger.kernl.org
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      69409854
  6. 10 9月, 2019 2 次提交
  7. 06 9月, 2019 3 次提交
  8. 16 8月, 2019 2 次提交
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 2bc73d91
      Wanpeng Li 提交于
      commit 17e433b54393a6269acbcb792da97791fe1592d8 upstream.
      
      After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2bc73d91
    • G
      crypto: ccp - Add support for valid authsize values less than 16 · 30692ede
      Gary R Hook 提交于
      commit 9f00baf74e4b6f79a3a3dfab44fb7bb2e797b551 upstream.
      
      AES GCM encryption allows for authsize values of 4, 8, and 12-16 bytes.
      Validate the requested authsize, and retain it to save in the request
      context.
      
      Fixes: 36cf515b ("crypto: ccp - Enable support for AES GCM on v5 CCPs")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NGary R Hook <gary.hook@amd.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30692ede
  9. 09 8月, 2019 6 次提交
    • T
      cgroup: Include dying leaders with live threads in PROCS iterations · 4340d175
      Tejun Heo 提交于
      commit c03cd7738a83b13739f00546166969342c8ff014 upstream.
      
      CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
      this means that a process with dying leader and live threads will be
      skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
      which is confusing to say the least.
      
      Fix it by making cset track dying tasks and include dying leaders with
      live threads in PROCS iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NTopi Miettinen <toiwoton@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4340d175
    • T
      cgroup: Implement css_task_iter_skip() · 370b9e63
      Tejun Heo 提交于
      commit b636fd38dc40113f853337a7d2a6885ad23b8811 upstream.
      
      When a task is moved out of a cset, task iterators pointing to the
      task are advanced using the normal css_task_iter_advance() call.  This
      is fine but we'll be tracking dying tasks on csets and thus moving
      tasks from cset->tasks to (to be added) cset->dying_tasks.  When we
      remove a task from cset->tasks, if we advance the iterators, they may
      move over to the next cset before we had the chance to add the task
      back on the dying list, which can allow the task to escape iteration.
      
      This patch separates out skipping from advancing.  Skipping only moves
      the affected iterators to the next pointer rather than fully advancing
      it and the following advancing will recognize that the cursor has
      already been moved forward and do the rest of advancing.  This ensures
      that when a task moves from one list to another in its cset, as long
      as it moves in the right direction, it's always visible to iteration.
      
      This doesn't cause any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      370b9e63
    • A
      compat_ioctl: pppoe: fix PPPOEIOCSFWD handling · e6e9bcef
      Arnd Bergmann 提交于
      [ Upstream commit 055d88242a6046a1ceac3167290f054c72571cd9 ]
      
      Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
      linux-2.5.69 along with hundreds of other commands, but was always broken
      sincen only the structure is compatible, but the command number is not,
      due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
      sockaddr_pppox)), which is different on 64-bit architectures.
      
      Guillaume Nault adds:
      
        And the implementation was broken until 2016 (see 29e73269 ("pppoe:
        fix reference counting in PPPoE proxy")), and nobody ever noticed. I
        should probably have removed this ioctl entirely instead of fixing it.
        Clearly, it has never been used.
      
      Fix it by adding a compat_ioctl handler for all pppoe variants that
      translates the command number and then calls the regular ioctl function.
      
      All other ioctl commands handled by pppoe are compatible between 32-bit
      and 64-bit, and require compat_ptr() conversion.
      
      This should apply to all stable kernels.
      Acked-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6e9bcef
    • A
      net/mlx5e: Prevent encap flow counter update async to user query · 0ccf4726
      Ariel Levkovich 提交于
      [ Upstream commit 90bb769291161cf25a818d69cf608c181654473e ]
      
      This patch prevents a race between user invoked cached counters
      query and a neighbor last usage updater.
      
      The cached flow counter stats can be queried by calling
      "mlx5_fc_query_cached" which provides the number of bytes and
      packets that passed via this flow since the last time this counter
      was queried.
      It does so by reducting the last saved stats from the current, cached
      stats and then updating the last saved stats with the cached stats.
      It also provide the lastuse value for that flow.
      
      Since "mlx5e_tc_update_neigh_used_value" needs to retrieve the
      last usage time of encapsulation flows, it calls the flow counter
      query method periodically and async to user queries of the flow counter
      using cls_flower.
      This call is causing the driver to update the last reported bytes and
      packets from the cache and therefore, future user queries of the flow
      stats will return lower than expected number for bytes and packets
      since the last saved stats in the driver was updated async to the last
      saved stats in cls_flower.
      
      This causes wrong stats presentation of encapsulation flows to user.
      
      Since the neighbor usage updater only needs the lastuse stats from the
      cached counter, the fix is to use a dedicated lastuse query call that
      returns the lastuse value without synching between the cached stats and
      the last saved stats.
      
      Fixes: f6dfb4c3 ("net/mlx5e: Update neighbour 'used' state using HW flow rules counters")
      Signed-off-by: NAriel Levkovich <lariel@mellanox.com>
      Reviewed-by: NRoi Dayan <roid@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0ccf4726
    • E
      net/mlx5: Fix modify_cq_in alignment · cd84a107
      Edward Srouji 提交于
      [ Upstream commit 7a32f2962c56d9d8a836b4469855caeee8766bd4 ]
      
      Fix modify_cq_in alignment to match the device specification.
      After this fix the 'cq_umem_valid' field will be in the right offset.
      
      Cc: <stable@vger.kernel.org> # 4.19
      Fixes: bd37197554eb ("net/mlx5: Update mlx5_ifc with DEVX UID bits")
      Signed-off-by: NEdward Srouji <edwards@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd84a107
    • D
      drivers/base: Introduce kill_device() · c23106d4
      Dan Williams 提交于
      commit 00289cd87676e14913d2d8492d1ce05c4baafdae upstream.
      
      The libnvdimm subsystem arranges for devices to be destroyed as a result
      of a sysfs operation. Since device_unregister() cannot be called from
      an actively running sysfs attribute of the same device libnvdimm
      arranges for device_unregister() to be performed in an out-of-line async
      context.
      
      The driver core maintains a 'dead' state for coordinating its own racing
      async registration / de-registration requests. Rather than add local
      'dead' state tracking infrastructure to libnvdimm device objects, export
      the existing state tracking via a new kill_device() helper.
      
      The kill_device() helper simply marks the device as dead, i.e. that it
      is on its way to device_del(), or returns that the device was already
      dead. This can be used in advance of calling device_unregister() for
      subsystems like libnvdimm that might need to handle multiple user
      threads racing to delete a device.
      
      This refactoring does not change any behavior, but it is a pre-requisite
      for follow-on fixes and therefore marked for -stable.
      
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Fixes: 4d88a97a ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
      Cc: <stable@vger.kernel.org>
      Tested-by: NJane Chu <jane.chu@oracle.com>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Link: https://lore.kernel.org/r/156341207332.292348.14959761496009347574.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c23106d4
  10. 07 8月, 2019 3 次提交
  11. 04 8月, 2019 5 次提交
    • B
      block, scsi: Change the preempt-only flag into a counter · c58a6507
      Bart Van Assche 提交于
      commit cd84a62e0078dce09f4ed349bec84f86c9d54b30 upstream.
      
      The RQF_PREEMPT flag is used for three purposes:
      - In the SCSI core, for making sure that power management requests
        are executed even if a device is in the "quiesced" state.
      - For domain validation by SCSI drivers that use the parallel port.
      - In the IDE driver, for IDE preempt requests.
      Rename "preempt-only" into "pm-only" because the primary purpose of
      this mode is power management. Since the power management core may
      but does not have to resume a runtime suspended device before
      performing system-wide suspend and since a later patch will set
      "pm-only" mode as long as a block device is runtime suspended, make
      it possible to set "pm-only" mode from more than one context. Since
      with this change scsi_device_quiesce() is no longer idempotent, make
      that function return early if it is called for a quiesced queue.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c58a6507
    • J
      sched/fair: Use RCU accessors consistently for ->numa_group · a5a3915f
      Jann Horn 提交于
      commit cb361d8cdef69990f6b4504dc1fd9a594d983c97 upstream.
      
      The old code used RCU annotations and accessors inconsistently for
      ->numa_group, which can lead to use-after-frees and NULL dereferences.
      
      Let all accesses to ->numa_group use proper RCU helpers to prevent such
      issues.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Fixes: 8c8a743c ("sched/numa: Use {cpu, pid} to create task groups for shared faults")
      Link: https://lkml.kernel.org/r/20190716152047.14424-3-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a5a3915f
    • J
      sched/fair: Don't free p->numa_faults with concurrent readers · 48046e09
      Jann Horn 提交于
      commit 16d51a590a8ce3befb1308e0e7ab77f3b661af33 upstream.
      
      When going through execve(), zero out the NUMA fault statistics instead of
      freeing them.
      
      During execve, the task is reachable through procfs and the scheduler. A
      concurrent /proc/*/sched reader can read data from a freed ->numa_faults
      allocation (confirmed by KASAN) and write it back to userspace.
      I believe that it would also be possible for a use-after-free read to occur
      through a race between a NUMA fault and execve(): task_numa_fault() can
      lead to task_numa_compare(), which invokes task_weight() on the currently
      running task of a different CPU.
      
      Another way to fix this would be to make ->numa_faults RCU-managed or add
      extra locking, but it seems easier to wipe the NUMA fault statistics on
      execve.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Fixes: 82727018 ("sched/numa: Call task_numa_free() from do_execve()")
      Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      48046e09
    • J
      iommu/iova: Fix compilation error with !CONFIG_IOMMU_IOVA · 3a0c22cb
      Joerg Roedel 提交于
      commit 201c1db90cd643282185a00770f12f95da330eca upstream.
      
      The stub function for !CONFIG_IOMMU_IOVA needs to be
      'static inline'.
      
      Fixes: effa467870c76 ('iommu/vt-d: Don't queue_iova() if there is no flush queue')
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a0c22cb
    • D
      iommu/vt-d: Don't queue_iova() if there is no flush queue · 4fd0eb60
      Dmitry Safonov 提交于
      commit effa467870c7612012885df4e246bdb8ffd8e44c upstream.
      
      Intel VT-d driver was reworked to use common deferred flushing
      implementation. Previously there was one global per-cpu flush queue,
      afterwards - one per domain.
      
      Before deferring a flush, the queue should be allocated and initialized.
      
      Currently only domains with IOMMU_DOMAIN_DMA type initialize their flush
      queue. It's probably worth to init it for static or unmanaged domains
      too, but it may be arguable - I'm leaving it to iommu folks.
      
      Prevent queuing an iova flush if the domain doesn't have a queue.
      The defensive check seems to be worth to keep even if queue would be
      initialized for all kinds of domains. And is easy backportable.
      
      On 4.19.43 stable kernel it has a user-visible effect: previously for
      devices in si domain there were crashes, on sata devices:
      
       BUG: spinlock bad magic on CPU#6, swapper/0/1
        lock: 0xffff88844f582008, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
       CPU: 6 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #1
       Call Trace:
        <IRQ>
        dump_stack+0x61/0x7e
        spin_bug+0x9d/0xa3
        do_raw_spin_lock+0x22/0x8e
        _raw_spin_lock_irqsave+0x32/0x3a
        queue_iova+0x45/0x115
        intel_unmap+0x107/0x113
        intel_unmap_sg+0x6b/0x76
        __ata_qc_complete+0x7f/0x103
        ata_qc_complete+0x9b/0x26a
        ata_qc_complete_multiple+0xd0/0xe3
        ahci_handle_port_interrupt+0x3ee/0x48a
        ahci_handle_port_intr+0x73/0xa9
        ahci_single_level_irq_intr+0x40/0x60
        __handle_irq_event_percpu+0x7f/0x19a
        handle_irq_event_percpu+0x32/0x72
        handle_irq_event+0x38/0x56
        handle_edge_irq+0x102/0x121
        handle_irq+0x147/0x15c
        do_IRQ+0x66/0xf2
        common_interrupt+0xf/0xf
       RIP: 0010:__do_softirq+0x8c/0x2df
      
      The same for usb devices that use ehci-pci:
       BUG: spinlock bad magic on CPU#0, swapper/0/1
        lock: 0xffff88844f402008, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #4
       Call Trace:
        <IRQ>
        dump_stack+0x61/0x7e
        spin_bug+0x9d/0xa3
        do_raw_spin_lock+0x22/0x8e
        _raw_spin_lock_irqsave+0x32/0x3a
        queue_iova+0x77/0x145
        intel_unmap+0x107/0x113
        intel_unmap_page+0xe/0x10
        usb_hcd_unmap_urb_setup_for_dma+0x53/0x9d
        usb_hcd_unmap_urb_for_dma+0x17/0x100
        unmap_urb_for_dma+0x22/0x24
        __usb_hcd_giveback_urb+0x51/0xc3
        usb_giveback_urb_bh+0x97/0xde
        tasklet_action_common.isra.4+0x5f/0xa1
        tasklet_action+0x2d/0x30
        __do_softirq+0x138/0x2df
        irq_exit+0x7d/0x8b
        smp_apic_timer_interrupt+0x10f/0x151
        apic_timer_interrupt+0xf/0x20
        </IRQ>
       RIP: 0010:_raw_spin_unlock_irqrestore+0x17/0x39
      
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Lu Baolu <baolu.lu@linux.intel.com>
      Cc: iommu@lists.linux-foundation.org
      Cc: <stable@vger.kernel.org> # 4.14+
      Fixes: 13cf0174 ("iommu/vt-d: Make use of iova deferred flushing")
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Reviewed-by: NLu Baolu <baolu.lu@linux.intel.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      [v4.14-port notes:
      o minor conflict with untrusted IOMMU devices check under if-condition]
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4fd0eb60
  12. 31 7月, 2019 2 次提交
    • L
      access: avoid the RCU grace period for the temporary subjective credentials · 408af823
      Linus Torvalds 提交于
      commit d7852fbd0f0423937fa287a598bfde188bb68c22 upstream.
      
      It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
      work because it installs a temporary credential that gets allocated and
      freed for each system call.
      
      The allocation and freeing overhead is mostly benign, but because
      credentials can be accessed under the RCU read lock, the freeing
      involves a RCU grace period.
      
      Which is not a huge deal normally, but if you have a lot of access()
      calls, this causes a fair amount of seconday damage: instead of having a
      nice alloc/free patterns that hits in hot per-CPU slab caches, you have
      all those delayed free's, and on big machines with hundreds of cores,
      the RCU overhead can end up being enormous.
      
      But it turns out that all of this is entirely unnecessary.  Exactly
      because access() only installs the credential as the thread-local
      subjective credential, the temporary cred pointer doesn't actually need
      to be RCU free'd at all.  Once we're done using it, we can just free it
      synchronously and avoid all the RCU overhead.
      
      So add a 'non_rcu' flag to 'struct cred', which can be set by users that
      know they only use it in non-RCU context (there are other potential
      users for this).  We can make it a union with the rcu freeing list head
      that we need for the RCU case, so this doesn't need any extra storage.
      
      Note that this also makes 'get_current_cred()' clear the new non_rcu
      flag, in case we have filesystems that take a long-term reference to the
      cred and then expect the RCU delayed freeing afterwards.  It's not
      entirely clear that this is required, but it makes for clear semantics:
      the subjective cred remains non-RCU as long as you only access it
      synchronously using the thread-local accessors, but you _can_ use it as
      a generic cred if you want to.
      
      It is possible that we should just remove the whole RCU markings for
      ->cred entirely.  Only ->real_cred is really supposed to be accessed
      through RCU, and the long-term cred copies that nfs uses might want to
      explicitly re-enable RCU freeing if required, rather than have
      get_current_cred() do it implicitly.
      
      But this is a "minimal semantic changes" change for the immediate
      problem.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Glauber <jglauber@marvell.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      408af823
    • T
      gpu: host1x: Increase maximum DMA segment size · 4d14323a
      Thierry Reding 提交于
      [ Upstream commit 1e390478cfb527e34c9ab89ba57212cb05c33c51 ]
      
      Recent versions of the DMA API debug code have started to warn about
      violations of the maximum DMA segment size. This is because the segment
      size defaults to 64 KiB, which can easily be exceeded in large buffer
      allocations such as used in DRM/KMS for framebuffers.
      
      Technically the Tegra SMMU and ARM SMMU don't have a maximum segment
      size (they map individual pages irrespective of whether they are
      contiguous or not), so the choice of 4 MiB is a bit arbitrary here. The
      maximum segment size is a 32-bit unsigned integer, though, so we can't
      set it to the correct maximum size, which would be the size of the
      aperture.
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4d14323a
  13. 28 7月, 2019 3 次提交
    • R
      jbd2: introduce jbd2_inode dirty range scoping · af3812b6
      Ross Zwisler 提交于
      commit 6ba0e7dc64a5adcda2fbe65adc466891795d639e upstream.
      
      Currently both journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() operate on the entire address space
      of each of the inodes associated with a given journal entry.  The
      consequence of this is that if we have an inode where we are constantly
      appending dirty pages we can end up waiting for an indefinite amount of
      time in journal_finish_inode_data_buffers() while we wait for all the
      pages under writeback to be written out.
      
      The easiest way to cause this type of workload is do just dd from
      /dev/zero to a file until it fills the entire filesystem.  This can
      cause journal_finish_inode_data_buffers() to wait for the duration of
      the entire dd operation.
      
      We can improve this situation by scoping each of the inode dirty ranges
      associated with a given transaction.  We do this via the jbd2_inode
      structure so that the scoping is contained within jbd2 and so that it
      follows the lifetime and locking rules for that structure.
      
      This allows us to limit the writeback & wait in
      journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() respectively to the dirty range for
      a given struct jdb2_inode, keeping us from waiting forever if the inode
      in question is still being appended to.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      af3812b6
    • R
      mm: add filemap_fdatawait_range_keep_errors() · 4becd6c1
      Ross Zwisler 提交于
      commit aa0bfcd939c30617385ffa28682c062d78050eba upstream.
      
      In the spirit of filemap_fdatawait_range() and
      filemap_fdatawait_keep_errors(), introduce
      filemap_fdatawait_range_keep_errors() which both takes a range upon
      which to wait and does not clear errors from the address space.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4becd6c1
    • A
      perf/core: Fix exclusive events' grouping · 75100ec5
      Alexander Shishkin 提交于
      commit 8a58ddae23796c733c5dfbd717538d89d036c5bd upstream.
      
      So far, we tried to disallow grouping exclusive events for the fear of
      complications they would cause with moving between contexts. Specifically,
      moving a software group to a hardware context would violate the exclusivity
      rules if both groups contain matching exclusive events.
      
      This attempt was, however, unsuccessful: the check that we have in the
      perf_event_open() syscall is both wrong (looks at wrong PMU) and
      insufficient (group leader may still be exclusive), as can be illustrated
      by running:
      
        $ perf record -e '{intel_pt//,cycles}' uname
        $ perf record -e '{cycles,intel_pt//}' uname
      
      ultimately successfully.
      
      Furthermore, we are completely free to trigger the exclusivity violation
      by:
      
         perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'
      
      even though the helpful perf record will not allow that, the ABI will.
      
      The warning later in the perf_event_open() path will also not trigger, because
      it's also wrong.
      
      Fix all this by validating the original group before moving, getting rid
      of broken safeguards and placing a useful one to perf_install_in_context().
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mathieu.poirier@linaro.org
      Cc: will.deacon@arm.com
      Fixes: bed5b25a ("perf: Add a pmu capability for "exclusive" events")
      Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75100ec5
  14. 26 7月, 2019 2 次提交
    • M
      clocksource/drivers/exynos_mct: Increase priority over ARM arch timer · e69fac59
      Marek Szyprowski 提交于
      [ Upstream commit 6282edb72bed5324352522d732080d4c1b9dfed6 ]
      
      Exynos SoCs based on CA7/CA15 have 2 timer interfaces: custom Exynos MCT
      (Multi Core Timer) and standard ARM Architected Timers.
      
      There are use cases, where both timer interfaces are used simultanously.
      One of such examples is using Exynos MCT for the main system timer and
      ARM Architected Timers for the KVM and virtualized guests (KVM requires
      arch timers).
      
      Exynos Multi-Core Timer driver (exynos_mct) must be however started
      before ARM Architected Timers (arch_timer), because they both share some
      common hardware blocks (global system counter) and turning on MCT is
      needed to get ARM Architected Timer working properly.
      
      To ensure selecting Exynos MCT as the main system timer, increase MCT
      timer rating. To ensure proper starting order of both timers during
      suspend/resume cycle, increase MCT hotplug priority over ARM Archictected
      Timers.
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Reviewed-by: NKrzysztof Kozlowski <krzk@kernel.org>
      Reviewed-by: NChanwoo Choi <cw00.choi@samsung.com>
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e69fac59
    • W
      rcu: Force inlining of rcu_read_lock() · 366ae49e
      Waiman Long 提交于
      [ Upstream commit 6da9f775175e516fc7229ceaa9b54f8f56aa7924 ]
      
      When debugging options are turned on, the rcu_read_lock() function
      might not be inlined. This results in lockdep's print_lock() function
      printing "rcu_read_lock+0x0/0x70" instead of rcu_read_lock()'s caller.
      For example:
      
      [   10.579995] =============================
      [   10.584033] WARNING: suspicious RCU usage
      [   10.588074] 4.18.0.memcg_v2+ #1 Not tainted
      [   10.593162] -----------------------------
      [   10.597203] include/linux/rcupdate.h:281 Illegal context switch in
      RCU read-side critical section!
      [   10.606220]
      [   10.606220] other info that might help us debug this:
      [   10.606220]
      [   10.614280]
      [   10.614280] rcu_scheduler_active = 2, debug_locks = 1
      [   10.620853] 3 locks held by systemd/1:
      [   10.624632]  #0: (____ptrval____) (&type->i_mutex_dir_key#5){.+.+}, at: lookup_slow+0x42/0x70
      [   10.633232]  #1: (____ptrval____) (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x70
      [   10.640954]  #2: (____ptrval____) (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x70
      
      These "rcu_read_lock+0x0/0x70" strings are not providing any useful
      information.  This commit therefore forces inlining of the rcu_read_lock()
      function so that rcu_read_lock()'s caller is instead shown.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      366ae49e