1. 25 8月, 2019 2 次提交
    • M
      KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block · 8c7053d1
      Marc Zyngier 提交于
      commit 5eeaf10eec394b28fad2c58f1f5c3a5da0e87d1c upstream.
      
      Since commit commit 328e5664 ("KVM: arm/arm64: vgic: Defer
      touching GICH_VMCR to vcpu_load/put"), we leave ICH_VMCR_EL2 (or
      its GICv2 equivalent) loaded as long as we can, only syncing it
      back when we're scheduled out.
      
      There is a small snag with that though: kvm_vgic_vcpu_pending_irq(),
      which is indirectly called from kvm_vcpu_check_block(), needs to
      evaluate the guest's view of ICC_PMR_EL1. At the point were we
      call kvm_vcpu_check_block(), the vcpu is still loaded, and whatever
      changes to PMR is not visible in memory until we do a vcpu_put().
      
      Things go really south if the guest does the following:
      
      	mov x0, #0	// or any small value masking interrupts
      	msr ICC_PMR_EL1, x0
      
      	[vcpu preempted, then rescheduled, VMCR sampled]
      
      	mov x0, #ff	// allow all interrupts
      	msr ICC_PMR_EL1, x0
      	wfi		// traps to EL2, so samping of VMCR
      
      	[interrupt arrives just after WFI]
      
      Here, the hypervisor's view of PMR is zero, while the guest has enabled
      its interrupts. kvm_vgic_vcpu_pending_irq() will then say that no
      interrupts are pending (despite an interrupt being received) and we'll
      block for no reason. If the guest doesn't have a periodic interrupt
      firing once it has blocked, it will stay there forever.
      
      To avoid this unfortuante situation, let's resync VMCR from
      kvm_arch_vcpu_blocking(), ensuring that a following kvm_vcpu_check_block()
      will observe the latest value of PMR.
      
      This has been found by booting an arm64 Linux guest with the pseudo NMI
      feature, and thus using interrupt priorities to mask interrupts instead
      of the usual PSTATE masking.
      
      Cc: stable@vger.kernel.org # 4.12
      Fixes: 328e5664 ("KVM: arm/arm64: vgic: Defer touching GICH_VMCR to vcpu_load/put")
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      8c7053d1
    • Q
      asm-generic: fix -Wtype-limits compiler warnings · 0755b6b1
      Qian Cai 提交于
      [ Upstream commit cbedfe11347fe418621bd188d58a206beb676218 ]
      
      Commit d66acc39 ("bitops: Optimise get_order()") introduced a
      compilation warning because "rx_frag_size" is an "ushort" while
      PAGE_SHIFT here is 16.
      
      The commit changed the get_order() to be a multi-line macro where
      compilers insist to check all statements in the macro even when
      __builtin_constant_p(rx_frag_size) will return false as "rx_frag_size"
      is a module parameter.
      
      In file included from ./arch/powerpc/include/asm/page_64.h:107,
                       from ./arch/powerpc/include/asm/page.h:242,
                       from ./arch/powerpc/include/asm/mmu.h:132,
                       from ./arch/powerpc/include/asm/lppaca.h:47,
                       from ./arch/powerpc/include/asm/paca.h:17,
                       from ./arch/powerpc/include/asm/current.h:13,
                       from ./include/linux/thread_info.h:21,
                       from ./arch/powerpc/include/asm/processor.h:39,
                       from ./include/linux/prefetch.h:15,
                       from drivers/net/ethernet/emulex/benet/be_main.c:14:
      drivers/net/ethernet/emulex/benet/be_main.c: In function 'be_rx_cqs_create':
      ./include/asm-generic/getorder.h:54:9: warning: comparison is always
      true due to limited range of data type [-Wtype-limits]
         (((n) < (1UL << PAGE_SHIFT)) ? 0 :  \
               ^
      drivers/net/ethernet/emulex/benet/be_main.c:3138:33: note: in expansion
      of macro 'get_order'
        adapter->big_page_size = (1 << get_order(rx_frag_size)) * PAGE_SIZE;
                                       ^~~~~~~~~
      
      Fix it by moving all of this multi-line macro into a proper function,
      and killing __get_order() off.
      
      [akpm@linux-foundation.org: remove __get_order() altogether]
      [cai@lca.pw: v2]
        Link: http://lkml.kernel.org/r/1564000166-31428-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/1563914986-26502-1-git-send-email-cai@lca.pw
      Fixes: d66acc39 ("bitops: Optimise get_order()")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Bill Wendling <morbo@google.com>
      Cc: James Y Knight <jyknight@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0755b6b1
  2. 16 8月, 2019 4 次提交
  3. 09 8月, 2019 7 次提交
    • T
      cgroup: Include dying leaders with live threads in PROCS iterations · 4340d175
      Tejun Heo 提交于
      commit c03cd7738a83b13739f00546166969342c8ff014 upstream.
      
      CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
      this means that a process with dying leader and live threads will be
      skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
      which is confusing to say the least.
      
      Fix it by making cset track dying tasks and include dying leaders with
      live threads in PROCS iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NTopi Miettinen <toiwoton@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4340d175
    • T
      cgroup: Implement css_task_iter_skip() · 370b9e63
      Tejun Heo 提交于
      commit b636fd38dc40113f853337a7d2a6885ad23b8811 upstream.
      
      When a task is moved out of a cset, task iterators pointing to the
      task are advanced using the normal css_task_iter_advance() call.  This
      is fine but we'll be tracking dying tasks on csets and thus moving
      tasks from cset->tasks to (to be added) cset->dying_tasks.  When we
      remove a task from cset->tasks, if we advance the iterators, they may
      move over to the next cset before we had the chance to add the task
      back on the dying list, which can allow the task to escape iteration.
      
      This patch separates out skipping from advancing.  Skipping only moves
      the affected iterators to the next pointer rather than fully advancing
      it and the following advancing will recognize that the cursor has
      already been moved forward and do the rest of advancing.  This ensures
      that when a task moves from one list to another in its cset, as long
      as it moves in the right direction, it's always visible to iteration.
      
      This doesn't cause any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      370b9e63
    • A
      compat_ioctl: pppoe: fix PPPOEIOCSFWD handling · e6e9bcef
      Arnd Bergmann 提交于
      [ Upstream commit 055d88242a6046a1ceac3167290f054c72571cd9 ]
      
      Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
      linux-2.5.69 along with hundreds of other commands, but was always broken
      sincen only the structure is compatible, but the command number is not,
      due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
      sockaddr_pppox)), which is different on 64-bit architectures.
      
      Guillaume Nault adds:
      
        And the implementation was broken until 2016 (see 29e73269 ("pppoe:
        fix reference counting in PPPoE proxy")), and nobody ever noticed. I
        should probably have removed this ioctl entirely instead of fixing it.
        Clearly, it has never been used.
      
      Fix it by adding a compat_ioctl handler for all pppoe variants that
      translates the command number and then calls the regular ioctl function.
      
      All other ioctl commands handled by pppoe are compatible between 32-bit
      and 64-bit, and require compat_ptr() conversion.
      
      This should apply to all stable kernels.
      Acked-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6e9bcef
    • A
      net/mlx5e: Prevent encap flow counter update async to user query · 0ccf4726
      Ariel Levkovich 提交于
      [ Upstream commit 90bb769291161cf25a818d69cf608c181654473e ]
      
      This patch prevents a race between user invoked cached counters
      query and a neighbor last usage updater.
      
      The cached flow counter stats can be queried by calling
      "mlx5_fc_query_cached" which provides the number of bytes and
      packets that passed via this flow since the last time this counter
      was queried.
      It does so by reducting the last saved stats from the current, cached
      stats and then updating the last saved stats with the cached stats.
      It also provide the lastuse value for that flow.
      
      Since "mlx5e_tc_update_neigh_used_value" needs to retrieve the
      last usage time of encapsulation flows, it calls the flow counter
      query method periodically and async to user queries of the flow counter
      using cls_flower.
      This call is causing the driver to update the last reported bytes and
      packets from the cache and therefore, future user queries of the flow
      stats will return lower than expected number for bytes and packets
      since the last saved stats in the driver was updated async to the last
      saved stats in cls_flower.
      
      This causes wrong stats presentation of encapsulation flows to user.
      
      Since the neighbor usage updater only needs the lastuse stats from the
      cached counter, the fix is to use a dedicated lastuse query call that
      returns the lastuse value without synching between the cached stats and
      the last saved stats.
      
      Fixes: f6dfb4c3 ("net/mlx5e: Update neighbour 'used' state using HW flow rules counters")
      Signed-off-by: NAriel Levkovich <lariel@mellanox.com>
      Reviewed-by: NRoi Dayan <roid@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0ccf4726
    • E
      net/mlx5: Fix modify_cq_in alignment · cd84a107
      Edward Srouji 提交于
      [ Upstream commit 7a32f2962c56d9d8a836b4469855caeee8766bd4 ]
      
      Fix modify_cq_in alignment to match the device specification.
      After this fix the 'cq_umem_valid' field will be in the right offset.
      
      Cc: <stable@vger.kernel.org> # 4.19
      Fixes: bd37197554eb ("net/mlx5: Update mlx5_ifc with DEVX UID bits")
      Signed-off-by: NEdward Srouji <edwards@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd84a107
    • D
      drivers/base: Introduce kill_device() · c23106d4
      Dan Williams 提交于
      commit 00289cd87676e14913d2d8492d1ce05c4baafdae upstream.
      
      The libnvdimm subsystem arranges for devices to be destroyed as a result
      of a sysfs operation. Since device_unregister() cannot be called from
      an actively running sysfs attribute of the same device libnvdimm
      arranges for device_unregister() to be performed in an out-of-line async
      context.
      
      The driver core maintains a 'dead' state for coordinating its own racing
      async registration / de-registration requests. Rather than add local
      'dead' state tracking infrastructure to libnvdimm device objects, export
      the existing state tracking via a new kill_device() helper.
      
      The kill_device() helper simply marks the device as dead, i.e. that it
      is on its way to device_del(), or returns that the device was already
      dead. This can be used in advance of calling device_unregister() for
      subsystems like libnvdimm that might need to handle multiple user
      threads racing to delete a device.
      
      This refactoring does not change any behavior, but it is a pre-requisite
      for follow-on fixes and therefore marked for -stable.
      
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Fixes: 4d88a97a ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
      Cc: <stable@vger.kernel.org>
      Tested-by: NJane Chu <jane.chu@oracle.com>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Link: https://lore.kernel.org/r/156341207332.292348.14959761496009347574.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c23106d4
    • H
      scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure · 93d6f084
      Hannes Reinecke 提交于
      commit 023358b136d490ca91735ac6490db3741af5a8bd upstream.
      
      Gcc-9 complains for a memset across pointer boundaries, which happens as
      the code tries to allocate a flexible array on the stack.  Turns out we
      cannot do this without relying on gcc-isms, so with this patch we'll embed
      the fc_rport_priv structure into fcoe_rport, can use the normal
      'container_of' outcast, and will only have to do a memset over one
      structure.
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      93d6f084
  4. 07 8月, 2019 3 次提交
  5. 04 8月, 2019 5 次提交
    • B
      block, scsi: Change the preempt-only flag into a counter · c58a6507
      Bart Van Assche 提交于
      commit cd84a62e0078dce09f4ed349bec84f86c9d54b30 upstream.
      
      The RQF_PREEMPT flag is used for three purposes:
      - In the SCSI core, for making sure that power management requests
        are executed even if a device is in the "quiesced" state.
      - For domain validation by SCSI drivers that use the parallel port.
      - In the IDE driver, for IDE preempt requests.
      Rename "preempt-only" into "pm-only" because the primary purpose of
      this mode is power management. Since the power management core may
      but does not have to resume a runtime suspended device before
      performing system-wide suspend and since a later patch will set
      "pm-only" mode as long as a block device is runtime suspended, make
      it possible to set "pm-only" mode from more than one context. Since
      with this change scsi_device_quiesce() is no longer idempotent, make
      that function return early if it is called for a quiesced queue.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c58a6507
    • J
      sched/fair: Use RCU accessors consistently for ->numa_group · a5a3915f
      Jann Horn 提交于
      commit cb361d8cdef69990f6b4504dc1fd9a594d983c97 upstream.
      
      The old code used RCU annotations and accessors inconsistently for
      ->numa_group, which can lead to use-after-frees and NULL dereferences.
      
      Let all accesses to ->numa_group use proper RCU helpers to prevent such
      issues.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Fixes: 8c8a743c ("sched/numa: Use {cpu, pid} to create task groups for shared faults")
      Link: https://lkml.kernel.org/r/20190716152047.14424-3-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a5a3915f
    • J
      sched/fair: Don't free p->numa_faults with concurrent readers · 48046e09
      Jann Horn 提交于
      commit 16d51a590a8ce3befb1308e0e7ab77f3b661af33 upstream.
      
      When going through execve(), zero out the NUMA fault statistics instead of
      freeing them.
      
      During execve, the task is reachable through procfs and the scheduler. A
      concurrent /proc/*/sched reader can read data from a freed ->numa_faults
      allocation (confirmed by KASAN) and write it back to userspace.
      I believe that it would also be possible for a use-after-free read to occur
      through a race between a NUMA fault and execve(): task_numa_fault() can
      lead to task_numa_compare(), which invokes task_weight() on the currently
      running task of a different CPU.
      
      Another way to fix this would be to make ->numa_faults RCU-managed or add
      extra locking, but it seems easier to wipe the NUMA fault statistics on
      execve.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Fixes: 82727018 ("sched/numa: Call task_numa_free() from do_execve()")
      Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      48046e09
    • J
      iommu/iova: Fix compilation error with !CONFIG_IOMMU_IOVA · 3a0c22cb
      Joerg Roedel 提交于
      commit 201c1db90cd643282185a00770f12f95da330eca upstream.
      
      The stub function for !CONFIG_IOMMU_IOVA needs to be
      'static inline'.
      
      Fixes: effa467870c76 ('iommu/vt-d: Don't queue_iova() if there is no flush queue')
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a0c22cb
    • D
      iommu/vt-d: Don't queue_iova() if there is no flush queue · 4fd0eb60
      Dmitry Safonov 提交于
      commit effa467870c7612012885df4e246bdb8ffd8e44c upstream.
      
      Intel VT-d driver was reworked to use common deferred flushing
      implementation. Previously there was one global per-cpu flush queue,
      afterwards - one per domain.
      
      Before deferring a flush, the queue should be allocated and initialized.
      
      Currently only domains with IOMMU_DOMAIN_DMA type initialize their flush
      queue. It's probably worth to init it for static or unmanaged domains
      too, but it may be arguable - I'm leaving it to iommu folks.
      
      Prevent queuing an iova flush if the domain doesn't have a queue.
      The defensive check seems to be worth to keep even if queue would be
      initialized for all kinds of domains. And is easy backportable.
      
      On 4.19.43 stable kernel it has a user-visible effect: previously for
      devices in si domain there were crashes, on sata devices:
      
       BUG: spinlock bad magic on CPU#6, swapper/0/1
        lock: 0xffff88844f582008, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
       CPU: 6 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #1
       Call Trace:
        <IRQ>
        dump_stack+0x61/0x7e
        spin_bug+0x9d/0xa3
        do_raw_spin_lock+0x22/0x8e
        _raw_spin_lock_irqsave+0x32/0x3a
        queue_iova+0x45/0x115
        intel_unmap+0x107/0x113
        intel_unmap_sg+0x6b/0x76
        __ata_qc_complete+0x7f/0x103
        ata_qc_complete+0x9b/0x26a
        ata_qc_complete_multiple+0xd0/0xe3
        ahci_handle_port_interrupt+0x3ee/0x48a
        ahci_handle_port_intr+0x73/0xa9
        ahci_single_level_irq_intr+0x40/0x60
        __handle_irq_event_percpu+0x7f/0x19a
        handle_irq_event_percpu+0x32/0x72
        handle_irq_event+0x38/0x56
        handle_edge_irq+0x102/0x121
        handle_irq+0x147/0x15c
        do_IRQ+0x66/0xf2
        common_interrupt+0xf/0xf
       RIP: 0010:__do_softirq+0x8c/0x2df
      
      The same for usb devices that use ehci-pci:
       BUG: spinlock bad magic on CPU#0, swapper/0/1
        lock: 0xffff88844f402008, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #4
       Call Trace:
        <IRQ>
        dump_stack+0x61/0x7e
        spin_bug+0x9d/0xa3
        do_raw_spin_lock+0x22/0x8e
        _raw_spin_lock_irqsave+0x32/0x3a
        queue_iova+0x77/0x145
        intel_unmap+0x107/0x113
        intel_unmap_page+0xe/0x10
        usb_hcd_unmap_urb_setup_for_dma+0x53/0x9d
        usb_hcd_unmap_urb_for_dma+0x17/0x100
        unmap_urb_for_dma+0x22/0x24
        __usb_hcd_giveback_urb+0x51/0xc3
        usb_giveback_urb_bh+0x97/0xde
        tasklet_action_common.isra.4+0x5f/0xa1
        tasklet_action+0x2d/0x30
        __do_softirq+0x138/0x2df
        irq_exit+0x7d/0x8b
        smp_apic_timer_interrupt+0x10f/0x151
        apic_timer_interrupt+0xf/0x20
        </IRQ>
       RIP: 0010:_raw_spin_unlock_irqrestore+0x17/0x39
      
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Lu Baolu <baolu.lu@linux.intel.com>
      Cc: iommu@lists.linux-foundation.org
      Cc: <stable@vger.kernel.org> # 4.14+
      Fixes: 13cf0174 ("iommu/vt-d: Make use of iova deferred flushing")
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Reviewed-by: NLu Baolu <baolu.lu@linux.intel.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      [v4.14-port notes:
      o minor conflict with untrusted IOMMU devices check under if-condition]
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4fd0eb60
  6. 31 7月, 2019 2 次提交
    • L
      access: avoid the RCU grace period for the temporary subjective credentials · 408af823
      Linus Torvalds 提交于
      commit d7852fbd0f0423937fa287a598bfde188bb68c22 upstream.
      
      It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
      work because it installs a temporary credential that gets allocated and
      freed for each system call.
      
      The allocation and freeing overhead is mostly benign, but because
      credentials can be accessed under the RCU read lock, the freeing
      involves a RCU grace period.
      
      Which is not a huge deal normally, but if you have a lot of access()
      calls, this causes a fair amount of seconday damage: instead of having a
      nice alloc/free patterns that hits in hot per-CPU slab caches, you have
      all those delayed free's, and on big machines with hundreds of cores,
      the RCU overhead can end up being enormous.
      
      But it turns out that all of this is entirely unnecessary.  Exactly
      because access() only installs the credential as the thread-local
      subjective credential, the temporary cred pointer doesn't actually need
      to be RCU free'd at all.  Once we're done using it, we can just free it
      synchronously and avoid all the RCU overhead.
      
      So add a 'non_rcu' flag to 'struct cred', which can be set by users that
      know they only use it in non-RCU context (there are other potential
      users for this).  We can make it a union with the rcu freeing list head
      that we need for the RCU case, so this doesn't need any extra storage.
      
      Note that this also makes 'get_current_cred()' clear the new non_rcu
      flag, in case we have filesystems that take a long-term reference to the
      cred and then expect the RCU delayed freeing afterwards.  It's not
      entirely clear that this is required, but it makes for clear semantics:
      the subjective cred remains non-RCU as long as you only access it
      synchronously using the thread-local accessors, but you _can_ use it as
      a generic cred if you want to.
      
      It is possible that we should just remove the whole RCU markings for
      ->cred entirely.  Only ->real_cred is really supposed to be accessed
      through RCU, and the long-term cred copies that nfs uses might want to
      explicitly re-enable RCU freeing if required, rather than have
      get_current_cred() do it implicitly.
      
      But this is a "minimal semantic changes" change for the immediate
      problem.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Glauber <jglauber@marvell.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      408af823
    • T
      gpu: host1x: Increase maximum DMA segment size · 4d14323a
      Thierry Reding 提交于
      [ Upstream commit 1e390478cfb527e34c9ab89ba57212cb05c33c51 ]
      
      Recent versions of the DMA API debug code have started to warn about
      violations of the maximum DMA segment size. This is because the segment
      size defaults to 64 KiB, which can easily be exceeded in large buffer
      allocations such as used in DRM/KMS for framebuffers.
      
      Technically the Tegra SMMU and ARM SMMU don't have a maximum segment
      size (they map individual pages irrespective of whether they are
      contiguous or not), so the choice of 4 MiB is a bit arbitrary here. The
      maximum segment size is a 32-bit unsigned integer, though, so we can't
      set it to the correct maximum size, which would be the size of the
      aperture.
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4d14323a
  7. 28 7月, 2019 7 次提交
    • R
      jbd2: introduce jbd2_inode dirty range scoping · af3812b6
      Ross Zwisler 提交于
      commit 6ba0e7dc64a5adcda2fbe65adc466891795d639e upstream.
      
      Currently both journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() operate on the entire address space
      of each of the inodes associated with a given journal entry.  The
      consequence of this is that if we have an inode where we are constantly
      appending dirty pages we can end up waiting for an indefinite amount of
      time in journal_finish_inode_data_buffers() while we wait for all the
      pages under writeback to be written out.
      
      The easiest way to cause this type of workload is do just dd from
      /dev/zero to a file until it fills the entire filesystem.  This can
      cause journal_finish_inode_data_buffers() to wait for the duration of
      the entire dd operation.
      
      We can improve this situation by scoping each of the inode dirty ranges
      associated with a given transaction.  We do this via the jbd2_inode
      structure so that the scoping is contained within jbd2 and so that it
      follows the lifetime and locking rules for that structure.
      
      This allows us to limit the writeback & wait in
      journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() respectively to the dirty range for
      a given struct jdb2_inode, keeping us from waiting forever if the inode
      in question is still being appended to.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      af3812b6
    • R
      mm: add filemap_fdatawait_range_keep_errors() · 4becd6c1
      Ross Zwisler 提交于
      commit aa0bfcd939c30617385ffa28682c062d78050eba upstream.
      
      In the spirit of filemap_fdatawait_range() and
      filemap_fdatawait_keep_errors(), introduce
      filemap_fdatawait_range_keep_errors() which both takes a range upon
      which to wait and does not clear errors from the address space.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4becd6c1
    • A
      perf/core: Fix exclusive events' grouping · 75100ec5
      Alexander Shishkin 提交于
      commit 8a58ddae23796c733c5dfbd717538d89d036c5bd upstream.
      
      So far, we tried to disallow grouping exclusive events for the fear of
      complications they would cause with moving between contexts. Specifically,
      moving a software group to a hardware context would violate the exclusivity
      rules if both groups contain matching exclusive events.
      
      This attempt was, however, unsuccessful: the check that we have in the
      perf_event_open() syscall is both wrong (looks at wrong PMU) and
      insufficient (group leader may still be exclusive), as can be illustrated
      by running:
      
        $ perf record -e '{intel_pt//,cycles}' uname
        $ perf record -e '{cycles,intel_pt//}' uname
      
      ultimately successfully.
      
      Furthermore, we are completely free to trigger the exclusivity violation
      by:
      
         perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'
      
      even though the helpful perf record will not allow that, the ABI will.
      
      The warning later in the perf_event_open() path will also not trigger, because
      it's also wrong.
      
      Fix all this by validating the original group before moving, getting rid
      of broken safeguards and placing a useful one to perf_install_in_context().
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mathieu.poirier@linaro.org
      Cc: will.deacon@arm.com
      Fixes: bed5b25a ("perf: Add a pmu capability for "exclusive" events")
      Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75100ec5
    • J
      net/tls: make sure offload also gets the keys wiped · fde351ae
      Jakub Kicinski 提交于
      [ Upstream commit acd3e96d53a24d219f720ed4012b62723ae05da1 ]
      
      Commit 86029d10 ("tls: zero the crypto information from tls_context
      before freeing") added memzero_explicit() calls to clear the key material
      before freeing struct tls_context, but it missed tls_device.c has its
      own way of freeing this structure. Replace the missing free.
      
      Fixes: 86029d10 ("tls: zero the crypto information from tls_context before freeing")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fde351ae
    • E
      tcp: fix tcp_set_congestion_control() use from bpf hook · c60f57df
      Eric Dumazet 提交于
      [ Upstream commit 8d650cdedaabb33e85e9b7c517c0c71fcecc1de9 ]
      
      Neal reported incorrect use of ns_capable() from bpf hook.
      
      bpf_setsockopt(...TCP_CONGESTION...)
        -> tcp_set_congestion_control()
         -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)
          -> ns_capable_common()
           -> current_cred()
            -> rcu_dereference_protected(current->cred, 1)
      
      Accessing 'current' in bpf context makes no sense, since packets
      are processed from softirq context.
      
      As Neal stated : The capability check in tcp_set_congestion_control()
      was written assuming a system call context, and then was reused from
      a BPF call site.
      
      The fix is to add a new parameter to tcp_set_congestion_control(),
      so that the ns_capable() call is only performed under the right
      context.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      Reported-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c60f57df
    • E
      tcp: be more careful in tcp_fragment() · 6323c238
      Eric Dumazet 提交于
      [ Upstream commit b617158dc096709d8600c53b6052144d12b89fab ]
      
      Some applications set tiny SO_SNDBUF values and expect
      TCP to just work. Recent patches to address CVE-2019-11478
      broke them in case of losses, since retransmits might
      be prevented.
      
      We should allow these flows to make progress.
      
      This patch allows the first and last skb in retransmit queue
      to be split even if memory limits are hit.
      
      It also adds the some room due to the fact that tcp_sendmsg()
      and tcp_sendpage() might overshoot sk_wmem_queued by about one full
      TSO skb (64KB size). Note this allowance was already present
      in stable backports for kernels < 4.15
      
      Note for < 4.15 backports :
       tcp_rtx_queue_tail() will probably look like :
      
      static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
      {
      	struct sk_buff *skb = tcp_send_head(sk);
      
      	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
      }
      
      Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrew Prout <aprout@ll.mit.edu>
      Tested-by: NAndrew Prout <aprout@ll.mit.edu>
      Tested-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Tested-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NChristoph Paasch <cpaasch@apple.com>
      Cc: Jonathan Looney <jtl@netflix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6323c238
    • F
      net: make skb_dst_force return true when dst is refcounted · 832d0ea7
      Florian Westphal 提交于
      [ Upstream commit b60a77386b1d4868f72f6353d35dabe5fbe981f2 ]
      
      netfilter did not expect that skb_dst_force() can cause skb to lose its
      dst entry.
      
      I got a bug report with a skb->dst NULL dereference in netfilter
      output path.  The backtrace contains nf_reinject(), so the dst might have
      been cleared when skb got queued to userspace.
      
      Other users were fixed via
      if (skb_dst(skb)) {
      	skb_dst_force(skb);
      	if (!skb_dst(skb))
      		goto handle_err;
      }
      
      But I think its preferable to make the 'dst might be cleared' part
      of the function explicit.
      
      In netfilter case, skb with a null dst is expected when queueing in
      prerouting hook, so drop skb for the other hooks.
      
      v2:
       v1 of this patch returned true in case skb had no dst entry.
       Eric said:
         Say if we have two skb_dst_force() calls for some reason
         on the same skb, only the first one will return false.
      
       This now returns false even when skb had no dst, as per Erics
       suggestion, so callers might need to check skb_dst() first before
       skb_dst_force().
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      832d0ea7
  8. 26 7月, 2019 9 次提交
    • D
      include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures · 41f64437
      Drew Davenport 提交于
      commit 6b15f678fb7d5ef54e089e6ace72f007fe6e9895 upstream.
      
      For architectures using __WARN_TAINT, the WARN_ON macro did not print
      out the "cut here" string.  The other WARN_XXX macros would print "cut
      here" inside __warn_printk, which is not called for WARN_ON since it
      doesn't have a message to print.
      
      Link: http://lkml.kernel.org/r/20190624154831.163888-1-ddavenport@chromium.org
      Fixes: a7bed27a ("bug: fix "cut here" location for __WARN_TAINT architectures")
      Signed-off-by: NDrew Davenport <ddavenport@chromium.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Tested-by: NKees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41f64437
    • D
      IB/mlx5: Report correctly tag matching rendezvous capability · cb4c2b94
      Danit Goldberg 提交于
      commit 89705e92700170888236555fe91b45e4c1bb0985 upstream.
      
      Userspace expects the IB_TM_CAP_RC bit to indicate that the device
      supports RC transport tag matching with rendezvous offload. However the
      firmware splits this into two capabilities for eager and rendezvous tag
      matching.
      
      Only if the FW supports both modes should userspace be told the tag
      matching capability is available.
      
      Cc: <stable@vger.kernel.org> # 4.13
      Fixes: eb761894 ("IB/mlx5: Fill XRQ capabilities")
      Signed-off-by: NDanit Goldberg <danitg@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Reviewed-by: NArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb4c2b94
    • A
      drm/edid: parse CEA blocks embedded in DisplayID · 66a13b5e
      Andres Rodriguez 提交于
      commit e28ad544f462231d3fd081a7316339359efbb481 upstream.
      
      DisplayID blocks allow embedding of CEA blocks. The payloads are
      identical to traditional top level CEA extension blocks, but the header
      is slightly different.
      
      This change allows the CEA parser to find a CEA block inside a DisplayID
      block. Additionally, it adds support for parsing the embedded CTA
      header. No further changes are necessary due to payload parity.
      
      This change fixes audio support for the Valve Index HMD.
      Signed-off-by: NAndres Rodriguez <andresx7@gmail.com>
      Reviewed-by: NDave Airlie <airlied@redhat.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: <stable@vger.kernel.org> # v4.15
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190619180901.17901-1-andresx7@gmail.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      66a13b5e
    • J
      xen/events: fix binding user event channels to cpus · 007e5aaf
      Juergen Gross 提交于
      commit bce5963bcb4f9934faa52be323994511d59fd13c upstream.
      
      When binding an interdomain event channel to a vcpu via
      IOCTL_EVTCHN_BIND_INTERDOMAIN not only the event channel needs to be
      bound, but the affinity of the associated IRQi must be changed, too.
      Otherwise the IRQ and the event channel won't be moved to another vcpu
      in case the original vcpu they were bound to is going offline.
      
      Cc: <stable@vger.kernel.org> # 4.13
      Fixes: c48f64ab ("xen-evtchn: Bind dyn evtchn:qemu-dm interrupt to next online VCPU")
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      007e5aaf
    • D
      rxrpc: Fix oops in tracepoint · 0f2f2ceb
      David Howells 提交于
      [ Upstream commit 99f0eae653b2db64917d0b58099eb51e300b311d ]
      
      If the rxrpc_eproto tracepoint is enabled, an oops will be cause by the
      trace line that rxrpc_extract_header() tries to emit when a protocol error
      occurs (typically because the packet is short) because the call argument is
      NULL.
      
      Fix this by using ?: to assume 0 as the debug_id if call is NULL.
      
      This can then be induced by:
      
      	echo -e '\0\0\0\0\0\0\0\0' | ncat -4u --send-only <addr> 20001
      
      where addr has the following program running on it:
      
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <unistd.h>
      	#include <sys/socket.h>
      	#include <arpa/inet.h>
      	#include <linux/rxrpc.h>
      	int main(void)
      	{
      		struct sockaddr_rxrpc srx;
      		int fd;
      		memset(&srx, 0, sizeof(srx));
      		srx.srx_family			= AF_RXRPC;
      		srx.srx_service			= 0;
      		srx.transport_type		= AF_INET;
      		srx.transport_len		= sizeof(srx.transport.sin);
      		srx.transport.sin.sin_family	= AF_INET;
      		srx.transport.sin.sin_port	= htons(0x4e21);
      		fd = socket(AF_RXRPC, SOCK_DGRAM, AF_INET6);
      		bind(fd, (struct sockaddr *)&srx, sizeof(srx));
      		sleep(20);
      		return 0;
      	}
      
      It results in the following oops.
      
      	BUG: kernel NULL pointer dereference, address: 0000000000000340
      	#PF: supervisor read access in kernel mode
      	#PF: error_code(0x0000) - not-present page
      	...
      	RIP: 0010:trace_event_raw_event_rxrpc_rx_eproto+0x47/0xac
      	...
      	Call Trace:
      	 <IRQ>
      	 rxrpc_extract_header+0x86/0x171
      	 ? rcu_read_lock_sched_held+0x5d/0x63
      	 ? rxrpc_new_skb+0xd4/0x109
      	 rxrpc_input_packet+0xef/0x14fc
      	 ? rxrpc_input_data+0x986/0x986
      	 udp_queue_rcv_one_skb+0xbf/0x3d0
      	 udp_unicast_rcv_skb.isra.8+0x64/0x71
      	 ip_protocol_deliver_rcu+0xe4/0x1b4
      	 ip_local_deliver+0xf0/0x154
      	 __netif_receive_skb_one_core+0x50/0x6c
      	 netif_receive_skb_internal+0x26b/0x2e9
      	 napi_gro_receive+0xf8/0x1da
      	 rtl8169_poll+0x303/0x4c4
      	 net_rx_action+0x10e/0x333
      	 __do_softirq+0x1a5/0x38f
      	 irq_exit+0x54/0xc4
      	 do_IRQ+0xda/0xf8
      	 common_interrupt+0xf/0xf
      	 </IRQ>
      	 ...
      	 ? cpuidle_enter_state+0x23c/0x34d
      	 cpuidle_enter+0x2a/0x36
      	 do_idle+0x163/0x1ea
      	 cpu_startup_entry+0x1d/0x1f
      	 start_secondary+0x157/0x172
      	 secondary_startup_64+0xa4/0xb0
      
      Fixes: a25e21f0 ("rxrpc, afs: Use debug_ids rather than pointers in traces")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0f2f2ceb
    • B
      bpf: fix uapi bpf_prog_info fields alignment · 7343178c
      Baruch Siach 提交于
      [ Upstream commit 0472301a28f6cf53a6bc5783e48a2d0bbff4682f ]
      
      Merge commit 1c8c5a9d ("Merge
      git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next") undid the
      fix from commit 36f9814a ("bpf: fix uapi hole for 32 bit compat
      applications") by taking the gpl_compatible 1-bit field definition from
      commit b85fab0e ("bpf: Add gpl_compatible flag to struct
      bpf_prog_info") as is. That breaks architectures with 16-bit alignment
      like m68k. Add 31-bit pad after gpl_compatible to restore alignment of
      following fields.
      
      Thanks to Dmitry V. Levin his analysis of this bug history.
      Signed-off-by: NBaruch Siach <baruch@tkos.co.il>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7343178c
    • M
      clocksource/drivers/exynos_mct: Increase priority over ARM arch timer · e69fac59
      Marek Szyprowski 提交于
      [ Upstream commit 6282edb72bed5324352522d732080d4c1b9dfed6 ]
      
      Exynos SoCs based on CA7/CA15 have 2 timer interfaces: custom Exynos MCT
      (Multi Core Timer) and standard ARM Architected Timers.
      
      There are use cases, where both timer interfaces are used simultanously.
      One of such examples is using Exynos MCT for the main system timer and
      ARM Architected Timers for the KVM and virtualized guests (KVM requires
      arch timers).
      
      Exynos Multi-Core Timer driver (exynos_mct) must be however started
      before ARM Architected Timers (arch_timer), because they both share some
      common hardware blocks (global system counter) and turning on MCT is
      needed to get ARM Architected Timer working properly.
      
      To ensure selecting Exynos MCT as the main system timer, increase MCT
      timer rating. To ensure proper starting order of both timers during
      suspend/resume cycle, increase MCT hotplug priority over ARM Archictected
      Timers.
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Reviewed-by: NKrzysztof Kozlowski <krzk@kernel.org>
      Reviewed-by: NChanwoo Choi <cw00.choi@samsung.com>
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e69fac59
    • J
      ipvs: fix tinfo memory leak in start_sync_thread · fe2ceeb4
      Julian Anastasov 提交于
      [ Upstream commit 5db7c8b9f9fc2aeec671ae3ca6375752c162e0e7 ]
      
      syzkaller reports for memory leak in start_sync_thread [1]
      
      As Eric points out, kthread may start and stop before the
      threadfn function is called, so there is no chance the
      data (tinfo in our case) to be released in thread.
      
      Fix this by releasing tinfo in the controlling code instead.
      
      [1]
      BUG: memory leak
      unreferenced object 0xffff8881206bf700 (size 32):
       comm "syz-executor761", pid 7268, jiffies 4294943441 (age 20.470s)
       hex dump (first 32 bytes):
         00 40 7c 09 81 88 ff ff 80 45 b8 21 81 88 ff ff  .@|......E.!....
         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
       backtrace:
         [<0000000057619e23>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
         [<0000000057619e23>] slab_post_alloc_hook mm/slab.h:439 [inline]
         [<0000000057619e23>] slab_alloc mm/slab.c:3326 [inline]
         [<0000000057619e23>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
         [<0000000086ce5479>] kmalloc include/linux/slab.h:547 [inline]
         [<0000000086ce5479>] start_sync_thread+0x5d2/0xe10 net/netfilter/ipvs/ip_vs_sync.c:1862
         [<000000001a9229cc>] do_ip_vs_set_ctl+0x4c5/0x780 net/netfilter/ipvs/ip_vs_ctl.c:2402
         [<00000000ece457c8>] nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
         [<00000000ece457c8>] nf_setsockopt+0x4c/0x80 net/netfilter/nf_sockopt.c:115
         [<00000000942f62d4>] ip_setsockopt net/ipv4/ip_sockglue.c:1258 [inline]
         [<00000000942f62d4>] ip_setsockopt+0x9b/0xb0 net/ipv4/ip_sockglue.c:1238
         [<00000000a56a8ffd>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
         [<00000000fa895401>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130
         [<0000000095eef4cf>] __sys_setsockopt+0x98/0x120 net/socket.c:2078
         [<000000009747cf88>] __do_sys_setsockopt net/socket.c:2089 [inline]
         [<000000009747cf88>] __se_sys_setsockopt net/socket.c:2086 [inline]
         [<000000009747cf88>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
         [<00000000ded8ba80>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
         [<00000000893b4ac8>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported-by: syzbot+7e2e50c8adfccd2e5041@syzkaller.appspotmail.com
      Suggested-by: NEric Biggers <ebiggers@kernel.org>
      Fixes: 998e7a76 ("ipvs: Use kthread_run() instead of doing a double-fork via kernel_thread()")
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fe2ceeb4
    • W
      rcu: Force inlining of rcu_read_lock() · 366ae49e
      Waiman Long 提交于
      [ Upstream commit 6da9f775175e516fc7229ceaa9b54f8f56aa7924 ]
      
      When debugging options are turned on, the rcu_read_lock() function
      might not be inlined. This results in lockdep's print_lock() function
      printing "rcu_read_lock+0x0/0x70" instead of rcu_read_lock()'s caller.
      For example:
      
      [   10.579995] =============================
      [   10.584033] WARNING: suspicious RCU usage
      [   10.588074] 4.18.0.memcg_v2+ #1 Not tainted
      [   10.593162] -----------------------------
      [   10.597203] include/linux/rcupdate.h:281 Illegal context switch in
      RCU read-side critical section!
      [   10.606220]
      [   10.606220] other info that might help us debug this:
      [   10.606220]
      [   10.614280]
      [   10.614280] rcu_scheduler_active = 2, debug_locks = 1
      [   10.620853] 3 locks held by systemd/1:
      [   10.624632]  #0: (____ptrval____) (&type->i_mutex_dir_key#5){.+.+}, at: lookup_slow+0x42/0x70
      [   10.633232]  #1: (____ptrval____) (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x70
      [   10.640954]  #2: (____ptrval____) (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x70
      
      These "rcu_read_lock+0x0/0x70" strings are not providing any useful
      information.  This commit therefore forces inlining of the rcu_read_lock()
      function so that rcu_read_lock()'s caller is instead shown.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      366ae49e
  9. 21 7月, 2019 1 次提交