1. 09 8月, 2019 4 次提交
    • A
      compat_ioctl: pppoe: fix PPPOEIOCSFWD handling · e6e9bcef
      Arnd Bergmann 提交于
      [ Upstream commit 055d88242a6046a1ceac3167290f054c72571cd9 ]
      
      Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
      linux-2.5.69 along with hundreds of other commands, but was always broken
      sincen only the structure is compatible, but the command number is not,
      due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
      sockaddr_pppox)), which is different on 64-bit architectures.
      
      Guillaume Nault adds:
      
        And the implementation was broken until 2016 (see 29e73269 ("pppoe:
        fix reference counting in PPPoE proxy")), and nobody ever noticed. I
        should probably have removed this ioctl entirely instead of fixing it.
        Clearly, it has never been used.
      
      Fix it by adding a compat_ioctl handler for all pppoe variants that
      translates the command number and then calls the regular ioctl function.
      
      All other ioctl commands handled by pppoe are compatible between 32-bit
      and 64-bit, and require compat_ptr() conversion.
      
      This should apply to all stable kernels.
      Acked-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6e9bcef
    • A
      net/mlx5e: Prevent encap flow counter update async to user query · 0ccf4726
      Ariel Levkovich 提交于
      [ Upstream commit 90bb769291161cf25a818d69cf608c181654473e ]
      
      This patch prevents a race between user invoked cached counters
      query and a neighbor last usage updater.
      
      The cached flow counter stats can be queried by calling
      "mlx5_fc_query_cached" which provides the number of bytes and
      packets that passed via this flow since the last time this counter
      was queried.
      It does so by reducting the last saved stats from the current, cached
      stats and then updating the last saved stats with the cached stats.
      It also provide the lastuse value for that flow.
      
      Since "mlx5e_tc_update_neigh_used_value" needs to retrieve the
      last usage time of encapsulation flows, it calls the flow counter
      query method periodically and async to user queries of the flow counter
      using cls_flower.
      This call is causing the driver to update the last reported bytes and
      packets from the cache and therefore, future user queries of the flow
      stats will return lower than expected number for bytes and packets
      since the last saved stats in the driver was updated async to the last
      saved stats in cls_flower.
      
      This causes wrong stats presentation of encapsulation flows to user.
      
      Since the neighbor usage updater only needs the lastuse stats from the
      cached counter, the fix is to use a dedicated lastuse query call that
      returns the lastuse value without synching between the cached stats and
      the last saved stats.
      
      Fixes: f6dfb4c3 ("net/mlx5e: Update neighbour 'used' state using HW flow rules counters")
      Signed-off-by: NAriel Levkovich <lariel@mellanox.com>
      Reviewed-by: NRoi Dayan <roid@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0ccf4726
    • E
      net/mlx5: Fix modify_cq_in alignment · cd84a107
      Edward Srouji 提交于
      [ Upstream commit 7a32f2962c56d9d8a836b4469855caeee8766bd4 ]
      
      Fix modify_cq_in alignment to match the device specification.
      After this fix the 'cq_umem_valid' field will be in the right offset.
      
      Cc: <stable@vger.kernel.org> # 4.19
      Fixes: bd37197554eb ("net/mlx5: Update mlx5_ifc with DEVX UID bits")
      Signed-off-by: NEdward Srouji <edwards@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd84a107
    • D
      drivers/base: Introduce kill_device() · c23106d4
      Dan Williams 提交于
      commit 00289cd87676e14913d2d8492d1ce05c4baafdae upstream.
      
      The libnvdimm subsystem arranges for devices to be destroyed as a result
      of a sysfs operation. Since device_unregister() cannot be called from
      an actively running sysfs attribute of the same device libnvdimm
      arranges for device_unregister() to be performed in an out-of-line async
      context.
      
      The driver core maintains a 'dead' state for coordinating its own racing
      async registration / de-registration requests. Rather than add local
      'dead' state tracking infrastructure to libnvdimm device objects, export
      the existing state tracking via a new kill_device() helper.
      
      The kill_device() helper simply marks the device as dead, i.e. that it
      is on its way to device_del(), or returns that the device was already
      dead. This can be used in advance of calling device_unregister() for
      subsystems like libnvdimm that might need to handle multiple user
      threads racing to delete a device.
      
      This refactoring does not change any behavior, but it is a pre-requisite
      for follow-on fixes and therefore marked for -stable.
      
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Fixes: 4d88a97a ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
      Cc: <stable@vger.kernel.org>
      Tested-by: NJane Chu <jane.chu@oracle.com>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Link: https://lore.kernel.org/r/156341207332.292348.14959761496009347574.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c23106d4
  2. 07 8月, 2019 3 次提交
  3. 04 8月, 2019 5 次提交
    • B
      block, scsi: Change the preempt-only flag into a counter · c58a6507
      Bart Van Assche 提交于
      commit cd84a62e0078dce09f4ed349bec84f86c9d54b30 upstream.
      
      The RQF_PREEMPT flag is used for three purposes:
      - In the SCSI core, for making sure that power management requests
        are executed even if a device is in the "quiesced" state.
      - For domain validation by SCSI drivers that use the parallel port.
      - In the IDE driver, for IDE preempt requests.
      Rename "preempt-only" into "pm-only" because the primary purpose of
      this mode is power management. Since the power management core may
      but does not have to resume a runtime suspended device before
      performing system-wide suspend and since a later patch will set
      "pm-only" mode as long as a block device is runtime suspended, make
      it possible to set "pm-only" mode from more than one context. Since
      with this change scsi_device_quiesce() is no longer idempotent, make
      that function return early if it is called for a quiesced queue.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c58a6507
    • J
      sched/fair: Use RCU accessors consistently for ->numa_group · a5a3915f
      Jann Horn 提交于
      commit cb361d8cdef69990f6b4504dc1fd9a594d983c97 upstream.
      
      The old code used RCU annotations and accessors inconsistently for
      ->numa_group, which can lead to use-after-frees and NULL dereferences.
      
      Let all accesses to ->numa_group use proper RCU helpers to prevent such
      issues.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Fixes: 8c8a743c ("sched/numa: Use {cpu, pid} to create task groups for shared faults")
      Link: https://lkml.kernel.org/r/20190716152047.14424-3-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a5a3915f
    • J
      sched/fair: Don't free p->numa_faults with concurrent readers · 48046e09
      Jann Horn 提交于
      commit 16d51a590a8ce3befb1308e0e7ab77f3b661af33 upstream.
      
      When going through execve(), zero out the NUMA fault statistics instead of
      freeing them.
      
      During execve, the task is reachable through procfs and the scheduler. A
      concurrent /proc/*/sched reader can read data from a freed ->numa_faults
      allocation (confirmed by KASAN) and write it back to userspace.
      I believe that it would also be possible for a use-after-free read to occur
      through a race between a NUMA fault and execve(): task_numa_fault() can
      lead to task_numa_compare(), which invokes task_weight() on the currently
      running task of a different CPU.
      
      Another way to fix this would be to make ->numa_faults RCU-managed or add
      extra locking, but it seems easier to wipe the NUMA fault statistics on
      execve.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Fixes: 82727018 ("sched/numa: Call task_numa_free() from do_execve()")
      Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      48046e09
    • J
      iommu/iova: Fix compilation error with !CONFIG_IOMMU_IOVA · 3a0c22cb
      Joerg Roedel 提交于
      commit 201c1db90cd643282185a00770f12f95da330eca upstream.
      
      The stub function for !CONFIG_IOMMU_IOVA needs to be
      'static inline'.
      
      Fixes: effa467870c76 ('iommu/vt-d: Don't queue_iova() if there is no flush queue')
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a0c22cb
    • D
      iommu/vt-d: Don't queue_iova() if there is no flush queue · 4fd0eb60
      Dmitry Safonov 提交于
      commit effa467870c7612012885df4e246bdb8ffd8e44c upstream.
      
      Intel VT-d driver was reworked to use common deferred flushing
      implementation. Previously there was one global per-cpu flush queue,
      afterwards - one per domain.
      
      Before deferring a flush, the queue should be allocated and initialized.
      
      Currently only domains with IOMMU_DOMAIN_DMA type initialize their flush
      queue. It's probably worth to init it for static or unmanaged domains
      too, but it may be arguable - I'm leaving it to iommu folks.
      
      Prevent queuing an iova flush if the domain doesn't have a queue.
      The defensive check seems to be worth to keep even if queue would be
      initialized for all kinds of domains. And is easy backportable.
      
      On 4.19.43 stable kernel it has a user-visible effect: previously for
      devices in si domain there were crashes, on sata devices:
      
       BUG: spinlock bad magic on CPU#6, swapper/0/1
        lock: 0xffff88844f582008, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
       CPU: 6 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #1
       Call Trace:
        <IRQ>
        dump_stack+0x61/0x7e
        spin_bug+0x9d/0xa3
        do_raw_spin_lock+0x22/0x8e
        _raw_spin_lock_irqsave+0x32/0x3a
        queue_iova+0x45/0x115
        intel_unmap+0x107/0x113
        intel_unmap_sg+0x6b/0x76
        __ata_qc_complete+0x7f/0x103
        ata_qc_complete+0x9b/0x26a
        ata_qc_complete_multiple+0xd0/0xe3
        ahci_handle_port_interrupt+0x3ee/0x48a
        ahci_handle_port_intr+0x73/0xa9
        ahci_single_level_irq_intr+0x40/0x60
        __handle_irq_event_percpu+0x7f/0x19a
        handle_irq_event_percpu+0x32/0x72
        handle_irq_event+0x38/0x56
        handle_edge_irq+0x102/0x121
        handle_irq+0x147/0x15c
        do_IRQ+0x66/0xf2
        common_interrupt+0xf/0xf
       RIP: 0010:__do_softirq+0x8c/0x2df
      
      The same for usb devices that use ehci-pci:
       BUG: spinlock bad magic on CPU#0, swapper/0/1
        lock: 0xffff88844f402008, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #4
       Call Trace:
        <IRQ>
        dump_stack+0x61/0x7e
        spin_bug+0x9d/0xa3
        do_raw_spin_lock+0x22/0x8e
        _raw_spin_lock_irqsave+0x32/0x3a
        queue_iova+0x77/0x145
        intel_unmap+0x107/0x113
        intel_unmap_page+0xe/0x10
        usb_hcd_unmap_urb_setup_for_dma+0x53/0x9d
        usb_hcd_unmap_urb_for_dma+0x17/0x100
        unmap_urb_for_dma+0x22/0x24
        __usb_hcd_giveback_urb+0x51/0xc3
        usb_giveback_urb_bh+0x97/0xde
        tasklet_action_common.isra.4+0x5f/0xa1
        tasklet_action+0x2d/0x30
        __do_softirq+0x138/0x2df
        irq_exit+0x7d/0x8b
        smp_apic_timer_interrupt+0x10f/0x151
        apic_timer_interrupt+0xf/0x20
        </IRQ>
       RIP: 0010:_raw_spin_unlock_irqrestore+0x17/0x39
      
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Lu Baolu <baolu.lu@linux.intel.com>
      Cc: iommu@lists.linux-foundation.org
      Cc: <stable@vger.kernel.org> # 4.14+
      Fixes: 13cf0174 ("iommu/vt-d: Make use of iova deferred flushing")
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Reviewed-by: NLu Baolu <baolu.lu@linux.intel.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      [v4.14-port notes:
      o minor conflict with untrusted IOMMU devices check under if-condition]
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4fd0eb60
  4. 31 7月, 2019 2 次提交
    • L
      access: avoid the RCU grace period for the temporary subjective credentials · 408af823
      Linus Torvalds 提交于
      commit d7852fbd0f0423937fa287a598bfde188bb68c22 upstream.
      
      It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
      work because it installs a temporary credential that gets allocated and
      freed for each system call.
      
      The allocation and freeing overhead is mostly benign, but because
      credentials can be accessed under the RCU read lock, the freeing
      involves a RCU grace period.
      
      Which is not a huge deal normally, but if you have a lot of access()
      calls, this causes a fair amount of seconday damage: instead of having a
      nice alloc/free patterns that hits in hot per-CPU slab caches, you have
      all those delayed free's, and on big machines with hundreds of cores,
      the RCU overhead can end up being enormous.
      
      But it turns out that all of this is entirely unnecessary.  Exactly
      because access() only installs the credential as the thread-local
      subjective credential, the temporary cred pointer doesn't actually need
      to be RCU free'd at all.  Once we're done using it, we can just free it
      synchronously and avoid all the RCU overhead.
      
      So add a 'non_rcu' flag to 'struct cred', which can be set by users that
      know they only use it in non-RCU context (there are other potential
      users for this).  We can make it a union with the rcu freeing list head
      that we need for the RCU case, so this doesn't need any extra storage.
      
      Note that this also makes 'get_current_cred()' clear the new non_rcu
      flag, in case we have filesystems that take a long-term reference to the
      cred and then expect the RCU delayed freeing afterwards.  It's not
      entirely clear that this is required, but it makes for clear semantics:
      the subjective cred remains non-RCU as long as you only access it
      synchronously using the thread-local accessors, but you _can_ use it as
      a generic cred if you want to.
      
      It is possible that we should just remove the whole RCU markings for
      ->cred entirely.  Only ->real_cred is really supposed to be accessed
      through RCU, and the long-term cred copies that nfs uses might want to
      explicitly re-enable RCU freeing if required, rather than have
      get_current_cred() do it implicitly.
      
      But this is a "minimal semantic changes" change for the immediate
      problem.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Glauber <jglauber@marvell.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      408af823
    • T
      gpu: host1x: Increase maximum DMA segment size · 4d14323a
      Thierry Reding 提交于
      [ Upstream commit 1e390478cfb527e34c9ab89ba57212cb05c33c51 ]
      
      Recent versions of the DMA API debug code have started to warn about
      violations of the maximum DMA segment size. This is because the segment
      size defaults to 64 KiB, which can easily be exceeded in large buffer
      allocations such as used in DRM/KMS for framebuffers.
      
      Technically the Tegra SMMU and ARM SMMU don't have a maximum segment
      size (they map individual pages irrespective of whether they are
      contiguous or not), so the choice of 4 MiB is a bit arbitrary here. The
      maximum segment size is a 32-bit unsigned integer, though, so we can't
      set it to the correct maximum size, which would be the size of the
      aperture.
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4d14323a
  5. 28 7月, 2019 3 次提交
    • R
      jbd2: introduce jbd2_inode dirty range scoping · af3812b6
      Ross Zwisler 提交于
      commit 6ba0e7dc64a5adcda2fbe65adc466891795d639e upstream.
      
      Currently both journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() operate on the entire address space
      of each of the inodes associated with a given journal entry.  The
      consequence of this is that if we have an inode where we are constantly
      appending dirty pages we can end up waiting for an indefinite amount of
      time in journal_finish_inode_data_buffers() while we wait for all the
      pages under writeback to be written out.
      
      The easiest way to cause this type of workload is do just dd from
      /dev/zero to a file until it fills the entire filesystem.  This can
      cause journal_finish_inode_data_buffers() to wait for the duration of
      the entire dd operation.
      
      We can improve this situation by scoping each of the inode dirty ranges
      associated with a given transaction.  We do this via the jbd2_inode
      structure so that the scoping is contained within jbd2 and so that it
      follows the lifetime and locking rules for that structure.
      
      This allows us to limit the writeback & wait in
      journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() respectively to the dirty range for
      a given struct jdb2_inode, keeping us from waiting forever if the inode
      in question is still being appended to.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      af3812b6
    • R
      mm: add filemap_fdatawait_range_keep_errors() · 4becd6c1
      Ross Zwisler 提交于
      commit aa0bfcd939c30617385ffa28682c062d78050eba upstream.
      
      In the spirit of filemap_fdatawait_range() and
      filemap_fdatawait_keep_errors(), introduce
      filemap_fdatawait_range_keep_errors() which both takes a range upon
      which to wait and does not clear errors from the address space.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4becd6c1
    • A
      perf/core: Fix exclusive events' grouping · 75100ec5
      Alexander Shishkin 提交于
      commit 8a58ddae23796c733c5dfbd717538d89d036c5bd upstream.
      
      So far, we tried to disallow grouping exclusive events for the fear of
      complications they would cause with moving between contexts. Specifically,
      moving a software group to a hardware context would violate the exclusivity
      rules if both groups contain matching exclusive events.
      
      This attempt was, however, unsuccessful: the check that we have in the
      perf_event_open() syscall is both wrong (looks at wrong PMU) and
      insufficient (group leader may still be exclusive), as can be illustrated
      by running:
      
        $ perf record -e '{intel_pt//,cycles}' uname
        $ perf record -e '{cycles,intel_pt//}' uname
      
      ultimately successfully.
      
      Furthermore, we are completely free to trigger the exclusivity violation
      by:
      
         perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'
      
      even though the helpful perf record will not allow that, the ABI will.
      
      The warning later in the perf_event_open() path will also not trigger, because
      it's also wrong.
      
      Fix all this by validating the original group before moving, getting rid
      of broken safeguards and placing a useful one to perf_install_in_context().
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mathieu.poirier@linaro.org
      Cc: will.deacon@arm.com
      Fixes: bed5b25a ("perf: Add a pmu capability for "exclusive" events")
      Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75100ec5
  6. 26 7月, 2019 2 次提交
    • M
      clocksource/drivers/exynos_mct: Increase priority over ARM arch timer · e69fac59
      Marek Szyprowski 提交于
      [ Upstream commit 6282edb72bed5324352522d732080d4c1b9dfed6 ]
      
      Exynos SoCs based on CA7/CA15 have 2 timer interfaces: custom Exynos MCT
      (Multi Core Timer) and standard ARM Architected Timers.
      
      There are use cases, where both timer interfaces are used simultanously.
      One of such examples is using Exynos MCT for the main system timer and
      ARM Architected Timers for the KVM and virtualized guests (KVM requires
      arch timers).
      
      Exynos Multi-Core Timer driver (exynos_mct) must be however started
      before ARM Architected Timers (arch_timer), because they both share some
      common hardware blocks (global system counter) and turning on MCT is
      needed to get ARM Architected Timer working properly.
      
      To ensure selecting Exynos MCT as the main system timer, increase MCT
      timer rating. To ensure proper starting order of both timers during
      suspend/resume cycle, increase MCT hotplug priority over ARM Archictected
      Timers.
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Reviewed-by: NKrzysztof Kozlowski <krzk@kernel.org>
      Reviewed-by: NChanwoo Choi <cw00.choi@samsung.com>
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e69fac59
    • W
      rcu: Force inlining of rcu_read_lock() · 366ae49e
      Waiman Long 提交于
      [ Upstream commit 6da9f775175e516fc7229ceaa9b54f8f56aa7924 ]
      
      When debugging options are turned on, the rcu_read_lock() function
      might not be inlined. This results in lockdep's print_lock() function
      printing "rcu_read_lock+0x0/0x70" instead of rcu_read_lock()'s caller.
      For example:
      
      [   10.579995] =============================
      [   10.584033] WARNING: suspicious RCU usage
      [   10.588074] 4.18.0.memcg_v2+ #1 Not tainted
      [   10.593162] -----------------------------
      [   10.597203] include/linux/rcupdate.h:281 Illegal context switch in
      RCU read-side critical section!
      [   10.606220]
      [   10.606220] other info that might help us debug this:
      [   10.606220]
      [   10.614280]
      [   10.614280] rcu_scheduler_active = 2, debug_locks = 1
      [   10.620853] 3 locks held by systemd/1:
      [   10.624632]  #0: (____ptrval____) (&type->i_mutex_dir_key#5){.+.+}, at: lookup_slow+0x42/0x70
      [   10.633232]  #1: (____ptrval____) (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x70
      [   10.640954]  #2: (____ptrval____) (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x70
      
      These "rcu_read_lock+0x0/0x70" strings are not providing any useful
      information.  This commit therefore forces inlining of the rcu_read_lock()
      function so that rcu_read_lock()'s caller is instead shown.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      366ae49e
  7. 21 7月, 2019 2 次提交
  8. 14 7月, 2019 1 次提交
  9. 10 7月, 2019 1 次提交
    • D
      bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K · 54e8cf41
      Daniel Borkmann 提交于
      [ Upstream commit fdadd04931c2d7cd294dc5b2b342863f94be53a3 ]
      
      Michael and Sandipan report:
      
        Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
        JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
        and is adjusted again at init time if MODULES_VADDR is defined.
      
        For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
        the compile-time default at boot-time, which is 0x9c400000 when
        using 64K page size. This overflows the signed 32-bit bpf_jit_limit
        value:
      
        root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
        -1673527296
      
        and can cause various unexpected failures throughout the network
        stack. In one case `strace dhclient eth0` reported:
      
        setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
                   16) = -1 ENOTSUPP (Unknown error 524)
      
        and similar failures can be seen with tools like tcpdump. This doesn't
        always reproduce however, and I'm not sure why. The more consistent
        failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
        host would time out on systemd/netplan configuring a virtio-net NIC
        with no noticeable errors in the logs.
      
      Given this and also given that in near future some architectures like
      arm64 will have a custom area for BPF JIT image allocations we should
      get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
      4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
      so therefore add another overridable bpf_jit_alloc_exec_limit() helper
      function which returns the possible size of the memory area for deriving
      the default heuristic in bpf_jit_charge_init().
      
      Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
      bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
      JIT memory provider, and therefore in case archs implement their custom
      module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
      vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
      
      Additionally, for archs supporting large page sizes, we should change
      the sysctl to be handled as long to not run into sysctl restrictions
      in future.
      
      Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
      Reported-by: NSandipan Das <sandipan@linux.ibm.com>
      Reported-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      54e8cf41
  10. 03 7月, 2019 2 次提交
    • D
      bpf: fix unconnected udp hooks · 613bc37f
      Daniel Borkmann 提交于
      commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream.
      
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NMartynas Pumputis <m@lambda.lt>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      613bc37f
    • T
      SUNRPC: Clean up initialisation of the struct rpc_rqst · dd9f2fb5
      Trond Myklebust 提交于
      commit 9dc6edcf676fe188430e8b119f91280bbf285163 upstream.
      
      Move the initialisation back into xprt.c.
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Yihao Wu <wuyihao@linux.alibaba.com>
      Cc: Caspar Zhang <caspar@linux.alibaba.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd9f2fb5
  11. 25 6月, 2019 2 次提交
    • D
      mmc: core: Add sdio_retune_hold_now() and sdio_retune_release() · 0349dbeb
      Douglas Anderson 提交于
      commit b4c9f938d542d5f88c501744d2d12fad4fd2915f upstream.
      
      We want SDIO drivers to be able to temporarily stop retuning when the
      driver knows that the SDIO card is not in a state where retuning will
      work (maybe because the card is asleep).  We'll move the relevant
      functions to a place where drivers can call them.
      
      Cc: stable@vger.kernel.org #v4.18+
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Acked-by: NAdrian Hunter <adrian.hunter@intel.com>
      Acked-by: NKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0349dbeb
    • D
      mmc: core: API to temporarily disable retuning for SDIO CRC errors · 7ed49e1b
      Douglas Anderson 提交于
      commit 0a55f4ab9678413a01e740c86e9367ba0c612b36 upstream.
      
      Normally when the MMC core sees an "-EILSEQ" error returned by a host
      controller then it will trigger a retuning of the card.  This is
      generally a good idea.
      
      However, if a command is expected to sometimes cause transfer errors
      then these transfer errors shouldn't cause a re-tuning.  This
      re-tuning will be a needless waste of time.  One example case where a
      transfer is expected to cause errors is when transitioning between
      idle (sometimes referred to as "sleep" in Broadcom code) and active
      state on certain Broadcom WiFi SDIO cards.  Specifically if the card
      was already transitioning between states when the command was sent it
      could cause an error on the SDIO bus.
      
      Let's add an API that the SDIO function drivers can call that will
      temporarily disable the auto-tuning functionality.  Then we can add a
      call to this in the Broadcom WiFi driver and any other driver that
      might have similar needs.
      
      NOTE: this makes the assumption that the card is already tuned well
      enough that it's OK to disable the auto-retuning during one of these
      error-prone situations.  Presumably the driver code performing the
      error-prone transfer knows how to recover / retry from errors.  ...and
      after we can get back to a state where transfers are no longer
      error-prone then we can enable the auto-retuning again.  If we truly
      find ourselves in a case where the card needs to be retuned sometimes
      to handle one of these error-prone transfers then we can always try a
      few transfers first without auto-retuning and then re-try with
      auto-retuning if the first few fail.
      
      Without this change on rk3288-veyron-minnie I periodically see this in
      the logs of a machine just sitting there idle:
        dwmmc_rockchip ff0d0000.dwmmc: Successfully tuned phase to XYZ
      
      Cc: stable@vger.kernel.org #v4.18+
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Acked-by: NAdrian Hunter <adrian.hunter@intel.com>
      Acked-by: NKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7ed49e1b
  12. 22 6月, 2019 1 次提交
    • A
      coredump: fix race condition between collapse_huge_page() and core dumping · 465ce9a5
      Andrea Arcangeli 提交于
      commit 59ea6d06cfa9247b586a695c21f94afa7183af74 upstream.
      
      When fixing the race conditions between the coredump and the mmap_sem
      holders outside the context of the process, we focused on
      mmget_not_zero()/get_task_mm() callers in 04f5866e41fb70 ("coredump: fix
      race condition between mmget_not_zero()/get_task_mm() and core
      dumping"), but those aren't the only cases where the mmap_sem can be
      taken outside of the context of the process as Michal Hocko noticed
      while backporting that commit to older -stable kernels.
      
      If mmgrab() is called in the context of the process, but then the
      mm_count reference is transferred outside the context of the process,
      that can also be a problem if the mmap_sem has to be taken for writing
      through that mm_count reference.
      
      khugepaged registration calls mmgrab() in the context of the process,
      but the mmap_sem for writing is taken later in the context of the
      khugepaged kernel thread.
      
      collapse_huge_page() after taking the mmap_sem for writing doesn't
      modify any vma, so it's not obvious that it could cause a problem to the
      coredump, but it happens to modify the pmd in a way that breaks an
      invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
      needs the mmap_sem for writing just to block concurrent page faults that
      call pmd_trans_huge_lock().
      
      Specifically the invariant that "!pmd_trans_huge()" cannot become a
      "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.
      
      The coredump will call __get_user_pages() without mmap_sem for reading,
      which eventually can invoke a lockless page fault which will need a
      functional pmd_trans_huge_lock().
      
      So collapse_huge_page() needs to use mmget_still_valid() to check it's
      not running concurrently with the coredump...  as long as the coredump
      can invoke page faults without holding the mmap_sem for reading.
      
      This has "Fixes: khugepaged" to facilitate backporting, but in my view
      it's more a bug in the coredump code that will eventually have to be
      rewritten to stop invoking page faults without the mmap_sem for reading.
      So the long term plan is still to drop all mmget_still_valid().
      
      Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
      Fixes: ba76149f ("thp: khugepaged")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      465ce9a5
  13. 19 6月, 2019 2 次提交
    • B
      x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback · ecec31ce
      Borislav Petkov 提交于
      commit 78f4e932f7760d965fb1569025d1576ab77557c5 upstream.
      
      Adric Blake reported the following warning during suspend-resume:
      
        Enabling non-boot CPUs ...
        x86: Booting SMP configuration:
        smpboot: Booting Node 0 Processor 1 APIC 0x2
        unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0000000000000000) \
         at rIP: 0xffffffff8d267924 (native_write_msr+0x4/0x20)
        Call Trace:
         intel_set_tfa
         intel_pmu_cpu_starting
         ? x86_pmu_dead_cpu
         x86_pmu_starting_cpu
         cpuhp_invoke_callback
         ? _raw_spin_lock_irqsave
         notify_cpu_starting
         start_secondary
         secondary_startup_64
        microcode: sig=0x806ea, pf=0x80, revision=0x96
        microcode: updated to revision 0xb4, date = 2019-04-01
        CPU1 is up
      
      The MSR in question is MSR_TFA_RTM_FORCE_ABORT and that MSR is emulated
      by microcode. The log above shows that the microcode loader callback
      happens after the PMU restoration, leading to the conjecture that
      because the microcode hasn't been updated yet, that MSR is not present
      yet, leading to the #GP.
      
      Add a microcode loader-specific hotplug vector which comes before
      the PERF vectors and thus executes earlier and makes sure the MSR is
      present.
      
      Fixes: 400816f60c54 ("perf/x86/intel: Implement support for TSX Force Abort")
      Reported-by: NAdric Blake <promarbler14@gmail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: x86@kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=203637Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecec31ce
    • T
      cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css() · c3b85bda
      Tejun Heo 提交于
      commit 18fa84a2db0e15b02baa5d94bdb5bd509175d2f6 upstream.
      
      A PF_EXITING task can stay associated with an offline css.  If such
      task calls task_get_css(), it can get stuck indefinitely.  This can be
      triggered by BSD process accounting which writes to a file with
      PF_EXITING set when racing against memcg disable as in the backtrace
      at the end.
      
      After this change, task_get_css() may return a css which was already
      offline when the function was called.  None of the existing users are
      affected by this change.
      
        INFO: rcu_sched self-detected stall on CPU
        INFO: rcu_sched detected stalls on CPUs/tasks:
        ...
        NMI backtrace for cpu 0
        ...
        Call Trace:
         <IRQ>
         dump_stack+0x46/0x68
         nmi_cpu_backtrace.cold.2+0x13/0x57
         nmi_trigger_cpumask_backtrace+0xba/0xca
         rcu_dump_cpu_stacks+0x9e/0xce
         rcu_check_callbacks.cold.74+0x2af/0x433
         update_process_times+0x28/0x60
         tick_sched_timer+0x34/0x70
         __hrtimer_run_queues+0xee/0x250
         hrtimer_interrupt+0xf4/0x210
         smp_apic_timer_interrupt+0x56/0x110
         apic_timer_interrupt+0xf/0x20
         </IRQ>
        RIP: 0010:balance_dirty_pages_ratelimited+0x28f/0x3d0
        ...
         btrfs_file_write_iter+0x31b/0x563
         __vfs_write+0xfa/0x140
         __kernel_write+0x4f/0x100
         do_acct_process+0x495/0x580
         acct_process+0xb9/0xdb
         do_exit+0x748/0xa00
         do_group_exit+0x3a/0xa0
         get_signal+0x254/0x560
         do_signal+0x23/0x5c0
         exit_to_usermode_loop+0x5d/0xa0
         prepare_exit_to_usermode+0x53/0x80
         retint_user+0x8/0x8
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v4.2+
      Fixes: ec438699 ("cgroup, block: implement task_get_css() and use it in bio_associate_current()")
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c3b85bda
  14. 18 6月, 2019 1 次提交
    • E
      tcp: limit payload size of sacked skbs · c09be314
      Eric Dumazet 提交于
      commit 3b4929f65b0d8249f19a50245cd88ed1a2f78cff upstream.
      
      Jonathan Looney reported that TCP can trigger the following crash
      in tcp_shifted_skb() :
      
      	BUG_ON(tcp_skb_pcount(skb) < pcount);
      
      This can happen if the remote peer has advertized the smallest
      MSS that linux TCP accepts : 48
      
      An skb can hold 17 fragments, and each fragment can hold 32KB
      on x86, or 64KB on PowerPC.
      
      This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
      can overflow.
      
      Note that tcp_sendmsg() builds skbs with less than 64KB
      of payload, so this problem needs SACK to be enabled.
      SACK blocks allow TCP to coalesce multiple skbs in the retransmit
      queue, thus filling the 17 fragments to maximal capacity.
      
      CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs
      
      Fixes: 832d11c5 ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c09be314
  15. 15 6月, 2019 1 次提交
    • P
      pwm: Fix deadlock warning when removing PWM device · 384642ff
      Phong Hoang 提交于
      [ Upstream commit 347ab9480313737c0f1aaa08e8f2e1a791235535 ]
      
      This patch fixes deadlock warning if removing PWM device
      when CONFIG_PROVE_LOCKING is enabled.
      
      This issue can be reproceduced by the following steps on
      the R-Car H3 Salvator-X board if the backlight is disabled:
      
       # cd /sys/class/pwm/pwmchip0
       # echo 0 > export
       # ls
       device  export  npwm  power  pwm0  subsystem  uevent  unexport
       # cd device/driver
       # ls
       bind  e6e31000.pwm  uevent  unbind
       # echo e6e31000.pwm > unbind
      
      [   87.659974] ======================================================
      [   87.666149] WARNING: possible circular locking dependency detected
      [   87.672327] 5.0.0 #7 Not tainted
      [   87.675549] ------------------------------------------------------
      [   87.681723] bash/2986 is trying to acquire lock:
      [   87.686337] 000000005ea0e178 (kn->count#58){++++}, at: kernfs_remove_by_name_ns+0x50/0xa0
      [   87.694528]
      [   87.694528] but task is already holding lock:
      [   87.700353] 000000006313b17c (pwm_lock){+.+.}, at: pwmchip_remove+0x28/0x13c
      [   87.707405]
      [   87.707405] which lock already depends on the new lock.
      [   87.707405]
      [   87.715574]
      [   87.715574] the existing dependency chain (in reverse order) is:
      [   87.723048]
      [   87.723048] -> #1 (pwm_lock){+.+.}:
      [   87.728017]        __mutex_lock+0x70/0x7e4
      [   87.732108]        mutex_lock_nested+0x1c/0x24
      [   87.736547]        pwm_request_from_chip.part.6+0x34/0x74
      [   87.741940]        pwm_request_from_chip+0x20/0x40
      [   87.746725]        export_store+0x6c/0x1f4
      [   87.750820]        dev_attr_store+0x18/0x28
      [   87.754998]        sysfs_kf_write+0x54/0x64
      [   87.759175]        kernfs_fop_write+0xe4/0x1e8
      [   87.763615]        __vfs_write+0x40/0x184
      [   87.767619]        vfs_write+0xa8/0x19c
      [   87.771448]        ksys_write+0x58/0xbc
      [   87.775278]        __arm64_sys_write+0x18/0x20
      [   87.779721]        el0_svc_common+0xd0/0x124
      [   87.783986]        el0_svc_compat_handler+0x1c/0x24
      [   87.788858]        el0_svc_compat+0x8/0x18
      [   87.792947]
      [   87.792947] -> #0 (kn->count#58){++++}:
      [   87.798260]        lock_acquire+0xc4/0x22c
      [   87.802353]        __kernfs_remove+0x258/0x2c4
      [   87.806790]        kernfs_remove_by_name_ns+0x50/0xa0
      [   87.811836]        remove_files.isra.1+0x38/0x78
      [   87.816447]        sysfs_remove_group+0x48/0x98
      [   87.820971]        sysfs_remove_groups+0x34/0x4c
      [   87.825583]        device_remove_attrs+0x6c/0x7c
      [   87.830197]        device_del+0x11c/0x33c
      [   87.834201]        device_unregister+0x14/0x2c
      [   87.838638]        pwmchip_sysfs_unexport+0x40/0x4c
      [   87.843509]        pwmchip_remove+0xf4/0x13c
      [   87.847773]        rcar_pwm_remove+0x28/0x34
      [   87.852039]        platform_drv_remove+0x24/0x64
      [   87.856651]        device_release_driver_internal+0x18c/0x21c
      [   87.862391]        device_release_driver+0x14/0x1c
      [   87.867175]        unbind_store+0xe0/0x124
      [   87.871265]        drv_attr_store+0x20/0x30
      [   87.875442]        sysfs_kf_write+0x54/0x64
      [   87.879618]        kernfs_fop_write+0xe4/0x1e8
      [   87.884055]        __vfs_write+0x40/0x184
      [   87.888057]        vfs_write+0xa8/0x19c
      [   87.891887]        ksys_write+0x58/0xbc
      [   87.895716]        __arm64_sys_write+0x18/0x20
      [   87.900154]        el0_svc_common+0xd0/0x124
      [   87.904417]        el0_svc_compat_handler+0x1c/0x24
      [   87.909289]        el0_svc_compat+0x8/0x18
      [   87.913378]
      [   87.913378] other info that might help us debug this:
      [   87.913378]
      [   87.921374]  Possible unsafe locking scenario:
      [   87.921374]
      [   87.927286]        CPU0                    CPU1
      [   87.931808]        ----                    ----
      [   87.936331]   lock(pwm_lock);
      [   87.939293]                                lock(kn->count#58);
      [   87.945120]                                lock(pwm_lock);
      [   87.950599]   lock(kn->count#58);
      [   87.953908]
      [   87.953908]  *** DEADLOCK ***
      [   87.953908]
      [   87.959821] 4 locks held by bash/2986:
      [   87.963563]  #0: 00000000ace7bc30 (sb_writers#6){.+.+}, at: vfs_write+0x188/0x19c
      [   87.971044]  #1: 00000000287991b2 (&of->mutex){+.+.}, at: kernfs_fop_write+0xb4/0x1e8
      [   87.978872]  #2: 00000000f739d016 (&dev->mutex){....}, at: device_release_driver_internal+0x40/0x21c
      [   87.988001]  #3: 000000006313b17c (pwm_lock){+.+.}, at: pwmchip_remove+0x28/0x13c
      [   87.995481]
      [   87.995481] stack backtrace:
      [   87.999836] CPU: 0 PID: 2986 Comm: bash Not tainted 5.0.0 #7
      [   88.005489] Hardware name: Renesas Salvator-X board based on r8a7795 ES1.x (DT)
      [   88.012791] Call trace:
      [   88.015235]  dump_backtrace+0x0/0x190
      [   88.018891]  show_stack+0x14/0x1c
      [   88.022204]  dump_stack+0xb0/0xec
      [   88.025514]  print_circular_bug.isra.32+0x1d0/0x2e0
      [   88.030385]  __lock_acquire+0x1318/0x1864
      [   88.034388]  lock_acquire+0xc4/0x22c
      [   88.037958]  __kernfs_remove+0x258/0x2c4
      [   88.041874]  kernfs_remove_by_name_ns+0x50/0xa0
      [   88.046398]  remove_files.isra.1+0x38/0x78
      [   88.050487]  sysfs_remove_group+0x48/0x98
      [   88.054490]  sysfs_remove_groups+0x34/0x4c
      [   88.058580]  device_remove_attrs+0x6c/0x7c
      [   88.062671]  device_del+0x11c/0x33c
      [   88.066154]  device_unregister+0x14/0x2c
      [   88.070070]  pwmchip_sysfs_unexport+0x40/0x4c
      [   88.074421]  pwmchip_remove+0xf4/0x13c
      [   88.078163]  rcar_pwm_remove+0x28/0x34
      [   88.081906]  platform_drv_remove+0x24/0x64
      [   88.085996]  device_release_driver_internal+0x18c/0x21c
      [   88.091215]  device_release_driver+0x14/0x1c
      [   88.095478]  unbind_store+0xe0/0x124
      [   88.099048]  drv_attr_store+0x20/0x30
      [   88.102704]  sysfs_kf_write+0x54/0x64
      [   88.106359]  kernfs_fop_write+0xe4/0x1e8
      [   88.110275]  __vfs_write+0x40/0x184
      [   88.113757]  vfs_write+0xa8/0x19c
      [   88.117065]  ksys_write+0x58/0xbc
      [   88.120374]  __arm64_sys_write+0x18/0x20
      [   88.124291]  el0_svc_common+0xd0/0x124
      [   88.128034]  el0_svc_compat_handler+0x1c/0x24
      [   88.132384]  el0_svc_compat+0x8/0x18
      
      The sysfs unexport in pwmchip_remove() is completely asymmetric
      to what we do in pwmchip_add_with_polarity() and commit 0733424c
      ("pwm: Unexport children before chip removal") is a strong indication
      that this was wrong to begin with. We should just move
      pwmchip_sysfs_unexport() where it belongs, which is right after
      pwmchip_sysfs_unexport_children(). In that case, we do not need
      separate functions anymore either.
      
      We also really want to remove sysfs irrespective of whether or not
      the chip will be removed as a result of pwmchip_remove(). We can only
      assume that the driver will be gone after that, so we shouldn't leave
      any dangling sysfs files around.
      
      This warning disappears if we move pwmchip_sysfs_unexport() to
      the top of pwmchip_remove(), pwmchip_sysfs_unexport_children().
      That way it is also outside of the pwm_lock section, which indeed
      doesn't seem to be needed.
      
      Moving the pwmchip_sysfs_export() call outside of that section also
      seems fine and it'd be perfectly symmetric with pwmchip_remove() again.
      
      So, this patch fixes them.
      Signed-off-by: NPhong Hoang <phong.hoang.wz@renesas.com>
      [shimoda: revise the commit log and code]
      Fixes: 76abbdde ("pwm: Add sysfs interface")
      Fixes: 0733424c ("pwm: Unexport children before chip removal")
      Signed-off-by: NYoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
      Tested-by: NHoan Nguyen An <na-hoan@jinso.co.jp>
      Reviewed-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: NSimon Horman <horms+renesas@verge.net.au>
      Reviewed-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: NThierry Reding <thierry.reding@gmail.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      384642ff
  16. 11 6月, 2019 3 次提交
    • J
      x86/power: Fix 'nosmt' vs hibernation triple fault during resume · 4d166206
      Jiri Kosina 提交于
      commit ec527c318036a65a083ef68d8ba95789d2212246 upstream.
      
      As explained in
      
      	0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      
      we always, no matter what, have to bring up x86 HT siblings during boot at
      least once in order to avoid first MCE bringing the system to its knees.
      
      That means that whenever 'nosmt' is supplied on the kernel command-line,
      all the HT siblings are as a result sitting in mwait or cpudile after
      going through the online-offline cycle at least once.
      
      This causes a serious issue though when a kernel, which saw 'nosmt' on its
      commandline, is going to perform resume from hibernation: if the resume
      from the hibernated image is successful, cr3 is flipped in order to point
      to the address space of the kernel that is being resumed, which in turn
      means that all the HT siblings are all of a sudden mwaiting on address
      which is no longer valid.
      
      That results in triple fault shortly after cr3 is switched, and machine
      reboots.
      
      Fix this by always waking up all the SMT siblings before initiating the
      'restore from hibernation' process; this guarantees that all the HT
      siblings will be properly carried over to the resumed kernel waiting in
      resume_play_dead(), and acted upon accordingly afterwards, based on the
      target kernel configuration.
      
      Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
      again in case it has SMT disabled; this means it has to online all
      the siblings when resuming (so that they come out of hlt) and offline
      them again to let them reach mwait.
      
      Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
      Debugged-by: NThomas Gleixner <tglx@linutronix.de>
      Fixes: 0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d166206
    • K
      pstore: Convert buf_lock to semaphore · d4128a1b
      Kees Cook 提交于
      commit ea84b580b95521644429cc6748b6c2bf27c8b0f3 upstream.
      
      Instead of running with interrupts disabled, use a semaphore. This should
      make it easier for backends that may need to sleep (e.g. EFI) when
      performing a write:
      
      |BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
      |in_atomic(): 1, irqs_disabled(): 1, pid: 2236, name: sig-xstate-bum
      |Preemption disabled at:
      |[<ffffffff99d60512>] pstore_dump+0x72/0x330
      |CPU: 26 PID: 2236 Comm: sig-xstate-bum Tainted: G      D           4.20.0-rc3 #45
      |Call Trace:
      | dump_stack+0x4f/0x6a
      | ___might_sleep.cold.91+0xd3/0xe4
      | __might_sleep+0x50/0x90
      | wait_for_completion+0x32/0x130
      | virt_efi_query_variable_info+0x14e/0x160
      | efi_query_variable_store+0x51/0x1a0
      | efivar_entry_set_safe+0xa3/0x1b0
      | efi_pstore_write+0x109/0x140
      | pstore_dump+0x11c/0x330
      | kmsg_dump+0xa4/0xd0
      | oops_exit+0x22/0x30
      ...
      Reported-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Fixes: 21b3ddd3 ("efi: Don't use spinlocks for efi vars")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4128a1b
    • L
      rcu: locking and unlocking need to always be at least barriers · 6726307d
      Linus Torvalds 提交于
      commit 66be4e66a7f422128748e3c3ef6ee72b20a6197b upstream.
      
      Herbert Xu pointed out that commit bb73c52b ("rcu: Don't disable
      preemption for Tiny and Tree RCU readers") was incorrect in making the
      preempt_disable/enable() be conditional on CONFIG_PREEMPT_COUNT.
      
      If CONFIG_PREEMPT_COUNT isn't enabled, the preemption enable/disable is
      a no-op, but still is a compiler barrier.
      
      And RCU locking still _needs_ that compiler barrier.
      
      It is simply fundamentally not true that RCU locking would be a complete
      no-op: we still need to guarantee (for example) that things that can
      trap and cause preemption cannot migrate into the RCU locked region.
      
      The way we do that is by making it a barrier.
      
      See for example commit 386afc91 ("spinlocks and preemption points
      need to be at least compiler barriers") from back in 2013 that had
      similar issues with spinlocks that become no-ops on UP: they must still
      constrain the compiler from moving other operations into the critical
      region.
      
      Now, it is true that a lot of RCU operations already use READ_ONCE() and
      WRITE_ONCE() (which in practice likely would never be re-ordered wrt
      anything remotely interesting), but it is also true that that is not
      globally the case, and that it's not even necessarily always possible
      (ie bitfields etc).
      Reported-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Fixes: bb73c52b ("rcu: Don't disable preemption for Tiny and Tree RCU readers")
      Cc: stable@kernel.org
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6726307d
  17. 09 6月, 2019 5 次提交
    • F
      of: overlay: validate overlay properties #address-cells and #size-cells · 15151d00
      Frank Rowand 提交于
      commit 6f75118800acf77f8ad6afec61ca1b2349ade371 upstream.
      
      If overlay properties #address-cells or #size-cells are already in
      the live devicetree for any given node, then the values in the
      overlay must match the values in the live tree.
      
      If the properties are already in the live tree then there is no
      need to create a changeset entry to add them since they must
      have the same value.  This reduces the memory used by the
      changeset and eliminates a possible memory leak.
      Tested-by: NAlan Tull <atull@kernel.org>
      Signed-off-by: NFrank Rowand <frank.rowand@sony.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      15151d00
    • M
      include/linux/module.h: copy __init/__exit attrs to init/cleanup_module · 9468870f
      Miguel Ojeda 提交于
      commit a6e60d84989fa0e91db7f236eda40453b0e44afa upstream.
      
      The upcoming GCC 9 release extends the -Wmissing-attributes warnings
      (enabled by -Wall) to C and aliases: it warns when particular function
      attributes are missing in the aliases but not in their target.
      
      In particular, it triggers for all the init/cleanup_module
      aliases in the kernel (defined by the module_init/exit macros),
      ending up being very noisy.
      
      These aliases point to the __init/__exit functions of a module,
      which are defined as __cold (among other attributes). However,
      the aliases themselves do not have the __cold attribute.
      
      Since the compiler behaves differently when compiling a __cold
      function as well as when compiling paths leading to calls
      to __cold functions, the warning is trying to point out
      the possibly-forgotten attribute in the alias.
      
      In order to keep the warning enabled, we decided to silence
      this case. Ideally, we would mark the aliases directly
      as __init/__exit. However, there are currently around 132 modules
      in the kernel which are missing __init/__exit in their init/cleanup
      functions (either because they are missing, or for other reasons,
      e.g. the functions being called from somewhere else); and
      a section mismatch is a hard error.
      
      A conservative alternative was to mark the aliases as __cold only.
      However, since we would like to eventually enforce __init/__exit
      to be always marked,  we chose to use the new __copy function
      attribute (introduced by GCC 9 as well to deal with this).
      With it, we copy the attributes used by the target functions
      into the aliases. This way, functions that were not marked
      as __init/__exit won't have their aliases marked either,
      and therefore there won't be a section mismatch.
      
      Note that the warning would go away marking either the extern
      declaration, the definition, or both. However, we only mark
      the definition of the alias, since we do not want callers
      (which only see the declaration) to be compiled as if the function
      was __cold (and therefore the paths leading to those calls
      would be assumed to be unlikely).
      
      Link: https://lore.kernel.org/lkml/20190123173707.GA16603@gmail.com/
      Link: https://lore.kernel.org/lkml/20190206175627.GA20399@gmail.com/Suggested-by: NMartin Sebor <msebor@gcc.gnu.org>
      Acked-by: NJessica Yu <jeyu@kernel.org>
      Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: NStefan Agner <stefan@agner.ch>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9468870f
    • M
      Compiler Attributes: add support for __copy (gcc >= 9) · 2a0f719d
      Miguel Ojeda 提交于
      commit c0d9782f5b6d7157635ae2fd782a4b27d55a6013 upstream.
      
      From the GCC manual:
      
        copy
        copy(function)
      
          The copy attribute applies the set of attributes with which function
          has been declared to the declaration of the function to which
          the attribute is applied. The attribute is designed for libraries
          that define aliases or function resolvers that are expected
          to specify the same set of attributes as their targets. The copy
          attribute can be used with functions, variables, or types. However,
          the kind of symbol to which the attribute is applied (either
          function or variable) must match the kind of symbol to which
          the argument refers. The copy attribute copies only syntactic and
          semantic attributes but not attributes that affect a symbol’s
          linkage or visibility such as alias, visibility, or weak.
          The deprecated attribute is also not copied.
      
        https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
      
      The upcoming GCC 9 release extends the -Wmissing-attributes warnings
      (enabled by -Wall) to C and aliases: it warns when particular function
      attributes are missing in the aliases but not in their target, e.g.:
      
          void __cold f(void) {}
          void __alias("f") g(void);
      
      diagnoses:
      
          warning: 'g' specifies less restrictive attribute than
          its target 'f': 'cold' [-Wmissing-attributes]
      
      Using __copy(f) we can copy the __cold attribute from f to g:
      
          void __cold f(void) {}
          void __copy(f) __alias("f") g(void);
      
      This attribute is most useful to deal with situations where an alias
      is declared but we don't know the exact attributes the target has.
      
      For instance, in the kernel, the widely used module_init/exit macros
      define the init/cleanup_module aliases, but those cannot be marked
      always as __init/__exit since some modules do not have their
      functions marked as such.
      Suggested-by: NMartin Sebor <msebor@gcc.gnu.org>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: NStefan Agner <stefan@agner.ch>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a0f719d
    • J
      memcg: make it work on sparse non-0-node systems · 8b057ad8
      Jiri Slaby 提交于
      commit 3e8589963773a5c23e2f1fe4bcad0e9a90b7f471 upstream.
      
      We have a single node system with node 0 disabled:
        Scanning NUMA topology in Northbridge 24
        Number of physical nodes 2
        Skipping disabled node 0
        Node 1 MemBase 0000000000000000 Limit 00000000fbff0000
        NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff]
      
      This causes crashes in memcg when system boots:
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        #PF error: [normal kernel read fault]
      ...
        RIP: 0010:list_lru_add+0x94/0x170
      ...
        Call Trace:
         d_lru_add+0x44/0x50
         dput.part.34+0xfc/0x110
         __fput+0x108/0x230
         task_work_run+0x9f/0xc0
         exit_to_usermode_loop+0xf5/0x100
      
      It is reproducible as far as 4.12.  I did not try older kernels.  You have
      to have a new enough systemd, e.g.  241 (the reason is unknown -- was not
      investigated).  Cannot be reproduced with systemd 234.
      
      The system crashes because the size of lru array is never updated in
      memcg_update_all_list_lrus and the reads are past the zero-sized array,
      causing dereferences of random memory.
      
      The root cause are list_lru_memcg_aware checks in the list_lru code.  The
      test in list_lru_memcg_aware is broken: it assumes node 0 is always
      present, but it is not true on some systems as can be seen above.
      
      So fix this by avoiding checks on node 0.  Remember the memcg-awareness by
      a bool flag in struct list_lru.
      
      Link: http://lkml.kernel.org/r/20190522091940.3615-1-jslaby@suse.cz
      Fixes: 60d3fd32 ("list_lru: introduce per-memcg lists")
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b057ad8
    • R
      include/linux/bitops.h: sanitize rotate primitives · 759766bf
      Rasmus Villemoes 提交于
      commit ef4d6f6b275c498f8e5626c99dbeefdc5027f843 upstream.
      
      The ror32 implementation (word >> shift) | (word << (32 - shift) has
      undefined behaviour if shift is outside the [1, 31] range.  Similarly
      for the 64 bit variants.  Most callers pass a compile-time constant
      (naturally in that range), but there's an UBSAN report that these may
      actually be called with a shift count of 0.
      
      Instead of special-casing that, we can make them DTRT for all values of
      shift while also avoiding UB.  For some reason, this was already partly
      done for rol32 (which was well-defined for [0, 31]).  gcc 8 recognizes
      these patterns as rotates, so for example
      
        __u32 rol32(__u32 word, unsigned int shift)
        {
      	return (word << (shift & 31)) | (word >> ((-shift) & 31));
        }
      
      compiles to
      
      0000000000000020 <rol32>:
        20:   89 f8                   mov    %edi,%eax
        22:   89 f1                   mov    %esi,%ecx
        24:   d3 c0                   rol    %cl,%eax
        26:   c3                      retq
      
      Older compilers unfortunately do not do as well, but this only affects
      the small minority of users that don't pass constants.
      
      Due to integer promotions, ro[lr]8 were already well-defined for shifts
      in [0, 8], and ro[lr]16 were mostly well-defined for shifts in [0, 16]
      (only mostly - u16 gets promoted to _signed_ int, so if bit 15 is set,
      word << 16 is undefined).  For consistency, update those as well.
      
      Link: http://lkml.kernel.org/r/20190410211906.2190-1-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reported-by: NIdo Schimmel <idosch@mellanox.com>
      Tested-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NWill Deacon <will.deacon@arm.com>
      Cc: Vadim Pasternak <vadimp@mellanox.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMatthias Kaehlcke <mka@chromium.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      759766bf