1. 28 8月, 2019 1 次提交
    • M
      Add genphy_c45_config_aneg() function to phy-c45.c · 94acaeb5
      Marco Hartmann 提交于
      Commit 34786005 ("net: phy: prevent PHYs w/o Clause 22 regs from calling
      genphy_config_aneg") introduced a check that aborts phy_config_aneg()
      if the phy is a C45 phy.
      This causes phy_state_machine() to call phy_error() so that the phy
      ends up in PHY_HALTED state.
      
      Instead of returning -EOPNOTSUPP, call genphy_c45_config_aneg()
      (analogous to the C22 case) so that the state machine can run
      correctly.
      
      genphy_c45_config_aneg() closely resembles mv3310_config_aneg()
      in drivers/net/phy/marvell10g.c, excluding vendor specific
      configurations for 1000BaseT.
      
      Fixes: 22b56e82 ("net: phy: replace genphy_10g_driver with genphy_c45_driver")
      Signed-off-by: NMarco Hartmann <marco.hartmann@nxp.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94acaeb5
  2. 27 8月, 2019 1 次提交
  3. 23 8月, 2019 1 次提交
  4. 22 8月, 2019 1 次提交
  5. 21 8月, 2019 1 次提交
  6. 20 8月, 2019 1 次提交
    • D
      keys: Fix description size · 555df336
      David Howells 提交于
      The maximum key description size is 4095.  Commit f771fde8 ("keys:
      Simplify key description management") inadvertantly reduced that to 255
      and made sizes between 256 and 4095 work weirdly, and any size whereby
      size & 255 == 0 would cause an assertion in __key_link_begin() at the
      following line:
      
      	BUG_ON(index_key->desc_len == 0);
      
      This can be fixed by simply increasing the size of desc_len in struct
      keyring_index_key to a u16.
      
      Note the argument length test in keyutils only checked empty
      descriptions and descriptions with a size around the limit (ie.  4095)
      and not for all the values in between, so it missed this.  This has been
      addressed and
      
      	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/commit/?id=066bf56807c26cd3045a25f355b34c1d8a20a5aa
      
      now exhaustively tests all possible lengths of type, description and
      payload and then some.
      
      The assertion failure looks something like:
      
       kernel BUG at security/keys/keyring.c:1245!
       ...
       RIP: 0010:__key_link_begin+0x88/0xa0
       ...
       Call Trace:
        key_create_or_update+0x211/0x4b0
        __x64_sys_add_key+0x101/0x200
        do_syscall_64+0x5b/0x1e0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      It can be triggered by:
      
      	keyctl add user "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" a @s
      
      Fixes: f771fde8 ("keys: Simplify key description management")
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      555df336
  7. 19 8月, 2019 2 次提交
    • M
      netfilter: add include guard to nf_conntrack_h323_types.h · 38a429c8
      Masahiro Yamada 提交于
      Add a header include guard just in case.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      38a429c8
    • E
      signal: Allow cifs and drbd to receive their terminating signals · 33da8e7c
      Eric W. Biederman 提交于
      My recent to change to only use force_sig for a synchronous events
      wound up breaking signal reception cifs and drbd.  I had overlooked
      the fact that by default kthreads start out with all signals set to
      SIG_IGN.  So a change I thought was safe turned out to have made it
      impossible for those kernel thread to catch their signals.
      
      Reverting the work on force_sig is a bad idea because what the code
      was doing was very much a misuse of force_sig.  As the way force_sig
      ultimately allowed the signal to happen was to change the signal
      handler to SIG_DFL.  Which after the first signal will allow userspace
      to send signals to these kernel threads.  At least for
      wake_ack_receiver in drbd that does not appear actively wrong.
      
      So correct this problem by adding allow_kernel_signal that will allow
      signals whose siginfo reports they were sent by the kernel through,
      but will not allow userspace generated signals, and update cifs and
      drbd to call allow_kernel_signal in an appropriate place so that their
      thread can receive this signal.
      
      Fixing things this way ensures that userspace won't be able to send
      signals and cause problems, that it is clear which signals the
      threads are expecting to receive, and it guarantees that nothing
      else in the system will be affected.
      
      This change was partly inspired by similar cifs and drbd patches that
      added allow_signal.
      Reported-by: Nronnie sahlberg <ronniesahlberg@gmail.com>
      Reported-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
      Tested-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
      Cc: Steve French <smfrench@gmail.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Fixes: 247bc947 ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
      Fixes: 72abe3bc ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
      Fixes: fee10990 ("signal/drbd: Use send_sig not force_sig")
      Fixes: 3cf5d076 ("signal: Remove task parameter from force_sig")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      33da8e7c
  8. 16 8月, 2019 1 次提交
    • J
      block: remove REQ_NOWAIT_INLINE · 7b6620d7
      Jens Axboe 提交于
      We had a few issues with this code, and there's still a problem around
      how we deal with error handling for chained/split bios. For now, just
      revert the code and we'll try again with a thoroug solution. This
      reverts commits:
      
      e15c2ffa ("block: fix O_DIRECT error handling for bio fragments")
      0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
      6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      893a1c97 ("blk-mq: allow REQ_NOWAIT to return an error inline")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7b6620d7
  9. 15 8月, 2019 1 次提交
  10. 14 8月, 2019 5 次提交
    • Y
      gpio: Fix build error of function redefinition · 68e03b85
      YueHaibing 提交于
      when do randbuilding, I got this error:
      
      In file included from drivers/hwmon/pmbus/ucd9000.c:19:0:
      ./include/linux/gpio/driver.h:576:1: error: redefinition of gpiochip_add_pin_range
       gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
       ^~~~~~~~~~~~~~~~~~~~~~
      In file included from drivers/hwmon/pmbus/ucd9000.c:18:0:
      ./include/linux/gpio.h:245:1: note: previous definition of gpiochip_add_pin_range was here
       gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
       ^~~~~~~~~~~~~~~~~~~~~~
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Fixes: 964cb341 ("gpio: move pincontrol calls to <linux/gpio/driver.h>")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Link: https://lore.kernel.org/r/20190731123814.46624-1-yuehaibing@huawei.comSigned-off-by: NLinus Walleij <linus.walleij@linaro.org>
      68e03b85
    • A
      Revert "mm, thp: restore node-local hugepage allocations" · a8282608
      Andrea Arcangeli 提交于
      This reverts commit 2f0799a0 ("mm, thp: restore node-local
      hugepage allocations").
      
      commit 2f0799a0 was rightfully applied to avoid the risk of a
      severe regression that was reported by the kernel test robot at the end
      of the merge window.  Now we understood the regression was a false
      positive and was caused by a significant increase in fairness during a
      swap trashing benchmark.  So it's safe to re-apply the fix and continue
      improving the code from there.  The benchmark that reported the
      regression is very useful, but it provides a meaningful result only when
      there is no significant alteration in fairness during the workload.  The
      removal of __GFP_THISNODE increased fairness.
      
      __GFP_THISNODE cannot be used in the generic page faults path for new
      memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
      behavior significantly deviates from what the MPOL_DEFAULT semantics are
      supposed to be for THP and 4k allocations alike.
      
      Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
      set to "madvise") has never meant to provide an implicit MPOL_BIND on
      the "current" node the task is running on, causing swap storms and
      providing a much more aggressive behavior than even zone_reclaim_node =
      3.
      
      Any workload who could have benefited from __GFP_THISNODE has now to
      enable zone_reclaim_mode=1||2||3.  __GFP_THISNODE implicitly provided
      the zone_reclaim_mode behavior, but it only did so if THP was enabled:
      if THP was disabled, there would have been no chance to get any 4k page
      from the current node if the current node was full of pagecache, which
      further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
      MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
      semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
      must work exactly the same with MADV_HUGEPAGE set or not.
      
      The performance characteristic of memory depends on the hardware
      details.  The numbers below are obtained on Naples/EPYC architecture and
      the N/A projection extends them to show what we should aim for in the
      future as a good THP NUMA locality default.  The benchmark used
      exercises random memory seeks (note: the cost of the page faults is not
      part of the measurement).
      
        D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
        0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A
      
      D0 means distance zero (i.e.  local memory), D1 means distance one (i.e.
      intra socket memory), D2 means distance two (i.e.  inter socket memory),
      etc...
      
      For the guest physical memory allocated by qemu and for guest mode
      kernel the performance characteristic of RAM is more complex and an
      ideal default could be:
      
        D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
        0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A
      
      NOTE: the N/A are projections and haven't been measured yet, the
      measurement in this case is done on a 1950x with only two NUMA nodes.
      The THP case here means THP was used both in the host and in the guest.
      
      After applying this commit the THP NUMA locality order that we'll get
      out of MADV_HUGEPAGE is this:
      
        D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...
      
      Before this commit it was:
      
        D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...
      
      Even if we ignore the breakage of large workloads that can't fit in a
      single node that the __GFP_THISNODE implicit "current node" mbind
      caused, the THP NUMA locality order provided by __GFP_THISNODE was still
      not the one we shall aim for in the long term (i.e.  the first one at
      the top).
      
      After this commit is applied, we can introduce a new allocator multi
      order API and to replace those two alloc_pages_vmas calls in the page
      fault path, with a single multi order call:
      
              unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
              page = alloc_pages_multi_order(..., &order);
              if (!page)
              	goto out;
              if (!(order & (1 << 0))) {
              	VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
              	/* THP fault */
              } else {
              	VM_WARN_ON(order != 1 << 0);
              	/* 4k fallback */
              }
      
      The page allocator logic has to be altered so that when it fails on any
      zone with order 9, it has to try again with a order 0 before falling
      back to the next zone in the zonelist.
      
      After that we need to do more measurements and evaluate if adding an
      opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
      with "DN+1 THP | DN 4k" at every NUMA distance crossing.
      
      Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8282608
    • A
      Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 92717d42
      Andrea Arcangeli 提交于
      Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".
      
      The fixes for what was originally reported as "pathological THP
      behavior" we rightfully reverted to be sure not to introduced
      regressions at end of a merge window after a severe regression report
      from the kernel bot.  We can safely re-apply them now that we had time
      to analyze the problem.
      
      The mm process worked fine, because the good fixes were eventually
      committed upstream without excessive delay.
      
      The regression reported by the kernel bot however forced us to revert
      the good fixes to be sure not to introduce regressions and to give us
      the time to analyze the issue further.  The silver lining is that this
      extra time allowed to think more at this issue and also plan for a
      future direction to improve things further in terms of THP NUMA
      locality.
      
      This patch (of 2):
      
      This reverts commit 356ff8a9 ("Revert "mm, thp: consolidate THP
      gfp handling into alloc_hugepage_direct_gfpmask").  So it reapplies
      89c83fb5 ("mm, thp: consolidate THP gfp handling into
      alloc_hugepage_direct_gfpmask").
      
      Consolidation of the THP allocation flags at the same place was meant to
      be a clean up to easier handle otherwise scattered code which is
      imposing a maintenance burden.  There were no real problems observed
      with the gfp mask consolidation but the reversion was rushed through
      without a larger consensus regardless.
      
      This patch brings the consolidation back because this should make the
      long term maintainability easier as well as it should allow future
      changes to be less error prone.
      
      [mhocko@kernel.org: changelog additions]
      Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92717d42
    • R
      mm: workingset: fix vmstat counters for shadow nodes · ec9f0238
      Roman Gushchin 提交于
      Memcg counters for shadow nodes are broken because the memcg pointer is
      obtained in a wrong way. The following approach is used:
              virt_to_page(xa_node)->mem_cgroup
      
      Since commit 4d96ba35 ("mm: memcg/slab: stop setting
      page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
      set for slab pages, so memcg_from_slab_page() should be used instead.
      
      Also I doubt that it ever worked correctly: virt_to_head_page() should
      be used instead of virt_to_page().  Otherwise objects residing on tail
      pages are not accounted, because only the head page contains a valid
      mem_cgroup pointer.  That was a case since the introduction of these
      counters by the commit 68d48e6a ("mm: workingset: add vmstat counter
      for shadow nodes").
      
      Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
      Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec9f0238
    • R
      mm: document zone device struct page field usage · 76470ccd
      Ralph Campbell 提交于
      Patch series "mm/hmm: fixes for device private page migration", v3.
      
      Testing the latest linux git tree turned up a few bugs with page
      migration to and from ZONE_DEVICE private and anonymous pages.
      Hopefully it clarifies how ZONE_DEVICE private struct page uses the same
      mapping and index fields from the source anonymous page mapping.
      
      This patch (of 3):
      
      Struct page for ZONE_DEVICE private pages uses the page->mapping and and
      page->index fields while the source anonymous pages are migrated to
      device private memory.  This is so rmap_walk() can find the page when
      migrating the ZONE_DEVICE private page back to system memory.
      ZONE_DEVICE pmem backed fsdax pages also use the page->mapping and
      page->index fields when files are mapped into a process address space.
      
      Add comments to struct page and remove the unused "_zd_pad_1" field to
      make this more clear.
      
      Link: http://lkml.kernel.org/r/20190724232700.23327-2-rcampbell@nvidia.comSigned-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76470ccd
  11. 12 8月, 2019 1 次提交
  12. 11 8月, 2019 1 次提交
    • C
      dma-mapping: fix page attributes for dma_mmap_* · 33dcb37c
      Christoph Hellwig 提交于
      All the way back to introducing dma_common_mmap we've defaulted to mark
      the pages as uncached.  But this is wrong for DMA coherent devices.
      Later on DMA_ATTR_WRITE_COMBINE also got incorrect treatment as that
      flag is only treated special on the alloc side for non-coherent devices.
      
      Introduce a new dma_pgprot helper that deals with the check for coherent
      devices so that only the remapping cases ever reach arch_dma_mmap_pgprot
      and we thus ensure no aliasing of page attributes happens, which makes
      the powerpc version of arch_dma_mmap_pgprot obsolete and simplifies the
      remaining ones.
      
      Note that this means arch_dma_mmap_pgprot is a bit misnamed now, but
      we'll phase it out soon.
      
      Fixes: 64ccc9c0 ("common: dma-mapping: add support for generic dma_mmap_* calls")
      Reported-by: NShawn Anastasio <shawn@anastas.io>
      Reported-by: NGavin Li <git@thegavinli.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com> # arm64
      33dcb37c
  13. 09 8月, 2019 3 次提交
    • J
      net/tls: prevent skb_orphan() from leaking TLS plain text with offload · 41477662
      Jakub Kicinski 提交于
      sk_validate_xmit_skb() and drivers depend on the sk member of
      struct sk_buff to identify segments requiring encryption.
      Any operation which removes or does not preserve the original TLS
      socket such as skb_orphan() or skb_clone() will cause clear text
      leaks.
      
      Make the TCP socket underlying an offloaded TLS connection
      mark all skbs as decrypted, if TLS TX is in offload mode.
      Then in sk_validate_xmit_skb() catch skbs which have no socket
      (or a socket with no validation) and decrypted flag set.
      
      Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
      sk->sk_validate_xmit_skb are slightly interchangeable right now,
      they all imply TLS offload. The new checks are guarded by
      CONFIG_TLS_DEVICE because that's the option guarding the
      sk_buff->decrypted member.
      
      Second, smaller issue with orphaning is that it breaks
      the guarantee that packets will be delivered to device
      queues in-order. All TLS offload drivers depend on that
      scheduling property. This means skb_orphan_partial()'s
      trick of preserving partial socket references will cause
      issues in the drivers. We need a full orphan, and as a
      result netem delay/throttling will cause all TLS offload
      skbs to be dropped.
      
      Reusing the sk_buff->decrypted flag also protects from
      leaking clear text when incoming, decrypted skb is redirected
      (e.g. by TC).
      
      See commit 0608c69c ("bpf: sk_msg, sock{map|hash} redirect
      through ULP") for justification why the internal flag is safe.
      The only location which could leak the flag in is tcp_bpf_sendmsg(),
      which is taken care of by clearing the previously unused bit.
      
      v2:
       - remove superfluous decrypted mark copy (Willem);
       - remove the stale doc entry (Boris);
       - rely entirely on EOR marking to prevent coalescing (Boris);
       - use an internal sendpages flag instead of marking the socket
         (Boris).
      v3 (Willem):
       - reorganize the can_skb_orphan_partial() condition;
       - fix the flag leak-in through tcp_bpf_sendmsg.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41477662
    • T
      net/mlx5e: kTLS, Fix progress params context WQE layout · a9bc3390
      Tariq Toukan 提交于
      The TLS progress params context WQE should not include an
      Eth segment, drop it.
      In addition, align the tls_progress_params layout with the
      HW specification document:
      - fix the tisn field name.
      - remove the valid bit.
      
      Fixes: a12ff35e ("net/mlx5: Introduce TLS TX offload hardware bits and structures")
      Fixes: d2ead1f3 ("net/mlx5e: Add kTLS TX HW offload support")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      a9bc3390
    • T
      net/mlx5: kTLS, Fix wrong TIS opmod constants · 26149e3e
      Tariq Toukan 提交于
      Fix the used constants for TLS TIS opmods, per the HW specification.
      
      Fixes: a12ff35e ("net/mlx5: Introduce TLS TX offload hardware bits and structures")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      26149e3e
  14. 05 8月, 2019 3 次提交
    • G
      KVM: no need to check return value of debugfs_create functions · 3e7093d0
      Greg KH 提交于
      When calling debugfs functions, there is no need to ever check the
      return value.  The function can work or not, but the code logic should
      never do something different based on this.
      
      Also, when doing this, change kvm_arch_create_vcpu_debugfs() to return
      void instead of an integer, as we should not care at all about if this
      function actually does anything or not.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <x86@kernel.org>
      Cc: <kvm@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3e7093d0
    • P
      KVM: remove kvm_arch_has_vcpu_debugfs() · 741cbbae
      Paolo Bonzini 提交于
      There is no need for this function as all arches have to implement
      kvm_arch_create_vcpu_debugfs() no matter what.  A #define symbol
      let us actually simplify the code.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      741cbbae
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  15. 03 8月, 2019 1 次提交
  16. 02 8月, 2019 1 次提交
  17. 01 8月, 2019 1 次提交
  18. 31 7月, 2019 2 次提交
    • A
      compat_ioctl: pppoe: fix PPPOEIOCSFWD handling · 055d8824
      Arnd Bergmann 提交于
      Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
      linux-2.5.69 along with hundreds of other commands, but was always broken
      sincen only the structure is compatible, but the command number is not,
      due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
      sockaddr_pppox)), which is different on 64-bit architectures.
      
      Guillaume Nault adds:
      
        And the implementation was broken until 2016 (see 29e73269 ("pppoe:
        fix reference counting in PPPoE proxy")), and nobody ever noticed. I
        should probably have removed this ioctl entirely instead of fixing it.
        Clearly, it has never been used.
      
      Fix it by adding a compat_ioctl handler for all pppoe variants that
      translates the command number and then calls the regular ioctl function.
      
      All other ioctl commands handled by pppoe are compatible between 32-bit
      and 64-bit, and require compat_ptr() conversion.
      
      This should apply to all stable kernels.
      Acked-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      055d8824
    • J
      loop: Fix mount(2) failure due to race with LOOP_SET_FD · 89e524c0
      Jan Kara 提交于
      Commit 33ec3e53 ("loop: Don't change loop device under exclusive
      opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
      while it updates loop device binding. However this can make perfectly
      valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
      temporarily the exclusive bdev reference in cases like this:
      
      for i in {a..z}{a..z}; do
              dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
              mkfs.ext2 $i.image
              mkdir mnt$i
      done
      
      echo "Run"
      for i in {a..z}{a..z}; do
              mount -o loop -t ext2 $i.image mnt$i &
      done
      
      Fix the problem by not getting full exclusive bdev reference in
      LOOP_SET_FD but instead just mark the bdev as being claimed while we
      update the binding information. This just blocks new exclusive openers
      instead of failing them with EBUSY thus fixing the problem.
      
      Fixes: 33ec3e53 ("loop: Don't change loop device under exclusive opener")
      Cc: stable@vger.kernel.org
      Tested-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      89e524c0
  19. 28 7月, 2019 1 次提交
    • B
      gpio: don't WARN() on NULL descs if gpiolib is disabled · ffe0bbab
      Bartosz Golaszewski 提交于
      If gpiolib is disabled, we use the inline stubs from gpio/consumer.h
      instead of regular definitions of GPIO API. The stubs for 'optional'
      variants of gpiod_get routines return NULL in this case as if the
      relevant GPIO wasn't found. This is correct so far.
      
      Calling other (non-gpio_get) stubs from this header triggers a warning
      because the GPIO descriptor couldn't have been requested. The warning
      however is unconditional (WARN_ON(1)) and is emitted even if the passed
      descriptor pointer is NULL.
      
      We don't want to force the users of 'optional' gpio_get to check the
      returned pointer before calling e.g. gpiod_set_value() so let's only
      WARN on non-NULL descriptors.
      
      Cc: stable@vger.kernel.org
      Reported-by: NClaus H. Stovgaard <cst@phaseone.com>
      Signed-off-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>
      ffe0bbab
  20. 27 7月, 2019 2 次提交
  21. 26 7月, 2019 5 次提交
  22. 25 7月, 2019 3 次提交
    • J
      sched/fair: Use RCU accessors consistently for ->numa_group · cb361d8c
      Jann Horn 提交于
      The old code used RCU annotations and accessors inconsistently for
      ->numa_group, which can lead to use-after-frees and NULL dereferences.
      
      Let all accesses to ->numa_group use proper RCU helpers to prevent such
      issues.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Fixes: 8c8a743c ("sched/numa: Use {cpu, pid} to create task groups for shared faults")
      Link: https://lkml.kernel.org/r/20190716152047.14424-3-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb361d8c
    • J
      sched/fair: Don't free p->numa_faults with concurrent readers · 16d51a59
      Jann Horn 提交于
      When going through execve(), zero out the NUMA fault statistics instead of
      freeing them.
      
      During execve, the task is reachable through procfs and the scheduler. A
      concurrent /proc/*/sched reader can read data from a freed ->numa_faults
      allocation (confirmed by KASAN) and write it back to userspace.
      I believe that it would also be possible for a use-after-free read to occur
      through a race between a NUMA fault and execve(): task_numa_fault() can
      lead to task_numa_compare(), which invokes task_weight() on the currently
      running task of a different CPU.
      
      Another way to fix this would be to make ->numa_faults RCU-managed or add
      extra locking, but it seems easier to wipe the NUMA fault statistics on
      execve.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Fixes: 82727018 ("sched/numa: Call task_numa_free() from do_execve()")
      Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      16d51a59
    • L
      access: avoid the RCU grace period for the temporary subjective credentials · d7852fbd
      Linus Torvalds 提交于
      It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
      work because it installs a temporary credential that gets allocated and
      freed for each system call.
      
      The allocation and freeing overhead is mostly benign, but because
      credentials can be accessed under the RCU read lock, the freeing
      involves a RCU grace period.
      
      Which is not a huge deal normally, but if you have a lot of access()
      calls, this causes a fair amount of seconday damage: instead of having a
      nice alloc/free patterns that hits in hot per-CPU slab caches, you have
      all those delayed free's, and on big machines with hundreds of cores,
      the RCU overhead can end up being enormous.
      
      But it turns out that all of this is entirely unnecessary.  Exactly
      because access() only installs the credential as the thread-local
      subjective credential, the temporary cred pointer doesn't actually need
      to be RCU free'd at all.  Once we're done using it, we can just free it
      synchronously and avoid all the RCU overhead.
      
      So add a 'non_rcu' flag to 'struct cred', which can be set by users that
      know they only use it in non-RCU context (there are other potential
      users for this).  We can make it a union with the rcu freeing list head
      that we need for the RCU case, so this doesn't need any extra storage.
      
      Note that this also makes 'get_current_cred()' clear the new non_rcu
      flag, in case we have filesystems that take a long-term reference to the
      cred and then expect the RCU delayed freeing afterwards.  It's not
      entirely clear that this is required, but it makes for clear semantics:
      the subjective cred remains non-RCU as long as you only access it
      synchronously using the thread-local accessors, but you _can_ use it as
      a generic cred if you want to.
      
      It is possible that we should just remove the whole RCU markings for
      ->cred entirely.  Only ->real_cred is really supposed to be accessed
      through RCU, and the long-term cred copies that nfs uses might want to
      explicitly re-enable RCU freeing if required, rather than have
      get_current_cred() do it implicitly.
      
      But this is a "minimal semantic changes" change for the immediate
      problem.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Glauber <jglauber@marvell.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7852fbd
  23. 24 7月, 2019 1 次提交
    • I
      bpf: fix narrower loads on s390 · d9b8aada
      Ilya Leoshkevich 提交于
      The very first check in test_pkt_md_access is failing on s390, which
      happens because loading a part of a struct __sk_buff field produces
      an incorrect result.
      
      The preprocessed code of the check is:
      
      {
      	__u8 tmp = *((volatile __u8 *)&skb->len +
      		((sizeof(skb->len) - sizeof(__u8)) / sizeof(__u8)));
      	if (tmp != ((*(volatile __u32 *)&skb->len) & 0xFF)) return 2;
      };
      
      clang generates the following code for it:
      
            0:	71 21 00 03 00 00 00 00	r2 = *(u8 *)(r1 + 3)
            1:	61 31 00 00 00 00 00 00	r3 = *(u32 *)(r1 + 0)
            2:	57 30 00 00 00 00 00 ff	r3 &= 255
            3:	5d 23 00 1d 00 00 00 00	if r2 != r3 goto +29 <LBB0_10>
      
      Finally, verifier transforms it to:
      
        0: (61) r2 = *(u32 *)(r1 +104)
        1: (bc) w2 = w2
        2: (74) w2 >>= 24
        3: (bc) w2 = w2
        4: (54) w2 &= 255
        5: (bc) w2 = w2
      
      The problem is that when verifier emits the code to replace a partial
      load of a struct __sk_buff field (*(u8 *)(r1 + 3)) with a full load of
      struct sk_buff field (*(u32 *)(r1 + 104)), an optional shift and a
      bitwise AND, it assumes that the machine is little endian and
      incorrectly decides to use a shift.
      
      Adjust shift count calculation to account for endianness.
      
      Fixes: 31fd8581 ("bpf: permits narrower load from bpf program context fields")
      Signed-off-by: NIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d9b8aada