1. 28 12月, 2020 1 次提交
    • P
      netfilter: nftables: add set expression flags · b4e70d8d
      Pablo Neira Ayuso 提交于
      The set flag NFT_SET_EXPR provides a hint to the kernel that userspace
      supports for multiple expressions per set element. In the same
      direction, NFT_DYNSET_F_EXPR specifies that dynset expression defines
      multiple expressions per set element.
      
      This allows new userspace software with old kernels to bail out with
      EOPNOTSUPP. This update is similar to ef516e86 ("netfilter:
      nf_tables: reintroduce the NFT_SET_CONCAT flag"). The NFT_SET_EXPR flag
      needs to be set on when the NFTA_SET_EXPRESSIONS attribute is specified.
      The NFT_SET_EXPR flag is not set on with NFTA_SET_EXPR to retain
      backward compatibility in old userspace binaries.
      
      Fixes: 48b0ae04 ("netfilter: nftables: netlink support for several set element expressions")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b4e70d8d
  2. 22 12月, 2020 1 次提交
  3. 20 12月, 2020 1 次提交
  4. 19 12月, 2020 2 次提交
  5. 17 12月, 2020 1 次提交
  6. 16 12月, 2020 2 次提交
    • L
      userfaultfd: add UFFD_USER_MODE_ONLY · 37cd0575
      Lokesh Gidra 提交于
      Patch series "Control over userfaultfd kernel-fault handling", v6.
      
      This patch series is split from [1].  The other series enables SELinux
      support for userfaultfd file descriptors so that its creation and movement
      can be controlled.
      
      It has been demonstrated on various occasions that suspending kernel code
      execution for an arbitrary amount of time at any access to userspace
      memory (copy_from_user()/copy_to_user()/...) can be exploited to change
      the intended behavior of the kernel.  For instance, handling page faults
      in kernel-mode using userfaultfd has been exploited in [2, 3].  Likewise,
      FUSE, which is similar to userfaultfd in this respect, has been exploited
      in [4, 5] for similar outcome.
      
      This small patch series adds a new flag to userfaultfd(2) that allows
      callers to give up the ability to handle kernel-mode faults with the
      resulting UFFD file object.  It then adds a 'user-mode only' option to the
      unprivileged_userfaultfd sysctl knob to require unprivileged callers to
      use this new flag.
      
      The purpose of this new interface is to decrease the chance of an
      unprivileged userfaultfd user taking advantage of userfaultfd to enhance
      security vulnerabilities by lengthening the race window in kernel code.
      
      [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
      [2] https://duasynt.com/blog/linux-kernel-heap-spray
      [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
      [4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
      [5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808
      
      This patch (of 2):
      
      userfaultfd handles page faults from both user and kernel code.  Add a new
      UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
      userfaultfd object refuse to handle faults from kernel mode, treating
      these faults as if SIGBUS were always raised, causing the kernel code to
      fail with EFAULT.
      
      A future patch adds a knob allowing administrators to give some processes
      the ability to create userfaultfd file objects only if they pass
      UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
      exploit userfaultfd's ability to delay kernel page faults to open timing
      windows for future exploits.
      
      Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
      Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.comSigned-off-by: NDaniel Colascione <dancol@google.com>
      Signed-off-by: NLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <calin@google.com>
      Cc: Daniel Colascione <dancol@dancol.org>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nitin Gupta <nigupta@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37cd0575
    • P
      uapi: move constants from <linux/kernel.h> to <linux/const.h> · a85cbe61
      Petr Vorel 提交于
      and include <linux/const.h> in UAPI headers instead of <linux/kernel.h>.
      
      The reason is to avoid indirect <linux/sysinfo.h> include when using
      some network headers: <linux/netlink.h> or others -> <linux/kernel.h>
      -> <linux/sysinfo.h>.
      
      This indirect include causes on MUSL redefinition of struct sysinfo when
      included both <sys/sysinfo.h> and some of UAPI headers:
      
          In file included from x86_64-buildroot-linux-musl/sysroot/usr/include/linux/kernel.h:5,
                           from x86_64-buildroot-linux-musl/sysroot/usr/include/linux/netlink.h:5,
                           from ../include/tst_netlink.h:14,
                           from tst_crypto.c:13:
          x86_64-buildroot-linux-musl/sysroot/usr/include/linux/sysinfo.h:8:8: error: redefinition of `struct sysinfo'
           struct sysinfo {
                  ^~~~~~~
          In file included from ../include/tst_safe_macros.h:15,
                           from ../include/tst_test.h:93,
                           from tst_crypto.c:11:
          x86_64-buildroot-linux-musl/sysroot/usr/include/sys/sysinfo.h:10:8: note: originally defined here
      
      Link: https://lkml.kernel.org/r/20201015190013.8901-1-petr.vorel@gmail.comSigned-off-by: NPetr Vorel <petr.vorel@gmail.com>
      Suggested-by: NRich Felker <dalias@aerifal.cx>
      Acked-by: NRich Felker <dalias@libc.org>
      Cc: Peter Korsgaard <peter@korsgaard.com>
      Cc: Baruch Siach <baruch@tkos.co.il>
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a85cbe61
  7. 15 12月, 2020 2 次提交
    • A
      vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag · caaf95e0
      Andra Paraschiv 提交于
      Add VMADDR_FLAG_TO_HOST vsock flag that is used to setup a vsock
      connection where all the packets are forwarded to the host.
      
      Then, using this type of vsock channel, vsock communication between
      sibling VMs can be built on top of it.
      
      Changelog
      
      v3 -> v4
      
      * Update the "VMADDR_FLAG_TO_HOST" value, as the size of the field has
        been updated to 1 byte.
      
      v2 -> v3
      
      * Update comments to mention when the flag is set in the connect and
        listen paths.
      
      v1 -> v2
      
      * New patch in v2, it was split from the first patch in the series.
      * Remove the default value for the vsock flags field.
      * Update the naming for the vsock flag to "VMADDR_FLAG_TO_HOST".
      Signed-off-by: NAndra Paraschiv <andraprs@amazon.com>
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      caaf95e0
    • A
      vm_sockets: Add flags field in the vsock address data structure · dc8eeef7
      Andra Paraschiv 提交于
      vsock enables communication between virtual machines and the host they
      are running on. With the multi transport support (guest->host and
      host->guest), nested VMs can also use vsock channels for communication.
      
      In addition to this, by default, all the vsock packets are forwarded to
      the host, if no host->guest transport is loaded. This behavior can be
      implicitly used for enabling vsock communication between sibling VMs.
      
      Add a flags field in the vsock address data structure that can be used
      to explicitly mark the vsock connection as being targeted for a certain
      type of communication. This way, can distinguish between different use
      cases such as nested VMs and sibling VMs.
      
      This field can be set when initializing the vsock address variable used
      for the connect() call.
      
      Changelog
      
      v3 -> v4
      
      * Update the size of "svm_flags" field to be 1 byte instead of 2 bytes.
      
      v2 -> v3
      
      * Add "svm_flags" as a new field, not reusing "svm_reserved1".
      
      v1 -> v2
      
      * Update the field name to "svm_flags".
      * Split the current patch in 2 patches.
      Signed-off-by: NAndra Paraschiv <andraprs@amazon.com>
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      dc8eeef7
  8. 14 12月, 2020 3 次提交
  9. 13 12月, 2020 1 次提交
    • P
      netfilter: nftables: netlink support for several set element expressions · 48b0ae04
      Pablo Neira Ayuso 提交于
      This patch adds three new netlink attributes to encapsulate a list of
      expressions per set elements:
      
      - NFTA_SET_EXPRESSIONS: this attribute provides the set definition in
        terms of expressions. New set elements get attached the list of
        expressions that is specified by this new netlink attribute.
      - NFTA_SET_ELEM_EXPRESSIONS: this attribute allows users to restore (or
        initialize) the stateful information of set elements when adding an
        element to the set.
      - NFTA_DYNSET_EXPRESSIONS: this attribute specifies the list of
        expressions that the set element gets when it is inserted from the
        packet path.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      48b0ae04
  10. 12 12月, 2020 2 次提交
  11. 11 12月, 2020 6 次提交
    • D
      dmaengine: idxd: add IAX configuration support in the IDXD driver · f25b4638
      Dave Jiang 提交于
      Add support to allow configuration of Intel Analytics Accelerator (IAX) in
      addition to the Intel Data Streaming Accelerator (DSA). The IAX hardware
      has the same configuration interface as DSA. The main difference
      is the type of operations it performs. We can support the DSA and
      IAX devices on the same driver with some tweaks.
      
      IAX has a 64B completion record that needs to be 64B aligned, as opposed to
      a 32B completion record that is 32B aligned for DSA. IAX also does not
      support token management.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Link: https://lore.kernel.org/r/160564555488.1834439.4261958859935360473.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NVinod Koul <vkoul@kernel.org>
      f25b4638
    • C
      nl80211: add common API to configure SAR power limitations · 6bdb68ce
      Carl Huang 提交于
      NL80211_CMD_SET_SAR_SPECS is added to configure SAR from
      user space. NL80211_ATTR_SAR_SPEC is used to pass the SAR
      power specification when used with NL80211_CMD_SET_SAR_SPECS.
      
      Wireless driver needs to register SAR type, supported frequency
      ranges to wiphy, so user space can query it. The index in
      frequency range is used to specify which sub band the power
      limitation applies to. The SAR type is for compatibility, so later
      other SAR mechanism can be implemented without breaking the user
      space SAR applications.
      
      Normal process is user space queries the SAR capability, and
      gets the index of supported frequency ranges and associates the
      power limitation with this index and sends to kernel.
      
      Here is an example of message send to kernel:
      8c 00 00 00 08 00 01 00 00 00 00 00 38 00 2b 81
      08 00 01 00 00 00 00 00 2c 00 02 80 14 00 00 80
      08 00 02 00 00 00 00 00 08 00 01 00 38 00 00 00
      14 00 01 80 08 00 02 00 01 00 00 00 08 00 01 00
      48 00 00 00
      
      NL80211_CMD_SET_SAR_SPECS:  0x8c
      NL80211_ATTR_WIPHY:     0x01(phy idx is 0)
      NL80211_ATTR_SAR_SPEC:  0x812b (NLA_NESTED)
      NL80211_SAR_ATTR_TYPE:  0x00 (NL80211_SAR_TYPE_POWER)
      NL80211_SAR_ATTR_SPECS: 0x8002 (NLA_NESTED)
      freq range 0 power: 0x38 in 0.25dbm unit (14dbm)
      freq range 1 power: 0x48 in 0.25dbm unit (18dbm)
      Signed-off-by: NCarl Huang <cjhuang@codeaurora.org>
      Reviewed-by: NBrian Norris <briannorris@chromium.org>
      Reviewed-by: NAbhishek Kumar <kuabhs@chromium.org>
      Link: https://lore.kernel.org/r/20201203103728.3034-2-cjhuang@codeaurora.org
      [minor edits, NLA parse cleanups]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      6bdb68ce
    • J
      cfg80211: support immediate reconnect request hint · 3bb02143
      Johannes Berg 提交于
      There are cases where it's necessary to disconnect, but an
      immediate reconnection is desired. Support a hint to userspace
      that this is the case, by including a new attribute in the
      deauth or disassoc event.
      Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
      Link: https://lore.kernel.org/r/iwlwifi.20201206145305.58d33941fb9d.I0e7168c205c7949529c8e3b86f3c9b12c01a7017@changeidSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      3bb02143
    • J
      cfg80211: include block-tx flag in channel switch started event · 669b8413
      Johannes Berg 提交于
      In the NL80211_CMD_CH_SWITCH_STARTED_NOTIFY event, include the
      NL80211_ATTR_CH_SWITCH_BLOCK_TX flag attribute if block-tx was
      requested by the AP.
      Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
      Link: https://lore.kernel.org/r/iwlwifi.20201129172929.8953ef22cc64.Ifee9cab337a4369938545920ba5590559e91327a@changeidSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      669b8413
    • E
      rfkill: add a reason to the HW rfkill state · 14486c82
      Emmanuel Grumbach 提交于
      The WLAN device may exist yet not be usable. This can happen
      when the WLAN device is controllable by both the host and
      some platform internal component.
      We need some arbritration that is vendor specific, but when
      the device is not available for the host, we need to reflect
      this state towards the user space.
      
      Add a reason field to the rfkill object (and event) so that
      userspace can know why the device is in rfkill: because some
      other platform component currently owns the device, or
      because the actual hw rfkill signal is asserted.
      
      Capable userspace can now determine the reason for the rfkill
      and possibly do some negotiation on a side band channel using
      a proprietary protocol to gain ownership on the device in case
      the device is owned by some other component. When the host
      gains ownership on the device, the kernel can remove the
      RFKILL_HARD_BLOCK_NOT_OWNER reason and the hw rfkill state
      will be off. Then, the userspace can bring the device up and
      start normal operation.
      
      The rfkill_event structure is enlarged to include the additional
      byte, it is now 9 bytes long. Old user space will ask to read
      only 8 bytes so that the kernel can know not to feed them with
      more data. When the user space writes 8 bytes, new kernels will
      just read what is present in the file descriptor. This new byte
      is read only from the userspace standpoint anyway.
      
      If a new user space uses an old kernel, it'll ask to read 9 bytes
      but will get only 8, and it'll know that it didn't get the new
      state. When it'll write 9 bytes, the kernel will again ignore
      this new byte which is read only from the userspace standpoint.
      Signed-off-by: NEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Link: https://lore.kernel.org/r/20201104134641.28816-1-emmanuel.grumbach@intel.comSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      14486c82
    • T
      ppp: add PPPIOCBRIDGECHAN and PPPIOCUNBRIDGECHAN ioctls · 4cf476ce
      Tom Parkin 提交于
      This new ioctl pair allows two ppp channels to be bridged together:
      frames arriving in one channel are transmitted in the other channel
      and vice versa.
      
      The practical use for this is primarily to support the L2TP Access
      Concentrator use-case.  The end-user session is presented as a ppp
      channel (typically PPPoE, although it could be e.g. PPPoA, or even PPP
      over a serial link) and is switched into a PPPoL2TP session for
      transmission to the LNS.  At the LNS the PPP session is terminated in
      the ISP's network.
      
      When a PPP channel is bridged to another it takes a reference on the
      other's struct ppp_file.  This reference is dropped when the channels
      are unbridged, which can occur either explicitly on userspace calling
      the PPPIOCUNBRIDGECHAN ioctl, or implicitly when either channel in the
      bridge is unregistered.
      
      In order to implement the channel bridge, struct channel is extended
      with a new field, 'bridge', which points to the other struct channel
      making up the bridge.
      
      This pointer is RCU protected to avoid adding another lock to the data
      path.
      
      To guard against concurrent writes to the pointer, the existing struct
      channel lock 'upl' coverage is extended rather than adding a new lock.
      
      The 'upl' lock is used to protect the existing unit pointer.  Since the
      bridge effectively replaces the unit (they're mutually exclusive for a
      channel) it makes coding easier to use the same lock to cover them
      both.
      Signed-off-by: NTom Parkin <tparkin@katalix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cf476ce
  12. 10 12月, 2020 7 次提交
  13. 09 12月, 2020 1 次提交
  14. 08 12月, 2020 2 次提交
  15. 07 12月, 2020 4 次提交
  16. 06 12月, 2020 1 次提交
  17. 05 12月, 2020 3 次提交
    • A
      net-zerocopy: Defer vm zap unless actually needed. · 94ab9eb9
      Arjun Roy 提交于
      Zapping pages is required only if we are calling vm_insert_page into a
      region where pages had previously been mapped. Receive zerocopy allows
      reusing such regions, and hitherto called zap_page_range() before
      calling vm_insert_page() in that range.
      
      zap_page_range() can also be triggered from userspace with
      madvise(MADV_DONTNEED). If userspace is configured to call this before
      reusing a segment, or if there was nothing mapped at this virtual
      address to begin with, we can avoid calling zap_page_range() under the
      socket lock. That said, if userspace does not do that, then we are
      still responsible for calling zap_page_range().
      
      This patch adds a flag that the user can use to hint to the kernel
      that a zap is not required. If the flag is not set, or if an older
      user application does not have a flags field at all, then the kernel
      calls zap_page_range as before. Also, if the flag is set but a zap is
      still required, the kernel performs that zap as necessary. Thus
      incorrectly indicating that a zap can be avoided does not change the
      correctness of operation. It also increases the batchsize for
      vm_insert_pages and prefetches the page struct for the batch since
      we're about to bump the refcount.
      
      An alternative mechanism could be to not have a flag, assume by
      default a zap is not needed, and fall back to zapping if needed.
      However, this would harm performance for older applications for which
      a zap is necessary, and thus we implement it with an explicit flag
      so newer applications can opt in.
      
      When using RPC-style traffic with medium sized (tens of KB) RPCs, this
      change yields an efficency improvement of about 30% for QPS/CPU usage.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      94ab9eb9
    • A
      net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy. · 18fb76ed
      Arjun Roy 提交于
      When TCP receive zerocopy does not successfully map the entire
      requested space, it outputs a 'hint' that the caller should recvmsg().
      
      Augment zerocopy to accept a user buffer that it tries to copy this
      hint into - if it is possible to copy the entire hint, it will do so.
      This elides a recvmsg() call for received traffic that isn't exactly
      page-aligned in size.
      
      This was tested with RPC-style traffic of arbitrary sizes. Normally,
      each received message required at least one getsockopt() call, and one
      recvmsg() call for the remaining unaligned data.
      
      With this change, almost all of the recvmsg() calls are eliminated,
      leading to a savings of about 25%-50% in number of system calls
      for RPC-style workloads.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      18fb76ed
    • F
      bpf: Add a bpf_sock_from_file helper · 4f19cab7
      Florent Revest 提交于
      While eBPF programs can check whether a file is a socket by file->f_op
      == &socket_file_ops, they cannot convert the void private_data pointer
      to a struct socket BTF pointer. In order to do this a new helper
      wrapping sock_from_file is added.
      
      This is useful to tracing programs but also other program types
      inheriting this set of helpers such as iterators or LSM programs.
      Signed-off-by: NFlorent Revest <revest@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NKP Singh <kpsingh@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20201204113609.1850150-2-revest@google.com
      4f19cab7