1. 09 2月, 2021 1 次提交
  2. 06 2月, 2021 1 次提交
    • S
      batman-adv: Drop publication years from copyright info · cfa55c6d
      Sven Eckelmann 提交于
      The batman-adv source code was using the year of publication (to net-next)
      as "last" year for the copyright statement. The whole source code mentioned
      in the MAINTAINERS "BATMAN ADVANCED" section was handled as a single entity
      regarding the publishing year.
      
      This avoided having outdated (in sense of year information - not copyright
      holder) publishing information inside several files. But since the simple
      "update copyright year" commit (without other changes) in the file was not
      well received in the upstream kernel, the option to not have a copyright
      year (for initial and last publication) in the files are chosen instead.
      More detailed information about the years can still be retrieved from the
      SCM system.
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Acked-by: NMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>
      cfa55c6d
  3. 05 2月, 2021 1 次提交
    • J
      Revert "GTP: add support for flow based tunneling API" · 49ecc587
      Jonas Bonn 提交于
      This reverts commit 9ab7e76a.
      
      This patch was committed without maintainer approval and despite a number
      of unaddressed concerns from review.  There are several issues that
      impede the acceptance of this patch and that make a reversion of this
      particular instance of these changes the best way forward:
      
      i)  the patch contains several logically separate changes that would be
      better served as smaller patches (for review purposes)
      ii) functionality like the handling of end markers has been introduced
      without further explanation
      iii) symmetry between the handling of GTPv0 and GTPv1 has been
      unnecessarily broken
      iv) the patchset produces 'broken' packets when extension headers are
      included
      v) there are no available userspace tools to allow for testing this
      functionality
      vi) there is an unaddressed Coverity report against the patch concering
      memory leakage
      vii) most importantly, the patch contains a large amount of superfluous
      churn that impedes other ongoing work with this driver
      
      This patch will be reworked into a series that aligns with other
      ongoing work and facilitates review.
      Signed-off-by: NJonas Bonn <jonas@norrbonn.se>
      Acked-by: NHarald Welte <laforge@gnumonks.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      49ecc587
  4. 04 2月, 2021 1 次提交
    • D
      ethtool: Extend link modes settings uAPI with lanes · 012ce4dd
      Danielle Ratson 提交于
      Currently, when auto negotiation is on, the user can advertise all the
      linkmodes which correspond to a specific speed, but does not have a
      similar selector for the number of lanes. This is significant when a
      specific speed can be achieved using different number of lanes.  For
      example, 2x50 or 4x25.
      
      Add 'ETHTOOL_A_LINKMODES_LANES' attribute and expand 'struct
      ethtool_link_settings' with lanes field in order to implement a new
      lanes-selector that will enable the user to advertise a specific number
      of lanes as well.
      
      When auto negotiation is off, lanes parameter can be forced only if the
      driver supports it. Add a capability bit in 'struct ethtool_ops' that
      allows ethtool know if the driver can handle the lanes parameter when
      auto negotiation is off, so if it does not, an error message will be
      returned when trying to set lanes.
      
      Example:
      
      $ ethtool -s swp1 lanes 4
      $ ethtool swp1
        Settings for swp1:
      	Supported ports: [ FIBRE ]
              Supported link modes:   1000baseKX/Full
                                      10000baseKR/Full
                                      40000baseCR4/Full
      				40000baseSR4/Full
      				40000baseLR4/Full
                                      25000baseCR/Full
                                      25000baseSR/Full
      				50000baseCR2/Full
                                      100000baseSR4/Full
      				100000baseCR4/Full
              Supported pause frame use: Symmetric Receive-only
              Supports auto-negotiation: Yes
              Supported FEC modes: Not reported
              Advertised link modes:  40000baseCR4/Full
      				40000baseSR4/Full
      				40000baseLR4/Full
                                      100000baseSR4/Full
      				100000baseCR4/Full
              Advertised pause frame use: No
              Advertised auto-negotiation: Yes
              Advertised FEC modes: Not reported
              Speed: Unknown!
              Duplex: Unknown! (255)
              Auto-negotiation: on
              Port: Direct Attach Copper
              PHYAD: 0
              Transceiver: internal
              Link detected: no
      Signed-off-by: NDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      012ce4dd
  5. 30 1月, 2021 1 次提交
  6. 28 1月, 2021 5 次提交
  7. 27 1月, 2021 2 次提交
    • P
      net: allow user to set metric on default route learned via Router Advertisement · 6b2e04bc
      Praveen Chaudhary 提交于
      For IPv4, default route is learned via DHCPv4 and user is allowed to change
      metric using config etc/network/interfaces. But for IPv6, default route can
      be learned via RA, for which, currently a fixed metric value 1024 is used.
      
      Ideally, user should be able to configure metric on default route for IPv6
      similar to IPv4. This patch adds sysctl for the same.
      
      Logs:
      
      For IPv4:
      
      Config in etc/network/interfaces:
      auto eth0
      iface eth0 inet dhcp
          metric 4261413864
      
      IPv4 Kernel Route Table:
      $ ip route list
      default via 172.21.47.1 dev eth0 metric 4261413864
      
      FRR Table, if a static route is configured:
      [In real scenario, it is useful to prefer BGP learned default route over DHCPv4 default route.]
      Codes: K - kernel route, C - connected, S - static, R - RIP,
             O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP,
             T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
             > - selected route, * - FIB route
      
      S>* 0.0.0.0/0 [20/0] is directly connected, eth0, 00:00:03
      K   0.0.0.0/0 [254/1000] via 172.21.47.1, eth0, 6d08h51m
      
      i.e. User can prefer Default Router learned via Routing Protocol in IPv4.
      Similar behavior is not possible for IPv6, without this fix.
      
      After fix [for IPv6]:
      sudo sysctl -w net.ipv6.conf.eth0.net.ipv6.conf.eth0.ra_defrtr_metric=1996489705
      
      IP monitor: [When IPv6 RA is received]
      default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705  pref high
      
      Kernel IPv6 routing table
      $ ip -6 route list
      default via fe80::be16:65ff:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 21sec hoplimit 64 pref high
      
      FRR Table, if a static route is configured:
      [In real scenario, it is useful to prefer BGP learned default route over IPv6 RA default route.]
      Codes: K - kernel route, C - connected, S - static, R - RIPng,
             O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
             v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
             > - selected route, * - FIB route
      
      S>* ::/0 [20/0] is directly connected, eth0, 00:00:06
      K   ::/0 [119/1001] via fe80::xx16:xxxx:feb3:ce8e, eth0, 6d07h43m
      
      If the metric is changed later, the effect will be seen only when next IPv6
      RA is received, because the default route must be fully controlled by RA msg.
      Below metric is changed from 1996489705 to 1996489704.
      
      $ sudo sysctl -w net.ipv6.conf.eth0.ra_defrtr_metric=1996489704
      net.ipv6.conf.eth0.ra_defrtr_metric = 1996489704
      
      IP monitor:
      [On next IPv6 RA msg, Kernel deletes prev route and installs new route with updated metric]
      
      Deleted default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 3sec hoplimit 64 pref high
      default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489704 pref high
      Signed-off-by: NPraveen Chaudhary <pchaudhary@linkedin.com>
      Signed-off-by: NZhenggen Xu <zxu@linkedin.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20210125214430.24079-1-pchaudhary@linkedin.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6b2e04bc
    • H
      media: v4l2-subdev.h: BIT() is not available in userspace · a53e3c18
      Hans Verkuil 提交于
      The BIT macro is not available in userspace, so replace BIT(0) by
      0x00000001.
      Signed-off-by: NHans Verkuil <hverkuil-cisco@xs4all.nl>
      Fixes: 6446ec6c ("media: v4l2-subdev: add VIDIOC_SUBDEV_QUERYCAP ioctl")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      a53e3c18
  8. 26 1月, 2021 1 次提交
  9. 24 1月, 2021 2 次提交
    • R
      net: mrp: move struct definitions out of uapi · 67819390
      Rasmus Villemoes 提交于
      None of these are actually used in the kernel/userspace interface -
      there's a userspace component of implementing MRP, and userspace will
      need to construct certain frames to put on the wire, but there's no
      reason the kernel should provide the relevant definitions in a UAPI
      header.
      
      In fact, some of those definitions were broken until previous commit,
      so only keep the few that are actually referenced in the kernel code,
      and move them to the br_private_mrp.h header.
      Signed-off-by: NRasmus Villemoes <rasmus.villemoes@prevas.dk>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      67819390
    • R
      net: mrp: fix definitions of MRP test packets · dc090de8
      Rasmus Villemoes 提交于
      Wireshark says that the MRP test packets cannot be decoded - and the
      reason for that is that there's a two-byte hole filled with garbage
      between the "transitions" and "timestamp" members.
      
      So Wireshark decodes the two garbage bytes and the top two bytes of
      the timestamp written by the kernel as the timestamp value (which thus
      fluctuates wildly), and interprets the lower two bytes of the
      timestamp as a new (type, length) pair, which is of course broken.
      
      Even though this makes the timestamp field in the struct unaligned, it
      actually makes it end up on a 32 bit boundary in the frame as mandated
      by the standard, since it is preceded by a two byte TLV header.
      
      The struct definitions live under include/uapi/, but they are not
      really part of any kernel<->userspace API/ABI, so fixing the
      definitions by adding the packed attribute should not cause any
      compatibility issues.
      Signed-off-by: NRasmus Villemoes <rasmus.villemoes@prevas.dk>
      Reviewed-by: NHoratiu Vultur <horatiu.vultur@microchip.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      dc090de8
  10. 23 1月, 2021 5 次提交
    • M
      sch_htb: Hierarchical QoS hardware offload · d03b195b
      Maxim Mikityanskiy 提交于
      HTB doesn't scale well because of contention on a single lock, and it
      also consumes CPU. This patch adds support for offloading HTB to
      hardware that supports hierarchical rate limiting.
      
      In the offload mode, HTB passes control commands to the driver using
      ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
      and their settings (rate, ceil) in the NIC. Every modification of the
      HTB tree caused by the admin results in ndo_setup_tc being called.
      
      After this setup, the HTB algorithm is done completely in the NIC. An SQ
      (send queue) is created for every leaf class and attached to the
      hierarchy, so that the NIC can calculate and obey aggregated rate
      limits, too. In the future, it can be changed, so that multiple SQs will
      back a single leaf class.
      
      ndo_select_queue is responsible for selecting the right queue that
      serves the traffic class of each packet.
      
      The data path works as follows: a packet is classified by clsact, the
      driver selects a hardware queue according to its class, and the packet
      is enqueued into this queue's qdisc.
      
      This solution addresses two main problems of scaling HTB:
      
      1. Contention by flow classification. Currently the filters are attached
      to the HTB instance as follows:
      
          # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
          classid 1:10
      
      It's possible to move classification to clsact egress hook, which is
      thread-safe and lock-free:
      
          # tc filter add dev eth0 egress protocol ip flower dst_port 80
          action skbedit priority 1:10
      
      This way classification still happens in software, but the lock
      contention is eliminated, and it happens before selecting the TX queue,
      allowing the driver to translate the class to the corresponding hardware
      queue in ndo_select_queue.
      
      Note that this is already compatible with non-offloaded HTB and doesn't
      require changes to the kernel nor iproute2.
      
      2. Contention by handling packets. HTB is not multi-queue, it attaches
      to a whole net device, and handling of all packets takes the same lock.
      When HTB is offloaded, it registers itself as a multi-queue qdisc,
      similarly to mq: HTB is attached to the netdev, and each queue has its
      own qdisc.
      
      Some features of HTB may be not supported by some particular hardware,
      for example, the maximum number of classes may be limited, the
      granularity of rate and ceil parameters may be different, etc. - so, the
      offload is not enabled by default, a new parameter is used to enable it:
      
          # tc qdisc replace dev eth0 root handle 1: htb offload
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d03b195b
    • A
      tcp: Add receive timestamp support for receive zerocopy. · 7eeba170
      Arjun Roy 提交于
      tcp_recvmsg() uses the CMSG mechanism to receive control information
      like packet receive timestamps. This patch adds CMSG fields to
      struct tcp_zerocopy_receive, and provides receive timestamps
      if available to the user.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      7eeba170
    • Y
      tcp: add TTL to SCM_TIMESTAMPING_OPT_STATS · e7ed11ee
      Yousuk Seung 提交于
      This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports
      the time-to-live or hop limit of the latest incoming packet with
      SCM_TSTAMP_ACK. The value exported may not be from the packet that acks
      the sequence when incoming packets are aggregated. Exporting the
      time-to-live or hop limit value of incoming packets helps to estimate
      the hop count of the path of the flow that may change over time.
      Signed-off-by: NYousuk Seung <ysseung@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      e7ed11ee
    • P
      devlink: Support get and set state of port function · a556dded
      Parav Pandit 提交于
      devlink port function can be in active or inactive state.
      Allow users to get and set port function's state.
      
      When the port function it activated, its operational state may change
      after a while when the device is created and driver binds to it.
      Similarly on deactivation flow.
      
      To clearly describe the state of the port function and its device's
      operational state in the host system, define state and opstate
      attributes.
      
      Example of a PCI SF port which supports a port function:
      
      $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
      
      $ devlink port show
      pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
      
      $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
      pci/0000:08:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
        function:
          hw_addr 00:00:00:00:00:00 state inactive opstate detached
      
      $ devlink port show pci/0000:06:00.0/32768
      pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
        function:
          hw_addr 00:00:00:00:88:88 state inactive opstate detached
      
      $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active
      
      $ devlink port show pci/0000:06:00.0/32768 -jp
      {
          "port": {
              "pci/0000:06:00.0/32768": {
                  "type": "eth",
                  "netdev": "ens2f0npf0sf88",
                  "flavour": "pcisf",
                  "controller": 0,
                  "pfnum": 0,
                  "sfnum": 88,
                  "external": false,
                  "splittable": false,
                  "function": {
                      "hw_addr": "00:00:00:00:88:88",
                      "state": "active",
                      "opstate": "attached"
                  }
              }
          }
      }
      Signed-off-by: NParav Pandit <parav@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Reviewed-by: NVu Pham <vuhuong@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      a556dded
    • P
      devlink: Introduce PCI SF port flavour and port attribute · b8288837
      Parav Pandit 提交于
      A PCI sub-function (SF) represents a portion of the device similar
      to PCI VF.
      
      In an eswitch, PCI SF may have port which is normally represented
      using a representor netdevice.
      To have better visibility of eswitch port, its association with SF,
      and its representor netdevice, introduce a PCI SF port flavour.
      
      When devlink port flavour is PCI SF, fill up PCI SF attributes of the
      port.
      
      Extend port name creation using PCI PF and SF number scheme on best
      effort basis, so that vendor drivers can skip defining their own
      scheme.
      This is done as cApfNSfM, where A, N and M are controller, PCI PF and
      PCI SF number respectively.
      This is similar to existing naming for PCI PF and PCI VF ports.
      
      An example view of a PCI SF port:
      
      $ devlink port show pci/0000:06:00.0/32768
      pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
        function:
          hw_addr 00:00:00:00:88:88 state active opstate attached
      
      $ devlink port show pci/0000:06:00.0/32768 -jp
      {
          "port": {
              "pci/0000:06:00.0/32768": {
                  "type": "eth",
                  "netdev": "ens2f0npf0sf88",
                  "flavour": "pcisf",
                  "controller": 0,
                  "pfnum": 0,
                  "sfnum": 88,
                  "splittable": false,
                  "function": {
                      "hw_addr": "00:00:00:00:88:88",
                      "state": "active",
                      "opstate": "attached"
                  }
              }
          }
      }
      Signed-off-by: NParav Pandit <parav@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Reviewed-by: NVu Pham <vuhuong@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      b8288837
  11. 21 1月, 2021 1 次提交
  12. 20 1月, 2021 1 次提交
    • J
      bonding: add a vlan+srcmac tx hashing option · 7b8fc010
      Jarod Wilson 提交于
      This comes from an end-user request, where they're running multiple VMs on
      hosts with bonded interfaces connected to some interest switch topologies,
      where 802.3ad isn't an option. They're currently running a proprietary
      solution that effectively achieves load-balancing of VMs and bandwidth
      utilization improvements with a similar form of transmission algorithm.
      
      Basically, each VM has it's own vlan, so it always sends its traffic out
      the same interface, unless that interface fails. Traffic gets split
      between the interfaces, maintaining a consistent path, with failover still
      available if an interface goes down.
      
      Unlike bond_eth_hash(), this hash function is using the full source MAC
      address instead of just the last byte, as there are so few components to
      the hash, and in the no-vlan case, we would be returning just the last
      byte of the source MAC as the hash value. It's entirely possible to have
      two NICs in a bond with the same last byte of their MAC, but not the same
      MAC, so this adjustment should guarantee distinct hashes in all cases.
      
      This has been rudimetarily tested to provide similar results to the
      proprietary solution it is aiming to replace. A patch for iproute2 is also
      posted, to properly support the new mode there as well.
      
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Thomas Davis <tadavis@lbl.gov>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Link: https://lore.kernel.org/r/20210119010927.1191922-1-jarod@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      7b8fc010
  13. 16 1月, 2021 1 次提交
  14. 15 1月, 2021 4 次提交
  15. 13 1月, 2021 1 次提交
  16. 12 1月, 2021 1 次提交
  17. 10 1月, 2021 2 次提交
    • G
      mptcp: add set_flags command in PM netlink · 0f9f696a
      Geliang Tang 提交于
      This patch added a new command MPTCP_PM_CMD_SET_FLAGS in PM netlink:
      
      In mptcp_nl_cmd_set_flags, parse the input address, get the backup value
      according to whether the address's FLAG_BACKUP flag is set from the
      user-space. Then check whether this address had been added in the local
      address list. If it had been, then call mptcp_nl_addr_backup to deal with
      this address.
      
      In mptcp_nl_addr_backup, traverse all the existing msk sockets to find
      the relevant sockets, and call mptcp_pm_nl_mp_prio_send_ack to send out
      a MP_PRIO ACK packet.
      
      Finally in mptcp_nl_cmd_set_flags, set or clear the address's FLAG_BACKUP
      flag.
      Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      0f9f696a
    • C
      bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket · b16671e8
      Coly Li 提交于
      When large bucket feature was added, BCH_FEATURE_INCOMPAT_LARGE_BUCKET
      was introduced into the incompat feature set. It used bucket_size_hi
      (which was added at the tail of struct cache_sb_disk) to extend current
      16bit bucket size to 32bit with existing bucket_size in struct
      cache_sb_disk.
      
      This is not a good idea, there are two obvious problems,
      - Bucket size is always value power of 2, if store log2(bucket size) in
        existing bucket_size of struct cache_sb_disk, it is unnecessary to add
        bucket_size_hi.
      - Macro csum_set() assumes d[SB_JOURNAL_BUCKETS] is the last member in
        struct cache_sb_disk, bucket_size_hi was added after d[] which makes
        csum_set calculate an unexpected super block checksum.
      
      To fix the above problems, this patch introduces a new incompat feature
      bit BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE, when this bit is set, it
      means bucket_size in struct cache_sb_disk stores the order of power-of-2
      bucket size value. When user specifies a bucket size larger than 32768
      sectors, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE will be set to
      incompat feature set, and bucket_size stores log2(bucket size) more
      than store the real bucket size value.
      
      The obsoleted BCH_FEATURE_INCOMPAT_LARGE_BUCKET won't be used anymore,
      it is renamed to BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET and still only
      recognized by kernel driver for legacy compatible purpose. The previous
      bucket_size_hi is renmaed to obso_bucket_size_hi in struct cache_sb_disk
      and not used in bcache-tools anymore.
      
      For cache device created with BCH_FEATURE_INCOMPAT_LARGE_BUCKET feature,
      bcache-tools and kernel driver still recognize the feature string and
      display it as "obso_large_bucket".
      
      With this change, the unnecessary extra space extend of bcache on-disk
      super block can be avoided, and csum_set() may generate expected check
      sum as well.
      
      Fixes: ffa47032 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org # 5.9+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b16671e8
  18. 08 1月, 2021 1 次提交
    • T
      KVM: SVM: Add support for booting APs in an SEV-ES guest · 647daca2
      Tom Lendacky 提交于
      Typically under KVM, an AP is booted using the INIT-SIPI-SIPI sequence,
      where the guest vCPU register state is updated and then the vCPU is VMRUN
      to begin execution of the AP. For an SEV-ES guest, this won't work because
      the guest register state is encrypted.
      
      Following the GHCB specification, the hypervisor must not alter the guest
      register state, so KVM must track an AP/vCPU boot. Should the guest want
      to park the AP, it must use the AP Reset Hold exit event in place of, for
      example, a HLT loop.
      
      First AP boot (first INIT-SIPI-SIPI sequence):
        Execute the AP (vCPU) as it was initialized and measured by the SEV-ES
        support. It is up to the guest to transfer control of the AP to the
        proper location.
      
      Subsequent AP boot:
        KVM will expect to receive an AP Reset Hold exit event indicating that
        the vCPU is being parked and will require an INIT-SIPI-SIPI sequence to
        awaken it. When the AP Reset Hold exit event is received, KVM will place
        the vCPU into a simulated HLT mode. Upon receiving the INIT-SIPI-SIPI
        sequence, KVM will make the vCPU runnable. It is again up to the guest
        to then transfer control of the AP to the proper location.
      
        To differentiate between an actual HLT and an AP Reset Hold, a new MP
        state is introduced, KVM_MP_STATE_AP_RESET_HOLD, which the vCPU is
        placed in upon receiving the AP Reset Hold exit event. Additionally, to
        communicate the AP Reset Hold exit event up to userspace (if needed), a
        new exit reason is introduced, KVM_EXIT_AP_RESET_HOLD.
      
      A new x86 ops function is introduced, vcpu_deliver_sipi_vector, in order
      to accomplish AP booting. For VMX, vcpu_deliver_sipi_vector is set to the
      original SIPI delivery function, kvm_vcpu_deliver_sipi_vector(). SVM adds
      a new function that, for non SEV-ES guests, invokes the original SIPI
      delivery function, kvm_vcpu_deliver_sipi_vector(), but for SEV-ES guests,
      implements the logic above.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <e8fbebe8eb161ceaabdad7c01a5859a78b424d5e.1609791600.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      647daca2
  19. 06 1月, 2021 1 次提交
  20. 28 12月, 2020 1 次提交
    • P
      netfilter: nftables: add set expression flags · b4e70d8d
      Pablo Neira Ayuso 提交于
      The set flag NFT_SET_EXPR provides a hint to the kernel that userspace
      supports for multiple expressions per set element. In the same
      direction, NFT_DYNSET_F_EXPR specifies that dynset expression defines
      multiple expressions per set element.
      
      This allows new userspace software with old kernels to bail out with
      EOPNOTSUPP. This update is similar to ef516e86 ("netfilter:
      nf_tables: reintroduce the NFT_SET_CONCAT flag"). The NFT_SET_EXPR flag
      needs to be set on when the NFTA_SET_EXPRESSIONS attribute is specified.
      The NFT_SET_EXPR flag is not set on with NFTA_SET_EXPR to retain
      backward compatibility in old userspace binaries.
      
      Fixes: 48b0ae04 ("netfilter: nftables: netlink support for several set element expressions")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b4e70d8d
  21. 22 12月, 2020 1 次提交
  22. 19 12月, 2020 2 次提交
  23. 17 12月, 2020 1 次提交
  24. 16 12月, 2020 2 次提交
    • L
      userfaultfd: add UFFD_USER_MODE_ONLY · 37cd0575
      Lokesh Gidra 提交于
      Patch series "Control over userfaultfd kernel-fault handling", v6.
      
      This patch series is split from [1].  The other series enables SELinux
      support for userfaultfd file descriptors so that its creation and movement
      can be controlled.
      
      It has been demonstrated on various occasions that suspending kernel code
      execution for an arbitrary amount of time at any access to userspace
      memory (copy_from_user()/copy_to_user()/...) can be exploited to change
      the intended behavior of the kernel.  For instance, handling page faults
      in kernel-mode using userfaultfd has been exploited in [2, 3].  Likewise,
      FUSE, which is similar to userfaultfd in this respect, has been exploited
      in [4, 5] for similar outcome.
      
      This small patch series adds a new flag to userfaultfd(2) that allows
      callers to give up the ability to handle kernel-mode faults with the
      resulting UFFD file object.  It then adds a 'user-mode only' option to the
      unprivileged_userfaultfd sysctl knob to require unprivileged callers to
      use this new flag.
      
      The purpose of this new interface is to decrease the chance of an
      unprivileged userfaultfd user taking advantage of userfaultfd to enhance
      security vulnerabilities by lengthening the race window in kernel code.
      
      [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
      [2] https://duasynt.com/blog/linux-kernel-heap-spray
      [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
      [4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
      [5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808
      
      This patch (of 2):
      
      userfaultfd handles page faults from both user and kernel code.  Add a new
      UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
      userfaultfd object refuse to handle faults from kernel mode, treating
      these faults as if SIGBUS were always raised, causing the kernel code to
      fail with EFAULT.
      
      A future patch adds a knob allowing administrators to give some processes
      the ability to create userfaultfd file objects only if they pass
      UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
      exploit userfaultfd's ability to delay kernel page faults to open timing
      windows for future exploits.
      
      Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
      Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.comSigned-off-by: NDaniel Colascione <dancol@google.com>
      Signed-off-by: NLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <calin@google.com>
      Cc: Daniel Colascione <dancol@dancol.org>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nitin Gupta <nigupta@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37cd0575
    • P
      uapi: move constants from <linux/kernel.h> to <linux/const.h> · a85cbe61
      Petr Vorel 提交于
      and include <linux/const.h> in UAPI headers instead of <linux/kernel.h>.
      
      The reason is to avoid indirect <linux/sysinfo.h> include when using
      some network headers: <linux/netlink.h> or others -> <linux/kernel.h>
      -> <linux/sysinfo.h>.
      
      This indirect include causes on MUSL redefinition of struct sysinfo when
      included both <sys/sysinfo.h> and some of UAPI headers:
      
          In file included from x86_64-buildroot-linux-musl/sysroot/usr/include/linux/kernel.h:5,
                           from x86_64-buildroot-linux-musl/sysroot/usr/include/linux/netlink.h:5,
                           from ../include/tst_netlink.h:14,
                           from tst_crypto.c:13:
          x86_64-buildroot-linux-musl/sysroot/usr/include/linux/sysinfo.h:8:8: error: redefinition of `struct sysinfo'
           struct sysinfo {
                  ^~~~~~~
          In file included from ../include/tst_safe_macros.h:15,
                           from ../include/tst_test.h:93,
                           from tst_crypto.c:11:
          x86_64-buildroot-linux-musl/sysroot/usr/include/sys/sysinfo.h:10:8: note: originally defined here
      
      Link: https://lkml.kernel.org/r/20201015190013.8901-1-petr.vorel@gmail.comSigned-off-by: NPetr Vorel <petr.vorel@gmail.com>
      Suggested-by: NRich Felker <dalias@aerifal.cx>
      Acked-by: NRich Felker <dalias@libc.org>
      Cc: Peter Korsgaard <peter@korsgaard.com>
      Cc: Baruch Siach <baruch@tkos.co.il>
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a85cbe61