1. 23 11月, 2018 1 次提交
  2. 21 11月, 2018 1 次提交
  3. 14 11月, 2018 4 次提交
  4. 17 10月, 2018 1 次提交
  5. 08 10月, 2018 3 次提交
    • D
      rxrpc: Use the UDP encap_rcv hook · 5271953c
      David Howells 提交于
      Use the UDP encap_rcv hook to cut the bit out of the rxrpc packet reception
      in which a packet is placed onto the UDP receive queue and then immediately
      removed again by rxrpc.  Going via the queue in this manner seems like it
      should be unnecessary.
      
      This does, however, require the invention of a value to place in encap_type
      as that's one of the conditions to switch packets out to the encap_rcv
      hook.  Possibly the value doesn't actually matter for anything other than
      sockopts on the UDP socket, which aren't accessible outside of rxrpc
      anyway.
      
      This seems to cut a bit of time out of the time elapsed between each
      sk_buff being timestamped and turning up in rxrpc (the final number in the
      following trace excerpts).  I measured this by making the rxrpc_rx_packet
      trace point print the time elapsed between the skb being timestamped and
      the current time (in ns), e.g.:
      
      	... 424.278721: rxrpc_rx_packet: ...  ACK 25026
      
      So doing a 512MiB DIO read from my test server, with an unmodified kernel:
      
      	N       min     max     sum		mean    stddev
      	27605   2626    7581    7.83992e+07     2840.04 181.029
      
      and with the patch applied:
      
      	N       min     max     sum		mean    stddev
      	27547   1895    12165   6.77461e+07     2459.29 255.02
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5271953c
    • E
      net/smc: retain old name for diag_mode field · d4f0006a
      Eugene Syromiatnikov 提交于
      Commit c601171d ("net/smc: provide smc mode in smc_diag.c") changed
      the name of diag_fallback field of struct smc_diag_msg structure
      to diag_mode.  However, this structure is a part of UAPI, and this change
      breaks user space applications that use it ([1], for example).  Since
      the new name is more suitable, convert the field to a union that provides
      access to the data via both the new and the old name.
      
      [1] https://gitlab.com/strace/strace/blob/v4.24/netlink_smc_diag.c#L165
      
      Fixes: c601171d ("net/smc: provide smc mode in smc_diag.c")
      Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4f0006a
    • E
      net/smc: use __aligned_u64 for 64-bit smc_diag fields · a21048c8
      Eugene Syromiatnikov 提交于
      Commit 4b1b7d3b ("net/smc: add SMC-D diag support") introduced
      new UAPI-exposed structure, struct smcd_diag_dmbinfo.  However,
      it's not usable by compat binaries, as it has different layout there.
      Probably, the most straightforward fix that will avoid similar issues
      in the future is to use __aligned_u64 for 64-bit fields.
      
      Fixes: 4b1b7d3b ("net/smc: add SMC-D diag support")
      Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a21048c8
  6. 06 10月, 2018 1 次提交
  7. 25 9月, 2018 1 次提交
  8. 20 9月, 2018 1 次提交
    • D
      KVM: x86: Control guest reads of MSR_PLATFORM_INFO · 6fbbde9a
      Drew Schmitt 提交于
      Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
      to reads of MSR_PLATFORM_INFO.
      
      Disabling access to reads of this MSR gives userspace the control to "expose"
      this platform-dependent information to guests in a clear way. As it exists
      today, guests that read this MSR would get unpopulated information if userspace
      hadn't already set it (and prior to this patch series, only the CPUID faulting
      information could have been populated). This existing interface could be
      confusing if guests don't handle the potential for incorrect/incomplete
      information gracefully (e.g. zero reported for base frequency).
      Signed-off-by: NDrew Schmitt <dasch@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6fbbde9a
  9. 10 9月, 2018 1 次提交
  10. 05 9月, 2018 1 次提交
  11. 04 9月, 2018 1 次提交
  12. 03 9月, 2018 1 次提交
    • V
      uapi: Fix linux/rds.h userspace compilation errors. · 59a03fea
      Vinson Lee 提交于
      Include linux/in6.h for struct in6_addr.
      
      /usr/include/linux/rds.h:156:18: error: field ‘laddr’ has incomplete type
        struct in6_addr laddr;
                        ^~~~~
      /usr/include/linux/rds.h:157:18: error: field ‘faddr’ has incomplete type
        struct in6_addr faddr;
                        ^~~~~
      /usr/include/linux/rds.h:178:18: error: field ‘laddr’ has incomplete type
        struct in6_addr laddr;
                        ^~~~~
      /usr/include/linux/rds.h:179:18: error: field ‘faddr’ has incomplete type
        struct in6_addr faddr;
                        ^~~~~
      /usr/include/linux/rds.h:198:18: error: field ‘bound_addr’ has incomplete type
        struct in6_addr bound_addr;
                        ^~~~~~~~~~
      /usr/include/linux/rds.h:199:18: error: field ‘connected_addr’ has incomplete type
        struct in6_addr connected_addr;
                        ^~~~~~~~~~~~~~
      /usr/include/linux/rds.h:219:18: error: field ‘local_addr’ has incomplete type
        struct in6_addr local_addr;
                        ^~~~~~~~~~
      /usr/include/linux/rds.h:221:18: error: field ‘peer_addr’ has incomplete type
        struct in6_addr peer_addr;
                        ^~~~~~~~~
      /usr/include/linux/rds.h:245:18: error: field ‘src_addr’ has incomplete type
        struct in6_addr src_addr;
                        ^~~~~~~~
      /usr/include/linux/rds.h:246:18: error: field ‘dst_addr’ has incomplete type
        struct in6_addr dst_addr;
                        ^~~~~~~~
      
      Fixes: b7ff8b10 ("rds: Extend RDS API for IPv6 support")
      Signed-off-by: NVinson Lee <vlee@freedesktop.org>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59a03fea
  13. 23 8月, 2018 1 次提交
  14. 17 8月, 2018 1 次提交
    • D
      netfilter: uapi: fix linux/netfilter/nf_osf.h userspace compilation errors · cdb2f401
      Dmitry V. Levin 提交于
      Move inclusion of <linux/ip.h> and <linux/tcp.h> from
      linux/netfilter/xt_osf.h to linux/netfilter/nf_osf.h to fix
      the following linux/netfilter/nf_osf.h userspace compilation errors:
      
      /usr/include/linux/netfilter/nf_osf.h:59:24: error: 'MAX_IPOPTLEN' undeclared here (not in a function)
        struct nf_osf_opt opt[MAX_IPOPTLEN];
      /usr/include/linux/netfilter/nf_osf.h:64:17: error: field 'ip' has incomplete type
        struct iphdr   ip;
      /usr/include/linux/netfilter/nf_osf.h:65:18: error: field 'tcp' has incomplete type
        struct tcphdr   tcp;
      
      Fixes: bfb15f2a ("netfilter: extract Passive OS fingerprint infrastructure from xt_osf")
      Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      cdb2f401
  15. 13 8月, 2018 2 次提交
    • V
      ipv6: Add icmp_echo_ignore_all support for ICMPv6 · e6f86b0f
      Virgile Jarry 提交于
      Preventing the kernel from responding to ICMP Echo Requests messages
      can be useful in several ways. The sysctl parameter
      'icmp_echo_ignore_all' can be used to prevent the kernel from
      responding to IPv4 ICMP echo requests. For IPv6 pings, such
      a sysctl kernel parameter did not exist.
      
      Add the ability to prevent the kernel from responding to IPv6
      ICMP echo requests through the use of the following sysctl
      parameter : /proc/sys/net/ipv6/icmp/echo_ignore_all.
      Update the documentation to reflect this change.
      Signed-off-by: NVirgile Jarry <virgile@acceis.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6f86b0f
    • A
      bpf: Introduce bpf_skb_ancestor_cgroup_id helper · 77236281
      Andrey Ignatov 提交于
      == Problem description ==
      
      It's useful to be able to identify cgroup associated with skb in TC so
      that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
      helper can help with this.
      
      Though in real life cgroup hierarchy and hierarchy to apply a policy to
      don't map 1:1.
      
      It's often the case that there is a container and corresponding cgroup,
      but there are many more sub-cgroups inside container, e.g. because it's
      delegated to containerized application to control resources for its
      subsystems, or to separate application inside container from infra that
      belongs to containerization system (e.g. sshd).
      
      At the same time it may be useful to apply a policy to container as a
      whole.
      
      If multiple containers like this are run on a host (what is often the
      case) and many of them have sub-cgroups, it may not be possible to apply
      per-container policy in TC with existing helpers such as
      bpf_skb_under_cgroup or bpf_skb_cgroup_id:
      
      * bpf_skb_cgroup_id will return id of immediate cgroup associated with
        skb, i.e. if it's a sub-cgroup inside container, it can't be used to
        identify container's cgroup;
      
      * bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
        i.e. if there are N containers on a host and a policy has to be
        applied to M of them (0 <= M <= N), it'd require M calls to
        bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild &
        load new BPF program.
      
      == Solution ==
      
      The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be
      used to get id of cgroup v2 that is an ancestor of cgroup associated
      with skb at specified level of cgroup hierarchy.
      
      That way admin can place all containers on one level of cgroup hierarchy
      (what is a good practice in general and already used in many
      configurations) and identify specific cgroup on this level no matter
      what sub-cgroup skb is associated with.
      
      E.g. if there is a cgroup hierarchy:
        root/
        root/container1/
        root/container1/app11/
        root/container1/app11/sub-app-a/
        root/container1/app12/
        root/container2/
        root/container2/app21/
        root/container2/app22/
        root/container2/app22/sub-app-b/
      
      , then having skb associated with root/container1/app11/sub-app-a/ it's
      possible to get ancestor at level 1, what is container1 and apply policy
      for this container, or apply another policy if it's container2.
      
      Policies can be kept e.g. in a hash map where key is a container cgroup
      id and value is an action.
      
      Levels where container cgroups are created are usually known in advance
      whether cgroup hierarchy inside container may be hard to predict
      especially in case when its creation is delegated to containerized
      application.
      
      == Implementation details ==
      
      The helper gets ancestor by walking parents up to specified level.
      
      Another option would be to get different kind of "id" from
      cgroup->ancestor_ids[level] and use it with idr_find() to get struct
      cgroup for ancestor. But that would require radix lookup what doesn't
      seem to be better (at least it's not obviously better).
      
      Format of return value of the new helper is same as that of
      bpf_skb_cgroup_id.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      77236281
  16. 12 8月, 2018 3 次提交
  17. 11 8月, 2018 2 次提交
    • M
      bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e
      Martin KaFai Lau 提交于
      This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
      a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
      non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
      
      BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
      to store the bpf context instead of using the skb->cb[48].
      
      At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
      from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
      point,  it is not always clear where the bpf context can be appended
      in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
      aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
      clear if the lower layer is only ipv4 and ipv6 in the future and
      will it not touch the cb[] again before transiting to the upper
      layer.
      
      For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
      instead of IP[6]CB and it may still modify the cb[] after calling
      the udp[46]_lib_lookup_skb().  Because of the above reason, if
      sk->cb is used for the bpf ctx, saving-and-restoring is needed
      and likely the whole 48 bytes cb[] has to be saved and restored.
      
      Instead of saving, setting and restoring the cb[], this patch opts
      to create a new "struct sk_reuseport_kern" and setting the needed
      values in there.
      
      The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
      will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
      specific usage at this point and it is also inline with the current
      sock_reuseport.c implementation (i.e. no protocol specific requirement).
      
      In "struct sk_reuseport_md", this patch exposes data/data_end/len
      with semantic similar to other existing usages.  Together
      with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
      the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
      the bpf prog that the reuseport group is bind-ed to a local
      INANY address which cannot be learned from skb.
      
      The new "bind_inany" is added to "struct sock_reuseport" which will be
      used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
      to avoid repeating the "bind INANY" test on
      "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
      only be properly initialized when a "sk->sk_reuseport" enabled sk is
      adding to a hashtable (i.e. during "reuseport_alloc()" and
      "reuseport_add_sock()").
      
      The new "sk_select_reuseport()" is the main helper that the
      bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
      that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
      the earlier patch, the validity of a selected sk is checked in
      run time in "sk_select_reuseport()".  Doing the check in
      verification time is difficult and inflexible (consider the map-in-map
      use case).  The runtime check is to compare the selected sk's reuseport_id
      with the reuseport_id that we want.  This helper will return -EXXX if the
      selected sk cannot serve the incoming request (e.g. reuseport_id
      not match).  The bpf prog can decide if it wants to do SK_DROP as its
      discretion.
      
      When the bpf prog returns SK_PASS, the kernel will check if a
      valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
      If it does , it will use the selected sk.  If not, the kernel
      will select one from "reuse->socks[]" (as before this patch).
      
      The SK_DROP and SK_PASS handling logic will be in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dbb9b9e
    • M
      bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY · 5dc4c4b7
      Martin KaFai Lau 提交于
      This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
      
      To unleash the full potential of a bpf prog, it is essential for the
      userspace to be capable of directly setting up a bpf map which can then
      be consumed by the bpf prog to make decision.  In this case, decide which
      SO_REUSEPORT sk to serve the incoming request.
      
      By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
      and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
      The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
      the bpf prog can directly select a sk from the bpf map.  That will
      raise the programmability of the bpf prog attached to a reuseport
      group (a group of sk serving the same IP:PORT).
      
      For example, in UDP, the bpf prog can peek into the payload (e.g.
      through the "data" pointer introduced in the later patch) to learn
      the application level's connection information and then decide which sk
      to pick from a bpf map.  The userspace can tightly couple the sk's location
      in a bpf map with the application logic in generating the UDP payload's
      connection information.  This connection info contact/API stays within the
      userspace.
      
      Also, when used with map-in-map, the userspace can switch the
      old-server-process's inner map to a new-server-process's inner map
      in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
      The bpf prog will then direct incoming requests to the new process instead
      of the old process.  The old process can finish draining the pending
      requests (e.g. by "accept()") before closing the old-fds.  [Note that
      deleting a fd from a bpf map does not necessary mean the fd is closed]
      
      During map_update_elem(),
      Only SO_REUSEPORT sk (i.e. which has already been added
      to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
      "bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
      ensured in "reuseport_array_update_check()".
      
      A SO_REUSEPORT sk can only be added once to a map (i.e. the
      same sk cannot be added twice even to the same map).  SO_REUSEPORT
      already allows another sk to be created for the same IP:PORT.
      There is no need to re-create a similar usage in the BPF side.
      
      When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
      it will notify the bpf map to remove it from the map also.  It is
      done through "bpf_sk_reuseport_detach()" and it will only be called
      if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
      
      The map_update()/map_delete() has to be in-sync with the
      "reuse->socks[]".  Hence, the same "reuseport_lock" used
      by "reuse->socks[]" has to be used here also. Care has
      been taken to ensure the lock is only acquired when the
      adding sk passes some strict tests. and
      freeing the map does not require the reuseport_lock.
      
      The reuseport_array will also support lookup from the syscall
      side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
      is on-demand (i.e. a sk's cookie is not generated until the very
      first map_lookup_elem()).
      
      The lookup cookie is 64bits but it goes against the logical userspace
      expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
      It may catch user in surprise if we enforce value_size=8 while
      userspace still pass a 32bits fd during update.  Supporting different
      value_size between lookup and update seems unintuitive also.
      
      We also need to consider what if other existing fd based maps want
      to return 64bits value from syscall's lookup in the future.
      Hence, reuseport_array supports both value_size 4 and 8, and
      assuming user will usually use value_size=4.  The syscall's lookup
      will return ENOSPC on value_size=4.  It will will only
      return 64bits value from sock_gen_cookie() when user consciously
      choose value_size=8 (as a signal that lookup is desired) which then
      requires a 64bits value in both lookup and update.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5dc4c4b7
  18. 09 8月, 2018 1 次提交
  19. 08 8月, 2018 2 次提交
    • P
      net/sched: allow flower to match tunnel options · 0a6e7778
      Pieter Jansen van Vuuren 提交于
      Allow matching on options in Geneve tunnel headers.
      This makes use of existing tunnel metadata support.
      
      The options can be described in the form
      CLASS:TYPE:DATA/CLASS_MASK:TYPE_MASK:DATA_MASK, where CLASS is
      represented as a 16bit hexadecimal value, TYPE as an 8bit
      hexadecimal value and DATA as a variable length hexadecimal value.
      
      e.g.
       # ip link add name geneve0 type geneve dstport 0 external
       # tc qdisc add dev geneve0 ingress
       # tc filter add dev geneve0 protocol ip parent ffff: \
           flower \
             enc_src_ip 10.0.99.192 \
             enc_dst_ip 10.0.99.193 \
             enc_key_id 11 \
             geneve_opts 0102:80:1122334421314151/ffff:ff:ffffffffffffffff \
             ip_proto udp \
             action mirred egress redirect dev eth1
      
      This patch adds support for matching Geneve options in the order
      supplied by the user. This leads to an efficient implementation in
      the software datapath (and in our opinion hardware datapaths that
      offload this feature). It is also compatible with Geneve options
      matching provided by the Open vSwitch kernel datapath which is
      relevant here as the Flower classifier may be used as a mechanism
      to program flows into hardware as a form of Open vSwitch datapath
      offload (sometimes referred to as OVS-TC). The netlink
      Kernel/Userspace API may be extended, for example by adding a flag,
      if other matching options are desired, for example matching given
      options in any order. This would require an implementation in the
      TC software datapath. And be done in a way that drivers that
      facilitate offload of the Flower classifier can reject or accept
      such flows based on hardware datapath capabilities.
      
      This approach was discussed and agreed on at Netconf 2017 in Seoul.
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NPieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a6e7778
    • F
      ethtool: Add WAKE_FILTER and RX_CLS_FLOW_WAKE · 6cfef793
      Florian Fainelli 提交于
      Add the ability to specify through ethtool::rxnfc that a rule location is
      special and will be used to participate in Wake-on-LAN, by e.g: having a
      specific pattern be matched. When this is the case, fs->ring_cookie must
      be set to the special value RX_CLS_FLOW_WAKE.
      
      We also define an additional ethtool::wolinfo flag: WAKE_FILTER which
      can be used to configure an Ethernet adapter to allow Wake-on-LAN using
      previously programmed filters.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6cfef793
  20. 07 8月, 2018 3 次提交
    • H
      netfilter: nft_ct: add ct timeout support · 7e0b2b57
      Harsha Sharma 提交于
      This patch allows to add, list and delete connection tracking timeout
      policies via nft objref infrastructure and assigning these timeout
      via nft rule.
      
      %./libnftnl/examples/nft-ct-timeout-add ip raw cttime tcp
      
      Ruleset:
      
      table ip raw {
         ct timeout cttime {
             protocol tcp;
             policy = {established: 111, close: 13 }
         }
      
         chain output {
             type filter hook output priority -300; policy accept;
             ct timeout set "cttime"
         }
      }
      
      %./libnftnl/examples/nft-rule-ct-timeout-add ip raw output cttime
      
      %conntrack -E
      [NEW] tcp      6 111 ESTABLISHED src=172.16.19.128 dst=172.16.19.1
      sport=22 dport=41360 [UNREPLIED] src=172.16.19.1 dst=172.16.19.128
      sport=41360 dport=22
      
      %nft delete rule ip raw output handle <handle>
      %./libnftnl/examples/nft-ct-timeout-del ip raw cttime
      
      Joint work with Pablo Neira.
      Signed-off-by: NHarsha Sharma <harshasharmaiitr@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      7e0b2b57
    • F
      netfilter: nft_osf: use NFT_OSF_MAXGENRELEN instead of IFNAMSIZ · 35a8a3bd
      Fernando Fernandez Mancera 提交于
      As no "genre" on pf.os exceed 16 bytes of length, we reduce
      NFT_OSF_MAXGENRELEN parameter to 16 bytes and use it instead of IFNAMSIZ.
      Signed-off-by: NFernando Fernandez Mancera <ffmancera@riseup.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      35a8a3bd
    • J
      vhost: switch to use new message format · 429711ae
      Jason Wang 提交于
      We use to have message like:
      
      struct vhost_msg {
      	int type;
      	union {
      		struct vhost_iotlb_msg iotlb;
      		__u8 padding[64];
      	};
      };
      
      Unfortunately, there will be a hole of 32bit in 64bit machine because
      of the alignment. This leads a different formats between 32bit API and
      64bit API. What's more it will break 32bit program running on 64bit
      machine.
      
      So fixing this by introducing a new message type with an explicit
      32bit reserved field after type like:
      
      struct vhost_msg_v2 {
      	__u32 type;
      	__u32 reserved;
      	union {
      		struct vhost_iotlb_msg iotlb;
      		__u8 padding[64];
      	};
      };
      
      We will have a consistent ABI after switching to use this. To enable
      this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for
      userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2).
      
      Fixes: 6b1e6cc7 ("vhost: new device IOTLB API")
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      429711ae
  21. 06 8月, 2018 5 次提交
    • W
      KVM: X86: Implement PV IPIs in linux guest · aaffcfd1
      Wanpeng Li 提交于
      Implement paravirtual apic hooks to enable PV IPIs for KVM if the "send IPI"
      hypercall is available.  The hypercall lets a guest send IPIs, with
      at most 128 destinations per hypercall in 64-bit mode and 64 vCPUs per
      hypercall in 32-bit mode.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      aaffcfd1
    • W
      KVM: X86: Implement "send IPI" hypercall · 4180bf1b
      Wanpeng Li 提交于
      Using hypercall to send IPIs by one vmexit instead of one by one for
      xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
      mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
      is enabled in qemu, however, latest AMD EPYC still just supports xapic
      mode which can get great improvement by Exit-less IPIs. This patchset
      lets a guest send multicast IPIs, with at most 128 destinations per
      hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.
      
      Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
      is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141):
      
      x2apic cluster mode, vanilla
      
       Dry-run:                         0,            2392199 ns
       Self-IPI:                  6907514,           15027589 ns
       Normal IPI:              223910476,          251301666 ns
       Broadcast IPI:                   0,         9282161150 ns
       Broadcast lock:                  0,         8812934104 ns
      
      x2apic cluster mode, pv-ipi
      
       Dry-run:                         0,            2449341 ns
       Self-IPI:                  6720360,           15028732 ns
       Normal IPI:              228643307,          255708477 ns
       Broadcast IPI:                   0,         7572293590 ns  => 22% performance boost
       Broadcast lock:                  0,         8316124651 ns
      
      x2apic physical mode, vanilla
      
       Dry-run:                         0,            3135933 ns
       Self-IPI:                  8572670,           17901757 ns
       Normal IPI:              226444334,          255421709 ns
       Broadcast IPI:                   0,        19845070887 ns
       Broadcast lock:                  0,        19827383656 ns
      
      x2apic physical mode, pv-ipi
      
       Dry-run:                         0,            2446381 ns
       Self-IPI:                  6788217,           15021056 ns
       Normal IPI:              219454441,          249583458 ns
       Broadcast IPI:                   0,         7806540019 ns  => 154% performance boost
       Broadcast lock:                  0,         9143618799 ns
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4180bf1b
    • J
      kvm: nVMX: Introduce KVM_CAP_NESTED_STATE · 8fcc4b59
      Jim Mattson 提交于
      For nested virtualization L0 KVM is managing a bit of state for L2 guests,
      this state can not be captured through the currently available IOCTLs. In
      fact the state captured through all of these IOCTLs is usually a mix of L1
      and L2 state. It is also dependent on whether the L2 guest was running at
      the moment when the process was interrupted to save its state.
      
      With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
      and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
      that is in VMX operation.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NJim Mattson <jmattson@google.com>
      [karahmed@ - rename structs and functions and make them ready for AMD and
                   address previous comments.
                 - handle nested.smm state.
                 - rebase & a bit of refactoring.
                 - Merge 7/8 and 8/8 into one patch. ]
      Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8fcc4b59
    • C
      aio: implement IOCB_CMD_POLL · bfe4037e
      Christoph Hellwig 提交于
      Simple one-shot poll through the io_submit() interface.  To poll for
      a file descriptor the application should submit an iocb of type
      IOCB_CMD_POLL.  It will poll the fd for the events specified in the
      the first 32 bits of the aio_buf field of the iocb.
      
      Unlike poll or epoll without EPOLLONESHOT this interface always works
      in one shot mode, that is once the iocb is completed, it will have to be
      resubmitted.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NAvi Kivity <avi@scylladb.com>
      bfe4037e
    • P
      ip: discard IPv4 datagrams with overlapping segments. · 7969e5c4
      Peter Oskolkov 提交于
      This behavior is required in IPv6, and there is little need
      to tolerate overlapping fragments in IPv4. This change
      simplifies the code and eliminates potential DDoS attack vectors.
      
      Tested: ran ip_defrag selftest (not yet available uptream).
      Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7969e5c4
  22. 05 8月, 2018 1 次提交
  23. 04 8月, 2018 2 次提交