You need to sign in or sign up before continuing.
  1. 10 2月, 2017 5 次提交
    • J
      openvswitch: Add force commit. · dd41d33f
      Jarno Rajahalme 提交于
      Stateful network admission policy may allow connections to one
      direction and reject connections initiated in the other direction.
      After policy change it is possible that for a new connection an
      overlapping conntrack entry already exists, where the original
      direction of the existing connection is opposed to the new
      connection's initial packet.
      
      Most importantly, conntrack state relating to the current packet gets
      the "reply" designation based on whether the original direction tuple
      or the reply direction tuple matched.  If this "directionality" is
      wrong w.r.t. to the stateful network admission policy it may happen
      that packets in neither direction are correctly admitted.
      
      This patch adds a new "force commit" option to the OVS conntrack
      action that checks the original direction of an existing conntrack
      entry.  If that direction is opposed to the current packet, the
      existing conntrack entry is deleted and a new one is subsequently
      created in the correct direction.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd41d33f
    • J
      openvswitch: Add original direction conntrack tuple to sw_flow_key. · 9dd7f890
      Jarno Rajahalme 提交于
      Add the fields of the conntrack original direction 5-tuple to struct
      sw_flow_key.  The new fields are initially marked as non-existent, and
      are populated whenever a conntrack action is executed and either finds
      or generates a conntrack entry.  This means that these fields exist
      for all packets that were not rejected by conntrack as untrackable.
      
      The original tuple fields in the sw_flow_key are filled from the
      original direction tuple of the conntrack entry relating to the
      current packet, or from the original direction tuple of the master
      conntrack entry, if the current conntrack entry has a master.
      Generally, expected connections of connections having an assigned
      helper (e.g., FTP), have a master conntrack entry.
      
      The main purpose of the new conntrack original tuple fields is to
      allow matching on them for policy decision purposes, with the premise
      that the admissibility of tracked connections reply packets (as well
      as original direction packets), and both direction packets of any
      related connections may be based on ACL rules applying to the master
      connection's original direction 5-tuple.  This also makes it easier to
      make policy decisions when the actual packet headers might have been
      transformed by NAT, as the original direction 5-tuple represents the
      packet headers before any such transformation.
      
      When using the original direction 5-tuple the admissibility of return
      and/or related packets need not be based on the mere existence of a
      conntrack entry, allowing separation of admission policy from the
      established conntrack state.  While existence of a conntrack entry is
      required for admission of the return or related packets, policy
      changes can render connections that were initially admitted to be
      rejected or dropped afterwards.  If the admission of the return and
      related packets was based on mere conntrack state (e.g., connection
      being in an established state), a policy change that would make the
      connection rejected or dropped would need to find and delete all
      conntrack entries affected by such a change.  When using the original
      direction 5-tuple matching the affected conntrack entries can be
      allowed to time out instead, as the established state of the
      connection would not need to be the basis for packet admission any
      more.
      
      It should be noted that the directionality of related connections may
      be the same or different than that of the master connection, and
      neither the original direction 5-tuple nor the conntrack state bits
      carry this information.  If needed, the directionality of the master
      connection can be stored in master's conntrack mark or labels, which
      are automatically inherited by the expected related connections.
      
      The fact that neither ARP nor ND packets are trackable by conntrack
      allows mutual exclusion between ARP/ND and the new conntrack original
      tuple fields.  Hence, the IP addresses are overlaid in union with ARP
      and ND fields.  This allows the sw_flow_key to not grow much due to
      this patch, but it also means that we must be careful to never use the
      new key fields with ARP or ND packets.  ARP is easy to distinguish and
      keep mutually exclusive based on the ethernet type, but ND being an
      ICMPv6 protocol requires a bit more attention.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9dd7f890
    • J
      openvswitch: Unionize ovs_key_ct_label with a u32 array. · cb80d58f
      Jarno Rajahalme 提交于
      Make the array of labels in struct ovs_key_ct_label an union, adding a
      u32 array of the same byte size as the existing u8 array.  It is
      faster to loop through the labels 32 bits at the time, which is also
      the alignment of netlink attributes.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb80d58f
    • X
      sctp: implement sender-side procedures for Add Incoming/Outgoing Streams Request Parameter · 242bd2d5
      Xin Long 提交于
      This patch is to implement Sender-Side Procedures for the Add
      Outgoing and Incoming Streams Request Parameter described in
      rfc6525 section 5.1.5-5.1.6.
      
      It is also to add sockopt SCTP_ADD_STREAMS in rfc6525 section
      6.3.4 for users.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      242bd2d5
    • X
      sctp: implement sender-side procedures for SSN/TSN Reset Request Parameter · a92ce1a4
      Xin Long 提交于
      This patch is to implement Sender-Side Procedures for the SSN/TSN
      Reset Request Parameter descibed in rfc6525 section 5.1.4.
      
      It is also to add sockopt SCTP_RESET_ASSOC in rfc6525 section 6.3.3
      for users.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a92ce1a4
  2. 04 2月, 2017 5 次提交
    • R
      bridge: uapi: add per vlan tunnel info · b3c7ef0a
      Roopa Prabhu 提交于
      New nested netlink attribute to associate tunnel info per vlan.
      This is used by bridge driver to send tunnel metadata to
      bridge ports in vlan tunnel mode. This patch also adds new per
      port flag IFLA_BRPORT_VLAN_TUNNEL to enable vlan tunnel mode.
      off by default.
      
      One example use for this is a vxlan bridging gateway or vtep
      which maps vlans to vn-segments (or vnis). User can configure
      per-vlan tunnel information which the bridge driver can use
      to bridge vlan into the corresponding vn-segment.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3c7ef0a
    • R
      vxlan: support fdb and learning in COLLECT_METADATA mode · 3ad7a4b1
      Roopa Prabhu 提交于
      Vxlan COLLECT_METADATA mode today solves the per-vni netdev
      scalability problem in l3 networks. It expects all forwarding
      information to be present in dst_metadata. This patch series
      enhances collect metadata mode to include the case where only
      vni is present in dst_metadata, and the vxlan driver can then use
      the rest of the forwarding information datbase to make forwarding
      decisions. There is no change to default COLLECT_METADATA
      behaviour. These changes only apply to COLLECT_METADATA when
      used with the bridging use-case with a special dst_metadata
      tunnel info flag (eg: where vxlan device is part of a bridge).
      For all this to work, the vxlan driver will need to now support a
      single fdb table hashed by mac + vni. This series essentially makes
      this happen.
      
      use-case and workflow:
      vxlan collect metadata device participates in bridging vlan
      to vn-segments. Bridge driver above the vxlan device,
      sends the vni corresponding to the vlan in the dst_metadata.
      vxlan driver will lookup forwarding database with (mac + vni)
      for the required remote destination information to forward the
      packet.
      
      Changes introduced by this patch:
          - allow learning and forwarding database state in vxlan netdev in
            COLLECT_METADATA mode. Current behaviour is not changed
            by default. tunnel info flag IP_TUNNEL_INFO_BRIDGE is used
            to support the new bridge friendly mode.
          - A single fdb table hashed by (mac, vni) to allow fdb entries with
            multiple vnis in the same fdb table
          - rx path already has the vni
          - tx path expects a vni in the packet with dst_metadata
          - prior to this series, fdb remote_dsts carried remote vni and
            the vxlan device carrying the fdb table represented the
            source vni. With the vxlan device now representing multiple vnis,
            this patch adds a src vni attribute to the fdb entry. The remote
            vni already uses NDA_VNI attribute. This patch introduces
            NDA_SRC_VNI netlink attribute to represent the src vni in a multi
            vni fdb table.
      
      iproute2 example (patched and pruned iproute2 output to just show
      relevant fdb entries):
      example shows same host mac learnt on two vni's.
      
      before (netdev per vni):
      $bridge fdb show | grep "00:02:00:00:00:03"
      00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self
      00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self
      
      after this patch with collect metadata in bridged mode (single netdev):
      $bridge fdb show | grep "00:02:00:00:00:03"
      00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self
      00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ad7a4b1
    • Y
      net/sched: act_ife: Change to use ife module · 295a6e06
      Yotam Gigi 提交于
      Use the encode/decode functionality from the ife module instead of using
      implementation inside the act_ife.
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NRoman Mashak <mrv@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      295a6e06
    • Y
      net: Introduce ife encapsulation module · 1ce84604
      Yotam Gigi 提交于
      This module is responsible for the ife encapsulation protocol
      encode/decode logics. That module can:
       - ife_encode: encode skb and reserve space for the ife meta header
       - ife_decode: decode skb and extract the meta header size
       - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
         header space.
       - ife_tlv_meta_decode - decodes one tlv entry from the packet
       - ife_tlv_meta_next - advance to the next tlv
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NRoman Mashak <mrv@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ce84604
    • D
      ipv6: sr: remove cleanup flag and fix HMAC computation · 013e8167
      David Lebrun 提交于
      In the latest version of the IPv6 Segment Routing IETF draft [1] the
      cleanup flag is removed and the flags field length is shrunk from 16 bits
      to 8 bits. As a consequence, the input of the HMAC computation is modified
      in a non-backward compatible way by covering the whole octet of flags
      instead of only the cleanup bit. As such, if an implementation compatible
      with the latest draft computes the HMAC of an SRH who has other flags set
      to 1, then the HMAC result would differ from the current implementation.
      
      This patch carries those modifications to prevent conflict with other
      implementations of IPv6 SR.
      
      [1] https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-05Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      013e8167
  3. 03 2月, 2017 2 次提交
    • E
      net: add LINUX_MIB_PFMEMALLOCDROP counter · 8fe809a9
      Eric Dumazet 提交于
      Debugging issues caused by pfmemalloc is often tedious.
      
      Add a new SNMP counter to more easily diagnose these problems.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Acked-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fe809a9
    • A
      unix: add ioctl to open a unix socket file with O_PATH · ba94f308
      Andrey Vagin 提交于
      This ioctl opens a file to which a socket is bound and
      returns a file descriptor. The caller has to have CAP_NET_ADMIN
      in the socket network namespace.
      
      Currently it is impossible to get a path and a mount point
      for a socket file. socket_diag reports address, device ID and inode
      number for unix sockets. An address can contain a relative path or
      a file may be moved somewhere. And these properties say nothing about
      a mount namespace and a mount point of a socket file.
      
      With the introduced ioctl, we can get a path by reading
      /proc/self/fd/X and get mnt_id from /proc/self/fdinfo/X.
      
      In CRIU we are going to use this ioctl to dump and restore unix socket.
      
      Here is an example how it can be used:
      
      $ strace -e socket,bind,ioctl ./test /tmp/test_sock
      socket(AF_UNIX, SOCK_STREAM, 0)         = 3
      bind(3, {sa_family=AF_UNIX, sun_path="test_sock"}, 11) = 0
      ioctl(3, SIOCUNIXFILE, 0)           = 4
      ^Z
      
      $ ss -a | grep test_sock
      u_str  LISTEN     0      1      test_sock 17798                 * 0
      
      $ ls -l /proc/760/fd/{3,4}
      lrwx------ 1 root root 64 Feb  1 09:41 3 -> 'socket:[17798]'
      l--------- 1 root root 64 Feb  1 09:41 4 -> /tmp/test_sock
      
      $ cat /proc/760/fdinfo/4
      pos:	0
      flags:	012000000
      mnt_id:	40
      
      $ cat /proc/self/mountinfo | grep "^40\s"
      40 19 0:37 / /tmp rw shared:23 - tmpfs tmpfs rw
      Signed-off-by: NAndrei Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba94f308
  4. 30 1月, 2017 2 次提交
  5. 27 1月, 2017 1 次提交
    • F
      net/ipv6: allow sysctl to change link-local address generation mode · d35a00b8
      Felix Jia 提交于
      The address generation mode for IPv6 link-local can only be configured
      by netlink messages. This patch adds the ability to change the address
      generation mode via sysctl.
      
      v1 -> v2
      Removed the rtnl lock and switch to use RCU lock to iterate through
      the netdev list.
      
      v2 -> v3
      Removed the addrgenmode variable from the idev structure and use the
      systcl storage for the flag.
      
      Simplifed the logic for sysctl handling by removing the supported
      for all operation.
      
      Added support for more types of tunnel interfaces for link-local
      address generation.
      
      Based the patches from net-next.
      
      v3 -> v4
      Removed unnecessary whitespace changes.
      Signed-off-by: NFelix Jia <felix.jia@alliedtelesis.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d35a00b8
  6. 26 1月, 2017 4 次提交
    • S
      ac79cbb9
    • S
      uapi: install batman_adv.h header · e60bf3ea
      Sven Eckelmann 提交于
      09748a22 ("batman-adv: add generic netlink family for batman-adv")
      introduced the new batman_adv.h which describes the netlink attributes and
      commands of batman-adv. But the Kbuild entry to install the header was not
      added.
      
      All currently known tools ship their own copy of batman_adv.h but it should
      be installed anyway to later be able to migrate to the system batman_adv.h.
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>
      e60bf3ea
    • W
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang 提交于
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          kernel.
      
        connect()
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
      
        write()/sendmsg()
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          EINPROGRESS.
          No MSG_FASTOPEN flag is needed.
      
        read()
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f6d3f3
    • J
      net sched actions: Add support for user cookies · 1045ba77
      Jamal Hadi Salim 提交于
      Introduce optional 128-bit action cookie.
      Like all other cookie schemes in the networking world (eg in protocols
      like http or existing kernel fib protocol field, etc) the idea is to save
      user state that when retrieved serves as a correlator. The kernel
      _should not_ intepret it.  The user can store whatever they wish in the
      128 bits.
      
      Sample exercise(showing variable length use of cookie)
      
      .. create an accept action with cookie a1b2c3d4
      sudo $TC actions add action ok index 1 cookie a1b2c3d4
      
      .. dump all gact actions..
      sudo $TC -s actions ls action gact
      
          action order 0: gact action pass
           random type none pass val 0
           index 1 ref 1 bind 0 installed 5 sec used 5 sec
          Action statistics:
          Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
          backlog 0b 0p requeues 0
          cookie a1b2c3d4
      
      .. bind the accept action to a filter..
      sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
      u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1
      
      ... send some traffic..
      $ ping 127.0.0.1 -c 3
      PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
      64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
      64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms
      64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1045ba77
  7. 25 1月, 2017 5 次提交
    • L
      netfilter: nft_log: restrict the log prefix length to 127 · 5ce6b04c
      Liping Zhang 提交于
      First, log prefix will be truncated to NF_LOG_PREFIXLEN-1, i.e. 127,
      at nf_log_packet(), so the extra part is useless.
      
      Second, after adding a log rule with a very very long prefix, we will
      fail to dump the nft rules after this _special_ one, but acctually,
      they do exist. For example:
        # name_65000=$(printf "%0.sQ" {1..65000})
        # nft add rule filter output log prefix "$name_65000"
        # nft add rule filter output counter
        # nft add rule filter output counter
        # nft list chain filter output
        table ip filter {
            chain output {
                type filter hook output priority 0; policy accept;
            }
        }
      
      So now, restrict the log prefix length to NF_LOG_PREFIXLEN-1.
      
      Fixes: 96518518 ("netfilter: add nftables")
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      5ce6b04c
    • D
      bpf: allow option for setting bpf_l4_csum_replace from scratch · d1b662ad
      Daniel Borkmann 提交于
      When programs need to calculate the csum from scratch for small UDP
      packets and use bpf_l4_csum_replace() to feed the result from helpers
      like bpf_csum_diff(), then we need a flag besides BPF_F_MARK_MANGLED_0
      that would ignore the case of current csum being 0, and which would
      still allow for the helper to set the csum and transform when needed
      to CSUM_MANGLED_0.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1b662ad
    • Y
      net/sched: Introduce sample tc action · 5c5670fa
      Yotam Gigi 提交于
      This action allows the user to sample traffic matched by tc classifier.
      The sampling consists of choosing packets randomly and sampling them using
      the psample module. The user can configure the psample group number, the
      sampling rate and the packet's truncation (to save kernel-user traffic).
      
      Example:
      To sample ingress traffic from interface eth1, one may use the commands:
      
      tc qdisc add dev eth1 handle ffff: ingress
      
      tc filter add dev eth1 parent ffff: \
      	   matchall action sample rate 12 group 4
      
      Where the first command adds an ingress qdisc and the second starts
      sampling randomly with an average of one sampled packet per 12 packets on
      dev eth1 to psample group 4.
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c5670fa
    • Y
      net: Introduce psample, a new genetlink channel for packet sampling · 6ae0a628
      Yotam Gigi 提交于
      Add a general way for kernel modules to sample packets, without being tied
      to any specific subsystem. This netlink channel can be used by tc,
      iptables, etc. and allow to standardize packet sampling in the kernel.
      
      For every sampled packet, the psample module adds the following metadata
      fields:
      
      PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable
      
      PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable
      
      PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been
         truncated during sampling
      
      PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the
         user who initiated the sampling. This field allows the user to
         differentiate between several samplers working simultaneously and
         filter packets relevant to him
      
      PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The
         sequence is kept for each group
      
      PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets
      
      PSAMPLE_ATTR_DATA - the actual packet bits
      
      The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast
      group. In addition, add the GET_GROUPS netlink command which allows the
      user to see the current sample groups, their refcount and sequence number.
      This command currently supports only netlink dump mode.
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ae0a628
    • F
      bridge: multicast to unicast · 6db6f0ea
      Felix Fietkau 提交于
      Implements an optional, per bridge port flag and feature to deliver
      multicast packets to any host on the according port via unicast
      individually. This is done by copying the packet per host and
      changing the multicast destination MAC to a unicast one accordingly.
      
      multicast-to-unicast works on top of the multicast snooping feature of
      the bridge. Which means unicast copies are only delivered to hosts which
      are interested in it and signalized this via IGMP/MLD reports
      previously.
      
      This feature is intended for interface types which have a more reliable
      and/or efficient way to deliver unicast packets than broadcast ones
      (e.g. wifi).
      
      However, it should only be enabled on interfaces where no IGMPv2/MLDv1
      report suppression takes place. This feature is disabled by default.
      
      The initial patch and idea is from Felix Fietkau.
      Signed-off-by: NFelix Fietkau <nbd@nbd.name>
      [linus.luessing@c0d3.blue: various bug + style fixes, commit message]
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6db6f0ea
  8. 24 1月, 2017 3 次提交
  9. 21 1月, 2017 2 次提交
    • J
      tipc: make replicast a user selectable option · 01fd12bb
      Jon Paul Maloy 提交于
      If the bearer carrying multicast messages supports broadcast, those
      messages will be sent to all cluster nodes, irrespective of whether
      these nodes host any actual destinations socket or not. This is clearly
      wasteful if the cluster is large and there are only a few real
      destinations for the message being sent.
      
      In this commit we extend the eligibility of the newly introduced
      "replicast" transmit option. We now make it possible for a user to
      select which method he wants to be used, either as a mandatory setting
      via setsockopt(), or as a relative setting where we let the broadcast
      layer decide which method to use based on the ratio between cluster
      size and the message's actual number of destination nodes.
      
      In the latter case, a sending socket must stick to a previously
      selected method until it enters an idle period of at least 5 seconds.
      This eliminates the risk of message reordering caused by method change,
      i.e., when changes to cluster size or number of destinations would
      otherwise mandate a new method to be used.
      Reviewed-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01fd12bb
    • G
      bpf: add bpf_probe_read_str helper · a5e8c070
      Gianluca Borello 提交于
      Provide a simple helper with the same semantics of strncpy_from_unsafe():
      
      int bpf_probe_read_str(void *dst, int size, const void *unsafe_addr)
      
      This gives more flexibility to a bpf program. A typical use case is
      intercepting a file name during sys_open(). The current approach is:
      
      SEC("kprobe/sys_open")
      void bpf_sys_open(struct pt_regs *ctx)
      {
      	char buf[PATHLEN]; // PATHLEN is defined to 256
      	bpf_probe_read(buf, sizeof(buf), ctx->di);
      
      	/* consume buf */
      }
      
      This is suboptimal because the size of the string needs to be estimated
      at compile time, causing more memory to be copied than often necessary,
      and can become more problematic if further processing on buf is done,
      for example by pushing it to userspace via bpf_perf_event_output(),
      since the real length of the string is unknown and the entire buffer
      must be copied (and defining an unrolled strnlen() inside the bpf
      program is a very inefficient and unfeasible approach).
      
      With the new helper, the code can easily operate on the actual string
      length rather than the buffer size:
      
      SEC("kprobe/sys_open")
      void bpf_sys_open(struct pt_regs *ctx)
      {
      	char buf[PATHLEN]; // PATHLEN is defined to 256
      	int res = bpf_probe_read_str(buf, sizeof(buf), ctx->di);
      
      	/* consume buf, for example push it to userspace via
      	 * bpf_perf_event_output(), but this time we can use
      	 * res (the string length) as event size, after checking
      	 * its boundaries.
      	 */
      }
      
      Another useful use case is when parsing individual process arguments or
      individual environment variables navigating current->mm->arg_start and
      current->mm->env_start: using this helper and the return value, one can
      quickly iterate at the right offset of the memory area.
      
      The code changes simply leverage the already existent
      strncpy_from_unsafe() kernel function, which is safe to be called from a
      bpf program as it is used in bpf_trace_printk().
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5e8c070
  10. 19 1月, 2017 2 次提交
  11. 18 1月, 2017 3 次提交
    • L
      bridge: sparse fixes in br_ip6_multicast_alloc_query() · 53631a5f
      Lance Richardson 提交于
      Changed type of csum field in struct igmpv3_query from __be16 to
      __sum16 to eliminate type warning, made same change in struct
      igmpv3_report for consistency.
      
      Fixed up an ntohs() where htons() should have been used instead.
      Signed-off-by: NLance Richardson <lrichard@redhat.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53631a5f
    • R
      mpls: Packet stats · 27d69105
      Robert Shearman 提交于
      Having MPLS packet stats is useful for observing network operation and
      for diagnosing network problems. In the absence of anything better,
      RFC2863 and RFC3813 are used for guidance for which stats to expose
      and the semantics of them. In particular rx_noroutes maps to in
      unknown protos in RFC2863. The stats are exposed to userspace via
      AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of
      RTM_GETSTATS messages.
      
      All the introduced fields are 64-bit, even error ones, to ensure no
      overflow with long uptimes. Per-CPU counters are used to avoid
      cache-line contention on the commonly used fields. The other fields
      have also been made per-CPU for code to avoid performance problems in
      error conditions on the assumption that on some platforms the cost of
      atomic operations could be more expensive than sending the packet
      (which is what would be done in the success case). If that's not the
      case, we could instead not use per-CPU counters for these fields.
      
      Only unicast and non-fragment are exposed at the moment, but other
      counters can be exposed in the future either by adding to the end of
      struct mpls_link_stats or by additional netlink attributes in the
      AF_MPLS IFLA_STATS_AF_SPEC nested attribute.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27d69105
    • R
      net: AF-specific RTM_GETSTATS attributes · aefb4d4a
      Robert Shearman 提交于
      Add the functionality for including address-family-specific per-link
      stats in RTM_GETSTATS messages. This is done through adding a new
      IFLA_STATS_AF_SPEC attribute under which address family attributes are
      nested and then the AF-specific attributes can be further nested. This
      follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the
      advantage of presenting an easily extended hierarchy. The rtnl_af_ops
      structure is extended to provide AFs with the opportunity to fill and
      provide the size of their stats attributes.
      
      One alternative would have been to provide AFs with the ability to add
      attributes directly into the RTM_GETSTATS message without a nested
      hierarchy. I discounted this approach as it increases the rate at
      which the 32 attribute number space is used up and it makes
      implementation a little more tricky for stats dump resuming (at the
      moment the order in which attributes are added to the message has to
      match the numeric order of the attributes).
      
      Another alternative would have been to register per-AF RTM_GETSTATS
      handlers. I discounted this approach as I perceived a common use-case
      to be getting all the stats for an interface and this approach would
      necessitate multiple requests/dumps to retrieve them all.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aefb4d4a
  12. 17 1月, 2017 2 次提交
    • D
      ipv6: sr: add missing Kbuild export for header files · a50a05f4
      David Lebrun 提交于
      Add missing IPv6-SR header files in include/uapi/linux/Kbuild.
      
      Also, prevent seg6_lwt_headroom() from being exported and add
      missing linux/types.h include.
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a50a05f4
    • D
      bpf: rework prog_digest into prog_tag · f1f7714e
      Daniel Borkmann 提交于
      Commit 7bd509e3 ("bpf: add prog_digest and expose it via
      fdinfo/netlink") was recently discussed, partially due to
      admittedly suboptimal name of "prog_digest" in combination
      with sha1 hash usage, thus inevitably and rightfully concerns
      about its security in terms of collision resistance were
      raised with regards to use-cases.
      
      The intended use cases are for debugging resp. introspection
      only for providing a stable "tag" over the instruction sequence
      that both kernel and user space can calculate independently.
      It's not usable at all for making a security relevant decision.
      So collisions where two different instruction sequences generate
      the same tag can happen, but ideally at a rather low rate. The
      "tag" will be dumped in hex and is short enough to introspect
      in tracepoints or kallsyms output along with other data such
      as stack trace, etc. Thus, this patch performs a rename into
      prog_tag and truncates the tag to a short output (64 bits) to
      make it obvious it's not collision-free.
      
      Should in future a hash or facility be needed with a security
      relevant focus, then we can think about requirements, constraints,
      etc that would fit to that situation. For now, rework the exposed
      parts for the current use cases as long as nothing has been
      released yet. Tested on x86_64 and s390x.
      
      Fixes: 7bd509e3 ("bpf: add prog_digest and expose it via fdinfo/netlink")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1f7714e
  13. 16 1月, 2017 1 次提交
  14. 13 1月, 2017 3 次提交