1. 18 2月, 2017 1 次提交
  2. 15 2月, 2017 1 次提交
    • D
      uapi: fix linux/if_pppol2tp.h userspace compilation errors · a725eb15
      Dmitry V. Levin 提交于
      Because of <linux/libc-compat.h> interface limitations, <netinet/in.h>
      provided by libc cannot be included after <linux/in.h>, therefore any
      header that includes <netinet/in.h> cannot be included after <linux/in.h>.
      
      Change uapi/linux/l2tp.h, the last uapi header that includes
      <netinet/in.h>, to include <linux/in.h> and <linux/in6.h> instead of
      <netinet/in.h> and use __SOCK_SIZE__ instead of sizeof(struct sockaddr)
      the same way as uapi/linux/in.h does, to fix linux/if_pppol2tp.h userspace
      compilation errors like this:
      
      In file included from /usr/include/linux/l2tp.h:12:0,
                       from /usr/include/linux/if_pppol2tp.h:21,
      /usr/include/netinet/in.h:31:8: error: redefinition of 'struct in_addr'
      
      Fixes: 47c3e778 ("net: l2tp: deprecate PPPOL2TP_MSG_* in favour of L2TP_MSG_*")
      Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a725eb15
  3. 14 2月, 2017 1 次提交
  4. 13 2月, 2017 1 次提交
    • A
      bpf: introduce BPF_F_ALLOW_OVERRIDE flag · 7f677633
      Alexei Starovoitov 提交于
      If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
      to the given cgroup the descendent cgroup will be able to override
      effective bpf program that was inherited from this cgroup.
      By default it's not passed, therefore override is disallowed.
      
      Examples:
      1.
      prog X attached to /A with default
      prog Y fails to attach to /A/B and /A/B/C
      Everything under /A runs prog X
      
      2.
      prog X attached to /A with allow_override.
      prog Y fails to attach to /A/B with default (non-override)
      prog M attached to /A/B with allow_override.
      Everything under /A/B runs prog M only.
      
      3.
      prog X attached to /A with allow_override.
      prog Y fails to attach to /A with default.
      The user has to detach first to switch the mode.
      
      In the future this behavior may be extended with a chain of
      non-overridable programs.
      
      Also fix the bug where detach from cgroup where nothing is attached
      was not throwing error. Return ENOENT in such case.
      
      Add several testcases and adjust libbpf.
      
      Fixes: 30070984 ("cgroup: add support for eBPF programs")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDaniel Mack <daniel@zonque.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f677633
  5. 12 2月, 2017 2 次提交
    • P
      netfilter: nf_tables: add NFTA_RULE_ID attribute · 1a94e38d
      Pablo Neira Ayuso 提交于
      This new attribute allows us to uniquely identify a rule in transaction.
      Robots may trigger an insertion followed by deletion in a batch, in that
      scenario we still don't have a public rule handle that we can use to
      delete the rule. This is similar to the NFTA_SET_ID attribute that
      allows us to refer to an anonymous set from a batch.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      1a94e38d
    • P
      netfilter: nfnetlink: allow to check for generation ID · 8c4d4e8b
      Pablo Neira Ayuso 提交于
      This patch allows userspace to specify the generation ID that has been
      used to build an incremental batch update.
      
      If userspace specifies the generation ID in the batch message as
      attribute, then nfnetlink compares it to the current generation ID so
      you make sure that you work against the right baseline. Otherwise, bail
      out with ERESTART so userspace knows that its changeset is stale and
      needs to respin. Userspace can do this transparently at the cost of
      taking slightly more time to refresh caches and rework the changeset.
      
      This check is optional, if there is no NFNL_BATCH_GENID attribute in the
      batch begin message, then no check is performed.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8c4d4e8b
  6. 11 2月, 2017 3 次提交
    • J
      devlink: fix the name of eswitch commands · adf200f3
      Jiri Pirko 提交于
      The eswitch_[gs]et command is supposed to be similar to port_[gs]et
      command - for multiple eswitch attributes. However, when it was introduced
      by 08f4b591 ("net/devlink: Add E-Switch mode control") it was wrongly
      named with the word "mode" in it. So fix this now, make the oririnal
      enum value existing but obsolete.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adf200f3
    • A
      net/act_pedit: Introduce 'add' operation · 853a14ba
      Amir Vadai 提交于
      This command could be useful to inc/dec fields.
      
      For example, to forward any TCP packet and decrease its TTL:
      $ tc filter add dev enp0s9 protocol ip parent ffff: \
          flower ip_proto tcp \
          action pedit munge ip ttl add 0xff pipe \
          action mirred egress redirect dev veth0
      
      In the example above, adding 0xff to this u8 field is actually
      decreasing it by one, since the operation is masked.
      Signed-off-by: NAmir Vadai <amir@vadai.me>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      853a14ba
    • A
      net/act_pedit: Support using offset relative to the conventional network headers · 71d0ed70
      Amir Vadai 提交于
      Extend pedit to enable the user setting offset relative to network
      headers. This change would enable to work with more complex header
      schemes (vs the simple IPv4 case) where setting a fixed offset relative
      to the network header is not enough.
      
      After this patch, the action has information about the exact header type
      and field inside this header. This information could be used later on
      for hardware offloading of pedit.
      
      Backward compatibility was being kept:
      1. Old kernel <-> new userspace
      2. New kernel <-> old userspace
      3. add rule using new userspace <-> dump using old userspace
      4. add rule using old userspace <-> dump using new userspace
      
      When using the extended api, new netlink attributes are being used. This
      way, operation will fail in (1) and (3) - and no malformed rule be added
      or dumped. Of course, new user space that doesn't need the new
      functionality can use the old netlink attributes and operation will
      succeed.
      Since action can support both api's, (2) should work, and it is easy to
      write the new user space to have (4) work.
      
      The action is having a strict check that only header types and commands
      it can handle are accepted. This way future additions will be much
      easier.
      
      Usage example:
      $ tc filter add dev enp0s9 protocol ip parent ffff: \
        flower \
          ip_proto tcp \
          dst_port 80 \
        action pedit munge tcp dport set 8080 pipe \
        action mirred egress redirect dev veth0
      
      Will forward tcp port whose original dest port is 80, while modifying
      the destination port to 8080.
      Signed-off-by: NAmir Vadai <amir@vadai.me>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71d0ed70
  7. 10 2月, 2017 5 次提交
    • J
      openvswitch: Add force commit. · dd41d33f
      Jarno Rajahalme 提交于
      Stateful network admission policy may allow connections to one
      direction and reject connections initiated in the other direction.
      After policy change it is possible that for a new connection an
      overlapping conntrack entry already exists, where the original
      direction of the existing connection is opposed to the new
      connection's initial packet.
      
      Most importantly, conntrack state relating to the current packet gets
      the "reply" designation based on whether the original direction tuple
      or the reply direction tuple matched.  If this "directionality" is
      wrong w.r.t. to the stateful network admission policy it may happen
      that packets in neither direction are correctly admitted.
      
      This patch adds a new "force commit" option to the OVS conntrack
      action that checks the original direction of an existing conntrack
      entry.  If that direction is opposed to the current packet, the
      existing conntrack entry is deleted and a new one is subsequently
      created in the correct direction.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd41d33f
    • J
      openvswitch: Add original direction conntrack tuple to sw_flow_key. · 9dd7f890
      Jarno Rajahalme 提交于
      Add the fields of the conntrack original direction 5-tuple to struct
      sw_flow_key.  The new fields are initially marked as non-existent, and
      are populated whenever a conntrack action is executed and either finds
      or generates a conntrack entry.  This means that these fields exist
      for all packets that were not rejected by conntrack as untrackable.
      
      The original tuple fields in the sw_flow_key are filled from the
      original direction tuple of the conntrack entry relating to the
      current packet, or from the original direction tuple of the master
      conntrack entry, if the current conntrack entry has a master.
      Generally, expected connections of connections having an assigned
      helper (e.g., FTP), have a master conntrack entry.
      
      The main purpose of the new conntrack original tuple fields is to
      allow matching on them for policy decision purposes, with the premise
      that the admissibility of tracked connections reply packets (as well
      as original direction packets), and both direction packets of any
      related connections may be based on ACL rules applying to the master
      connection's original direction 5-tuple.  This also makes it easier to
      make policy decisions when the actual packet headers might have been
      transformed by NAT, as the original direction 5-tuple represents the
      packet headers before any such transformation.
      
      When using the original direction 5-tuple the admissibility of return
      and/or related packets need not be based on the mere existence of a
      conntrack entry, allowing separation of admission policy from the
      established conntrack state.  While existence of a conntrack entry is
      required for admission of the return or related packets, policy
      changes can render connections that were initially admitted to be
      rejected or dropped afterwards.  If the admission of the return and
      related packets was based on mere conntrack state (e.g., connection
      being in an established state), a policy change that would make the
      connection rejected or dropped would need to find and delete all
      conntrack entries affected by such a change.  When using the original
      direction 5-tuple matching the affected conntrack entries can be
      allowed to time out instead, as the established state of the
      connection would not need to be the basis for packet admission any
      more.
      
      It should be noted that the directionality of related connections may
      be the same or different than that of the master connection, and
      neither the original direction 5-tuple nor the conntrack state bits
      carry this information.  If needed, the directionality of the master
      connection can be stored in master's conntrack mark or labels, which
      are automatically inherited by the expected related connections.
      
      The fact that neither ARP nor ND packets are trackable by conntrack
      allows mutual exclusion between ARP/ND and the new conntrack original
      tuple fields.  Hence, the IP addresses are overlaid in union with ARP
      and ND fields.  This allows the sw_flow_key to not grow much due to
      this patch, but it also means that we must be careful to never use the
      new key fields with ARP or ND packets.  ARP is easy to distinguish and
      keep mutually exclusive based on the ethernet type, but ND being an
      ICMPv6 protocol requires a bit more attention.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9dd7f890
    • J
      openvswitch: Unionize ovs_key_ct_label with a u32 array. · cb80d58f
      Jarno Rajahalme 提交于
      Make the array of labels in struct ovs_key_ct_label an union, adding a
      u32 array of the same byte size as the existing u8 array.  It is
      faster to loop through the labels 32 bits at the time, which is also
      the alignment of netlink attributes.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb80d58f
    • X
      sctp: implement sender-side procedures for Add Incoming/Outgoing Streams Request Parameter · 242bd2d5
      Xin Long 提交于
      This patch is to implement Sender-Side Procedures for the Add
      Outgoing and Incoming Streams Request Parameter described in
      rfc6525 section 5.1.5-5.1.6.
      
      It is also to add sockopt SCTP_ADD_STREAMS in rfc6525 section
      6.3.4 for users.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      242bd2d5
    • X
      sctp: implement sender-side procedures for SSN/TSN Reset Request Parameter · a92ce1a4
      Xin Long 提交于
      This patch is to implement Sender-Side Procedures for the SSN/TSN
      Reset Request Parameter descibed in rfc6525 section 5.1.4.
      
      It is also to add sockopt SCTP_RESET_ASSOC in rfc6525 section 6.3.3
      for users.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a92ce1a4
  8. 09 2月, 2017 1 次提交
    • L
      cfg80211: fix NAN bands definition · 8585989d
      Luca Coelho 提交于
      The nl80211_nan_dual_band_conf enumeration doesn't make much sense.
      The default value is assigned to a bit, which makes it weird if the
      default bit and other bits are set at the same time.
      
      To improve this, get rid of NL80211_NAN_BAND_DEFAULT and add a wiphy
      configuration to let the drivers define which bands are supported.
      This is exposed to the userspace, which then can make a decision on
      which band(s) to use.  Additionally, rename all "dual_band" elements
      to "bands", to make things clearer.
      Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      8585989d
  9. 08 2月, 2017 4 次提交
  10. 04 2月, 2017 5 次提交
    • R
      bridge: uapi: add per vlan tunnel info · b3c7ef0a
      Roopa Prabhu 提交于
      New nested netlink attribute to associate tunnel info per vlan.
      This is used by bridge driver to send tunnel metadata to
      bridge ports in vlan tunnel mode. This patch also adds new per
      port flag IFLA_BRPORT_VLAN_TUNNEL to enable vlan tunnel mode.
      off by default.
      
      One example use for this is a vxlan bridging gateway or vtep
      which maps vlans to vn-segments (or vnis). User can configure
      per-vlan tunnel information which the bridge driver can use
      to bridge vlan into the corresponding vn-segment.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3c7ef0a
    • R
      vxlan: support fdb and learning in COLLECT_METADATA mode · 3ad7a4b1
      Roopa Prabhu 提交于
      Vxlan COLLECT_METADATA mode today solves the per-vni netdev
      scalability problem in l3 networks. It expects all forwarding
      information to be present in dst_metadata. This patch series
      enhances collect metadata mode to include the case where only
      vni is present in dst_metadata, and the vxlan driver can then use
      the rest of the forwarding information datbase to make forwarding
      decisions. There is no change to default COLLECT_METADATA
      behaviour. These changes only apply to COLLECT_METADATA when
      used with the bridging use-case with a special dst_metadata
      tunnel info flag (eg: where vxlan device is part of a bridge).
      For all this to work, the vxlan driver will need to now support a
      single fdb table hashed by mac + vni. This series essentially makes
      this happen.
      
      use-case and workflow:
      vxlan collect metadata device participates in bridging vlan
      to vn-segments. Bridge driver above the vxlan device,
      sends the vni corresponding to the vlan in the dst_metadata.
      vxlan driver will lookup forwarding database with (mac + vni)
      for the required remote destination information to forward the
      packet.
      
      Changes introduced by this patch:
          - allow learning and forwarding database state in vxlan netdev in
            COLLECT_METADATA mode. Current behaviour is not changed
            by default. tunnel info flag IP_TUNNEL_INFO_BRIDGE is used
            to support the new bridge friendly mode.
          - A single fdb table hashed by (mac, vni) to allow fdb entries with
            multiple vnis in the same fdb table
          - rx path already has the vni
          - tx path expects a vni in the packet with dst_metadata
          - prior to this series, fdb remote_dsts carried remote vni and
            the vxlan device carrying the fdb table represented the
            source vni. With the vxlan device now representing multiple vnis,
            this patch adds a src vni attribute to the fdb entry. The remote
            vni already uses NDA_VNI attribute. This patch introduces
            NDA_SRC_VNI netlink attribute to represent the src vni in a multi
            vni fdb table.
      
      iproute2 example (patched and pruned iproute2 output to just show
      relevant fdb entries):
      example shows same host mac learnt on two vni's.
      
      before (netdev per vni):
      $bridge fdb show | grep "00:02:00:00:00:03"
      00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self
      00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self
      
      after this patch with collect metadata in bridged mode (single netdev):
      $bridge fdb show | grep "00:02:00:00:00:03"
      00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self
      00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ad7a4b1
    • Y
      net/sched: act_ife: Change to use ife module · 295a6e06
      Yotam Gigi 提交于
      Use the encode/decode functionality from the ife module instead of using
      implementation inside the act_ife.
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NRoman Mashak <mrv@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      295a6e06
    • Y
      net: Introduce ife encapsulation module · 1ce84604
      Yotam Gigi 提交于
      This module is responsible for the ife encapsulation protocol
      encode/decode logics. That module can:
       - ife_encode: encode skb and reserve space for the ife meta header
       - ife_decode: decode skb and extract the meta header size
       - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
         header space.
       - ife_tlv_meta_decode - decodes one tlv entry from the packet
       - ife_tlv_meta_next - advance to the next tlv
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NRoman Mashak <mrv@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ce84604
    • D
      ipv6: sr: remove cleanup flag and fix HMAC computation · 013e8167
      David Lebrun 提交于
      In the latest version of the IPv6 Segment Routing IETF draft [1] the
      cleanup flag is removed and the flags field length is shrunk from 16 bits
      to 8 bits. As a consequence, the input of the HMAC computation is modified
      in a non-backward compatible way by covering the whole octet of flags
      instead of only the cleanup bit. As such, if an implementation compatible
      with the latest draft computes the HMAC of an SRH who has other flags set
      to 1, then the HMAC result would differ from the current implementation.
      
      This patch carries those modifications to prevent conflict with other
      implementations of IPv6 SR.
      
      [1] https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-05Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      013e8167
  11. 03 2月, 2017 2 次提交
    • E
      net: add LINUX_MIB_PFMEMALLOCDROP counter · 8fe809a9
      Eric Dumazet 提交于
      Debugging issues caused by pfmemalloc is often tedious.
      
      Add a new SNMP counter to more easily diagnose these problems.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Acked-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fe809a9
    • A
      unix: add ioctl to open a unix socket file with O_PATH · ba94f308
      Andrey Vagin 提交于
      This ioctl opens a file to which a socket is bound and
      returns a file descriptor. The caller has to have CAP_NET_ADMIN
      in the socket network namespace.
      
      Currently it is impossible to get a path and a mount point
      for a socket file. socket_diag reports address, device ID and inode
      number for unix sockets. An address can contain a relative path or
      a file may be moved somewhere. And these properties say nothing about
      a mount namespace and a mount point of a socket file.
      
      With the introduced ioctl, we can get a path by reading
      /proc/self/fd/X and get mnt_id from /proc/self/fdinfo/X.
      
      In CRIU we are going to use this ioctl to dump and restore unix socket.
      
      Here is an example how it can be used:
      
      $ strace -e socket,bind,ioctl ./test /tmp/test_sock
      socket(AF_UNIX, SOCK_STREAM, 0)         = 3
      bind(3, {sa_family=AF_UNIX, sun_path="test_sock"}, 11) = 0
      ioctl(3, SIOCUNIXFILE, 0)           = 4
      ^Z
      
      $ ss -a | grep test_sock
      u_str  LISTEN     0      1      test_sock 17798                 * 0
      
      $ ls -l /proc/760/fd/{3,4}
      lrwx------ 1 root root 64 Feb  1 09:41 3 -> 'socket:[17798]'
      l--------- 1 root root 64 Feb  1 09:41 4 -> /tmp/test_sock
      
      $ cat /proc/760/fdinfo/4
      pos:	0
      flags:	012000000
      mnt_id:	40
      
      $ cat /proc/self/mountinfo | grep "^40\s"
      40 19 0:37 / /tmp rw shared:23 - tmpfs tmpfs rw
      Signed-off-by: NAndrei Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba94f308
  12. 30 1月, 2017 2 次提交
  13. 27 1月, 2017 1 次提交
    • F
      net/ipv6: allow sysctl to change link-local address generation mode · d35a00b8
      Felix Jia 提交于
      The address generation mode for IPv6 link-local can only be configured
      by netlink messages. This patch adds the ability to change the address
      generation mode via sysctl.
      
      v1 -> v2
      Removed the rtnl lock and switch to use RCU lock to iterate through
      the netdev list.
      
      v2 -> v3
      Removed the addrgenmode variable from the idev structure and use the
      systcl storage for the flag.
      
      Simplifed the logic for sysctl handling by removing the supported
      for all operation.
      
      Added support for more types of tunnel interfaces for link-local
      address generation.
      
      Based the patches from net-next.
      
      v3 -> v4
      Removed unnecessary whitespace changes.
      Signed-off-by: NFelix Jia <felix.jia@alliedtelesis.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d35a00b8
  14. 26 1月, 2017 4 次提交
    • S
      ac79cbb9
    • S
      uapi: install batman_adv.h header · e60bf3ea
      Sven Eckelmann 提交于
      09748a22 ("batman-adv: add generic netlink family for batman-adv")
      introduced the new batman_adv.h which describes the netlink attributes and
      commands of batman-adv. But the Kbuild entry to install the header was not
      added.
      
      All currently known tools ship their own copy of batman_adv.h but it should
      be installed anyway to later be able to migrate to the system batman_adv.h.
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>
      e60bf3ea
    • W
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang 提交于
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          kernel.
      
        connect()
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
      
        write()/sendmsg()
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          EINPROGRESS.
          No MSG_FASTOPEN flag is needed.
      
        read()
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f6d3f3
    • J
      net sched actions: Add support for user cookies · 1045ba77
      Jamal Hadi Salim 提交于
      Introduce optional 128-bit action cookie.
      Like all other cookie schemes in the networking world (eg in protocols
      like http or existing kernel fib protocol field, etc) the idea is to save
      user state that when retrieved serves as a correlator. The kernel
      _should not_ intepret it.  The user can store whatever they wish in the
      128 bits.
      
      Sample exercise(showing variable length use of cookie)
      
      .. create an accept action with cookie a1b2c3d4
      sudo $TC actions add action ok index 1 cookie a1b2c3d4
      
      .. dump all gact actions..
      sudo $TC -s actions ls action gact
      
          action order 0: gact action pass
           random type none pass val 0
           index 1 ref 1 bind 0 installed 5 sec used 5 sec
          Action statistics:
          Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
          backlog 0b 0p requeues 0
          cookie a1b2c3d4
      
      .. bind the accept action to a filter..
      sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
      u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1
      
      ... send some traffic..
      $ ping 127.0.0.1 -c 3
      PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
      64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
      64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms
      64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1045ba77
  15. 25 1月, 2017 5 次提交
    • L
      netfilter: nft_log: restrict the log prefix length to 127 · 5ce6b04c
      Liping Zhang 提交于
      First, log prefix will be truncated to NF_LOG_PREFIXLEN-1, i.e. 127,
      at nf_log_packet(), so the extra part is useless.
      
      Second, after adding a log rule with a very very long prefix, we will
      fail to dump the nft rules after this _special_ one, but acctually,
      they do exist. For example:
        # name_65000=$(printf "%0.sQ" {1..65000})
        # nft add rule filter output log prefix "$name_65000"
        # nft add rule filter output counter
        # nft add rule filter output counter
        # nft list chain filter output
        table ip filter {
            chain output {
                type filter hook output priority 0; policy accept;
            }
        }
      
      So now, restrict the log prefix length to NF_LOG_PREFIXLEN-1.
      
      Fixes: 96518518 ("netfilter: add nftables")
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      5ce6b04c
    • D
      bpf: allow option for setting bpf_l4_csum_replace from scratch · d1b662ad
      Daniel Borkmann 提交于
      When programs need to calculate the csum from scratch for small UDP
      packets and use bpf_l4_csum_replace() to feed the result from helpers
      like bpf_csum_diff(), then we need a flag besides BPF_F_MARK_MANGLED_0
      that would ignore the case of current csum being 0, and which would
      still allow for the helper to set the csum and transform when needed
      to CSUM_MANGLED_0.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1b662ad
    • Y
      net/sched: Introduce sample tc action · 5c5670fa
      Yotam Gigi 提交于
      This action allows the user to sample traffic matched by tc classifier.
      The sampling consists of choosing packets randomly and sampling them using
      the psample module. The user can configure the psample group number, the
      sampling rate and the packet's truncation (to save kernel-user traffic).
      
      Example:
      To sample ingress traffic from interface eth1, one may use the commands:
      
      tc qdisc add dev eth1 handle ffff: ingress
      
      tc filter add dev eth1 parent ffff: \
      	   matchall action sample rate 12 group 4
      
      Where the first command adds an ingress qdisc and the second starts
      sampling randomly with an average of one sampled packet per 12 packets on
      dev eth1 to psample group 4.
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c5670fa
    • Y
      net: Introduce psample, a new genetlink channel for packet sampling · 6ae0a628
      Yotam Gigi 提交于
      Add a general way for kernel modules to sample packets, without being tied
      to any specific subsystem. This netlink channel can be used by tc,
      iptables, etc. and allow to standardize packet sampling in the kernel.
      
      For every sampled packet, the psample module adds the following metadata
      fields:
      
      PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable
      
      PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable
      
      PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been
         truncated during sampling
      
      PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the
         user who initiated the sampling. This field allows the user to
         differentiate between several samplers working simultaneously and
         filter packets relevant to him
      
      PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The
         sequence is kept for each group
      
      PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets
      
      PSAMPLE_ATTR_DATA - the actual packet bits
      
      The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast
      group. In addition, add the GET_GROUPS netlink command which allows the
      user to see the current sample groups, their refcount and sequence number.
      This command currently supports only netlink dump mode.
      Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ae0a628
    • F
      bridge: multicast to unicast · 6db6f0ea
      Felix Fietkau 提交于
      Implements an optional, per bridge port flag and feature to deliver
      multicast packets to any host on the according port via unicast
      individually. This is done by copying the packet per host and
      changing the multicast destination MAC to a unicast one accordingly.
      
      multicast-to-unicast works on top of the multicast snooping feature of
      the bridge. Which means unicast copies are only delivered to hosts which
      are interested in it and signalized this via IGMP/MLD reports
      previously.
      
      This feature is intended for interface types which have a more reliable
      and/or efficient way to deliver unicast packets than broadcast ones
      (e.g. wifi).
      
      However, it should only be enabled on interfaces where no IGMPv2/MLDv1
      report suppression takes place. This feature is disabled by default.
      
      The initial patch and idea is from Felix Fietkau.
      Signed-off-by: NFelix Fietkau <nbd@nbd.name>
      [linus.luessing@c0d3.blue: various bug + style fixes, commit message]
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6db6f0ea
  16. 24 1月, 2017 2 次提交
    • M
      can: dev: add CAN interface API for fixed bitrates · 431af779
      Marc Kleine-Budde 提交于
      Some CAN interfaces only support fixed fixed bitrates. This patch adds a
      netlink interface to get the list of the CAN interface's fixed bitrates and
      data bitrates.
      
      Inside the driver arrays of supported data- bitrate values are defined.
      
      const u32 drvname_bitrate[] = { 20000, 50000, 100000 };
      const u32 drvname_data_bitrate[] = { 200000, 500000, 1000000 };
      
      struct drvname_priv *priv;
      priv = netdev_priv(dev);
      
      priv->bitrate_const = drvname_bitrate;
      priv->bitrate_const_cnt = ARRAY_SIZE(drvname_bitrate);
      priv->data_bitrate_const = drvname_data_bitrate;
      priv->data_bitrate_const_cnt = ARRAY_SIZE(drvname_data_bitrate);
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      431af779
    • O
      can: dev: add CAN interface termination API · 12a6075c
      Oliver Hartkopp 提交于
      This patch adds a netlink interface to configure the CAN bus termination of
      CAN interfaces.
      
      Inside the driver an array of supported termination values is defined:
      
      const u16 drvname_termination[] = { 60, 120, CAN_TERMINATION_DISABLED };
      
      struct drvname_priv *priv;
      priv = netdev_priv(dev);
      
      priv->termination_const = drvname_termination;
      priv->termination_const_cnt = ARRAY_SIZE(drvname_termination);
      priv->termination = CAN_TERMINATION_DISABLED;
      
      And the funtion to set the value has to be defined:
      
      priv->do_set_termination = drvname_set_termination;
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Reviewed-by: NRamesh Shanmugasundaram <Ramesh.shanmugasundaram@bp.renesas.com>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      12a6075c