1. 28 8月, 2015 5 次提交
    • T
      bridge: Add netlink support for vlan_protocol attribute · d2d427b3
      Toshiaki Makita 提交于
      This enables bridge vlan_protocol to be configured through netlink.
      
      When CONFIG_BRIDGE_VLAN_FILTERING is disabled, kernel behaves the
      same way as this feature is not implemented.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2d427b3
    • J
      openvswitch: Allow attaching helpers to ct action · cae3a262
      Joe Stringer 提交于
      Add support for using conntrack helpers to assist protocol detection.
      The new OVS_CT_ATTR_HELPER attribute of the CT action specifies a helper
      to be used for this connection. If no helper is specified, then helpers
      will be automatically applied as per the sysctl configuration of
      net.netfilter.nf_conntrack_helper.
      
      The helper may be specified as part of the conntrack action, eg:
      ct(helper=ftp). Initial packets for related connections should be
      committed to allow later packets for the flow to be considered
      established.
      
      Example ovs-ofctl flows allowing FTP connections from ports 1->2:
      in_port=1,tcp,action=ct(helper=ftp,commit),2
      in_port=2,tcp,ct_state=-trk,action=ct(recirc)
      in_port=2,tcp,ct_state=+trk-new+est,action=1
      in_port=2,tcp,ct_state=+trk+rel,action=1
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cae3a262
    • J
      openvswitch: Allow matching on conntrack label · c2ac6673
      Joe Stringer 提交于
      Allow matching and setting the ct_label field. As with ct_mark, this is
      populated by executing the CT action. The label field may be modified by
      specifying a label and mask nested under the CT action. It is stored as
      metadata attached to the connection. Label modification occurs after
      lookup, and will only persist when the conntrack entry is committed by
      providing the COMMIT flag to the CT action. Labels are currently fixed
      to 128 bits in size.
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2ac6673
    • J
      openvswitch: Allow matching on conntrack mark · 182e3042
      Joe Stringer 提交于
      Allow matching and setting the ct_mark field. As with ct_state and
      ct_zone, these fields are populated when the CT action is executed. To
      write to this field, a value and mask can be specified as a nested
      attribute under the CT action. This data is stored with the conntrack
      entry, and is executed after the lookup occurs for the CT action. The
      conntrack entry itself must be committed using the COMMIT flag in the CT
      action flags for this change to persist.
      Signed-off-by: NJustin Pettit <jpettit@nicira.com>
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      182e3042
    • J
      openvswitch: Add conntrack action · 7f8a436e
      Joe Stringer 提交于
      Expose the kernel connection tracker via OVS. Userspace components can
      make use of the CT action to populate the connection state (ct_state)
      field for a flow. This state can be subsequently matched.
      
      Exposed connection states are OVS_CS_F_*:
      - NEW (0x01) - Beginning of a new connection.
      - ESTABLISHED (0x02) - Part of an existing connection.
      - RELATED (0x04) - Related to an established connection.
      - INVALID (0x20) - Could not track the connection for this packet.
      - REPLY_DIR (0x40) - This packet is in the reply direction for the flow.
      - TRACKED (0x80) - This packet has been sent through conntrack.
      
      When the CT action is executed by itself, it will send the packet
      through the connection tracker and populate the ct_state field with one
      or more of the connection state flags above. The CT action will always
      set the TRACKED bit.
      
      When the COMMIT flag is passed to the conntrack action, this specifies
      that information about the connection should be stored. This allows
      subsequent packets for the same (or related) connections to be
      correlated with this connection. Sending subsequent packets for the
      connection through conntrack allows the connection tracker to consider
      the packets as ESTABLISHED, RELATED, and/or REPLY_DIR.
      
      The CT action may optionally take a zone to track the flow within. This
      allows connections with the same 5-tuple to be kept logically separate
      from connections in other zones. If the zone is specified, then the
      "ct_zone" match field will be subsequently populated with the zone id.
      
      IP fragments are handled by transparently assembling them as part of the
      CT action. The maximum received unit (MRU) size is tracked so that
      refragmentation can occur during output.
      
      IP frag handling contributed by Andy Zhou.
      
      Based on original design by Justin Pettit.
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Signed-off-by: NJustin Pettit <jpettit@nicira.com>
      Signed-off-by: NAndy Zhou <azhou@nicira.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f8a436e
  2. 21 8月, 2015 1 次提交
  3. 18 8月, 2015 6 次提交
    • T
      net: Identifier Locator Addressing module · 65d7ab8d
      Tom Herbert 提交于
      Adding new module name ila. This implements ILA translation. Light
      weight tunnel redirection is used to perform the translation in
      the data path. This is configured by the "ip -6 route" command
      using the "encap ila <locator>" option, where <locator> is the
      value to set in destination locator of the packet. e.g.
      
      ip -6 route add 3333:0:0:1:5555:0:1:0/128 \
            encap ila 2001:0:0:1 via 2401:db00:20:911a:face:0:25:0
      
      Sets a route where 3333:0:0:1 will be overwritten by
      2001:0:0:1 on output.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65d7ab8d
    • D
      netfilter: nf_conntrack: add efficient mark to zone mapping · 5e8018fc
      Daniel Borkmann 提交于
      This work adds the possibility of deriving the zone id from the skb->mark
      field in a scalable manner. This allows for having only a single template
      serving hundreds/thousands of different zones, for example, instead of the
      need to have one match for each zone as an extra CT jump target.
      
      Note that we'd need to have this information attached to the template as at
      the time when we're trying to lookup a possible ct object, we already need
      to know zone information for a possible match when going into
      __nf_conntrack_find_get(). This work provides a minimal implementation for
      a possible mapping.
      
      In order to not add/expose an extra ct->status bit, the zone structure has
      been extended to carry a flag for deriving the mark.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      5e8018fc
    • D
      netfilter: nf_conntrack: add direction support for zones · deedb590
      Daniel Borkmann 提交于
      This work adds a direction parameter to netfilter zones, so identity
      separation can be performed only in original/reply or both directions
      (default). This basically opens up the possibility of doing NAT with
      conflicting IP address/port tuples from multiple, isolated tenants
      on a host (e.g. from a netns) without requiring each tenant to NAT
      twice resp. to use its own dedicated IP address to SNAT to, meaning
      overlapping tuples can be made unique with the zone identifier in
      original direction, where the NAT engine will then allocate a unique
      tuple in the commonly shared default zone for the reply direction.
      In some restricted, local DNAT cases, also port redirection could be
      used for making the reply traffic unique w/o requiring SNAT.
      
      The consensus we've reached and discussed at NFWS and since the initial
      implementation [1] was to directly integrate the direction meta data
      into the existing zones infrastructure, as opposed to the ct->mark
      approach we proposed initially.
      
      As we pass the nf_conntrack_zone object directly around, we don't have
      to touch all call-sites, but only those, that contain equality checks
      of zones. Thus, based on the current direction (original or reply),
      we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID.
      CT expectations are direction-agnostic entities when expectations are
      being compared among themselves, so we can only use the identifier
      in this case.
      
      Note that zone identifiers can not be included into the hash mix
      anymore as they don't contain a "stable" value that would be equal
      for both directions at all times, f.e. if only zone->id would
      unconditionally be xor'ed into the table slot hash, then replies won't
      find the corresponding conntracking entry anymore.
      
      If no particular direction is specified when configuring zones, the
      behaviour is exactly as we expect currently (both directions).
      
      Support has been added for the CT netlink interface as well as the
      x_tables raw CT target, which both already offer existing interfaces
      to user space for the configuration of zones.
      
      Below a minimal, simplified collision example (script in [2]) with
      netperf sessions:
      
        +--- tenant-1 ---+   mark := 1
        |    netperf     |--+
        +----------------+  |                CT zone := mark [ORIGINAL]
         [ip,sport] := X   +--------------+  +--- gateway ---+
                           | mark routing |--|     SNAT      |-- ... +
                           +--------------+  +---------------+       |
        +--- tenant-2 ---+  |                                     ~~~|~~~
        |    netperf     |--+                +-----------+           |
        +----------------+   mark := 2       | netserver |------ ... +
         [ip,sport] := X                     +-----------+
                                              [ip,port] := Y
      On the gateway netns, example:
      
        iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL
        iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully
      
        iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
        iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark
      
      conntrack dump from gateway netns:
      
        netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns
      
        tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1
                                 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024
                     [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1
      
        tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2
                                 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555
                     [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1
      
        tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1
                              src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438
                     [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1
      
        tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2
                              src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889
                     [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2
      
      Taking this further, test script in [2] creates 200 tenants and runs
      original-tuple colliding netperf sessions each. A conntrack -L dump in
      the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED
      state as expected.
      
      I also did run various other tests with some permutations of the script,
      to mention some: SNAT in random/random-fully/persistent mode, no zones (no
      overlaps), static zones (original, reply, both directions), etc.
      
        [1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/
        [2] https://paste.fedoraproject.org/242835/65657871/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      deedb590
    • W
      packet: add extended BPF fanout mode · f2e52095
      Willem de Bruijn 提交于
      Add fanout mode PACKET_FANOUT_EBPF that accepts an en extended BPF
      program to select a socket.
      
      Update the internal eBPF program by passing to socket option
      SOL_PACKET/PACKET_FANOUT_DATA a file descriptor returned by bpf().
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2e52095
    • W
      packet: add classic BPF fanout mode · 47dceb8e
      Willem de Bruijn 提交于
      Add fanout mode PACKET_FANOUT_CBPF that accepts a classic BPF program
      to select a socket.
      
      This avoids having to keep adding special case fanout modes. One
      example use case is application layer load balancing. The QUIC
      protocol, for instance, encodes a connection ID in UDP payload.
      
      Also add socket option SOL_PACKET/PACKET_FANOUT_DATA that updates data
      associated with the socket group. Fanout mode PACKET_FANOUT_CBPF is the
      only user so far.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47dceb8e
    • J
      lwtunnel: rename ip lwtunnel attributes · a1c234f9
      Jiri Benc 提交于
      We already have IFLA_IPTUN_ netlink attributes. The IP_TUN_ attributes look
      very similar, yet they serve very different purpose. This is confusing for
      anyone trying to implement a user space tool supporting lwt.
      
      As the IP_TUN_ attributes are used only for the lightweight tunnels, prefix
      them with LWTUNNEL_IP_ instead to make their purpose clear. Also, it's more
      logical to have them in lwtunnel.h together with the encap enum.
      
      Fixes: 3093fbe7 ("route: Per route IP tunnel metadata via lightweight tunnel")
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1c234f9
  4. 14 8月, 2015 2 次提交
    • D
      net: Introduce VRF related flags and helpers · 4e3c8992
      David Ahern 提交于
      Add a VRF_MASTER flag for interfaces and helper functions for determining
      if a device is a VRF_MASTER.
      
      Add link attribute for passing VRF_TABLE id.
      
      Add vrf_ptr to netdevice.
      
      Add various macros for determining if a device is a VRF device, the index
      of the master VRF device and table associated with VRF device.
      Signed-off-by: NShrijeet Mukherjee <shm@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e3c8992
    • A
      net: ipv6 sysctl option to ignore routes when nexthop link is down · 35103d11
      Andy Gospodarek 提交于
      Like the ipv4 patch with a similar title, this adds a sysctl to allow
      the user to change routing behavior based on whether or not the
      interface associated with the nexthop was an up or down link.  The
      default setting preserves the current behavior, but anyone that enables
      it will notice that nexthops on down interfaces will no longer be
      selected:
      
      net.ipv6.conf.all.ignore_routes_with_linkdown = 0
      net.ipv6.conf.default.ignore_routes_with_linkdown = 0
      net.ipv6.conf.lo.ignore_routes_with_linkdown = 0
      ...
      
      When the above sysctls are set, not only will link status be reported to
      userspace, but an indication that a nexthop is dead and will not be used
      is also reported.
      
      1000::/8 via 7000::2 dev p7p1  metric 1024 dead linkdown  pref medium
      1000::/8 via 8000::2 dev p8p1  metric 1024  pref medium
      7000::/8 dev p7p1  proto kernel  metric 256 dead linkdown  pref medium
      8000::/8 dev p8p1  proto kernel  metric 256  pref medium
      9000::/8 via 8000::2 dev p8p1  metric 2048  pref medium
      9000::/8 via 7000::2 dev p7p1  metric 1024 dead linkdown  pref medium
      fe80::/64 dev p7p1  proto kernel  metric 256 dead linkdown  pref medium
      fe80::/64 dev p8p1  proto kernel  metric 256  pref medium
      
      This also adds devconf support and notification when sysctl values
      change.
      
      v2: drop use of rt6i_nhflags since it is not needed right now
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35103d11
  5. 11 8月, 2015 3 次提交
  6. 10 8月, 2015 2 次提交
  7. 08 8月, 2015 1 次提交
  8. 07 8月, 2015 3 次提交
  9. 04 8月, 2015 1 次提交
  10. 03 8月, 2015 1 次提交
    • D
      ebpf: add skb->hash to offset map for usage in {cls, act}_bpf or filters · ba7591d8
      Daniel Borkmann 提交于
      Add skb->hash to the __sk_buff offset map, so it can be accessed from
      an eBPF program. We currently already do this for classic BPF filters,
      but not yet on eBPF, it might be useful as a demuxer in combination with
      helpers like bpf_clone_redirect(), toy example:
      
        __section("cls-lb") int ingress_main(struct __sk_buff *skb)
        {
          unsigned int which = 3 + (skb->hash & 7);
          /* bpf_skb_store_bytes(skb, ...); */
          /* bpf_l{3,4}_csum_replace(skb, ...); */
          bpf_clone_redirect(skb, which, 0);
          return -1;
        }
      
      I was thinking whether to add skb_get_hash(), but then concluded the
      raw skb->hash seems fine in this case: we can directly access the hash
      w/o extra eBPF helper function call, it's filled out by many NICs on
      ingress, and in case the entropy level would not be sufficient, people
      can still implement their own specific sw fallback hash mix anyway.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba7591d8
  11. 01 8月, 2015 3 次提交
  12. 31 7月, 2015 1 次提交
    • H
      net/ipv6: add sysctl option accept_ra_min_hop_limit · 8013d1d7
      Hangbin Liu 提交于
      Commit 6fd99094 ("ipv6: Don't reduce hop limit for an interface")
      disabled accept hop limit from RA if it is smaller than the current hop
      limit for security stuff. But this behavior kind of break the RFC definition.
      
      RFC 4861, 6.3.4.  Processing Received Router Advertisements
         A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time,
         and Retrans Timer) may contain a value denoting that it is
         unspecified.  In such cases, the parameter should be ignored and the
         host should continue using whatever value it is already using.
      
         If the received Cur Hop Limit value is non-zero, the host SHOULD set
         its CurHopLimit variable to the received value.
      
      So add sysctl option accept_ra_min_hop_limit to let user choose the minimum
      hop limit value they can accept from RA. And set default to 1 to meet RFC
      standards.
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Acked-by: NYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8013d1d7
  13. 30 7月, 2015 1 次提交
    • M
      netfilter: nf_ct_sctp: minimal multihoming support · d7ee3519
      Michal Kubeček 提交于
      Currently nf_conntrack_proto_sctp module handles only packets between
      primary addresses used to establish the connection. Any packets between
      secondary addresses are classified as invalid so that usual firewall
      configurations drop them. Allowing HEARTBEAT and HEARTBEAT-ACK chunks to
      establish a new conntrack would allow traffic between secondary
      addresses to pass through. A more sophisticated solution based on the
      addresses advertised in the initial handshake (and possibly also later
      dynamic address addition and removal) would be much harder to implement.
      Moreover, in general we cannot assume to always see the initial
      handshake as it can be routed through a different path.
      
      The patch adds two new conntrack states:
      
        SCTP_CONNTRACK_HEARTBEAT_SENT  - a HEARTBEAT chunk seen but not acked
        SCTP_CONNTRACK_HEARTBEAT_ACKED - a HEARTBEAT acked by HEARTBEAT-ACK
      
      State transition rules:
      
      - HEARTBEAT_SENT responds to usual chunks the same way as NONE (so that
        the behaviour changes as little as possible)
      - HEARTBEAT_ACKED responds to usual chunks the same way as ESTABLISHED
        does, except the resulting state is HEARTBEAT_ACKED rather than
        ESTABLISHED
      - previously existing states except NONE are preserved when HEARTBEAT or
        HEARTBEAT-ACK is seen
      - NONE (in the initial direction) changes to HEARTBEAT_SENT on HEARTBEAT
        and to CLOSED on HEARTBEAT-ACK
      - HEARTBEAT_SENT changes to HEARTBEAT_ACKED on HEARTBEAT-ACK in the
        reply direction
      - HEARTBEAT_SENT and HEARTBEAT_ACKED are preserved on HEARTBEAT and
        HEARTBEAT-ACK otherwise
      
      Normally, vtag is set from the INIT chunk for the reply direction and
      from the INIT-ACK chunk for the originating direction (i.e. each of
      these defines vtag value for the opposite direction). For secondary
      conntracks, we can't rely on seeing INIT/INIT-ACK and even if we have
      seen them, we would need to connect two different conntracks. Therefore
      simplified logic is applied: vtag of first packet in each direction
      (HEARTBEAT in the originating and HEARTBEAT-ACK in reply direction) is
      saved and all following packets in that direction are compared with this
      saved value. While INIT and INIT-ACK define vtag for the opposite
      direction, vtags extracted from HEARTBEAT and HEARTBEAT-ACK are always
      for their direction.
      
      Default timeout values for new states are
      
        HEARTBEAT_SENT: 30 seconds (default hb_interval)
        HEARTBEAT_ACKED: 210 seconds (hb_interval * path_max_retry + max_rto)
      
      (We cannot expect to see the shutdown sequence so that, unlike
      ESTABLISHED, the HEARTBEAT_ACKED timeout shouldn't be too long.)
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d7ee3519
  14. 27 7月, 2015 1 次提交
  15. 23 7月, 2015 1 次提交
  16. 22 7月, 2015 8 次提交