1. 19 11月, 2012 2 次提交
    • E
      userns: make each net (net_ns) belong to a user_ns · 038e7332
      Eric W. Biederman 提交于
      The user namespace which creates a new network namespace owns that
      namespace and all resources created in it.  This way we can target
      capability checks for privileged operations against network resources to
      the user_ns which created the network namespace in which the resource
      lives.  Privilege to the user namespace which owns the network
      namespace, or any parent user namespace thereof, provides the same
      privilege to the network resource.
      
      This patch is reworked from a version originally by
      Serge E. Hallyn <serge.hallyn@canonical.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      038e7332
    • E
      netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS · d727abcb
      Eric W. Biederman 提交于
      The copy of copy_net_ns used when the network stack is not
      built is broken as it does not return -EINVAL when attempting
      to create a new network namespace.  We don't even have
      a previous network namespace.
      
      Since we need a copy of copy_net_ns in net/net_namespace.h that is
      available when the networking stack is not built at all move the
      correct version of copy_net_ns from net_namespace.c into net_namespace.h
      Leaving us with just 2 versions of copy_net_ns.  One version for when
      we compile in network namespace suport and another stub for all other
      occasions.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      d727abcb
  2. 15 10月, 2012 1 次提交
  3. 09 10月, 2012 2 次提交
    • J
      ipv4: Add FLOWI_FLAG_KNOWN_NH · c92b9655
      Julian Anastasov 提交于
      Add flag to request that output route should be
      returned with known rt_gateway, in case we want to use
      it as nexthop for neighbour resolving.
      
      	The returned route can be cached as follows:
      
      - in NH exception: because the cached routes are not shared
      	with other destinations
      - in FIB NH: when using gateway because all destinations for
      	NH share same gateway
      
      	As last option, to return rt_gateway!=0 we have to
      set DST_NOCACHE.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c92b9655
    • J
      ipv4: introduce rt_uses_gateway · 155e8336
      Julian Anastasov 提交于
      Add new flag to remember when route is via gateway.
      We will use it to allow rt_gateway to contain address of
      directly connected host for the cases when DST_NOCACHE is
      used or when the NH exception caches per-destination route
      without DST_NOCACHE flag, i.e. when routes are not used for
      other destinations. By this way we force the neighbour
      resolving to work with the routed destination but we
      can use different address in the packet, feature needed
      for IPVS-DR where original packet for virtual IP is routed
      via route to real IP.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      155e8336
  4. 06 10月, 2012 1 次提交
  5. 05 10月, 2012 2 次提交
    • N
      sctp: check src addr when processing SACK to update transport state · edfee033
      Nicolas Dichtel 提交于
      Suppose we have an SCTP connection with two paths. After connection is
      established, path1 is not available, thus this path is marked as inactive. Then
      traffic goes through path2, but for some reasons packets are delayed (after
      rto.max). Because packets are delayed, the retransmit mechanism will switch
      again to path1. At this time, we receive a delayed SACK from path2. When we
      update the state of the path in sctp_check_transmitted(), we do not take into
      account the source address of the SACK, hence we update the wrong path.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edfee033
    • E
      ipv4: add a fib_type to fib_info · f4ef85bb
      Eric Dumazet 提交于
      commit d2d68ba9 (ipv4: Cache input routes in fib_info nexthops.)
      introduced a regression for forwarding.
      
      This was hard to reproduce but the symptom was that packets were
      delivered to local host instead of being forwarded.
      
      David suggested to add fib_type to fib_info so that we dont
      inadvertently share same fib_info for different purposes.
      
      With help from Julian Anastasov who provided very helpful
      hints, reproduced here :
      
      <quote>
              Can it be a problem related to fib_info reuse
      from different routes. For example, when local IP address
      is created for subnet we have:
      
      broadcast 192.168.0.255 dev DEV  proto kernel  scope link  src
      192.168.0.1
      192.168.0.0/24 dev DEV  proto kernel  scope link  src 192.168.0.1
      local 192.168.0.1 dev DEV  proto kernel  scope host  src 192.168.0.1
      
              The "dev DEV  proto kernel  scope link  src 192.168.0.1" is
      a reused fib_info structure where we put cached routes.
      The result can be same fib_info for 192.168.0.255 and
      192.168.0.0/24. RTN_BROADCAST is cached only for input
      routes. Incoming broadcast to 192.168.0.255 can be cached
      and can cause problems for traffic forwarded to 192.168.0.0/24.
      So, this patch should solve the problem because it
      separates the broadcast from unicast traffic.
      
              And the ip_route_input_slow caching will work for
      local and broadcast input routes (above routes 1 and 3) just
      because they differ in scope and use different fib_info.
      
      </quote>
      
      Many thanks to Chris Clayton for his patience and help.
      Reported-by: NChris Clayton <chris2553@googlemail.com>
      Bisected-by: NChris Clayton <chris2553@googlemail.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Tested-by: NChris Clayton <chris2553@googlemail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4ef85bb
  6. 02 10月, 2012 2 次提交
    • E
      ipv4: gre: add GRO capability · 60769a5d
      Eric Dumazet 提交于
      Add GRO capability to IPv4 GRE tunnels, using the gro_cells
      infrastructure.
      
      Tested using IPv4 and IPv6 TCP traffic inside this tunnel, and
      checking GRO is building large packets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60769a5d
    • E
      net: add gro_cells infrastructure · c9e6bc64
      Eric Dumazet 提交于
      This adds a new include file (include/net/gro_cells.h), to bring GRO
      (Generic Receive Offload) capability to tunnels, in a modular way.
      
      Because tunnels receive path is lockless, and GRO adds a serialization
      using a napi_struct, I chose to add an array of up to
      DEFAULT_MAX_NUM_RSS_QUEUES cells, so that multi queue devices wont be
      slowed down because of GRO layer.
      
      skb_get_rx_queue() is used as selector.
      
      In the future, we might add optional fanout capabilities, using rxhash
      for example.
      
      With help from Ben Hutchings who reminded me
      netif_get_num_default_rss_queues() function.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9e6bc64
  7. 28 9月, 2012 2 次提交
    • E
      net: remove sk_init() helper · e2bcabec
      Eric Dumazet 提交于
      It seems sk_init() has no value today and even does strange things :
      
      # grep . /proc/sys/net/core/?mem_*
      /proc/sys/net/core/rmem_default:212992
      /proc/sys/net/core/rmem_max:131071
      /proc/sys/net/core/wmem_default:212992
      /proc/sys/net/core/wmem_max:131071
      
      We can remove it completely.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NShan Wei <davidshan@tencent.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2bcabec
    • S
      tunnel: drop packet if ECN present with not-ECT · eccc1bb8
      stephen hemminger 提交于
      Linux tunnels were written before RFC6040 and therefore never
      implemented the corner case of ECN getting set in the outer header
      and the inner header not being ready for it.
      
      Section 4.2.  Default Tunnel Egress Behaviour.
       o If the inner ECN field is Not-ECT, the decapsulator MUST NOT
            propagate any other ECN codepoint onwards.  This is because the
            inner Not-ECT marking is set by transports that rely on dropped
            packets as an indication of congestion and would not understand or
            respond to any other ECN codepoint [RFC4774].  Specifically:
      
            *  If the inner ECN field is Not-ECT and the outer ECN field is
               CE, the decapsulator MUST drop the packet.
      
            *  If the inner ECN field is Not-ECT and the outer ECN field is
               Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the
               outgoing packet with the ECN field cleared to Not-ECT.
      
      This patch moves the ECN decap logic out of the individual tunnels
      into a common place.
      
      It also adds logging to allow detecting broken systems that
      set ECN bits incorrectly when tunneling (or an intermediate
      router might be changing the header).
      
      Overloads rx_frame_error to keep track of ECN related error.
      
      Thanks to Chris Wright who caught this while reviewing the new VXLAN
      tunnel.
      
      This code was tested by injecting faulty logic in other end GRE
      to send incorrectly encapsulated packets.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eccc1bb8
  8. 25 9月, 2012 13 次提交
  9. 23 9月, 2012 2 次提交
  10. 20 9月, 2012 4 次提交
    • A
      ipv6: unify fragment thresh handling code · 6b102865
      Amerigo Wang 提交于
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b102865
    • A
      ipv6: make ip6_frag_nqueues() and ip6_frag_mem() static inline · d4915c08
      Amerigo Wang 提交于
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4915c08
    • A
      ipv6: unify conntrack reassembly expire code with standard one · b836c99f
      Amerigo Wang 提交于
      Two years ago, Shan Wei tried to fix this:
      http://patchwork.ozlabs.org/patch/43905/
      
      The problem is that RFC2460 requires an ICMP Time
      Exceeded -- Fragment Reassembly Time Exceeded message should be
      sent to the source of that fragment, if the defragmentation
      times out.
      
      "
         If insufficient fragments are received to complete reassembly of a
         packet within 60 seconds of the reception of the first-arriving
         fragment of that packet, reassembly of that packet must be
         abandoned and all the fragments that have been received for that
         packet must be discarded.  If the first fragment (i.e., the one
         with a Fragment Offset of zero) has been received, an ICMP Time
         Exceeded -- Fragment Reassembly Time Exceeded message should be
         sent to the source of that fragment.
      "
      
      As Herbert suggested, we could actually use the standard IPv6
      reassembly code which follows RFC2460.
      
      With this patch applied, I can see ICMP Time Exceeded sent
      from the receiver when the sender sent out 3/4 fragmented
      IPv6 UDP packet.
      
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Cc: David Miller <davem@davemloft.net>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b836c99f
    • A
      ipv6: add a new namespace for nf_conntrack_reasm · c038a767
      Amerigo Wang 提交于
      As pointed by Michal, it is necessary to add a new
      namespace for nf_conntrack_reasm code, this prepares
      for the second patch.
      
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Cc: David Miller <davem@davemloft.net>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c038a767
  11. 19 9月, 2012 7 次提交
  12. 18 9月, 2012 2 次提交
    • E
      userns: Convert the audit loginuid to be a kuid · e1760bd5
      Eric W. Biederman 提交于
      Always store audit loginuids in type kuid_t.
      
      Print loginuids by converting them into uids in the appropriate user
      namespace, and then printing the resulting uid.
      
      Modify audit_get_loginuid to return a kuid_t.
      
      Modify audit_set_loginuid to take a kuid_t.
      
      Modify /proc/<pid>/loginuid on read to convert the loginuid into the
      user namespace of the opener of the file.
      
      Modify /proc/<pid>/loginud on write to convert the loginuid
      rom the user namespace of the opener of the file.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Paul Moore <paul@paul-moore.com> ?
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      e1760bd5
    • C
      include/net/sock.h: squelch compiler warning in sk_rmem_schedule() · 35c448a8
      Chuck Lever 提交于
      This warning:
      
        In file included from linux/include/linux/tcp.h:227:0,
                         from linux/include/linux/ipv6.h:221,
                         from linux/include/net/ipv6.h:16,
                         from linux/include/linux/sunrpc/clnt.h:26,
                         from linux/net/sunrpc/stats.c:22:
        linux/include/net/sock.h: In function `sk_rmem_schedule':
        linux/nfs-2.6/include/net/sock.h:1339:13: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
      
      is seen with gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2) using the
      -Wextra option.
      
      Commit c76562b6 ("netvm: prevent a stream-specific deadlock")
      accidentally replaced the "size" parameter of sk_rmem_schedule() with an
      unsigned int.  This changes the semantics of the comparison in the
      return statement.
      
      In sk_wmem_schedule we have syntactically the same comparison, but
      "size" is a signed integer.  In addition, __sk_mem_schedule() takes a
      signed integer for its "size" parameter, so there is an implicit type
      conversion in sk_rmem_schedule() anyway.
      
      Revert the "size" parameter back to a signed integer so that the
      semantics of the expressions in both sk_[rw]mem_schedule() are exactly
      the same.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35c448a8