1. 05 10月, 2012 1 次提交
    • E
      ipv4: add a fib_type to fib_info · f4ef85bb
      Eric Dumazet 提交于
      commit d2d68ba9 (ipv4: Cache input routes in fib_info nexthops.)
      introduced a regression for forwarding.
      
      This was hard to reproduce but the symptom was that packets were
      delivered to local host instead of being forwarded.
      
      David suggested to add fib_type to fib_info so that we dont
      inadvertently share same fib_info for different purposes.
      
      With help from Julian Anastasov who provided very helpful
      hints, reproduced here :
      
      <quote>
              Can it be a problem related to fib_info reuse
      from different routes. For example, when local IP address
      is created for subnet we have:
      
      broadcast 192.168.0.255 dev DEV  proto kernel  scope link  src
      192.168.0.1
      192.168.0.0/24 dev DEV  proto kernel  scope link  src 192.168.0.1
      local 192.168.0.1 dev DEV  proto kernel  scope host  src 192.168.0.1
      
              The "dev DEV  proto kernel  scope link  src 192.168.0.1" is
      a reused fib_info structure where we put cached routes.
      The result can be same fib_info for 192.168.0.255 and
      192.168.0.0/24. RTN_BROADCAST is cached only for input
      routes. Incoming broadcast to 192.168.0.255 can be cached
      and can cause problems for traffic forwarded to 192.168.0.0/24.
      So, this patch should solve the problem because it
      separates the broadcast from unicast traffic.
      
              And the ip_route_input_slow caching will work for
      local and broadcast input routes (above routes 1 and 3) just
      because they differ in scope and use different fib_info.
      
      </quote>
      
      Many thanks to Chris Clayton for his patience and help.
      Reported-by: NChris Clayton <chris2553@googlemail.com>
      Bisected-by: NChris Clayton <chris2553@googlemail.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Tested-by: NChris Clayton <chris2553@googlemail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4ef85bb
  2. 02 10月, 2012 2 次提交
    • E
      ipv4: gre: add GRO capability · 60769a5d
      Eric Dumazet 提交于
      Add GRO capability to IPv4 GRE tunnels, using the gro_cells
      infrastructure.
      
      Tested using IPv4 and IPv6 TCP traffic inside this tunnel, and
      checking GRO is building large packets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60769a5d
    • E
      net: add gro_cells infrastructure · c9e6bc64
      Eric Dumazet 提交于
      This adds a new include file (include/net/gro_cells.h), to bring GRO
      (Generic Receive Offload) capability to tunnels, in a modular way.
      
      Because tunnels receive path is lockless, and GRO adds a serialization
      using a napi_struct, I chose to add an array of up to
      DEFAULT_MAX_NUM_RSS_QUEUES cells, so that multi queue devices wont be
      slowed down because of GRO layer.
      
      skb_get_rx_queue() is used as selector.
      
      In the future, we might add optional fanout capabilities, using rxhash
      for example.
      
      With help from Ben Hutchings who reminded me
      netif_get_num_default_rss_queues() function.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9e6bc64
  3. 28 9月, 2012 2 次提交
    • E
      net: remove sk_init() helper · e2bcabec
      Eric Dumazet 提交于
      It seems sk_init() has no value today and even does strange things :
      
      # grep . /proc/sys/net/core/?mem_*
      /proc/sys/net/core/rmem_default:212992
      /proc/sys/net/core/rmem_max:131071
      /proc/sys/net/core/wmem_default:212992
      /proc/sys/net/core/wmem_max:131071
      
      We can remove it completely.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NShan Wei <davidshan@tencent.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2bcabec
    • S
      tunnel: drop packet if ECN present with not-ECT · eccc1bb8
      stephen hemminger 提交于
      Linux tunnels were written before RFC6040 and therefore never
      implemented the corner case of ECN getting set in the outer header
      and the inner header not being ready for it.
      
      Section 4.2.  Default Tunnel Egress Behaviour.
       o If the inner ECN field is Not-ECT, the decapsulator MUST NOT
            propagate any other ECN codepoint onwards.  This is because the
            inner Not-ECT marking is set by transports that rely on dropped
            packets as an indication of congestion and would not understand or
            respond to any other ECN codepoint [RFC4774].  Specifically:
      
            *  If the inner ECN field is Not-ECT and the outer ECN field is
               CE, the decapsulator MUST drop the packet.
      
            *  If the inner ECN field is Not-ECT and the outer ECN field is
               Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the
               outgoing packet with the ECN field cleared to Not-ECT.
      
      This patch moves the ECN decap logic out of the individual tunnels
      into a common place.
      
      It also adds logging to allow detecting broken systems that
      set ECN bits incorrectly when tunneling (or an intermediate
      router might be changing the header).
      
      Overloads rx_frame_error to keep track of ECN related error.
      
      Thanks to Chris Wright who caught this while reviewing the new VXLAN
      tunnel.
      
      This code was tested by injecting faulty logic in other end GRE
      to send incorrectly encapsulated packets.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eccc1bb8
  4. 25 9月, 2012 13 次提交
  5. 23 9月, 2012 2 次提交
  6. 20 9月, 2012 4 次提交
    • A
      ipv6: unify fragment thresh handling code · 6b102865
      Amerigo Wang 提交于
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b102865
    • A
      ipv6: make ip6_frag_nqueues() and ip6_frag_mem() static inline · d4915c08
      Amerigo Wang 提交于
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4915c08
    • A
      ipv6: unify conntrack reassembly expire code with standard one · b836c99f
      Amerigo Wang 提交于
      Two years ago, Shan Wei tried to fix this:
      http://patchwork.ozlabs.org/patch/43905/
      
      The problem is that RFC2460 requires an ICMP Time
      Exceeded -- Fragment Reassembly Time Exceeded message should be
      sent to the source of that fragment, if the defragmentation
      times out.
      
      "
         If insufficient fragments are received to complete reassembly of a
         packet within 60 seconds of the reception of the first-arriving
         fragment of that packet, reassembly of that packet must be
         abandoned and all the fragments that have been received for that
         packet must be discarded.  If the first fragment (i.e., the one
         with a Fragment Offset of zero) has been received, an ICMP Time
         Exceeded -- Fragment Reassembly Time Exceeded message should be
         sent to the source of that fragment.
      "
      
      As Herbert suggested, we could actually use the standard IPv6
      reassembly code which follows RFC2460.
      
      With this patch applied, I can see ICMP Time Exceeded sent
      from the receiver when the sender sent out 3/4 fragmented
      IPv6 UDP packet.
      
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Cc: David Miller <davem@davemloft.net>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b836c99f
    • A
      ipv6: add a new namespace for nf_conntrack_reasm · c038a767
      Amerigo Wang 提交于
      As pointed by Michal, it is necessary to add a new
      namespace for nf_conntrack_reasm code, this prepares
      for the second patch.
      
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Cc: David Miller <davem@davemloft.net>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c038a767
  7. 19 9月, 2012 7 次提交
  8. 18 9月, 2012 3 次提交
    • E
      userns: Convert the audit loginuid to be a kuid · e1760bd5
      Eric W. Biederman 提交于
      Always store audit loginuids in type kuid_t.
      
      Print loginuids by converting them into uids in the appropriate user
      namespace, and then printing the resulting uid.
      
      Modify audit_get_loginuid to return a kuid_t.
      
      Modify audit_set_loginuid to take a kuid_t.
      
      Modify /proc/<pid>/loginuid on read to convert the loginuid into the
      user namespace of the opener of the file.
      
      Modify /proc/<pid>/loginud on write to convert the loginuid
      rom the user namespace of the opener of the file.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Paul Moore <paul@paul-moore.com> ?
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      e1760bd5
    • C
      include/net/sock.h: squelch compiler warning in sk_rmem_schedule() · 35c448a8
      Chuck Lever 提交于
      This warning:
      
        In file included from linux/include/linux/tcp.h:227:0,
                         from linux/include/linux/ipv6.h:221,
                         from linux/include/net/ipv6.h:16,
                         from linux/include/linux/sunrpc/clnt.h:26,
                         from linux/net/sunrpc/stats.c:22:
        linux/include/net/sock.h: In function `sk_rmem_schedule':
        linux/nfs-2.6/include/net/sock.h:1339:13: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
      
      is seen with gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2) using the
      -Wextra option.
      
      Commit c76562b6 ("netvm: prevent a stream-specific deadlock")
      accidentally replaced the "size" parameter of sk_rmem_schedule() with an
      unsigned int.  This changes the semantics of the comparison in the
      return statement.
      
      In sk_wmem_schedule we have syntactically the same comparison, but
      "size" is a signed integer.  In addition, __sk_mem_schedule() takes a
      signed integer for its "size" parameter, so there is an implicit type
      conversion in sk_rmem_schedule() anyway.
      
      Revert the "size" parameter back to a signed integer so that the
      semantics of the expressions in both sk_[rw]mem_schedule() are exactly
      the same.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35c448a8
    • D
  9. 15 9月, 2012 4 次提交
    • D
      cgroup: Assign subsystem IDs during compile time · 8a8e04df
      Daniel Wagner 提交于
      WARNING: With this change it is impossible to load external built
      controllers anymore.
      
      In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
      set, corresponding subsys_id should also be a constant. Up to now,
      net_prio_subsys_id and net_cls_subsys_id would be of the type int and
      the value would be assigned during runtime.
      
      By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
      to IS_ENABLED, all *_subsys_id will have constant value. That means we
      need to remove all the code which assumes a value can be assigned to
      net_prio_subsys_id and net_cls_subsys_id.
      
      A close look is necessary on the RCU part which was introduces by
      following patch:
      
        commit f8451725
        Author:	Herbert Xu <herbert@gondor.apana.org.au>  Mon May 24 09:12:34 2010
        Committer:	David S. Miller <davem@davemloft.net>  Mon May 24 09:12:34 2010
      
        cls_cgroup: Store classid in struct sock
      
        Tis code was added to init_cgroup_cls()
      
      	  /* We can't use rcu_assign_pointer because this is an int. */
      	  smp_wmb();
      	  net_cls_subsys_id = net_cls_subsys.subsys_id;
      
        respectively to exit_cgroup_cls()
      
      	  net_cls_subsys_id = -1;
      	  synchronize_rcu();
      
        and in module version of task_cls_classid()
      
      	  rcu_read_lock();
      	  id = rcu_dereference(net_cls_subsys_id);
      	  if (id >= 0)
      		  classid = container_of(task_subsys_state(p, id),
      					 struct cgroup_cls_state, css)->classid;
      	  rcu_read_unlock();
      
      Without an explicit explaination why the RCU part is needed. (The
      rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
      in a later commit, but that is a minor detail.)
      
      So here is my pondering why it was introduced and why it safe to
      remove it now. Note that this code was copied over to net_prio the
      reasoning holds for that subsystem too.
      
      The idea behind the RCU use for net_cls_subsys_id is to make sure we
      get a valid pointer back from task_subsys_state(). task_subsys_state()
      is just blindly accessing the subsys array and returning the
      pointer. Obviously, passing in -1 as id into task_subsys_state()
      returns an invalid value (out of lower bound).
      
      So this code makes sure that only after module is loaded and the
      subsystem registered, the id is assigned.
      
      Before unregistering the module all old readers must have left the
      critical section. This is done by assigning -1 to the id and issuing a
      synchronized_rcu(). Any new readers wont call task_subsys_state()
      anymore and therefore it is safe to unregister the subsystem.
      
      The new code relies on the same trick, but it looks at the subsys
      pointer return by task_subsys_state() (remember the id is constant
      and therefore we allways have a valid index into the subsys
      array).
      
      No precautions need to be taken during module loading
      module. Eventually, all CPUs will get a valid pointer back from
      task_subsys_state() because rebind_subsystem() which is called after
      the module init() function will assigned subsys[net_cls_subsys_id] the
      newly loaded module subsystem pointer.
      
      When the subsystem is about to be removed, rebind_subsystem() will
      called before the module exit() function. In this case,
      rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
      and then it calls synchronize_rcu(). All old readers have left by then
      the critical section. Any new reader wont access the subsystem
      anymore.  At this point we are safe to unregister the subsystem. No
      synchronize_rcu() call is needed.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8a8e04df
    • D
      cgroup: net_prio: Do not define task_netpioidx() when not selected · 51e4e7fa
      Daniel Wagner 提交于
      task_netprioidx() should not be defined in case the configuration is
      CONFIG_NETPRIO_CGROUP=n. The reason is that in a following patch the
      net_prio_subsys_id will only be defined if CONFIG_NETPRIO_CGROUP!=n.
      When net_prio is not built at all any callee should only get an empty
      task_netprioidx() without any references to net_prio_subsys_id.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      51e4e7fa
    • D
      cgroup: net_cls: Do not define task_cls_classid() when not selected · 8fb974c9
      Daniel Wagner 提交于
      task_cls_classid() should not be defined in case the configuration is
      CONFIG_NET_CLS_CGROUP=n. The reason is that in a following patch the
      net_cls_subsys_id will only be defined if CONFIG_NET_CLS_CGROUP!=n.
      When net_cls is not built at all a callee should only get an empty
      task_cls_classid() without any references to net_cls_subsys_id.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8fb974c9
    • D
      cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h · f3419807
      Daniel Wagner 提交于
      The only user of sock_update_classid() is net/socket.c which happens
      to include cls_cgroup.h directly.
      
      tj: Fix build breakage due to missing cls_cgroup.h inclusion in
          drivers/net/tun.c reported in linux-next by Stephen.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      f3419807
  10. 11 9月, 2012 1 次提交
  11. 10 9月, 2012 1 次提交
    • J
      mac80211: add key flag for management keys · e548c49e
      Johannes Berg 提交于
      Mark keys that might be used to receive management
      frames so drivers can fall back on software crypto
      for them if they don't support hardware offload.
      As the new flag is only set correctly for RX keys
      and the existing IEEE80211_KEY_FLAG_SW_MGMT flag
      can only affect TX, also rename the latter to
      IEEE80211_KEY_FLAG_SW_MGMT_TX.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      e548c49e