1. 18 4月, 2017 1 次提交
  2. 14 4月, 2017 2 次提交
  3. 10 3月, 2017 1 次提交
    • D
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells 提交于
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      
      The theory lockdep comes up with is as follows:
      
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      
      	mmap_sem must be taken before sk_lock-AF_RXRPC
      
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
      
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      
      	sk_lock-AF_INET must be taken before mmap_sem
      
      However, lockdep's theory is wrong in this instance because it deals only
      with lock classes and not individual locks.  The AF_INET lock in (2) isn't
      really equivalent to the AF_INET lock in (3) as the former deals with a
      socket entirely internal to the kernel that never sees userspace.  This is
      a limitation in the design of lockdep.
      
      Fix the general case by:
      
       (1) Double up all the locking keys used in sockets so that one set are
           used if the socket is created by userspace and the other set is used
           if the socket is created by the kernel.
      
       (2) Store the kern parameter passed to sk_alloc() in a variable in the
           sock struct (sk_kern_sock).  This informs sock_lock_init(),
           sock_init_data() and sk_clone_lock() as to the lock keys to be used.
      
           Note that the child created by sk_clone_lock() inherits the parent's
           kern setting.
      
       (3) Add a 'kern' parameter to ->accept() that is analogous to the one
           passed in to ->create() that distinguishes whether kernel_accept() or
           sys_accept4() was the caller and can be passed to sk_alloc().
      
           Note that a lot of accept functions merely dequeue an already
           allocated socket.  I haven't touched these as the new socket already
           exists before we get the parameter.
      
           Note also that there are a couple of places where I've made the accepted
           socket unconditionally kernel-based:
      
      	irda_accept()
      	rds_rcp_accept_one()
      	tcp_accept_from_sock()
      
           because they follow a sock_create_kern() and accept off of that.
      
      Whilst creating this, I noticed that lustre and ocfs don't create sockets
      through sock_create_kern() and thus they aren't marked as for-kernel,
      though they appear to be internal.  I wonder if these should do that so
      that they use the new set of lock keys.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdfbabfb
  4. 08 3月, 2017 1 次提交
  5. 02 3月, 2017 1 次提交
  6. 25 12月, 2016 1 次提交
  7. 18 12月, 2016 1 次提交
  8. 15 11月, 2016 1 次提交
  9. 06 7月, 2016 1 次提交
  10. 11 4月, 2016 1 次提交
  11. 15 12月, 2015 1 次提交
    • H
      net: add validation for the socket syscall protocol argument · 79462ad0
      Hannes Frederic Sowa 提交于
      郭永刚 reported that one could simply crash the kernel as root by
      using a simple program:
      
      	int socket_fd;
      	struct sockaddr_in addr;
      	addr.sin_port = 0;
      	addr.sin_addr.s_addr = INADDR_ANY;
      	addr.sin_family = 10;
      
      	socket_fd = socket(10,3,0x40000000);
      	connect(socket_fd , &addr,16);
      
      AF_INET, AF_INET6 sockets actually only support 8-bit protocol
      identifiers. inet_sock's skc_protocol field thus is sized accordingly,
      thus larger protocol identifiers simply cut off the higher bits and
      store a zero in the protocol fields.
      
      This could lead to e.g. NULL function pointer because as a result of
      the cut off inet_num is zero and we call down to inet_autobind, which
      is NULL for raw sockets.
      
      kernel: Call Trace:
      kernel:  [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70
      kernel:  [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80
      kernel:  [<ffffffff81645069>] SYSC_connect+0xd9/0x110
      kernel:  [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80
      kernel:  [<ffffffff810236d8>] ? syscall_trace_enter_phase2+0x108/0x200
      kernel:  [<ffffffff81645e0e>] SyS_connect+0xe/0x10
      kernel:  [<ffffffff81779515>] tracesys_phase2+0x84/0x89
      
      I found no particular commit which introduced this problem.
      
      CVE: CVE-2015-8543
      Cc: Cong Wang <cwang@twopensource.com>
      Reported-by: N郭永刚 <guoyonggang@360.cn>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79462ad0
  12. 02 12月, 2015 1 次提交
    • E
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet 提交于
      This patch is a cleanup to make following patch easier to
      review.
      
      Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      
      To ease backports, we rename both constants.
      
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cd3e072
  13. 17 10月, 2015 1 次提交
    • A
      netfilter: turn NF_HOOK into an inline function · 008027c3
      Arnd Bergmann 提交于
      A recent change to the dst_output handling caused a new warning
      when the call to NF_HOOK() is the only used of a local variable
      passed as 'dev', and CONFIG_NETFILTER is disabled:
      
      net/ipv6/ip6_output.c: In function 'ip6_output':
      net/ipv6/ip6_output.c:135:21: warning: unused variable 'dev' [-Wunused-variable]
      
      The reason for this is that the NF_HOOK macro in this case does
      not reference the variable at all, and the call to dev_net(dev)
      got removed from the ip6_output function. To avoid that warning now
      and in the future, this changes the macro into an equivalent
      inline function, which tells the compiler that the variable is
      passed correctly but still unused.
      
      The dn_forward function apparently had the same problem in
      the past and added a local workaround that no longer works
      with the inline function. In order to avoid a regression, we
      have to also remove the #ifdef from decnet in the same patch.
      
      Fixes: ede2059d ("dst: Pass net into dst->output")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      008027c3
  14. 08 10月, 2015 2 次提交
  15. 19 9月, 2015 1 次提交
  16. 18 9月, 2015 3 次提交
    • E
      netfilter: Pass net into okfn · 0c4b51f0
      Eric W. Biederman 提交于
      This is immediately motivated by the bridge code that chains functions that
      call into netfilter.  Without passing net into the okfns the bridge code would
      need to guess about the best expression for the network namespace to process
      packets in.
      
      As net is frequently one of the first things computed in continuation functions
      after netfilter has done it's job passing in the desired network namespace is in
      many cases a code simplification.
      
      To support this change the function dst_output_okfn is introduced to
      simplify passing dst_output as an okfn.  For the moment dst_output_okfn
      just silently drops the struct net.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c4b51f0
    • E
      netfilter: Pass struct net into the netfilter hooks · 29a26a56
      Eric W. Biederman 提交于
      Pass a network namespace parameter into the netfilter hooks.  At the
      call site of the netfilter hooks the path a packet is taking through
      the network stack is well known which allows the network namespace to
      be easily and reliabily.
      
      This allows the replacement of magic code like
      "dev_net(state->in?:state->out)" that appears at the start of most
      netfilter hooks with "state->net".
      
      In almost all cases the network namespace passed in is derived
      from the first network device passed in, guaranteeing those
      paths will not see any changes in practice.
      
      The exceptions are:
      xfrm/xfrm_output.c:xfrm_output_resume()         xs_net(skb_dst(skb)->xfrm)
      ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont()      ip_vs_conn_net(cp)
      ipvs/ip_vs_xmit.c:ip_vs_send_or_cont()          ip_vs_conn_net(cp)
      ipv4/raw.c:raw_send_hdrinc()                    sock_net(sk)
      ipv6/ip6_output.c:ip6_xmit()			sock_net(sk)
      ipv6/ndisc.c:ndisc_send_skb()                   dev_net(skb->dev) not dev_net(dst->dev)
      ipv6/raw.c:raw6_send_hdrinc()                   sock_net(sk)
      br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev
      
      In all cases these exceptions seem to be a better expression for the
      network namespace the packet is being processed in then the historic
      "dev_net(in?in:out)".  I am documenting them in case something odd
      pops up and someone starts trying to track down what happened.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29a26a56
    • E
      net: Merge dst_output and dst_output_sk · 5a70649e
      Eric W. Biederman 提交于
      Add a sock paramter to dst_output making dst_output_sk superfluous.
      Add a skb->sk parameter to all of the callers of dst_output
      Have the callers of dst_output_sk call dst_output.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a70649e
  17. 10 9月, 2015 1 次提交
    • P
      net: ipv6: use common fib_default_rule_pref · f53de1e9
      Phil Sutter 提交于
      This switches IPv6 policy routing to use the shared
      fib_default_rule_pref() function of IPv4 and DECnet. It is also used in
      multicast routing for IPv4 as well as IPv6.
      
      The motivation for this patch is a complaint about iproute2 behaving
      inconsistent between IPv4 and IPv6 when adding policy rules: Formerly,
      IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the
      assigned priority value was decreased with each rule added.
      
      Since then all users of the default_pref field have been converted to
      assign the generic function fib_default_rule_pref(), fib_nl_newrule()
      may just use it directly instead. Therefore get rid of the function
      pointer altogether and make fib_default_rule_pref() static, as it's not
      used outside fib_rules.c anymore.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f53de1e9
  18. 11 5月, 2015 1 次提交
  19. 08 4月, 2015 1 次提交
    • D
      netfilter: Pass socket pointer down through okfn(). · 7026b1dd
      David Miller 提交于
      On the output paths in particular, we have to sometimes deal with two
      socket contexts.  First, and usually skb->sk, is the local socket that
      generated the frame.
      
      And second, is potentially the socket used to control a tunneling
      socket, such as one the encapsulates using UDP.
      
      We do not want to disassociate skb->sk when encapsulating in order
      to fix this, because that would break socket memory accounting.
      
      The most extreme case where this can cause huge problems is an
      AF_PACKET socket transmitting over a vxlan device.  We hit code
      paths doing checks that assume they are dealing with an ipv4
      socket, but are actually operating upon the AF_PACKET one.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7026b1dd
  20. 05 4月, 2015 1 次提交
  21. 03 4月, 2015 1 次提交
    • W
      net: move fib_rules_unregister() under rtnl lock · 419df12f
      WANG Cong 提交于
      We have to hold rtnl lock for fib_rules_unregister()
      otherwise the following race could happen:
      
      fib_rules_unregister():	fib_nl_delrule():
      ...				...
      ...				ops = lookup_rules_ops();
      list_del_rcu(&ops->list);
      				list_for_each_entry(ops->rules) {
      fib_rules_cleanup_ops(ops);	  ...
        list_del_rcu();		  list_del_rcu();
      				}
      
      Note, net->rules_mod_lock is actually not needed at all,
      either upper layer netns code or rtnl lock guarantees
      we are safe.
      
      Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      419df12f
  22. 10 3月, 2015 1 次提交
  23. 07 3月, 2015 1 次提交
    • E
      DECnet: Only use neigh_ops for adding the link layer header · aaa4e704
      Eric W. Biederman 提交于
      Other users users of the neighbour table use neigh->output as the method
      to decided when and which link-layer header to place on a packet.
      DECnet has been using neigh->output to decide which DECnet headers to
      place on a packet depending which neighbour the packet is destined for.
      
      The DECnet usage isn't totally wrong but it can run into problems if the
      neighbour output function is run for a second time as the teql driver
      and the bridge netfilter code can do.
      
      Therefore to avoid pathologic problems later down the line and make the
      neighbour code easier to understand by refactoring the decnet output
      code to only use a neighbour method to add a link layer header to a
      packet.
      
      This is done by moving the neigbhour operations lookup from
      dn_to_neigh_output to dn_neigh_output_packet.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaa4e704
  24. 04 3月, 2015 1 次提交
    • E
      neigh: Factor out ___neigh_lookup_noref · 60395a20
      Eric W. Biederman 提交于
      While looking at the mpls code I found myself writing yet another
      version of neigh_lookup_noref.  We currently have __ipv4_lookup_noref
      and __ipv6_lookup_noref.
      
      So to make my work a little easier and to make it a smidge easier to
      verify/maintain the mpls code in the future I stopped and wrote
      ___neigh_lookup_noref.  Then I rewote __ipv4_lookup_noref and
      __ipv6_lookup_noref in terms of this new function.  I tested my new
      version by verifying that the same code is generated in
      ip_finish_output2 and ip6_finish_output2 where these functions are
      inlined.
      
      To get to ___neigh_lookup_noref I added a new neighbour cache table
      function key_eq.  So that the static size of the key would be
      available.
      
      I also added __neigh_lookup_noref for people who want to to lookup
      a neighbour table entry quickly but don't know which neibhgour table
      they are going to look up.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60395a20
  25. 03 3月, 2015 2 次提交
  26. 24 2月, 2015 1 次提交
  27. 19 1月, 2015 1 次提交
  28. 18 1月, 2015 1 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
  29. 06 1月, 2015 1 次提交
    • D
      net: tcp: add RTAX_CC_ALGO fib handling · ea697639
      Daniel Borkmann 提交于
      This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
      control metric to be set up and dumped back to user space.
      
      While the internal representation of RTAX_CC_ALGO is handled as a u32
      key, we avoided to expose this implementation detail to user space, thus
      instead, we chose the netlink attribute that is being exchanged between
      user space to be the actual congestion control algorithm name, similarly
      as in the setsockopt(2) API in order to allow for maximum flexibility,
      even for 3rd party modules.
      
      It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
      it should have been stored in RTAX_FEATURES instead, we first thought
      about reusing it for the congestion control key, but it brings more
      complications and/or confusion than worth it.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea697639
  30. 24 11月, 2014 2 次提交
  31. 12 11月, 2014 1 次提交
  32. 23 8月, 2014 3 次提交