1. 07 12月, 2015 1 次提交
    • R
      core: enable more fine-grained datagram reception control · ea3793ee
      Rainer Weikusat 提交于
      The __skb_recv_datagram routine in core/ datagram.c provides a general
      skb reception factility supposed to be utilized by protocol modules
      providing datagram sockets. It encompasses both the actual recvmsg code
      and a surrounding 'sleep until data is available' loop. This is
      inconvenient if a protocol module has to use additional locking in order
      to maintain some per-socket state the generic datagram socket code is
      unaware of (as the af_unix code does). The patch below moves the recvmsg
      proper code into a new __skb_try_recv_datagram routine which doesn't
      sleep and renames wait_for_more_packets to
      __skb_wait_for_more_packets, both routines being exported interfaces. The
      original __skb_recv_datagram routine is reimplemented on top of these
      two functions such that its user-visible behaviour remains unchanged.
      Signed-off-by: NRainer Weikusat <rweikusat@mobileactivedefense.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea3793ee
  2. 06 12月, 2015 3 次提交
  3. 04 12月, 2015 18 次提交
    • J
      tipc: fix node reference count bug · dc8d1eb3
      Jon Paul Maloy 提交于
      Commit 5405ff6e ("tipc: convert node lock to rwlock")
      introduced a bug to the node reference counter handling. When a
      message is successfully sent in the function tipc_node_xmit(),
      we return directly after releasing the node lock, instead of
      continuing and decrementing the node reference counter as we
      should do.
      
      This commit fixes this bug.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc8d1eb3
    • G
      pppox: use standard module auto-loading feature · 681b4d88
      Guillaume Nault 提交于
      * Register PF_PPPOX with pppox module rather than with pppoe,
          so that pppoe doesn't get loaded for any PF_PPPOX socket.
      
        * Register PX_PROTO_* with standard MODULE_ALIAS_NET_PF_PROTO()
          instead of using pppox's own naming scheme.
      
        * While there, add auto-loading feature for pptp.
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      681b4d88
    • A
      VSOCK: Add Makefile and Kconfig · 8a2a2029
      Asias He 提交于
      Enable virtio-vsock and vhost-vsock.
      Signed-off-by: NAsias He <asias@redhat.com>
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a2a2029
    • A
      VSOCK: Introduce virtio-vsock.ko · 32e61b06
      Asias He 提交于
      VM sockets virtio transport implementation. This module runs in guest
      kernel.
      Signed-off-by: NAsias He <asias@redhat.com>
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32e61b06
    • A
      VSOCK: Introduce virtio-vsock-common.ko · 80a19e33
      Asias He 提交于
      This module contains the common code and header files for the following
      virtio-vsock and virtio-vhost kernel modules.
      Signed-off-by: NAsias He <asias@redhat.com>
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80a19e33
    • A
    • R
      mpls: support for dead routes · c89359a4
      Roopa Prabhu 提交于
      Adds support for RTNH_F_DEAD and RTNH_F_LINKDOWN flags on mpls
      routes due to link events. Also adds code to ignore dead
      routes during route selection.
      
      Unlike ip routes, mpls routes are not deleted when the route goes
      dead. This is current mpls behaviour and this patch does not change
      that. With this patch however, routes will be marked dead.
      dead routes are not notified to userspace (this is consistent with ipv4
      routes).
      
      dead routes:
      -----------
      $ip -f mpls route show
      100
          nexthop as to 200 via inet 10.1.1.2  dev swp1
          nexthop as to 700 via inet 10.1.1.6  dev swp2
      
      $ip link set dev swp1 down
      
      $ip link show dev swp1
      4: swp1: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode
      DEFAULT group default qlen 1000
          link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff
      
      $ip -f mpls route show
      100
          nexthop as to 200 via inet 10.1.1.2  dev swp1 dead linkdown
          nexthop as to 700 via inet 10.1.1.6  dev swp2
      
      linkdown routes:
      ----------------
      $ip -f mpls route show
      100
          nexthop as to 200 via inet 10.1.1.2  dev swp1
          nexthop as to 700 via inet 10.1.1.6  dev swp2
      
      $ip link show dev swp1
      4: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
      state UP mode DEFAULT group default qlen 1000
          link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff
      
      /* carrier goes down */
      $ip link show dev swp1
      4: swp1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
      state DOWN mode DEFAULT group default qlen 1000
          link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff
      
      $ip -f mpls route show
      100
          nexthop as to 200 via inet 10.1.1.2  dev swp1 linkdown
          nexthop as to 700 via inet 10.1.1.6  dev swp2
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Acked-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c89359a4
    • E
      net_sched: fix qdisc_tree_decrease_qlen() races · 4eaf3b84
      Eric Dumazet 提交于
      qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
      devices.
      
      One problem is that it updates sch->q.qlen and sch->qstats.drops
      on the mq/mqprio root qdisc, while it should not : Daniele
      reported underflows errors :
      [  681.774821] PAX: sch->q.qlen: 0 n: 1
      [  681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
      [  681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G           O    4.2.6.201511282239-1-grsec #1
      [  681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
      [  681.774956]  ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
      [  681.774959]  ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
      [  681.774960]  ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
      [  681.774962] Call Trace:
      [  681.774967]  [<ffffffffa95d2810>] dump_stack+0x4c/0x7f
      [  681.774970]  [<ffffffffa91a44f4>] report_size_overflow+0x34/0x50
      [  681.774972]  [<ffffffffa94d17e2>] qdisc_tree_decrease_qlen+0x152/0x160
      [  681.774976]  [<ffffffffc02694b1>] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
      [  681.774978]  [<ffffffffc02680a0>] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
      [  681.774980]  [<ffffffffa94cd92d>] __qdisc_run+0x4d/0x1d0
      [  681.774983]  [<ffffffffa949b2b2>] net_tx_action+0xc2/0x160
      [  681.774985]  [<ffffffffa90664c1>] __do_softirq+0xf1/0x200
      [  681.774987]  [<ffffffffa90665ee>] run_ksoftirqd+0x1e/0x30
      [  681.774989]  [<ffffffffa90896b0>] smpboot_thread_fn+0x150/0x260
      [  681.774991]  [<ffffffffa9089560>] ? sort_range+0x40/0x40
      [  681.774992]  [<ffffffffa9085fe4>] kthread+0xe4/0x100
      [  681.774994]  [<ffffffffa9085f00>] ? kthread_worker_fn+0x170/0x170
      [  681.774995]  [<ffffffffa95d8d1e>] ret_from_fork+0x3e/0x70
      
      mq/mqprio have their own ways to report qlen/drops by folding stats on
      all their queues, with appropriate locking.
      
      A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
      without proper locking : concurrent qdisc updates could corrupt the list
      that qdisc_match_from_root() parses to find a qdisc given its handle.
      
      Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
      qdisc_tree_decrease_qlen() can use to abort its tree traversal,
      as soon as it meets a mq/mqprio qdisc children.
      
      Second problem can be fixed by RCU protection.
      Qdisc are already freed after RCU grace period, so qdisc_list_add() and
      qdisc_list_del() simply have to use appropriate rcu list variants.
      
      A future patch will add a per struct netdev_queue list anchor, so that
      qdisc_tree_decrease_qlen() can have more efficient lookups.
      Reported-by: NDaniele Fucini <dfucini@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <cwang@twopensource.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4eaf3b84
    • P
      net: ipv6: restrict hop_limit sysctl setting to range [1; 255] · d6df198d
      Phil Sutter 提交于
      Setting a value bigger than 255 resulted in using only the lower eight
      bits of that value as it is assigned to the u8 header field. To avoid
      this unexpected result, reject such values.
      
      Setting a value of zero is technically possible, but hosts receiving
      such a packet have to treat it like hop_limit was set to one, according
      to RFC2460. Therefore I don't see a use-case for that.
      
      Setting a route's hop_limit to zero in iproute2 means to use the sysctl
      default, which is not the case here: Setting e.g.
      net.conf.eth0.hop_limit=0 will not make the kernel use
      net.conf.all.hop_limit for outgoing packets on eth0. To avoid these
      kinds of confusion, reject zero.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6df198d
    • P
      openvswitch: fix hangup on vxlan/gre/geneve device deletion · 13175303
      Paolo Abeni 提交于
      Each openvswitch tunnel vport (vxlan,gre,geneve) holds a reference
      to the underlying tunnel device, but never released it when such
      device is deleted.
      Deleting the underlying device via the ip tool cause the kernel to
      hangup in the netdev_wait_allrefs() loop.
      This commit ensure that on device unregistration dp_detach_port_notify()
      is called for all vports that hold the device reference, properly
      releasing it.
      
      Fixes: 614732ea ("openvswitch: Use regular VXLAN net_device device")
      Fixes: b2acd1dc ("openvswitch: Use regular GRE net_device instead of vport")
      Fixes: 6b001e68 ("openvswitch: Use Geneve device.")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NFlavio Leitner <fbl@sysclose.org>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13175303
    • A
      ipv4: igmp: Allow removing groups from a removed interface · 4eba7bb1
      Andrew Lunn 提交于
      When a multicast group is joined on a socket, a struct ip_mc_socklist
      is appended to the sockets mc_list containing information about the
      joined group.
      
      If the interface is hot unplugged, this entry becomes stale. Prior to
      commit 52ad353a ("igmp: fix the problem when mc leave group") it
      was possible to remove the stale entry by performing a
      IP_DROP_MEMBERSHIP, passing either the old ifindex or ip address on
      the interface. However, this fix enforces that the interface must
      still exist. Thus with time, the number of stale entries grows, until
      sysctl_igmp_max_memberships is reached and then it is not possible to
      join and more groups.
      
      The previous patch fixes an issue where a IP_DROP_MEMBERSHIP is
      performed without specifying the interface, either by ifindex or ip
      address. However here we do supply one of these. So loosen the
      restriction on device existence to only apply when the interface has
      not been specified. This then restores the ability to clean up the
      stale entries.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Fixes: 52ad353a "(igmp: fix the problem when mc leave group")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4eba7bb1
    • E
      ipv6: sctp: implement sctp_v6_destroy_sock() · 602dd62d
      Eric Dumazet 提交于
      Dmitry Vyukov reported a memory leak using IPV6 SCTP sockets.
      
      We need to call inet6_destroy_sock() to properly release
      inet6 specific fields.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      602dd62d
    • J
      net: introduce change lower state notifier · 04d48266
      Jiri Pirko 提交于
      When lower device like bonding slave, team/bridge port, etc changes its
      state, it is useful for others to notice this change. Currently this is
      implemented specificly for bonding as NETDEV_BONDING_INFO notifier. This
      patch aims to replace this specific usage and make this more generic to
      be used for all upper-lower devices.
      
      Introduce NETDEV_CHANGELOWERSTATE netdev notifier type and
      netdev_lower_state_changed() helper.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04d48266
    • J
      net: add possibility to pass information about upper device via notifier · 29bf24af
      Jiri Pirko 提交于
      Sometimes the drivers and other code would find it handy to know some
      internal information about upper device being changed. So allow upper-code
      to pass information down to notifier listeners during linking.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29bf24af
    • J
      net: propagate upper priv via netdev_master_upper_dev_link · 6dffb044
      Jiri Pirko 提交于
      Eliminate netdev_master_upper_dev_link_private and pass priv directly as
      a parameter of netdev_master_upper_dev_link.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dffb044
    • I
      net: Check CHANGEUPPER notifier return value · b03804e7
      Ido Schimmel 提交于
      switchdev drivers reflect the newly requested topology to hardware when
      CHANGEUPPER is received, after software links were already formed.
      However, the operation can fail and user will not be notified, as the
      return value of the notifier is not checked.
      
      Add this check and rollback software links if necessary.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b03804e7
    • E
      ipv6: kill sk_dst_lock · 6bd4f355
      Eric Dumazet 提交于
      While testing the np->opt RCU conversion, I found that UDP/IPv6 was
      using a mixture of xchg() and sk_dst_lock to protect concurrent changes
      to sk->sk_dst_cache, leading to possible corruptions and crashes.
      
      ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
      way to fix the mess is to remove sk_dst_lock completely, as we did for
      IPv4.
      
      __ip6_dst_store() and ip6_dst_store() share same implementation.
      
      sk_setup_caps() being called with socket lock being held or not,
      we have to use sk_dst_set() instead of __sk_dst_set()
      
      Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
      in ip6_dst_store() before the sk_setup_caps(sk, dst) call.
      
      This is because ip6_dst_store() can be called from process context,
      without any lock held.
      
      As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
      from another cpu doing a concurrent ip6_dst_store()
      
      Doing the dst dereference before doing the install is needed to make
      sure no use after free would trigger.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bd4f355
    • E
      ipv6: sctp: add rcu protection around np->opt · c836a8ba
      Eric Dumazet 提交于
      This patch completes the work I did in commit 45f6fad8
      ("ipv6: add complete rcu protection around np->opt"), as I missed
      sctp part.
      
      This simply makes sure np->opt is used with proper RCU locking
      and accessors.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c836a8ba
  4. 03 12月, 2015 8 次提交
  5. 02 12月, 2015 4 次提交
    • E
      net: fix sock_wake_async() rcu protection · ceb5d58b
      Eric Dumazet 提交于
      Dmitry provided a syzkaller (http://github.com/google/syzkaller)
      triggering a fault in sock_wake_async() when async IO is requested.
      
      Said program stressed af_unix sockets, but the issue is generic
      and should be addressed in core networking stack.
      
      The problem is that by the time sock_wake_async() is called,
      we should not access the @flags field of 'struct socket',
      as the inode containing this socket might be freed without
      further notice, and without RCU grace period.
      
      We already maintain an RCU protected structure, "struct socket_wq"
      so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
      is the safe route.
      
      It also reduces number of cache lines needing dirtying, so might
      provide a performance improvement anyway.
      
      In followup patches, we might move remaining flags (SOCK_NOSPACE,
      SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
      being mostly read and let it being shared between cpus.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ceb5d58b
    • E
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet 提交于
      This patch is a cleanup to make following patch easier to
      review.
      
      Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      
      To ease backports, we rename both constants.
      
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cd3e072
    • N
      Revert "ipv6: ndisc: inherit metadata dst when creating ndisc requests" · 304d888b
      Nicolas Dichtel 提交于
      This reverts commit ab450605.
      
      In IPv6, we cannot inherit the dst of the original dst. ndisc packets
      are IPv6 packets and may take another route than the original packet.
      
      This patch breaks the following scenario: a packet comes from eth0 and
      is forwarded through vxlan1. The encapsulated packet triggers an NS
      which cannot be sent because of the wrong route.
      
      CC: Jiri Benc <jbenc@redhat.com>
      CC: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      304d888b
    • R
      unix: use wq_has_sleeper in unix_dgram_recvmsg · 77b75f4d
      Rainer Weikusat 提交于
      The current unix_dgram_recvmsg does a wake up for every received
      datagram. This seems wasteful as only SOCK_DGRAM client sockets in an
      n:1 association with a server socket will ever wait because of the
      associated condition. The patch below changes the function such that the
      wake up only happens if wq_has_sleeper indicates that someone actually
      wants to be notified. Testing with SOCK_SEQPACKET and SOCK_DGRAM socket
      seems to confirm that this is an improvment.
      Signed-Off-By: NRainer Weikusat <rweikusat@mobileactivedefense.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77b75f4d
  6. 01 12月, 2015 6 次提交