1. 25 11月, 2016 3 次提交
    • R
      devlink: Add E-Switch inline mode control · 59bfde01
      Roi Dayan 提交于
      Some HWs need the VF driver to put part of the packet headers on the
      TX descriptor so the e-switch can do proper matching and steering.
      
      The supported modes: none, link, network, transport.
      Signed-off-by: NRoi Dayan <roid@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59bfde01
    • O
      net: Add net-device param to the get offloaded stats ndo · 3df5b3c6
      Or Gerlitz 提交于
      Some drivers would need to check few internal matters for
      that. To be used in downstream mlx5 commit.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3df5b3c6
    • E
      tcp: enhance tcp_collapse_retrans() with skb_shift() · f8071cde
      Eric Dumazet 提交于
      In commit 2331ccc5 ("tcp: enhance tcp collapsing"),
      we made a first step allowing copying right skb to left skb head.
      
      Since all skbs in socket write queue are headless (but possibly the very
      first one), this strategy often does not work.
      
      This patch extends tcp_collapse_retrans() to perform frag shifting,
      thanks to skb_shift() helper.
      
      This helper needs to not BUG on non headless skbs, as callers are ok
      with that.
      
      Tested:
      
      Following packetdrill test now passes :
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      +.100 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
         +0 write(4, ..., 200) = 200
         +0 > P. 1:201(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 201:401(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 401:601(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 601:801(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 801:1001(200) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1001:1101(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1101:1201(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1201:1301(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1301:1401(100) ack 1
      
      +.099 < . 1:1(0) ack 201 win 257
      +.001 < . 1:1(0) ack 201 win 257 <nop,nop,sack 1001:1401>
         +0 > P. 201:1001(800) ack 1
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8071cde
  2. 20 11月, 2016 2 次提交
  3. 19 11月, 2016 3 次提交
  4. 18 11月, 2016 2 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
    • W
      net: check dead netns for peernet2id_alloc() · cfc44a4d
      WANG Cong 提交于
      Andrei reports we still allocate netns ID from idr after we destroy
      it in cleanup_net().
      
      cleanup_net():
        ...
        idr_destroy(&net->netns_ids);
        ...
        list_for_each_entry_reverse(ops, &pernet_list, list)
          ops_exit_list(ops, &net_exit_list);
            -> rollback_registered_many()
              -> rtmsg_ifinfo_build_skb()
               -> rtnl_fill_ifinfo()
                 -> peernet2id_alloc()
      
      After that point we should not even access net->netns_ids, we
      should check the death of the current netns as early as we can in
      peernet2id_alloc().
      
      For net-next we can consider to avoid sending rtmsg totally,
      it is a good optimization for netns teardown path.
      
      Fixes: 0c7aecd4 ("netns: add rtnl cmd to add and get peer netns ids")
      Reported-by: NAndrei Vagin <avagin@gmail.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NAndrei Vagin <avagin@openvz.org>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfc44a4d
  5. 17 11月, 2016 3 次提交
    • E
      netpoll: more efficient locking · 89c4b442
      Eric Dumazet 提交于
      Callers of netpoll_poll_lock() own NAPI_STATE_SCHED
      
      Callers of netpoll_poll_unlock() have BH blocked between
      the NAPI_STATE_SCHED being cleared and poll_lock is released.
      
      We can avoid the spinlock which has no contention, and use cmpxchg()
      on poll_owner which we need to set anyway.
      
      This removes a possible lockdep violation after the cited commit,
      since sk_busy_loop() re-enables BH before calling busy_poll_stop()
      
      Fixes: 217f6974 ("net: busy-poll: allow preemption in sk_busy_loop()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89c4b442
    • E
      net: busy-poll: return busypolling status to drivers · 364b6055
      Eric Dumazet 提交于
      NAPI drivers use napi_complete_done() or napi_complete() when
      they drained RX ring and right before re-enabling device interrupts.
      
      In busy polling, we can avoid interrupts being delivered since
      we are polling RX ring in a controlled loop.
      
      Drivers can chose to use napi_complete_done() return value
      to reduce interrupts overhead while busy polling is active.
      
      This is optional, legacy drivers should work fine even
      if not updated.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Adam Belay <abelay@google.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Cc: Yuval Mintz <Yuval.Mintz@cavium.com>
      Cc: Ariel Elior <ariel.elior@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      364b6055
    • E
      net: busy-poll: allow preemption in sk_busy_loop() · 217f6974
      Eric Dumazet 提交于
      After commit 4cd13c21 ("softirq: Let ksoftirqd do its job"),
      sk_busy_loop() needs a bit of care :
      softirqs might be delayed since we do not allow preemption yet.
      
      This patch adds preemptiom points in sk_busy_loop(),
      and makes sure no unnecessary cache line dirtying
      or atomic operations are done while looping.
      
      A new flag is added into napi->state : NAPI_STATE_IN_BUSY_POLL
      
      This prevents napi_complete_done() from clearing NAPIF_STATE_SCHED,
      so that sk_busy_loop() does not have to grab it again.
      
      Similarly, netpoll_poll_lock() is done one time.
      
      This gives about 10 to 20 % improvement in various busy polling
      tests, especially when many threads are busy polling in
      configurations with large number of NIC queues.
      
      This should allow experimenting with bigger delays without
      hurting overall latencies.
      
      Tested:
       On a 40Gb mlx4 NIC, 32 RX/TX queues.
      
       echo 70 >/proc/sys/net/core/busy_read
       for i in `seq 1 40`; do echo -n $i: ; ./super_netperf $i -H lpaa24 -t UDP_RR -- -N -n; done
      
          Before:      After:
       1:   90072   92819
       2:  157289  184007
       3:  235772  213504
       4:  344074  357513
       5:  394755  458267
       6:  461151  487819
       7:  549116  625963
       8:  544423  716219
       9:  720460  738446
      10:  794686  837612
      11:  915998  923960
      12:  937507  925107
      13: 1019677  971506
      14: 1046831 1113650
      15: 1114154 1148902
      16: 1105221 1179263
      17: 1266552 1299585
      18: 1258454 1383817
      19: 1341453 1312194
      20: 1363557 1488487
      21: 1387979 1501004
      22: 1417552 1601683
      23: 1550049 1642002
      24: 1568876 1601915
      25: 1560239 1683607
      26: 1640207 1745211
      27: 1706540 1723574
      28: 1638518 1722036
      29: 1734309 1757447
      30: 1782007 1855436
      31: 1724806 1888539
      32: 1717716 1944297
      33: 1778716 1869118
      34: 1805738 1983466
      35: 1815694 2020758
      36: 1893059 2035632
      37: 1843406 2034653
      38: 1888830 2086580
      39: 1972827 2143567
      40: 1877729 2181851
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Adam Belay <abelay@google.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Cc: Yuval Mintz <Yuval.Mintz@cavium.com>
      Cc: Ariel Elior <ariel.elior@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      217f6974
  6. 16 11月, 2016 2 次提交
  7. 15 11月, 2016 1 次提交
  8. 13 11月, 2016 2 次提交
  9. 10 11月, 2016 5 次提交
    • E
      net: napi_hash_add() is no longer exported · 149d6ad8
      Eric Dumazet 提交于
      There are no more users except from net/core/dev.c
      napi_hash_add() can now be static.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Michael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      149d6ad8
    • D
      ipv6: sr: add support for SRH encapsulation and injection with lwtunnels · 6c8702c6
      David Lebrun 提交于
      This patch creates a new type of interfaceless lightweight tunnel (SEG6),
      enabling the encapsulation and injection of SRH within locally emitted
      packets and forwarded packets.
      
      >From a configuration viewpoint, a seg6 tunnel would be configured as follows:
      
        ip -6 ro ad fc00::1/128 encap seg6 mode encap segs fc42::1,fc42::2,fc42::3 dev eth0
      
      Any packet whose destination address is fc00::1 would thus be encapsulated
      within an outer IPv6 header containing the SRH with three segments, and would
      actually be routed to the first segment of the list. If `mode inline' was
      specified instead of `mode encap', then the SRH would be directly inserted
      after the IPv6 header without outer encapsulation.
      
      The inline mode is only available if CONFIG_IPV6_SEG6_INLINE is enabled. This
      feature was made configurable because direct header insertion may break
      several mechanisms such as PMTUD or IPSec AH.
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c8702c6
    • M
      rtnl: reset calcit fptr in rtnl_unregister() · f567e950
      Mathias Krause 提交于
      To avoid having dangling function pointers left behind, reset calcit in
      rtnl_unregister(), too.
      
      This is no issue so far, as only the rtnl core registers a netlink
      handler with a calcit hook which won't be unregistered, but may become
      one if new code makes use of the calcit hook.
      
      Fixes: c7ac8679 ("rtnetlink: Compute and store minimum ifinfo...")
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Greg Rose <gregory.v.rose@intel.com>
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f567e950
    • E
      net-gro: avoid reorders · d61d072e
      Eric Dumazet 提交于
      Receiving a GSO packet in dev_gro_receive() is not uncommon
      in stacked devices, or devices partially implementing LRO/GRO
      like bnx2x. GRO is implementing the aggregation the device
      was not able to do itself.
      
      Current code causes reorders, like in following case :
      
      For a given flow where sender sent 3 packets P1,P2,P3,P4
      
      Receiver might receive P1 as a single packet, stored in GRO engine.
      
      Then P2-P4 are received as a single GSO packet, immediately given to
      upper stack, while P1 is held in GRO engine.
      
      This patch will make sure P1 is given to upper stack, then P2-P4
      immediately after.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d61d072e
    • L
      net: core: add missing check for uid_range in rule_exists. · 35b80733
      Lorenzo Colitti 提交于
      Without this check, it is not possible to create two rules that
      are identical except for their UID ranges. For example:
      
      root@net-test:/# ip rule add prio 1000 lookup 300
      root@net-test:/# ip rule add prio 1000 uidrange 100-200 lookup 300
      RTNETLINK answers: File exists
      root@net-test:/# ip rule add prio 1000 uidrange 100-199 lookup 100
      root@net-test:/# ip rule add prio 1000 uidrange 200-299 lookup 200
      root@net-test:/# ip rule add prio 1000 uidrange 300-399 lookup 100
      RTNETLINK answers: File exists
      
      Tested: https://android-review.googlesource.com/#/c/299980/Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35b80733
  10. 08 11月, 2016 3 次提交
    • S
      sock: do not set sk_err in sock_dequeue_err_skb · f5f99309
      Soheil Hassas Yeganeh 提交于
      Do not set sk_err when dequeuing errors from the error queue.
      Doing so results in:
      a) Bugs: By overwriting existing sk_err values, it possibly
         hides legitimate errors. It is also incorrect when local
         errors are queued with ip_local_error. That happens in the
         context of a system call, which already returns the error
         code.
      b) Inconsistent behavior: When there are pending errors on
         the error queue, sk_err is sometimes 0 (e.g., for
         the first timestamp on the error queue) and sometimes
         set to an error code (after dequeuing the first
         timestamp).
      c) Suboptimality: Setting sk_err to ENOMSG on simple
         TX timestamps can abort parallel reads and writes.
      
      Removing this line doesn't break userspace. This is because
      userspace code cannot rely on sk_err for detecting whether
      there is something on the error queue. Except for ICMP messages
      received for UDP and RAW, sk_err is not set at enqueue time,
      and as a result sk_err can be 0 while there are plenty of
      errors on the error queue.
      
      For ICMP packets in UDP and RAW, sk_err is set when they are
      enqueued on the error queue, but that does not result in aborting
      reads and writes. For such cases, sk_err is only readable via
      getsockopt(SO_ERROR) which will reset the value of sk_err on
      its own. More importantly, prior to this patch,
      recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e.,
      sk_err is set by sock_dequeue_err_skb without atomic ops or
      locks) which can store 0 in sk_err even when we have ICMP
      messages pending. Removing this line from sock_dequeue_err_skb
      eliminates that race.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5f99309
    • J
      net/qdisc: IFF_NO_QUEUE drivers should use consistent TX queue len · 11597084
      Jesper Dangaard Brouer 提交于
      The flag IFF_NO_QUEUE marks virtual device drivers that doesn't need a
      default qdisc attached, given they will be backed by physical device,
      that already have a qdisc attached for pushback.
      
      It is still supported to attach a qdisc to a IFF_NO_QUEUE device, as
      this can be useful for difference policy reasons (e.g. bandwidth
      limiting containers).  For this to work, the tx_queue_len need to have
      a sane value, because some qdiscs inherit/copy the tx_queue_len
      (namely, pfifo, bfifo, gred, htb, plug and sfb).
      
      Commit a813104d ("IFF_NO_QUEUE: Fix for drivers not calling
      ether_setup()") caught situations where some drivers didn't initialize
      tx_queue_len.  The problem with the commit was choosing 1 as the
      fallback value.
      
      A qdisc queue length of 1 causes more harm than good, because it
      creates hard to debug situations for userspace. It gives userspace a
      false sense of a working config after attaching a qdisc.  As low
      volume traffic (that doesn't activate the qdisc policy) works,
      like ping, while traffic that e.g. needs shaping cannot reach the
      configured policy levels, given the queue length is too small.
      
      This patch change the value to DEFAULT_TX_QUEUE_LEN, given other
      IFF_NO_QUEUE devices (that call ether_setup()) also use this value.
      
      Fixes: a813104d ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11597084
    • P
      udp: do fwd memory scheduling on dequeue · 7c13f97f
      Paolo Abeni 提交于
      A new argument is added to __skb_recv_datagram to provide
      an explicit skb destructor, invoked under the receive queue
      lock.
      The UDP protocol uses such argument to perform memory
      reclaiming on dequeue, so that the UDP protocol does not
      set anymore skb->desctructor.
      Instead explicit memory reclaiming is performed at close() time and
      when skbs are removed from the receive queue.
      The in kernel UDP protocol users now need to call a
      skb_recv_udp() variant instead of skb_recv_datagram() to
      properly perform memory accounting on dequeue.
      
      Overall, this allows acquiring only once the receive queue
      lock on dequeue.
      
      Tested using pktgen with random src port, 64 bytes packet,
      wire-speed on a 10G link as sender and udp_sink as the receiver,
      using an l4 tuple rxhash to stress the contention, and one or more
      udp_sink instances with reuseport.
      
      nr sinks	vanilla		patched
      1		440		560
      3		2150		2300
      6		3650		3800
      9		4450		4600
      12		6250		6450
      
      v1 -> v2:
       - do rmem and allocated memory scheduling under the receive lock
       - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
       - avoid the typdef for the dequeue callback
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c13f97f
  11. 05 11月, 2016 2 次提交
    • L
      net: core: add UID to flows, rules, and routes · 622ec2c9
      Lorenzo Colitti 提交于
      - Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
        range of UIDs.
      - Define a RTA_UID attribute for per-UID route lookups and dumps.
      - Support passing these attributes to and from userspace via
        rtnetlink. The value INVALID_UID indicates no UID was
        specified.
      - Add a UID field to the flow structures.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      622ec2c9
    • L
      net: core: Add a UID field to struct sock. · 86741ec2
      Lorenzo Colitti 提交于
      Protocol sockets (struct sock) don't have UIDs, but most of the
      time, they map 1:1 to userspace sockets (struct socket) which do.
      
      Various operations such as the iptables xt_owner match need
      access to the "UID of a socket", and do so by following the
      backpointer to the struct socket. This involves taking
      sk_callback_lock and doesn't work when there is no socket
      because userspace has already called close().
      
      Simplify this by adding a sk_uid field to struct sock whose value
      matches the UID of the corresponding struct socket. The semantics
      are as follows:
      
      1. Whenever sk_socket is non-null: sk_uid is the same as the UID
         in sk_socket, i.e., matches the return value of sock_i_uid.
         Specifically, the UID is set when userspace calls socket(),
         fchown(), or accept().
      2. When sk_socket is NULL, sk_uid is defined as follows:
         - For a socket that no longer has a sk_socket because
           userspace has called close(): the previous UID.
         - For a cloned socket (e.g., an incoming connection that is
           established but on which userspace has not yet called
           accept): the UID of the socket it was cloned from.
         - For a socket that has never had an sk_socket: UID 0 inside
           the user namespace corresponding to the network namespace
           the socket belongs to.
      
      Kernel sockets created by sock_create_kern are a special case
      of #1 and sk_uid is the user that created them. For kernel
      sockets created at network namespace creation time, such as the
      per-processor ICMP and TCP sockets, this is the user that created
      the network namespace.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86741ec2
  12. 04 11月, 2016 1 次提交
    • E
      dccp: do not release listeners too soon · c3f24cfb
      Eric Dumazet 提交于
      Andrey Konovalov reported following error while fuzzing with syzkaller :
      
      IPv4: Attempt to release alive inet socket ffff880068e98940
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff88006b9e0000 task.stack: ffff880068770000
      RIP: 0010:[<ffffffff819ead5f>]  [<ffffffff819ead5f>]
      selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
      RSP: 0018:ffff8800687771c8  EFLAGS: 00010202
      RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
      RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
      RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
      R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
      R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
      FS:  00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
      Stack:
       ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
       ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
       ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
      Call Trace:
       [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
       [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
       [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
       [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
       [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
      net/ipv4/ip_input.c:216
       [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
       [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
       [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
       [<     inline     >] dst_input ./include/net/dst.h:507
       [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
       [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
       [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
       [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
       [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
       [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
       [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
       [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
       [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
       [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
       [<     inline     >] new_sync_write fs/read_write.c:499
       [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512
       [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560
       [<     inline     >] SYSC_write fs/read_write.c:607
       [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599
       [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      It turns out DCCP calls __sk_receive_skb(), and this broke when
      lookups no longer took a reference on listeners.
      
      Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
      so that sock_put() is used only when needed.
      
      Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3f24cfb
  13. 01 11月, 2016 6 次提交
  14. 30 10月, 2016 3 次提交
  15. 28 10月, 2016 2 次提交
    • A
      net: skip genenerating uevents for network namespaces that are exiting · 002d8a1a
      Andrey Vagin 提交于
      No one can see these events, because a network namespace can not be
      destroyed, if it has sockets.
      
      Unlike other devices, uevent-s for network devices are generated
      only inside their network namespaces. They are filtered in
      kobj_bcast_filter()
      
      My experiments shows that net namespaces are destroyed more 30% faster
      with this optimization.
      
      Here is a perf output for destroying network namespaces without this
      patch.
      
      -   94.76%     0.02%  kworker/u48:1  [kernel.kallsyms]     [k] cleanup_net
         - 94.74% cleanup_net
            - 94.64% ops_exit_list.isra.4
               - 41.61% default_device_exit_batch
                  - 41.47% unregister_netdevice_many
                     - rollback_registered_many
                        - 40.36% netdev_unregister_kobject
                           - 14.55% device_del
                              + 13.71% kobject_uevent
                           - 13.04% netdev_queue_update_kobjects
                              + 12.96% kobject_put
                           - 12.72% net_rx_queue_update_kobjects
                                kobject_put
                              - kobject_release
                                 + 12.69% kobject_uevent
                        + 0.80% call_netdevice_notifiers_info
               + 19.57% nfsd_exit_net
               + 11.15% tcp_net_metrics_exit
               + 8.25% rpcsec_gss_exit_net
      
      It's very critical to optimize the exit path for network namespaces,
      because they are destroyed under net_mutex and many namespaces can be
      destroyed for one iteration.
      
      v2: use dev_set_uevent_suppress()
      
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrei Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      002d8a1a
    • A
      flow_dissector: fix vlan tag handling · bc72f3dd
      Arnd Bergmann 提交于
      gcc warns about an uninitialized pointer dereference in the vlan
      priority handling:
      
      net/core/flow_dissector.c: In function '__skb_flow_dissect':
      net/core/flow_dissector.c:281:61: error: 'vlan' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      As pointed out by Jiri Pirko, the variable is never actually used
      without being initialized first as the only way it end up uninitialized
      is with skb_vlan_tag_present(skb)==true, and that means it does not
      get accessed.
      
      However, the warning hints at some related issues that I'm addressing
      here:
      
      - the second check for the vlan tag is different from the first one
        that tests the skb for being NULL first, causing both the warning
        and a possible NULL pointer dereference that was not entirely fixed.
      - The same patch that introduced the NULL pointer check dropped an
        earlier optimization that skipped the repeated check of the
        protocol type
      - The local '_vlan' variable is referenced through the 'vlan' pointer
        but the variable has gone out of scope by the time that it is
        accessed, causing undefined behavior
      
      Caching the result of the 'skb && skb_vlan_tag_present(skb)' check
      in a local variable allows the compiler to further optimize the
      later check. With those changes, the warning also disappears.
      
      Fixes: 3805a938 ("flow_dissector: Check skb for VLAN only if skb specified.")
      Fixes: d5709f7a ("flow_dissector: For stripped vlan, get vlan info from skb->vlan_tci")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NEric Garver <e@erig.me>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc72f3dd