1. 02 10月, 2017 1 次提交
  2. 01 10月, 2017 3 次提交
  3. 29 9月, 2017 1 次提交
  4. 28 9月, 2017 2 次提交
  5. 27 9月, 2017 1 次提交
  6. 26 9月, 2017 2 次提交
  7. 22 9月, 2017 3 次提交
    • E
      net: prevent dst uses after free · 222d7dbd
      Eric Dumazet 提交于
      In linux-4.13, Wei worked hard to convert dst to a traditional
      refcounted model, removing GC.
      
      We now want to make sure a dst refcount can not transition from 0 back
      to 1.
      
      The problem here is that input path attached a not refcounted dst to an
      skb. Then later, because packet is forwarded and hits skb_dst_force()
      before exiting RCU section, we might try to take a refcount on one dst
      that is about to be freed, if another cpu saw 1 -> 0 transition in
      dst_release() and queued the dst for freeing after one RCU grace period.
      
      Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
      always perform the complete check against dst refcount, and not assume
      it is not zero.
      
      Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005
      
      [  989.919496]  skb_dst_force+0x32/0x34
      [  989.919498]  __dev_queue_xmit+0x1ad/0x482
      [  989.919501]  ? eth_header+0x28/0xc6
      [  989.919502]  dev_queue_xmit+0xb/0xd
      [  989.919504]  neigh_connected_output+0x9b/0xb4
      [  989.919507]  ip_finish_output2+0x234/0x294
      [  989.919509]  ? ipt_do_table+0x369/0x388
      [  989.919510]  ip_finish_output+0x12c/0x13f
      [  989.919512]  ip_output+0x53/0x87
      [  989.919513]  ip_forward_finish+0x53/0x5a
      [  989.919515]  ip_forward+0x2cb/0x3e6
      [  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
      [  989.919518]  ip_rcv_finish+0x2e2/0x321
      [  989.919519]  ip_rcv+0x26f/0x2eb
      [  989.919522]  ? vlan_do_receive+0x4f/0x289
      [  989.919523]  __netif_receive_skb_core+0x467/0x50b
      [  989.919526]  ? tcp_gro_receive+0x239/0x239
      [  989.919529]  ? inet_gro_receive+0x226/0x238
      [  989.919530]  __netif_receive_skb+0x4d/0x5f
      [  989.919532]  netif_receive_skb_internal+0x5c/0xaf
      [  989.919533]  napi_gro_receive+0x45/0x81
      [  989.919536]  ixgbe_poll+0xc8a/0xf09
      [  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
      [  989.919540]  net_rx_action+0xf4/0x266
      [  989.919543]  __do_softirq+0xa8/0x19d
      [  989.919545]  irq_exit+0x5d/0x6b
      [  989.919546]  do_IRQ+0x9c/0xb5
      [  989.919548]  common_interrupt+0x93/0x93
      [  989.919548]  </IRQ>
      
      Similarly dst_clone() can use dst_hold() helper to have additional
      debugging, as a follow up to commit 44ebe791 ("net: add debug
      atomic_inc_not_zero() in dst_hold()")
      
      In net-next we will convert dst atomic_t to refcount_t for peace of
      mind.
      
      Fixes: a4c2fd7f ("net: remove DST_NOCACHE flag")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Reported-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Bisected-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Acked-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      222d7dbd
    • D
      ipv4: Move fib_has_custom_local_routes outside of IP_MULTIPLE_TABLES. · a1f3316d
      David S. Miller 提交于
      > net/ipv4/fib_frontend.c: In function 'fib_validate_source':
      > net/ipv4/fib_frontend.c:411:16: error: 'struct netns_ipv4' has no member named 'fib_has_custom_local_routes'
      >    if (net->ipv4.fib_has_custom_local_routes)
      >                 ^
      > net/ipv4/fib_frontend.c: In function 'inet_rtm_newroute':
      > net/ipv4/fib_frontend.c:773:12: error: 'struct netns_ipv4' has no member named 'fib_has_custom_local_routes'
      >    net->ipv4.fib_has_custom_local_routes = true;
      >             ^
      
      Fixes: 6e617de8 ("net: avoid a full fib lookup when rp_filter is disabled.")
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1f3316d
    • P
      net: avoid a full fib lookup when rp_filter is disabled. · 6e617de8
      Paolo Abeni 提交于
      Since commit 1dced6a8 ("ipv4: Restore accept_local behaviour
      in fib_validate_source()") a full fib lookup is needed even if
      the rp_filter is disabled, if accept_local is false - which is
      the default.
      
      What we really need in the above scenario is just checking
      that the source IP address is not local, and in most case we
      can do that is a cheaper way looking up the ifaddr hash table.
      
      This commit adds a helper for such lookup, and uses it to
      validate the src address when rp_filter is disabled and no
      'local' routes are created by the user space in the relevant
      namespace.
      
      A new ipv4 netns flag is added to account for such routes.
      We need that to preserve the same behavior we had before this
      patch.
      
      It also drops the checks to bail early from __fib_validate_source,
      added by the commit 1dced6a8 ("ipv4: Restore accept_local
      behaviour in fib_validate_source()") they do not give any
      measurable performance improvement: if we do the lookup with are
      on a slower path.
      
      This improves UDP performances for unconnected sockets
      when rp_filter is disabled by 5% and also gives small but
      measurable performance improvement for TCP flood scenarios.
      
      v1 -> v2:
       - use the ifaddr lookup helper in __ip_dev_find(), as suggested
         by Eric
       - fall-back to full lookup if custom local routes are present
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e617de8
  8. 20 9月, 2017 5 次提交
    • E
      ipv4: speedup ipv6 tunnels dismantle · 64bc1781
      Eric Dumazet 提交于
      Implement exit_batch() method to dismantle more devices
      per round.
      
      (rtnl_lock() ...
       unregister_netdevice_many() ...
       rtnl_unlock())
      
      Tested:
      $ cat add_del_unshare.sh
      for i in `seq 1 40`
      do
       (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
      done
      wait ; grep net_namespace /proc/slabinfo
      
      Before patch :
      $ time ./add_del_unshare.sh
      net_namespace        126    282   5504    1    2 : tunables    8    4    0 : slabdata    126    282      0
      
      real    1m38.965s
      user    0m0.688s
      sys     0m37.017s
      
      After patch:
      $ time ./add_del_unshare.sh
      net_namespace        135    291   5504    1    2 : tunables    8    4    0 : slabdata    135    291      0
      
      real	0m22.117s
      user	0m0.728s
      sys	0m35.328s
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64bc1781
    • E
      ipv6: addrlabel: per netns list · a90c9347
      Eric Dumazet 提交于
      Having a global list of labels do not scale to thousands of
      netns in the cloud era. This causes quadratic behavior on
      netns creation and deletion.
      
      This is time having a per netns list of ~10 labels.
      
      Tested:
      
      $ time perf record (for f in `seq 1 3000` ; do ip netns add tast$f; done)
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 3.637 MB perf.data (~158898 samples) ]
      
      real    0m20.837s # instead of 0m24.227s
      user    0m0.328s
      sys     0m20.338s # instead of 0m23.753s
      
          16.17%       ip  [kernel.kallsyms]  [k] netlink_broadcast_filtered
          12.30%       ip  [kernel.kallsyms]  [k] netlink_has_listeners
           6.76%       ip  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
           5.78%       ip  [kernel.kallsyms]  [k] memset_erms
           5.77%       ip  [kernel.kallsyms]  [k] kobject_uevent_env
           5.18%       ip  [kernel.kallsyms]  [k] refcount_sub_and_test
           4.96%       ip  [kernel.kallsyms]  [k] _raw_read_lock
           3.82%       ip  [kernel.kallsyms]  [k] refcount_inc_not_zero
           3.33%       ip  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
           2.11%       ip  [kernel.kallsyms]  [k] unmap_page_range
           1.77%       ip  [kernel.kallsyms]  [k] __wake_up
           1.69%       ip  [kernel.kallsyms]  [k] strlen
           1.17%       ip  [kernel.kallsyms]  [k] __wake_up_common
           1.09%       ip  [kernel.kallsyms]  [k] insert_header
           1.04%       ip  [kernel.kallsyms]  [k] page_remove_rmap
           1.01%       ip  [kernel.kallsyms]  [k] consume_skb
           0.98%       ip  [kernel.kallsyms]  [k] netlink_trim
           0.51%       ip  [kernel.kallsyms]  [k] kernfs_link_sibling
           0.51%       ip  [kernel.kallsyms]  [k] filemap_map_pages
           0.46%       ip  [kernel.kallsyms]  [k] memcpy_erms
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a90c9347
    • C
      net_sched: no need to free qdisc in RCU callback · 752fbcc3
      Cong Wang 提交于
      gen estimator has been rewritten in commit 1c0d32fd
      ("net_sched: gen_estimator: complete rewrite of rate estimators"),
      the caller no longer needs to wait for a grace period. So this
      patch gets rid of it.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      752fbcc3
    • V
      net: dsa: remove copy of master ethtool_ops · f5619866
      Vivien Didelot 提交于
      There is no need to store a copy of the master ethtool ops, storing the
      original pointer in DSA and the new one in the master netdev itself is
      enough.
      
      In the meantime, set orig_ethtool_ops to NULL when restoring the master
      ethtool ops and check the presence of the master original ethtool ops as
      well as its needed functions before calling them.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5619866
    • E
      net: sk_buff rbnode reorg · bffa72cf
      Eric Dumazet 提交于
      skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
      
      Current uses (TCP receive ofo queue and netem) need to save/restore
      tstamp, while skb->dev is either NULL (TCP) or a constant for a given
      queue (netem).
      
      Since we plan using an RB tree for TCP retransmit queue to speedup SACK
      processing with large BDP, this patch exchanges skb->dev and
      skb->tstamp.
      
      This saves some overhead in both TCP and netem.
      
      v2: removes the swtstamp field from struct tcp_skb_cb
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bffa72cf
  9. 19 9月, 2017 2 次提交
  10. 16 9月, 2017 1 次提交
    • X
      sctp: fix an use-after-free issue in sctp_sock_dump · d25adbeb
      Xin Long 提交于
      Commit 86fdb344 ("sctp: ensure ep is not destroyed before doing the
      dump") tried to fix an use-after-free issue by checking !sctp_sk(sk)->ep
      with holding sock and sock lock.
      
      But Paolo noticed that endpoint could be destroyed in sctp_rcv without
      sock lock protection. It means the use-after-free issue still could be
      triggered when sctp_rcv put and destroy ep after sctp_sock_dump checks
      !ep, although it's pretty hard to reproduce.
      
      I could reproduce it by mdelay in sctp_rcv while msleep in sctp_close
      and sctp_sock_dump long time.
      
      This patch is to add another param cb_done to sctp_for_each_transport
      and dump ep->assocs with holding tsp after jumping out of transport's
      traversal in it to avoid this issue.
      
      It can also improve sctp diag dump to make it run faster, as no need
      to save sk into cb->args[5] and keep calling sctp_for_each_transport
      any more.
      
      This patch is also to use int * instead of int for the pos argument
      in sctp_for_each_transport, which could make postion increment only
      in sctp_for_each_transport and no need to keep changing cb->args[2]
      in sctp_sock_filter and sctp_sock_dump any more.
      
      Fixes: 86fdb344 ("sctp: ensure ep is not destroyed before doing the dump")
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d25adbeb
  11. 14 9月, 2017 1 次提交
    • D
      sctp: potential read out of bounds in sctp_ulpevent_type_enabled() · fa5f7b51
      Dan Carpenter 提交于
      This code causes a static checker warning because Smatch doesn't trust
      anything that comes from skb->data.  I've reviewed this code and I do
      think skb->data can be controlled by the user here.
      
      The sctp_event_subscribe struct has 13 __u8 fields and we want to see
      if ours is non-zero.  sn_type can be any value in the 0-USHRT_MAX range.
      We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read
      either before the start of the struct or after the end.
      
      This is a very old bug and it's surprising that it would go undetected
      for so long but my theory is that it just doesn't have a big impact so
      it would be hard to notice.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa5f7b51
  12. 13 9月, 2017 1 次提交
    • C
      net_sched: get rid of tcfa_rcu · d7fb60b9
      Cong Wang 提交于
      gen estimator has been rewritten in commit 1c0d32fd
      ("net_sched: gen_estimator: complete rewrite of rate estimators"),
      the caller is no longer needed to wait for a grace period.
      So this patch gets rid of it.
      
      This also completely closes a race condition between action free
      path and filter chain add/remove path for the following patch.
      Because otherwise the nested RCU callback can't be caught by
      rcu_barrier().
      
      Please see also the comments in code.
      
      Cc: Jiri Pirko <jiri@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7fb60b9
  13. 09 9月, 2017 1 次提交
    • F
      netfilter: nat: Revert "netfilter: nat: convert nat bysrc hash to rhashtable" · e1bf1687
      Florian Westphal 提交于
      This reverts commit 870190a9.
      
      It was not a good idea. The custom hash table was a much better
      fit for this purpose.
      
      A fast lookup is not essential, in fact for most cases there is no lookup
      at all because original tuple is not taken and can be used as-is.
      What needs to be fast is insertion and deletion.
      
      rhlist removal however requires a rhlist walk.
      We can have thousands of entries in such a list if source port/addresses
      are reused for multiple flows, if this happens removal requests are so
      expensive that deletions of a few thousand flows can take several
      seconds(!).
      
      The advantages that we got from rhashtable are:
      1) table auto-sizing
      2) multiple locks
      
      1) would be nice to have, but it is not essential as we have at
      most one lookup per new flow, so even a million flows in the bysource
      table are not a problem compared to current deletion cost.
      2) is easy to add to custom hash table.
      
      I tried to add hlist_node to rhlist to speed up rhltable_remove but this
      isn't doable without changing semantics.  rhltable_remove_fast will
      check that the to-be-deleted object is part of the table and that
      requires a list walk that we want to avoid.
      
      Furthermore, using hlist_node increases size of struct rhlist_head, which
      in turn increases nf_conn size.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=196821Reported-by: NIvan Babrou <ibobrik@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e1bf1687
  14. 06 9月, 2017 3 次提交
    • F
      net: dsa: Allow switch drivers to indicate number of TX queues · 55199df6
      Florian Fainelli 提交于
      Let switch drivers indicate how many TX queues they support. Some
      switches, such as Broadcom Starfighter 2 are designed with 8 egress
      queues. Future changes will allow us to leverage the queue mapping and
      direct the transmission towards a particular queue.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55199df6
    • T
      flow_dissector: Cleanup control flow · 3a1214e8
      Tom Herbert 提交于
      __skb_flow_dissect is riddled with gotos that make discerning the flow,
      debugging, and extending the capability difficult. This patch
      reorganizes things so that we only perform goto's after the two main
      switch statements (no gotos within the cases now). It also eliminates
      several goto labels so that there are only two labels that can be target
      for goto.
      Reported-by: NAlexander Popov <alex.popov@linux.com>
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a1214e8
    • A
      net/ncsi: fix ncsi_vlan_rx_{add,kill}_vid references · fd0c88b7
      Arnd Bergmann 提交于
      We get a new link error in allmodconfig kernels after ftgmac100
      started using the ncsi helpers:
      
      ERROR: "ncsi_vlan_rx_kill_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] undefined!
      ERROR: "ncsi_vlan_rx_add_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] undefined!
      
      Related to that, we get another error when CONFIG_NET_NCSI is disabled:
      
      drivers/net/ethernet/faraday/ftgmac100.c:1626:25: error: 'ncsi_vlan_rx_add_vid' undeclared here (not in a function); did you mean 'ncsi_start_dev'?
      drivers/net/ethernet/faraday/ftgmac100.c:1627:26: error: 'ncsi_vlan_rx_kill_vid' undeclared here (not in a function); did you mean 'ncsi_vlan_rx_add_vid'?
      
      This fixes both problems at once, using a 'static inline' stub helper
      for the disabled case, and exporting the functions when they are present.
      
      Fixes: 51564585 ("ftgmac100: Support NCSI VLAN filtering when available")
      Fixes: 21acf630 ("net/ncsi: Configure VLAN tag filter")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd0c88b7
  15. 05 9月, 2017 1 次提交
    • J
      mac80211: fix VLAN handling with TXQs · 53168215
      Johannes Berg 提交于
      With TXQs, the AP_VLAN interfaces are resolved to their owner AP
      interface when enqueuing the frame, which makes sense since the
      frame really goes out on that as far as the driver is concerned.
      
      However, this introduces a problem: frames to be encrypted with
      a VLAN-specific GTK will now be encrypted with the AP GTK, since
      the information about which virtual interface to use to select
      the key is taken from the TXQ.
      
      Fix this by preserving info->control.vif and using that in the
      dequeue function. This now requires doing the driver-mapping
      in the dequeue as well.
      
      Since there's no way to filter the frames that are sitting on a
      TXQ, drop all frames, which may affect other interfaces, when an
      AP_VLAN is removed.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      53168215
  16. 04 9月, 2017 7 次提交
  17. 02 9月, 2017 2 次提交
  18. 01 9月, 2017 2 次提交
    • A
      devlink: Add IPv6 header for dpipe · 1797f5b3
      Arkadi Sharshevsky 提交于
      This will be used by the IPv6 host table which will be introduced in the
      following patches. The fields in the header are added per-use. This header
      is global and can be reused by many drivers.
      Signed-off-by: NArkadi Sharshevsky <arkadis@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1797f5b3
    • C
      net_sched: add reverse binding for tc class · 07d79fc7
      Cong Wang 提交于
      TC filters when used as classifiers are bound to TC classes.
      However, there is a hidden difference when adding them in different
      orders:
      
      1. If we add tc classes before its filters, everything is fine.
         Logically, the classes exist before we specify their ID's in
         filters, it is easy to bind them together, just as in the current
         code base.
      
      2. If we add tc filters before the tc classes they bind, we have to
         do dynamic lookup in fast path. What's worse, this happens all
         the time not just once, because on fast path tcf_result is passed
         on stack, there is no way to propagate back to the one in tc filters.
      
      This hidden difference hurts performance silently if we have many tc
      classes in hierarchy.
      
      This patch intends to close this gap by doing the reverse binding when
      we create a new class, in this case we can actually search all the
      filters in its parent, match and fixup by classid. And because
      tcf_result is specific to each type of tc filter, we have to introduce
      a new ops for each filter to tell how to bind the class.
      
      Note, we still can NOT totally get rid of those class lookup in
      ->enqueue() because cgroup and flow filters have no way to determine
      the classid at setup time, they still have to go through dynamic lookup.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07d79fc7
  19. 31 8月, 2017 1 次提交