1. 29 9月, 2017 3 次提交
    • M
      bpf: Add name, load_time, uid and map_ids to bpf_prog_info · cb4d2b3f
      Martin KaFai Lau 提交于
      The patch adds name and load_time to struct bpf_prog_aux.  They
      are also exported to bpf_prog_info.
      
      The bpf_prog's name is passed by userspace during BPF_PROG_LOAD.
      The kernel only stores the first (BPF_PROG_NAME_LEN - 1) bytes
      and the name stored in the kernel is always \0 terminated.
      
      The kernel will reject name that contains characters other than
      isalnum() and '_'.  It will also reject name that is not null
      terminated.
      
      The existing 'user->uid' of the bpf_prog_aux is also exported to
      the bpf_prog_info as created_by_uid.
      
      The existing 'used_maps' of the bpf_prog_aux is exported to
      the newly added members 'nr_map_ids' and 'map_ids' of
      the bpf_prog_info.  On the input, nr_map_ids tells how
      big the userspace's map_ids buffer is.  On the output,
      nr_map_ids tells the exact user_map_cnt and it will only
      copy up to the userspace's map_ids buffer is allowed.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb4d2b3f
    • N
      net: bridge: add per-port group_fwd_mask with less restrictions · 5af48b59
      Nikolay Aleksandrov 提交于
      We need to be able to transparently forward most link-local frames via
      tunnels (e.g. vxlan, qinq). Currently the bridge's group_fwd_mask has a
      mask which restricts the forwarding of STP and LACP, but we need to be able
      to forward these over tunnels and control that forwarding on a per-port
      basis thus add a new per-port group_fwd_mask option which only disallows
      mac pause frames to be forwarded (they're always dropped anyway).
      The patch does not change the current default situation - all of the others
      are still restricted unless configured for forwarding.
      We have successfully tested this patch with LACP and STP forwarding over
      VxLAN and qinq tunnels.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5af48b59
    • A
      arp: make arp_hdr_len() return unsigned int · 6ade97da
      Alexey Dobriyan 提交于
      Negative ARP header length are not a thing.
      
      Constify arguments while I'm at it.
      
      Space savings:
      
      	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-3 (-3)
      	function                        old     new   delta
      	arpt_do_table                  1163    1160      -3
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ade97da
  2. 28 9月, 2017 5 次提交
  3. 27 9月, 2017 4 次提交
    • J
      net: sched: introduce helper to identify gact pass action · 3b8e9238
      Jiri Pirko 提交于
      Introduce a helper called is_tcf_gact_pass which could be used to
      tell if the action is gact pass or not.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b8e9238
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
    • D
      bpf: rename bpf_compute_data_end into bpf_compute_data_pointers · 6aaae2b6
      Daniel Borkmann 提交于
      Just do the rename into bpf_compute_data_pointers() as we'll add
      one more pointer here to recompute.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6aaae2b6
    • M
      qed: iWARP - Add check for errors on a SYN packet · 1e99c497
      Michal Kalderon 提交于
      A SYN packet which arrives with errors from FW should be dropped.
      This required adding an additional field to the ll2
      rx completion data.
      Signed-off-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Signed-off-by: NAriel Elior <Ariel.Elior@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e99c497
  4. 26 9月, 2017 4 次提交
    • A
      neigh: make strucrt neigh_table::entry_size unsigned int · 01ccdf12
      Alexey Dobriyan 提交于
      Key length can't be negative.
      
      Leave comparisons against nla_len() signed just in case truncated attribute
      can sneak in there.
      
      Space savings:
      
      	add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-7 (-7)
      	function                                     old     new   delta
      	pneigh_delete                                273     272      -1
      	mlx5e_rep_netevent_event                    1415    1414      -1
      	mlx5e_create_encap_header_ipv6              1194    1193      -1
      	mlx5e_create_encap_header_ipv4              1071    1070      -1
      	cxgb4_l2t_get                               1104    1103      -1
      	__pneigh_lookup                               69      68      -1
      	__neigh_create                              2452    2451      -1
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01ccdf12
    • A
      neigh: make struct neigh_table::entry_size unsigned int · e451ae8e
      Alexey Dobriyan 提交于
      Neigh entry size can't be negative.
      
      Space savings:
      
      	add/remove: 0/0 grow/shrink: 0/5 up/down: 0/-7 (-7)
      	function                                     old     new   delta
      	lowpan_neigh_construct                        25      24      -1
      	clip_seq_sub_iter                            152     151      -1
      	clip_ioctl                                  1475    1474      -1
      	clip_constructor                              93      92      -1
      	__neigh_create                              2455    2452      -3
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e451ae8e
    • P
      tun: enable napi_gro_frags() for TUN/TAP driver · 90e33d45
      Petar Penkov 提交于
      Add a TUN/TAP receive mode that exercises the napi_gro_frags()
      interface. This mode is available only in TAP mode, as the interface
      expects packets with Ethernet headers.
      
      Furthermore, packets follow the layout of the iovec_iter that was
      received. The first iovec is the linear data, and every one after the
      first is a fragment. If there are more fragments than the max number,
      drop the packet. Additionally, invoke eth_get_headlen() to exercise flow
      dissector code and to verify that the header resides in the linear data.
      
      The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option.
      This is imposed because this mode is intended for testing via tools like
      syzkaller and packetdrill, and the increased flexibility it provides can
      introduce security vulnerabilities. This flag is accepted only if the
      device is in TAP mode and has the IFF_NAPI flag set as well. This is
      done because both of these are explicit requirements for correct
      operation in this mode.
      Signed-off-by: NPetar Penkov <peterpenkov96@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: davem@davemloft.net
      Cc: ppenkov@stanford.edu
      Acked-by: NMahesh Bandewar <maheshb@google,com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90e33d45
    • P
      tun: enable NAPI for TUN/TAP driver · 94317099
      Petar Penkov 提交于
      Changes TUN driver to use napi_gro_receive() upon receiving packets
      rather than netif_rx_ni(). Adds flag IFF_NAPI that enables these
      changes and operation is not affected if the flag is disabled.  SKBs
      are constructed upon packet arrival and are queued to be processed
      later.
      
      The new path was evaluated with a benchmark with the following setup:
      Open two tap devices and a receiver thread that reads in a loop for
      each device. Start one sender thread and pin all threads to different
      CPUs. Send 1M minimum UDP packets to each device and measure sending
      time for each of the sending methods:
      	napi_gro_receive():	4.90s
      	netif_rx_ni():		4.90s
      	netif_receive_skb():	7.20s
      Signed-off-by: NPetar Penkov <peterpenkov96@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: davem@davemloft.net
      Cc: ppenkov@stanford.edu
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94317099
  5. 23 9月, 2017 1 次提交
  6. 22 9月, 2017 5 次提交
    • E
      net: prevent dst uses after free · 222d7dbd
      Eric Dumazet 提交于
      In linux-4.13, Wei worked hard to convert dst to a traditional
      refcounted model, removing GC.
      
      We now want to make sure a dst refcount can not transition from 0 back
      to 1.
      
      The problem here is that input path attached a not refcounted dst to an
      skb. Then later, because packet is forwarded and hits skb_dst_force()
      before exiting RCU section, we might try to take a refcount on one dst
      that is about to be freed, if another cpu saw 1 -> 0 transition in
      dst_release() and queued the dst for freeing after one RCU grace period.
      
      Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
      always perform the complete check against dst refcount, and not assume
      it is not zero.
      
      Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005
      
      [  989.919496]  skb_dst_force+0x32/0x34
      [  989.919498]  __dev_queue_xmit+0x1ad/0x482
      [  989.919501]  ? eth_header+0x28/0xc6
      [  989.919502]  dev_queue_xmit+0xb/0xd
      [  989.919504]  neigh_connected_output+0x9b/0xb4
      [  989.919507]  ip_finish_output2+0x234/0x294
      [  989.919509]  ? ipt_do_table+0x369/0x388
      [  989.919510]  ip_finish_output+0x12c/0x13f
      [  989.919512]  ip_output+0x53/0x87
      [  989.919513]  ip_forward_finish+0x53/0x5a
      [  989.919515]  ip_forward+0x2cb/0x3e6
      [  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
      [  989.919518]  ip_rcv_finish+0x2e2/0x321
      [  989.919519]  ip_rcv+0x26f/0x2eb
      [  989.919522]  ? vlan_do_receive+0x4f/0x289
      [  989.919523]  __netif_receive_skb_core+0x467/0x50b
      [  989.919526]  ? tcp_gro_receive+0x239/0x239
      [  989.919529]  ? inet_gro_receive+0x226/0x238
      [  989.919530]  __netif_receive_skb+0x4d/0x5f
      [  989.919532]  netif_receive_skb_internal+0x5c/0xaf
      [  989.919533]  napi_gro_receive+0x45/0x81
      [  989.919536]  ixgbe_poll+0xc8a/0xf09
      [  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
      [  989.919540]  net_rx_action+0xf4/0x266
      [  989.919543]  __do_softirq+0xa8/0x19d
      [  989.919545]  irq_exit+0x5d/0x6b
      [  989.919546]  do_IRQ+0x9c/0xb5
      [  989.919548]  common_interrupt+0x93/0x93
      [  989.919548]  </IRQ>
      
      Similarly dst_clone() can use dst_hold() helper to have additional
      debugging, as a follow up to commit 44ebe791 ("net: add debug
      atomic_inc_not_zero() in dst_hold()")
      
      In net-next we will convert dst atomic_t to refcount_t for peace of
      mind.
      
      Fixes: a4c2fd7f ("net: remove DST_NOCACHE flag")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Reported-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Bisected-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Acked-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      222d7dbd
    • D
      ipv4: Move fib_has_custom_local_routes outside of IP_MULTIPLE_TABLES. · a1f3316d
      David S. Miller 提交于
      > net/ipv4/fib_frontend.c: In function 'fib_validate_source':
      > net/ipv4/fib_frontend.c:411:16: error: 'struct netns_ipv4' has no member named 'fib_has_custom_local_routes'
      >    if (net->ipv4.fib_has_custom_local_routes)
      >                 ^
      > net/ipv4/fib_frontend.c: In function 'inet_rtm_newroute':
      > net/ipv4/fib_frontend.c:773:12: error: 'struct netns_ipv4' has no member named 'fib_has_custom_local_routes'
      >    net->ipv4.fib_has_custom_local_routes = true;
      >             ^
      
      Fixes: 6e617de8 ("net: avoid a full fib lookup when rp_filter is disabled.")
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1f3316d
    • D
      Input: uinput - avoid FF flush when destroying device · e8b95728
      Dmitry Torokhov 提交于
      Normally, when input device supporting force feedback effects is being
      destroyed, we try to "flush" currently playing effects, so that the
      physical device does not continue vibrating (or executing other effects).
      Unfortunately this does not work well for uinput as flushing of the effects
      deadlocks with the destroy action:
      
      - if device is being destroyed because the file descriptor is being closed,
        then there is noone to even service FF requests;
      
      - if device is being destroyed because userspace sent UI_DEV_DESTROY,
        while theoretically it could be possible to service FF requests,
        userspace is unlikely to do so (they'd need to make sure FF handling
        happens on a separate thread) even if kernel solves the issue with FF
        ioctls deadlocking with UI_DEV_DESTROY ioctl on udev->mutex.
      
      To avoid lockups like the one below, let's install a custom input device
      flush handler, and avoid trying to flush force feedback effects when we
      destroying the device, and instead rely on uinput to shut off the device
      properly.
      
      NMI watchdog: Watchdog detected hard LOCKUP on cpu 3
      ...
       <<EOE>>  [<ffffffff817a0307>] _raw_spin_lock_irqsave+0x37/0x40
       [<ffffffff810e633d>] complete+0x1d/0x50
       [<ffffffffa00ba08c>] uinput_request_done+0x3c/0x40 [uinput]
       [<ffffffffa00ba587>] uinput_request_submit.part.7+0x47/0xb0 [uinput]
       [<ffffffffa00bb62b>] uinput_dev_erase_effect+0x5b/0x76 [uinput]
       [<ffffffff815d91ad>] erase_effect+0xad/0xf0
       [<ffffffff815d929d>] flush_effects+0x4d/0x90
       [<ffffffff815d4cc0>] input_flush_device+0x40/0x60
       [<ffffffff815daf1c>] evdev_cleanup+0xac/0xc0
       [<ffffffff815daf5b>] evdev_disconnect+0x2b/0x60
       [<ffffffff815d74ac>] __input_unregister_device+0xac/0x150
       [<ffffffff815d75f7>] input_unregister_device+0x47/0x70
       [<ffffffffa00bac45>] uinput_destroy_device+0xb5/0xc0 [uinput]
       [<ffffffffa00bb2de>] uinput_ioctl_handler.isra.9+0x65e/0x740 [uinput]
       [<ffffffff811231ab>] ? do_futex+0x12b/0xad0
       [<ffffffffa00bb3f8>] uinput_ioctl+0x18/0x20 [uinput]
       [<ffffffff81241248>] do_vfs_ioctl+0x298/0x480
       [<ffffffff81337553>] ? security_file_ioctl+0x43/0x60
       [<ffffffff812414a9>] SyS_ioctl+0x79/0x90
       [<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71
      Reported-by: NRodrigo Rivas Costa <rodrigorivascosta@gmail.com>
      Reported-by: NClément VUCHENER <clement.vuchener@gmail.com>
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=193741Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      e8b95728
    • F
      net: ethtool: Add back transceiver type · 19cab887
      Florian Fainelli 提交于
      Commit 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
      deprecated the ethtool_cmd::transceiver field, which was fine in
      premise, except that the PHY library was actually using it to report the
      type of transceiver: internal or external.
      
      Use the first word of the reserved field to put this __u8 transceiver
      field back in. It is made read-only, and we don't expect the
      ETHTOOL_xLINKSETTINGS API to be doing anything with this anyway, so this
      is mostly for the legacy path where we do:
      
      ethtool_get_settings()
      -> dev->ethtool_ops->get_link_ksettings()
         -> convert_link_ksettings_to_legacy_settings()
      
      to have no information loss compared to the legacy get_settings API.
      
      Fixes: 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19cab887
    • P
      net: avoid a full fib lookup when rp_filter is disabled. · 6e617de8
      Paolo Abeni 提交于
      Since commit 1dced6a8 ("ipv4: Restore accept_local behaviour
      in fib_validate_source()") a full fib lookup is needed even if
      the rp_filter is disabled, if accept_local is false - which is
      the default.
      
      What we really need in the above scenario is just checking
      that the source IP address is not local, and in most case we
      can do that is a cheaper way looking up the ifaddr hash table.
      
      This commit adds a helper for such lookup, and uses it to
      validate the src address when rp_filter is disabled and no
      'local' routes are created by the user space in the relevant
      namespace.
      
      A new ipv4 netns flag is added to account for such routes.
      We need that to preserve the same behavior we had before this
      patch.
      
      It also drops the checks to bail early from __fib_validate_source,
      added by the commit 1dced6a8 ("ipv4: Restore accept_local
      behaviour in fib_validate_source()") they do not give any
      measurable performance improvement: if we do the lookup with are
      on a slower path.
      
      This improves UDP performances for unconnected sockets
      when rp_filter is disabled by 5% and also gives small but
      measurable performance improvement for TCP flood scenarios.
      
      v1 -> v2:
       - use the ifaddr lookup helper in __ip_dev_find(), as suggested
         by Eric
       - fall-back to full lookup if custom local routes are present
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e617de8
  7. 21 9月, 2017 1 次提交
    • Y
      bpf: one perf event close won't free bpf program attached by another perf event · ec9dd352
      Yonghong Song 提交于
      This patch fixes a bug exhibited by the following scenario:
        1. fd1 = perf_event_open with attr.config = ID1
        2. attach bpf program prog1 to fd1
        3. fd2 = perf_event_open with attr.config = ID1
           <this will be successful>
        4. user program closes fd2 and prog1 is detached from the tracepoint.
        5. user program with fd1 does not work properly as tracepoint
           no output any more.
      
      The issue happens at step 4. Multiple perf_event_open can be called
      successfully, but only one bpf prog pointer in the tp_event. In the
      current logic, any fd release for the same tp_event will free
      the tp_event->prog.
      
      The fix is to free tp_event->prog only when the closing fd
      corresponds to the one which registered the program.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec9dd352
  8. 20 9月, 2017 6 次提交
    • E
      ipv4: speedup ipv6 tunnels dismantle · 64bc1781
      Eric Dumazet 提交于
      Implement exit_batch() method to dismantle more devices
      per round.
      
      (rtnl_lock() ...
       unregister_netdevice_many() ...
       rtnl_unlock())
      
      Tested:
      $ cat add_del_unshare.sh
      for i in `seq 1 40`
      do
       (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
      done
      wait ; grep net_namespace /proc/slabinfo
      
      Before patch :
      $ time ./add_del_unshare.sh
      net_namespace        126    282   5504    1    2 : tunables    8    4    0 : slabdata    126    282      0
      
      real    1m38.965s
      user    0m0.688s
      sys     0m37.017s
      
      After patch:
      $ time ./add_del_unshare.sh
      net_namespace        135    291   5504    1    2 : tunables    8    4    0 : slabdata    135    291      0
      
      real	0m22.117s
      user	0m0.728s
      sys	0m35.328s
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64bc1781
    • E
      ipv6: addrlabel: per netns list · a90c9347
      Eric Dumazet 提交于
      Having a global list of labels do not scale to thousands of
      netns in the cloud era. This causes quadratic behavior on
      netns creation and deletion.
      
      This is time having a per netns list of ~10 labels.
      
      Tested:
      
      $ time perf record (for f in `seq 1 3000` ; do ip netns add tast$f; done)
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 3.637 MB perf.data (~158898 samples) ]
      
      real    0m20.837s # instead of 0m24.227s
      user    0m0.328s
      sys     0m20.338s # instead of 0m23.753s
      
          16.17%       ip  [kernel.kallsyms]  [k] netlink_broadcast_filtered
          12.30%       ip  [kernel.kallsyms]  [k] netlink_has_listeners
           6.76%       ip  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
           5.78%       ip  [kernel.kallsyms]  [k] memset_erms
           5.77%       ip  [kernel.kallsyms]  [k] kobject_uevent_env
           5.18%       ip  [kernel.kallsyms]  [k] refcount_sub_and_test
           4.96%       ip  [kernel.kallsyms]  [k] _raw_read_lock
           3.82%       ip  [kernel.kallsyms]  [k] refcount_inc_not_zero
           3.33%       ip  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
           2.11%       ip  [kernel.kallsyms]  [k] unmap_page_range
           1.77%       ip  [kernel.kallsyms]  [k] __wake_up
           1.69%       ip  [kernel.kallsyms]  [k] strlen
           1.17%       ip  [kernel.kallsyms]  [k] __wake_up_common
           1.09%       ip  [kernel.kallsyms]  [k] insert_header
           1.04%       ip  [kernel.kallsyms]  [k] page_remove_rmap
           1.01%       ip  [kernel.kallsyms]  [k] consume_skb
           0.98%       ip  [kernel.kallsyms]  [k] netlink_trim
           0.51%       ip  [kernel.kallsyms]  [k] kernfs_link_sibling
           0.51%       ip  [kernel.kallsyms]  [k] filemap_map_pages
           0.46%       ip  [kernel.kallsyms]  [k] memcpy_erms
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a90c9347
    • C
      net_sched: no need to free qdisc in RCU callback · 752fbcc3
      Cong Wang 提交于
      gen estimator has been rewritten in commit 1c0d32fd
      ("net_sched: gen_estimator: complete rewrite of rate estimators"),
      the caller no longer needs to wait for a grace period. So this
      patch gets rid of it.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      752fbcc3
    • V
      net: dsa: remove copy of master ethtool_ops · f5619866
      Vivien Didelot 提交于
      There is no need to store a copy of the master ethtool ops, storing the
      original pointer in DSA and the new one in the master netdev itself is
      enough.
      
      In the meantime, set orig_ethtool_ops to NULL when restoring the master
      ethtool ops and check the presence of the master original ethtool ops as
      well as its needed functions before calling them.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5619866
    • E
      net: sk_buff rbnode reorg · bffa72cf
      Eric Dumazet 提交于
      skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
      
      Current uses (TCP receive ofo queue and netem) need to save/restore
      tstamp, while skb->dev is either NULL (TCP) or a constant for a given
      queue (netem).
      
      Since we plan using an RB tree for TCP retransmit queue to speedup SACK
      processing with large BDP, this patch exchanges skb->dev and
      skb->tstamp.
      
      This saves some overhead in both TCP and netem.
      
      v2: removes the swtstamp field from struct tcp_skb_cb
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bffa72cf
    • J
      ACPI / bus: Make ACPI_HANDLE() work for non-GPL code again · 9e987b70
      John Hubbard 提交于
      Due to commit db3e50f3 (device property: Get rid of struct
      fwnode_handle type field), ACPI_HANDLE() inadvertently became
      a GPL-only call. The call path that led to that was:
      
      ACPI_HANDLE()
          ACPI_COMPANION()
              to_acpi_device_node()
                  is_acpi_device_node()
                      acpi_device_fwnode_ops
                          DECLARE_ACPI_FWNODE_OPS(acpi_device_fwnode_ops);
      
      ...and the new DECLARE_ACPI_FWNODE_OPS() includes
      EXPORT_SYMBOL_GPL, whereas previously it was a static struct.
      
      In order to avoid changing any of that, let's instead provide ever
      so slightly better encapsulation of those struct fwnode_operations
      instances. Those do not really need to be directly used in
      inline function calls in header files. Simply moving two small
      functions (is_acpi_device_node and is_acpi_data_node) out of
      acpi_bus.h, and into a .c file, does that.
      
      That leaves the internals of struct fwnode_operations as GPL-only
      (which I think was the intent all along), but un-breaks any driver
      code out there that relies on the ACPI subsystem's being (historically)
      an EXPORT_SYMBOL-usable system. By that, I mean, ACPI_HANDLE() and
      other basic ACPI calls were non-GPL-protected.
      
      Also, while I'm there, remove a tiny bit of redundancy that was missed
      in the earlier commit, by having is_acpi_node() use the other two
      routines, instead of checking fwnode directly.
      
      Fixes: db3e50f3 (device property: Get rid of struct fwnode_handle type field)
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: NSakari Ailus <sakari.ailus@linux.intel.com>
      Acked-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      9e987b70
  9. 19 9月, 2017 3 次提交
  10. 18 9月, 2017 1 次提交
  11. 16 9月, 2017 1 次提交
    • X
      sctp: fix an use-after-free issue in sctp_sock_dump · d25adbeb
      Xin Long 提交于
      Commit 86fdb344 ("sctp: ensure ep is not destroyed before doing the
      dump") tried to fix an use-after-free issue by checking !sctp_sk(sk)->ep
      with holding sock and sock lock.
      
      But Paolo noticed that endpoint could be destroyed in sctp_rcv without
      sock lock protection. It means the use-after-free issue still could be
      triggered when sctp_rcv put and destroy ep after sctp_sock_dump checks
      !ep, although it's pretty hard to reproduce.
      
      I could reproduce it by mdelay in sctp_rcv while msleep in sctp_close
      and sctp_sock_dump long time.
      
      This patch is to add another param cb_done to sctp_for_each_transport
      and dump ep->assocs with holding tsp after jumping out of transport's
      traversal in it to avoid this issue.
      
      It can also improve sctp diag dump to make it run faster, as no need
      to save sk into cb->args[5] and keep calling sctp_for_each_transport
      any more.
      
      This patch is also to use int * instead of int for the pos argument
      in sctp_for_each_transport, which could make postion increment only
      in sctp_for_each_transport and no need to keep changing cb->args[2]
      in sctp_sock_filter and sctp_sock_dump any more.
      
      Fixes: 86fdb344 ("sctp: ensure ep is not destroyed before doing the dump")
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d25adbeb
  12. 15 9月, 2017 5 次提交
  13. 14 9月, 2017 1 次提交
    • M
      mm: treewide: remove GFP_TEMPORARY allocation flag · 0ee931c4
      Michal Hocko 提交于
      GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived
      and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
      primary motivation was to allow users to tell that an allocation is
      short lived and so the allocator can try to place such allocations close
      together and prevent long term fragmentation.  As much as this sounds
      like a reasonable semantic it becomes much less clear when to use the
      highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
      context holding that memory sleep? Can it take locks? It seems there is
      no good answer for those questions.
      
      The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
      __GFP_RECLAIMABLE which in itself is tricky because basically none of
      the existing caller provide a way to reclaim the allocated memory.  So
      this is rather misleading and hard to evaluate for any benefits.
      
      I have checked some random users and none of them has added the flag
      with a specific justification.  I suspect most of them just copied from
      other existing users and others just thought it might be a good idea to
      use without any measuring.  This suggests that GFP_TEMPORARY just
      motivates for cargo cult usage without any reasoning.
      
      I believe that our gfp flags are quite complex already and especially
      those with highlevel semantic should be clearly defined to prevent from
      confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
      replace all existing users to simply use GFP_KERNEL.  Please note that
      SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
      so they will be placed properly for memory fragmentation prevention.
      
      I can see reasons we might want some gfp flag to reflect shorterm
      allocations but I propose starting from a clear semantic definition and
      only then add users with proper justification.
      
      This was been brought up before LSF this year by Matthew [1] and it
      turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
      seems to be a heuristic without any measured advantage for most (if not
      all) its current users.  The follow up discussion has revealed that
      opinions on what might be temporary allocation differ a lot between
      developers.  So rather than trying to tweak existing users into a
      semantic which they haven't expected I propose to simply remove the flag
      and start from scratch if we really need a semantic for short term
      allocations.
      
      [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
      
      [akpm@linux-foundation.org: fix typo]
      [akpm@linux-foundation.org: coding-style fixes]
      [sfr@canb.auug.org.au: drm/i915: fix up]
        Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ee931c4