1. 16 7月, 2015 2 次提交
  2. 11 7月, 2015 2 次提交
    • J
      net: call rcu_read_lock early in process_backlog · 2c17d27c
      Julian Anastasov 提交于
      Incoming packet should be either in backlog queue or
      in RCU read-side section. Otherwise, the final sequence of
      flush_backlog() and synchronize_net() may miss packets
      that can run without device reference:
      
      CPU 1                  CPU 2
                             skb->dev: no reference
                             process_backlog:__skb_dequeue
                             process_backlog:local_irq_enable
      
      on_each_cpu for
      flush_backlog =>       IPI(hardirq): flush_backlog
                             - packet not found in backlog
      
                             CPU delayed ...
      synchronize_net
      - no ongoing RCU
      read-side sections
      
      netdev_run_todo,
      rcu_barrier: no
      ongoing callbacks
                             __netif_receive_skb_core:rcu_read_lock
                             - too late
      free dev
                             process packet for freed dev
      
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c17d27c
    • J
      net: do not process device backlog during unregistration · e9e4dd32
      Julian Anastasov 提交于
      commit 381c759d ("ipv4: Avoid crashing in ip_error")
      fixes a problem where processed packet comes from device
      with destroyed inetdev (dev->ip_ptr). This is not expected
      because inetdev_destroy is called in NETDEV_UNREGISTER
      phase and packets should not be processed after
      dev_close_many() and synchronize_net(). Above fix is still
      required because inetdev_destroy can be called for other
      reasons. But it shows the real problem: backlog can keep
      packets for long time and they do not hold reference to
      device. Such packets are then delivered to upper levels
      at the same time when device is unregistered.
      Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
      accounts all packets from backlog but before that some packets
      continue to be delivered to upper levels long after the
      synchronize_net call which is supposed to wait the last
      ones. Also, as Eric pointed out, processed packets, mostly
      from other devices, can continue to add new packets to backlog.
      
      Fix the problem by moving flush_backlog early, after the
      device driver is stopped and before the synchronize_net() call.
      Then use netif_running check to make sure we do not add more
      packets to backlog. We have to do it in enqueue_to_backlog
      context when the local IRQ is disabled. As result, after the
      flush_backlog and synchronize_net sequence all packets
      should be accounted.
      
      Thanks to Eric W. Biederman for the test script and his
      valuable feedback!
      Reported-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9e4dd32
  3. 10 7月, 2015 3 次提交
  4. 09 7月, 2015 5 次提交
  5. 01 7月, 2015 1 次提交
    • C
      sock_diag: don't broadcast kernel sockets · b922622e
      Craig Gallek 提交于
      Kernel sockets do not hold a reference for the network namespace to
      which they point.  Socket destruction broadcasting relies on the
      network namespace and will cause the splat below when a kernel socket
      is destroyed.
      
      This fix simply ignores kernel sockets when they are destroyed.
      
      Reported as:
      general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      CPU: 1 PID: 9130 Comm: kworker/1:1 Not tainted 4.1.0-gelk-debug+ #1
      Workqueue: sock_diag_events sock_diag_broadcast_destroy_work
      Stack:
       ffff8800b9c586c0 ffff8800b9c586c0 ffff8800ac4692c0 ffff8800936d4a90
       ffff8800352efd38 ffffffff8469a93e ffff8800352efd98 ffffffffc09b9b90
       ffff8800352efd78 ffff8800ac4692c0 ffff8800b9c586c0 ffff8800831b6ab8
      Call Trace:
       [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10
       [<ffffffffc09b9b90>] ? inet_diag_handler_get_info+0x110/0x1fb [inet_diag]
       [<ffffffff845c868d>] netlink_broadcast+0x1d/0x20
       [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10
       [<ffffffff845b2bf5>] sock_diag_broadcast_destroy_work+0xd5/0x160
       [<ffffffff8408ea97>] process_one_work+0x147/0x420
       [<ffffffff8408f0f9>] worker_thread+0x69/0x470
       [<ffffffff8409fda3>] ? preempt_count_sub+0xa3/0xf0
       [<ffffffff8408f090>] ? rescuer_thread+0x320/0x320
       [<ffffffff84093cd7>] kthread+0x107/0x120
       [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0
       [<ffffffff8469d31f>] ret_from_fork+0x3f/0x70
       [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0
      
      Tested:
        Using a debug kernel while 'ss -E' is running:
        ip netns add test-ns
        ip netns delete test-ns
      
      Fixes: eb4cb008 sock_diag: define destruction multicast groups
      Fixes: 26abe143 net: Modify sk_alloc to not reference count the
        netns of kernel sockets.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b922622e
  6. 29 6月, 2015 2 次提交
  7. 23 6月, 2015 1 次提交
    • S
      switchdev; add VLAN support for port's bridge_getlink · 7d4f8d87
      Scott Feldman 提交于
      One more missing piece of the puzzle.  Add vlan dump support to switchdev
      port's bridge_getlink.  iproute2 "bridge vlan show" cmd already knows how
      to show the vlans installed on the bridge and the device , but (until now)
      no one implemented the port vlan part of the netlink PF_BRIDGE:RTM_GETLINK
      msg.  Before this patch, "bridge vlan show":
      
      	$ bridge -c vlan show
      	port    vlan ids
      	sw1p1    30-34			<< bridge side vlans
      		 57
      
      	sw1p1				<< device side vlans (missing)
      
      	sw1p2    57
      
      	sw1p2
      
      	sw1p3
      
      	sw1p4
      
      	br0     None
      
      (When the port is bridged, the output repeats the vlan list for the vlans
      on the bridge side of the port and the vlans on the device side of the
      port.  The listing above show no vlans for the device side even though they
      are installed).
      
      After this patch:
      
      	$ bridge -c vlan show
      	port    vlan ids
      	sw1p1    30-34			<< bridge side vlan
      		 57
      
      	sw1p1    30-34			<< device side vlans
      		 57
      		 3840 PVID
      
      	sw1p2    57
      
      	sw1p2    57
      		 3840 PVID
      
      	sw1p3    3842 PVID
      
      	sw1p4    3843 PVID
      
      	br0     None
      
      I re-used ndo_dflt_bridge_getlink to add vlan fill call-back func.
      switchdev support adds an obj dump for VLAN objects, using the same
      call-back scheme as FDB dump.  Support included for both compressed and
      un-compressed vlan dumps.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d4f8d87
  8. 22 6月, 2015 1 次提交
    • J
      neigh: do not modify unlinked entries · 2c51a97f
      Julian Anastasov 提交于
      The lockless lookups can return entry that is unlinked.
      Sometimes they get reference before last neigh_cleanup_and_release,
      sometimes they do not need reference. Later, any
      modification attempts may result in the following problems:
      
      1. entry is not destroyed immediately because neigh_update
      can start the timer for dead entry, eg. on change to NUD_REACHABLE
      state. As result, entry lives for some time but is invisible
      and out of control.
      
      2. __neigh_event_send can run in parallel with neigh_destroy
      while refcnt=0 but if timer is started and expired refcnt can
      reach 0 for second time leading to second neigh_destroy and
      possible crash.
      
      Thanks to Eric Dumazet and Ying Xue for their work and analyze
      on the __neigh_event_send change.
      
      Fixes: 767e97e1 ("neigh: RCU conversion of struct neighbour")
      Fixes: a263b309 ("ipv4: Make neigh lookups directly in output packet path.")
      Fixes: 6fd6ce20 ("ipv6: Do not depend on rt->n in ip6_finish_output2().")
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Ying Xue <ying.xue@windriver.com>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c51a97f
  9. 16 6月, 2015 5 次提交
  10. 13 6月, 2015 3 次提交
  11. 12 6月, 2015 1 次提交
    • S
      net: don't wait for order-3 page allocation · fb05e7a8
      Shaohua Li 提交于
      We saw excessive direct memory compaction triggered by skb_page_frag_refill.
      This causes performance issues and add latency. Commit 5640f768
      introduces the order-3 allocation. According to the changelog, the order-3
      allocation isn't a must-have but to improve performance. But direct memory
      compaction has high overhead. The benefit of order-3 allocation can't
      compensate the overhead of direct memory compaction.
      
      This patch makes the order-3 page allocation atomic. If there is no memory
      pressure and memory isn't fragmented, the alloction will still success, so we
      don't sacrifice the order-3 benefit here. If the atomic allocation fails,
      direct memory compaction will not be triggered, skb_page_frag_refill will
      fallback to order-0 immediately, hence the direct memory compaction overhead is
      avoided. In the allocation failure case, kswapd is waken up and doing
      compaction, so chances are allocation could success next time.
      
      alloc_skb_with_frags is the same.
      
      The mellanox driver does similar thing, if this is accepted, we must fix
      the driver too.
      
      V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
      V2: make the changelog clearer
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Debabrata Banerjee <dbavatar@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb05e7a8
  12. 11 6月, 2015 2 次提交
    • H
      net/ethtool: Add current supported tunable options · a4244b0c
      Hadar Hen Zion 提交于
      Add strings array of the current supported tunable options.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4244b0c
    • M
      net, swap: Remove a warning and clarify why sk_mem_reclaim is required when deactivating swap · 5d753610
      Mel Gorman 提交于
      Jeff Layton reported the following;
      
       [   74.232485] ------------[ cut here ]------------
       [   74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80()
       [   74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio
       [   74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5
       [   74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       [   74.245546]  0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d
       [   74.246786]  0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa
       [   74.248175]  0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8
       [   74.249483] Call Trace:
       [   74.249872]  [<ffffffff8179263d>] dump_stack+0x45/0x57
       [   74.250703]  [<ffffffff8109e6fa>] warn_slowpath_common+0x8a/0xc0
       [   74.251655]  [<ffffffff8109e82a>] warn_slowpath_null+0x1a/0x20
       [   74.252585]  [<ffffffff81661241>] sk_clear_memalloc+0x51/0x80
       [   74.253519]  [<ffffffffa0116c72>] xs_disable_swap+0x42/0x80 [sunrpc]
       [   74.254537]  [<ffffffffa01109de>] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc]
       [   74.255610]  [<ffffffffa03e4fd7>] nfs_swap_deactivate+0x27/0x30 [nfs]
       [   74.256582]  [<ffffffff811e99d4>] destroy_swap_extents+0x74/0x80
       [   74.257496]  [<ffffffff811ecb52>] SyS_swapoff+0x222/0x5c0
       [   74.258318]  [<ffffffff81023f27>] ? syscall_trace_leave+0xc7/0x140
       [   74.259253]  [<ffffffff81798dae>] system_call_fastpath+0x12/0x71
       [   74.260158] ---[ end trace 2530722966429f10 ]---
      
      The warning in question was unnecessary but with Jeff's series the rules
      are also clearer.  This patch removes the warning and updates the comment
      to explain why sk_mem_reclaim() may still be called.
      
      [jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon
                points out that it's not needed.]
      
      Cc: Leon Romanovsky <leon@leon.nu>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d753610
  13. 09 6月, 2015 1 次提交
  14. 07 6月, 2015 2 次提交
    • A
      bpf: allow programs to write to certain skb fields · d691f9e8
      Alexei Starovoitov 提交于
      allow programs read/write skb->mark, tc_index fields and
      ((struct qdisc_skb_cb *)cb)->data.
      
      mark and tc_index are generically useful in TC.
      cb[0]-cb[4] are primarily used to pass arguments from one
      program to another called via bpf_tail_call() which can
      be seen in sockex3_kern.c example.
      
      All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs.
      mark, tc_index are writeable from tc_cls_act only.
      cb[0]-cb[4] are writeable by both sockets and tc_cls_act.
      
      Add verifier tests and improve sample code.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d691f9e8
    • A
      bpf: make programs see skb->data == L2 for ingress and egress · 3431205e
      Alexei Starovoitov 提交于
      eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
      For ingress L2 header is already pulled, whereas for egress it's present.
      This is known to program writers which are currently forced to use
      BPF_LL_OFF workaround.
      Since programs don't change skb internal pointers it is safe to do
      pull/push right around invocation of the program and earlier taps and
      later pt->func() will not be affected.
      Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick
      around run_filter/BPF_PROG_RUN even if skb_shared.
      
      This fix finally allows programs to use optimized LD_ABS/IND instructions
      without BPF_LL_OFF for higher performance.
      tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o
             w/o JIT   w/JIT
      before  20.5     23.6 Mpps
      after   21.8     26.6 Mpps
      
      Old programs with BPF_LL_OFF will still work as-is.
      
      We can now undo most of the earlier workaround commit:
      a166151c ("bpf: fix bpf helpers to use skb->mac_header relative offsets")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3431205e
  15. 05 6月, 2015 9 次提交