1. 11 6月, 2016 15 次提交
    • E
      net_sched: remove generic throttled management · 45f50bed
      Eric Dumazet 提交于
      __QDISC_STATE_THROTTLED bit manipulation is rather expensive
      for HTB and few others.
      
      I already removed it for sch_fq in commit f2600cf0
      ("net: sched: avoid costly atomic operation in fq_dequeue()")
      and so far nobody complained.
      
      When one ore more packets are stuck in one or more throttled
      HTB class, a htb dequeue() performs two atomic operations
      to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
      lock is held.
      
      Removing this pair of atomic operations bring me a 8 % performance
      increase on 200 TCP_RR tests, in presence of throttled classes.
      
      This patch has no side effect, since nothing actually uses
      disc_is_throttled() anymore.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45f50bed
    • E
      net_sched: netem: remove qdisc_is_throttled() use · 42117927
      Eric Dumazet 提交于
      Looks like it is only there as some optimization attempt.
      
      Since __QDISC_STATE_THROTTLED set/unset is way too expensive,
      and netem is the last user, just remove this check.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42117927
    • E
      net_sched: cbq: remove a flaky use of qdisc_is_throttled() · cca605dd
      Eric Dumazet 提交于
      So far no qdisc ever unset the throttled bit at enqueue() time,
      so CBQ usage of qdisc_is_throttled() was flaky.
      
      Since __QDISC_STATE_THROTTLED set/unset is way too expensive
      considering that only CBQ was eventually caring for this status,
      it would make sense to implement a Qdisc ops ->is_throttled()
      if we find that this is needed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cca605dd
    • E
      net_sched: sch_plug: use a private throttled status · 8fe6a79f
      Eric Dumazet 提交于
      We want to get rid of generic qdisc throttled management,
      so this qdisc has to use a private flag.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fe6a79f
    • X
      sctp: sctp should change socket state when shutdown is received · d46e416c
      Xin Long 提交于
      Now sctp doesn't change socket state upon shutdown reception. It changes
      just the assoc state, even though it's a TCP-style socket.
      
      For some cases, if we really need to check sk->sk_state, it's necessary to
      fix this issue, at least when we use ss or netstat to dump, we can get a
      more exact information.
      
      As an improvement, we will change sk->sk_state when we change asoc->state
      to SHUTDOWN_RECEIVED, and also do it in sctp_shutdown to keep consistent
      with sctp_close.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo R. Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d46e416c
    • L
      tcp: add NV congestion control · 699fafaf
      Lawrence Brakmo 提交于
      TCP-NV (New Vegas) is a major update to TCP-Vegas.
      An earlier version of NV was presented at 2010's LPC.
      It is a delayed based congestion avoidance for the
      data center. This version has been tested within a
      10G rack where the HW RTTs are 20-50us and with
      1 to 400 flows.
      
      A description of TCP-NV, including implementation
      details as well as experimental results, can be found at:
      http://www.brakmo.org/networking/tcp-nv/TCPNV.htmlSigned-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      699fafaf
    • L
      tcp: add in_flight to tcp_skb_cb · 6f094b9e
      Lawrence Brakmo 提交于
      Add in_flight (bytes in flight when packet was sent) field
      to tx component of tcp_skb_cb and make it available to
      congestion modules' pkts_acked() function through the
      ack_sample function argument.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f094b9e
    • M
      packet: use common code for virtio_net_hdr and skb GSO conversion · 1276f24e
      Mike Rapoport 提交于
      Replace open coded conversion between virtio_net_hdr to skb GSO info with
      virtio_net_hdr_from_skb
      Signed-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1276f24e
    • B
      RDS: IB: Remove deprecated create_workqueue · 231edca9
      Bhaktipriya Shridhar 提交于
      alloc_workqueue replaces deprecated create_workqueue().
      
      Since the driver is infiniband which can be used as block device and the
      workqueue seems involved in regular operation of the device, so a
      dedicated workqueue has been used  with WQ_MEM_RECLAIM set to guarantee
      forward progress under memory pressure.
      Since there are only a fixed number of work items, explicit concurrency
      limit is unnecessary here.
      Signed-off-by: NBhaktipriya Shridhar <bhaktipriya96@gmail.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      231edca9
    • D
      rxrpc: Limit the listening backlog · 0e119b41
      David Howells 提交于
      Limit the socket incoming call backlog queue size so that a remote client
      can't pump in sufficient new calls that the server runs out of memory.  Note
      that this is partially theoretical at the moment since whilst the number of
      calls is limited, the number of packets trying to set up new calls is not.
      This will be addressed in a later patch.
      
      If the caller of listen() specifies a backlog INT_MAX, then they get the
      current maximum; anything else greater than max_backlog or anything
      negative incurs EINVAL.
      
      The limit on the maximum queue size can be set by:
      
      	echo N >/proc/sys/net/rxrpc/max_backlog
      
      where 4<=N<=32.
      
      Further, set the default backlog to 0, requiring listen() to be called
      before we start actually queueing new calls.  Whilst this kind of is a
      change in the UAPI, the caller can't actually *accept* new calls anyway
      unless they've first called listen() to put the socket into the LISTENING
      state - thus the aforementioned new calls would otherwise just sit there,
      eating up kernel memory.  (Note that sockets that don't have a non-zero
      service ID bound don't get incoming calls anyway.)
      
      Given that the default backlog is now 0, make the AFS filesystem call
      kernel_listen() to set the maximum backlog for itself.
      
      Possible improvements include:
      
       (1) Trimming a too-large backlog to max_backlog when listen is called.
      
       (2) Trimming the backlog value whenever the value is used so that changes
           to max_backlog are applied to an open socket automatically.  Note that
           the AFS filesystem opens one socket and keeps it open for extended
           periods, so would miss out on changes to max_backlog.
      
       (3) Having a separate setting for the AFS filesystem.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e119b41
    • D
      rxrpc: Trim line-terminal whitespace · bc6e1ea3
      David Howells 提交于
      Trim line-terminal whitespace in net/rxrpc/
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc6e1ea3
    • D
      net, cls: allow for deleting all filters for given parent · ea7f8277
      Daniel Borkmann 提交于
      Add a possibility where the user can just specify the parent and
      all filters under that parent are then being purged. Currently,
      for example for scripting, one needs to specify pref/prio to have
      a well-defined number for 'tc filter del' command for addressing
      the previously created instance or additionally filter handle in
      case of priorities being the same. Improve usage by allowing the
      option for tc to specify the parent and removing the whole chain
      for that given parent.
      
      Example usage after patch, no tc changes required:
      
        # tc qdisc replace dev foo clsact
        # tc filter add dev foo egress bpf da obj ./bpf.o
        # tc filter add dev foo egress bpf da obj ./bpf.o
        # tc filter show dev foo egress
        filter protocol all pref 49151 bpf
        filter protocol all pref 49151 bpf handle 0x1 bpf.o:[classifier] direct-action
        filter protocol all pref 49152 bpf
        filter protocol all pref 49152 bpf handle 0x1 bpf.o:[classifier] direct-action
        # tc filter del dev foo egress
        # tc filter show dev foo egress
        #
      
      Previously, RTM_DELTFILTER requests with invalid prio of 0 were
      rejected, so only netlink requests with RTM_NEWTFILTER and NLM_F_CREATE
      flag were allowed where the kernel would auto-generate a pref/prio.
      We can piggyback on that and use prio of 0 as a wildcard for
      requests of RTM_DELTFILTER.
      
      For notifying tc netlink monitoring users (e.g. libnl uses this
      for caching), there are two options, that is, sending individual
      tfilter_notify() notifications for each tcf_proto, or sending a
      single one indicating wildcard removal. I tried both and there
      are pros and cons for each, eventually I decided for sending
      individual tfilter_notify(), so that user space can support this
      seamlessly and there won't be a mess of changing each and every
      application to make sure expectations from the kernel won't break
      when they don't understand single notification. Since linear chains
      don't really scale, I expect only a handful of classifiers to be
      attached at max for a given parent anyway.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea7f8277
    • D
      bpf: reject wrong sized filters earlier · f7bd9e36
      Daniel Borkmann 提交于
      Add a bpf_check_basics_ok() and reject filters that are of invalid
      size much earlier, so we don't do any useless work such as invoking
      bpf_prog_alloc(). Currently, rejection happens in bpf_check_classic()
      only, but it's really unnecessarily late and they should be rejected
      at earliest point. While at it, also clean up one bpf_prog_size() to
      make it consistent with the remaining invocations.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7bd9e36
    • D
      bpf: enforce recursion limit on redirects · a70b506e
      Daniel Borkmann 提交于
      Respect the stack's xmit_recursion limit for calls into dev_queue_xmit().
      Currently, they are not handeled by the limiter when attached to clsact's
      egress parent, for example, and a buggy program redirecting it to the
      same device again could run into stack overflow eventually. It would be
      good if we could notify an admin to give him a chance to react. We reuse
      xmit_recursion instead of having one private to eBPF, so that the stack's
      current recursion depth will be taken into account as well. Follow-up to
      commit 3896d655 ("bpf: introduce bpf_clone_redirect() helper") and
      27b29f63 ("bpf: add bpf_redirect() helper").
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a70b506e
    • W
      openvswitch: Add packet truncation support. · f2a4d086
      William Tu 提交于
      The patch adds a new OVS action, OVS_ACTION_ATTR_TRUNC, in order to
      truncate packets. A 'max_len' is added for setting up the maximum
      packet size, and a 'cutlen' field is to record the number of bytes
      to trim the packet when the packet is outputting to a port, or when
      the packet is sent to userspace.
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2a4d086
  2. 10 6月, 2016 7 次提交
    • W
      packet: compat support for sock_fprog · 719c44d3
      Willem de Bruijn 提交于
      Socket option PACKET_FANOUT_DATA takes a struct sock_fprog as argument
      if PACKET_FANOUT has mode PACKET_FANOUT_CBPF. This structure contains
      a pointer into user memory. If userland is 32-bit and kernel is 64-bit
      the two disagree about the layout of struct sock_fprog.
      
      Add compat setsockopt support to convert a 32-bit compat_sock_fprog to
      a 64-bit sock_fprog. This is analogous to compat_sock_fprog support for
      SO_REUSEPORT added in commit 19575988 ("soreuseport: add compat
      case for setsockopt SO_ATTACH_REUSEPORT_CBPF").
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      719c44d3
    • D
      net: vrf: Fix crash when IPv6 is disabled at boot time · e4348637
      David Ahern 提交于
      Frank Kellermann reported a kernel crash with 4.5.0 when IPv6 is
      disabled at boot using the kernel option ipv6.disable=1. Using
      current net-next with the boot option:
      
      $ ip link add red type vrf table 1001
      
      Generates:
      [12210.919584] BUG: unable to handle kernel NULL pointer dereference at 0000000000000748
      [12210.921341] IP: [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a
      [12210.922537] PGD b79e3067 PUD bb32b067 PMD 0
      [12210.923479] Oops: 0000 [#1] SMP
      [12210.924001] Modules linked in: ipvlan 8021q garp mrp stp llc
      [12210.925130] CPU: 3 PID: 1177 Comm: ip Not tainted 4.7.0-rc1+ #235
      [12210.926168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [12210.928065] task: ffff8800b9ac4640 ti: ffff8800bacac000 task.ti: ffff8800bacac000
      [12210.929328] RIP: 0010:[<ffffffff814b30e3>]  [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a
      [12210.930697] RSP: 0018:ffff8800bacaf888  EFLAGS: 00010202
      [12210.931563] RAX: 0000000000000748 RBX: ffffffff81a9e280 RCX: ffff8800b9ac4e28
      [12210.932688] RDX: 00000000000000e9 RSI: 0000000000000002 RDI: 0000000000000286
      [12210.933820] RBP: ffff8800bacaf898 R08: ffff8800b9ac4df0 R09: 000000000052001b
      [12210.934941] R10: 00000000657c0000 R11: 000000000000c649 R12: 00000000000003e9
      [12210.936032] R13: 00000000000003e9 R14: ffff8800bace7800 R15: ffff8800bb3ec000
      [12210.937103] FS:  00007faa1766c700(0000) GS:ffff88013ac00000(0000) knlGS:0000000000000000
      [12210.938321] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [12210.939166] CR2: 0000000000000748 CR3: 00000000b79d6000 CR4: 00000000000406e0
      [12210.940278] Stack:
      [12210.940603]  ffff8800bb3ec000 ffffffff81a9e280 ffff8800bacaf8c8 ffffffff814b3135
      [12210.941818]  ffff8800bb3ec000 ffffffff81a9e280 ffffffff81a9e280 ffff8800bace7800
      [12210.943040]  ffff8800bacaf8f0 ffffffff81397c88 ffff8800bb3ec000 ffffffff81a9e280
      [12210.944288] Call Trace:
      [12210.944688]  [<ffffffff814b3135>] fib6_new_table+0x24/0x8a
      [12210.945516]  [<ffffffff81397c88>] vrf_dev_init+0xd4/0x162
      [12210.946328]  [<ffffffff814091e1>] register_netdevice+0x100/0x396
      [12210.947209]  [<ffffffff8139823d>] vrf_newlink+0x40/0xb3
      [12210.948001]  [<ffffffff814187f0>] rtnl_newlink+0x5d3/0x6d5
      ...
      
      The problem above is due to the fact that the fib hash table is not
      allocated when IPv6 is disabled at boot.
      
      As for the VRF driver it should not do any IPv6 initializations if IPv6
      is disabled, so it needs to know if IPv6 is disabled at boot. The disable
      parameter is private to the IPv6 module, so provide an accessor for
      modules to determine if IPv6 was disabled at boot time.
      
      Fixes: 35402e31 ("net: Add IPv6 support to VRF device")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4348637
    • D
      rxrpc: Simplify connect() implementation and simplify sendmsg() op · 2341e077
      David Howells 提交于
      Simplify the RxRPC connect() implementation.  It will just note the
      destination address it is given, and if a sendmsg() comes along with no
      address, this will be assigned as the address.  No transport struct will be
      held internally, which will allow us to remove this later.
      
      Simplify sendmsg() also.  Whilst a call is active, userspace refers to it
      by a private unique user ID specified in a control message.  When sendmsg()
      sees a user ID that doesn't map to an extant call, it creates a new call
      for that user ID and attempts to add it.  If, when we try to add it, the
      user ID is now registered, we now reject the message with -EEXIST.  We
      should never see this situation unless two threads are racing, trying to
      create a call with the same ID - which would be an error.
      
      It also isn't required to provide sendmsg() with an address - provided the
      control message data holds a user ID that maps to a currently active call.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2341e077
    • F
      21aff3b9
    • E
      net: add netdev_lockdep_set_classes() helper · d3fff6c4
      Eric Dumazet 提交于
      It is time to add netdev_lockdep_set_classes() helper
      so that lockdep annotations per device type are easier to manage.
      
      This removes a lot of copies and missing annotations.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3fff6c4
    • E
      net: sched: fix qdisc->running lockdep annotations · 52fbb290
      Eric Dumazet 提交于
      1) qdisc_run_begin() is really using the equivalent of a trylock.
        Instead of using write_seqcount_begin(), use a combination of
        raw_write_seqcount_begin() and correct lockdep annotation.
      
      2) sch_direct_xmit() should use regular spin_lock(root_lock)
      
      Fixes: f9eb8aea ("net_sched: transform qdisc running bit into a seqcount")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52fbb290
    • S
      sit: remove unnecessary protocol check in ipip6_tunnel_xmit() · adba931f
      Simon Horman 提交于
      ipip6_tunnel_xmit() is called immediately after checking that
      skb->protocol is  htons(ETH_P_IPV6) so there is no need
      to check it a second time.
      
      Found by inspection.
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Reviewed-by: NDinan Gunawardena <dinan.gunawardena@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adba931f
  3. 09 6月, 2016 18 次提交