1. 29 9月, 2014 2 次提交
  2. 27 9月, 2014 5 次提交
  3. 26 9月, 2014 4 次提交
    • E
      net: sched: use pinned timers · 4a8e320c
      Eric Dumazet 提交于
      While using a MQ + NETEM setup, I had confirmation that the default
      timer migration ( /proc/sys/kernel/timer_migration ) is killing us.
      
      Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX
      queues) :
      
      EST="est 1sec 4sec"
      for ETH in eth1
      do
       tc qd del dev $ETH root 2>/dev/null
       tc qd add dev $ETH root handle 1: mq
       tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms
       tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms
       tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms
       tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms
       tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms
       tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms
       tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms
       tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms
      done
      
      We can see that timers get migrated into a single cpu, presumably idle
      at the time timers are set up.
      Then all qdisc dequeues run from this cpu and huge lock contention
      happens. This single cpu is stuck in softirq mode and cannot dequeue
      fast enough.
      
          39.24%  [kernel]          [k] _raw_spin_lock
           2.65%  [kernel]          [k] netem_enqueue
           1.80%  [kernel]          [k] netem_dequeue
           1.63%  [kernel]          [k] copy_user_enhanced_fast_string
           1.45%  [kernel]          [k] _raw_spin_lock_bh
      
      By pinning qdisc timers on the cpu running the qdisc, we respect proper
      XPS setting and remove this lock contention.
      
           5.84%  [kernel]          [k] netem_enqueue
           4.83%  [kernel]          [k] _raw_spin_lock
           2.92%  [kernel]          [k] copy_user_enhanced_fast_string
      
      Current Qdiscs that benefit from this change are :
      
      	netem, cbq, fq, hfsc, tbf, htb.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a8e320c
    • T
      net: Remove gso_send_check as an offload callback · 53e50398
      Tom Herbert 提交于
      The send_check logic was only interesting in cases of TCP offload and
      UDP UFO where the checksum needed to be initialized to the pseudo
      header checksum. Now we've moved that logic into the related
      gso_segment functions so gso_send_check is no longer needed.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53e50398
    • T
      udp: move logic out of udp[46]_ufo_send_check · f71470b3
      Tom Herbert 提交于
      In udp[46]_ufo_send_check the UDP checksum initialized to the pseudo
      header checksum. We can move this logic into udp[46]_ufo_fragment.
      After this change udp[64]_ufo_send_check is a no-op.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f71470b3
    • T
      tcp: move logic out of tcp_v[64]_gso_send_check · d020f8f7
      Tom Herbert 提交于
      In tcp_v[46]_gso_send_check the TCP checksum is initialized to the
      pseudo header checksum using __tcp_v[46]_send_check. We can move this
      logic into new tcp[46]_gso_segment functions to be done when
      ip_summed != CHECKSUM_PARTIAL (ip_summed == CHECKSUM_PARTIAL should be
      the common case, possibly always true when taking GSO path). After this
      change tcp_v[46]_gso_send_check is no-op.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d020f8f7
  4. 24 9月, 2014 2 次提交
    • E
      tcp: add coalescing attempt in tcp_ofo_queue() · bd1e75ab
      Eric Dumazet 提交于
      In order to make TCP more resilient in presence of reorders, we need
      to allow coalescing to happen when skbs from out of order queue are
      transferred into receive queue. LRO/GRO can be completely canceled
      in some pathological cases, like per packet load balancing on aggregated
      links.
      
      I had to move tcp_try_coalesce() up in the file above tcp_ofo_queue()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd1e75ab
    • E
      icmp: add a global rate limitation · 4cdf507d
      Eric Dumazet 提交于
      Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
      protected by a lock, meaning that hosts can be stuck hard if all cpus
      want to check ICMP limits.
      
      When say a DNS or NTP server process is restarted, inetpeer tree grows
      quick and machine comes to its knees.
      
      iptables can not help because the bottleneck happens before ICMP
      messages are even cooked and sent.
      
      This patch adds a new global limitation, using a token bucket filter,
      controlled by two new sysctl :
      
      icmp_msgs_per_sec - INTEGER
          Limit maximal number of ICMP packets sent per second from this host.
          Only messages whose type matches icmp_ratemask are
          controlled by this limit.
          Default: 1000
      
      icmp_msgs_burst - INTEGER
          icmp_msgs_per_sec controls number of ICMP packets sent per second,
          while icmp_msgs_burst controls the burst size of these packets.
          Default: 50
      
      Note that if we really want to send millions of ICMP messages per
      second, we might extend idea and infra added in commit 04ca6973
      ("ip: make IP identifiers less predictable") :
      add a token bucket in the ip_idents hash and no longer rely on inetpeer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cdf507d
  5. 23 9月, 2014 12 次提交
    • E
      ipv4: do not use this_cpu_ptr() in preemptible context · a35165ca
      Eric Dumazet 提交于
      this_cpu_ptr() in preemptible context is generally bad
      
      Sep 22 05:05:55 br kernel: [   94.608310] BUG: using smp_processor_id()
      in
      preemptible [00000000] code: ip/2261
      Sep 22 05:05:55 br kernel: [   94.608316] caller is
      tunnel_dst_set.isra.28+0x20/0x60 [ip_tunnel]
      Sep 22 05:05:55 br kernel: [   94.608319] CPU: 3 PID: 2261 Comm: ip Not
      tainted
      3.17.0-rc5 #82
      
      We can simply use raw_cpu_ptr(), as preemption is safe in these
      contexts.
      
      Should fix https://bugzilla.kernel.org/show_bug.cgi?id=84991Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJoe <joe9mail@gmail.com>
      Fixes: 9a4aa9af ("ipv4: Use percpu Cache route in IP tunnels")
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a35165ca
    • E
      net: sched: fix compile warning in cls_u32 · a2aeb02a
      Eric Dumazet 提交于
      $ grep CONFIG_CLS_U32_MARK .config
      # CONFIG_CLS_U32_MARK is not set
      
      net/sched/cls_u32.c: In function 'u32_change':
      net/sched/cls_u32.c:852:1: warning: label 'errout' defined but not used
      [-Wunused-label]
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2aeb02a
    • E
      tcp: avoid possible arithmetic overflows · fcdd1cf4
      Eric Dumazet 提交于
      icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default,
      or more if some sysctl (eg tcp_retries2) are changed.
      
      Better use 64bit to perform icsk_rto << icsk_backoff operations
      
      As Joe Perches suggested, add a helper for this.
      
      Yuchung spotted the tcp_v4_err() case.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fcdd1cf4
    • D
      ipv6: mld: answer mldv2 queries with mldv1 reports in mldv1 fallback · 35f7aa53
      Daniel Borkmann 提交于
      RFC2710 (MLDv1), section 3.7. says:
      
        The length of a received MLD message is computed by taking the
        IPv6 Payload Length value and subtracting the length of any IPv6
        extension headers present between the IPv6 header and the MLD
        message. If that length is greater than 24 octets, that indicates
        that there are other fields present *beyond* the fields described
        above, perhaps belonging to a *future backwards-compatible* version
        of MLD. An implementation of the version of MLD specified in this
        document *MUST NOT* send an MLD message longer than 24 octets and
        MUST ignore anything past the first 24 octets of a received MLD
        message.
      
      RFC3810 (MLDv2), section 8.2.1. states for *listeners* regarding
      presence of MLDv1 routers:
      
        In order to be compatible with MLDv1 routers, MLDv2 hosts MUST
        operate in version 1 compatibility mode. [...] When Host
        Compatibility Mode is MLDv2, a host acts using the MLDv2 protocol
        on that interface. When Host Compatibility Mode is MLDv1, a host
        acts in MLDv1 compatibility mode, using *only* the MLDv1 protocol,
        on that interface. [...]
      
      While section 8.3.1. specifies *router* behaviour regarding presence
      of MLDv1 routers:
      
        MLDv2 routers may be placed on a network where there is at least
        one MLDv1 router. The following requirements apply:
      
        If an MLDv1 router is present on the link, the Querier MUST use
        the *lowest* version of MLD present on the network. This must be
        administratively assured. Routers that desire to be compatible
        with MLDv1 MUST have a configuration option to act in MLDv1 mode;
        if an MLDv1 router is present on the link, the system administrator
        must explicitly configure all MLDv2 routers to act in MLDv1 mode.
        When in MLDv1 mode, the Querier MUST send periodic General Queries
        truncated at the Multicast Address field (i.e., 24 bytes long),
        and SHOULD also warn about receiving an MLDv2 Query (such warnings
        must be rate-limited). The Querier MUST also fill in the Maximum
        Response Delay in the Maximum Response Code field, i.e., the
        exponential algorithm described in section 5.1.3. is not used. [...]
      
      That means that we should not get queries from different versions of
      MLD. When there's a MLDv1 router present, MLDv2 enforces truncation
      and MRC == MRD (both fields are overlapping within the 24 octet range).
      
      Section 8.3.2. specifies behaviour in the presence of MLDv1 multicast
      address *listeners*:
      
        MLDv2 routers may be placed on a network where there are hosts
        that have not yet been upgraded to MLDv2. In order to be compatible
        with MLDv1 hosts, MLDv2 routers MUST operate in version 1 compatibility
        mode. MLDv2 routers keep a compatibility mode per multicast address
        record. The compatibility mode of a multicast address is determined
        from the Multicast Address Compatibility Mode variable, which can be
        in one of the two following states: MLDv1 or MLDv2.
      
        The Multicast Address Compatibility Mode of a multicast address
        record is set to MLDv1 whenever an MLDv1 Multicast Listener Report is
        *received* for that multicast address. At the same time, the Older
        Version Host Present timer for the multicast address is set to Older
        Version Host Present Timeout seconds. The timer is re-set whenever a
        new MLDv1 Report is received for that multicast address. If the Older
        Version Host Present timer expires, the router switches back to
        Multicast Address Compatibility Mode of MLDv2 for that multicast
        address. [...]
      
      That means, what can happen is the following scenario, that hosts can
      act in MLDv1 compatibility mode when they previously have received an
      MLDv1 query (or, simply operate in MLDv1 mode-only); and at the same
      time, an MLDv2 router could start up and transmits MLDv2 startup query
      messages while being unaware of the current operational mode.
      
      Given RFC2710, section 3.7 we would need to answer to that with an MLDv1
      listener report, so that the router according to RFC3810, section 8.3.2.
      would receive that and internally switch to MLDv1 compatibility as well.
      
      Right now, I believe since the initial implementation of MLDv2, Linux
      hosts would just silently drop such MLDv2 queries instead of replying
      with an MLDv1 listener report, which would prevent a MLDv2 router going
      into fallback mode (until it receives other MLDv1 queries).
      
      Since the mapping of MRC to MRD in exactly such cases can make use of
      the exponential algorithm from 5.1.3, we cannot [strictly speaking] be
      aware in MLDv1 of the encoding in MRC, it seems also not mentioned by
      the RFC. Since encodings are the same up to 32767, assume in such a
      situation this value as a hard upper limit we would clamp. We have asked
      one of the RFC authors on that regard, and he mentioned that there seem
      not to be any implementations that make use of that exponential algorithm
      on startup messages. In any case, this patch fixes this MLD
      interoperability issue.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35f7aa53
    • L
      net: rfkill: gpio: Fix clock status · fa5c107c
      Loic Poulain 提交于
      Clock is disabled when the device is blocked.
      So, clock_enabled is the logical negation of "blocked".
      Signed-off-by: NLoic Poulain <loic.poulain@intel.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      fa5c107c
    • J
      net: sched: cls_u32 changes to knode must appear atomic to readers · de5df632
      John Fastabend 提交于
      Changes to the cls_u32 classifier must appear atomic to the
      readers. Before this patch if a change is requested for both
      the exts and ifindex, first the ifindex is updated then the
      exts with tcf_exts_change(). This opens a small window where
      a reader can have a exts chain with an incorrect ifindex. This
      violates the the RCU semantics.
      
      Here we resolve this by always passing u32_set_parms() a copy
      of the tc_u_knode to work on and then inserting it into the hash
      table after the updates have been successfully applied.
      
      Tested with the following short script:
      
      #tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 handle 1: \
      	       u32 divisor 256
      
      #tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 \
      	       u32 link 1: hashkey mask ffffff00 at 12    \
      	       match ip src 192.168.8.0/2
      
      #tc filter add dev p3p2 parent 8001:0 protocol ip prio 102    \
      	       handle 1::10 u32 classid 1:2 ht 1: 	      \
      	       match ip src 192.168.8.0/8 match ip tos 0x0a 1e
      
      #tc filter change dev p3p2 parent 8001:0 protocol ip prio 102 \
      		 handle 1::10 u32 classid 1:2 ht 1:        \
      		 match ip src 1.1.0.0/8 match ip tos 0x0b 1e
      
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de5df632
    • J
      net: cls_u32: fix missed pcpu_success free_percpu · a1ddcfee
      John Fastabend 提交于
      This fixes a missed free_percpu in the unwind code path and when
      keys are destroyed.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1ddcfee
    • T
      udp: Need to make ip6_udp_tunnel.c have GPL license · 3fcb95a8
      Tom Herbert 提交于
      Unable to load various tunneling modules without this:
      
      [   80.679049] fou: Unknown symbol udp_sock_create6 (err 0)
      [   91.439939] ip6_udp_tunnel: Unknown symbol ip6_local_out (err 0)
      [   91.439954] ip6_udp_tunnel: Unknown symbol __put_net (err 0)
      [   91.457792] vxlan: Unknown symbol udp_sock_create6 (err 0)
      [   91.457831] vxlan: Unknown symbol udp_tunnel6_xmit_skb (err 0)
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fcb95a8
    • J
      net: keep original skb which only needs header checking during software GSO · cecda693
      Jason Wang 提交于
      Commit ce93718f ("net: Don't keep
      around original SKB when we software segment GSO frames") frees the
      original skb after software GSO even for dodgy gso skbs. This breaks
      the stream throughput from untrusted sources, since only header
      checking was done during software GSO instead of a true
      segmentation. This patch fixes this by freeing the original gso skb
      only when it was really segmented by software.
      
      Fixes ce93718f ("net: Don't keep
      around original SKB when we software segment GSO frames.")
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cecda693
    • F
      net: dsa: add {get, set}_wol callbacks to slave devices · 19e57c4e
      Florian Fainelli 提交于
      Allow switch drivers to implement per-port Wake-on-LAN getter and
      setters.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19e57c4e
    • F
      net: dsa: allow switch drivers to implement suspend/resume hooks · 24462549
      Florian Fainelli 提交于
      Add an abstraction layer to suspend/resume switch devices, doing the
      following split:
      
      - suspend/resume the slave network devices and their corresponding PHY
        devices
      - suspend/resume the switch hardware using switch driver callbacks
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24462549
    • E
      net: sched: shrink struct qdisc_skb_cb to 28 bytes · 25711786
      Eric Dumazet 提交于
      We cannot make struct qdisc_skb_cb bigger without impacting IPoIB,
      or increasing skb->cb[] size.
      
      Commit e0f31d84 ("flow_keys: Record IP layer protocol in
      skb_flow_dissect()") broke IPoIB.
      
      Only current offender is sch_choke, and this one do not need an
      absolutely precise flow key.
      
      If we store 17 bytes of flow key, its more than enough. (Its the actual
      size of flow_keys if it was a packed structure, but we might add new
      fields at the end of it later)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: e0f31d84 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25711786
  6. 20 9月, 2014 15 次提交