1. 29 9月, 2014 40 次提交
    • D
      Merge branch 'dctcp' · a11238ec
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      net: tcp: DCTCP congestion control algorithm
      
      This patch series adds support for the DataCenter TCP (DCTCP) congestion
      control algorithm. Please see individual patches for the details.
      
      The last patch adds DCTCP as a congestion control module, and previous
      ones add needed infrastructure to extend the congestion control framework.
      
      Joint work between Florian Westphal, Daniel Borkmann and Glenn Judd.
      
      v3 -> v2:
       - No changes anywhere, just a resend as requested by Dave
       - Added Stephen's ACK
      v1 -> v2:
       - Rebased to latest net-next
       - Addressed Eric's feedback, thanks!
        - Update stale comment wrt. DCTCP ECN usage
        - Don't call INET_ECN_xmit for every packet
       - Add dctcp ss/inetdiag support to expose internal stats to userspace
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a11238ec
    • D
      net: tcp: add DCTCP congestion control algorithm · e3118e83
      Daniel Borkmann 提交于
      This work adds the DataCenter TCP (DCTCP) congestion control
      algorithm [1], which has been first published at SIGCOMM 2010 [2],
      resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
      recently as an informational IETF draft available at [4]).
      
      DCTCP is an enhancement to the TCP congestion control algorithm for
      data center networks. Typical data center workloads are i.e.
      i) partition/aggregate (queries; bursty, delay sensitive), ii) short
      messages e.g. 50KB-1MB (for coordination and control state; delay
      sensitive), and iii) large flows e.g. 1MB-100MB (data update;
      throughput sensitive). DCTCP has therefore been designed for such
      environments to provide/achieve the following three requirements:
      
        * High burst tolerance (incast due to partition/aggregate)
        * Low latency (short flows, queries)
        * High throughput (continuous data updates, large file
          transfers) with commodity, shallow buffered switches
      
      The basic idea of its design consists of two fundamentals: i) on the
      switch side, packets are being marked when its internal queue
      length > threshold K (K is chosen so that a large enough headroom
      for marked traffic is still available in the switch queue); ii) the
      sender/host side maintains a moving average of the fraction of marked
      packets, so each RTT, F is being updated as follows:
      
       F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
       alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
      
      The resulting alpha (iow: probability that switch queue is congested)
      is then being used in order to adaptively decrease the congestion
      window W:
      
       W := (1 - (alpha / 2)) * W
      
      The means for receiving marked packets resp. marking them on switch
      side in DCTCP is the use of ECN.
      
      RFC3168 describes a mechanism for using Explicit Congestion Notification
      from the switch for early detection of congestion, rather than waiting
      for segment loss to occur.
      
      However, this method only detects the presence of congestion, not
      the *extent*. In the presence of mild congestion, it reduces the TCP
      congestion window too aggressively and unnecessarily affects the
      throughput of long flows [4].
      
      DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
      processing to estimate the fraction of bytes that encounter congestion,
      rather than simply detecting that some congestion has occurred. DCTCP
      then scales the TCP congestion window based on this estimate [4],
      thus it can derive multibit feedback from the information present in
      the single-bit sequence of marks in its control law. And thus act in
      *proportion* to the extent of congestion, not its *presence*.
      
      Switches therefore set the Congestion Experienced (CE) codepoint in
      packets when internal queue lengths exceed threshold K. Resulting,
      DCTCP delivers the same or better throughput than normal TCP, while
      using 90% less buffer space.
      
      It was found in [2] that DCTCP enables the applications to handle 10x
      the current background traffic, without impacting foreground traffic.
      Moreover, a 10x increase in foreground traffic did not cause any
      timeouts, and thus largely eliminates TCP incast collapse problems.
      
      The algorithm itself has already seen deployments in large production
      data centers since then.
      
      We did a long-term stress-test and analysis in a data center, short
      summary of our TCP incast tests with iperf compared to cubic:
      
      This test measured DCTCP throughput and latency and compared it with
      CUBIC throughput and latency for an incast scenario. In this test, 19
      senders sent at maximum rate to a single receiver. The receiver simply
      ran iperf -s.
      
      The senders ran iperf -c <receiver> -t 30. All senders started
      simultaneously (using local clocks synchronized by ntp).
      
      This test was repeated multiple times. Below shows the results from a
      single test. Other tests are similar. (DCTCP results were extremely
      consistent, CUBIC results show some variance induced by the TCP timeouts
      that CUBIC encountered.)
      
      For this test, we report statistics on the number of TCP timeouts,
      flow throughput, and traffic latency.
      
      1) Timeouts (total over all flows, and per flow summaries):
      
                  CUBIC            DCTCP
        Total     3227             25
        Mean       169.842          1.316
        Median     183              1
        Max        207              5
        Min        123              0
        Stddev      28.991          1.600
      
      Timeout data is taken by measuring the net change in netstat -s
      "other TCP timeouts" reported. As a result, the timeout measurements
      above are not restricted to the test traffic, and we believe that it
      is likely that all of the "DCTCP timeouts" are actually timeouts for
      non-test traffic. We report them nevertheless. CUBIC will also include
      some non-test timeouts, but they are drawfed by bona fide test traffic
      timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
      TCP timeouts. DCTCP reduces timeouts by at least two orders of
      magnitude and may well have eliminated them in this scenario.
      
      2) Throughput (per flow in Mbps):
      
                  CUBIC            DCTCP
        Mean      521.684          521.895
        Median    464              523
        Max       776              527
        Min       403              519
        Stddev    105.891            2.601
        Fairness    0.962            0.999
      
      Throughput data was simply the average throughput for each flow
      reported by iperf. By avoiding TCP timeouts, DCTCP is able to
      achieve much better per-flow results. In CUBIC, many flows
      experience TCP timeouts which makes flow throughput unpredictable and
      unfair. DCTCP, on the other hand, provides very clean predictable
      throughput without incurring TCP timeouts. Thus, the standard deviation
      of CUBIC throughput is dramatically higher than the standard deviation
      of DCTCP throughput.
      
      Mean throughput is nearly identical because even though cubic flows
      suffer TCP timeouts, other flows will step in and fill the unused
      bandwidth. Note that this test is something of a best case scenario
      for incast under CUBIC: it allows other flows to fill in for flows
      experiencing a timeout. Under situations where the receiver is issuing
      requests and then waiting for all flows to complete, flows cannot fill
      in for timed out flows and throughput will drop dramatically.
      
      3) Latency (in ms):
      
                  CUBIC            DCTCP
        Mean      4.0088           0.04219
        Median    4.055            0.0395
        Max       4.2              0.085
        Min       3.32             0.028
        Stddev    0.1666           0.01064
      
      Latency for each protocol was computed by running "ping -i 0.2
      <receiver>" from a single sender to the receiver during the incast
      test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
      that traffic traversed the DCTCP queue and was not dropped when the
      queue size was greater than the marking threshold. The summary
      statistics above are over all ping metrics measured between the single
      sender, receiver pair.
      
      The latency results for this test show a dramatic difference between
      CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
      which incurs the maximum queue latency (more buffer memory will lead
      to high latency.) DCTCP, on the other hand, deliberately attempts to
      keep queue occupancy low. The result is a two orders of magnitude
      reduction of latency with DCTCP - even with a switch with relatively
      little RAM. Switches with larger amounts of RAM will incur increasing
      amounts of latency for CUBIC, but not for DCTCP.
      
      4) Convergence and stability test:
      
      This test measured the time that DCTCP took to fairly redistribute
      bandwidth when a new flow commences. It also measured DCTCP's ability
      to remain stable at a fair bandwidth distribution. DCTCP is compared
      with CUBIC for this test.
      
      At the commencement of this test, a single flow is sending at maximum
      rate (near 10 Gbps) to a single receiver. One second after that first
      flow commences, a new flow from a distinct server begins sending to
      the same receiver as the first flow. After the second flow has sent
      data for 10 seconds, the second flow is terminated. The first flow
      sends for an additional second. Ideally, the bandwidth would be evenly
      shared as soon as the second flow starts, and recover as soon as it
      stops.
      
      The results of this test are shown below. Note that the flow bandwidth
      for the two flows was measured near the same time, but not
      simultaneously.
      
      DCTCP performs nearly perfectly within the measurement limitations
      of this test: bandwidth is quickly distributed fairly between the two
      flows, remains stable throughout the duration of the test, and
      recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
      fairly, and has trouble remaining stable.
      
        CUBIC                      DCTCP
      
        Seconds  Flow 1  Flow 2    Seconds  Flow 1  Flow 2
         0       9.93    0          0       9.92    0
         0.5     9.87    0          0.5     9.86    0
         1       8.73    2.25       1       6.46    4.88
         1.5     7.29    2.8        1.5     4.9     4.99
         2       6.96    3.1        2       4.92    4.94
         2.5     6.67    3.34       2.5     4.93    5
         3       6.39    3.57       3       4.92    4.99
         3.5     6.24    3.75       3.5     4.94    4.74
         4       6       3.94       4       5.34    4.71
         4.5     5.88    4.09       4.5     4.99    4.97
         5       5.27    4.98       5       4.83    5.01
         5.5     4.93    5.04       5.5     4.89    4.99
         6       4.9     4.99       6       4.92    5.04
         6.5     4.93    5.1        6.5     4.91    4.97
         7       4.28    5.8        7       4.97    4.97
         7.5     4.62    4.91       7.5     4.99    4.82
         8       5.05    4.45       8       5.16    4.76
         8.5     5.93    4.09       8.5     4.94    4.98
         9       5.73    4.2        9       4.92    5.02
         9.5     5.62    4.32       9.5     4.87    5.03
        10       6.12    3.2       10       4.91    5.01
        10.5     6.91    3.11      10.5     4.87    5.04
        11       8.48    0         11       8.49    4.94
        11.5     9.87    0         11.5     9.9     0
      
      SYN/ACK ECT test:
      
      This test demonstrates the importance of ECT on SYN and SYN-ACK packets
      by measuring the connection probability in the presence of competing
      flows for a DCTCP connection attempt *without* ECT in the SYN packet.
      The test was repeated five times for each number of competing flows.
      
                    Competing Flows  1 |    2 |    4 |    8 |   16
                                     ------------------------------
      Mean Connection Probability    1 | 0.67 | 0.45 | 0.28 |    0
      Median Connection Probability  1 | 0.65 | 0.45 | 0.25 |    0
      
      As the number of competing flows moves beyond 1, the connection
      probability drops rapidly.
      
      Enabling DCTCP with this patch requires the following steps:
      
      DCTCP must be running both on the sender and receiver side in your
      data center, i.e.:
      
        sysctl -w net.ipv4.tcp_congestion_control=dctcp
      
      Also, ECN functionality must be enabled on all switches in your
      data center for DCTCP to work. The default ECN marking threshold (K)
      heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
      1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
      
      In above tests, for each switch port, traffic was segregated into two
      queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
      0x04 - the packet was placed into the DCTCP queue. All other packets
      were placed into the default drop-tail queue. For the DCTCP queue,
      RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
      More details however, we refer you to the paper [2] under section 3).
      
      There are no code changes required to applications running in user
      space. DCTCP has been implemented in full *isolation* of the rest of
      the TCP code as its own congestion control module, so that it can run
      without a need to expose code to the core of the TCP stack, and thus
      nothing changes for non-DCTCP users.
      
      Changes in the CA framework code are minimal, and DCTCP algorithm
      operates on mechanisms that are already available in most Silicon.
      The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
      the paper, but we leave the option that it can be chosen carefully
      to a different value by the user.
      
      In case DCTCP is being used and ECN support on peer site is off,
      DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
      
      ss {-4,-6} -t -i diag interface:
      
        ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
        ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
        send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
        reordering:101 rcv_space:29200
      
        ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
        cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
        325.5Mbps rcv_rtt:1.5 rcv_space:29200
      
      More information about DCTCP can be found in [1-4].
      
        [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
        [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
        [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
        [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
      
      Joint work with Florian Westphal and Glenn Judd.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3118e83
    • F
      net: tcp: more detailed ACK events and events for CE marked packets · 9890092e
      Florian Westphal 提交于
      DataCenter TCP (DCTCP) determines cwnd growth based on ECN information
      and ACK properties, e.g. ACK that updates window is treated differently
      than DUPACK.
      
      Also DCTCP needs information whether ACK was delayed ACK. Furthermore,
      DCTCP also implements a CE state machine that keeps track of CE markings
      of incoming packets.
      
      Therefore, extend the congestion control framework to provide these
      event types, so that DCTCP can be properly implemented as a normal
      congestion algorithm module outside of the core stack.
      
      Joint work with Daniel Borkmann and Glenn Judd.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9890092e
    • F
      net: tcp: split ack slow/fast events from cwnd_event · 7354c8c3
      Florian Westphal 提交于
      The congestion control ops "cwnd_event" currently supports
      CA_EVENT_FAST_ACK and CA_EVENT_SLOW_ACK events (among others).
      Both FAST and SLOW_ACK are only used by Westwood congestion
      control algorithm.
      
      This removes both flags from cwnd_event and adds a new
      in_ack_event callback for this. The goal is to be able to
      provide more detailed information about ACKs, such as whether
      ECE flag was set, or whether the ACK resulted in a window
      update.
      
      It is required for DataCenter TCP (DCTCP) congestion control
      algorithm as it makes a different choice depending on ECE being
      set or not.
      
      Joint work with Daniel Borkmann and Glenn Judd.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7354c8c3
    • D
      net: tcp: add flag for ca to indicate that ECN is required · 30e502a3
      Daniel Borkmann 提交于
      This patch adds a flag to TCP congestion algorithms that allows
      for requesting to mark IPv4/IPv6 sockets with transport as ECN
      capable, that is, ECT(0), when required by a congestion algorithm.
      
      It is currently used and needed in DataCenter TCP (DCTCP), as it
      requires both peers to assert ECT on all IP packets sent - it
      uses ECN feedback (i.e. CE, Congestion Encountered information)
      from switches inside the data center to derive feedback to the
      end hosts.
      
      Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's
      algorithm/behaviour slightly diverges from RFC3168, therefore this
      is only (!) enabled iff the assigned congestion control ops module
      has requested this. By that, we can tightly couple this logic really
      only to the provided congestion control ops.
      
      Joint work with Florian Westphal and Glenn Judd.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30e502a3
    • F
      net: tcp: assign tcp cong_ops when tcp sk is created · 55d8694f
      Florian Westphal 提交于
      Split assignment and initialization from one into two functions.
      
      This is required by followup patches that add Datacenter TCP
      (DCTCP) congestion control algorithm - we need to be able to
      determine if the connection is moderated by DCTCP before the
      3WHS has finished.
      
      As we walk the available congestion control list during the
      assignment, we are always guaranteed to have Reno present as
      it's fixed compiled-in. Therefore, since we're doing the
      early assignment, we don't have a real use for the Reno alias
      tcp_init_congestion_ops anymore and can thus remove it.
      
      Actual usage of the congestion control operations are being
      made after the 3WHS has finished, in some cases however we
      can access get_info() via diag if implemented, therefore we
      need to zero out the private area for those modules.
      
      Joint work with Daniel Borkmann and Glenn Judd.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55d8694f
    • J
      net: sched: cls_rcvp, complete rcu conversion · 53dfd501
      John Fastabend 提交于
      This completes the cls_rsvp conversion to RCU safe
      copy, update semantics.
      
      As a result all cases of tcf_exts_change occur on
      empty lists now.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53dfd501
    • E
      dql: dql_queued() should write first to reduce bus transactions · 3d9a0d2f
      Eric Dumazet 提交于
      While doing high throughput test on a BQL enabled NIC,
      I found a very high cost in ndo_start_xmit() when accessing BQL data.
      
      It turned out the problem was caused by compiler trying to be
      smart, but involving a bad MESI transaction :
      
        0.05 │  mov    0xc0(%rax),%edi    // LOAD dql->num_queued
        0.48 │  mov    %edx,0xc8(%rax)    // STORE dql->last_obj_cnt = count
       58.23 │  add    %edx,%edi
        0.58 │  cmp    %edi,0xc4(%rax)
        0.76 │  mov    %edi,0xc0(%rax)    // STORE dql->num_queued += count
        0.72 │  js     bd8
      
      I got an incredible 10 % gain [1] by making sure cpu do not attempt
      to get the cache line in Shared mode, but directly requests for
      ownership.
      
      New code :
      	mov    %edx,0xc8(%rax)  // STORE dql->last_obj_cnt = count
      	add    %edx,0xc0(%rax)  // RMW   dql->num_queued += count
      	mov    0xc4(%rax),%ecx  // LOAD dql->adj_limit
      	mov    0xc0(%rax),%edx  // LOAD dql->num_queued
      	cmp    %edx,%ecx
      
      The TX completion was running from another cpu, with high interrupts
      rate.
      
      Note that I am using barrier() as a soft hint, as mb() here could be
      too heavy cost.
      
      [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d9a0d2f
    • W
      net_sched: fix another regression in cls_tcindex · 68f6a7c6
      WANG Cong 提交于
      Clearly the following change is not expected:
      
      	-       if (!cp.perfect && !cp.h)
      	-               cp.alloc_hash = cp.hash;
      	+       if (!cp->perfect && cp->h)
      	+               cp->alloc_hash = cp->hash;
      
      Fixes: commit 331b7292 ("net: sched: RCU cls_tcindex")
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68f6a7c6
    • W
      net_sched: fix errno in tcindex_set_parms() · 02c5e844
      WANG Cong 提交于
      When kmemdup() fails, we should return -ENOMEM.
      
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02c5e844
    • D
      Merge branch 'cxgb4-next' · c01035f1
      David S. Miller 提交于
      Hariprasad Shenai says:
      
      ====================
      cxgb4: Use new BAR2 GTS for T5, adds adaptive rx and few Device ID's
      
      This patch series adds support to use new BAR2 GTS for T5 adapter.
      Adds support for adaptive rx. Remove redundant variable from a macro of
      cxgb4vf driver. Adds Device ID for new adapters.
      
      The patches series is created against 'net-next' tree.
      And includes patches on cxgb4 and cxgb4vf driver.
      
      We have included all the maintainers of respective drivers. Kindly review the
      change and let us know in case of any review comments.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c01035f1
    • H
      cxgb4: Add support for adaptive rx · e553ec3f
      Hariprasad Shenai 提交于
      Based on original work by Kumar Sanghvi <kumaras@chelsio.com>
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e553ec3f
    • H
    • H
      cxgb4vf: Remove superfluous "idx" parameter of CH_DEVICE() macro. · b961f9a4
      Hariprasad Shenai 提交于
      Remove redundant idx parameter of CH_DEVICE() macro, its always zero.
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b961f9a4
    • H
      cxgb4: Use BAR2 Going To Sleep (GTS) for T5 and later. · d63a6dcf
      Hariprasad Shenai 提交于
      Use BAR2 GTS for T5. If we are on T4 use the old doorbell mechanism;
      otherwise ue the new BAR2 mechanism. Use BAR2 doorbells for refilling FL's.
      
      Based on original work by Casey Leedom <leedom@chelsio.com>
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d63a6dcf
    • R
      arp: Do not perturb drop profiles with ignored ARP packets · 825bae5d
      Rick Jones 提交于
      We do not wish to disturb dropwatch or perf drop profiles with an ARP
      we will ignore.
      Signed-off-by: NRick Jones <rick.jones2@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      825bae5d
    • W
      net_sched: remove the first parameter from tcf_exts_destroy() · 18d0264f
      WANG Cong 提交于
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <hadi@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18d0264f
    • E
      mlx4: exploit skb->xmit_more to conditionally send doorbell · 5804283d
      Eric Dumazet 提交于
      skb->xmit_more tells us if another skb is coming next.
      
      We need to send doorbell when : xmit_more is not set,
      or txqueue is stopped (preventing next skb to come immediately)
      
      Tested with a modified pktgen version, I got a 40% increase of
      throughput.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5804283d
    • D
      Merge branch 'r8152' · a8404ce5
      David S. Miller 提交于
      Hayes Wang says:
      
      ====================
      r8152: support setting eee by ethtool
      
      Modify some definitions about EEE, and add the support of setting
      the EEE through ethtool.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8404ce5
    • H
      r8152: support ethtool eee · df35d283
      hayeswang 提交于
      Support get_eee() and set_eee() of ethtool_ops.
      Signed-off-by: NHayes Wang <hayeswang@realtek.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df35d283
    • H
      r8152: add functions to set EEE · d24f6134
      hayeswang 提交于
      Add functions to enable EEE and set EEE advertisement.
      Signed-off-by: NHayes Wang <hayeswang@realtek.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d24f6134
    • H
      r8152: change the EEE definition · 4c4a6b1b
      hayeswang 提交于
      Replace the EEE definitions with the ones which is declared
      in "mdio.h".
      
      Chage some definitions to make them readable.
      Signed-off-by: NHayes Wang <hayeswang@realtek.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c4a6b1b
    • D
      Merge branch 'defxx-next' · 18c565eb
      David S. Miller 提交于
      Maciej W. Rozycki says:
      
      ====================
      defxx: DEFEA fixes and updates
      
       I have finally got my hands on an EISA variation of the board (DEC
      FDDIcontroller/EISA aka DEFEA) and was able to do some testing.  Here are
      initial updates to the driver that address problems I encountered so far.
      More to come later on as I get back to the system that I have in a remote
      location -- I need to double-check MMIO support and see what might have
      been causing spurious interrupts I saw with the 8259A PIC the board's
      interrupt line has been routed to.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18c565eb
    • M
      defxx: DEFEA's ESIC port I/O decoding cleanup · b98dfaf2
      Maciej W. Rozycki 提交于
      Use the slot-specific I/O range for decoding accesses to PDQ ASIC
      registers (IOCS0) and the discrete Burst Holdoff register (IOCS1) as per
      the "HD64981F EISA Slave Interface Controller (ESIC)" datasheet.  Use
      disjoint decode ranges now that the assignment of chip selects is known.
      Update the span of the port I/O resource requested accordingly.
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b98dfaf2
    • M
      defxx: DEFEA's Burst Holdoff register initialization fix · b1a6d3ec
      Maciej W. Rozycki 提交于
      Use the mask rather than bit number macro to initialize the chip select
      control bit for PDQ register space decoding in the Burst Holdoff register.
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1a6d3ec
    • M
      defxx: Correct DEFEA's ESIC port I/O accesses · 8a189f12
      Maciej W. Rozycki 提交于
      Reverse the order of arguments to `outb', data to write comes first.
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a189f12
    • D
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next · f5c7e1a4
      David S. Miller 提交于
      Steffen Klassert says:
      
      ====================
      pull request (net-next): ipsec-next 2014-09-25
      
      1) Remove useless hash_resize_mutex in xfrm_hash_resize().
         This mutex is used only there, but xfrm_hash_resize()
         can't be called concurrently at all. From Ying Xue.
      
      2) Extend policy hashing to prefixed policies based on
         prefix lenght thresholds. From Christophe Gouault.
      
      3) Make the policy hash table thresholds configurable
         via netlink. From Christophe Gouault.
      
      4) Remove the maximum authentication length for AH.
         This was needed to limit stack usage. We switched
         already to allocate space, so no need to keep the
         limit. From Herbert Xu.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5c7e1a4
    • D
      Merge branch 'dsa_eee' · fe2c5fb1
      David S. Miller 提交于
      Florian Fainelli says:
      
      ====================
      net: dsa: EEE and other PM features
      
      This patch set allows DSA switch drivers to enable/disable/query EEE on a
      per-port level, as well as control precisely which switch ports are
      enable/disabled.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe2c5fb1
    • F
      net: dsa: bcm_sf2: add support for controlling EEE · 450b05c1
      Florian Fainelli 提交于
      When EEE is enabled, negotiate this feature with the PHY and make sure
      that the capability checking, local EEE advertisement, link partner EEE
      advertisement and auto-negotiation resolution returned by phy_init_eee()
      is positive, and enable EEE at the switch level.
      
      While querying the current EEE settings, verify the low-power indication
      and indicate its status.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      450b05c1
    • F
      net: dsa: allow switches driver to implement get/set EEE · 7905288f
      Florian Fainelli 提交于
      Allow switches driver to query and enable/disable EEE on a per-port
      basis by implementing the ethtool_{get,set}_eee settings and delegating
      these operations to the switch driver.
      
      set_eee() will need to coordinate with the PHY driver to make sure that
      EEE is enabled, the link-partner supports it and the auto-negotiation
      result is satisfactory.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7905288f
    • F
      net: dsa: bcm_sf2: add port_enable/disable callbacks · b6d045db
      Florian Fainelli 提交于
      The SF2 switch driver is already architected around per-port
      enable/disable callbacks, so we just need a slight update to our
      existing bcm_sf2_port_setup() resp. bcm_sf2_port_disable() functions to
      be suitable as callbacks for port_enable/port_disable.
      
      We need to shuffle a little the code that does the per-port VLAN
      configuration/isolation since ports can now be brought up/down
      separately, so we need to make sure that IMP (CPU, management) port is
      always included in that specific port setup.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6d045db
    • F
      net: dsa: bcm_sf2: disable RGMII interface(s) when link is down · 7de1557c
      Florian Fainelli 提交于
      When the link is down, disable the RGMII interface to conserve as much
      power as possible. We re-enable the RGMII interface whenever the link is
      detected.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7de1557c
    • F
      net: dsa: allow enabling and disable switch ports · b2f2af21
      Florian Fainelli 提交于
      Whenever a per-port network device is used/unused, invoke the switch
      driver port_enable/port_disable callbacks to allow saving as much power
      as possible by disabling unused parts of the switch (RX/TX logic, memory
      arrays, PHYs...). We supply a PHY device argument to make sure the
      switch driver can act on the PHY device if needed (like putting/taking
      the PHY out of deep low power mode).
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2f2af21
    • F
      net: dsa: start and stop the PHY state machine · f7f1de51
      Florian Fainelli 提交于
      dsa_slave_open() should start the PHY library state machine for its PHY
      interface, and dsa_slave_close() should stop the PHY library state
      machine accordingly.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7f1de51
    • P
      tcp: use tcp_flags in tcp_data_queue() · 155c6e1a
      Peter Pan(潘卫平) 提交于
      This patch is a cleanup which follows the idea in commit e11ecddf (tcp: use
      TCP_SKB_CB(skb)->tcp_flags in input path),
      and it may reduce register pressure since skb->cb[] access is fast,
      bacause skb is probably in a register.
      
      v2: remove variable th
      v3: reword the changelog
      Signed-off-by: NWeiping Pan <panweiping3@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      155c6e1a
    • E
      tcp: change tcp_skb_pcount() location · cd7d8498
      Eric Dumazet 提交于
      Our goal is to access no more than one cache line access per skb in
      a write or receive queue when doing the various walks.
      
      After recent TCP_SKB_CB() reorganizations, it is almost done.
      
      Last part is tcp_skb_pcount() which currently uses
      skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs
      3 cache lines in current kernel (skb->head, skb->end, and
      shinfo->gso_segs are all in 3 different cache lines, far from skb->cb)
      
      This very simple patch reuses space currently taken by tcp_tw_isn
      only in input path, as tcp_skb_pcount is only needed for skb stored in
      write queue.
      
      This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags
      to get SKBTX_ACK_TSTAMP, which seems possible.
      
      This also speeds up all sack processing in general.
      
      This speeds up tcp_sendmsg() because it no longer has to access/dirty
      shinfo.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd7d8498
    • D
      Merge branch 'tcp_skb_cb' · dc83d4d8
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      tcp: better TCP_SKB_CB layout
      
      TCP had the assumption that IPCB and IP6CB are first members of skb->cb[]
      
      This is fine, except that IPCB/IP6CB are used in TCP for a very short time
      in input path.
      
      What really matters for TCP stack is to get skb->next,
      TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq in the same cache line.
      
      skb that are immediately consumed do not care because whole skb->cb[] is
      hot in cpu cache, while skb that sit in wocket write queue or receive queues
      do not need TCP_SKB_CB(skb)->header at all.
      
      This patch set implements the prereq for IPv4, IPv6, and TCP to make this
      possible. This makes TCP more efficient.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc83d4d8
    • E
      tcp: better TCP_SKB_CB layout to reduce cache line misses · 971f10ec
      Eric Dumazet 提交于
      TCP maintains lists of skb in write queue, and in receive queues
      (in order and out of order queues)
      
      Scanning these lists both in input and output path usually requires
      access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq
      
      These fields are currently in two different cache lines, meaning we
      waste lot of memory bandwidth when these queues are big and flows
      have either packet drops or packet reorders.
      
      We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
      this header is not used in fast path. This allows TCP to search much faster
      in the skb lists.
      
      Even with regular flows, we save one cache line miss in fast path.
      
      Thanks to Christoph Paasch for noticing we need to cleanup
      skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
      and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      971f10ec
    • E
      ipv6: add a struct inet6_skb_parm param to ipv6_opt_accepted() · a224772d
      Eric Dumazet 提交于
      ipv6_opt_accepted() assumes IP6CB(skb) holds the struct inet6_skb_parm
      that it needs. Lets not assume this, as TCP stack might use a different
      place.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a224772d
    • E
      ipv4: rename ip_options_echo to __ip_options_echo() · 24a2d43d
      Eric Dumazet 提交于
      ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt
      Lets break this assumption, but provide a helper to not change all call points.
      
      ip_send_unicast_reply() gets a new struct ip_options pointer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24a2d43d