1. 19 4月, 2017 2 次提交
    • X
      sctp: process duplicated strreset in and addstrm in requests correctly · d0f025e6
      Xin Long 提交于
      This patch is to fix the replay attack issue for strreset and addstrm in
      requests.
      
      When a duplicated strreset in or addstrm in request is received, reply it
      with bad seqno if it's seqno < asoc->strreset_inseq - 2, and reply it with
      the result saved in asoc if it's seqno >= asoc->strreset_inseq - 2.
      
      For strreset in or addstrm in request, if the receiver side processes it
      successfully, a strreset out or addstrm out request(as a response for that
      request) will be sent back to peer. reconf_time will retransmit the out
      request even if it's lost.
      
      So when receiving a duplicated strreset in or addstrm in request and it's
      result was performed, it shouldn't reply this request, but drop it instead.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0f025e6
    • X
      sctp: process duplicated strreset out and addstrm out requests correctly · e4dc99c7
      Xin Long 提交于
      Now sctp stream reconf will process a request again even if it's seqno is
      less than asoc->strreset_inseq.
      
      If one request has been done successfully and some data chunks have been
      accepted and then a duplicated strreset out request comes, the streamin's
      ssn will be cleared. It will cause that stream will never receive chunks
      any more because of unsynchronized ssn. It allows a replay attack.
      
      A similar issue also exists when processing addstrm out requests. It will
      cause more extra streams being added.
      
      This patch is to fix it by saving the last 2 results into asoc. When a
      duplicated strreset out or addstrm out request is received, reply it with
      bad seqno if it's seqno < asoc->strreset_inseq - 2, and reply it with the
      result saved in asoc if it's seqno >= asoc->strreset_inseq - 2.
      
      Note that it saves last 2 results instead of only last 1 result, because
      two requests can be sent together in one chunk.
      
      And note that when receiving a duplicated request, the receiver side will
      still reply it even if the peer has received the response. It's safe, As
      the response will be dropped by the peer.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4dc99c7
  2. 18 4月, 2017 30 次提交
    • C
      bonding: deliver link-local packets with skb->dev set to link that packets arrived on · b89f04c6
      Chonggang Li 提交于
      Bonding driver changes the skb->dev to the bonding-master before
      passing the packet to stack for further processing. This, however
      does not make sense for the link-local packets and it loses "the
      link info" once its skb->dev is changed to bonding-master.  This
      patch changes this behavior for link-local packets by not changing
      the skb->dev to the bonding-master and maintaining it as it is,
      i.e. the link on which the packet arrived.
      Signed-off-by: NChonggang Li <chonggangli@google.com>
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b89f04c6
    • D
      net: rtnetlink: plumb extended ack to doit function · c21ef3e3
      David Ahern 提交于
      Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
      for doit functions that call it directly.
      
      This is the first step to using extended error reporting in rtnetlink.
      >From here individual subsystems can be updated to set netlink_ext_ack as
      needed.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c21ef3e3
    • D
      ipv6: sr: fix BUG due to headroom too small after SRH push · af3b5158
      David Lebrun 提交于
      When a locally generated packet receives an SRH with two or more segments,
      the remaining headroom is too small to push an ethernet header. This patch
      ensures that the headroom is large enough after SRH push.
      
      The BUG generated the following trace.
      
      [  192.950285] skbuff: skb_under_panic: text:ffffffff81809675 len:198 put:14 head:ffff88006f306400 data:ffff88006f3063fa tail:0xc0 end:0x2c0 dev:A-1
      [  192.952456] ------------[ cut here ]------------
      [  192.953218] kernel BUG at net/core/skbuff.c:105!
      [  192.953411] invalid opcode: 0000 [#1] PREEMPT SMP
      [  192.953411] Modules linked in:
      [  192.953411] CPU: 5 PID: 3433 Comm: ping6 Not tainted 4.11.0-rc3+ #237
      [  192.953411] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
      [  192.953411] task: ffff88007c2d42c0 task.stack: ffffc90000ef4000
      [  192.953411] RIP: 0010:skb_panic+0x61/0x70
      [  192.953411] RSP: 0018:ffffc90000ef7900 EFLAGS: 00010286
      [  192.953411] RAX: 0000000000000085 RBX: 00000000000086dd RCX: 0000000000000201
      [  192.953411] RDX: 0000000080000201 RSI: ffffffff81d104c5 RDI: 00000000ffffffff
      [  192.953411] RBP: ffffc90000ef7920 R08: 0000000000000001 R09: 0000000000000000
      [  192.953411] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [  192.953411] R13: ffff88007c5a4000 R14: ffff88007b363d80 R15: 00000000000000b8
      [  192.953411] FS:  00007f94b558b700(0000) GS:ffff88007fd40000(0000) knlGS:0000000000000000
      [  192.953411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  192.953411] CR2: 00007fff5ecd5080 CR3: 0000000074141000 CR4: 00000000001406e0
      [  192.953411] Call Trace:
      [  192.953411]  skb_push+0x3b/0x40
      [  192.953411]  eth_header+0x25/0xc0
      [  192.953411]  neigh_resolve_output+0x168/0x230
      [  192.953411]  ? ip6_finish_output2+0x242/0x8f0
      [  192.953411]  ip6_finish_output2+0x242/0x8f0
      [  192.953411]  ? ip6_finish_output2+0x76/0x8f0
      [  192.953411]  ip6_finish_output+0xa8/0x1d0
      [  192.953411]  ip6_output+0x64/0x2d0
      [  192.953411]  ? ip6_output+0x73/0x2d0
      [  192.953411]  ? ip6_dst_check+0xb5/0xc0
      [  192.953411]  ? dst_cache_per_cpu_get.isra.2+0x40/0x80
      [  192.953411]  seg6_output+0xb0/0x220
      [  192.953411]  lwtunnel_output+0xcf/0x210
      [  192.953411]  ? lwtunnel_output+0x59/0x210
      [  192.953411]  ip6_local_out+0x38/0x70
      [  192.953411]  ip6_send_skb+0x2a/0xb0
      [  192.953411]  ip6_push_pending_frames+0x48/0x50
      [  192.953411]  rawv6_sendmsg+0xa39/0xf10
      [  192.953411]  ? __lock_acquire+0x489/0x890
      [  192.953411]  ? __mutex_lock+0x1fc/0x970
      [  192.953411]  ? __lock_acquire+0x489/0x890
      [  192.953411]  ? __mutex_lock+0x1fc/0x970
      [  192.953411]  ? tty_ioctl+0x283/0xec0
      [  192.953411]  inet_sendmsg+0x45/0x1d0
      [  192.953411]  ? _copy_from_user+0x54/0x80
      [  192.953411]  sock_sendmsg+0x33/0x40
      [  192.953411]  SYSC_sendto+0xef/0x170
      [  192.953411]  ? entry_SYSCALL_64_fastpath+0x5/0xc2
      [  192.953411]  ? trace_hardirqs_on_caller+0x12b/0x1b0
      [  192.953411]  ? trace_hardirqs_on_thunk+0x1a/0x1c
      [  192.953411]  SyS_sendto+0x9/0x10
      [  192.953411]  entry_SYSCALL_64_fastpath+0x1f/0xc2
      [  192.953411] RIP: 0033:0x7f94b453db33
      [  192.953411] RSP: 002b:00007fff5ecd0578 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [  192.953411] RAX: ffffffffffffffda RBX: 00007fff5ecd16e0 RCX: 00007f94b453db33
      [  192.953411] RDX: 0000000000000040 RSI: 000055a78352e9c0 RDI: 0000000000000003
      [  192.953411] RBP: 00007fff5ecd1690 R08: 000055a78352c940 R09: 000000000000001c
      [  192.953411] R10: 0000000000000000 R11: 0000000000000246 R12: 000055a783321e10
      [  192.953411] R13: 000055a7839890c0 R14: 0000000000000004 R15: 0000000000000000
      [  192.953411] Code: 00 00 48 89 44 24 10 8b 87 c4 00 00 00 48 89 44 24 08 48 8b 87 d8 00 00 00 48 c7 c7 90 58 d2 81 48 89 04 24 31 c0 e8 4f 70 9a ff <0f> 0b 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 8b 97 d8 00 00
      [  192.953411] RIP: skb_panic+0x61/0x70 RSP: ffffc90000ef7900
      [  193.000186] ---[ end trace bd0b89fabdf2f92c ]---
      [  193.000951] Kernel panic - not syncing: Fatal exception in interrupt
      [  193.001137] Kernel Offset: disabled
      [  193.001169] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
      
      Fixes: 19d5a26f ("ipv6: sr: expand skb head only if necessary")
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af3b5158
    • I
      gso: Validate assumption of frag_list segementation · 7a7a9bd7
      Ilan Tayari 提交于
      Commit 07b26c94 ("gso: Support partial splitting at the frag_list
      pointer") assumes that all SKBs in a frag_list (except maybe the last
      one) contain the same amount of GSO payload.
      
      This assumption is not always correct, resulting in the following
      warning message in the log:
          skb_segment: too many frags
      
      For example, mlx5 driver in Striding RQ mode creates some RX SKBs with
      one frag, and some with 2 frags.
      After GRO, the frag_list SKBs end up having different amounts of payload.
      If this frag_list SKB is then forwarded, the aforementioned assumption
      is violated.
      
      Validate the assumption, and fall back to software GSO if it not true.
      
      Fixes: 07b26c94 ("gso: Support partial splitting at the frag_list pointer")
      Signed-off-by: NIlan Tayari <ilant@mellanox.com>
      Signed-off-by: NIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a7a9bd7
    • X
      sctp: get list_of_streams of strreset outreq earlier · edb12f2d
      Xin Long 提交于
      Now when processing strreset out responses, it gets outreq->list_of_streams
      only when result is performed. But if result is not performed, str_p will
      be NULL. It will cause panic in sctp_ulpevent_make_stream_reset_event if
      nums is not 0.
      
      This patch is to fix it by getting outreq->list_of_streams earlier, and
      also to improve some codes for the strreset inreq process.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edb12f2d
    • C
      Add uid and cookie bpf helper to cg_skb_func_proto · 9fd0f315
      Chenbo Feng 提交于
      BPF helper functions get_socket_cookie and get_socket_uid can be
      used for network traffic classifications, among others. Expose
      them also to programs of type BPF_PROG_TYPE_CGROUP_SKB. As of
      commit 8f917bba ("bpf: pass sk to helper functions") the
      required skb->sk function is available at both cgroup bpf ingress
      and egress hooks. With these two new helper, cg_skb_func_proto is
      effectively the same as sk_filter_func_proto.
      
      Change since V1:
      Instead of add the helper to cg_skb_func_proto, redirect the
      cg_skb_func_proto to sk_filter_func_proto since all helper function
      in sk_filter_func_proto are applicable to cg_skb_func_proto now.
      Signed-off-by: NChenbo Feng <fengc@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fd0f315
    • S
      hv_netvsc: change netvsc device default duplex to FULL · f3c9d40e
      Simon Xiao 提交于
      The netvsc device supports full duplex by default.
      This warnings in log from bonding device which did not like
      seeing UNKNOWN duplex.
      Signed-off-by: NSimon Xiao <sixiao@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3c9d40e
    • S
      netvsc: fix RCU warning in get_stats · 776e726b
      stephen hemminger 提交于
      The statistics functionis called with RTNL held during probe
      but with RCU held during access from /proc and elsewhere.
      This is safe so update the lockdep annotation.
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      776e726b
    • D
      net: phy: test the right variable in phy_write_mmd() · 1dbba4cb
      Dan Carpenter 提交于
      This is a copy and paste buglet.  We meant to test for ->write_mmd but
      we test for ->read_mmd.
      
      Fixes: 1ee6b9bc ("net: phy: make phy_(read|write)_mmd() generic MMD accessors")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1dbba4cb
    • D
      Merge branch 'for-upstream' of... · 450cc8cc
      David S. Miller 提交于
      Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
      
      Johan Hedberg says:
      
      ====================
      pull request: bluetooth-next 2017-04-14
      
      Here's the main batch of Bluetooth & 802.15.4 patches for the 4.12
      kernel.
      
       - Many fixes to 6LoWPAN, in particular for BLE
       - New CA8210 IEEE 802.15.4 device driver (accounting for most of the
         lines of code added in this pull request)
       - Added Nokia Bluetooth (UART) HCI driver
       - Some serdev & TTY changes that are dependencies for the Nokia
         driver (with acks from relevant maintainers and an agreement that
         these come through the bluetooth tree)
       - Support for new Intel Bluetooth device
       - Various other minor cleanups/fixes here and there
      
      Please let me know if there are any issues pulling. Thanks.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      450cc8cc
    • D
      Merge branch 'bpf-lru-perf' · d584fec6
      David S. Miller 提交于
      Martin KaFai Lau says:
      
      ====================
      bpf: LRU performance and test-program improvements
      
      The first 4 patches make a few improvements to the LRU tests.
      
      Patch 5/6 is to improve the performance of BPF_F_NO_COMMON_LRU map.
      
      Patch 6/6 adds an example in using LRU map with map-in-map.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d584fec6
    • M
      bpf: lru: Add map-in-map LRU example · 3a5795b8
      Martin KaFai Lau 提交于
      This patch adds a map-in-map LRU example.
      If we know only a subset of cores will use the
      LRU, we can allocate a common LRU list per targeting core
      and store it into an array-of-hashs.
      
      It allows using the common LRU map with map-update performance
      comparable to the BPF_F_NO_COMMON_LRU map but without wasting memory
      on the unused cores that we know they will never access the LRU map.
      
      BPF_F_NO_COMMON_LRU:
      > map_perf_test 32 8 10000000 10000000 | awk '{sum += $3}END{print sum}'
      9234314 (9.23M/s)
      
      map-in-map LRU:
      > map_perf_test 512 8 1260000 80000000 | awk '{sum += $3}END{print sum}'
      9962743 (9.96M/s)
      
      Notes that the max_entries for the map-in-map LRU test is 1260000 which
      is the max_entries for each inner LRU map.  8 processes have been
      started, so 8 * 1260000 = 10080000 (~10M) which is close to what is
      used in the BPF_F_NO_COMMON_LRU test.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a5795b8
    • M
      bpf: lru: Lower the PERCPU_NR_SCANS from 16 to 4 · 695ba265
      Martin KaFai Lau 提交于
      After doing map_perf_test with a much bigger
      BPF_F_NO_COMMON_LRU map, the perf report shows a
      lot of time spent in rotating the inactive list (i.e.
      __bpf_lru_list_rotate_inactive):
      > map_perf_test 32 8 10000 1000000 | awk '{sum += $3}END{print sum}'
      19644783 (19M/s)
      > map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
      6283930 (6.28M/s)
      
      By inactive, it usually means the element is not in cache.  Hence,
      there is a need to tune the PERCPU_NR_SCANS value.
      
      This patch finds a better number of elements to
      scan during each list rotation.  The PERCPU_NR_SCANS (which
      is defined the same as PERCPU_FREE_TARGET) decreases
      from 16 elements to 4 elements.  This change only
      affects the BPF_F_NO_COMMON_LRU map.
      
      The test_lru_dist does not show meaningful difference
      between 16 and 4.  Our production L4 load balancer which uses
      the LRU map for conntrack-ing also shows little change in cache
      hit rate.  Since both benchmark and production data show no
      cache-hit difference, PERCPU_NR_SCANS is lowered from 16 to 4.
      We can consider making it configurable if we find a usecase
      later that shows another value works better and/or use
      a different rotation strategy.
      
      After this change:
      > map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
      9240324 (9.2M/s)
      
      i.e. 6.28M/s -> 9.2M/s
      
      The test_lru_dist has not shown meaningful difference:
      > test_lru_dist zipf.100k.a1_01.out 4000 1:
      nr_misses: 31575 (Before) vs 31566 (After)
      
      > test_lru_dist zipf.100k.a0_01.out 40000 1
      nr_misses: 67036 (Before) vs 67031 (After)
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      695ba265
    • M
      bpf: Allow bpf sample programs (*_user.c) to change bpf_map_def · 9fd63d05
      Martin KaFai Lau 提交于
      The current bpf_map_def is statically defined during compile
      time.  This patch allows the *_user.c program to change it during
      runtime.  It is done by adding load_bpf_file_fixup_map() which
      takes a callback.  The callback will be called before creating
      each map so that it has a chance to modify the bpf_map_def.
      
      The current usecase is to change max_entries in map_perf_test.
      It is interesting to test with a much bigger map size in
      some cases (e.g. the following patch on bpf_lru_map.c).
      However,  it is hard to find one size to fit all testing
      environment.  Hence, it is handy to take the max_entries
      as a cmdline arg and then configure the bpf_map_def during
      runtime.
      
      This patch adds two cmdline args.  One is to configure
      the map's max_entries.  Another is to configure the max_cnt
      which controls how many times a syscall is called.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fd63d05
    • M
      bpf: lru: Refactor LRU map tests in map_perf_test · bf8db5d2
      Martin KaFai Lau 提交于
      One more LRU test will be added later in this patch series.
      In this patch, we first move all existing LRU map tests into
      a single syscall (connect) first so that the future new
      LRU test can be added without hunting another syscall.
      
      One of the map name is also changed from percpu_lru_hash_map
      to nocommon_lru_hash_map to avoid the confusion with percpu_hash_map.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf8db5d2
    • M
      bpf: lru: Cleanup test_lru_map.c · 6467acbc
      Martin KaFai Lau 提交于
      This patch does the following cleanup on test_lru_map.c
      1) Fix indentation (Replace spaces by tabs)
      2) Remove redundant BPF_F_NO_COMMON_LRU test
      3) Simplify some comments
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6467acbc
    • M
      bpf: lru: Add test_lru_sanity6 for BPF_F_NO_COMMON_LRU · 9746f856
      Martin KaFai Lau 提交于
      test_lru_sanity3 is not applicable to BPF_F_NO_COMMON_LRU.
      It just happens to work when PERCPU_FREE_TARGET == 16.
      
      This patch:
      1) Disable test_lru_sanity3 for BPF_F_NO_COMMON_LRU
      2) Add test_lru_sanity6 to test list rotation for
         the BPF_F_NO_COMMON_LRU map.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9746f856
    • J
      net: mvneta: fix failed to suspend if WOL is enabled · 82960fff
      Jisheng Zhang 提交于
      Recently, suspend/resume and WOL support are added into mvneta driver.
      If we enable WOL, then we get some error as below on Marvell BG4CT
      platforms during suspend:
      
      [  184.149723] dpm_run_callback(): mdio_bus_suspend+0x0/0x50 returns -16
      [  184.149727] PM: Device f7b62004.mdio-mi:00 failed to suspend: error -16
      
      -16 means -EBUSY, phy_suspend() will return -EBUSY if it finds the
      device has WOL enabled.
      
      We fix this issue by properly setting the netdev's power.can_wakeup
      and power.wakeup, i.e
      
      1. in mvneta_mdio_probe(), call device_set_wakeup_capable() to set
      power.can_wakeup if the phy support WOL.
      
      2. in mvneta_ethtool_set_wol(), call device_set_wakeup_enable() to
      set power.wakeup if WOL has been successfully enabled in phy.
      Signed-off-by: NJisheng Zhang <jszhang@marvell.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82960fff
    • N
      net: bridge: notify on hw fdb takeover · cab93af0
      Nikolay Aleksandrov 提交于
      Recently we added support for SW fdbs to take over HW ones, but that
      results in changing a user-visible fdb flag thus we need to send a
      notification, also it's consistent with how HW takes over SW entries.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cab93af0
    • W
      kcm: remove a useless copy_from_user() · f5001cea
      WANG Cong 提交于
      struct kcm_clone only contains fd, and kcm_clone() only
      writes this struct, so there is no need to copy it from user.
      
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5001cea
    • J
      MAINTAINERS: rename TC entry and add couple of header files · 6b2af241
      Jiri Pirko 提交于
      The section is not specific only to "TC classifiers", but applies to the
      whole TC subsystem. Also, add couple of forgotten headers.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b2af241
    • R
      net: phy: simplify phy_supported_speeds() · 786df9c2
      Russell King 提交于
      Simplify the loop in phy_supported_speeds().
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      786df9c2
    • R
      net: phy: improve phylib correctness for non-autoneg settings · d0613037
      Russell King 提交于
      phylib has some undesirable behaviour when forcing a link mode through
      ethtool.  phylib uses this code:
      
      	idx = phy_find_valid(phy_find_setting(phydev->speed, phydev->duplex),
      			features);
      
      to find an index in the settings table.  phy_find_setting() starts at
      index 0, and scans upwards looking for an exact speed and duplex match.
      When it doesn't find it, it returns MAX_NUM_SETTINGS - 1, which is
      10baseT-Half duplex.
      
      phy_find_valid() then scans from the point (and effectively only checks
      one entry) before bailing out, returning MAX_NUM_SETTINGS - 1.
      
      phy_sanitize_settings() then sets ->speed to SPEED_10 and ->duplex to
      DUPLEX_HALF whether or not 10baseT-Half is supported or not.  This goes
      against all the comments against these functions, and 10baseT-Half may
      not even be supported by the hardware.
      
      Rework these functions, introducing a new method of scanning the table.
      There are two modes of lookup that phylib wants: exact, and inexact.
      
      - in exact mode, we return either an exact match or failure
      - in inexact mode, we return an exact match if it exists, a match at
        the highest speed that is not greater than the requested speed
        (ignoring duplex), or failing that, the lowest supported speed, or
        failure.
      
      The biggest difference is that we always check whether the entry is
      supported before further consideration, so all unsupported entries are
      not considered as candidates.
      
      This results in arguably saner behaviour, better matches the comments,
      and is probably what users would expect.
      
      This becomes important as ethernet speeds increase, PHYs exist which do
      not support the 10Mbit speeds, and half-duplex is likely to become
      obsolete - it's already not even an option on 10Gbit and faster links.
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0613037
    • S
      Subject: net: allow configuring default qdisc · 8ea3e439
      stephen hemminger 提交于
      Since 3.12 it has been possible to configure the default queuing
      discipline via sysctl. This patch adds ability to configure the
      default queue discipline in kernel configuration. This is useful for
      environments where configuring the value from userspace is difficult
      to manage.
      
      The default is still the same as before (pfifo_fast) and it is
      possible to change after kernel init with sysctl. This is similar
      to how TCP congestion control works.
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ea3e439
    • D
      Merge branch 'qed-arfs' · 7ca95118
      David S. Miller 提交于
      Manish Chopra says:
      
      ====================
      qed/qede: aRFS support
      
      This series adds support for Accelerated Flow Steering
      in qede driver for TCP/UDP over IPv4/IPv6 protocols.
      
      Please consider applying this series to "net-next"
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ca95118
    • C
      qede: Add aRFS support · e4917d46
      Chopra, Manish 提交于
      This patch adds support for aRFS for TCP and UDP
      protocols with IPv4/IPv6.
      Signed-off-by: NManish Chopra <manish.chopra@cavium.com>
      Signed-off-by: NYuval Mintz <yuval.mintz@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4917d46
    • C
      qed: aRFS infrastructure support · d51e4af5
      Chopra, Manish 提交于
      This patch adds necessary APIs to interface with
      qede aRFS support in successive patch.
      
      It also reserves separate PTT entry for aRFS,
      [as being in fastpath flow] for hardware access instead of
      trying to acquire it at run time from the ptt pool.
      Signed-off-by: NManish Chopra <manish.chopra@cavium.com>
      Signed-off-by: NYuval Mintz <yuval.mintz@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d51e4af5
    • M
      smsc95xx: Add comments to the registers definition · 53a759c8
      Martin Wetterwald 提交于
      This chip is used by a lot of embedded devices and also by the Raspberry
      Pi 1, 2 & 3 which were created to promote the study of computer
      sciences. Students wanting to learn kernel / network device driver
      programming through those devices can only rely on the Linux kernel
      driver source to make their own.
      
      This commit adds a lot of comments to the registers definition to expand
      the register names.
      
      Cc: Steve Glendinning <steve.glendinning@shawell.net>
      Cc: Microchip Linux Driver Support <UNGLinuxDriver@microchip.com>
      CC: David Miller <davem@davemloft.net>
      Signed-off-by: NMartin Wetterwald <martin@wetterwald.eu>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Acked-by: NSteve Glendinning <steve.glendinning@shawell.net>
      Acked-by: NWoojung Huh <Woojung.Huh@microchip.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53a759c8
    • R
      l2tp: device MTU setup, tunnel socket needs a lock · 57240d00
      R. Parameswaran 提交于
      The MTU overhead calculation in L2TP device set-up
      merged via commit b784e7eb
      needs to be adjusted to lock the tunnel socket while
      referencing the sub-data structures to derive the
      socket's IP overhead.
      Reported-by: NGuillaume Nault <g.nault@alphalink.fr>
      Tested-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NR. Parameswaran <rparames@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57240d00
    • D
      net: ipv6: send unsolicited NA on admin up · 4a6e3c5d
      David Ahern 提交于
      ndisc_notify is the ipv6 equivalent to arp_notify. When arp_notify is
      set to 1, gratuitous arp requests are sent when the device is brought up.
      The same is expected when ndisc_notify is set to 1 (per ndisc_notify in
      Documentation/networking/ip-sysctl.txt). The NA is not sent on NETDEV_UP
      event; add it.
      
      Fixes: 5cb04436 ("ipv6: add knob to send unsolicited ND on link-layer address change")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a6e3c5d
  3. 17 4月, 2017 8 次提交
    • D
      Merge branch 'mlx5-RDMA-netdevice' · 70d40b36
      David S. Miller 提交于
      Saeed Mahameed says:
      
      ====================
      Mellanox, mlx5 RDMA net device support
      
      This series provides the lower level mlx5 support of RDMA netdevice
      creation API [1] suggested and introduced by Intel's HFI OPA VNIC
      netdevice driver [2], to enable IPoIB mlx5 RDMA netdevice creation.
      
      mlx5 IPoIB RDMA netdev will serve as an acceleration netdevice for the current
      IPoIB ULP generic netdevice, providing:
      	- mlx5 RSS support.
      	- mlx5 HW RX,TX offloads (checksum, TSO, LRO, etc ..).
      	- Full mlx5 HW features transparent to the ULP itself.
      
      The idea here is to reuse and benefit from the already implemented mlx5e netdevice
      management and channels API for both etherent and RDMA netdevices, since both IPoIB
      and Ethernet netdevices share same common mlx5 HW resources (with some small
      exceptions) and share most of the control/data path logic, it is more natural to
      have them share the same code.
      
      The differences between IPoIB and Ethernet netdevices can be summarized to:
      
      Steering:
      In mlx5, IPoIB traffic is sent and received from an underlay special QP, and in Ethernet
      the traffic is handled by vports and vport steering is managed by e-switch or FW.
      
      For IPoIB traffic to get steered correctly the only thing we need to do is to create RSS
      HW contexts for RX and TX HW contexts for TX (similar to mlx5e) with the underlay QP attached to
      them (underlay QP will be 0 in case of Ethernet).
      
      RX,TX:
      Since IPoIB traffic is different, slightly modified RX and TX handlers are required,
      still we do some code reuse in data path via common helper functions.
      
      All of the other generic netdevice and mlx5 aspects will be shared between mlx5 Ethernet
      and IPoIB netdevices, e.g.
      	- Channels creation and handling (RQs,SQs,CQs, NAPI, interrupt moderation, etc..)
      	- Offloads, checksum, GRO, LRO, TSO, and more.
              - netdevice logic and non Ethernet specific ndos (open/close, etc..)
      
      In order to achieve what we want:
      
      In patchet 1 to 3, Erez added the supported for underlay QP in mlx5_ifc and refactored
      the mlx5 steering code to accept the underlay QP as a parameter for creating steering
      objects and enabled flow steering for IB link.
      
      Then we are going to use the mlx5e netdevice profile, which is already used to separate between
      NIC and VF representors netdevices, to create new type of IPoIB netdevice profile.
      
      For that, one small refactoring is required to make mlx5e netdevice profile management
      more genetic and agnostic to link type which is done in patch #4.
      
      In patch #5, we introduce ipoib.c to host all of mlx5 IPoIB (mlx5i) specific logic and a
      skeleton for the IPoIB mlx5 netdevice profile, and we will start filling it in next patches,
      using mlx5e already existing APIs.
      
      Patch #6 and #7, Implement init/cleanup RX mlx5i netdev profile handlers to create mlx5 RSS
      resources, same as mlx5e but without vlan and L2 steering tables.
      
      Patch #8, Implement init/cleanup TX mlx5i netdev profile handlers, to create TX resources
      same as mlx5e but with one TC (tc = 0) support.
      
      Patch #9, Implement mlx5i open/close ndos, where we reuese the mlx5e channels API, to start/stop TX/RX channels.
      
      Patch #10, Create the underlay QP and attach it to mlx5i RSS and TX HW contexts.
      
      Patch #11 and #12, Break down the mlx5e xmit flow into smaller helper function and implement the
      mlx5i IPoIB xmit routine.
      
      Patch #13 and #14, Have an RX handler per netdevice profile. We already do this before this series
      in a non clean way to separate between NIC netdev and VF representor RX handlers, in patch 13 we make
      the RX handler generic and bound to a profile and in patch 14 we implement the IPoIB RX handlers.
      
      Patch #15, Small cleanup to avoid e-switch with IPoIB netdev.
      
      In order to enable mlx5 IPoIB, a merge between the IPoIB RDMA netdev offolad support [3]
      - which was alread submitted to the rdma mailing list - and this series is required
      plus an extra small patch [4] which will connect between both sides and actually enables the offload.
      
      Once both patch-sets are merged into linux we will have to submit the extra small patch [4], to enable
      the feature.
      
      Thanks,
      Saeed.
      
      [1] https://patchwork.kernel.org/patch/9676637/
      
      [2] https://lwn.net/Articles/715453/
          https://patchwork.kernel.org/patch/9587815/
      
      [3] https://patchwork.kernel.org/patch/9672069/
      [4] https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/commit/?id=0141db6a686e32294dee015b7d07706162ba48d8
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70d40b36
    • E
      hw/mlx5: Add New bit to check over QP creation · 93d576af
      Erez Shitrit 提交于
      Add check for bit IB_QP_CREATE_NETIF_QP while creating QP.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93d576af
    • S
      net/mlx5e: E-switch vport manager is valid for ethernet only · 955bc480
      Saeed Mahameed 提交于
      Currently the driver support only ethernet eswitch, and we want to
      protect downstream IPoIB netdev from trying to access it in IB link.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      955bc480
    • S
      net/mlx5e: IPoIB, RX handler · 9d6bd752
      Saeed Mahameed 提交于
      Implement IPoIB RX SKB handler.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d6bd752
    • S
      net/mlx5e: RX handlers per netdev profile · 20fd0c19
      Saeed Mahameed 提交于
      In order to have different RX handler per profile, fix and refactor the
      current code to take the rx handler directly from the netdevice profile
      rather than computing it on runtime as it was done with the switchdev
      mode representor rx handler.
      
      This will also remove the current wrong assumption in mlx5e_alloc_rq
      code that mlx5e_priv->ppriv is of the type vport_rep.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20fd0c19
    • S
      net/mlx5e: IPoIB, Xmit flow · 25854544
      Saeed Mahameed 提交于
      Implement mlx5e's IPoIB SKB transmit using the helper functions provided
      by mlx5e ethernet tx flow, the only difference in the code between
      mlx5e_xmit and mlx5i_xmit is that IPoIB has some extra fields to fill
      (UD datagram segment) in the TX descriptor (WQE) and it doesn't need to
      have any vlan handling.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25854544
    • S
      net/mlx5e: Xmit flow break down · 77bdf895
      Saeed Mahameed 提交于
      Break current mlx5e xmit flow into smaller blocks (helper functions)
      in order to reuse them for IPoIB SKB transmission.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77bdf895
    • S
      net/mlx5e: IPoIB, Underlay QP · ec8fd927
      Saeed Mahameed 提交于
      Create IPoIB underlay QP needed by the IPoIB netdevice profile for RSS
      and TX HW context to perform on IPoIB traffic.
      
      Reset the underlay QP on dev_uninit ndo to stop IPoIB traffic going
      through this QP when the ULP IPoIB decides to cleanup.
      
      Implement attach/detach mcast RDMA netdev callbacks for later RDMA
      netdev use.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec8fd927