1. 04 6月, 2014 1 次提交
  2. 03 6月, 2014 4 次提交
    • B
      ethtool: Check that reserved fields of struct ethtool_rxfh are 0 · f062a384
      Ben Hutchings 提交于
      We should fail rather than silently ignoring use of these extensions.
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      f062a384
    • B
      ethtool: Replace ethtool_ops::{get,set}_rxfh_indir() with {get,set}_rxfh() · fe62d001
      Ben Hutchings 提交于
      ETHTOOL_{G,S}RXFHINDIR and ETHTOOL_{G,S}RSSH should work for drivers
      regardless of whether they expose the hash key, unless you try to
      set a hash key for a driver that doesn't expose it.
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Acked-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      fe62d001
    • E
      inetpeer: get rid of ip_id_count · 73f156a6
      Eric Dumazet 提交于
      Ideally, we would need to generate IP ID using a per destination IP
      generator.
      
      linux kernels used inet_peer cache for this purpose, but this had a huge
      cost on servers disabling MTU discovery.
      
      1) each inet_peer struct consumes 192 bytes
      
      2) inetpeer cache uses a binary tree of inet_peer structs,
         with a nominal size of ~66000 elements under load.
      
      3) lookups in this tree are hitting a lot of cache lines, as tree depth
         is about 20.
      
      4) If server deals with many tcp flows, we have a high probability of
         not finding the inet_peer, allocating a fresh one, inserting it in
         the tree with same initial ip_id_count, (cf secure_ip_id())
      
      5) We garbage collect inet_peer aggressively.
      
      IP ID generation do not have to be 'perfect'
      
      Goal is trying to avoid duplicates in a short period of time,
      so that reassembly units have a chance to complete reassembly of
      fragments belonging to one message before receiving other fragments
      with a recycled ID.
      
      We simply use an array of generators, and a Jenkin hash using the dst IP
      as a key.
      
      ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
      belongs (it is only used from this file)
      
      secure_ip_id() and secure_ipv6_id() no longer are needed.
      
      Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
      unnecessary decrement/increment of the number of segments.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73f156a6
    • A
      net: Add support for device specific address syncing · 670e5b8e
      Alexander Duyck 提交于
      This change provides a function to be used in order to break the
      ndo_set_rx_mode call into a set of address add and remove calls.  The code
      is based on the implementation of dev_uc_sync/dev_mc_sync.  Since they
      essentially do the same thing but with only one dev I simply named my
      functions __dev_uc_sync/__dev_mc_sync.
      
      I also implemented an unsync version of the functions as well to allow for
      cleanup on close.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      670e5b8e
  3. 02 6月, 2014 2 次提交
    • D
      net: filter: improve filter block macros · f8f6d679
      Daniel Borkmann 提交于
      Commit 9739eef1 ("net: filter: make BPF conversion more readable")
      started to introduce helper macros similar to BPF_STMT()/BPF_JUMP()
      macros from classic BPF.
      
      However, quite some statements in the filter conversion functions
      remained in the old style which gives a mixture of block macros and
      non block macros in the code. This patch makes the block macros itself
      more readable by using explicit member initialization, and converts
      the remaining ones where possible to remain in a more consistent state.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8f6d679
    • D
      net: filter: get rid of BPF_S_* enum · 34805931
      Daniel Borkmann 提交于
      This patch finally allows us to get rid of the BPF_S_* enum.
      Currently, the code performs unnecessary encode and decode
      workarounds in seccomp and filter migration itself when a filter
      is being attached in order to overcome BPF_S_* encoding which
      is not used anymore by the new interpreter resp. JIT compilers.
      
      Keeping it around would mean that also in future we would need
      to extend and maintain this enum and related encoders/decoders.
      We can get rid of all that and save us these operations during
      filter attaching. Naturally, also JIT compilers need to be updated
      by this.
      
      Before JIT conversion is being done, each compiler checks if A
      is being loaded at startup to obtain information if it needs to
      emit instructions to clear A first. Since BPF extensions are a
      subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
      for extensions can be removed at that point. To ease and minimalize
      code changes in the classic JITs, we have introduced bpf_anc_helper().
      
      Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
      arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
      unfortunately didn't have access, but changes are analogous to
      the rest.
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mircea Gherzan <mgherzan@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NChema Gonzalez <chemag@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34805931
  4. 31 5月, 2014 1 次提交
    • S
      net: tso: Export symbols for modular build · 484611e5
      Sachin Kamat 提交于
      Export the symbols to fix the below errors when built as modules:
      ERROR: "tso_build_data" [drivers/net/ethernet/marvell/mvneta.ko] undefined!
      ERROR: "tso_build_hdr" [drivers/net/ethernet/marvell/mvneta.ko] undefined!
      ERROR: "tso_start" [drivers/net/ethernet/marvell/mvneta.ko] undefined!
      ERROR: "tso_count_descs" [drivers/net/ethernet/marvell/mvneta.ko] undefined!
      ERROR: "tso_build_data" [drivers/net/ethernet/marvell/mv643xx_eth.ko] undefined!
      ERROR: "tso_build_hdr" [drivers/net/ethernet/marvell/mv643xx_eth.ko] undefined!
      ERROR: "tso_start" [drivers/net/ethernet/marvell/mv643xx_eth.ko] undefined!
      ERROR: "tso_count_descs" [drivers/net/ethernet/marvell/mv643xx_eth.ko] undefined!
      Signed-off-by: NSachin Kamat <sachin.kamat@linaro.org>
      Acked-by: NEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      484611e5
  5. 24 5月, 2014 4 次提交
    • D
      net: filter: let unattached filters use sock_fprog_kern · b1fcd35c
      Daniel Borkmann 提交于
      The sk_unattached_filter_create() API is used by BPF filters that
      are not directly attached or related to sockets, and are used in
      team, ptp, xt_bpf, cls_bpf, etc. As such all users do their own
      internal managment of obtaining filter blocks and thus already
      have them in kernel memory and set up before calling into
      sk_unattached_filter_create(). As a result, due to __user annotation
      in sock_fprog, sparse triggers false positives (incorrect type in
      assignment [different address space]) when filters are set up before
      passing them to sk_unattached_filter_create(). Therefore, let
      sk_unattached_filter_create() API use sock_fprog_kern to overcome
      this issue.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1fcd35c
    • D
      net: filter: remove DL macro · 8556ce79
      Daniel Borkmann 提交于
      Lets get rid of this macro. After commit 5bcfedf0 ("net: filter:
      simplify label names from jump-table"), labels have become more
      readable due to omission of BPF_ prefix but at the same time more
      generic, so that things like `git grep -n` would not find them. As
      a middle path, lets get rid of the DL macro as it's not strictly
      needed and would otherwise just hide the full name.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8556ce79
    • T
      net: Split sk_no_check into sk_no_check_{rx,tx} · 28448b80
      Tom Herbert 提交于
      Define separate fields in the sock structure for configuring disabling
      checksums in both TX and RX-- sk_no_check_tx and sk_no_check_rx.
      The SO_NO_CHECK socket option only affects sk_no_check_tx. Also,
      removed UDP_CSUM_* defines since they are no longer necessary.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28448b80
    • S
      net-next:v4: Add support to configure SR-IOV VF minimum and maximum Tx rate through ip tool. · ed616689
      Sucheta Chakraborty 提交于
      o min_tx_rate puts lower limit on the VF bandwidth. VF is guaranteed
        to have a bandwidth of at least this value.
        max_tx_rate puts cap on the VF bandwidth. VF can have a bandwidth
        of up to this value.
      
      o A new handler set_vf_rate for attr IFLA_VF_RATE has been introduced
        which takes 4 arguments:
        netdev, VF number, min_tx_rate, max_tx_rate
      
      o ndo_set_vf_rate replaces ndo_set_vf_tx_rate handler.
      
      o Drivers that currently implement ndo_set_vf_tx_rate should now call
        ndo_set_vf_rate instead and reject attempt to set a minimum bandwidth
        greater than 0 for IFLA_VF_TX_RATE when IFLA_VF_RATE is not yet
        implemented by driver.
      
      o If user enters only one of either min_tx_rate or max_tx_rate, then,
        userland should read back the other value from driver and set both
        for IFLA_VF_RATE.
        Drivers that have not yet implemented IFLA_VF_RATE should always
        return min_tx_rate as 0 when read from ip tool.
      
      o If both IFLA_VF_TX_RATE and IFLA_VF_RATE options are specified, then
        IFLA_VF_RATE should override.
      
      o Idea is to have consistent display of rate values to user.
      
      o Usage example: -
      
        ./ip link set p4p1 vf 0 rate 900
      
        ./ip link show p4p1
        32: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
        DEFAULT qlen 1000
          link/ether 00:0e:1e:08:b0:f0 brd ff:ff:ff:ff:ff:ff
          vf 0 MAC 3e:a0:ca:bd:ae:5a, tx rate 900 (Mbps), max_tx_rate 900Mbps
          vf 1 MAC f6:c6:7c:3f:3d:6c
          vf 2 MAC 56:32:43:98:d7:71
          vf 3 MAC d6:be:c3:b5:85:ff
          vf 4 MAC ee:a9:9a:1e:19:14
          vf 5 MAC 4a:d0:4c:07:52:18
          vf 6 MAC 3a:76:44:93:62:f9
          vf 7 MAC 82:e9:e7:e3:15:1a
      
        ./ip link set p4p1 vf 0 max_tx_rate 300 min_tx_rate 200
      
        ./ip link show p4p1
        32: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
        DEFAULT qlen 1000
          link/ether 00:0e:1e:08:b0:f0 brd ff:ff:ff:ff:ff:ff
          vf 0 MAC 3e:a0:ca:bd:ae:5a, tx rate 300 (Mbps), max_tx_rate 300Mbps,
          min_tx_rate 200Mbps
          vf 1 MAC f6:c6:7c:3f:3d:6c
          vf 2 MAC 56:32:43:98:d7:71
          vf 3 MAC d6:be:c3:b5:85:ff
          vf 4 MAC ee:a9:9a:1e:19:14
          vf 5 MAC 4a:d0:4c:07:52:18
          vf 6 MAC 3a:76:44:93:62:f9
          vf 7 MAC 82:e9:e7:e3:15:1a
      
        ./ip link set p4p1 vf 0 max_tx_rate 600 rate 300
      
        ./ip link show p4p1
        32: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
        DEFAULT qlen 1000
          link/ether 00:0e:1e:08:b0:f brd ff:ff:ff:ff:ff:ff
          vf 0 MAC 3e:a0:ca:bd:ae:5, tx rate 600 (Mbps), max_tx_rate 600Mbps,
          min_tx_rate 200Mbps
          vf 1 MAC f6:c6:7c:3f:3d:6c
          vf 2 MAC 56:32:43:98:d7:71
          vf 3 MAC d6:be:c3:b5:85:ff
          vf 4 MAC ee:a9:9a:1e:19:14
          vf 5 MAC 4a:d0:4c:07:52:18
          vf 6 MAC 3a:76:44:93:62:f9
          vf 7 MAC 82:e9:e7:e3:15:1a
      Signed-off-by: NSucheta Chakraborty <sucheta.chakraborty@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed616689
  6. 23 5月, 2014 1 次提交
  7. 22 5月, 2014 1 次提交
    • A
      net: filter: cleanup invocation of internal BPF · 5fe821a9
      Alexei Starovoitov 提交于
      Kernel API for classic BPF socket filters is:
      
      sk_unattached_filter_create() - validate classic BPF, convert, JIT
      SK_RUN_FILTER() - run it
      sk_unattached_filter_destroy() - destroy socket filter
      
      Cleanup internal BPF kernel API as following:
      
      sk_filter_select_runtime() - final step of internal BPF creation.
        Try to JIT internal BPF program, if JIT is not available select interpreter
      SK_RUN_FILTER() - run it
      sk_filter_free() - free internal BPF program
      
      Disallow direct calls to BPF interpreter. Execution of the BPF program should
      be done with SK_RUN_FILTER() macro.
      
      Example of internal BPF create, run, destroy:
      
        struct sk_filter *fp;
      
        fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
        memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
        fp->len = prog_len;
      
        sk_filter_select_runtime(fp);
      
        SK_RUN_FILTER(fp, ctx);
      
        sk_filter_free(fp);
      
      Sockets, seccomp, testsuite, tracing are using different ways to populate
      sk_filter, so first steps of program creation are not common.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fe821a9
  8. 19 5月, 2014 3 次提交
  9. 17 5月, 2014 5 次提交
  10. 16 5月, 2014 2 次提交
    • A
      net: filter: x86: internal BPF JIT · 62258278
      Alexei Starovoitov 提交于
      Maps all internal BPF instructions into x86_64 instructions.
      This patch replaces original BPF x64 JIT with internal BPF x64 JIT.
      sysctl net.core.bpf_jit_enable is reused as on/off switch.
      
      Performance:
      
      1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code.
        No performance difference is observed for filters that were JIT-able before
      
      Example assembler code for BPF filter "tcpdump port 22"
      
      original BPF -> old JIT:            original BPF -> internal BPF -> new JIT:
         0:   push   %rbp                      0:     push   %rbp
         1:   mov    %rsp,%rbp                 1:     mov    %rsp,%rbp
         4:   sub    $0x60,%rsp                4:     sub    $0x228,%rsp
         8:   mov    %rbx,-0x8(%rbp)           b:     mov    %rbx,-0x228(%rbp) // prologue
                                              12:     mov    %r13,-0x220(%rbp)
                                              19:     mov    %r14,-0x218(%rbp)
                                              20:     mov    %r15,-0x210(%rbp)
                                              27:     xor    %eax,%eax         // clear A
         c:   xor    %ebx,%ebx                29:     xor    %r13,%r13         // clear X
         e:   mov    0x68(%rdi),%r9d          2c:     mov    0x68(%rdi),%r9d
        12:   sub    0x6c(%rdi),%r9d          30:     sub    0x6c(%rdi),%r9d
        16:   mov    0xd8(%rdi),%r8           34:     mov    0xd8(%rdi),%r10
                                              3b:     mov    %rdi,%rbx
        1d:   mov    $0xc,%esi                3e:     mov    $0xc,%esi
        22:   callq  0xffffffffe1021e15       43:     callq  0xffffffffe102bd75
        27:   cmp    $0x86dd,%eax             48:     cmp    $0x86dd,%rax
        2c:   jne    0x0000000000000069       4f:     jne    0x000000000000009a
        2e:   mov    $0x14,%esi               51:     mov    $0x14,%esi
        33:   callq  0xffffffffe1021e31       56:     callq  0xffffffffe102bd91
        38:   cmp    $0x84,%eax               5b:     cmp    $0x84,%rax
        3d:   je     0x0000000000000049       62:     je     0x0000000000000074
        3f:   cmp    $0x6,%eax                64:     cmp    $0x6,%rax
        42:   je     0x0000000000000049       68:     je     0x0000000000000074
        44:   cmp    $0x11,%eax               6a:     cmp    $0x11,%rax
        47:   jne    0x00000000000000c6       6e:     jne    0x0000000000000117
        49:   mov    $0x36,%esi               74:     mov    $0x36,%esi
        4e:   callq  0xffffffffe1021e15       79:     callq  0xffffffffe102bd75
        53:   cmp    $0x16,%eax               7e:     cmp    $0x16,%rax
        56:   je     0x00000000000000bf       82:     je     0x0000000000000110
        58:   mov    $0x38,%esi               88:     mov    $0x38,%esi
        5d:   callq  0xffffffffe1021e15       8d:     callq  0xffffffffe102bd75
        62:   cmp    $0x16,%eax               92:     cmp    $0x16,%rax
        65:   je     0x00000000000000bf       96:     je     0x0000000000000110
        67:   jmp    0x00000000000000c6       98:     jmp    0x0000000000000117
        69:   cmp    $0x800,%eax              9a:     cmp    $0x800,%rax
        6e:   jne    0x00000000000000c6       a1:     jne    0x0000000000000117
        70:   mov    $0x17,%esi               a3:     mov    $0x17,%esi
        75:   callq  0xffffffffe1021e31       a8:     callq  0xffffffffe102bd91
        7a:   cmp    $0x84,%eax               ad:     cmp    $0x84,%rax
        7f:   je     0x000000000000008b       b4:     je     0x00000000000000c2
        81:   cmp    $0x6,%eax                b6:     cmp    $0x6,%rax
        84:   je     0x000000000000008b       ba:     je     0x00000000000000c2
        86:   cmp    $0x11,%eax               bc:     cmp    $0x11,%rax
        89:   jne    0x00000000000000c6       c0:     jne    0x0000000000000117
        8b:   mov    $0x14,%esi               c2:     mov    $0x14,%esi
        90:   callq  0xffffffffe1021e15       c7:     callq  0xffffffffe102bd75
        95:   test   $0x1fff,%ax              cc:     test   $0x1fff,%rax
        99:   jne    0x00000000000000c6       d3:     jne    0x0000000000000117
                                              d5:     mov    %rax,%r14
        9b:   mov    $0xe,%esi                d8:     mov    $0xe,%esi
        a0:   callq  0xffffffffe1021e44       dd:     callq  0xffffffffe102bd91 // MSH
                                              e2:     and    $0xf,%eax
                                              e5:     shl    $0x2,%eax
                                              e8:     mov    %rax,%r13
                                              eb:     mov    %r14,%rax
                                              ee:     mov    %r13,%rsi
        a5:   lea    0xe(%rbx),%esi           f1:     add    $0xe,%esi
        a8:   callq  0xffffffffe1021e0d       f4:     callq  0xffffffffe102bd6d
        ad:   cmp    $0x16,%eax               f9:     cmp    $0x16,%rax
        b0:   je     0x00000000000000bf       fd:     je     0x0000000000000110
                                              ff:     mov    %r13,%rsi
        b2:   lea    0x10(%rbx),%esi         102:     add    $0x10,%esi
        b5:   callq  0xffffffffe1021e0d      105:     callq  0xffffffffe102bd6d
        ba:   cmp    $0x16,%eax              10a:     cmp    $0x16,%rax
        bd:   jne    0x00000000000000c6      10e:     jne    0x0000000000000117
        bf:   mov    $0xffff,%eax            110:     mov    $0xffff,%eax
        c4:   jmp    0x00000000000000c8      115:     jmp    0x000000000000011c
        c6:   xor    %eax,%eax               117:     mov    $0x0,%eax
        c8:   mov    -0x8(%rbp),%rbx         11c:     mov    -0x228(%rbp),%rbx // epilogue
        cc:   leaveq                         123:     mov    -0x220(%rbp),%r13
        cd:   retq                           12a:     mov    -0x218(%rbp),%r14
                                             131:     mov    -0x210(%rbp),%r15
                                             138:     leaveq
                                             139:     retq
      
      On fully cached SKBs both JITed functions take 12 nsec to execute.
      BPF interpreter executes the program in 30 nsec.
      
      The difference in generated assembler is due to the following:
      
      Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function
      inside bpf_jit.S.
      New JIT removes the helper and does it explicitly, so ldx_msh cost
      is the same for both JITs, but generated code looks longer.
      
      New JIT has 4 registers to save, so prologue/epilogue are larger,
      but the cost is within noise on x64.
      
      Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'.
      New JIT clears %rax unconditionally.
      
      2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM
        extensions. New JIT supports all BPF extensions.
        Performance of such filters improves 2-4 times depending on a filter.
        The longer the filter the higher performance gain.
        Synthetic benchmarks with many ancillary loads see 20x speedup
        which seems to be the maximum gain from JIT
      
      Notes:
      
      . net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional
        and can be used to see generated assembler
      
      . there are two jit_compile() functions and code flow for classic filters is:
        sk_attach_filter() - load classic BPF
        bpf_jit_compile() - try to JIT from classic BPF
        sk_convert_filter() - convert classic to internal
        bpf_int_jit_compile() - JIT from internal BPF
      
        seccomp and tracing filters will just call bpf_int_jit_compile()
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62258278
    • C
      rtnetlink: wait for unregistering devices in rtnl_link_unregister() · 200b916f
      Cong Wang 提交于
      From: Cong Wang <cwang@twopensource.com>
      
      commit 50624c93 (net: Delay default_device_exit_batch until no
      devices are unregistering) introduced rtnl_lock_unregistering() for
      default_device_exit_batch(). Same race could happen we when rmmod a driver
      which calls rtnl_link_unregister() as we call dev->destructor without rtnl
      lock.
      
      For long term, I think we should clean up the mess of netdev_run_todo()
      and net namespce exit code.
      
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      200b916f
  11. 14 5月, 2014 4 次提交
    • H
      net: avoid dependency of net_get_random_once on nop patching · 3d440522
      Hannes Frederic Sowa 提交于
      net_get_random_once depends on the static keys infrastructure to patch up
      the branch to the slow path during boot. This was realized by abusing the
      static keys api and defining a new initializer to not enable the call
      site while still indicating that the branch point should get patched
      up. This was needed to have the fast path considered likely by gcc.
      
      The static key initialization during boot up normally walks through all
      the registered keys and either patches in ideal nops or enables the jump
      site but omitted that step on x86 if ideal nops where already placed at
      static_key branch points. Thus net_get_random_once branches not always
      became active.
      
      This patch switches net_get_random_once to the ordinary static_key
      api and thus places the kernel fast path in the - by gcc considered -
      unlikely path.  Microbenchmarks on Intel and AMD x86-64 showed that
      the unlikely path actually beats the likely path in terms of cycle cost
      and that different nop patterns did not make much difference, thus this
      switch should not be noticeable.
      
      Fixes: a48e4292 ("net: introduce new macro net_get_random_once")
      Reported-by: NTuomas Räsänen <tuomasjjrasanen@tjjr.fi>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d440522
    • M
      net: ptp: mark filter as __initdata · 0f49ff07
      Mathias Krause 提交于
      sk_unattached_filter_create() will copy the filter's instructions so we
      don't need to have the master copy hanging around after initialization.
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f49ff07
    • D
      net: filter: Fix redefinition warnings on x86-64. · 1268e253
      David S. Miller 提交于
      Do not collide with the x86-64 PTRACE user API namespace.
      
      net/core/filter.c:57:0: warning: "R8" redefined [enabled by default]
      arch/x86/include/uapi/asm/ptrace-abi.h:38:0: note: this is the location of the previous definition
      
      Fix by adding a BPF_ prefix to the register macros.
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1268e253
    • D
      neigh: set nud_state to NUD_INCOMPLETE when probing router reachability · 2176d5d4
      Duan Jiong 提交于
      Since commit 7e980569("ipv6: router reachability probing"), a router falls
      into NUD_FAILED will be probed.
      
      Now if function rt6_select() selects a router which neighbour state is NUD_FAILED,
      and at the same time function rt6_probe() changes the neighbour state to NUD_PROBE,
      then function dst_neigh_output() can directly send packets, but actually the
      neighbour still is unreachable. If we set nud_state to NUD_INCOMPLETE instead
      NUD_PROBE, packets will not be sent out until the neihbour is reachable.
      
      In addition, because the route should be probes with a single NS, so we must
      set neigh->probes to neigh_max_probes(), then the neigh timer timeout and function
      neigh_timer_handler() will not send other NS Messages.
      Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2176d5d4
  12. 13 5月, 2014 1 次提交
  13. 12 5月, 2014 1 次提交
  14. 08 5月, 2014 1 次提交
  15. 06 5月, 2014 1 次提交
    • R
      unregister_netdevice : move RTM_DELLINK to until after ndo_uninit · 56bfa7ee
      Roopa Prabhu 提交于
      This patch fixes ordering of rtnl notifications during unregister_netdevice
      by moving RTM_DELLINK notification to until after ndo_uninit.
      
      The problem was seen with unregistering bond netdevices.
      
      bond ndo_uninit callback generates a few RTM_NEWLINK notifications for
      NETDEV_CHANGEADDR and NETDEV_FEAT_CHANGE. This is seen mostly when the
      bond is deleted with slaves still enslaved to the bond.
      
      During unregister netdevice (rollback_registered_many to be specific)
      bond ndo_uninit is called after RTM_DELLINK notification goes out.
      This results in userspace seeing RTM_DELLINK followed by a couple of
      RTM_NEWLINK's.
      
      In userspace problem was seen with libnl. libnl cache deletes the bond
      when it sees RTM_DELLINK and re-adds the bond with the following
      RTM_NEWLINK. Resulting in a stale bond entry in libnl cache when the kernel
      has already deleted the bond.
      
      This patch has been tested for bond, bridges and vlan devices.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56bfa7ee
  16. 05 5月, 2014 3 次提交
  17. 29 4月, 2014 1 次提交
  18. 27 4月, 2014 1 次提交
  19. 25 4月, 2014 3 次提交
    • D
      rtnetlink: Only supply IFLA_VF_PORTS information when RTEXT_FILTER_VF is set · c53864fd
      David Gibson 提交于
      Since 115c9b81 (rtnetlink: Fix problem with
      buffer allocation), RTM_NEWLINK messages only contain the IFLA_VFINFO_LIST
      attribute if they were solicited by a GETLINK message containing an
      IFLA_EXT_MASK attribute with the RTEXT_FILTER_VF flag.
      
      That was done because some user programs broke when they received more data
      than expected - because IFLA_VFINFO_LIST contains information for each VF
      it can become large if there are many VFs.
      
      However, the IFLA_VF_PORTS attribute, supplied for devices which implement
      ndo_get_vf_port (currently the 'enic' driver only), has the same problem.
      It supplies per-VF information and can therefore become large, but it is
      not currently conditional on the IFLA_EXT_MASK value.
      
      Worse, it interacts badly with the existing EXT_MASK handling.  When
      IFLA_EXT_MASK is not supplied, the buffer for netlink replies is fixed at
      NLMSG_GOODSIZE.  If the information for IFLA_VF_PORTS exceeds this, then
      rtnl_fill_ifinfo() returns -EMSGSIZE on the first message in a packet.
      netlink_dump() will misinterpret this as having finished the listing and
      omit data for this interface and all subsequent ones.  That can cause
      getifaddrs(3) to enter an infinite loop.
      
      This patch addresses the problem by only supplying IFLA_VF_PORTS when
      IFLA_EXT_MASK is supplied with the RTEXT_FILTER_VF flag set.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c53864fd
    • D
      rtnetlink: Warn when interface's information won't fit in our packet · 973462bb
      David Gibson 提交于
      Without IFLA_EXT_MASK specified, the information reported for a single
      interface in response to RTM_GETLINK is expected to fit within a netlink
      packet of NLMSG_GOODSIZE.
      
      If it doesn't, however, things will go badly wrong,  When listing all
      interfaces, netlink_dump() will incorrectly treat -EMSGSIZE on the first
      message in a packet as the end of the listing and omit information for
      that interface and all subsequent ones.  This can cause getifaddrs(3) to
      enter an infinite loop.
      
      This patch won't fix the problem, but it will WARN_ON() making it easier to
      track down what's going wrong.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NJiri Pirko <jpirko@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      973462bb
    • E
      net: Use netlink_ns_capable to verify the permisions of netlink messages · 90f62cf3
      Eric W. Biederman 提交于
      It is possible by passing a netlink socket to a more privileged
      executable and then to fool that executable into writing to the socket
      data that happens to be valid netlink message to do something that
      privileged executable did not intend to do.
      
      To keep this from happening replace bare capable and ns_capable calls
      with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
      Which act the same as the previous calls except they verify that the
      opener of the socket had the desired permissions as well.
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90f62cf3