1. 02 6月, 2015 2 次提交
    • T
      vlan: Add GRO support for non hardware accelerated vlan · 66e5133f
      Toshiaki Makita 提交于
      Currently packets with non-hardware-accelerated vlan cannot be handled
      by GRO. This causes low performance for 802.1ad and stacked vlan, as their
      vlan tags are currently not stripped by hardware.
      
      This patch adds GRO support for non-hardware-accelerated vlan and
      improves receive performance of them.
      
      Test Environment:
       vlan device (.1Q) on vlan device (.1ad) on ixgbe (82599)
      
      Result:
      
      - Before
      
      $ netperf -t TCP_STREAM -H 192.168.20.2 -l 60
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    60.00    5233.17
      
      Rx side CPU usage:
        %usr      %sys      %irq     %soft     %idle
        0.27     58.03      0.00     41.70      0.00
      
      - After
      
      $ netperf -t TCP_STREAM -H 192.168.20.2 -l 60
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    60.00    7586.85
      
      Rx side CPU usage:
        %usr      %sys      %irq     %soft     %idle
        0.50     25.83      0.00     59.53     14.14
      
      [ Register VLAN offloads with priority 10 -DaveM ]
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66e5133f
    • D
      net: Add priority to packet_offload objects. · bdef7de4
      David S. Miller 提交于
      When we scan a packet for GRO processing, we want to see the most
      common packet types in the front of the offload_base list.
      
      So add a priority field so we can handle this properly.
      
      IPv4/IPv6 get the highest priority with the implicit zero priority
      field.
      
      Next comes ethernet with a priority of 10, and then we have the MPLS
      types with a priority of 15.
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Suggested-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdef7de4
  2. 01 6月, 2015 1 次提交
  3. 31 5月, 2015 14 次提交
  4. 29 5月, 2015 1 次提交
    • D
      percpu_counter: batch size aware __percpu_counter_compare() · 80188b0d
      Dave Chinner 提交于
      XFS uses non-stanard batch sizes for avoiding frequent global
      counter updates on it's allocated inode counters, as they increment
      or decrement in batches of 64 inodes. Hence the standard percpu
      counter batch of 32 means that the counter is effectively a global
      counter. Currently Xfs uses a batch size of 128 so that it doesn't
      take the global lock on every single modification.
      
      However, Xfs also needs to compare accurately against zero, which
      means we need to use percpu_counter_compare(), and that has a
      hard-coded batch size of 32, and hence will spuriously fail to
      detect when it is supposed to use precise comparisons and hence
      the accounting goes wrong.
      
      Add __percpu_counter_compare() to take a custom batch size so we can
      use it sanely in XFS and factor percpu_counter_compare() to use it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      80188b0d
  5. 28 5月, 2015 2 次提交
    • R
      cpumask_set_cpu_local_first => cpumask_local_spread, lament · f36963c9
      Rusty Russell 提交于
      da91309e (cpumask: Utility function to set n'th cpu...) created a
      genuinely weird function.  I never saw it before, it went through DaveM.
      (He only does this to make us other maintainers feel better about our own
      mistakes.)
      
      cpumask_set_cpu_local_first's purpose is say "I need to spread things
      across N online cpus, choose the ones on this numa node first"; you call
      it in a loop.
      
      It can fail.  One of the two callers ignores this, the other aborts and
      fails the device open.
      
      It can fail in two ways: allocating the off-stack cpumask, or through a
      convoluted codepath which AFAICT can only occur if cpu_online_mask
      changes.  Which shouldn't happen, because if cpu_online_mask can change
      while you call this, it could return a now-offline cpu anyway.
      
      It contains a nonsensical test "!cpumask_of_node(numa_node)".  This was
      drawn to my attention by Geert, who said this causes a warning on Sparc.
      It sets a single bit in a cpumask instead of returning a cpu number,
      because that's what the callers want.
      
      It could be made more efficient by passing the previous cpu rather than
      an index, but that would be more invasive to the callers.
      
      Fixes: da91309e
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (then rebased)
      Tested-by: NAmir Vadai <amirv@mellanox.com>
      Acked-by: NAmir Vadai <amirv@mellanox.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      f36963c9
    • S
      pci: Add Cavium PCI vendor id · e5c4708b
      Sunil Goutham 提交于
      This vendor id will be used by network (vNIC), USB (xHCI),
      SATA (AHCI), GPIO, I2C, MMC and maybe other drivers
      for ThunderX SoC.
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5c4708b
  6. 27 5月, 2015 1 次提交
  7. 26 5月, 2015 1 次提交
  8. 25 5月, 2015 3 次提交
  9. 23 5月, 2015 1 次提交
    • E
      tcp: fix a potential deadlock in tcp_get_info() · d654976c
      Eric Dumazet 提交于
      Taking socket spinlock in tcp_get_info() can deadlock, as
      inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
      while packet processing can use the reverse locking order.
      
      We could avoid this locking for TCP_LISTEN states, but lockdep would
      certainly get confused as all TCP sockets share same lockdep classes.
      
      [  523.722504] ======================================================
      [  523.728706] [ INFO: possible circular locking dependency detected ]
      [  523.734990] 4.1.0-dbg-DEV #1676 Not tainted
      [  523.739202] -------------------------------------------------------
      [  523.745474] ss/18032 is trying to acquire lock:
      [  523.750002]  (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
      [  523.758129]
      [  523.758129] but task is already holding lock:
      [  523.763968]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
      [  523.774661]
      [  523.774661] which lock already depends on the new lock.
      [  523.774661]
      [  523.782850]
      [  523.782850] the existing dependency chain (in reverse order) is:
      [  523.790326]
      -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
      [  523.796599]        [<ffffffff811126bb>] lock_acquire+0xbb/0x270
      [  523.802565]        [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
      [  523.808628]        [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
      [  523.815273]        [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
      [  523.822067]        [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
      [  523.828199]        [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
      [  523.834331]        [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
      [  523.840202]        [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
      [  523.847214]        [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
      [  523.853440]        [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
      [  523.859624]        [<ffffffff81659db7>] ip_rcv+0x307/0x420
      
      Lets use u64_sync infrastructure instead. As a bonus, 64bit
      arches get optimized, as these are nop for them.
      
      Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d654976c
  10. 22 5月, 2015 2 次提交
    • M
      tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info · 2efd055c
      Marcelo Ricardo Leitner 提交于
      This patch tracks the total number of inbound and outbound segments on a
      TCP socket. One may use this number to have an idea on connection
      quality when compared against the retransmissions.
      
      RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
      
      These are a 32bit field each and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->segs_out was placed near tp->snd_nxt for good data
      locality and minimal performance impact, while tp->segs_in was placed
      near tp->bytes_received for the same reason.
      
      Join work with Eric Dumazet.
      
      Note that received SYN are accounted on the listener, but sent SYNACK
      are not accounted.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2efd055c
    • A
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov 提交于
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04fd61ab
  11. 20 5月, 2015 1 次提交
  12. 18 5月, 2015 1 次提交
    • E
      net: fix two sparse errors · c91d4606
      Eric Dumazet 提交于
      First one in __skb_checksum_validate_complete() fixes the following
      (and other callers)
      
      make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/tcp_ipv4.o
        CHECK   net/ipv4/tcp_ipv4.c
      include/linux/skbuff.h:3052:24: warning: incorrect type in return expression (different base types)
      include/linux/skbuff.h:3052:24:    expected restricted __sum16
      include/linux/skbuff.h:3052:24:    got int
      
      Second is fixing gso_make_checksum() :
      
        CHECK   net/ipv4/gre_offload.c
      include/linux/skbuff.h:3360:14: warning: incorrect type in assignment (different base types)
      include/linux/skbuff.h:3360:14:    expected unsigned short [unsigned] [usertype] csum
      include/linux/skbuff.h:3360:14:    got restricted __sum16
      include/linux/skbuff.h:3365:16: warning: incorrect type in return expression (different base types)
      include/linux/skbuff.h:3365:16:    expected restricted __sum16
      include/linux/skbuff.h:3365:16:    got unsigned short [unsigned] [usertype] csum
      
      Fixes: 5a212329 ("net: Support for csum_bad in skbuff")
      Fixes: 7e2b10c1 ("net: Support for multiple checksums with gso")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      CC: Tom Herbert <tom@herbertland.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c91d4606
  13. 17 5月, 2015 1 次提交
    • H
      rhashtable: Add cap on number of elements in hash table · 07ee0722
      Herbert Xu 提交于
      We currently have no limit on the number of elements in a hash table.
      This is a problem because some users (tipc) set a ceiling on the
      maximum table size and when that is reached the hash table may
      degenerate.  Others may encounter OOM when growing and if we allow
      insertions when that happens the hash table perofrmance may also
      suffer.
      
      This patch adds a new paramater insecure_max_entries which becomes
      the cap on the table.  If unset it defaults to max_size * 2.  If
      it is also zero it means that there is no cap on the number of
      elements in the table.  However, the table will grow whenever the
      utilisation hits 100% and if that growth fails, you will get ENOMEM
      on insertion.
      
      As allowing oversubscription is potentially dangerous, the name
      contains the word insecure.
      
      Note that the cap is not a hard limit.  This is done for performance
      reasons as enforcing a hard limit will result in use of atomic ops
      that are heavier than the ones we currently use.
      
      The reasoning is that we're only guarding against a gross over-
      subscription of the table, rather than a small breach of the limit.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07ee0722
  14. 16 5月, 2015 1 次提交
    • P
      netfilter: x_tables: add context to know if extension runs from nft_compat · 55917a21
      Pablo Neira Ayuso 提交于
      Currently, we have four xtables extensions that cannot be used from the
      xt over nft compat layer. The problem is that they need real access to
      the full blown xt_entry to validate that the rule comes with the right
      dependencies. This check was introduced to overcome the lack of
      sufficient userspace dependency validation in iptables.
      
      To resolve this problem, this patch introduces a new field to the
      xt_tgchk_param structure that tell us if the extension is run from
      nft_compat context.
      
      The three affected extensions are:
      
      1) CLUSTERIP, this target has been superseded by xt_cluster. So just
         bail out by returning -EINVAL.
      
      2) TCPMSS. Relax the checking when used from nft_compat. If used with
         the wrong configuration, it will corrupt !syn packets by adding TCP
         MSS option.
      
      3) ebt_stp. Relax the check to make sure it uses the reserved
         destination MAC address for STP.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Tested-by: NArturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
      55917a21
  15. 15 5月, 2015 5 次提交
    • B
      mdio-gpio: Propagate mii_bus.phy_ignore_ta_mask · ef7f3a5c
      Bert Vermeulen 提交于
      This also changes mii_bus.phy_mask to u32 for consistency.
      Signed-off-by: NBert Vermeulen <bert@biot.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef7f3a5c
    • D
      test_bpf: add tests related to BPF_MAXINSNS · a4afd37b
      Daniel Borkmann 提交于
      Couple of torture test cases related to the bug fixed in 0b59d880
      ("ARM: net: delegate filter to kernel interpreter when imm_offset()
      return value can't fit into 12bits.").
      
      I've added a helper to allocate and fill the insn space. Output on
      x86_64 from my laptop:
      
      test_bpf: #233 BPF_MAXINSNS: Maximum possible literals jited:0 7 PASS
      test_bpf: #234 BPF_MAXINSNS: Single literal jited:0 8 PASS
      test_bpf: #235 BPF_MAXINSNS: Run/add until end jited:0 11553 PASS
      test_bpf: #236 BPF_MAXINSNS: Too many instructions PASS
      test_bpf: #237 BPF_MAXINSNS: Very long jump jited:0 9 PASS
      test_bpf: #238 BPF_MAXINSNS: Ctx heavy transformations jited:0 20329 20398 PASS
      test_bpf: #239 BPF_MAXINSNS: Call heavy transformations jited:0 32178 32475 PASS
      test_bpf: #240 BPF_MAXINSNS: Jump heavy test jited:0 10518 PASS
      
      test_bpf: #233 BPF_MAXINSNS: Maximum possible literals jited:1 4 PASS
      test_bpf: #234 BPF_MAXINSNS: Single literal jited:1 4 PASS
      test_bpf: #235 BPF_MAXINSNS: Run/add until end jited:1 1625 PASS
      test_bpf: #236 BPF_MAXINSNS: Too many instructions PASS
      test_bpf: #237 BPF_MAXINSNS: Very long jump jited:1 8 PASS
      test_bpf: #238 BPF_MAXINSNS: Ctx heavy transformations jited:1 3301 3174 PASS
      test_bpf: #239 BPF_MAXINSNS: Call heavy transformations jited:1 24107 23491 PASS
      test_bpf: #240 BPF_MAXINSNS: Jump heavy test jited:1 8651 PASS
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Nicolas Schichan <nschichan@freebox.fr>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4afd37b
    • J
      uidgid: make uid_valid and gid_valid work with !CONFIG_MULTIUSER · 929aa5b2
      Josh Triplett 提交于
      {u,g}id_valid call {u,g}id_eq, which calls __k{u,g}id_val on both
      arguments and compares.  With !CONFIG_MULTIUSER, __k{u,g}id_val return a
      constant 0, which makes {u,g}id_valid always return false.  Change
      {u,g}id_valid to compare their argument against -1 instead.  That produces
      identical results in the normal CONFIG_MULTIUSER=y case, but with
      !CONFIG_MULTIUSER will make {u,g}id_valid constant-fold into "return
      true;" rather than "return false;".
      
      This fixes uses of devpts without CONFIG_MULTIUSER.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Reported-by: Fengguang Wu <fengguang.wu@intel.com>,
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      929aa5b2
    • V
      gfp: add __GFP_NOACCOUNT · 8f4fc071
      Vladimir Davydov 提交于
      Not all kmem allocations should be accounted to memcg.  The following
      patch gives an example when accounting of a certain type of allocations to
      memcg can effectively result in a memory leak.  This patch adds the
      __GFP_NOACCOUNT flag which if passed to kmalloc and friends will force the
      allocation to go through the root cgroup.  It will be used by the next
      patch.
      
      Note, since in case of kmemleak enabled each kmalloc implies yet another
      allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
      gfp_kmemleak_mask.
      
      Alternatively, we could introduce a per kmem cache flag disabling
      accounting for all allocations of a particular kind, but (a) we would not
      be able to bypass accounting for kmalloc then and (b) a kmem cache with
      this flag set could not be merged with a kmem cache without this flag,
      which would increase the number of global caches and therefore
      fragmentation even if the memory cgroup controller is not used.
      
      Despite its generic name, currently __GFP_NOACCOUNT disables accounting
      only for kmem allocations while user page allocations are always charged.
      To catch abusing of this flag, a warning is issued on an attempt of
      passing it to mem_cgroup_try_charge.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[4.0.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f4fc071
    • F
      net: phy: Add phy_ignore_ta_mask to account for broken turn-around · 922f2dd1
      Florian Fainelli 提交于
      Some PHY devices/switches will not release the turn-around line as they
      should do at the end of a MDIO transaction. To help with such
      situations, allow MDIO bus drivers to be made aware of such
      restrictions.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      922f2dd1
  16. 14 5月, 2015 3 次提交
    • D
      netfilter: ipset: deinline ip_set_put_extensions() · a3b1c1eb
      Denys Vlasenko 提交于
      On x86 allyesconfig build:
      The function compiles to 489 bytes of machine code.
      It has 25 callsites.
      
          text    data       bss       dec     hex filename
      82441375 22255384 20627456 125324215 7784bb7 vmlinux.before
      82434909 22255384 20627456 125317749 7783275 vmlinux
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      CC: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Jan Engelhardt <jengelh@medozas.de>
      CC: Jiri Pirko <jpirko@redhat.com>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: netfilter-devel@vger.kernel.org
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a3b1c1eb
    • F
      netfilter: bridge: neigh_head and physoutdev can't be used at same time · 7fb48c5b
      Florian Westphal 提交于
      The neigh_header is only needed when we detect DNAT after prerouting
      and neigh cache didn't have a mac address for us.
      
      The output port has not been chosen yet so we can re-use the storage
      area, bringing struct size down to 32 bytes on x86_64.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      7fb48c5b
    • P
      netfilter: add netfilter ingress hook after handle_ing() under unique static key · e687ad60
      Pablo Neira 提交于
      This patch adds the Netfilter ingress hook just after the existing tc ingress
      hook, that seems to be the consensus solution for this.
      
      Note that the Netfilter hook resides under the global static key that enables
      ingress filtering. Nonetheless, Netfilter still also has its own static key for
      minimal impact on the existing handle_ing().
      
      * Without this patch:
      
      Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
        16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
      
          42.46%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          25.92%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
           7.81%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           5.62%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.70%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           2.34%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
           1.44%  kpktgend_0   [kernel.kallsyms]   [k] __build_skb
      
      * With this patch:
      
      Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
        16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
      
          41.23%  kpktgend_0      [kernel.kallsyms]  [k] __netif_receive_skb_core
          26.57%  kpktgend_0      [kernel.kallsyms]  [k] kfree_skb
           7.72%  kpktgend_0      [pktgen]           [k] pktgen_thread_worker
           5.55%  kpktgend_0      [kernel.kallsyms]  [k] ip_rcv
           2.78%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_internal
           2.06%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_sk
           1.43%  kpktgend_0      [kernel.kallsyms]  [k] __build_skb
      
      * Without this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
        10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
      
          40.99%  kpktgend_0   [kernel.kallsyms]  [k] __netif_receive_skb_core
          17.50%  kpktgend_0   [kernel.kallsyms]  [k] kfree_skb
          11.77%  kpktgend_0   [cls_u32]          [k] u32_classify
           5.62%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify_compat
           5.18%  kpktgend_0   [pktgen]           [k] pktgen_thread_worker
           3.23%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify
           2.97%  kpktgend_0   [kernel.kallsyms]  [k] ip_rcv
           1.83%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_internal
           1.50%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_sk
           0.99%  kpktgend_0   [kernel.kallsyms]  [k] __build_skb
      
      * With this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
        10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
      
          42.01%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          17.78%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
          11.70%  kpktgend_0   [cls_u32]           [k] u32_classify
           5.46%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify_compat
           5.16%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           2.98%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.84%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify
           1.96%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           1.57%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
      
      Note that the results are very similar before and after.
      
      I can see gcc gets the code under the ingress static key out of the hot path.
      Then, on that cold branch, it generates the code to accomodate the netfilter
      ingress static key. My explanation for this is that this reduces the pressure
      on the instruction cache for non-users as the new code is out of the hot path,
      and it comes with minimal impact for tc ingress users.
      
      Using gcc version 4.8.4 on:
      
      Architecture:          x86_64
      CPU op-mode(s):        32-bit, 64-bit
      Byte Order:            Little Endian
      CPU(s):                8
      [...]
      L1d cache:             16K
      L1i cache:             64K
      L2 cache:              2048K
      L3 cache:              8192K
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e687ad60