1. 09 5月, 2016 4 次提交
  2. 07 5月, 2016 2 次提交
    • A
      bpf: wire in data and data_end for cls_act_bpf · db58ba45
      Alexei Starovoitov 提交于
      allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers.
      The bpf helpers that change skb->data need to update data_end pointer as well.
      The verifier checks that programs always reload data, data_end pointers
      after calls to such bpf helpers.
      We cannot add 'data_end' pointer to struct qdisc_skb_cb directly,
      since it's embedded as-is by infiniband ipoib, so wrapper struct is needed.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db58ba45
    • A
      bpf: direct packet access · 969bf05e
      Alexei Starovoitov 提交于
      Extended BPF carried over two instructions from classic to access
      packet data: LD_ABS and LD_IND. They're highly optimized in JITs,
      but due to their design they have to do length check for every access.
      When BPF is processing 20M packets per second single LD_ABS after JIT
      is consuming 3% cpu. Hence the need to optimize it further by amortizing
      the cost of 'off < skb_headlen' over multiple packet accesses.
      One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW
      with similar usage as skb_header_pointer().
      The kernel part for interpreter and x64 JIT was implemented in [1], but such
      new insns behave like old ld_abs and abort the program with 'return 0' if
      access is beyond linear data. Such hidden control flow is hard to workaround
      plus changing JITs and rolling out new llvm is incovenient.
      
      Therefore allow cls_bpf/act_bpf program access skb->data directly:
      int bpf_prog(struct __sk_buff *skb)
      {
        struct iphdr *ip;
      
        if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end)
            /* packet too small */
            return 0;
      
        ip = skb->data + ETH_HLEN;
      
        /* access IP header fields with direct loads */
        if (ip->version != 4 || ip->saddr == 0x7f000001)
            return 1;
        [...]
      }
      
      This solution avoids introduction of new instructions. llvm stays
      the same and all JITs stay the same, but verifier has to work extra hard
      to prove safety of the above program.
      
      For XDP the direct store instructions can be allowed as well.
      
      The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check
      the alignment. The complex packet parsers where packet pointer is adjusted
      incrementally cannot be tracked for alignment, so allow byte access in such cases
      and misaligned access on architectures that define efficient_unaligned_access
      
      [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dwSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969bf05e
  3. 06 5月, 2016 1 次提交
    • H
      net/mlx4: Avoid wrong virtual mappings · 73898db0
      Haggai Abramovsky 提交于
      The dma_alloc_coherent() function returns a virtual address which can
      be used for coherent access to the underlying memory.  On some
      architectures, like arm64, undefined behavior results if this memory is
      also accessed via virtual mappings that are not coherent.  Because of
      their undefined nature, operations like virt_to_page() return garbage
      when passed virtual addresses obtained from dma_alloc_coherent().  Any
      subsequent mappings via vmap() of the garbage page values are unusable
      and result in bad things like bus errors (synchronous aborts in ARM64
      speak).
      
      The mlx4 driver contains code that does the equivalent of:
      vmap(virt_to_page(dma_alloc_coherent)), this results in an OOPs when the
      device is opened.
      
      Prevent Ethernet driver to run this problematic code by forcing it to
      allocate contiguous memory. As for the Infiniband driver, at first we
      are trying to allocate contiguous memory, but in case of failure roll
      back to work with fragmented memory.
      Signed-off-by: NHaggai Abramovsky <hagaya@mellanox.com>
      Signed-off-by: NYishai Hadas <yishaih@mellanox.com>
      Reported-by: NDavid Daney <david.daney@cavium.com>
      Tested-by: NSinan Kaya <okaya@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73898db0
  4. 05 5月, 2016 6 次提交
  5. 04 5月, 2016 5 次提交
  6. 03 5月, 2016 12 次提交
    • E
      net: relax expensive skb_unclone() in iptunnel_handle_offloads() · 9580bf2e
      Eric Dumazet 提交于
      Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
      tunnel have to go through an expensive skb_unclone()
      
      Reallocating skb->head is a lot of work.
      
      Test should really check if a 'real clone' of the packet was done.
      
      TCP does not care if the original gso_type is changed while the packet
      travels in the stack.
      
      This adds skb_header_unclone() which is a variant of skb_clone()
      using skb_header_cloned() check instead of skb_cloned().
      
      This variant can probably be used from other points.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9580bf2e
    • F
      netdevice: shrink size of struct netdev_queue · c0ef079c
      Florian Westphal 提交于
      - trans_timeout is incremented when tx queue timed out (tx watchdog).
      - tx_maxrate is set via sysfs
      
      Moving tx_maxrate to read-mostly part shrinks the struct by 64 bytes.
      While at it, also move trans_timeout (it is out-of-place in the
      'write-mostly' part).
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0ef079c
    • N
      bridge: netlink: export per-vlan stats · a60c0903
      Nikolay Aleksandrov 提交于
      Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the
      RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and
      get_linkxstats_size) in order to export the per-vlan stats.
      The paddings were added because soon these fields will be needed for
      per-port per-vlan stats (or something else if someone beats me to it) so
      avoiding at least a few more netlink attributes.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a60c0903
    • N
      bridge: vlan: learn to count · 6dada9b1
      Nikolay Aleksandrov 提交于
      Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets
      allocated a per-cpu stats which is then set in each per-port vlan context
      for quick access. The br_allowed_ingress() common function is used to
      account for Rx packets and the br_handle_vlan() common function is used
      to account for Tx packets. Stats accounting is performed only if the
      bridge-wide vlan_stats_enabled option is set either via sysfs or netlink.
      A struct hole between vlan_enabled and vlan_proto is used for the new
      option so it is in the same cache line. Currently it is binary (on/off)
      but it is intentionally restricted to exactly 0 and 1 since other values
      will be used in the future for different purposes (e.g. per-port stats).
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dada9b1
    • N
      net: rtnetlink: add linkxstats callbacks and attribute · 97a47fac
      Nikolay Aleksandrov 提交于
      Add callbacks to calculate the size and fill link extended statistics
      which can be split into multiple messages and are dumped via the new
      rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute.
      Also add that attribute to the idx mask check since it is expected to
      be able to save state and resume dumping (e.g. future bridge per-vlan
      stats will be dumped via this attribute and callbacks).
      Each link type should nest its private attributes under the per-link type
      attribute. This allows to have any number of separated private attributes
      and to avoid one call to get the dev link type.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97a47fac
    • T
      ipv6: Generic tunnel cleanup · 79ecb90e
      Tom Herbert 提交于
      A few generic changes to generalize tunnels in IPv6:
        - Export ip6_tnl_change_mtu so that it can be called by ip6_gre
        - Add tun_hlen to ip6_tnl structure.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79ecb90e
    • T
      gre: Create common functions for transmit · 182a352d
      Tom Herbert 提交于
      Create common functions for both IPv4 and IPv6 GRE in transmit. These
      are put into gre.h.
      
      Common functions are for:
        - GRE checksum calculation. Move gre_checksum to gre.h.
        - Building a GRE header. Move GRE build_header and rename
          gre_build_header.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      182a352d
    • T
      ipv6: Create ip6_tnl_xmit · 8eb30be0
      Tom Herbert 提交于
      This patch renames ip6_tnl_xmit2 to ip6_tnl_xmit and exports it. Other
      users like GRE will be able to call this. The original ip6_tnl_xmit
      function is renamed to ip6_tnl_start_xmit (this is an ndo_start_xmit
      function).
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8eb30be0
    • T
      gre: Move utility functions to common headers · 95f5c64c
      Tom Herbert 提交于
      Several of the GRE functions defined in net/ipv4/ip_gre.c are usable
      for IPv6 GRE implementation (that is they are protocol agnostic).
      
      These include:
        - GRE flag handling functions are move to gre.h
        - GRE build_header is moved to gre.h and renamed gre_build_header
        - parse_gre_header is moved to gre_demux.c and renamed gre_parse_header
        - iptunnel_pull_header is taken out of gre_parse_header. This is now
          done by caller. The header length is returned from gre_parse_header
          in an int* argument.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95f5c64c
    • T
      ipv6: Cleanup IPv6 tunnel receive path · 0d3c703a
      Tom Herbert 提交于
      Some basic changes to make IPv6 tunnel receive path look more like
      IPv4 path:
        - Make ip6_tnl_rcv non-static so that GREv6 and others can call it
        - Make ip6_tnl_rcv look like ip_tunnel_rcv
        - Switch to gro_cells_receive
        - Make ip6_tnl_rcv non-static and export it
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d3c703a
    • E
      tcp: make tcp_sendmsg() aware of socket backlog · d41a69f1
      Eric Dumazet 提交于
      Large sendmsg()/write() hold socket lock for the duration of the call,
      unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
      are parked into socket backlog for a long time.
      Critical decisions like fast retransmit might be delayed.
      Receivers have to maintain a big out of order queue with additional cpu
      overhead, and also possible stalls in TX once windows are full.
      
      Bidirectional flows are particularly hurt since the backlog can become
      quite big if the copy from user space triggers IO (page faults)
      
      Some applications learnt to use sendmsg() (or sendmmsg()) with small
      chunks to avoid this issue.
      
      Kernel should know better, right ?
      
      Add a generic sk_flush_backlog() helper and use it right
      before a new skb is allocated. Typically we put 64KB of payload
      per skb (unless MSG_EOR is requested) and checking socket backlog
      every 64KB gives good results.
      
      As a matter of fact, tests with TSO/GSO disabled give very nice
      results, as we manage to keep a small write queue and smaller
      perceived rtt.
      
      Note that sk_flush_backlog() maintains socket ownership,
      so is not equivalent to a {release_sock(sk); lock_sock(sk);},
      to ensure implicit atomicity rules that sendmsg() was
      giving to (possibly buggy) applications.
      
      In this simple implementation, I chose to not call tcp_release_cb(),
      but we might consider this later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d41a69f1
    • L
      Minimal fix-up of bad hashing behavior of hash_64() · 689de1d6
      Linus Torvalds 提交于
      This is a fairly minimal fixup to the horribly bad behavior of hash_64()
      with certain input patterns.
      
      In particular, because the multiplicative value used for the 64-bit hash
      was intentionally bit-sparse (so that the multiply could be done with
      shifts and adds on architectures without hardware multipliers), some
      bits did not get spread out very much.  In particular, certain fairly
      common bit ranges in the input (roughly bits 12-20: commonly with the
      most information in them when you hash things like byte offsets in files
      or memory that have block factors that mean that the low bits are often
      zero) would not necessarily show up much in the result.
      
      There's a bigger patch-series brewing to fix up things more completely,
      but this is the fairly minimal fix for the 64-bit hashing problem.  It
      simply picks a much better constant multiplier, spreading the bits out a
      lot better.
      
      NOTE! For 32-bit architectures, the bad old hash_64() remains the same
      for now, since 64-bit multiplies are expensive.  The bigger hashing
      cleanup will replace the 32-bit case with something better.
      
      The new constants were picked by George Spelvin who wrote that bigger
      cleanup series.  I just picked out the constants and part of the comment
      from that series.
      
      Cc: stable@vger.kernel.org
      Cc: George Spelvin <linux@horizon.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      689de1d6
  7. 02 5月, 2016 3 次提交
  8. 30 4月, 2016 5 次提交
  9. 29 4月, 2016 2 次提交
    • G
      numa: fix /proc/<pid>/numa_maps for THP · 28093f9f
      Gerald Schaefer 提交于
      In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
      because the layouts may differ depending on the architecture.  On s390
      this will lead to inaccurate numa_maps accounting in /proc because of
      misguided pte_present() and pte_dirty() checks on the fake pte.
      
      On other architectures pte_present() and pte_dirty() may work by chance,
      but there may be an issue with direct-access (dax) mappings w/o
      underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
      available.  In vm_normal_page() the fake pte will be checked with
      pte_special() and because there is no "special" bit in a pmd, this will
      always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
      skipped.  On dax mappings w/o struct pages, an invalid struct page
      pointer would then be returned that can crash the kernel.
      
      This patch fixes the numa_maps THP handling by introducing new "_pmd"
      variants of the can_gather_numa_stats() and vm_normal_page() functions.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28093f9f
    • K
      thp: keep huge zero page pinned until tlb flush · aa88b68c
      Kirill A. Shutemov 提交于
      Andrea has found[1] a race condition on MMU-gather based TLB flush vs
      split_huge_page() or shrinker which frees huge zero under us (patch 1/2
      and 2/2 respectively).
      
      With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the
      page pinned until flush is complete and the pin prevents the page from
      being split under us.
      
      We still need patch 2/2.  This is simplified version of Andrea's patch.
      We don't need fancy encoding.
      
      [1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa88b68c