1. 12 5月, 2016 2 次提交
    • D
      net: l3mdev: Add hook in ip and ipv6 · 74b20582
      David Ahern 提交于
      Currently the VRF driver uses the rx_handler to switch the skb device
      to the VRF device. Switching the dev prior to the ip / ipv6 layer
      means the VRF driver has to duplicate IP/IPv6 processing which adds
      overhead and makes features such as retaining the ingress device index
      more complicated than necessary.
      
      This patch moves the hook to the L3 layer just after the first NF_HOOK
      for PRE_ROUTING. This location makes exposing the original ingress device
      trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
      in the future.
      
      dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
      with the switched device through the packet taps to maintain current
      behavior (tcpdump can be used on either the vrf device or the enslaved
      devices).
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74b20582
    • L
      tcp: replace cnt & rtt with struct in pkts_acked() · 756ee172
      Lawrence Brakmo 提交于
      Replace 2 arguments (cnt and rtt) in the congestion control modules'
      pkts_acked() function with a struct. This will allow adding more
      information without having to modify existing congestion control
      modules (tcp_nv in particular needs bytes in flight when packet
      was sent).
      
      As proposed by Neal Cardwell in his comments to the tcp_nv patch.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      756ee172
  2. 11 5月, 2016 3 次提交
  3. 10 5月, 2016 2 次提交
  4. 09 5月, 2016 8 次提交
  5. 07 5月, 2016 4 次提交
    • J
      udp_offload: Set encapsulation before inner completes. · 229740c6
      Jarno Rajahalme 提交于
      UDP tunnel segmentation code relies on the inner offsets being set for
      an UDP tunnel GSO packet, but the inner *_complete() functions will
      set the inner offsets only if 'encapsulation' is set before calling
      them.  Currently, udp_gro_complete() sets 'encapsulation' only after
      the inner *_complete() functions are done.  This causes the inner
      offsets having invalid values after udp_gro_complete() returns, which
      in turn will make it impossible to properly segment the packet in case
      it needs to be forwarded, which would be visible to the user either as
      invalid packets being sent or as packet loss.
      
      This patch fixes this by setting skb's 'encapsulation' in
      udp_gro_complete() before calling into the inner complete functions,
      and by making each possible UDP tunnel gro_complete() callback set the
      inner_mac_header to the beginning of the tunnel payload.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Reviewed-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      229740c6
    • J
      udp_tunnel: Remove redundant udp_tunnel_gro_complete(). · 43b8448c
      Jarno Rajahalme 提交于
      The setting of the UDP tunnel GSO type is already performed by
      udp[46]_gro_complete().
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43b8448c
    • A
      bpf: wire in data and data_end for cls_act_bpf · db58ba45
      Alexei Starovoitov 提交于
      allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers.
      The bpf helpers that change skb->data need to update data_end pointer as well.
      The verifier checks that programs always reload data, data_end pointers
      after calls to such bpf helpers.
      We cannot add 'data_end' pointer to struct qdisc_skb_cb directly,
      since it's embedded as-is by infiniband ipoib, so wrapper struct is needed.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db58ba45
    • A
      bpf: direct packet access · 969bf05e
      Alexei Starovoitov 提交于
      Extended BPF carried over two instructions from classic to access
      packet data: LD_ABS and LD_IND. They're highly optimized in JITs,
      but due to their design they have to do length check for every access.
      When BPF is processing 20M packets per second single LD_ABS after JIT
      is consuming 3% cpu. Hence the need to optimize it further by amortizing
      the cost of 'off < skb_headlen' over multiple packet accesses.
      One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW
      with similar usage as skb_header_pointer().
      The kernel part for interpreter and x64 JIT was implemented in [1], but such
      new insns behave like old ld_abs and abort the program with 'return 0' if
      access is beyond linear data. Such hidden control flow is hard to workaround
      plus changing JITs and rolling out new llvm is incovenient.
      
      Therefore allow cls_bpf/act_bpf program access skb->data directly:
      int bpf_prog(struct __sk_buff *skb)
      {
        struct iphdr *ip;
      
        if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end)
            /* packet too small */
            return 0;
      
        ip = skb->data + ETH_HLEN;
      
        /* access IP header fields with direct loads */
        if (ip->version != 4 || ip->saddr == 0x7f000001)
            return 1;
        [...]
      }
      
      This solution avoids introduction of new instructions. llvm stays
      the same and all JITs stay the same, but verifier has to work extra hard
      to prove safety of the above program.
      
      For XDP the direct store instructions can be allowed as well.
      
      The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check
      the alignment. The complex packet parsers where packet pointer is adjusted
      incrementally cannot be tracked for alignment, so allow byte access in such cases
      and misaligned access on architectures that define efficient_unaligned_access
      
      [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dwSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969bf05e
  6. 06 5月, 2016 6 次提交
    • F
      netfilter: conntrack: use a single expectation table for all namespaces · 0a93aaed
      Florian Westphal 提交于
      We already include netns address in the hash and compare the netns pointers
      during lookup, so even if namespaces have overlapping addresses entries
      will be spread across the expectation table.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      0a93aaed
    • H
      net/mlx4: Avoid wrong virtual mappings · 73898db0
      Haggai Abramovsky 提交于
      The dma_alloc_coherent() function returns a virtual address which can
      be used for coherent access to the underlying memory.  On some
      architectures, like arm64, undefined behavior results if this memory is
      also accessed via virtual mappings that are not coherent.  Because of
      their undefined nature, operations like virt_to_page() return garbage
      when passed virtual addresses obtained from dma_alloc_coherent().  Any
      subsequent mappings via vmap() of the garbage page values are unusable
      and result in bad things like bus errors (synchronous aborts in ARM64
      speak).
      
      The mlx4 driver contains code that does the equivalent of:
      vmap(virt_to_page(dma_alloc_coherent)), this results in an OOPs when the
      device is opened.
      
      Prevent Ethernet driver to run this problematic code by forcing it to
      allocate contiguous memory. As for the Infiniband driver, at first we
      are trying to allocate contiguous memory, but in case of failure roll
      back to work with fragmented memory.
      Signed-off-by: NHaggai Abramovsky <hagaya@mellanox.com>
      Signed-off-by: NYishai Hadas <yishaih@mellanox.com>
      Reported-by: NDavid Daney <david.daney@cavium.com>
      Tested-by: NSinan Kaya <okaya@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73898db0
    • A
      byteswap: try to avoid __builtin_constant_p gcc bug · 7322dd75
      Arnd Bergmann 提交于
      This is another attempt to avoid a regression in wwn_to_u64() after that
      started using get_unaligned_be64(), which in turn ran into a bug on
      gcc-4.9 through 6.1.
      
      The regression got introduced due to the combination of two separate
      workarounds (commits e3bde956: "include/linux/unaligned: force
      inlining of byteswap operations" and ef3fb242: "scsi: fc: use
      get/put_unaligned64 for wwn access") that each try to sidestep distinct
      problems with gcc behavior (code growth and increased stack usage).
      
      Unfortunately after both have been applied, a more serious gcc bug has
      been uncovered, leading to incorrect object code that discards part of a
      function and causes undefined behavior.
      
      As part of this problem is how __builtin_constant_p gets evaluated on an
      argument passed by reference into an inline function, this avoids the
      use of __builtin_constant_p() for all architectures that set
      CONFIG_ARCH_USE_BUILTIN_BSWAP.  Most architectures do not set
      ARCH_SUPPORTS_OPTIMIZED_INLINING, which means they probably do not
      suffer from the problem in the qla2xxx driver, but they might still run
      into it elsewhere.
      
      Both of the original workarounds were only merged in the 4.6 kernel, and
      the bug that is fixed by this patch should only appear if both are
      there, so we probably don't need to backport the fix.  On the other
      hand, it works by simplifying the code path and should not have any
      negative effects.
      
      [arnd@arndb.de: fix older gcc warnings]
        (http://lkml.kernel.org/r/12243652.bxSxEgjgfk@wuerfel)
      Link: https://lkml.org/lkml/headers/2016/4/12/1103
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70232
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
      Fixes: e3bde956 ("include/linux/unaligned: force inlining of byteswap operations")
      Fixes: ef3fb242 ("scsi: fc: use get/put_unaligned64 for wwn access")
      Link: http://lkml.kernel.org/r/1780465.XdtPJpi8Tt@wuerfelSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Tested-by: Josh Poimboeuf <jpoimboe@redhat.com> # on gcc-5.3
      Tested-by: NQuinn Tran <quinn.tran@qlogic.com>
      Cc: Martin Jambor <mjambor@suse.cz>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Himanshu Madhani <himanshu.madhani@qlogic.com>
      Cc: Jan Hubicka <hubicka@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7322dd75
    • A
      mm: thp: kvm: fix memory corruption in KVM with THP enabled · 127393fb
      Andrea Arcangeli 提交于
      After the THP refcounting change, obtaining a compound pages from
      get_user_pages() no longer allows us to assume the entire compound page
      is immediately mappable from a secondary MMU.
      
      A secondary MMU doesn't want to call get_user_pages() more than once for
      each compound page, in order to know if it can map the whole compound
      page.  So a secondary MMU needs to know from a single get_user_pages()
      invocation when it can map immediately the entire compound page to avoid
      a flood of unnecessary secondary MMU faults and spurious
      atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
      users).
      
      Ideally instead of the page->_mapcount < 1 check, get_user_pages()
      should return the granularity of the "page" mapping in the "mm" passed
      to get_user_pages().  However it's non trivial change to pass the "pmd"
      status belonging to the "mm" walked by get_user_pages up the stack (up
      to the caller of get_user_pages).  So the fix just checks if there is
      not a single pte mapping on the page returned by get_user_pages, and in
      turn if the caller can assume that the whole compound page is mapped in
      the current "mm" (in a pmd_trans_huge()).  In such case the entire
      compound page is safe to map into the secondary MMU without additional
      get_user_pages() calls on the surrounding tail/head pages.  In addition
      of being faster, not having to run other get_user_pages() calls also
      reduces the memory footprint of the secondary MMU fault in case the pmd
      split happened as result of memory pressure.
      
      Without this fix after a MADV_DONTNEED (like invoked by QEMU during
      postcopy live migration or balloning) or after generic swapping (with a
      failure in split_huge_page() that would only result in pmd splitting and
      not a physical page split), KVM would map the whole compound page into
      the shadow pagetables, despite regular faults or userfaults (like
      UFFDIO_COPY) may map regular pages into the primary MMU as result of the
      pte faults, leading to the guest mode and userland mode going out of
      sync and not working on the same memory at all times.
      
      Any other secondary MMU notifier manager (KVM is just one of the many
      MMU notifier users) will need the same information if it doesn't want to
      run a flood of get_user_pages_fast and it can support multiple
      granularity in the secondary MMU mappings, so I think it is justified to
      be exposed not just to KVM.
      
      The other option would be to move transparent_hugepage_adjust to
      mm/huge_memory.c but that currently has all kind of KVM data structures
      in it, so it's definitely not a cut-and-paste work, so I couldn't do a
      fix as cleaner as this one for 4.6.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Li, Liang Z" <liang.z.li@intel.com>
      Cc: Amit Shah <amit.shah@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      127393fb
    • A
      rapidio/mport_cdev: fix uapi type definitions · 4e1016da
      Alexandre Bounine 提交于
      Fix problems in uapi definitions reported by Gabriel Laskar: (see
      https://lkml.org/lkml/2016/4/5/205 for details)
      
       - move public header file rio_mport_cdev.h to include/uapi/linux directory
       - change types in data structures passed as IOCTL parameters
       - improve parameter checking in some IOCTL service routines
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Reported-by: NGabriel Laskar <gabriel@lse.epita.fr>
      Tested-by: NBarry Wood <barry.wood@idt.com>
      Cc: Gabriel Laskar <gabriel@lse.epita.fr>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
      Cc: Barry Wood <barry.wood@idt.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e1016da
    • J
      mm: memcontrol: let v2 cgroups follow changes in system swappiness · 4550c4e1
      Johannes Weiner 提交于
      Cgroup2 currently doesn't have a per-cgroup swappiness setting.  We
      might want to add one later - that's a different discussion - but until
      we do, the cgroups should always follow the system setting.  Otherwise
      it will be unchangeably set to whatever the ancestor inherited from the
      system setting at the time of cgroup creation.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>	[4.5]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4550c4e1
  7. 05 5月, 2016 10 次提交
  8. 04 5月, 2016 5 次提交