1. 12 12月, 2016 1 次提交
  2. 10 12月, 2016 2 次提交
    • E
      udp: add batching to udp_rmem_release() · 6b229cf7
      Eric Dumazet 提交于
      If udp_recvmsg() constantly releases sk_rmem_alloc
      for every read packet, it gives opportunity for
      producers to immediately grab spinlocks and desperatly
      try adding another packet, causing false sharing.
      
      We can add a simple heuristic to give the signal
      by batches of ~25 % of the queue capacity.
      
      This patch considerably increases performance under
      flood by about 50 %, since the thread draining the queue
      is no longer slowed by false sharing.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b229cf7
    • E
      udp: copy skb->truesize in the first cache line · c84d9490
      Eric Dumazet 提交于
      In UDP RX handler, we currently clear skb->dev before skb
      is added to receive queue, because device pointer is no longer
      available once we exit from RCU section.
      
      Since this first cache line is always hot, lets reuse this space
      to store skb->truesize and thus avoid a cache line miss at
      udp_recvmsg()/udp_skb_destructor time while receive queue
      spinlock is held.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c84d9490
  3. 09 12月, 2016 8 次提交
  4. 08 12月, 2016 1 次提交
  5. 07 12月, 2016 7 次提交
    • D
      acpi, nfit, libnvdimm: fix / harden ars_status output length handling · efda1b5d
      Dan Williams 提交于
      Given ambiguities in the ACPI 6.1 definition of the "Output (Size)"
      field of the ARS (Address Range Scrub) Status command, a firmware
      implementation may in practice return 0, 4, or 8 to indicate that there
      is no output payload to process.
      
      The specification states "Size of Output Buffer in bytes, including this
      field.". However, 'Output Buffer' is also the name of the entire
      payload, and earlier in the specification it states "Max Query ARS
      Status Output Buffer Size: Maximum size of buffer (including the Status
      and Extended Status fields)".
      
      Without this fix if the BIOS happens to return 0 it causes memory
      corruption as evidenced by this result from the acpi_nfit_ctl() unit
      test.
      
       ars_status00000000: 00020000 00000000                    ........
       BUG: stack guard page was hit at ffffc90001750000 (stack is ffffc9000174c000..ffffc9000174ffff)
       kernel stack overflow (page fault): 0000 [#1] SMP DEBUG_PAGEALLOC
       task: ffff8803332d2ec0 task.stack: ffffc9000174c000
       RIP: 0010:[<ffffffff814cfe72>]  [<ffffffff814cfe72>] __memcpy+0x12/0x20
       RSP: 0018:ffffc9000174f9a8  EFLAGS: 00010246
       RAX: ffffc9000174fab8 RBX: 0000000000000000 RCX: 000000001fffff56
       RDX: 0000000000000000 RSI: ffff8803231f5a08 RDI: ffffc90001750000
       RBP: ffffc9000174fa88 R08: ffffc9000174fab0 R09: ffff8803231f54b8
       R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000000003 R15: ffff8803231f54a0
       FS:  00007f3a611af640(0000) GS:ffff88033ed00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffc90001750000 CR3: 0000000325b20000 CR4: 00000000000406e0
       Stack:
        ffffffffa00bc60d 0000000000000008 ffffc90000000001 ffffc9000174faac
        0000000000000292 ffffffffa00c24e4 ffffffffa00c2914 0000000000000000
        0000000000000000 ffffffff00000003 ffff880331ae8ad0 0000000800000246
       Call Trace:
        [<ffffffffa00bc60d>] ? acpi_nfit_ctl+0x49d/0x750 [nfit]
        [<ffffffffa01f4fe0>] nfit_test_probe+0x670/0xb1b [nfit_test]
      
      Cc: <stable@vger.kernel.org>
      Fixes: 747ffe11 ("libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      efda1b5d
    • F
      netfilter: ingress: translate 0 nf_hook_slow retval to -1 · df122f58
      Florian Westphal 提交于
      The caller assumes that < 0 means that skb was stolen (or free'd).
      
      All other return values continue skb processing.
      
      nf_hook_slow returns 3 different return value types:
      
      A) a (negative) errno value: the skb was dropped (NF_DROP, e.g.
      by iptables '-j DROP' rule).
      
      B) 0. The skb was stolen by the hook or queued to userspace.
      
      C) 1. all hooks returned NF_ACCEPT so the caller should invoke
         the okfn so packet processing can continue.
      
      nft ingress facility currently doesn't have the 'okfn' that
      the NF_HOOK() macros use; there is no nfqueue support either.
      
      So 1 means that nf_hook_ingress() caller should go on processing the skb.
      
      In order to allow use of NF_STOLEN from ingress we need to translate
      this to an errno number, else we'd crash because we continue with
      already-free'd (or about to be free-d) skb.
      
      The errno value isn't checked, its just important that its less than 0,
      so return -1.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      df122f58
    • F
      netfilter: x_tables: pack percpu counter allocations · ae0ac0ed
      Florian Westphal 提交于
      instead of allocating each xt_counter individually, allocate 4k chunks
      and then use these for counter allocation requests.
      
      This should speed up rule evaluation by increasing data locality,
      also speeds up ruleset loading because we reduce calls to the percpu
      allocator.
      
      As Eric points out we can't use PAGE_SIZE, page_allocator would fail on
      arches with 64k page size.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ae0ac0ed
    • F
      netfilter: x_tables: pass xt_counters struct to counter allocator · f28e15ba
      Florian Westphal 提交于
      Keeps some noise away from a followup patch.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f28e15ba
    • F
      netfilter: x_tables: pass xt_counters struct instead of packet counter · 4d31eef5
      Florian Westphal 提交于
      On SMP we overload the packet counter (unsigned long) to contain
      percpu offset.  Hide this from callers and pass xt_counters address
      instead.
      
      Preparation patch to allocate the percpu counters in page-sized batch
      chunks.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      4d31eef5
    • A
      netfilter: decouple nf_hook_entry and nf_hook_ops · d415b9eb
      Aaron Conole 提交于
      During nfhook traversal we only need a very small subset of
      nf_hook_ops members.
      
      We need:
      - next element
      - hook function to call
      - hook function priv argument
      
      Bridge netfilter also needs 'thresh'; can be obtained via ->orig_ops.
      
      nf_hook_entry struct is now 32 bytes on x86_64.
      
      A followup patch will turn the run-time list into an array that only
      stores hook functions plus their priv arguments, eliminating the ->next
      element.
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NAaron Conole <aconole@bytheb.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d415b9eb
    • A
      netfilter: introduce accessor functions for hook entries · 0aa8c57a
      Aaron Conole 提交于
      This allows easier future refactoring.
      Signed-off-by: NAaron Conole <aconole@bytheb.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      0aa8c57a
  6. 06 12月, 2016 5 次提交
    • P
      locking/ww_mutex: Use relaxed atomics · f4ec57b6
      Peter Zijlstra 提交于
      The stamp is a sequence number, we don't care about memory ordering.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f4ec57b6
    • P
      x86/uaccess, sched/preempt: Verify access_ok() context · 7c478895
      Peter Zijlstra 提交于
      I recently encountered wreckage because access_ok() was used where it
      should not be, add an explicit WARN when access_ok() is used wrongly.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7c478895
    • D
      bpf: add prog_digest and expose it via fdinfo/netlink · 7bd509e3
      Daniel Borkmann 提交于
      When loading a BPF program via bpf(2), calculate the digest over
      the program's instruction stream and store it in struct bpf_prog's
      digest member. This is done at a point in time before any instructions
      are rewritten by the verifier. Any unstable map file descriptor
      number part of the imm field will be zeroed for the hash.
      
      fdinfo example output for progs:
      
        # cat /proc/1590/fdinfo/5
        pos:          0
        flags:        02000002
        mnt_id:       11
        prog_type:    1
        prog_jited:   1
        prog_digest:  b27e8b06da22707513aa97363dfb11c7c3675d28
        memlock:      4096
      
      When programs are pinned and retrieved by an ELF loader, the loader
      can check the program's digest through fdinfo and compare it against
      one that was generated over the ELF file's program section to see
      if the program needs to be reloaded. Furthermore, this can also be
      exposed through other means such as netlink in case of a tc cls/act
      dump (or xdp in future), but also through tracepoints or other
      facilities to identify the program. Other than that, the digest can
      also serve as a base name for the work in progress kallsyms support
      of programs. The digest doesn't depend/select the crypto layer, since
      we need to keep dependencies to a minimum. iproute2 will get support
      for this facility.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bd509e3
    • E
      tcp: tsq: move tsq_flags close to sk_wmem_alloc · 7aa5470c
      Eric Dumazet 提交于
      tsq_flags being in the same cache line than sk_wmem_alloc
      makes a lot of sense. Both fields are changed from tcp_wfree()
      and more generally by various TSQ related functions.
      
      Prior patch made room in struct sock and added sk_tsq_flags,
      this patch deletes tsq_flags from struct tcp_sock.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7aa5470c
    • E
      tcp: tsq: add tsq_flags / tsq_enum · 40fc3423
      Eric Dumazet 提交于
      This is a cleanup, to ease code review of following patches.
      
      Old 'enum tsq_flags' is renamed, and a new enumeration is added
      with the flags used in cmpxchg() operations as opposed to
      single bit operations.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40fc3423
  7. 05 12月, 2016 1 次提交
    • D
      netfilter: conntrack: built-in support for DCCP · c51d3901
      Davide Caratti 提交于
      CONFIG_NF_CT_PROTO_DCCP is no more a tristate. When set to y, connection
      tracking support for DCCP protocol is built-in into nf_conntrack.ko.
      
      footprint test:
      $ ls -l net/netfilter/nf_conntrack{_proto_dccp,}.ko \
              net/ipv4/netfilter/nf_conntrack_ipv4.ko \
              net/ipv6/netfilter/nf_conntrack_ipv6.ko
      
      (builtin)||  dccp  |  ipv4  |  ipv6  | nf_conntrack
      ---------++--------+--------+--------+--------------
      none     || 469140 | 828755 | 828676 | 6141434
      DCCP     ||   -    | 830566 | 829935 | 6533526
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c51d3901
  8. 04 12月, 2016 1 次提交
  9. 03 12月, 2016 6 次提交
  10. 02 12月, 2016 3 次提交
    • T
      bpf: BPF for lightweight tunnel infrastructure · 3a0af8fd
      Thomas Graf 提交于
      Registers new BPF program types which correspond to the LWT hooks:
        - BPF_PROG_TYPE_LWT_IN   => dst_input()
        - BPF_PROG_TYPE_LWT_OUT  => dst_output()
        - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
      
      The separate program types are required to differentiate between the
      capabilities each LWT hook allows:
      
       * Programs attached to dst_input() or dst_output() are restricted and
         may only read the data of an skb. This prevent modification and
         possible invalidation of already validated packet headers on receive
         and the construction of illegal headers while the IP headers are
         still being assembled.
      
       * Programs attached to lwtunnel_xmit() are allowed to modify packet
         content as well as prepending an L2 header via a newly introduced
         helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
         invoked after the IP header has been assembled completely.
      
      All BPF programs receive an skb with L3 headers attached and may return
      one of the following error codes:
      
       BPF_OK - Continue routing as per nexthop
       BPF_DROP - Drop skb and return EPERM
       BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                      (Only valid in lwtunnel_xmit() context)
      
      The return codes are binary compatible with their TC_ACT_
      relatives to ease compatibility.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a0af8fd
    • T
      net/mlx5e: Implement Fragmented Work Queue (WQ) · 1c1b5228
      Tariq Toukan 提交于
      Add new type of struct mlx5_frag_buf which is used to allocate fragmented
      buffers rather than contiguous, and make the Completion Queues (CQs) use
      it as they are big (default of 2MB per CQ in Striding RQ).
      
      This fixes the failures of type:
      "mlx5e_open_locked: mlx5e_open_channels failed, -12"
      due to dma_zalloc_coherent insufficient contiguous coherent memory to
      satisfy the driver's request when the user tries to setup more or larger
      rings.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c1b5228
    • R
      net: phy: add mdix_ctrl to hold the user configuration. · f4ed2fe3
      Raju Lakkaraju 提交于
      Add new parameter mdix_ctrl to hold the user configuration.
      Existing mdix maintain the current status of MDI(X) crossover performed or
      not.
      mdix_ctrl can configure either ETH_TP_MDI or ETH_TP_MDI_X orETH_TP_MDI_AUTO.
      Signed-off-by: NRaju Lakkaraju <Raju.Lakkaraju@microsemi.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4ed2fe3
  11. 01 12月, 2016 5 次提交