1. 10 3月, 2016 11 次提交
  2. 09 3月, 2016 29 次提交
    • A
      samples/bpf: add map performance test · 26e90931
      Alexei Starovoitov 提交于
      performance tests for hash map and per-cpu hash map
      with and without pre-allocation
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26e90931
    • A
      samples/bpf: stress test bpf_get_stackid · 7dcc42b6
      Alexei Starovoitov 提交于
      increase stress by also calling bpf_get_stackid() from
      various *spin* functions
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7dcc42b6
    • A
      samples/bpf: add bpf map stress test · 9d8b612d
      Alexei Starovoitov 提交于
      this test calls bpf programs from different contexts:
      from inside of slub, from rcu, from pretty much everywhere,
      since it kprobes all spin_lock functions.
      It stresses the bpf hash and percpu map pre-allocation,
      deallocation logic and call_rcu mechanisms.
      User space part adding more stress by walking and deleting map elements.
      
      Note that due to nature bpf_load.c the earlier kprobe+bpf programs are
      already active while loader loads new programs, creates new kprobes and
      attaches them.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d8b612d
    • D
      ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INET · e28e87ed
      Daniel Borkmann 提交于
      Helpers like ip_tunnel_info_opts_{get,set}() are only available if
      CONFIG_INET is set, thus add an empty definition into the header for
      the !CONFIG_INET case, where already other empty inline helpers are
      defined.
      
      This avoids ifdef kludge inside filter.c, but also vxlan and geneve
      themself where this facility can only be used with, depend on INET
      being set. For the !INET case TUNNEL_OPTIONS_PRESENT would never be
      set in flags.
      
      Fixes: 14ca0751 ("bpf: support for access to tunnel options")
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e28e87ed
    • D
      Merge branch 'bpf-map-prealloc' · f14b488d
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      bpf: map pre-alloc
      
      v1->v2:
      . fix few issues spotted by Daniel
      . converted stackmap into pre-allocation as well
      . added a workaround for lockdep false positive
      . added pcpu_freelist_populate to be used by hashmap and stackmap
      
      this path set switches bpf hash map to use pre-allocation by default
      and introduces BPF_F_NO_PREALLOC flag to keep old behavior for cases
      where full map pre-allocation is too memory expensive.
      
      Some time back Daniel Wagner reported crashes when bpf hash map is
      used to compute time intervals between preempt_disable->preempt_enable
      and recently Tom Zanussi reported a dead lock in iovisor/bcc/funccount
      tool if it's used to count the number of invocations of kernel
      '*spin*' functions. Both problems are due to the recursive use of
      slub and can only be solved by pre-allocating all map elements.
      
      A lot of different solutions were considered. Many implemented,
      but at the end pre-allocation seems to be the only feasible answer.
      As far as pre-allocation goes it also was implemented 4 different ways:
      - simple free-list with single lock
      - percpu_ida with optimizations
      - blk-mq-tag variant customized for bpf use case
      - percpu_freelist
      For bpf style of alloc/free patterns percpu_freelist is the best
      and implemented in this patch set.
      Detailed performance numbers in patch 3.
      Patch 2 introduces percpu_freelist
      Patch 1 fixes simple deadlocks due to missing recursion checks
      Patch 5: converts stackmap to pre-allocation
      Patches 6-9: prepare test infra
      Patch 10: stress test for hash map infra. It attaches to spin_lock
      functions and bpf_map_update/delete are called from different contexts
      Patch 11: stress for bpf_get_stackid
      Patch 12: map performance test
      Reported-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Reported-by: NTom Zanussi <tom.zanussi@linux.intel.com>
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f14b488d
    • A
      samples/bpf: test both pre-alloc and normal maps · c3f85cff
      Alexei Starovoitov 提交于
      extend test coveraged to include pre-allocated and run-time alloc maps
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3f85cff
    • A
      samples/bpf: add map_flags to bpf loader · 89b97607
      Alexei Starovoitov 提交于
      note old loader is compatible with new kernel.
      map_flags are optional
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89b97607
    • A
      samples/bpf: move ksym_search() into library · 3622e7e4
      Alexei Starovoitov 提交于
      move ksym search from offwaketime into library to be reused
      in other tests
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3622e7e4
    • A
      samples/bpf: make map creation more verbose · 618ec9a7
      Alexei Starovoitov 提交于
      map creation is typically the first one to fail when rlimits are
      too low, not enough memory, etc
      Make this failure scenario more verbose
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      618ec9a7
    • A
      bpf: convert stackmap to pre-allocation · 557c0c6e
      Alexei Starovoitov 提交于
      It was observed that calling bpf_get_stackid() from a kprobe inside
      slub or from spin_unlock causes similar deadlock as with hashmap,
      therefore convert stackmap to use pre-allocated memory.
      
      The call_rcu is no longer feasible mechanism, since delayed freeing
      causes bpf_get_stackid() to fail unpredictably when number of actual
      stacks is significantly less than user requested max_entries.
      Since elements are no longer freed into slub, we can push elements into
      freelist immediately and let them be recycled.
      However the very unlikley race between user space map_lookup() and
      program-side recycling is possible:
           cpu0                          cpu1
           ----                          ----
      user does lookup(stackidX)
      starts copying ips into buffer
                                         delete(stackidX)
                                         calls bpf_get_stackid()
      				   which recyles the element and
                                         overwrites with new stack trace
      
      To avoid user space seeing a partial stack trace consisting of two
      merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
      to preserve consistent stack trace delivery to user space.
      Now we can move memset(,0) of left-over element value from critical
      path of bpf_get_stackid() into slow-path of user space lookup.
      Also disallow lookup() from bpf program, since it's useless and
      program shouldn't be messing with collected stack trace.
      
      Note that similar race between user space lookup and kernel side updates
      is also present in hashmap, but it's not a new race. bpf programs were
      always allowed to modify hash and array map elements while user space
      is copying them.
      
      Fixes: d5a3b1f6 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      557c0c6e
    • A
    • A
      bpf: pre-allocate hash map elements · 6c905981
      Alexei Starovoitov 提交于
      If kprobe is placed on spin_unlock then calling kmalloc/kfree from
      bpf programs is not safe, since the following dead lock is possible:
      kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
      bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
      and deadlocks.
      
      The following solutions were considered and some implemented, but
      eventually discarded
      - kmem_cache_create for every map
      - add recursion check to slow-path of slub
      - use reserved memory in bpf_map_update for in_irq or in preempt_disabled
      - kmalloc via irq_work
      
      At the end pre-allocation of all map elements turned out to be the simplest
      solution and since the user is charged upfront for all the memory, such
      pre-allocation doesn't affect the user space visible behavior.
      
      Since it's impossible to tell whether kprobe is triggered in a safe
      location from kmalloc point of view, use pre-allocation by default
      and introduce new BPF_F_NO_PREALLOC flag.
      
      While testing of per-cpu hash maps it was discovered
      that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
      fails to allocate memory even when 90% of it is free.
      The pre-allocation of per-cpu hash elements solves this problem as well.
      
      Turned out that bpf_map_update() quickly followed by
      bpf_map_lookup()+bpf_map_delete() is very common pattern used
      in many of iovisor/bcc/tools, so there is additional benefit of
      pre-allocation, since such use cases are must faster.
      
      Since all hash map elements are now pre-allocated we can remove
      atomic increment of htab->count and save few more cycles.
      
      Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
      large malloc/free done by users who don't have sufficient limits.
      
      Pre-allocation is done with vmalloc and alloc/free is done
      via percpu_freelist. Here are performance numbers for different
      pre-allocation algorithms that were implemented, but discarded
      in favor of percpu_freelist:
      
      1 cpu:
      pcpu_ida	2.1M
      pcpu_ida nolock	2.3M
      bt		2.4M
      kmalloc		1.8M
      hlist+spinlock	2.3M
      pcpu_freelist	2.6M
      
      4 cpu:
      pcpu_ida	1.5M
      pcpu_ida nolock	1.8M
      bt w/smp_align	1.7M
      bt no/smp_align	1.1M
      kmalloc		0.7M
      hlist+spinlock	0.2M
      pcpu_freelist	2.0M
      
      8 cpu:
      pcpu_ida	0.7M
      bt w/smp_align	0.8M
      kmalloc		0.4M
      pcpu_freelist	1.5M
      
      32 cpu:
      kmalloc		0.13M
      pcpu_freelist	0.49M
      
      pcpu_ida nolock is a modified percpu_ida algorithm without
      percpu_ida_cpu locks and without cross-cpu tag stealing.
      It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
      
      bt is a variant of block/blk-mq-tag.c simlified and customized
      for bpf use case. bt w/smp_align is using cache line for every 'long'
      (similar to blk-mq-tag). bt no/smp_align allocates 'long'
      bitmasks continuously to save memory. It's comparable to percpu_ida
      and in some cases faster, but slower than percpu_freelist
      
      hlist+spinlock is the simplest free list with single spinlock.
      As expeceted it has very bad scaling in SMP.
      
      kmalloc is existing implementation which is still available via
      BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
      in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
      but saves memory, so in cases where map->max_entries can be large
      and number of map update/delete per second is low, it may make
      sense to use it.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c905981
    • A
      bpf: introduce percpu_freelist · e19494ed
      Alexei Starovoitov 提交于
      Introduce simple percpu_freelist to keep single list of elements
      spread across per-cpu singly linked lists.
      
      /* push element into the list */
      void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);
      
      /* pop element from the list */
      struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);
      
      The object is pushed to the current cpu list.
      Pop first trying to get the object from the current cpu list,
      if it's empty goes to the neigbour cpu list.
      
      For bpf program usage pattern the collision rate is very low,
      since programs push and pop the objects typically on the same cpu.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e19494ed
    • A
      bpf: prevent kprobe+bpf deadlocks · b121d1e7
      Alexei Starovoitov 提交于
      if kprobe is placed within update or delete hash map helpers
      that hold bucket spin lock and triggered bpf program is trying to
      grab the spinlock for the same bucket on the same cpu, it will
      deadlock.
      Fix it by extending existing recursion prevention mechanism.
      
      Note, map_lookup and other tracing helpers don't have this problem,
      since they don't hold any locks and don't modify global data.
      bpf_trace_printk has its own recursive check and ok as well.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b121d1e7
    • D
      Merge branch 'ipv6-per-netns-gc' · 8aba8b83
      David S. Miller 提交于
      Michal Kubecek says:
      
      ====================
      ipv6: per netns FIB6 walkers and garbage collector
      
      Commit 2ac3ac8f ("ipv6: prevent fib6_run_gc() contention") reduced
      the risk of contention on FIB6 garbage collector lock on systems with
      many CPUs. However, one of our customers can still observe heavy
      contention on fib6_gc_lock which can even trigger the soft lockup
      detector.
      
      This is caused by garbage collector running in forced mode from a timer.
      While there is one timer per network namespace, the instances of
      fib6_run_gc() running from them are protected by one global spinlock so
      that only one garbage collector can run at any moment and other
      namespaces have to wait. As most relevant data structures are separated
      per netns, there is little reason for garbage collectors blocking each
      other.
      
      Similar problem exists for walkers: changes in one tree do not need to
      adjust (and block) walkers traversing FIB trees in other namespaces.
      
      This series separates both the walkers infrastructure and garbage
      collector so that they work independently in network namespaces.
      
      v2: get rid of ifdef in ipv6_route_seq_setup_walk(), pass net from
      callers instead
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8aba8b83
    • M
      ipv6: per netns FIB garbage collection · 3dc94f93
      Michal Kubeček 提交于
      One of our customers observed issues with FIB6 garbage collectors
      running in different network namespaces blocking each other, resulting
      in soft lockups (fib6_run_gc() initiated from timer runs always in
      forced mode).
      
      Now that FIB6 walkers are separated per namespace, there is no more need
      for instances of fib6_run_gc() in different namespaces blocking each
      other. There is still a call to icmp6_dst_gc() which operates on shared
      data but this function is protected by its own shared lock.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dc94f93
    • M
      ipv6: per netns fib6 walkers · 9a03cd8f
      Michal Kubeček 提交于
      The IPv6 FIB data structures are separated per network namespace but
      there is still only one global walkers list and one global walker list
      lock. This means changes in one namespace unnecessarily interfere with
      walkers in other namespaces.
      
      Replace the global list with per-netns lists (and give each its own
      lock).
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a03cd8f
    • M
      ipv6: replace global gc_args with local variable · 3570df91
      Michal Kubeček 提交于
      Global variable gc_args is only used in fib6_run_gc() and functions
      called from it. As fib6_run_gc() makes sure there is at most one
      instance of fib6_clean_all() running at any moment, we can replace
      gc_args with a local variable which will be needed once multiple
      instances (per netns) of garbage collector are allowed.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3570df91
    • D
      Merge branch 'bnxt_en-next' · 02daec7c
      David S. Miller 提交于
      Michael Chan says:
      
      ====================
      bnxt_en: Updates for net-next.
      
      Updates to support autoneg for all supported speeds, add PF port statistics,
      and Advanced Error Reporting.
      
      v2: Fixed patch 3 to not use parentheses on function return.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02daec7c
    • S
      bnxt_en: Enable AER support. · 6316ea6d
      Satish Baddipadige 提交于
      Add pci_error_handler callbacks to support for pcie advanced error
      recovery.
      Signed-off-by: NSatish Baddipadige <sbaddipa@broadcom.com>
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6316ea6d
    • M
      bnxt_en: Include hardware port statistics in ethtool -S. · 8ddc9aaa
      Michael Chan 提交于
      Include the more useful port statistics in ethtool -S for the PF device.
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ddc9aaa
    • M
      bnxt_en: Include some hardware port statistics in ndo_get_stats64(). · 9947f83f
      Michael Chan 提交于
      Include some of the port error counters (e.g. crc) in ->ndo_get_stats64()
      for the PF device.
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9947f83f
    • M
      bnxt_en: Add port statistics support. · 3bdf56c4
      Michael Chan 提交于
      Gather periodic port statistics if the device is PF and link is up.  This
      is triggered in bnxt_timer() every one second to request firmware to DMA
      the counters.
      Signed-off-by: NMichael Chan <michael.chan@broadocm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3bdf56c4
    • M
      bnxt_en: Extend autoneg to all speeds. · f1a082a6
      Michael Chan 提交于
      Allow all autoneg speeds aupported by firmware to be advertised.  If
      the advertising parameter is 0, then all supported speeds will be
      advertised.
      
      Remove BNXT_ALL_COPPER_ETHTOOL_SPEED which is no longer used as all
      supported speeds can be advertised.
      Signed-off-by: NMichael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1a082a6
    • M
      bnxt_en: Use common function to get ethtool supported flags. · 4b32cacc
      Michael Chan 提交于
      The supported bits and advertising bits in ethtool have the same
      definitions.  The same is true for the firmware bits.  So use the
      common function to handle the conversion for both supported and
      advertising bits.
      
      v2: Don't use parentheses on function return.
      Signed-off-by: NMichael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b32cacc
    • M
      bnxt_en: Add reporting of link partner advertisement. · 3277360e
      Michael Chan 提交于
      And report actual pause settings to ETHTOOL_GPAUSEPARAM to let ethtool
      resolve the actual pause settings.
      Signed-off-by: NMichael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3277360e
    • M
      bnxt_en: Refactor bnxt_fw_to_ethtool_advertised_spds(). · 27c4d578
      Michael Chan 提交于
      Include the conversion of pause bits and add one extra call layer so
      that the same refactored function can be reused to get the link partner
      advertisement bits.
      Signed-off-by: NMichael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27c4d578
    • K
      net_sched: dsmark: use qdisc_dequeue_peeked() · f8b33d8e
      Kyeong Yoo 提交于
      This fix is for dsmark similar to commit 3557619f
      ("net_sched: prio: use qdisc_dequeue_peeked")
      and makes use of qdisc_dequeue_peeked() instead of direct dequeue() call.
      
      First time, wrr peeks dsmark, which will then peek into sfq.
      sfq dequeues an skb and it's stored in sch->gso_skb.
      Next time, wrr tries to dequeue from dsmark, which will call sfq dequeue
      directly. This results skipping the previously peeked skb.
      
      So changed dsmark dequeue to call qdisc_dequeue_peeked() instead to use
      peeked skb if exists.
      Signed-off-by: NKyeong Yoo <kyeong.yoo@alliedtelesis.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8b33d8e
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 4c38cd61
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS updates for net-next
      
      The following patchset contains Netfilter updates for your net-next tree,
      they are:
      
      1) Remove useless debug message when deleting IPVS service, from
         Yannick Brosseau.
      
      2) Get rid of compilation warning when CONFIG_PROC_FS is unset in
         several spots of the IPVS code, from Arnd Bergmann.
      
      3) Add prandom_u32 support to nft_meta, from Florian Westphal.
      
      4) Remove unused variable in xt_osf, from Sudip Mukherjee.
      
      5) Don't calculate IP checksum twice from netfilter ipv4 defrag hook
         since fixing af_packet defragmentation issues, from Joe Stringer.
      
      6) On-demand hook registration for iptables from netns. Instead of
         registering the hooks for every available netns whenever we need
         one of the support tables, we register this on the specific netns
         that needs it, patchset from Florian Westphal.
      
      7) Add missing port range selection to nf_tables masquerading support.
      
      BTW, just for the record, there is a typo in the description of
      5f6c253e ("netfilter: bridge: register hooks only when bridge
      interface is added") that refers to the cluster match as deprecated, but
      it is actually the CLUSTERIP target (which registers hooks
      inconditionally) the one that is scheduled for removal.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c38cd61