1. 16 11月, 2016 2 次提交
    • M
      bpf: Add percpu LRU list · 961578b6
      Martin KaFai Lau 提交于
      Instead of having a common LRU list, this patch allows a
      percpu LRU list which can be selected by specifying a map
      attribute.  The map attribute will be added in the later
      patch.
      
      While the common use case for LRU is #reads >> #updates,
      percpu LRU list allows bpf prog to absorb unusual #updates
      under pathological case (e.g. external traffic facing machine which
      could be under attack).
      
      Each percpu LRU is isolated from each other.  The LRU nodes (including
      free nodes) cannot be moved across different LRU Lists.
      
      Here are the update performance comparison between
      common LRU list and percpu LRU list (the test code is
      at the last patch):
      
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
       1 cpus: 2934082 updates
       4 cpus: 7391434 updates
       8 cpus: 6500576 updates
      
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
        1 cpus: 2896553 updates
        4 cpus: 9766395 updates
        8 cpus: 17460553 updates
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      961578b6
    • M
      bpf: LRU List · 3a08c2fd
      Martin KaFai Lau 提交于
      Introduce bpf_lru_list which will provide LRU capability to
      the bpf_htab in the later patch.
      
      * General Thoughts:
      1. Target use case.  Read is more often than update.
         (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
         If bpf_prog does a bpf_lookup_elem() first and then an in-place
         update, it still counts as a read operation to the LRU list concern.
      2. It may be useful to think of it as a LRU cache
      3. Optimize the read case
         3.1 No lock in read case
         3.2 The LRU maintenance is only done during bpf_update_elem()
      4. If there is a percpu LRU list, it will lose the system-wise LRU
         property.  A completely isolated percpu LRU list has the best
         performance but the memory utilization is not ideal considering
         the work load may be imbalance.
      5. Hence, this patch starts the LRU implementation with a global LRU
         list with batched operations before accessing the global LRU list.
         As a LRU cache, #read >> #update/#insert operations, it will work well.
      6. There is a local list (for each cpu) which is named
         'struct bpf_lru_locallist'.  This local list is not used to sort
         the LRU property.  Instead, the local list is to batch enough
         operations before acquiring the lock of the global LRU list.  More
         details on this later.
      7. In the later patch, it allows a percpu LRU list by specifying a
         map-attribute for scalability reason and for use cases that need to
         prepare for the worst (and pathological) case like DoS attack.
         The percpu LRU list is completely isolated from each other and the
         LRU nodes (including free nodes) cannot be moved across the list.  The
         following description is for the global LRU list but mostly applicable
         to the percpu LRU list also.
      
      * Global LRU List:
      1. It has three sub-lists: active-list, inactive-list and free-list.
      2. The two list idea, active and inactive, is borrowed from the
         page cache.
      3. All nodes are pre-allocated and all sit at the free-list (of the
         global LRU list) at the beginning.  The pre-allocation reasoning
         is similar to the existing BPF_MAP_TYPE_HASH.  However,
         opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
         the LRU map.
      
      * Active/Inactive List (of the global LRU list):
      1. The active list, as its name says it, maintains the active set of
         the nodes.  We can think of it as the working set or more frequently
         accessed nodes.  The access frequency is approximated by a ref-bit.
         The ref-bit is set during the bpf_lookup_elem().
      2. The inactive list, as its name also says it, maintains a less
         active set of nodes.  They are the candidates to be removed
         from the bpf_htab when we are running out of free nodes.
      3. The ordering of these two lists is acting as a rough clock.
         The tail of the inactive list is the older nodes and
         should be released first if the bpf_htab needs free element.
      
      * Rotating the Active/Inactive List (of the global LRU list):
      1. It is the basic operation to maintain the LRU property of
         the global list.
      2. The active list is only rotated when the inactive list is running
         low.  This idea is similar to the current page cache.
         Inactive running low is currently defined as
         "# of inactive < # of active".
      3. The active list rotation always starts from the tail.  It moves
         node without ref-bit set to the head of the inactive list.
         It moves node with ref-bit set back to the head of the active
         list and then clears its ref-bit.
      4. The inactive rotation is pretty simply.
         It walks the inactive list and moves the nodes back to the head of
         active list if its ref-bit is set. The ref-bit is cleared after moving
         to the active list.
         If the node does not have ref-bit set, it just leave it as it is
         because it is already in the inactive list.
      
      * Shrinking the Inactive List (of the global LRU list):
      1. Shrinking is the operation to get free nodes when the bpf_htab is
         full.
      2. It usually only shrinks the inactive list to get free nodes.
      3. During shrinking, it will walk the inactive list from the tail,
         delete the nodes without ref-bit set from bpf_htab.
      4. If no free node found after step (3), it will forcefully get
         one node from the tail of inactive or active list.  Forcefully is
         in the sense that it ignores the ref-bit.
      
      * Local List:
      1. Each CPU has a 'struct bpf_lru_locallist'.  The purpose is to
         batch enough operations before acquiring the lock of the
         global LRU.
      2. A local list has two sub-lists, free-list and pending-list.
      3. During bpf_update_elem(), it will try to get from the free-list
         of (the current CPU local list).
      4. If the local free-list is empty, it will acquire from the
         global LRU list.  The global LRU list can either satisfy it
         by its global free-list or by shrinking the global inactive
         list.  Since we have acquired the global LRU list lock,
         it will try to get at most LOCAL_FREE_TARGET elements
         to the local free list.
      5. When a new element is added to the bpf_htab, it will
         first sit at the pending-list (of the local list) first.
         The pending-list will be flushed to the global LRU list
         when it needs to acquire free nodes from the global list
         next time.
      
      * Lock Consideration:
      The LRU list has a lock (lru_lock).  Each bucket of htab has a
      lock (buck_lock).  If both locks need to be acquired together,
      the lock order is always lru_lock -> buck_lock and this only
      happens in the bpf_lru_list.c logic.
      
      In hashtab.c, both locks are not acquired together (i.e. one
      lock is always released first before acquiring another lock).
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a08c2fd
  2. 15 11月, 2016 28 次提交
  3. 14 11月, 2016 10 次提交
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 7d384846
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains a second batch of Netfilter updates for
      your net-next tree. This includes a rework of the core hook
      infrastructure that improves Netfilter performance by ~15% according to
      synthetic benchmarks. Then, a large batch with ipset updates, including
      a new hash:ipmac set type, via Jozsef Kadlecsik. This also includes a
      couple of assorted updates.
      
      Regarding the core hook infrastructure rework to improve performance,
      using this simple drop-all packets ruleset from ingress:
      
              nft add table netdev x
              nft add chain netdev x y { type filter hook ingress device eth0 priority 0\; }
              nft add rule netdev x y drop
      
      And generating traffic through Jesper Brouer's
      samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh script using -i
      option. perf report shows nf_tables calls in its top 10:
      
          17.30%  kpktgend_0   [nf_tables]            [k] nft_do_chain
          15.75%  kpktgend_0   [kernel.vmlinux]       [k] __netif_receive_skb_core
          10.39%  kpktgend_0   [nf_tables_netdev]     [k] nft_do_chain_netdev
      
      I'm measuring here an improvement of ~15% in performance with this
      patchset, so we got +2.5Mpps more. I have used my old laptop Intel(R)
      Core(TM) i5-3320M CPU @ 2.60GHz 4-cores.
      
      This rework contains more specifically, in strict order, these patches:
      
      1) Remove compile-time debugging from core.
      
      2) Remove obsolete comments that predate the rcu era. These days it is
         well known that a Netfilter hook always runs under rcu_read_lock().
      
      3) Remove threshold handling, this is only used by br_netfilter too.
         We already have specific code to handle this from br_netfilter,
         so remove this code from the core path.
      
      4) Deprecate NF_STOP, as this is only used by br_netfilter.
      
      5) Place nf_state_hook pointer into xt_action_param structure, so
         this structure fits into one single cacheline according to pahole.
         This also implicit affects nftables since it also relies on the
         xt_action_param structure.
      
      6) Move state->hook_entries into nf_queue entry. The hook_entries
         pointer is only required by nf_queue(), so we can store this in the
         queue entry instead.
      
      7) use switch() statement to handle verdict cases.
      
      8) Remove hook_entries field from nf_hook_state structure, this is only
         required by nf_queue, so store it in nf_queue_entry structure.
      
      9) Merge nf_iterate() into nf_hook_slow() that results in a much more
         simple and readable function.
      
      10) Handle NF_REPEAT away from the core, so far the only client is
          nf_conntrack_in() and we can restart the packet processing using a
          simple goto to jump back there when the TCP requires it.
          This update required a second pass to fix fallout, fix from
          Arnd Bergmann.
      
      11) Set random seed from nft_hash when no seed is specified from
          userspace.
      
      12) Simplify nf_tables expression registration, in a much smarter way
          to save lots of boiler plate code, by Liping Zhang.
      
      13) Simplify layer 4 protocol conntrack tracker registration, from
          Davide Caratti.
      
      14) Missing CONFIG_NF_SOCKET_IPV4 dependency for udp4_lib_lookup, due
          to recent generalization of the socket infrastructure, from Arnd
          Bergmann.
      
      15) Then, the ipset batch from Jozsef, he describes it as it follows:
      
      * Cleanup: Remove extra whitespaces in ip_set.h
      * Cleanup: Mark some of the helpers arguments as const in ip_set.h
      * Cleanup: Group counter helper functions together in ip_set.h
      * struct ip_set_skbinfo is introduced instead of open coded fields
        in skbinfo get/init helper funcions.
      * Use kmalloc() in comment extension helper instead of kzalloc()
        because it is unnecessary to zero out the area just before
        explicit initialization.
      * Cleanup: Split extensions into separate files.
      * Cleanup: Separate memsize calculation code into dedicated function.
      * Cleanup: group ip_set_put_extensions() and ip_set_get_extensions()
        together.
      * Add element count to hash headers by Eric B Munson.
      * Add element count to all set types header for uniform output
        across all set types.
      * Count non-static extension memory into memsize calculation for
        userspace.
      * Cleanup: Remove redundant mtype_expire() arguments, because
        they can be get from other parameters.
      * Cleanup: Simplify mtype_expire() for hash types by removing
        one level of intendation.
      * Make NLEN compile time constant for hash types.
      * Make sure element data size is a multiple of u32 for the hash set
        types.
      * Optimize hash creation routine, exit as early as possible.
      * Make struct htype per ipset family so nets array becomes fixed size
        and thus simplifies the struct htype allocation.
      * Collapse same condition body into a single one.
      * Fix reported memory size for hash:* types, base hash bucket structure
        was not taken into account.
      * hash:ipmac type support added to ipset by Tomasz Chilinski.
      * Use setup_timer() and mod_timer() instead of init_timer()
        by Muhammad Falak R Wani, individually for the set type families.
      
      16) Remove useless connlabel field in struct netns_ct, patch from
          Florian Westphal.
      
      17) xt_find_table_lock() doesn't return ERR_PTR() anymore, so simplify
          {ip,ip6,arp}tables code that uses this.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d384846
    • J
      mlxsw: spectrum_router: Add FIB abort warning · 8d419324
      Jiri Pirko 提交于
      Add a warning that the abort mechanism was triggered for device.
      Also avoid going through the procedure if abort was already done.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d419324
    • D
      Merge branch 'dsa-mv88e6xxx-post-refactor-fixes' · 5aad5b42
      David S. Miller 提交于
      Andrew Lunn says:
      
      ====================
      dsa: mv88e6xxx: Fixes for port refactoring
      
      The patches which refactored setting up the switch MACs introduced a
      couple of regressions. The RGMII delays for a port can be set using
      other mechanism than just phy-mode. Don't overwrite the delays unless
      explicitly asked to. This broke my Armada 370 RD. Also, the mv88e6351
      family supports setting RGMII delays, but is missing the necessary
      entries in the ops structures to allow this.
      
      These fixes are to patches currently in net-next. No need for stable
      etc.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5aad5b42
    • A
      net: dsa: mv88e6xxx: 6351 family also has RGMII delays · 94d66ae6
      Andrew Lunn 提交于
      The recent refactoring of setting the MAC configuration broke setting
      of RGMII delays, via the phy-mode, on the 6351 family. Add the missing
      ops to the structure.
      
      Fixes: 7340e5ecdbb1 ("net: dsa: mv88e6xxx: setup port's MAC")
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94d66ae6
    • A
      net: dsa: mv88e6xxx: Don't modify RGMII delays when not RGMII mode · fedf1865
      Andrew Lunn 提交于
      The RGMII modes delays can be set via strapping pings or EEPROM.
      Don't change them unless explicitly asked to change them.  The recent
      refactoring of setting the MAC configuration changed this behaviours,
      in that CPU and DSA ports have any pre-configured RGMII delays
      removed. This breaks the Armada 370RD board. Restore the previous
      behaviour, in that RGMII delays are only applied/removed when
      explicitly asked for via an phy-mode being PHY_INTERFACE_MODE_RGMII*
      
      Fixes: 7340e5ecdbb1 ("net: dsa: mv88e6xxx: setup port's MAC")
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fedf1865
    • D
      ntb_perf: potential info leak in debugfs · 819baf88
      Dan Carpenter 提交于
      This is a static checker warning, not something I'm desperately
      concerned about.  But snprintf() returns the number of bytes that
      would have been copied if there were space.  We really care about the
      number of bytes that actually were copied so we should use scnprintf()
      instead.
      
      It probably won't overrun, and in that case we may as well just use
      sprintf() but these sorts of things make static checkers and code
      reviewers happier.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NJon Mason <jdmason@kudzu.us>
      819baf88
    • D
      ntb: ntb_hw_intel: init peer_addr in struct intel_ntb_dev · 25ea9f2b
      Dave Jiang 提交于
      The peer_addr member of intel_ntb_dev is not set, therefore when
      acquiring ntb_peer_db and ntb_peer_spad we only get the offset rather
      than the actual physical address. Adding fix to correct that.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Acked-by: NAllen Hubbe <Allen.Hubbe@emc.com>
      Signed-off-by: NJon Mason <jdmason@kudzu.us>
      25ea9f2b
    • N
      ntb: make DMA_OUT_RESOURCE_TO HZ independent · cdc08982
      Nicholas Mc Guire 提交于
      schedule_timeout_* takes a timeout in jiffies but the code currently is
      passing in a constant which makes this timeout HZ dependent, so pass it
      through msecs_to_jiffies() to fix this up.
      Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
      Acked-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NJon Mason <jdmason@kudzu.us>
      cdc08982
    • N
      ntb_transport: make DMA_OUT_RESOURCE_TO HZ independent · c0a88032
      Nicholas Mc Guire 提交于
      schedule_timeout_* takes a timeout in jiffies but the code currently is
      passing in a constant which makes this timeout HZ dependent, so pass it
      through msecs_to_jiffies() to fix this up.
      Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
      Signed-off-by: NJon Mason <jdmason@kudzu.us>
      c0a88032
    • W
      NTB: ntb_hw_intel: Fix typo in module parameter descriptions · 49b89de4
      Wei Yongjun 提交于
      Fix typo in module parameter descriptions.
      Signed-off-by: NWei Yongjun <weiyj.lk@gmail.com>
      Acked-by: NAllen Hubbe <Allen.Hubbe@emc.com>
      Signed-off-by: NJon Mason <jdmason@kudzu.us>
      49b89de4