1. 16 6月, 2015 27 次提交
    • E
      net/mlx4_core: Reset counters data when freed · b72ca7e9
      Eran Ben Elisha 提交于
      Add resetting the counter data to the free counter flow, so the counter's
      data won't be accessible anymore if querying the counter. Also, on next
      counter allocation (to another VM for example), it will be fresh and clear.
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b72ca7e9
    • E
      net/mlx4_core: Check before cleaning counters bitmap · efa6bc91
      Eran Ben Elisha 提交于
      If counters are not supported by the device. The indices bitmap table is not
      allocated during initialization. Add the symmetrical check before cleaning
      the counters bitmap table or freeing a counter.
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efa6bc91
    • S
      bridge: del external_learned fdbs from device on flush or ageout · b4ad7baa
      Scott Feldman 提交于
      We need to delete from offload the device externally learnded fdbs when any
      one of these events happen:
      
      1) Bridge ages out fdb.  (When bridge is doing ageing vs. device doing
      ageing.  If device is doing ageing, it would send SWITCHDEV_FDB_DEL
      directly).
      
      2) STP state change flushes fdbs on port.
      
      3) User uses sysfs interface to flush fdbs from bridge or bridge port:
      
      	echo 1 >/sys/class/net/BR_DEV/bridge/flush
      	echo 1 >/sys/class/net/BR_PORT/brport/flush
      
      4) Offload driver send event SWITCHDEV_FDB_DEL to delete fdb entry.
      
      For rocker, we can now get called to delete fdb entry in wait and nowait
      contexts, so set NOWAIT flag when deleting fdb entry.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4ad7baa
    • D
      Merge tag 'nfc-next-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next · 023033b1
      David S. Miller 提交于
      Samuel Ortiz says:
      
      ====================
      NFC 4.2 pull request
      
      This is the NFC pull request for 4.2.
      
      - NCI drivers can now define their own handlers for processing
        proprietary NCI responses and notifications.
      
      - NFC vendors can use a dedicated netlink API to send their own
        proprietary commands, like e.g. all commands needed to implement
        vendor specific manufacturing tools.
      
      - A new generic NCI over UART driver against which any NCI chipset
        running on top of a serial interface can register.
      
      - The st21nfcb driver is renamed to st-nci as it can and will support
        most of ST Microelectronics NCI chipsets.
      
      - The st21nfcb driver can put its CLF in hibernate mode and save
        significant amount of power.
      
      - A few st21nfcb minor fixes.
      
      - The NXP NCI driver now supports ACPI enumeration.
      
      - The Marvell NCI driver now supports both USB and serial
        physical interfaces.
      
      - The Marvell NCI drivers also supports NCI frames being muxed
        over HCI. This is a setting that can be defined by a DT property.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      023033b1
    • D
      Merge branch 'bond-netlink-3ad-attrs' · cadfaf43
      David S. Miller 提交于
      Nikolay Aleksandrov says:
      
      ====================
      bonding: extend the 3ad exported attributes
      
      These are two small patches that export actor_oper_port_state and
      partner_oper_port_state via netlink and sysfs, until now they were only
      exported via bond's proc entry. If this set gets accepted I have an iproute2
      patch prepared that will export them with which I tested these changes.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cadfaf43
    • N
      bonding: export slave's partner_oper_port_state via sysfs and netlink · 46ea297e
      Nikolay Aleksandrov 提交于
      Export the partner_oper_port_state of each port via sysfs and netlink.
      In 802.3ad mode it is valuable for the user to be able to check the
      partner_oper state, it is already exported via bond's proc entry.
      Signed-off-by: NNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46ea297e
    • N
      bonding: export slave's actor_oper_port_state via sysfs and netlink · 254cb6db
      Nikolay Aleksandrov 提交于
      Export the actor_oper_port_state of each port via sysfs and netlink.
      In 802.3ad mode it is valuable for the user to be able to check the
      actor_oper state, it is already exported via bond's proc entry.
      Signed-off-by: NNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      254cb6db
    • D
      Merge branch 'rocker-no-wait' · 4d367963
      David S. Miller 提交于
      Scott Feldman says:
      
      ====================
      rocker: revert back to support for nowait processes
      
      One of the items removed from the rocker driver in the Spring Cleanup patch
      series was the ability to mark processing in the driver as "no wait" for
      those contexts where we cannot sleep.  Turns out, we have "no wait"
      contexts where we want to program the device and we don't want to defer the
      processing to a process context.  So re-add the ROCKER_OP_FLAG_NOWAIT flag
      to mark such processes, and propagate flags to mem allocator and to the
      device cmd executor.  With NOWAIT, mem allocs are GFP_ATOMIC and device
      cmds are queued to the device, but the driver will not wait (sleep) for the
      response back from the device.
      
      My bad for removing NOWAIT support in the first place; I thought we could
      swing non-sleep contexts to process context using a work queue, for
      example, but there is push-back to keep processing in original context.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d367963
    • S
      rocker: move port stop to 'no wait' processing · f66feaa9
      Scott Feldman 提交于
      rocker_port_stop can be called from atomic and non-atomic contexts.  Since
      we can't test what context we're getting called in, do the processing as
      'no wait', which will cover all cases.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f66feaa9
    • S
    • S
      rocker: mark STP update as 'no wait' processing · ac28393e
      Scott Feldman 提交于
      We can get STP updates from the bridge driver in atomic and non-atomic
      contexts.  Since we can't test what context we're getting called in,
      do the STP processing as 'no wait', which will cover all cases.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac28393e
    • S
      rocker: mark neigh update event processing as 'no wait' · 02a9fbfc
      Scott Feldman 提交于
      Neigh update event handler runs in a context where we can't sleep, so mark
      processing in driver with ROCKER_OP_FLAG_NOWAIT.  NOWAIT will use
      GFP_ATOMIC for allocations and will queue cmds to the device's cmd ring but
      will not wait (sleep) for cmd response back from device.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02a9fbfc
    • S
      rocker: revert back to support for nowait processes · 179f9a25
      Scott Feldman 提交于
      One of the items removed from the rocker driver in the Spring Cleanup patch
      series was the ability to mark processing in the driver as "no wait" for
      those contexts where we cannot sleep.  Turns out, we have "no wait"
      contexts where we want to program the device.  So re-add the
      ROCKER_OP_FLAG_NOWAIT flag to mark such processes, and propagate flags to
      mem allocator and to the device cmd executor.  With NOWAIT, mem allocs are
      GFP_ATOMIC and device cmds are queued to the device, but the driver will
      not wait (sleep) for the response back from the device.
      
      My bad for removing NOWAIT support in the first place; I thought we could
      swing non-sleep contexts to process context using a work queue, for
      example, but there is push-back to keep processing in original context.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      179f9a25
    • S
      rocker: fix neigh tbl index increment race · 4d81db41
      Scott Feldman 提交于
      rocker->neigh_tbl_next_index is used to generate unique indices for neigh
      entries programmed into the device.  The way new indices were generated was
      racy with the new prepare-commit transaction model.  A simple fix here
      removes the race.  The race was with two processes getting the same index,
      one process using prepare-commit, the other not:
      
      Proc A					Proc B
      
      PREPARE phase
      get neigh_tbl_next_index
      
      					NONE phase
      					get neigh_tbl_next_index
      					neigh_tbl_next_index++
      
      COMMIT phase
      neigh_tbl_next_index++
      
      Both A and B got the same index.  The fix is to store and increment
      neigh_tbl_next_index in the PREPARE (or NONE) phase and use value in COMMIT
      phase:
      
      Proc A					Proc B
      
      PREPARE phase
      get neigh_tbl_next_index
      neigh_tbl_next_index++
      
      					NONE phase
      					get neigh_tbl_next_index
      					neigh_tbl_next_index++
      
      COMMIT phase
      // use value stashed in PREPARE phase
      Reported-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d81db41
    • S
      rocker: gaurd against NULL rocker_port when removing ports · a0720310
      Scott Feldman 提交于
      The ports array is filled in as ports are probed, but if probing doesn't
      finish, we need to stop only those ports that where probed successfully.
      Check the ports array for NULL to skip un-probed ports when stopping.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0720310
    • E
      net: make u64_stats_init() a function · 9464ca65
      Eric Dumazet 提交于
      Using a function instead of a macro is cleaner and remove
      following W=1 warnings (extract)
      
      In file included from net/ipv6/ip6_vti.c:29:0:
      net/ipv6/ip6_vti.c: In function ‘vti6_dev_init_gen’:
      include/linux/netdevice.h:2029:18: warning: variable ‘stat’ set but not
      used [-Wunused-but-set-variable]
          typeof(type) *stat;   \
                        ^
      net/ipv6/ip6_vti.c:862:16: note: in expansion of macro
      ‘netdev_alloc_pcpu_stats’
        dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
                      ^
        CC [M]  net/ipv6/sit.o
      In file included from net/ipv6/sit.c:30:0:
      net/ipv6/sit.c: In function ‘ipip6_tunnel_init’:
      include/linux/netdevice.h:2029:18: warning: variable ‘stat’ set but not
      used [-Wunused-but-set-variable]
          typeof(type) *stat;   \
                        ^
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9464ca65
    • S
      bridge: use either ndo VLAN ops or switchdev VLAN ops to install MASTER vlans · 7f109539
      Scott Feldman 提交于
      v2:
      
      Move struct switchdev_obj automatics to inner scope where there used.
      
      v1:
      
      To maintain backward compatibility with the existing iproute2 "bridge vlan"
      command, let bridge's setlink/dellink handler call into either the port
      driver's 8021q ndo ops or the port driver's bridge_setlink/dellink ops.
      
      This allows port driver to choose 8021q ops or the newer
      bridge_setlink/dellink ops when implementing VLAN add/del filtering on the
      device.  The iproute "bridge vlan" command does not need to be modified.
      
      To summarize using the "bridge vlan" command examples, we have:
      
      1) bridge vlan add|del vid VID dev DEV
      
      Here iproute2 sets MASTER flag.  Bridge's bridge_setlink/dellink is called.
      Vlan is set on bridge for port.  If port driver implements ndo 8021q ops,
      call those to port driver can install vlan filter on device.  Otherwise, if
      port driver implements bridge_setlink/dellink ops, call those to install
      vlan filter to device.  This option only works if port is bridged.
      
      2) bridge vlan add|del vid VID dev DEV master
      
      Same as 1)
      
      3) bridge vlan add|del vid VID dev DEV self
      
      Bridge's bridge_setlink/dellink isn't called.  Port driver's
      bridge_setlink/dellink is called, if implemented.  This option works if
      port is bridged or not.  If port is not bridged, a VLAN can still be
      added/deleted to device filter using this variant.
      
      4) bridge vlan add|del vid VID dev DEV master self
      
      This is a combination of 1) and 3), but will only work if port is bridged.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f109539
    • D
      Merge branch 'bpf-share-helpers' · 9f42c8b3
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      v1->v2: switched to init_user_ns from current_user_ns as suggested by Andy
      
      Introduce new helpers to access 'struct task_struct'->pid, tgid, uid, gid, comm
      fields in tracing and networking.
      
      Share bpf_trace_printk() and bpf_get_smp_processor_id() helpers between
      tracing and networking.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f42c8b3
    • A
      bpf: let kprobe programs use bpf_get_smp_processor_id() helper · ab1973d3
      Alexei Starovoitov 提交于
      It's useful to do per-cpu histograms.
      Suggested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab1973d3
    • A
      bpf: allow networking programs to use bpf_trace_printk() for debugging · 0756ea3e
      Alexei Starovoitov 提交于
      bpf_trace_printk() is a helper function used to debug eBPF programs.
      Let socket and TC programs use it as well.
      Note, it's DEBUG ONLY helper. If it's used in the program,
      the kernel will print warning banner to make sure users don't use
      it in production.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0756ea3e
    • A
      bpf: introduce current->pid, tgid, uid, gid, comm accessors · ffeedafb
      Alexei Starovoitov 提交于
      eBPF programs attached to kprobes need to filter based on
      current->pid, uid and other fields, so introduce helper functions:
      
      u64 bpf_get_current_pid_tgid(void)
      Return: current->tgid << 32 | current->pid
      
      u64 bpf_get_current_uid_gid(void)
      Return: current_gid << 32 | current_uid
      
      bpf_get_current_comm(char *buf, int size_of_buf)
      stores current->comm into buf
      
      They can be used from the programs attached to TC as well to classify packets
      based on current task fields.
      
      Update tracex2 example to print histogram of write syscalls for each process
      instead of aggregated for all.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ffeedafb
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · ada6c1de
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      This a bit large (and late) patchset that contains Netfilter updates for
      net-next. Most relevantly br_netfilter fixes, ipset RCU support, removal of
      x_tables percpu ruleset copy and rework of the nf_tables netdev support. More
      specifically, they are:
      
      1) Warn the user when there is a better protocol conntracker available, from
         Marcelo Ricardo Leitner.
      
      2) Fix forwarding of IPv6 fragmented traffic in br_netfilter, from Bernhard
         Thaler. This comes with several patches to prepare the change in first place.
      
      3) Get rid of special mtu handling of PPPoE/VLAN frames for br_netfilter. This
         is not needed anymore since now we use the largest fragment size to
         refragment, from Florian Westphal.
      
      4) Restore vlan tag when refragmenting in br_netfilter, also from Florian.
      
      5) Get rid of the percpu ruleset copy in x_tables, from Florian. Plus another
         follow up patch to refine it from Eric Dumazet.
      
      6) Several ipset cleanups, fixes and finally RCU support, from Jozsef Kadlecsik.
      
      7) Get rid of parens in Netfilter Kconfig files.
      
      8) Attach the net_device to the basechain as opposed to the initial per table
         approach in the nf_tables netdev family.
      
      9) Subscribe to netdev events to detect the removal and registration of a
         device that is referenced by a basechain.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ada6c1de
    • P
      netfilter: nf_tables_netdev: unregister hooks on net_device removal · 835b8033
      Pablo Neira Ayuso 提交于
      In case the net_device is gone, we have to unregister the hooks and put back
      the reference on the net_device object. Once it comes back, register them
      again. This also covers the device rename case.
      
      This patch also adds a new flag to indicate that the basechain is disabled, so
      their hooks are not registered. This flag is used by the netdev family to
      handle the case where the net_device object is gone. Currently this flag is not
      exposed to userspace.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      835b8033
    • P
      netfilter: nf_tables: add nft_register_basechain() and nft_unregister_basechain() · d8ee8f7c
      Pablo Neira Ayuso 提交于
      This wrapper functions take care of hook registration for basechains.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d8ee8f7c
    • P
      netfilter: nf_tables: attach net_device to basechain · 2cbce139
      Pablo Neira Ayuso 提交于
      The device is part of the hook configuration, so instead of a global
      configuration per table, set it to each of the basechain that we create.
      
      This patch reworks ebddf1a8 ("netfilter: nf_tables: allow to bind table to
      net_device").
      
      Note that this adds a dev_name field in the nft_base_chain structure which is
      required the netdev notification subscription that follows up in a patch to
      handle gone net_devices.
      Suggested-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2cbce139
    • E
      netfilter: x_tables: remove XT_TABLE_INFO_SZ and a dereference. · 711bdde6
      Eric Dumazet 提交于
      After Florian patches, there is no need for XT_TABLE_INFO_SZ anymore :
      Only one copy of table is kept, instead of one copy per cpu.
      
      We also can avoid a dereference if we put table data right after
      xt_table_info. It reduces register pressure and helps compiler.
      
      Then, we attempt a kmalloc() if total size is under order-3 allocation,
      to reduce TLB pressure, as in many cases, rules fit in 32 KB.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      711bdde6
    • P
      Merge branch 'master' of git://blackhole.kfki.hu/nf-next · 53b87627
      Pablo Neira Ayuso 提交于
      Jozsef Kadlecsik says:
      
      ====================
      ipset patches for nf-next
      
      Please consider to apply the next bunch of patches for ipset. First
      comes the small changes, then the bugfixes and at the end the RCU
      related patches.
      
      * Use MSEC_PER_SEC consistently instead of the number.
      * Use SET_WITH_*() helpers to test set extensions from Sergey Popovich.
      * Check extensions attributes before getting extensions from Sergey Popovich.
      * Permit CIDR equal to the host address CIDR in IPv6 from Sergey Popovich.
      * Make sure we always return line number on batch in the case of error
        from Sergey Popovich.
      * Check CIDR value only when attribute is given from Sergey Popovich.
      * Fix cidr handling for hash:*net* types, reported by Jonathan Johnson.
      * Fix parallel resizing and listing of the same set so that the original
        set is kept for the whole dumping.
      * Make sure listing doesn't grab a set which is just being destroyed.
      * Remove rbtree from ip_set_hash_netiface.c in order to introduce RCU.
      * Replace rwlock_t with spinlock_t in "struct ip_set", change the locking
        in the core and simplifications in the timeout routines.
      * Introduce RCU locking in bitmap:* types with a slight modification in the
        logic on how an element is added.
      * Introduce RCU locking in hash:* types. This is the most complex part of
        the changes.
      * Introduce RCU locking in list type where standard rculist is used.
      * Fix coding styles reported by checkpatch.pl.
      ====================
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      53b87627
  2. 15 6月, 2015 2 次提交
  3. 14 6月, 2015 11 次提交
    • J
    • J
      netfilter: ipset: Introduce RCU locking in list type · 00590fdd
      Jozsef Kadlecsik 提交于
      Standard rculist is used.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      00590fdd
    • J
      netfilter: ipset: Introduce RCU locking in hash:* types · 18f84d41
      Jozsef Kadlecsik 提交于
      Three types of data need to be protected in the case of the hash types:
      
      a. The hash buckets: standard rcu pointer operations are used.
      b. The element blobs in the hash buckets are stored in an array and
         a bitmap is used for book-keeping to tell which elements in the array
         are used or free.
      c. Networks per cidr values and the cidr values themselves are stored
         in fix sized arrays and need no protection. The values are modified
         in such an order that in the worst case an element testing is repeated
         once with the same cidr value.
      
      The ipset hash approach uses arrays instead of lists and therefore is
      incompatible with rhashtable.
      
      Performance is tested by Jesper Dangaard Brouer:
      
      Simple drop in FORWARD
      ~~~~~~~~~~~~~~~~~~~~~~
      
      Dropping via simple iptables net-mask match::
      
       iptables -t raw -N simple || iptables -t raw -F simple
       iptables -t raw -I simple  -s 198.18.0.0/15 -j DROP
       iptables -t raw -D PREROUTING -j simple
       iptables -t raw -I PREROUTING -j simple
      
      Drop performance in "raw": 11.3Mpps
      
      Generator: sending 12.2Mpps (tx:12264083 pps)
      
      Drop via original ipset in RAW table
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Create a set with lots of elements::
      
       sudo ./ipset destroy test
       echo "create test hash:ip hashsize 65536" > test.set
       for x in `seq 0 255`; do
          for y in `seq 0 255`; do
              echo "add test 198.18.$x.$y" >> test.set
          done
       done
       sudo ./ipset restore < test.set
      
      Dropping via ipset::
      
       iptables -t raw -F
       iptables -t raw -N net198 || iptables -t raw -F net198
       iptables -t raw -I net198 -m set --match-set test src -j DROP
       iptables -t raw -I PREROUTING -j net198
      
      Drop performance in "raw" with ipset: 8Mpps
      
      Perf report numbers ipset drop in "raw"::
      
       +   24.65%  ksoftirqd/1  [ip_set]           [k] ip_set_test
       -   21.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_lock_bh
          - _raw_read_lock_bh
             + 99.88% ip_set_test
       -   19.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_unlock_bh
          - _raw_read_unlock_bh
             + 99.72% ip_set_test
       +    4.31%  ksoftirqd/1  [ip_set_hash_ip]   [k] hash_ip4_kadt
       +    2.27%  ksoftirqd/1  [ixgbe]            [k] ixgbe_fetch_rx_buffer
       +    2.18%  ksoftirqd/1  [ip_tables]        [k] ipt_do_table
       +    1.81%  ksoftirqd/1  [ip_set_hash_ip]   [k] hash_ip4_test
       +    1.61%  ksoftirqd/1  [kernel.kallsyms]  [k] __netif_receive_skb_core
       +    1.44%  ksoftirqd/1  [kernel.kallsyms]  [k] build_skb
       +    1.42%  ksoftirqd/1  [kernel.kallsyms]  [k] ip_rcv
       +    1.36%  ksoftirqd/1  [kernel.kallsyms]  [k] __local_bh_enable_ip
       +    1.16%  ksoftirqd/1  [kernel.kallsyms]  [k] dev_gro_receive
       +    1.09%  ksoftirqd/1  [kernel.kallsyms]  [k] __rcu_read_unlock
       +    0.96%  ksoftirqd/1  [ixgbe]            [k] ixgbe_clean_rx_irq
       +    0.95%  ksoftirqd/1  [kernel.kallsyms]  [k] __netdev_alloc_frag
       +    0.88%  ksoftirqd/1  [kernel.kallsyms]  [k] kmem_cache_alloc
       +    0.87%  ksoftirqd/1  [xt_set]           [k] set_match_v3
       +    0.85%  ksoftirqd/1  [kernel.kallsyms]  [k] inet_gro_receive
       +    0.83%  ksoftirqd/1  [kernel.kallsyms]  [k] nf_iterate
       +    0.76%  ksoftirqd/1  [kernel.kallsyms]  [k] put_compound_page
       +    0.75%  ksoftirqd/1  [kernel.kallsyms]  [k] __rcu_read_lock
      
      Drop via ipset in RAW table with RCU-locking
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      With RCU locking, the RW-lock is gone.
      
      Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps
      Performance-tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      18f84d41
    • J
      netfilter: ipset: Introduce RCU locking in bitmap:* types · 96f51428
      Jozsef Kadlecsik 提交于
      There's nothing much required because the bitmap types use atomic
      bit operations. However the logic of adding elements slightly changed:
      first the MAC address updated (which is not atomic), then the element
      activated (added). The extensions may call kfree_rcu() therefore we
      call rcu_barrier() at module removal.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      96f51428
    • J
      netfilter: ipset: Prepare the ipset core to use RCU at set level · b57b2d1f
      Jozsef Kadlecsik 提交于
      Replace rwlock_t with spinlock_t in "struct ip_set" and change the locking
      accordingly. Convert the comment extension into an rcu-avare object. Also,
      simplify the timeout routines.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      b57b2d1f
    • J
      netfilter:ipset Remove rbtree from hash:net,iface · bd55389c
      Jozsef Kadlecsik 提交于
      Remove rbtree in order to introduce RCU instead of rwlock in ipset
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      bd55389c
    • J
      netfilter: ipset: Make sure listing doesn't grab a set which is just being destroyed. · 9c1ba5c8
      Jozsef Kadlecsik 提交于
      There was a small window when all sets are destroyed and a concurrent
      listing of all sets could grab a set which is just being destroyed.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      9c1ba5c8
    • J
      netfilter: ipset: Fix parallel resizing and listing of the same set · c4c99783
      Jozsef Kadlecsik 提交于
      When elements added to a hash:* type of set and resizing triggered,
      parallel listing could start to list the original set (before resizing)
      and "continue" with listing the new set. Fix it by references and
      using the original hash table for listing. Therefore the destroying of
      the original hash table may happen from the resizing or listing functions.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      c4c99783
    • J
      netfilter: ipset: Fix cidr handling for hash:*net* types · f690cbae
      Jozsef Kadlecsik 提交于
      Commit "Simplify cidr handling for hash:*net* types" broke the cidr
      handling for the hash:*net* types when the sets were used by the SET
      target: entries with invalid cidr values were added to the sets.
      Reported by Jonathan Johnson.
      
      Testsuite entry is added to verify the fix.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      f690cbae
    • S
      netfilter: ipset: Check CIDR value only when attribute is given · aff22758
      Sergey Popovich 提交于
      There is no reason to check CIDR value regardless attribute
      specifying CIDR is given.
      
      Initialize cidr array in element structure on element structure
      declaration to let more freedom to the compiler to optimize
      initialization right before element structure is used.
      
      Remove local variables cidr and cidr2 for netnet and netportnet
      hashes as we do not use packed cidr value for such set types and
      can store value directly in e.cidr[].
      Signed-off-by: NSergey Popovich <popovich_sergei@mail.ua>
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      aff22758
    • S
      netfilter: ipset: Make sure we always return line number on batch · a212e08e
      Sergey Popovich 提交于
      Even if we return with generic IPSET_ERR_PROTOCOL it is good idea
      to return line number if we called in batch mode.
      
      Moreover we are not always exiting with IPSET_ERR_PROTOCOL. For
      example hash:ip,port,net may return IPSET_ERR_HASH_RANGE_UNSUPPORTED
      or IPSET_ERR_INVALID_CIDR.
      Signed-off-by: NSergey Popovich <popovich_sergei@mail.ua>
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      a212e08e