1. 16 7月, 2015 2 次提交
    • F
      netfilter: move tee_active to core · e7c8899f
      Florian Westphal 提交于
      This prepares for a TEE like expression in nftables.
      We want to ensure only one duplicate is sent, so both will
      use the same percpu variable to detect duplication.
      
      The other use case is detection of recursive call to xtables, but since
      we don't want dependency from nft to xtables core its put into core.c
      instead of the x_tables core.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e7c8899f
    • E
      netfilter: Per network namespace netfilter hooks. · 085db2c0
      Eric W. Biederman 提交于
      - Add a new set of functions for registering and unregistering per
        network namespace hooks.
      
      - Modify the old global namespace hook functions to use the per
        network namespace hooks in their implementation, so their remains a
        single list that needs to be walked for any hook (this is important
        for keeping the hook priority working and for keeping the code
        walking the hooks simple).
      
      - Only allow registering the per netdevice hooks in the network
        namespace where the network device lives.
      
      - Dynamically allocate the structures in the per network namespace
        hook list in nf_register_net_hook, and unregister them in
        nf_unregister_net_hook.
      
        Dynamic allocate is required somewhere as the number of network
        namespaces are not fixed so we might as well allocate them in the
        registration function.
      
        The chain of registered hooks on any list is expected to be small so
        the cost of walking that list to find the entry we are unregistering
        should also be small.
      
        Performing the management of the dynamically allocated list entries
        in the registration and unregistration functions keeps the complexity
        from spreading.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      085db2c0
  2. 15 7月, 2015 1 次提交
  3. 19 6月, 2015 1 次提交
    • P
      netfilter: don't pull include/linux/netfilter.h from netns headers · a263653e
      Pablo Neira Ayuso 提交于
      This pulls the full hook netfilter definitions from all those that include
      net_namespace.h.
      
      Instead let's just include the bare minimum required in the new
      linux/netfilter_defs.h file, and use it from the netfilter netns header files.
      
      I also needed to include in.h and in6.h from linux/netfilter.h otherwise we hit
      this compilation error:
      
      In file included from include/linux/netfilter_defs.h:4:0,
                       from include/net/netns/netfilter.h:4,
                       from include/net/net_namespace.h:22,
                       from include/linux/netdevice.h:43,
                       from net/netfilter/nfnetlink_queue_core.c:23:
      include/uapi/linux/netfilter.h:76:17: error: field ‘in’ has incomplete type struct in_addr in;
      
      And also explicit include linux/netfilter.h in several spots.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      a263653e
  4. 14 5月, 2015 4 次提交
    • P
      netfilter: add netfilter ingress hook after handle_ing() under unique static key · e687ad60
      Pablo Neira 提交于
      This patch adds the Netfilter ingress hook just after the existing tc ingress
      hook, that seems to be the consensus solution for this.
      
      Note that the Netfilter hook resides under the global static key that enables
      ingress filtering. Nonetheless, Netfilter still also has its own static key for
      minimal impact on the existing handle_ing().
      
      * Without this patch:
      
      Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
        16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
      
          42.46%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          25.92%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
           7.81%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           5.62%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.70%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           2.34%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
           1.44%  kpktgend_0   [kernel.kallsyms]   [k] __build_skb
      
      * With this patch:
      
      Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
        16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
      
          41.23%  kpktgend_0      [kernel.kallsyms]  [k] __netif_receive_skb_core
          26.57%  kpktgend_0      [kernel.kallsyms]  [k] kfree_skb
           7.72%  kpktgend_0      [pktgen]           [k] pktgen_thread_worker
           5.55%  kpktgend_0      [kernel.kallsyms]  [k] ip_rcv
           2.78%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_internal
           2.06%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_sk
           1.43%  kpktgend_0      [kernel.kallsyms]  [k] __build_skb
      
      * Without this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
        10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
      
          40.99%  kpktgend_0   [kernel.kallsyms]  [k] __netif_receive_skb_core
          17.50%  kpktgend_0   [kernel.kallsyms]  [k] kfree_skb
          11.77%  kpktgend_0   [cls_u32]          [k] u32_classify
           5.62%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify_compat
           5.18%  kpktgend_0   [pktgen]           [k] pktgen_thread_worker
           3.23%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify
           2.97%  kpktgend_0   [kernel.kallsyms]  [k] ip_rcv
           1.83%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_internal
           1.50%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_sk
           0.99%  kpktgend_0   [kernel.kallsyms]  [k] __build_skb
      
      * With this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
        10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
      
          42.01%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          17.78%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
          11.70%  kpktgend_0   [cls_u32]           [k] u32_classify
           5.46%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify_compat
           5.16%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           2.98%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.84%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify
           1.96%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           1.57%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
      
      Note that the results are very similar before and after.
      
      I can see gcc gets the code under the ingress static key out of the hot path.
      Then, on that cold branch, it generates the code to accomodate the netfilter
      ingress static key. My explanation for this is that this reduces the pressure
      on the instruction cache for non-users as the new code is out of the hot path,
      and it comes with minimal impact for tc ingress users.
      
      Using gcc version 4.8.4 on:
      
      Architecture:          x86_64
      CPU op-mode(s):        32-bit, 64-bit
      Byte Order:            Little Endian
      CPU(s):                8
      [...]
      L1d cache:             16K
      L1i cache:             64K
      L2 cache:              2048K
      L3 cache:              8192K
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e687ad60
    • P
      netfilter: add nf_hook_list_active() · b8d0aad0
      Pablo Neira 提交于
      In preparation to have netfilter ingress per-device hook list.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8d0aad0
    • P
      f7191483
    • P
      87d5c18c
  5. 08 4月, 2015 3 次提交
    • D
      netfilter: Pass socket pointer down through okfn(). · 7026b1dd
      David Miller 提交于
      On the output paths in particular, we have to sometimes deal with two
      socket contexts.  First, and usually skb->sk, is the local socket that
      generated the frame.
      
      And second, is potentially the socket used to control a tunneling
      socket, such as one the encapsulates using UDP.
      
      We do not want to disassociate skb->sk when encapsulating in order
      to fix this, because that would break socket memory accounting.
      
      The most extreme case where this can cause huge problems is an
      AF_PACKET socket transmitting over a vxlan device.  We hit code
      paths doing checks that assume they are dealing with an ipv4
      socket, but are actually operating upon the AF_PACKET one.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7026b1dd
    • D
      netfilter: Add socket pointer to nf_hook_state. · 1c984f8a
      David Miller 提交于
      It is currently always set to NULL, but nf_queue is adjusted to be
      prepared for it being set to a real socket by taking and releasing a
      reference to that socket when necessary.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c984f8a
    • D
      netfilter: Add nf_hook_state initializer function. · 107a9f4d
      David Miller 提交于
      This way we can consolidate where we setup new nf_hook_state objects,
      to make sure the entire thing is initialized.
      
      The only other place an nf_hook_object is instantiated is nf_queue,
      wherein a structure copy is used.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      107a9f4d
  6. 05 4月, 2015 2 次提交
  7. 25 8月, 2014 1 次提交
  8. 14 10月, 2013 2 次提交
    • P
      netfilter: add nftables · 96518518
      Patrick McHardy 提交于
      This patch adds nftables which is the intended successor of iptables.
      This packet filtering framework reuses the existing netfilter hooks,
      the connection tracking system, the NAT subsystem, the transparent
      proxying engine, the logging infrastructure and the userspace packet
      queueing facilities.
      
      In a nutshell, nftables provides a pseudo-state machine with 4 general
      purpose registers of 128 bits and 1 specific purpose register to store
      verdicts. This pseudo-machine comes with an extensible instruction set,
      a.k.a. "expressions" in the nftables jargon. The expressions included
      in this patch provide the basic functionality, they are:
      
      * bitwise: to perform bitwise operations.
      * byteorder: to change from host/network endianess.
      * cmp: to compare data with the content of the registers.
      * counter: to enable counters on rules.
      * ct: to store conntrack keys into register.
      * exthdr: to match IPv6 extension headers.
      * immediate: to load data into registers.
      * limit: to limit matching based on packet rate.
      * log: to log packets.
      * meta: to match metainformation that usually comes with the skbuff.
      * nat: to perform Network Address Translation.
      * payload: to fetch data from the packet payload and store it into
        registers.
      * reject (IPv4 only): to explicitly close connection, eg. TCP RST.
      
      Using this instruction-set, the userspace utility 'nft' can transform
      the rules expressed in human-readable text representation (using a
      new syntax, inspired by tcpdump) to nftables bytecode.
      
      nftables also inherits the table, chain and rule objects from
      iptables, but in a more configurable way, and it also includes the
      original datatype-agnostic set infrastructure with mapping support.
      This set infrastructure is enhanced in the follow up patch (netfilter:
      nf_tables: add netlink set API).
      
      This patch includes the following components:
      
      * the netlink API: net/netfilter/nf_tables_api.c and
        include/uapi/netfilter/nf_tables.h
      * the packet filter core: net/netfilter/nf_tables_core.c
      * the expressions (described above): net/netfilter/nft_*.c
      * the filter tables: arp, IPv4, IPv6 and bridge:
        net/ipv4/netfilter/nf_tables_ipv4.c
        net/ipv6/netfilter/nf_tables_ipv6.c
        net/ipv4/netfilter/nf_tables_arp.c
        net/bridge/netfilter/nf_tables_bridge.c
      * the NAT table (IPv4 only):
        net/ipv4/netfilter/nf_table_nat_ipv4.c
      * the route table (similar to mangle):
        net/ipv4/netfilter/nf_table_route_ipv4.c
        net/ipv6/netfilter/nf_table_route_ipv6.c
      * internal definitions under:
        include/net/netfilter/nf_tables.h
        include/net/netfilter/nf_tables_core.h
      * It also includes an skeleton expression:
        net/netfilter/nft_expr_template.c
        and the preliminary implementation of the meta target
        net/netfilter/nft_meta_target.c
      
      It also includes a change in struct nf_hook_ops to add a new
      pointer to store private data to the hook, that is used to store
      the rule list per chain.
      
      This patch is based on the patch from Patrick McHardy, plus merged
      accumulated cleanups, fixes and small enhancements to the nftables
      code that has been done since 2009, which are:
      
      From Patrick McHardy:
      * nf_tables: adjust netlink handler function signatures
      * nf_tables: only retry table lookup after successful table module load
      * nf_tables: fix event notification echo and avoid unnecessary messages
      * nft_ct: add l3proto support
      * nf_tables: pass expression context to nft_validate_data_load()
      * nf_tables: remove redundant definition
      * nft_ct: fix maxattr initialization
      * nf_tables: fix invalid event type in nf_tables_getrule()
      * nf_tables: simplify nft_data_init() usage
      * nf_tables: build in more core modules
      * nf_tables: fix double lookup expression unregistation
      * nf_tables: move expression initialization to nf_tables_core.c
      * nf_tables: build in payload module
      * nf_tables: use NFPROTO constants
      * nf_tables: rename pid variables to portid
      * nf_tables: save 48 bits per rule
      * nf_tables: introduce chain rename
      * nf_tables: check for duplicate names on chain rename
      * nf_tables: remove ability to specify handles for new rules
      * nf_tables: return error for rule change request
      * nf_tables: return error for NLM_F_REPLACE without rule handle
      * nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
      * nf_tables: fix NLM_F_MULTI usage in netlink notifications
      * nf_tables: include NLM_F_APPEND in rule dumps
      
      From Pablo Neira Ayuso:
      * nf_tables: fix stack overflow in nf_tables_newrule
      * nf_tables: nft_ct: fix compilation warning
      * nf_tables: nft_ct: fix crash with invalid packets
      * nft_log: group and qthreshold are 2^16
      * nf_tables: nft_meta: fix socket uid,gid handling
      * nft_counter: allow to restore counters
      * nf_tables: fix module autoload
      * nf_tables: allow to remove all rules placed in one chain
      * nf_tables: use 64-bits rule handle instead of 16-bits
      * nf_tables: fix chain after rule deletion
      * nf_tables: improve deletion performance
      * nf_tables: add missing code in route chain type
      * nf_tables: rise maximum number of expressions from 12 to 128
      * nf_tables: don't delete table if in use
      * nf_tables: fix basechain release
      
      From Tomasz Bursztyka:
      * nf_tables: Add support for changing users chain's name
      * nf_tables: Change chain's name to be fixed sized
      * nf_tables: Add support for replacing a rule by another one
      * nf_tables: Update uapi nftables netlink header documentation
      
      From Florian Westphal:
      * nft_log: group is u16, snaplen u32
      
      From Phil Oester:
      * nf_tables: operational limit match
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      96518518
    • P
      netfilter: pass hook ops to hookfn · 795aa6ef
      Patrick McHardy 提交于
      Pass the hook ops to the hookfn to allow for generic hook
      functions. This change is required by nf_tables.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      795aa6ef
  9. 27 9月, 2013 1 次提交
    • J
      netfilter: Remove extern from function prototypes · a0f4ecf3
      Joe Perches 提交于
      There are a mix of function prototypes with and without extern
      in the kernel sources.  Standardize on not using extern for
      function prototypes.
      
      Function prototypes don't need to be written with extern.
      extern is assumed by the compiler.  Its use is as unnecessary as
      using auto to declare automatic/local variables in a block.
      Signed-off-by: NJoe Perches <joe@perches.com>
      a0f4ecf3
  10. 28 8月, 2013 1 次提交
  11. 13 8月, 2013 1 次提交
  12. 01 8月, 2013 1 次提交
  13. 31 7月, 2013 1 次提交
  14. 23 5月, 2013 1 次提交
  15. 06 4月, 2013 1 次提交
  16. 13 10月, 2012 1 次提交
  17. 30 8月, 2012 1 次提交
  18. 22 6月, 2012 1 次提交
  19. 21 6月, 2012 1 次提交
  20. 16 6月, 2012 2 次提交
  21. 07 6月, 2012 1 次提交
  22. 21 4月, 2012 1 次提交
  23. 24 2月, 2012 1 次提交
    • I
      static keys: Introduce 'struct static_key', static_key_true()/false() and... · c5905afb
      Ingo Molnar 提交于
      static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()
      
      So here's a boot tested patch on top of Jason's series that does
      all the cleanups I talked about and turns jump labels into a
      more intuitive to use facility. It should also address the
      various misconceptions and confusions that surround jump labels.
      
      Typical usage scenarios:
      
              #include <linux/static_key.h>
      
              struct static_key key = STATIC_KEY_INIT_TRUE;
      
              if (static_key_false(&key))
                      do unlikely code
              else
                      do likely code
      
      Or:
      
              if (static_key_true(&key))
                      do likely code
              else
                      do unlikely code
      
      The static key is modified via:
      
              static_key_slow_inc(&key);
              ...
              static_key_slow_dec(&key);
      
      The 'slow' prefix makes it abundantly clear that this is an
      expensive operation.
      
      I've updated all in-kernel code to use this everywhere. Note
      that I (intentionally) have not pushed through the rename
      blindly through to the lowest levels: the actual jump-label
      patching arch facility should be named like that, so we want to
      decouple jump labels from the static-key facility a bit.
      
      On non-jump-label enabled architectures static keys default to
      likely()/unlikely() branches.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: a.p.zijlstra@chello.nl
      Cc: mathieu.desnoyers@efficios.com
      Cc: davem@davemloft.net
      Cc: ddaney.cavm@gmail.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.huSigned-off-by: NIngo Molnar <mingo@elte.hu>
      c5905afb
  24. 22 11月, 2011 1 次提交
    • E
      netfilter: use jump_label for nf_hooks · a2d7ec58
      Eric Dumazet 提交于
      On configs where CONFIG_JUMP_LABEL=y, we can replace in fast path a
      load/compare/conditional jump by a single jump with no dcache reference.
      
      Jump target is modified as soon as nf_hooks[pf][hook] switches from
      empty state to non empty states. jump_label state is kept outside of
      nf_hooks array so has no cost on cpu caches.
      
      This patch removes the test on CONFIG_NETFILTER_DEBUG : No need to call
      nf_hook_slow() at all if nf_hooks[pf][hook] is empty, this didnt give
      useful information, but slowed down things a lot.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2d7ec58
  25. 28 5月, 2011 1 次提交
  26. 04 4月, 2011 2 次提交
  27. 18 1月, 2011 2 次提交
    • F
      netfilter: allow NFQUEUE bypass if no listener is available · 94b27cc3
      Florian Westphal 提交于
      If an skb is to be NF_QUEUE'd, but no program has opened the queue, the
      packet is dropped.
      
      This adds a v2 target revision of xt_NFQUEUE that allows packets to
      continue through the ruleset instead.
      
      Because the actual queueing happens outside of the target context, the
      'bypass' flag has to be communicated back to the netfilter core.
      
      Unfortunately the only choice to do this without adding a new function
      argument is to use the target function return value (i.e. the verdict).
      
      In the NF_QUEUE case, the upper 16bit already contain the queue number
      to use.  The previous patch reduced NF_VERDICT_MASK to 0xff, i.e.
      we now have extra room for a new flag.
      
      If a hook issued a NF_QUEUE verdict, then the netfilter core will
      continue packet processing if the queueing hook
      returns -ESRCH (== "this queue does not exist") and the new
      NF_VERDICT_FLAG_QUEUE_BYPASS flag is set in the verdict value.
      
      Note: If the queue exists, but userspace does not consume packets fast
      enough, the skb will still be dropped.
      Signed-off-by: NFlorian Westphal <fwestphal@astaro.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      94b27cc3
    • F
      netfilter: reduce NF_VERDICT_MASK to 0xff · f615df76
      Florian Westphal 提交于
      NF_VERDICT_MASK is currently 0xffff. This is because the upper
      16 bits are used to store errno (for NF_DROP) or the queue number
      (NF_QUEUE verdict).
      
      As there are up to 0xffff different queues available, there is no more
      room to store additional flags.
      
      At the moment there are only 6 different verdicts, i.e. we can reduce
      NF_VERDICT_MASK to 0xff to allow storing additional flags in the 0xff00 space.
      
      NF_VERDICT_BITS would then be reduced to 8, but because the value is
      exported to userspace, this might cause breakage; e.g.:
      
      e.g. 'queuenr = (1 << NF_VERDICT_BITS) | NF_QUEUE'  would now break.
      
      Thus, remove NF_VERDICT_BITS usage in the kernel and move the old value
      to the 'userspace compat' section.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      f615df76
  28. 18 11月, 2010 1 次提交
  29. 16 11月, 2010 1 次提交