1. 20 6月, 2012 1 次提交
  2. 18 6月, 2012 1 次提交
  3. 16 6月, 2012 3 次提交
    • P
      netfilter: add user-space connection tracking helper infrastructure · 12f7a505
      Pablo Neira Ayuso 提交于
      There are good reasons to supports helpers in user-space instead:
      
      * Rapid connection tracking helper development, as developing code
        in user-space is usually faster.
      
      * Reliability: A buggy helper does not crash the kernel. Moreover,
        we can monitor the helper process and restart it in case of problems.
      
      * Security: Avoid complex string matching and mangling in kernel-space
        running in privileged mode. Going further, we can even think about
        running user-space helpers as a non-root process.
      
      * Extensibility: It allows the development of very specific helpers (most
        likely non-standard proprietary protocols) that are very likely not to be
        accepted for mainline inclusion in the form of kernel-space connection
        tracking helpers.
      
      This patch adds the infrastructure to allow the implementation of
      user-space conntrack helpers by means of the new nfnetlink subsystem
      `nfnetlink_cthelper' and the existing queueing infrastructure
      (nfnetlink_queue).
      
      I had to add the new hook NF_IP6_PRI_CONNTRACK_HELPER to register
      ipv[4|6]_helper which results from splitting ipv[4|6]_confirm into
      two pieces. This change is required not to break NAT sequence
      adjustment and conntrack confirmation for traffic that is enqueued
      to our user-space conntrack helpers.
      
      Basic operation, in a few steps:
      
      1) Register user-space helper by means of `nfct':
      
       nfct helper add ftp inet tcp
      
       [ It must be a valid existing helper supported by conntrack-tools ]
      
      2) Add rules to enable the FTP user-space helper which is
         used to track traffic going to TCP port 21.
      
      For locally generated packets:
      
       iptables -I OUTPUT -t raw -p tcp --dport 21 -j CT --helper ftp
      
      For non-locally generated packets:
      
       iptables -I PREROUTING -t raw -p tcp --dport 21 -j CT --helper ftp
      
      3) Run the test conntrackd in helper mode (see example files under
         doc/helper/conntrackd.conf
      
       conntrackd
      
      4) Generate FTP traffic going, if everything is OK, then conntrackd
         should create expectations (you can check that with `conntrack':
      
       conntrack -E expect
      
          [NEW] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp
      [DESTROY] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp
      
      This confirms that our test helper is receiving packets including the
      conntrack information, and adding expectations in kernel-space.
      
      The user-space helper can also store its private tracking information
      in the conntrack structure in the kernel via the CTA_HELP_INFO. The
      kernel will consider this a binary blob whose layout is unknown. This
      information will be included in the information that is transfered
      to user-space via glue code that integrates nfnetlink_queue and
      ctnetlink.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      12f7a505
    • P
      netfilter: nfnetlink_queue: add NAT TCP sequence adjustment if packet mangled · 8c88f87c
      Pablo Neira Ayuso 提交于
      User-space programs that receive traffic via NFQUEUE may mangle packets.
      If NAT is enabled, this usually puzzles sequence tracking, leading to
      traffic disruptions.
      
      With this patch, nfnl_queue will make the corresponding NAT TCP sequence
      adjustment if:
      
      1) The packet has been mangled,
      2) the NFQA_CFG_F_CONNTRACK flag has been set, and
      3) NAT is detected.
      
      There are some records on the Internet complaning about this issue:
      http://stackoverflow.com/questions/260757/packet-mangling-utilities-besides-iptables
      
      By now, we only support TCP since we have no helpers for DCCP or SCTP.
      Better to add this if we ever have some helper over those layer 4 protocols.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8c88f87c
    • P
      netfilter: nf_ct_helper: implement variable length helper private data · 1afc5679
      Pablo Neira Ayuso 提交于
      This patch uses the new variable length conntrack extensions.
      
      Instead of using union nf_conntrack_help that contain all the
      helper private data information, we allocate variable length
      area to store the private helper data.
      
      This patch includes the modification of all existing helpers.
      It also includes a couple of include header to avoid compilation
      warnings.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      1afc5679
  4. 15 6月, 2012 1 次提交
    • D
      ipv4: Handle PMTU in all ICMP error handlers. · 36393395
      David S. Miller 提交于
      With ip_rt_frag_needed() removed, we have to explicitly update PMTU
      information in every ICMP error handler.
      
      Create two helper functions to facilitate this.
      
      1) ipv4_sk_update_pmtu()
      
         This updates the PMTU when we have a socket context to
         work with.
      
      2) ipv4_update_pmtu()
      
         Raw version, used when no socket context is available.  For this
         interface, we essentially just pass in explicit arguments for
         the flow identity information we would have extracted from the
         socket.
      
         And you'll notice that ipv4_sk_update_pmtu() is simply implemented
         in terms of ipv4_update_pmtu()
      
      Note that __ip_route_output_key() is used, rather than something like
      ip_route_output_flow() or ip_route_output_key().  This is because we
      absolutely do not want to end up with a route that does IPSEC
      encapsulation and the like.  Instead, we only want the route that
      would get us to the node described by the outermost IP header.
      Reported-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36393395
  5. 13 6月, 2012 2 次提交
    • M
      net-next: add dev_loopback_xmit() to avoid duplicate code · 95603e22
      Michel Machado 提交于
      Add dev_loopback_xmit() in order to deduplicate functions
      ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and
      ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c).
      
      I was about to reinvent the wheel when I noticed that
      ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I
      need and are not IP-only functions, but they were not available to reuse
      elsewhere.
      
      ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I
      understand that this is harmless, and should be in dev_loopback_xmit().
      Signed-off-by: NMichel Machado <michel@digirati.com.br>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jpirko@redhat.com>
      CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95603e22
    • T
      ipv4: Add interface option to enable routing of 127.0.0.0/8 · d0daebc3
      Thomas Graf 提交于
      Routing of 127/8 is tradtionally forbidden, we consider
      packets from that address block martian when routing and do
      not process corresponding ARP requests.
      
      This is a sane default but renders a huge address space
      practically unuseable.
      
      The RFC states that no address within the 127/8 block should
      ever appear on any network anywhere but it does not forbid
      the use of such addresses outside of the loopback device in
      particular. For example to address a pool of virtual guests
      behind a load balancer.
      
      This patch adds a new interface option 'route_localnet'
      enabling routing of the 127/8 address block and processing
      of ARP requests on a specific interface.
      
      Note that for the feature to work, the default local route
      covering 127/8 dev lo needs to be removed.
      
      Example:
        $ sysctl -w net.ipv4.conf.eth0.route_localnet=1
        $ ip route del 127.0.0.0/8 dev lo table local
        $ ip addr add 127.1.0.1/16 dev eth0
        $ ip route flush cache
      
      V2: Fix invalid check to auto flush cache (thanks davem)
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0daebc3
  6. 11 6月, 2012 6 次提交
  7. 10 6月, 2012 4 次提交
  8. 09 6月, 2012 4 次提交
  9. 08 6月, 2012 1 次提交
    • V
      snmp: fix OutOctets counter to include forwarded datagrams · 2d8dbb04
      Vincent Bernat 提交于
      RFC 4293 defines ipIfStatsOutOctets (similar definition for
      ipSystemStatsOutOctets):
      
         The total number of octets in IP datagrams delivered to the lower
         layers for transmission.  Octets from datagrams counted in
         ipIfStatsOutTransmits MUST be counted here.
      
      And ipIfStatsOutTransmits:
      
         The total number of IP datagrams that this entity supplied to the
         lower layers for transmission.  This includes datagrams generated
         locally and those forwarded by this entity.
      
      Therefore, IPSTATS_MIB_OUTOCTETS must be incremented when incrementing
      IPSTATS_MIB_OUTFORWDATAGRAMS.
      
      IP_UPD_PO_STATS is not used since ipIfStatsOutRequests must not
      include forwarded datagrams:
      
         The total number of IP datagrams that local IP user-protocols
         (including ICMP) supplied to IP in requests for transmission.  Note
         that this counter does not include any datagrams counted in
         ipIfStatsOutForwDatagrams.
      Signed-off-by: NVincent Bernat <bernat@luffy.cx>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d8dbb04
  10. 07 6月, 2012 8 次提交
  11. 04 6月, 2012 4 次提交
  12. 02 6月, 2012 2 次提交
    • E
      tcp: reflect SYN queue_mapping into SYNACK packets · fff32699
      Eric Dumazet 提交于
      While testing how linux behaves on SYNFLOOD attack on multiqueue device
      (ixgbe), I found that SYNACK messages were dropped at Qdisc level
      because we send them all on a single queue.
      
      Obvious choice is to reflect incoming SYN packet @queue_mapping to
      SYNACK packet.
      
      Under stress, my machine could only send 25.000 SYNACK per second (for
      200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues.
      
      After patch, not a single SYNACK is dropped.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fff32699
    • E
      tcp: do not create inetpeer on SYNACK message · 7433819a
      Eric Dumazet 提交于
      Another problem on SYNFLOOD/DDOS attack is the inetpeer cache getting
      larger and larger, using lots of memory and cpu time.
      
      tcp_v4_send_synack()
      ->inet_csk_route_req()
       ->ip_route_output_flow()
        ->rt_set_nexthop()
         ->rt_init_metrics()
          ->inet_getpeer( create = true)
      
      This is a side effect of commit a4daad6b (net: Pre-COW metrics for
      TCP) added in 2.6.39
      
      Possible solution :
      
      Instruct inet_csk_route_req() to remove FLOWI_FLAG_PRECOW_METRICS
      
      Before patch :
      
      # grep peer /proc/slabinfo
      inet_peer_cache   4175430 4175430    192   42    2 : tunables    0    0    0 : slabdata  99415  99415      0
      
      Samples: 41K of event 'cycles', Event count (approx.): 30716565122
      +  20,24%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_getpeer
      +   8,19%      ksoftirqd/0  [kernel.kallsyms]           [k] peer_avl_rebalance.isra.1
      +   4,81%      ksoftirqd/0  [kernel.kallsyms]           [k] sha_transform
      +   3,64%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_table_lookup
      +   2,36%      ksoftirqd/0  [ixgbe]                     [k] ixgbe_poll
      +   2,16%      ksoftirqd/0  [kernel.kallsyms]           [k] __ip_route_output_key
      +   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] kernel_map_pages
      +   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] ip_route_input_common
      +   2,01%      ksoftirqd/0  [kernel.kallsyms]           [k] __inet_lookup_established
      +   1,83%      ksoftirqd/0  [kernel.kallsyms]           [k] md5_transform
      +   1,75%      ksoftirqd/0  [kernel.kallsyms]           [k] check_leaf.isra.9
      +   1,49%      ksoftirqd/0  [kernel.kallsyms]           [k] ipt_do_table
      +   1,46%      ksoftirqd/0  [kernel.kallsyms]           [k] hrtimer_interrupt
      +   1,45%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_alloc
      +   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_csk_search_req
      +   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] __netif_receive_skb
      +   1,16%      ksoftirqd/0  [kernel.kallsyms]           [k] copy_user_generic_string
      +   1,15%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_free
      +   1,02%      ksoftirqd/0  [kernel.kallsyms]           [k] tcp_make_synack
      +   0,93%      ksoftirqd/0  [kernel.kallsyms]           [k] _raw_spin_lock_bh
      +   0,87%      ksoftirqd/0  [kernel.kallsyms]           [k] __call_rcu
      +   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] rt_garbage_collect
      +   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_rules_lookup
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7433819a
  13. 30 5月, 2012 1 次提交
    • G
      memcg: decrement static keys at real destroy time · 3f134619
      Glauber Costa 提交于
      We call the destroy function when a cgroup starts to be removed, such as
      by a rmdir event.
      
      However, because of our reference counters, some objects are still
      inflight.  Right now, we are decrementing the static_keys at destroy()
      time, meaning that if we get rid of the last static_key reference, some
      objects will still have charges, but the code to properly uncharge them
      won't be run.
      
      This becomes a problem specially if it is ever enabled again, because now
      new charges will be added to the staled charges making keeping it pretty
      much impossible.
      
      We just need to be careful with the static branch activation: since there
      is no particular preferred order of their activation, we need to make sure
      that we only start using it after all call sites are active.  This is
      achieved by having a per-memcg flag that is only updated after
      static_key_slow_inc() returns.  At this time, we are sure all sites are
      active.
      
      This is made per-memcg, not global, for a reason: it also has the effect
      of making socket accounting more consistent.  The first memcg to be
      limited will trigger static_key() activation, therefore, accounting.  But
      all the others will then be accounted no matter what.  After this patch,
      only limited memcgs will have its sockets accounted.
      
      [akpm@linux-foundation.org: move enum sock_flag_bits into sock.h,
                                  document enum sock_flag_bits,
                                  convert memcg_proto_active() and memcg_proto_activated() to test_bit(),
                                  redo tcp_update_limit() comment to 80 cols]
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f134619
  14. 27 5月, 2012 1 次提交
  15. 24 5月, 2012 1 次提交
    • E
      tcp: take care of overlaps in tcp_try_coalesce() · 1ca7ee30
      Eric Dumazet 提交于
      Sergio Correia reported following warning :
      
      WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
      
      WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
           "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
           tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);
      
      It appears TCP coalescing, and more specifically commit b081f85c
      (net: implement tcp coalescing in tcp_queue_rcv()) should take care of
      possible segment overlaps in receive queue. This was properly done in
      the case of out_or_order_queue by the caller.
      
      For example, segment at tail of queue have sequence 1000-2000, and we
      add a segment with sequence 1500-2500.
      This can happen in case of retransmits.
      
      In this case, just don't do the coalescing.
      Reported-by: NSergio Correia <lists@uece.net>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NSergio Correia <lists@uece.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ca7ee30