1. 13 12月, 2011 1 次提交
  2. 12 12月, 2011 1 次提交
  3. 22 11月, 2011 1 次提交
    • P
      netfilter: nf_conntrack: make event callback registration per-netns · 70e9942f
      Pablo Neira Ayuso 提交于
      This patch fixes an oops that can be triggered following this recipe:
      
      0) make sure nf_conntrack_netlink and nf_conntrack_ipv4 are loaded.
      1) container is started.
      2) connect to it via lxc-console.
      3) generate some traffic with the container to create some conntrack
         entries in its table.
      4) stop the container: you hit one oops because the conntrack table
         cleanup tries to report the destroy event to user-space but the
         per-netns nfnetlink socket has already gone (as the nfnetlink
         socket is per-netns but event callback registration is global).
      
      To fix this situation, we make the ctnl_notifier per-netns so the
      callback is registered/unregistered if the container is
      created/destroyed.
      
      Alex Bligh and Alexey Dobriyan originally proposed one small patch to
      check if the nfnetlink socket is gone in nfnetlink_has_listeners,
      but this is a very visited path for events, thus, it may reduce
      performance and it looks a bit hackish to check for the nfnetlink
      socket only to workaround this situation. As a result, I decided
      to follow the bigger path choice, which seems to look nicer to me.
      
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Reported-by: NAlex Bligh <alex@alex.org.uk>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      70e9942f
  4. 14 11月, 2011 1 次提交
    • E
      ipv6: reduce percpu needs for icmpv6msg mibs · 2a24444f
      Eric Dumazet 提交于
      Reading /proc/net/snmp6 on a machine with a lot of cpus is very
      expensive (can be ~88000 us).
      
      This is because ICMPV6MSG MIB uses 4096 bytes per cpu, and folding
      values for all possible cpus can read 16 Mbytes of memory (32MBytes on
      non x86 arches)
      
      ICMP messages are not considered as fast path on a typical server, and
      eventually few cpus handle them anyway. We can afford an atomic
      operation instead of using percpu data.
      
      This saves 4096 bytes per cpu and per network namespace.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a24444f
  5. 10 11月, 2011 1 次提交
    • E
      ipv4: reduce percpu needs for icmpmsg mibs · acb32ba3
      Eric Dumazet 提交于
      Reading /proc/net/snmp on a machine with a lot of cpus is very expensive
      (can be ~88000 us).
      
      This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values
      for all possible cpus can read 16 Mbytes of memory.
      
      ICMP messages are not considered as fast path on a typical server, and
      eventually few cpus handle them anyway. We can afford an atomic
      operation instead of using percpu data.
      
      This saves 4096 bytes per cpu and per network namespace.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acb32ba3
  6. 27 7月, 2011 1 次提交
  7. 14 5月, 2011 1 次提交
    • V
      net: ipv4: add IPPROTO_ICMP socket kind · c319b4d7
      Vasiliy Kulikov 提交于
      This patch adds IPPROTO_ICMP socket kind.  It makes it possible to send
      ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
      without any special privileges.  In other words, the patch makes it
      possible to implement setuid-less and CAP_NET_RAW-less /bin/ping.  In
      order not to increase the kernel's attack surface, the new functionality
      is disabled by default, but is enabled at bootup by supporting Linux
      distributions, optionally with restriction to a group or a group range
      (see below).
      
      Similar functionality is implemented in Mac OS X:
      http://www.manpagez.com/man/4/icmp/
      
      A new ping socket is created with
      
          socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
      
      Message identifiers (octets 4-5 of ICMP header) are interpreted as local
      ports. Addresses are stored in struct sockaddr_in. No port numbers are
      reserved for privileged processes, port 0 is reserved for API ("let the
      kernel pick a free number"). There is no notion of remote ports, remote
      port numbers provided by the user (e.g. in connect()) are ignored.
      
      Data sent and received include ICMP headers. This is deliberate to:
      1) Avoid the need to transport headers values like sequence numbers by
      other means.
      2) Make it easier to port existing programs using raw sockets.
      
      ICMP headers given to send() are checked and sanitized. The type must be
      ICMP_ECHO and the code must be zero (future extensions might relax this,
      see below). The id is set to the number (local port) of the socket, the
      checksum is always recomputed.
      
      ICMP reply packets received from the network are demultiplexed according
      to their id's, and are returned by recv() without any modifications.
      IP header information and ICMP errors of those packets may be obtained
      via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
      quenches and redirects are reported as fake errors via the error queue
      (IP_RECVERR); the next hop address for redirects is saved to ee_info (in
      network order).
      
      socket(2) is restricted to the group range specified in
      "/proc/sys/net/ipv4/ping_group_range".  It is "1 0" by default, meaning
      that nobody (not even root) may create ping sockets.  Setting it to "100
      100" would grant permissions to the single group (to either make
      /sbin/ping g+s and owned by this group or to grant permissions to the
      "netadmins" group), "0 4294967295" would enable it for the world, "100
      4294967295" would enable it for the users, but not daemons.
      
      The existing code might be (in the unlikely case anyone needs it)
      extended rather easily to handle other similar pairs of ICMP messages
      (Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
      etc.).
      
      Userspace ping util & patch for it:
      http://openwall.info/wiki/people/segoon/ping
      
      For Openwall GNU/*/Linux it was the last step on the road to the
      setuid-less distro.  A revision of this patch (for RHEL5/OpenVZ kernels)
      is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
      http://mirrors.kernel.org/openwall/Owl/current/iso/
      
      Initially this functionality was written by Pavel Kankovsky for
      Linux 2.4.32, but unfortunately it was never made public.
      
      All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
      the patch.
      
      PATCH v3:
          - switched to flowi4.
          - minor changes to be consistent with raw sockets code.
      
      PATCH v2:
          - changed ping_debug() to pr_debug().
          - removed CONFIG_IP_PING.
          - removed ping_seq_fops.owner field (unused for procfs).
          - switched to proc_net_fops_create().
          - switched to %pK in seq_printf().
      
      PATCH v1:
          - fixed checksumming bug.
          - CAP_NET_RAW may not create icmp sockets anymore.
      
      RFC v2:
          - minor cleanups.
          - introduced sysctl'able group range to restrict socket(2).
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c319b4d7
  8. 25 3月, 2011 1 次提交
  9. 15 3月, 2011 1 次提交
  10. 19 1月, 2011 1 次提交
    • P
      netfilter: nf_conntrack_tstamp: add flow-based timestamp extension · a992ca2a
      Pablo Neira Ayuso 提交于
      This patch adds flow-based timestamping for conntracks. This
      conntrack extension is disabled by default. Basically, we use
      two 64-bits variables to store the creation timestamp once the
      conntrack has been confirmed and the other to store the deletion
      time. This extension is disabled by default, to enable it, you
      have to:
      
      echo 1 > /proc/sys/net/netfilter/nf_conntrack_timestamp
      
      This patch allows to save memory for user-space flow-based
      loogers such as ulogd2. In short, ulogd2 does not need to
      keep a hashtable with the conntrack in user-space to know
      when they were created and destroyed, instead we use the
      kernel timestamp. If we want to have a sane IPFIX implementation
      in user-space, this nanosecs resolution timestamps are also
      useful. Other custom user-space applications can benefit from
      this via libnetfilter_conntrack.
      
      This patch modifies the /proc output to display the delta time
      in seconds since the flow start. You can also obtain the
      flow-start date by means of the conntrack-tools.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      a992ca2a
  11. 14 1月, 2011 1 次提交
  12. 13 1月, 2011 17 次提交
  13. 22 11月, 2010 1 次提交
  14. 18 10月, 2010 1 次提交
    • E
      netns: reorder fields in struct net · 8e602ce2
      Eric Dumazet 提交于
      In a network bench, I noticed an unfortunate false sharing between
      'loopback_dev' and 'count' fields in "struct net".
      
      'count' is written each time a socket is created or destroyed, while
      loopback_dev might be often read in routing code.
      
      Move loopback_dev in a read mostly section of "struct net"
      
      Note: struct netns_xfrm is cache line aligned on SMP.
      (It contains a "struct dst_ops")
      Move it at the end to avoid holes, and reduce sizeof(struct net) by 128
      bytes on ia32.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e602ce2
  15. 11 5月, 2010 4 次提交
    • P
      ipv6: ip6mr: support multiple tables · d1db275d
      Patrick McHardy 提交于
      This patch adds support for multiple independant multicast routing instances,
      named "tables".
      
      Userspace multicast routing daemons can bind to a specific table instance by
      issuing a setsockopt call using a new option MRT6_TABLE. The table number is
      stored in the raw socket data and affects all following ip6mr setsockopt(),
      getsockopt() and ioctl() calls. By default, a single table (RT6_TABLE_DFLT)
      is created with a default routing rule pointing to it. Newly created pim6reg
      devices have the table number appended ("pim6regX"), with the exception of
      devices created in the default table, which are named just "pim6reg" for
      compatibility reasons.
      
      Packets are directed to a specific table instance using routing rules,
      similar to how regular routing rules work. Currently iif, oif and mark
      are supported as keys, source and destination addresses could be supported
      additionally.
      
      Example usage:
      
      - bind pimd/xorp/... to a specific table:
      
      uint32_t table = 123;
      setsockopt(fd, SOL_IPV6, MRT6_TABLE, &table, sizeof(table));
      
      - create routing rules directing packets to the new table:
      
      # ip -6 mrule add iif eth0 lookup 123
      # ip -6 mrule add oif eth0 lookup 123
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      d1db275d
    • P
      6bd52143
    • P
      f30a7784
    • P
      ipv6: ip6mr: move unres_queue and timer to per-namespace data · c476efbc
      Patrick McHardy 提交于
      The unres_queue is currently shared between all namespaces. Following patches
      will additionally allow to create multiple multicast routing tables in each
      namespace. Having a single shared queue for all these users seems to excessive,
      move the queue and the cleanup timer to the per-namespace data to unshare it.
      
      As a side-effect, this fixes a bug in the seq file iteration functions: the
      first entry returned is always from the current namespace, entries returned
      after that may belong to any namespace.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      c476efbc
  16. 08 5月, 2010 1 次提交
  17. 28 4月, 2010 1 次提交
  18. 14 4月, 2010 4 次提交