1. 14 1月, 2014 2 次提交
    • H
      ipv4: introduce hardened ip_no_pmtu_disc mode · 8ed1dc44
      Hannes Frederic Sowa 提交于
      This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
      to be honored by protocols which do more stringent validation on the
      ICMP's packet payload. This knob is useful for people who e.g. want to
      run an unmodified DNS server in a namespace where they need to use pmtu
      for TCP connections (as they are used for zone transfers or fallback
      for requests) but don't want to use possibly spoofed UDP pmtu information.
      
      Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
      if the returned packet is in the window or if the association is valid.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Suggested-by: NFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ed1dc44
    • H
      ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing · f87c10a8
      Hannes Frederic Sowa 提交于
      While forwarding we should not use the protocol path mtu to calculate
      the mtu for a forwarded packet but instead use the interface mtu.
      
      We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
      introduced for multicast forwarding. But as it does not conflict with
      our usage in unicast code path it is perfect for reuse.
      
      I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
      along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
      dependencies because of IPSKB_FORWARDED.
      
      Because someone might have written a software which does probe
      destinations manually and expects the kernel to honour those path mtus
      I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
      can disable this new behaviour. We also still use mtus which are locked on a
      route for forwarding.
      
      The reason for this change is, that path mtus information can be injected
      into the kernel via e.g. icmp_err protocol handler without verification
      of local sockets. As such, this could cause the IPv4 forwarding path to
      wrongfully emit fragmentation needed notifications or start to fragment
      packets along a path.
      
      Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
      won't be set and further fragmentation logic will use the path mtu to
      determine the fragmentation size. They also recheck packet size with
      help of path mtu discovery and report appropriate errors.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f87c10a8
  2. 09 1月, 2014 1 次提交
    • A
      batman-adv: add isolation_mark sysfs attribute · c42edfe3
      Antonio Quartulli 提交于
      This attribute can be used to set and read the value and the
      mask of the skb mark which will be used to classify the
      source non-mesh client as ISOLATED. In this way a client can
      be advertised as such and the mark can potentially be
      restored at the receiving node before delivering the skb.
      
      This can be helpful for creating network wide netfilter
      policies.
      
      This sysfs file expects a string of the shape "$mark/$mask".
      Where $mark has to be a 32-bit number in any base, while
      $mask must be a 32bit mask expressed in hex base. Only bits
      in $mark covered by the bitmask are really stored.
      Signed-off-by: NAntonio Quartulli <antonio@open-mesh.com>
      Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>
      c42edfe3
  3. 08 1月, 2014 1 次提交
  4. 04 1月, 2014 1 次提交
    • D
      netfilter: x_tables: lightweight process control group matching · 82a37132
      Daniel Borkmann 提交于
      It would be useful e.g. in a server or desktop environment to have
      a facility in the notion of fine-grained "per application" or "per
      application group" firewall policies. Probably, users in the mobile,
      embedded area (e.g. Android based) with different security policy
      requirements for application groups could have great benefit from
      that as well. For example, with a little bit of configuration effort,
      an admin could whitelist well-known applications, and thus block
      otherwise unwanted "hard-to-track" applications like [1] from a
      user's machine. Blocking is just one example, but it is not limited
      to that, meaning we can have much different scenarios/policies that
      netfilter allows us than just blocking, e.g. fine grained settings
      where applications are allowed to connect/send traffic to, application
      traffic marking/conntracking, application-specific packet mangling,
      and so on.
      
      Implementation of PID-based matching would not be appropriate
      as they frequently change, and child tracking would make that
      even more complex and ugly. Cgroups would be a perfect candidate
      for accomplishing that as they associate a set of tasks with a
      set of parameters for one or more subsystems, in our case the
      netfilter subsystem, which, of course, can be combined with other
      cgroup subsystems into something more complex if needed.
      
      As mentioned, to overcome this constraint, such processes could
      be placed into one or multiple cgroups where different fine-grained
      rules can be defined depending on the application scenario, while
      e.g. everything else that is not part of that could be dropped (or
      vice versa), thus making life harder for unwanted processes to
      communicate to the outside world. So, we make use of cgroups here
      to track jobs and limit their resources in terms of iptables
      policies; in other words, limiting, tracking, etc what they are
      allowed to communicate.
      
      In our case we're working on outgoing traffic based on which local
      socket that originated from. Also, one doesn't even need to have
      an a-prio knowledge of the application internals regarding their
      particular use of ports or protocols. Matching is *extremly*
      lightweight as we just test for the sk_classid marker of sockets,
      originating from net_cls. net_cls and netfilter do not contradict
      each other; in fact, each construct can live as standalone or they
      can be used in combination with each other, which is perfectly fine,
      plus it serves Tejun's requirement to not introduce a new cgroups
      subsystem. Through this, we result in a very minimal and efficient
      module, and don't add anything except netfilter code.
      
      One possible, minimal usage example (many other iptables options
      can be applied obviously):
      
       1) Configuring cgroups if not already done, e.g.:
      
        mkdir /sys/fs/cgroup/net_cls
        mount -t cgroup -o net_cls net_cls /sys/fs/cgroup/net_cls
        mkdir /sys/fs/cgroup/net_cls/0
        echo 1 > /sys/fs/cgroup/net_cls/0/net_cls.classid
        (resp. a real flow handle id for tc)
      
       2) Configuring netfilter (iptables-nftables), e.g.:
      
        iptables -A OUTPUT -m cgroup ! --cgroup 1 -j DROP
      
       3) Running applications, e.g.:
      
        ping 208.67.222.222  <pid:1799>
        echo 1799 > /sys/fs/cgroup/net_cls/0/tasks
        64 bytes from 208.67.222.222: icmp_seq=44 ttl=49 time=11.9 ms
        [...]
        ping 208.67.220.220  <pid:1804>
        ping: sendmsg: Operation not permitted
        [...]
        echo 1804 > /sys/fs/cgroup/net_cls/0/tasks
        64 bytes from 208.67.220.220: icmp_seq=89 ttl=56 time=19.0 ms
        [...]
      
      Of course, real-world deployments would make use of cgroups user
      space toolsuite, or own custom policy daemons dynamically moving
      applications from/to various cgroups.
      
        [1] http://www.blackhat.com/presentations/bh-europe-06/bh-eu-06-biondi/bh-eu-06-biondi-up.pdfSigned-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: cgroups@vger.kernel.org
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      82a37132
  5. 01 1月, 2014 1 次提交
  6. 31 12月, 2013 1 次提交
  7. 22 12月, 2013 2 次提交
  8. 21 12月, 2013 2 次提交
  9. 19 12月, 2013 2 次提交
  10. 18 12月, 2013 1 次提交
  11. 17 12月, 2013 2 次提交
  12. 16 12月, 2013 1 次提交
  13. 13 12月, 2013 1 次提交
  14. 12 12月, 2013 2 次提交
  15. 11 12月, 2013 1 次提交
  16. 10 12月, 2013 4 次提交
    • F
      net: phy: consolidate PHY reset in phy_init_hw() · 87aa9f9c
      Florian Fainelli 提交于
      There are quite a lot of drivers touching a PHY device MII_BMCR
      register to reset the PHY without taking care of:
      
      1) ensuring that BMCR_RESET is cleared after a given timeout
      2) the PHY state machine resuming to the proper state and re-applying
      potentially changed settings such as auto-negotiation
      
      Introduce phy_poll_reset() which will take care of polling the MII_BMCR
      for the BMCR_RESET bit to be cleared after a given timeout or return a
      timeout error code.
      
      In order to make sure the PHY is in a correct state, phy_init_hw() first
      issues a software reset through MII_BMCR and then applies any fixups.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87aa9f9c
    • D
      packet: introduce PACKET_QDISC_BYPASS socket option · d346a3fa
      Daniel Borkmann 提交于
      This patch introduces a PACKET_QDISC_BYPASS socket option, that
      allows for using a similar xmit() function as in pktgen instead
      of taking the dev_queue_xmit() path. This can be very useful when
      PF_PACKET applications are required to be used in a similar
      scenario as pktgen, but with full, flexible packet payload that
      needs to be provided, for example.
      
      On default, nothing changes in behaviour for normal PF_PACKET
      TX users, so everything stays as is for applications. New users,
      however, can now set PACKET_QDISC_BYPASS if needed to prevent
      own packets from i) reentering packet_rcv() and ii) to directly
      push the frame to the driver.
      
      In doing so we can increase pps (here 64 byte packets) for
      PF_PACKET a bit:
      
        # CPUs -- QDISC_BYPASS   -- qdisc path -- qdisc path[**]
        1 CPU  ==  1,509,628 pps --  1,208,708 --  1,247,436
        2 CPUs ==  3,198,659 pps --  2,536,012 --  1,605,779
        3 CPUs ==  4,787,992 pps --  3,788,740 --  1,735,610
        4 CPUs ==  6,173,956 pps --  4,907,799 --  1,909,114
        5 CPUs ==  7,495,676 pps --  5,956,499 --  2,014,422
        6 CPUs ==  9,001,496 pps --  7,145,064 --  2,155,261
        7 CPUs == 10,229,776 pps --  8,190,596 --  2,220,619
        8 CPUs == 11,040,732 pps --  9,188,544 --  2,241,879
        9 CPUs == 12,009,076 pps -- 10,275,936 --  2,068,447
       10 CPUs == 11,380,052 pps -- 11,265,337 --  1,578,689
       11 CPUs == 11,672,676 pps -- 11,845,344 --  1,297,412
       [...]
       20 CPUs == 11,363,192 pps -- 11,014,933 --  1,245,081
      
       [**]: qdisc path with packet_rcv(), how probably most people
             seem to use it (hopefully not anymore if not needed)
      
      The test was done using a modified trafgen, sending a simple
      static 64 bytes packet, on all CPUs.  The trick in the fast
      "qdisc path" case, is to avoid reentering packet_rcv() by
      setting the RAW socket protocol to zero, like:
      socket(PF_PACKET, SOCK_RAW, 0);
      
      Tradeoffs are documented as well in this patch, clearly, if
      queues are busy, we will drop more packets, tc disciplines are
      ignored, and these packets are not visible to taps anymore. For
      a pktgen like scenario, we argue that this is acceptable.
      
      The pointer to the xmit function has been placed in packet
      socket structure hole between cached_dev and prot_hook that
      is hot anyway as we're working on cached_dev in each send path.
      
      Done in joint work together with Jesper Dangaard Brouer.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d346a3fa
    • D
      packet: fix send path when running with proto == 0 · 66e56cd4
      Daniel Borkmann 提交于
      Commit e40526cb introduced a cached dev pointer, that gets
      hooked into register_prot_hook(), __unregister_prot_hook() to
      update the device used for the send path.
      
      We need to fix this up, as otherwise this will not work with
      sockets created with protocol = 0, plus with sll_protocol = 0
      passed via sockaddr_ll when doing the bind.
      
      So instead, assign the pointer directly. The compiler can inline
      these helper functions automagically.
      
      While at it, also assume the cached dev fast-path as likely(),
      and document this variant of socket creation as it seems it is
      not widely used (seems not even the author of TX_RING was aware
      of that in his reference example [1]). Tested with reproducer
      from e40526cb.
      
       [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap#Example
      
      Fixes: e40526cb ("packet: fix use after free race in send path when dev is released")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Tested-by: NSalam Noureddine <noureddine@aristanetworks.com>
      Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66e56cd4
    • P
      [media] videobuf2: Add support for file access mode flags for DMABUF exporting · c1b96a23
      Philipp Zabel 提交于
      Currently it is not possible for userspace to map a DMABUF exported buffer
      with write permissions. This patch allows to also pass O_RDONLY/O_RDWR when
      exporting the buffer, so that userspace may map it with write permissions.
      Signed-off-by: NPhilipp Zabel <p.zabel@pengutronix.de>
      Signed-off-by: NSylwester Nawrocki <s.nawrocki@samsung.com>
      Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
      c1b96a23
  17. 07 12月, 2013 3 次提交
    • J
      ether_addr_equal: Optimize implementation, remove unused compare_ether_addr · 0d74c42f
      Joe Perches 提交于
      Add a new check for CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to reduce
      the number of or's used in the ether_addr_equal comparison to very
      slightly improve function performance.
      
      Simplify the ether_addr_equal_64bits implementation.
      Integrate and remove the zap_last_2bytes helper as it's now
      used only once.
      
      Remove the now unused compare_ether_addr function.
      
      Update the unaligned-memory-access documentation to remove the
      compare_ether_addr description and show how unaligned accesses
      could occur with ether_addr_equal.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d74c42f
    • F
      Documentation: update Ethernet PHY devices binding with 'max-speed' · 9f2b0936
      Florian Fainelli 提交于
      The 'max-speed' property is optional but defined in the ePAPR
      specification and now supported by the Linux Device Tree parsing
      infrastructure.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f2b0936
    • E
      tcp: auto corking · f54b3111
      Eric Dumazet 提交于
      With the introduction of TCP Small Queues, TSO auto sizing, and TCP
      pacing, we can implement Automatic Corking in the kernel, to help
      applications doing small write()/sendmsg() to TCP sockets.
      
      Idea is to change tcp_push() to check if the current skb payload is
      under skb optimal size (a multiple of MSS bytes)
      
      If under 'size_goal', and at least one packet is still in Qdisc or
      NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
      will be delayed up to TX completion time.
      
      This delay might allow the application to coalesce more bytes
      in the skb in following write()/sendmsg()/sendfile() system calls.
      
      The exact duration of the delay is depending on the dynamics
      of the system, and might be zero if no packet for this flow
      is actually held in Qdisc or NIC TX ring.
      
      Using FQ/pacing is a way to increase the probability of
      autocorking being triggered.
      
      Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
      this feature and default it to 1 (enabled)
      
      Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
      This counter is incremented every time we detected skb was under used
      and its flush was deferred.
      
      Tested:
      
      Interesting effects when using line buffered commands under ssh.
      
      Excellent performance results in term of cpu usage and total throughput.
      
      lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
      lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
      9410.39
      
       Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
      
            35209.439626 task-clock                #    2.901 CPUs utilized
                   2,294 context-switches          #    0.065 K/sec
                     101 CPU-migrations            #    0.003 K/sec
                   4,079 page-faults               #    0.116 K/sec
          97,923,241,298 cycles                    #    2.781 GHz                     [83.31%]
          51,832,908,236 stalled-cycles-frontend   #   52.93% frontend cycles idle    [83.30%]
          25,697,986,603 stalled-cycles-backend    #   26.24% backend  cycles idle    [66.70%]
         102,225,978,536 instructions              #    1.04  insns per cycle
                                                   #    0.51  stalled cycles per insn [83.38%]
          18,657,696,819 branches                  #  529.906 M/sec                   [83.29%]
              91,679,646 branch-misses             #    0.49% of all branches         [83.40%]
      
            12.136204899 seconds time elapsed
      
      lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
      lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
      6624.89
      
       Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
            40045.864494 task-clock                #    3.301 CPUs utilized
                     171 context-switches          #    0.004 K/sec
                      53 CPU-migrations            #    0.001 K/sec
                   4,080 page-faults               #    0.102 K/sec
         111,340,458,645 cycles                    #    2.780 GHz                     [83.34%]
          61,778,039,277 stalled-cycles-frontend   #   55.49% frontend cycles idle    [83.31%]
          29,295,522,759 stalled-cycles-backend    #   26.31% backend  cycles idle    [66.67%]
         108,654,349,355 instructions              #    0.98  insns per cycle
                                                   #    0.57  stalled cycles per insn [83.34%]
          19,552,170,748 branches                  #  488.244 M/sec                   [83.34%]
             157,875,417 branch-misses             #    0.81% of all branches         [83.34%]
      
            12.130267788 seconds time elapsed
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f54b3111
  18. 06 12月, 2013 1 次提交
    • T
      net: davinci_emac: Fix platform data handling and make usable for am3517 · dd0df47d
      Tony Lindgren 提交于
      When booted with device tree, we may still have platform data passed
      as auxdata. For am3517 this is needed for passing the interrupt_enable
      and interrupt_disable callbacks that access the omap system control module
      registers. These callback functions will eventually go away when we have
      a separate system control module driver.
      
      Some of the things that are currently passed as platform data we don't need
      to set up as device tree properties as they are always the same on am3517.
      So let's use a new compatible flag for those so we can get those from
      the device tree match data.
      
      Also note that we need to fix setting of phy_dev to NULL instead of an empty
      string as the code later on uses that to find the first phy on the mdio bus.
      This seems to have been caused by 5d69e007 (net: davinci_emac: switch to
      new mdio).
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd0df47d
  19. 03 12月, 2013 10 次提交
  20. 02 12月, 2013 1 次提交
    • D
      KEYS: Fix multiple key add into associative array · 23fd78d7
      David Howells 提交于
      If sufficient keys (or keyrings) are added into a keyring such that a node in
      the associative array's tree overflows (each node has a capacity N, currently
      16) and such that all N+1 keys have the same index key segment for that level
      of the tree (the level'th nibble of the index key), then assoc_array_insert()
      calls ops->diff_objects() to indicate at which bit position the two index keys
      vary.
      
      However, __key_link_begin() passes a NULL object to assoc_array_insert() with
      the intention of supplying the correct pointer later before we commit the
      change.  This means that keyring_diff_objects() is given a NULL pointer as one
      of its arguments which it does not expect.  This results in an oops like the
      attached.
      
      With the previous patch to fix the keyring hash function, this can be forced
      much more easily by creating a keyring and only adding keyrings to it.  Add any
      other sort of key and a different insertion path is taken - all 16+1 objects
      must want to cluster in the same node slot.
      
      This can be tested by:
      
      	r=`keyctl newring sandbox @s`
      	for ((i=0; i<=16; i++)); do keyctl newring ring$i $r; done
      
      This should work fine, but oopses when the 17th keyring is added.
      
      Since ops->diff_objects() is always called with the first pointer pointing to
      the object to be inserted (ie. the NULL pointer), we can fix the problem by
      changing the to-be-inserted object pointer to point to the index key passed
      into assoc_array_insert() instead.
      
      Whilst we're at it, we also switch the arguments so that they are the same as
      for ->compare_object().
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
      IP: [<ffffffff81191ee4>] hash_key_type_and_desc+0x18/0xb0
      ...
      RIP: 0010:[<ffffffff81191ee4>] hash_key_type_and_desc+0x18/0xb0
      ...
      Call Trace:
       [<ffffffff81191f9d>] keyring_diff_objects+0x21/0xd2
       [<ffffffff811f09ef>] assoc_array_insert+0x3b6/0x908
       [<ffffffff811929a7>] __key_link_begin+0x78/0xe5
       [<ffffffff81191a2e>] key_create_or_update+0x17d/0x36a
       [<ffffffff81192e0a>] SyS_add_key+0x123/0x183
       [<ffffffff81400ddb>] tracesys+0xdd/0xe2
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NStephen Gallagher <sgallagh@redhat.com>
      23fd78d7