1. 08 7月, 2014 2 次提交
    • T
      ipv6: Implement automatic flow label generation on transmit · cb1ce2ef
      Tom Herbert 提交于
      Automatically generate flow labels for IPv6 packets on transmit.
      The flow label is computed based on skb_get_hash. The flow label will
      only automatically be set when it is zero otherwise (i.e. flow label
      manager hasn't set one). This supports the transmit side functionality
      of RFC 6438.
      
      Added an IPv6 sysctl auto_flowlabels to enable/disable this behavior
      system wide, and added IPV6_AUTOFLOWLABEL socket option to enable this
      functionality per socket.
      
      By default, auto flowlabels are disabled to avoid possible conflicts
      with flow label manager, however if this feature proves useful we
      may want to enable it by default.
      
      It should also be noted that FreeBSD has already implemented automatic
      flow labels (including the sysctl and socket option). In FreeBSD,
      automatic flow labels default to enabled.
      
      Performance impact:
      
      Running super_netperf with 200 flows for TCP_RR and UDP_RR for
      IPv6. Note that in UDP case, __skb_get_hash will be called for
      every packet with explains slight regression. In the TCP case
      the hash is saved in the socket so there is no regression.
      
      Automatic flow labels disabled:
      
        TCP_RR:
          86.53% CPU utilization
          127/195/322 90/95/99% latencies
          1.40498e+06 tps
      
        UDP_RR:
          90.70% CPU utilization
          118/168/243 90/95/99% latencies
          1.50309e+06 tps
      
      Automatic flow labels enabled:
      
        TCP_RR:
          85.90% CPU utilization
          128/199/337 90/95/99% latencies
          1.40051e+06
      
        UDP_RR
          92.61% CPU utilization
          115/164/236 90/95/99% latencies
          1.4687e+06
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb1ce2ef
    • F
      net: systemport: add Wake-on-LAN support · 83e82f4c
      Florian Fainelli 提交于
      Support for Wake-on-LAN using Magic Packet with or without SecureOn
      password is implemented doing the following:
      
      - setting the password to the relevant UniMAC registers
      - flagging the device as a wakeup source for the system, as well as
        its Wake-on-LAN interrupt
      - prepare the hardware for entering WoL mode
      - enabling the MPD interrupt to wake us
      
      The Device Tree binding documentation is also reflected to specify the
      third optional Wake-on-LAN interrupt line.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83e82f4c
  2. 02 7月, 2014 2 次提交
    • J
      pktgen: document tuning for max NIC performance · 9ceb87fc
      Jesper Dangaard Brouer 提交于
      Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
      running full.  Thus, the TX ring is artificially limiting pktgen.
      (Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
      counters.)
      
      Using ixgbe, the real reason behind the TX ring running full, is due
      to TX ring not being cleaned up fast enough. The ixgbe driver combines
      TX+RX ring cleanups, and the cleanup interval is affected by the
      ethtool --coalesce setting of parameter "rx-usecs".
      
      Do not increase the default NIC TX ring buffer or default cleanup
      interval.  Instead simply document that pktgen needs special NIC
      tuning for maximum packet per sec performance.
      
      Performance results with pktgen with clone_skb=100000.
      TX ring size 512 (default), adjusting "rx-usecs":
       (Single CPU performance, E5-2630, ixgbe)
       - 3935002 pps - rx-usecs:  1 (irqs:  9346)
       - 5132350 pps - rx-usecs: 10 (irqs: 99157)
       - 5375111 pps - rx-usecs: 20 (irqs: 50154)
       - 5454050 pps - rx-usecs: 30 (irqs: 33872)
       - 5496320 pps - rx-usecs: 40 (irqs: 26197)
       - 5502510 pps - rx-usecs: 50 (irqs: 21527)
      
      TX ring size adjusting (ethtool -G), "rx-usecs==1" (default):
       - 3935002 pps - tx-size:  512
       - 5354401 pps - tx-size:  768
       - 5356847 pps - tx-size: 1024
       - 5327595 pps - tx-size: 1536
       - 5356779 pps - tx-size: 2048
       - 5353438 pps - tx-size: 4096
      
      Notice after commit 6f25cd47 (pktgen: fix xmit test for BQL enabled
      devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
      the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag.  This allow us to put
      more pressure on the TX ring buffers.
      
      It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
      pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ceb87fc
    • B
      ipv6: Allow accepting RA from local IP addresses. · d9333196
      Ben Greear 提交于
      This can be used in virtual networking applications, and
      may have other uses as well.  The option is disabled by
      default.
      
      A specific use case is setting up virtual routers, bridges, and
      hosts on a single OS without the use of network namespaces or
      virtual machines.  With proper use of ip rules, routing tables,
      veth interface pairs and/or other virtual interfaces,
      and applications that can bind to interfaces and/or IP addresses,
      it is possibly to create one or more virtual routers with multiple
      hosts attached.  The host interfaces can act as IPv6 systems,
      with radvd running on the ports in the virtual routers.  With the
      option provided in this patch enabled, those hosts can now properly
      obtain IPv6 addresses from the radvd.
      Signed-off-by: NBen Greear <greearb@candelatech.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9333196
  3. 25 6月, 2014 2 次提交
  4. 24 6月, 2014 5 次提交
  5. 23 6月, 2014 1 次提交
  6. 19 6月, 2014 2 次提交
  7. 17 6月, 2014 4 次提交
  8. 13 6月, 2014 2 次提交
  9. 12 6月, 2014 5 次提交
  10. 11 6月, 2014 5 次提交
  11. 10 6月, 2014 4 次提交
  12. 09 6月, 2014 1 次提交
  13. 07 6月, 2014 5 次提交
    • K
      mm: mark remap_file_pages() syscall as deprecated · 33041a0d
      Kirill A. Shutemov 提交于
      The remap_file_pages() system call is used to create a nonlinear
      mapping, that is, a mapping in which the pages of the file are mapped
      into a nonsequential order in memory.  The advantage of using
      remap_file_pages() over using repeated calls to mmap(2) is that the
      former approach does not require the kernel to create additional VMA
      (Virtual Memory Area) data structures.
      
      Supporting of nonlinear mapping requires significant amount of
      non-trivial code in kernel virtual memory subsystem including hot paths.
      Also to get nonlinear mapping work kernel need a way to distinguish
      normal page table entries from entries with file offset (pte_file).
      Kernel reserves flag in PTE for this purpose.  PTE flags are scarce
      resource especially on some CPU architectures.  It would be nice to free
      up the flag for other usage.
      
      Fortunately, there are not many users of remap_file_pages() in the wild.
      It's only known that one enterprise RDBMS implementation uses the
      syscall on 32-bit systems to map files bigger than can linearly fit into
      32-bit virtual address space.  This use-case is not critical anymore
      since 64-bit systems are widely available.
      
      The plan is to deprecate the syscall and replace it with an emulation.
      The emulation will create new VMAs instead of nonlinear mappings.  It's
      going to work slower for rare users of remap_file_pages() but ABI is
      preserved.
      
      One side effect of emulation (apart from performance) is that user can
      hit vm.max_map_count limit more easily due to additional VMAs.  See
      comment for DEFAULT_MAX_MAP_COUNT for more details on the limit.
      
      [akpm@linux-foundation.org: fix spello]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Armin Rigo <arigo@tunes.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33041a0d
    • C
      mm: introduce kmemleak_update_trace() · ffe2c748
      Catalin Marinas 提交于
      The memory allocation stack trace is not always useful for debugging a
      memory leak (e.g.  radix_tree_preload).  This function, when called,
      updates the stack trace for an already allocated object.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffe2c748
    • M
      vmscan: memcg: always use swappiness of the reclaimed memcg · 688eb988
      Michal Hocko 提交于
      Memory reclaim always uses swappiness of the reclaim target memcg
      (origin of the memory pressure) or vm_swappiness for global memory
      reclaim.  This behavior was consistent (except for difference between
      global and hard limit reclaim) because swappiness was enforced to be
      consistent within each memcg hierarchy.
      
      After "mm: memcontrol: remove hierarchy restrictions for swappiness and
      oom_control" each memcg can have its own swappiness independent of
      hierarchical parents, though, so the consistency guarantee is gone.
      This can lead to an unexpected behavior.  Say that a group is explicitly
      configured to not swapout by memory.swappiness=0 but its memory gets
      swapped out anyway when the memory pressure comes from its parent with a
      It is also unexpected that the knob is meaningless without setting the
      hard limit which would trigger the reclaim and enforce the swappiness.
      There are setups where the hard limit is configured higher in the
      hierarchy by an administrator and children groups are under control of
      somebody else who is interested in the swapout behavior but not
      necessarily about the memory limit.
      
      From a semantic point of view swappiness is an attribute defining anon
      vs.
       file proportional scanning of LRU which is memcg specific (unlike
      charges which are propagated up the hierarchy) so it should be applied
      to the particular memcg's LRU regardless where the memory pressure comes
      from.
      
      This patch removes vmscan_swappiness() and stores the swappiness into
      the scan_control structure.  mem_cgroup_swappiness is then used to
      provide the correct value before shrink_lruvec is called.  The global
      vm_swappiness is used for the root memcg.
      
      [hughd@google.com: oopses immediately when booted with cgroup_disable=memory]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      688eb988
    • K
      sysctl: allow for strict write position handling · f4aacea2
      Kees Cook 提交于
      When writing to a sysctl string, each write, regardless of VFS position,
      begins writing the string from the start.  This means the contents of
      the last write to the sysctl controls the string contents instead of the
      first:
      
        open("/proc/sys/kernel/modprobe", O_WRONLY)   = 1
        write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
        write(1, "/bin/true", 9)                = 9
        close(1)                                = 0
      
        $ cat /proc/sys/kernel/modprobe
        /bin/true
      
      Expected behaviour would be to have the sysctl be "AAAA..." capped at
      maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
      contents of the second write.  Similarly, multiple short writes would
      not append to the sysctl.
      
      The old behavior is unlike regular POSIX files enough that doing audits
      of software that interact with sysctls can end up in unexpected or
      dangerous situations.  For example, "as long as the input starts with a
      trusted path" turns out to be an insufficient filter, as what must also
      happen is for the input to be entirely contained in a single write
      syscall -- not a common consideration, especially for high level tools.
      
      This provides kernel.sysctl_writes_strict as a way to make this behavior
      act in a less surprising manner for strings, and disallows non-zero file
      position when writing numeric sysctls (similar to what is already done
      when reading from non-zero file positions).  For now, the default (0) is
      to warn about non-zero file position use, but retain the legacy
      behavior.  Setting this to -1 disables the warning, and setting this to
      1 enables the file position respecting behavior.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: move misplaced hunk, per Randy]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4aacea2
    • M
      kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after panic_notifers · f06e5153
      Masami Hiramatsu 提交于
      Add a "crash_kexec_post_notifiers" boot option to run kdump after
      running panic_notifiers and dump kmsg.  This can help rare situations
      where kdump fails because of unstable crashed kernel or hardware failure
      (memory corruption on critical data/code), or the 2nd kernel is already
      broken by the 1st kernel (it's a broken behavior, but who can guarantee
      that the "crashed" kernel works correctly?).
      
      Usage: add "crash_kexec_post_notifiers" to kernel boot option.
      
      Note that this actually increases risks of the failure of kdump.  This
      option should be set only if you worry about the rare case of kdump
      failure rather than increasing the chance of success.
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Acked-by: NMotohiro Kosaki <Motohiro.Kosaki@us.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
      Cc: Satoru MORIYA <satoru.moriya.br@hitachi.com>
      Cc: Tomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f06e5153